linux-coco.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
@ 2021-08-24  0:52 Sean Christopherson
  2021-08-24 10:48 ` Yu Zhang
                   ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Sean Christopherson @ 2021-08-24  0:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, David Hildenbrand, Dave Hansen,
	Yu Zhang

The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the
game, on an acceptable direction for supporting guest private memory, e.g. for
Intel's TDX.  The TDX architectural effectively allows KVM guests to crash the
host if guest private memory is accessible to host userspace, and thus does not
play nice with KVM's existing approach of pulling the pfn and mapping level from
the host page tables.

This is by no means a complete patch; it's a rough sketch of the KVM changes that
would be needed.  The kernel side of things is completely omitted from the patch;
the design concept is below.

There's also fair bit of hand waving on implementation details that shouldn't
fundamentally change the overall ABI, e.g. how the backing store will ensure
there are no mappings when "converting" to guest private.

Background
==========

This is a loose continuation of Kirill's RFC[*] to support TDX guest private
memory by tracking guest memory at the 'struct page' level.  This proposal is the
result of several offline discussions that were prompted by Andy Lutomirksi's
concerns with tracking via 'struct page':

  1. The kernel wouldn't easily be able to enforce a 1:1 page:guest association,
     let alone a 1:1 pfn:gfn mapping.

  2. Does not work for memory that isn't backed by 'struct page', e.g. if devices
     gain support for exposing encrypted memory regions to guests.

  3. Does not help march toward page migration or swap support (though it doesn't
     hurt either).

[*] https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@linux.intel.com

Concept
=======

Guest private memory must be backed by an "enlightened" file descriptor, where
"enlightened" means the implementing subsystem supports a one-way "conversion" to
guest private memory and provides bi-directional hooks to communicate directly
with KVM.  Creating a private fd doesn't necessarily have to be a conversion, e.g. it
could also be a flag provided at file creation, a property of the file system itself,
etc...

Before a private fd can be mapped into a KVM guest, it must be paired 1:1 with a
KVM guest, i.e. multiple guests cannot share a fd.  At pairing, KVM and the fd's
subsystem exchange a set of function pointers to allow KVM to call into the subsystem,
e.g. to translate gfn->pfn, and vice versa to allow the subsystem to call into KVM,
e.g. to invalidate/move/swap a gfn range.

Mapping a private fd in host userspace is disallowed, i.e. there is never a host
virtual address associated with the fd and thus no userspace page tables pointing
at the private memory.

Pinning _from KVM_ is not required.  If the backing store supports page migration
and/or swap, it can query the KVM-provided function pointers to see if KVM supports
the operation.  If the operation is not supported (this will be the case initially
in KVM), the backing store is responsible for ensuring correct functionality.

Unmapping guest memory, e.g. to prevent use-after-free, is handled via a callback
from the backing store to KVM.  KVM will employ techniques similar to those it uses
for mmu_notifiers to ensure the guest cannot access freed memory.

A key point is that, unlike similar failed proposals of the past, e.g. /dev/mktme,
existing backing stores can be englightened, a from-scratch implementations is not
required (though would obviously be possible as well).

One idea for extending existing backing stores, e.g. HugeTLBFS and tmpfs, is
to add F_SEAL_GUEST, which would convert the entire file to guest private memory
and either fail if the current size is non-zero or truncate the size to zero.

KVM
===

Guest private memory is managed as a new address space, i.e. as a different set of
memslots, similar to how KVM has a separate memory view for when a guest vCPU is
executing in virtual SMM.  SMM is mutually exclusive with guest private memory.

The fd (the actual integer) is provided to KVM when a private memslot is added
via KVM_SET_USER_MEMORY_REGION.  This is when the aforementioned pairing occurs.

By default, KVM memslot lookups will be "shared", only specific touchpoints will
be modified to work with private memslots, e.g. guest page faults.  All host
accesses to guest memory, e.g. for emulation, will thus look for shared memory
and naturally fail without attempting copy_to/from_user() if the guest attempts
to coerce KVM into access private memory.  Note, avoiding copy_to/from_user() and
friends isn't strictly necessary, it's more of a happy side effect.

A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in vcpu->run
is added to propagate illegal accesses (see above) and implicit conversions
to userspace (see below).  Note, the new exit reason + struct can also be to
support several other feature requests in KVM[1][2].

The guest may explicitly or implicity request KVM to map a shared/private variant
of a GFN.  An explicit map request is done via hypercall (out of scope for this
proposal as both TDX and SNP ABIs define such a hypercall).  An implicit map request
is triggered simply by the guest accessing the shared/private variant, which KVM
sees as a guest page fault (EPT violation or #NPF).  Ideally only explicit requests
would be supported, but neither TDX nor SNP require this in their guest<->host ABIs.

For implicit or explicit mappings, if a memslot is found that fully covers the
requested range (which is a single gfn for implicit mappings), KVM's normal guest
page fault handling works with minimal modification.

If a memslot is not found, for explicit mappings, KVM will exit to userspace with
the aforementioned dedicated exit reason.  For implict _private_ mappings, KVM will
also immediately exit with the same dedicated reason.  For implicit shared mappings,
an additional check is required to differentiate between emulated MMIO and an
implicit private->shared conversion[*].  If there is an existing private memslot
for the gfn, KVM will exit to userspace, otherwise KVM will treat the access as an
emulated MMIO access and handle the page fault accordingly.

[1] https://lkml.kernel.org/r/YKxJLcg/WomPE422@google.com
[2] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com

Punching Holes
==============

The expected userspace memory model is that mapping requests will be handled as
conversions, e.g. on a shared mapping request, first unmap the private gfn range,
then map the shared gfn range.  A new KVM ioctl() will likely be needed to allow
userspace to punch a hole in a memslot, as expressing such an operation isn't
possible with KVM_SET_USER_MEMORY_REGION.  While userspace could delete the
memslot, then recreate three new memslots, doing so would be destructive to guest
data as unmapping guest private memory (from the EPT/NPT tables) is destructive
to the data for both TDX and SEV-SNP guests.

Pros (vs. struct page)
======================

Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping.

Userspace page tables are not populated, e.g. reduced memory footprint, lower
probability of making private memory accessible to userspace.

Provides line of sight to supporting page migration and swap.

Provides line of sight to mapping MMIO pages into guest private memory.

Cons (vs. struct page)
======================

Significantly more churn in KVM, e.g. to plumb 'private' through where needed,
support memslot hole punching, etc...

KVM's MMU gets another method of retrieving host pfn and page size.

Requires enabling in every backing store that someone wants to support.

Because the NUMA APIs work on virtual addresses, new syscalls fmove_pages(),
fbind(), etc... would be required to provide equivalents to existing NUMA
functionality (though those syscalls would likely be useful irrespective of guest
private memory).

Washes (vs. struct page)
========================

A misbehaving guest that triggers a large number of shared memory mappings will
consume a large number of memslots.  But, this is likely a wash as similar effect
would happen with VMAs in the struct page approach.

Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Varad Gautam <varad.gautam@suse.com>
Cc: Dario Faggioli <dfaggioli@suse.com>
Cc: x86@kernel.org
Cc: linux-mm@kvack.org
Cc: linux-coco@lists.linux.dev
Cc: linux-kernel@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Not-signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  7 +++---
 arch/x86/include/uapi/asm/kvm.h |  4 ++++
 arch/x86/kvm/mmu.h              |  3 +++
 arch/x86/kvm/mmu/mmu.c          | 26 ++++++++++++++++++++--
 include/linux/kvm_host.h        | 24 ++++++++++++++++++---
 include/uapi/linux/kvm.h        |  4 ++++
 virt/kvm/kvm_main.c             | 38 ++++++++++++++++++++++++++++-----
 7 files changed, 93 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 20daaf67a5bf..5ab6e5e9f38b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1810,10 +1810,11 @@ enum {
 #define HF_SMM_INSIDE_NMI_MASK	(1 << 7)
 
 #define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
-#define KVM_ADDRESS_SPACE_NUM 2
+#define KVM_ADDRESS_SPACE_NUM 3
 
-#define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
-#define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
+#define kvm_arch_vcpu_memslots_id(vcpu, private)	\
+	(((vcpu)->arch.hflags & HF_SMM_MASK) ? 1 : (!!private) << 1)
+#define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm | (role).private << 1)
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index a6c327f8ad9e..600bf108741d 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -53,6 +53,10 @@
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
 
+#define KVM_DEFAULT_ADDRESS_SPACE	0
+#define KVM_SMM_ADDRESS_SPACE		1
+#define KVM_PRIVATE_ADDRESS_SPACE	2
+
 struct kvm_memory_alias {
 	__u32 slot;  /* this has a different namespace than memory slots */
 	__u32 flags;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index ce369a533800..fc620eda27fd 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -127,6 +127,9 @@ struct kvm_page_fault {
 	const bool rsvd;
 	const bool user;
 
+	/* Guest private, derived from error_code (SNP) or gfn (TDX). */
+	const bool private;
+
 	/* Derived from mmu and global state.  */
 	const bool is_tdp;
 	const bool nx_huge_page_workaround_enabled;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a272ccbddfa1..771080235b2d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2896,6 +2896,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
+	if (memslot_is_private(slot))
+		return slot->private_ops->pfn_mapping_level(...);
+
 	host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
 	return min(host_level, max_level);
 }
@@ -3835,9 +3838,11 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 
 static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
 {
-	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
+	struct kvm_memory_slot *slot;
 	bool async;
 
+	slot = __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, fault->private);
+
 	/*
 	 * Retry the page fault if the gfn hit a memslot that is being deleted
 	 * or moved.  This ensures any existing SPTEs for the old memslot will
@@ -3846,8 +3851,19 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
 		goto out_retry;
 
+	/*
+	 * Exit to userspace to map the requested private/shared memory region
+	 * if there is no memslot and (a) the access is private or (b) there is
+	 * an existing private memslot.  Emulated MMIO must be accessed through
+	 * shared GPAs, thus a memslot miss on a private GPA is always handled
+	 * as an implicit conversion "request".
+	 */
+	if (!slot &&
+	    (fault->private || __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, true)))
+		goto out_convert;
+
 	if (!kvm_is_visible_memslot(slot)) {
-		/* Don't expose private memslots to L2. */
+		/* Don't expose KVM's internal memslots to L2. */
 		if (is_guest_mode(vcpu)) {
 			fault->pfn = KVM_PFN_NOSLOT;
 			fault->map_writable = false;
@@ -3890,6 +3906,12 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 out_retry:
 	*r = RET_PF_RETRY;
 	return true;
+
+out_convert:
+	vcpu->run->exit_reason = KVM_EXIT_MAP_MEMORY;
+	/* TODO: fill vcpu->run with more info. */
+	*r = 0;
+	return true;
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d447b21cdd73..21ff766f98d0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -434,6 +434,7 @@ struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+	struct guest_private_memory_ops *private_ops;
 };
 
 static inline bool kvm_slot_dirty_track_enabled(struct kvm_memory_slot *slot)
@@ -514,7 +515,7 @@ struct kvm_irq_routing_table {
 #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS)
 
 #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
-static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
+static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu, bool private)
 {
 	return 0;
 }
@@ -785,13 +786,19 @@ static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
 	return __kvm_memslots(kvm, 0);
 }
 
-static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
+static inline struct kvm_memslots *__kvm_vcpu_memslots(struct kvm_vcpu *vcpu,
+						       bool private)
 {
-	int as_id = kvm_arch_vcpu_memslots_id(vcpu);
+	int as_id = kvm_arch_vcpu_memslots_id(vcpu, private);
 
 	return __kvm_memslots(vcpu->kvm, as_id);
 }
 
+static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
+{
+	return __kvm_vcpu_memslots(vcpu, false);
+}
+
 static inline
 struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
 {
@@ -807,6 +814,15 @@ struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
 	return slot;
 }
 
+static inline bool memslot_is_private(struct kvm_memory_slot *slot)
+{
+#ifdef KVM_PRIVATE_ADDRESS_SPACE
+	return slot && slot->as_id == KVM_PRIVATE_ADDRESS_SPACE;
+#else
+	return false;
+#endif
+}
+
 /*
  * KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
  * - create a new memory slot
@@ -946,6 +962,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot, g
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
 
 struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
+struct kvm_memory_slot *__kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu,
+						  gfn_t gfn, bool private);
 struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
 kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
 kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a067410ebea5..bd01cff9aeff 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -101,6 +101,9 @@ struct kvm_userspace_memory_region {
 	__u64 guest_phys_addr;
 	__u64 memory_size; /* bytes */
 	__u64 userspace_addr; /* start of the userspace allocated memory */
+#ifdef KVM_PRIVATE_ADDRESS_SPACE
+	__u32 fd; /* valid iff memslot is guest private memory */
+#endif
 };
 
 /*
@@ -269,6 +272,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_AP_RESET_HOLD    32
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
+#define KVM_EXIT_MEMORY_ERROR	  35
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3e67c93ca403..8747a39d4173 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1583,7 +1583,19 @@ static int kvm_set_memslot(struct kvm *kvm,
 		kvm_copy_memslots(slots, __kvm_memslots(kvm, as_id));
 	}
 
+#ifdef KVM_PRIVATE_ADDRESS_SPACE
+	if (change == KVM_MR_CREATE && as_id == KVM_PRIVATE_ADDRESS_SPACE) {
+		r = kvm_register_private_memslot(kvm, mem, old, new);
+		if (r)
+			goto out_slots;
+	}
+#endif
+
 	r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
+#ifdef KVM_PRIVATE_ADDRESS_SPACE
+	if ((r || change == KVM_MR_DELETE) && as_id == KVM_PRIVATE_ADDRESS_SPACE)
+		kvm_unregister_private_memslot(kvm, mem, old, new);
+#endif
 	if (r)
 		goto out_slots;
 
@@ -1706,6 +1718,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		new.dirty_bitmap = NULL;
 		memset(&new.arch, 0, sizeof(new.arch));
 	} else { /* Modify an existing slot. */
+#ifdef KVM_PRIVATE_ADDRESS_SPACE
+		/* Private memslots are immutable, they can only be deleted. */
+		if (as_id == KVM_PRIVATE_ADDRESS_SPACE)
+			return -EINVAL;
+#endif
+
 		if ((new.userspace_addr != old.userspace_addr) ||
 		    (new.npages != old.npages) ||
 		    ((new.flags ^ old.flags) & KVM_MEM_READONLY))
@@ -2059,9 +2077,10 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(gfn_to_memslot);
 
-struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn)
+struct kvm_memory_slot *__kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu,
+						  gfn_t gfn, bool private)
 {
-	struct kvm_memslots *slots = kvm_vcpu_memslots(vcpu);
+	struct kvm_memslots *slots = __kvm_vcpu_memslots(vcpu, private);
 	struct kvm_memory_slot *slot;
 	int slot_index;
 
@@ -2082,6 +2101,12 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
 
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(__kvm_vcpu_gfn_to_memslot);
+
+struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	return __kvm_vcpu_gfn_to_memslot(vcpu, gfn, false);
+}
 EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
 
 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
@@ -2428,8 +2453,12 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool *async, bool write_fault,
 			       bool *writable, hva_t *hva)
 {
-	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
+	unsigned long addr;
 
+	if (memslot_is_private(slot))
+		return slot->private_ops->gfn_to_pfn(...);
+
+	addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 	if (hva)
 		*hva = addr;
 
@@ -2624,8 +2653,7 @@ EXPORT_SYMBOL_GPL(kvm_map_gfn);
 
 int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
 {
-	return __kvm_map_gfn(kvm_vcpu_memslots(vcpu), gfn, map,
-		NULL, false);
+	return __kvm_map_gfn(kvm_vcpu_memslots(vcpu), gfn, map, NULL, false);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_map);
 
-- 
2.33.0.rc2.250.ged5fa647cd-goog


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-24  0:52 [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Sean Christopherson
@ 2021-08-24 10:48 ` Yu Zhang
  2021-08-26  0:35   ` Sean Christopherson
  2021-08-26 10:15 ` David Hildenbrand
  2021-09-02 18:47 ` Kirill A. Shutemov
  2 siblings, 1 reply; 71+ messages in thread
From: Yu Zhang @ 2021-08-24 10:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, David Hildenbrand, Dave Hansen,
	Isaku Yamahata

On Mon, Aug 23, 2021 at 05:52:48PM -0700, Sean Christopherson wrote:

Thanks a lot for sharing these ideas. Lots of questions are inlined below. :)

> The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the
> game, on an acceptable direction for supporting guest private memory, e.g. for
> Intel's TDX.  The TDX architectural effectively allows KVM guests to crash the
> host if guest private memory is accessible to host userspace, and thus does not

What about incorrect/malicious accesses from host kernel? Should the direct mapping
also be removed for guest private memory?

> play nice with KVM's existing approach of pulling the pfn and mapping level from
> the host page tables.
> 
> This is by no means a complete patch; it's a rough sketch of the KVM changes that
> would be needed.  The kernel side of things is completely omitted from the patch;
> the design concept is below.
> 
> There's also fair bit of hand waving on implementation details that shouldn't
> fundamentally change the overall ABI, e.g. how the backing store will ensure
> there are no mappings when "converting" to guest private.
> 
> Background
> ==========
> 
> This is a loose continuation of Kirill's RFC[*] to support TDX guest private
> memory by tracking guest memory at the 'struct page' level.  This proposal is the
> result of several offline discussions that were prompted by Andy Lutomirksi's
> concerns with tracking via 'struct page':
> 
>   1. The kernel wouldn't easily be able to enforce a 1:1 page:guest association,
>      let alone a 1:1 pfn:gfn mapping.

May I ask why? Doesn't FOLL_GUEST in Kirill's earlier patch work? Or just
because traversing the host PT to get a PFN(for a PageGuest(page)) is too
heavy?

> 
>   2. Does not work for memory that isn't backed by 'struct page', e.g. if devices
>      gain support for exposing encrypted memory regions to guests.

Do you mean that a page not backed by 'struct page' might be mapped to other
user space? I thought the VM_GUEST flags for the VMA could prevent that(though
I may possiblely be wrong). Could you explain more? Thanks!

> 
>   3. Does not help march toward page migration or swap support (though it doesn't
>      hurt either).
> 
> [*] https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@linux.intel.com
> 
> Concept
> =======
> 
> Guest private memory must be backed by an "enlightened" file descriptor, where
> "enlightened" means the implementing subsystem supports a one-way "conversion" to
> guest private memory and provides bi-directional hooks to communicate directly
> with KVM.  Creating a private fd doesn't necessarily have to be a conversion, e.g. it
> could also be a flag provided at file creation, a property of the file system itself,
> etc...
> 
> Before a private fd can be mapped into a KVM guest, it must be paired 1:1 with a
> KVM guest, i.e. multiple guests cannot share a fd.  At pairing, KVM and the fd's
> subsystem exchange a set of function pointers to allow KVM to call into the subsystem,
> e.g. to translate gfn->pfn, and vice versa to allow the subsystem to call into KVM,
> e.g. to invalidate/move/swap a gfn range.

So the gfn->pfn translation is done by the fd's subsystem? Again, could you
please elaborate how?

And each private memory region would need a seperate group of callbacks? 

> 
> Mapping a private fd in host userspace is disallowed, i.e. there is never a host
> virtual address associated with the fd and thus no userspace page tables pointing
> at the private memory.
> 
> Pinning _from KVM_ is not required.  If the backing store supports page migration
> and/or swap, it can query the KVM-provided function pointers to see if KVM supports
> the operation.  If the operation is not supported (this will be the case initially
> in KVM), the backing store is responsible for ensuring correct functionality.
> 
> Unmapping guest memory, e.g. to prevent use-after-free, is handled via a callback
> from the backing store to KVM.  KVM will employ techniques similar to those it uses
> for mmu_notifiers to ensure the guest cannot access freed memory.
> 
> A key point is that, unlike similar failed proposals of the past, e.g. /dev/mktme,
> existing backing stores can be englightened, a from-scratch implementations is not
> required (though would obviously be possible as well).
> 
> One idea for extending existing backing stores, e.g. HugeTLBFS and tmpfs, is
> to add F_SEAL_GUEST, which would convert the entire file to guest private memory
> and either fail if the current size is non-zero or truncate the size to zero.

Have you discussed memfd_secret(if host direct mapping is also to be removed)? 

And how does this F_SEAL_GUEST work?

> 
> KVM
> ===
> 
> Guest private memory is managed as a new address space, i.e. as a different set of
> memslots, similar to how KVM has a separate memory view for when a guest vCPU is
> executing in virtual SMM.  SMM is mutually exclusive with guest private memory.
> 
> The fd (the actual integer) is provided to KVM when a private memslot is added
> via KVM_SET_USER_MEMORY_REGION.  This is when the aforementioned pairing occurs.

My understanding of KVM_SET_USER_MEMORY_REGION is that, this ioctl is to
facilitate the binding of HVA and GPA ranges. But if there's no HVAs for
a private region at all, why do we need a memslot for it? Besides to keep
track of the private GFN ranges, and provide the callbacks, is there any
other reason?

Another question is: why do we need a whole new address space, instead of
one address space accommodating memslot types?

> 
> By default, KVM memslot lookups will be "shared", only specific touchpoints will
> be modified to work with private memslots, e.g. guest page faults.  All host
> accesses to guest memory, e.g. for emulation, will thus look for shared memory
> and naturally fail without attempting copy_to/from_user() if the guest attempts

Becasue gfn_to_hva() will fail first?

> to coerce KVM into access private memory.  Note, avoiding copy_to/from_user() and
> friends isn't strictly necessary, it's more of a happy side effect.
> 
> A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in vcpu->run
> is added to propagate illegal accesses (see above) and implicit conversions

Sorry, illegal accesses from VM?

Do you actually mean a KVM page fault caused by private access from VM, which
implicitly notifies KVM to mark it as private(e.g. by bouncing to Qemu, which
then creates a private memory region and ioctls into KVM)?

If the answer is yes, how about naming the exit reason as KVM_EXIT_MEMORY_PRIVATE?
Meanwhile, is Qemu also supposed to invoke some system call into host kernel
before ioctls into KVM? I'm still confused where the kernel callbacks like
the gfn_to_pfn() come from(and how they function)... :)

> to userspace (see below).  Note, the new exit reason + struct can also be to
> support several other feature requests in KVM[1][2].
> 
> The guest may explicitly or implicity request KVM to map a shared/private variant
> of a GFN.  An explicit map request is done via hypercall (out of scope for this
> proposal as both TDX and SNP ABIs define such a hypercall).  An implicit map request
> is triggered simply by the guest accessing the shared/private variant, which KVM
> sees as a guest page fault (EPT violation or #NPF).  Ideally only explicit requests
> would be supported, but neither TDX nor SNP require this in their guest<->host ABIs.

Well, I am wondering, should we assume all guest pages as shared or private by
default? I mean, if all guest pages are private when the VM is created, maybe
the private memslots can be initialized in VM creation time, and be deleted/splited
later(e.g. in response to guest sharing  hypercalls)?

It may simplify the logic, but may also restrict the VM type(e.g. to be TD guest).

> 
> For implicit or explicit mappings, if a memslot is found that fully covers the
> requested range (which is a single gfn for implicit mappings), KVM's normal guest
> page fault handling works with minimal modification.
> 
> If a memslot is not found, for explicit mappings, KVM will exit to userspace with
> the aforementioned dedicated exit reason.  For implict _private_ mappings, KVM will
> also immediately exit with the same dedicated reason.  For implicit shared mappings,
> an additional check is required to differentiate between emulated MMIO and an
> implicit private->shared conversion[*].  If there is an existing private memslot
> for the gfn, KVM will exit to userspace, otherwise KVM will treat the access as an
> emulated MMIO access and handle the page fault accordingly.
> 
> [1] https://lkml.kernel.org/r/YKxJLcg/WomPE422@google.com
> [2] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
> 
> Punching Holes
> ==============
> 
> The expected userspace memory model is that mapping requests will be handled as
> conversions, e.g. on a shared mapping request, first unmap the private gfn range,
> then map the shared gfn range.  A new KVM ioctl() will likely be needed to allow
> userspace to punch a hole in a memslot, as expressing such an operation isn't
> possible with KVM_SET_USER_MEMORY_REGION.  While userspace could delete the
> memslot, then recreate three new memslots, doing so would be destructive to guest
> data as unmapping guest private memory (from the EPT/NPT tables) is destructive
> to the data for both TDX and SEV-SNP guests.

May I ask why? Thanks!

> 
> Pros (vs. struct page)
> ======================
> 
> Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping.
> 
> Userspace page tables are not populated, e.g. reduced memory footprint, lower
> probability of making private memory accessible to userspace.
> 
> Provides line of sight to supporting page migration and swap.
> 
> Provides line of sight to mapping MMIO pages into guest private memory.
> 
> Cons (vs. struct page)
> ======================
> 
> Significantly more churn in KVM, e.g. to plumb 'private' through where needed,
> support memslot hole punching, etc...
> 
> KVM's MMU gets another method of retrieving host pfn and page size.

And the method is provided by host kernel? How does this method work?

[...]
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a272ccbddfa1..771080235b2d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2896,6 +2896,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  	if (max_level == PG_LEVEL_4K)
>  		return PG_LEVEL_4K;
>  
> +	if (memslot_is_private(slot))
> +		return slot->private_ops->pfn_mapping_level(...);
> +

Oh, any suggestion how host kernel decides the mapping level here?

>  	host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
>  	return min(host_level, max_level);
>  }
> @@ -3835,9 +3838,11 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  
>  static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
>  {
> -	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
> +	struct kvm_memory_slot *slot;
>  	bool async;
>  
> +	slot = __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, fault->private);
> +
>  	/*
>  	 * Retry the page fault if the gfn hit a memslot that is being deleted
>  	 * or moved.  This ensures any existing SPTEs for the old memslot will
> @@ -3846,8 +3851,19 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  	if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
>  		goto out_retry;
>  
> +	/*
> +	 * Exit to userspace to map the requested private/shared memory region
> +	 * if there is no memslot and (a) the access is private or (b) there is
> +	 * an existing private memslot.  Emulated MMIO must be accessed through
> +	 * shared GPAs, thus a memslot miss on a private GPA is always handled
> +	 * as an implicit conversion "request".
> +	 */

For (b), do you mean this fault is for a GFN which marked as private, but now
converted to a shared? If true, could we just disallow it if no explict sharing
hypercall is triggered?

> +	if (!slot &&
> +	    (fault->private || __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, true)))
> +		goto out_convert;
> +
>  	if (!kvm_is_visible_memslot(slot)) {
> -		/* Don't expose private memslots to L2. */
> +		/* Don't expose KVM's internal memslots to L2. */
>  		if (is_guest_mode(vcpu)) {
>  			fault->pfn = KVM_PFN_NOSLOT;
>  			fault->map_writable = false;
> @@ -3890,6 +3906,12 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  out_retry:
>  	*r = RET_PF_RETRY;
>  	return true;
> +
> +out_convert:
> +	vcpu->run->exit_reason = KVM_EXIT_MAP_MEMORY;
> +	/* TODO: fill vcpu->run with more info. */
> +	*r = 0;
> +	return true;
>  }


B.R.
Yu


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-24 10:48 ` Yu Zhang
@ 2021-08-26  0:35   ` Sean Christopherson
  2021-08-26 13:23     ` Yu Zhang
  0 siblings, 1 reply; 71+ messages in thread
From: Sean Christopherson @ 2021-08-26  0:35 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, David Hildenbrand, Dave Hansen,
	Isaku Yamahata

On Tue, Aug 24, 2021, Yu Zhang wrote:
> On Mon, Aug 23, 2021 at 05:52:48PM -0700, Sean Christopherson wrote:
> 
> Thanks a lot for sharing these ideas. Lots of questions are inlined below. :)
> 
> > The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the
> > game, on an acceptable direction for supporting guest private memory, e.g. for
> > Intel's TDX.  The TDX architectural effectively allows KVM guests to crash the
> > host if guest private memory is accessible to host userspace, and thus does not
> 
> What about incorrect/malicious accesses from host kernel? Should the direct mapping
> also be removed for guest private memory?

I would say that's out of scope for this RFC as it should not require any
coordination between KVM and MM.

> > play nice with KVM's existing approach of pulling the pfn and mapping level from
> > the host page tables.
> > 
> > This is by no means a complete patch; it's a rough sketch of the KVM changes that
> > would be needed.  The kernel side of things is completely omitted from the patch;
> > the design concept is below.
> > 
> > There's also fair bit of hand waving on implementation details that shouldn't
> > fundamentally change the overall ABI, e.g. how the backing store will ensure
> > there are no mappings when "converting" to guest private.
> > 
> > Background
> > ==========
> > 
> > This is a loose continuation of Kirill's RFC[*] to support TDX guest private
> > memory by tracking guest memory at the 'struct page' level.  This proposal is the
> > result of several offline discussions that were prompted by Andy Lutomirksi's
> > concerns with tracking via 'struct page':
> > 
> >   1. The kernel wouldn't easily be able to enforce a 1:1 page:guest association,
> >      let alone a 1:1 pfn:gfn mapping.
> 
> May I ask why? Doesn't FOLL_GUEST in Kirill's earlier patch work? Or just
> because traversing the host PT to get a PFN(for a PageGuest(page)) is too
> heavy?

To ensure a 1:1 page:guest, 'struct page' would need to track enough information
to uniquely identify which guest owns the page.  With TDX, one can argue that the
kernel can rely on the TDX-Module to do that tracking (I argued this :-)), but
for SEV and pure software implementations there is no third party that's tracking
who owns what.  In other words, without stashing an identifier in 'struct page',
the kernel would only know that _a_ guest owns the page, it couldn't identify which
guest owns the page.  And allocating that much space in 'struct page' would be
painfully expensive.

1:1 pfn:gfn is even worse.  E.g. if userspace maps the same file as MAP_GUEST at
multiple host virtual adddress, then configures KVM's memslots to have multiple
regions, one for each alias, the guest will end up with 1:N pfn:gfn mappings.
Preventing that would be nigh impossible without ending up with a plethora of
races.

> >   2. Does not work for memory that isn't backed by 'struct page', e.g. if devices
> >      gain support for exposing encrypted memory regions to guests.
> 
> Do you mean that a page not backed by 'struct page' might be mapped to other
> user space? I thought the VM_GUEST flags for the VMA could prevent that(though
> I may possiblely be wrong). Could you explain more? Thanks!

It's about enabling, not preventing.  If in the future there is some form of memory
that is not backed by 'struct page' (CXL memory?), that can also be mapped into
a protected KVM guest, relying on 'struct page' to ensure that only the owning KVM
guest has access to that memory would not work.

> >   3. Does not help march toward page migration or swap support (though it doesn't
> >      hurt either).
> > 
> > [*] https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@linux.intel.com
> > 
> > Concept
> > =======
> > 
> > Guest private memory must be backed by an "enlightened" file descriptor, where
> > "enlightened" means the implementing subsystem supports a one-way "conversion" to
> > guest private memory and provides bi-directional hooks to communicate directly
> > with KVM.  Creating a private fd doesn't necessarily have to be a conversion, e.g. it
> > could also be a flag provided at file creation, a property of the file system itself,
> > etc...
> > 
> > Before a private fd can be mapped into a KVM guest, it must be paired 1:1 with a
> > KVM guest, i.e. multiple guests cannot share a fd.  At pairing, KVM and the fd's
> > subsystem exchange a set of function pointers to allow KVM to call into the subsystem,
> > e.g. to translate gfn->pfn, and vice versa to allow the subsystem to call into KVM,
> > e.g. to invalidate/move/swap a gfn range.
> 
> So the gfn->pfn translation is done by the fd's subsystem? Again, could you
> please elaborate how?

Rats, I meant to capture this in the "patch" and did not.  The memslot would hold
the base offset into the file instead of a host virtual address.  I.e. instead of
gfn -> hva -> pfn, it would roughly be gfn -> offset -> pfn.

> And each private memory region would need a seperate group of callbacks?

Each private memslot would have its own pointer to a virtual function table, but
there would only be a single table per backing store implementation.

> > Mapping a private fd in host userspace is disallowed, i.e. there is never a host
> > virtual address associated with the fd and thus no userspace page tables pointing
> > at the private memory.
> > 
> > Pinning _from KVM_ is not required.  If the backing store supports page migration
> > and/or swap, it can query the KVM-provided function pointers to see if KVM supports
> > the operation.  If the operation is not supported (this will be the case initially
> > in KVM), the backing store is responsible for ensuring correct functionality.
> > 
> > Unmapping guest memory, e.g. to prevent use-after-free, is handled via a callback
> > from the backing store to KVM.  KVM will employ techniques similar to those it uses
> > for mmu_notifiers to ensure the guest cannot access freed memory.
> > 
> > A key point is that, unlike similar failed proposals of the past, e.g. /dev/mktme,
> > existing backing stores can be englightened, a from-scratch implementations is not
> > required (though would obviously be possible as well).
> > 
> > One idea for extending existing backing stores, e.g. HugeTLBFS and tmpfs, is
> > to add F_SEAL_GUEST, which would convert the entire file to guest private memory
> > and either fail if the current size is non-zero or truncate the size to zero.
> 
> Have you discussed memfd_secret(if host direct mapping is also to be removed)? 

No.  From a userspace perspective, the two would be mutually exclusive

> And how does this F_SEAL_GUEST work?

Are you asking about semantics or implementation?  If it's implementation, that's
firmly in the handwaving part :-)  Semantically, once F_SEAL_GUEST is set it can't
be removed, i.e. the file is forever "guest private".  In a way it's destructive,
e.g. the host can't ever read out the memory, but that's really just a reflection
of the hardware, e.g. even with SEV, the host can read the memory but it can never
decrypt the memory.

> > 
> > KVM
> > ===
> > 
> > Guest private memory is managed as a new address space, i.e. as a different set of
> > memslots, similar to how KVM has a separate memory view for when a guest vCPU is
> > executing in virtual SMM.  SMM is mutually exclusive with guest private memory.
> > 
> > The fd (the actual integer) is provided to KVM when a private memslot is added
> > via KVM_SET_USER_MEMORY_REGION.  This is when the aforementioned pairing occurs.
> 
> My understanding of KVM_SET_USER_MEMORY_REGION is that, this ioctl is to
> facilitate the binding of HVA and GPA ranges. But if there's no HVAs for
> a private region at all, why do we need a memslot for it? Besides to keep
> track of the private GFN ranges, and provide the callbacks, is there any
> other reason?

The short answer is that something in KVM needs to translate gfn->pfn.  Using just
the gfn isn't feasible because there's no anchor (see above regarding the base file
offset).  And even if the offset weren't needed, KVM still needs a lot of the same
metadata, e.g. some day there may be read-only private memsots, dirty logging of
private memslots, etc...  Implementing something else entirely would also require
an absurd amount of code churn as all of KVM revolves around gfn -> slot -> pfn.

> Another question is: why do we need a whole new address space, instead of
> one address space accommodating memslot types?

Either way could be made to work, and there isn't thaaat much code difference.  My
initial thoughts were to use a flag, but there are some niceties that a separate
address space provides (more below).

> > By default, KVM memslot lookups will be "shared", only specific touchpoints will
> > be modified to work with private memslots, e.g. guest page faults.  All host
> > accesses to guest memory, e.g. for emulation, will thus look for shared memory
> > and naturally fail without attempting copy_to/from_user() if the guest attempts
> 
> Becasue gfn_to_hva() will fail first?

Yes.  More precisely, gfn_to_hva() will fail because there is no memslot for the
current address space (shared).  This is one advantage over a flag, e.g. there's no
need to check a flag after getting a memslot.  It's somewhat superficial as it
wouldn't be too difficult to add low level helpers to default to "shared", but there
are multiple instances of those types of patterns, so I do hope/think it will yield
to cleaner code.

> > to coerce KVM into access private memory.  Note, avoiding copy_to/from_user() and
> > friends isn't strictly necessary, it's more of a happy side effect.
> > 
> > A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in vcpu->run
> > is added to propagate illegal accesses (see above) and implicit conversions
> 
> Sorry, illegal accesses from VM?

Illegal accesses from the host, e.g. attempting to read/write guest private memory
via gfn_to_hva().

> Do you actually mean a KVM page fault caused by private access from VM, which
> implicitly notifies KVM to mark it as private(e.g. by bouncing to Qemu, which
> then creates a private memory region and ioctls into KVM)?
> 
> If the answer is yes, how about naming the exit reason as KVM_EXIT_MEMORY_PRIVATE?
> Meanwhile, is Qemu also supposed to invoke some system call into host kernel
> before ioctls into KVM? I'm still confused where the kernel callbacks like
> the gfn_to_pfn() come from(and how they function)... :)

I would like the exit to be generic so that it can be reused for other, completely
unrelated (yet similar) scenarios (see links below).

> > to userspace (see below).  Note, the new exit reason + struct can also be to
> > support several other feature requests in KVM[1][2].
> > 
> > The guest may explicitly or implicity request KVM to map a shared/private variant
> > of a GFN.  An explicit map request is done via hypercall (out of scope for this
> > proposal as both TDX and SNP ABIs define such a hypercall).  An implicit map request
> > is triggered simply by the guest accessing the shared/private variant, which KVM
> > sees as a guest page fault (EPT violation or #NPF).  Ideally only explicit requests
> > would be supported, but neither TDX nor SNP require this in their guest<->host ABIs.
> 
> Well, I am wondering, should we assume all guest pages as shared or private by
> default? I mean, if all guest pages are private when the VM is created, maybe
> the private memslots can be initialized in VM creation time, and be deleted/splited
> later(e.g. in response to guest sharing  hypercalls)?

Define "we".  I absolutely expect the userspace VMM to assume all guest pages are
private by default.  But I feel very strongly that KVM should not be involved in
the control logic for decided when to convert a given page between shared and
private.  Thus, deciding the default state is not KVM's responsibility.

> It may simplify the logic, but may also restrict the VM type(e.g. to be TD guest).
> 
> > 
> > For implicit or explicit mappings, if a memslot is found that fully covers the
> > requested range (which is a single gfn for implicit mappings), KVM's normal guest
> > page fault handling works with minimal modification.
> > 
> > If a memslot is not found, for explicit mappings, KVM will exit to userspace with
> > the aforementioned dedicated exit reason.  For implict _private_ mappings, KVM will
> > also immediately exit with the same dedicated reason.  For implicit shared mappings,
> > an additional check is required to differentiate between emulated MMIO and an
> > implicit private->shared conversion[*].  If there is an existing private memslot
> > for the gfn, KVM will exit to userspace, otherwise KVM will treat the access as an
> > emulated MMIO access and handle the page fault accordingly.
> > 
> > [1] https://lkml.kernel.org/r/YKxJLcg/WomPE422@google.com
> > [2] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
> > 
> > Punching Holes
> > ==============
> > 
> > The expected userspace memory model is that mapping requests will be handled as
> > conversions, e.g. on a shared mapping request, first unmap the private gfn range,
> > then map the shared gfn range.  A new KVM ioctl() will likely be needed to allow
> > userspace to punch a hole in a memslot, as expressing such an operation isn't
> > possible with KVM_SET_USER_MEMORY_REGION.  While userspace could delete the
> > memslot, then recreate three new memslots, doing so would be destructive to guest
> > data as unmapping guest private memory (from the EPT/NPT tables) is destructive
> > to the data for both TDX and SEV-SNP guests.
> 
> May I ask why? Thanks!

Hmm, for SNP it might not actually be destructive, I forget the exact flows for
unmapping memory.

Anyways, for TDX it's most certainly destructive.  When mapping private pages into
the guest (ignore pre-boot), the guest must accept a page before accessing the page,
as a #VE will occur if the page is not accepted.  This holds true even if the host
unmaps and remaps the same PFN -> GFN.  The TDX module also explicitly zeroes the
page when installing a new mapping.

And to prevent use-after-free, KVM must fully unmap a page if its corresponding
memslot is deleted, e.g. it can't trust userspace to remap the memory at the exact
gfn:pfn combo.

Without a new API to punch a hole, deleting private memslots to create two smaller,
discontiguous memslots would destroy the data in the two remaining/new memslots.

> > Pros (vs. struct page)
> > ======================
> > 
> > Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping.
> > 
> > Userspace page tables are not populated, e.g. reduced memory footprint, lower
> > probability of making private memory accessible to userspace.
> > 
> > Provides line of sight to supporting page migration and swap.
> > 
> > Provides line of sight to mapping MMIO pages into guest private memory.
> > 
> > Cons (vs. struct page)
> > ======================
> > 
> > Significantly more churn in KVM, e.g. to plumb 'private' through where needed,
> > support memslot hole punching, etc...
> > 
> > KVM's MMU gets another method of retrieving host pfn and page size.
> 
> And the method is provided by host kernel? How does this method work?

It's a function callback provided by the backing store.  The details of the
function would be likely specific to the backing store's implementation, e.g.
HugeTLBFS would likely have its own implementation.

> [...]
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index a272ccbddfa1..771080235b2d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2896,6 +2896,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >  	if (max_level == PG_LEVEL_4K)
> >  		return PG_LEVEL_4K;
> >  
> > +	if (memslot_is_private(slot))
> > +		return slot->private_ops->pfn_mapping_level(...);
> > +
> 
> Oh, any suggestion how host kernel decides the mapping level here?

That decision comes from the backing store.  E.g. HugeTLBFS would simply return
its static/configured size/level.
 
> >  	host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> >  	return min(host_level, max_level);
> >  }
> > @@ -3835,9 +3838,11 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  
> >  static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
> >  {
> > -	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
> > +	struct kvm_memory_slot *slot;
> >  	bool async;
> >  
> > +	slot = __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, fault->private);
> > +
> >  	/*
> >  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> >  	 * or moved.  This ensures any existing SPTEs for the old memslot will
> > @@ -3846,8 +3851,19 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >  	if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
> >  		goto out_retry;
> >  
> > +	/*
> > +	 * Exit to userspace to map the requested private/shared memory region
> > +	 * if there is no memslot and (a) the access is private or (b) there is
> > +	 * an existing private memslot.  Emulated MMIO must be accessed through
> > +	 * shared GPAs, thus a memslot miss on a private GPA is always handled
> > +	 * as an implicit conversion "request".
> > +	 */
> 
> For (b), do you mean this fault is for a GFN which marked as private, but now
> converted to a shared?

Yes.

> If true, could we just disallow it if no explict sharing hypercall is triggered?

Sadly, no.  Even if all hardware vendor specs _required_ such behavior, KVM has no
recourse but to exit to userspace because there's no way to communicate to the
guest that it accessed a "bad" address.  E.g. KVM can't inject exceptions in TDX,
and it can't touch guest register state to "return" an error code,  Not to mention
that the guest would be in some random flow accessing memory.

Happily for KVM, it's userspace's problem^Wdecision.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-24  0:52 [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Sean Christopherson
  2021-08-24 10:48 ` Yu Zhang
@ 2021-08-26 10:15 ` David Hildenbrand
  2021-08-26 17:05   ` Andy Lutomirski
                     ` (2 more replies)
  2021-09-02 18:47 ` Kirill A. Shutemov
  2 siblings, 3 replies; 71+ messages in thread
From: David Hildenbrand @ 2021-08-26 10:15 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Borislav Petkov, Andy Lutomirski, Andrew Morton,
	Joerg Roedel, Andi Kleen, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Varad Gautam, Dario Faggioli, x86, linux-mm, linux-coco,
	Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, Dave Hansen, Yu Zhang

On 24.08.21 02:52, Sean Christopherson wrote:
> The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the
> game, on an acceptable direction for supporting guest private memory, e.g. for
> Intel's TDX.  The TDX architectural effectively allows KVM guests to crash the
> host if guest private memory is accessible to host userspace, and thus does not
> play nice with KVM's existing approach of pulling the pfn and mapping level from
> the host page tables.
> 
> This is by no means a complete patch; it's a rough sketch of the KVM changes that
> would be needed.  The kernel side of things is completely omitted from the patch;
> the design concept is below.
> 
> There's also fair bit of hand waving on implementation details that shouldn't
> fundamentally change the overall ABI, e.g. how the backing store will ensure
> there are no mappings when "converting" to guest private.
> 

This is a lot of complexity and rather advanced approaches (not saying 
they are bad, just that we try to teach the whole stack something 
completely new).


What I think would really help is a list of requirements, such that 
everybody is aware of what we actually want to achieve. Let me start:

GFN: Guest Frame Number
EPFN: Encrypted Physical Frame Number


1) An EPFN must not get mapped into more than one VM: it belongs exactly 
to one VM. It must neither be shared between VMs between processes nor 
between VMs within a processes.


2) User space (well, and actually the kernel) must never access an EPFN:

- If we go for an fd, essentially all operations (read/write) have to
   fail.
- If we have to map an EPFN into user space page tables (e.g., to
   simplify KVM), we could only allow fake swap entries such that "there
   is something" but it cannot be  accessed and is flagged accordingly.
- /proc/kcore and friends have to be careful as well and should not read
   this memory. So there has to be a way to flag these pages.

3) We need a way to express the GFN<->EPFN mapping and essentially 
assign an EPFN to a GFN.


4) Once we assigned a EPFN to a GFN, that assignment must not longer 
change. Further, an EPFN must not get assigned to multiple GFNs.


5) There has to be a way to "replace" encrypted parts by "shared" parts
    and the other way around.

What else?



> Background
> ==========
> 
> This is a loose continuation of Kirill's RFC[*] to support TDX guest private
> memory by tracking guest memory at the 'struct page' level.  This proposal is the
> result of several offline discussions that were prompted by Andy Lutomirksi's
> concerns with tracking via 'struct page':
> 
>    1. The kernel wouldn't easily be able to enforce a 1:1 page:guest association,
>       let alone a 1:1 pfn:gfn mapping.

Well, it could with some help on higher layers. Someone has to do the 
tracking. Marking EPFNs as EPFNs can actually be very helpful,  e.g., 
allow /proc/kcore to just not touch such pages. If we want to do all the 
tracking in the struct page is a different story.

> 
>    2. Does not work for memory that isn't backed by 'struct page', e.g. if devices
>       gain support for exposing encrypted memory regions to guests.

Let's keep it simple. If a struct page is right now what we need to 
properly track it, so be it. If not, good. But let's not make this a 
requirement right from the start if it's stuff for the far future.

> 
>    3. Does not help march toward page migration or swap support (though it doesn't
>       hurt either).

"Does not help towards world peace, (though it doesn't hurt either)".

Maybe let's ignore that for now, as it doesn't seem to be required to 
get something reasonable running.

> 
> [*] https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@linux.intel.com
> 
> Concept
> =======
> 
> Guest private memory must be backed by an "enlightened" file descriptor, where
> "enlightened" means the implementing subsystem supports a one-way "conversion" to
> guest private memory and provides bi-directional hooks to communicate directly
> with KVM.  Creating a private fd doesn't necessarily have to be a conversion, e.g. it
> could also be a flag provided at file creation, a property of the file system itself,
> etc...

Doesn't sound too crazy. Maybe even introducing memfd_encrypted() if 
extending the other ones turns out too complicated.

> 
> Before a private fd can be mapped into a KVM guest, it must be paired 1:1 with a
> KVM guest, i.e. multiple guests cannot share a fd.  At pairing, KVM and the fd's
> subsystem exchange a set of function pointers to allow KVM to call into the subsystem,
> e.g. to translate gfn->pfn, and vice versa to allow the subsystem to call into KVM,
> e.g. to invalidate/move/swap a gfn range.
> 
> Mapping a private fd in host userspace is disallowed, i.e. there is never a host
> virtual address associated with the fd and thus no userspace page tables pointing
> at the private memory.

To keep the primary vs. secondary MMU thing working, I think it would 
actually be nice to go with special swap entries instead; it just keeps 
most things working as expected. But let's see where we end up.

> 
> Pinning _from KVM_ is not required.  If the backing store supports page migration
> and/or swap, it can query the KVM-provided function pointers to see if KVM supports
> the operation.  If the operation is not supported (this will be the case initially
> in KVM), the backing store is responsible for ensuring correct functionality.
> 
> Unmapping guest memory, e.g. to prevent use-after-free, is handled via a callback
> from the backing store to KVM.  KVM will employ techniques similar to those it uses
> for mmu_notifiers to ensure the guest cannot access freed memory.
> 
> A key point is that, unlike similar failed proposals of the past, e.g. /dev/mktme,
> existing backing stores can be englightened, a from-scratch implementations is not
> required (though would obviously be possible as well).

Right. But if it's just a bad fit, let's do something new. Just like we 
did with memfd_secret.

> 
> One idea for extending existing backing stores, e.g. HugeTLBFS and tmpfs, is
> to add F_SEAL_GUEST, which would convert the entire file to guest private memory
> and either fail if the current size is non-zero or truncate the size to zero.

While possible, I actually do have the feeling that we want eventually 
to have something new, as the semantics are just too different. But 
let's see.


> KVM
> ===
> 
> Guest private memory is managed as a new address space, i.e. as a different set of
> memslots, similar to how KVM has a separate memory view for when a guest vCPU is
> executing in virtual SMM.  SMM is mutually exclusive with guest private memory.
> 
> The fd (the actual integer) is provided to KVM when a private memslot is added
> via KVM_SET_USER_MEMORY_REGION.  This is when the aforementioned pairing occurs.
> 
> By default, KVM memslot lookups will be "shared", only specific touchpoints will
> be modified to work with private memslots, e.g. guest page faults.  All host
> accesses to guest memory, e.g. for emulation, will thus look for shared memory
> and naturally fail without attempting copy_to/from_user() if the guest attempts
> to coerce KVM into access private memory.  Note, avoiding copy_to/from_user() and
> friends isn't strictly necessary, it's more of a happy side effect.
> 
> A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in vcpu->run
> is added to propagate illegal accesses (see above) and implicit conversions
> to userspace (see below).  Note, the new exit reason + struct can also be to
> support several other feature requests in KVM[1][2].
> 
> The guest may explicitly or implicity request KVM to map a shared/private variant
> of a GFN.  An explicit map request is done via hypercall (out of scope for this
> proposal as both TDX and SNP ABIs define such a hypercall).  An implicit map request
> is triggered simply by the guest accessing the shared/private variant, which KVM
> sees as a guest page fault (EPT violation or #NPF).  Ideally only explicit requests
> would be supported, but neither TDX nor SNP require this in their guest<->host ABIs.
> 
> For implicit or explicit mappings, if a memslot is found that fully covers the
> requested range (which is a single gfn for implicit mappings), KVM's normal guest
> page fault handling works with minimal modification.
> 
> If a memslot is not found, for explicit mappings, KVM will exit to userspace with
> the aforementioned dedicated exit reason.  For implict _private_ mappings, KVM will
> also immediately exit with the same dedicated reason.  For implicit shared mappings,
> an additional check is required to differentiate between emulated MMIO and an
> implicit private->shared conversion[*].  If there is an existing private memslot
> for the gfn, KVM will exit to userspace, otherwise KVM will treat the access as an
> emulated MMIO access and handle the page fault accordingly.

Do you mean some kind of overlay. "Ordinary" user memory regions overlay 
"private user memory regions"? So when marking something shared, you'd 
leave the private user memory region alone and only create a new 
"ordinary"user memory regions that references shared memory in QEMU 
(IOW, a different mapping)?

Reading below, I think you were not actually thinking about an overlay, 
but maybe overlays might actually be a nice concept to have instead.


> Punching Holes
> ==============
> 
> The expected userspace memory model is that mapping requests will be handled as
> conversions, e.g. on a shared mapping request, first unmap the private gfn range,
> then map the shared gfn range.  A new KVM ioctl() will likely be needed to allow
> userspace to punch a hole in a memslot, as expressing such an operation isn't
> possible with KVM_SET_USER_MEMORY_REGION.  While userspace could delete the
> memslot, then recreate three new memslots, doing so would be destructive to guest
> data as unmapping guest private memory (from the EPT/NPT tables) is destructive
> to the data for both TDX and SEV-SNP guests.

If you'd treat it like an overlay, you'd not actually be punching holes. 
You'd only be creating/removing ordinary user memory regions when 
marking something shared/unshared.

> 
> Pros (vs. struct page)
> ======================
> 
> Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping.
> 
> Userspace page tables are not populated, e.g. reduced memory footprint, lower
> probability of making private memory accessible to userspace.

Agreed to the first part, although I consider that a secondary concern. 
The second part, I'm not sure if that is really the case. Fake swap 
entries are just a marker.

> 
> Provides line of sight to supporting page migration and swap.

Again, let's leave that out for now. I think that's an kernel internal 
that will require quite some thought either way.

> 
> Provides line of sight to mapping MMIO pages into guest private memory.

That's an interesting thought. Would it work via overlays as well? Can 
you elaborate?

> 
> Cons (vs. struct page)
> ======================
> 
> Significantly more churn in KVM, e.g. to plumb 'private' through where needed,
> support memslot hole punching, etc...
> 
> KVM's MMU gets another method of retrieving host pfn and page size.
> 
> Requires enabling in every backing store that someone wants to support.

I think we will only care about anonymous memory eventually with 
huge/gigantic pages in the next years. Just to what memfd() is already 
limited. File-backed -- I don't know ... if at all, swapping ... in a 
couple of years ...

> 
> Because the NUMA APIs work on virtual addresses, new syscalls fmove_pages(),
> fbind(), etc... would be required to provide equivalents to existing NUMA
> functionality (though those syscalls would likely be useful irrespective of guest
> private memory).

Right, that's because we don't have a VMA that describes all this. E.g., 
mbind().

> 
> Washes (vs. struct page)
> ========================
> 
> A misbehaving guest that triggers a large number of shared memory mappings will
> consume a large number of memslots.  But, this is likely a wash as similar effect
> would happen with VMAs in the struct page approach.

Just cap it then to something sane. 32k which we have right now is crazy 
and only required in very special setups. You can just make QEMU 
override/set the KVM default.





My wild idea after reading everything so far (full of flaws, just want 
to mention it, maybe it gives some ideas):

Introduce memfd_encrypted().

Similar to like memfd_secret()
- Most system calls will just fail.
- Allow MAP_SHARED only.
- Enforce VM_DONTDUMP and skip during fork().
- File size can change exactly once, before any mmap. (IIRC)

Different to memfd_secret(), allow mapping each page of the fd exactly 
one time via mmap() into a single process.

The simplest way would be requiring that only the whole file can be 
mmaped, and that can happen exactly once. memremap() and friends will 
fail. Splitting the VMA will fail. munmap()/mmap(MAP_FIXED) will fail. 
You'll have it in a single process mapped ever only and persistent. 
Unmap will only work when tearing down the MM.

Hole punching via fallocate() has to be evaluated (below).


You'll end up with a VMA that corresponds to the whole file in a single 
process only, and that cannot vanish, not even in parts.

Define "ordinary" user memory slots as overlay on top of "encrypted" 
memory slots.  Inside KVM, bail out if you encounter such a VMA inside a 
normal user memory slot. When creating a "encryped" user memory slot, 
require that the whole VMA is covered at creation time. You know the VMA 
can't change later.

In QEMU, allocate for each RAMBlock a MAP_PRIVATE memfd_encrypted() and 
a MAP_SHARED memfd(). Make QEMU and other processes always access the 
MAP_SHARED part only. Initially, create "encrypted" user memory regions 
in KVM when defining the RAM layout, disallowing changes. Define the 
MAP_SHARED memfd() overlay user memory regions in KVM depending on the 
shared/private state.

In the actual memory backend, flag all new allocated pages as 
PG_encrypted. Pages can be faulted into a process page table via a new 
fake swap entry, where KVM can look them up via a special GUP flag, as 
Kirill suggested. If access via user space or another GUP user, SIGBUS 
instead of converting the special swap entry.

Allow only a single encrypted VM per process ever in KVM for now.
All that handshake regarding swapping/migration ... can be handled 
internally if ever required.

Memory hotplug: should not be an issue.
Memory hotunplug would require some thought -- maybe 
fallcoate(PUNCHHOLE) will do.

Reducing memory consumption: With MADV_DONTNEED/fallocate(punchole) we 
can reclaim memory at least within the shared memory part when switching 
back and forth. Maybe even in the MAP_PRIVATE/memfd_encrypted() part 
when marking something shared. TBD.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-26  0:35   ` Sean Christopherson
@ 2021-08-26 13:23     ` Yu Zhang
  0 siblings, 0 replies; 71+ messages in thread
From: Yu Zhang @ 2021-08-26 13:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, David Hildenbrand, Dave Hansen,
	Isaku Yamahata

Thanks a lot for your explaination, Sean.
Still many questions below. :)

On Thu, Aug 26, 2021 at 12:35:21AM +0000, Sean Christopherson wrote:
> On Tue, Aug 24, 2021, Yu Zhang wrote:
> > On Mon, Aug 23, 2021 at 05:52:48PM -0700, Sean Christopherson wrote:
> > 
> > Thanks a lot for sharing these ideas. Lots of questions are inlined below. :)
> > 
> > > The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the
> > > game, on an acceptable direction for supporting guest private memory, e.g. for
> > > Intel's TDX.  The TDX architectural effectively allows KVM guests to crash the
> > > host if guest private memory is accessible to host userspace, and thus does not
> > 
> > What about incorrect/malicious accesses from host kernel? Should the direct mapping
> > also be removed for guest private memory?
> 
> I would say that's out of scope for this RFC as it should not require any
> coordination between KVM and MM.

So Linux MM still has choice to unmap the private memory in direct mapping, right?

> 
> > > play nice with KVM's existing approach of pulling the pfn and mapping level from
> > > the host page tables.
> > > 
> > > This is by no means a complete patch; it's a rough sketch of the KVM changes that
> > > would be needed.  The kernel side of things is completely omitted from the patch;
> > > the design concept is below.
> > > 
> > > There's also fair bit of hand waving on implementation details that shouldn't
> > > fundamentally change the overall ABI, e.g. how the backing store will ensure
> > > there are no mappings when "converting" to guest private.
> > > 
> > > Background
> > > ==========
> > > 
> > > This is a loose continuation of Kirill's RFC[*] to support TDX guest private
> > > memory by tracking guest memory at the 'struct page' level.  This proposal is the
> > > result of several offline discussions that were prompted by Andy Lutomirksi's
> > > concerns with tracking via 'struct page':
> > > 
> > >   1. The kernel wouldn't easily be able to enforce a 1:1 page:guest association,
> > >      let alone a 1:1 pfn:gfn mapping.
> > 
> > May I ask why? Doesn't FOLL_GUEST in Kirill's earlier patch work? Or just
> > because traversing the host PT to get a PFN(for a PageGuest(page)) is too
> > heavy?
> 
> To ensure a 1:1 page:guest, 'struct page' would need to track enough information
> to uniquely identify which guest owns the page.  With TDX, one can argue that the
> kernel can rely on the TDX-Module to do that tracking (I argued this :-)), but
> for SEV and pure software implementations there is no third party that's tracking
> who owns what.  In other words, without stashing an identifier in 'struct page',
> the kernel would only know that _a_ guest owns the page, it couldn't identify which
> guest owns the page.  And allocating that much space in 'struct page' would be
> painfully expensive.

So it's to make sure a private page can only be assigned to one guest.
I thought PAMT in TDX and RMP in SEV-SNP can do this, is this understanding
correct?

But for pure software implementations, it's a problem indeed.

And why would using a specific fd can achieve this? I saw requirement
to only bind the fd to one guest, but is this enough to guarantee the
1:1 page:guest binding?

> 
> 1:1 pfn:gfn is even worse.  E.g. if userspace maps the same file as MAP_GUEST at
> multiple host virtual adddress, then configures KVM's memslots to have multiple
> regions, one for each alias, the guest will end up with 1:N pfn:gfn mappings.
> Preventing that would be nigh impossible without ending up with a plethora of
> races.

You mean different memslots(in one address space) may contain the same
gfn? IIUC, gfn overlap is not allowed: kvm_set_memory_region() will just
return -EEXIST.

> 
> > >   2. Does not work for memory that isn't backed by 'struct page', e.g. if devices
> > >      gain support for exposing encrypted memory regions to guests.
> > 
> > Do you mean that a page not backed by 'struct page' might be mapped to other
> > user space? I thought the VM_GUEST flags for the VMA could prevent that(though
> > I may possiblely be wrong). Could you explain more? Thanks!
> 
> It's about enabling, not preventing.  If in the future there is some form of memory
> that is not backed by 'struct page' (CXL memory?), that can also be mapped into
> a protected KVM guest, relying on 'struct page' to ensure that only the owning KVM
> guest has access to that memory would not work.

Oh. Keeping the page owner info in 'struct page' is not a good choice.
But still, my quesions are:

1> If TDX module/ SEV-SNP can do this, do we still need to enforce a 1:1
page:guest association in host kernel - if we are not so ambitious to also
make this work as a pure software design? :-)

2> Besides, could you please explain how this design achieves the 1:1
association? 

> 
> > >   3. Does not help march toward page migration or swap support (though it doesn't
> > >      hurt either).
> > > 
> > > [*] https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@linux.intel.com
> > > 
> > > Concept
> > > =======
> > > 
> > > Guest private memory must be backed by an "enlightened" file descriptor, where
> > > "enlightened" means the implementing subsystem supports a one-way "conversion" to
> > > guest private memory and provides bi-directional hooks to communicate directly
> > > with KVM.  Creating a private fd doesn't necessarily have to be a conversion, e.g. it
> > > could also be a flag provided at file creation, a property of the file system itself,
> > > etc...
> > > 
> > > Before a private fd can be mapped into a KVM guest, it must be paired 1:1 with a
> > > KVM guest, i.e. multiple guests cannot share a fd.  At pairing, KVM and the fd's
> > > subsystem exchange a set of function pointers to allow KVM to call into the subsystem,
> > > e.g. to translate gfn->pfn, and vice versa to allow the subsystem to call into KVM,
> > > e.g. to invalidate/move/swap a gfn range.
> > 
> > So the gfn->pfn translation is done by the fd's subsystem? Again, could you
> > please elaborate how?
> 
> Rats, I meant to capture this in the "patch" and did not.  The memslot would hold
> the base offset into the file instead of a host virtual address.  I.e. instead of
> gfn -> hva -> pfn, it would roughly be gfn -> offset -> pfn.
> 

And the offset just equals gfn?

> > And each private memory region would need a seperate group of callbacks?
> 
> Each private memslot would have its own pointer to a virtual function table, but
> there would only be a single table per backing store implementation.
> 

Hmm. So we can have various backing stores for different memslots? And Qemu would
inform KVM about this for each memslot?

> > > Mapping a private fd in host userspace is disallowed, i.e. there is never a host
> > > virtual address associated with the fd and thus no userspace page tables pointing
> > > at the private memory.
> > > 
> > > Pinning _from KVM_ is not required.  If the backing store supports page migration
> > > and/or swap, it can query the KVM-provided function pointers to see if KVM supports
> > > the operation.  If the operation is not supported (this will be the case initially
> > > in KVM), the backing store is responsible for ensuring correct functionality.
> > > 
> > > Unmapping guest memory, e.g. to prevent use-after-free, is handled via a callback
> > > from the backing store to KVM.  KVM will employ techniques similar to those it uses
> > > for mmu_notifiers to ensure the guest cannot access freed memory.
> > > 
> > > A key point is that, unlike similar failed proposals of the past, e.g. /dev/mktme,
> > > existing backing stores can be englightened, a from-scratch implementations is not
> > > required (though would obviously be possible as well).
> > > 
> > > One idea for extending existing backing stores, e.g. HugeTLBFS and tmpfs, is
> > > to add F_SEAL_GUEST, which would convert the entire file to guest private memory
> > > and either fail if the current size is non-zero or truncate the size to zero.
> > 
> > Have you discussed memfd_secret(if host direct mapping is also to be removed)? 
> 
> No.  From a userspace perspective, the two would be mutually exclusive
> 

Sorry? By "mutually exclusive", do you mean we can NOT remove both userspace
mappings and kernel direct mappings at the same time?

> > And how does this F_SEAL_GUEST work?
> 
> Are you asking about semantics or implementation?  If it's implementation, that's
> firmly in the handwaving part :-)  Semantically, once F_SEAL_GUEST is set it can't
> be removed, i.e. the file is forever "guest private".  In a way it's destructive,
> e.g. the host can't ever read out the memory, but that's really just a reflection
> of the hardware, e.g. even with SEV, the host can read the memory but it can never
> decrypt the memory.
> 

Acctually, I am asking both. :-)

E.g., what do we expect of this fd. For now, my understandings are:
1> It's a dedicated fd, and can not be shared between different processes.
2> mmap() shall fail on this fd.
3> With F_SEAL_GUEST, size of this file is set to 0, or need be truncated
   to 0.

But when should this fd be created?
What operations do we expect this fd to offer? 
And how does this fd function as a channel between MM and KVM?

> > > 
> > > KVM
> > > ===
> > > 
> > > Guest private memory is managed as a new address space, i.e. as a different set of
> > > memslots, similar to how KVM has a separate memory view for when a guest vCPU is
> > > executing in virtual SMM.  SMM is mutually exclusive with guest private memory.
> > > 
> > > The fd (the actual integer) is provided to KVM when a private memslot is added
> > > via KVM_SET_USER_MEMORY_REGION.  This is when the aforementioned pairing occurs.
> > 
> > My understanding of KVM_SET_USER_MEMORY_REGION is that, this ioctl is to
> > facilitate the binding of HVA and GPA ranges. But if there's no HVAs for
> > a private region at all, why do we need a memslot for it? Besides to keep
> > track of the private GFN ranges, and provide the callbacks, is there any
> > other reason?
> 
> The short answer is that something in KVM needs to translate gfn->pfn.  Using just
> the gfn isn't feasible because there's no anchor (see above regarding the base file
> offset).  And even if the offset weren't needed, KVM still needs a lot of the same
> metadata, e.g. some day there may be read-only private memsots, dirty logging of
> private memslots, etc...  Implementing something else entirely would also require
> an absurd amount of code churn as all of KVM revolves around gfn -> slot -> pfn.
> 

Yes. Thanks!

> > Another question is: why do we need a whole new address space, instead of
> > one address space accommodating memslot types?
> 
> Either way could be made to work, and there isn't thaaat much code difference.  My
> initial thoughts were to use a flag, but there are some niceties that a separate
> address space provides (more below).
> 
> > > By default, KVM memslot lookups will be "shared", only specific touchpoints will
> > > be modified to work with private memslots, e.g. guest page faults.  All host
> > > accesses to guest memory, e.g. for emulation, will thus look for shared memory
> > > and naturally fail without attempting copy_to/from_user() if the guest attempts
> > 
> > Becasue gfn_to_hva() will fail first?
> 
> Yes.  More precisely, gfn_to_hva() will fail because there is no memslot for the
> current address space (shared).  This is one advantage over a flag, e.g. there's no
> need to check a flag after getting a memslot.  It's somewhat superficial as it
> wouldn't be too difficult to add low level helpers to default to "shared", but there
> are multiple instances of those types of patterns, so I do hope/think it will yield
> to cleaner code.
> 
> > > to coerce KVM into access private memory.  Note, avoiding copy_to/from_user() and
> > > friends isn't strictly necessary, it's more of a happy side effect.
> > > 
> > > A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in vcpu->run
> > > is added to propagate illegal accesses (see above) and implicit conversions
> > 
> > Sorry, illegal accesses from VM?
> 
> Illegal accesses from the host, e.g. attempting to read/write guest private memory
> via gfn_to_hva().
> 

Sorry, I do not get it. Without a valid HVA, how would the host perform
such illegal accesses?

> > Do you actually mean a KVM page fault caused by private access from VM, which
> > implicitly notifies KVM to mark it as private(e.g. by bouncing to Qemu, which
> > then creates a private memory region and ioctls into KVM)?
> > 
> > If the answer is yes, how about naming the exit reason as KVM_EXIT_MEMORY_PRIVATE?
> > Meanwhile, is Qemu also supposed to invoke some system call into host kernel
> > before ioctls into KVM? I'm still confused where the kernel callbacks like
> > the gfn_to_pfn() come from(and how they function)... :)
> 
> I would like the exit to be generic so that it can be reused for other, completely
> unrelated (yet similar) scenarios (see links below).
> 
> > > to userspace (see below).  Note, the new exit reason + struct can also be to
> > > support several other feature requests in KVM[1][2].
> > > 
> > > The guest may explicitly or implicity request KVM to map a shared/private variant
> > > of a GFN.  An explicit map request is done via hypercall (out of scope for this
> > > proposal as both TDX and SNP ABIs define such a hypercall).  An implicit map request
> > > is triggered simply by the guest accessing the shared/private variant, which KVM
> > > sees as a guest page fault (EPT violation or #NPF).  Ideally only explicit requests
> > > would be supported, but neither TDX nor SNP require this in their guest<->host ABIs.
> > 
> > Well, I am wondering, should we assume all guest pages as shared or private by
> > default? I mean, if all guest pages are private when the VM is created, maybe
> > the private memslots can be initialized in VM creation time, and be deleted/splited
> > later(e.g. in response to guest sharing  hypercalls)?
> 
> Define "we".  I absolutely expect the userspace VMM to assume all guest pages are
> private by default.  But I feel very strongly that KVM should not be involved in
> the control logic for decided when to convert a given page between shared and
> private.  Thus, deciding the default state is not KVM's responsibility.
> 

Could qemu just tell KVM this is a protected VM when creating it?

For runtime conversion, KVM can handle hypercalls from guest, and forward to Qemu,
just like what Kirill did in previous RFC patches.

> > It may simplify the logic, but may also restrict the VM type(e.g. to be TD guest).
> > 
> > > 
> > > For implicit or explicit mappings, if a memslot is found that fully covers the
> > > requested range (which is a single gfn for implicit mappings), KVM's normal guest
> > > page fault handling works with minimal modification.
> > > 
> > > If a memslot is not found, for explicit mappings, KVM will exit to userspace with
> > > the aforementioned dedicated exit reason.  For implict _private_ mappings, KVM will
> > > also immediately exit with the same dedicated reason.  For implicit shared mappings,
> > > an additional check is required to differentiate between emulated MMIO and an
> > > implicit private->shared conversion[*].  If there is an existing private memslot
> > > for the gfn, KVM will exit to userspace, otherwise KVM will treat the access as an
> > > emulated MMIO access and handle the page fault accordingly.
> > > 
> > > [1] https://lkml.kernel.org/r/YKxJLcg/WomPE422@google.com
> > > [2] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
> > > 
> > > Punching Holes
> > > ==============
> > > 
> > > The expected userspace memory model is that mapping requests will be handled as
> > > conversions, e.g. on a shared mapping request, first unmap the private gfn range,
> > > then map the shared gfn range.  A new KVM ioctl() will likely be needed to allow
> > > userspace to punch a hole in a memslot, as expressing such an operation isn't
> > > possible with KVM_SET_USER_MEMORY_REGION.  While userspace could delete the
> > > memslot, then recreate three new memslots, doing so would be destructive to guest
> > > data as unmapping guest private memory (from the EPT/NPT tables) is destructive
> > > to the data for both TDX and SEV-SNP guests.
> > 
> > May I ask why? Thanks!
> 
> Hmm, for SNP it might not actually be destructive, I forget the exact flows for
> unmapping memory.
> 
> Anyways, for TDX it's most certainly destructive.  When mapping private pages into
> the guest (ignore pre-boot), the guest must accept a page before accessing the page,
> as a #VE will occur if the page is not accepted.  This holds true even if the host
> unmaps and remaps the same PFN -> GFN.  The TDX module also explicitly zeroes the
> page when installing a new mapping.
> 
> And to prevent use-after-free, KVM must fully unmap a page if its corresponding
> memslot is deleted, e.g. it can't trust userspace to remap the memory at the exact
> gfn:pfn combo.
> 
> Without a new API to punch a hole, deleting private memslots to create two smaller,
> discontiguous memslots would destroy the data in the two remaining/new memslots.
> 

So that's because deleting a memslot will inevitably zeros guest private pages?

> > > Pros (vs. struct page)
> > > ======================
> > > 
> > > Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping.
> > > 
> > > Userspace page tables are not populated, e.g. reduced memory footprint, lower
> > > probability of making private memory accessible to userspace.
> > > 
> > > Provides line of sight to supporting page migration and swap.
> > > 
> > > Provides line of sight to mapping MMIO pages into guest private memory.
> > > 
> > > Cons (vs. struct page)
> > > ======================
> > > 
> > > Significantly more churn in KVM, e.g. to plumb 'private' through where needed,
> > > support memslot hole punching, etc...
> > > 
> > > KVM's MMU gets another method of retrieving host pfn and page size.
> > 
> > And the method is provided by host kernel? How does this method work?
> 
> It's a function callback provided by the backing store.  The details of the
> function would be likely specific to the backing store's implementation, e.g.
> HugeTLBFS would likely have its own implementation.
> 

HugeTLBFS is a nice-to-have, not a must, right?
Is there any other backing store we need support first?

> > [...]
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index a272ccbddfa1..771080235b2d 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -2896,6 +2896,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > >  	if (max_level == PG_LEVEL_4K)
> > >  		return PG_LEVEL_4K;
> > >  
> > > +	if (memslot_is_private(slot))
> > > +		return slot->private_ops->pfn_mapping_level(...);
> > > +
> > 
> > Oh, any suggestion how host kernel decides the mapping level here?
> 
> That decision comes from the backing store.  E.g. HugeTLBFS would simply return
> its static/configured size/level.
>  
> > >  	host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> > >  	return min(host_level, max_level);
> > >  }
> > > @@ -3835,9 +3838,11 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >  
> > >  static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
> > >  {
> > > -	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
> > > +	struct kvm_memory_slot *slot;
> > >  	bool async;
> > >  
> > > +	slot = __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, fault->private);
> > > +
> > >  	/*
> > >  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> > >  	 * or moved.  This ensures any existing SPTEs for the old memslot will
> > > @@ -3846,8 +3851,19 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > >  	if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
> > >  		goto out_retry;
> > >  
> > > +	/*
> > > +	 * Exit to userspace to map the requested private/shared memory region
> > > +	 * if there is no memslot and (a) the access is private or (b) there is
> > > +	 * an existing private memslot.  Emulated MMIO must be accessed through
> > > +	 * shared GPAs, thus a memslot miss on a private GPA is always handled
> > > +	 * as an implicit conversion "request".
> > > +	 */
> > 
> > For (b), do you mean this fault is for a GFN which marked as private, but now
> > converted to a shared?
> 
> Yes.
> 
> > If true, could we just disallow it if no explict sharing hypercall is triggered?
> 
> Sadly, no.  Even if all hardware vendor specs _required_ such behavior, KVM has no
> recourse but to exit to userspace because there's no way to communicate to the
> guest that it accessed a "bad" address.  E.g. KVM can't inject exceptions in TDX,
> and it can't touch guest register state to "return" an error code,  Not to mention
> that the guest would be in some random flow accessing memory.

So, guest can just use a shared GPA(with shared-bit set) directly, without the
need to explicitly ask KVM to convert the private one?

> 
> Happily for KVM, it's userspace's problem^Wdecision.
> 

And with KVM_EXIT_MAP_MEMORY returned, qemu should trigger a new ioctl into KVM,
which puches holes in KVM private address space's memslots? 

B.R.
Yu

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-26 10:15 ` David Hildenbrand
@ 2021-08-26 17:05   ` Andy Lutomirski
  2021-08-26 21:26     ` David Hildenbrand
  2021-08-27  2:31   ` Yu Zhang
  2021-08-27 22:18   ` Sean Christopherson
  2 siblings, 1 reply; 71+ messages in thread
From: Andy Lutomirski @ 2021-08-26 17:05 UTC (permalink / raw)
  To: David Hildenbrand, Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Borislav Petkov, Andrew Morton, Joerg Roedel,
	Andi Kleen, David Rientjes, Vlastimil Babka, Tom Lendacky,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, Kirill A . Shutemov,
	Kirill A . Shutemov, Kuppuswamy Sathyanarayanan, Dave Hansen,
	Yu Zhang

On 8/26/21 3:15 AM, David Hildenbrand wrote:
> On 24.08.21 02:52, Sean Christopherson wrote:
>> The goal of this RFC is to try and align KVM, mm, and anyone else with
>> skin in the
>> game, on an acceptable direction for supporting guest private memory,
>> e.g. for
>> Intel's TDX.  The TDX architectural effectively allows KVM guests to
>> crash the
>> host if guest private memory is accessible to host userspace, and thus
>> does not
>> play nice with KVM's existing approach of pulling the pfn and mapping
>> level from
>> the host page tables.
>>
>> This is by no means a complete patch; it's a rough sketch of the KVM
>> changes that
>> would be needed.  The kernel side of things is completely omitted from
>> the patch;
>> the design concept is below.
>>
>> There's also fair bit of hand waving on implementation details that
>> shouldn't
>> fundamentally change the overall ABI, e.g. how the backing store will
>> ensure
>> there are no mappings when "converting" to guest private.
>>
> 
> This is a lot of complexity and rather advanced approaches (not saying
> they are bad, just that we try to teach the whole stack something
> completely new).
> 
> 
> What I think would really help is a list of requirements, such that
> everybody is aware of what we actually want to achieve. Let me start:
> 
> GFN: Guest Frame Number
> EPFN: Encrypted Physical Frame Number
> 
> 
> 1) An EPFN must not get mapped into more than one VM: it belongs exactly
> to one VM. It must neither be shared between VMs between processes nor
> between VMs within a processes.
> 
> 
> 2) User space (well, and actually the kernel) must never access an EPFN:
> 
> - If we go for an fd, essentially all operations (read/write) have to
>   fail.
> - If we have to map an EPFN into user space page tables (e.g., to
>   simplify KVM), we could only allow fake swap entries such that "there
>   is something" but it cannot be  accessed and is flagged accordingly.
> - /proc/kcore and friends have to be careful as well and should not read
>   this memory. So there has to be a way to flag these pages.
> 
> 3) We need a way to express the GFN<->EPFN mapping and essentially
> assign an EPFN to a GFN.
> 
> 
> 4) Once we assigned a EPFN to a GFN, that assignment must not longer
> change. Further, an EPFN must not get assigned to multiple GFNs.
> 
> 
> 5) There has to be a way to "replace" encrypted parts by "shared" parts
>    and the other way around.
> 
> What else?
> 
> 
> 
>> Background
>> ==========
>>
>> This is a loose continuation of Kirill's RFC[*] to support TDX guest
>> private
>> memory by tracking guest memory at the 'struct page' level.  This
>> proposal is the
>> result of several offline discussions that were prompted by Andy
>> Lutomirksi's
>> concerns with tracking via 'struct page':
>>
>>    1. The kernel wouldn't easily be able to enforce a 1:1 page:guest
>> association,
>>       let alone a 1:1 pfn:gfn mapping.
> 
> Well, it could with some help on higher layers. Someone has to do the
> tracking. Marking EPFNs as EPFNs can actually be very helpful,  e.g.,
> allow /proc/kcore to just not touch such pages. If we want to do all the
> tracking in the struct page is a different story.
> 
>>
>>    2. Does not work for memory that isn't backed by 'struct page',
>> e.g. if devices
>>       gain support for exposing encrypted memory regions to guests.
> 
> Let's keep it simple. If a struct page is right now what we need to
> properly track it, so be it. If not, good. But let's not make this a
> requirement right from the start if it's stuff for the far future.
> 
>>
>>    3. Does not help march toward page migration or swap support
>> (though it doesn't
>>       hurt either).
> 
> "Does not help towards world peace, (though it doesn't hurt either)".
> 
> Maybe let's ignore that for now, as it doesn't seem to be required to
> get something reasonable running.
> 
>>
>> [*]
>> https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@linux.intel.com
>>
>>
>> Concept
>> =======
>>
>> Guest private memory must be backed by an "enlightened" file
>> descriptor, where
>> "enlightened" means the implementing subsystem supports a one-way
>> "conversion" to
>> guest private memory and provides bi-directional hooks to communicate
>> directly
>> with KVM.  Creating a private fd doesn't necessarily have to be a
>> conversion, e.g. it
>> could also be a flag provided at file creation, a property of the file
>> system itself,
>> etc...
> 
> Doesn't sound too crazy. Maybe even introducing memfd_encrypted() if
> extending the other ones turns out too complicated.
> 
>>
>> Before a private fd can be mapped into a KVM guest, it must be paired
>> 1:1 with a
>> KVM guest, i.e. multiple guests cannot share a fd.  At pairing, KVM
>> and the fd's
>> subsystem exchange a set of function pointers to allow KVM to call
>> into the subsystem,
>> e.g. to translate gfn->pfn, and vice versa to allow the subsystem to
>> call into KVM,
>> e.g. to invalidate/move/swap a gfn range.
>>
>> Mapping a private fd in host userspace is disallowed, i.e. there is
>> never a host
>> virtual address associated with the fd and thus no userspace page
>> tables pointing
>> at the private memory.
> 
> To keep the primary vs. secondary MMU thing working, I think it would
> actually be nice to go with special swap entries instead; it just keeps
> most things working as expected. But let's see where we end up.
> 
>>
>> Pinning _from KVM_ is not required.  If the backing store supports
>> page migration
>> and/or swap, it can query the KVM-provided function pointers to see if
>> KVM supports
>> the operation.  If the operation is not supported (this will be the
>> case initially
>> in KVM), the backing store is responsible for ensuring correct
>> functionality.
>>
>> Unmapping guest memory, e.g. to prevent use-after-free, is handled via
>> a callback
>> from the backing store to KVM.  KVM will employ techniques similar to
>> those it uses
>> for mmu_notifiers to ensure the guest cannot access freed memory.
>>
>> A key point is that, unlike similar failed proposals of the past, e.g.
>> /dev/mktme,
>> existing backing stores can be englightened, a from-scratch
>> implementations is not
>> required (though would obviously be possible as well).
> 
> Right. But if it's just a bad fit, let's do something new. Just like we
> did with memfd_secret.
> 
>>
>> One idea for extending existing backing stores, e.g. HugeTLBFS and
>> tmpfs, is
>> to add F_SEAL_GUEST, which would convert the entire file to guest
>> private memory
>> and either fail if the current size is non-zero or truncate the size
>> to zero.
> 
> While possible, I actually do have the feeling that we want eventually
> to have something new, as the semantics are just too different. But
> let's see.
> 
> 
>> KVM
>> ===
>>
>> Guest private memory is managed as a new address space, i.e. as a
>> different set of
>> memslots, similar to how KVM has a separate memory view for when a
>> guest vCPU is
>> executing in virtual SMM.  SMM is mutually exclusive with guest
>> private memory.
>>
>> The fd (the actual integer) is provided to KVM when a private memslot
>> is added
>> via KVM_SET_USER_MEMORY_REGION.  This is when the aforementioned
>> pairing occurs.
>>
>> By default, KVM memslot lookups will be "shared", only specific
>> touchpoints will
>> be modified to work with private memslots, e.g. guest page faults. 
>> All host
>> accesses to guest memory, e.g. for emulation, will thus look for
>> shared memory
>> and naturally fail without attempting copy_to/from_user() if the guest
>> attempts
>> to coerce KVM into access private memory.  Note, avoiding
>> copy_to/from_user() and
>> friends isn't strictly necessary, it's more of a happy side effect.
>>
>> A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in
>> vcpu->run
>> is added to propagate illegal accesses (see above) and implicit
>> conversions
>> to userspace (see below).  Note, the new exit reason + struct can also
>> be to
>> support several other feature requests in KVM[1][2].
>>
>> The guest may explicitly or implicity request KVM to map a
>> shared/private variant
>> of a GFN.  An explicit map request is done via hypercall (out of scope
>> for this
>> proposal as both TDX and SNP ABIs define such a hypercall).  An
>> implicit map request
>> is triggered simply by the guest accessing the shared/private variant,
>> which KVM
>> sees as a guest page fault (EPT violation or #NPF).  Ideally only
>> explicit requests
>> would be supported, but neither TDX nor SNP require this in their
>> guest<->host ABIs.
>>
>> For implicit or explicit mappings, if a memslot is found that fully
>> covers the
>> requested range (which is a single gfn for implicit mappings), KVM's
>> normal guest
>> page fault handling works with minimal modification.
>>
>> If a memslot is not found, for explicit mappings, KVM will exit to
>> userspace with
>> the aforementioned dedicated exit reason.  For implict _private_
>> mappings, KVM will
>> also immediately exit with the same dedicated reason.  For implicit
>> shared mappings,
>> an additional check is required to differentiate between emulated MMIO
>> and an
>> implicit private->shared conversion[*].  If there is an existing
>> private memslot
>> for the gfn, KVM will exit to userspace, otherwise KVM will treat the
>> access as an
>> emulated MMIO access and handle the page fault accordingly.
> 
> Do you mean some kind of overlay. "Ordinary" user memory regions overlay
> "private user memory regions"? So when marking something shared, you'd
> leave the private user memory region alone and only create a new
> "ordinary"user memory regions that references shared memory in QEMU
> (IOW, a different mapping)?
> 
> Reading below, I think you were not actually thinking about an overlay,
> but maybe overlays might actually be a nice concept to have instead.
> 
> 
>> Punching Holes
>> ==============
>>
>> The expected userspace memory model is that mapping requests will be
>> handled as
>> conversions, e.g. on a shared mapping request, first unmap the private
>> gfn range,
>> then map the shared gfn range.  A new KVM ioctl() will likely be
>> needed to allow
>> userspace to punch a hole in a memslot, as expressing such an
>> operation isn't
>> possible with KVM_SET_USER_MEMORY_REGION.  While userspace could
>> delete the
>> memslot, then recreate three new memslots, doing so would be
>> destructive to guest
>> data as unmapping guest private memory (from the EPT/NPT tables) is
>> destructive
>> to the data for both TDX and SEV-SNP guests.
> 
> If you'd treat it like an overlay, you'd not actually be punching holes.
> You'd only be creating/removing ordinary user memory regions when
> marking something shared/unshared.
> 
>>
>> Pros (vs. struct page)
>> ======================
>>
>> Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping.
>>
>> Userspace page tables are not populated, e.g. reduced memory
>> footprint, lower
>> probability of making private memory accessible to userspace.
> 
> Agreed to the first part, although I consider that a secondary concern.
> The second part, I'm not sure if that is really the case. Fake swap
> entries are just a marker.
> 
>>
>> Provides line of sight to supporting page migration and swap.
> 
> Again, let's leave that out for now. I think that's an kernel internal
> that will require quite some thought either way.
> 
>>
>> Provides line of sight to mapping MMIO pages into guest private memory.
> 
> That's an interesting thought. Would it work via overlays as well? Can
> you elaborate?
> 
>>
>> Cons (vs. struct page)
>> ======================
>>
>> Significantly more churn in KVM, e.g. to plumb 'private' through where
>> needed,
>> support memslot hole punching, etc...
>>
>> KVM's MMU gets another method of retrieving host pfn and page size.
>>
>> Requires enabling in every backing store that someone wants to support.
> 
> I think we will only care about anonymous memory eventually with
> huge/gigantic pages in the next years. Just to what memfd() is already
> limited. File-backed -- I don't know ... if at all, swapping ... in a
> couple of years ...
> 
>>
>> Because the NUMA APIs work on virtual addresses, new syscalls
>> fmove_pages(),
>> fbind(), etc... would be required to provide equivalents to existing NUMA
>> functionality (though those syscalls would likely be useful
>> irrespective of guest
>> private memory).
> 
> Right, that's because we don't have a VMA that describes all this. E.g.,
> mbind().
> 
>>
>> Washes (vs. struct page)
>> ========================
>>
>> A misbehaving guest that triggers a large number of shared memory
>> mappings will
>> consume a large number of memslots.  But, this is likely a wash as
>> similar effect
>> would happen with VMAs in the struct page approach.
> 
> Just cap it then to something sane. 32k which we have right now is crazy
> and only required in very special setups. You can just make QEMU
> override/set the KVM default.
> 
> 
> 
> 
> 
> My wild idea after reading everything so far (full of flaws, just want
> to mention it, maybe it gives some ideas):
> 
> Introduce memfd_encrypted().
> 
> Similar to like memfd_secret()
> - Most system calls will just fail.
> - Allow MAP_SHARED only.
> - Enforce VM_DONTDUMP and skip during fork().

This seems like it would work, but integrating it with the hugetlb
reserve mechanism might be rather messy.

> - File size can change exactly once, before any mmap. (IIRC)

Why is this needed?  Obviously if the file size can be reduced, then the
pages need to get removed safely, but this seems doable if there's a use
case.

> 
> Different to memfd_secret(), allow mapping each page of the fd exactly
> one time via mmap() into a single process.

This doesn't solve the case of multiple memslots pointing at the same
address.  It also doesn't help with future operations that need to map
from a memfd_encrypted() backing page to the GPA that maps it.

> You'll end up with a VMA that corresponds to the whole file in a single
> process only, and that cannot vanish, not even in parts.
> 
> Define "ordinary" user memory slots as overlay on top of "encrypted"
> memory slots.  Inside KVM, bail out if you encounter such a VMA inside a
> normal user memory slot. When creating a "encryped" user memory slot,
> require that the whole VMA is covered at creation time. You know the VMA
> can't change later.

Oof.  That's quite a requirement.  What's the point of the VMA once all
this is done?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-26 17:05   ` Andy Lutomirski
@ 2021-08-26 21:26     ` David Hildenbrand
  2021-08-27 18:24       ` Andy Lutomirski
  0 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2021-08-26 21:26 UTC (permalink / raw)
  To: Andy Lutomirski, Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Borislav Petkov, Andrew Morton, Joerg Roedel,
	Andi Kleen, David Rientjes, Vlastimil Babka, Tom Lendacky,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, Kirill A . Shutemov,
	Kirill A . Shutemov, Kuppuswamy Sathyanarayanan, Dave Hansen,
	Yu Zhang

On 26.08.21 19:05, Andy Lutomirski wrote:
> On 8/26/21 3:15 AM, David Hildenbrand wrote:
>> On 24.08.21 02:52, Sean Christopherson wrote:
>>> The goal of this RFC is to try and align KVM, mm, and anyone else with
>>> skin in the
>>> game, on an acceptable direction for supporting guest private memory,
>>> e.g. for
>>> Intel's TDX.  The TDX architectural effectively allows KVM guests to
>>> crash the
>>> host if guest private memory is accessible to host userspace, and thus
>>> does not
>>> play nice with KVM's existing approach of pulling the pfn and mapping
>>> level from
>>> the host page tables.
>>>
>>> This is by no means a complete patch; it's a rough sketch of the KVM
>>> changes that
>>> would be needed.  The kernel side of things is completely omitted from
>>> the patch;
>>> the design concept is below.
>>>
>>> There's also fair bit of hand waving on implementation details that
>>> shouldn't
>>> fundamentally change the overall ABI, e.g. how the backing store will
>>> ensure
>>> there are no mappings when "converting" to guest private.
>>>
>>
>> This is a lot of complexity and rather advanced approaches (not saying
>> they are bad, just that we try to teach the whole stack something
>> completely new).
>>
>>
>> What I think would really help is a list of requirements, such that
>> everybody is aware of what we actually want to achieve. Let me start:
>>
>> GFN: Guest Frame Number
>> EPFN: Encrypted Physical Frame Number
>>
>>
>> 1) An EPFN must not get mapped into more than one VM: it belongs exactly
>> to one VM. It must neither be shared between VMs between processes nor
>> between VMs within a processes.
>>
>>
>> 2) User space (well, and actually the kernel) must never access an EPFN:
>>
>> - If we go for an fd, essentially all operations (read/write) have to
>>    fail.
>> - If we have to map an EPFN into user space page tables (e.g., to
>>    simplify KVM), we could only allow fake swap entries such that "there
>>    is something" but it cannot be  accessed and is flagged accordingly.
>> - /proc/kcore and friends have to be careful as well and should not read
>>    this memory. So there has to be a way to flag these pages.
>>
>> 3) We need a way to express the GFN<->EPFN mapping and essentially
>> assign an EPFN to a GFN.
>>
>>
>> 4) Once we assigned a EPFN to a GFN, that assignment must not longer
>> change. Further, an EPFN must not get assigned to multiple GFNs.
>>
>>
>> 5) There has to be a way to "replace" encrypted parts by "shared" parts
>>     and the other way around.
>>
>> What else?
>>
>>
>>
>>> Background
>>> ==========
>>>
>>> This is a loose continuation of Kirill's RFC[*] to support TDX guest
>>> private
>>> memory by tracking guest memory at the 'struct page' level.  This
>>> proposal is the
>>> result of several offline discussions that were prompted by Andy
>>> Lutomirksi's
>>> concerns with tracking via 'struct page':
>>>
>>>     1. The kernel wouldn't easily be able to enforce a 1:1 page:guest
>>> association,
>>>        let alone a 1:1 pfn:gfn mapping.
>>
>> Well, it could with some help on higher layers. Someone has to do the
>> tracking. Marking EPFNs as EPFNs can actually be very helpful,  e.g.,
>> allow /proc/kcore to just not touch such pages. If we want to do all the
>> tracking in the struct page is a different story.
>>
>>>
>>>     2. Does not work for memory that isn't backed by 'struct page',
>>> e.g. if devices
>>>        gain support for exposing encrypted memory regions to guests.
>>
>> Let's keep it simple. If a struct page is right now what we need to
>> properly track it, so be it. If not, good. But let's not make this a
>> requirement right from the start if it's stuff for the far future.
>>
>>>
>>>     3. Does not help march toward page migration or swap support
>>> (though it doesn't
>>>        hurt either).
>>
>> "Does not help towards world peace, (though it doesn't hurt either)".
>>
>> Maybe let's ignore that for now, as it doesn't seem to be required to
>> get something reasonable running.
>>
>>>
>>> [*]
>>> https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@linux.intel.com
>>>
>>>
>>> Concept
>>> =======
>>>
>>> Guest private memory must be backed by an "enlightened" file
>>> descriptor, where
>>> "enlightened" means the implementing subsystem supports a one-way
>>> "conversion" to
>>> guest private memory and provides bi-directional hooks to communicate
>>> directly
>>> with KVM.  Creating a private fd doesn't necessarily have to be a
>>> conversion, e.g. it
>>> could also be a flag provided at file creation, a property of the file
>>> system itself,
>>> etc...
>>
>> Doesn't sound too crazy. Maybe even introducing memfd_encrypted() if
>> extending the other ones turns out too complicated.
>>
>>>
>>> Before a private fd can be mapped into a KVM guest, it must be paired
>>> 1:1 with a
>>> KVM guest, i.e. multiple guests cannot share a fd.  At pairing, KVM
>>> and the fd's
>>> subsystem exchange a set of function pointers to allow KVM to call
>>> into the subsystem,
>>> e.g. to translate gfn->pfn, and vice versa to allow the subsystem to
>>> call into KVM,
>>> e.g. to invalidate/move/swap a gfn range.
>>>
>>> Mapping a private fd in host userspace is disallowed, i.e. there is
>>> never a host
>>> virtual address associated with the fd and thus no userspace page
>>> tables pointing
>>> at the private memory.
>>
>> To keep the primary vs. secondary MMU thing working, I think it would
>> actually be nice to go with special swap entries instead; it just keeps
>> most things working as expected. But let's see where we end up.
>>
>>>
>>> Pinning _from KVM_ is not required.  If the backing store supports
>>> page migration
>>> and/or swap, it can query the KVM-provided function pointers to see if
>>> KVM supports
>>> the operation.  If the operation is not supported (this will be the
>>> case initially
>>> in KVM), the backing store is responsible for ensuring correct
>>> functionality.
>>>
>>> Unmapping guest memory, e.g. to prevent use-after-free, is handled via
>>> a callback
>>> from the backing store to KVM.  KVM will employ techniques similar to
>>> those it uses
>>> for mmu_notifiers to ensure the guest cannot access freed memory.
>>>
>>> A key point is that, unlike similar failed proposals of the past, e.g.
>>> /dev/mktme,
>>> existing backing stores can be englightened, a from-scratch
>>> implementations is not
>>> required (though would obviously be possible as well).
>>
>> Right. But if it's just a bad fit, let's do something new. Just like we
>> did with memfd_secret.
>>
>>>
>>> One idea for extending existing backing stores, e.g. HugeTLBFS and
>>> tmpfs, is
>>> to add F_SEAL_GUEST, which would convert the entire file to guest
>>> private memory
>>> and either fail if the current size is non-zero or truncate the size
>>> to zero.
>>
>> While possible, I actually do have the feeling that we want eventually
>> to have something new, as the semantics are just too different. But
>> let's see.
>>
>>
>>> KVM
>>> ===
>>>
>>> Guest private memory is managed as a new address space, i.e. as a
>>> different set of
>>> memslots, similar to how KVM has a separate memory view for when a
>>> guest vCPU is
>>> executing in virtual SMM.  SMM is mutually exclusive with guest
>>> private memory.
>>>
>>> The fd (the actual integer) is provided to KVM when a private memslot
>>> is added
>>> via KVM_SET_USER_MEMORY_REGION.  This is when the aforementioned
>>> pairing occurs.
>>>
>>> By default, KVM memslot lookups will be "shared", only specific
>>> touchpoints will
>>> be modified to work with private memslots, e.g. guest page faults.
>>> All host
>>> accesses to guest memory, e.g. for emulation, will thus look for
>>> shared memory
>>> and naturally fail without attempting copy_to/from_user() if the guest
>>> attempts
>>> to coerce KVM into access private memory.  Note, avoiding
>>> copy_to/from_user() and
>>> friends isn't strictly necessary, it's more of a happy side effect.
>>>
>>> A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in
>>> vcpu->run
>>> is added to propagate illegal accesses (see above) and implicit
>>> conversions
>>> to userspace (see below).  Note, the new exit reason + struct can also
>>> be to
>>> support several other feature requests in KVM[1][2].
>>>
>>> The guest may explicitly or implicity request KVM to map a
>>> shared/private variant
>>> of a GFN.  An explicit map request is done via hypercall (out of scope
>>> for this
>>> proposal as both TDX and SNP ABIs define such a hypercall).  An
>>> implicit map request
>>> is triggered simply by the guest accessing the shared/private variant,
>>> which KVM
>>> sees as a guest page fault (EPT violation or #NPF).  Ideally only
>>> explicit requests
>>> would be supported, but neither TDX nor SNP require this in their
>>> guest<->host ABIs.
>>>
>>> For implicit or explicit mappings, if a memslot is found that fully
>>> covers the
>>> requested range (which is a single gfn for implicit mappings), KVM's
>>> normal guest
>>> page fault handling works with minimal modification.
>>>
>>> If a memslot is not found, for explicit mappings, KVM will exit to
>>> userspace with
>>> the aforementioned dedicated exit reason.  For implict _private_
>>> mappings, KVM will
>>> also immediately exit with the same dedicated reason.  For implicit
>>> shared mappings,
>>> an additional check is required to differentiate between emulated MMIO
>>> and an
>>> implicit private->shared conversion[*].  If there is an existing
>>> private memslot
>>> for the gfn, KVM will exit to userspace, otherwise KVM will treat the
>>> access as an
>>> emulated MMIO access and handle the page fault accordingly.
>>
>> Do you mean some kind of overlay. "Ordinary" user memory regions overlay
>> "private user memory regions"? So when marking something shared, you'd
>> leave the private user memory region alone and only create a new
>> "ordinary"user memory regions that references shared memory in QEMU
>> (IOW, a different mapping)?
>>
>> Reading below, I think you were not actually thinking about an overlay,
>> but maybe overlays might actually be a nice concept to have instead.
>>
>>
>>> Punching Holes
>>> ==============
>>>
>>> The expected userspace memory model is that mapping requests will be
>>> handled as
>>> conversions, e.g. on a shared mapping request, first unmap the private
>>> gfn range,
>>> then map the shared gfn range.  A new KVM ioctl() will likely be
>>> needed to allow
>>> userspace to punch a hole in a memslot, as expressing such an
>>> operation isn't
>>> possible with KVM_SET_USER_MEMORY_REGION.  While userspace could
>>> delete the
>>> memslot, then recreate three new memslots, doing so would be
>>> destructive to guest
>>> data as unmapping guest private memory (from the EPT/NPT tables) is
>>> destructive
>>> to the data for both TDX and SEV-SNP guests.
>>
>> If you'd treat it like an overlay, you'd not actually be punching holes.
>> You'd only be creating/removing ordinary user memory regions when
>> marking something shared/unshared.
>>
>>>
>>> Pros (vs. struct page)
>>> ======================
>>>
>>> Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping.
>>>
>>> Userspace page tables are not populated, e.g. reduced memory
>>> footprint, lower
>>> probability of making private memory accessible to userspace.
>>
>> Agreed to the first part, although I consider that a secondary concern.
>> The second part, I'm not sure if that is really the case. Fake swap
>> entries are just a marker.
>>
>>>
>>> Provides line of sight to supporting page migration and swap.
>>
>> Again, let's leave that out for now. I think that's an kernel internal
>> that will require quite some thought either way.
>>
>>>
>>> Provides line of sight to mapping MMIO pages into guest private memory.
>>
>> That's an interesting thought. Would it work via overlays as well? Can
>> you elaborate?
>>
>>>
>>> Cons (vs. struct page)
>>> ======================
>>>
>>> Significantly more churn in KVM, e.g. to plumb 'private' through where
>>> needed,
>>> support memslot hole punching, etc...
>>>
>>> KVM's MMU gets another method of retrieving host pfn and page size.
>>>
>>> Requires enabling in every backing store that someone wants to support.
>>
>> I think we will only care about anonymous memory eventually with
>> huge/gigantic pages in the next years. Just to what memfd() is already
>> limited. File-backed -- I don't know ... if at all, swapping ... in a
>> couple of years ...
>>
>>>
>>> Because the NUMA APIs work on virtual addresses, new syscalls
>>> fmove_pages(),
>>> fbind(), etc... would be required to provide equivalents to existing NUMA
>>> functionality (though those syscalls would likely be useful
>>> irrespective of guest
>>> private memory).
>>
>> Right, that's because we don't have a VMA that describes all this. E.g.,
>> mbind().
>>
>>>
>>> Washes (vs. struct page)
>>> ========================
>>>
>>> A misbehaving guest that triggers a large number of shared memory
>>> mappings will
>>> consume a large number of memslots.  But, this is likely a wash as
>>> similar effect
>>> would happen with VMAs in the struct page approach.
>>
>> Just cap it then to something sane. 32k which we have right now is crazy
>> and only required in very special setups. You can just make QEMU
>> override/set the KVM default.
>>
>>
>>
>>
>>
>> My wild idea after reading everything so far (full of flaws, just want
>> to mention it, maybe it gives some ideas):
>>
>> Introduce memfd_encrypted().
>>
>> Similar to like memfd_secret()
>> - Most system calls will just fail.
>> - Allow MAP_SHARED only.
>> - Enforce VM_DONTDUMP and skip during fork().
> 
> This seems like it would work, but integrating it with the hugetlb
> reserve mechanism might be rather messy.

One step at a time.

> 
>> - File size can change exactly once, before any mmap. (IIRC)
> 
> Why is this needed?  Obviously if the file size can be reduced, then the
> pages need to get removed safely, but this seems doable if there's a use
> case.

Right, but we usually don't resize memfd either.

> 
>>
>> Different to memfd_secret(), allow mapping each page of the fd exactly
>> one time via mmap() into a single process.
> 
> This doesn't solve the case of multiple memslots pointing at the same
> address.  It also doesn't help with future operations that need to map
> from a memfd_encrypted() backing page to the GPA that maps it.

That's trivial to enforce inside KVM when mapping. Encryted user memory 
regions, just like such VMAs just have to be sticky until we can come up 
with something different.

> 
>> You'll end up with a VMA that corresponds to the whole file in a single
>> process only, and that cannot vanish, not even in parts.
>>
>> Define "ordinary" user memory slots as overlay on top of "encrypted"
>> memory slots.  Inside KVM, bail out if you encounter such a VMA inside a
>> normal user memory slot. When creating a "encryped" user memory slot,
>> require that the whole VMA is covered at creation time. You know the VMA
>> can't change later.
> 
> Oof.  That's quite a requirement.  What's the point of the VMA once all
> this is done?

You can keep using things like mbind(), madvise(), ... and the GUP code 
with a special flag might mostly just do what you want. You won't have 
to reinvent too many wheels on the page fault logic side at least.

Just a brain dump. Feel free to refine if you think any of this makes sense.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-26 10:15 ` David Hildenbrand
  2021-08-26 17:05   ` Andy Lutomirski
@ 2021-08-27  2:31   ` Yu Zhang
  2021-08-31 19:08     ` David Hildenbrand
  2021-09-01  4:53     ` Andy Lutomirski
  2021-08-27 22:18   ` Sean Christopherson
  2 siblings, 2 replies; 71+ messages in thread
From: Yu Zhang @ 2021-08-27  2:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, Dave Hansen

On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:
> On 24.08.21 02:52, Sean Christopherson wrote:
> > The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the
> > game, on an acceptable direction for supporting guest private memory, e.g. for
> > Intel's TDX.  The TDX architectural effectively allows KVM guests to crash the
> > host if guest private memory is accessible to host userspace, and thus does not
> > play nice with KVM's existing approach of pulling the pfn and mapping level from
> > the host page tables.
> > 
> > This is by no means a complete patch; it's a rough sketch of the KVM changes that
> > would be needed.  The kernel side of things is completely omitted from the patch;
> > the design concept is below.
> > 
> > There's also fair bit of hand waving on implementation details that shouldn't
> > fundamentally change the overall ABI, e.g. how the backing store will ensure
> > there are no mappings when "converting" to guest private.
> > 
> 
> This is a lot of complexity and rather advanced approaches (not saying they
> are bad, just that we try to teach the whole stack something completely
> new).
> 
> 
> What I think would really help is a list of requirements, such that
> everybody is aware of what we actually want to achieve. Let me start:
> 
> GFN: Guest Frame Number
> EPFN: Encrypted Physical Frame Number
> 
> 
> 1) An EPFN must not get mapped into more than one VM: it belongs exactly to
> one VM. It must neither be shared between VMs between processes nor between
> VMs within a processes.
> 
> 
> 2) User space (well, and actually the kernel) must never access an EPFN:
> 
> - If we go for an fd, essentially all operations (read/write) have to
>   fail.
> - If we have to map an EPFN into user space page tables (e.g., to
>   simplify KVM), we could only allow fake swap entries such that "there
>   is something" but it cannot be  accessed and is flagged accordingly.
> - /proc/kcore and friends have to be careful as well and should not read
>   this memory. So there has to be a way to flag these pages.
> 
> 3) We need a way to express the GFN<->EPFN mapping and essentially assign an
> EPFN to a GFN.
> 
> 
> 4) Once we assigned a EPFN to a GFN, that assignment must not longer change.
> Further, an EPFN must not get assigned to multiple GFNs.
> 
> 
> 5) There has to be a way to "replace" encrypted parts by "shared" parts
>    and the other way around.
> 
> What else?

Thanks a lot for this summary. A question about the requirement: do we or
do we not have plan to support assigned device to the protected VM?

If yes. The fd based solution may need change the VFIO interface as well(
though the fake swap entry solution need mess with VFIO too). Because:

1> KVM uses VFIO when assigning devices into a VM.

2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
guest pages will have to be mapped in host IOMMU page table to host pages,
which are pinned during the whole life cycle fo the VM.

3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver,
in vfio_dma_do_map().

4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA
and pin the page. 

But if we are using fd based solution, not every GPA can have a HVA, thus
the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
doubt if VFIO can be modified to support this easily.


B.R.
Yu

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-26 21:26     ` David Hildenbrand
@ 2021-08-27 18:24       ` Andy Lutomirski
  2021-08-27 22:28         ` Sean Christopherson
  0 siblings, 1 reply; 71+ messages in thread
From: Andy Lutomirski @ 2021-08-27 18:24 UTC (permalink / raw)
  To: David Hildenbrand, Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm list, Linux Kernel Mailing List, Borislav Petkov,
	Andrew Morton, Joerg Roedel, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang



On Thu, Aug 26, 2021, at 2:26 PM, David Hildenbrand wrote:
> On 26.08.21 19:05, Andy Lutomirski wrote:

> > Oof.  That's quite a requirement.  What's the point of the VMA once all
> > this is done?
> 
> You can keep using things like mbind(), madvise(), ... and the GUP code 
> with a special flag might mostly just do what you want. You won't have 
> to reinvent too many wheels on the page fault logic side at least.
> 

You can keep calling the functions.  The implementations working is a different story: you can't just unmap (pte_numa-style or otherwise) a private guest page to quiesce it, move it with memcpy(), and then fault it back in.

In any event, adding fd-based NUMA APIs would be quite nice.  Look at the numactl command.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-26 10:15 ` David Hildenbrand
  2021-08-26 17:05   ` Andy Lutomirski
  2021-08-27  2:31   ` Yu Zhang
@ 2021-08-27 22:18   ` Sean Christopherson
  2021-08-31 19:07     ` David Hildenbrand
  2 siblings, 1 reply; 71+ messages in thread
From: Sean Christopherson @ 2021-08-27 22:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, Dave Hansen, Yu Zhang

On Thu, Aug 26, 2021, David Hildenbrand wrote:
> You'll end up with a VMA that corresponds to the whole file in a single
> process only, and that cannot vanish, not even in parts.

How would userspace tell the kernel to free parts of memory that it doesn't want
assigned to the guest, e.g. to free memory that the guest has converted to
not-private?

> Define "ordinary" user memory slots as overlay on top of "encrypted" memory
> slots.  Inside KVM, bail out if you encounter such a VMA inside a normal
> user memory slot. When creating a "encryped" user memory slot, require that
> the whole VMA is covered at creation time. You know the VMA can't change
> later.

This can work for the basic use cases, but even then I'd strongly prefer not to
tie memslot correctness to the VMAs.  KVM doesn't truly care what lies behind
the virtual address of a memslot, and when it does care, it tends to do poorly,
e.g. see the whole PFNMAP snafu.  KVM cares about the pfn<->gfn mappings, and
that's reflected in the infrastructure.  E.g. KVM relies on the mmu_notifiers
to handle mprotect()/munmap()/etc...

As is, I don't think KVM would get any kind of notification if userpaces unmaps
the VMA for a private memslot that does not have any entries in the host page
tables.   I'm sure it's a solvable problem, e.g. by ensuring at least one page
is touched by the backing store, but I don't think the end result would be any
prettier than a dedicated API for KVM to consume.

Relying on VMAs, and thus the mmu_notifiers, also doesn't provide line of sight
to page migration or swap.  For those types of operations, KVM currently just
reacts to invalidation notifications by zapping guest PTEs, and then gets the
new pfn when the guest re-faults on the page.  That sequence doesn't work for
TDX or SEV-SNP because the trusteday agent needs to do the memcpy() of the page
contents, i.e. the host needs to call into KVM for the actual migration.

There's also the memory footprint side of things; the fd-based approach avoids
having to create host page tables for memory that by definition will never be
used by the host.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-27 18:24       ` Andy Lutomirski
@ 2021-08-27 22:28         ` Sean Christopherson
  2021-08-31 19:12           ` David Hildenbrand
  0 siblings, 1 reply; 71+ messages in thread
From: Sean Christopherson @ 2021-08-27 22:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Hildenbrand, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On Fri, Aug 27, 2021, Andy Lutomirski wrote:
> 
> On Thu, Aug 26, 2021, at 2:26 PM, David Hildenbrand wrote:
> > On 26.08.21 19:05, Andy Lutomirski wrote:
> 
> > > Oof.  That's quite a requirement.  What's the point of the VMA once all
> > > this is done?
> > 
> > You can keep using things like mbind(), madvise(), ... and the GUP code 
> > with a special flag might mostly just do what you want. You won't have 
> > to reinvent too many wheels on the page fault logic side at least.

Ya, Kirill's RFC more or less proved a special GUP flag would indeed Just Work.
However, the KVM page fault side of things would require only a handful of small
changes to send private memslots down a different path.  Compared to the rest of
the enabling, it's quite minor.

The counter to that is other KVM architectures would need to learn how to use the
new APIs, though I suspect that there will be a fair bit of arch enabling regardless
of what route we take.

> You can keep calling the functions.  The implementations working is a
> different story: you can't just unmap (pte_numa-style or otherwise) a private
> guest page to quiesce it, move it with memcpy(), and then fault it back in.

Ya, I brought this up in my earlier reply.  Even the initial implementation (without
real NUMA support) would likely be painful, e.g. the KVM TDX RFC/PoC adds dedicated
logic in KVM to handle the case where NUMA balancing zaps a _pinned_ page and then
KVM fault in the same pfn.  It's not thaaat ugly, but it's arguably more invasive
to KVM's page fault flows than a new fd-based private memslot scheme.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-27 22:18   ` Sean Christopherson
@ 2021-08-31 19:07     ` David Hildenbrand
  2021-08-31 21:54       ` Sean Christopherson
  2021-09-01  4:58       ` Andy Lutomirski
  0 siblings, 2 replies; 71+ messages in thread
From: David Hildenbrand @ 2021-08-31 19:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, Dave Hansen, Yu Zhang

On 28.08.21 00:18, Sean Christopherson wrote:
> On Thu, Aug 26, 2021, David Hildenbrand wrote:
>> You'll end up with a VMA that corresponds to the whole file in a single
>> process only, and that cannot vanish, not even in parts.
> 
> How would userspace tell the kernel to free parts of memory that it doesn't want
> assigned to the guest, e.g. to free memory that the guest has converted to
> not-private?

I'd guess one possibility could be fallocate(FALLOC_FL_PUNCH_HOLE).

Questions are: when would it actually be allowed to perform such a 
destructive operation? Do we have to protect from that? How would KVM 
protect from user space replacing private pages by shared pages in any 
of the models we discuss?

> 
>> Define "ordinary" user memory slots as overlay on top of "encrypted" memory
>> slots.  Inside KVM, bail out if you encounter such a VMA inside a normal
>> user memory slot. When creating a "encryped" user memory slot, require that
>> the whole VMA is covered at creation time. You know the VMA can't change
>> later.
> 
> This can work for the basic use cases, but even then I'd strongly prefer not to
> tie memslot correctness to the VMAs.  KVM doesn't truly care what lies behind
> the virtual address of a memslot, and when it does care, it tends to do poorly,
> e.g. see the whole PFNMAP snafu.  KVM cares about the pfn<->gfn mappings, and
> that's reflected in the infrastructure.  E.g. KVM relies on the mmu_notifiers
> to handle mprotect()/munmap()/etc...

Right, and for the existing use cases this worked. But encrypted memory 
breaks many assumptions we once made ...

I have somewhat mixed feelings about pages that are mapped into 
$WHATEVER page tables but not actually mapped into user space page 
tables. There is no way to reach these via the rmap.

We have something like that already via vfio. And that is fundamentally 
broken when it comes to mmu notifiers, page pinning, page migration, ...

> 
> As is, I don't think KVM would get any kind of notification if userpaces unmaps
> the VMA for a private memslot that does not have any entries in the host page
> tables.   I'm sure it's a solvable problem, e.g. by ensuring at least one page
> is touched by the backing store, but I don't think the end result would be any
> prettier than a dedicated API for KVM to consume.
> 
> Relying on VMAs, and thus the mmu_notifiers, also doesn't provide line of sight
> to page migration or swap.  For those types of operations, KVM currently just
> reacts to invalidation notifications by zapping guest PTEs, and then gets the
> new pfn when the guest re-faults on the page.  That sequence doesn't work for
> TDX or SEV-SNP because the trusteday agent needs to do the memcpy() of the page
> contents, i.e. the host needs to call into KVM for the actual migration.

Right, but I still think this is a kernel internal. You can do such 
handshake later in the kernel IMHO.

But I also already thought: is it really KVM that is to perform the 
migration or is it the fd-provider that performs the migration? Who says 
memfd_encrypted() doesn't default to a TDX "backend" on Intel CPUs that 
just knows how to migrate such a page?

I'd love to have some details on how that's supposed to work, and which 
information we'd need to migrate/swap/... in addition to the EPFN and a 
new SPFN.

> 
> There's also the memory footprint side of things; the fd-based approach avoids
> having to create host page tables for memory that by definition will never be
> used by the host.

While that is true, that is not a compelling argument IMHO. No need to 
try to be better than state of the art if it results in something 
cleaner/better* just sticking with state of the art. Just like we don't 
have special interfaces to map $WHATEVER into a guest and bypassing user 
space page tables.

* to be shown what actually is cleaner/better. We don't really have 
prototypes for either.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-27  2:31   ` Yu Zhang
@ 2021-08-31 19:08     ` David Hildenbrand
  2021-08-31 20:01       ` Andi Kleen
  2021-09-01  4:53     ` Andy Lutomirski
  1 sibling, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2021-08-31 19:08 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, Dave Hansen

On 27.08.21 04:31, Yu Zhang wrote:
> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:
>> On 24.08.21 02:52, Sean Christopherson wrote:
>>> The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the
>>> game, on an acceptable direction for supporting guest private memory, e.g. for
>>> Intel's TDX.  The TDX architectural effectively allows KVM guests to crash the
>>> host if guest private memory is accessible to host userspace, and thus does not
>>> play nice with KVM's existing approach of pulling the pfn and mapping level from
>>> the host page tables.
>>>
>>> This is by no means a complete patch; it's a rough sketch of the KVM changes that
>>> would be needed.  The kernel side of things is completely omitted from the patch;
>>> the design concept is below.
>>>
>>> There's also fair bit of hand waving on implementation details that shouldn't
>>> fundamentally change the overall ABI, e.g. how the backing store will ensure
>>> there are no mappings when "converting" to guest private.
>>>
>>
>> This is a lot of complexity and rather advanced approaches (not saying they
>> are bad, just that we try to teach the whole stack something completely
>> new).
>>
>>
>> What I think would really help is a list of requirements, such that
>> everybody is aware of what we actually want to achieve. Let me start:
>>
>> GFN: Guest Frame Number
>> EPFN: Encrypted Physical Frame Number
>>
>>
>> 1) An EPFN must not get mapped into more than one VM: it belongs exactly to
>> one VM. It must neither be shared between VMs between processes nor between
>> VMs within a processes.
>>
>>
>> 2) User space (well, and actually the kernel) must never access an EPFN:
>>
>> - If we go for an fd, essentially all operations (read/write) have to
>>    fail.
>> - If we have to map an EPFN into user space page tables (e.g., to
>>    simplify KVM), we could only allow fake swap entries such that "there
>>    is something" but it cannot be  accessed and is flagged accordingly.
>> - /proc/kcore and friends have to be careful as well and should not read
>>    this memory. So there has to be a way to flag these pages.
>>
>> 3) We need a way to express the GFN<->EPFN mapping and essentially assign an
>> EPFN to a GFN.
>>
>>
>> 4) Once we assigned a EPFN to a GFN, that assignment must not longer change.
>> Further, an EPFN must not get assigned to multiple GFNs.
>>
>>
>> 5) There has to be a way to "replace" encrypted parts by "shared" parts
>>     and the other way around.
>>
>> What else?
> 
> Thanks a lot for this summary. A question about the requirement: do we or
> do we not have plan to support assigned device to the protected VM?

Good question, I assume that is stuff for the far far future.

> 
> If yes. The fd based solution may need change the VFIO interface as well(
> though the fake swap entry solution need mess with VFIO too). Because:
> 
> 1> KVM uses VFIO when assigning devices into a VM.
> 
> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
> guest pages will have to be mapped in host IOMMU page table to host pages,
> which are pinned during the whole life cycle fo the VM.
> 
> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver,
> in vfio_dma_do_map().
> 
> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA
> and pin the page.
> 
> But if we are using fd based solution, not every GPA can have a HVA, thus
> the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
> doubt if VFIO can be modified to support this easily.

I fully agree. Maybe Intel folks have some idea how that's supposed to 
look like in the future.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-27 22:28         ` Sean Christopherson
@ 2021-08-31 19:12           ` David Hildenbrand
  2021-08-31 20:45             ` Sean Christopherson
  0 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2021-08-31 19:12 UTC (permalink / raw)
  To: Sean Christopherson, Andy Lutomirski
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On 28.08.21 00:28, Sean Christopherson wrote:
> On Fri, Aug 27, 2021, Andy Lutomirski wrote:
>>
>> On Thu, Aug 26, 2021, at 2:26 PM, David Hildenbrand wrote:
>>> On 26.08.21 19:05, Andy Lutomirski wrote:
>>
>>>> Oof.  That's quite a requirement.  What's the point of the VMA once all
>>>> this is done?
>>>
>>> You can keep using things like mbind(), madvise(), ... and the GUP code
>>> with a special flag might mostly just do what you want. You won't have
>>> to reinvent too many wheels on the page fault logic side at least.
> 
> Ya, Kirill's RFC more or less proved a special GUP flag would indeed Just Work.
> However, the KVM page fault side of things would require only a handful of small
> changes to send private memslots down a different path.  Compared to the rest of
> the enabling, it's quite minor.
> 
> The counter to that is other KVM architectures would need to learn how to use the
> new APIs, though I suspect that there will be a fair bit of arch enabling regardless
> of what route we take.
> 
>> You can keep calling the functions.  The implementations working is a
>> different story: you can't just unmap (pte_numa-style or otherwise) a private
>> guest page to quiesce it, move it with memcpy(), and then fault it back in.
> 
> Ya, I brought this up in my earlier reply.  Even the initial implementation (without
> real NUMA support) would likely be painful, e.g. the KVM TDX RFC/PoC adds dedicated
> logic in KVM to handle the case where NUMA balancing zaps a _pinned_ page and then
> KVM fault in the same pfn.  It's not thaaat ugly, but it's arguably more invasive
> to KVM's page fault flows than a new fd-based private memslot scheme.

I might have a different mindset, but less code churn doesn't 
necessarily translate to "better approach".

I'm certainly not pushing for what I proposed (it's a rough, broken 
sketch). I'm much rather trying to come up with alternatives that try 
solving the same issue, handling the identified requirements.

I have a gut feeling that the list of requirements might not be complete 
yet. For example, I wonder if we have to protect against user space 
replacing private pages by shared pages or punishing random holes into 
the encrypted memory fd.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-31 19:08     ` David Hildenbrand
@ 2021-08-31 20:01       ` Andi Kleen
  2021-08-31 20:15         ` David Hildenbrand
  0 siblings, 1 reply; 71+ messages in thread
From: Andi Kleen @ 2021-08-31 20:01 UTC (permalink / raw)
  To: David Hildenbrand, Yu Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, Dave Hansen


>> Thanks a lot for this summary. A question about the requirement: do 
>> we or
>> do we not have plan to support assigned device to the protected VM?
>
> Good question, I assume that is stuff for the far far future.

It is in principle possible with the current TDX, but not secure. But 
someone might decide to do it. So it would be good to have basic support 
at least.

-Andi

>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-31 20:01       ` Andi Kleen
@ 2021-08-31 20:15         ` David Hildenbrand
  2021-08-31 20:39           ` Andi Kleen
  0 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2021-08-31 20:15 UTC (permalink / raw)
  To: Andi Kleen, Yu Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, Dave Hansen

On 31.08.21 22:01, Andi Kleen wrote:
> 
>>> Thanks a lot for this summary. A question about the requirement: do
>>> we or
>>> do we not have plan to support assigned device to the protected VM?
>>
>> Good question, I assume that is stuff for the far far future.
> 
> It is in principle possible with the current TDX, but not secure. But
> someone might decide to do it. So it would be good to have basic support
> at least.

Can you elaborate the "not secure" part? Do you mean, making the device 
only access "shared" memory, not secure/encrypted/whatsoever?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-31 20:15         ` David Hildenbrand
@ 2021-08-31 20:39           ` Andi Kleen
  2021-09-01  3:34             ` Yu Zhang
  0 siblings, 1 reply; 71+ messages in thread
From: Andi Kleen @ 2021-08-31 20:39 UTC (permalink / raw)
  To: David Hildenbrand, Yu Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, Dave Hansen


On 8/31/2021 1:15 PM, David Hildenbrand wrote:
> On 31.08.21 22:01, Andi Kleen wrote:
>>
>>>> Thanks a lot for this summary. A question about the requirement: do
>>>> we or
>>>> do we not have plan to support assigned device to the protected VM?
>>>
>>> Good question, I assume that is stuff for the far far future.
>>
>> It is in principle possible with the current TDX, but not secure. But
>> someone might decide to do it. So it would be good to have basic support
>> at least.
>
> Can you elaborate the "not secure" part? Do you mean, making the 
> device only access "shared" memory, not secure/encrypted/whatsoever?


Yes that's right. It can only access shared areas.


-Andi


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-31 19:12           ` David Hildenbrand
@ 2021-08-31 20:45             ` Sean Christopherson
  2021-09-01  7:51               ` David Hildenbrand
  0 siblings, 1 reply; 71+ messages in thread
From: Sean Christopherson @ 2021-08-31 20:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andy Lutomirski, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On Tue, Aug 31, 2021, David Hildenbrand wrote:
> On 28.08.21 00:28, Sean Christopherson wrote:
> > On Fri, Aug 27, 2021, Andy Lutomirski wrote:
> > > 
> > > On Thu, Aug 26, 2021, at 2:26 PM, David Hildenbrand wrote:
> > > > On 26.08.21 19:05, Andy Lutomirski wrote:
> > > 
> > > > > Oof.  That's quite a requirement.  What's the point of the VMA once all
> > > > > this is done?
> > > > 
> > > > You can keep using things like mbind(), madvise(), ... and the GUP code
> > > > with a special flag might mostly just do what you want. You won't have
> > > > to reinvent too many wheels on the page fault logic side at least.
> > 
> > Ya, Kirill's RFC more or less proved a special GUP flag would indeed Just Work.
> > However, the KVM page fault side of things would require only a handful of small
> > changes to send private memslots down a different path.  Compared to the rest of
> > the enabling, it's quite minor.
> > 
> > The counter to that is other KVM architectures would need to learn how to use the
> > new APIs, though I suspect that there will be a fair bit of arch enabling regardless
> > of what route we take.
> > 
> > > You can keep calling the functions.  The implementations working is a
> > > different story: you can't just unmap (pte_numa-style or otherwise) a private
> > > guest page to quiesce it, move it with memcpy(), and then fault it back in.
> > 
> > Ya, I brought this up in my earlier reply.  Even the initial implementation (without
> > real NUMA support) would likely be painful, e.g. the KVM TDX RFC/PoC adds dedicated
> > logic in KVM to handle the case where NUMA balancing zaps a _pinned_ page and then
> > KVM fault in the same pfn.  It's not thaaat ugly, but it's arguably more invasive
> > to KVM's page fault flows than a new fd-based private memslot scheme.
> 
> I might have a different mindset, but less code churn doesn't necessarily
> translate to "better approach".

I wasn't referring to code churn.  By "invasive" I mean number of touchpoints in
KVM as well as the nature of the touchpoints.  E.g. poking into how KVM uses
available bits in its shadow PTEs and adding multiple checks through KVM's page
fault handler, versus two callbacks to get the PFN and page size.

> I'm certainly not pushing for what I proposed (it's a rough, broken sketch).
> I'm much rather trying to come up with alternatives that try solving the
> same issue, handling the identified requirements.
> 
> I have a gut feeling that the list of requirements might not be complete
> yet. For example, I wonder if we have to protect against user space
> replacing private pages by shared pages or punishing random holes into the
> encrypted memory fd.

Replacing a private page with a shared page for a given GFN is very much a
requirement as it's expected behavior for all VMM+guests when converting guest
memory between shared and private.

Punching holes is a sort of optional requirement.  It's a "requirement" in that
it's allowed if the backing store supports such a behavior, optional in that
support wouldn't be strictly necessary and/or could come with constraints.  The
expected use case is that host userspace would punch a hole to free unreachable
private memory, e.g. after the corresponding GFN(s) is converted to shared, so
that it doesn't consume 2x memory for the guest.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-31 19:07     ` David Hildenbrand
@ 2021-08-31 21:54       ` Sean Christopherson
  2021-09-01  8:09         ` David Hildenbrand
  2021-09-01  4:58       ` Andy Lutomirski
  1 sibling, 1 reply; 71+ messages in thread
From: Sean Christopherson @ 2021-08-31 21:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, Dave Hansen, Yu Zhang

On Tue, Aug 31, 2021, David Hildenbrand wrote:
> On 28.08.21 00:18, Sean Christopherson wrote:
> > On Thu, Aug 26, 2021, David Hildenbrand wrote:
> > > You'll end up with a VMA that corresponds to the whole file in a single
> > > process only, and that cannot vanish, not even in parts.
> > 
> > How would userspace tell the kernel to free parts of memory that it doesn't want
> > assigned to the guest, e.g. to free memory that the guest has converted to
> > not-private?
> 
> I'd guess one possibility could be fallocate(FALLOC_FL_PUNCH_HOLE).
> 
> Questions are: when would it actually be allowed to perform such a
> destructive operation?

From the kernel's perspective, userspace is allowed to perform destructive
operations at will.  It is ultimately the userspace VMM's responsibility to not
DoS the guest.

> Do we have to protect from that? How would KVM protect from user space
> replacing private pages by shared pages in any of the models we discuss?

The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*]
as both private and shared, where "shared" also incorporates any mapping from the
host.  Essentially it boils down to the kernel ensuring that a pfn is unmapped
before it's converted to/from private, and KVM ensuring that it honors any
unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback
as proposed in this RFC.

As it pertains to PUNCH_HOLE, the responsibilities are no different than when the
backing-store is destroyed; the backing-store needs to notify downstream MMUs
(a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory.

[*] Whether or not the kernel's direct mapping needs to be removed is debatable,
    but my argument is that that behavior is not visible to userspace and thus
    out of scope for this discussion, e.g. zapping/restoring the direct map can
    be added/removed without impacting the userspace ABI.

> > > Define "ordinary" user memory slots as overlay on top of "encrypted" memory
> > > slots.  Inside KVM, bail out if you encounter such a VMA inside a normal
> > > user memory slot. When creating a "encryped" user memory slot, require that
> > > the whole VMA is covered at creation time. You know the VMA can't change
> > > later.
> > 
> > This can work for the basic use cases, but even then I'd strongly prefer not to
> > tie memslot correctness to the VMAs.  KVM doesn't truly care what lies behind
> > the virtual address of a memslot, and when it does care, it tends to do poorly,
> > e.g. see the whole PFNMAP snafu.  KVM cares about the pfn<->gfn mappings, and
> > that's reflected in the infrastructure.  E.g. KVM relies on the mmu_notifiers
> > to handle mprotect()/munmap()/etc...
> 
> Right, and for the existing use cases this worked. But encrypted memory
> breaks many assumptions we once made ...
> 
> I have somewhat mixed feelings about pages that are mapped into $WHATEVER
> page tables but not actually mapped into user space page tables. There is no
> way to reach these via the rmap.
> 
> We have something like that already via vfio. And that is fundamentally
> broken when it comes to mmu notifiers, page pinning, page migration, ...

I'm not super familiar with VFIO internals, but the idea with the fd-based
approach is that the backing-store would be in direct communication with KVM and
would handle those operations through that direct channel.

> > As is, I don't think KVM would get any kind of notification if userpaces unmaps
> > the VMA for a private memslot that does not have any entries in the host page
> > tables.   I'm sure it's a solvable problem, e.g. by ensuring at least one page
> > is touched by the backing store, but I don't think the end result would be any
> > prettier than a dedicated API for KVM to consume.
> > 
> > Relying on VMAs, and thus the mmu_notifiers, also doesn't provide line of sight
> > to page migration or swap.  For those types of operations, KVM currently just
> > reacts to invalidation notifications by zapping guest PTEs, and then gets the
> > new pfn when the guest re-faults on the page.  That sequence doesn't work for
> > TDX or SEV-SNP because the trusteday agent needs to do the memcpy() of the page
> > contents, i.e. the host needs to call into KVM for the actual migration.
> 
> Right, but I still think this is a kernel internal. You can do such
> handshake later in the kernel IMHO.

It is kernel internal, but AFAICT it will be ugly because KVM "needs" to do the
migration and that would invert the mmu_notifer API, e.g. instead of "telling"
secondary MMUs to invalidate/change a mappings, the mm would be "asking"
secondary MMus "can you move this?".  More below.

> But I also already thought: is it really KVM that is to perform the
> migration or is it the fd-provider that performs the migration? Who says
> memfd_encrypted() doesn't default to a TDX "backend" on Intel CPUs that just
> knows how to migrate such a page?
> 
> I'd love to have some details on how that's supposed to work, and which
> information we'd need to migrate/swap/... in addition to the EPFN and a new
> SPFN.

KVM "needs" to do the migration.  On TDX, the migration will be a SEAMCALL,
a post-VMXON instruction that transfers control to the TDX-Module, that at
minimum needs a per-VM identifier, the gfn, and the page table level.  The call
into the TDX-Module would also need to take a KVM lock (probably KVM's mmu_lock)
to satisfy TDX's concurrency requirement, e.g. to avoid "spurious" errors due to
the backing-store attempting to migrate memory that KVM is unmapping due to a
memslot change.

The per-VM identifier may not apply to SEV-SNP, but I believe everything else
holds true.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-31 20:39           ` Andi Kleen
@ 2021-09-01  3:34             ` Yu Zhang
  0 siblings, 0 replies; 71+ messages in thread
From: Yu Zhang @ 2021-09-01  3:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Hildenbrand, Sean Christopherson, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Borislav Petkov, Andy Lutomirski, Andrew Morton,
	Joerg Roedel, David Rientjes, Vlastimil Babka, Tom Lendacky,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, Kirill A . Shutemov,
	Kirill A . Shutemov, Kuppuswamy Sathyanarayanan, Dave Hansen

On Tue, Aug 31, 2021 at 01:39:31PM -0700, Andi Kleen wrote:
> 
> On 8/31/2021 1:15 PM, David Hildenbrand wrote:
> > On 31.08.21 22:01, Andi Kleen wrote:
> > > 
> > > > > Thanks a lot for this summary. A question about the requirement: do
> > > > > we or
> > > > > do we not have plan to support assigned device to the protected VM?
> > > > 
> > > > Good question, I assume that is stuff for the far far future.
> > > 
> > > It is in principle possible with the current TDX, but not secure. But
> > > someone might decide to do it. So it would be good to have basic support
> > > at least.
> > 
> > Can you elaborate the "not secure" part? Do you mean, making the device
> > only access "shared" memory, not secure/encrypted/whatsoever?
> 
> 
> Yes that's right. It can only access shared areas.

Thanks, Andy & David.

Actually, enabling of device assinment needs quite some effort, e.g.,
to guarantee only shared pages are mapped in IOMMU page table (using
shared GFNs). And the buffer copying inside TD is still unavoidable,
thus not much performance benefit.

Maybe we should just *disable* VFIO device in TDX first. 

As to the fd-based private memory, enventually we will have to tolerate
its impact on any place where GUP is needed in virtualization. :)

B.R.
Yu

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-27  2:31   ` Yu Zhang
  2021-08-31 19:08     ` David Hildenbrand
@ 2021-09-01  4:53     ` Andy Lutomirski
  2021-09-01  7:12       ` Tian, Kevin
  2021-09-01 10:24       ` Yu Zhang
  1 sibling, 2 replies; 71+ messages in thread
From: Andy Lutomirski @ 2021-09-01  4:53 UTC (permalink / raw)
  To: Yu Zhang, David Hildenbrand
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen



On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote:
> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:

> Thanks a lot for this summary. A question about the requirement: do we or
> do we not have plan to support assigned device to the protected VM?
> 
> If yes. The fd based solution may need change the VFIO interface as well(
> though the fake swap entry solution need mess with VFIO too). Because:
> 
> 1> KVM uses VFIO when assigning devices into a VM.
> 
> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
> guest pages will have to be mapped in host IOMMU page table to host pages,
> which are pinned during the whole life cycle fo the VM.
> 
> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver,
> in vfio_dma_do_map().
> 
> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA
> and pin the page. 
> 
> But if we are using fd based solution, not every GPA can have a HVA, thus
> the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
> doubt if VFIO can be modified to support this easily.
> 
> 

Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device?

If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it.  I'm not sure whether anyone will actually do this, but it should be possible, at least in principle.  Maybe someone will want to assign a NIC to a TDX guest.  An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether.

If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-31 19:07     ` David Hildenbrand
  2021-08-31 21:54       ` Sean Christopherson
@ 2021-09-01  4:58       ` Andy Lutomirski
  2021-09-01  7:49         ` David Hildenbrand
  1 sibling, 1 reply; 71+ messages in thread
From: Andy Lutomirski @ 2021-09-01  4:58 UTC (permalink / raw)
  To: David Hildenbrand, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On Tue, Aug 31, 2021, at 12:07 PM, David Hildenbrand wrote:
> On 28.08.21 00:18, Sean Christopherson wrote:
> > On Thu, Aug 26, 2021, David Hildenbrand wrote:
> >> You'll end up with a VMA that corresponds to the whole file in a single
> >> process only, and that cannot vanish, not even in parts.
> > 
> > How would userspace tell the kernel to free parts of memory that it doesn't want
> > assigned to the guest, e.g. to free memory that the guest has converted to
> > not-private?
> 
> I'd guess one possibility could be fallocate(FALLOC_FL_PUNCH_HOLE).
> 
> Questions are: when would it actually be allowed to perform such a 
> destructive operation? Do we have to protect from that? How would KVM 
> protect from user space replacing private pages by shared pages in any 
> of the models we discuss?
> 

What do you mean?  If userspace maliciously replaces a shared page by a private page, then the guest crashes.

(The actual meaning here is a bit different on SNP-ES vs TDX.  In SNP-ES, a given GPA can be shared, private, or nonexistent.  A guest accesses it with a special bit set in the guest page tables to indicate whether it expects shared or private, and the CPU will produce an appropriate error if the bit doesn't match the page.  In TDX, there is actually an entirely separate shared vs private address space, and, in theory, a given "GPA" can exist as shared and as private at once.  The full guest n-bit GPA plus the shared/private bit is logically an N+1 bit address, and it's possible to map all of it at once, half shared, and half private.  In practice, the defined guest->host APIs don't really support that usage.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01  4:53     ` Andy Lutomirski
@ 2021-09-01  7:12       ` Tian, Kevin
  2021-09-01 10:24       ` Yu Zhang
  1 sibling, 0 replies; 71+ messages in thread
From: Tian, Kevin @ 2021-09-01  7:12 UTC (permalink / raw)
  To: Lutomirski, Andy, Yu Zhang, David Hildenbrand
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Hansen, Dave

> From: Andy Lutomirski <luto@kernel.org>
> Sent: Wednesday, September 1, 2021 12:53 PM
> 
> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote:
> > On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:
> 
> > Thanks a lot for this summary. A question about the requirement: do we or
> > do we not have plan to support assigned device to the protected VM?
> >
> > If yes. The fd based solution may need change the VFIO interface as well(
> > though the fake swap entry solution need mess with VFIO too). Because:
> >
> > 1> KVM uses VFIO when assigning devices into a VM.
> >
> > 2> Not knowing which GPA ranges may be used by the VM as DMA buffer,
> all
> > guest pages will have to be mapped in host IOMMU page table to host
> pages,
> > which are pinned during the whole life cycle fo the VM.
> >
> > 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU
> driver,
> > in vfio_dma_do_map().
> >
> > 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get
> the HPA
> > and pin the page.
> >
> > But if we are using fd based solution, not every GPA can have a HVA, thus
> > the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
> > doubt if VFIO can be modified to support this easily.
> >
> >
> 
> Do you mean assigning a normal device to a protected VM or a hypothetical
> protected-MMIO device?
> 
> If the former, it should work more or less like with a non-protected VM.
> mmap the VFIO device, set up a memslot, and use it.  I'm not sure whether
> anyone will actually do this, but it should be possible, at least in principle.
> Maybe someone will want to assign a NIC to a TDX guest.  An NVMe device
> with the understanding that the guest can't trust it wouldn't be entirely crazy
> ether.
> 
> If the latter, AFAIK there is no spec for how it would work even in principle.
> Presumably it wouldn't work quite like VFIO -- instead, the kernel could have
> a protection-virtual-io-fd mechanism, and that fd could be bound to a
> memslot in whatever way we settle on for binding secure memory to a
> memslot.

FYI the iommu logic in VFIO is being refactored out into an unified /dev/iommu
framework [1]. Currently it plans to support the same DMA mapping semantics
as what VFIO provides today (HVA-based). in the future it could be extended
to support another mapping protocol which accepts fd+offset instead of HVA 
and then calls the helper function from whatever backing store which can help 
translate fd+offset to HPA instead of using GUP.

Thanks
Kevin

[1]https://lore.kernel.org/kvm/BN9PR11MB5433B1E4AE5B0480369F97178C189@BN9PR11MB5433.namprd11.prod.outlook.com/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01  4:58       ` Andy Lutomirski
@ 2021-09-01  7:49         ` David Hildenbrand
  0 siblings, 0 replies; 71+ messages in thread
From: David Hildenbrand @ 2021-09-01  7:49 UTC (permalink / raw)
  To: Andy Lutomirski, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On 01.09.21 06:58, Andy Lutomirski wrote:
> On Tue, Aug 31, 2021, at 12:07 PM, David Hildenbrand wrote:
>> On 28.08.21 00:18, Sean Christopherson wrote:
>>> On Thu, Aug 26, 2021, David Hildenbrand wrote:
>>>> You'll end up with a VMA that corresponds to the whole file in a single
>>>> process only, and that cannot vanish, not even in parts.
>>>
>>> How would userspace tell the kernel to free parts of memory that it doesn't want
>>> assigned to the guest, e.g. to free memory that the guest has converted to
>>> not-private?
>>
>> I'd guess one possibility could be fallocate(FALLOC_FL_PUNCH_HOLE).
>>
>> Questions are: when would it actually be allowed to perform such a
>> destructive operation? Do we have to protect from that? How would KVM
>> protect from user space replacing private pages by shared pages in any
>> of the models we discuss?
>>
> 
> What do you mean?  If userspace maliciously replaces a shared page by a private page, then the guest crashes.

Assume we have private pages in a fd and fallocate(FALLOC_FL_PUNCH_HOLE) 
random pages the guest is still using. If we "only" crash the guest, 
everything is fine.

> 
> (The actual meaning here is a bit different on SNP-ES vs TDX.  In SNP-ES, a given GPA can be shared, private, or nonexistent.  A guest accesses it with a special bit set in the guest page tables to indicate whether it expects shared or private, and the CPU will produce an appropriate error if the bit doesn't match the page.

Rings a bell, thanks for reminding me.

  In TDX, there is actually an entirely separate shared vs private 
address space, and, in theory, a given "GPA" can exist as shared and as 
private at once.  The full guest n-bit GPA plus the shared/private bit 
is logically an N+1 bit address, and it's possible to map all of it at 
once, half shared, and half private.  In practice, the defined 
guest->host APIs don't really support that usage.

Thanks, that explains a lot.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-31 20:45             ` Sean Christopherson
@ 2021-09-01  7:51               ` David Hildenbrand
  0 siblings, 0 replies; 71+ messages in thread
From: David Hildenbrand @ 2021-09-01  7:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On 31.08.21 22:45, Sean Christopherson wrote:
> On Tue, Aug 31, 2021, David Hildenbrand wrote:
>> On 28.08.21 00:28, Sean Christopherson wrote:
>>> On Fri, Aug 27, 2021, Andy Lutomirski wrote:
>>>>
>>>> On Thu, Aug 26, 2021, at 2:26 PM, David Hildenbrand wrote:
>>>>> On 26.08.21 19:05, Andy Lutomirski wrote:
>>>>
>>>>>> Oof.  That's quite a requirement.  What's the point of the VMA once all
>>>>>> this is done?
>>>>>
>>>>> You can keep using things like mbind(), madvise(), ... and the GUP code
>>>>> with a special flag might mostly just do what you want. You won't have
>>>>> to reinvent too many wheels on the page fault logic side at least.
>>>
>>> Ya, Kirill's RFC more or less proved a special GUP flag would indeed Just Work.
>>> However, the KVM page fault side of things would require only a handful of small
>>> changes to send private memslots down a different path.  Compared to the rest of
>>> the enabling, it's quite minor.
>>>
>>> The counter to that is other KVM architectures would need to learn how to use the
>>> new APIs, though I suspect that there will be a fair bit of arch enabling regardless
>>> of what route we take.
>>>
>>>> You can keep calling the functions.  The implementations working is a
>>>> different story: you can't just unmap (pte_numa-style or otherwise) a private
>>>> guest page to quiesce it, move it with memcpy(), and then fault it back in.
>>>
>>> Ya, I brought this up in my earlier reply.  Even the initial implementation (without
>>> real NUMA support) would likely be painful, e.g. the KVM TDX RFC/PoC adds dedicated
>>> logic in KVM to handle the case where NUMA balancing zaps a _pinned_ page and then
>>> KVM fault in the same pfn.  It's not thaaat ugly, but it's arguably more invasive
>>> to KVM's page fault flows than a new fd-based private memslot scheme.
>>
>> I might have a different mindset, but less code churn doesn't necessarily
>> translate to "better approach".
> 
> I wasn't referring to code churn.  By "invasive" I mean number of touchpoints in
> KVM as well as the nature of the touchpoints.  E.g. poking into how KVM uses
> available bits in its shadow PTEs and adding multiple checks through KVM's page
> fault handler, versus two callbacks to get the PFN and page size.
> 
>> I'm certainly not pushing for what I proposed (it's a rough, broken sketch).
>> I'm much rather trying to come up with alternatives that try solving the
>> same issue, handling the identified requirements.
>>
>> I have a gut feeling that the list of requirements might not be complete
>> yet. For example, I wonder if we have to protect against user space
>> replacing private pages by shared pages or punishing random holes into the
>> encrypted memory fd.
> 
> Replacing a private page with a shared page for a given GFN is very much a
> requirement as it's expected behavior for all VMM+guests when converting guest
> memory between shared and private.
> 
> Punching holes is a sort of optional requirement.  It's a "requirement" in that
> it's allowed if the backing store supports such a behavior, optional in that
> support wouldn't be strictly necessary and/or could come with constraints.  The
> expected use case is that host userspace would punch a hole to free unreachable
> private memory, e.g. after the corresponding GFN(s) is converted to shared, so
> that it doesn't consume 2x memory for the guest.
> 

Okay, that matches my understanding then. I was rather thinking about 
"what happens if we punch a hole where private memory was not converted 
to shared yet". AFAIU, we will simply crash the guest then.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-31 21:54       ` Sean Christopherson
@ 2021-09-01  8:09         ` David Hildenbrand
  2021-09-01 15:54           ` Andy Lutomirski
  0 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2021-09-01  8:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, Kirill A . Shutemov, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, Dave Hansen, Yu Zhang

>> Do we have to protect from that? How would KVM protect from user space
>> replacing private pages by shared pages in any of the models we discuss?
> 
> The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*]
> as both private and shared, where "shared" also incorporates any mapping from the
> host.  Essentially it boils down to the kernel ensuring that a pfn is unmapped
> before it's converted to/from private, and KVM ensuring that it honors any
> unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback
> as proposed in this RFC.

Okay, so the fallocate(PUNCHHOLE) from user space could trigger the 
respective unmapping and freeing of backing storage.

> 
> As it pertains to PUNCH_HOLE, the responsibilities are no different than when the
> backing-store is destroyed; the backing-store needs to notify downstream MMUs
> (a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory.

Right.

> 
> [*] Whether or not the kernel's direct mapping needs to be removed is debatable,
>      but my argument is that that behavior is not visible to userspace and thus
>      out of scope for this discussion, e.g. zapping/restoring the direct map can
>      be added/removed without impacting the userspace ABI.

Right. Removing it shouldn't also be requited IMHO. There are other ways 
to teach the kernel to not read/write some online pages (filter 
/proc/kcore, disable hibernation, strict access checks for /dev/mem ...).

> 
>>>> Define "ordinary" user memory slots as overlay on top of "encrypted" memory
>>>> slots.  Inside KVM, bail out if you encounter such a VMA inside a normal
>>>> user memory slot. When creating a "encryped" user memory slot, require that
>>>> the whole VMA is covered at creation time. You know the VMA can't change
>>>> later.
>>>
>>> This can work for the basic use cases, but even then I'd strongly prefer not to
>>> tie memslot correctness to the VMAs.  KVM doesn't truly care what lies behind
>>> the virtual address of a memslot, and when it does care, it tends to do poorly,
>>> e.g. see the whole PFNMAP snafu.  KVM cares about the pfn<->gfn mappings, and
>>> that's reflected in the infrastructure.  E.g. KVM relies on the mmu_notifiers
>>> to handle mprotect()/munmap()/etc...
>>
>> Right, and for the existing use cases this worked. But encrypted memory
>> breaks many assumptions we once made ...
>>
>> I have somewhat mixed feelings about pages that are mapped into $WHATEVER
>> page tables but not actually mapped into user space page tables. There is no
>> way to reach these via the rmap.
>>
>> We have something like that already via vfio. And that is fundamentally
>> broken when it comes to mmu notifiers, page pinning, page migration, ...
> 
> I'm not super familiar with VFIO internals, but the idea with the fd-based
> approach is that the backing-store would be in direct communication with KVM and
> would handle those operations through that direct channel.

Right. The problem I am seeing is that e.g., try_to_unmap() might not be 
able to actually fully unmap a page, because some non-synchronized KVM 
MMU still maps a page. It would be great to evaluate how the fd 
callbacks would fit into the whole picture, including the current rmap.

I guess I'm missing the bigger picture how it all fits together on the 
!KVM side.

> 
>>> As is, I don't think KVM would get any kind of notification if userpaces unmaps
>>> the VMA for a private memslot that does not have any entries in the host page
>>> tables.   I'm sure it's a solvable problem, e.g. by ensuring at least one page
>>> is touched by the backing store, but I don't think the end result would be any
>>> prettier than a dedicated API for KVM to consume.
>>>
>>> Relying on VMAs, and thus the mmu_notifiers, also doesn't provide line of sight
>>> to page migration or swap.  For those types of operations, KVM currently just
>>> reacts to invalidation notifications by zapping guest PTEs, and then gets the
>>> new pfn when the guest re-faults on the page.  That sequence doesn't work for
>>> TDX or SEV-SNP because the trusteday agent needs to do the memcpy() of the page
>>> contents, i.e. the host needs to call into KVM for the actual migration.
>>
>> Right, but I still think this is a kernel internal. You can do such
>> handshake later in the kernel IMHO.
> 
> It is kernel internal, but AFAICT it will be ugly because KVM "needs" to do the
> migration and that would invert the mmu_notifer API, e.g. instead of "telling"
> secondary MMUs to invalidate/change a mappings, the mm would be "asking"
> secondary MMus "can you move this?".  More below.

In my thinking, the the rmap via mmu notifiers would do the unmapping 
just as we know it (from primary MMU -> secondary MMU). Once 
try_to_unmap() succeeded, the fd provider could kick-off the migration 
via whatever callback.

> 
>> But I also already thought: is it really KVM that is to perform the
>> migration or is it the fd-provider that performs the migration? Who says
>> memfd_encrypted() doesn't default to a TDX "backend" on Intel CPUs that just
>> knows how to migrate such a page?
>>
>> I'd love to have some details on how that's supposed to work, and which
>> information we'd need to migrate/swap/... in addition to the EPFN and a new
>> SPFN.
> 
> KVM "needs" to do the migration.  On TDX, the migration will be a SEAMCALL,
> a post-VMXON instruction that transfers control to the TDX-Module, that at
> minimum needs a per-VM identifier, the gfn, and the page table level.  The call

The per-VM identifier and the GFN would be easy to grab. Page table 
level, not so sure -- do you mean the general page table depth? Or if 
it's mapped as 4k vs. 2M ... ? The latter could be answered by the fd 
provider already I assume.

Does the page still have to be mapped into the secondary MMU when 
performing the migration via TDX? I assume not, which would simplify 
things a lot.

> into the TDX-Module would also need to take a KVM lock (probably KVM's mmu_lock)
> to satisfy TDX's concurrency requirement, e.g. to avoid "spurious" errors due to
> the backing-store attempting to migrate memory that KVM is unmapping due to a
> memslot change.

Something like that might be handled by fixing private memory slots 
similar to in my draft, right?

> 
> The per-VM identifier may not apply to SEV-SNP, but I believe everything else
> holds true.

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01  4:53     ` Andy Lutomirski
  2021-09-01  7:12       ` Tian, Kevin
@ 2021-09-01 10:24       ` Yu Zhang
  2021-09-01 16:07         ` Andy Lutomirski
  1 sibling, 1 reply; 71+ messages in thread
From: Yu Zhang @ 2021-09-01 10:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Hildenbrand, Sean Christopherson, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm list, Linux Kernel Mailing List, Borislav Petkov,
	Andrew Morton, Joerg Roedel, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen

On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote:
> 
> 
> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote:
> > On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:
> 
> > Thanks a lot for this summary. A question about the requirement: do we or
> > do we not have plan to support assigned device to the protected VM?
> > 
> > If yes. The fd based solution may need change the VFIO interface as well(
> > though the fake swap entry solution need mess with VFIO too). Because:
> > 
> > 1> KVM uses VFIO when assigning devices into a VM.
> > 
> > 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
> > guest pages will have to be mapped in host IOMMU page table to host pages,
> > which are pinned during the whole life cycle fo the VM.
> > 
> > 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver,
> > in vfio_dma_do_map().
> > 
> > 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA
> > and pin the page. 
> > 
> > But if we are using fd based solution, not every GPA can have a HVA, thus
> > the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
> > doubt if VFIO can be modified to support this easily.
> > 
> > 
> 
> Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device?
> 
> If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it.  I'm not sure whether anyone will actually do this, but it should be possible, at least in principle.  Maybe someone will want to assign a NIC to a TDX guest.  An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether.
> 
> If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot.
> 

Thanks Andy. I was asking the first scenario.

Well, I agree it is doable if someone really want some assigned
device in TD guest. As Kevin mentioned in his reply, HPA can be
generated, by extending VFIO with a new mapping protocol which
uses fd+offset, instead of HVA. 

Another issue is current TDX does not support DMA encryption, and
only shared GPA memory shall be mapped in the VT-d. So to support
this, KVM may need to work with VFIO to dynamically program host
IOPT(IOMMU Page Table) when TD guest notifies a shared GFN range(e.g.,
with a MAP_GPA TDVMCALL), instead of prepopulating the IOPT at VM
creation time, by mapping entire GFN ranges of a guest.

So my inclination would be to just disallow using of VFIO device in
TDX first, until we have real requirement(with above enabling work
finished). 

B.R.
Yu

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01  8:09         ` David Hildenbrand
@ 2021-09-01 15:54           ` Andy Lutomirski
  2021-09-01 16:16             ` David Hildenbrand
  2021-09-01 16:18             ` James Bottomley
  0 siblings, 2 replies; 71+ messages in thread
From: Andy Lutomirski @ 2021-09-01 15:54 UTC (permalink / raw)
  To: David Hildenbrand, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On Wed, Sep 1, 2021, at 1:09 AM, David Hildenbrand wrote:
> >> Do we have to protect from that? How would KVM protect from user space
> >> replacing private pages by shared pages in any of the models we discuss?
> > 
> > The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*]
> > as both private and shared, where "shared" also incorporates any mapping from the
> > host.  Essentially it boils down to the kernel ensuring that a pfn is unmapped
> > before it's converted to/from private, and KVM ensuring that it honors any
> > unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback
> > as proposed in this RFC.
> 
> Okay, so the fallocate(PUNCHHOLE) from user space could trigger the 
> respective unmapping and freeing of backing storage.
> 
> > 
> > As it pertains to PUNCH_HOLE, the responsibilities are no different than when the
> > backing-store is destroyed; the backing-store needs to notify downstream MMUs
> > (a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory.
> 
> Right.
> 
> > 
> > [*] Whether or not the kernel's direct mapping needs to be removed is debatable,
> >      but my argument is that that behavior is not visible to userspace and thus
> >      out of scope for this discussion, e.g. zapping/restoring the direct map can
> >      be added/removed without impacting the userspace ABI.
> 
> Right. Removing it shouldn't also be requited IMHO. There are other ways 
> to teach the kernel to not read/write some online pages (filter 
> /proc/kcore, disable hibernation, strict access checks for /dev/mem ...).
> 
> > 
> >>>> Define "ordinary" user memory slots as overlay on top of "encrypted" memory
> >>>> slots.  Inside KVM, bail out if you encounter such a VMA inside a normal
> >>>> user memory slot. When creating a "encryped" user memory slot, require that
> >>>> the whole VMA is covered at creation time. You know the VMA can't change
> >>>> later.
> >>>
> >>> This can work for the basic use cases, but even then I'd strongly prefer not to
> >>> tie memslot correctness to the VMAs.  KVM doesn't truly care what lies behind
> >>> the virtual address of a memslot, and when it does care, it tends to do poorly,
> >>> e.g. see the whole PFNMAP snafu.  KVM cares about the pfn<->gfn mappings, and
> >>> that's reflected in the infrastructure.  E.g. KVM relies on the mmu_notifiers
> >>> to handle mprotect()/munmap()/etc...
> >>
> >> Right, and for the existing use cases this worked. But encrypted memory
> >> breaks many assumptions we once made ...
> >>
> >> I have somewhat mixed feelings about pages that are mapped into $WHATEVER
> >> page tables but not actually mapped into user space page tables. There is no
> >> way to reach these via the rmap.
> >>
> >> We have something like that already via vfio. And that is fundamentally
> >> broken when it comes to mmu notifiers, page pinning, page migration, ...
> > 
> > I'm not super familiar with VFIO internals, but the idea with the fd-based
> > approach is that the backing-store would be in direct communication with KVM and
> > would handle those operations through that direct channel.
> 
> Right. The problem I am seeing is that e.g., try_to_unmap() might not be 
> able to actually fully unmap a page, because some non-synchronized KVM 
> MMU still maps a page. It would be great to evaluate how the fd 
> callbacks would fit into the whole picture, including the current rmap.
> 
> I guess I'm missing the bigger picture how it all fits together on the 
> !KVM side.

The big picture is fundamentally a bit nasty.  Logically (ignoring the implementation details of rmap, mmu_notifier, etc), you can call try_to_unmap and end up with a page that is Just A Bunch Of Bytes (tm).  Then you can write it to disk, memcpy to another node, compress it, etc. When it gets faulted back in, you make sure the same bytes end up somewhere and put the PTEs back.

With guest-private memory, this doesn't work.  Forget about the implementation: you simply can't take a page of private memory, quiesce it so the guest can't access it without faulting, and turn it into Just A Bunch Of Bytes.  TDX does not have that capability.  (Any system with integrity-protected memory won't without having >PAGE_SIZE bytes or otherwise storing metadata, but TDX can't do this at all.)  SEV-ES *can* (I think -- I asked the lead architect), but that's not the same thing as saying it's a good idea.

So if you want to migrate a TDX page from one NUMA node to another, you need to do something different (I don't know all the details), you will have to ask Intel to explain how this might work in the future (it wasn't in the public docs last time I looked), but I'm fairly confident that it does not resemble try_to_unmap().

Even on SEV, where a page *can* be transformed into a Just A Bunch Of Bytes, the operation doesn't really look like try_to_unmap().  As I understand it, it's more of:

look up the one-and-only guest and GPA at which this page is mapped.
tear down the NPT PTE.  (SEV, unlike TDX, uses paging structures in normal memory.)
Ask the magic SEV mechanism to turn the page into swappable data
Go to town.

This doesn't really resemble the current rmap + mmu_notifier code, and shoehorning this into mmu_notifier seems like it may be uglier and more code than implementing it more directly in the backing store.

If you actually just try_to_unmap() a SEV-ES page (and it worked, which it currently won't), you will have data corruption or cache incoherency.

If you want to swap a page on TDX, you can't.  Sorry, go directly to jail, do not collect $200.

So I think there are literally zero code paths that currently call try_to_unmap() that will actually work like that on TDX.  If we run out of memory on a TDX host, we can kill the guest completely and reclaim all of its memory (which probably also involves killing QEMU or whatever other user program is in charge), but that's really our only option.

--Andy

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 10:24       ` Yu Zhang
@ 2021-09-01 16:07         ` Andy Lutomirski
  2021-09-01 16:27           ` David Hildenbrand
                             ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Andy Lutomirski @ 2021-09-01 16:07 UTC (permalink / raw)
  To: Yu Zhang
  Cc: David Hildenbrand, Sean Christopherson, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm list, Linux Kernel Mailing List, Borislav Petkov,
	Andrew Morton, Joerg Roedel, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen

On 9/1/21 3:24 AM, Yu Zhang wrote:
> On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote:
>>
>>
>> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote:
>>> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:
>>
>>> Thanks a lot for this summary. A question about the requirement: do we or
>>> do we not have plan to support assigned device to the protected VM?
>>>
>>> If yes. The fd based solution may need change the VFIO interface as well(
>>> though the fake swap entry solution need mess with VFIO too). Because:
>>>
>>> 1> KVM uses VFIO when assigning devices into a VM.
>>>
>>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
>>> guest pages will have to be mapped in host IOMMU page table to host pages,
>>> which are pinned during the whole life cycle fo the VM.
>>>
>>> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver,
>>> in vfio_dma_do_map().
>>>
>>> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA
>>> and pin the page. 
>>>
>>> But if we are using fd based solution, not every GPA can have a HVA, thus
>>> the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
>>> doubt if VFIO can be modified to support this easily.
>>>
>>>
>>
>> Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device?
>>
>> If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it.  I'm not sure whether anyone will actually do this, but it should be possible, at least in principle.  Maybe someone will want to assign a NIC to a TDX guest.  An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether.
>>
>> If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot.
>>
> 
> Thanks Andy. I was asking the first scenario.
> 
> Well, I agree it is doable if someone really want some assigned
> device in TD guest. As Kevin mentioned in his reply, HPA can be
> generated, by extending VFIO with a new mapping protocol which
> uses fd+offset, instead of HVA. 

I'm confused.  I don't see why any new code is needed for this at all.
Every proposal I've seen for handling TDX memory continues to handle TDX
*shared* memory exactly like regular guest memory today.  The only
differences are that more hole punching will be needed, which will
require lightweight memslots (to have many of them), memslots with
holes, or mappings backing memslots with holes (which can be done with
munmap() on current kernels).

So you can literally just mmap a VFIO device and expect it to work,
exactly like it does right now.  Whether the guest will be willing to
use the device will depend on the guest security policy (all kinds of
patches about that are flying around), but if the guest tries to use it,
it really should just work.

> 
> Another issue is current TDX does not support DMA encryption, and
> only shared GPA memory shall be mapped in the VT-d. So to support
> this, KVM may need to work with VFIO to dynamically program host
> IOPT(IOMMU Page Table) when TD guest notifies a shared GFN range(e.g.,
> with a MAP_GPA TDVMCALL), instead of prepopulating the IOPT at VM
> creation time, by mapping entire GFN ranges of a guest.

Given that there is no encrypted DMA support, shouldn't the only IOMMU
mappings (real host-side IOMMU) that point at guest memory be for
non-encrypted DMA?  I don't see how this interacts at all.  If the guest
tries to MapGPA to turn a shared MMIO page into private, the host should
fail the hypercall because the operation makes no sense.

It is indeed the case that, with a TDX guest, MapGPA shared->private to
a page that was previously used for unencrypted DMA will need to avoid
having IOPT entries to the new private page, but even that doesn't seem
particularly bad.  The fd+special memslot proposal for private memory
means that shared *backing store* pages never actually transition
between shared and private without being completely freed.

As far as I can tell, the actual problem you're referring to is:

>>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
>>> guest pages will have to be mapped in host IOMMU page table to host
pages,
>>> which are pinned during the whole life cycle fo the VM.

In principle, you could actually initialize a TDX guest with all of its
memory shared and all of it mapped in the host IOMMU.  When a guest
turns some pages private, user code could punch a hole in the memslot,
allocate private memory at that address, but leave the shared backing
store in place and still mapped in the host IOMMU.  The result would be
that guest-initiated DMA to the previously shared address would actually
work but would hit pages that are invisible to the guest.  And a whole
bunch of memory would be waste, but the whole system should stll work.

Improving this may well need VFIO changes, but I think those would be
roughly the same pages needed to have VFIO DMA be filtered through a
guest-controlled virtual IOMMU and wouldn't have anything to do with
private memory per se.

--Andy

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 15:54           ` Andy Lutomirski
@ 2021-09-01 16:16             ` David Hildenbrand
  2021-09-01 17:09               ` Andy Lutomirski
  2021-09-01 16:18             ` James Bottomley
  1 sibling, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2021-09-01 16:16 UTC (permalink / raw)
  To: Andy Lutomirski, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On 01.09.21 17:54, Andy Lutomirski wrote:
> On Wed, Sep 1, 2021, at 1:09 AM, David Hildenbrand wrote:
>>>> Do we have to protect from that? How would KVM protect from user space
>>>> replacing private pages by shared pages in any of the models we discuss?
>>>
>>> The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*]
>>> as both private and shared, where "shared" also incorporates any mapping from the
>>> host.  Essentially it boils down to the kernel ensuring that a pfn is unmapped
>>> before it's converted to/from private, and KVM ensuring that it honors any
>>> unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback
>>> as proposed in this RFC.
>>
>> Okay, so the fallocate(PUNCHHOLE) from user space could trigger the
>> respective unmapping and freeing of backing storage.
>>
>>>
>>> As it pertains to PUNCH_HOLE, the responsibilities are no different than when the
>>> backing-store is destroyed; the backing-store needs to notify downstream MMUs
>>> (a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory.
>>
>> Right.
>>
>>>
>>> [*] Whether or not the kernel's direct mapping needs to be removed is debatable,
>>>       but my argument is that that behavior is not visible to userspace and thus
>>>       out of scope for this discussion, e.g. zapping/restoring the direct map can
>>>       be added/removed without impacting the userspace ABI.
>>
>> Right. Removing it shouldn't also be requited IMHO. There are other ways
>> to teach the kernel to not read/write some online pages (filter
>> /proc/kcore, disable hibernation, strict access checks for /dev/mem ...).
>>
>>>
>>>>>> Define "ordinary" user memory slots as overlay on top of "encrypted" memory
>>>>>> slots.  Inside KVM, bail out if you encounter such a VMA inside a normal
>>>>>> user memory slot. When creating a "encryped" user memory slot, require that
>>>>>> the whole VMA is covered at creation time. You know the VMA can't change
>>>>>> later.
>>>>>
>>>>> This can work for the basic use cases, but even then I'd strongly prefer not to
>>>>> tie memslot correctness to the VMAs.  KVM doesn't truly care what lies behind
>>>>> the virtual address of a memslot, and when it does care, it tends to do poorly,
>>>>> e.g. see the whole PFNMAP snafu.  KVM cares about the pfn<->gfn mappings, and
>>>>> that's reflected in the infrastructure.  E.g. KVM relies on the mmu_notifiers
>>>>> to handle mprotect()/munmap()/etc...
>>>>
>>>> Right, and for the existing use cases this worked. But encrypted memory
>>>> breaks many assumptions we once made ...
>>>>
>>>> I have somewhat mixed feelings about pages that are mapped into $WHATEVER
>>>> page tables but not actually mapped into user space page tables. There is no
>>>> way to reach these via the rmap.
>>>>
>>>> We have something like that already via vfio. And that is fundamentally
>>>> broken when it comes to mmu notifiers, page pinning, page migration, ...
>>>
>>> I'm not super familiar with VFIO internals, but the idea with the fd-based
>>> approach is that the backing-store would be in direct communication with KVM and
>>> would handle those operations through that direct channel.
>>
>> Right. The problem I am seeing is that e.g., try_to_unmap() might not be
>> able to actually fully unmap a page, because some non-synchronized KVM
>> MMU still maps a page. It would be great to evaluate how the fd
>> callbacks would fit into the whole picture, including the current rmap.
>>
>> I guess I'm missing the bigger picture how it all fits together on the
>> !KVM side.
> 
> The big picture is fundamentally a bit nasty.  Logically (ignoring the implementation details of rmap, mmu_notifier, etc), you can call try_to_unmap and end up with a page that is Just A Bunch Of Bytes (tm).  Then you can write it to disk, memcpy to another node, compress it, etc. When it gets faulted back in, you make sure the same bytes end up somewhere and put the PTEs back.
> 
> With guest-private memory, this doesn't work.  Forget about the implementation: you simply can't take a page of private memory, quiesce it so the guest can't access it without faulting, and turn it into Just A Bunch Of Bytes.  TDX does not have that capability.  (Any system with integrity-protected memory won't without having >PAGE_SIZE bytes or otherwise storing metadata, but TDX can't do this at all.)  SEV-ES *can* (I think -- I asked the lead architect), but that's not the same thing as saying it's a good idea.
> 
> So if you want to migrate a TDX page from one NUMA node to another, you need to do something different (I don't know all the details), you will have to ask Intel to explain how this might work in the future (it wasn't in the public docs last time I looked), but I'm fairly confident that it does not resemble try_to_unmap().
> 
> Even on SEV, where a page *can* be transformed into a Just A Bunch Of Bytes, the operation doesn't really look like try_to_unmap().  As I understand it, it's more of:
> 
> look up the one-and-only guest and GPA at which this page is mapped.
> tear down the NPT PTE.  (SEV, unlike TDX, uses paging structures in normal memory.)
> Ask the magic SEV mechanism to turn the page into swappable data
> Go to town.
> 
> This doesn't really resemble the current rmap + mmu_notifier code, and shoehorning this into mmu_notifier seems like it may be uglier and more code than implementing it more directly in the backing store.
> 
> If you actually just try_to_unmap() a SEV-ES page (and it worked, which it currently won't), you will have data corruption or cache incoherency.
> 
> If you want to swap a page on TDX, you can't.  Sorry, go directly to jail, do not collect $200.
> 
> So I think there are literally zero code paths that currently call try_to_unmap() that will actually work like that on TDX.  If we run out of memory on a TDX host, we can kill the guest completely and reclaim all of its memory (which probably also involves killing QEMU or whatever other user program is in charge), but that's really our only option.

try_to_unmap() would actually do the right thing I think. It's the users 
that access page content (migration, swapping) that need additional care 
to special-case these pages. Completely agree that these would be broken.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 15:54           ` Andy Lutomirski
  2021-09-01 16:16             ` David Hildenbrand
@ 2021-09-01 16:18             ` James Bottomley
  2021-09-01 16:22               ` David Hildenbrand
                                 ` (4 more replies)
  1 sibling, 5 replies; 71+ messages in thread
From: James Bottomley @ 2021-09-01 16:18 UTC (permalink / raw)
  To: Andy Lutomirski, David Hildenbrand, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote:
[...]
> If you want to swap a page on TDX, you can't.  Sorry, go directly to
> jail, do not collect $200.

Actually, even on SEV-ES you can't either.  You can read the encrypted
page and write it out if you want, but unless you swap it back to the
exact same physical memory location, the encryption key won't work. 
Since we don't guarantee this for swap, I think swap won't actually
work for any confidential computing environment.

> So I think there are literally zero code paths that currently call
> try_to_unmap() that will actually work like that on TDX.  If we run
> out of memory on a TDX host, we can kill the guest completely and
> reclaim all of its memory (which probably also involves killing QEMU
> or whatever other user program is in charge), but that's really our
> only option.

I think our only option for swap is guest co-operation.  We're going to
have to inflate a balloon or something in the guest and have the guest
driver do some type of bounce of the page, where it becomes an
unencrypted page in the guest (so the host can read it without the
physical address keying of the encryption getting in the way) but
actually encrypted with a swap transfer key known only to the guest.  I
assume we can use the page acceptance infrastructure currently being
discussed elsewhere to do swap back in as well ... the host provides
the guest with the encrypted swap page and the guest has to decrypt it
and place it in encrypted guest memory.

That way the swapped memory is securely encrypted, but can be swapped
back in.

James



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:18             ` James Bottomley
@ 2021-09-01 16:22               ` David Hildenbrand
  2021-09-01 16:31                 ` James Bottomley
  2021-09-01 17:08               ` Andy Lutomirski
                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2021-09-01 16:22 UTC (permalink / raw)
  To: jejb, Andy Lutomirski, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On 01.09.21 18:18, James Bottomley wrote:
> On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote:
> [...]
>> If you want to swap a page on TDX, you can't.  Sorry, go directly to
>> jail, do not collect $200.
> 
> Actually, even on SEV-ES you can't either.  You can read the encrypted
> page and write it out if you want, but unless you swap it back to the
> exact same physical memory location, the encryption key won't work.
> Since we don't guarantee this for swap, I think swap won't actually
> work for any confidential computing environment.
> 
>> So I think there are literally zero code paths that currently call
>> try_to_unmap() that will actually work like that on TDX.  If we run
>> out of memory on a TDX host, we can kill the guest completely and
>> reclaim all of its memory (which probably also involves killing QEMU
>> or whatever other user program is in charge), but that's really our
>> only option.
> 
> I think our only option for swap is guest co-operation.  We're going to
> have to inflate a balloon or something in the guest and have the guest
> driver do some type of bounce of the page, where it becomes an
> unencrypted page in the guest (so the host can read it without the
> physical address keying of the encryption getting in the way) but
> actually encrypted with a swap transfer key known only to the guest.  I
> assume we can use the page acceptance infrastructure currently being
> discussed elsewhere to do swap back in as well ... the host provides
> the guest with the encrypted swap page and the guest has to decrypt it
> and place it in encrypted guest memory.

Ballooning is indeed *the* mechanism to avoid swapping in the hypervisor 
and much rather let the guest swap. Shame it requires trusting a guest, 
which we, in general, can't. Not to mention other issues we already do 
have with ballooning (latency, broken auto-ballooning, over-inflating, ...).

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:07         ` Andy Lutomirski
@ 2021-09-01 16:27           ` David Hildenbrand
  2021-09-02  8:34             ` Yu Zhang
  2021-09-02  8:19           ` Yu Zhang
  2021-09-02  9:27           ` Joerg Roedel
  2 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2021-09-01 16:27 UTC (permalink / raw)
  To: Andy Lutomirski, Yu Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen

On 01.09.21 18:07, Andy Lutomirski wrote:
> On 9/1/21 3:24 AM, Yu Zhang wrote:
>> On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote:
>>>
>>>
>>> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote:
>>>> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:
>>>
>>>> Thanks a lot for this summary. A question about the requirement: do we or
>>>> do we not have plan to support assigned device to the protected VM?
>>>>
>>>> If yes. The fd based solution may need change the VFIO interface as well(
>>>> though the fake swap entry solution need mess with VFIO too). Because:
>>>>
>>>> 1> KVM uses VFIO when assigning devices into a VM.
>>>>
>>>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
>>>> guest pages will have to be mapped in host IOMMU page table to host pages,
>>>> which are pinned during the whole life cycle fo the VM.
>>>>
>>>> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver,
>>>> in vfio_dma_do_map().
>>>>
>>>> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA
>>>> and pin the page.
>>>>
>>>> But if we are using fd based solution, not every GPA can have a HVA, thus
>>>> the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
>>>> doubt if VFIO can be modified to support this easily.
>>>>
>>>>
>>>
>>> Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device?
>>>
>>> If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it.  I'm not sure whether anyone will actually do this, but it should be possible, at least in principle.  Maybe someone will want to assign a NIC to a TDX guest.  An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether.
>>>
>>> If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot.
>>>
>>
>> Thanks Andy. I was asking the first scenario.
>>
>> Well, I agree it is doable if someone really want some assigned
>> device in TD guest. As Kevin mentioned in his reply, HPA can be
>> generated, by extending VFIO with a new mapping protocol which
>> uses fd+offset, instead of HVA.
> 
> I'm confused.  I don't see why any new code is needed for this at all.
> Every proposal I've seen for handling TDX memory continues to handle TDX
> *shared* memory exactly like regular guest memory today.  The only
> differences are that more hole punching will be needed, which will
> require lightweight memslots (to have many of them), memslots with
> holes, or mappings backing memslots with holes (which can be done with
> munmap() on current kernels).
> 
> So you can literally just mmap a VFIO device and expect it to work,
> exactly like it does right now.  Whether the guest will be willing to
> use the device will depend on the guest security policy (all kinds of
> patches about that are flying around), but if the guest tries to use it,
> it really should just work.

... but if you end up mapping private memory into IOMMU of the device 
and the device ends up accessing that memory, we're in the same position 
that the host might get capped, just like access from user space, no?

Sure, you can map only the complete duplicate shared-memory region into 
the IOMMU of the device, that would work. Shame vfio mostly always pins 
all guest memory and you essentially cannot punch holes into the shared 
memory anymore -- resulting in the worst case in a duplicate memory 
consumption for your VM.

So you'd actually want to map only the *currently* shared pieces into 
the IOMMU and update the mappings on demand. Having worked on something 
related, I can only say that 64k individual mappings, and not being able 
to modify existing mappings except completely deleting them to replace 
with something new (!atomic), can be quite an issue for bigger VMs.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:22               ` David Hildenbrand
@ 2021-09-01 16:31                 ` James Bottomley
  2021-09-01 16:37                   ` David Hildenbrand
  0 siblings, 1 reply; 71+ messages in thread
From: James Bottomley @ 2021-09-01 16:31 UTC (permalink / raw)
  To: David Hildenbrand, Andy Lutomirski, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On Wed, 2021-09-01 at 18:22 +0200, David Hildenbrand wrote:
> On 01.09.21 18:18, James Bottomley wrote:
> > On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote:
> > [...]
> > > If you want to swap a page on TDX, you can't.  Sorry, go directly
> > > to jail, do not collect $200.
> > 
> > Actually, even on SEV-ES you can't either.  You can read the
> > encrypted page and write it out if you want, but unless you swap it
> > back to the exact same physical memory location, the encryption key
> > won't work.  Since we don't guarantee this for swap, I think swap
> > won't actually work for any confidential computing environment.
> > 
> > > So I think there are literally zero code paths that currently
> > > call try_to_unmap() that will actually work like that on TDX.  If
> > > we run out of memory on a TDX host, we can kill the guest
> > > completely and reclaim all of its memory (which probably also
> > > involves killing QEMU or whatever other user program is in
> > > charge), but that's really our only option.
> > 
> > I think our only option for swap is guest co-operation.  We're
> > going to have to inflate a balloon or something in the guest and
> > have the guest driver do some type of bounce of the page, where it
> > becomes an unencrypted page in the guest (so the host can read it
> > without the physical address keying of the encryption getting in
> > the way) but actually encrypted with a swap transfer key known only
> > to the guest.  I assume we can use the page acceptance
> > infrastructure currently being discussed elsewhere to do swap back
> > in as well ... the host provides the guest with the encrypted swap
> > page and the guest has to decrypt it and place it in encrypted
> > guest memory.
> 
> Ballooning is indeed *the* mechanism to avoid swapping in the
> hypervisor  and much rather let the guest swap. Shame it requires
> trusting a guest, which we, in general, can't. Not to mention other
> issues we already do have with ballooning (latency, broken auto-
> ballooning, over-inflating, ...).


Well not necessarily, but it depends how clever we want to get.  If you
look over on the OVMF/edk2 list, there's a proposal to do guest
migration via a mirror VM that invokes a co-routine embedded in the
OVMF binary:

https://patchew.org/EDK2/20210818212048.162626-1-tobin@linux.ibm.com/

This gives us a page encryption mechanism that's provided by the host
but accepted via the guest using attestation, meaning we have a
mutually trusted piece of code that can use to extract encrypted pages.
It does seem it could be enhanced to do swapping for us as well if
that's a road we want to go down?

James



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:31                 ` James Bottomley
@ 2021-09-01 16:37                   ` David Hildenbrand
  2021-09-01 16:45                     ` James Bottomley
  0 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2021-09-01 16:37 UTC (permalink / raw)
  To: jejb, Andy Lutomirski, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On 01.09.21 18:31, James Bottomley wrote:
> On Wed, 2021-09-01 at 18:22 +0200, David Hildenbrand wrote:
>> On 01.09.21 18:18, James Bottomley wrote:
>>> On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote:
>>> [...]
>>>> If you want to swap a page on TDX, you can't.  Sorry, go directly
>>>> to jail, do not collect $200.
>>>
>>> Actually, even on SEV-ES you can't either.  You can read the
>>> encrypted page and write it out if you want, but unless you swap it
>>> back to the exact same physical memory location, the encryption key
>>> won't work.  Since we don't guarantee this for swap, I think swap
>>> won't actually work for any confidential computing environment.
>>>
>>>> So I think there are literally zero code paths that currently
>>>> call try_to_unmap() that will actually work like that on TDX.  If
>>>> we run out of memory on a TDX host, we can kill the guest
>>>> completely and reclaim all of its memory (which probably also
>>>> involves killing QEMU or whatever other user program is in
>>>> charge), but that's really our only option.
>>>
>>> I think our only option for swap is guest co-operation.  We're
>>> going to have to inflate a balloon or something in the guest and
>>> have the guest driver do some type of bounce of the page, where it
>>> becomes an unencrypted page in the guest (so the host can read it
>>> without the physical address keying of the encryption getting in
>>> the way) but actually encrypted with a swap transfer key known only
>>> to the guest.  I assume we can use the page acceptance
>>> infrastructure currently being discussed elsewhere to do swap back
>>> in as well ... the host provides the guest with the encrypted swap
>>> page and the guest has to decrypt it and place it in encrypted
>>> guest memory.
>>
>> Ballooning is indeed *the* mechanism to avoid swapping in the
>> hypervisor  and much rather let the guest swap. Shame it requires
>> trusting a guest, which we, in general, can't. Not to mention other
>> issues we already do have with ballooning (latency, broken auto-
>> ballooning, over-inflating, ...).
> 
> 
> Well not necessarily, but it depends how clever we want to get.  If you
> look over on the OVMF/edk2 list, there's a proposal to do guest
> migration via a mirror VM that invokes a co-routine embedded in the
> OVMF binary:

Yes, I heard of that. "Interesting" design.

> 
> https://patchew.org/EDK2/20210818212048.162626-1-tobin@linux.ibm.com/
> 
> This gives us a page encryption mechanism that's provided by the host
> but accepted via the guest using attestation, meaning we have a
> mutually trusted piece of code that can use to extract encrypted pages.
> It does seem it could be enhanced to do swapping for us as well if
> that's a road we want to go down?

Right, but that's than no longer ballooning, unless I am missing 
something important. You'd ask the guest to export/import, and you can 
trust it. But do we want to call something like that out of random 
kernel context when swapping/writeback, ...? Hard to tell. Feels like it 
won't win in a beauty contest.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:37                   ` David Hildenbrand
@ 2021-09-01 16:45                     ` James Bottomley
  2021-09-01 17:08                       ` David Hildenbrand
  0 siblings, 1 reply; 71+ messages in thread
From: James Bottomley @ 2021-09-01 16:45 UTC (permalink / raw)
  To: David Hildenbrand, Andy Lutomirski, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On Wed, 2021-09-01 at 18:37 +0200, David Hildenbrand wrote:
> On 01.09.21 18:31, James Bottomley wrote:
> > On Wed, 2021-09-01 at 18:22 +0200, David Hildenbrand wrote:
> > > On 01.09.21 18:18, James Bottomley wrote:
> > > > On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote:
> > > > [...]
> > > > > If you want to swap a page on TDX, you can't.  Sorry, go
> > > > > directly
> > > > > to jail, do not collect $200.
> > > > 
> > > > Actually, even on SEV-ES you can't either.  You can read the
> > > > encrypted page and write it out if you want, but unless you
> > > > swap it back to the exact same physical memory location, the
> > > > encryption key won't work.  Since we don't guarantee this for
> > > > swap, I think swap won't actually work for any confidential
> > > > computing environment.
> > > > 
> > > > > So I think there are literally zero code paths that currently
> > > > > call try_to_unmap() that will actually work like that on
> > > > > TDX.  If we run out of memory on a TDX host, we can kill the
> > > > > guest completely and reclaim all of its memory (which
> > > > > probably also involves killing QEMU or whatever other user
> > > > > program is in charge), but that's really our only option.
> > > > 
> > > > I think our only option for swap is guest co-operation.  We're
> > > > going to have to inflate a balloon or something in the guest
> > > > and have the guest driver do some type of bounce of the page,
> > > > where it becomes an unencrypted page in the guest (so the host
> > > > can read it without the physical address keying of the
> > > > encryption getting in the way) but actually encrypted with a
> > > > swap transfer key known only to the guest.  I assume we can use
> > > > the page acceptance infrastructure currently being discussed
> > > > elsewhere to do swap back in as well ... the host provides the
> > > > guest with the encrypted swap page and the guest has to decrypt
> > > > it and place it in encrypted guest memory.
> > > 
> > > Ballooning is indeed *the* mechanism to avoid swapping in the
> > > hypervisor  and much rather let the guest swap. Shame it requires
> > > trusting a guest, which we, in general, can't. Not to mention
> > > other issues we already do have with ballooning (latency, broken
> > > auto-ballooning, over-inflating, ...).
> > 
> > Well not necessarily, but it depends how clever we want to get.  If
> > you look over on the OVMF/edk2 list, there's a proposal to do guest
> > migration via a mirror VM that invokes a co-routine embedded in the
> > OVMF binary:
> 
> Yes, I heard of that. "Interesting" design.

Heh, well what other suggestion do you have?  The problem is there
needs to be code somewhere to perform some operations that's trusted by
both the guest and the host.  The only element for a confidential VM
that has this shared trust is the OVMF firmware, so it seems logical to
use it.

> 
> > https://patchew.org/EDK2/20210818212048.162626-1-tobin@linux.ibm.com/
> > 
> > This gives us a page encryption mechanism that's provided by the
> > host but accepted via the guest using attestation, meaning we have
> > a mutually trusted piece of code that can use to extract encrypted
> > pages. It does seem it could be enhanced to do swapping for us as
> > well if that's a road we want to go down?
> 
> Right, but that's than no longer ballooning, unless I am missing 
> something important. You'd ask the guest to export/import, and you
> can trust it. But do we want to call something like that out of
> random kernel context when swapping/writeback, ...? Hard to tell.
> Feels like it won't win in a beauty contest.

What I was thinking is that OVMF can emulate devices in this trusted
code ... another potential use for it is a trusted vTPM for SEV-SNP so
we can do measured boot.  To use it we'd give the guest kernel some
type of virtual swap driver that attaches to this OVMF device.  I
suppose by the time we've done this, it really does look like a
balloon, but I'd like to think of it more as a paravirt memory
controller since it might be used to make a guest more co-operative in
a host overcommit situation.

That's not to say we *should* do this, merely that it doesn't have to
look like a pig with lipstick.

James



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:18             ` James Bottomley
  2021-09-01 16:22               ` David Hildenbrand
@ 2021-09-01 17:08               ` Andy Lutomirski
  2021-09-01 17:13                 ` James Bottomley
  2021-09-02 10:18                 ` Joerg Roedel
  2021-09-01 18:24               ` Andy Lutomirski
                                 ` (2 subsequent siblings)
  4 siblings, 2 replies; 71+ messages in thread
From: Andy Lutomirski @ 2021-09-01 17:08 UTC (permalink / raw)
  To: James Bottomley, David Hildenbrand, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang



On Wed, Sep 1, 2021, at 9:18 AM, James Bottomley wrote:
> On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote:
> [...]
> > If you want to swap a page on TDX, you can't.  Sorry, go directly to
> > jail, do not collect $200.
> 
> Actually, even on SEV-ES you can't either.  You can read the encrypted
> page and write it out if you want, but unless you swap it back to the
> exact same physical memory location, the encryption key won't work. 
> Since we don't guarantee this for swap, I think swap won't actually
> work for any confidential computing environment.
> 
> > So I think there are literally zero code paths that currently call
> > try_to_unmap() that will actually work like that on TDX.  If we run
> > out of memory on a TDX host, we can kill the guest completely and
> > reclaim all of its memory (which probably also involves killing QEMU
> > or whatever other user program is in charge), but that's really our
> > only option.
> 
> I think our only option for swap is guest co-operation.  We're going to
> have to inflate a balloon or something in the guest and have the guest
> driver do some type of bounce of the page, where it becomes an
> unencrypted page in the guest (so the host can read it without the
> physical address keying of the encryption getting in the way) but
> actually encrypted with a swap transfer key known only to the guest.  I
> assume we can use the page acceptance infrastructure currently being
> discussed elsewhere to do swap back in as well ... the host provides
> the guest with the encrypted swap page and the guest has to decrypt it
> and place it in encrypted guest memory.

I asked David, and he said the PSP offers a swapping mechanism for SEV-ES.  I haven’t read the details, but they should all be public.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:45                     ` James Bottomley
@ 2021-09-01 17:08                       ` David Hildenbrand
  2021-09-01 17:50                         ` Sean Christopherson
  0 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2021-09-01 17:08 UTC (permalink / raw)
  To: jejb, Andy Lutomirski, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

>>> Well not necessarily, but it depends how clever we want to get.  If
>>> you look over on the OVMF/edk2 list, there's a proposal to do guest
>>> migration via a mirror VM that invokes a co-routine embedded in the
>>> OVMF binary:
>>
>> Yes, I heard of that. "Interesting" design.
> 
> Heh, well what other suggestion do you have?  The problem is there
> needs to be code somewhere to perform some operations that's trusted by
> both the guest and the host.  The only element for a confidential VM
> that has this shared trust is the OVMF firmware, so it seems logical to
> use it.

<offtopic>

Let me put it this way: I worked with another architecture that doesn't 
fault on access of a secure page, but instead automatically 
exports/encrypts it so it can be swapped. It doesn't send a MCE and 
kills the host. It doesn't require fancy code in the guest firmware to 
export a page.

The code runs in the ultravisor -- yes, I'm talking about s390x. Now, I 
am not an expert on all of the glory details of TDX, SEV, ... to say 
which attack surface they introduced with that design, and if it can't 
be mitigated. I can only assume that there are real reasons (e.g., 
supporting an ultravisor is problematic, patents? ;) ) why x86-64 is 
different.

So whenever I see something really complicated to work around such 
issues, it feels to me like a hardware/platform limitation is making our 
life hard and forces us to come up with such "interesting" designs.

Sure, it's logical in this context, but it feels like "The house doesn't 
have a door, so I'll have to climb through the window.". It gets the job 
done but isn't ideally what you'd want to have. If you understand what I 
am trying to say :)

</offtopic>

> 
>>
>>> https://patchew.org/EDK2/20210818212048.162626-1-tobin@linux.ibm.com/
>>>
>>> This gives us a page encryption mechanism that's provided by the
>>> host but accepted via the guest using attestation, meaning we have
>>> a mutually trusted piece of code that can use to extract encrypted
>>> pages. It does seem it could be enhanced to do swapping for us as
>>> well if that's a road we want to go down?
>>
>> Right, but that's than no longer ballooning, unless I am missing
>> something important. You'd ask the guest to export/import, and you
>> can trust it. But do we want to call something like that out of
>> random kernel context when swapping/writeback, ...? Hard to tell.
>> Feels like it won't win in a beauty contest.
> 
> What I was thinking is that OVMF can emulate devices in this trusted
> code ... another potential use for it is a trusted vTPM for SEV-SNP so
> we can do measured boot.  To use it we'd give the guest kernel some
> type of virtual swap driver that attaches to this OVMF device.  I
> suppose by the time we've done this, it really does look like a
> balloon, but I'd like to think of it more as a paravirt memory
> controller since it might be used to make a guest more co-operative in
> a host overcommit situation.
> 
> That's not to say we *should* do this, merely that it doesn't have to
> look like a pig with lipstick.

It's an interesting approach: it would essentially mean that the OVMF 
would swap pages out to some virtual device and then essentially 
"inflate" the pages like a balloon. Still, it doesn't sound like 
something you want to trigger from actual kernel context when actually 
swapping in the kernel. It would much rather be something like other 
balloon implementations: completely controlled by user space.

So yes, "doesn't look like a pig with lipstick", but still compared to 
proper in-kernel swapping, looks like a workaround.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:16             ` David Hildenbrand
@ 2021-09-01 17:09               ` Andy Lutomirski
  0 siblings, 0 replies; 71+ messages in thread
From: Andy Lutomirski @ 2021-09-01 17:09 UTC (permalink / raw)
  To: David Hildenbrand, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang



On Wed, Sep 1, 2021, at 9:16 AM, David Hildenbrand wrote:
> On 01.09.21 17:54, Andy Lutomirski wrote:
> > On Wed, Sep 1, 2021, at 1:09 AM, David Hildenbrand wrote:
> >>>> Do we have to protect from that? How would KVM protect from user space
> >>>> replacing private pages by shared pages in any of the models we discuss?
> >>>
> >>> The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*]
> >>> as both private and shared, where "shared" also incorporates any mapping from the
> >>> host.  Essentially it boils down to the kernel ensuring that a pfn is unmapped
> >>> before it's converted to/from private, and KVM ensuring that it honors any
> >>> unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback
> >>> as proposed in this RFC.
> >>
> >> Okay, so the fallocate(PUNCHHOLE) from user space could trigger the
> >> respective unmapping and freeing of backing storage.
> >>
> >>>
> >>> As it pertains to PUNCH_HOLE, the responsibilities are no different than when the
> >>> backing-store is destroyed; the backing-store needs to notify downstream MMUs
> >>> (a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory.
> >>
> >> Right.
> >>
> >>>
> >>> [*] Whether or not the kernel's direct mapping needs to be removed is debatable,
> >>>       but my argument is that that behavior is not visible to userspace and thus
> >>>       out of scope for this discussion, e.g. zapping/restoring the direct map can
> >>>       be added/removed without impacting the userspace ABI.
> >>
> >> Right. Removing it shouldn't also be requited IMHO. There are other ways
> >> to teach the kernel to not read/write some online pages (filter
> >> /proc/kcore, disable hibernation, strict access checks for /dev/mem ...).
> >>
> >>>
> >>>>>> Define "ordinary" user memory slots as overlay on top of "encrypted" memory
> >>>>>> slots.  Inside KVM, bail out if you encounter such a VMA inside a normal
> >>>>>> user memory slot. When creating a "encryped" user memory slot, require that
> >>>>>> the whole VMA is covered at creation time. You know the VMA can't change
> >>>>>> later.
> >>>>>
> >>>>> This can work for the basic use cases, but even then I'd strongly prefer not to
> >>>>> tie memslot correctness to the VMAs.  KVM doesn't truly care what lies behind
> >>>>> the virtual address of a memslot, and when it does care, it tends to do poorly,
> >>>>> e.g. see the whole PFNMAP snafu.  KVM cares about the pfn<->gfn mappings, and
> >>>>> that's reflected in the infrastructure.  E.g. KVM relies on the mmu_notifiers
> >>>>> to handle mprotect()/munmap()/etc...
> >>>>
> >>>> Right, and for the existing use cases this worked. But encrypted memory
> >>>> breaks many assumptions we once made ...
> >>>>
> >>>> I have somewhat mixed feelings about pages that are mapped into $WHATEVER
> >>>> page tables but not actually mapped into user space page tables. There is no
> >>>> way to reach these via the rmap.
> >>>>
> >>>> We have something like that already via vfio. And that is fundamentally
> >>>> broken when it comes to mmu notifiers, page pinning, page migration, ...
> >>>
> >>> I'm not super familiar with VFIO internals, but the idea with the fd-based
> >>> approach is that the backing-store would be in direct communication with KVM and
> >>> would handle those operations through that direct channel.
> >>
> >> Right. The problem I am seeing is that e.g., try_to_unmap() might not be
> >> able to actually fully unmap a page, because some non-synchronized KVM
> >> MMU still maps a page. It would be great to evaluate how the fd
> >> callbacks would fit into the whole picture, including the current rmap.
> >>
> >> I guess I'm missing the bigger picture how it all fits together on the
> >> !KVM side.
> > 
> > The big picture is fundamentally a bit nasty.  Logically (ignoring the implementation details of rmap, mmu_notifier, etc), you can call try_to_unmap and end up with a page that is Just A Bunch Of Bytes (tm).  Then you can write it to disk, memcpy to another node, compress it, etc. When it gets faulted back in, you make sure the same bytes end up somewhere and put the PTEs back.
> > 
> > With guest-private memory, this doesn't work.  Forget about the implementation: you simply can't take a page of private memory, quiesce it so the guest can't access it without faulting, and turn it into Just A Bunch Of Bytes.  TDX does not have that capability.  (Any system with integrity-protected memory won't without having >PAGE_SIZE bytes or otherwise storing metadata, but TDX can't do this at all.)  SEV-ES *can* (I think -- I asked the lead architect), but that's not the same thing as saying it's a good idea.
> > 
> > So if you want to migrate a TDX page from one NUMA node to another, you need to do something different (I don't know all the details), you will have to ask Intel to explain how this might work in the future (it wasn't in the public docs last time I looked), but I'm fairly confident that it does not resemble try_to_unmap().
> > 
> > Even on SEV, where a page *can* be transformed into a Just A Bunch Of Bytes, the operation doesn't really look like try_to_unmap().  As I understand it, it's more of:
> > 
> > look up the one-and-only guest and GPA at which this page is mapped.
> > tear down the NPT PTE.  (SEV, unlike TDX, uses paging structures in normal memory.)
> > Ask the magic SEV mechanism to turn the page into swappable data
> > Go to town.
> > 
> > This doesn't really resemble the current rmap + mmu_notifier code, and shoehorning this into mmu_notifier seems like it may be uglier and more code than implementing it more directly in the backing store.
> > 
> > If you actually just try_to_unmap() a SEV-ES page (and it worked, which it currently won't), you will have data corruption or cache incoherency.
> > 
> > If you want to swap a page on TDX, you can't.  Sorry, go directly to jail, do not collect $200.
> > 
> > So I think there are literally zero code paths that currently call try_to_unmap() that will actually work like that on TDX.  If we run out of memory on a TDX host, we can kill the guest completely and reclaim all of its memory (which probably also involves killing QEMU or whatever other user program is in charge), but that's really our only option.
> 
> try_to_unmap() would actually do the right thing I think. It's the users 
> that access page content (migration, swapping) that need additional care 
> to special-case these pages. Completely agree that these would be broken.

I suspect that the eventual TDX numa migration flow will involve asking the TD module to relocate the page without unmapping it first. TDX doesn’t really have a nondestructive unmap operation.

> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 17:08               ` Andy Lutomirski
@ 2021-09-01 17:13                 ` James Bottomley
  2021-09-02 10:18                 ` Joerg Roedel
  1 sibling, 0 replies; 71+ messages in thread
From: James Bottomley @ 2021-09-01 17:13 UTC (permalink / raw)
  To: Andy Lutomirski, David Hildenbrand, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On Wed, 2021-09-01 at 10:08 -0700, Andy Lutomirski wrote:
> 
> On Wed, Sep 1, 2021, at 9:18 AM, James Bottomley wrote:
> > On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote:
> > [...]
> > > If you want to swap a page on TDX, you can't.  Sorry, go directly
> > > to jail, do not collect $200.
> > 
> > Actually, even on SEV-ES you can't either.  You can read the
> > encrypted page and write it out if you want, but unless you swap it
> > back to the exact same physical memory location, the encryption key
> > won't work.  Since we don't guarantee this for swap, I think swap
> > won't actually work for any confidential computing environment.
> > 
> > > So I think there are literally zero code paths that currently
> > > call try_to_unmap() that will actually work like that on TDX.  If
> > > we run out of memory on a TDX host, we can kill the guest
> > > completely and reclaim all of its memory (which probably also
> > > involves killing QEMU or whatever other user program is in
> > > charge), but that's really our only option.
> > 
> > I think our only option for swap is guest co-operation.  We're
> > going to have to inflate a balloon or something in the guest and
> > have the guest driver do some type of bounce of the page, where it
> > becomes an unencrypted page in the guest (so the host can read it
> > without the physical address keying of the encryption getting in
> > the way) but actually encrypted with a swap transfer key known only
> > to the guest.  I assume we can use the page acceptance
> > infrastructure currently being discussed elsewhere to do swap back
> > in as well ... the host provides the guest with the encrypted swap
> > page and the guest has to decrypt it and place it in encrypted
> > guest memory.
> 
> I asked David, and he said the PSP offers a swapping mechanism for
> SEV-ES.  I haven’t read the details, but they should all be public.

Well it does, but it's not useful: we can't use the PSP for bulk
encryption, it's too slow.  That's why we're having to fuss about fast
migration in the first place.  In theory the two PSPs can co-operate to
migrate a guest but only if you have about a year to wait for it to
happen.

James



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 17:08                       ` David Hildenbrand
@ 2021-09-01 17:50                         ` Sean Christopherson
  2021-09-01 17:53                           ` David Hildenbrand
  0 siblings, 1 reply; 71+ messages in thread
From: Sean Christopherson @ 2021-09-01 17:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jejb, Andy Lutomirski, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm list,
	Linux Kernel Mailing List, Borislav Petkov, Andrew Morton,
	Joerg Roedel, Andi Kleen, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On Wed, Sep 01, 2021, David Hildenbrand wrote:
> > > > Well not necessarily, but it depends how clever we want to get.  If
> > > > you look over on the OVMF/edk2 list, there's a proposal to do guest
> > > > migration via a mirror VM that invokes a co-routine embedded in the
> > > > OVMF binary:
> > > 
> > > Yes, I heard of that. "Interesting" design.
> > 
> > Heh, well what other suggestion do you have?  The problem is there
> > needs to be code somewhere to perform some operations that's trusted by
> > both the guest and the host.  The only element for a confidential VM
> > that has this shared trust is the OVMF firmware, so it seems logical to
> > use it.
> 
> <offtopic>
> 
> Let me put it this way: I worked with another architecture that doesn't
> fault on access of a secure page, but instead automatically exports/encrypts

I thought s390 does fault on insecure accesses to secure pages, and it's the
kernel's fault handler that "automatically" converts the page?  E.g. trap 0x3d
-> do_secure_storage_access() -> arch_make_page_accessible().

> it so it can be swapped. It doesn't send a MCE and kills the host. It
> doesn't require fancy code in the guest firmware to export a page.
> 
> The code runs in the ultravisor -- yes, I'm talking about s390x. Now, I am
> not an expert on all of the glory details of TDX, SEV, ... to say which
> attack surface they introduced with that design, and if it can't be
> mitigated. I can only assume that there are real reasons (e.g., supporting
> an ultravisor is problematic, patents? ;) ) why x86-64 is different.
>
> So whenever I see something really complicated to work around such issues,
> it feels to me like a hardware/platform limitation is making our life hard
> and forces us to come up with such "interesting" designs.

Oh, 100% agree, the TDX "limitation" of poisoning cache line and leaving a land
mine to trip over is absolutely abhorrent.  SEV-ES and SEV aren't much better in
that they happily corrupt guest memory.  I am quite jealous of s390 behavior of
simply faulting on the actual access.

SEV-SNP does fault on the access and could do something similar to s390, but I'm
not totally convinced that's actually desirable as it has ramifications with
respect to debugging host and/or guest.  But it'd be nice to have the option...

> Sure, it's logical in this context, but it feels like "The house doesn't

Heh, for some definitions of "logical".

> have a door, so I'll have to climb through the window.". It gets the job
> done but isn't ideally what you'd want to have. If you understand what I am
> trying to say :)
> 
> </offtopic>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 17:50                         ` Sean Christopherson
@ 2021-09-01 17:53                           ` David Hildenbrand
  0 siblings, 0 replies; 71+ messages in thread
From: David Hildenbrand @ 2021-09-01 17:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: jejb, Andy Lutomirski, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, kvm list,
	Linux Kernel Mailing List, Borislav Petkov, Andrew Morton,
	Joerg Roedel, Andi Kleen, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On 01.09.21 19:50, Sean Christopherson wrote:
> On Wed, Sep 01, 2021, David Hildenbrand wrote:
>>>>> Well not necessarily, but it depends how clever we want to get.  If
>>>>> you look over on the OVMF/edk2 list, there's a proposal to do guest
>>>>> migration via a mirror VM that invokes a co-routine embedded in the
>>>>> OVMF binary:
>>>>
>>>> Yes, I heard of that. "Interesting" design.
>>>
>>> Heh, well what other suggestion do you have?  The problem is there
>>> needs to be code somewhere to perform some operations that's trusted by
>>> both the guest and the host.  The only element for a confidential VM
>>> that has this shared trust is the OVMF firmware, so it seems logical to
>>> use it.
>>
>> <offtopic>
>>
>> Let me put it this way: I worked with another architecture that doesn't
>> fault on access of a secure page, but instead automatically exports/encrypts
> 
> I thought s390 does fault on insecure accesses to secure pages, and it's the
> kernel's fault handler that "automatically" converts the page?  E.g. trap 0x3d
> -> do_secure_storage_access() -> arch_make_page_accessible().

"automatic" as in "the kernel can do it easily automatically under the 
hood when accessing such memory", yes that's what I meant :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:18             ` James Bottomley
  2021-09-01 16:22               ` David Hildenbrand
  2021-09-01 17:08               ` Andy Lutomirski
@ 2021-09-01 18:24               ` Andy Lutomirski
  2021-09-01 19:26               ` Dave Hansen
  2021-09-07 15:00               ` Tom Lendacky
  4 siblings, 0 replies; 71+ messages in thread
From: Andy Lutomirski @ 2021-09-01 18:24 UTC (permalink / raw)
  To: James Bottomley, David Hildenbrand, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang



On Wed, Sep 1, 2021, at 9:18 AM, James Bottomley wrote:
> On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote:
> [...]
> > If you want to swap a page on TDX, you can't.  Sorry, go directly to
> > jail, do not collect $200.
> 
> Actually, even on SEV-ES you can't either.  You can read the encrypted
> page and write it out if you want, but unless you swap it back to the
> exact same physical memory location, the encryption key won't work. 
> Since we don't guarantee this for swap, I think swap won't actually
> work for any confidential computing environment.
> 
> > So I think there are literally zero code paths that currently call
> > try_to_unmap() that will actually work like that on TDX.  If we run
> > out of memory on a TDX host, we can kill the guest completely and
> > reclaim all of its memory (which probably also involves killing QEMU
> > or whatever other user program is in charge), but that's really our
> > only option.
> 
> I think our only option for swap is guest co-operation.  We're going to
> have to inflate a balloon or something in the guest and have the guest
> driver do some type of bounce of the page, where it becomes an
> unencrypted page in the guest (so the host can read it without the
> physical address keying of the encryption getting in the way) but
> actually encrypted with a swap transfer key known only to the guest.  I
> assume we can use the page acceptance infrastructure currently being
> discussed elsewhere to do swap back in as well ... the host provides
> the guest with the encrypted swap page and the guest has to decrypt it
> and place it in encrypted guest memory.

I think the TD module could, without hardware changes, fairly efficiently re-encrypt guest pages for swap.  The TD module has access to the full CPU crypto capabilities, so this wouldn't be slow.  It would just require convincing the TD module team to implement such a feature.

It would be very, very nice for whatever swap support ends up existing to work without guest cooperation -- even if the guest is cooperative, we don't want to end up waiting for the guest to help when the host is under memory pressure.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:18             ` James Bottomley
                                 ` (2 preceding siblings ...)
  2021-09-01 18:24               ` Andy Lutomirski
@ 2021-09-01 19:26               ` Dave Hansen
  2021-09-07 15:00               ` Tom Lendacky
  4 siblings, 0 replies; 71+ messages in thread
From: Dave Hansen @ 2021-09-01 19:26 UTC (permalink / raw)
  To: jejb, Andy Lutomirski, David Hildenbrand, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Yu Zhang

On 9/1/21 9:18 AM, James Bottomley wrote:
>> So I think there are literally zero code paths that currently call
>> try_to_unmap() that will actually work like that on TDX.  If we run
>> out of memory on a TDX host, we can kill the guest completely and
>> reclaim all of its memory (which probably also involves killing QEMU
>> or whatever other user program is in charge), but that's really our
>> only option.
> I think our only option for swap is guest co-operation.

Yes, today that's the only choice.  Both TDX and SEV-SNP can
*theoretically* extend their architectures to enable swap with
uncooperative guests.  But, frankly, nobody has been asking for it.

If you care, please ask Intel and AMD to fix this gap.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:07         ` Andy Lutomirski
  2021-09-01 16:27           ` David Hildenbrand
@ 2021-09-02  8:19           ` Yu Zhang
  2021-09-02 18:41             ` Andy Lutomirski
  2021-09-07  1:33             ` Yan Zhao
  2021-09-02  9:27           ` Joerg Roedel
  2 siblings, 2 replies; 71+ messages in thread
From: Yu Zhang @ 2021-09-02  8:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Hildenbrand, Sean Christopherson, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm list, Linux Kernel Mailing List, Borislav Petkov,
	Andrew Morton, Joerg Roedel, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen

On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote:
> On 9/1/21 3:24 AM, Yu Zhang wrote:
> > On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote:
> >>
> >>
> >> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote:
> >>> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:
> >>
> >>> Thanks a lot for this summary. A question about the requirement: do we or
> >>> do we not have plan to support assigned device to the protected VM?
> >>>
> >>> If yes. The fd based solution may need change the VFIO interface as well(
> >>> though the fake swap entry solution need mess with VFIO too). Because:
> >>>
> >>> 1> KVM uses VFIO when assigning devices into a VM.
> >>>
> >>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
> >>> guest pages will have to be mapped in host IOMMU page table to host pages,
> >>> which are pinned during the whole life cycle fo the VM.
> >>>
> >>> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver,
> >>> in vfio_dma_do_map().
> >>>
> >>> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA
> >>> and pin the page. 
> >>>
> >>> But if we are using fd based solution, not every GPA can have a HVA, thus
> >>> the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
> >>> doubt if VFIO can be modified to support this easily.
> >>>
> >>>
> >>
> >> Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device?
> >>
> >> If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it.  I'm not sure whether anyone will actually do this, but it should be possible, at least in principle.  Maybe someone will want to assign a NIC to a TDX guest.  An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether.
> >>
> >> If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot.
> >>
> > 
> > Thanks Andy. I was asking the first scenario.
> > 
> > Well, I agree it is doable if someone really want some assigned
> > device in TD guest. As Kevin mentioned in his reply, HPA can be
> > generated, by extending VFIO with a new mapping protocol which
> > uses fd+offset, instead of HVA. 
> 
> I'm confused.  I don't see why any new code is needed for this at all.
> Every proposal I've seen for handling TDX memory continues to handle TDX
> *shared* memory exactly like regular guest memory today.  The only
> differences are that more hole punching will be needed, which will
> require lightweight memslots (to have many of them), memslots with
> holes, or mappings backing memslots with holes (which can be done with
> munmap() on current kernels).

Thanks for pointing this out. And yes, for DMAs not capable of encryption(
which is the case in current TDX). GUP shall work as it is in VFIO. :)

> 
> So you can literally just mmap a VFIO device and expect it to work,
> exactly like it does right now.  Whether the guest will be willing to
> use the device will depend on the guest security policy (all kinds of
> patches about that are flying around), but if the guest tries to use it,
> it really should just work.
> 

But I think there's still problem. For now,

1> Qemu mmap()s all GPAs into its HVA space, when the VM is created.
2> With no idea which part of guest memory shall be shared, VFIO will just
set up the IOPT, by mapping whole GPA ranges in IOPT. 
3> And those GPAs are actually private ones, with no shared-bit set.

Later when guest tries to perform a DMA(using a shared GPA), IO page fault
shall happen.

> > 
> > Another issue is current TDX does not support DMA encryption, and
> > only shared GPA memory shall be mapped in the VT-d. So to support
> > this, KVM may need to work with VFIO to dynamically program host
> > IOPT(IOMMU Page Table) when TD guest notifies a shared GFN range(e.g.,
> > with a MAP_GPA TDVMCALL), instead of prepopulating the IOPT at VM
> > creation time, by mapping entire GFN ranges of a guest.
> 
> Given that there is no encrypted DMA support, shouldn't the only IOMMU
> mappings (real host-side IOMMU) that point at guest memory be for
> non-encrypted DMA?  I don't see how this interacts at all.  If the guest
> tries to MapGPA to turn a shared MMIO page into private, the host should
> fail the hypercall because the operation makes no sense.
> 
> It is indeed the case that, with a TDX guest, MapGPA shared->private to
> a page that was previously used for unencrypted DMA will need to avoid
> having IOPT entries to the new private page, but even that doesn't seem
> particularly bad.  The fd+special memslot proposal for private memory
> means that shared *backing store* pages never actually transition
> between shared and private without being completely freed.
> 
> As far as I can tell, the actual problem you're referring to is:
> 
> >>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
> >>> guest pages will have to be mapped in host IOMMU page table to host
> pages,
> >>> which are pinned during the whole life cycle fo the VM.

Yes. That's the primary concern. :)

> 
> In principle, you could actually initialize a TDX guest with all of its
> memory shared and all of it mapped in the host IOMMU.  When a guest
> turns some pages private, user code could punch a hole in the memslot,
> allocate private memory at that address, but leave the shared backing
> store in place and still mapped in the host IOMMU.  The result would be
> that guest-initiated DMA to the previously shared address would actually
> work but would hit pages that are invisible to the guest.  And a whole
> bunch of memory would be waste, but the whole system should stll work.

Do you mean to let VFIO & IOMMU to treat all guest memory as shared first,
and then just allocate the private pages in another backing store? I guess
that could work, but with the cost of allocating roughly 2x physical pages
of the guest RAM size. After all, the shared pages shall be only a small
part of guest memory.

If device assignment is desired in current TDX. My understanding of the 
enabling work would be like this:
1> Change qemu to not trigger VFIO_IOMMU_MAP_DMA for the TD, thus no IOPT
prepopulated, and no physical page allocated.
2> KVM forwards MapGPA(private -> shared) request to Qemu.
3> Qemu asks VFIO to pin and map the shared GPAs.

For private -> shared transitions, the memslot punching, IOPT unmapping,
and iotlb flushing are necessary. Possibly new interface between VFIO and
KVM is needed.

But actually I am not sure if people really want assigned device in current
TDX. Bottleneck of the performance should be the copying to/from swiotlb
buffers.

B.R.
Yu

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:27           ` David Hildenbrand
@ 2021-09-02  8:34             ` Yu Zhang
  2021-09-02  8:44               ` David Hildenbrand
  0 siblings, 1 reply; 71+ messages in thread
From: Yu Zhang @ 2021-09-02  8:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andy Lutomirski, Sean Christopherson, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm list, Linux Kernel Mailing List, Borislav Petkov,
	Andrew Morton, Joerg Roedel, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen

On Wed, Sep 01, 2021 at 06:27:20PM +0200, David Hildenbrand wrote:
> On 01.09.21 18:07, Andy Lutomirski wrote:
> > On 9/1/21 3:24 AM, Yu Zhang wrote:
> > > On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote:
> > > > 
> > > > 
> > > > On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote:
> > > > > On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:
> > > > 
> > > > > Thanks a lot for this summary. A question about the requirement: do we or
> > > > > do we not have plan to support assigned device to the protected VM?
> > > > > 
> > > > > If yes. The fd based solution may need change the VFIO interface as well(
> > > > > though the fake swap entry solution need mess with VFIO too). Because:
> > > > > 
> > > > > 1> KVM uses VFIO when assigning devices into a VM.
> > > > > 
> > > > > 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
> > > > > guest pages will have to be mapped in host IOMMU page table to host pages,
> > > > > which are pinned during the whole life cycle fo the VM.
> > > > > 
> > > > > 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver,
> > > > > in vfio_dma_do_map().
> > > > > 
> > > > > 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA
> > > > > and pin the page.
> > > > > 
> > > > > But if we are using fd based solution, not every GPA can have a HVA, thus
> > > > > the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
> > > > > doubt if VFIO can be modified to support this easily.
> > > > > 
> > > > > 
> > > > 
> > > > Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device?
> > > > 
> > > > If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it.  I'm not sure whether anyone will actually do this, but it should be possible, at least in principle.  Maybe someone will want to assign a NIC to a TDX guest.  An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether.
> > > > 
> > > > If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot.
> > > > 
> > > 
> > > Thanks Andy. I was asking the first scenario.
> > > 
> > > Well, I agree it is doable if someone really want some assigned
> > > device in TD guest. As Kevin mentioned in his reply, HPA can be
> > > generated, by extending VFIO with a new mapping protocol which
> > > uses fd+offset, instead of HVA.
> > 
> > I'm confused.  I don't see why any new code is needed for this at all.
> > Every proposal I've seen for handling TDX memory continues to handle TDX
> > *shared* memory exactly like regular guest memory today.  The only
> > differences are that more hole punching will be needed, which will
> > require lightweight memslots (to have many of them), memslots with
> > holes, or mappings backing memslots with holes (which can be done with
> > munmap() on current kernels).
> > 
> > So you can literally just mmap a VFIO device and expect it to work,
> > exactly like it does right now.  Whether the guest will be willing to
> > use the device will depend on the guest security policy (all kinds of
> > patches about that are flying around), but if the guest tries to use it,
> > it really should just work.
> 
> ... but if you end up mapping private memory into IOMMU of the device and
> the device ends up accessing that memory, we're in the same position that
> the host might get capped, just like access from user space, no?

Well, according to the spec:

  - If the result of the translation results in a physical address with a TD
  private key ID, then the IOMMU will abort the transaction and report a VT-d
  DMA remapping failure.

  - If the GPA in the transaction that is input to the IOMMU is private (SHARED
  bit is 0), then the IOMMU may abort the transaction and report a VT-d DMA
  remapping failure.

So I guess mapping private GPAs in IOMMU is not that dangerous as mapping
into userspace. Though still wrong.

> 
> Sure, you can map only the complete duplicate shared-memory region into the
> IOMMU of the device, that would work. Shame vfio mostly always pins all
> guest memory and you essentially cannot punch holes into the shared memory
> anymore -- resulting in the worst case in a duplicate memory consumption for
> your VM.
> 
> So you'd actually want to map only the *currently* shared pieces into the
> IOMMU and update the mappings on demand. Having worked on something related,

Exactly. On demand mapping and page pinning for shared memory is necessary.

> I can only say that 64k individual mappings, and not being able to modify
> existing mappings except completely deleting them to replace with something
> new (!atomic), can be quite an issue for bigger VMs.

Do you mean atomicity in mapping/unmapping can hardly be guaranteed during
the shared <-> private transition? May I ask for elaboration? Thanks!

B.R.
Yu

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-02  8:34             ` Yu Zhang
@ 2021-09-02  8:44               ` David Hildenbrand
  2021-09-02 11:02                 ` Yu Zhang
  0 siblings, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2021-09-02  8:44 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Andy Lutomirski, Sean Christopherson, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm list, Linux Kernel Mailing List, Borislav Petkov,
	Andrew Morton, Joerg Roedel, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen

On 02.09.21 10:34, Yu Zhang wrote:
> On Wed, Sep 01, 2021 at 06:27:20PM +0200, David Hildenbrand wrote:
>> On 01.09.21 18:07, Andy Lutomirski wrote:
>>> On 9/1/21 3:24 AM, Yu Zhang wrote:
>>>> On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote:
>>>>>
>>>>>
>>>>> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote:
>>>>>> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:
>>>>>
>>>>>> Thanks a lot for this summary. A question about the requirement: do we or
>>>>>> do we not have plan to support assigned device to the protected VM?
>>>>>>
>>>>>> If yes. The fd based solution may need change the VFIO interface as well(
>>>>>> though the fake swap entry solution need mess with VFIO too). Because:
>>>>>>
>>>>>> 1> KVM uses VFIO when assigning devices into a VM.
>>>>>>
>>>>>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
>>>>>> guest pages will have to be mapped in host IOMMU page table to host pages,
>>>>>> which are pinned during the whole life cycle fo the VM.
>>>>>>
>>>>>> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver,
>>>>>> in vfio_dma_do_map().
>>>>>>
>>>>>> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA
>>>>>> and pin the page.
>>>>>>
>>>>>> But if we are using fd based solution, not every GPA can have a HVA, thus
>>>>>> the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
>>>>>> doubt if VFIO can be modified to support this easily.
>>>>>>
>>>>>>
>>>>>
>>>>> Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device?
>>>>>
>>>>> If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it.  I'm not sure whether anyone will actually do this, but it should be possible, at least in principle.  Maybe someone will want to assign a NIC to a TDX guest.  An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether.
>>>>>
>>>>> If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot.
>>>>>
>>>>
>>>> Thanks Andy. I was asking the first scenario.
>>>>
>>>> Well, I agree it is doable if someone really want some assigned
>>>> device in TD guest. As Kevin mentioned in his reply, HPA can be
>>>> generated, by extending VFIO with a new mapping protocol which
>>>> uses fd+offset, instead of HVA.
>>>
>>> I'm confused.  I don't see why any new code is needed for this at all.
>>> Every proposal I've seen for handling TDX memory continues to handle TDX
>>> *shared* memory exactly like regular guest memory today.  The only
>>> differences are that more hole punching will be needed, which will
>>> require lightweight memslots (to have many of them), memslots with
>>> holes, or mappings backing memslots with holes (which can be done with
>>> munmap() on current kernels).
>>>
>>> So you can literally just mmap a VFIO device and expect it to work,
>>> exactly like it does right now.  Whether the guest will be willing to
>>> use the device will depend on the guest security policy (all kinds of
>>> patches about that are flying around), but if the guest tries to use it,
>>> it really should just work.
>>
>> ... but if you end up mapping private memory into IOMMU of the device and
>> the device ends up accessing that memory, we're in the same position that
>> the host might get capped, just like access from user space, no?
> 
> Well, according to the spec:
> 
>    - If the result of the translation results in a physical address with a TD
>    private key ID, then the IOMMU will abort the transaction and report a VT-d
>    DMA remapping failure.
> 
>    - If the GPA in the transaction that is input to the IOMMU is private (SHARED
>    bit is 0), then the IOMMU may abort the transaction and report a VT-d DMA
>    remapping failure.
> 
> So I guess mapping private GPAs in IOMMU is not that dangerous as mapping
> into userspace. Though still wrong.
> 
>>
>> Sure, you can map only the complete duplicate shared-memory region into the
>> IOMMU of the device, that would work. Shame vfio mostly always pins all
>> guest memory and you essentially cannot punch holes into the shared memory
>> anymore -- resulting in the worst case in a duplicate memory consumption for
>> your VM.
>>
>> So you'd actually want to map only the *currently* shared pieces into the
>> IOMMU and update the mappings on demand. Having worked on something related,
> 
> Exactly. On demand mapping and page pinning for shared memory is necessary.
> 
>> I can only say that 64k individual mappings, and not being able to modify
>> existing mappings except completely deleting them to replace with something
>> new (!atomic), can be quite an issue for bigger VMs.
> 
> Do you mean atomicity in mapping/unmapping can hardly be guaranteed during
> the shared <-> private transition? May I ask for elaboration? Thanks!

If we expect to really only have little shared memory, and expect a VM 
always has no shared memory when booting up (I think this is the case), 
I guess this could work.

The issue is if the guest e.g., makes contiguous 2 MiB shared and later 
wants to unshare individual pages/parts.

You'll have to DMA map the 2 MiB in page granularity, otherwise you'll 
have to DMA unmap 2 MiB and DMA remap all still-shared pieces. That is 
not atomic and can be problematic if the device is accessing some of the 
shared parts at that time.

Consequently that means, that large shared regions can be problematic 
when mapped, because we'll have to map in page granularity. We have 64k 
such individual mappings in general.

64k * 4KiB == 256 MiB

Not sure if there would be use cases, e.g., with GPGPUs and similar, 
where you'd want to share a lot of memory with a device ...

But these are just my thoughts, maybe I am missing something important.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:07         ` Andy Lutomirski
  2021-09-01 16:27           ` David Hildenbrand
  2021-09-02  8:19           ` Yu Zhang
@ 2021-09-02  9:27           ` Joerg Roedel
  2021-09-02 18:41             ` Andy Lutomirski
  2 siblings, 1 reply; 71+ messages in thread
From: Joerg Roedel @ 2021-09-02  9:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Yu Zhang, David Hildenbrand, Sean Christopherson, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm list, Linux Kernel Mailing List, Borislav Petkov,
	Andrew Morton, Andi Kleen, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen

On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote:
> In principle, you could actually initialize a TDX guest with all of its
> memory shared and all of it mapped in the host IOMMU.

Not sure how this works in TDX, but in SEV code fetches are always
treated as encrypted. So this approach would not work with SEV, not to
speak about attestation, which will not work with this approach either
:)

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 17:08               ` Andy Lutomirski
  2021-09-01 17:13                 ` James Bottomley
@ 2021-09-02 10:18                 ` Joerg Roedel
  1 sibling, 0 replies; 71+ messages in thread
From: Joerg Roedel @ 2021-09-02 10:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: James Bottomley, David Hildenbrand, Sean Christopherson,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On Wed, Sep 01, 2021 at 10:08:33AM -0700, Andy Lutomirski wrote:
> I asked David, and he said the PSP offers a swapping mechanism for
> SEV-ES.  I haven’t read the details, but they should all be public.

Right, the details are here:

	https://www.amd.com/system/files/TechDocs/55766_SEV-KM_API_Specification.pdf

Namely section 6.25 and 6.26 which describe the SWAP_OUT and SWAP_IN
commands.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-02  8:44               ` David Hildenbrand
@ 2021-09-02 11:02                 ` Yu Zhang
  0 siblings, 0 replies; 71+ messages in thread
From: Yu Zhang @ 2021-09-02 11:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andy Lutomirski, Sean Christopherson, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm list, Linux Kernel Mailing List, Borislav Petkov,
	Andrew Morton, Joerg Roedel, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen

On Thu, Sep 02, 2021 at 10:44:02AM +0200, David Hildenbrand wrote:
> On 02.09.21 10:34, Yu Zhang wrote:
> > On Wed, Sep 01, 2021 at 06:27:20PM +0200, David Hildenbrand wrote:
> > > On 01.09.21 18:07, Andy Lutomirski wrote:
> > > > On 9/1/21 3:24 AM, Yu Zhang wrote:
> > > > > On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote:
> > > > > > 
> > > > > > 
> > > > > > On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote:
> > > > > > > On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:
> > > > > > 
> > > > > > > Thanks a lot for this summary. A question about the requirement: do we or
> > > > > > > do we not have plan to support assigned device to the protected VM?
> > > > > > > 
> > > > > > > If yes. The fd based solution may need change the VFIO interface as well(
> > > > > > > though the fake swap entry solution need mess with VFIO too). Because:
> > > > > > > 
> > > > > > > 1> KVM uses VFIO when assigning devices into a VM.
> > > > > > > 
> > > > > > > 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
> > > > > > > guest pages will have to be mapped in host IOMMU page table to host pages,
> > > > > > > which are pinned during the whole life cycle fo the VM.
> > > > > > > 
> > > > > > > 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver,
> > > > > > > in vfio_dma_do_map().
> > > > > > > 
> > > > > > > 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA
> > > > > > > and pin the page.
> > > > > > > 
> > > > > > > But if we are using fd based solution, not every GPA can have a HVA, thus
> > > > > > > the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
> > > > > > > doubt if VFIO can be modified to support this easily.
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device?
> > > > > > 
> > > > > > If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it.  I'm not sure whether anyone will actually do this, but it should be possible, at least in principle.  Maybe someone will want to assign a NIC to a TDX guest.  An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether.
> > > > > > 
> > > > > > If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot.
> > > > > > 
> > > > > 
> > > > > Thanks Andy. I was asking the first scenario.
> > > > > 
> > > > > Well, I agree it is doable if someone really want some assigned
> > > > > device in TD guest. As Kevin mentioned in his reply, HPA can be
> > > > > generated, by extending VFIO with a new mapping protocol which
> > > > > uses fd+offset, instead of HVA.
> > > > 
> > > > I'm confused.  I don't see why any new code is needed for this at all.
> > > > Every proposal I've seen for handling TDX memory continues to handle TDX
> > > > *shared* memory exactly like regular guest memory today.  The only
> > > > differences are that more hole punching will be needed, which will
> > > > require lightweight memslots (to have many of them), memslots with
> > > > holes, or mappings backing memslots with holes (which can be done with
> > > > munmap() on current kernels).
> > > > 
> > > > So you can literally just mmap a VFIO device and expect it to work,
> > > > exactly like it does right now.  Whether the guest will be willing to
> > > > use the device will depend on the guest security policy (all kinds of
> > > > patches about that are flying around), but if the guest tries to use it,
> > > > it really should just work.
> > > 
> > > ... but if you end up mapping private memory into IOMMU of the device and
> > > the device ends up accessing that memory, we're in the same position that
> > > the host might get capped, just like access from user space, no?
> > 
> > Well, according to the spec:
> > 
> >    - If the result of the translation results in a physical address with a TD
> >    private key ID, then the IOMMU will abort the transaction and report a VT-d
> >    DMA remapping failure.
> > 
> >    - If the GPA in the transaction that is input to the IOMMU is private (SHARED
> >    bit is 0), then the IOMMU may abort the transaction and report a VT-d DMA
> >    remapping failure.
> > 
> > So I guess mapping private GPAs in IOMMU is not that dangerous as mapping
> > into userspace. Though still wrong.
> > 
> > > 
> > > Sure, you can map only the complete duplicate shared-memory region into the
> > > IOMMU of the device, that would work. Shame vfio mostly always pins all
> > > guest memory and you essentially cannot punch holes into the shared memory
> > > anymore -- resulting in the worst case in a duplicate memory consumption for
> > > your VM.
> > > 
> > > So you'd actually want to map only the *currently* shared pieces into the
> > > IOMMU and update the mappings on demand. Having worked on something related,
> > 
> > Exactly. On demand mapping and page pinning for shared memory is necessary.
> > 
> > > I can only say that 64k individual mappings, and not being able to modify
> > > existing mappings except completely deleting them to replace with something
> > > new (!atomic), can be quite an issue for bigger VMs.
> > 
> > Do you mean atomicity in mapping/unmapping can hardly be guaranteed during
> > the shared <-> private transition? May I ask for elaboration? Thanks!
> 
> If we expect to really only have little shared memory, and expect a VM
> always has no shared memory when booting up (I think this is the case), I
> guess this could work.
> 
> The issue is if the guest e.g., makes contiguous 2 MiB shared and later
> wants to unshare individual pages/parts.
> 
> You'll have to DMA map the 2 MiB in page granularity, otherwise you'll have
> to DMA unmap 2 MiB and DMA remap all still-shared pieces. That is not atomic
> and can be problematic if the device is accessing some of the shared parts
> at that time.
> 
> Consequently that means, that large shared regions can be problematic when
> mapped, because we'll have to map in page granularity. We have 64k such
> individual mappings in general.
> 
> 64k * 4KiB == 256 MiB
> 
> Not sure if there would be use cases, e.g., with GPGPUs and similar, where
> you'd want to share a lot of memory with a device ...
> 
> But these are just my thoughts, maybe I am missing something important.

Good point. And thanks!

So to guarantee the atomicity, we shall only use 4K mappings for
shared pages, thus having to bear with bigger memory footprint and
more IOTLB miss penalties. :)

B.R.
Yu

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-02  8:19           ` Yu Zhang
@ 2021-09-02 18:41             ` Andy Lutomirski
  2021-09-07  1:33             ` Yan Zhao
  1 sibling, 0 replies; 71+ messages in thread
From: Andy Lutomirski @ 2021-09-02 18:41 UTC (permalink / raw)
  To: Yu Zhang
  Cc: David Hildenbrand, Sean Christopherson, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm list, Linux Kernel Mailing List, Borislav Petkov,
	Andrew Morton, Joerg Roedel, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen

>>
>> In principle, you could actually initialize a TDX guest with all of its
>> memory shared and all of it mapped in the host IOMMU.  When a guest
>> turns some pages private, user code could punch a hole in the memslot,
>> allocate private memory at that address, but leave the shared backing
>> store in place and still mapped in the host IOMMU.  The result would be
>> that guest-initiated DMA to the previously shared address would actually
>> work but would hit pages that are invisible to the guest.  And a whole
>> bunch of memory would be waste, but the whole system should stll work.
> 
> Do you mean to let VFIO & IOMMU to treat all guest memory as shared first,
> and then just allocate the private pages in another backing store? I guess
> that could work, but with the cost of allocating roughly 2x physical pages
> of the guest RAM size. After all, the shared pages shall be only a small
> part of guest memory.

Yes.

My point is that I don't think there should be any particular danger in
leaving the VFIO code alone as part of TDX enablement.  The code ought
to *work* even if it will be wildly inefficient.  If someone cares to
make it work better, they're welcome to do so.

--Andy

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-02  9:27           ` Joerg Roedel
@ 2021-09-02 18:41             ` Andy Lutomirski
  2021-09-02 18:57               ` Sean Christopherson
  0 siblings, 1 reply; 71+ messages in thread
From: Andy Lutomirski @ 2021-09-02 18:41 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Yu Zhang, David Hildenbrand, Sean Christopherson, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm list, Linux Kernel Mailing List, Borislav Petkov,
	Andrew Morton, Andi Kleen, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen

On 9/2/21 2:27 AM, Joerg Roedel wrote:
> On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote:
>> In principle, you could actually initialize a TDX guest with all of its
>> memory shared and all of it mapped in the host IOMMU.
> 
> Not sure how this works in TDX, but in SEV code fetches are always
> treated as encrypted. So this approach would not work with SEV, not to
> speak about attestation, which will not work with this approach either
> :)
> 

Oof.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-08-24  0:52 [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Sean Christopherson
  2021-08-24 10:48 ` Yu Zhang
  2021-08-26 10:15 ` David Hildenbrand
@ 2021-09-02 18:47 ` Kirill A. Shutemov
  2021-09-02 20:33   ` Sean Christopherson
  2 siblings, 1 reply; 71+ messages in thread
From: Kirill A. Shutemov @ 2021-09-02 18:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, David Hildenbrand, Dave Hansen,
	Yu Zhang

Hi folks,

I try to sketch how the memfd changes would look like.

I've added F_SEAL_GUEST. The new seal is only allowed if there's no
pre-existing pages in the fd (i_mapping->nrpages check) and there's
no existing mapping of the file (RB_EMPTY_ROOT(&i_mapping->i_mmap.rb_root check).

After the seal is set, no read/write/mmap from userspace is allowed.

Although it's not clear how to serialize read check vs. seal setup: seal
is protected with inode_lock() which we don't hold in read path because it
is expensive. I don't know yet how to get it right. For TDX, it's okay to
allow read as it cannot trigger #MCE. Maybe we can allow it?

Truncate and punch hole are tricky.

We want to allow it to save memory if substantial range is converted to
shared. Partial truncate and punch hole effectively writes zeros to
partially truncated page and may lead to #MCE. We can reject any partial
truncate/punch requests, but it doesn't help the situation with THPs.

If we truncate to the middle of THP page, we try to split it into small
pages and proceed as usual for small pages. But split is allowed to fail.
If it happens we zero part of THP.
I guess we may reject truncate if split fails. It should work fine if we
only use it for saving memory.

We need to modify truncation/punch path to notify kvm that pages are about
to be freed. I think we will register callback in the memfd on adding the
fd to KVM memslot that going to be called for the notification. That means
1:1 between memfd and memslot. I guess it's okay.

Migration going to always fail on F_SEAL_GUEST for now. Can be modified to
use a callback in the future.

Swapout will also always fail on F_SEAL_GUEST. It seems trivial. Again, it
can be a callback in the future.

For GPA->PFN translation KVM could use vm_ops->fault(). Semantically it is
a good fit, but we don't have any VMAs around and ->mmap is forbidden for
F_SEAL_GUEST.
Other option is call shmem_getpage() directly, but it looks like a
layering violation to me. And it's not available to modules :/

Any comments?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-02 18:41             ` Andy Lutomirski
@ 2021-09-02 18:57               ` Sean Christopherson
  2021-09-02 19:07                 ` Dave Hansen
  0 siblings, 1 reply; 71+ messages in thread
From: Sean Christopherson @ 2021-09-02 18:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Yu Zhang, David Hildenbrand, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm list, Linux Kernel Mailing List, Borislav Petkov,
	Andrew Morton, Andi Kleen, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen

On Thu, Sep 02, 2021, Andy Lutomirski wrote:
> On 9/2/21 2:27 AM, Joerg Roedel wrote:
> > On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote:
> >> In principle, you could actually initialize a TDX guest with all of its
> >> memory shared and all of it mapped in the host IOMMU.
> > 
> > Not sure how this works in TDX, but in SEV code fetches are always
> > treated as encrypted. So this approach would not work with SEV, not to
> > speak about attestation, which will not work with this approach either
> > :)
> > 
> 
> Oof.

TDX is kinda similar.  _All_ accesses are private if paging is disabled because
the shared bit is either bit 48 or bit 51 in the GPA, i.e. can't be reached if
paging is disabled.  The vCPU is hardcoded to start in unpaged protected mode,
so at least some amount of guest memory needs to be private.

I also could've sworn code fetches from shared memory would #VE, but I can't find
anything in the specs that confirm that.  I may be conflating TDX with SGX's #GP
on a code fetch outside of ELRANGE...

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-02 18:57               ` Sean Christopherson
@ 2021-09-02 19:07                 ` Dave Hansen
  2021-09-02 20:42                   ` Andy Lutomirski
  0 siblings, 1 reply; 71+ messages in thread
From: Dave Hansen @ 2021-09-02 19:07 UTC (permalink / raw)
  To: Sean Christopherson, Andy Lutomirski
  Cc: Joerg Roedel, Yu Zhang, David Hildenbrand, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm list, Linux Kernel Mailing List, Borislav Petkov,
	Andrew Morton, Andi Kleen, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy

On 9/2/21 11:57 AM, Sean Christopherson wrote:
> On Thu, Sep 02, 2021, Andy Lutomirski wrote:
>> On 9/2/21 2:27 AM, Joerg Roedel wrote:
>>> On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote:
>>>> In principle, you could actually initialize a TDX guest with all of its
>>>> memory shared and all of it mapped in the host IOMMU.
>>> Not sure how this works in TDX, but in SEV code fetches are always
>>> treated as encrypted. So this approach would not work with SEV, not to
>>> speak about attestation, which will not work with this approach either
>>> :)
>>>
>> Oof.
> TDX is kinda similar.  _All_ accesses are private if paging is disabled because
> the shared bit is either bit 48 or bit 51 in the GPA, i.e. can't be reached if
> paging is disabled.  The vCPU is hardcoded to start in unpaged protected mode,
> so at least some amount of guest memory needs to be private.

That's a rule we should definitely add to our page table checker.  Just
like how we can look for W+X, we should also look for Shared+X.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-02 18:47 ` Kirill A. Shutemov
@ 2021-09-02 20:33   ` Sean Christopherson
  2021-09-03 19:14     ` Kirill A. Shutemov
  0 siblings, 1 reply; 71+ messages in thread
From: Sean Christopherson @ 2021-09-02 20:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, David Hildenbrand, Dave Hansen,
	Yu Zhang

On Thu, Sep 02, 2021, Kirill A. Shutemov wrote:
> Hi folks,
> 
> I try to sketch how the memfd changes would look like.
> 
> I've added F_SEAL_GUEST. The new seal is only allowed if there's no
> pre-existing pages in the fd (i_mapping->nrpages check) and there's
> no existing mapping of the file (RB_EMPTY_ROOT(&i_mapping->i_mmap.rb_root check).
> 
> After the seal is set, no read/write/mmap from userspace is allowed.
> 
> Although it's not clear how to serialize read check vs. seal setup: seal
> is protected with inode_lock() which we don't hold in read path because it
> is expensive. I don't know yet how to get it right. For TDX, it's okay to
> allow read as it cannot trigger #MCE. Maybe we can allow it?

Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem?

> Truncate and punch hole are tricky.
> 
> We want to allow it to save memory if substantial range is converted to
> shared. Partial truncate and punch hole effectively writes zeros to
> partially truncated page and may lead to #MCE. We can reject any partial
> truncate/punch requests, but it doesn't help the situation with THPs.
> 
> If we truncate to the middle of THP page, we try to split it into small
> pages and proceed as usual for small pages. But split is allowed to fail.
> If it happens we zero part of THP.
> I guess we may reject truncate if split fails. It should work fine if we
> only use it for saving memory.

FWIW, splitting a THP will also require a call into KVM to demote the huge page
to the equivalent small pages.

> We need to modify truncation/punch path to notify kvm that pages are about
> to be freed. I think we will register callback in the memfd on adding the
> fd to KVM memslot that going to be called for the notification. That means
> 1:1 between memfd and memslot. I guess it's okay.

Hmm, 1:1 memfd to memslot will be problematic as that would prevent punching a
hole in KVM's memslots, e.g. to convert a subset to shared.  It would also
disallow backing guest memory with a single memfd that's split across two
memslots for <4gb and >4gb.

But I don't think we need a 1:1 relationship.  To keep KVM sane, we can require
each private memslot to be wholly contained in a single memfd, I can't think of
any reason that would be problematic for userspace.

For the callbacks, I believe the rule should be 1:1 between memfd and KVM instance.
That would allow mapping multiple memslots to a single memfd so long as they're
all coming from the same KVM instance.

> Migration going to always fail on F_SEAL_GUEST for now. Can be modified to
> use a callback in the future.
> 
> Swapout will also always fail on F_SEAL_GUEST. It seems trivial. Again, it
> can be a callback in the future.
> 
> For GPA->PFN translation KVM could use vm_ops->fault(). Semantically it is
> a good fit, but we don't have any VMAs around and ->mmap is forbidden for
> F_SEAL_GUEST.
> Other option is call shmem_getpage() directly, but it looks like a
> layering violation to me. And it's not available to modules :/

My idea for this was to have the memfd:KVM exchange callbacks, i.e. memfd would
have callbacks into KVM, but KVM would also have callbacks into memfd.  To avoid
circular refcounts, KVM would hold a reference to the memfd (since it's the
instigator) and KVM would be responsible for unregistering itself before freeing
it's reference to the memfd.

The memfd callbacks would be tracked per private memslot, which meshes nicely
without how KVM uses memslots to translate gfn->pfn.  In effect, the ops pointer
in the memslots replaces the host virtual address that's used to get the pfn for
non-private memslots.

@@ -2428,8 +2453,12 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
                               bool atomic, bool *async, bool write_fault,
                               bool *writable, hva_t *hva)
 {
-       unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
+       unsigned long addr;

+       if (memslot_is_private(slot))
+               return slot->private_ops->gfn_to_pfn(...);
+
+       addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
        if (hva)
                *hva = addr;

> 
> Any comments?
> 
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-02 19:07                 ` Dave Hansen
@ 2021-09-02 20:42                   ` Andy Lutomirski
  0 siblings, 0 replies; 71+ messages in thread
From: Andy Lutomirski @ 2021-09-02 20:42 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson
  Cc: Joerg Roedel, Yu Zhang, David Hildenbrand, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	kvm list, Linux Kernel Mailing List, Borislav Petkov,
	Andrew Morton, Andi Kleen, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy



On Thu, Sep 2, 2021, at 12:07 PM, Dave Hansen wrote:
> On 9/2/21 11:57 AM, Sean Christopherson wrote:
> > On Thu, Sep 02, 2021, Andy Lutomirski wrote:
> >> On 9/2/21 2:27 AM, Joerg Roedel wrote:
> >>> On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote:
> >>>> In principle, you could actually initialize a TDX guest with all of its
> >>>> memory shared and all of it mapped in the host IOMMU.
> >>> Not sure how this works in TDX, but in SEV code fetches are always
> >>> treated as encrypted. So this approach would not work with SEV, not to
> >>> speak about attestation, which will not work with this approach either
> >>> :)
> >>>
> >> Oof.
> > TDX is kinda similar.  _All_ accesses are private if paging is disabled because
> > the shared bit is either bit 48 or bit 51 in the GPA, i.e. can't be reached if
> > paging is disabled.  The vCPU is hardcoded to start in unpaged protected mode,
> > so at least some amount of guest memory needs to be private.
> 
> That's a rule we should definitely add to our page table checker.  Just
> like how we can look for W+X, we should also look for Shared+X.
> 

The only case I can thing of where the TDX vs SEV rule matters is for some mildly crazy user who wants to run user code out of an unencrypted DAX device (or virtio-fs, I guess).  We can save that for another year :)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-02 20:33   ` Sean Christopherson
@ 2021-09-03 19:14     ` Kirill A. Shutemov
  2021-09-03 19:15       ` Andy Lutomirski
  0 siblings, 1 reply; 71+ messages in thread
From: Kirill A. Shutemov @ 2021-09-03 19:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, David Hildenbrand, Dave Hansen,
	Yu Zhang

On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote:
> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem?

I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look
closer into locking next.

> FWIW, splitting a THP will also require a call into KVM to demote the huge page
> to the equivalent small pages.

But will it though? Even if the page got split we still don't allow
migration so EPT can stay 2M. Or do I miss something?

For truncate/punch I disallowed non-page aligned operations. I ignore
failed split. Yes, we zero the truncated parts, but these parts are
already removed from KVM, so it should be safe.

> > We need to modify truncation/punch path to notify kvm that pages are about
> > to be freed. I think we will register callback in the memfd on adding the
> > fd to KVM memslot that going to be called for the notification. That means
> > 1:1 between memfd and memslot. I guess it's okay.
> 
> Hmm, 1:1 memfd to memslot will be problematic as that would prevent punching a
> hole in KVM's memslots, e.g. to convert a subset to shared.  It would also
> disallow backing guest memory with a single memfd that's split across two
> memslots for <4gb and >4gb.
> 
> But I don't think we need a 1:1 relationship.  To keep KVM sane, we can require
> each private memslot to be wholly contained in a single memfd, I can't think of
> any reason that would be problematic for userspace.
> 
> For the callbacks, I believe the rule should be 1:1 between memfd and KVM instance.
> That would allow mapping multiple memslots to a single memfd so long as they're
> all coming from the same KVM instance.

Okay, I've added guest_owner field that I check on registering guest. I
don't think we want to tie it directly to kvm, so I left it just void *.

> 
> > Migration going to always fail on F_SEAL_GUEST for now. Can be modified to
> > use a callback in the future.
> > 
> > Swapout will also always fail on F_SEAL_GUEST. It seems trivial. Again, it
> > can be a callback in the future.
> > 
> > For GPA->PFN translation KVM could use vm_ops->fault(). Semantically it is
> > a good fit, but we don't have any VMAs around and ->mmap is forbidden for
> > F_SEAL_GUEST.
> > Other option is call shmem_getpage() directly, but it looks like a
> > layering violation to me. And it's not available to modules :/
> 
> My idea for this was to have the memfd:KVM exchange callbacks, i.e. memfd would
> have callbacks into KVM, but KVM would also have callbacks into memfd.  To avoid
> circular refcounts, KVM would hold a reference to the memfd (since it's the
> instigator) and KVM would be responsible for unregistering itself before freeing
> it's reference to the memfd.

Okay I went this path. This callback exchange looks wierd to me, but I
suppose it works.

Below is how I see it can look like. It only compile-tested and might be
broken in many ways.

I have not found a right abstaction for shmem_test_guest(). Maybe it has
to be a new inode operation? I donno.
For initial prototyping we can call it directly.

Any comments?

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 8e775ce517bb..6f94fc46a3b1 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -10,6 +10,17 @@
 #include <linux/xattr.h>
 #include <linux/fs_parser.h>
 
+struct guest_ops {
+	void (*invalidate_page_range)(struct inode *inode,
+				      pgoff_t start, pgoff_t end);
+};
+
+struct guest_mem_ops {
+	unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset);
+	void (*put_unlock_pfn)(unsigned long pfn);
+
+};
+
 /* inode in-kernel data */
 
 struct shmem_inode_info {
@@ -24,6 +35,8 @@ struct shmem_inode_info {
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	atomic_t		stop_eviction;	/* hold when working on inode */
 	struct inode		vfs_inode;
+	void			*guest_owner;
+	const struct guest_ops	*guest_ops;
 };
 
 struct shmem_sb_info {
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..c79bc8572721 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
 #define F_SEAL_GROW	0x0004	/* prevent file from growing */
 #define F_SEAL_WRITE	0x0008	/* prevent writes */
 #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
+#define F_SEAL_GUEST		0x0020
 /* (1U << 31) is reserved for signed error codes */
 
 /*
diff --git a/mm/memfd.c b/mm/memfd.c
index 081dd33e6a61..ccaa4edb21f7 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -134,7 +134,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
 		     F_SEAL_SHRINK | \
 		     F_SEAL_GROW | \
 		     F_SEAL_WRITE | \
-		     F_SEAL_FUTURE_WRITE)
+		     F_SEAL_FUTURE_WRITE | \
+		     F_SEAL_GUEST)
 
 static int memfd_add_seals(struct file *file, unsigned int seals)
 {
@@ -203,10 +204,27 @@ static int memfd_add_seals(struct file *file, unsigned int seals)
 		}
 	}
 
+	if (seals & F_SEAL_GUEST) {
+		i_mmap_lock_read(inode->i_mapping);
+
+		if (!RB_EMPTY_ROOT(&inode->i_mapping->i_mmap.rb_root)) {
+			error = -EBUSY;
+			goto unlock;
+		}
+
+		if (i_size_read(inode)) {
+			error = -EBUSY;
+			goto unlock;
+		}
+	}
+
 	*file_seals |= seals;
 	error = 0;
 
 unlock:
+	if (seals & F_SEAL_GUEST)
+		i_mmap_unlock_read(inode->i_mapping);
+
 	inode_unlock(inode);
 	return error;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index dacda7463d54..761cf4e2152e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -883,6 +883,17 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
 	return split_huge_page(page) >= 0;
 }
 
+static void guest_invalidate_page(struct inode *inode,
+				  struct page *page, pgoff_t start, pgoff_t end)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	start = max(start, page->index);
+	end = min(end, page->index + HPAGE_PMD_NR) - 1;
+
+	info->guest_ops->invalidate_page_range(inode, start, end);
+}
+
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -923,6 +934,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			index += thp_nr_pages(page) - 1;
 
+			guest_invalidate_page(inode, page, start, end);
+
 			if (!unfalloc || !PageUptodate(page))
 				truncate_inode_page(mapping, page);
 			unlock_page(page);
@@ -999,6 +1012,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 					index--;
 					break;
 				}
+
+				guest_invalidate_page(inode, page, start, end);
+
 				VM_BUG_ON_PAGE(PageWriteback(page), page);
 				if (shmem_punch_compound(page, start, end))
 					truncate_inode_page(mapping, page);
@@ -1074,6 +1090,9 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
 			return -EPERM;
 
+		if ((info->seals & F_SEAL_GUEST) && (newsize & ~PAGE_MASK))
+			return -EINVAL;
+
 		if (newsize != oldsize) {
 			error = shmem_reacct_size(SHMEM_I(inode)->flags,
 					oldsize, newsize);
@@ -1348,6 +1367,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		goto redirty;
 	if (!total_swap_pages)
 		goto redirty;
+	if (info->seals & F_SEAL_GUEST)
+		goto redirty;
 
 	/*
 	 * Our capabilities prevent regular writeback or sync from ever calling
@@ -2274,6 +2295,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (ret)
 		return ret;
 
+	if (info->seals & F_SEAL_GUEST)
+		return -EPERM;
+
 	/* arm64 - allow memory tagging on RAM-based files */
 	vma->vm_flags |= VM_MTE_ALLOWED;
 
@@ -2471,12 +2495,14 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 	pgoff_t index = pos >> PAGE_SHIFT;
 
 	/* i_mutex is held by caller */
-	if (unlikely(info->seals & (F_SEAL_GROW |
-				   F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {
+	if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE |
+				    F_SEAL_FUTURE_WRITE | F_SEAL_GUEST))) {
 		if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))
 			return -EPERM;
 		if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
 			return -EPERM;
+		if (info->seals & F_SEAL_GUEST)
+			return -EPERM;
 	}
 
 	return shmem_getpage(inode, index, pagep, SGP_WRITE);
@@ -2530,6 +2556,8 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t retval = 0;
 	loff_t *ppos = &iocb->ki_pos;
 
+	if (SHMEM_I(inode)->seals & F_SEAL_GUEST)
+		return -EPERM;
 	/*
 	 * Might this read be for a stacking filesystem?  Then when reading
 	 * holes of a sparse file, we actually need to allocate those pages,
@@ -2675,6 +2703,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 			goto out;
 		}
 
+		if ((info->seals & F_SEAL_GUEST) &&
+		    (offset & ~PAGE_MASK || len & ~PAGE_MASK)) {
+			error = -EINVAL;
+			goto out;
+		}
+
 		shmem_falloc.waitq = &shmem_falloc_waitq;
 		shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
 		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
@@ -3761,6 +3795,20 @@ static void shmem_destroy_inodecache(void)
 	kmem_cache_destroy(shmem_inode_cachep);
 }
 
+#ifdef CONFIG_MIGRATION
+int shmem_migrate_page(struct address_space *mapping,
+		struct page *newpage, struct page *page,
+		enum migrate_mode mode)
+{
+	struct inode *inode = mapping->host;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	if (info->seals & F_SEAL_GUEST)
+		return -ENOTSUPP;
+	return migrate_page(mapping, newpage, page, mode);
+}
+#endif
+
 const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
@@ -3769,12 +3817,57 @@ const struct address_space_operations shmem_aops = {
 	.write_end	= shmem_write_end,
 #endif
 #ifdef CONFIG_MIGRATION
-	.migratepage	= migrate_page,
+	.migratepage	= shmem_migrate_page,
 #endif
 	.error_remove_page = generic_error_remove_page,
 };
 EXPORT_SYMBOL(shmem_aops);
 
+static unsigned long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset)
+{
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(inode, offset, &page, SGP_WRITE);
+	if (ret)
+		return ret;
+
+	return page_to_pfn(page);
+}
+
+static void shmem_put_unlock_pfn(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	set_page_dirty(page);
+	unlock_page(page);
+	put_page(page);
+}
+
+static const struct guest_mem_ops shmem_guest_ops = {
+	.get_lock_pfn = shmem_get_lock_pfn,
+	.put_unlock_pfn = shmem_put_unlock_pfn,
+};
+
+int shmem_register_guest(struct inode *inode, void *owner,
+			 const struct guest_ops *guest_ops,
+			 const struct guest_mem_ops **guest_mem_ops)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	if (!owner)
+		return -EINVAL;
+
+	if (info->guest_owner && info->guest_owner != owner)
+		return -EPERM;
+
+	info->guest_ops = guest_ops;
+	*guest_mem_ops = &shmem_guest_ops;
+	return 0;
+}
+
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
 	.get_unmapped_area = shmem_get_unmapped_area,
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-03 19:14     ` Kirill A. Shutemov
@ 2021-09-03 19:15       ` Andy Lutomirski
  2021-09-10 17:18         ` Kirill A. Shutemov
  0 siblings, 1 reply; 71+ messages in thread
From: Andy Lutomirski @ 2021-09-03 19:15 UTC (permalink / raw)
  To: Kirill A. Shutemov, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov, Andrew Morton,
	Joerg Roedel, Andi Kleen, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Varad Gautam, Dario Faggioli, x86, linux-mm, linux-coco,
	Kirill A . Shutemov, Kuppuswamy Sathyanarayanan,
	David Hildenbrand, Dave Hansen, Yu Zhang

On 9/3/21 12:14 PM, Kirill A. Shutemov wrote:
> On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote:
>> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem?
> 
> I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look
> closer into locking next.

We can decisively eliminate this sort of failure by making the switch
happen at open time instead of after.  For a memfd-like API, this would
be straightforward.  For a filesystem, it would take a bit more thought.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-02  8:19           ` Yu Zhang
  2021-09-02 18:41             ` Andy Lutomirski
@ 2021-09-07  1:33             ` Yan Zhao
  1 sibling, 0 replies; 71+ messages in thread
From: Yan Zhao @ 2021-09-07  1:33 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Andy Lutomirski, David Hildenbrand, Sean Christopherson,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Hansen, Dave

On Thu, Sep 02, 2021 at 04:19:23PM +0800, Yu Zhang wrote:
> On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote:
> > On 9/1/21 3:24 AM, Yu Zhang wrote:
> > > On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote:
> > >>
> > >>
> > >> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote:
> > >>> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:
> > >>
> > >>> Thanks a lot for this summary. A question about the requirement: do we or
> > >>> do we not have plan to support assigned device to the protected VM?
> > >>>
> > >>> If yes. The fd based solution may need change the VFIO interface as well(
> > >>> though the fake swap entry solution need mess with VFIO too). Because:
> > >>>
> > >>> 1> KVM uses VFIO when assigning devices into a VM.
> > >>>
> > >>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
> > >>> guest pages will have to be mapped in host IOMMU page table to host pages,
> > >>> which are pinned during the whole life cycle fo the VM.
> > >>>
> > >>> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver,
> > >>> in vfio_dma_do_map().
> > >>>
> > >>> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA
> > >>> and pin the page. 
> > >>>
> > >>> But if we are using fd based solution, not every GPA can have a HVA, thus
> > >>> the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
> > >>> doubt if VFIO can be modified to support this easily.
> > >>>
> > >>>
> > >>
> > >> Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device?
> > >>
> > >> If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it.  I'm not sure whether anyone will actually do this, but it should be possible, at least in principle.  Maybe someone will want to assign a NIC to a TDX guest.  An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether.
> > >>
> > >> If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot.
> > >>
> > > 
> > > Thanks Andy. I was asking the first scenario.
> > > 
> > > Well, I agree it is doable if someone really want some assigned
> > > device in TD guest. As Kevin mentioned in his reply, HPA can be
> > > generated, by extending VFIO with a new mapping protocol which
> > > uses fd+offset, instead of HVA. 
> > 
> > I'm confused.  I don't see why any new code is needed for this at all.
> > Every proposal I've seen for handling TDX memory continues to handle TDX
> > *shared* memory exactly like regular guest memory today.  The only
> > differences are that more hole punching will be needed, which will
> > require lightweight memslots (to have many of them), memslots with
> > holes, or mappings backing memslots with holes (which can be done with
> > munmap() on current kernels).
> 
> Thanks for pointing this out. And yes, for DMAs not capable of encryption(
> which is the case in current TDX). GUP shall work as it is in VFIO. :)
> 
> > 
> > So you can literally just mmap a VFIO device and expect it to work,
> > exactly like it does right now.  Whether the guest will be willing to
> > use the device will depend on the guest security policy (all kinds of
> > patches about that are flying around), but if the guest tries to use it,
> > it really should just work.
> > 
> 
> But I think there's still problem. For now,
> 
> 1> Qemu mmap()s all GPAs into its HVA space, when the VM is created.
> 2> With no idea which part of guest memory shall be shared, VFIO will just
> set up the IOPT, by mapping whole GPA ranges in IOPT. 
actually, it's not GPA in IOPT. It's IOVA in IOPT, and the IOVA is equal
to GPA in this case.

> 3> And those GPAs are actually private ones, with no shared-bit set.
> 
in Guest, IOVA is set to GPA (without shared bit) for shared memory in
TDX now.

> Later when guest tries to perform a DMA(using a shared GPA), IO page fault
> shall happen.
So, this situation should not happen.



> > > 
> > > Another issue is current TDX does not support DMA encryption, and
> > > only shared GPA memory shall be mapped in the VT-d. So to support
> > > this, KVM may need to work with VFIO to dynamically program host
> > > IOPT(IOMMU Page Table) when TD guest notifies a shared GFN range(e.g.,
> > > with a MAP_GPA TDVMCALL), instead of prepopulating the IOPT at VM
> > > creation time, by mapping entire GFN ranges of a guest.
> > 
> > Given that there is no encrypted DMA support, shouldn't the only IOMMU
> > mappings (real host-side IOMMU) that point at guest memory be for
> > non-encrypted DMA?  I don't see how this interacts at all.  If the guest
> > tries to MapGPA to turn a shared MMIO page into private, the host should
> > fail the hypercall because the operation makes no sense.
> > 
> > It is indeed the case that, with a TDX guest, MapGPA shared->private to
> > a page that was previously used for unencrypted DMA will need to avoid
> > having IOPT entries to the new private page, but even that doesn't seem
> > particularly bad.  The fd+special memslot proposal for private memory
> > means that shared *backing store* pages never actually transition
> > between shared and private without being completely freed.
> > 
> > As far as I can tell, the actual problem you're referring to is:
> > 
> > >>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
> > >>> guest pages will have to be mapped in host IOMMU page table to host
> > pages,
> > >>> which are pinned during the whole life cycle fo the VM.
> 
> Yes. That's the primary concern. :)
> 
> > 
> > In principle, you could actually initialize a TDX guest with all of its
> > memory shared and all of it mapped in the host IOMMU.  When a guest
> > turns some pages private, user code could punch a hole in the memslot,
> > allocate private memory at that address, but leave the shared backing
> > store in place and still mapped in the host IOMMU.  The result would be
> > that guest-initiated DMA to the previously shared address would actually
> > work but would hit pages that are invisible to the guest.  And a whole
> > bunch of memory would be waste, but the whole system should stll work.
> 
> Do you mean to let VFIO & IOMMU to treat all guest memory as shared first,
> and then just allocate the private pages in another backing store? I guess
just a curious question.
what if this shared->private conversion is on MMIO ranges that are mapped
into the device?
do you have enough info to get MMIO hpa for the private mapping?

Thanks
Yan


> that could work, but with the cost of allocating roughly 2x physical pages
> of the guest RAM size. After all, the shared pages shall be only a small
> part of guest memory.
> 
> If device assignment is desired in current TDX. My understanding of the 
> enabling work would be like this:
> 1> Change qemu to not trigger VFIO_IOMMU_MAP_DMA for the TD, thus no IOPT
> prepopulated, and no physical page allocated.
> 2> KVM forwards MapGPA(private -> shared) request to Qemu.
> 3> Qemu asks VFIO to pin and map the shared GPAs.
> 
> For private -> shared transitions, the memslot punching, IOPT unmapping,
> and iotlb flushing are necessary. Possibly new interface between VFIO and
> KVM is needed.
> 
> But actually I am not sure if people really want assigned device in current
> TDX. Bottleneck of the performance should be the copying to/from swiotlb
> buffers.
> 
> B.R.
> Yu
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-01 16:18             ` James Bottomley
                                 ` (3 preceding siblings ...)
  2021-09-01 19:26               ` Dave Hansen
@ 2021-09-07 15:00               ` Tom Lendacky
  4 siblings, 0 replies; 71+ messages in thread
From: Tom Lendacky @ 2021-09-07 15:00 UTC (permalink / raw)
  To: jejb, Andy Lutomirski, David Hildenbrand, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm list, Linux Kernel Mailing List,
	Borislav Petkov, Andrew Morton, Joerg Roedel, Andi Kleen,
	David Rientjes, Vlastimil Babka, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Ingo Molnar, Varad Gautam, Dario Faggioli,
	the arch/x86 maintainers, linux-mm, linux-coco,
	Kirill A. Shutemov, Kirill A . Shutemov,
	Sathyanarayanan Kuppuswamy, Dave Hansen, Yu Zhang

On 9/1/21 11:18 AM, James Bottomley wrote:
> On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote:
> [...]
>> If you want to swap a page on TDX, you can't.  Sorry, go directly to
>> jail, do not collect $200.
> 
> Actually, even on SEV-ES you can't either.  You can read the encrypted
> page and write it out if you want, but unless you swap it back to the
> exact same physical memory location, the encryption key won't work.
> Since we don't guarantee this for swap, I think swap won't actually
> work for any confidential computing environment.

There are SEV/SEV-ES and SEV-SNP APIs to swap a page out and in. There is 
also an API to copy (for SEV/SEV-ES) or move (for SEV-SNP) a page in 
memory to help with balancing if needed.

These aren't currently implemented in the kernel, but can be. It requires 
the changes going in for live migration to recognize that a page is 
encrypted/private and invoke the APIs to transform the page before doing 
the operation.

Thanks,
Tom

> 
>> So I think there are literally zero code paths that currently call
>> try_to_unmap() that will actually work like that on TDX.  If we run
>> out of memory on a TDX host, we can kill the guest completely and
>> reclaim all of its memory (which probably also involves killing QEMU
>> or whatever other user program is in charge), but that's really our
>> only option.
> 
> I think our only option for swap is guest co-operation.  We're going to
> have to inflate a balloon or something in the guest and have the guest
> driver do some type of bounce of the page, where it becomes an
> unencrypted page in the guest (so the host can read it without the
> physical address keying of the encryption getting in the way) but
> actually encrypted with a swap transfer key known only to the guest.  I
> assume we can use the page acceptance infrastructure currently being
> discussed elsewhere to do swap back in as well ... the host provides
> the guest with the encrypted swap page and the guest has to decrypt it
> and place it in encrypted guest memory.
> 
> That way the swapped memory is securely encrypted, but can be swapped
> back in.
> 
> James
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-03 19:15       ` Andy Lutomirski
@ 2021-09-10 17:18         ` Kirill A. Shutemov
  2021-09-15 19:58           ` Chao Peng
  0 siblings, 1 reply; 71+ messages in thread
From: Kirill A. Shutemov @ 2021-09-10 17:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andrew Morton, Joerg Roedel, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, Kirill A . Shutemov, Kuppuswamy Sathyanarayanan,
	David Hildenbrand, Dave Hansen, Yu Zhang

[-- Attachment #1: Type: text/plain, Size: 11656 bytes --]

On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote:
> On 9/3/21 12:14 PM, Kirill A. Shutemov wrote:
> > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote:
> >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem?
> > 
> > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look
> > closer into locking next.
> 
> We can decisively eliminate this sort of failure by making the switch
> happen at open time instead of after.  For a memfd-like API, this would
> be straightforward.  For a filesystem, it would take a bit more thought.

I think it should work fine as long as we check seals after i_size in the
read path. See the comment in shmem_file_read_iter().

Below is updated version. I think it should be good enough to start
integrate with KVM.

I also attach a test-case that consists of kernel patch and userspace
program. It demonstrates how it can be integrated into KVM code.

One caveat I noticed is that guest_ops::invalidate_page_range() can be
called after the owner (struct kvm) has being freed. It happens because
memfd can outlive KVM. So the callback has to check if such owner exists,
than check that there's a memslot with such inode.

I guess it should be okay: we have vm_list we can check owner against.
We may consider replace vm_list with something more scalable if number of
VMs will get too high.

Any comments?

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..3005e233140a 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -4,13 +4,34 @@
 
 #include <linux/file.h>
 
+struct guest_ops {
+	void (*invalidate_page_range)(struct inode *inode, void *owner,
+				      pgoff_t start, pgoff_t end);
+};
+
+struct guest_mem_ops {
+	unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset);
+	void (*put_unlock_pfn)(unsigned long pfn);
+
+};
+
 #ifdef CONFIG_MEMFD_CREATE
 extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
+
+extern inline int memfd_register_guest(struct inode *inode, void *owner,
+				       const struct guest_ops *guest_ops,
+				       const struct guest_mem_ops **guest_mem_ops);
 #else
 static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 {
 	return -EINVAL;
 }
+static inline int memfd_register_guest(struct inode *inode, void *owner,
+				       const struct guest_ops *guest_ops,
+				       const struct guest_mem_ops **guest_mem_ops)
+{
+	return -EINVAL;
+}
 #endif
 
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 8e775ce517bb..265d0c13bc5e 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -12,6 +12,9 @@
 
 /* inode in-kernel data */
 
+struct guest_ops;
+struct guest_mem_ops;
+
 struct shmem_inode_info {
 	spinlock_t		lock;
 	unsigned int		seals;		/* shmem seals */
@@ -24,6 +27,8 @@ struct shmem_inode_info {
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	atomic_t		stop_eviction;	/* hold when working on inode */
 	struct inode		vfs_inode;
+	void			*guest_owner;
+	const struct guest_ops	*guest_ops;
 };
 
 struct shmem_sb_info {
@@ -90,6 +95,10 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 						pgoff_t start, pgoff_t end);
 
+extern int shmem_register_guest(struct inode *inode, void *owner,
+				const struct guest_ops *guest_ops,
+				const struct guest_mem_ops **guest_mem_ops);
+
 /* Flag allocation requirements to shmem_getpage */
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..c79bc8572721 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
 #define F_SEAL_GROW	0x0004	/* prevent file from growing */
 #define F_SEAL_WRITE	0x0008	/* prevent writes */
 #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
+#define F_SEAL_GUEST		0x0020
 /* (1U << 31) is reserved for signed error codes */
 
 /*
diff --git a/mm/memfd.c b/mm/memfd.c
index 081dd33e6a61..ae43454789f4 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -130,11 +130,24 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
 	return NULL;
 }
 
+int memfd_register_guest(struct inode *inode, void *owner,
+			 const struct guest_ops *guest_ops,
+			 const struct guest_mem_ops **guest_mem_ops)
+{
+	if (shmem_mapping(inode->i_mapping)) {
+		return shmem_register_guest(inode, owner,
+					    guest_ops, guest_mem_ops);
+	}
+
+	return -EINVAL;
+}
+
 #define F_ALL_SEALS (F_SEAL_SEAL | \
 		     F_SEAL_SHRINK | \
 		     F_SEAL_GROW | \
 		     F_SEAL_WRITE | \
-		     F_SEAL_FUTURE_WRITE)
+		     F_SEAL_FUTURE_WRITE | \
+		     F_SEAL_GUEST)
 
 static int memfd_add_seals(struct file *file, unsigned int seals)
 {
@@ -203,10 +216,27 @@ static int memfd_add_seals(struct file *file, unsigned int seals)
 		}
 	}
 
+	if (seals & F_SEAL_GUEST) {
+		i_mmap_lock_read(inode->i_mapping);
+
+		if (!RB_EMPTY_ROOT(&inode->i_mapping->i_mmap.rb_root)) {
+			error = -EBUSY;
+			goto unlock;
+		}
+
+		if (i_size_read(inode)) {
+			error = -EBUSY;
+			goto unlock;
+		}
+	}
+
 	*file_seals |= seals;
 	error = 0;
 
 unlock:
+	if (seals & F_SEAL_GUEST)
+		i_mmap_unlock_read(inode->i_mapping);
+
 	inode_unlock(inode);
 	return error;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index dacda7463d54..54c213b7b42a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -80,6 +80,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/userfaultfd_k.h>
 #include <linux/rmap.h>
 #include <linux/uuid.h>
+#include <linux/memfd.h>
 
 #include <linux/uaccess.h>
 
@@ -883,6 +884,18 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
 	return split_huge_page(page) >= 0;
 }
 
+static void guest_invalidate_page(struct inode *inode,
+				  struct page *page, pgoff_t start, pgoff_t end)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	start = max(start, page->index);
+	end = min(end, page->index + HPAGE_PMD_NR) - 1;
+
+	info->guest_ops->invalidate_page_range(inode, info->guest_owner,
+					       start, end);
+}
+
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -923,6 +936,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			index += thp_nr_pages(page) - 1;
 
+			guest_invalidate_page(inode, page, start, end);
+
 			if (!unfalloc || !PageUptodate(page))
 				truncate_inode_page(mapping, page);
 			unlock_page(page);
@@ -999,6 +1014,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 					index--;
 					break;
 				}
+
+				guest_invalidate_page(inode, page, start, end);
+
 				VM_BUG_ON_PAGE(PageWriteback(page), page);
 				if (shmem_punch_compound(page, start, end))
 					truncate_inode_page(mapping, page);
@@ -1074,6 +1092,9 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
 			return -EPERM;
 
+		if ((info->seals & F_SEAL_GUEST) && (newsize & ~PAGE_MASK))
+			return -EINVAL;
+
 		if (newsize != oldsize) {
 			error = shmem_reacct_size(SHMEM_I(inode)->flags,
 					oldsize, newsize);
@@ -1348,6 +1369,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		goto redirty;
 	if (!total_swap_pages)
 		goto redirty;
+	if (info->seals & F_SEAL_GUEST)
+		goto redirty;
 
 	/*
 	 * Our capabilities prevent regular writeback or sync from ever calling
@@ -2274,6 +2297,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (ret)
 		return ret;
 
+	if (info->seals & F_SEAL_GUEST)
+		return -EPERM;
+
 	/* arm64 - allow memory tagging on RAM-based files */
 	vma->vm_flags |= VM_MTE_ALLOWED;
 
@@ -2471,12 +2497,14 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 	pgoff_t index = pos >> PAGE_SHIFT;
 
 	/* i_mutex is held by caller */
-	if (unlikely(info->seals & (F_SEAL_GROW |
-				   F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {
+	if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE |
+				    F_SEAL_FUTURE_WRITE | F_SEAL_GUEST))) {
 		if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))
 			return -EPERM;
 		if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
 			return -EPERM;
+		if (info->seals & F_SEAL_GUEST)
+			return -EPERM;
 	}
 
 	return shmem_getpage(inode, index, pagep, SGP_WRITE);
@@ -2550,6 +2578,20 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		end_index = i_size >> PAGE_SHIFT;
 		if (index > end_index)
 			break;
+
+		/*
+		 * inode_lock protects setting up seals as well as write to
+		 * i_size. Setting F_SEAL_GUEST only allowed with i_size == 0.
+		 *
+		 * Check F_SEAL_GUEST after i_size. It effectively serialize
+		 * read vs. setting F_SEAL_GUEST without taking inode_lock in
+		 * read path.
+		 */
+		if (SHMEM_I(inode)->seals & F_SEAL_GUEST) {
+			error = -EPERM;
+			break;
+		}
+
 		if (index == end_index) {
 			nr = i_size & ~PAGE_MASK;
 			if (nr <= offset)
@@ -2675,6 +2717,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 			goto out;
 		}
 
+		if ((info->seals & F_SEAL_GUEST) &&
+		    (offset & ~PAGE_MASK || len & ~PAGE_MASK)) {
+			error = -EINVAL;
+			goto out;
+		}
+
 		shmem_falloc.waitq = &shmem_falloc_waitq;
 		shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
 		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
@@ -3761,6 +3809,20 @@ static void shmem_destroy_inodecache(void)
 	kmem_cache_destroy(shmem_inode_cachep);
 }
 
+#ifdef CONFIG_MIGRATION
+int shmem_migrate_page(struct address_space *mapping,
+		struct page *newpage, struct page *page,
+		enum migrate_mode mode)
+{
+	struct inode *inode = mapping->host;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	if (info->seals & F_SEAL_GUEST)
+		return -ENOTSUPP;
+	return migrate_page(mapping, newpage, page, mode);
+}
+#endif
+
 const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
@@ -3769,12 +3831,57 @@ const struct address_space_operations shmem_aops = {
 	.write_end	= shmem_write_end,
 #endif
 #ifdef CONFIG_MIGRATION
-	.migratepage	= migrate_page,
+	.migratepage	= shmem_migrate_page,
 #endif
 	.error_remove_page = generic_error_remove_page,
 };
 EXPORT_SYMBOL(shmem_aops);
 
+static unsigned long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset)
+{
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(inode, offset, &page, SGP_WRITE);
+	if (ret)
+		return ret;
+
+	return page_to_pfn(page);
+}
+
+static void shmem_put_unlock_pfn(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	set_page_dirty(page);
+	unlock_page(page);
+	put_page(page);
+}
+
+static const struct guest_mem_ops shmem_guest_ops = {
+	.get_lock_pfn = shmem_get_lock_pfn,
+	.put_unlock_pfn = shmem_put_unlock_pfn,
+};
+
+int shmem_register_guest(struct inode *inode, void *owner,
+			 const struct guest_ops *guest_ops,
+			 const struct guest_mem_ops **guest_mem_ops)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	if (!owner)
+		return -EINVAL;
+
+	if (info->guest_owner && info->guest_owner != owner)
+		return -EPERM;
+
+	info->guest_ops = guest_ops;
+	*guest_mem_ops = &shmem_guest_ops;
+	return 0;
+}
+
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
 	.get_unmapped_area = shmem_get_unmapped_area,
-- 
 Kirill A. Shutemov

[-- Attachment #2: test-case.patch --]
[-- Type: text/x-diff, Size: 2424 bytes --]

diff --git a/mm/shmem.c b/mm/shmem.c
index 54c213b7b42a..78d07299ef04 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -81,6 +81,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/rmap.h>
 #include <linux/uuid.h>
 #include <linux/memfd.h>
+#include <linux/debugfs.h>
 
 #include <linux/uaccess.h>
 
@@ -3882,6 +3883,85 @@ int shmem_register_guest(struct inode *inode, void *owner,
 	return 0;
 }
 
+static void test_guest_invalidate_page_range(struct inode *inode, void *owner,
+					     pgoff_t start, pgoff_t end)
+{
+	/*
+	 * XXX: We can get here after the owner no longer exists. Need to check
+	 * somehow if the owner is still alive.
+	 */
+
+	printk("invalidate: %#lx-%#lx\n", start, end);
+}
+
+static const struct guest_ops guest_ops = {
+	.invalidate_page_range = test_guest_invalidate_page_range,
+};
+
+#define GUEST_MEM_REGISTER	_IOW('m', 1, int)
+#define GUEST_MEM_UNREGISTER	_IOW('m', 2, int)
+#define GUEST_MEM_GET_PFN	_IOW('m', 3, unsigned long)
+
+static long guest_mem_test_ioctl(struct file *file,
+				 unsigned int cmd, unsigned long arg)
+{
+	static struct file *memfd_file = NULL;
+	static const struct guest_mem_ops *guest_mem_ops = NULL;
+
+	switch (cmd) {
+	case GUEST_MEM_REGISTER: {
+		struct fd memfd = fdget(arg);
+
+		if (memfd_file)
+			return -EBUSY;
+
+		if (!memfd.file)
+			return -EINVAL;
+
+		memfd_file = memfd.file;
+		return memfd_register_guest(memfd_file->f_inode, file,
+					    &guest_ops, &guest_mem_ops);
+	}
+	case GUEST_MEM_UNREGISTER: {
+		if (!memfd_file)
+			return -EINVAL;
+
+		fput(memfd_file);
+		memfd_file = NULL;
+		guest_mem_ops = NULL;
+		return 0;
+	}
+	case GUEST_MEM_GET_PFN: {
+		unsigned long pfn;
+		printk("guest_mem_ops: %px\n", guest_mem_ops);
+		pfn = guest_mem_ops->get_lock_pfn(memfd_file->f_inode, 0);
+		printk("pfn: %#lx\n");
+		if (pfn < 0)
+			return pfn;
+		guest_mem_ops->put_unlock_pfn(pfn);
+		return 0;
+	}
+	default:
+		return -EINVAL;
+	}
+
+	return -EINVAL;
+}
+
+static const struct file_operations guest_mem_test_fops = {
+	.owner	 = THIS_MODULE,
+	.unlocked_ioctl = guest_mem_test_ioctl,
+	.llseek  = no_llseek,
+};
+
+static int __init guest_mem_test(void)
+{
+	debugfs_create_file("guest_mem_test", 0200, NULL, NULL,
+			    &guest_mem_test_fops);
+	return 0;
+}
+late_initcall(guest_mem_test);
+
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
 	.get_unmapped_area = shmem_get_unmapped_area,

[-- Attachment #3: memfd_guest.c --]
[-- Type: text/x-c, Size: 1716 bytes --]

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <sys/mman.h>

#define GUEST_MEM_REGISTER	_IOW('m', 1, int)
#define GUEST_MEM_UNREGISTER	_IOW('m', 2, int)
#define GUEST_MEM_GET_PFN	_IOW('m', 3, unsigned long)

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); } while (0)

#define F_SEAL_GUEST	0x0020

int main(int argc, char *argv[])
{
	int fd, test_fd;
	char buf[4];

	/* Create an anonymous file in tmpfs; allow seals to be
	   placed on the file. */

	fd = memfd_create("memfd_guest", MFD_ALLOW_SEALING);
	if (fd == -1)
		errExit("memfd_create");

#if 0
	if (write(fd, "test", 4) != 4)
		errExit("write");
#endif

#if 0
	if (ftruncate(fd, 4096) == -1)
		errExit("ftruncate");
#endif

#if 0
	if (mmap(NULL, 4096, PROT_READ, MAP_SHARED, fd, 0) == MAP_FAILED)
		errExit("mmap");
#endif

	if (fcntl(fd, F_ADD_SEALS, F_SEAL_GUEST) == -1)
		errExit("fcntl");

	if (ftruncate(fd, 2UL << 20) == -1)
		errExit("ftruncate");

	/* Expected to fail */
	if (write(fd, "test", 4) != -1)
		errExit("write");

	if (read(fd, buf, 4) != -1)
		errExit("read");

	if (mmap(NULL, 4096, PROT_READ, MAP_SHARED, fd, 0) != MAP_FAILED)
		errExit("mmap");

	test_fd = open("/sys/kernel/debug/guest_mem_test", O_RDWR);
	if (fd == -1)
		errExit("open");

	if (ioctl(test_fd, GUEST_MEM_REGISTER, fd) == -1)
		errExit("ioctl");

	if (ioctl(test_fd, GUEST_MEM_GET_PFN, 0x100) == -1)
		errExit("ioctl");

	if (ioctl(test_fd, GUEST_MEM_GET_PFN, 0x1) == -1)
		errExit("ioctl");

	if (ftruncate(fd, 0))
		errExit("ftruncate");

	if (ioctl(test_fd, GUEST_MEM_UNREGISTER, fd) == -1)
		errExit("ioctl");

	return 0;
}

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-15 19:58           ` Chao Peng
@ 2021-09-15 13:51             ` David Hildenbrand
  2021-09-15 14:29               ` Kirill A. Shutemov
  2021-09-15 14:11             ` Kirill A. Shutemov
  1 sibling, 1 reply; 71+ messages in thread
From: David Hildenbrand @ 2021-09-15 13:51 UTC (permalink / raw)
  To: Chao Peng, Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Borislav Petkov, Andrew Morton, Joerg Roedel,
	Andi Kleen, David Rientjes, Vlastimil Babka, Tom Lendacky,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, Dave Hansen, Yu Zhang

>> diff --git a/mm/memfd.c b/mm/memfd.c
>> index 081dd33e6a61..ae43454789f4 100644
>> --- a/mm/memfd.c
>> +++ b/mm/memfd.c
>> @@ -130,11 +130,24 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
>>   	return NULL;
>>   }
>>   
>> +int memfd_register_guest(struct inode *inode, void *owner,
>> +			 const struct guest_ops *guest_ops,
>> +			 const struct guest_mem_ops **guest_mem_ops)
>> +{
>> +	if (shmem_mapping(inode->i_mapping)) {
>> +		return shmem_register_guest(inode, owner,
>> +					    guest_ops, guest_mem_ops);
>> +	}
>> +
>> +	return -EINVAL;
>> +}
> 
> Are we stick our design to memfd interface (e.g other memory backing
> stores like tmpfs and hugetlbfs will all rely on this memfd interface to
> interact with KVM), or this is just the initial implementation for PoC?

I don't think we are, it still feels like we are in the early prototype 
phase (even way before a PoC). I'd be happy to see something "cleaner" 
so to say -- it still feels kind of hacky to me, especially there seem 
to be many pieces of the big puzzle missing so far. Unfortunately, this 
series hasn't caught the attention of many -MM people so far, maybe 
because other people miss the big picture as well and are waiting for a 
complete design proposal.

For example, what's unclear to me: we'll be allocating pages with 
GFP_HIGHUSER_MOVABLE, making them land on MIGRATE_CMA or ZONE_MOVABLE; 
then we silently turn them unmovable, which breaks these concepts. Who'd 
migrate these pages away just like when doing long-term pinning, or how 
is that supposed to work?

Also unclear to me is how refcount and mapcount will be handled to 
prevent swapping, who will actually do some kind of gfn-epfn etc. 
mapping, how we'll forbid access to this memory e.g., via /proc/kcore or 
when dumping memory ... and how it would ever work with 
migration/swapping/rmap (it's clearly future work, but it's been raised 
that this would be the way to make it work, I don't quite see how it 
would all come together).

<note>
Last but not least, I raised to Intel via a different channel that I'd 
appreciate updated hardware that avoids essentially crashing the 
hypervisor when writing to encrypted memory from user space. It has the 
smell of "broken hardware" to it that might just be fixed by a new 
hardware generation to make it look more similar to other successful 
implementations of secure/encrypted memory. That might it much easier to 
support an initial version of TDX -- instead of having to reinvent the 
way we map guest memory just now to support hardware that might sort out 
the root problem later.

Having that said, there might be benefits to mapping guest memory 
differently, but my gut feeling is that it might take quite a long time 
to get something reasonable working, to settle on a design, and to get 
it accepted by all involved parties to merge it upstream.

Just my 2 cents, I might be all wrong as so often.
<\note>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-15 19:58           ` Chao Peng
  2021-09-15 13:51             ` David Hildenbrand
@ 2021-09-15 14:11             ` Kirill A. Shutemov
  2021-09-16  7:36               ` Chao Peng
  2021-09-16  9:24               ` Paolo Bonzini
  1 sibling, 2 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2021-09-15 14:11 UTC (permalink / raw)
  To: Chao Peng
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov, Andrew Morton,
	Joerg Roedel, Andi Kleen, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Varad Gautam, Dario Faggioli, x86, linux-mm, linux-coco,
	Kuppuswamy Sathyanarayanan, David Hildenbrand, Dave Hansen,
	Yu Zhang

On Wed, Sep 15, 2021 at 07:58:57PM +0000, Chao Peng wrote:
> On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote:
> > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote:
> > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote:
> > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem?
> > > > 
> > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look
> > > > closer into locking next.
> > > 
> > > We can decisively eliminate this sort of failure by making the switch
> > > happen at open time instead of after.  For a memfd-like API, this would
> > > be straightforward.  For a filesystem, it would take a bit more thought.
> > 
> > I think it should work fine as long as we check seals after i_size in the
> > read path. See the comment in shmem_file_read_iter().
> > 
> > Below is updated version. I think it should be good enough to start
> > integrate with KVM.
> > 
> > I also attach a test-case that consists of kernel patch and userspace
> > program. It demonstrates how it can be integrated into KVM code.
> > 
> > One caveat I noticed is that guest_ops::invalidate_page_range() can be
> > called after the owner (struct kvm) has being freed. It happens because
> > memfd can outlive KVM. So the callback has to check if such owner exists,
> > than check that there's a memslot with such inode.
> 
> Would introducing memfd_unregister_guest() fix this?

I considered this, but it get complex quickly.

At what point it gets called? On KVM memslot destroy?

What if multiple KVM slot share the same memfd? Add refcount into memfd on
how many times the owner registered the memfd?

It would leave us in strange state: memfd refcount owners (struct KVM) and
KVM memslot pins the struct file. Weird refcount exchnage program.

I hate it.

> > I guess it should be okay: we have vm_list we can check owner against.
> > We may consider replace vm_list with something more scalable if number of
> > VMs will get too high.
> > 
> > Any comments?
> > 
> > diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> > index 4f1600413f91..3005e233140a 100644
> > --- a/include/linux/memfd.h
> > +++ b/include/linux/memfd.h
> > @@ -4,13 +4,34 @@
> >  
> >  #include <linux/file.h>
> >  
> > +struct guest_ops {
> > +	void (*invalidate_page_range)(struct inode *inode, void *owner,
> > +				      pgoff_t start, pgoff_t end);
> > +};
> 
> I can see there are two scenarios to invalidate page(s), when punching a
> hole or ftruncating to 0, in either cases KVM should already been called
> with necessary information from usersapce with memory slot punch hole
> syscall or memory slot delete syscall, so wondering this callback is
> really needed.

So what you propose? Forbid truncate/punch from userspace and make KVM
handle punch hole/truncate from within kernel? I think it's layering
violation.

> > +
> > +struct guest_mem_ops {
> > +	unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset);
> > +	void (*put_unlock_pfn)(unsigned long pfn);
> 
> Same as above, I’m not clear on which time put_unlock_pfn() would be
> called, I’m thinking the page can be put_and_unlock when userspace
> punching a hole or ftruncating to 0 on the fd.

No. put_unlock_pfn() has to be called after the pfn is in SEPT. This way
we close race between SEPT population and truncate/punch. get_lock_pfn()
would stop truncate untile put_unlock_pfn() called.

> We did miss pfn_mapping_level() callback which is needed for KVM to query
> the page size level (e.g. 4K or 2M) that the backing store can support.

Okay, makes sense. We can return the information as part of get_lock_pfn()
call.

> Are we stick our design to memfd interface (e.g other memory backing
> stores like tmpfs and hugetlbfs will all rely on this memfd interface to
> interact with KVM), or this is just the initial implementation for PoC?
> 
> If we really want to expose multiple memory backing stores directly to
> KVM (as opposed to expose backing stores to memfd and then memfd expose
> single interface to KVM), I feel we need a third layer between KVM and
> backing stores to eliminate the direct call like this. Backing stores
> can register ‘memory fd providers’ and KVM should be able to connect to
> the right backing store provider with the fd provided by usersapce under
> the help of this third layer.

memfd can provide shmem and hugetlbfs. That's should be enough for now.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-15 13:51             ` David Hildenbrand
@ 2021-09-15 14:29               ` Kirill A. Shutemov
  2021-09-15 14:59                 ` David Hildenbrand
  0 siblings, 1 reply; 71+ messages in thread
From: Kirill A. Shutemov @ 2021-09-15 14:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chao Peng, Kirill A. Shutemov, Andy Lutomirski,
	Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andrew Morton, Joerg Roedel, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, Kuppuswamy Sathyanarayanan, Dave Hansen, Yu Zhang

On Wed, Sep 15, 2021 at 03:51:25PM +0200, David Hildenbrand wrote:
> > > diff --git a/mm/memfd.c b/mm/memfd.c
> > > index 081dd33e6a61..ae43454789f4 100644
> > > --- a/mm/memfd.c
> > > +++ b/mm/memfd.c
> > > @@ -130,11 +130,24 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
> > >   	return NULL;
> > >   }
> > > +int memfd_register_guest(struct inode *inode, void *owner,
> > > +			 const struct guest_ops *guest_ops,
> > > +			 const struct guest_mem_ops **guest_mem_ops)
> > > +{
> > > +	if (shmem_mapping(inode->i_mapping)) {
> > > +		return shmem_register_guest(inode, owner,
> > > +					    guest_ops, guest_mem_ops);
> > > +	}
> > > +
> > > +	return -EINVAL;
> > > +}
> > 
> > Are we stick our design to memfd interface (e.g other memory backing
> > stores like tmpfs and hugetlbfs will all rely on this memfd interface to
> > interact with KVM), or this is just the initial implementation for PoC?
> 
> I don't think we are, it still feels like we are in the early prototype
> phase (even way before a PoC). I'd be happy to see something "cleaner" so to
> say -- it still feels kind of hacky to me, especially there seem to be many
> pieces of the big puzzle missing so far. Unfortunately, this series hasn't
> caught the attention of many -MM people so far, maybe because other people
> miss the big picture as well and are waiting for a complete design proposal.
> 
> For example, what's unclear to me: we'll be allocating pages with
> GFP_HIGHUSER_MOVABLE, making them land on MIGRATE_CMA or ZONE_MOVABLE; then
> we silently turn them unmovable, which breaks these concepts. Who'd migrate
> these pages away just like when doing long-term pinning, or how is that
> supposed to work?

That's fair point. We can fix it by changing mapping->gfp_mask.

> Also unclear to me is how refcount and mapcount will be handled to prevent
> swapping,

refcount and mapcount are unchanged. Pages not pinned per se. Swapping
prevented with the change in shmem_writepage().

> who will actually do some kind of gfn-epfn etc. mapping, how we'll
> forbid access to this memory e.g., via /proc/kcore or when dumping memory

It's not aimed to prevent root to shoot into his leg. Root do root.

> ... and how it would ever work with migration/swapping/rmap (it's clearly
> future work, but it's been raised that this would be the way to make it
> work, I don't quite see how it would all come together).

Given that hardware supports it migration and swapping can be implemented
by providing new callbacks in guest_ops. Like ->migrate_page would
transfer encrypted data between pages and ->swapout would provide
encrypted blob that can be put on disk or handled back to ->swapin to
bring back to memory.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-15 14:29               ` Kirill A. Shutemov
@ 2021-09-15 14:59                 ` David Hildenbrand
  2021-09-15 15:35                   ` David Hildenbrand
  2021-09-15 20:04                   ` Kirill A. Shutemov
  0 siblings, 2 replies; 71+ messages in thread
From: David Hildenbrand @ 2021-09-15 14:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Chao Peng, Kirill A. Shutemov, Andy Lutomirski,
	Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andrew Morton, Joerg Roedel, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, Kuppuswamy Sathyanarayanan, Dave Hansen, Yu Zhang


>> I don't think we are, it still feels like we are in the early prototype
>> phase (even way before a PoC). I'd be happy to see something "cleaner" so to
>> say -- it still feels kind of hacky to me, especially there seem to be many
>> pieces of the big puzzle missing so far. Unfortunately, this series hasn't
>> caught the attention of many -MM people so far, maybe because other people
>> miss the big picture as well and are waiting for a complete design proposal.
>>
>> For example, what's unclear to me: we'll be allocating pages with
>> GFP_HIGHUSER_MOVABLE, making them land on MIGRATE_CMA or ZONE_MOVABLE; then
>> we silently turn them unmovable, which breaks these concepts. Who'd migrate
>> these pages away just like when doing long-term pinning, or how is that
>> supposed to work?
> 
> That's fair point. We can fix it by changing mapping->gfp_mask.

That's essentially what secretmem does when setting up a file.

> 
>> Also unclear to me is how refcount and mapcount will be handled to prevent
>> swapping,
> 
> refcount and mapcount are unchanged. Pages not pinned per se. Swapping
> prevented with the change in shmem_writepage().

So when mapping into the guest, we'd increment the refcount but not the 
mapcount I assume?

> 
>> who will actually do some kind of gfn-epfn etc. mapping, how we'll
>> forbid access to this memory e.g., via /proc/kcore or when dumping memory
> 
> It's not aimed to prevent root to shoot into his leg. Root do root.

IMHO being root is not an excuse to read some random file (actually used 
in production environments) to result in the machine crashing. Not 
acceptable for distributions.

I'm still missing the whole gfn-epfn 1:1 mapping discussion we 
identified as requirements. Is that supposed to be done by KVM? How?

> 
>> ... and how it would ever work with migration/swapping/rmap (it's clearly
>> future work, but it's been raised that this would be the way to make it
>> work, I don't quite see how it would all come together).
> 
> Given that hardware supports it migration and swapping can be implemented
> by providing new callbacks in guest_ops. Like ->migrate_page would
> transfer encrypted data between pages and ->swapout would provide
> encrypted blob that can be put on disk or handled back to ->swapin to
> bring back to memory.

Again, I'm missing the complete picture. To make swapping decisions 
vmscan code needs track+handle dirty+reference information. How would we 
be able to track references? Does the hardware allow for temporary 
unmapping of encrypted memory and faulting on it? How would 
page_referenced() continue working? "we can add callbacks" is not a 
satisfying answer, at least for me. Especially, when it comes to 
eventual locking problems and races.

Maybe saying "migration+swap is not supported" is clearer than "we can 
add callbacks" and missing some details on the bigger picture.

Again, a complete design proposal would be highly valuable, especially 
to get some more review from other -MM folks. Otherwise there is a high 
chance that this will be rejected late when trying to upstream and -MM 
people stumbling over it (we've had some similar thing happening just 
recently unfortunately ...).

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-15 14:59                 ` David Hildenbrand
@ 2021-09-15 15:35                   ` David Hildenbrand
  2021-09-15 20:04                   ` Kirill A. Shutemov
  1 sibling, 0 replies; 71+ messages in thread
From: David Hildenbrand @ 2021-09-15 15:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Chao Peng, Kirill A. Shutemov, Andy Lutomirski,
	Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andrew Morton, Joerg Roedel, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, Kuppuswamy Sathyanarayanan, Dave Hansen, Yu Zhang

>>
>>> who will actually do some kind of gfn-epfn etc. mapping, how we'll
>>> forbid access to this memory e.g., via /proc/kcore or when dumping memory
>>
>> It's not aimed to prevent root to shoot into his leg. Root do root.
> 
> IMHO being root is not an excuse to read some random file (actually used
> in production environments) to result in the machine crashing. Not
> acceptable for distributions.
I just realized that reading encrypted memory should be ok and only 
writing is an issue, right?


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-10 17:18         ` Kirill A. Shutemov
@ 2021-09-15 19:58           ` Chao Peng
  2021-09-15 13:51             ` David Hildenbrand
  2021-09-15 14:11             ` Kirill A. Shutemov
  0 siblings, 2 replies; 71+ messages in thread
From: Chao Peng @ 2021-09-15 19:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Borislav Petkov, Andrew Morton, Joerg Roedel,
	Andi Kleen, David Rientjes, Vlastimil Babka, Tom Lendacky,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, Kirill A . Shutemov,
	Kuppuswamy Sathyanarayanan, David Hildenbrand, Dave Hansen,
	Yu Zhang

On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote:
> > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote:
> > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote:
> > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem?
> > > 
> > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look
> > > closer into locking next.
> > 
> > We can decisively eliminate this sort of failure by making the switch
> > happen at open time instead of after.  For a memfd-like API, this would
> > be straightforward.  For a filesystem, it would take a bit more thought.
> 
> I think it should work fine as long as we check seals after i_size in the
> read path. See the comment in shmem_file_read_iter().
> 
> Below is updated version. I think it should be good enough to start
> integrate with KVM.
> 
> I also attach a test-case that consists of kernel patch and userspace
> program. It demonstrates how it can be integrated into KVM code.
> 
> One caveat I noticed is that guest_ops::invalidate_page_range() can be
> called after the owner (struct kvm) has being freed. It happens because
> memfd can outlive KVM. So the callback has to check if such owner exists,
> than check that there's a memslot with such inode.

Would introducing memfd_unregister_guest() fix this?

> 
> I guess it should be okay: we have vm_list we can check owner against.
> We may consider replace vm_list with something more scalable if number of
> VMs will get too high.
> 
> Any comments?
> 
> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> index 4f1600413f91..3005e233140a 100644
> --- a/include/linux/memfd.h
> +++ b/include/linux/memfd.h
> @@ -4,13 +4,34 @@
>  
>  #include <linux/file.h>
>  
> +struct guest_ops {
> +	void (*invalidate_page_range)(struct inode *inode, void *owner,
> +				      pgoff_t start, pgoff_t end);
> +};

I can see there are two scenarios to invalidate page(s), when punching a
hole or ftruncating to 0, in either cases KVM should already been called
with necessary information from usersapce with memory slot punch hole
syscall or memory slot delete syscall, so wondering this callback is
really needed.

> +
> +struct guest_mem_ops {
> +	unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset);
> +	void (*put_unlock_pfn)(unsigned long pfn);

Same as above, I’m not clear on which time put_unlock_pfn() would be
called, I’m thinking the page can be put_and_unlock when userspace
punching a hole or ftruncating to 0 on the fd.

We did miss pfn_mapping_level() callback which is needed for KVM to query
the page size level (e.g. 4K or 2M) that the backing store can support.

> +
> +};
> +
>  #ifdef CONFIG_MEMFD_CREATE
>  extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
> +
> +extern inline int memfd_register_guest(struct inode *inode, void *owner,
> +				       const struct guest_ops *guest_ops,
> +				       const struct guest_mem_ops **guest_mem_ops);
>  #else
>  static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
>  {
>  	return -EINVAL;
>  }
> +static inline int memfd_register_guest(struct inode *inode, void *owner,
> +				       const struct guest_ops *guest_ops,
> +				       const struct guest_mem_ops **guest_mem_ops)
> +{
> +	return -EINVAL;
> +}
>  #endif
>  
>  #endif /* __LINUX_MEMFD_H */
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 8e775ce517bb..265d0c13bc5e 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -12,6 +12,9 @@
>  
>  /* inode in-kernel data */
>  
> +struct guest_ops;
> +struct guest_mem_ops;
> +
>  struct shmem_inode_info {
>  	spinlock_t		lock;
>  	unsigned int		seals;		/* shmem seals */
> @@ -24,6 +27,8 @@ struct shmem_inode_info {
>  	struct simple_xattrs	xattrs;		/* list of xattrs */
>  	atomic_t		stop_eviction;	/* hold when working on inode */
>  	struct inode		vfs_inode;
> +	void			*guest_owner;
> +	const struct guest_ops	*guest_ops;
>  };
>  
>  struct shmem_sb_info {
> @@ -90,6 +95,10 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
>  extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
>  						pgoff_t start, pgoff_t end);
>  
> +extern int shmem_register_guest(struct inode *inode, void *owner,
> +				const struct guest_ops *guest_ops,
> +				const struct guest_mem_ops **guest_mem_ops);
> +
>  /* Flag allocation requirements to shmem_getpage */
>  enum sgp_type {
>  	SGP_READ,	/* don't exceed i_size, don't allocate page */
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 2f86b2ad6d7e..c79bc8572721 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -43,6 +43,7 @@
>  #define F_SEAL_GROW	0x0004	/* prevent file from growing */
>  #define F_SEAL_WRITE	0x0008	/* prevent writes */
>  #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
> +#define F_SEAL_GUEST		0x0020
>  /* (1U << 31) is reserved for signed error codes */
>  
>  /*
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 081dd33e6a61..ae43454789f4 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -130,11 +130,24 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
>  	return NULL;
>  }
>  
> +int memfd_register_guest(struct inode *inode, void *owner,
> +			 const struct guest_ops *guest_ops,
> +			 const struct guest_mem_ops **guest_mem_ops)
> +{
> +	if (shmem_mapping(inode->i_mapping)) {
> +		return shmem_register_guest(inode, owner,
> +					    guest_ops, guest_mem_ops);
> +	}
> +
> +	return -EINVAL;
> +}

Are we stick our design to memfd interface (e.g other memory backing
stores like tmpfs and hugetlbfs will all rely on this memfd interface to
interact with KVM), or this is just the initial implementation for PoC?

If we really want to expose multiple memory backing stores directly to
KVM (as opposed to expose backing stores to memfd and then memfd expose
single interface to KVM), I feel we need a third layer between KVM and
backing stores to eliminate the direct call like this. Backing stores
can register ‘memory fd providers’ and KVM should be able to connect to
the right backing store provider with the fd provided by usersapce under
the help of this third layer.

Thanks,
Chao

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-15 14:59                 ` David Hildenbrand
  2021-09-15 15:35                   ` David Hildenbrand
@ 2021-09-15 20:04                   ` Kirill A. Shutemov
  1 sibling, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2021-09-15 20:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kirill A. Shutemov, Chao Peng, Andy Lutomirski,
	Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Borislav Petkov,
	Andrew Morton, Joerg Roedel, Andi Kleen, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, Kuppuswamy Sathyanarayanan, Dave Hansen, Yu Zhang

On Wed, Sep 15, 2021 at 04:59:46PM +0200, David Hildenbrand wrote:
> 
> > > I don't think we are, it still feels like we are in the early prototype
> > > phase (even way before a PoC). I'd be happy to see something "cleaner" so to
> > > say -- it still feels kind of hacky to me, especially there seem to be many
> > > pieces of the big puzzle missing so far. Unfortunately, this series hasn't
> > > caught the attention of many -MM people so far, maybe because other people
> > > miss the big picture as well and are waiting for a complete design proposal.
> > > 
> > > For example, what's unclear to me: we'll be allocating pages with
> > > GFP_HIGHUSER_MOVABLE, making them land on MIGRATE_CMA or ZONE_MOVABLE; then
> > > we silently turn them unmovable, which breaks these concepts. Who'd migrate
> > > these pages away just like when doing long-term pinning, or how is that
> > > supposed to work?
> > 
> > That's fair point. We can fix it by changing mapping->gfp_mask.
> 
> That's essentially what secretmem does when setting up a file.
> 
> > 
> > > Also unclear to me is how refcount and mapcount will be handled to prevent
> > > swapping,
> > 
> > refcount and mapcount are unchanged. Pages not pinned per se. Swapping
> > prevented with the change in shmem_writepage().
> 
> So when mapping into the guest, we'd increment the refcount but not the
> mapcount I assume?

No. The only refcount hold page cache. But we inform KVM via callback
before removing the page from the page cache. It is similar to
mmu_notifier scheme KVM uses at the moment.

> 
> > 
> > > who will actually do some kind of gfn-epfn etc. mapping, how we'll
> > > forbid access to this memory e.g., via /proc/kcore or when dumping memory
> > 
> > It's not aimed to prevent root to shoot into his leg. Root do root.
> 
> IMHO being root is not an excuse to read some random file (actually used in
> production environments) to result in the machine crashing. Not acceptable
> for distributions.

Reading does not cause problems. Writing does.

> I'm still missing the whole gfn-epfn 1:1 mapping discussion we identified as
> requirements. Is that supposed to be done by KVM? How?

KVM memslots that represents a range of GFNs refers to memfd (and holds
file pin) plus offset in the file. This info enough to calculate offset in
the file and find PFN. memfd tied 1:1 to struct kvm and KVM would make
sure that there's only one possible gfn for a file offset.

> > > ... and how it would ever work with migration/swapping/rmap (it's clearly
> > > future work, but it's been raised that this would be the way to make it
> > > work, I don't quite see how it would all come together).
> > 
> > Given that hardware supports it migration and swapping can be implemented
> > by providing new callbacks in guest_ops. Like ->migrate_page would
> > transfer encrypted data between pages and ->swapout would provide
> > encrypted blob that can be put on disk or handled back to ->swapin to
> > bring back to memory.
> 
> Again, I'm missing the complete picture. To make swapping decisions vmscan
> code needs track+handle dirty+reference information. How would we be able to
> track references? Does the hardware allow for temporary unmapping of
> encrypted memory and faulting on it? How would page_referenced() continue
> working? "we can add callbacks" is not a satisfying answer, at least for me.
> Especially, when it comes to eventual locking problems and races.

HW doesn't support swapping yet, so details will be just speculation.
IIUC, there's an accessed bit in EPT that can be used for tracking.

> Maybe saying "migration+swap is not supported" is clearer than "we can add
> callbacks" and missing some details on the bigger picture.
> 
> Again, a complete design proposal would be highly valuable, especially to
> get some more review from other -MM folks. Otherwise there is a high chance
> that this will be rejected late when trying to upstream and -MM people
> stumbling over it (we've had some similar thing happening just recently
> unfortunately ...).

I only work on core-mm side of the story. We will definitely need to look
at whole picture again once all pieces are somewhat ready.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-15 14:11             ` Kirill A. Shutemov
@ 2021-09-16  7:36               ` Chao Peng
  2021-09-16  9:24               ` Paolo Bonzini
  1 sibling, 0 replies; 71+ messages in thread
From: Chao Peng @ 2021-09-16  7:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Borislav Petkov, Andrew Morton,
	Joerg Roedel, Andi Kleen, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Varad Gautam, Dario Faggioli, x86, linux-mm, linux-coco,
	Kuppuswamy Sathyanarayanan, David Hildenbrand, Dave Hansen,
	Yu Zhang

On Wed, Sep 15, 2021 at 05:11:47PM +0300, Kirill A. Shutemov wrote:
> On Wed, Sep 15, 2021 at 07:58:57PM +0000, Chao Peng wrote:
> > On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote:
> > > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote:
> > > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote:
> > > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem?
> > > > > 
> > > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look
> > > > > closer into locking next.
> > > > 
> > > > We can decisively eliminate this sort of failure by making the switch
> > > > happen at open time instead of after.  For a memfd-like API, this would
> > > > be straightforward.  For a filesystem, it would take a bit more thought.
> > > 
> > > I think it should work fine as long as we check seals after i_size in the
> > > read path. See the comment in shmem_file_read_iter().
> > > 
> > > Below is updated version. I think it should be good enough to start
> > > integrate with KVM.
> > > 
> > > I also attach a test-case that consists of kernel patch and userspace
> > > program. It demonstrates how it can be integrated into KVM code.
> > > 
> > > One caveat I noticed is that guest_ops::invalidate_page_range() can be
> > > called after the owner (struct kvm) has being freed. It happens because
> > > memfd can outlive KVM. So the callback has to check if such owner exists,
> > > than check that there's a memslot with such inode.
> > 
> > Would introducing memfd_unregister_guest() fix this?
> 
> I considered this, but it get complex quickly.
> 
> At what point it gets called? On KVM memslot destroy?

I meant when the VM gets destroyed.

> 
> What if multiple KVM slot share the same memfd? Add refcount into memfd on
> how many times the owner registered the memfd?
> 
> It would leave us in strange state: memfd refcount owners (struct KVM) and
> KVM memslot pins the struct file. Weird refcount exchnage program.
> 
> I hate it.

But yes agree things will get much complex in practice.

> 
> > > I guess it should be okay: we have vm_list we can check owner against.
> > > We may consider replace vm_list with something more scalable if number of
> > > VMs will get too high.
> > > 
> > > Any comments?
> > > 
> > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> > > index 4f1600413f91..3005e233140a 100644
> > > --- a/include/linux/memfd.h
> > > +++ b/include/linux/memfd.h
> > > @@ -4,13 +4,34 @@
> > >  
> > >  #include <linux/file.h>
> > >  
> > > +struct guest_ops {
> > > +	void (*invalidate_page_range)(struct inode *inode, void *owner,
> > > +				      pgoff_t start, pgoff_t end);
> > > +};
> > 
> > I can see there are two scenarios to invalidate page(s), when punching a
> > hole or ftruncating to 0, in either cases KVM should already been called
> > with necessary information from usersapce with memory slot punch hole
> > syscall or memory slot delete syscall, so wondering this callback is
> > really needed.
> 
> So what you propose? Forbid truncate/punch from userspace and make KVM
> handle punch hole/truncate from within kernel? I think it's layering
> violation.

As far as I understand the flow for punch hole/truncate in this design,
there will be two steps for userspace:
  1. punch hole/delete kvm memory slot, and then
  2. puncn hole/truncate on the memory backing store fd.

In concept we can do whatever needed for invalidation in either steps.
If we can do the invalidation in step 1 then we don’t need bothering
this callback. This is what I mean but agree the current callback can
also work.

> 
> > > +
> > > +struct guest_mem_ops {
> > > +	unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset);
> > > +	void (*put_unlock_pfn)(unsigned long pfn);
> > 
> > Same as above, I’m not clear on which time put_unlock_pfn() would be
> > called, I’m thinking the page can be put_and_unlock when userspace
> > punching a hole or ftruncating to 0 on the fd.
> 
> No. put_unlock_pfn() has to be called after the pfn is in SEPT. This way
> we close race between SEPT population and truncate/punch. get_lock_pfn()
> would stop truncate untile put_unlock_pfn() called.

Okay, makes sense.

Thanks,
Chao

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-09-15 14:11             ` Kirill A. Shutemov
  2021-09-16  7:36               ` Chao Peng
@ 2021-09-16  9:24               ` Paolo Bonzini
  1 sibling, 0 replies; 71+ messages in thread
From: Paolo Bonzini @ 2021-09-16  9:24 UTC (permalink / raw)
  To: Kirill A. Shutemov, Chao Peng
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Borislav Petkov, Andrew Morton, Joerg Roedel,
	Andi Kleen, David Rientjes, Vlastimil Babka, Tom Lendacky,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco,
	Kuppuswamy Sathyanarayanan, David Hildenbrand, Dave Hansen,
	Yu Zhang

On 15/09/21 16:11, Kirill A. Shutemov wrote:
>> Would introducing memfd_unregister_guest() fix this?
> I considered this, but it get complex quickly.
> 
> At what point it gets called? On KVM memslot destroy?
> 
> What if multiple KVM slot share the same memfd? Add refcount into memfd on
> how many times the owner registered the memfd?

You will always have multiple KVM slots sharing the same memfd, because 
memslots are SRCU-protected.  So there will be multiple generations of 
memslots around and unregistering must be delayed to after 
synchronize_srcu (around the call to kvm_arch_commit_memory_region).

So KVM could just call memfd_{,un}register_guest as many times as it 
calls fdget/fput.  Looking at your test device, it would be like the 
following pseudo-patch:

	case GUEST_MEM_REGISTER: {
		struct fd memfd = fdget(arg);
		memfd_file = memfd.file;
		return memfd_register_guest(memfd_file->f_inode, file,
					    &guest_ops, &guest_mem_ops);
	}
	case GUEST_MEM_UNREGISTER: {
		if (!memfd_file)
			return -EINVAL;

+		memfd_unregister_guest(memfd_file->f_inode, file);
		fput(memfd_file);
		memfd_file = NULL;
		guest_mem_ops = NULL;
		return 0;

and shmem_unregister_guest would be something like

	struct shmem_inode_info *info = SHMEM_I(inode);

	if (WARN_ON_ONCE(info->guest_owner != owner))
		return;
	if (--info->guest_usage_count)
		return;
	info->guest_owner = NULL;
	info->guest_ops = NULL;

Paolo

> It would leave us in strange state: memfd refcount owners (struct KVM) and
> KVM memslot pins the struct file. Weird refcount exchnage program.


^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2021-09-16  9:24 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-24  0:52 [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Sean Christopherson
2021-08-24 10:48 ` Yu Zhang
2021-08-26  0:35   ` Sean Christopherson
2021-08-26 13:23     ` Yu Zhang
2021-08-26 10:15 ` David Hildenbrand
2021-08-26 17:05   ` Andy Lutomirski
2021-08-26 21:26     ` David Hildenbrand
2021-08-27 18:24       ` Andy Lutomirski
2021-08-27 22:28         ` Sean Christopherson
2021-08-31 19:12           ` David Hildenbrand
2021-08-31 20:45             ` Sean Christopherson
2021-09-01  7:51               ` David Hildenbrand
2021-08-27  2:31   ` Yu Zhang
2021-08-31 19:08     ` David Hildenbrand
2021-08-31 20:01       ` Andi Kleen
2021-08-31 20:15         ` David Hildenbrand
2021-08-31 20:39           ` Andi Kleen
2021-09-01  3:34             ` Yu Zhang
2021-09-01  4:53     ` Andy Lutomirski
2021-09-01  7:12       ` Tian, Kevin
2021-09-01 10:24       ` Yu Zhang
2021-09-01 16:07         ` Andy Lutomirski
2021-09-01 16:27           ` David Hildenbrand
2021-09-02  8:34             ` Yu Zhang
2021-09-02  8:44               ` David Hildenbrand
2021-09-02 11:02                 ` Yu Zhang
2021-09-02  8:19           ` Yu Zhang
2021-09-02 18:41             ` Andy Lutomirski
2021-09-07  1:33             ` Yan Zhao
2021-09-02  9:27           ` Joerg Roedel
2021-09-02 18:41             ` Andy Lutomirski
2021-09-02 18:57               ` Sean Christopherson
2021-09-02 19:07                 ` Dave Hansen
2021-09-02 20:42                   ` Andy Lutomirski
2021-08-27 22:18   ` Sean Christopherson
2021-08-31 19:07     ` David Hildenbrand
2021-08-31 21:54       ` Sean Christopherson
2021-09-01  8:09         ` David Hildenbrand
2021-09-01 15:54           ` Andy Lutomirski
2021-09-01 16:16             ` David Hildenbrand
2021-09-01 17:09               ` Andy Lutomirski
2021-09-01 16:18             ` James Bottomley
2021-09-01 16:22               ` David Hildenbrand
2021-09-01 16:31                 ` James Bottomley
2021-09-01 16:37                   ` David Hildenbrand
2021-09-01 16:45                     ` James Bottomley
2021-09-01 17:08                       ` David Hildenbrand
2021-09-01 17:50                         ` Sean Christopherson
2021-09-01 17:53                           ` David Hildenbrand
2021-09-01 17:08               ` Andy Lutomirski
2021-09-01 17:13                 ` James Bottomley
2021-09-02 10:18                 ` Joerg Roedel
2021-09-01 18:24               ` Andy Lutomirski
2021-09-01 19:26               ` Dave Hansen
2021-09-07 15:00               ` Tom Lendacky
2021-09-01  4:58       ` Andy Lutomirski
2021-09-01  7:49         ` David Hildenbrand
2021-09-02 18:47 ` Kirill A. Shutemov
2021-09-02 20:33   ` Sean Christopherson
2021-09-03 19:14     ` Kirill A. Shutemov
2021-09-03 19:15       ` Andy Lutomirski
2021-09-10 17:18         ` Kirill A. Shutemov
2021-09-15 19:58           ` Chao Peng
2021-09-15 13:51             ` David Hildenbrand
2021-09-15 14:29               ` Kirill A. Shutemov
2021-09-15 14:59                 ` David Hildenbrand
2021-09-15 15:35                   ` David Hildenbrand
2021-09-15 20:04                   ` Kirill A. Shutemov
2021-09-15 14:11             ` Kirill A. Shutemov
2021-09-16  7:36               ` Chao Peng
2021-09-16  9:24               ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).