linux-coco.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement
@ 2023-06-15 20:12 isaku.yamahata
  2023-06-15 20:12 ` [RFC PATCH 1/6] KVM: selftests: Fix test_add_overlapping_private_memory_regions() isaku.yamahata
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: isaku.yamahata @ 2023-06-15 20:12 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, linux-coco, Chao Peng, Ackerley Tng,
	Vishal Annapurve, Michael Roth

From: Isaku Yamahata <isaku.yamahata@intel.com>

Hello. This is an RFC patch series based on KVM gmem [1] and [2] 
for the common use of TDX and SEV-SNP.  I'd like to discuss three items.

* Flags for kvm_gfn_range:
Roth, By adding flags to kvm_gfn_range, kvm_arch_gmem_invalidate() won't be
needed.  Maybe x86 gmem_invalidate callback is still needed, though.

* Page fault error code or how to figure out the private page fault
There is an issue to set kvm_page_fault.is_private. I came up with two new
error codes.  Is this a way or any other way?

* VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
  - Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
  - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
    the future)
  - any other way?

[1] KVM gmem patches
https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo

[2] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support
https://lore.kernel.org/lkml/20230612042559.375660-1-michael.roth@amd.com/

Isaku Yamahata (6):
  KVM: selftests: Fix test_add_overlapping_private_memory_regions()
  KVM: selftests: Fix guest_memfd()
  KVM: x86/mmu: Pass round full 64-bit error code for the KVM page fault
  KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private
  KVM: Add flags to struct kvm_gfn_range
  KVM: x86: Add is_vm_type_supported callback

 arch/x86/include/asm/kvm-x86-ops.h               |  1 +
 arch/x86/include/asm/kvm_host.h                  |  5 +++++
 arch/x86/include/uapi/asm/kvm.h                  |  1 +
 arch/x86/kvm/mmu/mmu.c                           | 14 +++++++++-----
 arch/x86/kvm/mmu/mmu_internal.h                  |  8 ++++----
 arch/x86/kvm/mmu/mmutrace.h                      |  2 +-
 arch/x86/kvm/mmu/paging_tmpl.h                   |  4 ++--
 arch/x86/kvm/svm/svm.c                           |  7 +++++++
 arch/x86/kvm/vmx/vmx.c                           |  6 ++++++
 arch/x86/kvm/x86.c                               | 10 +++++++++-
 arch/x86/kvm/x86.h                               |  2 ++
 include/linux/kvm_host.h                         | 11 ++++++++++-
 tools/testing/selftests/kvm/guest_memfd_test.c   |  4 ++--
 .../selftests/kvm/set_memory_region_test.c       | 16 ++++++++++++++--
 virt/kvm/guest_mem.c                             | 10 +++++++---
 virt/kvm/kvm_main.c                              |  4 +++-
 16 files changed, 83 insertions(+), 22 deletions(-)


base-commit: be8abcec83c87d4e15ae04816b685fe260c4bcfd
-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/6] KVM: selftests: Fix test_add_overlapping_private_memory_regions()
  2023-06-15 20:12 [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement isaku.yamahata
@ 2023-06-15 20:12 ` isaku.yamahata
  2023-06-15 20:12 ` [RFC PATCH 2/6] KVM: selftests: Fix guest_memfd() isaku.yamahata
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: isaku.yamahata @ 2023-06-15 20:12 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, linux-coco, Chao Peng, Ackerley Tng,
	Vishal Annapurve, Michael Roth

From: Isaku Yamahata <isaku.yamahata@intel.com>

The last test in test_add_overlapping_private_memory_regions() doesn't use
overlapping regions resulting in the failure.  When the region is overlaps
with the existing ones, the error code is EEXIST instead of EINVAL.  Pass
the overlapping region, and check if the errno is EEXIST.

Fixes: bdb645960cb5 ("KVM: selftests: Expand set_memory_region_test to validate guest_memfd()")
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 .../selftests/kvm/set_memory_region_test.c       | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c
index f46841843300..ea7da324c4d6 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -432,6 +432,7 @@ static void test_add_overlapping_private_memory_regions(void)
 {
 	struct kvm_vm *vm;
 	int memfd;
+	int r;
 
 	pr_info("Testing ADD of overlapping KVM_MEM_PRIVATE memory regions\n");
 
@@ -453,8 +454,19 @@ static void test_add_overlapping_private_memory_regions(void)
 	vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
 				   MEM_REGION_GPA, 0, NULL, -1, 0);
 
-	test_invalid_guest_memfd(vm, memfd, MEM_REGION_SIZE,
-				 "Overlapping guest_memfd() bindings should fail");
+	r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+					 MEM_REGION_GPA * 2 - MEM_REGION_SIZE,
+					 MEM_REGION_SIZE * 2,
+					 0, memfd, 0);
+	TEST_ASSERT(r == -1 && errno == EEXIST, "%s",
+		    "Overlapping guest_memfd() bindings should fail");
+
+	r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+					 MEM_REGION_GPA * 2 + MEM_REGION_SIZE,
+					 MEM_REGION_SIZE * 2,
+					 0, memfd, 0);
+	TEST_ASSERT(r == -1 && errno == EEXIST, "%s",
+		    "Overlapping guest_memfd() bindings should fail");
 
 	close(memfd);
 	kvm_vm_free(vm);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/6] KVM: selftests: Fix guest_memfd()
  2023-06-15 20:12 [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement isaku.yamahata
  2023-06-15 20:12 ` [RFC PATCH 1/6] KVM: selftests: Fix test_add_overlapping_private_memory_regions() isaku.yamahata
@ 2023-06-15 20:12 ` isaku.yamahata
  2023-06-15 20:12 ` [RFC PATCH 3/6] KVM: x86/mmu: Pass round full 64-bit error code for the KVM page fault isaku.yamahata
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: isaku.yamahata @ 2023-06-15 20:12 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, linux-coco, Chao Peng, Ackerley Tng,
	Vishal Annapurve, Michael Roth

From: Isaku Yamahata <isaku.yamahata@intel.com>

Some test cases should succeed.  Check !ret instead of ret.

Fixes: 36eedd5b91e3 ("KVM: selftests: Add basic selftest for guest_memfd()")
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 tools/testing/selftests/kvm/guest_memfd_test.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 3b6532b833b2..f3b99c1e5464 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -72,11 +72,11 @@ static void test_fallocate(int fd, size_t page_size, size_t total_size)
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
 			total_size, page_size);
-	TEST_ASSERT(ret, "fallocate(PUNCH_HOLE) at total_size should be fine (no-op)");
+	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) at total_size should be fine (no-op)");
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
 			total_size + page_size, page_size);
-	TEST_ASSERT(ret, "fallocate(PUNCH_HOLE) after total_size should be fine (no-op)");
+	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) after total_size should be fine (no-op)");
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
 			page_size, page_size - 1);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 3/6] KVM: x86/mmu: Pass round full 64-bit error code for the KVM page fault
  2023-06-15 20:12 [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement isaku.yamahata
  2023-06-15 20:12 ` [RFC PATCH 1/6] KVM: selftests: Fix test_add_overlapping_private_memory_regions() isaku.yamahata
  2023-06-15 20:12 ` [RFC PATCH 2/6] KVM: selftests: Fix guest_memfd() isaku.yamahata
@ 2023-06-15 20:12 ` isaku.yamahata
  2023-06-15 20:12 ` [RFC PATCH 4/6] KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private isaku.yamahata
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: isaku.yamahata @ 2023-06-15 20:12 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, linux-coco, Chao Peng, Ackerley Tng,
	Vishal Annapurve, Michael Roth

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because the full 64-bit error code, or more info about the fault, for the
KVM page fault will be needed for protected VM, TDX and SEV-SNP, update
kvm_mmu_do_page_fault() to accept the 64-bit value so it can pass it to the
callbacks.

The upper 32 bits of error code are discarded at kvm_mmu_page_fault()
by lower_32_bits().  Now it's passed down as full 64 bits.
Because only FNAME(page_fault) depends on it, move lower_32_bits() into
FNAME(page_fault).

The accesses of fault->error_code are as follows
- FNAME(page_fault): change to explicitly use lower_32_bits()
- kvm_mmu_page_fault(): explicit mask with PFERR_RSVD_MASK,
                        PFERR_NESTED_GUEST_PAGE
- mmutrace: changed u32 -> u64
- pgprintk(): change %x -> %llx

No functional change is intended.  This is a preparation to pass on more
info with page fault error code.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/mmu.c          | 5 ++---
 arch/x86/kvm/mmu/mmu_internal.h | 4 ++--
 arch/x86/kvm/mmu/mmutrace.h     | 2 +-
 arch/x86/kvm/mmu/paging_tmpl.h  | 4 ++--
 4 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dc2b9a2f717c..b8ba7f11c3cb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4510,7 +4510,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu,
 				struct kvm_page_fault *fault)
 {
-	pgprintk("%s: gva %lx error %x\n", __func__, fault->addr, fault->error_code);
+	pgprintk("%s: gva %llx error %llx\n", __func__, fault->addr, fault->error_code);
 
 	/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
 	fault->max_level = PG_LEVEL_2M;
@@ -5820,8 +5820,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 	}
 
 	if (r == RET_PF_INVALID) {
-		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
-					  lower_32_bits(error_code), false,
+		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
 					  &emulation_type);
 		if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
 			return -EIO;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index f1786698ae00..7f9ec1e5b136 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -191,7 +191,7 @@ static inline bool is_nx_huge_page_enabled(struct kvm *kvm)
 struct kvm_page_fault {
 	/* arguments to kvm_mmu_do_page_fault.  */
 	const gpa_t addr;
-	const u32 error_code;
+	const u64 error_code;
 	const bool prefetch;
 
 	/* Derived from error_code.  */
@@ -283,7 +283,7 @@ enum {
 };
 
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-					u32 err, bool prefetch, int *emulation_type)
+					u64 err, bool prefetch, int *emulation_type)
 {
 	struct kvm_page_fault fault = {
 		.addr = cr2_or_gpa,
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index 2d7555381955..2e77883c92f6 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -261,7 +261,7 @@ TRACE_EVENT(
 	TP_STRUCT__entry(
 		__field(int, vcpu_id)
 		__field(gpa_t, cr2_or_gpa)
-		__field(u32, error_code)
+		__field(u64, error_code)
 		__field(u64 *, sptep)
 		__field(u64, old_spte)
 		__field(u64, new_spte)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0662e0278e70..ee4b881c5b39 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -758,7 +758,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	struct guest_walker walker;
 	int r;
 
-	pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code);
+	pgprintk("%s: addr %llx err %llx\n", __func__, fault->addr, fault->error_code);
 	WARN_ON_ONCE(fault->is_tdp);
 
 	/*
@@ -767,7 +767,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 * The bit needs to be cleared before walking guest page tables.
 	 */
 	r = FNAME(walk_addr)(&walker, vcpu, fault->addr,
-			     fault->error_code & ~PFERR_RSVD_MASK);
+			     lower_32_bits(fault->error_code) & ~PFERR_RSVD_MASK);
 
 	/*
 	 * The page is not mapped by the guest.  Let the guest handle it.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 4/6] KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private
  2023-06-15 20:12 [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement isaku.yamahata
                   ` (2 preceding siblings ...)
  2023-06-15 20:12 ` [RFC PATCH 3/6] KVM: x86/mmu: Pass round full 64-bit error code for the KVM page fault isaku.yamahata
@ 2023-06-15 20:12 ` isaku.yamahata
  2023-06-15 20:12 ` [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range isaku.yamahata
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: isaku.yamahata @ 2023-06-15 20:12 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, linux-coco, Chao Peng, Ackerley Tng,
	Vishal Annapurve, Michael Roth

From: Isaku Yamahata <isaku.yamahata@intel.com>

It is unfortunate and inflexible for kvm_mmu_do_page_fault() to call
kvm_mem_is_private(), eventually looking up memory attributes.  Later
__kvm_faultin_pfn() looks up memory attributes again.  There is a race
condition that other threads can change memory attributes due to not
gaining the mmu lock.  SNP-SEV and TDX define theri way to indicate that
the page fault is private.

Add two PFERR codes to designate that the page fault is private and that
it requires looking up memory attributes.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 4 ++++
 arch/x86/kvm/mmu/mmu.c          | 9 +++++++--
 arch/x86/kvm/mmu/mmu_internal.h | 4 ++--
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8ae131dc645d..2763f9837a0b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -255,7 +255,9 @@ enum x86_intercept_stage;
 #define PFERR_SGX_BIT 15
 #define PFERR_GUEST_FINAL_BIT 32
 #define PFERR_GUEST_PAGE_BIT 33
+#define PFERR_GUEST_ENC_BIT 34
 #define PFERR_IMPLICIT_ACCESS_BIT 48
+#define PFERR_HASATTR_BIT 63
 
 #define PFERR_PRESENT_MASK	BIT(PFERR_PRESENT_BIT)
 #define PFERR_WRITE_MASK	BIT(PFERR_WRITE_BIT)
@@ -266,7 +268,9 @@ enum x86_intercept_stage;
 #define PFERR_SGX_MASK		BIT(PFERR_SGX_BIT)
 #define PFERR_GUEST_FINAL_MASK	BIT_ULL(PFERR_GUEST_FINAL_BIT)
 #define PFERR_GUEST_PAGE_MASK	BIT_ULL(PFERR_GUEST_PAGE_BIT)
+#define PFERR_GUEST_ENC_MASK	BIT_ULL(PFERR_GUEST_ENC_BIT)
 #define PFERR_IMPLICIT_ACCESS	BIT_ULL(PFERR_IMPLICIT_ACCESS_BIT)
+#define PFERR_HASATTR_MASK	BIT_ULL(PFERR_HASATTR_BIT)
 
 #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK |	\
 				 PFERR_WRITE_MASK |		\
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b8ba7f11c3cb..e9c9780bab89 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4358,6 +4358,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	bool async;
+	bool is_private;
 
 	/*
 	 * Retry the page fault if the gfn hit a memslot that is being deleted
@@ -4386,8 +4387,12 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			return RET_PF_EMULATE;
 	}
 
-	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
-		return kvm_do_memory_fault_exit(vcpu, fault);
+	is_private = kvm_mem_is_private(vcpu->kvm, fault->gfn);
+	if (fault->error_code & PFERR_HASATTR_MASK) {
+		if (fault->is_private != is_private)
+			return kvm_do_memory_fault_exit(vcpu, fault);
+	} else
+		fault->is_private = is_private;
 
 	if (fault->is_private)
 		return kvm_faultin_pfn_private(vcpu, fault);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 7f9ec1e5b136..22f2cd60cabf 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -203,7 +203,7 @@ struct kvm_page_fault {
 
 	/* Derived from mmu and global state.  */
 	const bool is_tdp;
-	const bool is_private;
+	bool is_private;
 	const bool nx_huge_page_workaround_enabled;
 
 	/*
@@ -301,7 +301,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
 		.req_level = PG_LEVEL_4K,
 		.goal_level = PG_LEVEL_4K,
-		.is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
+		.is_private = err & PFERR_GUEST_ENC_MASK,
 	};
 	int r;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range
  2023-06-15 20:12 [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement isaku.yamahata
                   ` (3 preceding siblings ...)
  2023-06-15 20:12 ` [RFC PATCH 4/6] KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private isaku.yamahata
@ 2023-06-15 20:12 ` isaku.yamahata
  2023-06-20 16:28   ` Michael Roth
  2023-06-15 20:12 ` [RFC PATCH 6/6] KVM: x86: Add is_vm_type_supported callback isaku.yamahata
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: isaku.yamahata @ 2023-06-15 20:12 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, linux-coco, Chao Peng, Ackerley Tng,
	Vishal Annapurve, Michael Roth

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX and SEV-SNP need to know the reason for a callback by
kvm_unmap_gfn_range().  mmu notifier, set memory attributes ioctl or KVM
gmem callback.  The callback handler changes the behavior or does the
additional housekeeping operation.  For mmu notifier, it's zapping shared
PTE.  For set memory attributes, it's the conversion of memory attributes
(private <=> shared).  For KVM gmem, it's punching a hole in the range, and
releasing the file.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 include/linux/kvm_host.h | 11 ++++++++++-
 virt/kvm/guest_mem.c     | 10 +++++++---
 virt/kvm/kvm_main.c      |  4 +++-
 3 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1a47cedae8a1..c049c0aa44d6 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,12 +256,21 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
+
+#define KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR		BIT(0)
+#define KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE		BIT(1)
+#define KVM_GFN_RANGE_FLAGS_GMEM_RELEASE		BIT(2)
+
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
 	gfn_t end;
-	pte_t pte;
+	union {
+		pte_t pte;
+		u64 attrs;
+	};
 	bool may_block;
+	unsigned int flags;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index cdf2d84683c8..30b8f66784d4 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -99,7 +99,8 @@ static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
 }
 
 static void kvm_gmem_invalidate_begin(struct kvm *kvm, struct kvm_gmem *gmem,
-				      pgoff_t start, pgoff_t end)
+				      pgoff_t start, pgoff_t end,
+				      unsigned int flags)
 {
 	struct kvm_memory_slot *slot;
 	unsigned long index;
@@ -118,6 +119,7 @@ static void kvm_gmem_invalidate_begin(struct kvm *kvm, struct kvm_gmem *gmem,
 			.slot = slot,
 			.pte = __pte(0),
 			.may_block = true,
+			.flags = flags,
 		};
 
 		kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end);
@@ -156,7 +158,8 @@ static long kvm_gmem_punch_hole(struct file *file, loff_t offset, loff_t len)
 	 */
 	filemap_invalidate_lock(file->f_mapping);
 
-	kvm_gmem_invalidate_begin(kvm, gmem, start, end);
+	kvm_gmem_invalidate_begin(kvm, gmem, start, end,
+				  KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE);
 
 	truncate_inode_pages_range(file->f_mapping, offset, offset + len - 1);
 
@@ -263,7 +266,8 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
 	 * Free the backing memory, and more importantly, zap all SPTEs that
 	 * pointed at this file.
 	 */
-	kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul);
+	kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul,
+				  KVM_GFN_RANGE_FLAGS_GMEM_RELEASE);
 	truncate_inode_pages_final(file->f_mapping);
 	kvm_gmem_invalidate_end(kvm, gmem, 0, -1ul);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 422d49634c56..9cdfa2fb675f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -613,6 +613,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
 			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
 			gfn_range.slot = slot;
+			gfn_range.flags = 0;
 
 			if (!locked) {
 				locked = true;
@@ -2391,8 +2392,9 @@ static void kvm_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
 	bool flush = false;
 	int i;
 
-	gfn_range.pte = __pte(0);
+	gfn_range.attrs = attrs;
 	gfn_range.may_block = true;
+	gfn_range.flags = KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR;
 
 	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		slots = __kvm_memslots(kvm, i);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 6/6] KVM: x86: Add is_vm_type_supported callback
  2023-06-15 20:12 [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement isaku.yamahata
                   ` (4 preceding siblings ...)
  2023-06-15 20:12 ` [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range isaku.yamahata
@ 2023-06-15 20:12 ` isaku.yamahata
  2023-06-19 19:11 ` [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement Vishal Annapurve
  2023-06-24  0:09 ` Sean Christopherson
  7 siblings, 0 replies; 18+ messages in thread
From: isaku.yamahata @ 2023-06-15 20:12 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, linux-coco, Chao Peng, Ackerley Tng,
	Vishal Annapurve, Michael Roth

From: Isaku Yamahata <isaku.yamahata@intel.com>

For TDX, allow the backend can override the supported vm type.  Add
KVM_X86_TDX_VM to reserve the bit.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  1 +
 arch/x86/include/uapi/asm/kvm.h    |  1 +
 arch/x86/kvm/svm/svm.c             |  7 +++++++
 arch/x86/kvm/vmx/vmx.c             |  6 ++++++
 arch/x86/kvm/x86.c                 | 10 +++++++++-
 arch/x86/kvm/x86.h                 |  2 ++
 7 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 13bc212cd4bc..c0143906fe6d 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -20,6 +20,7 @@ KVM_X86_OP(hardware_disable)
 KVM_X86_OP(hardware_unsetup)
 KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
+KVM_X86_OP(is_vm_type_supported)
 KVM_X86_OP(vm_init)
 KVM_X86_OP_OPTIONAL(vm_destroy)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2763f9837a0b..ce83e24a538d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1547,6 +1547,7 @@ struct kvm_x86_ops {
 	bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
 	void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
 
+	bool (*is_vm_type_supported)(unsigned long vm_type);
 	unsigned int vm_size;
 	int (*vm_init)(struct kvm *kvm);
 	void (*vm_destroy)(struct kvm *kvm);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 6afbfbb32d56..53d382b3b423 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -561,5 +561,6 @@ struct kvm_pmu_event_filter {
 
 #define KVM_X86_DEFAULT_VM	0
 #define KVM_X86_PROTECTED_VM	1
+#define KVM_X86_TDX_VM		2
 
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index eb308c9994f9..e9ed8729f63b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4756,6 +4756,12 @@ static void svm_vm_destroy(struct kvm *kvm)
 	sev_vm_destroy(kvm);
 }
 
+static bool svm_is_vm_type_supported(unsigned long type)
+{
+	/* FIXME: Check if CPU is capable of SEV. */
+	return __kvm_is_vm_type_supported(type);
+}
+
 static int svm_vm_init(struct kvm *kvm)
 {
 	if (!pause_filter_count || !pause_filter_thresh)
@@ -4784,6 +4790,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.vcpu_free = svm_vcpu_free,
 	.vcpu_reset = svm_vcpu_reset,
 
+	.is_vm_type_supported = svm_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_svm),
 	.vm_init = svm_vm_init,
 	.vm_destroy = svm_vm_destroy,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 44fb619803b8..b5394ba8cb9c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7469,6 +7469,11 @@ static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 	return err;
 }
 
+static bool vmx_is_vm_type_supported(unsigned long type)
+{
+	return __kvm_is_vm_type_supported(type);
+}
+
 #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 
@@ -8138,6 +8143,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.hardware_disable = vmx_hardware_disable,
 	.has_emulated_msr = vmx_has_emulated_msr,
 
+	.is_vm_type_supported = vmx_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_vmx),
 	.vm_init = vmx_vm_init,
 	.vm_destroy = vmx_vm_destroy,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c9e1c9369be2..b5f865f39a00 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4418,12 +4418,18 @@ static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
-static bool kvm_is_vm_type_supported(unsigned long type)
+bool __kvm_is_vm_type_supported(unsigned long type)
 {
 	return type == KVM_X86_DEFAULT_VM ||
 	       (type == KVM_X86_PROTECTED_VM &&
 	        IS_ENABLED(CONFIG_KVM_PROTECTED_VM) && tdp_enabled);
 }
+EXPORT_SYMBOL_GPL(__kvm_is_vm_type_supported);
+
+static bool kvm_is_vm_type_supported(unsigned long type)
+{
+	return static_call(kvm_x86_is_vm_type_supported)(type);
+}
 
 int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 {
@@ -4618,6 +4624,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = BIT(KVM_X86_DEFAULT_VM);
 		if (kvm_is_vm_type_supported(KVM_X86_PROTECTED_VM))
 			r |= BIT(KVM_X86_PROTECTED_VM);
+		if (kvm_is_vm_type_supported(KVM_X86_TDX_VM))
+			r |= BIT(KVM_X86_TDX_VM);
 		break;
 	default:
 		break;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index c544602d07a3..7d5aa8f0571a 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -9,6 +9,8 @@
 #include "kvm_cache_regs.h"
 #include "kvm_emulate.h"
 
+bool __kvm_is_vm_type_supported(unsigned long type);
+
 struct kvm_caps {
 	/* control of guest tsc rate supported? */
 	bool has_tsc_control;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement
  2023-06-15 20:12 [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement isaku.yamahata
                   ` (5 preceding siblings ...)
  2023-06-15 20:12 ` [RFC PATCH 6/6] KVM: x86: Add is_vm_type_supported callback isaku.yamahata
@ 2023-06-19 19:11 ` Vishal Annapurve
  2023-06-19 20:11   ` Zhi Wang
  2023-06-24  0:09 ` Sean Christopherson
  7 siblings, 1 reply; 18+ messages in thread
From: Vishal Annapurve @ 2023-06-19 19:11 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, linux-coco, Chao Peng, Ackerley Tng,
	Michael Roth

On Thu, Jun 15, 2023 at 1:12 PM <isaku.yamahata@intel.com> wrote:
> ...
>
> * VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
>   - Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
>   - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
>     the future)
>   - any other way?

There are selftests posted[1] in context of this work, which rely on
KVM_X86_PROTECTED_VM being just the software-only psuedo-confidential
VMs. In future there might be more work to expand this usecase to
full-scale VMs. So it would be better to treat protected VMs as a
separate type which can be used on any platform without the need of
enabling TDX/SEV functionality.

TDX VM type can possibly serve as a specialized type of protected VM
with additional arch specific capabilities enabled.

[1] - https://github.com/sean-jc/linux/commits/x86/kvm_gmem_solo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement
  2023-06-19 19:11 ` [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement Vishal Annapurve
@ 2023-06-19 20:11   ` Zhi Wang
  2023-06-19 21:55     ` Vishal Annapurve
  0 siblings, 1 reply; 18+ messages in thread
From: Zhi Wang @ 2023-06-19 20:11 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack,
	Kai Huang, chen.bo, linux-coco, Chao Peng, Ackerley Tng,
	Michael Roth

On Mon, 19 Jun 2023 12:11:50 -0700
Vishal Annapurve <vannapurve@google.com> wrote:

> On Thu, Jun 15, 2023 at 1:12___PM <isaku.yamahata@intel.com> wrote:
> > ...
> >
> > * VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
> >   - Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
> >   - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
> >     the future)
> >   - any other way?  
> 
> There are selftests posted[1] in context of this work, which rely on
> KVM_X86_PROTECTED_VM being just the software-only psuedo-confidential
> VMs. In future there might be more work to expand this usecase to
> full-scale VMs. So it would be better to treat protected VMs as a
> separate type which can be used on any platform without the need of
> enabling TDX/SEV functionality.
> 

Out of curiosity, is this really a valid case in practice except selftest?
It sounds to me whenever KVM_X86_PROTECTED_VM is used, it has to be tied
with a platform-specific CC type.

> TDX VM type can possibly serve as a specialized type of protected VM
> with additional arch specific capabilities enabled.
> 
> [1] - https://github.com/sean-jc/linux/commits/x86/kvm_gmem_solo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement
  2023-06-19 20:11   ` Zhi Wang
@ 2023-06-19 21:55     ` Vishal Annapurve
  2023-06-21  8:16       ` Zhi Wang
  2023-06-21 18:19       ` Dong, Eddie
  0 siblings, 2 replies; 18+ messages in thread
From: Vishal Annapurve @ 2023-06-19 21:55 UTC (permalink / raw)
  To: Zhi Wang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack,
	Kai Huang, chen.bo, linux-coco, Chao Peng, Ackerley Tng,
	Michael Roth

On Mon, Jun 19, 2023 at 1:11 PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
>
> On Mon, 19 Jun 2023 12:11:50 -0700
> Vishal Annapurve <vannapurve@google.com> wrote:
>
> > On Thu, Jun 15, 2023 at 1:12___PM <isaku.yamahata@intel.com> wrote:
> > > ...
> > >
> > > * VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
> > >   - Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
> > >   - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
> > >     the future)
> > >   - any other way?
> >
> > There are selftests posted[1] in context of this work, which rely on
> > KVM_X86_PROTECTED_VM being just the software-only psuedo-confidential
> > VMs. In future there might be more work to expand this usecase to
> > full-scale VMs. So it would be better to treat protected VMs as a
> > separate type which can be used on any platform without the need of
> > enabling TDX/SEV functionality.
> >
>
> Out of curiosity, is this really a valid case in practice except selftest?
> It sounds to me whenever KVM_X86_PROTECTED_VM is used, it has to be tied
> with a platform-specific CC type.

Protected VM effort is about being able to have guest memory ranges
not mapped into Userspace VMM and so are unreachable for most of the
cases from KVM as well. Non-CC VMs can use this support to mitigate
any unintended accesses from userspace VMM/KVM possibly using
enlightened kernels.

Exact implementation of such a support warrants more discussion but it
should be in the line of sight here as a future work item.




>
> > TDX VM type can possibly serve as a specialized type of protected VM
> > with additional arch specific capabilities enabled.
> >
> > [1] - https://github.com/sean-jc/linux/commits/x86/kvm_gmem_solo
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range
  2023-06-15 20:12 ` [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range isaku.yamahata
@ 2023-06-20 16:28   ` Michael Roth
  2023-06-20 19:04     ` Isaku Yamahata
  0 siblings, 1 reply; 18+ messages in thread
From: Michael Roth @ 2023-06-20 16:28 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang,
	Zhi Wang, chen.bo, linux-coco, Chao Peng, Ackerley Tng,
	Vishal Annapurve

On Thu, Jun 15, 2023 at 01:12:18PM -0700, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX and SEV-SNP need to know the reason for a callback by
> kvm_unmap_gfn_range().  mmu notifier, set memory attributes ioctl or KVM
> gmem callback.  The callback handler changes the behavior or does the
> additional housekeeping operation.  For mmu notifier, it's zapping shared
> PTE.  For set memory attributes, it's the conversion of memory attributes
> (private <=> shared).  For KVM gmem, it's punching a hole in the range, and
> releasing the file.

I think it's still an open topic that we need to hear more from Sean about:

  https://lore.kernel.org/lkml/20230522235838.ov3722lcusotzlvo@amd.com/

but I *think* we were leaning toward decoupling the act of invalidating
GFNs, vs. the act of invalidating/free'ing gmem pages.

One concrete example of where this seperation makes sense if with
hole-punching. SNP has unique platform-specific stuff it has to do before
free'ing that gmem page back to the host. If we try to plumb this through
kvm_unmap_gfn_range() via a special flag then it's a little awkward
because:

a) Presumably that hole-punch would have occurred after a preceeding
   KVM_SET_MEMORY_ATTRIBUTES was issued to switch the page to shared
   state in the xarray. So all it should really need to do is handle
   that platform-specific behavior, like updating RMP table in case of
   SNP. But none of the other details like GFN ranges are relevant in
   that case, RMP updates here only need the PFN, so we end up walking
   memslots to do GFN->PFN translations, when it would actually be much
   more efficient to do these translations by translating the
   hole-punched FD offset range to the corresponding folio()'s backing
   those ranges

b) It places an unecessary dependency on having an active memslot to do
   those translations. This ends up not being an issue with current
   version of gmem patchset because the release() happens *before*
   gmem_unbind(), so there is a memslot associated with the ranges at
   gmem_release() time, but in the initial version of gmem it was the
   reverse, so if things ever changed again in this regard we'd once
   again have to completely rework how to issue these platform-specific
   invalidation callbacks.

I really *really* like having a separate, simple invalidation mechanism
in place that just converts FD offsets to PFNs and then passes those on
to platform-defined handlers to clean up pages before free'ing them back
to the system. It's versatile in that it can be called pretty much
anywhere regardless of where we are in KVM lifecycle, it's robust in
that it doesn't rely on unecessary outside dependencies, and it avoids
added uneeded complexity to paths like kvm_unmap_gfn_range().

That's the approach taken with SNP hypervisor v9 series, with the
gmem hook being introduced here:

  https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@amd.com/T/#m3ad8245235a27ed0f41c359c191dcda6c77af043

and the SEV-SNP implementation of that hook being here:

  https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@amd.com/T/#m6ac04b44722dbc07839011816e94fadf5ad6794e

Would a similar approach work for TDX? At least WRT cleaning up pages
before returning them back to the host? If we could isolate that
requirement/handling from all the other aspects of invalidations it
really seems like it would cause us less headaches down the road.

> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  include/linux/kvm_host.h | 11 ++++++++++-
>  virt/kvm/guest_mem.c     | 10 +++++++---
>  virt/kvm/kvm_main.c      |  4 +++-
>  3 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1a47cedae8a1..c049c0aa44d6 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -256,12 +256,21 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>  
>  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> +
> +#define KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR		BIT(0)

Can you go into more detail on why special handling is needed for
SET_MEM_ATTR?

> +#define KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE		BIT(1)
> +#define KVM_GFN_RANGE_FLAGS_GMEM_RELEASE		BIT(2)

Would the need to distinguish between PUNCH_HOLE/RELEASE go away in the
TDX case if you take the above approach? For SNP, the answer is yes. If
that's also the case for TDX I think that would be another argument in
favor of decoupling these from existing KVM MMU invalidation path.

-Mike

> +
>  struct kvm_gfn_range {
>  	struct kvm_memory_slot *slot;
>  	gfn_t start;
>  	gfn_t end;
> -	pte_t pte;
> +	union {
> +		pte_t pte;
> +		u64 attrs;
> +	};
>  	bool may_block;
> +	unsigned int flags;
>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> index cdf2d84683c8..30b8f66784d4 100644
> --- a/virt/kvm/guest_mem.c
> +++ b/virt/kvm/guest_mem.c
> @@ -99,7 +99,8 @@ static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
>  }
>  
>  static void kvm_gmem_invalidate_begin(struct kvm *kvm, struct kvm_gmem *gmem,
> -				      pgoff_t start, pgoff_t end)
> +				      pgoff_t start, pgoff_t end,
> +				      unsigned int flags)
>  {
>  	struct kvm_memory_slot *slot;
>  	unsigned long index;
> @@ -118,6 +119,7 @@ static void kvm_gmem_invalidate_begin(struct kvm *kvm, struct kvm_gmem *gmem,
>  			.slot = slot,
>  			.pte = __pte(0),
>  			.may_block = true,
> +			.flags = flags,
>  		};
>  
>  		kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end);
> @@ -156,7 +158,8 @@ static long kvm_gmem_punch_hole(struct file *file, loff_t offset, loff_t len)
>  	 */
>  	filemap_invalidate_lock(file->f_mapping);
>  
> -	kvm_gmem_invalidate_begin(kvm, gmem, start, end);
> +	kvm_gmem_invalidate_begin(kvm, gmem, start, end,
> +				  KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE);
>  
>  	truncate_inode_pages_range(file->f_mapping, offset, offset + len - 1);
>  
> @@ -263,7 +266,8 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
>  	 * Free the backing memory, and more importantly, zap all SPTEs that
>  	 * pointed at this file.
>  	 */
> -	kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul);
> +	kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul,
> +				  KVM_GFN_RANGE_FLAGS_GMEM_RELEASE);
>  	truncate_inode_pages_final(file->f_mapping);
>  	kvm_gmem_invalidate_end(kvm, gmem, 0, -1ul);
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 422d49634c56..9cdfa2fb675f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -613,6 +613,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  			gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
>  			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
>  			gfn_range.slot = slot;
> +			gfn_range.flags = 0;
>  
>  			if (!locked) {
>  				locked = true;
> @@ -2391,8 +2392,9 @@ static void kvm_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
>  	bool flush = false;
>  	int i;
>  
> -	gfn_range.pte = __pte(0);
> +	gfn_range.attrs = attrs;
>  	gfn_range.may_block = true;
> +	gfn_range.flags = KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR;
>  
>  	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
>  		slots = __kvm_memslots(kvm, i);
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range
  2023-06-20 16:28   ` Michael Roth
@ 2023-06-20 19:04     ` Isaku Yamahata
  2023-06-23 19:37       ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: Isaku Yamahata @ 2023-06-20 19:04 UTC (permalink / raw)
  To: Michael Roth
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack,
	Kai Huang, Zhi Wang, chen.bo, linux-coco, Chao Peng,
	Ackerley Tng, Vishal Annapurve

On Tue, Jun 20, 2023 at 11:28:35AM -0500,
Michael Roth <michael.roth@amd.com> wrote:

> On Thu, Jun 15, 2023 at 01:12:18PM -0700, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > TDX and SEV-SNP need to know the reason for a callback by
> > kvm_unmap_gfn_range().  mmu notifier, set memory attributes ioctl or KVM
> > gmem callback.  The callback handler changes the behavior or does the
> > additional housekeeping operation.  For mmu notifier, it's zapping shared
> > PTE.  For set memory attributes, it's the conversion of memory attributes
> > (private <=> shared).  For KVM gmem, it's punching a hole in the range, and
> > releasing the file.
> 
> I think it's still an open topic that we need to hear more from Sean about:
> 
>   https://lore.kernel.org/lkml/20230522235838.ov3722lcusotzlvo@amd.com/
> 
> but I *think* we were leaning toward decoupling the act of invalidating
> GFNs, vs. the act of invalidating/free'ing gmem pages.
> 
> One concrete example of where this seperation makes sense if with
> hole-punching. SNP has unique platform-specific stuff it has to do before
> free'ing that gmem page back to the host. If we try to plumb this through
> kvm_unmap_gfn_range() via a special flag then it's a little awkward
> because:
> 
> a) Presumably that hole-punch would have occurred after a preceeding
>    KVM_SET_MEMORY_ATTRIBUTES was issued to switch the page to shared
>    state in the xarray. So all it should really need to do is handle
>    that platform-specific behavior, like updating RMP table in case of
>    SNP. But none of the other details like GFN ranges are relevant in
>    that case, RMP updates here only need the PFN, so we end up walking
>    memslots to do GFN->PFN translations, when it would actually be much
>    more efficient to do these translations by translating the
>    hole-punched FD offset range to the corresponding folio()'s backing
>    those ranges
> 
> b) It places an unecessary dependency on having an active memslot to do
>    those translations. This ends up not being an issue with current
>    version of gmem patchset because the release() happens *before*
>    gmem_unbind(), so there is a memslot associated with the ranges at
>    gmem_release() time, but in the initial version of gmem it was the
>    reverse, so if things ever changed again in this regard we'd once
>    again have to completely rework how to issue these platform-specific
>    invalidation callbacks.
> 
> I really *really* like having a separate, simple invalidation mechanism
> in place that just converts FD offsets to PFNs and then passes those on
> to platform-defined handlers to clean up pages before free'ing them back
> to the system. It's versatile in that it can be called pretty much
> anywhere regardless of where we are in KVM lifecycle, it's robust in
> that it doesn't rely on unecessary outside dependencies, and it avoids
> added uneeded complexity to paths like kvm_unmap_gfn_range().
> 
> That's the approach taken with SNP hypervisor v9 series, with the
> gmem hook being introduced here:
> 
>   https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@amd.com/T/#m3ad8245235a27ed0f41c359c191dcda6c77af043
> 
> and the SEV-SNP implementation of that hook being here:
> 
>   https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@amd.com/T/#m6ac04b44722dbc07839011816e94fadf5ad6794e
> 
> Would a similar approach work for TDX? At least WRT cleaning up pages
> before returning them back to the host? If we could isolate that
> requirement/handling from all the other aspects of invalidations it
> really seems like it would cause us less headaches down the road.

In short, TDX KVM doesn't need an extra callback for invalidating/freeing gmem
pages. kvm_unmap_gfn_range() callback works.  Instead TDX needs attributes
(private-or-shared) for it.

The reason is, TDX uses encrypted page table for guest (We call it Secure-EPT),
and decicated operation on it.  The TDX KVM mmu hooks TDP MMU operations.
In in the case of invalidating/releasing page, it eventually hooks
handle_removed_pt() for additional operation.

Because TDX simply won't use gmem_invalidate callback, I'm fine with it.


> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  include/linux/kvm_host.h | 11 ++++++++++-
> >  virt/kvm/guest_mem.c     | 10 +++++++---
> >  virt/kvm/kvm_main.c      |  4 +++-
> >  3 files changed, 20 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 1a47cedae8a1..c049c0aa44d6 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -256,12 +256,21 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> >  #endif
> >  
> >  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> > +
> > +#define KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR		BIT(0)
> 
> Can you go into more detail on why special handling is needed for
> SET_MEM_ATTR?

When in TDX, the VMM zaps a private page from the encrypted page table and the
VMM adds the page back to the same GPA, it results in zeroing page and guest
needs to accept the page again.  When converting a page from shared to private,
KVM needs to zap only shared pages.  So the callback needs to know this zap
is for converting shared-to-private or private-to-shared.


> > +#define KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE		BIT(1)
> > +#define KVM_GFN_RANGE_FLAGS_GMEM_RELEASE		BIT(2)
> 
> Would the need to distinguish between PUNCH_HOLE/RELEASE go away in the
> TDX case if you take the above approach? For SNP, the answer is yes. If
> that's also the case for TDX I think that would be another argument in
> favor of decoupling these from existing KVM MMU invalidation path.

TDX doesn't need gmem_invalidate callback.  TDx doesn't need the difference
betwee punch hole and release. So in short TDX needs
KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR and KVM_GFN_RANGE_FLAGS_GMEM.

With KVM_GFN_RANGE_FLAGS_GMEM, the callback can know that invalidating private
pages. Maybe by (ab)using KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR(attr=shared),
KVM_GFN_RANGE_FLAGS_GMEM can be dropped.  

Thanks,
Isaku Yamahata
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement
  2023-06-19 21:55     ` Vishal Annapurve
@ 2023-06-21  8:16       ` Zhi Wang
  2023-06-21 18:19       ` Dong, Eddie
  1 sibling, 0 replies; 18+ messages in thread
From: Zhi Wang @ 2023-06-21  8:16 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack,
	Kai Huang, chen.bo, linux-coco, Chao Peng, Ackerley Tng,
	Michael Roth

On Mon, 19 Jun 2023 14:55:09 -0700
Vishal Annapurve <vannapurve@google.com> wrote:

> On Mon, Jun 19, 2023 at 1:11___PM Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> >
> > On Mon, 19 Jun 2023 12:11:50 -0700
> > Vishal Annapurve <vannapurve@google.com> wrote:
> >
> > > On Thu, Jun 15, 2023 at 1:12___PM <isaku.yamahata@intel.com> wrote:
> > > > ...
> > > >
> > > > * VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
> > > >   - Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
> > > >   - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
> > > >     the future)
> > > >   - any other way?
> > >
> > > There are selftests posted[1] in context of this work, which rely on
> > > KVM_X86_PROTECTED_VM being just the software-only psuedo-confidential
> > > VMs. In future there might be more work to expand this usecase to
> > > full-scale VMs. So it would be better to treat protected VMs as a
> > > separate type which can be used on any platform without the need of
> > > enabling TDX/SEV functionality.
> > >
> >
> > Out of curiosity, is this really a valid case in practice except selftest?
> > It sounds to me whenever KVM_X86_PROTECTED_VM is used, it has to be tied
> > with a platform-specific CC type.
> 
> Protected VM effort is about being able to have guest memory ranges
> not mapped into Userspace VMM and so are unreachable for most of the
> cases from KVM as well. Non-CC VMs can use this support to mitigate
> any unintended accesses from userspace VMM/KVM possibly using
> enlightened kernels.
> 
> Exact implementation of such a support warrants more discussion but it
> should be in the line of sight here as a future work item.
> 
>

IIUC, what you are saying is (KVM_X86_PROTECTED_VM == (VMs with UPM or GMEM))
&& (KVM_X86_PROTECTED_VM != KVM_X86_CC_VM) && KVM_X86_CC_VM requires
KVM_X86_PROTECTED_VM.

If we think KVM_X86_PROTECTED_VM as a dedicated feature, not tightly coupled
with CC techs, it seems we needs another defination of KVM_X86_CC_VM to represent
CC VM and CC platform types like KVM_X86_CC_TDX_VM to tell which CC tech sits
behind it?

I don't think it is good to mix the usages of KVM_X86_PROTECTED_VM and KVM_X86_CC_VM
together if we are sure KVM_X86_PROTECTED_VM is not going to represent CC VMs
in the code.

> 
> 
> >
> > > TDX VM type can possibly serve as a specialized type of protected VM
> > > with additional arch specific capabilities enabled.
> > >
> > > [1] - https://github.com/sean-jc/linux/commits/x86/kvm_gmem_solo
> >


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement
  2023-06-19 21:55     ` Vishal Annapurve
  2023-06-21  8:16       ` Zhi Wang
@ 2023-06-21 18:19       ` Dong, Eddie
  2023-06-21 20:46         ` Vishal Annapurve
  1 sibling, 1 reply; 18+ messages in thread
From: Dong, Eddie @ 2023-06-21 18:19 UTC (permalink / raw)
  To: Annapurve, Vishal, Zhi Wang
  Cc: Yamahata, Isaku, kvm, linux-kernel, isaku.yamahata,
	Paolo Bonzini, Aktas, Erdem, Christopherson,,
	Sean, Shahar, Sagi, David Matlack, Huang, Kai, Chen, Bo2,
	linux-coco, Chao Peng, Ackerley Tng, Michael Roth



> -----Original Message-----
> From: Vishal Annapurve <vannapurve@google.com>
> Sent: Monday, June 19, 2023 2:55 PM
> To: Zhi Wang <zhi.wang.linux@gmail.com>
> Cc: Yamahata, Isaku <isaku.yamahata@intel.com>; kvm@vger.kernel.org;
> linux-kernel@vger.kernel.org; isaku.yamahata@gmail.com; Paolo Bonzini
> <pbonzini@redhat.com>; Aktas, Erdem <erdemaktas@google.com>;
> Christopherson,, Sean <seanjc@google.com>; Shahar, Sagi
> <sagis@google.com>; David Matlack <dmatlack@google.com>; Huang, Kai
> <kai.huang@intel.com>; Chen, Bo2 <chen.bo@intel.com>; linux-
> coco@lists.linux.dev; Chao Peng <chao.p.peng@linux.intel.com>; Ackerley
> Tng <ackerleytng@google.com>; Michael Roth <michael.roth@amd.com>
> Subject: Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement
> 
> On Mon, Jun 19, 2023 at 1:11 PM Zhi Wang <zhi.wang.linux@gmail.com>
> wrote:
> >
> > On Mon, 19 Jun 2023 12:11:50 -0700
> > Vishal Annapurve <vannapurve@google.com> wrote:
> >
> > > On Thu, Jun 15, 2023 at 1:12___PM <isaku.yamahata@intel.com> wrote:
> > > > ...
> > > >
> > > > * VM type: Now we have KVM_X86_PROTECTED_VM. How do we
> proceed?
> > > >   - Keep KVM_X86_PROTECTED_VM for its use. Introduce
> KVM_X86_TDX_VM
> > > >   - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce
> another type in
> > > >     the future)
> > > >   - any other way?
> > >
> > > There are selftests posted[1] in context of this work, which rely on
> > > KVM_X86_PROTECTED_VM being just the software-only
> > > psuedo-confidential VMs. In future there might be more work to
> > > expand this usecase to full-scale VMs. So it would be better to
> > > treat protected VMs as a separate type which can be used on any
> > > platform without the need of enabling TDX/SEV functionality.
> > >
> >
> > Out of curiosity, is this really a valid case in practice except selftest?
> > It sounds to me whenever KVM_X86_PROTECTED_VM is used, it has to be
> > tied with a platform-specific CC type.
> 
> Protected VM effort is about being able to have guest memory ranges not
> mapped into Userspace VMM and so are unreachable for most of the cases
> from KVM as well. Non-CC VMs can use this support to mitigate any
> unintended accesses from userspace VMM/KVM possibly using enlightened
> kernels.

"PROTECTED" seems to be not very close to what you mean here. "PROTECTED_MEM" ?
What case of non-CC VMs may use this feature in reality?  Or do you have any expected cases?

> 
> Exact implementation of such a support warrants more discussion but it
> should be in the line of sight here as a future work item.
> 
> 
> 
> 
> >
> > > TDX VM type can possibly serve as a specialized type of protected VM
> > > with additional arch specific capabilities enabled.
> > >
> > > [1] - https://github.com/sean-jc/linux/commits/x86/kvm_gmem_solo
> >


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement
  2023-06-21 18:19       ` Dong, Eddie
@ 2023-06-21 20:46         ` Vishal Annapurve
  0 siblings, 0 replies; 18+ messages in thread
From: Vishal Annapurve @ 2023-06-21 20:46 UTC (permalink / raw)
  To: Dong, Eddie
  Cc: Zhi Wang, Yamahata, Isaku, kvm, linux-kernel, isaku.yamahata,
	Paolo Bonzini, Aktas, Erdem, Christopherson,,
	Sean, Shahar, Sagi, David Matlack, Huang, Kai, Chen, Bo2,
	linux-coco, Chao Peng, Ackerley Tng, Michael Roth

On Wed, Jun 21, 2023 at 11:20 AM Dong, Eddie <eddie.dong@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Vishal Annapurve <vannapurve@google.com>
> > Sent: Monday, June 19, 2023 2:55 PM
> > To: Zhi Wang <zhi.wang.linux@gmail.com>
> > Cc: Yamahata, Isaku <isaku.yamahata@intel.com>; kvm@vger.kernel.org;
> > linux-kernel@vger.kernel.org; isaku.yamahata@gmail.com; Paolo Bonzini
> > <pbonzini@redhat.com>; Aktas, Erdem <erdemaktas@google.com>;
> > Christopherson,, Sean <seanjc@google.com>; Shahar, Sagi
> > <sagis@google.com>; David Matlack <dmatlack@google.com>; Huang, Kai
> > <kai.huang@intel.com>; Chen, Bo2 <chen.bo@intel.com>; linux-
> > coco@lists.linux.dev; Chao Peng <chao.p.peng@linux.intel.com>; Ackerley
> > Tng <ackerleytng@google.com>; Michael Roth <michael.roth@amd.com>
> > Subject: Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement
> >
> > On Mon, Jun 19, 2023 at 1:11 PM Zhi Wang <zhi.wang.linux@gmail.com>
> > wrote:
> > >
> > > On Mon, 19 Jun 2023 12:11:50 -0700
> ...
> >
> > Protected VM effort is about being able to have guest memory ranges not
> > mapped into Userspace VMM and so are unreachable for most of the cases
> > from KVM as well. Non-CC VMs can use this support to mitigate any
> > unintended accesses from userspace VMM/KVM possibly using enlightened
> > kernels.
>
> "PROTECTED" seems to be not very close to what you mean here. "PROTECTED_MEM" ?
> What case of non-CC VMs may use this feature in reality?  Or do you have any expected cases?
>

Similar to pKvm efforts [1], PROTECTED_VM functionality may be used to
unmap guest memory ranges from the host and userspace VMM on x86
platforms. If the KVM/host kernel and the guest VMs are enlightened
for this usecase, then it should be possible to deploy this feature
for normal VMs irrespective of the platforms they are running on.

Primary usecase here would be to prevent unintended accesses from
KVM/userspace VMM which would normally go undetected at runtime or are
hard to trace back to the original culprit.

[1] https://source.android.com/docs/core/virtualization/architecture#hypervisor

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range
  2023-06-20 19:04     ` Isaku Yamahata
@ 2023-06-23 19:37       ` Sean Christopherson
  0 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2023-06-23 19:37 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Michael Roth, isaku.yamahata, kvm, linux-kernel, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Kai Huang, Zhi Wang,
	chen.bo, linux-coco, Chao Peng, Ackerley Tng, Vishal Annapurve

On Tue, Jun 20, 2023, Isaku Yamahata wrote:
> On Tue, Jun 20, 2023 at 11:28:35AM -0500,
> Michael Roth <michael.roth@amd.com> wrote:
> 
> > On Thu, Jun 15, 2023 at 01:12:18PM -0700, isaku.yamahata@intel.com wrote:
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > 
> > > TDX and SEV-SNP need to know the reason for a callback by
> > > kvm_unmap_gfn_range().  mmu notifier, set memory attributes ioctl or KVM
> > > gmem callback.  The callback handler changes the behavior or does the
> > > additional housekeeping operation.  For mmu notifier, it's zapping shared
> > > PTE.  For set memory attributes, it's the conversion of memory attributes
> > > (private <=> shared).  For KVM gmem, it's punching a hole in the range, and
> > > releasing the file.
> > 
> > I think it's still an open topic that we need to hear more from Sean about:
> > 
> >   https://lore.kernel.org/lkml/20230522235838.ov3722lcusotzlvo@amd.com/
> > 
> > but I *think* we were leaning toward decoupling the act of invalidating
> > GFNs, vs. the act of invalidating/free'ing gmem pages.

Yes.  Ignore any comments I made about not introducing new hooks, I messed up and
forgot the full context.

> > One concrete example of where this seperation makes sense if with
> > hole-punching. SNP has unique platform-specific stuff it has to do before
> > free'ing that gmem page back to the host. If we try to plumb this through
> > kvm_unmap_gfn_range() via a special flag then it's a little awkward
> > because:
> > 
> > a) Presumably that hole-punch would have occurred after a preceeding
> >    KVM_SET_MEMORY_ATTRIBUTES was issued to switch the page to shared
> >    state in the xarray. So all it should really need to do is handle
> >    that platform-specific behavior, like updating RMP table in case of
> >    SNP. But none of the other details like GFN ranges are relevant in
> >    that case, RMP updates here only need the PFN, so we end up walking
> >    memslots to do GFN->PFN translations, when it would actually be much
> >    more efficient to do these translations by translating the
> >    hole-punched FD offset range to the corresponding folio()'s backing
> >    those ranges
> > 
> > b) It places an unecessary dependency on having an active memslot to do
> >    those translations. This ends up not being an issue with current
> >    version of gmem patchset because the release() happens *before*
> >    gmem_unbind(), so there is a memslot associated with the ranges at
> >    gmem_release() time, but in the initial version of gmem it was the
> >    reverse, so if things ever changed again in this regard we'd once
> >    again have to completely rework how to issue these platform-specific
> >    invalidation callbacks.
> > 
> > I really *really* like having a separate, simple invalidation mechanism
> > in place that just converts FD offsets to PFNs and then passes those on
> > to platform-defined handlers to clean up pages before free'ing them back
> > to the system. It's versatile in that it can be called pretty much
> > anywhere regardless of where we are in KVM lifecycle, it's robust in
> > that it doesn't rely on unecessary outside dependencies, and it avoids
> > added uneeded complexity to paths like kvm_unmap_gfn_range().
> > 
> > That's the approach taken with SNP hypervisor v9 series, with the
> > gmem hook being introduced here:
> > 
> >   https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@amd.com/T/#m3ad8245235a27ed0f41c359c191dcda6c77af043
> > 
> > and the SEV-SNP implementation of that hook being here:
> > 
> >   https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@amd.com/T/#m6ac04b44722dbc07839011816e94fadf5ad6794e
> > 
> > Would a similar approach work for TDX? At least WRT cleaning up pages
> > before returning them back to the host? If we could isolate that
> > requirement/handling from all the other aspects of invalidations it
> > really seems like it would cause us less headaches down the road.
> 
> In short, TDX KVM doesn't need an extra callback for invalidating/freeing gmem
> pages. kvm_unmap_gfn_range() callback works.  Instead TDX needs attributes
> (private-or-shared) for it.

Just because TDX doesn't strictly need a separate callback doesn't mean it
wouldn't be useful and beneficial.  SNP doesn't "need" a separate callback either,
i.e. we could make kvm_unmap_gfn_range() work, but it would be ugly and inflexible.

E.g. as Mike pointed out in the other thread, reclaiming physical memory when
SPTEs are zapped is suboptimal if the memory is not actually discarded.

  : There's also cases like userspaces that opt to not discard memory after
  : conversions because they highly favor performance over memory usage. In
  : those cases it would make sense to defer marking the pages as shared in
  : the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via
  : KVM MMU invalidation path after a conversion.

And to some extent, I would even argue that TDX does "need" the separate hook,
because doing PAGE.WBINVD while holding mmu_lock for write is going to be slooow.
Even if the PAGE.WBINVD isn't redundant, i.e. the memory is never converted back
to private, deferring the cache flush until the backing store is freed is a win
for guest performance during conversions.

> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 1a47cedae8a1..c049c0aa44d6 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -256,12 +256,21 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > >  #endif
> > >  
> > >  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> > > +
> > > +#define KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR		BIT(0)
> > 
> > Can you go into more detail on why special handling is needed for
> > SET_MEM_ATTR?
> 
> When in TDX, the VMM zaps a private page from the encrypted page table and the
> VMM adds the page back to the same GPA, it results in zeroing page and guest
> needs to accept the page again.  When converting a page from shared to private,
> KVM needs to zap only shared pages.  So the callback needs to know this zap
> is for converting shared-to-private or private-to-shared.

That doesn't answer Mike's question.  You answered why KVM needs to know whether
to zap shared vs. private, but not why SET_MEM_ATTR is a special case.

> > > +#define KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE		BIT(1)
> > > +#define KVM_GFN_RANGE_FLAGS_GMEM_RELEASE		BIT(2)
> > 
> > Would the need to distinguish between PUNCH_HOLE/RELEASE go away in the
> > TDX case if you take the above approach? For SNP, the answer is yes. If
> > that's also the case for TDX I think that would be another argument in
> > favor of decoupling these from existing KVM MMU invalidation path.
> 
> TDX doesn't need gmem_invalidate callback.  TDx doesn't need the difference
> betwee punch hole and release. So in short TDX needs
> KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR and KVM_GFN_RANGE_FLAGS_GMEM.

TDX needs to now what flavor of mappings, i.e. EPT vs. S-EPT, are in scope, TDX
doesn't need to know who/what initiated a zap.  And for that, a simple private vs.
shared flag would suffice.

However, looking at kvm_mem_attrs_changed() again, I think invoking kvm_unmap_gfn_range()
from generic KVM code is a mistake and shortsighted.  Zapping in response to
*any* attribute change is very private/shared centric.  E.g. if/when we extend
attributes to provide per-page RWX protections, zapping existing SPTEs in response
to granting *more* permissions may not be necessary or even desirable.

And for SNP, isn't zapping unnecessary?  KVM needs to update the RMP, but there's
no need to zap NPT entries.  Or do RMP lookups need to be blocked while the RMP
is being updated?

Regardless of what SNP needs to do on attribute changes, rather than using
kvm_unmap_gfn_range(), I think kvm_mem_attrs_changed() should look something like:


	struct kvm_gfn_range gfn_range;
	struct kvm_memory_slot *slot;
	struct kvm_memslots *slots;
	struct kvm_memslot_iter iter;
	bool flush = false;
	int i;

	gfn_range.may_block = true;
	gfn_range.attrs = attrs;

	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
		slots = __kvm_memslots(kvm, i);

		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
			slot = iter.slot;
			gfn_range.start = max(start, slot->base_gfn);
			gfn_range.end = min(end, slot->base_gfn + slot->npages);
			if (gfn_range.start >= gfn_range.end)
				continue;
			gfn_range.slot = slot;

			flush |= kvm_arch_set_memory_attributes(kvm, &gfn_range);
		}
	}

	if (flush)
		kvm_flush_remote_tlbs(kvm);

At that point, a single "is_private" or "is_shared" flag is awkward and arguably
wrong, because the changing attributes affect both.  The fact that TDX only needs
to zap one or the other is an implementation detail, not a fundamental truth of
the update itself.

We could just have kvm_arch_set_memory_attributes() take individual params instead
of taking a kvm_gfn_range pointer, but that's a missed opportunitiy IMO as this
really is just another variation of gfn-based notification to arch MMU code.

Going with my union suggestion from the MGLRU thread[*], I think we should aim
for making kvm_gfn_range look like this:

struct kvm_gfn_range {
	struct kvm_memory_slot *slot;
	gfn_t start;
	gfn_t end;
	union {
		struct test_clear_young_metadata *metadata;
		unsigned long attributes;
		pte_t pte;
		unsigned long callback_arg; /* needs a better name */
	};
	bool only_private;
	bool only_shared;
	bool may_block;
};

That way kvm_arch_set_memory_attributes() can communicate that it affects both
shared and private mappings, i.e. can leave it to TDX to precisely zap only the
necessary tree.  It's a bit unfortunate that the existing mmu_notifier usage
would need to explicitly say "only_shared", but that's literally one line of code
in __kvm_handle_hva_range(), and on the plus side would clearly document that
hva-based notifications are shared-only.

[*] https://lore.kernel.org/all/ZItNoeWriZgLUaon@google.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement
  2023-06-15 20:12 [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement isaku.yamahata
                   ` (6 preceding siblings ...)
  2023-06-19 19:11 ` [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement Vishal Annapurve
@ 2023-06-24  0:09 ` Sean Christopherson
  2023-06-26 20:54   ` Isaku Yamahata
  7 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2023-06-24  0:09 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sagi Shahar, David Matlack, Kai Huang, Zhi Wang, chen.bo,
	linux-coco, Chao Peng, Ackerley Tng, Vishal Annapurve,
	Michael Roth

On Thu, Jun 15, 2023, isaku.yamahata@intel.com wrote:
> * VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
>   - Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
>   - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
>     the future)

How would KVM differentiate between software-protected VMs and TDX VMs if we go
with this option?

>   - any other way?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement
  2023-06-24  0:09 ` Sean Christopherson
@ 2023-06-26 20:54   ` Isaku Yamahata
  0 siblings, 0 replies; 18+ messages in thread
From: Isaku Yamahata @ 2023-06-26 20:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Kai Huang, Zhi Wang,
	chen.bo, linux-coco, Chao Peng, Ackerley Tng, Vishal Annapurve,
	Michael Roth

On Fri, Jun 23, 2023 at 05:09:26PM -0700,
Sean Christopherson <seanjc@google.com> wrote:

> On Thu, Jun 15, 2023, isaku.yamahata@intel.com wrote:
> > * VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
> >   - Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
> >   - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
> >     the future)
> 
> How would KVM differentiate between software-protected VMs and TDX VMs if we go
> with this option?

Let's introduce new two VM type.  I'm fine with two new VM types.  I had KVM
capability in mind if SEV implementation doesn't like new VM types.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-06-26 20:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-15 20:12 [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement isaku.yamahata
2023-06-15 20:12 ` [RFC PATCH 1/6] KVM: selftests: Fix test_add_overlapping_private_memory_regions() isaku.yamahata
2023-06-15 20:12 ` [RFC PATCH 2/6] KVM: selftests: Fix guest_memfd() isaku.yamahata
2023-06-15 20:12 ` [RFC PATCH 3/6] KVM: x86/mmu: Pass round full 64-bit error code for the KVM page fault isaku.yamahata
2023-06-15 20:12 ` [RFC PATCH 4/6] KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private isaku.yamahata
2023-06-15 20:12 ` [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range isaku.yamahata
2023-06-20 16:28   ` Michael Roth
2023-06-20 19:04     ` Isaku Yamahata
2023-06-23 19:37       ` Sean Christopherson
2023-06-15 20:12 ` [RFC PATCH 6/6] KVM: x86: Add is_vm_type_supported callback isaku.yamahata
2023-06-19 19:11 ` [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement Vishal Annapurve
2023-06-19 20:11   ` Zhi Wang
2023-06-19 21:55     ` Vishal Annapurve
2023-06-21  8:16       ` Zhi Wang
2023-06-21 18:19       ` Dong, Eddie
2023-06-21 20:46         ` Vishal Annapurve
2023-06-24  0:09 ` Sean Christopherson
2023-06-26 20:54   ` Isaku Yamahata

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).