linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes
@ 2023-09-14  1:54 Sean Christopherson
  2023-09-14  1:54 ` [RFC PATCH v12 01/33] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
                   ` (32 more replies)
  0 siblings, 33 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:54 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

This is hopefully the last RFC for implementing fd-based (instead of vma-based)
memory for KVM guests.  If you want the full background of why we are doing
this, please go read the v10 cover letter.  With luck, v13 will be a "normal"
series that's ready for inclusion.

Tagged RFC as there are still several empty changelogs, a lot of missing
documentation, and a handful of TODOs.  And I haven't tested or proofread this
anywhere near as much as I normally would.  I am posting even though the
remaining TODOs aren't _that_ big so that people can test this new version
without having to wait a few weeks to close out the remaining TODOs, i.e. to
give us at least some chance of hitting v6.7.

The most relevant TODO item for non-KVM folks is that we are planning on
dropping the dedicated "gmem" file system.  Assuming that pans out, the patch
to export security_inode_init_security_anon() should go away.

KVM folks, there a few changes I want to highlight and get feedback on, all of
which are directly related to the "annotated memory faults" series[*]:

 - Rename kvm_run.memory to kvm_run.memory_fault
 - Place "memory_fault" in a separate union
 - Return -EFAULT or -EHWPOISON with exiting with KVM_EXIT_MEMORY_FAULT

The first one is pretty self-explanatory, "run->memory.gpa" looks quite odd and
would prevent ever doing something directly with memory.

Putting the struct in a separate union is not at all necessary for supporting
private memory, it's purely forward looking to Anish series, which wants to
annotate (fill memory_fault) on all faults, even if KVM ultimately doesn't exit
to userspace (x86 has a few unfortunate flows where KVM can clobber a previous
exit, or suppress a memory fault exit).  Using a separate union, i.e. different
bytes in kvm_run, allows exiting to userspace with both memory_fault and the
"normal" union filled, e.g. if KVM starts an MMIO exit and then hits a memory
fault exit, the MMIO exit will be preserved.  It's unlikely userspace will be
able to do anything useful with the info in that case, but the reverse will
likely be much more interesting, e.g. if KVM hits a memory fault and then doesn't
report it to userspace for whatever reason.

As for returning -EFAULT/-EHWPOISON, far too many helpers that touch guest
memory, i.e. can "fault", return 0 on success, which makes it all bug impossible
to use '0' to signal "exit to userspace".  Rather than use '0' for _just_ the
case where the guest is accessing private vs. shared, my thought is to use
-EFAULT everywhere except for the poisoned page case.

[*] https://lore.kernel.org/all/20230908222905.1321305-1-amoorthy@google.com

TODOs [owner]:
 - Documentation [none]
 - Changelogs [Sean]
 - Fully anonymous inode vs. proper filesystem [Paolo]
 - kvm_gmem_error_page() testing (my version is untested) [Isaku?]

v12:
 - Squash fixes from others. [Many people]
 - Kill of the .on_unlock() callback and use .on_lock() when handling
   memory attributes updates. [Isaku]
 - Add more tests. [Ackerley]
 - Move range_has_attrs() to common code. [Paolo]
 - Return actually number of address spaces for the VM-scoped version of
   KVM_CAP_MULTI_ADDRESS_SPACE. [Paolo]
 - Move forward declaration of "struct kvm_gfn_range" to kvm_types.h. [Yuan]
 - Plumb code to have HVA-based mmu_notifier events affect only shared
   mappings. [Asish]
 - Clean up kvm_vm_ioctl_set_mem_attributes() math. [Binbin]
 - Collect a few reviews and acks. [Paolo, Paul]
 - Unconditionally advertise a synchronized MMU on PPC. [Paolo]
 - Check for error return from filemap_grab_folio(). [A
 - Make max_order optional. [Fuad]
 - Remove signal injection, zap SPTEs on memory error. [Isaku]
 - Add KVM_CAP_GUEST_MEMFD. [Xiaoyao]
 - Invoke kvm_arch_pre_set_memory_attributes() instead of
   kvm_mmu_unmap_gfn_range().
 - Rename kvm_run.memory to kvm_run.memory_fault
 - Place "memory_fault" in a separate union
 - Return -EFAULT and -EHWPOISON with KVM_EXIT_MEMORY_FAULT
 - "Init" run->exit_reason in x86's vcpu_run()

v11:
 - https://lore.kernel.org/all/20230718234512.1690985-1-seanjc@google.com
 - Test private<=>shared conversions *without* doing fallocate()
 - PUNCH_HOLE all memory between iterations of the conversion test so that
   KVM doesn't retain pages in the guest_memfd
 - Rename hugepage control to be a very generic ALLOW_HUGEPAGE, instead of
   giving it a THP or PMD specific name.
 - Fold in fixes from a lot of people (thank you!)
 - Zap SPTEs *before* updating attributes to ensure no weirdness, e.g. if
   KVM handles a page fault and looks at inconsistent attributes
 - Refactor MMU interaction with attributes updates to reuse much of KVM's
   framework for mmu_notifiers.

v10: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com

Ackerley Tng (1):
  KVM: selftests: Test KVM exit behavior for private memory/access

Chao Peng (8):
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  KVM: Introduce per-page memory attributes
  KVM: x86: Disallow hugepages when memory attributes are mixed
  KVM: x86/mmu: Handle page fault for private memory
  KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper
  KVM: selftests: Expand set_memory_region_test to validate
    guest_memfd()
  KVM: selftests: Add basic selftest for guest_memfd()

Sean Christopherson (21):
  KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn
    ranges
  KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER
  KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU
  KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to
    CONFIG_KVM_GENERIC_MMU_NOTIFIER
  KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory
  KVM: Drop .on_unlock() mmu_notifier hook
  KVM: Set the stage for handling only shared mappings in mmu_notifier
    events
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
  security: Export security_inode_init_security_anon() for use by KVM
  KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing
    memory
  KVM: Add transparent hugepage support for dedicated guest memory
  KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
  KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro
  KVM: Allow arch code to track number of memslot address spaces per VM
  KVM: x86: Add support for "protected VMs" that can utilize private
    memory
  KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper
  KVM: selftests: Convert lib's mem regions to
    KVM_SET_USER_MEMORY_REGION2
  KVM: selftests: Add support for creating private memslots
  KVM: selftests: Introduce VM "shape" to allow tests to specify the VM
    type
  KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data

Vishal Annapurve (3):
  KVM: selftests: Add helpers to convert guest memory b/w private and
    shared
  KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls
    (x86)
  KVM: selftests: Add x86-only selftest for private memory conversions

 Documentation/virt/kvm/api.rst                | 116 ++++
 arch/arm64/include/asm/kvm_host.h             |   2 -
 arch/arm64/kvm/Kconfig                        |   2 +-
 arch/mips/include/asm/kvm_host.h              |   2 -
 arch/mips/kvm/Kconfig                         |   2 +-
 arch/powerpc/include/asm/kvm_host.h           |   2 -
 arch/powerpc/kvm/Kconfig                      |   8 +-
 arch/powerpc/kvm/book3s_hv.c                  |   2 +-
 arch/powerpc/kvm/powerpc.c                    |   7 +-
 arch/riscv/include/asm/kvm_host.h             |   2 -
 arch/riscv/kvm/Kconfig                        |   2 +-
 arch/x86/include/asm/kvm_host.h               |  17 +-
 arch/x86/include/uapi/asm/kvm.h               |   3 +
 arch/x86/kvm/Kconfig                          |  14 +-
 arch/x86/kvm/debugfs.c                        |   2 +-
 arch/x86/kvm/mmu/mmu.c                        | 264 +++++++-
 arch/x86/kvm/mmu/mmu_internal.h               |   2 +
 arch/x86/kvm/mmu/tdp_mmu.c                    |   2 +-
 arch/x86/kvm/vmx/vmx.c                        |  11 +-
 arch/x86/kvm/x86.c                            |  25 +-
 include/linux/kvm_host.h                      | 143 +++-
 include/linux/kvm_types.h                     |   1 +
 include/linux/pagemap.h                       |  19 +-
 include/uapi/linux/kvm.h                      |  67 ++
 include/uapi/linux/magic.h                    |   1 +
 mm/compaction.c                               |  43 +-
 mm/migrate.c                                  |   2 +
 security/security.c                           |   1 +
 tools/testing/selftests/kvm/Makefile          |   3 +
 tools/testing/selftests/kvm/dirty_log_test.c  |   2 +-
 .../testing/selftests/kvm/guest_memfd_test.c  | 165 +++++
 .../selftests/kvm/include/kvm_util_base.h     | 148 +++-
 .../testing/selftests/kvm/include/test_util.h |   5 +
 .../selftests/kvm/include/ucall_common.h      |  11 +
 .../selftests/kvm/include/x86_64/processor.h  |  15 +
 .../selftests/kvm/kvm_page_table_test.c       |   2 +-
 tools/testing/selftests/kvm/lib/kvm_util.c    | 231 ++++---
 tools/testing/selftests/kvm/lib/memstress.c   |   3 +-
 .../selftests/kvm/set_memory_region_test.c    | 100 +++
 .../kvm/x86_64/private_mem_conversions_test.c | 410 +++++++++++
 .../kvm/x86_64/private_mem_kvm_exits_test.c   | 121 ++++
 .../kvm/x86_64/ucna_injection_test.c          |   2 +-
 virt/kvm/Kconfig                              |  17 +
 virt/kvm/Makefile.kvm                         |   1 +
 virt/kvm/dirty_ring.c                         |   2 +-
 virt/kvm/guest_mem.c                          | 637 ++++++++++++++++++
 virt/kvm/kvm_main.c                           | 482 +++++++++++--
 virt/kvm/kvm_mm.h                             |  38 ++
 48 files changed, 2888 insertions(+), 271 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_test.c
 create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
 create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
 create mode 100644 virt/kvm/guest_mem.c


base-commit: 0bb80ecc33a8fb5a682236443c1e740d5c917d1d
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 01/33] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
@ 2023-09-14  1:54 ` Sean Christopherson
  2023-09-15  6:47   ` Xiaoyao Li
  2023-09-14  1:55 ` [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
                   ` (31 subsequent siblings)
  32 siblings, 1 reply; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:54 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Rework and rename "struct kvm_hva_range" into "kvm_mmu_notifier_range" so
that the structure can be used to handle notifications that operate on gfn
context, i.e. that aren't tied to a host virtual address.

Practically speaking, this is a nop for 64-bit kernels as the only
meaningful change is to store start+end as u64s instead of unsigned longs.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 486800a7024b..0524933856d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -541,18 +541,22 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 	return container_of(mn, struct kvm, mmu_notifier);
 }
 
-typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
 			     unsigned long end);
 
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
-struct kvm_hva_range {
-	unsigned long start;
-	unsigned long end;
+struct kvm_mmu_notifier_range {
+	/*
+	 * 64-bit addresses, as KVM notifiers can operate on host virtual
+	 * addresses (unsigned long) and guest physical addresses (64-bit).
+	 */
+	u64 start;
+	u64 end;
 	union kvm_mmu_notifier_arg arg;
-	hva_handler_t handler;
+	gfn_handler_t handler;
 	on_lock_fn_t on_lock;
 	on_unlock_fn_t on_unlock;
 	bool flush_on_ret;
@@ -581,7 +585,7 @@ static const union kvm_mmu_notifier_arg KVM_MMU_NOTIFIER_NO_ARG;
 	     node = interval_tree_iter_next(node, start, last))	     \
 
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
-						  const struct kvm_hva_range *range)
+						  const struct kvm_mmu_notifier_range *range)
 {
 	bool ret = false, locked = false;
 	struct kvm_gfn_range gfn_range;
@@ -608,9 +612,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			unsigned long hva_start, hva_end;
 
 			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
-			hva_start = max(range->start, slot->userspace_addr);
-			hva_end = min(range->end, slot->userspace_addr +
-						  (slot->npages << PAGE_SHIFT));
+			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
+			hva_end = min_t(unsigned long, range->end,
+					slot->userspace_addr + (slot->npages << PAGE_SHIFT));
 
 			/*
 			 * To optimize for the likely case where the address
@@ -660,10 +664,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 						unsigned long start,
 						unsigned long end,
 						union kvm_mmu_notifier_arg arg,
-						hva_handler_t handler)
+						gfn_handler_t handler)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range range = {
+	const struct kvm_mmu_notifier_range range = {
 		.start		= start,
 		.end		= end,
 		.arg		= arg,
@@ -680,10 +684,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
 							 unsigned long start,
 							 unsigned long end,
-							 hva_handler_t handler)
+							 gfn_handler_t handler)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range range = {
+	const struct kvm_mmu_notifier_range range = {
 		.start		= start,
 		.end		= end,
 		.handler	= handler,
@@ -771,7 +775,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range hva_range = {
+	const struct kvm_mmu_notifier_range hva_range = {
 		.start		= range->start,
 		.end		= range->end,
 		.handler	= kvm_unmap_gfn_range,
@@ -835,7 +839,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range hva_range = {
+	const struct kvm_mmu_notifier_range hva_range = {
 		.start		= range->start,
 		.end		= range->end,
 		.handler	= (void *)kvm_null_fn,
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
  2023-09-14  1:54 ` [RFC PATCH v12 01/33] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  3:07   ` Binbin Wu
  2023-09-20  6:07   ` Xu Yilun
  2023-09-14  1:55 ` [RFC PATCH v12 03/33] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER Sean Christopherson
                   ` (30 subsequent siblings)
  32 siblings, 2 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

Currently in mmu_notifier invalidate path, hva range is recorded and
then checked against by mmu_notifier_retry_hva() in the page fault
handling path. However, for the to be introduced private memory, a page
fault may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
[sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c   | 10 ++++++----
 arch/x86/kvm/vmx/vmx.c   | 11 +++++------
 include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
 virt/kvm/kvm_main.c      | 40 +++++++++++++++++++++++++++++++---------
 4 files changed, 63 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e1d011c67cc6..0f0231d2b74f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3056,7 +3056,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
  *
  * There are several ways to safely use this helper:
  *
- * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, before
+ * - Check mmu_invalidate_retry_gfn() after grabbing the mapping level, before
  *   consuming it.  In this case, mmu_lock doesn't need to be held during the
  *   lookup, but it does need to be held while checking the MMU notifier.
  *
@@ -4358,7 +4358,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 		return true;
 
 	return fault->slot &&
-	       mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
+	       mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -6253,7 +6253,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	write_lock(&kvm->mmu_lock);
 
-	kvm_mmu_invalidate_begin(kvm, 0, -1ul);
+	kvm_mmu_invalidate_begin(kvm);
+
+	kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
 
 	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
@@ -6266,7 +6268,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	if (flush)
 		kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);
 
-	kvm_mmu_invalidate_end(kvm, 0, -1ul);
+	kvm_mmu_invalidate_end(kvm);
 
 	write_unlock(&kvm->mmu_lock);
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 72e3943f3693..6e502ba93141 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6757,10 +6757,10 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 		return;
 
 	/*
-	 * Grab the memslot so that the hva lookup for the mmu_notifier retry
-	 * is guaranteed to use the same memslot as the pfn lookup, i.e. rely
-	 * on the pfn lookup's validation of the memslot to ensure a valid hva
-	 * is used for the retry check.
+	 * Explicitly grab the memslot using KVM's internal slot ID to ensure
+	 * KVM doesn't unintentionally grab a userspace memslot.  It _should_
+	 * be impossible for userspace to create a memslot for the APIC when
+	 * APICv is enabled, but paranoia won't hurt in this case.
 	 */
 	slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
@@ -6785,8 +6785,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 		return;
 
 	read_lock(&vcpu->kvm->mmu_lock);
-	if (mmu_invalidate_retry_hva(kvm, mmu_seq,
-				     gfn_to_hva_memslot(slot, gfn))) {
+	if (mmu_invalidate_retry_gfn(kvm, mmu_seq, gfn)) {
 		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
 		read_unlock(&vcpu->kvm->mmu_lock);
 		goto out;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fb6c6109fdca..11d091688346 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -787,8 +787,8 @@ struct kvm {
 	struct mmu_notifier mmu_notifier;
 	unsigned long mmu_invalidate_seq;
 	long mmu_invalidate_in_progress;
-	unsigned long mmu_invalidate_range_start;
-	unsigned long mmu_invalidate_range_end;
+	gfn_t mmu_invalidate_range_start;
+	gfn_t mmu_invalidate_range_end;
 #endif
 	struct list_head devices;
 	u64 manual_dirty_log_protect;
@@ -1392,10 +1392,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end);
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end);
+void kvm_mmu_invalidate_begin(struct kvm *kvm);
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_invalidate_end(struct kvm *kvm);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -1970,9 +1969,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 	return 0;
 }
 
-static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
+static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
 					   unsigned long mmu_seq,
-					   unsigned long hva)
+					   gfn_t gfn)
 {
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
@@ -1981,10 +1980,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
 	 * that might be being invalidated. Note that it may include some false
 	 * positives, due to shortcuts when handing concurrent invalidations.
 	 */
-	if (unlikely(kvm->mmu_invalidate_in_progress) &&
-	    hva >= kvm->mmu_invalidate_range_start &&
-	    hva < kvm->mmu_invalidate_range_end)
-		return 1;
+	if (unlikely(kvm->mmu_invalidate_in_progress)) {
+		/*
+		 * Dropping mmu_lock after bumping mmu_invalidate_in_progress
+		 * but before updating the range is a KVM bug.
+		 */
+		if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
+				 kvm->mmu_invalidate_range_end == INVALID_GPA))
+			return 1;
+
+		if (gfn >= kvm->mmu_invalidate_range_start &&
+		    gfn < kvm->mmu_invalidate_range_end)
+			return 1;
+	}
+
 	if (kvm->mmu_invalidate_seq != mmu_seq)
 		return 1;
 	return 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0524933856d4..4fad3b01dc1f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -543,9 +543,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
-			     unsigned long end);
-
+typedef void (*on_lock_fn_t)(struct kvm *kvm);
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
 struct kvm_mmu_notifier_range {
@@ -637,7 +635,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 				locked = true;
 				KVM_MMU_LOCK(kvm);
 				if (!IS_KVM_NULL_FN(range->on_lock))
-					range->on_lock(kvm, range->start, range->end);
+					range->on_lock(kvm);
+
 				if (IS_KVM_NULL_FN(range->handler))
 					break;
 			}
@@ -742,15 +741,26 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	kvm_handle_hva_range(mn, address, address + 1, arg, kvm_change_spte_gfn);
 }
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end)
+void kvm_mmu_invalidate_begin(struct kvm *kvm)
 {
+	lockdep_assert_held_write(&kvm->mmu_lock);
 	/*
 	 * The count increase must become visible at unlock time as no
 	 * spte can be established without taking the mmu_lock and
 	 * count is also read inside the mmu_lock critical section.
 	 */
 	kvm->mmu_invalidate_in_progress++;
+
+	if (likely(kvm->mmu_invalidate_in_progress == 1))
+		kvm->mmu_invalidate_range_start = INVALID_GPA;
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
 	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
 		kvm->mmu_invalidate_range_start = start;
 		kvm->mmu_invalidate_range_end = end;
@@ -771,6 +781,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
 	}
 }
 
+static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+	return kvm_unmap_gfn_range(kvm, range);
+}
+
 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
@@ -778,7 +794,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	const struct kvm_mmu_notifier_range hva_range = {
 		.start		= range->start,
 		.end		= range->end,
-		.handler	= kvm_unmap_gfn_range,
+		.handler	= kvm_mmu_unmap_gfn_range,
 		.on_lock	= kvm_mmu_invalidate_begin,
 		.on_unlock	= kvm_arch_guest_memory_reclaimed,
 		.flush_on_ret	= true,
@@ -817,8 +833,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	return 0;
 }
 
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end)
+void kvm_mmu_invalidate_end(struct kvm *kvm)
 {
 	/*
 	 * This sequence increase will notify the kvm page fault that
@@ -833,6 +848,13 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
 	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
 	 */
 	kvm->mmu_invalidate_in_progress--;
+
+	/*
+	 * Assert that at least one range must be added between start() and
+	 * end().  Not adding a range isn't fatal, but it is a KVM bug.
+	 */
+	WARN_ON_ONCE(kvm->mmu_invalidate_in_progress &&
+		     kvm->mmu_invalidate_range_start == INVALID_GPA);
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 03/33] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
  2023-09-14  1:54 ` [RFC PATCH v12 01/33] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 04/33] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU Sean Christopherson
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Assert that both KVM_ARCH_WANT_MMU_NOTIFIER and CONFIG_MMU_NOTIFIER are
defined when KVM is enabled, and return '1' unconditionally for the
CONFIG_KVM_BOOK3S_HV_POSSIBLE=n path.  All flavors of PPC support for KVM
select MMU_NOTIFIER, and KVM_ARCH_WANT_MMU_NOTIFIER is unconditionally
defined by arch/powerpc/include/asm/kvm_host.h.

Effectively dropping use of KVM_ARCH_WANT_MMU_NOTIFIER will simplify a
future cleanup to turn KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig, i.e.
will allow combining all of the

  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)

checks into a single

  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER

without having to worry about PPC's "bare" usage of
KVM_ARCH_WANT_MMU_NOTIFIER.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/powerpc/kvm/powerpc.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 7197c8256668..b0a512ede764 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -632,12 +632,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		break;
 #endif
 	case KVM_CAP_SYNC_MMU:
+#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+		BUILD_BUG();
+#endif
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 		r = hv_enabled;
-#elif defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-		r = 1;
 #else
-		r = 0;
+		r = 1;
 #endif
 		break;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 04/33] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (2 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 03/33] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 05/33] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER Sean Christopherson
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Advertise that KVM's MMU is synchronized with the primary MMU for all
flavors of PPC KVM support, i.e. advertise that the MMU is synchronized
when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y but the VM is not using hypervisor
mode (a.k.a. PR VMs).  PR VMs, via kvm_unmap_gfn_range_pr(), do the right
thing for mmu_notifier invalidation events, and more tellingly, KVM
returns '1' for KVM_CAP_SYNC_MMU when CONFIG_KVM_BOOK3S_HV_POSSIBLE=n
and CONFIG_KVM_BOOK3S_PR_POSSIBLE=y, i.e. KVM already advertises a
synchronized MMU for PR VMs, just not when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/powerpc/kvm/powerpc.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b0a512ede764..8d3ec483bc2b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -635,11 +635,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 		BUILD_BUG();
 #endif
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-		r = hv_enabled;
-#else
 		r = 1;
-#endif
 		break;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 	case KVM_CAP_PPC_HTAB_FD:
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 05/33] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (3 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 04/33] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-10-09 16:42   ` Anup Patel
  2023-09-14  1:55 ` [RFC PATCH v12 06/33] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
                   ` (27 subsequent siblings)
  32 siblings, 1 reply; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior.  Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.

Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared.  PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.

  bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);

Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/include/asm/kvm_host.h   |  2 --
 arch/arm64/kvm/Kconfig              |  2 +-
 arch/mips/include/asm/kvm_host.h    |  2 --
 arch/mips/kvm/Kconfig               |  2 +-
 arch/powerpc/include/asm/kvm_host.h |  2 --
 arch/powerpc/kvm/Kconfig            |  8 ++++----
 arch/powerpc/kvm/powerpc.c          |  4 +---
 arch/riscv/include/asm/kvm_host.h   |  2 --
 arch/riscv/kvm/Kconfig              |  2 +-
 arch/x86/include/asm/kvm_host.h     |  2 --
 arch/x86/kvm/Kconfig                |  2 +-
 include/linux/kvm_host.h            |  6 +++---
 include/linux/kvm_types.h           |  1 +
 virt/kvm/Kconfig                    |  4 ++++
 virt/kvm/kvm_main.c                 | 10 +++++-----
 15 files changed, 22 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index af06ccb7ee34..9e046b64847a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -921,8 +921,6 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
 int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
 			      struct kvm_vcpu_events *events);
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 83c1e09be42e..1a777715199f 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -22,7 +22,7 @@ menuconfig KVM
 	bool "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select KVM_GENERIC_HARDWARE_ENABLING
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	select PREEMPT_NOTIFIERS
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select KVM_MMIO
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 54a85f1d4f2c..179f320cc231 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -810,8 +810,6 @@ int kvm_mips_mkclean_gpa_pt(struct kvm *kvm, gfn_t start_gfn, gfn_t end_gfn);
 pgd_t *kvm_pgd_alloc(void);
 void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 /* Emulation */
 enum emulation_result update_pc(struct kvm_vcpu *vcpu, u32 cause);
 int kvm_get_badinstr(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index a8cdba75f98d..c04987d2ed2e 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -25,7 +25,7 @@ config KVM
 	select HAVE_KVM_EVENTFD
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select KVM_MMIO
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	select INTERVAL_TREE
 	select KVM_GENERIC_HARDWARE_ENABLING
 	help
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..4b5c3f2acf78 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -62,8 +62,6 @@
 
 #include <linux/mmu_notifier.h>
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 #define HPTEG_CACHE_NUM			(1 << 15)
 #define HPTEG_HASH_BITS_PTE		13
 #define HPTEG_HASH_BITS_PTE_LONG	12
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 902611954200..b33358ee6424 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -42,7 +42,7 @@ config KVM_BOOK3S_64_HANDLER
 config KVM_BOOK3S_PR_POSSIBLE
 	bool
 	select KVM_MMIO
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 
 config KVM_BOOK3S_HV_POSSIBLE
 	bool
@@ -85,7 +85,7 @@ config KVM_BOOK3S_64_HV
 	tristate "KVM for POWER7 and later using hypervisor mode in host"
 	depends on KVM_BOOK3S_64 && PPC_POWERNV
 	select KVM_BOOK3S_HV_POSSIBLE
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	select CMA
 	help
 	  Support running unmodified book3s_64 guest kernels in
@@ -194,7 +194,7 @@ config KVM_E500V2
 	depends on !CONTEXT_TRACKING_USER
 	select KVM
 	select KVM_MMIO
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	help
 	  Support running unmodified E500 guest kernels in virtual machines on
 	  E500v2 host processors.
@@ -211,7 +211,7 @@ config KVM_E500MC
 	select KVM
 	select KVM_MMIO
 	select KVM_BOOKE_HV
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	help
 	  Support running unmodified E500MC/E5500/E6500 guest kernels in
 	  virtual machines on E500MC/E5500/E6500 host processors.
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 8d3ec483bc2b..aac75c98a956 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -632,9 +632,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		break;
 #endif
 	case KVM_CAP_SYNC_MMU:
-#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-		BUILD_BUG();
-#endif
+		BUILD_BUG_ON(!IS_ENABLED(CONFIG_KVM_GENERIC_MMU_NOTIFIER));
 		r = 1;
 		break;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 1ebf20dfbaa6..66ee9ff483e9 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -249,8 +249,6 @@ struct kvm_vcpu_arch {
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 #define KVM_RISCV_GSTAGE_TLB_MIN_ORDER		12
 
 void kvm_riscv_local_hfence_gvma_vmid_gpa(unsigned long vmid,
diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
index dfc237d7875b..ae2e05f050ec 100644
--- a/arch/riscv/kvm/Kconfig
+++ b/arch/riscv/kvm/Kconfig
@@ -30,7 +30,7 @@ config KVM
 	select KVM_GENERIC_HARDWARE_ENABLING
 	select KVM_MMIO
 	select KVM_XFER_TO_GUEST_WORK
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	select PREEMPT_NOTIFIERS
 	help
 	  Support hosting virtualized guest machines.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1a4def36d5bb..3a2b53483524 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2131,8 +2131,6 @@ enum {
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
 #endif
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_extint(struct kvm_vcpu *v);
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ed90f148140d..091b74599c22 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -24,7 +24,7 @@ config KVM
 	depends on HIGH_RES_TIMERS
 	depends on X86_LOCAL_APIC
 	select PREEMPT_NOTIFIERS
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_PFNCACHE
 	select HAVE_KVM_IRQFD
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 11d091688346..5faba69403ac 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -253,7 +253,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 union kvm_mmu_notifier_arg {
 	pte_t pte;
 };
@@ -783,7 +783,7 @@ struct kvm {
 	struct hlist_head irq_ack_notifier_list;
 #endif
 
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 	struct mmu_notifier mmu_notifier;
 	unsigned long mmu_invalidate_seq;
 	long mmu_invalidate_in_progress;
@@ -1946,7 +1946,7 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[];
 extern const struct kvm_stats_header kvm_vcpu_stats_header;
 extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
 
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 {
 	if (unlikely(kvm->mmu_invalidate_in_progress))
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 6f4737d5046a..9d1f7835d8c1 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -6,6 +6,7 @@
 struct kvm;
 struct kvm_async_pf;
 struct kvm_device_ops;
+struct kvm_gfn_range;
 struct kvm_interrupt;
 struct kvm_irq_routing_table;
 struct kvm_memory_slot;
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 484d0873061c..ecae2914c97e 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -92,3 +92,7 @@ config HAVE_KVM_PM_NOTIFIER
 
 config KVM_GENERIC_HARDWARE_ENABLING
        bool
+
+config KVM_GENERIC_MMU_NOTIFIER
+       select MMU_NOTIFIER
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4fad3b01dc1f..8d21757cd5e9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -535,7 +535,7 @@ void kvm_destroy_vcpus(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
 
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 {
 	return container_of(mn, struct kvm, mmu_notifier);
@@ -960,14 +960,14 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
 }
 
-#else  /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
+#else  /* !CONFIG_KVM_GENERIC_MMU_NOTIFIER */
 
 static int kvm_init_mmu_notifier(struct kvm *kvm)
 {
 	return 0;
 }
 
-#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
+#endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_pm_notifier_call(struct notifier_block *bl,
@@ -1287,7 +1287,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 out_err_no_debugfs:
 	kvm_coalesced_mmio_free(kvm);
 out_no_coalesced_mmio:
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 	if (kvm->mmu_notifier.ops)
 		mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
 #endif
@@ -1347,7 +1347,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 		kvm->buses[i] = NULL;
 	}
 	kvm_coalesced_mmio_free(kvm);
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
 	/*
 	 * At this point, pending calls to invalidate_range_start()
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 06/33] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (4 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 05/33] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-15  6:59   ` Xiaoyao Li
  2023-09-14  1:55 ` [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace Sean Christopherson
                   ` (26 subsequent siblings)
  32 siblings, 1 reply; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
information can be supplied without setting userspace up to fail.  The
padding in the new kvm_userspace_memory_region2 structure will be used to
pass a file descriptor in addition to the userspace_addr, i.e. allow
userspace to point at a file descriptor and map memory into a guest that
is NOT mapped into host userspace.

Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
makes detection of bad flags a bit more robust, e.g. if the new fd field
is guarded only by a flag and not a new ioctl(), then a userspace bug
(setting a "bad" flag) would generate out-of-bounds access instead of an
-EINVAL error.

Cc: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c       |  2 +-
 include/linux/kvm_host.h |  4 ++--
 include/uapi/linux/kvm.h | 13 +++++++++++++
 virt/kvm/kvm_main.c      | 38 ++++++++++++++++++++++++++++++--------
 4 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c9c81e82e65..8356907079e1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12447,7 +12447,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		struct kvm_userspace_memory_region m;
+		struct kvm_userspace_memory_region2 m;
 
 		m.slot = id | (i << 16);
 		m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5faba69403ac..4e741ff27af3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1146,9 +1146,9 @@ enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem);
+			  const struct kvm_userspace_memory_region2 *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem);
+			    const struct kvm_userspace_memory_region2 *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 13065dd96132..bd1abe067f28 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+/* for KVM_SET_USER_MEMORY_REGION2 */
+struct kvm_userspace_memory_region2 {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size;
+	__u64 userspace_addr;
+	__u64 pad[16];
+};
+
 /*
  * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
  * userspace, other bits are reserved for kvm internal use which are defined
@@ -1192,6 +1202,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_COUNTER_OFFSET 227
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
+#define KVM_CAP_USER_MEMORY2 230
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1473,6 +1484,8 @@ struct kvm_vfio_spapr_tce {
 					struct kvm_userspace_memory_region)
 #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
+#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
+					 struct kvm_userspace_memory_region2)
 
 /* enable ucontrol for s390 */
 struct kvm_s390_ucas_mapping {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8d21757cd5e9..7c0e38752526 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1571,7 +1571,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(const struct kvm_userspace_memory_region2 *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
@@ -1973,7 +1973,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem)
+			    const struct kvm_userspace_memory_region2 *mem)
 {
 	struct kvm_memory_slot *old, *new;
 	struct kvm_memslots *slots;
@@ -2077,7 +2077,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem)
+			  const struct kvm_userspace_memory_region2 *mem)
 {
 	int r;
 
@@ -2089,7 +2089,7 @@ int kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
-					  struct kvm_userspace_memory_region *mem)
+					  struct kvm_userspace_memory_region2 *mem)
 {
 	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
@@ -4559,6 +4559,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 {
 	switch (arg) {
 	case KVM_CAP_USER_MEMORY:
+	case KVM_CAP_USER_MEMORY2:
 	case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
 	case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
 	case KVM_CAP_INTERNAL_ERROR_DATA:
@@ -4814,6 +4815,14 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
 	return fd;
 }
 
+#define SANITY_CHECK_MEM_REGION_FIELD(field)					\
+do {										\
+	BUILD_BUG_ON(offsetof(struct kvm_userspace_memory_region, field) !=		\
+		     offsetof(struct kvm_userspace_memory_region2, field));	\
+	BUILD_BUG_ON(sizeof_field(struct kvm_userspace_memory_region, field) !=		\
+		     sizeof_field(struct kvm_userspace_memory_region2, field));	\
+} while (0)
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -4836,15 +4845,28 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
 		break;
 	}
+	case KVM_SET_USER_MEMORY_REGION2:
 	case KVM_SET_USER_MEMORY_REGION: {
-		struct kvm_userspace_memory_region kvm_userspace_mem;
+		struct kvm_userspace_memory_region2 mem;
+		unsigned long size;
+
+		if (ioctl == KVM_SET_USER_MEMORY_REGION)
+			size = sizeof(struct kvm_userspace_memory_region);
+		else
+			size = sizeof(struct kvm_userspace_memory_region2);
+
+		/* Ensure the common parts of the two structs are identical. */
+		SANITY_CHECK_MEM_REGION_FIELD(slot);
+		SANITY_CHECK_MEM_REGION_FIELD(flags);
+		SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+		SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+		SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
 
 		r = -EFAULT;
-		if (copy_from_user(&kvm_userspace_mem, argp,
-						sizeof(kvm_userspace_mem)))
+		if (copy_from_user(&mem, argp, size))
 			goto out;
 
-		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
 		break;
 	}
 	case KVM_GET_DIRTY_LOG: {
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (5 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 06/33] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-22  6:03   ` Xiaoyao Li
  2023-09-14  1:55 ` [RFC PATCH v12 08/33] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory Sean Christopherson
                   ` (25 subsequent siblings)
  32 siblings, 1 reply; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).

KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory.  With guest private memory,
there will be  two kind of memory conversions:

  - explicit conversion: happens when the guest explicitly calls into KVM
    to map a range (as private or shared)

  - implicit conversion: happens when the guest attempts to access a gfn
    that is configured in the "wrong" state (private vs. shared)

On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.

KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.

Place "struct memory_fault" in a second anonymous union so that filling
memory_fault doesn't clobber state from other yet-to-be-fulfilled exits,
and to provide additional information if KVM does NOT ultimately exit to
userspace with KVM_EXIT_MEMORY_FAULT, e.g. if KVM suppresses (or worse,
loses) the exit, as KVM often suppresses exits for memory failures that
occur when accessing paravirt data structures.  The initial usage for
private memory will be all-or-nothing, but other features such as the
proposed "userfault on missing mappings" support will use
KVM_EXIT_MEMORY_FAULT for potentially _all_ guest memory accesses, i.e.
will run afoul of KVM's various quirks.

Use bit 3 for flagging private memory so that KVM can use bits 0-2 for
capturing RWX behavior if/when userspace needs such information.

Note!  To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'!  Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.

Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
Cc: Anish Moorthy <amoorthy@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst | 24 ++++++++++++++++++++++++
 include/linux/kvm_host.h       | 15 +++++++++++++++
 include/uapi/linux/kvm.h       | 24 ++++++++++++++++++++++++
 3 files changed, 63 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 21a7578142a1..e28a13439a95 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6702,6 +6702,30 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory;
+
+KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
+could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
+guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
+describes properties of the faulting access that are likely pertinent:
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
+   on a private memory access.  When clear, indicates the fault occurred on a
+   shared access.
+
+Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
+accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
+or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
+kvm_run.exit_reason is stale/undefined for all other error numbers.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4e741ff27af3..d8c6ce6c8211 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2327,4 +2327,19 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+						 gpa_t gpa, gpa_t size,
+						 bool is_write, bool is_exec,
+						 bool is_private)
+{
+	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+	vcpu->run->memory_fault.gpa = gpa;
+	vcpu->run->memory_fault.size = size;
+
+	/* RWX flags are not (yet) defined or communicated to userspace. */
+	vcpu->run->memory_fault.flags = 0;
+	if (is_private)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index bd1abe067f28..d2d913acf0df 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -274,6 +274,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -541,6 +542,29 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+
+	/*
+	 * This second exit union holds structs for exit types which may be
+	 * triggered after KVM has already initiated a different exit, or which
+	 * may be ultimately dropped by KVM.
+	 *
+	 * For example, because of limitations in KVM's uAPI, KVM x86 can
+	 * generate a memory fault exit an MMIO exit is initiated (exit_reason
+	 * and kvm_run.mmio are filled).  And conversely, KVM often disables
+	 * paravirt features if a memory fault occurs when accessing paravirt
+	 * data instead of reporting the error to userspace.
+	 */
+	union {
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory_fault;
+		/* Fix the size of the union. */
+		char padding2[256];
+	};
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 08/33] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (6 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 09/33] KVM: Drop .on_unlock() mmu_notifier hook Sean Christopherson
                   ` (24 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
__kvm_handle_hva_range() return whether or not an overlapping memslot
was found, i.e. mmu_lock was acquired.  Using the .on_unlock() hook
works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.

Use a small struct to return the tuple of the notifier-specific return,
plus whether or not overlap was found.  Because the iteration helpers are
__always_inlined, practically speaking, the struct will never actually be
returned from a function call (not to mention the size of the struct will
be two bytes in practice).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 53 +++++++++++++++++++++++++++++++--------------
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7c0e38752526..76d01de7838f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -561,6 +561,19 @@ struct kvm_mmu_notifier_range {
 	bool may_block;
 };
 
+/*
+ * The inner-most helper returns a tuple containing the return value from the
+ * arch- and action-specific handler, plus a flag indicating whether or not at
+ * least one memslot was found, i.e. if the handler found guest memory.
+ *
+ * Note, most notifiers are averse to booleans, so even though KVM tracks the
+ * return from arch code as a bool, outer helpers will cast it to an int. :-(
+ */
+typedef struct kvm_mmu_notifier_return {
+	bool ret;
+	bool found_memslot;
+} kvm_mn_ret_t;
+
 /*
  * Use a dedicated stub instead of NULL to indicate that there is no callback
  * function/handler.  The compiler technically can't guarantee that a real
@@ -582,22 +595,25 @@ static const union kvm_mmu_notifier_arg KVM_MMU_NOTIFIER_NO_ARG;
 	     node;							     \
 	     node = interval_tree_iter_next(node, start, last))	     \
 
-static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
-						  const struct kvm_mmu_notifier_range *range)
+static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
+							   const struct kvm_mmu_notifier_range *range)
 {
-	bool ret = false, locked = false;
+	struct kvm_mmu_notifier_return r = {
+		.ret = false,
+		.found_memslot = false,
+	};
 	struct kvm_gfn_range gfn_range;
 	struct kvm_memory_slot *slot;
 	struct kvm_memslots *slots;
 	int i, idx;
 
 	if (WARN_ON_ONCE(range->end <= range->start))
-		return 0;
+		return r;
 
 	/* A null handler is allowed if and only if on_lock() is provided. */
 	if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
 			 IS_KVM_NULL_FN(range->handler)))
-		return 0;
+		return r;
 
 	idx = srcu_read_lock(&kvm->srcu);
 
@@ -631,8 +647,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
 			gfn_range.slot = slot;
 
-			if (!locked) {
-				locked = true;
+			if (!r.found_memslot) {
+				r.found_memslot = true;
 				KVM_MMU_LOCK(kvm);
 				if (!IS_KVM_NULL_FN(range->on_lock))
 					range->on_lock(kvm);
@@ -640,14 +656,14 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 				if (IS_KVM_NULL_FN(range->handler))
 					break;
 			}
-			ret |= range->handler(kvm, &gfn_range);
+			r.ret |= range->handler(kvm, &gfn_range);
 		}
 	}
 
-	if (range->flush_on_ret && ret)
+	if (range->flush_on_ret && r.ret)
 		kvm_flush_remote_tlbs(kvm);
 
-	if (locked) {
+	if (r.found_memslot) {
 		KVM_MMU_UNLOCK(kvm);
 		if (!IS_KVM_NULL_FN(range->on_unlock))
 			range->on_unlock(kvm);
@@ -655,8 +671,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 
 	srcu_read_unlock(&kvm->srcu, idx);
 
-	/* The notifiers are averse to booleans. :-( */
-	return (int)ret;
+	return r;
 }
 
 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
@@ -677,7 +692,7 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.may_block	= false,
 	};
 
-	return __kvm_handle_hva_range(kvm, &range);
+	return __kvm_handle_hva_range(kvm, &range).ret;
 }
 
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
@@ -696,7 +711,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 		.may_block	= false,
 	};
 
-	return __kvm_handle_hva_range(kvm, &range);
+	return __kvm_handle_hva_range(kvm, &range).ret;
 }
 
 static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -796,7 +811,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.end		= range->end,
 		.handler	= kvm_mmu_unmap_gfn_range,
 		.on_lock	= kvm_mmu_invalidate_begin,
-		.on_unlock	= kvm_arch_guest_memory_reclaimed,
+		.on_unlock	= (void *)kvm_null_fn,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
@@ -828,7 +843,13 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end,
 					  hva_range.may_block);
 
-	__kvm_handle_hva_range(kvm, &hva_range);
+	/*
+	 * If one or more memslots were found and thus zapped, notify arch code
+	 * that guest memory has been reclaimed.  This needs to be done *after*
+	 * dropping mmu_lock, as x86's reclaim path is slooooow.
+	 */
+	if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot)
+		kvm_arch_guest_memory_reclaimed(kvm);
 
 	return 0;
 }
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 09/33] KVM: Drop .on_unlock() mmu_notifier hook
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (7 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 08/33] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events Sean Christopherson
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
notifying arch code that memory has been reclaimed.  Adding .on_unlock()
and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
resulted in .on_lock() and .on_unlock() having divergent and asymmetric
behavior, and set future developers up for failure, i.e. all but asked for
bugs where KVM relied using .on_unlock() to try to run a callback while
holding mmu_lock.

Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
guard against future bugs of this nature.

Reported-by: Isaku Yamahata <isaku.yamahata@intel.com>
Link: https://lore.kernel.org/all/20230802203119.GB2021422@ls.amr.corp.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 76d01de7838f..174de2789657 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -544,7 +544,6 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm);
-typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
 struct kvm_mmu_notifier_range {
 	/*
@@ -556,7 +555,6 @@ struct kvm_mmu_notifier_range {
 	union kvm_mmu_notifier_arg arg;
 	gfn_handler_t handler;
 	on_lock_fn_t on_lock;
-	on_unlock_fn_t on_unlock;
 	bool flush_on_ret;
 	bool may_block;
 };
@@ -663,11 +661,8 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 	if (range->flush_on_ret && r.ret)
 		kvm_flush_remote_tlbs(kvm);
 
-	if (r.found_memslot) {
+	if (r.found_memslot)
 		KVM_MMU_UNLOCK(kvm);
-		if (!IS_KVM_NULL_FN(range->on_unlock))
-			range->on_unlock(kvm);
-	}
 
 	srcu_read_unlock(&kvm->srcu, idx);
 
@@ -687,7 +682,6 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.arg		= arg,
 		.handler	= handler,
 		.on_lock	= (void *)kvm_null_fn,
-		.on_unlock	= (void *)kvm_null_fn,
 		.flush_on_ret	= true,
 		.may_block	= false,
 	};
@@ -706,7 +700,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 		.end		= end,
 		.handler	= handler,
 		.on_lock	= (void *)kvm_null_fn,
-		.on_unlock	= (void *)kvm_null_fn,
 		.flush_on_ret	= false,
 		.may_block	= false,
 	};
@@ -811,7 +804,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.end		= range->end,
 		.handler	= kvm_mmu_unmap_gfn_range,
 		.on_lock	= kvm_mmu_invalidate_begin,
-		.on_unlock	= (void *)kvm_null_fn,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
@@ -856,6 +848,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 void kvm_mmu_invalidate_end(struct kvm *kvm)
 {
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
 	/*
 	 * This sequence increase will notify the kvm page fault that
 	 * the page that is going to be mapped in the spte could have
@@ -887,7 +881,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		.end		= range->end,
 		.handler	= (void *)kvm_null_fn,
 		.on_lock	= kvm_mmu_invalidate_end,
-		.on_unlock	= (void *)kvm_null_fn,
 		.flush_on_ret	= false,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (8 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 09/33] KVM: Drop .on_unlock() mmu_notifier hook Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-18  1:14   ` Binbin Wu
  2023-09-18 18:07   ` Michael Roth
  2023-09-14  1:55 ` [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes Sean Christopherson
                   ` (22 subsequent siblings)
  32 siblings, 2 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Add flags to "struct kvm_gfn_range" to let notifier events target only
shared and only private mappings, and write up the existing mmu_notifier
events to be shared-only (private memory is never associated with a
userspace virtual address, i.e. can't be reached via mmu_notifiers).

Add two flags so that KVM can handle the three possibilities (shared,
private, and shared+private) without needing something like a tri-state
enum.

Link: https://lore.kernel.org/all/ZJX0hk+KpQP0KUyB@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h | 2 ++
 virt/kvm/kvm_main.c      | 7 +++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d8c6ce6c8211..b5373cee2b08 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -263,6 +263,8 @@ struct kvm_gfn_range {
 	gfn_t start;
 	gfn_t end;
 	union kvm_mmu_notifier_arg arg;
+	bool only_private;
+	bool only_shared;
 	bool may_block;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 174de2789657..a41f8658dfe0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 			 * the second or later invocation of the handler).
 			 */
 			gfn_range.arg = range->arg;
+
+			/*
+			 * HVA-based notifications aren't relevant to private
+			 * mappings as they don't have a userspace mapping.
+			 */
+			gfn_range.only_private = false;
+			gfn_range.only_shared = true;
 			gfn_range.may_block = range->may_block;
 
 			/*
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (9 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-15  6:32   ` Yan Zhao
                     ` (2 more replies)
  2023-09-14  1:55 ` [RFC PATCH v12 12/33] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable Sean Christopherson
                   ` (21 subsequent siblings)
  32 siblings, 3 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
userspace to operate on the per-page memory attributes.
  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
    a guest memory range.
  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
    memory attributes.

Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.

Because setting memory attributes is roughly analogous to mprotect() on
memory that is mapped into the guest, zap existing mappings prior to
updating the memory attributes.  Opportunistically provide an arch hook
for the post-set path (needed to complete invalidation anyways) in
anticipation of x86 needing the hook to update metadata related to
determining whether or not a given gfn can be backed with various sizes
of hugepages.

It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst |  60 ++++++++++
 include/linux/kvm_host.h       |  18 +++
 include/uapi/linux/kvm.h       |  14 +++
 virt/kvm/Kconfig               |   4 +
 virt/kvm/kvm_main.c            | 212 +++++++++++++++++++++++++++++++++
 5 files changed, 308 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e28a13439a95..c44ef5295a12 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6070,6 +6070,56 @@ writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using the SET_ONE_REG
 interface. No error will be returned, but the resulting offset will not be
 applied.
 
+4.139 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
+-----------------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: u64 memory attributes bitmask(out)
+:Returns: 0 on success, <0 on error
+
+Returns supported memory attributes bitmask. Supported memory attributes will
+have the corresponding bits set in u64 memory attributes bitmask.
+
+The following memory attributes are defined::
+
+  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
+
+4.140 KVM_SET_MEMORY_ATTRIBUTES
+-----------------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_memory_attributes(in/out)
+:Returns: 0 on success, <0 on error
+
+Sets memory attributes for pages in a guest memory range. Parameters are
+specified via the following structure::
+
+  struct kvm_memory_attributes {
+	__u64 address;
+	__u64 size;
+	__u64 attributes;
+	__u64 flags;
+  };
+
+The user sets the per-page memory attributes to a guest memory range indicated
+by address/size, and in return KVM adjusts address and size to reflect the
+actual pages of the memory range have been successfully set to the attributes.
+If the call returns 0, "address" is updated to the last successful address + 1
+and "size" is updated to the remaining address size that has not been set
+successfully. The user should check the return value as well as the size to
+decide if the operation succeeded for the whole range or not. The user may want
+to retry the operation with the returned address/size if the previous range was
+partially successful.
+
+Both address and size should be page aligned and the supported attributes can be
+retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
+
+The "flags" field may be used for future extensions and should be set to 0s.
+
 5. The kvm_run structure
 ========================
 
@@ -8498,6 +8548,16 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
 64-bit bitmap (each bit describing a block size). The default value is
 0, to disable the eager page splitting.
 
+8.41 KVM_CAP_MEMORY_ATTRIBUTES
+------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm
+
+This capability indicates KVM supports per-page memory attributes and ioctls
+KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
+
 9. Known KVM API problems
 =========================
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b5373cee2b08..9b695391b11c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,6 +256,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 union kvm_mmu_notifier_arg {
 	pte_t pte;
+	unsigned long attributes;
 };
 
 struct kvm_gfn_range {
@@ -808,6 +809,9 @@ struct kvm {
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 	struct notifier_block pm_notifier;
+#endif
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+	struct xarray mem_attr_array;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
 };
@@ -2344,4 +2348,18 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
 }
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
+{
+	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
+}
+
+bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+				     unsigned long attrs);
+bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+					struct kvm_gfn_range *range);
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+					 struct kvm_gfn_range *range);
+#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d2d913acf0df..f8642ff2eb9d 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1227,6 +1227,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
 #define KVM_CAP_USER_MEMORY2 230
+#define KVM_CAP_MEMORY_ATTRIBUTES 231
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2293,4 +2294,17 @@ struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
+#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
+#define KVM_SET_MEMORY_ATTRIBUTES              _IOW(KVMIO,  0xd3, struct kvm_memory_attributes)
+
+struct kvm_memory_attributes {
+	__u64 address;
+	__u64 size;
+	__u64 attributes;
+	__u64 flags;
+};
+
+#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index ecae2914c97e..5bd7fcaf9089 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -96,3 +96,7 @@ config KVM_GENERIC_HARDWARE_ENABLING
 config KVM_GENERIC_MMU_NOTIFIER
        select MMU_NOTIFIER
        bool
+
+config KVM_GENERIC_MEMORY_ATTRIBUTES
+       select KVM_GENERIC_MMU_NOTIFIER
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a41f8658dfe0..2726938b684b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1218,6 +1218,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+	xa_init(&kvm->mem_attr_array);
+#endif
 
 	INIT_LIST_HEAD(&kvm->gpc_list);
 	spin_lock_init(&kvm->gpc_lock);
@@ -1391,6 +1394,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	}
 	cleanup_srcu_struct(&kvm->irq_srcu);
 	cleanup_srcu_struct(&kvm->srcu);
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+	xa_destroy(&kvm->mem_attr_array);
+#endif
 	kvm_arch_free_vm(kvm);
 	preempt_notifier_dec();
 	hardware_disable_all();
@@ -2389,6 +2395,188 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+/*
+ * Returns true if _all_ gfns in the range [@start, @end) have attributes
+ * matching @attrs.
+ */
+bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+				     unsigned long attrs)
+{
+	XA_STATE(xas, &kvm->mem_attr_array, start);
+	unsigned long index;
+	bool has_attrs;
+	void *entry;
+
+	rcu_read_lock();
+
+	if (!attrs) {
+		has_attrs = !xas_find(&xas, end);
+		goto out;
+	}
+
+	has_attrs = true;
+	for (index = start; index < end; index++) {
+		do {
+			entry = xas_next(&xas);
+		} while (xas_retry(&xas, entry));
+
+		if (xas.xa_index != index || xa_to_value(entry) != attrs) {
+			has_attrs = false;
+			break;
+		}
+	}
+
+out:
+	rcu_read_unlock();
+	return has_attrs;
+}
+
+static u64 kvm_supported_mem_attributes(struct kvm *kvm)
+{
+	return 0;
+}
+
+static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
+						 struct kvm_mmu_notifier_range *range)
+{
+	struct kvm_gfn_range gfn_range;
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	struct kvm_memslot_iter iter;
+	bool found_memslot = false;
+	bool ret = false;
+	int i;
+
+	gfn_range.arg = range->arg;
+	gfn_range.may_block = range->may_block;
+
+	/*
+	 * If/when KVM supports more attributes beyond private .vs shared, this
+	 * _could_ set only_{private,shared} appropriately if the entire target
+	 * range already has the desired private vs. shared state (it's unclear
+	 * if that is a net win).  For now, KVM reaches this point if and only
+	 * if the private flag is being toggled, i.e. all mappings are in play.
+	 */
+	gfn_range.only_private = false;
+	gfn_range.only_shared = false;
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+
+		kvm_for_each_memslot_in_gfn_range(&iter, slots, range->start, range->end) {
+			slot = iter.slot;
+			gfn_range.slot = slot;
+
+			gfn_range.start = max(range->start, slot->base_gfn);
+			gfn_range.end = min(range->end, slot->base_gfn + slot->npages);
+			if (gfn_range.start >= gfn_range.end)
+				continue;
+
+			if (!found_memslot) {
+				found_memslot = true;
+				KVM_MMU_LOCK(kvm);
+				if (!IS_KVM_NULL_FN(range->on_lock))
+					range->on_lock(kvm);
+			}
+
+			ret |= range->handler(kvm, &gfn_range);
+		}
+	}
+
+	if (range->flush_on_ret && ret)
+		kvm_flush_remote_tlbs(kvm);
+
+	if (found_memslot)
+		KVM_MMU_UNLOCK(kvm);
+}
+
+/* Set @attributes for the gfn range [@start, @end). */
+static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+				     unsigned long attributes)
+{
+	struct kvm_mmu_notifier_range pre_set_range = {
+		.start = start,
+		.end = end,
+		.handler = kvm_arch_pre_set_memory_attributes,
+		.on_lock = kvm_mmu_invalidate_begin,
+		.flush_on_ret = true,
+		.may_block = true,
+	};
+	struct kvm_mmu_notifier_range post_set_range = {
+		.start = start,
+		.end = end,
+		.arg.attributes = attributes,
+		.handler = kvm_arch_post_set_memory_attributes,
+		.on_lock = kvm_mmu_invalidate_end,
+		.may_block = true,
+	};
+	unsigned long i;
+	void *entry;
+	int r = 0;
+
+	entry = attributes ? xa_mk_value(attributes) : NULL;
+
+	mutex_lock(&kvm->slots_lock);
+
+	/* Nothing to do if the entire range as the desired attributes. */
+	if (kvm_range_has_memory_attributes(kvm, start, end, attributes))
+		goto out_unlock;
+
+	/*
+	 * Reserve memory ahead of time to avoid having to deal with failures
+	 * partway through setting the new attributes.
+	 */
+	for (i = start; i < end; i++) {
+		r = xa_reserve(&kvm->mem_attr_array, i, GFP_KERNEL_ACCOUNT);
+		if (r)
+			goto out_unlock;
+	}
+
+	kvm_handle_gfn_range(kvm, &pre_set_range);
+
+	for (i = start; i < end; i++) {
+		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
+				    GFP_KERNEL_ACCOUNT));
+		KVM_BUG_ON(r, kvm);
+	}
+
+	kvm_handle_gfn_range(kvm, &post_set_range);
+
+out_unlock:
+	mutex_unlock(&kvm->slots_lock);
+
+	return r;
+}
+static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
+					   struct kvm_memory_attributes *attrs)
+{
+	gfn_t start, end;
+
+	/* flags is currently not used. */
+	if (attrs->flags)
+		return -EINVAL;
+	if (attrs->attributes & ~kvm_supported_mem_attributes(kvm))
+		return -EINVAL;
+	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
+		return -EINVAL;
+	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
+		return -EINVAL;
+
+	start = attrs->address >> PAGE_SHIFT;
+	end = (attrs->address + attrs->size) >> PAGE_SHIFT;
+
+	/*
+	 * xarray tracks data using "unsigned long", and as a result so does
+	 * KVM.  For simplicity, supports generic attributes only on 64-bit
+	 * architectures.
+	 */
+	BUILD_BUG_ON(sizeof(attrs->attributes) != sizeof(unsigned long));
+
+	return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
+}
+#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 {
 	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
@@ -4587,6 +4775,9 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #ifdef CONFIG_HAVE_KVM_MSI
 	case KVM_CAP_SIGNAL_MSI:
 #endif
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+	case KVM_CAP_MEMORY_ATTRIBUTES:
+#endif
 #ifdef CONFIG_HAVE_KVM_IRQFD
 	case KVM_CAP_IRQFD:
 #endif
@@ -5015,6 +5206,27 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+	case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
+		u64 attrs = kvm_supported_mem_attributes(kvm);
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &attrs, sizeof(attrs)))
+			goto out;
+		r = 0;
+		break;
+	}
+	case KVM_SET_MEMORY_ATTRIBUTES: {
+		struct kvm_memory_attributes attrs;
+
+		r = -EFAULT;
+		if (copy_from_user(&attrs, argp, sizeof(attrs)))
+			goto out;
+
+		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
+		break;
+	}
+#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
 	case KVM_CREATE_DEVICE: {
 		struct kvm_create_device cd;
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 12/33] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (10 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 13/33] security: Export security_inode_init_security_anon() for use by KVM Sean Christopherson
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Add an "unmovable" flag for mappings that cannot be migrated under any
circumstance.  KVM will use the flag for its upcoming GUEST_MEMFD support,
which will not support compaction/migration, at least not in the
foreseeable future.

Test AS_UNMOVABLE under folio lock as already done for the async
compaction/dirty folio case, as the mapping can be removed by truncation
while compaction is running.  To avoid having to lock every folio with a
mapping, assume/require that unmovable mappings are also unevictable, and
have mapping_set_unmovable() also set AS_UNEVICTABLE.

Cc: Matthew Wilcox <willy@infradead.org>
Co-developed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/pagemap.h | 19 +++++++++++++++++-
 mm/compaction.c         | 43 +++++++++++++++++++++++++++++------------
 mm/migrate.c            |  2 ++
 3 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 351c3b7f93a1..82c9bf506b79 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -203,7 +203,8 @@ enum mapping_flags {
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
 	AS_LARGE_FOLIO_SUPPORT = 6,
-	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
+	AS_RELEASE_ALWAYS = 7,	/* Call ->release_folio(), even if no private data */
+	AS_UNMOVABLE	= 8,	/* The mapping cannot be moved, ever */
 };
 
 /**
@@ -289,6 +290,22 @@ static inline void mapping_clear_release_always(struct address_space *mapping)
 	clear_bit(AS_RELEASE_ALWAYS, &mapping->flags);
 }
 
+static inline void mapping_set_unmovable(struct address_space *mapping)
+{
+	/*
+	 * It's expected unmovable mappings are also unevictable. Compaction
+	 * migrate scanner (isolate_migratepages_block()) relies on this to
+	 * reduce page locking.
+	 */
+	set_bit(AS_UNEVICTABLE, &mapping->flags);
+	set_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
+static inline bool mapping_unmovable(struct address_space *mapping)
+{
+	return test_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return mapping->gfp_mask;
diff --git a/mm/compaction.c b/mm/compaction.c
index 38c8d216c6a3..12b828aed7c8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -883,6 +883,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 	/* Time to isolate some pages for migration */
 	for (; low_pfn < end_pfn; low_pfn++) {
+		bool is_dirty, is_unevictable;
 
 		if (skip_on_failure && low_pfn >= next_skip_pfn) {
 			/*
@@ -1080,8 +1081,10 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		if (!folio_test_lru(folio))
 			goto isolate_fail_put;
 
+		is_unevictable = folio_test_unevictable(folio);
+
 		/* Compaction might skip unevictable pages but CMA takes them */
-		if (!(mode & ISOLATE_UNEVICTABLE) && folio_test_unevictable(folio))
+		if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
 			goto isolate_fail_put;
 
 		/*
@@ -1093,26 +1096,42 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_writeback(folio))
 			goto isolate_fail_put;
 
-		if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) {
-			bool migrate_dirty;
+		is_dirty = folio_test_dirty(folio);
+
+		if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) ||
+		    (mapping && is_unevictable)) {
+			bool migrate_dirty = true;
+			bool is_unmovable;
 
 			/*
 			 * Only folios without mappings or that have
-			 * a ->migrate_folio callback are possible to
-			 * migrate without blocking.  However, we may
-			 * be racing with truncation, which can free
-			 * the mapping.  Truncation holds the folio lock
-			 * until after the folio is removed from the page
-			 * cache so holding it ourselves is sufficient.
+			 * a ->migrate_folio callback are possible to migrate
+			 * without blocking.
+			 *
+			 * Folios from unmovable mappings are not migratable.
+			 *
+			 * However, we can be racing with truncation, which can
+			 * free the mapping that we need to check. Truncation
+			 * holds the folio lock until after the folio is removed
+			 * from the page so holding it ourselves is sufficient.
+			 *
+			 * To avoid locking the folio just to check unmovable,
+			 * assume every unmovable folio is also unevictable,
+			 * which is a cheaper test.  If our assumption goes
+			 * wrong, it's not a correctness bug, just potentially
+			 * wasted cycles.
 			 */
 			if (!folio_trylock(folio))
 				goto isolate_fail_put;
 
 			mapping = folio_mapping(folio);
-			migrate_dirty = !mapping ||
-					mapping->a_ops->migrate_folio;
+			if ((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) {
+				migrate_dirty = !mapping ||
+						mapping->a_ops->migrate_folio;
+			}
+			is_unmovable = mapping && mapping_unmovable(mapping);
 			folio_unlock(folio);
-			if (!migrate_dirty)
+			if (!migrate_dirty || is_unmovable)
 				goto isolate_fail_put;
 		}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index b7fa020003f3..3d25c145098d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -953,6 +953,8 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
 
 		if (!mapping)
 			rc = migrate_folio(mapping, dst, src, mode);
+		else if (mapping_unmovable(mapping))
+			rc = -EOPNOTSUPP;
 		else if (mapping->a_ops->migrate_folio)
 			/*
 			 * Most folios have a mapping and most filesystems
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 13/33] security: Export security_inode_init_security_anon() for use by KVM
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (11 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 12/33] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

TODO: Throw this away, assuming KVM drops its dedicated file system.

Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 security/security.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/security/security.c b/security/security.c
index 23b129d482a7..0024156f867a 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1693,6 +1693,7 @@ int security_inode_init_security_anon(struct inode *inode,
 	return call_int_hook(inode_init_security_anon, 0, inode, name,
 			     context_inode);
 }
+EXPORT_SYMBOL_GPL(security_inode_init_security_anon);
 
 #ifdef CONFIG_SECURITY_PATH
 /**
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (12 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 13/33] security: Export security_inode_init_security_anon() for use by KVM Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-15  6:11   ` Yan Zhao
                     ` (3 more replies)
  2023-09-14  1:55 ` [RFC PATCH v12 15/33] KVM: Add transparent hugepage support for dedicated guest memory Sean Christopherson
                   ` (18 subsequent siblings)
  32 siblings, 4 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

TODO

Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h   |  48 +++
 include/uapi/linux/kvm.h   |  15 +-
 include/uapi/linux/magic.h |   1 +
 virt/kvm/Kconfig           |   4 +
 virt/kvm/Makefile.kvm      |   1 +
 virt/kvm/guest_mem.c       | 593 +++++++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c        |  61 +++-
 virt/kvm/kvm_mm.h          |  38 +++
 8 files changed, 756 insertions(+), 5 deletions(-)
 create mode 100644 virt/kvm/guest_mem.c

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9b695391b11c..18d8f02a99a3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -591,8 +591,20 @@ struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+
+#ifdef CONFIG_KVM_PRIVATE_MEM
+	struct {
+		struct file __rcu *file;
+		pgoff_t pgoff;
+	} gmem;
+#endif
 };
 
+static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
+{
+	return slot && (slot->flags & KVM_MEM_PRIVATE);
+}
+
 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
 {
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
@@ -687,6 +699,17 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 }
 #endif
 
+/*
+ * Arch code must define kvm_arch_has_private_mem if support for private memory
+ * is enabled.
+ */
+#if !defined(kvm_arch_has_private_mem) && !IS_ENABLED(CONFIG_KVM_PRIVATE_MEM)
+static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
+{
+	return false;
+}
+#endif
+
 struct kvm_memslots {
 	u64 generation;
 	atomic_long_t last_used_slot;
@@ -1401,6 +1424,7 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_invalidate_begin(struct kvm *kvm);
 void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
 void kvm_mmu_invalidate_end(struct kvm *kvm);
+bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -2360,6 +2384,30 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range);
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					 struct kvm_gfn_range *range);
+
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+	return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&
+	       kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+}
+#else
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+	return false;
+}
 #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
 
+#ifdef CONFIG_KVM_PRIVATE_MEM
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+			      gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+#else
+static inline int kvm_gmem_get_pfn(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   kvm_pfn_t *pfn, int *max_order)
+{
+	KVM_BUG_ON(1, kvm);
+	return -EIO;
+}
+#endif /* CONFIG_KVM_PRIVATE_MEM */
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f8642ff2eb9d..b6f90a273e2e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -102,7 +102,10 @@ struct kvm_userspace_memory_region2 {
 	__u64 guest_phys_addr;
 	__u64 memory_size;
 	__u64 userspace_addr;
-	__u64 pad[16];
+	__u64 gmem_offset;
+	__u32 gmem_fd;
+	__u32 pad1;
+	__u64 pad2[14];
 };
 
 /*
@@ -112,6 +115,7 @@ struct kvm_userspace_memory_region2 {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -1228,6 +1232,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
 #define KVM_CAP_USER_MEMORY2 230
 #define KVM_CAP_MEMORY_ATTRIBUTES 231
+#define KVM_CAP_GUEST_MEMFD 232
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2307,4 +2312,12 @@ struct kvm_memory_attributes {
 
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
 
+#define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
+
+struct kvm_create_guest_memfd {
+	__u64 size;
+	__u64 flags;
+	__u64 reserved[6];
+};
+
 #endif /* __LINUX_KVM_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..afe9c376c9a5 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define KVM_GUEST_MEMORY_MAGIC	0x474d454d	/* "GMEM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 5bd7fcaf9089..08afef022db9 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -100,3 +100,7 @@ config KVM_GENERIC_MMU_NOTIFIER
 config KVM_GENERIC_MEMORY_ATTRIBUTES
        select KVM_GENERIC_MMU_NOTIFIER
        bool
+
+config KVM_PRIVATE_MEM
+       select XARRAY_MULTI
+       bool
diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
index 2c27d5d0c367..a5a61bbe7f4c 100644
--- a/virt/kvm/Makefile.kvm
+++ b/virt/kvm/Makefile.kvm
@@ -12,3 +12,4 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
 kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
 kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o
 kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o
+kvm-$(CONFIG_KVM_PRIVATE_MEM) += $(KVM)/guest_mem.o
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
new file mode 100644
index 000000000000..0dd3f836cf9c
--- /dev/null
+++ b/virt/kvm/guest_mem.c
@@ -0,0 +1,593 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/backing-dev.h>
+#include <linux/falloc.h>
+#include <linux/kvm_host.h>
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+
+#include <uapi/linux/magic.h>
+
+#include "kvm_mm.h"
+
+static struct vfsmount *kvm_gmem_mnt;
+
+struct kvm_gmem {
+	struct kvm *kvm;
+	struct xarray bindings;
+	struct list_head entry;
+};
+
+static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
+{
+	struct folio *folio;
+
+	/* TODO: Support huge pages. */
+	folio = filemap_grab_folio(file->f_mapping, index);
+	if (IS_ERR_OR_NULL(folio))
+		return NULL;
+
+	/*
+	 * Use the up-to-date flag to track whether or not the memory has been
+	 * zeroed before being handed off to the guest.  There is no backing
+	 * storage for the memory, so the folio will remain up-to-date until
+	 * it's removed.
+	 *
+	 * TODO: Skip clearing pages when trusted firmware will do it when
+	 * assigning memory to the guest.
+	 */
+	if (!folio_test_uptodate(folio)) {
+		unsigned long nr_pages = folio_nr_pages(folio);
+		unsigned long i;
+
+		for (i = 0; i < nr_pages; i++)
+			clear_highpage(folio_page(folio, i));
+
+		folio_mark_uptodate(folio);
+	}
+
+	/*
+	 * Ignore accessed, referenced, and dirty flags.  The memory is
+	 * unevictable and there is no storage to write back to.
+	 */
+	return folio;
+}
+
+static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
+				      pgoff_t end)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = gmem->kvm;
+	unsigned long index;
+	bool flush = false;
+
+	KVM_MMU_LOCK(kvm);
+
+	kvm_mmu_invalidate_begin(kvm);
+
+	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
+		pgoff_t pgoff = slot->gmem.pgoff;
+
+		struct kvm_gfn_range gfn_range = {
+			.start = slot->base_gfn + max(pgoff, start) - pgoff,
+			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
+			.slot = slot,
+			.may_block = true,
+		};
+
+		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
+	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+
+	KVM_MMU_UNLOCK(kvm);
+}
+
+static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
+				    pgoff_t end)
+{
+	struct kvm *kvm = gmem->kvm;
+
+	KVM_MMU_LOCK(kvm);
+	if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT))
+		kvm_mmu_invalidate_end(kvm);
+	KVM_MMU_UNLOCK(kvm);
+}
+
+static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+{
+	struct list_head *gmem_list = &inode->i_mapping->private_list;
+	pgoff_t start = offset >> PAGE_SHIFT;
+	pgoff_t end = (offset + len) >> PAGE_SHIFT;
+	struct kvm_gmem *gmem;
+
+	/*
+	 * Bindings must stable across invalidation to ensure the start+end
+	 * are balanced.
+	 */
+	filemap_invalidate_lock(inode->i_mapping);
+
+	list_for_each_entry(gmem, gmem_list, entry) {
+		kvm_gmem_invalidate_begin(gmem, start, end);
+		kvm_gmem_invalidate_end(gmem, start, end);
+	}
+
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	return 0;
+}
+
+static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
+{
+	struct address_space *mapping = inode->i_mapping;
+	pgoff_t start, index, end;
+	int r;
+
+	/* Dedicated guest is immutable by default. */
+	if (offset + len > i_size_read(inode))
+		return -EINVAL;
+
+	filemap_invalidate_lock_shared(mapping);
+
+	start = offset >> PAGE_SHIFT;
+	end = (offset + len) >> PAGE_SHIFT;
+
+	r = 0;
+	for (index = start; index < end; ) {
+		struct folio *folio;
+
+		if (signal_pending(current)) {
+			r = -EINTR;
+			break;
+		}
+
+		folio = kvm_gmem_get_folio(inode, index);
+		if (!folio) {
+			r = -ENOMEM;
+			break;
+		}
+
+		index = folio_next_index(folio);
+
+		folio_unlock(folio);
+		folio_put(folio);
+
+		/* 64-bit only, wrapping the index should be impossible. */
+		if (WARN_ON_ONCE(!index))
+			break;
+
+		cond_resched();
+	}
+
+	filemap_invalidate_unlock_shared(mapping);
+
+	return r;
+}
+
+static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
+			       loff_t len)
+{
+	int ret;
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE))
+		return -EOPNOTSUPP;
+
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+		return -EOPNOTSUPP;
+
+	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+		return -EINVAL;
+
+	if (mode & FALLOC_FL_PUNCH_HOLE)
+		ret = kvm_gmem_punch_hole(file_inode(file), offset, len);
+	else
+		ret = kvm_gmem_allocate(file_inode(file), offset, len);
+
+	if (!ret)
+		file_modified(file);
+	return ret;
+}
+
+static int kvm_gmem_release(struct inode *inode, struct file *file)
+{
+	struct kvm_gmem *gmem = file->private_data;
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = gmem->kvm;
+	unsigned long index;
+
+	/*
+	 * Prevent concurrent attempts to *unbind* a memslot.  This is the last
+	 * reference to the file and thus no new bindings can be created, but
+	 * dereferencing the slot for existing bindings needs to be protected
+	 * against memslot updates, specifically so that unbind doesn't race
+	 * and free the memslot (kvm_gmem_get_file() will return NULL).
+	 */
+	mutex_lock(&kvm->slots_lock);
+
+	filemap_invalidate_lock(inode->i_mapping);
+
+	xa_for_each(&gmem->bindings, index, slot)
+		rcu_assign_pointer(slot->gmem.file, NULL);
+
+	synchronize_rcu();
+
+	/*
+	 * All in-flight operations are gone and new bindings can be created.
+	 * Zap all SPTEs pointed at by this file.  Do not free the backing
+	 * memory, as its lifetime is associated with the inode, not the file.
+	 */
+	kvm_gmem_invalidate_begin(gmem, 0, -1ul);
+	kvm_gmem_invalidate_end(gmem, 0, -1ul);
+
+	list_del(&gmem->entry);
+
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	mutex_unlock(&kvm->slots_lock);
+
+	xa_destroy(&gmem->bindings);
+	kfree(gmem);
+
+	kvm_put_kvm(kvm);
+
+	return 0;
+}
+
+static struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
+{
+	struct file *file;
+
+	rcu_read_lock();
+
+	file = rcu_dereference(slot->gmem.file);
+	if (file && !get_file_rcu(file))
+		file = NULL;
+
+	rcu_read_unlock();
+
+	return file;
+}
+
+static const struct file_operations kvm_gmem_fops = {
+	.open		= generic_file_open,
+	.release	= kvm_gmem_release,
+	.fallocate	= kvm_gmem_fallocate,
+};
+
+static int kvm_gmem_migrate_folio(struct address_space *mapping,
+				  struct folio *dst, struct folio *src,
+				  enum migrate_mode mode)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+
+static int kvm_gmem_error_page(struct address_space *mapping, struct page *page)
+{
+	struct list_head *gmem_list = &mapping->private_list;
+	struct kvm_gmem *gmem;
+	pgoff_t start, end;
+
+	filemap_invalidate_lock_shared(mapping);
+
+	start = page->index;
+	end = start + thp_nr_pages(page);
+
+	list_for_each_entry(gmem, gmem_list, entry)
+		kvm_gmem_invalidate_begin(gmem, start, end);
+
+	/*
+	 * Do not truncate the range, what action is taken in response to the
+	 * error is userspace's decision (assuming the architecture supports
+	 * gracefully handling memory errors).  If/when the guest attempts to
+	 * access a poisoned page, kvm_gmem_get_pfn() will return -EHWPOISON,
+	 * at which point KVM can either terminate the VM or propagate the
+	 * error to userspace.
+	 */
+
+	list_for_each_entry(gmem, gmem_list, entry)
+		kvm_gmem_invalidate_end(gmem, start, end);
+
+	filemap_invalidate_unlock_shared(mapping);
+
+	return MF_DELAYED;
+}
+
+static const struct address_space_operations kvm_gmem_aops = {
+	.dirty_folio = noop_dirty_folio,
+#ifdef CONFIG_MIGRATION
+	.migrate_folio	= kvm_gmem_migrate_folio,
+#endif
+	.error_remove_page = kvm_gmem_error_page,
+};
+
+static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
+			    struct kstat *stat, u32 request_mask,
+			    unsigned int query_flags)
+{
+	struct inode *inode = path->dentry->d_inode;
+
+	/* TODO */
+	generic_fillattr(idmap, request_mask, inode, stat);
+	return 0;
+}
+
+static int kvm_gmem_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
+			    struct iattr *attr)
+{
+	/* TODO */
+	return -EINVAL;
+}
+static const struct inode_operations kvm_gmem_iops = {
+	.getattr	= kvm_gmem_getattr,
+	.setattr	= kvm_gmem_setattr,
+};
+
+static int __kvm_gmem_create(struct kvm *kvm, loff_t size, struct vfsmount *mnt)
+{
+	const char *anon_name = "[kvm-gmem]";
+	const struct qstr qname = QSTR_INIT(anon_name, strlen(anon_name));
+	struct kvm_gmem *gmem;
+	struct inode *inode;
+	struct file *file;
+	int fd, err;
+
+	inode = alloc_anon_inode(mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	err = security_inode_init_security_anon(inode, &qname, NULL);
+	if (err)
+		goto err_inode;
+
+	inode->i_private = (void *)(unsigned long)flags;
+	inode->i_op = &kvm_gmem_iops;
+	inode->i_mapping->a_ops = &kvm_gmem_aops;
+	inode->i_mode |= S_IFREG;
+	inode->i_size = size;
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_unmovable(inode->i_mapping);
+	/* Unmovable mappings are supposed to be marked unevictable as well. */
+	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+
+	fd = get_unused_fd_flags(0);
+	if (fd < 0) {
+		err = fd;
+		goto err_inode;
+	}
+
+	file = alloc_file_pseudo(inode, mnt, "kvm-gmem", O_RDWR, &kvm_gmem_fops);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_fd;
+	}
+
+	file->f_flags |= O_LARGEFILE;
+	file->f_mapping = inode->i_mapping;
+
+	gmem = kzalloc(sizeof(*gmem), GFP_KERNEL);
+	if (!gmem) {
+		err = -ENOMEM;
+		goto err_file;
+	}
+
+	kvm_get_kvm(kvm);
+	gmem->kvm = kvm;
+	xa_init(&gmem->bindings);
+
+	file->private_data = gmem;
+
+	list_add(&gmem->entry, &inode->i_mapping->private_list);
+
+	fd_install(fd, file);
+	return fd;
+
+err_file:
+	fput(file);
+err_fd:
+	put_unused_fd(fd);
+err_inode:
+	iput(inode);
+	return err;
+}
+
+static bool kvm_gmem_is_valid_size(loff_t size, u64 flags)
+{
+	if (size < 0 || !PAGE_ALIGNED(size))
+		return false;
+
+	return true;
+}
+
+int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
+{
+	loff_t size = args->size;
+	u64 flags = args->flags;
+	u64 valid_flags = 0;
+
+	if (flags & ~valid_flags)
+		return -EINVAL;
+
+	if (!kvm_gmem_is_valid_size(size, flags))
+		return -EINVAL;
+
+	return __kvm_gmem_create(kvm, size, flags, kvm_gmem_mnt);
+}
+
+int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
+		  unsigned int fd, loff_t offset)
+{
+	loff_t size = slot->npages << PAGE_SHIFT;
+	unsigned long start, end, flags;
+	struct kvm_gmem *gmem;
+	struct inode *inode;
+	struct file *file;
+
+	BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
+
+	file = fget(fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kvm_gmem_fops)
+		goto err;
+
+	gmem = file->private_data;
+	if (gmem->kvm != kvm)
+		goto err;
+
+	inode = file_inode(file);
+	flags = (unsigned long)inode->i_private;
+
+	/*
+	 * For simplicity, require the offset into the file and the size of the
+	 * memslot to be aligned to the largest possible page size used to back
+	 * the file (same as the size of the file itself).
+	 */
+	if (!kvm_gmem_is_valid_size(offset, flags) ||
+	    !kvm_gmem_is_valid_size(size, flags))
+		goto err;
+
+	if (offset + size > i_size_read(inode))
+		goto err;
+
+	filemap_invalidate_lock(inode->i_mapping);
+
+	start = offset >> PAGE_SHIFT;
+	end = start + slot->npages;
+
+	if (!xa_empty(&gmem->bindings) &&
+	    xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
+		filemap_invalidate_unlock(inode->i_mapping);
+		goto err;
+	}
+
+	/*
+	 * No synchronize_rcu() needed, any in-flight readers are guaranteed to
+	 * be see either a NULL file or this new file, no need for them to go
+	 * away.
+	 */
+	rcu_assign_pointer(slot->gmem.file, file);
+	slot->gmem.pgoff = start;
+
+	xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	/*
+	 * Drop the reference to the file, even on success.  The file pins KVM,
+	 * not the other way 'round.  Active bindings are invalidated if the
+	 * file is closed before memslots are destroyed.
+	 */
+	fput(file);
+	return 0;
+
+err:
+	fput(file);
+	return -EINVAL;
+}
+
+void kvm_gmem_unbind(struct kvm_memory_slot *slot)
+{
+	unsigned long start = slot->gmem.pgoff;
+	unsigned long end = start + slot->npages;
+	struct kvm_gmem *gmem;
+	struct file *file;
+
+	/*
+	 * Nothing to do if the underlying file was already closed (or is being
+	 * closed right now), kvm_gmem_release() invalidates all bindings.
+	 */
+	file = kvm_gmem_get_file(slot);
+	if (!file)
+		return;
+
+	gmem = file->private_data;
+
+	filemap_invalidate_lock(file->f_mapping);
+	xa_store_range(&gmem->bindings, start, end - 1, NULL, GFP_KERNEL);
+	rcu_assign_pointer(slot->gmem.file, NULL);
+	synchronize_rcu();
+	filemap_invalidate_unlock(file->f_mapping);
+
+	fput(file);
+}
+
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+{
+	pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
+	struct kvm_gmem *gmem;
+	struct folio *folio;
+	struct page *page;
+	struct file *file;
+	int r;
+
+	file = kvm_gmem_get_file(slot);
+	if (!file)
+		return -EFAULT;
+
+	gmem = file->private_data;
+
+	if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) {
+		r = -EIO;
+		goto out_fput;
+	}
+
+	folio = kvm_gmem_get_folio(file_inode(file), index);
+	if (!folio) {
+		r = -ENOMEM;
+		goto out_fput;
+	}
+
+	if (folio_test_hwpoison(folio)) {
+		r = -EHWPOISON;
+		goto out_unlock;
+	}
+
+	page = folio_file_page(folio, index);
+
+	*pfn = page_to_pfn(page);
+	if (max_order)
+		*max_order = compound_order(compound_head(page));
+	r = 0;
+
+out_unlock:
+	folio_unlock(folio);
+out_fput:
+	fput(file);
+
+	return r;
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
+
+static int kvm_gmem_init_fs_context(struct fs_context *fc)
+{
+	if (!init_pseudo(fc, KVM_GUEST_MEMORY_MAGIC))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static struct file_system_type kvm_gmem_fs = {
+	.name		 = "kvm_guest_memory",
+	.init_fs_context = kvm_gmem_init_fs_context,
+	.kill_sb	 = kill_anon_super,
+};
+
+int kvm_gmem_init(void)
+{
+	kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);
+	if (IS_ERR(kvm_gmem_mnt))
+		return PTR_ERR(kvm_gmem_mnt);
+
+	/* For giggles.  Userspace can never map this anyways. */
+	kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
+
+	return 0;
+}
+
+void kvm_gmem_exit(void)
+{
+	kern_unmount(kvm_gmem_mnt);
+	kvm_gmem_mnt = NULL;
+}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2726938b684b..68a6119e09e4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -796,7 +796,7 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
 	}
 }
 
-static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
 	return kvm_unmap_gfn_range(kvm, range);
@@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
 /* This does not remove the slot from struct kvm_memslots data structures */
 static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
+	if (slot->flags & KVM_MEM_PRIVATE)
+		kvm_gmem_unbind(slot);
+
 	kvm_destroy_dirty_bitmap(slot);
 
 	kvm_arch_free_memslot(kvm, slot);
@@ -1598,10 +1601,18 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region2 *mem)
+static int check_memory_region_flags(struct kvm *kvm,
+				     const struct kvm_userspace_memory_region2 *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
+	if (kvm_arch_has_private_mem(kvm))
+		valid_flags |= KVM_MEM_PRIVATE;
+
+	/* Dirty logging private memory is not currently supported. */
+	if (mem->flags & KVM_MEM_PRIVATE)
+		valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
+
 #ifdef __KVM_HAVE_READONLY_MEM
 	valid_flags |= KVM_MEM_READONLY;
 #endif
@@ -2010,7 +2021,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	int as_id, id;
 	int r;
 
-	r = check_memory_region_flags(mem);
+	r = check_memory_region_flags(kvm, mem);
 	if (r)
 		return r;
 
@@ -2029,6 +2040,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
 			mem->memory_size))
 		return -EINVAL;
+	if (mem->flags & KVM_MEM_PRIVATE &&
+	    (mem->gmem_offset & (PAGE_SIZE - 1) ||
+	     mem->gmem_offset + mem->memory_size < mem->gmem_offset))
+		return -EINVAL;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
 		return -EINVAL;
 	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
@@ -2067,6 +2082,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
 			return -EINVAL;
 	} else { /* Modify an existing slot. */
+		/* Private memslots are immutable, they can only be deleted. */
+		if (mem->flags & KVM_MEM_PRIVATE)
+			return -EINVAL;
 		if ((mem->userspace_addr != old->userspace_addr) ||
 		    (npages != old->npages) ||
 		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
@@ -2095,10 +2113,23 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new->npages = npages;
 	new->flags = mem->flags;
 	new->userspace_addr = mem->userspace_addr;
+	if (mem->flags & KVM_MEM_PRIVATE) {
+		r = kvm_gmem_bind(kvm, new, mem->gmem_fd, mem->gmem_offset);
+		if (r)
+			goto out;
+	}
 
 	r = kvm_set_memslot(kvm, old, new, change);
 	if (r)
-		kfree(new);
+		goto out_restricted;
+
+	return 0;
+
+out_restricted:
+	if (mem->flags & KVM_MEM_PRIVATE)
+		kvm_gmem_unbind(new);
+out:
+	kfree(new);
 	return r;
 }
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
@@ -2434,6 +2465,8 @@ bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
+	if (kvm_arch_has_private_mem(kvm))
+		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
 	return 0;
 }
 
@@ -4824,6 +4857,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_BINARY_STATS_FD:
 	case KVM_CAP_SYSTEM_EVENT_DATA:
 		return 1;
+#ifdef CONFIG_KVM_PRIVATE_MEM
+	case KVM_CAP_GUEST_MEMFD:
+		return !kvm || kvm_arch_has_private_mem(kvm);
+#endif
 	default:
 		break;
 	}
@@ -5254,6 +5291,16 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_GET_STATS_FD:
 		r = kvm_vm_ioctl_get_stats_fd(kvm);
 		break;
+	case KVM_CREATE_GUEST_MEMFD: {
+		struct kvm_create_guest_memfd guest_memfd;
+
+		r = -EFAULT;
+		if (copy_from_user(&guest_memfd, argp, sizeof(guest_memfd)))
+			goto out;
+
+		r = kvm_gmem_create(kvm, &guest_memfd);
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
@@ -6375,6 +6422,10 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
 	if (r)
 		goto err_async_pf;
 
+	r = kvm_gmem_init();
+	if (r)
+		goto err_gmem;
+
 	kvm_chardev_ops.owner = module;
 
 	kvm_preempt_ops.sched_in = kvm_sched_in;
@@ -6401,6 +6452,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
 err_register:
 	kvm_vfio_ops_exit();
 err_vfio:
+	kvm_gmem_exit();
+err_gmem:
 	kvm_async_pf_deinit();
 err_async_pf:
 	kvm_irqfd_exit();
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 180f1a09e6ba..798f20d612bb 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -37,4 +37,42 @@ static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
 }
 #endif /* HAVE_KVM_PFNCACHE */
 
+#ifdef CONFIG_KVM_PRIVATE_MEM
+int kvm_gmem_init(void);
+void kvm_gmem_exit(void);
+int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args);
+int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
+		  unsigned int fd, loff_t offset);
+void kvm_gmem_unbind(struct kvm_memory_slot *slot);
+#else
+static inline int kvm_gmem_init(void)
+{
+	return 0;
+}
+
+static inline void kvm_gmem_exit(void)
+{
+
+}
+
+static inline int kvm_gmem_create(struct kvm *kvm,
+				  struct kvm_create_guest_memfd *args)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int kvm_gmem_bind(struct kvm *kvm,
+					 struct kvm_memory_slot *slot,
+					 unsigned int fd, loff_t offset)
+{
+	WARN_ON_ONCE(1);
+	return -EIO;
+}
+
+static inline void kvm_gmem_unbind(struct kvm_memory_slot *slot)
+{
+	WARN_ON_ONCE(1);
+}
+#endif /* CONFIG_KVM_PRIVATE_MEM */
+
 #endif /* __KVM_MM_H__ */
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 15/33] KVM: Add transparent hugepage support for dedicated guest memory
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (13 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 16/33] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN Sean Christopherson
                   ` (17 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

TODO: writeme

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/uapi/linux/kvm.h |  2 ++
 virt/kvm/guest_mem.c     | 54 ++++++++++++++++++++++++++++++++++++----
 2 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b6f90a273e2e..2df18796fd8e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2314,6 +2314,8 @@ struct kvm_memory_attributes {
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
 
+#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE		(1ULL << 0)
+
 struct kvm_create_guest_memfd {
 	__u64 size;
 	__u64 flags;
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index 0dd3f836cf9c..a819367434e9 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -17,15 +17,48 @@ struct kvm_gmem {
 	struct list_head entry;
 };
 
-static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
+static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index)
 {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long huge_index = round_down(index, HPAGE_PMD_NR);
+	unsigned long flags = (unsigned long)inode->i_private;
+	struct address_space *mapping  = inode->i_mapping;
+	gfp_t gfp = mapping_gfp_mask(mapping);
 	struct folio *folio;
 
-	/* TODO: Support huge pages. */
-	folio = filemap_grab_folio(file->f_mapping, index);
-	if (IS_ERR_OR_NULL(folio))
+	if (!(flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE))
 		return NULL;
 
+	if (filemap_range_has_page(mapping, huge_index << PAGE_SHIFT,
+				   (huge_index + HPAGE_PMD_NR - 1) << PAGE_SHIFT))
+		return NULL;
+
+	folio = filemap_alloc_folio(gfp, HPAGE_PMD_ORDER);
+	if (!folio)
+		return NULL;
+
+	if (filemap_add_folio(mapping, folio, huge_index, gfp)) {
+		folio_put(folio);
+		return NULL;
+	}
+
+	return folio;
+#else
+	return NULL;
+#endif
+}
+
+static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
+{
+	struct folio *folio;
+
+	folio = kvm_gmem_get_huge_folio(inode, index);
+	if (!folio) {
+		folio = filemap_grab_folio(inode->i_mapping, index);
+		if (IS_ERR_OR_NULL(folio))
+			return NULL;
+	}
+
 	/*
 	 * Use the up-to-date flag to track whether or not the memory has been
 	 * zeroed before being handed off to the guest.  There is no backing
@@ -323,7 +356,8 @@ static const struct inode_operations kvm_gmem_iops = {
 	.setattr	= kvm_gmem_setattr,
 };
 
-static int __kvm_gmem_create(struct kvm *kvm, loff_t size, struct vfsmount *mnt)
+static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags,
+			     struct vfsmount *mnt)
 {
 	const char *anon_name = "[kvm-gmem]";
 	const struct qstr qname = QSTR_INIT(anon_name, strlen(anon_name));
@@ -346,6 +380,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, struct vfsmount *mnt)
 	inode->i_mode |= S_IFREG;
 	inode->i_size = size;
 	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_large_folios(inode->i_mapping);
 	mapping_set_unmovable(inode->i_mapping);
 	/* Unmovable mappings are supposed to be marked unevictable as well. */
 	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
@@ -396,6 +431,12 @@ static bool kvm_gmem_is_valid_size(loff_t size, u64 flags)
 	if (size < 0 || !PAGE_ALIGNED(size))
 		return false;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
+	    !IS_ALIGNED(size, HPAGE_PMD_SIZE))
+		return false;
+#endif
+
 	return true;
 }
 
@@ -405,6 +446,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	u64 flags = args->flags;
 	u64 valid_flags = 0;
 
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+
 	if (flags & ~valid_flags)
 		return -EINVAL;
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 16/33] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (14 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 15/33] KVM: Add transparent hugepage support for dedicated guest memory Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 17/33] KVM: x86: Disallow hugepages when memory attributes are mixed Sean Christopherson
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
the probability of exiting to userspace with a stale run->exit_reason that
*appears* to be valid.

To support fd-based guest memory (guest memory without a corresponding
userspace virtual address), KVM will exit to userspace for various memory
related errors, which userspace *may* be able to resolve, instead of using
e.g. BUS_MCEERR_AR.  And in the more distant future, KVM will also likely
utilize the same functionality to let userspace "intercept" and handle
memory faults when the userspace mapping is missing, i.e. when fast gup()
fails.

Because many of KVM's internal APIs related to guest memory use '0' to
indicate "success, continue on" and not "exit to userspace", reporting
memory faults/errors to userspace will set run->exit_reason and
corresponding fields in the run structure fields in conjunction with a
a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON.  And because
KVM already returns  -EFAULT in many paths, there's a relatively high
probability that KVM could return -EFAULT without setting run->exit_reason,
in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
whatever exit reason happened to be in the run structure.

Note, KVM must wait until after run->immediate_exit is serviced to
sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
preserved across KVM_RUN when run->immediate_exit is true.

Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoorthy@google.com
Link: https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8356907079e1..8d21b7b09bb5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10951,6 +10951,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 {
 	int r;
 
+	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
 	vcpu->arch.l1tf_flush_l1d = true;
 
 	for (;;) {
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 17/33] KVM: x86: Disallow hugepages when memory attributes are mixed
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (15 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 16/33] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory Sean Christopherson
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

Disallow creating hugepages with mixed memory attributes, e.g. shared
versus private, as mapping a hugepage in this case would allow the guest
to access memory with the wrong attributes, e.g. overlaying private memory
with a shared hugepage.

Tracking whether or not attributes are mixed via the existing
disallow_lpage field, but use the most significant bit in 'disallow_lpage'
to indicate a hugepage has mixed attributes instead using the normal
refcounting.  Whether or not attributes are mixed is binary; either they
are or they aren't.  Attempting to squeeze that info into the refcount is
unnecessarily complex as it would require knowing the previous state of
the mixed count when updating attributes.  Using a flag means KVM just
needs to ensure the current status is reflected in the memslots.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/kvm/mmu/mmu.c          | 152 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   4 +
 3 files changed, 157 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3a2b53483524..91a28ddf7cfd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1838,6 +1838,9 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu);
 int kvm_mmu_init_vm(struct kvm *kvm);
 void kvm_mmu_uninit_vm(struct kvm *kvm);
 
+void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
+					    struct kvm_memory_slot *slot);
+
 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu);
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0f0231d2b74f..a079f36a8bf5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -795,16 +795,26 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
 	return &slot->arch.lpage_info[level - 2][idx];
 }
 
+/*
+ * The most significant bit in disallow_lpage tracks whether or not memory
+ * attributes are mixed, i.e. not identical for all gfns at the current level.
+ * The lower order bits are used to refcount other cases where a hugepage is
+ * disallowed, e.g. if KVM has shadow a page table at the gfn.
+ */
+#define KVM_LPAGE_MIXED_FLAG	BIT(31)
+
 static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
 					    gfn_t gfn, int count)
 {
 	struct kvm_lpage_info *linfo;
-	int i;
+	int old, i;
 
 	for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
 		linfo = lpage_info_slot(gfn, slot, i);
+
+		old = linfo->disallow_lpage;
 		linfo->disallow_lpage += count;
-		WARN_ON_ONCE(linfo->disallow_lpage < 0);
+		WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
 	}
 }
 
@@ -7172,3 +7182,141 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 	if (kvm->arch.nx_huge_page_recovery_thread)
 		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
+
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+				int level)
+{
+	return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_MIXED_FLAG;
+}
+
+static void hugepage_clear_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+				 int level)
+{
+	lpage_info_slot(gfn, slot, level)->disallow_lpage &= ~KVM_LPAGE_MIXED_FLAG;
+}
+
+static void hugepage_set_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+			       int level)
+{
+	lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_MIXED_FLAG;
+}
+
+static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
+			       gfn_t gfn, int level, unsigned long attrs)
+{
+	const unsigned long start = gfn;
+	const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
+
+	if (level == PG_LEVEL_2M)
+		return kvm_range_has_memory_attributes(kvm, start, end, attrs);
+
+	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
+		if (hugepage_test_mixed(slot, gfn, level - 1) ||
+		    attrs != kvm_get_memory_attributes(kvm, gfn))
+			return false;
+	}
+	return true;
+}
+
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+					 struct kvm_gfn_range *range)
+{
+	unsigned long attrs = range->arg.attributes;
+	struct kvm_memory_slot *slot = range->slot;
+	int level;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+	lockdep_assert_held(&kvm->slots_lock);
+
+	/*
+	 * KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip
+	 * the slot if the slot will never consume the PRIVATE attribute.
+	 */
+	if (!kvm_slot_can_be_private(slot))
+		return false;
+
+	/*
+	 * The sequence matters here: upper levels consume the result of lower
+	 * level's scanning.
+	 */
+	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+		gfn_t nr_pages = KVM_PAGES_PER_HPAGE(level);
+		gfn_t gfn = gfn_round_for_level(range->start, level);
+
+		/* Process the head page if it straddles the range. */
+		if (gfn != range->start || gfn + nr_pages > range->end) {
+			/*
+			 * Skip mixed tracking if the aligned gfn isn't covered
+			 * by the memslot, KVM can't use a hugepage due to the
+			 * misaligned address regardless of memory attributes.
+			 */
+			if (gfn >= slot->base_gfn) {
+				if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
+					hugepage_clear_mixed(slot, gfn, level);
+				else
+					hugepage_set_mixed(slot, gfn, level);
+			}
+			gfn += nr_pages;
+		}
+
+		/*
+		 * Pages entirely covered by the range are guaranteed to have
+		 * only the attributes which were just set.
+		 */
+		for ( ; gfn + nr_pages <= range->end; gfn += nr_pages)
+			hugepage_clear_mixed(slot, gfn, level);
+
+		/*
+		 * Process the last tail page if it straddles the range and is
+		 * contained by the memslot.  Like the head page, KVM can't
+		 * create a hugepage if the slot size is misaligned.
+		 */
+		if (gfn < range->end &&
+		    (gfn + nr_pages) <= (slot->base_gfn + slot->npages)) {
+			if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
+				hugepage_clear_mixed(slot, gfn, level);
+			else
+				hugepage_set_mixed(slot, gfn, level);
+		}
+	}
+	return false;
+}
+
+void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
+					    struct kvm_memory_slot *slot)
+{
+	int level;
+
+	if (!kvm_slot_can_be_private(slot))
+		return;
+
+	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+		/*
+		 * Don't bother tracking mixed attributes for pages that can't
+		 * be huge due to alignment, i.e. process only pages that are
+		 * entirely contained by the memslot.
+		 */
+		gfn_t end = gfn_round_for_level(slot->base_gfn + slot->npages, level);
+		gfn_t start = gfn_round_for_level(slot->base_gfn, level);
+		gfn_t nr_pages = KVM_PAGES_PER_HPAGE(level);
+		gfn_t gfn;
+
+		if (start < slot->base_gfn)
+			start += nr_pages;
+
+		/*
+		 * Unlike setting attributes, every potential hugepage needs to
+		 * be manually checked as the attributes may already be mixed.
+		 */
+		for (gfn = start; gfn < end; gfn += nr_pages) {
+			unsigned long attrs = kvm_get_memory_attributes(kvm, gfn);
+
+			if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
+				hugepage_clear_mixed(slot, gfn, level);
+			else
+				hugepage_set_mixed(slot, gfn, level);
+		}
+	}
+}
+#endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8d21b7b09bb5..ac36a5b7b5a3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12598,6 +12598,10 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 		}
 	}
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+	kvm_mmu_init_memslot_memory_attributes(kvm, slot);
+#endif
+
 	if (kvm_page_track_create_memslot(kvm, slot, npages))
 		goto out_free;
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (16 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 17/33] KVM: x86: Disallow hugepages when memory attributes are mixed Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-15  5:40   ` Yan Zhao
  2023-09-14  1:55 ` [RFC PATCH v12 19/33] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro Sean Christopherson
                   ` (14 subsequent siblings)
  32 siblings, 1 reply; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
hva-based shared memory. Architecture code (like TDX code) can tell
whether the on-going fault is private or not. This patch adds a
'is_private' field to kvm_page_fault to indicate this and architecture
code is expected to set it.

To handle page fault for such memslot, the handling logic is different
depending on whether the fault is private or shared. KVM checks if
'is_private' matches the host's view of the page (maintained in
mem_attr_array).
  - For a successful match, private pfn is obtained with
    restrictedmem_get_page() and shared pfn is obtained with existing
    get_user_pages().
  - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
    userspace. Userspace then can convert memory between private/shared
    in host's view and retry the fault.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 94 +++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 2 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a079f36a8bf5..9b48d8d0300b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3147,9 +3147,9 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
 	return level;
 }
 
-int kvm_mmu_max_mapping_level(struct kvm *kvm,
-			      const struct kvm_memory_slot *slot, gfn_t gfn,
-			      int max_level)
+static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
+				       const struct kvm_memory_slot *slot,
+				       gfn_t gfn, int max_level, bool is_private)
 {
 	struct kvm_lpage_info *linfo;
 	int host_level;
@@ -3161,6 +3161,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			break;
 	}
 
+	if (is_private)
+		return max_level;
+
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
@@ -3168,6 +3171,16 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 	return min(host_level, max_level);
 }
 
+int kvm_mmu_max_mapping_level(struct kvm *kvm,
+			      const struct kvm_memory_slot *slot, gfn_t gfn,
+			      int max_level)
+{
+	bool is_private = kvm_slot_can_be_private(slot) &&
+			  kvm_mem_is_private(kvm, gfn);
+
+	return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_private);
+}
+
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
@@ -3188,8 +3201,9 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 * Enforce the iTLB multihit workaround after capturing the requested
 	 * level, which will be used to do precise, accurate accounting.
 	 */
-	fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
-						     fault->gfn, fault->max_level);
+	fault->req_level = __kvm_mmu_max_mapping_level(vcpu->kvm, slot,
+						       fault->gfn, fault->max_level,
+						       fault->is_private);
 	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
 		return;
 
@@ -4261,6 +4275,55 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
 }
 
+static inline u8 kvm_max_level_for_order(int order)
+{
+	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+	KVM_MMU_WARN_ON(order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G) &&
+			order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M) &&
+			order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K));
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+		return PG_LEVEL_1G;
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+		return PG_LEVEL_2M;
+
+	return PG_LEVEL_4K;
+}
+
+static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+					      struct kvm_page_fault *fault)
+{
+	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
+				      PAGE_SIZE, fault->write, fault->exec,
+				      fault->is_private);
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+				   struct kvm_page_fault *fault)
+{
+	int max_order, r;
+
+	if (!kvm_slot_can_be_private(fault->slot)) {
+		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+		return -EFAULT;
+	}
+
+	r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
+			     &max_order);
+	if (r) {
+		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+		return r;
+	}
+
+	fault->max_level = min(kvm_max_level_for_order(max_order),
+			       fault->max_level);
+	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
+
+	return RET_PF_CONTINUE;
+}
+
 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
@@ -4293,6 +4356,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			return RET_PF_EMULATE;
 	}
 
+	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+		return -EFAULT;
+	}
+
+	if (fault->is_private)
+		return kvm_faultin_pfn_private(vcpu, fault);
+
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
 					  fault->write, &fault->map_writable,
@@ -7184,6 +7255,19 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 }
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+					struct kvm_gfn_range *range)
+{
+	/*
+	 * KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip
+	 * the slot if the slot will never consume the PRIVATE attribute.
+	 */
+	if (!kvm_slot_can_be_private(range->slot))
+		return false;
+
+	return kvm_mmu_unmap_gfn_range(kvm, range);
+}
+
 static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
 				int level)
 {
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b102014e2c60..4efbf43b4b18 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -202,6 +202,7 @@ struct kvm_page_fault {
 
 	/* Derived from mmu and global state.  */
 	const bool is_tdp;
+	const bool is_private;
 	const bool nx_huge_page_workaround_enabled;
 
 	/*
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 19/33] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (17 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 20/33] KVM: Allow arch code to track number of memslot address spaces per VM Sean Christopherson
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
KVM_ADDRESS_SPACE_NUM.

No functional change intended.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 1 -
 include/linux/kvm_host.h        | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 91a28ddf7cfd..78d641056ec5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2126,7 +2126,6 @@ enum {
 #define HF_SMM_MASK		(1 << 1)
 #define HF_SMM_INSIDE_NMI_MASK	(1 << 2)
 
-# define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
 # define KVM_ADDRESS_SPACE_NUM 2
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 18d8f02a99a3..aea1b4306129 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -692,7 +692,7 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
 #define KVM_MEM_SLOTS_NUM SHRT_MAX
 #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
 
-#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
+#if KVM_ADDRESS_SPACE_NUM == 1
 static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 {
 	return 0;
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 20/33] KVM: Allow arch code to track number of memslot address spaces per VM
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (18 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 19/33] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 21/33] KVM: x86: Add support for "protected VMs" that can utilize private memory Sean Christopherson
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs.  Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.

Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/powerpc/kvm/book3s_hv.c    |  2 +-
 arch/x86/include/asm/kvm_host.h |  8 +++++++-
 arch/x86/kvm/debugfs.c          |  2 +-
 arch/x86/kvm/mmu/mmu.c          |  8 ++++----
 arch/x86/kvm/mmu/tdp_mmu.c      |  2 +-
 arch/x86/kvm/x86.c              |  2 +-
 include/linux/kvm_host.h        | 17 +++++++++++------
 virt/kvm/dirty_ring.c           |  2 +-
 virt/kvm/kvm_main.c             | 26 ++++++++++++++------------
 9 files changed, 41 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 130bafdb1430..9b0eaa17275a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -6084,7 +6084,7 @@ static int kvmhv_svm_off(struct kvm *kvm)
 	}
 
 	srcu_idx = srcu_read_lock(&kvm->srcu);
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		struct kvm_memory_slot *memslot;
 		struct kvm_memslots *slots = __kvm_memslots(kvm, i);
 		int bkt;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 78d641056ec5..44d67a97304e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2126,9 +2126,15 @@ enum {
 #define HF_SMM_MASK		(1 << 1)
 #define HF_SMM_INSIDE_NMI_MASK	(1 << 2)
 
-# define KVM_ADDRESS_SPACE_NUM 2
+# define KVM_MAX_NR_ADDRESS_SPACES	2
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
+
+static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
+{
+	return KVM_MAX_NR_ADDRESS_SPACES;
+}
+
 #else
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
 #endif
diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index ee8c4c3496ed..42026b3f3ff3 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -111,7 +111,7 @@ static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void *v)
 	mutex_lock(&kvm->slots_lock);
 	write_lock(&kvm->mmu_lock);
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		int bkt;
 
 		slots = __kvm_memslots(kvm, i);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9b48d8d0300b..269d4dc47c98 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3755,7 +3755,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 	    kvm_page_track_write_tracking_enabled(kvm))
 		goto out_success;
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		slots = __kvm_memslots(kvm, i);
 		kvm_for_each_memslot(slot, bkt, slots) {
 			/*
@@ -6301,7 +6301,7 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_e
 	if (!kvm_memslots_have_rmaps(kvm))
 		return flush;
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		slots = __kvm_memslots(kvm, i);
 
 		kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, gfn_end) {
@@ -6341,7 +6341,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
 	if (tdp_mmu_enabled) {
-		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
+		for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++)
 			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
 						      gfn_end, true, flush);
 	}
@@ -6802,7 +6802,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
 	 * modifier prior to checking for a wrap of the MMIO generation so
 	 * that a wrap in any address space is detected.
 	 */
-	gen &= ~((u64)KVM_ADDRESS_SPACE_NUM - 1);
+	gen &= ~((u64)kvm_arch_nr_memslot_as_ids(kvm) - 1);
 
 	/*
 	 * The very rare case: if the MMIO generation number has wrapped,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6c63f2d1675f..ca7ec39f17d3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -905,7 +905,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 	 * is being destroyed or the userspace VMM has exited.  In both cases,
 	 * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
 	 */
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		for_each_tdp_mmu_root_yield_safe(kvm, root, i)
 			tdp_mmu_zap_root(kvm, root, false);
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ac36a5b7b5a3..f1da61236670 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12447,7 +12447,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 		hva = slot->userspace_addr;
 	}
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		struct kvm_userspace_memory_region2 m;
 
 		m.slot = id | (i << 16);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index aea1b4306129..8c5c017ab4e9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -80,8 +80,8 @@
 /* Two fragments for cross MMIO pages. */
 #define KVM_MAX_MMIO_FRAGMENTS	2
 
-#ifndef KVM_ADDRESS_SPACE_NUM
-#define KVM_ADDRESS_SPACE_NUM	1
+#ifndef KVM_MAX_NR_ADDRESS_SPACES
+#define KVM_MAX_NR_ADDRESS_SPACES	1
 #endif
 
 /*
@@ -692,7 +692,12 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
 #define KVM_MEM_SLOTS_NUM SHRT_MAX
 #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
 
-#if KVM_ADDRESS_SPACE_NUM == 1
+#if KVM_MAX_NR_ADDRESS_SPACES == 1
+static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
+{
+	return KVM_MAX_NR_ADDRESS_SPACES;
+}
+
 static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 {
 	return 0;
@@ -747,9 +752,9 @@ struct kvm {
 	struct mm_struct *mm; /* userspace tied to this vm */
 	unsigned long nr_memslot_pages;
 	/* The two memslot sets - active and inactive (per address space) */
-	struct kvm_memslots __memslots[KVM_ADDRESS_SPACE_NUM][2];
+	struct kvm_memslots __memslots[KVM_MAX_NR_ADDRESS_SPACES][2];
 	/* The current active memslot set for each address space */
-	struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
+	struct kvm_memslots __rcu *memslots[KVM_MAX_NR_ADDRESS_SPACES];
 	struct xarray vcpu_array;
 	/*
 	 * Protected by slots_lock, but can be read outside if an
@@ -1018,7 +1023,7 @@ void kvm_put_kvm_no_destroy(struct kvm *kvm);
 
 static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
 {
-	as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
+	as_id = array_index_nospec(as_id, KVM_MAX_NR_ADDRESS_SPACES);
 	return srcu_dereference_check(kvm->memslots[as_id], &kvm->srcu,
 			lockdep_is_held(&kvm->slots_lock) ||
 			!refcount_read(&kvm->users_count));
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index c1cd7dfe4a90..86d267db87bb 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -58,7 +58,7 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
 	as_id = slot >> 16;
 	id = (u16)slot;
 
-	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
 		return;
 
 	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 68a6119e09e4..a83dfef1316e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -615,7 +615,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 
 	idx = srcu_read_lock(&kvm->srcu);
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		struct interval_tree_node *node;
 
 		slots = __kvm_memslots(kvm, i);
@@ -1248,7 +1248,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 		goto out_err_no_irq_srcu;
 
 	refcount_set(&kvm->users_count, 1);
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		for (j = 0; j < 2; j++) {
 			slots = &kvm->__memslots[i][j];
 
@@ -1391,7 +1391,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 #endif
 	kvm_arch_destroy_vm(kvm);
 	kvm_destroy_devices(kvm);
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
 		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
 	}
@@ -1674,7 +1674,7 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
 	 * space 0 will use generations 0, 2, 4, ... while address space 1 will
 	 * use generations 1, 3, 5, ...
 	 */
-	gen += KVM_ADDRESS_SPACE_NUM;
+	gen += kvm_arch_nr_memslot_as_ids(kvm);
 
 	kvm_arch_memslots_updated(kvm, gen);
 
@@ -2044,7 +2044,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	    (mem->gmem_offset & (PAGE_SIZE - 1) ||
 	     mem->gmem_offset + mem->memory_size < mem->gmem_offset))
 		return -EINVAL;
-	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
+	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM)
 		return -EINVAL;
 	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
 		return -EINVAL;
@@ -2180,7 +2180,7 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
 
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
-	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
 
 	slots = __kvm_memslots(kvm, as_id);
@@ -2242,7 +2242,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
-	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
 
 	slots = __kvm_memslots(kvm, as_id);
@@ -2354,7 +2354,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
-	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
 
 	if (log->first_page & 63)
@@ -2494,7 +2494,7 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 	gfn_range.only_private = false;
 	gfn_range.only_shared = false;
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		slots = __kvm_memslots(kvm, i);
 
 		kvm_for_each_memslot_in_gfn_range(&iter, slots, range->start, range->end) {
@@ -4833,9 +4833,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_IRQ_ROUTING:
 		return KVM_MAX_IRQ_ROUTES;
 #endif
-#if KVM_ADDRESS_SPACE_NUM > 1
+#if KVM_MAX_NR_ADDRESS_SPACES > 1
 	case KVM_CAP_MULTI_ADDRESS_SPACE:
-		return KVM_ADDRESS_SPACE_NUM;
+		if (kvm)
+			return kvm_arch_nr_memslot_as_ids(kvm);
+		return KVM_MAX_NR_ADDRESS_SPACES;
 #endif
 	case KVM_CAP_NR_MEMSLOTS:
 		return KVM_USER_MEM_SLOTS;
@@ -4939,7 +4941,7 @@ bool kvm_are_all_memslots_empty(struct kvm *kvm)
 
 	lockdep_assert_held(&kvm->slots_lock);
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		if (!kvm_memslots_empty(__kvm_memslots(kvm, i)))
 			return false;
 	}
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 21/33] KVM: x86: Add support for "protected VMs" that can utilize private memory
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (19 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 20/33] KVM: Allow arch code to track number of memslot address spaces per VM Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 22/33] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper Sean Christopherson
                   ` (11 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst  | 32 ++++++++++++++++++++++++++++++++
 arch/x86/include/asm/kvm_host.h | 15 +++++++++------
 arch/x86/include/uapi/asm/kvm.h |  3 +++
 arch/x86/kvm/Kconfig            | 12 ++++++++++++
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 arch/x86/kvm/x86.c              | 16 +++++++++++++++-
 include/uapi/linux/kvm.h        |  1 +
 virt/kvm/Kconfig                |  5 +++++
 8 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index c44ef5295a12..5e08f2a157ef 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -147,10 +147,29 @@ described as 'basic' will be available.
 The new VM has no virtual cpus and no memory.
 You probably want to use 0 as machine type.
 
+X86:
+^^^^
+
+Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
+
+S390:
+^^^^^
+
 In order to create user controlled virtual machines on S390, check
 KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
 privileged user (CAP_SYS_ADMIN).
 
+MIPS:
+^^^^^
+
+To use hardware assisted virtualization on MIPS (VZ ASE) rather than
+the default trap & emulate implementation (which changes the virtual
+memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
+flag KVM_VM_MIPS_VZ.
+
+ARM64:
+^^^^^^
+
 On arm64, the physical address size for a VM (IPA Size limit) is limited
 to 40bits by default. The limit can be configured if the host supports the
 extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
@@ -8558,6 +8577,19 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
 This capability indicates KVM supports per-page memory attributes and ioctls
 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
 
+8.41 KVM_CAP_VM_TYPES
+---------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: system ioctl
+
+This capability returns a bitmap of support VM types.  The 1-setting of bit @n
+means the VM type with value @n is supported.  Possible values of @n are::
+
+  #define KVM_X86_DEFAULT_VM	0
+  #define KVM_X86_SW_PROTECTED_VM	1
+
 9. Known KVM API problems
 =========================
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 44d67a97304e..95018cc653f5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1245,6 +1245,7 @@ enum kvm_apicv_inhibit {
 };
 
 struct kvm_arch {
+	unsigned long vm_type;
 	unsigned long n_used_mmu_pages;
 	unsigned long n_requested_mmu_pages;
 	unsigned long n_max_mmu_pages;
@@ -2079,6 +2080,12 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd);
 void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 		       int tdp_max_root_level, int tdp_huge_page_level);
 
+#ifdef CONFIG_KVM_PRIVATE_MEM
+#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.vm_type != KVM_X86_DEFAULT_VM)
+#else
+#define kvm_arch_has_private_mem(kvm) false
+#endif
+
 static inline u16 kvm_read_ldt(void)
 {
 	u16 ldt;
@@ -2127,14 +2134,10 @@ enum {
 #define HF_SMM_INSIDE_NMI_MASK	(1 << 2)
 
 # define KVM_MAX_NR_ADDRESS_SPACES	2
+/* SMM is currently unsupported for guests with private memory. */
+# define kvm_arch_nr_memslot_as_ids(kvm) (kvm_arch_has_private_mem(kvm) ? 1 : 2)
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
-
-static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
-{
-	return KVM_MAX_NR_ADDRESS_SPACES;
-}
-
 #else
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
 #endif
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 1a6a1f987949..a448d0964fc0 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -562,4 +562,7 @@ struct kvm_pmu_event_filter {
 /* x86-specific KVM_EXIT_HYPERCALL flags. */
 #define KVM_EXIT_HYPERCALL_LONG_MODE	BIT(0)
 
+#define KVM_X86_DEFAULT_VM	0
+#define KVM_X86_SW_PROTECTED_VM	1
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 091b74599c22..8452ed0228cb 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -77,6 +77,18 @@ config KVM_WERROR
 
 	  If in doubt, say "N".
 
+config KVM_SW_PROTECTED_VM
+	bool "Enable support for KVM software-protected VMs"
+	depends on EXPERT
+	depends on X86_64
+	select KVM_GENERIC_PRIVATE_MEM
+	help
+	  Enable support for KVM software-protected VMs.  Currently "protected"
+	  means the VM can be backed with memory provided by
+	  KVM_CREATE_GUEST_MEMFD.
+
+	  If unsure, say "N".
+
 config KVM_INTEL
 	tristate "KVM for Intel (and compatible) processors support"
 	depends on KVM && IA32_FEAT_CTL
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 4efbf43b4b18..71ba4f833dc1 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -298,6 +298,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
 		.req_level = PG_LEVEL_4K,
 		.goal_level = PG_LEVEL_4K,
+		.is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
 	};
 	int r;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f1da61236670..767236b4d771 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4441,6 +4441,13 @@ static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+static bool kvm_is_vm_type_supported(unsigned long type)
+{
+	return type == KVM_X86_DEFAULT_VM ||
+	       (type == KVM_X86_SW_PROTECTED_VM &&
+		IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled);
+}
+
 int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 {
 	int r = 0;
@@ -4631,6 +4638,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_X86_NOTIFY_VMEXIT:
 		r = kvm_caps.has_notify_vmexit;
 		break;
+	case KVM_CAP_VM_TYPES:
+		r = BIT(KVM_X86_DEFAULT_VM);
+		if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM))
+			r |= BIT(KVM_X86_SW_PROTECTED_VM);
+		break;
 	default:
 		break;
 	}
@@ -12302,9 +12314,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	int ret;
 	unsigned long flags;
 
-	if (type)
+	if (!kvm_is_vm_type_supported(type))
 		return -EINVAL;
 
+	kvm->arch.vm_type = type;
+
 	ret = kvm_page_track_init(kvm);
 	if (ret)
 		goto out;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2df18796fd8e..65fc983af840 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1233,6 +1233,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_USER_MEMORY2 230
 #define KVM_CAP_MEMORY_ATTRIBUTES 231
 #define KVM_CAP_GUEST_MEMFD 232
+#define KVM_CAP_VM_TYPES 233
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 08afef022db9..2c964586aa14 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -104,3 +104,8 @@ config KVM_GENERIC_MEMORY_ATTRIBUTES
 config KVM_PRIVATE_MEM
        select XARRAY_MULTI
        bool
+
+config KVM_GENERIC_PRIVATE_MEM
+       select KVM_GENERIC_MEMORY_ATTRIBUTES
+       select KVM_PRIVATE_MEM
+       bool
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 22/33] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (20 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 21/33] KVM: x86: Add support for "protected VMs" that can utilize private memory Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 23/33] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
                   ` (10 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Drop kvm_userspace_memory_region_find(), it's unused and a terrible API
(probably why it's unused).  If anything outside of kvm_util.c needs to
get at the memslot, userspace_mem_region_find() can be exposed to give
others full access to all memory region/slot information.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     |  4 ---
 tools/testing/selftests/kvm/lib/kvm_util.c    | 29 -------------------
 2 files changed, 33 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index a18db6a7b3cf..967eaaeacd75 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -776,10 +776,6 @@ vm_adjust_num_guest_pages(enum vm_guest_mode mode, unsigned int num_guest_pages)
 	return n;
 }
 
-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
-				 uint64_t end);
-
 #define sync_global_to_guest(vm, g) ({				\
 	typeof(g) *_p = addr_gva2hva(vm, (vm_vaddr_t)&(g));	\
 	memcpy(_p, &(g), sizeof(g));				\
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 7a8af1821f5d..f09295d56c23 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -590,35 +590,6 @@ userspace_mem_region_find(struct kvm_vm *vm, uint64_t start, uint64_t end)
 	return NULL;
 }
 
-/*
- * KVM Userspace Memory Region Find
- *
- * Input Args:
- *   vm - Virtual Machine
- *   start - Starting VM physical address
- *   end - Ending VM physical address, inclusive.
- *
- * Output Args: None
- *
- * Return:
- *   Pointer to overlapping region, NULL if no such region.
- *
- * Public interface to userspace_mem_region_find. Allows tests to look up
- * the memslot datastructure for a given range of guest physical memory.
- */
-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
-				 uint64_t end)
-{
-	struct userspace_mem_region *region;
-
-	region = userspace_mem_region_find(vm, start, end);
-	if (!region)
-		return NULL;
-
-	return &region->region;
-}
-
 __weak void vcpu_arch_free(struct kvm_vcpu *vcpu)
 {
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 23/33] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (21 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 22/33] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 24/33] KVM: selftests: Add support for creating private memslots Sean Christopherson
                   ` (9 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Use KVM_SET_USER_MEMORY_REGION2 throughough KVM's selftests library so
that support for guest private memory can be added without needing an
entirely separate set of helpers.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     |  2 +-
 tools/testing/selftests/kvm/lib/kvm_util.c    | 19 ++++++++++---------
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 967eaaeacd75..9f144841c2ee 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -44,7 +44,7 @@ typedef uint64_t vm_paddr_t; /* Virtual Machine (Guest) physical address */
 typedef uint64_t vm_vaddr_t; /* Virtual Machine (Guest) virtual address */
 
 struct userspace_mem_region {
-	struct kvm_userspace_memory_region region;
+	struct kvm_userspace_memory_region2 region;
 	struct sparsebit *unused_phy_pages;
 	int fd;
 	off_t offset;
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index f09295d56c23..3676b37bea38 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -453,8 +453,9 @@ void kvm_vm_restart(struct kvm_vm *vmp)
 		vm_create_irqchip(vmp);
 
 	hash_for_each(vmp->regions.slot_hash, ctr, region, slot_node) {
-		int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION, &region->region);
-		TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+		int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION2, &region->region);
+
+		TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
 			    "  rc: %i errno: %i\n"
 			    "  slot: %u flags: 0x%x\n"
 			    "  guest_phys_addr: 0x%llx size: 0x%llx",
@@ -657,7 +658,7 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
 	}
 
 	region->region.memory_size = 0;
-	vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region);
+	vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
 
 	sparsebit_free(&region->unused_phy_pages);
 	ret = munmap(region->mmap_start, region->mmap_size);
@@ -1014,8 +1015,8 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	region->region.guest_phys_addr = guest_paddr;
 	region->region.memory_size = npages * vm->page_size;
 	region->region.userspace_addr = (uintptr_t) region->host_mem;
-	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region);
-	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
+	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
 		"  rc: %i errno: %i\n"
 		"  slot: %u flags: 0x%x\n"
 		"  guest_phys_addr: 0x%lx size: 0x%lx",
@@ -1097,9 +1098,9 @@ void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags)
 
 	region->region.flags = flags;
 
-	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region);
+	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
 
-	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
 		"  rc: %i errno: %i slot: %u flags: 0x%x",
 		ret, errno, slot, flags);
 }
@@ -1127,9 +1128,9 @@ void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, uint64_t new_gpa)
 
 	region->region.guest_phys_addr = new_gpa;
 
-	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region);
+	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
 
-	TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION failed\n"
+	TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed\n"
 		    "ret: %i errno: %i slot: %u new_gpa: 0x%lx",
 		    ret, errno, slot, new_gpa);
 }
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 24/33] KVM: selftests: Add support for creating private memslots
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (22 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 23/33] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 25/33] KVM: selftests: Add helpers to convert guest memory b/w private and shared Sean Christopherson
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Add support for creating "private" memslots via KVM_CREATE_GUEST_MEMFD and
KVM_SET_USER_MEMORY_REGION2.  Make vm_userspace_mem_region_add() a wrapper
to its effective replacement, vm_mem_add(), so that private memslots are
fully opt-in, i.e. don't require update all tests that add memory regions.

Pivot on the KVM_MEM_PRIVATE flag instead of the validity of the "gmem"
file descriptor so that simple tests can let vm_mem_add() do the heavy
lifting of creating the guest memfd, but also allow the caller to pass in
an explicit fd+offset so that fancier tests can do things like back
multiple memslots with a single file.  If the caller passes in a fd, dup()
the fd so that (a) __vm_mem_region_delete() can close the fd associated
with the memory region without needing yet another flag, and (b) so that
the caller can safely close its copy of the fd without having to first
destroy memslots.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     | 23 +++++
 .../testing/selftests/kvm/include/test_util.h |  5 ++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 85 ++++++++++++-------
 3 files changed, 82 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 9f144841c2ee..47ea25f9dc97 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -431,6 +431,26 @@ static inline uint64_t vm_get_stat(struct kvm_vm *vm, const char *stat_name)
 
 void vm_create_irqchip(struct kvm_vm *vm);
 
+static inline int __vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+					uint64_t flags)
+{
+	struct kvm_create_guest_memfd gmem = {
+		.size = size,
+		.flags = flags,
+	};
+
+	return __vm_ioctl(vm, KVM_CREATE_GUEST_MEMFD, &gmem);
+}
+
+static inline int vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+					uint64_t flags)
+{
+	int fd = __vm_create_guest_memfd(vm, size, flags);
+
+	TEST_ASSERT(fd >= 0, KVM_IOCTL_ERROR(KVM_CREATE_GUEST_MEMFD, fd));
+	return fd;
+}
+
 void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
 			       uint64_t gpa, uint64_t size, void *hva);
 int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
@@ -439,6 +459,9 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	enum vm_mem_backing_src_type src_type,
 	uint64_t guest_paddr, uint32_t slot, uint64_t npages,
 	uint32_t flags);
+void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
+		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
+		uint32_t flags, int gmem_fd, uint64_t gmem_offset);
 
 void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags);
 void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, uint64_t new_gpa);
diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 7e614adc6cf4..7257f2243ab9 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -142,6 +142,11 @@ static inline bool backing_src_is_shared(enum vm_mem_backing_src_type t)
 	return vm_mem_backing_src_alias(t)->flag & MAP_SHARED;
 }
 
+static inline bool backing_src_can_be_huge(enum vm_mem_backing_src_type t)
+{
+	return t != VM_MEM_SRC_ANONYMOUS && t != VM_MEM_SRC_SHMEM;
+}
+
 /* Aligns x up to the next multiple of size. Size must be a power of 2. */
 static inline uint64_t align_up(uint64_t x, uint64_t size)
 {
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 3676b37bea38..127f44c6c83c 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -669,6 +669,8 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
 		TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("munmap()", ret));
 		close(region->fd);
 	}
+	if (region->region.gmem_fd >= 0)
+		close(region->region.gmem_fd);
 
 	free(region);
 }
@@ -870,36 +872,15 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
 		    errno, strerror(errno));
 }
 
-/*
- * VM Userspace Memory Region Add
- *
- * Input Args:
- *   vm - Virtual Machine
- *   src_type - Storage source for this region.
- *              NULL to use anonymous memory.
- *   guest_paddr - Starting guest physical address
- *   slot - KVM region slot
- *   npages - Number of physical pages
- *   flags - KVM memory region flags (e.g. KVM_MEM_LOG_DIRTY_PAGES)
- *
- * Output Args: None
- *
- * Return: None
- *
- * Allocates a memory area of the number of pages specified by npages
- * and maps it to the VM specified by vm, at a starting physical address
- * given by guest_paddr.  The region is created with a KVM region slot
- * given by slot, which must be unique and < KVM_MEM_SLOTS_NUM.  The
- * region is created with the flags given by flags.
- */
-void vm_userspace_mem_region_add(struct kvm_vm *vm,
-	enum vm_mem_backing_src_type src_type,
-	uint64_t guest_paddr, uint32_t slot, uint64_t npages,
-	uint32_t flags)
+/* FIXME: This thing needs to be ripped apart and rewritten. */
+void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
+		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
+		uint32_t flags, int gmem_fd, uint64_t gmem_offset)
 {
 	int ret;
 	struct userspace_mem_region *region;
 	size_t backing_src_pagesz = get_backing_src_pagesz(src_type);
+	size_t mem_size = npages * vm->page_size;
 	size_t alignment;
 
 	TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
@@ -952,7 +933,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	/* Allocate and initialize new mem region structure. */
 	region = calloc(1, sizeof(*region));
 	TEST_ASSERT(region != NULL, "Insufficient Memory");
-	region->mmap_size = npages * vm->page_size;
+	region->mmap_size = mem_size;
 
 #ifdef __s390x__
 	/* On s390x, the host address must be aligned to 1M (due to PGSTEs) */
@@ -999,14 +980,47 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	/* As needed perform madvise */
 	if ((src_type == VM_MEM_SRC_ANONYMOUS ||
 	     src_type == VM_MEM_SRC_ANONYMOUS_THP) && thp_configured()) {
-		ret = madvise(region->host_mem, npages * vm->page_size,
+		ret = madvise(region->host_mem, mem_size,
 			      src_type == VM_MEM_SRC_ANONYMOUS ? MADV_NOHUGEPAGE : MADV_HUGEPAGE);
 		TEST_ASSERT(ret == 0, "madvise failed, addr: %p length: 0x%lx src_type: %s",
-			    region->host_mem, npages * vm->page_size,
+			    region->host_mem, mem_size,
 			    vm_mem_backing_src_alias(src_type)->name);
 	}
 
 	region->backing_src_type = src_type;
+
+	if (flags & KVM_MEM_PRIVATE) {
+		if (gmem_fd < 0) {
+			uint32_t gmem_flags = 0;
+
+			/*
+			 * Allow hugepages for the guest memfd backing if the
+			 * "normal" backing is allowed/required to be huge.
+			 */
+			if (src_type != VM_MEM_SRC_ANONYMOUS &&
+			    src_type != VM_MEM_SRC_SHMEM)
+				gmem_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+
+			TEST_ASSERT(!gmem_offset,
+				    "Offset must be zero when creating new guest_memfd");
+			gmem_fd = vm_create_guest_memfd(vm, mem_size, gmem_flags);
+		} else {
+			/*
+			 * Install a unique fd for each memslot so that the fd
+			 * can be closed when the region is deleted without
+			 * needing to track if the fd is owned by the framework
+			 * or by the caller.
+			 */
+			gmem_fd = dup(gmem_fd);
+			TEST_ASSERT(gmem_fd >= 0, __KVM_SYSCALL_ERROR("dup()", gmem_fd));
+		}
+
+		region->region.gmem_fd = gmem_fd;
+		region->region.gmem_offset = gmem_offset;
+	} else {
+		region->region.gmem_fd = -1;
+	}
+
 	region->unused_phy_pages = sparsebit_alloc();
 	sparsebit_set_num(region->unused_phy_pages,
 		guest_paddr >> vm->page_shift, npages);
@@ -1019,9 +1033,10 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
 		"  rc: %i errno: %i\n"
 		"  slot: %u flags: 0x%x\n"
-		"  guest_phys_addr: 0x%lx size: 0x%lx",
+		"  guest_phys_addr: 0x%lx size: 0x%lx guest_memfd: %d\n",
 		ret, errno, slot, flags,
-		guest_paddr, (uint64_t) region->region.memory_size);
+		guest_paddr, (uint64_t) region->region.memory_size,
+		region->region.gmem_fd);
 
 	/* Add to quick lookup data structures */
 	vm_userspace_mem_region_gpa_insert(&vm->regions.gpa_tree, region);
@@ -1042,6 +1057,14 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	}
 }
 
+void vm_userspace_mem_region_add(struct kvm_vm *vm,
+				 enum vm_mem_backing_src_type src_type,
+				 uint64_t guest_paddr, uint32_t slot,
+				 uint64_t npages, uint32_t flags)
+{
+	vm_mem_add(vm, src_type, guest_paddr, slot, npages, flags, -1, 0);
+}
+
 /*
  * Memslot to region
  *
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 25/33] KVM: selftests: Add helpers to convert guest memory b/w private and shared
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (23 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 24/33] KVM: selftests: Add support for creating private memslots Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 26/33] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) Sean Christopherson
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Vishal Annapurve <vannapurve@google.com>

Add helpers to convert memory between private and shared via KVM's
memory attributes, as well as helpers to free/allocate guest_memfd memory
via fallocate().  Userspace, i.e. tests, is NOT required to do fallocate()
when converting memory, as the attributes are the single source of true.
The fallocate() helpers are provided so that tests can mimic a userspace
that frees private memory on conversion, e.g. to prioritize memory usage
over performance.

Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     | 48 +++++++++++++++++++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 26 ++++++++++
 2 files changed, 74 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 47ea25f9dc97..a0315503ac3e 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -333,6 +333,54 @@ static inline void vm_enable_cap(struct kvm_vm *vm, uint32_t cap, uint64_t arg0)
 	vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
 }
 
+static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
+					    uint64_t size, uint64_t attributes)
+{
+	struct kvm_memory_attributes attr = {
+		.attributes = attributes,
+		.address = gpa,
+		.size = size,
+		.flags = 0,
+	};
+
+	/*
+	 * KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes.  These flows
+	 * need significant enhancements to support multiple attributes.
+	 */
+	TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
+		    "Update me to support multiple attributes!");
+
+	vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
+}
+
+
+static inline void vm_mem_set_private(struct kvm_vm *vm, uint64_t gpa,
+				      uint64_t size)
+{
+	vm_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa,
+				     uint64_t size)
+{
+	vm_set_memory_attributes(vm, gpa, size, 0);
+}
+
+void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
+			    bool punch_hole);
+
+static inline void vm_guest_mem_punch_hole(struct kvm_vm *vm, uint64_t gpa,
+					   uint64_t size)
+{
+	vm_guest_mem_fallocate(vm, gpa, size, true);
+}
+
+static inline void vm_guest_mem_allocate(struct kvm_vm *vm, uint64_t gpa,
+					 uint64_t size)
+{
+	vm_guest_mem_fallocate(vm, gpa, size, false);
+}
+
 void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
 const char *vm_guest_mode_string(uint32_t i);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 127f44c6c83c..bf2bd5c39a96 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1176,6 +1176,32 @@ void vm_mem_region_delete(struct kvm_vm *vm, uint32_t slot)
 	__vm_mem_region_delete(vm, memslot2region(vm, slot), true);
 }
 
+void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
+			    bool punch_hole)
+{
+	struct userspace_mem_region *region;
+	uint64_t end = gpa + size - 1;
+	off_t fd_offset;
+	int mode, ret;
+
+	region = userspace_mem_region_find(vm, gpa, gpa);
+	TEST_ASSERT(region && region->region.flags & KVM_MEM_PRIVATE,
+		    "Private memory region not found for GPA 0x%lx", gpa);
+
+	TEST_ASSERT(region == userspace_mem_region_find(vm, end, end),
+		    "fallocate() for guest_memfd must act on a single memslot");
+
+	fd_offset = region->region.gmem_offset +
+		    (gpa - region->region.guest_phys_addr);
+
+	mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? FALLOC_FL_PUNCH_HOLE : 0);
+
+	ret = fallocate(region->region.gmem_fd, mode, fd_offset, size);
+	TEST_ASSERT(!ret, "fallocate() failed to %s at %lx[%lu], fd = %d, mode = %x, offset = %lx\n",
+		     punch_hole ? "punch hole" : "allocate", gpa, size,
+		     region->region.gmem_fd, mode, fd_offset);
+}
+
 /* Returns the size of a vCPU's kvm_run structure. */
 static int vcpu_mmap_sz(void)
 {
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 26/33] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86)
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (24 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 25/33] KVM: selftests: Add helpers to convert guest memory b/w private and shared Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 27/33] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type Sean Christopherson
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Vishal Annapurve <vannapurve@google.com>

Add helpers for x86 guests to invoke the KVM_HC_MAP_GPA_RANGE hypercall,
which KVM will forward to userspace and thus can be used by tests to
coordinate private<=>shared conversions between host userspace code and
guest code.

Signed-off-by: Vishal Annapurve <vannapurve@google.com>
[sean: drop shared/private helpers (let tests specify flags)]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/x86_64/processor.h      | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 4fd042112526..1911c12d5bad 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -15,6 +15,7 @@
 #include <asm/msr-index.h>
 #include <asm/prctl.h>
 
+#include <linux/kvm_para.h>
 #include <linux/stringify.h>
 
 #include "../kvm_util.h"
@@ -1171,6 +1172,20 @@ uint64_t kvm_hypercall(uint64_t nr, uint64_t a0, uint64_t a1, uint64_t a2,
 uint64_t __xen_hypercall(uint64_t nr, uint64_t a0, void *a1);
 void xen_hypercall(uint64_t nr, uint64_t a0, void *a1);
 
+static inline uint64_t __kvm_hypercall_map_gpa_range(uint64_t gpa,
+						     uint64_t size, uint64_t flags)
+{
+	return kvm_hypercall(KVM_HC_MAP_GPA_RANGE, gpa, size >> PAGE_SHIFT, flags, 0);
+}
+
+static inline void kvm_hypercall_map_gpa_range(uint64_t gpa, uint64_t size,
+					       uint64_t flags)
+{
+	uint64_t ret = __kvm_hypercall_map_gpa_range(gpa, size, flags);
+
+	GUEST_ASSERT(!ret);
+}
+
 void __vm_xsave_require_permission(uint64_t xfeature, const char *name);
 
 #define vm_xsave_require_permission(xfeature)	\
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 27/33] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (25 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 26/33] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 28/33] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data Sean Christopherson
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Add a "vm_shape" structure to encapsulate the selftests-defined "mode",
along with the KVM-defined "type" for use when creating a new VM.  "mode"
tracks physical and virtual address properties, as well as the preferred
backing memory type, while "type" corresponds to the VM type.

Taking the VM type will allow adding tests for KVM_CREATE_GUEST_MEMFD,
a.k.a. guest private memory, without needing an entirely separate set of
helpers.  Guest private memory is effectively usable only by confidential
VM types, and it's expected that x86 will double down and require unique
VM types for TDX and SNP guests.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c  |  2 +-
 .../selftests/kvm/include/kvm_util_base.h     | 54 +++++++++++++++----
 .../selftests/kvm/kvm_page_table_test.c       |  2 +-
 tools/testing/selftests/kvm/lib/kvm_util.c    | 43 +++++++--------
 tools/testing/selftests/kvm/lib/memstress.c   |  3 +-
 .../kvm/x86_64/ucna_injection_test.c          |  2 +-
 6 files changed, 72 insertions(+), 34 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 936f3a8d1b83..6cbecf499767 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -699,7 +699,7 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, struct kvm_vcpu **vcpu,
 
 	pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode));
 
-	vm = __vm_create(mode, 1, extra_mem_pages);
+	vm = __vm_create(VM_SHAPE(mode), 1, extra_mem_pages);
 
 	log_mode_create_vm_done(vm);
 	*vcpu = vm_vcpu_add(vm, 0, guest_code);
diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index a0315503ac3e..b608fbb832d5 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -188,6 +188,23 @@ enum vm_guest_mode {
 	NUM_VM_MODES,
 };
 
+struct vm_shape {
+	enum vm_guest_mode mode;
+	unsigned int type;
+};
+
+#define VM_TYPE_DEFAULT			0
+
+#define VM_SHAPE(__mode)			\
+({						\
+	struct vm_shape shape = {		\
+		.mode = (__mode),		\
+		.type = VM_TYPE_DEFAULT		\
+	};					\
+						\
+	shape;					\
+})
+
 #if defined(__aarch64__)
 
 extern enum vm_guest_mode vm_mode_default;
@@ -220,6 +237,8 @@ extern enum vm_guest_mode vm_mode_default;
 
 #endif
 
+#define VM_SHAPE_DEFAULT	VM_SHAPE(VM_MODE_DEFAULT)
+
 #define MIN_PAGE_SIZE		(1U << MIN_PAGE_SHIFT)
 #define PTES_PER_MIN_PAGE	ptes_per_page(MIN_PAGE_SIZE)
 
@@ -784,21 +803,21 @@ vm_paddr_t vm_alloc_page_table(struct kvm_vm *vm);
  * __vm_create() does NOT create vCPUs, @nr_runnable_vcpus is used purely to
  * calculate the amount of memory needed for per-vCPU data, e.g. stacks.
  */
-struct kvm_vm *____vm_create(enum vm_guest_mode mode);
-struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus,
+struct kvm_vm *____vm_create(struct vm_shape shape);
+struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
 			   uint64_t nr_extra_pages);
 
 static inline struct kvm_vm *vm_create_barebones(void)
 {
-	return ____vm_create(VM_MODE_DEFAULT);
+	return ____vm_create(VM_SHAPE_DEFAULT);
 }
 
 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)
 {
-	return __vm_create(VM_MODE_DEFAULT, nr_runnable_vcpus, 0);
+	return __vm_create(VM_SHAPE_DEFAULT, nr_runnable_vcpus, 0);
 }
 
-struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus,
+struct kvm_vm *__vm_create_with_vcpus(struct vm_shape shape, uint32_t nr_vcpus,
 				      uint64_t extra_mem_pages,
 				      void *guest_code, struct kvm_vcpu *vcpus[]);
 
@@ -806,17 +825,27 @@ static inline struct kvm_vm *vm_create_with_vcpus(uint32_t nr_vcpus,
 						  void *guest_code,
 						  struct kvm_vcpu *vcpus[])
 {
-	return __vm_create_with_vcpus(VM_MODE_DEFAULT, nr_vcpus, 0,
+	return __vm_create_with_vcpus(VM_SHAPE_DEFAULT, nr_vcpus, 0,
 				      guest_code, vcpus);
 }
 
+
+struct kvm_vm *__vm_create_shape_with_one_vcpu(struct vm_shape shape,
+					       struct kvm_vcpu **vcpu,
+					       uint64_t extra_mem_pages,
+					       void *guest_code);
+
 /*
  * Create a VM with a single vCPU with reasonable defaults and @extra_mem_pages
  * additional pages of guest memory.  Returns the VM and vCPU (via out param).
  */
-struct kvm_vm *__vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
-					 uint64_t extra_mem_pages,
-					 void *guest_code);
+static inline struct kvm_vm *__vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
+						       uint64_t extra_mem_pages,
+						       void *guest_code)
+{
+	return __vm_create_shape_with_one_vcpu(VM_SHAPE_DEFAULT, vcpu,
+					       extra_mem_pages, guest_code);
+}
 
 static inline struct kvm_vm *vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
 						     void *guest_code)
@@ -824,6 +853,13 @@ static inline struct kvm_vm *vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
 	return __vm_create_with_one_vcpu(vcpu, 0, guest_code);
 }
 
+static inline struct kvm_vm *vm_create_shape_with_one_vcpu(struct vm_shape shape,
+							   struct kvm_vcpu **vcpu,
+							   void *guest_code)
+{
+	return __vm_create_shape_with_one_vcpu(shape, vcpu, 0, guest_code);
+}
+
 struct kvm_vcpu *vm_recreate_with_one_vcpu(struct kvm_vm *vm);
 
 void kvm_pin_this_task_to_pcpu(uint32_t pcpu);
diff --git a/tools/testing/selftests/kvm/kvm_page_table_test.c b/tools/testing/selftests/kvm/kvm_page_table_test.c
index 69f26d80c821..e37dc9c21888 100644
--- a/tools/testing/selftests/kvm/kvm_page_table_test.c
+++ b/tools/testing/selftests/kvm/kvm_page_table_test.c
@@ -254,7 +254,7 @@ static struct kvm_vm *pre_init_before_test(enum vm_guest_mode mode, void *arg)
 
 	/* Create a VM with enough guest pages */
 	guest_num_pages = test_mem_size / guest_page_size;
-	vm = __vm_create_with_vcpus(mode, nr_vcpus, guest_num_pages,
+	vm = __vm_create_with_vcpus(VM_SHAPE(mode), nr_vcpus, guest_num_pages,
 				    guest_code, test_args.vcpus);
 
 	/* Align down GPA of the testing memslot */
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index bf2bd5c39a96..68afea10b469 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -209,7 +209,7 @@ __weak void vm_vaddr_populate_bitmap(struct kvm_vm *vm)
 		(1ULL << (vm->va_bits - 1)) >> vm->page_shift);
 }
 
-struct kvm_vm *____vm_create(enum vm_guest_mode mode)
+struct kvm_vm *____vm_create(struct vm_shape shape)
 {
 	struct kvm_vm *vm;
 
@@ -221,13 +221,13 @@ struct kvm_vm *____vm_create(enum vm_guest_mode mode)
 	vm->regions.hva_tree = RB_ROOT;
 	hash_init(vm->regions.slot_hash);
 
-	vm->mode = mode;
-	vm->type = 0;
+	vm->mode = shape.mode;
+	vm->type = shape.type;
 
-	vm->pa_bits = vm_guest_mode_params[mode].pa_bits;
-	vm->va_bits = vm_guest_mode_params[mode].va_bits;
-	vm->page_size = vm_guest_mode_params[mode].page_size;
-	vm->page_shift = vm_guest_mode_params[mode].page_shift;
+	vm->pa_bits = vm_guest_mode_params[vm->mode].pa_bits;
+	vm->va_bits = vm_guest_mode_params[vm->mode].va_bits;
+	vm->page_size = vm_guest_mode_params[vm->mode].page_size;
+	vm->page_shift = vm_guest_mode_params[vm->mode].page_shift;
 
 	/* Setup mode specific traits. */
 	switch (vm->mode) {
@@ -265,7 +265,7 @@ struct kvm_vm *____vm_create(enum vm_guest_mode mode)
 		/*
 		 * Ignore KVM support for 5-level paging (vm->va_bits == 57),
 		 * it doesn't take effect unless a CR4.LA57 is set, which it
-		 * isn't for this VM_MODE.
+		 * isn't for this mode (48-bit virtual address space).
 		 */
 		TEST_ASSERT(vm->va_bits == 48 || vm->va_bits == 57,
 			    "Linear address width (%d bits) not supported",
@@ -285,10 +285,11 @@ struct kvm_vm *____vm_create(enum vm_guest_mode mode)
 		vm->pgtable_levels = 5;
 		break;
 	default:
-		TEST_FAIL("Unknown guest mode, mode: 0x%x", mode);
+		TEST_FAIL("Unknown guest mode: 0x%x", vm->mode);
 	}
 
 #ifdef __aarch64__
+	TEST_ASSERT(!vm->type, "ARM doesn't support test-provided types");
 	if (vm->pa_bits != 40)
 		vm->type = KVM_VM_TYPE_ARM_IPA_SIZE(vm->pa_bits);
 #endif
@@ -347,19 +348,19 @@ static uint64_t vm_nr_pages_required(enum vm_guest_mode mode,
 	return vm_adjust_num_guest_pages(mode, nr_pages);
 }
 
-struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus,
+struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
 			   uint64_t nr_extra_pages)
 {
-	uint64_t nr_pages = vm_nr_pages_required(mode, nr_runnable_vcpus,
+	uint64_t nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
 						 nr_extra_pages);
 	struct userspace_mem_region *slot0;
 	struct kvm_vm *vm;
 	int i;
 
-	pr_debug("%s: mode='%s' pages='%ld'\n", __func__,
-		 vm_guest_mode_string(mode), nr_pages);
+	pr_debug("%s: mode='%s' type='%d', pages='%ld'\n", __func__,
+		 vm_guest_mode_string(shape.mode), shape.type, nr_pages);
 
-	vm = ____vm_create(mode);
+	vm = ____vm_create(shape);
 
 	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 0);
 	for (i = 0; i < NR_MEM_REGIONS; i++)
@@ -400,7 +401,7 @@ struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus,
  * extra_mem_pages is only used to calculate the maximum page table size,
  * no real memory allocation for non-slot0 memory in this function.
  */
-struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus,
+struct kvm_vm *__vm_create_with_vcpus(struct vm_shape shape, uint32_t nr_vcpus,
 				      uint64_t extra_mem_pages,
 				      void *guest_code, struct kvm_vcpu *vcpus[])
 {
@@ -409,7 +410,7 @@ struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus
 
 	TEST_ASSERT(!nr_vcpus || vcpus, "Must provide vCPU array");
 
-	vm = __vm_create(mode, nr_vcpus, extra_mem_pages);
+	vm = __vm_create(shape, nr_vcpus, extra_mem_pages);
 
 	for (i = 0; i < nr_vcpus; ++i)
 		vcpus[i] = vm_vcpu_add(vm, i, guest_code);
@@ -417,15 +418,15 @@ struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus
 	return vm;
 }
 
-struct kvm_vm *__vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
-					 uint64_t extra_mem_pages,
-					 void *guest_code)
+struct kvm_vm *__vm_create_shape_with_one_vcpu(struct vm_shape shape,
+					       struct kvm_vcpu **vcpu,
+					       uint64_t extra_mem_pages,
+					       void *guest_code)
 {
 	struct kvm_vcpu *vcpus[1];
 	struct kvm_vm *vm;
 
-	vm = __vm_create_with_vcpus(VM_MODE_DEFAULT, 1, extra_mem_pages,
-				    guest_code, vcpus);
+	vm = __vm_create_with_vcpus(shape, 1, extra_mem_pages, guest_code, vcpus);
 
 	*vcpu = vcpus[0];
 	return vm;
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index df457452d146..d05487e5a371 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -168,7 +168,8 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 	 * The memory is also added to memslot 0, but that's a benign side
 	 * effect as KVM allows aliasing HVAs in meslots.
 	 */
-	vm = __vm_create_with_vcpus(mode, nr_vcpus, slot0_pages + guest_num_pages,
+	vm = __vm_create_with_vcpus(VM_SHAPE(mode), nr_vcpus,
+				    slot0_pages + guest_num_pages,
 				    memstress_guest_code, vcpus);
 
 	args->vm = vm;
diff --git a/tools/testing/selftests/kvm/x86_64/ucna_injection_test.c b/tools/testing/selftests/kvm/x86_64/ucna_injection_test.c
index 85f34ca7e49e..0ed32ec903d0 100644
--- a/tools/testing/selftests/kvm/x86_64/ucna_injection_test.c
+++ b/tools/testing/selftests/kvm/x86_64/ucna_injection_test.c
@@ -271,7 +271,7 @@ int main(int argc, char *argv[])
 
 	kvm_check_cap(KVM_CAP_MCE);
 
-	vm = __vm_create(VM_MODE_DEFAULT, 3, 0);
+	vm = __vm_create(VM_SHAPE_DEFAULT, 3, 0);
 
 	kvm_ioctl(vm->kvm_fd, KVM_X86_GET_MCE_CAP_SUPPORTED,
 		  &supported_mcg_caps);
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 28/33] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (26 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 27/33] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 29/33] KVM: selftests: Add x86-only selftest for private memory conversions Sean Christopherson
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Add GUEST_SYNC[1-6]() so that tests can pass the maximum amount of
information supported via ucall(), without needing to resort to shared
memory.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/include/ucall_common.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/ucall_common.h b/tools/testing/selftests/kvm/include/ucall_common.h
index 112bc1da732a..7cf40aba7add 100644
--- a/tools/testing/selftests/kvm/include/ucall_common.h
+++ b/tools/testing/selftests/kvm/include/ucall_common.h
@@ -54,6 +54,17 @@ int ucall_nr_pages_required(uint64_t page_size);
 #define GUEST_SYNC_ARGS(stage, arg1, arg2, arg3, arg4)	\
 				ucall(UCALL_SYNC, 6, "hello", stage, arg1, arg2, arg3, arg4)
 #define GUEST_SYNC(stage)	ucall(UCALL_SYNC, 2, "hello", stage)
+#define GUEST_SYNC1(arg0)	ucall(UCALL_SYNC, 1, arg0)
+#define GUEST_SYNC2(arg0, arg1)	ucall(UCALL_SYNC, 2, arg0, arg1)
+#define GUEST_SYNC3(arg0, arg1, arg2) \
+				ucall(UCALL_SYNC, 3, arg0, arg1, arg2)
+#define GUEST_SYNC4(arg0, arg1, arg2, arg3) \
+				ucall(UCALL_SYNC, 4, arg0, arg1, arg2, arg3)
+#define GUEST_SYNC5(arg0, arg1, arg2, arg3, arg4) \
+				ucall(UCALL_SYNC, 5, arg0, arg1, arg2, arg3, arg4)
+#define GUEST_SYNC6(arg0, arg1, arg2, arg3, arg4, arg5) \
+				ucall(UCALL_SYNC, 6, arg0, arg1, arg2, arg3, arg4, arg5)
+
 #define GUEST_PRINTF(_fmt, _args...) ucall_fmt(UCALL_PRINTF, _fmt, ##_args)
 #define GUEST_DONE()		ucall(UCALL_DONE, 0)
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 29/33] KVM: selftests: Add x86-only selftest for private memory conversions
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (27 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 28/33] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 30/33] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper Sean Christopherson
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Vishal Annapurve <vannapurve@google.com>

Add a selftest to exercise implicit/explicit conversion functionality
within KVM and verify:

 - Shared memory is visible to host userspace
 - Private memory is not visible to host userspace
 - Host userspace and guest can communicate over shared memory
 - Data in shared backing is preserved across conversions (test's
   host userspace doesn't free the data)
 - Private memory is bound to the lifetime of the VM

Ideally, KVM's selftests infrastructure would be reworked to allow backing
a single region of guest memory with multiple memslots for _all_ backing
types and shapes, i.e. ideally the code for using a single backing fd
across multiple memslots would work for "regular" memory as well.  But
sadly, support for KVM_CREATE_GUEST_MEMFD has languished for far too long,
and overhauling selftests' memslots infrastructure would likely open a can
of worms, i.e. delay things even further.

Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../kvm/x86_64/private_mem_conversions_test.c | 410 ++++++++++++++++++
 2 files changed, 411 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index a3bb36fb3cfc..b709a52d5cdb 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -81,6 +81,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test
 TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
+TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
 TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
new file mode 100644
index 000000000000..50541246d6fd
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -0,0 +1,410 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include <fcntl.h>
+#include <limits.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+#include <linux/kvm_para.h>
+#include <linux/memfd.h>
+#include <linux/sizes.h>
+
+#include <test_util.h>
+#include <kvm_util.h>
+#include <processor.h>
+
+#define BASE_DATA_SLOT		10
+#define BASE_DATA_GPA		((uint64_t)(1ull << 32))
+#define PER_CPU_DATA_SIZE	((uint64_t)(SZ_2M + PAGE_SIZE))
+
+/* Horrific macro so that the line info is captured accurately :-( */
+#define memcmp_g(gpa, pattern,  size)							\
+do {											\
+	uint8_t *mem = (uint8_t *)gpa;							\
+	size_t i;									\
+											\
+	for (i = 0; i < size; i++)							\
+		__GUEST_ASSERT(mem[i] == pattern,					\
+			       "Expected 0x%x at offset %lu (gpa 0x%llx), got 0x%x",	\
+			       pattern, i, gpa + i, mem[i]);				\
+} while (0)
+
+static void memcmp_h(uint8_t *mem, uint8_t pattern, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		TEST_ASSERT(mem[i] == pattern,
+			    "Expected 0x%x at offset %lu, got 0x%x",
+			    pattern, i, mem[i]);
+}
+
+/*
+ * Run memory conversion tests with explicit conversion:
+ * Execute KVM hypercall to map/unmap gpa range which will cause userspace exit
+ * to back/unback private memory. Subsequent accesses by guest to the gpa range
+ * will not cause exit to userspace.
+ *
+ * Test memory conversion scenarios with following steps:
+ * 1) Access private memory using private access and verify that memory contents
+ *   are not visible to userspace.
+ * 2) Convert memory to shared using explicit conversions and ensure that
+ *   userspace is able to access the shared regions.
+ * 3) Convert memory back to private using explicit conversions and ensure that
+ *   userspace is again not able to access converted private regions.
+ */
+
+#define GUEST_STAGE(o, s) { .offset = o, .size = s }
+
+enum ucall_syncs {
+	SYNC_SHARED,
+	SYNC_PRIVATE,
+};
+
+static void guest_sync_shared(uint64_t gpa, uint64_t size,
+			      uint8_t current_pattern, uint8_t new_pattern)
+{
+	GUEST_SYNC5(SYNC_SHARED, gpa, size, current_pattern, new_pattern);
+}
+
+static void guest_sync_private(uint64_t gpa, uint64_t size, uint8_t pattern)
+{
+	GUEST_SYNC4(SYNC_PRIVATE, gpa, size, pattern);
+}
+
+/* Arbitrary values, KVM doesn't care about the attribute flags. */
+#define MAP_GPA_SHARED		BIT(0)
+#define MAP_GPA_DO_FALLOCATE	BIT(1)
+
+static void guest_map_mem(uint64_t gpa, uint64_t size, bool map_shared,
+			  bool do_fallocate)
+{
+	uint64_t flags = 0;
+
+	if (map_shared)
+		flags |= MAP_GPA_SHARED;
+	if (do_fallocate)
+		flags |= MAP_GPA_DO_FALLOCATE;
+	kvm_hypercall_map_gpa_range(gpa, size, flags);
+}
+
+static void guest_map_shared(uint64_t gpa, uint64_t size, bool do_fallocate)
+{
+	guest_map_mem(gpa, size, true, do_fallocate);
+}
+
+static void guest_map_private(uint64_t gpa, uint64_t size, bool do_fallocate)
+{
+	guest_map_mem(gpa, size, false, do_fallocate);
+}
+
+static void guest_run_test(uint64_t base_gpa, bool do_fallocate)
+{
+	struct {
+		uint64_t offset;
+		uint64_t size;
+		uint8_t pattern;
+	} stages[] = {
+		GUEST_STAGE(0, PAGE_SIZE),
+		GUEST_STAGE(0, SZ_2M),
+		GUEST_STAGE(PAGE_SIZE, PAGE_SIZE),
+		GUEST_STAGE(PAGE_SIZE, SZ_2M),
+		GUEST_STAGE(SZ_2M, PAGE_SIZE),
+	};
+	const uint8_t init_p = 0xcc;
+	uint64_t j;
+	int i;
+
+	/* Memory should be shared by default. */
+	memset((void *)base_gpa, ~init_p, PER_CPU_DATA_SIZE);
+	guest_sync_shared(base_gpa, PER_CPU_DATA_SIZE, (uint8_t)~init_p, init_p);
+	memcmp_g(base_gpa, init_p, PER_CPU_DATA_SIZE);
+
+	for (i = 0; i < ARRAY_SIZE(stages); i++) {
+		uint64_t gpa = base_gpa + stages[i].offset;
+		uint64_t size = stages[i].size;
+		uint8_t p1 = 0x11;
+		uint8_t p2 = 0x22;
+		uint8_t p3 = 0x33;
+		uint8_t p4 = 0x44;
+
+		/*
+		 * Set the test region to pattern one to differentiate it from
+		 * the data range as a whole (contains the initial pattern).
+		 */
+		memset((void *)gpa, p1, size);
+
+		/*
+		 * Convert to private, set and verify the private data, and
+		 * then verify that the rest of the data (map shared) still
+		 * holds the initial pattern, and that the host always sees the
+		 * shared memory (initial pattern).  Unlike shared memory,
+		 * punching a hole in private memory is destructive, i.e.
+		 * previous values aren't guaranteed to be preserved.
+		 */
+		guest_map_private(gpa, size, do_fallocate);
+
+		if (size > PAGE_SIZE) {
+			memset((void *)gpa, p2, PAGE_SIZE);
+			goto skip;
+		}
+
+		memset((void *)gpa, p2, size);
+		guest_sync_private(gpa, size, p1);
+
+		/*
+		 * Verify that the private memory was set to pattern two, and
+		 * that shared memory still holds the initial pattern.
+		 */
+		memcmp_g(gpa, p2, size);
+		if (gpa > base_gpa)
+			memcmp_g(base_gpa, init_p, gpa - base_gpa);
+		if (gpa + size < base_gpa + PER_CPU_DATA_SIZE)
+			memcmp_g(gpa + size, init_p,
+				 (base_gpa + PER_CPU_DATA_SIZE) - (gpa + size));
+
+		/*
+		 * Convert odd-number page frames back to shared to verify KVM
+		 * also correctly handles holes in private ranges.
+		 */
+		for (j = 0; j < size; j += PAGE_SIZE) {
+			if ((j >> PAGE_SHIFT) & 1) {
+				guest_map_shared(gpa + j, PAGE_SIZE, do_fallocate);
+				guest_sync_shared(gpa + j, PAGE_SIZE, p1, p3);
+
+				memcmp_g(gpa + j, p3, PAGE_SIZE);
+			} else {
+				guest_sync_private(gpa + j, PAGE_SIZE, p1);
+			}
+		}
+
+skip:
+		/*
+		 * Convert the entire region back to shared, explicitly write
+		 * pattern three to fill in the even-number frames before
+		 * asking the host to verify (and write pattern four).
+		 */
+		guest_map_shared(gpa, size, do_fallocate);
+		memset((void *)gpa, p3, size);
+		guest_sync_shared(gpa, size, p3, p4);
+		memcmp_g(gpa, p4, size);
+
+		/* Reset the shared memory back to the initial pattern. */
+		memset((void *)gpa, init_p, size);
+
+		/*
+		 * Free (via PUNCH_HOLE) *all* private memory so that the next
+		 * iteration starts from a clean slate, e.g. with respect to
+		 * whether or not there are pages/folios in guest_mem.
+		 */
+		guest_map_shared(base_gpa, PER_CPU_DATA_SIZE, true);
+	}
+}
+
+static void guest_code(uint64_t base_gpa)
+{
+	/*
+	 * Run everything twice, with and without doing fallocate() on the
+	 * guest_memfd backing when converting between shared and private.
+	 */
+	guest_run_test(base_gpa, false);
+	guest_run_test(base_gpa, true);
+	GUEST_DONE();
+}
+
+static void handle_exit_hypercall(struct kvm_vcpu *vcpu)
+{
+	struct kvm_run *run = vcpu->run;
+	uint64_t gpa = run->hypercall.args[0];
+	uint64_t size = run->hypercall.args[1] * PAGE_SIZE;
+	bool map_shared = run->hypercall.args[2] & MAP_GPA_SHARED;
+	bool do_fallocate = run->hypercall.args[2] & MAP_GPA_DO_FALLOCATE;
+	struct kvm_vm *vm = vcpu->vm;
+
+	TEST_ASSERT(run->hypercall.nr == KVM_HC_MAP_GPA_RANGE,
+		    "Wanted MAP_GPA_RANGE (%u), got '%llu'",
+		    KVM_HC_MAP_GPA_RANGE, run->hypercall.nr);
+
+	if (do_fallocate)
+		vm_guest_mem_fallocate(vm, gpa, size, map_shared);
+
+	vm_set_memory_attributes(vm, gpa, size,
+				 map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE);
+	run->hypercall.ret = 0;
+}
+
+static bool run_vcpus;
+
+static void *__test_mem_conversions(void *__vcpu)
+{
+	struct kvm_vcpu *vcpu = __vcpu;
+	struct kvm_run *run = vcpu->run;
+	struct kvm_vm *vm = vcpu->vm;
+	struct ucall uc;
+
+	while (!READ_ONCE(run_vcpus))
+		;
+
+	for ( ;; ) {
+		vcpu_run(vcpu);
+
+		if (run->exit_reason == KVM_EXIT_HYPERCALL) {
+			handle_exit_hypercall(vcpu);
+			continue;
+		}
+
+		TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+			    "Wanted KVM_EXIT_IO, got exit reason: %u (%s)",
+			    run->exit_reason, exit_reason_str(run->exit_reason));
+
+		switch (get_ucall(vcpu, &uc)) {
+		case UCALL_ABORT:
+			REPORT_GUEST_ASSERT(uc);
+		case UCALL_SYNC: {
+			uint8_t *hva = addr_gpa2hva(vm, uc.args[1]);
+			uint64_t size = uc.args[2];
+
+			TEST_ASSERT(uc.args[0] == SYNC_SHARED ||
+				    uc.args[0] == SYNC_PRIVATE,
+				    "Unknown sync command '%ld'", uc.args[0]);
+
+			/* In all cases, the host should observe the shared data. */
+			memcmp_h(hva, uc.args[3], size);
+
+			/* For shared, write the new pattern to guest memory. */
+			if (uc.args[0] == SYNC_SHARED)
+				memset(hva, uc.args[4], size);
+			break;
+		}
+		case UCALL_DONE:
+			return NULL;
+		default:
+			TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
+		}
+	}
+}
+
+static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t nr_vcpus,
+				 uint32_t nr_memslots)
+{
+	/*
+	 * Allocate enough memory so that each vCPU's chunk of memory can be
+	 * naturally aligned with respect to the size of the backing store.
+	 */
+	const size_t size = align_up(PER_CPU_DATA_SIZE, get_backing_src_pagesz(src_type));
+	const size_t memfd_size = size * nr_vcpus;
+	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
+	pthread_t threads[KVM_MAX_VCPUS];
+	uint64_t gmem_flags;
+	struct kvm_vm *vm;
+	int memfd, i, r;
+
+	const struct vm_shape shape = {
+		.mode = VM_MODE_DEFAULT,
+		.type = KVM_X86_SW_PROTECTED_VM,
+	};
+
+	vm = __vm_create_with_vcpus(shape, nr_vcpus, 0, guest_code, vcpus);
+
+	vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
+
+	if (backing_src_can_be_huge(src_type))
+		gmem_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+	else
+		gmem_flags = 0;
+	memfd = vm_create_guest_memfd(vm, memfd_size, gmem_flags);
+
+	for (i = 0; i < nr_memslots; i++)
+		vm_mem_add(vm, src_type, BASE_DATA_GPA + size * i,
+			   BASE_DATA_SLOT + i, size / vm->page_size,
+			   KVM_MEM_PRIVATE, memfd, size * i);
+
+	for (i = 0; i < nr_vcpus; i++) {
+		uint64_t gpa =  BASE_DATA_GPA + i * size;
+
+		vcpu_args_set(vcpus[i], 1, gpa);
+
+		virt_map(vm, gpa, gpa, size / vm->page_size);
+
+		pthread_create(&threads[i], NULL, __test_mem_conversions, vcpus[i]);
+	}
+
+	WRITE_ONCE(run_vcpus, true);
+
+	for (i = 0; i < nr_vcpus; i++)
+		pthread_join(threads[i], NULL);
+
+	kvm_vm_free(vm);
+
+	/*
+	 * Allocate and free memory from the guest_memfd after closing the VM
+	 * fd.  The guest_memfd is gifted a reference to its owning VM, i.e.
+	 * should prevent the VM from being fully destroyed until the last
+	 * reference to the guest_memfd is also put.
+	 */
+	r = fallocate(memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0, memfd_size);
+	TEST_ASSERT(!r, __KVM_SYSCALL_ERROR("fallocate()", r));
+
+	r = fallocate(memfd, FALLOC_FL_KEEP_SIZE, 0, memfd_size);
+	TEST_ASSERT(!r, __KVM_SYSCALL_ERROR("fallocate()", r));
+}
+
+static void usage(const char *cmd)
+{
+	puts("");
+	printf("usage: %s [-h] [-m] [-s mem_type] [-n nr_vcpus]\n", cmd);
+	puts("");
+	backing_src_help("-s");
+	puts("");
+	puts(" -n: specify the number of vcpus (default: 1)");
+	puts("");
+	puts(" -m: use multiple memslots (default: 1)");
+	puts("");
+}
+
+int main(int argc, char *argv[])
+{
+	enum vm_mem_backing_src_type src_type = DEFAULT_VM_MEM_SRC;
+	bool use_multiple_memslots = false;
+	uint32_t nr_vcpus = 1;
+	uint32_t nr_memslots;
+	int opt;
+
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_EXIT_HYPERCALL));
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
+
+	while ((opt = getopt(argc, argv, "hms:n:")) != -1) {
+		switch (opt) {
+		case 's':
+			src_type = parse_backing_src_type(optarg);
+			break;
+		case 'n':
+			nr_vcpus = atoi_positive("nr_vcpus", optarg);
+			break;
+		case 'm':
+			use_multiple_memslots = true;
+			break;
+		case 'h':
+		default:
+			usage(argv[0]);
+			exit(0);
+		}
+	}
+
+	nr_memslots = use_multiple_memslots ? nr_vcpus : 1;
+
+	test_mem_conversions(src_type, nr_vcpus, nr_memslots);
+
+	return 0;
+}
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 30/33] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (28 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 29/33] KVM: selftests: Add x86-only selftest for private memory conversions Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 31/33] KVM: selftests: Expand set_memory_region_test to validate guest_memfd() Sean Christopherson
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

Add helpers to invoke KVM_SET_USER_MEMORY_REGION2 directly so that tests
can validate of features that are unique to "version 2" of "set user
memory region", e.g. do negative testing on gmem_fd and gmem_offset.

Provide a raw version as well as an assert-success version to reduce
the amount of boilerplate code need for basic usage.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     |  7 +++++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 29 +++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index b608fbb832d5..edc0f380acc0 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -522,6 +522,13 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
 			       uint64_t gpa, uint64_t size, void *hva);
 int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
 				uint64_t gpa, uint64_t size, void *hva);
+void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot,
+				uint32_t flags, uint64_t gpa, uint64_t size,
+				void *hva, uint32_t gmem_fd, uint64_t gmem_offset);
+int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot,
+				 uint32_t flags, uint64_t gpa, uint64_t size,
+				 void *hva, uint32_t gmem_fd, uint64_t gmem_offset);
+
 void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	enum vm_mem_backing_src_type src_type,
 	uint64_t guest_paddr, uint32_t slot, uint64_t npages,
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 68afea10b469..8fc70c021c1c 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -873,6 +873,35 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
 		    errno, strerror(errno));
 }
 
+int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot,
+				 uint32_t flags, uint64_t gpa, uint64_t size,
+				 void *hva, uint32_t gmem_fd, uint64_t gmem_offset)
+{
+	struct kvm_userspace_memory_region2 region = {
+		.slot = slot,
+		.flags = flags,
+		.guest_phys_addr = gpa,
+		.memory_size = size,
+		.userspace_addr = (uintptr_t)hva,
+		.gmem_fd = gmem_fd,
+		.gmem_offset = gmem_offset,
+	};
+
+	return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION2, &region);
+}
+
+void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot,
+				uint32_t flags, uint64_t gpa, uint64_t size,
+				void *hva, uint32_t gmem_fd, uint64_t gmem_offset)
+{
+	int ret = __vm_set_user_memory_region2(vm, slot, flags, gpa, size, hva,
+					       gmem_fd, gmem_offset);
+
+	TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed, errno = %d (%s)",
+		    errno, strerror(errno));
+}
+
+
 /* FIXME: This thing needs to be ripped apart and rewritten. */
 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 31/33] KVM: selftests: Expand set_memory_region_test to validate guest_memfd()
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (29 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 30/33] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 32/33] KVM: selftests: Add basic selftest for guest_memfd() Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 33/33] KVM: selftests: Test KVM exit behavior for private memory/access Sean Christopherson
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

Expand set_memory_region_test to exercise various positive and negative
testcases for private memory.

 - Non-guest_memfd() file descriptor for private memory
 - guest_memfd() from different VM
 - Overlapping bindings
 - Unaligned bindings

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
[sean: trim the testcases to remove duplicate coverage]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     |  10 ++
 .../selftests/kvm/set_memory_region_test.c    | 100 ++++++++++++++++++
 2 files changed, 110 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index edc0f380acc0..ac9356108df6 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -819,6 +819,16 @@ static inline struct kvm_vm *vm_create_barebones(void)
 	return ____vm_create(VM_SHAPE_DEFAULT);
 }
 
+static inline struct kvm_vm *vm_create_barebones_protected_vm(void)
+{
+	const struct vm_shape shape = {
+		.mode = VM_MODE_DEFAULT,
+		.type = KVM_X86_SW_PROTECTED_VM,
+	};
+
+	return ____vm_create(shape);
+}
+
 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)
 {
 	return __vm_create(VM_SHAPE_DEFAULT, nr_runnable_vcpus, 0);
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c
index b32960189f5f..ca83e3307a98 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -385,6 +385,98 @@ static void test_add_max_memory_regions(void)
 	kvm_vm_free(vm);
 }
 
+
+static void test_invalid_guest_memfd(struct kvm_vm *vm, int memfd,
+				     size_t offset, const char *msg)
+{
+	int r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+					     MEM_REGION_GPA, MEM_REGION_SIZE,
+					     0, memfd, offset);
+	TEST_ASSERT(r == -1 && errno == EINVAL, "%s", msg);
+}
+
+static void test_add_private_memory_region(void)
+{
+	struct kvm_vm *vm, *vm2;
+	int memfd, i;
+
+	pr_info("Testing ADD of KVM_MEM_PRIVATE memory regions\n");
+
+	vm = vm_create_barebones_protected_vm();
+
+	test_invalid_guest_memfd(vm, vm->kvm_fd, 0, "KVM fd should fail");
+	test_invalid_guest_memfd(vm, vm->fd, 0, "VM's fd should fail");
+
+	memfd = kvm_memfd_alloc(MEM_REGION_SIZE, false);
+	test_invalid_guest_memfd(vm, memfd, 0, "Regular memfd() should fail");
+	close(memfd);
+
+	vm2 = vm_create_barebones_protected_vm();
+	memfd = vm_create_guest_memfd(vm2, MEM_REGION_SIZE, 0);
+	test_invalid_guest_memfd(vm, memfd, 0, "Other VM's guest_memfd() should fail");
+
+	vm_set_user_memory_region2(vm2, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+				   MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0);
+	close(memfd);
+	kvm_vm_free(vm2);
+
+	memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE, 0);
+	for (i = 1; i < PAGE_SIZE; i++)
+		test_invalid_guest_memfd(vm, memfd, i, "Unaligned offset should fail");
+
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+				   MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0);
+	close(memfd);
+
+	kvm_vm_free(vm);
+}
+
+static void test_add_overlapping_private_memory_regions(void)
+{
+	struct kvm_vm *vm;
+	int memfd;
+	int r;
+
+	pr_info("Testing ADD of overlapping KVM_MEM_PRIVATE memory regions\n");
+
+	vm = vm_create_barebones_protected_vm();
+
+	memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE * 4, 0);
+
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+				   MEM_REGION_GPA, MEM_REGION_SIZE * 2, 0, memfd, 0);
+
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT + 1, KVM_MEM_PRIVATE,
+				   MEM_REGION_GPA * 2, MEM_REGION_SIZE * 2,
+				   0, memfd, MEM_REGION_SIZE * 2);
+
+	/*
+	 * Delete the first memslot, and then attempt to recreate it except
+	 * with a "bad" offset that results in overlap in the guest_memfd().
+	 */
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+				   MEM_REGION_GPA, 0, NULL, -1, 0);
+
+	/* Overlap the front half of the other slot. */
+	r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+					 MEM_REGION_GPA * 2 - MEM_REGION_SIZE,
+					 MEM_REGION_SIZE * 2,
+					 0, memfd, 0);
+	TEST_ASSERT(r == -1 && errno == EEXIST, "%s",
+		    "Overlapping guest_memfd() bindings should fail with EEXIST");
+
+	/* And now the back half of the other slot. */
+	r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+					 MEM_REGION_GPA * 2 + MEM_REGION_SIZE,
+					 MEM_REGION_SIZE * 2,
+					 0, memfd, 0);
+	TEST_ASSERT(r == -1 && errno == EEXIST, "%s",
+		    "Overlapping guest_memfd() bindings should fail with EEXIST");
+
+	close(memfd);
+	kvm_vm_free(vm);
+}
+
 int main(int argc, char *argv[])
 {
 #ifdef __x86_64__
@@ -401,6 +493,14 @@ int main(int argc, char *argv[])
 
 	test_add_max_memory_regions();
 
+	if (kvm_has_cap(KVM_CAP_GUEST_MEMFD) &&
+	    (kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM))) {
+		test_add_private_memory_region();
+		test_add_overlapping_private_memory_regions();
+	} else {
+		pr_info("Skipping tests for KVM_MEM_PRIVATE memory regions\n");
+	}
+
 #ifdef __x86_64__
 	if (argc > 1)
 		loops = atoi_positive("Number of iterations", argv[1]);
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 32/33] KVM: selftests: Add basic selftest for guest_memfd()
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (30 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 31/33] KVM: selftests: Expand set_memory_region_test to validate guest_memfd() Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  2023-09-14  1:55 ` [RFC PATCH v12 33/33] KVM: selftests: Test KVM exit behavior for private memory/access Sean Christopherson
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

Add a selftest to verify the basic functionality of guest_memfd():

+ file descriptor created with the guest_memfd() ioctl does not allow
  read/write/mmap operations
+ file size and block size as returned from fstat are as expected
+ fallocate on the fd checks that offset/length on
  fallocate(FALLOC_FL_PUNCH_HOLE) should be page aligned
+ invalid inputs (misaligned size, invalid flags) are rejected

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../testing/selftests/kvm/guest_memfd_test.c  | 165 ++++++++++++++++++
 2 files changed, 166 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index b709a52d5cdb..2b1ef809d73a 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -124,6 +124,7 @@ TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
+TEST_GEN_PROGS_x86_64 += guest_memfd_test
 TEST_GEN_PROGS_x86_64 += guest_print_test
 TEST_GEN_PROGS_x86_64 += hardware_disable_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
new file mode 100644
index 000000000000..75073645aaa1
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -0,0 +1,165 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright Intel Corporation, 2023
+ *
+ * Author: Chao Peng <chao.p.peng@linux.intel.com>
+ */
+
+#define _GNU_SOURCE
+#include "test_util.h"
+#include "kvm_util_base.h"
+#include <linux/bitmap.h>
+#include <linux/falloc.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+#include <fcntl.h>
+
+static void test_file_read_write(int fd)
+{
+	char buf[64];
+
+	TEST_ASSERT(read(fd, buf, sizeof(buf)) < 0,
+		    "read on a guest_mem fd should fail");
+	TEST_ASSERT(write(fd, buf, sizeof(buf)) < 0,
+		    "write on a guest_mem fd should fail");
+	TEST_ASSERT(pread(fd, buf, sizeof(buf), 0) < 0,
+		    "pread on a guest_mem fd should fail");
+	TEST_ASSERT(pwrite(fd, buf, sizeof(buf), 0) < 0,
+		    "pwrite on a guest_mem fd should fail");
+}
+
+static void test_mmap(int fd, size_t page_size)
+{
+	char *mem;
+
+	mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT_EQ(mem, MAP_FAILED);
+}
+
+static void test_file_size(int fd, size_t page_size, size_t total_size)
+{
+	struct stat sb;
+	int ret;
+
+	ret = fstat(fd, &sb);
+	TEST_ASSERT(!ret, "fstat should succeed");
+	TEST_ASSERT_EQ(sb.st_size, total_size);
+	TEST_ASSERT_EQ(sb.st_blksize, page_size);
+}
+
+static void test_fallocate(int fd, size_t page_size, size_t total_size)
+{
+	int ret;
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, total_size);
+	TEST_ASSERT(!ret, "fallocate with aligned offset and size should succeed");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			page_size - 1, page_size);
+	TEST_ASSERT(ret, "fallocate with unaligned offset should fail");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size, page_size);
+	TEST_ASSERT(ret, "fallocate beginning at total_size should fail");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size + page_size, page_size);
+	TEST_ASSERT(ret, "fallocate beginning after total_size should fail");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			total_size, page_size);
+	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) at total_size should succeed");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			total_size + page_size, page_size);
+	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) after total_size should succeed");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			page_size, page_size - 1);
+	TEST_ASSERT(ret, "fallocate with unaligned size should fail");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			page_size, page_size);
+	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) with aligned offset and size should succeed");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, page_size, page_size);
+	TEST_ASSERT(!ret, "fallocate to restore punched hole should succeed");
+}
+
+static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
+{
+	uint64_t valid_flags = 0;
+	size_t page_size = getpagesize();
+	uint64_t flag;
+	size_t size;
+	int fd;
+
+	for (size = 1; size < page_size; size++) {
+		fd = __vm_create_guest_memfd(vm, size, 0);
+		TEST_ASSERT(fd == -1 && errno == EINVAL,
+			    "guest_memfd() with non-page-aligned page size '0x%lx' should fail with EINVAL",
+			    size);
+	}
+
+	if (thp_configured()) {
+		for (size = page_size * 2; size < get_trans_hugepagesz(); size += page_size) {
+			fd = __vm_create_guest_memfd(vm, size, KVM_GUEST_MEMFD_ALLOW_HUGEPAGE);
+			TEST_ASSERT(fd == -1 && errno == EINVAL,
+				    "guest_memfd() with non-hugepage-aligned page size '0x%lx' should fail with EINVAL",
+				    size);
+		}
+
+		valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+	}
+
+	for (flag = 1; flag; flag <<= 1) {
+		uint64_t bit;
+
+		if (flag & valid_flags)
+			continue;
+
+		fd = __vm_create_guest_memfd(vm, page_size, flag);
+		TEST_ASSERT(fd == -1 && errno == EINVAL,
+			    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
+			    flag);
+
+		for_each_set_bit(bit, &valid_flags, 64) {
+			fd = __vm_create_guest_memfd(vm, page_size, flag | BIT_ULL(bit));
+			TEST_ASSERT(fd == -1 && errno == EINVAL,
+				    "guest_memfd() with flags '0x%llx' should fail with EINVAL",
+				    flag | BIT_ULL(bit));
+		}
+	}
+}
+
+
+int main(int argc, char *argv[])
+{
+	size_t page_size;
+	size_t total_size;
+	int fd;
+	struct kvm_vm *vm;
+
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
+
+	page_size = getpagesize();
+	total_size = page_size * 4;
+
+	vm = vm_create_barebones();
+
+	test_create_guest_memfd_invalid(vm);
+
+	fd = vm_create_guest_memfd(vm, total_size, 0);
+
+	test_file_read_write(fd);
+	test_mmap(fd, page_size);
+	test_file_size(fd, page_size, total_size);
+	test_fallocate(fd, page_size, total_size);
+
+	close(fd);
+}
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH v12 33/33] KVM: selftests: Test KVM exit behavior for private memory/access
  2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (31 preceding siblings ...)
  2023-09-14  1:55 ` [RFC PATCH v12 32/33] KVM: selftests: Add basic selftest for guest_memfd() Sean Christopherson
@ 2023-09-14  1:55 ` Sean Christopherson
  32 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14  1:55 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Ackerley Tng <ackerleytng@google.com>

"Testing private access when memslot gets deleted" tests the behavior
of KVM when a private memslot gets deleted while the VM is using the
private memslot. When KVM looks up the deleted (slot = NULL) memslot,
KVM should exit to userspace with KVM_EXIT_MEMORY_FAULT.

In the second test, upon a private access to non-private memslot, KVM
should also exit to userspace with KVM_EXIT_MEMORY_FAULT.

sean: These testcases belong in set_memory_region_test.c, they're private
variants on existing testscases and aren't as robust, e.g. don't ensure
the vCPU is actually running and accessing memory when converting and
deleting.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../kvm/x86_64/private_mem_kvm_exits_test.c   | 121 ++++++++++++++++++
 2 files changed, 122 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 2b1ef809d73a..f7fdd8244547 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -82,6 +82,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
 TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
+TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
 TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
new file mode 100644
index 000000000000..1a61c51c2390
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+#include <linux/kvm.h>
+#include <pthread.h>
+#include <stdint.h>
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "test_util.h"
+
+/* Arbitrarily selected to avoid overlaps with anything else */
+#define EXITS_TEST_GVA 0xc0000000
+#define EXITS_TEST_GPA EXITS_TEST_GVA
+#define EXITS_TEST_NPAGES 1
+#define EXITS_TEST_SIZE (EXITS_TEST_NPAGES * PAGE_SIZE)
+#define EXITS_TEST_SLOT 10
+
+static uint64_t guest_repeatedly_read(void)
+{
+	volatile uint64_t value;
+
+	while (true)
+		value = *((uint64_t *) EXITS_TEST_GVA);
+
+	return value;
+}
+
+static uint32_t run_vcpu_get_exit_reason(struct kvm_vcpu *vcpu)
+{
+	int r;
+
+	r = _vcpu_run(vcpu);
+	if (r) {
+		TEST_ASSERT(errno == EFAULT, KVM_IOCTL_ERROR(KVM_RUN, r));
+		TEST_ASSERT_EQ(vcpu->run->exit_reason, KVM_EXIT_MEMORY_FAULT);
+	}
+	return vcpu->run->exit_reason;
+}
+
+const struct vm_shape protected_vm_shape = {
+	.mode = VM_MODE_DEFAULT,
+	.type = KVM_X86_SW_PROTECTED_VM,
+};
+
+static void test_private_access_memslot_deleted(void)
+{
+	struct kvm_vm *vm;
+	struct kvm_vcpu *vcpu;
+	pthread_t vm_thread;
+	void *thread_return;
+	uint32_t exit_reason;
+
+	vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
+					   guest_repeatedly_read);
+
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+				    EXITS_TEST_GPA, EXITS_TEST_SLOT,
+				    EXITS_TEST_NPAGES,
+				    KVM_MEM_PRIVATE);
+
+	virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
+
+	/* Request to access page privately */
+	vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+
+	pthread_create(&vm_thread, NULL,
+		       (void *(*)(void *))run_vcpu_get_exit_reason,
+		       (void *)vcpu);
+
+	vm_mem_region_delete(vm, EXITS_TEST_SLOT);
+
+	pthread_join(vm_thread, &thread_return);
+	exit_reason = (uint32_t)(uint64_t)thread_return;
+
+	TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+	TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
+	TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
+	TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+
+	kvm_vm_free(vm);
+}
+
+static void test_private_access_memslot_not_private(void)
+{
+	struct kvm_vm *vm;
+	struct kvm_vcpu *vcpu;
+	uint32_t exit_reason;
+
+	vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
+					   guest_repeatedly_read);
+
+	/* Add a non-private memslot (flags = 0) */
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+				    EXITS_TEST_GPA, EXITS_TEST_SLOT,
+				    EXITS_TEST_NPAGES, 0);
+
+	virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
+
+	/* Request to access page privately */
+	vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+
+	exit_reason = run_vcpu_get_exit_reason(vcpu);
+
+	TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+	TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
+	TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
+	TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+
+	kvm_vm_free(vm);
+}
+
+int main(int argc, char *argv[])
+{
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
+
+	test_private_access_memslot_deleted();
+	test_private_access_memslot_not_private();
+}
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-09-14  1:55 ` [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
@ 2023-09-14  3:07   ` Binbin Wu
  2023-09-14 14:19     ` Sean Christopherson
  2023-09-20  6:07   ` Xu Yilun
  1 sibling, 1 reply; 83+ messages in thread
From: Binbin Wu @ 2023-09-14  3:07 UTC (permalink / raw)
  To: Sean Christopherson, Chao Peng
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Paolo Bonzini, Marc Zyngier,
	Oliver Upton, Huacai Chen, Michael Ellerman, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn,
	Fuad Tabba, Jarkko Sakkinen, Anish Moorthy, Yu Zhang,
	Isaku Yamahata, Xu Yilun, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov



On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
>
> Currently in mmu_notifier invalidate path, hva range is recorded and
> then checked against by mmu_notifier_retry_hva() in the page fault
> handling path. However, for the to be introduced private memory, a page
> fault may not have a hva associated, checking gfn(gpa) makes more sense.
>
> For existing hva based shared memory, gfn is expected to also work. The
> only downside is when aliasing multiple gfns to a single hva, the
> current algorithm of checking multiple ranges could result in a much
> larger range being rejected. Such aliasing should be uncommon, so the
> impact is expected small.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Tested-by: Fuad Tabba <tabba@google.com>
> [sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c   | 10 ++++++----
>   arch/x86/kvm/vmx/vmx.c   | 11 +++++------
>   include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
>   virt/kvm/kvm_main.c      | 40 +++++++++++++++++++++++++++++++---------
>   4 files changed, 63 insertions(+), 31 deletions(-)
>
[...]
>   
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -			      unsigned long end)
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
>   {
> +	lockdep_assert_held_write(&kvm->mmu_lock);
>   	/*
>   	 * The count increase must become visible at unlock time as no
>   	 * spte can be established without taking the mmu_lock and
>   	 * count is also read inside the mmu_lock critical section.
>   	 */
>   	kvm->mmu_invalidate_in_progress++;
> +
> +	if (likely(kvm->mmu_invalidate_in_progress == 1))
> +		kvm->mmu_invalidate_range_start = INVALID_GPA;
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	lockdep_assert_held_write(&kvm->mmu_lock);
> +
> +	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
>   	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
>   		kvm->mmu_invalidate_range_start = start;
>   		kvm->mmu_invalidate_range_end = end;
> @@ -771,6 +781,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>   	}
>   }
>   
> +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> +	return kvm_unmap_gfn_range(kvm, range);
> +}
> +
>   static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   					const struct mmu_notifier_range *range)
>   {
> @@ -778,7 +794,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   	const struct kvm_mmu_notifier_range hva_range = {
>   		.start		= range->start,
>   		.end		= range->end,
> -		.handler	= kvm_unmap_gfn_range,
> +		.handler	= kvm_mmu_unmap_gfn_range,
>   		.on_lock	= kvm_mmu_invalidate_begin,
>   		.on_unlock	= kvm_arch_guest_memory_reclaimed,
>   		.flush_on_ret	= true,
> @@ -817,8 +833,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   	return 0;
>   }
>   
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -			    unsigned long end)
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
>   {
>   	/*
>   	 * This sequence increase will notify the kvm page fault that
> @@ -833,6 +848,13 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
>   	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
>   	 */
>   	kvm->mmu_invalidate_in_progress--;
> +
> +	/*
> +	 * Assert that at least one range must be added between start() and
> +	 * end().  Not adding a range isn't fatal, but it is a KVM bug.
> +	 */
> +	WARN_ON_ONCE(kvm->mmu_invalidate_in_progress &&
> +		     kvm->mmu_invalidate_range_start == INVALID_GPA);
Should the check happen before the decrease of 
kvm->mmu_invalidate_in_progress?
Otherwise, KVM calls kvm_mmu_invalidate_begin(), then 
kvm_mmu_invalidate_end()
the check will not take effect.

>   }
>   
>   static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-09-14  3:07   ` Binbin Wu
@ 2023-09-14 14:19     ` Sean Christopherson
  0 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-14 14:19 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Chao Peng, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Paolo Bonzini, Marc Zyngier,
	Oliver Upton, Huacai Chen, Michael Ellerman, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn,
	Fuad Tabba, Jarkko Sakkinen, Anish Moorthy, Yu Zhang,
	Isaku Yamahata, Xu Yilun, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Thu, Sep 14, 2023, Binbin Wu wrote:
> 
> On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> >   {
> >   	/*
> >   	 * This sequence increase will notify the kvm page fault that
> > @@ -833,6 +848,13 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> >   	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
> >   	 */
> >   	kvm->mmu_invalidate_in_progress--;
> > +
> > +	/*
> > +	 * Assert that at least one range must be added between start() and
> > +	 * end().  Not adding a range isn't fatal, but it is a KVM bug.
> > +	 */
> > +	WARN_ON_ONCE(kvm->mmu_invalidate_in_progress &&
> > +		     kvm->mmu_invalidate_range_start == INVALID_GPA);
> Should the check happen before the decrease of kvm->mmu_invalidate_in_progress?
> Otherwise, KVM calls kvm_mmu_invalidate_begin(), then kvm_mmu_invalidate_end()
> the check will not take effect.

Indeed.  I'm pretty sure I added this code, not sure what I was thinking.  There's
no reason to check mmu_invalidate_in_progress, it's not like KVM allows
mmu_invalidate_in_progress to go negative.  The comment is also a bit funky.  I'll
post a fixup patch to make it look like this (assuming I'm not forgetting a subtle
reason for guarding the check with the in-progress flag):

	/*
	 * Assert that at least one range was added between start() and end().
	 * Not adding a range isn't fatal, but it is a KVM bug.
	 */
	WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA);

Regarding kvm->mmu_invalidate_in_progress, this would be a good opportunity to
move the BUG_ON() into the common end(), e.g. as is, an end() without a start()
from something other than the generic mmu_notifier would go unnoticed.  And I
_think_ we can replace the BUG_ON() with a KVM_BUG_ON() without putting the
kernel at risk.  E.g.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dd948276e5d6..54480655bcce 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -870,6 +870,7 @@ void kvm_mmu_invalidate_end(struct kvm *kvm)
         * in conjunction with the smp_rmb in mmu_invalidate_retry().
         */
        kvm->mmu_invalidate_in_progress--;
+       KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
 
        /*
         * Assert that at least one range was added between start() and end().
@@ -905,8 +906,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
         */
        if (wake)
                rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
-
-       BUG_ON(kvm->mmu_invalidate_in_progress < 0);
 }
 
 static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory
  2023-09-14  1:55 ` [RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory Sean Christopherson
@ 2023-09-15  5:40   ` Yan Zhao
  2023-09-15 14:26     ` Sean Christopherson
  0 siblings, 1 reply; 83+ messages in thread
From: Yan Zhao @ 2023-09-15  5:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Wed, Sep 13, 2023 at 06:55:16PM -0700, Sean Christopherson wrote:
....
> +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> +					      struct kvm_page_fault *fault)
> +{
> +	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> +				      PAGE_SIZE, fault->write, fault->exec,
> +				      fault->is_private);
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> +				   struct kvm_page_fault *fault)
> +{
> +	int max_order, r;
> +
> +	if (!kvm_slot_can_be_private(fault->slot)) {
> +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +		return -EFAULT;
> +	}
> +
> +	r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
> +			     &max_order);
> +	if (r) {
> +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +		return r;
> +	}
> +
> +	fault->max_level = min(kvm_max_level_for_order(max_order),
> +			       fault->max_level);
> +	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
> +
> +	return RET_PF_CONTINUE;
> +}
> +
>  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
> @@ -4293,6 +4356,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			return RET_PF_EMULATE;
>  	}
>  
> +	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
In patch 21,
fault->is_private is set as:
	".is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT)",
then, the inequality here means memory attribute has been updated after
last check.
So, why an exit to user space for converting is required instead of a mere retry?

Or, is it because how .is_private is assigned in patch 21 is subjected to change
in future? 

> +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +		return -EFAULT;
> +	}
> +
> +	if (fault->is_private)
> +		return kvm_faultin_pfn_private(vcpu, fault);
> +
>  	async = false;
>  	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
>  					  fault->write, &fault->map_writable,
> @@ -7184,6 +7255,19 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>  }
>  
 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-09-14  1:55 ` [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
@ 2023-09-15  6:11   ` Yan Zhao
  2023-09-18 16:36   ` Michael Roth
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 83+ messages in thread
From: Yan Zhao @ 2023-09-15  6:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Wed, Sep 13, 2023 at 06:55:12PM -0700, Sean Christopherson wrote:
> +static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
> +{
> +	struct folio *folio;
> +
> +	/* TODO: Support huge pages. */
> +	folio = filemap_grab_folio(file->f_mapping, index);
> +	if (IS_ERR_OR_NULL(folio))
> +		return NULL;
> +
> +	/*
> +	 * Use the up-to-date flag to track whether or not the memory has been
> +	 * zeroed before being handed off to the guest.  There is no backing
> +	 * storage for the memory, so the folio will remain up-to-date until
> +	 * it's removed.
> +	 *
> +	 * TODO: Skip clearing pages when trusted firmware will do it when
> +	 * assigning memory to the guest.
> +	 */
> +	if (!folio_test_uptodate(folio)) {
> +		unsigned long nr_pages = folio_nr_pages(folio);
> +		unsigned long i;
> +
> +		for (i = 0; i < nr_pages; i++)
> +			clear_highpage(folio_page(folio, i));
> +
> +		folio_mark_uptodate(folio);
> +	}
> +
> +	/*
> +	 * Ignore accessed, referenced, and dirty flags.  The memory is
> +	 * unevictable and there is no storage to write back to.
> +	 */
> +	return folio;
> +}
If VFIO wants to map a private page, is it required to call this function for PFN?


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
  2023-09-14  1:55 ` [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes Sean Christopherson
@ 2023-09-15  6:32   ` Yan Zhao
  2023-09-20 21:00     ` Sean Christopherson
  2023-09-18  7:51   ` Binbin Wu
  2023-10-03 12:47   ` Fuad Tabba
  2 siblings, 1 reply; 83+ messages in thread
From: Yan Zhao @ 2023-09-15  6:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Wed, Sep 13, 2023 at 06:55:09PM -0700, Sean Christopherson wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
> 
> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
> 
...
>> +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> +				     unsigned long attrs)
> +{
> +	XA_STATE(xas, &kvm->mem_attr_array, start);
> +	unsigned long index;
> +	bool has_attrs;
> +	void *entry;
> +
> +	rcu_read_lock();
> +
> +	if (!attrs) {
> +		has_attrs = !xas_find(&xas, end);
> +		goto out;
> +	}
> +
> +	has_attrs = true;
> +	for (index = start; index < end; index++) {
> +		do {
> +			entry = xas_next(&xas);
> +		} while (xas_retry(&xas, entry));
> +
> +		if (xas.xa_index != index || xa_to_value(entry) != attrs) {
Should "xa_to_value(entry) != attrs" be "!(xa_to_value(entry) & attrs)" ?

> +			has_attrs = false;
> +			break;
> +		}
> +	}
> +
> +out:
> +	rcu_read_unlock();
> +	return has_attrs;
> +}
> +
...
> +/* Set @attributes for the gfn range [@start, @end). */
> +static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> +				     unsigned long attributes)
> +{
> +	struct kvm_mmu_notifier_range pre_set_range = {
> +		.start = start,
> +		.end = end,
> +		.handler = kvm_arch_pre_set_memory_attributes,
> +		.on_lock = kvm_mmu_invalidate_begin,
> +		.flush_on_ret = true,
> +		.may_block = true,
> +	};
> +	struct kvm_mmu_notifier_range post_set_range = {
> +		.start = start,
> +		.end = end,
> +		.arg.attributes = attributes,
> +		.handler = kvm_arch_post_set_memory_attributes,
> +		.on_lock = kvm_mmu_invalidate_end,
> +		.may_block = true,
> +	};
> +	unsigned long i;
> +	void *entry;
> +	int r = 0;
> +
> +	entry = attributes ? xa_mk_value(attributes) : NULL;
Also here, do we need to get existing attributes of a GFN first ?

> +	mutex_lock(&kvm->slots_lock);
> +
> +	/* Nothing to do if the entire range as the desired attributes. */
> +	if (kvm_range_has_memory_attributes(kvm, start, end, attributes))
> +		goto out_unlock;
> +
> +	/*
> +	 * Reserve memory ahead of time to avoid having to deal with failures
> +	 * partway through setting the new attributes.
> +	 */
> +	for (i = start; i < end; i++) {
> +		r = xa_reserve(&kvm->mem_attr_array, i, GFP_KERNEL_ACCOUNT);
> +		if (r)
> +			goto out_unlock;
> +	}
> +
> +	kvm_handle_gfn_range(kvm, &pre_set_range);
> +
> +	for (i = start; i < end; i++) {
> +		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +				    GFP_KERNEL_ACCOUNT));
> +		KVM_BUG_ON(r, kvm);
> +	}
> +
> +	kvm_handle_gfn_range(kvm, &post_set_range);
> +
> +out_unlock:
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	return r;
> +}
 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 01/33] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges
  2023-09-14  1:54 ` [RFC PATCH v12 01/33] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
@ 2023-09-15  6:47   ` Xiaoyao Li
  2023-09-15 21:05     ` Sean Christopherson
  0 siblings, 1 reply; 83+ messages in thread
From: Xiaoyao Li @ 2023-09-15  6:47 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Huacai Chen, Michael Ellerman, Anup Patel, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 9/14/2023 9:54 AM, Sean Christopherson wrote:
> Rework and rename "struct kvm_hva_range" into "kvm_mmu_notifier_range" so
> that the structure can be used to handle notifications that operate on gfn
> context, i.e. that aren't tied to a host virtual address.
> 
> Practically speaking, this is a nop for 64-bit kernels as the only
> meaningful change is to store start+end as u64s instead of unsigned longs.
> 
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   virt/kvm/kvm_main.c | 34 +++++++++++++++++++---------------
>   1 file changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 486800a7024b..0524933856d4 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -541,18 +541,22 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>   	return container_of(mn, struct kvm, mmu_notifier);
>   }
>   
> -typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
> +typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);

Is it worth mentioning the rename of it as well in changelog?

Anyway,

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 06/33] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  2023-09-14  1:55 ` [RFC PATCH v12 06/33] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
@ 2023-09-15  6:59   ` Xiaoyao Li
  0 siblings, 0 replies; 83+ messages in thread
From: Xiaoyao Li @ 2023-09-15  6:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Huacai Chen, Michael Ellerman, Anup Patel, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
> information can be supplied without setting userspace up to fail.  The
> padding in the new kvm_userspace_memory_region2 structure will be used to
> pass a file descriptor in addition to the userspace_addr, i.e. allow
> userspace to point at a file descriptor and map memory into a guest that
> is NOT mapped into host userspace.
> 
> Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
> without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
> makes detection of bad flags a bit more robust, e.g. if the new fd field
> is guarded only by a flag and not a new ioctl(), then a userspace bug
> (setting a "bad" flag) would generate out-of-bounds access instead of an
> -EINVAL error.
> 
> Cc: Jarkko Sakkinen <jarkko@kernel.org> > Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/x86.c       |  2 +-
>   include/linux/kvm_host.h |  4 ++--
>   include/uapi/linux/kvm.h | 13 +++++++++++++
>   virt/kvm/kvm_main.c      | 38 ++++++++++++++++++++++++++++++--------
>   4 files changed, 46 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6c9c81e82e65..8356907079e1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12447,7 +12447,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
>   	}
>   
>   	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> -		struct kvm_userspace_memory_region m;
> +		struct kvm_userspace_memory_region2 m;
>   
>   		m.slot = id | (i << 16);
>   		m.flags = 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 5faba69403ac..4e741ff27af3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1146,9 +1146,9 @@ enum kvm_mr_change {
>   };
>   
>   int kvm_set_memory_region(struct kvm *kvm,
> -			  const struct kvm_userspace_memory_region *mem);
> +			  const struct kvm_userspace_memory_region2 *mem);
>   int __kvm_set_memory_region(struct kvm *kvm,
> -			    const struct kvm_userspace_memory_region *mem);
> +			    const struct kvm_userspace_memory_region2 *mem);
>   void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
>   void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
>   int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 13065dd96132..bd1abe067f28 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
>   	__u64 userspace_addr; /* start of the userspace allocated memory */
>   };
>   
> +/* for KVM_SET_USER_MEMORY_REGION2 */
> +struct kvm_userspace_memory_region2 {
> +	__u32 slot;
> +	__u32 flags;
> +	__u64 guest_phys_addr;
> +	__u64 memory_size;
> +	__u64 userspace_addr;
> +	__u64 pad[16];
> +};
> +
>   /*
>    * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
>    * userspace, other bits are reserved for kvm internal use which are defined
> @@ -1192,6 +1202,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_COUNTER_OFFSET 227
>   #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
>   #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
> +#define KVM_CAP_USER_MEMORY2 230
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
> @@ -1473,6 +1484,8 @@ struct kvm_vfio_spapr_tce {
>   					struct kvm_userspace_memory_region)
>   #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
>   #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> +#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> +					 struct kvm_userspace_memory_region2)
>   
>   /* enable ucontrol for s390 */
>   struct kvm_s390_ucas_mapping {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8d21757cd5e9..7c0e38752526 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1571,7 +1571,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
>   	}
>   }
>   
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_userspace_memory_region2 *mem)
>   {
>   	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>   
> @@ -1973,7 +1973,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>    * Must be called holding kvm->slots_lock for write.
>    */
>   int __kvm_set_memory_region(struct kvm *kvm,
> -			    const struct kvm_userspace_memory_region *mem)
> +			    const struct kvm_userspace_memory_region2 *mem)
>   {
>   	struct kvm_memory_slot *old, *new;
>   	struct kvm_memslots *slots;
> @@ -2077,7 +2077,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>   
>   int kvm_set_memory_region(struct kvm *kvm,
> -			  const struct kvm_userspace_memory_region *mem)
> +			  const struct kvm_userspace_memory_region2 *mem)
>   {
>   	int r;
>   
> @@ -2089,7 +2089,7 @@ int kvm_set_memory_region(struct kvm *kvm,
>   EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>   
>   static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> -					  struct kvm_userspace_memory_region *mem)
> +					  struct kvm_userspace_memory_region2 *mem)
>   {
>   	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
>   		return -EINVAL;
> @@ -4559,6 +4559,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>   {
>   	switch (arg) {
>   	case KVM_CAP_USER_MEMORY:
> +	case KVM_CAP_USER_MEMORY2:
>   	case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
>   	case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
>   	case KVM_CAP_INTERNAL_ERROR_DATA:
> @@ -4814,6 +4815,14 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
>   	return fd;
>   }
>   
> +#define SANITY_CHECK_MEM_REGION_FIELD(field)					\
> +do {										\
> +	BUILD_BUG_ON(offsetof(struct kvm_userspace_memory_region, field) !=		\
> +		     offsetof(struct kvm_userspace_memory_region2, field));	\
> +	BUILD_BUG_ON(sizeof_field(struct kvm_userspace_memory_region, field) !=		\
> +		     sizeof_field(struct kvm_userspace_memory_region2, field));	\
> +} while (0)
> +
>   static long kvm_vm_ioctl(struct file *filp,
>   			   unsigned int ioctl, unsigned long arg)
>   {
> @@ -4836,15 +4845,28 @@ static long kvm_vm_ioctl(struct file *filp,
>   		r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
>   		break;
>   	}
> +	case KVM_SET_USER_MEMORY_REGION2:
>   	case KVM_SET_USER_MEMORY_REGION: {
> -		struct kvm_userspace_memory_region kvm_userspace_mem;
> +		struct kvm_userspace_memory_region2 mem;
> +		unsigned long size;
> +
> +		if (ioctl == KVM_SET_USER_MEMORY_REGION)
> +			size = sizeof(struct kvm_userspace_memory_region);
> +		else
> +			size = sizeof(struct kvm_userspace_memory_region2);
> +
> +		/* Ensure the common parts of the two structs are identical. */
> +		SANITY_CHECK_MEM_REGION_FIELD(slot);
> +		SANITY_CHECK_MEM_REGION_FIELD(flags);
> +		SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +		SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +		SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
>   
>   		r = -EFAULT;
> -		if (copy_from_user(&kvm_userspace_mem, argp,
> -						sizeof(kvm_userspace_mem)))
> +		if (copy_from_user(&mem, argp, size))
>   			goto out;
>   
> -		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> +		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>   		break;
>   	}
>   	case KVM_GET_DIRTY_LOG: {


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory
  2023-09-15  5:40   ` Yan Zhao
@ 2023-09-15 14:26     ` Sean Christopherson
  2023-09-18  0:54       ` Yan Zhao
  2023-09-21  5:51       ` Binbin Wu
  0 siblings, 2 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-15 14:26 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Sep 15, 2023, Yan Zhao wrote:
> On Wed, Sep 13, 2023 at 06:55:16PM -0700, Sean Christopherson wrote:
> ....
> > +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> > +					      struct kvm_page_fault *fault)
> > +{
> > +	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> > +				      PAGE_SIZE, fault->write, fault->exec,
> > +				      fault->is_private);
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > +				   struct kvm_page_fault *fault)
> > +{
> > +	int max_order, r;
> > +
> > +	if (!kvm_slot_can_be_private(fault->slot)) {
> > +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > +		return -EFAULT;
> > +	}
> > +
> > +	r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
> > +			     &max_order);
> > +	if (r) {
> > +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > +		return r;
> > +	}
> > +
> > +	fault->max_level = min(kvm_max_level_for_order(max_order),
> > +			       fault->max_level);
> > +	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
> > +
> > +	return RET_PF_CONTINUE;
> > +}
> > +
> >  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  {
> >  	struct kvm_memory_slot *slot = fault->slot;
> > @@ -4293,6 +4356,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  			return RET_PF_EMULATE;
> >  	}
> >  
> > +	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> In patch 21,
> fault->is_private is set as:
> 	".is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT)",
> then, the inequality here means memory attribute has been updated after
> last check.
> So, why an exit to user space for converting is required instead of a mere retry?
> 
> Or, is it because how .is_private is assigned in patch 21 is subjected to change
> in future? 

This.  Retrying on SNP or TDX would hang the guest.  I suppose we could special
case VMs where .is_private is derived from the memory attributes, but the
SW_PROTECTED_VM type is primary a development vehicle at this point.  I'd like to
have it mimic SNP/TDX as much as possible; performance is a secondary concern.

E.g. userspace needs to be prepared for "spurious" exits due to races on SNP and
TDX, which this can theoretically exercise.  Though the window is quite small so
I doubt that'll actually happen in practice; which of course also makes it less
important to retry instead of exiting.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 01/33] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges
  2023-09-15  6:47   ` Xiaoyao Li
@ 2023-09-15 21:05     ` Sean Christopherson
  0 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-15 21:05 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Sep 15, 2023, Xiaoyao Li wrote:
> On 9/14/2023 9:54 AM, Sean Christopherson wrote:
> > Rework and rename "struct kvm_hva_range" into "kvm_mmu_notifier_range" so
> > that the structure can be used to handle notifications that operate on gfn
> > context, i.e. that aren't tied to a host virtual address.
> > 
> > Practically speaking, this is a nop for 64-bit kernels as the only
> > meaningful change is to store start+end as u64s instead of unsigned longs.
> > 
> > Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   virt/kvm/kvm_main.c | 34 +++++++++++++++++++---------------
> >   1 file changed, 19 insertions(+), 15 deletions(-)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 486800a7024b..0524933856d4 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -541,18 +541,22 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> >   	return container_of(mn, struct kvm, mmu_notifier);
> >   }
> > -typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
> > +typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
> 
> Is it worth mentioning the rename of it as well in changelog?

Meh, I suppose.  At some point, we do have to assume a certain level of code
literacy though :-)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory
  2023-09-15 14:26     ` Sean Christopherson
@ 2023-09-18  0:54       ` Yan Zhao
  2023-09-21 14:59         ` Sean Christopherson
  2023-09-21  5:51       ` Binbin Wu
  1 sibling, 1 reply; 83+ messages in thread
From: Yan Zhao @ 2023-09-18  0:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Sep 15, 2023 at 07:26:16AM -0700, Sean Christopherson wrote:
> On Fri, Sep 15, 2023, Yan Zhao wrote:
> > On Wed, Sep 13, 2023 at 06:55:16PM -0700, Sean Christopherson wrote:
> > ....
> > > +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> > > +					      struct kvm_page_fault *fault)
> > > +{
> > > +	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> > > +				      PAGE_SIZE, fault->write, fault->exec,
> > > +				      fault->is_private);
> > > +}
> > > +
> > > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > > +				   struct kvm_page_fault *fault)
> > > +{
> > > +	int max_order, r;
> > > +
> > > +	if (!kvm_slot_can_be_private(fault->slot)) {
> > > +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > > +		return -EFAULT;
> > > +	}
> > > +
> > > +	r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
> > > +			     &max_order);
> > > +	if (r) {
> > > +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > > +		return r;
> > > +	}
> > > +
> > > +	fault->max_level = min(kvm_max_level_for_order(max_order),
> > > +			       fault->max_level);
> > > +	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
> > > +
> > > +	return RET_PF_CONTINUE;
> > > +}
> > > +
> > >  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > >  {
> > >  	struct kvm_memory_slot *slot = fault->slot;
> > > @@ -4293,6 +4356,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > >  			return RET_PF_EMULATE;
> > >  	}
> > >  
> > > +	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > In patch 21,
> > fault->is_private is set as:
> > 	".is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT)",
> > then, the inequality here means memory attribute has been updated after
> > last check.
> > So, why an exit to user space for converting is required instead of a mere retry?
> > 
> > Or, is it because how .is_private is assigned in patch 21 is subjected to change
> > in future? 
> 
> This.  Retrying on SNP or TDX would hang the guest.  I suppose we could special
Is this because if the guest access a page in private way (e.g. via
private key in TDX), the returned page must be a private page?

> case VMs where .is_private is derived from the memory attributes, but the
> SW_PROTECTED_VM type is primary a development vehicle at this point.  I'd like to
> have it mimic SNP/TDX as much as possible; performance is a secondary concern.
Ok. But this mimic is somewhat confusing as it may be problematic in below scenario,
though sane guest should ensure no one is accessing a page before doing memory
conversion.


CPU 0                           CPU 1
access GFN A in private way
fault->is_private=true
                                convert GFN A to shared
			        set memory attribute of A to shared

faultin, mismatch and exit
set memory attribute of A
to private

                                vCPU access GFN A in shared way
                                fault->is_private = true
                                faultin, match and map a private PFN B

                                vCPU accesses private PFN B in shared way

> 
> E.g. userspace needs to be prepared for "spurious" exits due to races on SNP and
> TDX, which this can theoretically exercise.  Though the window is quite small so
> I doubt that'll actually happen in practice; which of course also makes it less
> important to retry instead of exiting.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events
  2023-09-14  1:55 ` [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events Sean Christopherson
@ 2023-09-18  1:14   ` Binbin Wu
  2023-09-18 15:57     ` Sean Christopherson
  2023-09-18 18:07   ` Michael Roth
  1 sibling, 1 reply; 83+ messages in thread
From: Binbin Wu @ 2023-09-18  1:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Paolo Bonzini, Marc Zyngier,
	Oliver Upton, Huacai Chen, Michael Ellerman, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn,
	Chao Peng, Fuad Tabba, Jarkko Sakkinen, Anish Moorthy, Yu Zhang,
	Isaku Yamahata, Xu Yilun, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov



On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> Add flags to "struct kvm_gfn_range" to let notifier events target only
> shared and only private mappings, and write up the existing mmu_notifier
> events to be shared-only (private memory is never associated with a
> userspace virtual address, i.e. can't be reached via mmu_notifiers).
>
> Add two flags so that KVM can handle the three possibilities (shared,
> private, and shared+private) without needing something like a tri-state
> enum.

How to understand the word "stage" in short log?


>
> Link: https://lore.kernel.org/all/ZJX0hk+KpQP0KUyB@google.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   include/linux/kvm_host.h | 2 ++
>   virt/kvm/kvm_main.c      | 7 +++++++
>   2 files changed, 9 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d8c6ce6c8211..b5373cee2b08 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -263,6 +263,8 @@ struct kvm_gfn_range {
>   	gfn_t start;
>   	gfn_t end;
>   	union kvm_mmu_notifier_arg arg;
> +	bool only_private;
> +	bool only_shared;
>   	bool may_block;
>   };
>   bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 174de2789657..a41f8658dfe0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>   			 * the second or later invocation of the handler).
>   			 */
>   			gfn_range.arg = range->arg;
> +
> +			/*
> +			 * HVA-based notifications aren't relevant to private
> +			 * mappings as they don't have a userspace mapping.
> +			 */
> +			gfn_range.only_private = false;
> +			gfn_range.only_shared = true;
>   			gfn_range.may_block = range->may_block;
>   
>   			/*


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
  2023-09-14  1:55 ` [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes Sean Christopherson
  2023-09-15  6:32   ` Yan Zhao
@ 2023-09-18  7:51   ` Binbin Wu
  2023-09-20 21:03     ` Sean Christopherson
  2023-10-03 12:47   ` Fuad Tabba
  2 siblings, 1 reply; 83+ messages in thread
From: Binbin Wu @ 2023-09-18  7:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Paolo Bonzini, Marc Zyngier,
	Oliver Upton, Huacai Chen, Michael Ellerman, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn,
	Chao Peng, Fuad Tabba, Jarkko Sakkinen, Anish Moorthy, Yu Zhang,
	Isaku Yamahata, Xu Yilun, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov



On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
[...]
>   
> +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +/*
> + * Returns true if _all_ gfns in the range [@start, @end) have attributes
> + * matching @attrs.
> + */
> +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> +				     unsigned long attrs)
> +{
> +	XA_STATE(xas, &kvm->mem_attr_array, start);
> +	unsigned long index;
> +	bool has_attrs;
> +	void *entry;
> +
> +	rcu_read_lock();
> +
> +	if (!attrs) {
> +		has_attrs = !xas_find(&xas, end);
IIUIC, xas_find() is inclusive for "end", so here should be "end - 1" ?


> +		goto out;
> +	}
> +
> +	has_attrs = true;
> +	for (index = start; index < end; index++) {
> +		do {
> +			entry = xas_next(&xas);
> +		} while (xas_retry(&xas, entry));
> +
> +		if (xas.xa_index != index || xa_to_value(entry) != attrs) {
> +			has_attrs = false;
> +			break;
> +		}
> +	}
> +
> +out:
> +	rcu_read_unlock();
> +	return has_attrs;
> +}
> +
>
[...]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events
  2023-09-18  1:14   ` Binbin Wu
@ 2023-09-18 15:57     ` Sean Christopherson
  0 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-18 15:57 UTC (permalink / raw)
  To: Binbin Wu
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Paolo Bonzini, Marc Zyngier,
	Oliver Upton, Huacai Chen, Michael Ellerman, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn,
	Chao Peng, Fuad Tabba, Jarkko Sakkinen, Anish Moorthy, Yu Zhang,
	Isaku Yamahata, Xu Yilun, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Mon, Sep 18, 2023, Binbin Wu wrote:
> 
> 
> On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> > Add flags to "struct kvm_gfn_range" to let notifier events target only
> > shared and only private mappings, and write up the existing mmu_notifier
> > events to be shared-only (private memory is never associated with a
> > userspace virtual address, i.e. can't be reached via mmu_notifiers).
> > 
> > Add two flags so that KVM can handle the three possibilities (shared,
> > private, and shared+private) without needing something like a tri-state
> > enum.
> 
> How to understand the word "stage" in short log?

Sorry, it's an idiom[*] that essentially means "to prepare for".  I'll rephrase
the shortlog to be more straightforward (I have a bad habit of using idioms).

[*] https://dictionary.cambridge.org/us/dictionary/english/set-the-stage-for

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-09-14  1:55 ` [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
  2023-09-15  6:11   ` Yan Zhao
@ 2023-09-18 16:36   ` Michael Roth
  2023-09-20 23:44     ` Sean Christopherson
  2023-09-19  9:01   ` Binbin Wu
  2023-09-21 19:10   ` Sean Christopherson
  3 siblings, 1 reply; 83+ messages in thread
From: Michael Roth @ 2023-09-18 16:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Wed, Sep 13, 2023 at 06:55:12PM -0700, Sean Christopherson wrote:
> TODO
> 
> Cc: Fuad Tabba <tabba@google.com>
> Cc: Vishal Annapurve <vannapurve@google.com>
> Cc: Ackerley Tng <ackerleytng@google.com>
> Cc: Jarkko Sakkinen <jarkko@kernel.org>
> Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Quentin Perret <qperret@google.com>
> Cc: Michael Roth <michael.roth@amd.com>
> Cc: Wang <wei.w.wang@intel.com>
> Cc: Liam Merwick <liam.merwick@oracle.com>
> Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
> Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  include/linux/kvm_host.h   |  48 +++
>  include/uapi/linux/kvm.h   |  15 +-
>  include/uapi/linux/magic.h |   1 +
>  virt/kvm/Kconfig           |   4 +
>  virt/kvm/Makefile.kvm      |   1 +
>  virt/kvm/guest_mem.c       | 593 +++++++++++++++++++++++++++++++++++++
>  virt/kvm/kvm_main.c        |  61 +++-
>  virt/kvm/kvm_mm.h          |  38 +++
>  8 files changed, 756 insertions(+), 5 deletions(-)
>  create mode 100644 virt/kvm/guest_mem.c
> 

<snip>

> +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> +{
> +	struct list_head *gmem_list = &inode->i_mapping->private_list;
> +	pgoff_t start = offset >> PAGE_SHIFT;
> +	pgoff_t end = (offset + len) >> PAGE_SHIFT;
> +	struct kvm_gmem *gmem;
> +
> +	/*
> +	 * Bindings must stable across invalidation to ensure the start+end
> +	 * are balanced.
> +	 */
> +	filemap_invalidate_lock(inode->i_mapping);
> +
> +	list_for_each_entry(gmem, gmem_list, entry) {
> +		kvm_gmem_invalidate_begin(gmem, start, end);

In v11 we used to call truncate_inode_pages_range() here to drop filemap's
reference on the folio. AFAICT the folios are only getting free'd upon
guest shutdown without this. Was this on purpose?

> +		kvm_gmem_invalidate_end(gmem, start, end);
> +	}
> +
> +	filemap_invalidate_unlock(inode->i_mapping);
> +
> +	return 0;
> +}
> +
> +static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
> +{
> +	struct address_space *mapping = inode->i_mapping;
> +	pgoff_t start, index, end;
> +	int r;
> +
> +	/* Dedicated guest is immutable by default. */
> +	if (offset + len > i_size_read(inode))
> +		return -EINVAL;
> +
> +	filemap_invalidate_lock_shared(mapping);

We take the filemap lock here, but not for
kvm_gmem_get_pfn()->kvm_gmem_get_folio(). Is it needed there as well?

> +
> +	start = offset >> PAGE_SHIFT;
> +	end = (offset + len) >> PAGE_SHIFT;
> +
> +	r = 0;
> +	for (index = start; index < end; ) {
> +		struct folio *folio;
> +
> +		if (signal_pending(current)) {
> +			r = -EINTR;
> +			break;
> +		}
> +
> +		folio = kvm_gmem_get_folio(inode, index);
> +		if (!folio) {
> +			r = -ENOMEM;
> +			break;
> +		}
> +
> +		index = folio_next_index(folio);
> +
> +		folio_unlock(folio);
> +		folio_put(folio);
> +
> +		/* 64-bit only, wrapping the index should be impossible. */
> +		if (WARN_ON_ONCE(!index))
> +			break;
> +
> +		cond_resched();
> +	}
> +
> +	filemap_invalidate_unlock_shared(mapping);
> +
> +	return r;
> +}
> +

<snip>

> +static int __kvm_gmem_create(struct kvm *kvm, loff_t size, struct vfsmount *mnt)
> +{
> +	const char *anon_name = "[kvm-gmem]";
> +	const struct qstr qname = QSTR_INIT(anon_name, strlen(anon_name));
> +	struct kvm_gmem *gmem;
> +	struct inode *inode;
> +	struct file *file;
> +	int fd, err;
> +
> +	inode = alloc_anon_inode(mnt->mnt_sb);
> +	if (IS_ERR(inode))
> +		return PTR_ERR(inode);
> +
> +	err = security_inode_init_security_anon(inode, &qname, NULL);
> +	if (err)
> +		goto err_inode;
> +
> +	inode->i_private = (void *)(unsigned long)flags;

The 'flags' argument isn't added until the subsequent patch that adds THP
support.

<snip>

> +static bool kvm_gmem_is_valid_size(loff_t size, u64 flags)
> +{
> +	if (size < 0 || !PAGE_ALIGNED(size))
> +		return false;
> +
> +	return true;
> +}
> +
> +int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> +{
> +	loff_t size = args->size;
> +	u64 flags = args->flags;
> +	u64 valid_flags = 0;
> +
> +	if (flags & ~valid_flags)
> +		return -EINVAL;
> +
> +	if (!kvm_gmem_is_valid_size(size, flags))
> +		return -EINVAL;
> +
> +	return __kvm_gmem_create(kvm, size, flags, kvm_gmem_mnt);
> +}
> +
> +int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
> +		  unsigned int fd, loff_t offset)
> +{
> +	loff_t size = slot->npages << PAGE_SHIFT;
> +	unsigned long start, end, flags;
> +	struct kvm_gmem *gmem;
> +	struct inode *inode;
> +	struct file *file;
> +
> +	BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
> +
> +	file = fget(fd);
> +	if (!file)
> +		return -EBADF;
> +
> +	if (file->f_op != &kvm_gmem_fops)
> +		goto err;
> +
> +	gmem = file->private_data;
> +	if (gmem->kvm != kvm)
> +		goto err;
> +
> +	inode = file_inode(file);
> +	flags = (unsigned long)inode->i_private;
> +
> +	/*
> +	 * For simplicity, require the offset into the file and the size of the
> +	 * memslot to be aligned to the largest possible page size used to back
> +	 * the file (same as the size of the file itself).
> +	 */
> +	if (!kvm_gmem_is_valid_size(offset, flags) ||
> +	    !kvm_gmem_is_valid_size(size, flags))
> +		goto err;

I needed to relax this check for SNP. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
applies to entire gmem inode, so it makes sense for userspace to enable
hugepages if start/end are hugepage-aligned, but QEMU will do things
like map overlapping regions for ROMs and other things on top of the
GPA range that the gmem inode was originally allocated for. For
instance:

  692500@1689108688.696338:kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fbf5be00000 ret=0 restricted_fd=19 restricted_offset=0x0
  692500@1689108688.699802:kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0x100000000 size=0x380000000 ua=0x7fbfdbe00000 ret=0 restricted_fd=19 restricted_offset=0x80000000
  692500@1689108688.795412:kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x0 gpa=0x0 size=0x0 ua=0x7fbf5be00000 ret=0 restricted_fd=19 restricted_offset=0x0
  692500@1689108688.795550:kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0xc0000 ua=0x7fbf5be00000 ret=0 restricted_fd=19 restricted_offset=0x0
  692500@1689108688.796227:kvm_set_user_memory AddrSpace#0 Slot#6 flags=0x4 gpa=0x100000 size=0x7ff00000 ua=0x7fbf5bf00000 ret=0 restricted_fd=19 restricted_offset=0x100000

Because of that the KVM_SET_USER_MEMORY_REGIONs for non-THP-aligned GPAs
will fail. Maybe instead it should be allowed, and kvm_gmem_get_folio()
should handle the alignment checks on a case-by-case and simply force 4k
for offsets corresponding to unaligned bindings?

-Mike

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events
  2023-09-14  1:55 ` [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events Sean Christopherson
  2023-09-18  1:14   ` Binbin Wu
@ 2023-09-18 18:07   ` Michael Roth
  2023-09-19  0:08     ` Sean Christopherson
  1 sibling, 1 reply; 83+ messages in thread
From: Michael Roth @ 2023-09-18 18:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov, Kalra, Ashish

On Wed, Sep 13, 2023 at 06:55:08PM -0700, Sean Christopherson wrote:
> Add flags to "struct kvm_gfn_range" to let notifier events target only
> shared and only private mappings, and write up the existing mmu_notifier
> events to be shared-only (private memory is never associated with a
> userspace virtual address, i.e. can't be reached via mmu_notifiers).
> 
> Add two flags so that KVM can handle the three possibilities (shared,
> private, and shared+private) without needing something like a tri-state
> enum.
> 
> Link: https://lore.kernel.org/all/ZJX0hk+KpQP0KUyB@google.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  include/linux/kvm_host.h | 2 ++
>  virt/kvm/kvm_main.c      | 7 +++++++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d8c6ce6c8211..b5373cee2b08 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -263,6 +263,8 @@ struct kvm_gfn_range {
>  	gfn_t start;
>  	gfn_t end;
>  	union kvm_mmu_notifier_arg arg;
> +	bool only_private;
> +	bool only_shared;
>  	bool may_block;
>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 174de2789657..a41f8658dfe0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>  			 * the second or later invocation of the handler).
>  			 */
>  			gfn_range.arg = range->arg;
> +
> +			/*
> +			 * HVA-based notifications aren't relevant to private
> +			 * mappings as they don't have a userspace mapping.
> +			 */
> +			gfn_range.only_private = false;
> +			gfn_range.only_shared = true;
>  			gfn_range.may_block = range->may_block;

Who is supposed to read only_private/only_shared? Is it supposed to be
plumbed onto arch code and handled specially there?

I ask because I see elsewhere you have:

    /*
     * If one or more memslots were found and thus zapped, notify arch code
     * that guest memory has been reclaimed.  This needs to be done *after*
     * dropping mmu_lock, as x86's reclaim path is slooooow.
     */
    if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot)
            kvm_arch_guest_memory_reclaimed(kvm);

and if there are any MMU notifier events that touch HVAs, then
kvm_arch_guest_memory_reclaimed()->wbinvd_on_all_cpus() will get called,
which causes the performance issues for SEV and SNP that Ashish had brought
up. Technically that would only need to happen if there are GPAs in that
memslot that aren't currently backed by gmem pages (and then gmem could handle
its own wbinvd_on_all_cpus() (or maybe clflush per-page)). 

Actually, even if there are shared pages in the GPA range, the
kvm_arch_guest_memory_reclaimed()->wbinvd_on_all_cpus() can be skipped for
guests that only use gmem pages for private memory. Is that acceptable? Just
trying to figure out where this only_private/only_shared handling ties into
that (or if it's a separate thing entirely).

-Mike

>  
>  			/*
> -- 
> 2.42.0.283.g2d96d420d3-goog
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events
  2023-09-18 18:07   ` Michael Roth
@ 2023-09-19  0:08     ` Sean Christopherson
  0 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-19  0:08 UTC (permalink / raw)
  To: Michael Roth
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov, Ashish Kalra

On Mon, Sep 18, 2023, Michael Roth wrote:
> On Wed, Sep 13, 2023 at 06:55:08PM -0700, Sean Christopherson wrote:
> > Add flags to "struct kvm_gfn_range" to let notifier events target only
> > shared and only private mappings, and write up the existing mmu_notifier
> > events to be shared-only (private memory is never associated with a
> > userspace virtual address, i.e. can't be reached via mmu_notifiers).
> > 
> > Add two flags so that KVM can handle the three possibilities (shared,
> > private, and shared+private) without needing something like a tri-state
> > enum.
> > 
> > Link: https://lore.kernel.org/all/ZJX0hk+KpQP0KUyB@google.com
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  include/linux/kvm_host.h | 2 ++
> >  virt/kvm/kvm_main.c      | 7 +++++++
> >  2 files changed, 9 insertions(+)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index d8c6ce6c8211..b5373cee2b08 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -263,6 +263,8 @@ struct kvm_gfn_range {
> >  	gfn_t start;
> >  	gfn_t end;
> >  	union kvm_mmu_notifier_arg arg;
> > +	bool only_private;
> > +	bool only_shared;
> >  	bool may_block;
> >  };
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 174de2789657..a41f8658dfe0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> >  			 * the second or later invocation of the handler).
> >  			 */
> >  			gfn_range.arg = range->arg;
> > +
> > +			/*
> > +			 * HVA-based notifications aren't relevant to private
> > +			 * mappings as they don't have a userspace mapping.
> > +			 */
> > +			gfn_range.only_private = false;
> > +			gfn_range.only_shared = true;
> >  			gfn_range.may_block = range->may_block;
> 
> Who is supposed to read only_private/only_shared? Is it supposed to be
> plumbed onto arch code and handled specially there?

Yeah, that's the idea.  Though I don't know that it's worth using for SNP, the
cost of checking the RMP may be higher than just eating the extra faults.

> I ask because I see elsewhere you have:
> 
>     /*
>      * If one or more memslots were found and thus zapped, notify arch code
>      * that guest memory has been reclaimed.  This needs to be done *after*
>      * dropping mmu_lock, as x86's reclaim path is slooooow.
>      */
>     if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot)
>             kvm_arch_guest_memory_reclaimed(kvm);
> 
> and if there are any MMU notifier events that touch HVAs, then
> kvm_arch_guest_memory_reclaimed()->wbinvd_on_all_cpus() will get called,
> which causes the performance issues for SEV and SNP that Ashish had brought
> up. Technically that would only need to happen if there are GPAs in that
> memslot that aren't currently backed by gmem pages (and then gmem could handle
> its own wbinvd_on_all_cpus() (or maybe clflush per-page)). 
> 
> Actually, even if there are shared pages in the GPA range, the
> kvm_arch_guest_memory_reclaimed()->wbinvd_on_all_cpus() can be skipped for
> guests that only use gmem pages for private memory. Is that acceptable?

Yes, that was my original plan.  I may have forgotten that exact plan at one point
or another and not communicated it well.  But the idea is definitely that if a VM
type, a.k.a. SNP guests, is required to use gmem for private memory, then there's
no need to blast WBINVD because barring a KVM bug, the mmu_notifier event can't
have freed private memory, even if it *did* zap SPTEs.

For gmem, if KVM doesn't precisely zap only shared SPTEs for SNP (is that even
possible to do race-free?), then KVM needs to blast WBINVD when freeing memory
from gmem even if there are no SPTEs.  But that seems like a non-issue for a
well-behaved setup because the odds of there being *zero* SPTEs should be nil.

> Just trying to figure out where this only_private/only_shared handling ties
> into that (or if it's a separate thing entirely).

It's mostly a TDX thing.  I threw it in this series mostly to "formally" document
that the mmu_notifier path only affects shared mappings.  If the code causes
confusion without the TDX context, and won't be used by SNP, we can and should
drop it from the initial merge and have it go along with the TDX series.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-09-14  1:55 ` [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
  2023-09-15  6:11   ` Yan Zhao
  2023-09-18 16:36   ` Michael Roth
@ 2023-09-19  9:01   ` Binbin Wu
  2023-09-20 14:24     ` Sean Christopherson
  2023-09-21 19:10   ` Sean Christopherson
  3 siblings, 1 reply; 83+ messages in thread
From: Binbin Wu @ 2023-09-19  9:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Paolo Bonzini, Marc Zyngier,
	Oliver Upton, Huacai Chen, Michael Ellerman, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn,
	Chao Peng, Fuad Tabba, Jarkko Sakkinen, Anish Moorthy, Yu Zhang,
	Isaku Yamahata, Xu Yilun, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov



On 9/14/2023 9:55 AM, Sean Christopherson wrote:
[...]
> +
> +static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> +				      pgoff_t end)
> +{
> +	struct kvm_memory_slot *slot;
> +	struct kvm *kvm = gmem->kvm;
> +	unsigned long index;
> +	bool flush = false;
> +
> +	KVM_MMU_LOCK(kvm);
> +
> +	kvm_mmu_invalidate_begin(kvm);
> +
> +	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> +		pgoff_t pgoff = slot->gmem.pgoff;
> +
> +		struct kvm_gfn_range gfn_range = {
> +			.start = slot->base_gfn + max(pgoff, start) - pgoff,
> +			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
> +			.slot = slot,
> +			.may_block = true,
> +		};
> +
> +		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
> +	}
> +
> +	if (flush)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	KVM_MMU_UNLOCK(kvm);
> +}
> +
> +static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
> +				    pgoff_t end)
> +{
> +	struct kvm *kvm = gmem->kvm;
> +
> +	KVM_MMU_LOCK(kvm);
> +	if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT))
> +		kvm_mmu_invalidate_end(kvm);
kvm_mmu_invalidate_begin() is called unconditionally in 
kvm_gmem_invalidate_begin(),
but kvm_mmu_invalidate_end() is not here.
This makes the kvm_gmem_invalidate_{begin, end}() calls asymmetric.


> +	KVM_MMU_UNLOCK(kvm);
> +}
> +
> +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> +{
> +	struct list_head *gmem_list = &inode->i_mapping->private_list;
> +	pgoff_t start = offset >> PAGE_SHIFT;
> +	pgoff_t end = (offset + len) >> PAGE_SHIFT;
> +	struct kvm_gmem *gmem;
> +
> +	/*
> +	 * Bindings must stable across invalidation to ensure the start+end
> +	 * are balanced.
> +	 */
> +	filemap_invalidate_lock(inode->i_mapping);
> +
> +	list_for_each_entry(gmem, gmem_list, entry) {
> +		kvm_gmem_invalidate_begin(gmem, start, end);
> +		kvm_gmem_invalidate_end(gmem, start, end);
> +	}
Why to loop for each gmem in gmem_list here?

IIUIC, offset is the offset according to the inode, it is only 
meaningful to the
inode passed in, i.e, it is only meaningful to the gmem binding with the 
inode,
not others.


> +
> +	filemap_invalidate_unlock(inode->i_mapping);
> +
> +	return 0;
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-09-14  1:55 ` [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
  2023-09-14  3:07   ` Binbin Wu
@ 2023-09-20  6:07   ` Xu Yilun
  2023-09-20 13:55     ` Sean Christopherson
  1 sibling, 1 reply; 83+ messages in thread
From: Xu Yilun @ 2023-09-20  6:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On 2023-09-13 at 18:55:00 -0700, Sean Christopherson wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
> 
> Currently in mmu_notifier invalidate path, hva range is recorded and
> then checked against by mmu_notifier_retry_hva() in the page fault
                          ^

Now it is mmu_invalidate_retry_hva().

> handling path. However, for the to be introduced private memory, a page
> fault may not have a hva associated, checking gfn(gpa) makes more sense.
> 
> For existing hva based shared memory, gfn is expected to also work. The
> only downside is when aliasing multiple gfns to a single hva, the
> current algorithm of checking multiple ranges could result in a much
> larger range being rejected. Such aliasing should be uncommon, so the
> impact is expected small.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Tested-by: Fuad Tabba <tabba@google.com>
> [sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c   | 10 ++++++----
>  arch/x86/kvm/vmx/vmx.c   | 11 +++++------
>  include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
>  virt/kvm/kvm_main.c      | 40 +++++++++++++++++++++++++++++++---------
>  4 files changed, 63 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e1d011c67cc6..0f0231d2b74f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3056,7 +3056,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
>   *
>   * There are several ways to safely use this helper:
>   *
> - * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, before
> + * - Check mmu_invalidate_retry_gfn() after grabbing the mapping level, before
>   *   consuming it.  In this case, mmu_lock doesn't need to be held during the
>   *   lookup, but it does need to be held while checking the MMU notifier.
>   *
> @@ -4358,7 +4358,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
>  		return true;
>  
>  	return fault->slot &&
> -	       mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
> +	       mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn);
>  }
>  
>  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> @@ -6253,7 +6253,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	write_lock(&kvm->mmu_lock);
>  
> -	kvm_mmu_invalidate_begin(kvm, 0, -1ul);
> +	kvm_mmu_invalidate_begin(kvm);
> +
> +	kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
>  
>  	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>  
> @@ -6266,7 +6268,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  	if (flush)
>  		kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);
>  
> -	kvm_mmu_invalidate_end(kvm, 0, -1ul);
> +	kvm_mmu_invalidate_end(kvm);
>  
>  	write_unlock(&kvm->mmu_lock);
>  }
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 72e3943f3693..6e502ba93141 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6757,10 +6757,10 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  		return;
>  
>  	/*
> -	 * Grab the memslot so that the hva lookup for the mmu_notifier retry
> -	 * is guaranteed to use the same memslot as the pfn lookup, i.e. rely
> -	 * on the pfn lookup's validation of the memslot to ensure a valid hva
> -	 * is used for the retry check.
> +	 * Explicitly grab the memslot using KVM's internal slot ID to ensure
> +	 * KVM doesn't unintentionally grab a userspace memslot.  It _should_
> +	 * be impossible for userspace to create a memslot for the APIC when
> +	 * APICv is enabled, but paranoia won't hurt in this case.
>  	 */
>  	slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
>  	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
> @@ -6785,8 +6785,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  		return;
>  
>  	read_lock(&vcpu->kvm->mmu_lock);
> -	if (mmu_invalidate_retry_hva(kvm, mmu_seq,
> -				     gfn_to_hva_memslot(slot, gfn))) {
> +	if (mmu_invalidate_retry_gfn(kvm, mmu_seq, gfn)) {
>  		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
>  		read_unlock(&vcpu->kvm->mmu_lock);
>  		goto out;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index fb6c6109fdca..11d091688346 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -787,8 +787,8 @@ struct kvm {
>  	struct mmu_notifier mmu_notifier;
>  	unsigned long mmu_invalidate_seq;
>  	long mmu_invalidate_in_progress;
> -	unsigned long mmu_invalidate_range_start;
> -	unsigned long mmu_invalidate_range_end;
> +	gfn_t mmu_invalidate_range_start;
> +	gfn_t mmu_invalidate_range_end;
>  #endif
>  	struct list_head devices;
>  	u64 manual_dirty_log_protect;
> @@ -1392,10 +1392,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  #endif
>  
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -			      unsigned long end);
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -			    unsigned long end);
> +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_mmu_invalidate_end(struct kvm *kvm);
>  
>  long kvm_arch_dev_ioctl(struct file *filp,
>  			unsigned int ioctl, unsigned long arg);
> @@ -1970,9 +1969,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
>  	return 0;
>  }
>  
> -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
>  					   unsigned long mmu_seq,
> -					   unsigned long hva)
> +					   gfn_t gfn)
>  {
>  	lockdep_assert_held(&kvm->mmu_lock);
>  	/*
> @@ -1981,10 +1980,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  	 * that might be being invalidated. Note that it may include some false
>  	 * positives, due to shortcuts when handing concurrent invalidations.
>  	 */
> -	if (unlikely(kvm->mmu_invalidate_in_progress) &&
> -	    hva >= kvm->mmu_invalidate_range_start &&
> -	    hva < kvm->mmu_invalidate_range_end)
> -		return 1;
> +	if (unlikely(kvm->mmu_invalidate_in_progress)) {
> +		/*
> +		 * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> +		 * but before updating the range is a KVM bug.
> +		 */
> +		if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> +				 kvm->mmu_invalidate_range_end == INVALID_GPA))
> +			return 1;
> +
> +		if (gfn >= kvm->mmu_invalidate_range_start &&
> +		    gfn < kvm->mmu_invalidate_range_end)
> +			return 1;
> +	}
> +
>  	if (kvm->mmu_invalidate_seq != mmu_seq)
>  		return 1;
>  	return 0;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0524933856d4..4fad3b01dc1f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -543,9 +543,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  
>  typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>  
> -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> -			     unsigned long end);
> -
> +typedef void (*on_lock_fn_t)(struct kvm *kvm);
>  typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>  
>  struct kvm_mmu_notifier_range {
> @@ -637,7 +635,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  				locked = true;
>  				KVM_MMU_LOCK(kvm);
>  				if (!IS_KVM_NULL_FN(range->on_lock))
> -					range->on_lock(kvm, range->start, range->end);
> +					range->on_lock(kvm);
> +
>  				if (IS_KVM_NULL_FN(range->handler))
>  					break;
>  			}
> @@ -742,15 +741,26 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	kvm_handle_hva_range(mn, address, address + 1, arg, kvm_change_spte_gfn);
>  }
>  
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -			      unsigned long end)
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
>  {
> +	lockdep_assert_held_write(&kvm->mmu_lock);
>  	/*
>  	 * The count increase must become visible at unlock time as no
>  	 * spte can be established without taking the mmu_lock and
>  	 * count is also read inside the mmu_lock critical section.
>  	 */
>  	kvm->mmu_invalidate_in_progress++;
> +
> +	if (likely(kvm->mmu_invalidate_in_progress == 1))
> +		kvm->mmu_invalidate_range_start = INVALID_GPA;
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	lockdep_assert_held_write(&kvm->mmu_lock);
> +
> +	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
>  	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
>  		kvm->mmu_invalidate_range_start = start;
>  		kvm->mmu_invalidate_range_end = end;

IIUC, Now we only add or override a part of the invalidate range in
these fields, IOW only the range in last slot is stored when we unlock.
That may break mmu_invalidate_retry_gfn() cause it can never know the
whole invalidate range.

How about we extend the mmu_invalidate_range_start/end everytime so that
it records the whole invalidate range:

if (kvm->mmu_invalidate_range_start == INVALID_GPA) {
	kvm->mmu_invalidate_range_start = start;
	kvm->mmu_invalidate_range_end = end;
} else {
	kvm->mmu_invalidate_range_start =
		min(kvm->mmu_invalidate_range_start, start);
	kvm->mmu_invalidate_range_end =
		max(kvm->mmu_invalidate_range_end, end);
}

Thanks,
Yilun

> @@ -771,6 +781,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>  	}
>  }
>  
> +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> +	return kvm_unmap_gfn_range(kvm, range);
> +}
> +
>  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  					const struct mmu_notifier_range *range)
>  {
> @@ -778,7 +794,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	const struct kvm_mmu_notifier_range hva_range = {
>  		.start		= range->start,
>  		.end		= range->end,
> -		.handler	= kvm_unmap_gfn_range,
> +		.handler	= kvm_mmu_unmap_gfn_range,
>  		.on_lock	= kvm_mmu_invalidate_begin,
>  		.on_unlock	= kvm_arch_guest_memory_reclaimed,
>  		.flush_on_ret	= true,
> @@ -817,8 +833,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	return 0;
>  }
>  
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -			    unsigned long end)
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
>  {
>  	/*
>  	 * This sequence increase will notify the kvm page fault that
> @@ -833,6 +848,13 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
>  	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
>  	 */
>  	kvm->mmu_invalidate_in_progress--;
> +
> +	/*
> +	 * Assert that at least one range must be added between start() and
> +	 * end().  Not adding a range isn't fatal, but it is a KVM bug.
> +	 */
> +	WARN_ON_ONCE(kvm->mmu_invalidate_in_progress &&
> +		     kvm->mmu_invalidate_range_start == INVALID_GPA);
>  }
>  
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> -- 
> 2.42.0.283.g2d96d420d3-goog
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-09-20  6:07   ` Xu Yilun
@ 2023-09-20 13:55     ` Sean Christopherson
  2023-09-21  2:39       ` Xu Yilun
  0 siblings, 1 reply; 83+ messages in thread
From: Sean Christopherson @ 2023-09-20 13:55 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Wed, Sep 20, 2023, Xu Yilun wrote:
> On 2023-09-13 at 18:55:00 -0700, Sean Christopherson wrote:
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +	lockdep_assert_held_write(&kvm->mmu_lock);
> > +
> > +	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > +
> >  	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> >  		kvm->mmu_invalidate_range_start = start;
> >  		kvm->mmu_invalidate_range_end = end;
> 
> IIUC, Now we only add or override a part of the invalidate range in
> these fields, IOW only the range in last slot is stored when we unlock.

Ouch.  Good catch!

> That may break mmu_invalidate_retry_gfn() cause it can never know the
> whole invalidate range.
> 
> How about we extend the mmu_invalidate_range_start/end everytime so that
> it records the whole invalidate range:
> 
> if (kvm->mmu_invalidate_range_start == INVALID_GPA) {
> 	kvm->mmu_invalidate_range_start = start;
> 	kvm->mmu_invalidate_range_end = end;
> } else {
> 	kvm->mmu_invalidate_range_start =
> 		min(kvm->mmu_invalidate_range_start, start);
> 	kvm->mmu_invalidate_range_end =
> 		max(kvm->mmu_invalidate_range_end, end);
> }

Yeah, that does seem to be the easiest solution.

I'll post a fixup patch, unless you want the honors.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-09-19  9:01   ` Binbin Wu
@ 2023-09-20 14:24     ` Sean Christopherson
  2023-09-21  5:58       ` Binbin Wu
  0 siblings, 1 reply; 83+ messages in thread
From: Sean Christopherson @ 2023-09-20 14:24 UTC (permalink / raw)
  To: Binbin Wu
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Paolo Bonzini, Marc Zyngier,
	Oliver Upton, Huacai Chen, Michael Ellerman, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn,
	Chao Peng, Fuad Tabba, Jarkko Sakkinen, Anish Moorthy, Yu Zhang,
	Isaku Yamahata, Xu Yilun, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Tue, Sep 19, 2023, Binbin Wu wrote:
> 
> 
> On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> [...]
> > +
> > +static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> > +				      pgoff_t end)
> > +{
> > +	struct kvm_memory_slot *slot;
> > +	struct kvm *kvm = gmem->kvm;
> > +	unsigned long index;
> > +	bool flush = false;
> > +
> > +	KVM_MMU_LOCK(kvm);
> > +
> > +	kvm_mmu_invalidate_begin(kvm);
> > +
> > +	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> > +		pgoff_t pgoff = slot->gmem.pgoff;
> > +
> > +		struct kvm_gfn_range gfn_range = {
> > +			.start = slot->base_gfn + max(pgoff, start) - pgoff,
> > +			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
> > +			.slot = slot,
> > +			.may_block = true,
> > +		};
> > +
> > +		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
> > +	}
> > +
> > +	if (flush)
> > +		kvm_flush_remote_tlbs(kvm);
> > +
> > +	KVM_MMU_UNLOCK(kvm);
> > +}
> > +
> > +static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
> > +				    pgoff_t end)
> > +{
> > +	struct kvm *kvm = gmem->kvm;
> > +
> > +	KVM_MMU_LOCK(kvm);
> > +	if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT))
> > +		kvm_mmu_invalidate_end(kvm);
> kvm_mmu_invalidate_begin() is called unconditionally in
> kvm_gmem_invalidate_begin(),
> but kvm_mmu_invalidate_end() is not here.
> This makes the kvm_gmem_invalidate_{begin, end}() calls asymmetric.

Another ouch :-(

And there should be no need to acquire mmu_lock() unconditionally, the inode's
mutex protects the bindings, not mmu_lock.

I'll get a fix posted today.  I think KVM can also add a sanity check to detect
unresolved invalidations, e.g.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7ba1ab1832a9..2a2d18070856 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1381,8 +1381,13 @@ static void kvm_destroy_vm(struct kvm *kvm)
         * No threads can be waiting in kvm_swap_active_memslots() as the
         * last reference on KVM has been dropped, but freeing
         * memslots would deadlock without this manual intervention.
+        *
+        * If the count isn't unbalanced, i.e. KVM did NOT unregister between
+        * a start() and end(), then there shouldn't be any in-progress
+        * invalidations.
         */
        WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
+       WARN_ON(!kvm->mn_active_invalidate_count && kvm->mmu_invalidate_in_progress);
        kvm->mn_active_invalidate_count = 0;
 #else
        kvm_flush_shadow_all(kvm);


or an alternative style

	if (kvm->mn_active_invalidate_count)
		kvm->mn_active_invalidate_count = 0;
	else
		WARN_ON(kvm->mmu_invalidate_in_progress)

> > +	KVM_MMU_UNLOCK(kvm);
> > +}
> > +
> > +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> > +{
> > +	struct list_head *gmem_list = &inode->i_mapping->private_list;
> > +	pgoff_t start = offset >> PAGE_SHIFT;
> > +	pgoff_t end = (offset + len) >> PAGE_SHIFT;
> > +	struct kvm_gmem *gmem;
> > +
> > +	/*
> > +	 * Bindings must stable across invalidation to ensure the start+end
> > +	 * are balanced.
> > +	 */
> > +	filemap_invalidate_lock(inode->i_mapping);
> > +
> > +	list_for_each_entry(gmem, gmem_list, entry) {
> > +		kvm_gmem_invalidate_begin(gmem, start, end);
> > +		kvm_gmem_invalidate_end(gmem, start, end);
> > +	}
> Why to loop for each gmem in gmem_list here?
> 
> IIUIC, offset is the offset according to the inode, it is only meaningful to
> the inode passed in, i.e, it is only meaningful to the gmem binding with the
> inode, not others.

The code is structured to allow for multiple gmem instances per inode.  This isn't
actually possible in the initial code base, but it's on the horizon[*].  I included
the list-based infrastructure in this initial series to ensure that guest_memfd
can actually support multiple files per inode, and to minimize the churn when the
"link" support comes along.

[*] https://lore.kernel.org/all/cover.1691446946.git.ackerleytng@google.com


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
  2023-09-15  6:32   ` Yan Zhao
@ 2023-09-20 21:00     ` Sean Christopherson
  2023-09-21  1:21       ` Yan Zhao
  0 siblings, 1 reply; 83+ messages in thread
From: Sean Christopherson @ 2023-09-20 21:00 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Sep 15, 2023, Yan Zhao wrote:
> On Wed, Sep 13, 2023 at 06:55:09PM -0700, Sean Christopherson wrote:
> > From: Chao Peng <chao.p.peng@linux.intel.com>
> > 
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> > 
> ...
> >> +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> > +				     unsigned long attrs)
> > +{
> > +	XA_STATE(xas, &kvm->mem_attr_array, start);
> > +	unsigned long index;
> > +	bool has_attrs;
> > +	void *entry;
> > +
> > +	rcu_read_lock();
> > +
> > +	if (!attrs) {
> > +		has_attrs = !xas_find(&xas, end);
> > +		goto out;
> > +	}
> > +
> > +	has_attrs = true;
> > +	for (index = start; index < end; index++) {
> > +		do {
> > +			entry = xas_next(&xas);
> > +		} while (xas_retry(&xas, entry));
> > +
> > +		if (xas.xa_index != index || xa_to_value(entry) != attrs) {
> Should "xa_to_value(entry) != attrs" be "!(xa_to_value(entry) & attrs)" ?

No, the exact comparsion is deliberate.  The intent of the API is to determine
if the entire range already has the desired attributes, not if there is overlap
between the two.

E.g. if/when RWX attributes are supported, the exact comparison is needed to
handle a RW => R conversion.

> > +			has_attrs = false;
> > +			break;
> > +		}
> > +	}
> > +
> > +out:
> > +	rcu_read_unlock();
> > +	return has_attrs;
> > +}
> > +
> ...
> > +/* Set @attributes for the gfn range [@start, @end). */
> > +static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> > +				     unsigned long attributes)
> > +{
> > +	struct kvm_mmu_notifier_range pre_set_range = {
> > +		.start = start,
> > +		.end = end,
> > +		.handler = kvm_arch_pre_set_memory_attributes,
> > +		.on_lock = kvm_mmu_invalidate_begin,
> > +		.flush_on_ret = true,
> > +		.may_block = true,
> > +	};
> > +	struct kvm_mmu_notifier_range post_set_range = {
> > +		.start = start,
> > +		.end = end,
> > +		.arg.attributes = attributes,
> > +		.handler = kvm_arch_post_set_memory_attributes,
> > +		.on_lock = kvm_mmu_invalidate_end,
> > +		.may_block = true,
> > +	};
> > +	unsigned long i;
> > +	void *entry;
> > +	int r = 0;
> > +
> > +	entry = attributes ? xa_mk_value(attributes) : NULL;
> Also here, do we need to get existing attributes of a GFN first ?

No?  @entry is the new value that will be set for all entries.  This line doesn't
touch the xarray in any way.  Maybe I'm just not understanding your question.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
  2023-09-18  7:51   ` Binbin Wu
@ 2023-09-20 21:03     ` Sean Christopherson
  2023-09-27  5:19       ` Binbin Wu
  0 siblings, 1 reply; 83+ messages in thread
From: Sean Christopherson @ 2023-09-20 21:03 UTC (permalink / raw)
  To: Binbin Wu
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Paolo Bonzini, Marc Zyngier,
	Oliver Upton, Huacai Chen, Michael Ellerman, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn,
	Chao Peng, Fuad Tabba, Jarkko Sakkinen, Anish Moorthy, Yu Zhang,
	Isaku Yamahata, Xu Yilun, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Mon, Sep 18, 2023, Binbin Wu wrote:
> 
> 
> On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> > From: Chao Peng <chao.p.peng@linux.intel.com>
> [...]
> > +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> > +/*
> > + * Returns true if _all_ gfns in the range [@start, @end) have attributes
> > + * matching @attrs.
> > + */
> > +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> > +				     unsigned long attrs)
> > +{
> > +	XA_STATE(xas, &kvm->mem_attr_array, start);
> > +	unsigned long index;
> > +	bool has_attrs;
> > +	void *entry;
> > +
> > +	rcu_read_lock();
> > +
> > +	if (!attrs) {
> > +		has_attrs = !xas_find(&xas, end);
> IIUIC, xas_find() is inclusive for "end", so here should be "end - 1" ?

Yes, that does appear to be the case.  Inclusive vs. exclusive on gfn ranges has
is the bane of my existence.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-09-18 16:36   ` Michael Roth
@ 2023-09-20 23:44     ` Sean Christopherson
  0 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-20 23:44 UTC (permalink / raw)
  To: Michael Roth
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Mon, Sep 18, 2023, Michael Roth wrote:
> > +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> > +{
> > +	struct list_head *gmem_list = &inode->i_mapping->private_list;
> > +	pgoff_t start = offset >> PAGE_SHIFT;
> > +	pgoff_t end = (offset + len) >> PAGE_SHIFT;
> > +	struct kvm_gmem *gmem;
> > +
> > +	/*
> > +	 * Bindings must stable across invalidation to ensure the start+end
> > +	 * are balanced.
> > +	 */
> > +	filemap_invalidate_lock(inode->i_mapping);
> > +
> > +	list_for_each_entry(gmem, gmem_list, entry) {
> > +		kvm_gmem_invalidate_begin(gmem, start, end);
> 
> In v11 we used to call truncate_inode_pages_range() here to drop filemap's
> reference on the folio. AFAICT the folios are only getting free'd upon
> guest shutdown without this. Was this on purpose?

Nope, I just spotted this too.  And then after scratching my head for a few minutes,
wondering if I was having an -ENOCOFFEE moment, I finally read your mail.  *sigh*

Looking at my reflog history, I'm pretty sure I deleted the wrong line when
removing the truncation from kvm_gmem_error_page().

> > +		kvm_gmem_invalidate_end(gmem, start, end);
> > +	}
> > +
> > +	filemap_invalidate_unlock(inode->i_mapping);
> > +
> > +	return 0;
> > +}
> > +
> > +static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
> > +{
> > +	struct address_space *mapping = inode->i_mapping;
> > +	pgoff_t start, index, end;
> > +	int r;
> > +
> > +	/* Dedicated guest is immutable by default. */
> > +	if (offset + len > i_size_read(inode))
> > +		return -EINVAL;
> > +
> > +	filemap_invalidate_lock_shared(mapping);
> 
> We take the filemap lock here, but not for
> kvm_gmem_get_pfn()->kvm_gmem_get_folio(). Is it needed there as well?

No, we specifically do not want to take a rwsem when faulting in guest memory.
Callers of kvm_gmem_get_pfn() *must* guard against concurrent invalidations via
mmu_invalidate_seq and friends.

> > +	/*
> > +	 * For simplicity, require the offset into the file and the size of the
> > +	 * memslot to be aligned to the largest possible page size used to back
> > +	 * the file (same as the size of the file itself).
> > +	 */
> > +	if (!kvm_gmem_is_valid_size(offset, flags) ||
> > +	    !kvm_gmem_is_valid_size(size, flags))
> > +		goto err;
> 
> I needed to relax this check for SNP. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
> applies to entire gmem inode, so it makes sense for userspace to enable
> hugepages if start/end are hugepage-aligned, but QEMU will do things
> like map overlapping regions for ROMs and other things on top of the
> GPA range that the gmem inode was originally allocated for. For
> instance:
> 
>   692500@1689108688.696338:kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fbf5be00000 ret=0 restricted_fd=19 restricted_offset=0x0
>   692500@1689108688.699802:kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0x100000000 size=0x380000000 ua=0x7fbfdbe00000 ret=0 restricted_fd=19 restricted_offset=0x80000000
>   692500@1689108688.795412:kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x0 gpa=0x0 size=0x0 ua=0x7fbf5be00000 ret=0 restricted_fd=19 restricted_offset=0x0
>   692500@1689108688.795550:kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0xc0000 ua=0x7fbf5be00000 ret=0 restricted_fd=19 restricted_offset=0x0
>   692500@1689108688.796227:kvm_set_user_memory AddrSpace#0 Slot#6 flags=0x4 gpa=0x100000 size=0x7ff00000 ua=0x7fbf5bf00000 ret=0 restricted_fd=19 restricted_offset=0x100000
> 
> Because of that the KVM_SET_USER_MEMORY_REGIONs for non-THP-aligned GPAs
> will fail. Maybe instead it should be allowed, and kvm_gmem_get_folio()
> should handle the alignment checks on a case-by-case and simply force 4k
> for offsets corresponding to unaligned bindings?

Yeah, I wanted to keep the code simple, but disallowing small bindings/memslots
is probably going to be a deal-breaker.  Even though I'm skeptical that QEMU
_needs_ to play these games for SNP guests, not playing nice will make it all
but impossible to use guest_memfd for regular VMs.

And the code isn't really any more complex, so long as we punt on allowing
hugepages on interior sub-ranges.

Compile-tested only, but this?

---
 virt/kvm/guest_mem.c | 54 ++++++++++++++++++++++----------------------
 1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index a819367434e9..dc12e38211df 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -426,20 +426,6 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags,
 	return err;
 }
 
-static bool kvm_gmem_is_valid_size(loff_t size, u64 flags)
-{
-	if (size < 0 || !PAGE_ALIGNED(size))
-		return false;
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
-	    !IS_ALIGNED(size, HPAGE_PMD_SIZE))
-		return false;
-#endif
-
-	return true;
-}
-
 int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 {
 	loff_t size = args->size;
@@ -452,9 +438,15 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	if (flags & ~valid_flags)
 		return -EINVAL;
 
-	if (!kvm_gmem_is_valid_size(size, flags))
+	if (size < 0 || !PAGE_ALIGNED(size))
 		return -EINVAL;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
+	    !IS_ALIGNED(size, HPAGE_PMD_SIZE))
+		return false;
+#endif
+
 	return __kvm_gmem_create(kvm, size, flags, kvm_gmem_mnt);
 }
 
@@ -462,7 +454,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
 		  unsigned int fd, loff_t offset)
 {
 	loff_t size = slot->npages << PAGE_SHIFT;
-	unsigned long start, end, flags;
+	unsigned long start, end;
 	struct kvm_gmem *gmem;
 	struct inode *inode;
 	struct file *file;
@@ -481,16 +473,9 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
 		goto err;
 
 	inode = file_inode(file);
-	flags = (unsigned long)inode->i_private;
 
-	/*
-	 * For simplicity, require the offset into the file and the size of the
-	 * memslot to be aligned to the largest possible page size used to back
-	 * the file (same as the size of the file itself).
-	 */
-	if (!kvm_gmem_is_valid_size(offset, flags) ||
-	    !kvm_gmem_is_valid_size(size, flags))
-		goto err;
+	if (offset < 0 || !PAGE_ALIGNED(offset))
+		return -EINVAL;
 
 	if (offset + size > i_size_read(inode))
 		goto err;
@@ -591,8 +576,23 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	page = folio_file_page(folio, index);
 
 	*pfn = page_to_pfn(page);
-	if (max_order)
-		*max_order = compound_order(compound_head(page));
+	if (!max_order)
+		goto success;
+
+	*max_order = compound_order(compound_head(page));
+	if (!*max_order)
+		goto success;
+
+	/*
+	 * For simplicity, allow mapping a hugepage if and only if the entire
+	 * binding is compatible, i.e. don't bother supporting mapping interior
+	 * sub-ranges with hugepages (unless userspace comes up with a *really*
+	 * strong use case for needing hugepages within unaligned bindings).
+	 */
+	if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) ||
+	    !IS_ALIGNED(slot->npages, 1ull << *max_order))
+		*max_order = 0;
+success:
 	r = 0;
 
 out_unlock:

base-commit: bc1a54ee393e0574ea422525cf0b2f1e768e38c5
-- 


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
  2023-09-20 21:00     ` Sean Christopherson
@ 2023-09-21  1:21       ` Yan Zhao
  2023-09-25 17:37         ` Sean Christopherson
  0 siblings, 1 reply; 83+ messages in thread
From: Yan Zhao @ 2023-09-21  1:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Wed, Sep 20, 2023 at 02:00:22PM -0700, Sean Christopherson wrote:
> On Fri, Sep 15, 2023, Yan Zhao wrote:
> > On Wed, Sep 13, 2023 at 06:55:09PM -0700, Sean Christopherson wrote:
> > > From: Chao Peng <chao.p.peng@linux.intel.com>
> > > 
> > > In confidential computing usages, whether a page is private or shared is
> > > necessary information for KVM to perform operations like page fault
> > > handling, page zapping etc. There are other potential use cases for
> > > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > > or exec-only, etc.) without having to modify memslots.
> > > 
> > ...
> > >> +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> > > +				     unsigned long attrs)
> > > +{
> > > +	XA_STATE(xas, &kvm->mem_attr_array, start);
> > > +	unsigned long index;
> > > +	bool has_attrs;
> > > +	void *entry;
> > > +
> > > +	rcu_read_lock();
> > > +
> > > +	if (!attrs) {
> > > +		has_attrs = !xas_find(&xas, end);
> > > +		goto out;
> > > +	}
> > > +
> > > +	has_attrs = true;
> > > +	for (index = start; index < end; index++) {
> > > +		do {
> > > +			entry = xas_next(&xas);
> > > +		} while (xas_retry(&xas, entry));
> > > +
> > > +		if (xas.xa_index != index || xa_to_value(entry) != attrs) {
> > Should "xa_to_value(entry) != attrs" be "!(xa_to_value(entry) & attrs)" ?
> 
> No, the exact comparsion is deliberate.  The intent of the API is to determine
> if the entire range already has the desired attributes, not if there is overlap
> between the two.
> 
> E.g. if/when RWX attributes are supported, the exact comparison is needed to
> handle a RW => R conversion.
> 
> > > +			has_attrs = false;
> > > +			break;
> > > +		}
> > > +	}
> > > +
> > > +out:
> > > +	rcu_read_unlock();
> > > +	return has_attrs;
> > > +}
> > > +
> > ...
> > > +/* Set @attributes for the gfn range [@start, @end). */
> > > +static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> > > +				     unsigned long attributes)
> > > +{
> > > +	struct kvm_mmu_notifier_range pre_set_range = {
> > > +		.start = start,
> > > +		.end = end,
> > > +		.handler = kvm_arch_pre_set_memory_attributes,
> > > +		.on_lock = kvm_mmu_invalidate_begin,
> > > +		.flush_on_ret = true,
> > > +		.may_block = true,
> > > +	};
> > > +	struct kvm_mmu_notifier_range post_set_range = {
> > > +		.start = start,
> > > +		.end = end,
> > > +		.arg.attributes = attributes,
> > > +		.handler = kvm_arch_post_set_memory_attributes,
> > > +		.on_lock = kvm_mmu_invalidate_end,
> > > +		.may_block = true,
> > > +	};
> > > +	unsigned long i;
> > > +	void *entry;
> > > +	int r = 0;
> > > +
> > > +	entry = attributes ? xa_mk_value(attributes) : NULL;
> > Also here, do we need to get existing attributes of a GFN first ?
> 
> No?  @entry is the new value that will be set for all entries.  This line doesn't
> touch the xarray in any way.  Maybe I'm just not understanding your question.
Hmm, I thought this interface was to allow users to add/remove an attribute to a GFN
rather than overwrite all attributes of a GFN. Now I think I misunderstood the intention.

But I wonder if there is a way for users to just add one attribute, as I don't find
ioctl like KVM_GET_MEMORY_ATTRIBUTES for users to get current attributes and then to
add/remove one based on that. e.g. maybe in future, KVM wants to add one attribute in
kernel without being told by userspace ?


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-09-20 13:55     ` Sean Christopherson
@ 2023-09-21  2:39       ` Xu Yilun
  2023-09-21 14:24         ` Sean Christopherson
  0 siblings, 1 reply; 83+ messages in thread
From: Xu Yilun @ 2023-09-21  2:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On 2023-09-20 at 06:55:05 -0700, Sean Christopherson wrote:
> On Wed, Sep 20, 2023, Xu Yilun wrote:
> > On 2023-09-13 at 18:55:00 -0700, Sean Christopherson wrote:
> > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +	lockdep_assert_held_write(&kvm->mmu_lock);
> > > +
> > > +	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > +
> > >  	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > >  		kvm->mmu_invalidate_range_start = start;
> > >  		kvm->mmu_invalidate_range_end = end;
> > 
> > IIUC, Now we only add or override a part of the invalidate range in
> > these fields, IOW only the range in last slot is stored when we unlock.
> 
> Ouch.  Good catch!
> 
> > That may break mmu_invalidate_retry_gfn() cause it can never know the
> > whole invalidate range.
> > 
> > How about we extend the mmu_invalidate_range_start/end everytime so that
> > it records the whole invalidate range:
> > 
> > if (kvm->mmu_invalidate_range_start == INVALID_GPA) {
> > 	kvm->mmu_invalidate_range_start = start;
> > 	kvm->mmu_invalidate_range_end = end;
> > } else {
> > 	kvm->mmu_invalidate_range_start =
> > 		min(kvm->mmu_invalidate_range_start, start);
> > 	kvm->mmu_invalidate_range_end =
> > 		max(kvm->mmu_invalidate_range_end, end);
> > }
> 
> Yeah, that does seem to be the easiest solution.
> 
> I'll post a fixup patch, unless you want the honors.

Please go ahead, cause at a second thought I'm wondering if this simple
range extension is reasonable.

When the invalidation acrosses multiple slots, I'm not sure if the
contiguous HVA range must correspond to contiguous GFN range. If not,
are we producing a larger range than required?

And when the invalidation acrosses multiple address space, I'm almost
sure it is wrong to merge GFN ranges from different address spaces. But
I have no clear solution yet.

Thanks,
Yilun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory
  2023-09-15 14:26     ` Sean Christopherson
  2023-09-18  0:54       ` Yan Zhao
@ 2023-09-21  5:51       ` Binbin Wu
  1 sibling, 0 replies; 83+ messages in thread
From: Binbin Wu @ 2023-09-21  5:51 UTC (permalink / raw)
  To: Sean Christopherson, Yan Zhao
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov



On 9/15/2023 10:26 PM, Sean Christopherson wrote:
> On Fri, Sep 15, 2023, Yan Zhao wrote:
>> On Wed, Sep 13, 2023 at 06:55:16PM -0700, Sean Christopherson wrote:
>> ....
>>> +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>>> +					      struct kvm_page_fault *fault)
>>> +{
>>> +	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
>>> +				      PAGE_SIZE, fault->write, fault->exec,
>>> +				      fault->is_private);
>>> +}
>>> +
>>> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
>>> +				   struct kvm_page_fault *fault)
>>> +{
>>> +	int max_order, r;
>>> +
>>> +	if (!kvm_slot_can_be_private(fault->slot)) {
>>> +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
>>> +		return -EFAULT;
>>> +	}
>>> +
>>> +	r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
>>> +			     &max_order);
>>> +	if (r) {
>>> +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
>>> +		return r;
>>> +	}
>>> +
>>> +	fault->max_level = min(kvm_max_level_for_order(max_order),
>>> +			       fault->max_level);
>>> +	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
>>> +
>>> +	return RET_PF_CONTINUE;
>>> +}
>>> +
>>>   static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>>>   {
>>>   	struct kvm_memory_slot *slot = fault->slot;
>>> @@ -4293,6 +4356,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>>>   			return RET_PF_EMULATE;
>>>   	}
>>>   
>>> +	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
>> In patch 21,
>> fault->is_private is set as:
>> 	".is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT)",
>> then, the inequality here means memory attribute has been updated after
>> last check.
>> So, why an exit to user space for converting is required instead of a mere retry?
>>
>> Or, is it because how .is_private is assigned in patch 21 is subjected to change
>> in future?
> This.  Retrying on SNP or TDX would hang the guest.  I suppose we could special
> case VMs where .is_private is derived from the memory attributes, but the
> SW_PROTECTED_VM type is primary a development vehicle at this point.  I'd like to
> have it mimic SNP/TDX as much as possible; performance is a secondary concern.
So when .is_private is derived from the memory attributes, and if I 
didn't miss
anything, there is no explicit conversion mechanism introduced yet so 
far, does
it mean for pure sw-protected VM (withouth SNP/TDX), the page fault will be
handled according to the memory attributes setup by host/user vmm, no 
implicit
conversion will be triggered, right?


>
> E.g. userspace needs to be prepared for "spurious" exits due to races on SNP and
> TDX, which this can theoretically exercise.  Though the window is quite small so
> I doubt that'll actually happen in practice; which of course also makes it less
> important to retry instead of exiting.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-09-20 14:24     ` Sean Christopherson
@ 2023-09-21  5:58       ` Binbin Wu
  0 siblings, 0 replies; 83+ messages in thread
From: Binbin Wu @ 2023-09-21  5:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Paolo Bonzini, Marc Zyngier,
	Oliver Upton, Huacai Chen, Michael Ellerman, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn,
	Chao Peng, Fuad Tabba, Jarkko Sakkinen, Anish Moorthy, Yu Zhang,
	Isaku Yamahata, Xu Yilun, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov



On 9/20/2023 10:24 PM, Sean Christopherson wrote:
> On Tue, Sep 19, 2023, Binbin Wu wrote:
>>
>> On 9/14/2023 9:55 AM, Sean Christopherson wrote:
>> [...]
>>> +
>>> +static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
>>> +				      pgoff_t end)
>>> +{
>>> +	struct kvm_memory_slot *slot;
>>> +	struct kvm *kvm = gmem->kvm;
>>> +	unsigned long index;
>>> +	bool flush = false;
>>> +
>>> +	KVM_MMU_LOCK(kvm);
>>> +
>>> +	kvm_mmu_invalidate_begin(kvm);
>>> +
>>> +	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
>>> +		pgoff_t pgoff = slot->gmem.pgoff;
>>> +
>>> +		struct kvm_gfn_range gfn_range = {
>>> +			.start = slot->base_gfn + max(pgoff, start) - pgoff,
>>> +			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
>>> +			.slot = slot,
>>> +			.may_block = true,
>>> +		};
>>> +
>>> +		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
>>> +	}
>>> +
>>> +	if (flush)
>>> +		kvm_flush_remote_tlbs(kvm);
>>> +
>>> +	KVM_MMU_UNLOCK(kvm);
>>> +}
>>> +
>>> +static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
>>> +				    pgoff_t end)
>>> +{
>>> +	struct kvm *kvm = gmem->kvm;
>>> +
>>> +	KVM_MMU_LOCK(kvm);
>>> +	if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT))
>>> +		kvm_mmu_invalidate_end(kvm);
>> kvm_mmu_invalidate_begin() is called unconditionally in
>> kvm_gmem_invalidate_begin(),
>> but kvm_mmu_invalidate_end() is not here.
>> This makes the kvm_gmem_invalidate_{begin, end}() calls asymmetric.
> Another ouch :-(
>
> And there should be no need to acquire mmu_lock() unconditionally, the inode's
> mutex protects the bindings, not mmu_lock.
>
> I'll get a fix posted today.  I think KVM can also add a sanity check to detect
> unresolved invalidations, e.g.
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7ba1ab1832a9..2a2d18070856 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1381,8 +1381,13 @@ static void kvm_destroy_vm(struct kvm *kvm)
>           * No threads can be waiting in kvm_swap_active_memslots() as the
>           * last reference on KVM has been dropped, but freeing
>           * memslots would deadlock without this manual intervention.
> +        *
> +        * If the count isn't unbalanced, i.e. KVM did NOT unregister between
> +        * a start() and end(), then there shouldn't be any in-progress
> +        * invalidations.
>           */
>          WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
> +       WARN_ON(!kvm->mn_active_invalidate_count && kvm->mmu_invalidate_in_progress);
>          kvm->mn_active_invalidate_count = 0;
>   #else
>          kvm_flush_shadow_all(kvm);
>
>
> or an alternative style
>
> 	if (kvm->mn_active_invalidate_count)
> 		kvm->mn_active_invalidate_count = 0;
> 	else
> 		WARN_ON(kvm->mmu_invalidate_in_progress)
>
>>> +	KVM_MMU_UNLOCK(kvm);
>>> +}
>>> +
>>> +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>>> +{
>>> +	struct list_head *gmem_list = &inode->i_mapping->private_list;
>>> +	pgoff_t start = offset >> PAGE_SHIFT;
>>> +	pgoff_t end = (offset + len) >> PAGE_SHIFT;
>>> +	struct kvm_gmem *gmem;
>>> +
>>> +	/*
>>> +	 * Bindings must stable across invalidation to ensure the start+end
>>> +	 * are balanced.
>>> +	 */
>>> +	filemap_invalidate_lock(inode->i_mapping);
>>> +
>>> +	list_for_each_entry(gmem, gmem_list, entry) {
>>> +		kvm_gmem_invalidate_begin(gmem, start, end);
>>> +		kvm_gmem_invalidate_end(gmem, start, end);
>>> +	}
>> Why to loop for each gmem in gmem_list here?
>>
>> IIUIC, offset is the offset according to the inode, it is only meaningful to
>> the inode passed in, i.e, it is only meaningful to the gmem binding with the
>> inode, not others.
> The code is structured to allow for multiple gmem instances per inode.  This isn't
> actually possible in the initial code base, but it's on the horizon[*].  I included
> the list-based infrastructure in this initial series to ensure that guest_memfd
> can actually support multiple files per inode, and to minimize the churn when the
> "link" support comes along.
>
> [*] https://lore.kernel.org/all/cover.1691446946.git.ackerleytng@google.com
Got it, thanks for the explanation!




^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-09-21  2:39       ` Xu Yilun
@ 2023-09-21 14:24         ` Sean Christopherson
  0 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-21 14:24 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Thu, Sep 21, 2023, Xu Yilun wrote:
> When the invalidation acrosses multiple slots, I'm not sure if the
> contiguous HVA range must correspond to contiguous GFN range. If not,
> are we producing a larger range than required?

Multiple invalidations are all but guaranteed to yield a range that covers addresses
that aren't actually being invalidated.  This is true today.

> And when the invalidation acrosses multiple address space, I'm almost
> sure it is wrong to merge GFN ranges from different address spaces. 

It's not "wrong" in the sense that false positives do not cause functional
problems, at worst a false positive can unnecessarily stall a vCPU until the
unrelated invalidations complete.

Multiple concurrent invalidations are not common, and if they do happen, they are
likely related and will have spacial locality in both host virtual address space
and guest physical address space.  Given that, we chose for the simple (and fast!)
approach of maintaining a single all-encompassing range.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory
  2023-09-18  0:54       ` Yan Zhao
@ 2023-09-21 14:59         ` Sean Christopherson
  0 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-21 14:59 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Mon, Sep 18, 2023, Yan Zhao wrote:
> On Fri, Sep 15, 2023 at 07:26:16AM -0700, Sean Christopherson wrote:
> > On Fri, Sep 15, 2023, Yan Zhao wrote:
> > > >  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > >  {
> > > >  	struct kvm_memory_slot *slot = fault->slot;
> > > > @@ -4293,6 +4356,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > > >  			return RET_PF_EMULATE;
> > > >  	}
> > > >  
> > > > +	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > > In patch 21,
> > > fault->is_private is set as:
> > > 	".is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT)",
> > > then, the inequality here means memory attribute has been updated after
> > > last check.
> > > So, why an exit to user space for converting is required instead of a mere retry?
> > > 
> > > Or, is it because how .is_private is assigned in patch 21 is subjected to change
> > > in future? 
> > 
> > This.  Retrying on SNP or TDX would hang the guest.  I suppose we could special
> Is this because if the guest access a page in private way (e.g. via
> private key in TDX), the returned page must be a private page?

Yes, the returned page must be private, because the GHCI (TDX) and GHCB (SNP)
require that the host allow implicit conversions.  I.e. if the guest accesses
memory as private (or shared), then the host must map memory as private (or shared).
Simply resuming the guest will not change the guest access, nor will it change KVM's
memory attributes.

Ideally (IMO), implicit conversions would be disallowed, but even if implicit
conversions weren't a thing, retrying would still be wrong as KVM would either
inject an exception into the guest or exit to userspace to let userspace handle
the illegal access.

> > case VMs where .is_private is derived from the memory attributes, but the
> > SW_PROTECTED_VM type is primary a development vehicle at this point.  I'd like to
> > have it mimic SNP/TDX as much as possible; performance is a secondary concern.
> Ok. But this mimic is somewhat confusing as it may be problematic in below scenario,
> though sane guest should ensure no one is accessing a page before doing memory
> conversion.
> 
> 
> CPU 0                           CPU 1
> access GFN A in private way
> fault->is_private=true
>                                 convert GFN A to shared
> 			        set memory attribute of A to shared
> 
> faultin, mismatch and exit
> set memory attribute of A
> to private
> 
>                                 vCPU access GFN A in shared way
>                                 fault->is_private = true
>                                 faultin, match and map a private PFN B
> 
>                                 vCPU accesses private PFN B in shared way

If this is a TDX or SNP VM, then the private vs. shared information comes from
the guest itself, e.g. this sequence

                                   vCPU access GFN A in shared way
                                   fault->is_private = true

cannot happen because is_private will be false based on the error code (SNP) or
the GPA (TDX).

And when hardware doesn't generate page faults based on private vs. shared, i.e.
for non-TDX/SNP VMs, from a fault handling perspective there is no concept of the
guest accessing a GFN in a "private way" or a "shared way".  I.e. there are no
implicit conversions.

For SEV and SEV-ES, the guest can access memory as private vs. shared, but the
and the host VMM absolutely must be in agreement and synchronized with respect to
the state of a page, otherwise guest memory will be corrupted.  But that has
nothing to do with the fault handling, e.g. creating aliases in the guest to access
a single GFN as shared and private from two CPUs will create incoherent cache
entries and/or corrupt data without any involvement from KVM.

In other words, the above isn't possible for TDX/SNP, and for all other types,
the conflict between CPU0 and CPU1 is unequivocally a guest bug.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-09-14  1:55 ` [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
                     ` (2 preceding siblings ...)
  2023-09-19  9:01   ` Binbin Wu
@ 2023-09-21 19:10   ` Sean Christopherson
  3 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-21 19:10 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Wed, Sep 13, 2023, Sean Christopherson wrote:
>  virt/kvm/guest_mem.c       | 593 +++++++++++++++++++++++++++++++++++++

Getting to the really important stuff...

Anyone object to naming the new file guest_memfd.c instead of guest_mem.c?  Just
the file, i.e. still keep the gmem namespace.

Using guest_memfd.c would make it much more obvious that the file holds more than
generic "guest memory" APIs, and would provide a stronger conceptual connection
with memfd.c.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-09-14  1:55 ` [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace Sean Christopherson
@ 2023-09-22  6:03   ` Xiaoyao Li
  2023-09-22 14:30     ` Sean Christopherson
  2023-09-22 16:28     ` Sean Christopherson
  0 siblings, 2 replies; 83+ messages in thread
From: Xiaoyao Li @ 2023-09-22  6:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Huacai Chen, Michael Ellerman, Anup Patel, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Xu Yilun, Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
> 
> Add a new KVM exit type to allow userspace to handle memory faults that
> KVM cannot resolve, but that userspace *may* be able to handle (without
> terminating the guest).
> 
> KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
> conversions between private and shared memory.  With guest private memory,
> there will be  two kind of memory conversions:
> 
>    - explicit conversion: happens when the guest explicitly calls into KVM
>      to map a range (as private or shared)
> 
>    - implicit conversion: happens when the guest attempts to access a gfn
>      that is configured in the "wrong" state (private vs. shared)
> 
> On x86 (first architecture to support guest private memory), explicit
> conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,

side topic.

Do we expect to integrate TDVMCALL(MAPGPA) of TDX into KVM_HC_MAP_GPA_RANGE?

> but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
> as there is (obviously) no hypercall, and there is no guarantee that the
> guest actually intends to convert between private and shared, i.e. what
> KVM thinks is an implicit conversion "request" could actually be the
> result of a guest code bug.
> 
> KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
> be implicit conversions.
> 
> Place "struct memory_fault" in a second anonymous union so that filling
> memory_fault doesn't clobber state from other yet-to-be-fulfilled exits,
> and to provide additional information if KVM does NOT ultimately exit to
> userspace with KVM_EXIT_MEMORY_FAULT, e.g. if KVM suppresses (or worse,
> loses) the exit, as KVM often suppresses exits for memory failures that
> occur when accessing paravirt data structures.  The initial usage for
> private memory will be all-or-nothing, but other features such as the
> proposed "userfault on missing mappings" support will use
> KVM_EXIT_MEMORY_FAULT for potentially _all_ guest memory accesses, i.e.
> will run afoul of KVM's various quirks.

So when exit reason is KVM_EXIT_MEMORY_FAULT, how can we tell which 
field in the first union is valid?

When exit reason is not KVM_EXIT_MEMORY_FAULT, how can we know the info 
in the second union run.memory is valid without a run.memory.valid field?

> Use bit 3 for flagging private memory so that KVM can use bits 0-2 for
> capturing RWX behavior if/when userspace needs such information.
> 
> Note!  To allow for future possibilities where KVM reports
> KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
> fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
> perspective), not '0'!  Due to historical baggage within KVM, exiting to
> userspace with '0' from deep callstacks, e.g. in emulation paths, is
> infeasible as doing so would require a near-complete overhaul of KVM,
> whereas KVM already propagates -errno return codes to userspace even when
> the -errno originated in a low level helper.
> 
> Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
> Cc: Anish Moorthy <amoorthy@google.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   Documentation/virt/kvm/api.rst | 24 ++++++++++++++++++++++++
>   include/linux/kvm_host.h       | 15 +++++++++++++++
>   include/uapi/linux/kvm.h       | 24 ++++++++++++++++++++++++
>   3 files changed, 63 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 21a7578142a1..e28a13439a95 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6702,6 +6702,30 @@ array field represents return values. The userspace should update the return
>   values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>   spec refer, https://github.com/riscv/riscv-sbi-doc.
>   
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory;
> +
> +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
> +could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
> +guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
> +describes properties of the faulting access that are likely pertinent:
> +
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
> +   on a private memory access.  When clear, indicates the fault occurred on a
> +   shared access.
> +
> +Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
> +accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
> +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
> +kvm_run.exit_reason is stale/undefined for all other error numbers.
> +

Initially, this section is the copy of struct kvm_run and had comments 
for each field accordingly. Unfortunately, the consistence has not been 
well maintained during the new filed being added.

Do we expect to fix it?

>   ::
>   
>       /* KVM_EXIT_NOTIFY */
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4e741ff27af3..d8c6ce6c8211 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2327,4 +2327,19 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>   /* Max number of entries allowed for each kvm dirty ring */
>   #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>   
> +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> +						 gpa_t gpa, gpa_t size,
> +						 bool is_write, bool is_exec,
> +						 bool is_private)
> +{
> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	vcpu->run->memory_fault.gpa = gpa;
> +	vcpu->run->memory_fault.size = size;
> +
> +	/* RWX flags are not (yet) defined or communicated to userspace. */
> +	vcpu->run->memory_fault.flags = 0;
> +	if (is_private)
> +		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
> +}
> +
>   #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index bd1abe067f28..d2d913acf0df 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -274,6 +274,7 @@ struct kvm_xen_exit {
>   #define KVM_EXIT_RISCV_SBI        35
>   #define KVM_EXIT_RISCV_CSR        36
>   #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>   
>   /* For KVM_EXIT_INTERNAL_ERROR */
>   /* Emulate instruction failed. */
> @@ -541,6 +542,29 @@ struct kvm_run {
>   		struct kvm_sync_regs regs;
>   		char padding[SYNC_REGS_SIZE_BYTES];
>   	} s;
> +
> +	/*
> +	 * This second exit union holds structs for exit types which may be
> +	 * triggered after KVM has already initiated a different exit, or which
> +	 * may be ultimately dropped by KVM.
> +	 *
> +	 * For example, because of limitations in KVM's uAPI, KVM x86 can
> +	 * generate a memory fault exit an MMIO exit is initiated (exit_reason
> +	 * and kvm_run.mmio are filled).  And conversely, KVM often disables
> +	 * paravirt features if a memory fault occurs when accessing paravirt
> +	 * data instead of reporting the error to userspace.
> +	 */
> +	union {
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory_fault;
> +		/* Fix the size of the union. */
> +		char padding2[256];
> +	};
>   };
>   
>   /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-09-22  6:03   ` Xiaoyao Li
@ 2023-09-22 14:30     ` Sean Christopherson
  2023-09-22 16:28     ` Sean Christopherson
  1 sibling, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-22 14:30 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Sep 22, 2023, Xiaoyao Li wrote:
> On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> > From: Chao Peng <chao.p.peng@linux.intel.com>
> > 
> > Add a new KVM exit type to allow userspace to handle memory faults that
> > KVM cannot resolve, but that userspace *may* be able to handle (without
> > terminating the guest).
> > 
> > KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
> > conversions between private and shared memory.  With guest private memory,
> > there will be  two kind of memory conversions:
> > 
> >    - explicit conversion: happens when the guest explicitly calls into KVM
> >      to map a range (as private or shared)
> > 
> >    - implicit conversion: happens when the guest attempts to access a gfn
> >      that is configured in the "wrong" state (private vs. shared)
> > 
> > On x86 (first architecture to support guest private memory), explicit
> > conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
> 
> side topic.
> 
> Do we expect to integrate TDVMCALL(MAPGPA) of TDX into KVM_HC_MAP_GPA_RANGE?

Yes, that's my expectation.

> > but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
> > as there is (obviously) no hypercall, and there is no guarantee that the
> > guest actually intends to convert between private and shared, i.e. what
> > KVM thinks is an implicit conversion "request" could actually be the
> > result of a guest code bug.
> > 
> > KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
> > be implicit conversions.
> > 
> > Place "struct memory_fault" in a second anonymous union so that filling
> > memory_fault doesn't clobber state from other yet-to-be-fulfilled exits,
> > and to provide additional information if KVM does NOT ultimately exit to
> > userspace with KVM_EXIT_MEMORY_FAULT, e.g. if KVM suppresses (or worse,
> > loses) the exit, as KVM often suppresses exits for memory failures that
> > occur when accessing paravirt data structures.  The initial usage for
> > private memory will be all-or-nothing, but other features such as the
> > proposed "userfault on missing mappings" support will use
> > KVM_EXIT_MEMORY_FAULT for potentially _all_ guest memory accesses, i.e.
> > will run afoul of KVM's various quirks.
> 
> So when exit reason is KVM_EXIT_MEMORY_FAULT, how can we tell which field in
> the first union is valid?
> 
> When exit reason is not KVM_EXIT_MEMORY_FAULT, how can we know the info in
> the second union run.memory is valid without a run.memory.valid field?

I'll respond to this separately with a trimmed Cc list.  I suspect this will be
a rather lengthy conversation, and it has almost nothing to do with guest_memfd.

> > +Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
> > +accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
> > +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
> > +kvm_run.exit_reason is stale/undefined for all other error numbers.
> > +
> 
> Initially, this section is the copy of struct kvm_run and had comments for
> each field accordingly. Unfortunately, the consistence has not been well
> maintained during the new filed being added.
> 
> Do we expect to fix it?

AFAIK, no one is working on cleaning up this section of the docs, but as always,
patches are welcome :-)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-09-22  6:03   ` Xiaoyao Li
  2023-09-22 14:30     ` Sean Christopherson
@ 2023-09-22 16:28     ` Sean Christopherson
  2023-09-22 16:35       ` Sean Christopherson
  2023-10-02 22:33       ` Anish Moorthy
  1 sibling, 2 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-22 16:28 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, kvm, kvmarm, kvm-riscv,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata

[-- Attachment #1: Type: text/plain, Size: 10807 bytes --]

Removing non-KVM lists/people from Cc, this is going to get way off the guest_memfd
track...

On Fri, Sep 22, 2023, Xiaoyao Li wrote:
> On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> > Place "struct memory_fault" in a second anonymous union so that filling
> > memory_fault doesn't clobber state from other yet-to-be-fulfilled exits,
> > and to provide additional information if KVM does NOT ultimately exit to
> > userspace with KVM_EXIT_MEMORY_FAULT, e.g. if KVM suppresses (or worse,
> > loses) the exit, as KVM often suppresses exits for memory failures that
> > occur when accessing paravirt data structures.  The initial usage for
> > private memory will be all-or-nothing, but other features such as the
> > proposed "userfault on missing mappings" support will use
> > KVM_EXIT_MEMORY_FAULT for potentially _all_ guest memory accesses, i.e.
> > will run afoul of KVM's various quirks.
> 
> So when exit reason is KVM_EXIT_MEMORY_FAULT, how can we tell which field in
> the first union is valid?

/facepalm

At one point, I believe we had discussed a second exit reason field?  But yeah,
as is, there's no way for userspace to glean anything useful from the first union.

The more I think about this, the more I think it's a fool's errand.  Even if KVM
provides the exit_reason history, userspace can't act on the previous, unfulfilled
exit without *knowing* that it's safe/correct to process the previous exit.  I
don't see how that's remotely possible.

Practically speaking, there is one known instance of this in KVM, and it's a
rather riduclous edge case that has existed "forever".  I'm very strongly inclined
to do nothing special, and simply treat clobbering an exit that userspace actually
cares about like any other KVM bug.

> When exit reason is not KVM_EXIT_MEMORY_FAULT, how can we know the info in
> the second union run.memory is valid without a run.memory.valid field?

Anish's series adds a flag in kvm_run.flags to track whether or not memory_fault
has been filled.  The idea is that KVM would clear the flag early in KVM_RUN, and
then set the flag when memory_fault is first filled.

	/* KVM_CAP_MEMORY_FAULT_INFO flag for kvm_run.flags */
	#define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8)

I didn't propose that flag here because clobbering memory_fault from the page
fault path would be a flagrant KVM bug.

Honestly, I'm becoming more and more skeptical that separating memory_fault is
worthwhile, or even desirable.  Similar to memory_fault clobbering something else,
userspace can only take action if memory_fault is clobbered if userspace somehow
knows that it's safe/correct to do so.

Even if KVM exits "immediately" after initially filling memory_fault, the fact
that KVM is exiting for a different reason (or a different memory fault) means
that KVM did *something* between filling memory_fault and actually exiting.  And
it's completely impossible for usersepace to know what that "something" was.

E.g. in the splat from selftests[1], KVM reacts to a failure during Real Mode
event injection by synthesizing a triple fault

	ret = emulate_int_real(ctxt, irq);

	if (ret != X86EMUL_CONTINUE) {
		kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);

There are multiple KVM bugs at play: read_emulate() and write_emulate() incorrectly
assume *all* failures should be treated like MMIO, and conversely ->read_std() and
->write_std() don't handle *any* failures as MMIO.

Circling back to my "capturing the history is pointless" assertion, by the time
userspace gets an exit, the vCPU is already in shutdown, and KVM has clobbered
memory_fault something like five times.  There is zero chance userspace can do
anything but shed a tear for the VM and move on.

The whole "let's annotate all memory faults" idea came from my desire to push KVM
towards a future where all -EFAULT exits are annotated[2].  I still think we should
point KVM in that general direction, i.e. implement something that _can_ provide
100% "coverage" in the future, even though we don't expect to get there anytime soon.

I bring that up because neither private memory nor userfault-on-missing needs to
annotate anything other than -EFAULT during guest page faults.  I.e. all of this
paranoia about clobbering memory_fault isn't actually buying us anything other
than noise and complexity.  The cases we need to work _today_ are perfectly fine,
and _if_ some future use cases needs all/more paths to be 100% accurate, then the
right thing to do is to make whatever changes are necessary for KVM to be 100%
accurate.

In other words, trying to gracefully handle memory_fault clobbering is pointless.
KVM either needs to guarantee there's no clobbering (guest page fault paths) or
treat the annotation as best effort and informational-only (everything else at
this time).  Future features may grow the set of paths that needs strong guarantees,
but that just means fixing more paths and treating any violation of the contract
like any other KVM bug.

And if we stop being unnecessarily paranoid, KVM_RUN_MEMORY_FAULT_FILLED can also
go away.  The flag came about in part because *unconditionally* sanitizing
kvm_run.exit_reason at the start of KVM_RUN would break KVM's ABI, as userspace
may rely on the exit_reason being preserved when calling back into KVM to complete
userspace I/O (or MMIO)[3].  But the goal is purely to avoid exiting with stale
memory_fault information, not to sanitize every other existing exit_reason, and
that can be achieved by simply making the reset conditional.

Somewhat of a tangent, I think we should add KVM_CAP_MEMORY_FAULT_INFO if the
KVM_EXIT_MEMORY_FAULT supports comes in with guest_memfd.

Unless someone comes up with a good argument for keeping the paranoid behavior,
I'll post the below patch as fixup for the guest_memfd series, and work with Anish
to massage the attached patch (result of the below being sqaushed) in case his
series lands first.

[1] https://lore.kernel.org/all/202309141107.30863e9d-oliver.sang@intel.com
[2] https://lore.kernel.org/all/Y+6iX6a22+GEuH1b@google.com
[3] https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com

---
 Documentation/virt/kvm/api.rst | 21 +++++++++++++++++++
 arch/x86/kvm/x86.c             |  1 +
 include/uapi/linux/kvm.h       | 37 ++++++++++------------------------
 virt/kvm/kvm_main.c            | 10 +++++++++
 4 files changed, 43 insertions(+), 26 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 5e08f2a157ef..d5c9e46e2d12 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7829,6 +7829,27 @@ This capability is aimed to mitigate the threat that malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.
 
+7.34 KVM_CAP_MEMORY_FAULT_INFO
+------------------------------
+
+:Architectures: x86
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN *may* fill
+kvm_run.memory_fault in response to failed guest memory accesses in a vCPU
+context.  KVM only guarantees that errors that occur when handling guest page
+fault VM-Exits will be annotated, all other error paths are best effort.
+
+The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
+an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
+to KVM_EXIT_MEMORY_FAULT.
+
+Note: Userspaces which attempt to resolve memory faults so that they can retry
+KVM_RUN are encouraged to guard against repeatedly receiving the same
+error/annotated fault.
+
+See KVM_EXIT_MEMORY_FAULT for more information.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 767236b4d771..e25076fdd720 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4525,6 +4525,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
 	case KVM_CAP_IRQFD_RESAMPLE:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 65fc983af840..7f0ee6475141 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -525,6 +525,13 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -546,29 +553,6 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
-
-	/*
-	 * This second exit union holds structs for exit types which may be
-	 * triggered after KVM has already initiated a different exit, or which
-	 * may be ultimately dropped by KVM.
-	 *
-	 * For example, because of limitations in KVM's uAPI, KVM x86 can
-	 * generate a memory fault exit an MMIO exit is initiated (exit_reason
-	 * and kvm_run.mmio are filled).  And conversely, KVM often disables
-	 * paravirt features if a memory fault occurs when accessing paravirt
-	 * data instead of reporting the error to userspace.
-	 */
-	union {
-		/* KVM_EXIT_MEMORY_FAULT */
-		struct {
-#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
-			__u64 flags;
-			__u64 gpa;
-			__u64 size;
-		} memory_fault;
-		/* Fix the size of the union. */
-		char padding2[256];
-	};
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
@@ -1231,9 +1215,10 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
 #define KVM_CAP_USER_MEMORY2 230
-#define KVM_CAP_MEMORY_ATTRIBUTES 231
-#define KVM_CAP_GUEST_MEMFD 232
-#define KVM_CAP_VM_TYPES 233
+#define KVM_CAP_MEMORY_FAULT_INFO 231
+#define KVM_CAP_MEMORY_ATTRIBUTES 232
+#define KVM_CAP_GUEST_MEMFD 233
+#define KVM_CAP_VM_TYPES 234
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 96fc609459e3..d78e97b527e5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4450,6 +4450,16 @@ static long kvm_vcpu_ioctl(struct file *filp,
 				synchronize_rcu();
 			put_pid(oldpid);
 		}
+
+		/*
+		 * Reset the exit reason if the previous userspace exit was due
+		 * to a memory fault.  Not all -EFAULT exits are annotated, and
+		 * so leaving exit_reason set to KVM_EXIT_MEMORY_FAULT could
+		 * result in feeding userspace stale information.
+		 */
+		if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT)
+			vcpu->run->exit_reason = KVM_EXIT_UNKNOWN
+
 		r = kvm_arch_vcpu_ioctl_run(vcpu);
 		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
 		break;

base-commit: 67aa951d727ad2715f7ad891929f18b7f2567a0f
-- 


[-- Attachment #2: 0001-KVM-Add-KVM_EXIT_MEMORY_FAULT-exit-to-report-faults-.patch --]
[-- Type: text/x-diff, Size: 9189 bytes --]

From ca887b5ed3b344562411cf2876a68a82bd0f584b Mon Sep 17 00:00:00 2001
From: Chao Peng <chao.p.peng@linux.intel.com>
Date: Wed, 13 Sep 2023 18:55:05 -0700
Subject: [PATCH] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to
 userspace

Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).

KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory.  With guest private memory,
there will be  two kind of memory conversions:

  - explicit conversion: happens when the guest explicitly calls into KVM
    to map a range (as private or shared)

  - implicit conversion: happens when the guest attempts to access a gfn
    that is configured in the "wrong" state (private vs. shared)

On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.

KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.

Use bit 3 for flagging private memory so that KVM can use bits 0-2 for
capturing RWX behavior if/when userspace needs such information.

Add a new capability, KVM_CAP_MEMORY_FAULT_INFO, to advertise support for
KVM_EXIT_MEMORY_FAULT.  There is at least one other in-flight use case for
using KVM_EXIT_MEMORY_FAULT+memory_fault to resolve faults in userspace,
providing a dedicated capability allows userspace to query KVM support for
annotating faults without having to depend on an unrelated feature, i.e.
the proposed userfault-on-missing functionality shouldn't have to depend
on private memory support.

Note!  To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'!  Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.

Returning an errno will also allow KVM to differentiate hardware poisoned
memory errors, i.e. by returning with errno=EHWPOISON.

Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
Cc: Anish Moorthy <amoorthy@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Anish Moorthy <amoorthy@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst | 45 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c             |  1 +
 include/linux/kvm_host.h       | 15 ++++++++++++
 include/uapi/linux/kvm.h       |  9 +++++++
 virt/kvm/kvm_main.c            | 10 ++++++++
 5 files changed, 80 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 21a7578142a1..63347d0add3b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6702,6 +6702,30 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory;
+
+KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
+could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
+guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
+describes properties of the faulting access that are likely pertinent:
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
+   on a private memory access.  When clear, indicates the fault occurred on a
+   shared access.
+
+Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
+accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
+or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
+kvm_run.exit_reason is stale/undefined for all other error numbers.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
@@ -7736,6 +7760,27 @@ This capability is aimed to mitigate the threat that malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.
 
+7.34 KVM_CAP_MEMORY_FAULT_INFO
+------------------------------
+
+:Architectures: x86
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN *may* fill
+kvm_run.memory_fault in response to failed guest memory accesses in a vCPU
+context.  KVM only guarantees that errors that occur when handling guest page
+fault VM-Exits will be annotated, all other error paths are best effort.
+
+The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
+an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
+to KVM_EXIT_MEMORY_FAULT.
+
+Note: Userspaces which attempt to resolve memory faults so that they can retry
+KVM_RUN are encouraged to guard against repeatedly receiving the same
+error/annotated fault.
+
+See KVM_EXIT_MEMORY_FAULT for more information.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8356907079e1..f58df6efffa4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4518,6 +4518,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
 	case KVM_CAP_IRQFD_RESAMPLE:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4e741ff27af3..d8c6ce6c8211 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2327,4 +2327,19 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+						 gpa_t gpa, gpa_t size,
+						 bool is_write, bool is_exec,
+						 bool is_private)
+{
+	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+	vcpu->run->memory_fault.gpa = gpa;
+	vcpu->run->memory_fault.size = size;
+
+	/* RWX flags are not (yet) defined or communicated to userspace. */
+	vcpu->run->memory_fault.flags = 0;
+	if (is_private)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index bd1abe067f28..5239d3fc1082 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -274,6 +274,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -520,6 +521,13 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -1203,6 +1211,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
 #define KVM_CAP_USER_MEMORY2 230
+#define KVM_CAP_MEMORY_FAULT_INFO 231
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7c0e38752526..d13b646188e5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4184,6 +4184,16 @@ static long kvm_vcpu_ioctl(struct file *filp,
 				synchronize_rcu();
 			put_pid(oldpid);
 		}
+
+		/*
+		 * Reset the exit reason if the previous userspace exit was due
+		 * to a memory fault.  Not all -EFAULT exits are annotated, and
+		 * so leaving exit_reason set to KVM_EXIT_MEMORY_FAULT could
+		 * result in feeding userspace stale information.
+		 */
+		if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT)
+			vcpu->run->exit_reason = KVM_EXIT_UNKNOWN
+
 		r = kvm_arch_vcpu_ioctl_run(vcpu);
 		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
 		break;

base-commit: 2358793cd9062b068ac25ac9c965c00d685eea92
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-09-22 16:28     ` Sean Christopherson
@ 2023-09-22 16:35       ` Sean Christopherson
  2023-10-02 22:33       ` Anish Moorthy
  1 sibling, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-22 16:35 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, kvm, kvmarm, kvm-riscv,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata

On Fri, Sep 22, 2023, Sean Christopherson wrote:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7c0e38752526..d13b646188e5 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4184,6 +4184,16 @@ static long kvm_vcpu_ioctl(struct file *filp,
>  				synchronize_rcu();
>  			put_pid(oldpid);
>  		}
> +
> +		/*
> +		 * Reset the exit reason if the previous userspace exit was due
> +		 * to a memory fault.  Not all -EFAULT exits are annotated, and
> +		 * so leaving exit_reason set to KVM_EXIT_MEMORY_FAULT could
> +		 * result in feeding userspace stale information.
> +		 */
> +		if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT)
> +			vcpu->run->exit_reason = KVM_EXIT_UNKNOWN

Darn semicolons.  Doesn't look like I botched anything else though.

> +
>  		r = kvm_arch_vcpu_ioctl_run(vcpu);
>  		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
>  		break;
> 
> base-commit: 2358793cd9062b068ac25ac9c965c00d685eea92
> -- 
> 2.42.0.515.g380fc7ccd1-goog
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
  2023-09-21  1:21       ` Yan Zhao
@ 2023-09-25 17:37         ` Sean Christopherson
  0 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-09-25 17:37 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Thu, Sep 21, 2023, Yan Zhao wrote:
> On Wed, Sep 20, 2023 at 02:00:22PM -0700, Sean Christopherson wrote:
> > On Fri, Sep 15, 2023, Yan Zhao wrote:
> > > On Wed, Sep 13, 2023 at 06:55:09PM -0700, Sean Christopherson wrote:
> > > > +/* Set @attributes for the gfn range [@start, @end). */
> > > > +static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> > > > +				     unsigned long attributes)
> > > > +{
> > > > +	struct kvm_mmu_notifier_range pre_set_range = {
> > > > +		.start = start,
> > > > +		.end = end,
> > > > +		.handler = kvm_arch_pre_set_memory_attributes,
> > > > +		.on_lock = kvm_mmu_invalidate_begin,
> > > > +		.flush_on_ret = true,
> > > > +		.may_block = true,
> > > > +	};
> > > > +	struct kvm_mmu_notifier_range post_set_range = {
> > > > +		.start = start,
> > > > +		.end = end,
> > > > +		.arg.attributes = attributes,
> > > > +		.handler = kvm_arch_post_set_memory_attributes,
> > > > +		.on_lock = kvm_mmu_invalidate_end,
> > > > +		.may_block = true,
> > > > +	};
> > > > +	unsigned long i;
> > > > +	void *entry;
> > > > +	int r = 0;
> > > > +
> > > > +	entry = attributes ? xa_mk_value(attributes) : NULL;
> > > Also here, do we need to get existing attributes of a GFN first ?
> > 
> > No?  @entry is the new value that will be set for all entries.  This line doesn't
> > touch the xarray in any way.  Maybe I'm just not understanding your question.
> Hmm, I thought this interface was to allow users to add/remove an attribute to a GFN
> rather than overwrite all attributes of a GFN. Now I think I misunderstood the intention.
> 
> But I wonder if there is a way for users to just add one attribute, as I don't find
> ioctl like KVM_GET_MEMORY_ATTRIBUTES for users to get current attributes and then to
> add/remove one based on that. e.g. maybe in future, KVM wants to add one attribute in
> kernel without being told by userspace ?

The plan is that memory attributes will be 100% userspace driven, i.e. that KVM
will never add its own attributes.  That's why there is (currently) no
KVM_GET_MEMORY_ATTRIBUTES, the intended usage model is that userspace is fully
responsible for managing attributes, and so should never need to query information
that it already knows.  If there's a compelling case for getting attributes then
we could certainly add such an ioctl(), but I hope we never need to add a GET
because that likely means we've made mistakes along the way.

Giving userspace full control of attributes allows for a simpler uAPI, e.g. if
userspace doesn't have full control, then setting or clearing bits requires a RMW
operation, which means creating a more complex ioctl().  That's why its a straight
SET operation and not an OR type operation.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
  2023-09-20 21:03     ` Sean Christopherson
@ 2023-09-27  5:19       ` Binbin Wu
  0 siblings, 0 replies; 83+ messages in thread
From: Binbin Wu @ 2023-09-27  5:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-security-module, linux-kernel, Paolo Bonzini, Marc Zyngier,
	Oliver Upton, Huacai Chen, Michael Ellerman, Anup Patel,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn,
	Chao Peng, Fuad Tabba, Jarkko Sakkinen, Anish Moorthy, Yu Zhang,
	Isaku Yamahata, Xu Yilun, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov



On 9/21/2023 5:03 AM, Sean Christopherson wrote:
> On Mon, Sep 18, 2023, Binbin Wu wrote:
>>
>> On 9/14/2023 9:55 AM, Sean Christopherson wrote:
>>> From: Chao Peng <chao.p.peng@linux.intel.com>
>> [...]
>>> +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
>>> +/*
>>> + * Returns true if _all_ gfns in the range [@start, @end) have attributes
>>> + * matching @attrs.
>>> + */
>>> +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
>>> +				     unsigned long attrs)
>>> +{
>>> +	XA_STATE(xas, &kvm->mem_attr_array, start);
>>> +	unsigned long index;
>>> +	bool has_attrs;
>>> +	void *entry;
>>> +
>>> +	rcu_read_lock();
>>> +
>>> +	if (!attrs) {
>>> +		has_attrs = !xas_find(&xas, end);
>> IIUIC, xas_find() is inclusive for "end", so here should be "end - 1" ?
> Yes, that does appear to be the case.  Inclusive vs. exclusive on gfn ranges has
> is the bane of my existence.

Seems this one is not included in the "KVM: guest_memfd fixes" patch series?
https://lore.kernel.org/kvm/20230921203331.3746712-1-seanjc@google.com/




^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-09-22 16:28     ` Sean Christopherson
  2023-09-22 16:35       ` Sean Christopherson
@ 2023-10-02 22:33       ` Anish Moorthy
  2023-10-03  1:42         ` Sean Christopherson
  1 sibling, 1 reply; 83+ messages in thread
From: Anish Moorthy @ 2023-10-02 22:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiaoyao Li, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Huacai Chen, Michael Ellerman, Anup Patel, kvm, kvmarm,
	kvm-riscv, linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Yu Zhang, Isaku Yamahata, Xu Yilun, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata

On Fri, Sep 22, 2023 at 9:28 AM Sean Christopherson <seanjc@google.com> wrote:
>
> > So when exit reason is KVM_EXIT_MEMORY_FAULT, how can we tell which field in
> > the first union is valid?
>
> /facepalm
>
> At one point, I believe we had discussed a second exit reason field?  But yeah,
> as is, there's no way for userspace to glean anything useful from the first union.

Oh, was this an objective? When I was pushing for the second union
this I was just trying to make sure all the efault annotations
wouldn't clobber *other* exits. But yeah, I don't/didn't see a
meaningful way to have valid information in both structs.

> The more I think about this, the more I think it's a fool's errand.  Even if KVM
> provides the exit_reason history, userspace can't act on the previous, unfulfilled
> exit without *knowing* that it's safe/correct to process the previous exit.  I
> don't see how that's remotely possible.
>
> Practically speaking, there is one known instance of this in KVM, and it's a
> rather riduclous edge case that has existed "forever".  I'm very strongly inclined
> to do nothing special, and simply treat clobbering an exit that userspace actually
> cares about like any other KVM bug.
>
> > When exit reason is not KVM_EXIT_MEMORY_FAULT, how can we know the info in
> > the second union run.memory is valid without a run.memory.valid field?
>
> Anish's series adds a flag in kvm_run.flags to track whether or not memory_fault
> has been filled.  The idea is that KVM would clear the flag early in KVM_RUN, and
> then set the flag when memory_fault is first filled.
>
>         /* KVM_CAP_MEMORY_FAULT_INFO flag for kvm_run.flags */
>         #define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8)
>
> I didn't propose that flag here because clobbering memory_fault from the page
> fault path would be a flagrant KVM bug.
>
> Honestly, I'm becoming more and more skeptical that separating memory_fault is
> worthwhile, or even desirable.  Similar to memory_fault clobbering something else,
> userspace can only take action if memory_fault is clobbered if userspace somehow
> knows that it's safe/correct to do so.
>
> Even if KVM exits "immediately" after initially filling memory_fault, the fact
> that KVM is exiting for a different reason (or a different memory fault) means
> that KVM did *something* between filling memory_fault and actually exiting.  And
> it's completely impossible for usersepace to know what that "something" was.

Are you describing a scenario in which memory_fault is (initially)
filled, then something else happens to fill memory_fault (thus
clobbering it), then KVM_RUN exits? I'm confused by the tension
between the "KVM exits 'immediately'" and "KVM did *something* between
filling memory_fault and actually existing" statements here.

> E.g. in the splat from selftests[1], KVM reacts to a failure during Real Mode
> event injection by synthesizing a triple fault
>
>         ret = emulate_int_real(ctxt, irq);
>
>         if (ret != X86EMUL_CONTINUE) {
>                 kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
>
> There are multiple KVM bugs at play: read_emulate() and write_emulate() incorrectly
> assume *all* failures should be treated like MMIO, and conversely ->read_std() and
> ->write_std() don't handle *any* failures as MMIO.
>
> Circling back to my "capturing the history is pointless" assertion, by the time
> userspace gets an exit, the vCPU is already in shutdown, and KVM has clobbered
> memory_fault something like five times.  There is zero chance userspace can do
> anything but shed a tear for the VM and move on.
>
> The whole "let's annotate all memory faults" idea came from my desire to push KVM
> towards a future where all -EFAULT exits are annotated[2].  I still think we should
> point KVM in that general direction, i.e. implement something that _can_ provide
> 100% "coverage" in the future, even though we don't expect to get there anytime soon.
>
> I bring that up because neither private memory nor userfault-on-missing needs to
> annotate anything other than -EFAULT during guest page faults.  I.e. all of this
> paranoia about clobbering memory_fault isn't actually buying us anything other
> than noise and complexity.  The cases we need to work _today_ are perfectly fine,
> and _if_ some future use cases needs all/more paths to be 100% accurate, then the
> right thing to do is to make whatever changes are necessary for KVM to be 100%
> accurate.
>
> In other words, trying to gracefully handle memory_fault clobbering is pointless.
> KVM either needs to guarantee there's no clobbering (guest page fault paths) or
> treat the annotation as best effort and informational-only (everything else at
> this time).  Future features may grow the set of paths that needs strong guarantees,
> but that just means fixing more paths and treating any violation of the contract
> like any other KVM bug.

Ok, so if we're restricting the exit to just the places it's totally
accurate (page-fault paths) then, IIUC,

- There's no reason to attach it to EFAULT, ie it becomes a "normal" exit
- I should go drop the patches annotating kvm_vcpu_read/write_page
from my series
- The helper function [a] for filling the memory_fault field
(downgraded back into the current union) can drop the "has the field
already been filled?" check/WARN.
- [KVM_CAP_USERFAULT_ON_MISSING] The memslot flag check [b] needs to
be moved back from __gfn_to_pfn_memslot() into
user_mem_abort()/kvm_handle_error_pfn() since the slot flag-triggered
fast-gup failures *have* to result in the memory fault exits, and we
only want to do those in the two SLAT-failure paths (for now).

[a] https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com/
[b] https://lore.kernel.org/all/20230908222905.1321305-11-amoorthy@google.com/

> And if we stop being unnecessarily paranoid, KVM_RUN_MEMORY_FAULT_FILLED can also
> go away.  The flag came about in part because *unconditionally* sanitizing
> kvm_run.exit_reason at the start of KVM_RUN would break KVM's ABI, as userspace
> may rely on the exit_reason being preserved when calling back into KVM to complete
> userspace I/O (or MMIO)[3].  But the goal is purely to avoid exiting with stale
> memory_fault information, not to sanitize every other existing exit_reason, and
> that can be achieved by simply making the reset conditional.
>
> ...
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 96fc609459e3..d78e97b527e5 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4450,6 +4450,16 @@ static long kvm_vcpu_ioctl(struct file *filp,
>                                 synchronize_rcu();
>                         put_pid(oldpid);
>                 }
> +
> +               /*
> +                * Reset the exit reason if the previous userspace exit was due
> +                * to a memory fault.  Not all -EFAULT exits are annotated, and
> +                * so leaving exit_reason set to KVM_EXIT_MEMORY_FAULT could
> +                * result in feeding userspace stale information.
> +                */
> +               if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT)
> +                       vcpu->run->exit_reason = KVM_EXIT_UNKNOWN
> +
>                 r = kvm_arch_vcpu_ioctl_run(vcpu);

Under my reading of the earlier block I'm not sure why we need to keep
this around. The original idea behind a canary of this type was to
avoid stomping on non-memory-fault exits in cases where something
caused an (ignored) annotated memory fault before the exit could be
completed. But if the annotations are going to be restricted in
general to just the page fault paths, then we can just eliminate the
sentinel check (and just this block) entirely, right?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-10-02 22:33       ` Anish Moorthy
@ 2023-10-03  1:42         ` Sean Christopherson
  2023-10-03 22:59           ` Anish Moorthy
  0 siblings, 1 reply; 83+ messages in thread
From: Sean Christopherson @ 2023-10-03  1:42 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Xiaoyao Li, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Huacai Chen, Michael Ellerman, Anup Patel, kvm, kvmarm,
	kvm-riscv, linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Yu Zhang, Isaku Yamahata, Xu Yilun, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata

On Mon, Oct 02, 2023, Anish Moorthy wrote:
> On Fri, Sep 22, 2023 at 9:28 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > So when exit reason is KVM_EXIT_MEMORY_FAULT, how can we tell which field in
> > > the first union is valid?
> >
> > /facepalm
> >
> > At one point, I believe we had discussed a second exit reason field?  But yeah,
> > as is, there's no way for userspace to glean anything useful from the first union.
> 
> Oh, was this an objective? When I was pushing for the second union
> this I was just trying to make sure all the efault annotations
> wouldn't clobber *other* exits. But yeah, I don't/didn't see a
> meaningful way to have valid information in both structs.

Clobbering other exits means KVM is already broken, because simply accessing memory
in guest context after initiating an exit is a KVM bug as it would violate ordering
and maybe causality.   E.g. the only reason the preemption case (see below) isn't
completely buggy is specifically because it's host paravirt behavior.

In other words, ignoring preemption for the moment, not clobbering other exits isn't
useful because whatever buggy KVM behavior caused the clobbering already happened,
i.e. the VM is already in trouble either way.  The only realistic options are to fix
the KVM bugs, or to effectively take an errata and say "don't do that" (like we've
done for the silly PUSHD to MMIO case).

> > The more I think about this, the more I think it's a fool's errand.  Even if KVM
> > provides the exit_reason history, userspace can't act on the previous, unfulfilled
> > exit without *knowing* that it's safe/correct to process the previous exit.  I
> > don't see how that's remotely possible.
> >
> > Practically speaking, there is one known instance of this in KVM, and it's a
> > rather riduclous edge case that has existed "forever".  I'm very strongly inclined
> > to do nothing special, and simply treat clobbering an exit that userspace actually
> > cares about like any other KVM bug.
> >
> > > When exit reason is not KVM_EXIT_MEMORY_FAULT, how can we know the info in
> > > the second union run.memory is valid without a run.memory.valid field?
> >
> > Anish's series adds a flag in kvm_run.flags to track whether or not memory_fault
> > has been filled.  The idea is that KVM would clear the flag early in KVM_RUN, and
> > then set the flag when memory_fault is first filled.
> >
> >         /* KVM_CAP_MEMORY_FAULT_INFO flag for kvm_run.flags */
> >         #define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8)
> >
> > I didn't propose that flag here because clobbering memory_fault from the page
> > fault path would be a flagrant KVM bug.
> >
> > Honestly, I'm becoming more and more skeptical that separating memory_fault is
> > worthwhile, or even desirable.  Similar to memory_fault clobbering something else,
> > userspace can only take action if memory_fault is clobbered if userspace somehow
> > knows that it's safe/correct to do so.
> >
> > Even if KVM exits "immediately" after initially filling memory_fault, the fact
> > that KVM is exiting for a different reason (or a different memory fault) means
> > that KVM did *something* between filling memory_fault and actually exiting.  And
> > it's completely impossible for usersepace to know what that "something" was.
> 
> Are you describing a scenario in which memory_fault is (initially)
> filled, then something else happens to fill memory_fault (thus
> clobbering it), then KVM_RUN exits? I'm confused by the tension
> between the "KVM exits 'immediately'" and "KVM did *something* between
> filling memory_fault and actually existing" statements here.

Yes, I'm describing a hypothetical scenario.  Immediately was in quotes because
even if KVM returns from the *current* function straightaway, it's possible that
control is deep in a call stack, i.e. KVM may "immediately" try to exit from the
current function's perspective, but in reality it may take a while to actually
get out to userspace.

> > > E.g. in the splat from selftests[1], KVM reacts to a failure during Real Mode
> > event injection by synthesizing a triple fault
> >
> >         ret = emulate_int_real(ctxt, irq);
> >
> >         if (ret != X86EMUL_CONTINUE) {
> >                 kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> >
> > There are multiple KVM bugs at play: read_emulate() and write_emulate() incorrectly
> > assume *all* failures should be treated like MMIO, and conversely ->read_std() and
> > ->write_std() don't handle *any* failures as MMIO.
> >
> > Circling back to my "capturing the history is pointless" assertion, by the time
> > userspace gets an exit, the vCPU is already in shutdown, and KVM has clobbered
> > memory_fault something like five times.  There is zero chance userspace can do
> > anything but shed a tear for the VM and move on.
> >
> > The whole "let's annotate all memory faults" idea came from my desire to push KVM
> > towards a future where all -EFAULT exits are annotated[2].  I still think we should
> > point KVM in that general direction, i.e. implement something that _can_ provide
> > 100% "coverage" in the future, even though we don't expect to get there anytime soon.
> >
> > I bring that up because neither private memory nor userfault-on-missing needs to
> > annotate anything other than -EFAULT during guest page faults.  I.e. all of this
> > paranoia about clobbering memory_fault isn't actually buying us anything other
> > than noise and complexity.  The cases we need to work _today_ are perfectly fine,
> > and _if_ some future use cases needs all/more paths to be 100% accurate, then the
> > right thing to do is to make whatever changes are necessary for KVM to be 100%
> > accurate.
> >
> > In other words, trying to gracefully handle memory_fault clobbering is pointless.
> > KVM either needs to guarantee there's no clobbering (guest page fault paths) or
> > treat the annotation as best effort and informational-only (everything else at
> > this time).  Future features may grow the set of paths that needs strong guarantees,
> > but that just means fixing more paths and treating any violation of the contract
> > like any other KVM bug.
> 
> Ok, so if we're restricting the exit to just the places it's totally
> accurate (page-fault paths) then, IIUC,
> 
> - There's no reason to attach it to EFAULT, ie it becomes a "normal" exit

No, I still want at least partial line of sight to being able to provide useful
information to userspace on EFAULT.  Making KVM_EXIT_MEMORY_FAULT a "normal" exit
pretty much squashes any hope of that.

> - I should go drop the patches annotating kvm_vcpu_read/write_page
> from my series

Hold up on that.  I'd prefer to keep them as there's still value in giving userspace
debug information.  All I'm proposing is that we would firmly state in the
documentation that those paths must be treated as informational-only.

The whole kvm_steal_time_set_preempted() mess does give me pause though.  That
helper isn't actually problematic, but only because it uses copy_to_user_nofault()
directly :-/

But that doesn't necessarily mean we need to abandon the entire idea, e.g. it
might not be a terrible idea to explicitly differentiate accesses to guest memory
for paravirt stuff, from accesses to guest memory on behalf of the guest.

Anyways, don't do anything just yet.

> - The helper function [a] for filling the memory_fault field
> (downgraded back into the current union) can drop the "has the field
> already been filled?" check/WARN.

That would need to be dropped regardless because it's user-triggered (sadly).

> - [KVM_CAP_USERFAULT_ON_MISSING] The memslot flag check [b] needs to
> be moved back from __gfn_to_pfn_memslot() into
> user_mem_abort()/kvm_handle_error_pfn() since the slot flag-triggered
> fast-gup failures *have* to result in the memory fault exits, and we
> only want to do those in the two SLAT-failure paths (for now).

I'll look at this more closely when I review your series (slowly, slowly getting
there).  There's no right or wrong answer here, it's more a question of what's the
easiest to maintain.

> [a] https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com/
> [b] https://lore.kernel.org/all/20230908222905.1321305-11-amoorthy@google.com/
> 
> > And if we stop being unnecessarily paranoid, KVM_RUN_MEMORY_FAULT_FILLED can also
> > go away.  The flag came about in part because *unconditionally* sanitizing
> > kvm_run.exit_reason at the start of KVM_RUN would break KVM's ABI, as userspace
> > may rely on the exit_reason being preserved when calling back into KVM to complete
> > userspace I/O (or MMIO)[3].  But the goal is purely to avoid exiting with stale
> > memory_fault information, not to sanitize every other existing exit_reason, and
> > that can be achieved by simply making the reset conditional.
> >
> > ...
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 96fc609459e3..d78e97b527e5 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4450,6 +4450,16 @@ static long kvm_vcpu_ioctl(struct file *filp,
> >                                 synchronize_rcu();
> >                         put_pid(oldpid);
> >                 }
> > +
> > +               /*
> > +                * Reset the exit reason if the previous userspace exit was due
> > +                * to a memory fault.  Not all -EFAULT exits are annotated, and
> > +                * so leaving exit_reason set to KVM_EXIT_MEMORY_FAULT could
> > +                * result in feeding userspace stale information.
> > +                */
> > +               if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT)
> > +                       vcpu->run->exit_reason = KVM_EXIT_UNKNOWN
> > +
> >                 r = kvm_arch_vcpu_ioctl_run(vcpu);
> 
> Under my reading of the earlier block I'm not sure why we need to keep
> this around. The original idea behind a canary of this type was to
> avoid stomping on non-memory-fault exits in cases where something
> caused an (ignored) annotated memory fault before the exit could be
> completed. But if the annotations are going to be restricted in
> general to just the page fault paths, then we can just eliminate the
> sentinel check (and just this block) entirely, right?

This isn't a canary, it's to ensure KVM doesn't feed userspace garbage.  As above,
I'm not saying we throw away all of the code for the "optional" paths, I'm saying
that we only commit to 100% accuracy for the paths that the two use cases need to
work, i.e. the page fault handlers.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
  2023-09-14  1:55 ` [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes Sean Christopherson
  2023-09-15  6:32   ` Yan Zhao
  2023-09-18  7:51   ` Binbin Wu
@ 2023-10-03 12:47   ` Fuad Tabba
  2023-10-03 15:59     ` Sean Christopherson
  2 siblings, 1 reply; 83+ messages in thread
From: Fuad Tabba @ 2023-10-03 12:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Jarkko Sakkinen, Anish Moorthy,
	Yu Zhang, Isaku Yamahata, Xu Yilun, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

Hi,

> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index d2d913acf0df..f8642ff2eb9d 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1227,6 +1227,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
>  #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
>  #define KVM_CAP_USER_MEMORY2 230
> +#define KVM_CAP_MEMORY_ATTRIBUTES 231
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -2293,4 +2294,17 @@ struct kvm_s390_zpci_op {
>  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
>  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
>
> +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
> +#define KVM_SET_MEMORY_ATTRIBUTES              _IOW(KVMIO,  0xd3, struct kvm_memory_attributes)
> +
> +struct kvm_memory_attributes {
> +       __u64 address;
> +       __u64 size;
> +       __u64 attributes;
> +       __u64 flags;
> +};
> +
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +

In pKVM, we don't want to allow setting (or clearing) of
PRIVATE/SHARED attributes from userspace. However, we'd like to use
the attributes xarray to track the sharing state of guest pages at the
host kernel.

Moreover, we'd rather the default guest page state be PRIVATE, and
only specify which pages are shared. All pKVM guest pages start off as
private, and the majority will remain so.

I'm not sure if this is the best way to do this: One idea would be to
move the definition of KVM_MEMORY_ATTRIBUTE_PRIVATE to
arch/*/include/asm/kvm_host.h, which is where
kvm_arch_supported_attributes() lives as well. This would allow
different architectures to specify their own attributes (i.e., instead
we'd have a KVM_MEMORY_ATTRIBUTE_SHARED for pKVM). This wouldn't help
in terms of preventing userspace from clearing attributes (i.e.,
setting a 0 attribute) though.

The other thing, which we need for pKVM anyway, is to make
kvm_vm_set_mem_attributes() global, so that it can be called from
outside of kvm_main.c (already have a local patch for this that
declares it in kvm_host.h), and not gate this function by
KVM_GENERIC_MEMORY_ATTRIBUTES. This would let pKVM select only
KVM_PRIVATE_MEM (as opposed to KVM_GENERIC_PRIVATE_MEM, which selects
KVM_GENERIC_MEMORY_ATTRIBUTES), preventing userspace from setting
these attributes, while allowing pKVM to call
kvm_vm_set_mem_attributes().

What do you think?

Thanks,
/fuad

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
  2023-10-03 12:47   ` Fuad Tabba
@ 2023-10-03 15:59     ` Sean Christopherson
  2023-10-03 18:33       ` Fuad Tabba
  0 siblings, 1 reply; 83+ messages in thread
From: Sean Christopherson @ 2023-10-03 15:59 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Jarkko Sakkinen, Anish Moorthy,
	Yu Zhang, Isaku Yamahata, Xu Yilun, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Tue, Oct 03, 2023, Fuad Tabba wrote:
> Hi,
> 
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index d2d913acf0df..f8642ff2eb9d 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -1227,6 +1227,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
> >  #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
> >  #define KVM_CAP_USER_MEMORY2 230
> > +#define KVM_CAP_MEMORY_ATTRIBUTES 231
> >
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >
> > @@ -2293,4 +2294,17 @@ struct kvm_s390_zpci_op {
> >  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
> >  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
> >
> > +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> > +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
> > +#define KVM_SET_MEMORY_ATTRIBUTES              _IOW(KVMIO,  0xd3, struct kvm_memory_attributes)
> > +
> > +struct kvm_memory_attributes {
> > +       __u64 address;
> > +       __u64 size;
> > +       __u64 attributes;
> > +       __u64 flags;
> > +};
> > +
> > +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> > +
> 
> In pKVM, we don't want to allow setting (or clearing) of PRIVATE/SHARED
> attributes from userspace.

Why not?  The whole thing falls apart if userspace doesn't *know* the state of a
page, and the only way for userspace to know the state of a page at a given moment
in time is if userspace controls the attributes.  E.g. even if KVM were to provide
a way for userspace to query attributes, the attributes exposed to usrspace would
become stale the instant KVM drops slots_lock (or whatever lock protects the attributes)
since userspace couldn't prevent future changes.

Why does pKVM need to prevent userspace from stating *its* view of attributes?

If the goal is to reduce memory overhead, that can be solved by using an internal,
non-ABI attributes flag to track pKVM's view of SHARED vs. PRIVATE.  If the guest
attempts to access memory where pKVM and userspace don't agree on the state,
generate an exit to userspace.  Or kill the guest.  Or do something else entirely.

> However, we'd like to use the attributes xarray to track the sharing state of
> guest pages at the host kernel.
> 
> Moreover, we'd rather the default guest page state be PRIVATE, and
> only specify which pages are shared. All pKVM guest pages start off as
> private, and the majority will remain so.

I would rather optimize kvm_vm_set_mem_attributes() to generate range-based
xarray entries, at which point it shouldn't matter all that much whether PRIVATE
or SHARED is the default "empty" state.  We opted not to do that for the initial
merge purely to keep the code as simple as possible (which is obviously still not
exactly simple).

With range-based xarray entries, the cost of tagging huge chunks of memory as
PRIVATE should be a non-issue.  And if that's not enough for whatever reason, I
would rather define the polarity of PRIVATE on a per-VM basis, but only for internal
storage.
 
> I'm not sure if this is the best way to do this: One idea would be to move
> the definition of KVM_MEMORY_ATTRIBUTE_PRIVATE to
> arch/*/include/asm/kvm_host.h, which is where kvm_arch_supported_attributes()
> lives as well. This would allow different architectures to specify their own
> attributes (i.e., instead we'd have a KVM_MEMORY_ATTRIBUTE_SHARED for pKVM).
> This wouldn't help in terms of preventing userspace from clearing attributes
> (i.e., setting a 0 attribute) though.
> 
> The other thing, which we need for pKVM anyway, is to make
> kvm_vm_set_mem_attributes() global, so that it can be called from outside of
> kvm_main.c (already have a local patch for this that declares it in
> kvm_host.h),

That's no problem, but I am definitely opposed to KVM modifying attributes that
are owned by userspace.

> and not gate this function by KVM_GENERIC_MEMORY_ATTRIBUTES.

As above, I am opposed to pKVM having a completely different ABI for managing
PRIVATE vs. SHARED.  I have no objection to pKVM using unclaimed flags in the
attributes to store extra metadata, but if KVM_SET_MEMORY_ATTRIBUTES doesn't work
for pKVM, then we've failed miserably and should revist the uAPI.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
  2023-10-03 15:59     ` Sean Christopherson
@ 2023-10-03 18:33       ` Fuad Tabba
  2023-10-03 20:51         ` Sean Christopherson
  0 siblings, 1 reply; 83+ messages in thread
From: Fuad Tabba @ 2023-10-03 18:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Jarkko Sakkinen, Anish Moorthy,
	Yu Zhang, Isaku Yamahata, Xu Yilun, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

Hi Sean,


On Tue, Oct 3, 2023 at 4:59 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Oct 03, 2023, Fuad Tabba wrote:
> > Hi,
> >
> > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > index d2d913acf0df..f8642ff2eb9d 100644
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -1227,6 +1227,7 @@ struct kvm_ppc_resize_hpt {
> > >  #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
> > >  #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
> > >  #define KVM_CAP_USER_MEMORY2 230
> > > +#define KVM_CAP_MEMORY_ATTRIBUTES 231
> > >
> > >  #ifdef KVM_CAP_IRQ_ROUTING
> > >
> > > @@ -2293,4 +2294,17 @@ struct kvm_s390_zpci_op {
> > >  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
> > >  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
> > >
> > > +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> > > +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
> > > +#define KVM_SET_MEMORY_ATTRIBUTES              _IOW(KVMIO,  0xd3, struct kvm_memory_attributes)
> > > +
> > > +struct kvm_memory_attributes {
> > > +       __u64 address;
> > > +       __u64 size;
> > > +       __u64 attributes;
> > > +       __u64 flags;
> > > +};
> > > +
> > > +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> > > +
> >
> > In pKVM, we don't want to allow setting (or clearing) of PRIVATE/SHARED
> > attributes from userspace.
>
> Why not?  The whole thing falls apart if userspace doesn't *know* the state of a
> page, and the only way for userspace to know the state of a page at a given moment
> in time is if userspace controls the attributes.  E.g. even if KVM were to provide
> a way for userspace to query attributes, the attributes exposed to usrspace would
> become stale the instant KVM drops slots_lock (or whatever lock protects the attributes)
> since userspace couldn't prevent future changes.

I think I might not quite understand the purpose of the
KVM_SET_MEMORY_ATTRIBUTES ABI. In pKVM, all of a protected guest's
memory is private by default, until the guest shares it with the host
(via a hypercall), or another guest (future work). When the guest
shares it, userspace is notified via KVM_EXIT_HYPERCALL. In many use
cases, userspace doesn't need to keep track directly of all of this,
but can reactively un/map the memory being un/shared.

> Why does pKVM need to prevent userspace from stating *its* view of attributes?
>
> If the goal is to reduce memory overhead, that can be solved by using an internal,
> non-ABI attributes flag to track pKVM's view of SHARED vs. PRIVATE.  If the guest
> attempts to access memory where pKVM and userspace don't agree on the state,
> generate an exit to userspace.  Or kill the guest.  Or do something else entirely.

For the pKVM hypervisor the guest's view of the attributes doesn't
matter. The hypervisor at the end of the day is the ultimate arbiter
for what is shared and with how. For pKVM (at least in my port of
guestmem), we use the memory attributes from guestmem essentially to
control which memory can be mapped by the host.

One difference between pKVM and TDX (as I understand it), is that TDX
uses the msb of the guest's IPA to indicate whether memory is shared
or private, and that can generate a mismatch on guest memory access
between what it thinks the state is, and what the sharing state in
reality is. pKVM doesn't have that. Memory is private by default, and
can be shared in-place, both in the guest's IPA space as well as the
underlying physical page.

> > However, we'd like to use the attributes xarray to track the sharing state of
> > guest pages at the host kernel.
> >
> > Moreover, we'd rather the default guest page state be PRIVATE, and
> > only specify which pages are shared. All pKVM guest pages start off as
> > private, and the majority will remain so.
>
> I would rather optimize kvm_vm_set_mem_attributes() to generate range-based
> xarray entries, at which point it shouldn't matter all that much whether PRIVATE
> or SHARED is the default "empty" state.  We opted not to do that for the initial
> merge purely to keep the code as simple as possible (which is obviously still not
> exactly simple).
>
> With range-based xarray entries, the cost of tagging huge chunks of memory as
> PRIVATE should be a non-issue.  And if that's not enough for whatever reason, I
> would rather define the polarity of PRIVATE on a per-VM basis, but only for internal
> storage.

Sounds good.

> > I'm not sure if this is the best way to do this: One idea would be to move
> > the definition of KVM_MEMORY_ATTRIBUTE_PRIVATE to
> > arch/*/include/asm/kvm_host.h, which is where kvm_arch_supported_attributes()
> > lives as well. This would allow different architectures to specify their own
> > attributes (i.e., instead we'd have a KVM_MEMORY_ATTRIBUTE_SHARED for pKVM).
> > This wouldn't help in terms of preventing userspace from clearing attributes
> > (i.e., setting a 0 attribute) though.
> >
> > The other thing, which we need for pKVM anyway, is to make
> > kvm_vm_set_mem_attributes() global, so that it can be called from outside of
> > kvm_main.c (already have a local patch for this that declares it in
> > kvm_host.h),
>
> That's no problem, but I am definitely opposed to KVM modifying attributes that
> are owned by userspace.
>
> > and not gate this function by KVM_GENERIC_MEMORY_ATTRIBUTES.
>
> As above, I am opposed to pKVM having a completely different ABI for managing
> PRIVATE vs. SHARED.  I have no objection to pKVM using unclaimed flags in the
> attributes to store extra metadata, but if KVM_SET_MEMORY_ATTRIBUTES doesn't work
> for pKVM, then we've failed miserably and should revist the uAPI.

Like I said, pKVM doesn't need a userspace ABI for managing
PRIVATE/SHARED, just a way of tracking in the host kernel of what is
shared (as opposed to the hypervisor, which already has the
knowledge). The solution could simply be that pKVM does not enable
KVM_GENERIC_MEMORY_ATTRIBUTES, has its own tracking of the status of
the guest pages, and only selects KVM_PRIVATE_MEM.

Thanks!
/fuad

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
  2023-10-03 18:33       ` Fuad Tabba
@ 2023-10-03 20:51         ` Sean Christopherson
  0 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-10-03 20:51 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Jarkko Sakkinen, Anish Moorthy,
	Yu Zhang, Isaku Yamahata, Xu Yilun, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Tue, Oct 03, 2023, Fuad Tabba wrote:
> On Tue, Oct 3, 2023 at 4:59 PM Sean Christopherson <seanjc@google.com> wrote:
> > On Tue, Oct 03, 2023, Fuad Tabba wrote:
> > > > +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> > > > +
> > >
> > > In pKVM, we don't want to allow setting (or clearing) of PRIVATE/SHARED
> > > attributes from userspace.
> >
> > Why not?  The whole thing falls apart if userspace doesn't *know* the state of a
> > page, and the only way for userspace to know the state of a page at a given moment
> > in time is if userspace controls the attributes.  E.g. even if KVM were to provide
> > a way for userspace to query attributes, the attributes exposed to usrspace would
> > become stale the instant KVM drops slots_lock (or whatever lock protects the attributes)
> > since userspace couldn't prevent future changes.
> 
> I think I might not quite understand the purpose of the
> KVM_SET_MEMORY_ATTRIBUTES ABI. In pKVM, all of a protected guest's memory is
> private by default, until the guest shares it with the host (via a
> hypercall), or another guest (future work). When the guest shares it,
> userspace is notified via KVM_EXIT_HYPERCALL. In many use cases, userspace
> doesn't need to keep track directly of all of this, but can reactively un/map
> the memory being un/shared.

Yes, and then userspace needs to tell KVM, via KVM_SET_MEMORY_ATTRIBUTES, that
userspace has agreed to change the state of the page.  Userspace may not need/want
to explicitly track the state of pages, but userspace still needs to tell KVM what
userspace wants.

KVM is primarily an accelerator, e.g. KVM's role is to make things go fast (relative
to doing things in userspace) and provide access to resources/instructions that
require elevated privileges.  As a general rule, we try to avoid defining the vCPU
model, security policies, etc. in KVM, because hardcoding policy into KVM (and the
kernel as a whole) eventually limits the utility of KVM.

As it pertains to PRIVATE vs. SHARED, KVM's role is to define and enforce the basic
rules, but KVM shouldn't do things like define when it is (il)legal to convert
memory to/from SHARED, what pages can be converted, what happens if the guest and
userspace disagree, etc.

> > Why does pKVM need to prevent userspace from stating *its* view of attributes?
> >
> > If the goal is to reduce memory overhead, that can be solved by using an internal,
> > non-ABI attributes flag to track pKVM's view of SHARED vs. PRIVATE.  If the guest
> > attempts to access memory where pKVM and userspace don't agree on the state,
> > generate an exit to userspace.  Or kill the guest.  Or do something else entirely.
> 
> For the pKVM hypervisor the guest's view of the attributes doesn't
> matter. The hypervisor at the end of the day is the ultimate arbiter
> for what is shared and with how. For pKVM (at least in my port of
> guestmem), we use the memory attributes from guestmem essentially to
> control which memory can be mapped by the host.

The guest's view absolutely matters.  The guest's view may not be expressed at
access time, e.g. as you note below, pKVM and other software-protected VMs don't
have a dedicated shared vs. private bit like TDX and SNP.  But the view is still
there, e.g. in the pKVM model, the guest expresses its desire for shared vs.
private via hypercall, and IIRC, the guest's view is tracked by the hypervisor
in the stage-2 PTEs.  pKVM itself may track the guest's view on things, but the
view is still the guest's.

E.g. if the guest thinks a page is private, but in reality KVM and host userspace
have it as shared, then the guest may unintentionally leak data to the untrusted
world.

IIUC, you have implemented guest_memfd support in pKVM by changing the attributes
when the guest makes the hypercall.  This can work, but only so long as the guest
and userspace are well-behaved, and it will likely paint pKVM into a corner in
the long run.

E.g. if the guest makes a hypercall to convert memory to PRIVATE, but there is
no memslot or the memslot doesn't support private memory, then unless there is
policy baked into KVM, or an ABI for the guest<=>host hypercall interface that
allows unwinding the program counter, you're stuck.  Returning an error for the
hypercall straight from KVM is undesirable as that would put policy into KVM that
doesn't need to be there, e.g. that would prevent userspace from manipulating
memslots in response to (un)share requests from the guest.  It's a similar story
if KVM marks the page as PRIVATE, as that would prevent userspace from returning
an error for the hypercall, i.e. would prevent usersepace from denying the request
to convert to PRIVATE.

> One difference between pKVM and TDX (as I understand it), is that TDX
> uses the msb of the guest's IPA to indicate whether memory is shared
> or private, and that can generate a mismatch on guest memory access
> between what it thinks the state is, and what the sharing state in
> reality is. pKVM doesn't have that. Memory is private by default, and
> can be shared in-place, both in the guest's IPA space as well as the
> underlying physical page.

TDX's shared bit and SNP's encryption bit are just a means of hardware enforcement.
pKVM does have a hardware bit because hardware doesn't provide any enforcement.
But as above, pKVM does have an equivalent *somewhere*.

> > > The other thing, which we need for pKVM anyway, is to make
> > > kvm_vm_set_mem_attributes() global, so that it can be called from outside of
> > > kvm_main.c (already have a local patch for this that declares it in
> > > kvm_host.h),
> >
> > That's no problem, but I am definitely opposed to KVM modifying attributes that
> > are owned by userspace.
> >
> > > and not gate this function by KVM_GENERIC_MEMORY_ATTRIBUTES.
> >
> > As above, I am opposed to pKVM having a completely different ABI for managing
> > PRIVATE vs. SHARED.  I have no objection to pKVM using unclaimed flags in the
> > attributes to store extra metadata, but if KVM_SET_MEMORY_ATTRIBUTES doesn't work
> > for pKVM, then we've failed miserably and should revist the uAPI.
> 
> Like I said, pKVM doesn't need a userspace ABI for managing PRIVATE/SHARED,
> just a way of tracking in the host kernel of what is shared (as opposed to
> the hypervisor, which already has the knowledge). The solution could simply
> be that pKVM does not enable KVM_GENERIC_MEMORY_ATTRIBUTES, has its own
> tracking of the status of the guest pages, and only selects KVM_PRIVATE_MEM.

At the risk of overstepping my bounds, I think that effectively giving the guest
full control over what is shared vs. private is a mistake.  It more or less locks
pKVM into a single model, and even within that model, dealing with errors and/or
misbehaving guests becomes unnecessarily problematic.

Using KVM_SET_MEMORY_ATTRIBUTES may not provide value *today*, e.g. the userspace
side of pKVM could simply "reflect" all conversion hypercalls, and terminate the
VM on errors.  But the cost is very minimal, e.g. a single extra ioctl() per
converion, and the upside is that pKVM won't be stuck if a use case comes along
that wants to go beyond "all conversion requests either immediately succeed or
terminate the guest".

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-10-03  1:42         ` Sean Christopherson
@ 2023-10-03 22:59           ` Anish Moorthy
  2023-10-03 23:46             ` Sean Christopherson
  0 siblings, 1 reply; 83+ messages in thread
From: Anish Moorthy @ 2023-10-03 22:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiaoyao Li, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Huacai Chen, Michael Ellerman, Anup Patel, kvm, kvmarm,
	kvm-riscv, linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Yu Zhang, Isaku Yamahata, Xu Yilun, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata

On Mon, Oct 2, 2023 at 6:43 PM Sean Christopherson <seanjc@google.com> wrote:
>
> > - I should go drop the patches annotating kvm_vcpu_read/write_page
> > from my series
>
> Hold up on that.  I'd prefer to keep them as there's still value in giving userspace
> debug information.  All I'm proposing is that we would firmly state in the
> documentation that those paths must be treated as informational-only.

Userspace would then need to know whether annotations were performed
from reliable/unreliable paths though, right? That'd imply another
flag bit beyond the current R/W/E bits.

> > - The helper function [a] for filling the memory_fault field
> > (downgraded back into the current union) can drop the "has the field
> > already been filled?" check/WARN.
>
> That would need to be dropped regardless because it's user-triggered (sadly).

Well the current v5 of the series uses a non-userspace visible canary-
it seems like there'd still be value in that if we were to keep the
annotations in potentially unreliable spots. Although perhaps that
test failure you noticed [1] is a good counter-argument, since it
shows a known case where a current flow does multiple writes to the
memory_fault member.

[1] https://lore.kernel.org/all/202309141107.30863e9d-oliver.sang@intel.com

> Anyways, don't do anything just yet.

:salutes:

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-10-03 22:59           ` Anish Moorthy
@ 2023-10-03 23:46             ` Sean Christopherson
  2023-10-05 22:07               ` Anish Moorthy
  0 siblings, 1 reply; 83+ messages in thread
From: Sean Christopherson @ 2023-10-03 23:46 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Xiaoyao Li, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Huacai Chen, Michael Ellerman, Anup Patel, kvm, kvmarm,
	kvm-riscv, linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Yu Zhang, Isaku Yamahata, Xu Yilun, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata

On Tue, Oct 03, 2023, Anish Moorthy wrote:
> On Mon, Oct 2, 2023 at 6:43 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > - I should go drop the patches annotating kvm_vcpu_read/write_page
> > > from my series
> >
> > Hold up on that.  I'd prefer to keep them as there's still value in giving userspace
> > debug information.  All I'm proposing is that we would firmly state in the
> > documentation that those paths must be treated as informational-only.
> 
> Userspace would then need to know whether annotations were performed
> from reliable/unreliable paths though, right? That'd imply another
> flag bit beyond the current R/W/E bits.

No, what's missing is a guarantee in KVM that every attempt to exit will actually
make it to userspace.  E.g. if a different exit, including another memory_fault
exit, clobbers an attempt to exit, the "unreliable" annotation will never be seen
by userspace.

The only way a KVM_EXIT_MEMORY_FAULT that actually reaches userspace could be
"unreliable" is if something other than a memory_fault exit clobbered the union,
but didn't signal its KVM_EXIT_* reason.  And that would be an egregious bug that
isn't unique to KVM_EXIT_MEMORY_FAULT, i.e. the same data corruption would affect
each and every other KVM_EXIT_* reason.

The "informational only" part is that userspace can't develop features that
*require* KVM to exit.

> > > - The helper function [a] for filling the memory_fault field
> > > (downgraded back into the current union) can drop the "has the field
> > > already been filled?" check/WARN.
> >
> > That would need to be dropped regardless because it's user-triggered (sadly).
> 
> Well the current v5 of the series uses a non-userspace visible canary-
> it seems like there'd still be value in that if we were to keep the
> annotations in potentially unreliable spots. Although perhaps that
> test failure you noticed [1] is a good counter-argument, since it
> shows a known case where a current flow does multiple writes to the
> memory_fault member.

The problem is that anything but a WARN will go unnoticed, and we can't have any
WARNs that are user-triggerable, at least not in upstream.  Internally, we can
and probably should add a canary, and an aggressive one at that, but I can't think
of a sane way to add a canary in upstream while avoiding the known offenders. :-(

> [1] https://lore.kernel.org/all/202309141107.30863e9d-oliver.sang@intel.com
> 
> > Anyways, don't do anything just yet.
> 
> :salutes:

LOL

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-10-03 23:46             ` Sean Christopherson
@ 2023-10-05 22:07               ` Anish Moorthy
  2023-10-05 22:46                 ` Sean Christopherson
  0 siblings, 1 reply; 83+ messages in thread
From: Anish Moorthy @ 2023-10-05 22:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiaoyao Li, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Huacai Chen, Michael Ellerman, Anup Patel, kvm, kvmarm,
	kvm-riscv, linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Yu Zhang, Isaku Yamahata, Xu Yilun, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata

On Tue, Oct 3, 2023 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
>
> The only way a KVM_EXIT_MEMORY_FAULT that actually reaches userspace could be
> "unreliable" is if something other than a memory_fault exit clobbered the union,
> but didn't signal its KVM_EXIT_* reason.  And that would be an egregious bug that
> isn't unique to KVM_EXIT_MEMORY_FAULT, i.e. the same data corruption would affect
> each and every other KVM_EXIT_* reason.

Keep in mind the case where an "unreliable" annotation sets up a
KVM_EXIT_MEMORY_FAULT, KVM_RUN ends up continuing, then something
unrelated comes up and causes KVM_RUN to EFAULT. Although this at
least is a case of "outdated" information rather than blatant
corruption.

IIRC the last time this came up we said that there's minimal harm in
userspace acting on the outdated info, but it seems like another good
argument for just restricting the annotations to paths we know are
reliable. What if the second EFAULT above is fatal (as I understand
all are today) and sets up subsequent KVM_RUNs to crash and burn
somehow? Seems like that'd be a safety issue.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-10-05 22:07               ` Anish Moorthy
@ 2023-10-05 22:46                 ` Sean Christopherson
  2023-10-10 22:21                   ` David Matlack
  0 siblings, 1 reply; 83+ messages in thread
From: Sean Christopherson @ 2023-10-05 22:46 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Xiaoyao Li, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Huacai Chen, Michael Ellerman, Anup Patel, kvm, kvmarm,
	kvm-riscv, linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Yu Zhang, Isaku Yamahata, Xu Yilun, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata

On Thu, Oct 05, 2023, Anish Moorthy wrote:
> On Tue, Oct 3, 2023 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > The only way a KVM_EXIT_MEMORY_FAULT that actually reaches userspace could be
> > "unreliable" is if something other than a memory_fault exit clobbered the union,
> > but didn't signal its KVM_EXIT_* reason.  And that would be an egregious bug that
> > isn't unique to KVM_EXIT_MEMORY_FAULT, i.e. the same data corruption would affect
> > each and every other KVM_EXIT_* reason.
> 
> Keep in mind the case where an "unreliable" annotation sets up a
> KVM_EXIT_MEMORY_FAULT, KVM_RUN ends up continuing, then something
> unrelated comes up and causes KVM_RUN to EFAULT. Although this at
> least is a case of "outdated" information rather than blatant
> corruption.

Drat, I managed to forget about that.

> IIRC the last time this came up we said that there's minimal harm in
> userspace acting on the outdated info, but it seems like another good
> argument for just restricting the annotations to paths we know are
> reliable. What if the second EFAULT above is fatal (as I understand
> all are today) and sets up subsequent KVM_RUNs to crash and burn
> somehow? Seems like that'd be a safety issue.

For your series, let's omit 

  KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page

and just fill memory_fault for the page fault paths.  That will be easier to
document too since we can simply say that if the exit reason is KVM_EXIT_MEMORY_FAULT,
then run->memory_fault is valid and fresh.

Adding a flag or whatever to mark the data as trustworthy would be the alternative,
but that's effectively adding ABI that says "KVM is buggy, sorry".

My dream of having KVM always return useful information for -EFAULT will have to
wait for another day.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 05/33] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER
  2023-09-14  1:55 ` [RFC PATCH v12 05/33] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER Sean Christopherson
@ 2023-10-09 16:42   ` Anup Patel
  0 siblings, 0 replies; 83+ messages in thread
From: Anup Patel @ 2023-10-09 16:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Matthew Wilcox (Oracle),
	Andrew Morton, Paul Moore, James Morris, Serge E. Hallyn, kvm,
	linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev, kvm-riscv,
	linux-riscv, linux-fsdevel, linux-mm, linux-security-module,
	linux-kernel, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Thu, Sep 14, 2023 at 7:25 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
> appropriate to effectively maintain existing behavior.  Using a proper
> Kconfig will simplify building more functionality on top of KVM's
> mmu_notifier infrastructure.
>
> Add a forward declaration of kvm_gfn_range to kvm_types.h so that
> including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
> generate warnings due to kvm_gfn_range being undeclared.  PPC defines
> hooks for PR vs. HV without guarding them via #ifdeffery, e.g.
>
>   bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
>   bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>   bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>   bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>
> Alternatively, PPC could forward declare kvm_gfn_range, but there's no
> good reason not to define it in common KVM.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Looks good to me.

For KVM RISC-V:
Acked-by: Anup Patel <anup@brainfault.org>

Thanks,
Anup

> ---
>  arch/arm64/include/asm/kvm_host.h   |  2 --
>  arch/arm64/kvm/Kconfig              |  2 +-
>  arch/mips/include/asm/kvm_host.h    |  2 --
>  arch/mips/kvm/Kconfig               |  2 +-
>  arch/powerpc/include/asm/kvm_host.h |  2 --
>  arch/powerpc/kvm/Kconfig            |  8 ++++----
>  arch/powerpc/kvm/powerpc.c          |  4 +---
>  arch/riscv/include/asm/kvm_host.h   |  2 --
>  arch/riscv/kvm/Kconfig              |  2 +-
>  arch/x86/include/asm/kvm_host.h     |  2 --
>  arch/x86/kvm/Kconfig                |  2 +-
>  include/linux/kvm_host.h            |  6 +++---
>  include/linux/kvm_types.h           |  1 +
>  virt/kvm/Kconfig                    |  4 ++++
>  virt/kvm/kvm_main.c                 | 10 +++++-----
>  15 files changed, 22 insertions(+), 29 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index af06ccb7ee34..9e046b64847a 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -921,8 +921,6 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>  int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>                               struct kvm_vcpu_events *events);
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
>  void kvm_arm_halt_guest(struct kvm *kvm);
>  void kvm_arm_resume_guest(struct kvm *kvm);
>
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 83c1e09be42e..1a777715199f 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -22,7 +22,7 @@ menuconfig KVM
>         bool "Kernel-based Virtual Machine (KVM) support"
>         depends on HAVE_KVM
>         select KVM_GENERIC_HARDWARE_ENABLING
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         select PREEMPT_NOTIFIERS
>         select HAVE_KVM_CPU_RELAX_INTERCEPT
>         select KVM_MMIO
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 54a85f1d4f2c..179f320cc231 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -810,8 +810,6 @@ int kvm_mips_mkclean_gpa_pt(struct kvm *kvm, gfn_t start_gfn, gfn_t end_gfn);
>  pgd_t *kvm_pgd_alloc(void);
>  void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
>  /* Emulation */
>  enum emulation_result update_pc(struct kvm_vcpu *vcpu, u32 cause);
>  int kvm_get_badinstr(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
> diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
> index a8cdba75f98d..c04987d2ed2e 100644
> --- a/arch/mips/kvm/Kconfig
> +++ b/arch/mips/kvm/Kconfig
> @@ -25,7 +25,7 @@ config KVM
>         select HAVE_KVM_EVENTFD
>         select HAVE_KVM_VCPU_ASYNC_IOCTL
>         select KVM_MMIO
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         select INTERVAL_TREE
>         select KVM_GENERIC_HARDWARE_ENABLING
>         help
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 14ee0dece853..4b5c3f2acf78 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -62,8 +62,6 @@
>
>  #include <linux/mmu_notifier.h>
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
>  #define HPTEG_CACHE_NUM                        (1 << 15)
>  #define HPTEG_HASH_BITS_PTE            13
>  #define HPTEG_HASH_BITS_PTE_LONG       12
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 902611954200..b33358ee6424 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -42,7 +42,7 @@ config KVM_BOOK3S_64_HANDLER
>  config KVM_BOOK3S_PR_POSSIBLE
>         bool
>         select KVM_MMIO
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>
>  config KVM_BOOK3S_HV_POSSIBLE
>         bool
> @@ -85,7 +85,7 @@ config KVM_BOOK3S_64_HV
>         tristate "KVM for POWER7 and later using hypervisor mode in host"
>         depends on KVM_BOOK3S_64 && PPC_POWERNV
>         select KVM_BOOK3S_HV_POSSIBLE
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         select CMA
>         help
>           Support running unmodified book3s_64 guest kernels in
> @@ -194,7 +194,7 @@ config KVM_E500V2
>         depends on !CONTEXT_TRACKING_USER
>         select KVM
>         select KVM_MMIO
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         help
>           Support running unmodified E500 guest kernels in virtual machines on
>           E500v2 host processors.
> @@ -211,7 +211,7 @@ config KVM_E500MC
>         select KVM
>         select KVM_MMIO
>         select KVM_BOOKE_HV
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         help
>           Support running unmodified E500MC/E5500/E6500 guest kernels in
>           virtual machines on E500MC/E5500/E6500 host processors.
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 8d3ec483bc2b..aac75c98a956 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -632,9 +632,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>                 break;
>  #endif
>         case KVM_CAP_SYNC_MMU:
> -#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> -               BUILD_BUG();
> -#endif
> +               BUILD_BUG_ON(!IS_ENABLED(CONFIG_KVM_GENERIC_MMU_NOTIFIER));
>                 r = 1;
>                 break;
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index 1ebf20dfbaa6..66ee9ff483e9 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -249,8 +249,6 @@ struct kvm_vcpu_arch {
>  static inline void kvm_arch_sync_events(struct kvm *kvm) {}
>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
>  #define KVM_RISCV_GSTAGE_TLB_MIN_ORDER         12
>
>  void kvm_riscv_local_hfence_gvma_vmid_gpa(unsigned long vmid,
> diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
> index dfc237d7875b..ae2e05f050ec 100644
> --- a/arch/riscv/kvm/Kconfig
> +++ b/arch/riscv/kvm/Kconfig
> @@ -30,7 +30,7 @@ config KVM
>         select KVM_GENERIC_HARDWARE_ENABLING
>         select KVM_MMIO
>         select KVM_XFER_TO_GUEST_WORK
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         select PREEMPT_NOTIFIERS
>         help
>           Support hosting virtualized guest machines.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1a4def36d5bb..3a2b53483524 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2131,8 +2131,6 @@ enum {
>  # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
>  #endif
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
>  int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
>  int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_cpu_has_extint(struct kvm_vcpu *v);
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index ed90f148140d..091b74599c22 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -24,7 +24,7 @@ config KVM
>         depends on HIGH_RES_TIMERS
>         depends on X86_LOCAL_APIC
>         select PREEMPT_NOTIFIERS
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         select HAVE_KVM_IRQCHIP
>         select HAVE_KVM_PFNCACHE
>         select HAVE_KVM_IRQFD
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 11d091688346..5faba69403ac 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -253,7 +253,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>  union kvm_mmu_notifier_arg {
>         pte_t pte;
>  };
> @@ -783,7 +783,7 @@ struct kvm {
>         struct hlist_head irq_ack_notifier_list;
>  #endif
>
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>         struct mmu_notifier mmu_notifier;
>         unsigned long mmu_invalidate_seq;
>         long mmu_invalidate_in_progress;
> @@ -1946,7 +1946,7 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[];
>  extern const struct kvm_stats_header kvm_vcpu_stats_header;
>  extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
>
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>  static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
>  {
>         if (unlikely(kvm->mmu_invalidate_in_progress))
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 6f4737d5046a..9d1f7835d8c1 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -6,6 +6,7 @@
>  struct kvm;
>  struct kvm_async_pf;
>  struct kvm_device_ops;
> +struct kvm_gfn_range;
>  struct kvm_interrupt;
>  struct kvm_irq_routing_table;
>  struct kvm_memory_slot;
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 484d0873061c..ecae2914c97e 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -92,3 +92,7 @@ config HAVE_KVM_PM_NOTIFIER
>
>  config KVM_GENERIC_HARDWARE_ENABLING
>         bool
> +
> +config KVM_GENERIC_MMU_NOTIFIER
> +       select MMU_NOTIFIER
> +       bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4fad3b01dc1f..8d21757cd5e9 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -535,7 +535,7 @@ void kvm_destroy_vcpus(struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  {
>         return container_of(mn, struct kvm, mmu_notifier);
> @@ -960,14 +960,14 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>         return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
>  }
>
> -#else  /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
> +#else  /* !CONFIG_KVM_GENERIC_MMU_NOTIFIER */
>
>  static int kvm_init_mmu_notifier(struct kvm *kvm)
>  {
>         return 0;
>  }
>
> -#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> +#endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */
>
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  static int kvm_pm_notifier_call(struct notifier_block *bl,
> @@ -1287,7 +1287,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  out_err_no_debugfs:
>         kvm_coalesced_mmio_free(kvm);
>  out_no_coalesced_mmio:
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>         if (kvm->mmu_notifier.ops)
>                 mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
>  #endif
> @@ -1347,7 +1347,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>                 kvm->buses[i] = NULL;
>         }
>         kvm_coalesced_mmio_free(kvm);
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>         mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
>         /*
>          * At this point, pending calls to invalidate_range_start()
> --
> 2.42.0.283.g2d96d420d3-goog
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-10-05 22:46                 ` Sean Christopherson
@ 2023-10-10 22:21                   ` David Matlack
  2023-10-13 18:45                     ` Sean Christopherson
  0 siblings, 1 reply; 83+ messages in thread
From: David Matlack @ 2023-10-10 22:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Anish Moorthy, Xiaoyao Li, Paolo Bonzini, Marc Zyngier,
	Oliver Upton, Huacai Chen, Michael Ellerman, Anup Patel, kvm,
	kvmarm, kvm-riscv, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata

On Thu, Oct 5, 2023 at 3:46 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Oct 05, 2023, Anish Moorthy wrote:
> > On Tue, Oct 3, 2023 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > The only way a KVM_EXIT_MEMORY_FAULT that actually reaches userspace could be
> > > "unreliable" is if something other than a memory_fault exit clobbered the union,
> > > but didn't signal its KVM_EXIT_* reason.  And that would be an egregious bug that
> > > isn't unique to KVM_EXIT_MEMORY_FAULT, i.e. the same data corruption would affect
> > > each and every other KVM_EXIT_* reason.
> >
> > Keep in mind the case where an "unreliable" annotation sets up a
> > KVM_EXIT_MEMORY_FAULT, KVM_RUN ends up continuing, then something
> > unrelated comes up and causes KVM_RUN to EFAULT. Although this at
> > least is a case of "outdated" information rather than blatant
> > corruption.
>
> Drat, I managed to forget about that.
>
> > IIRC the last time this came up we said that there's minimal harm in
> > userspace acting on the outdated info, but it seems like another good
> > argument for just restricting the annotations to paths we know are
> > reliable. What if the second EFAULT above is fatal (as I understand
> > all are today) and sets up subsequent KVM_RUNs to crash and burn
> > somehow? Seems like that'd be a safety issue.
>
> For your series, let's omit
>
>   KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page
>
> and just fill memory_fault for the page fault paths.  That will be easier to
> document too since we can simply say that if the exit reason is KVM_EXIT_MEMORY_FAULT,
> then run->memory_fault is valid and fresh.

+1

And from a performance perspective, I don't think we care about
kvm_vcpu_read/write_guest_page(). Our (Google) KVM Demand Paging
implementation just sends any kvm_vcpu_read/write_guest_page()
requests through the netlink socket, which is just a poor man's
userfaultfd. So I think we'll be fine sending these callsites through
uffd instead of exiting out to userspace.

And with that out of the way, is there any reason to keep tying
KVM_EXIT_MEMORY_FAULT to -EFAULT? As mentioned in the patch at the top
of this thread, -EFAULT is just a hack to allow the emulator paths to
return out to userspace. But that's no longer necessary. I just find
it odd that some KVM_EXIT_* correspond with KVM_RUN returning an error
and others don't. The exit_reason is sufficient to tell userspace
what's going on and has a firm contract, unlike -EFAULT which anything
KVM calls into can return.

>
> Adding a flag or whatever to mark the data as trustworthy would be the alternative,
> but that's effectively adding ABI that says "KVM is buggy, sorry".
>
> My dream of having KVM always return useful information for -EFAULT will have to
> wait for another day.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-10-10 22:21                   ` David Matlack
@ 2023-10-13 18:45                     ` Sean Christopherson
  0 siblings, 0 replies; 83+ messages in thread
From: Sean Christopherson @ 2023-10-13 18:45 UTC (permalink / raw)
  To: David Matlack
  Cc: Anish Moorthy, Xiaoyao Li, Paolo Bonzini, Marc Zyngier,
	Oliver Upton, Huacai Chen, Michael Ellerman, Anup Patel, kvm,
	kvmarm, kvm-riscv, linux-kernel, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Yu Zhang, Isaku Yamahata, Xu Yilun,
	Vlastimil Babka, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata

On Tue, Oct 10, 2023, David Matlack wrote:
> On Thu, Oct 5, 2023 at 3:46 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Oct 05, 2023, Anish Moorthy wrote:
> > > On Tue, Oct 3, 2023 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > The only way a KVM_EXIT_MEMORY_FAULT that actually reaches userspace could be
> > > > "unreliable" is if something other than a memory_fault exit clobbered the union,
> > > > but didn't signal its KVM_EXIT_* reason.  And that would be an egregious bug that
> > > > isn't unique to KVM_EXIT_MEMORY_FAULT, i.e. the same data corruption would affect
> > > > each and every other KVM_EXIT_* reason.
> > >
> > > Keep in mind the case where an "unreliable" annotation sets up a
> > > KVM_EXIT_MEMORY_FAULT, KVM_RUN ends up continuing, then something
> > > unrelated comes up and causes KVM_RUN to EFAULT. Although this at
> > > least is a case of "outdated" information rather than blatant
> > > corruption.
> >
> > Drat, I managed to forget about that.
> >
> > > IIRC the last time this came up we said that there's minimal harm in
> > > userspace acting on the outdated info, but it seems like another good
> > > argument for just restricting the annotations to paths we know are
> > > reliable. What if the second EFAULT above is fatal (as I understand
> > > all are today) and sets up subsequent KVM_RUNs to crash and burn
> > > somehow? Seems like that'd be a safety issue.
> >
> > For your series, let's omit
> >
> >   KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page
> >
> > and just fill memory_fault for the page fault paths.  That will be easier to
> > document too since we can simply say that if the exit reason is KVM_EXIT_MEMORY_FAULT,
> > then run->memory_fault is valid and fresh.
> 
> +1
> 
> And from a performance perspective, I don't think we care about
> kvm_vcpu_read/write_guest_page(). Our (Google) KVM Demand Paging
> implementation just sends any kvm_vcpu_read/write_guest_page()
> requests through the netlink socket, which is just a poor man's
> userfaultfd. So I think we'll be fine sending these callsites through
> uffd instead of exiting out to userspace.
> 
> And with that out of the way, is there any reason to keep tying
> KVM_EXIT_MEMORY_FAULT to -EFAULT? As mentioned in the patch at the top
> of this thread, -EFAULT is just a hack to allow the emulator paths to
> return out to userspace. But that's no longer necessary.

Not forcing '0' makes handling other error codes simpler, e.g. if the memory is
poisoned, KVM can simply return -EHWPOISON instead of having to add a flag to
run->memory_fault[*].

KVM would also have to make returning '0' instead of -EFAULT conditional based on
a capability being enabled.

And again, committing to returning '0' will make it all but impossible to extend
KVM_EXIT_MEMORY_FAULT beyond the page fault handlers.  Well, I suppose we could
have the top level kvm_arch_vcpu_ioctl_run() do

	if (r == -EFAULT && vcpu->kvm->enable_memory_fault_exits &&
	    kvm_run->exit_reason == KVM_EXIT_MEMORY_FAULT)
		r = 0;

but that's quite gross IMO.

> I just find it odd that some KVM_EXIT_* correspond with KVM_RUN returning an
> error and others don't.

FWIW, there is already precedent for run->exit_reason being valid with a non-zero
error code.  E.g. KVM selftests relies on run->exit_reason being preserved when
forcing an immediate exit, which returns -EINTR, not '0'.

	if (kvm_run->immediate_exit) {
		r = -EINTR;
		goto out;
	}

And pre-immediate_exit code that relies on signalling vCPUs is even more explicit
in setting exit_reason with a non-zero errno:

		if (signal_pending(current)) {
			r = -EINTR;
			kvm_run->exit_reason = KVM_EXIT_INTR;
			++vcpu->stat.signal_exits;
		}

I agree that -EFAULT with KVM_EXIT_MEMORY_FAULT *looks* a little odd, but IMO the
existing KVM behavior of returning '0' is actually what's truly odd.  E.g. returning
'0' + KVM_EXIT_MMIO if the guest accesses non-existent memory is downright weird.
KVM_RUN should arguably never return '0', because it can never actual completely
succeed.

> The exit_reason is sufficient to tell userspace what's going on and has a
> firm contract, unlike -EFAULT which anything KVM calls into can return.

Eh, I don't think it lessens the contract in a meaningful way.  KVM is still
contractually obligated to fill run->exit_reason when KVM returns '0', and
userspace will still likely terminate the VM on an undocumented EFAULT/EHWPOISON.

E.g. if KVM has a bug and doesn't return KVM_EXIT_MEMORY_FAULT when handling a
page fault, then odds are very good that the bug would result in KVM returning a
"bare" -EFAULT regardless of whether KVM_EXIT_MEMORY_FAULT is paried with '0' or
-EFAULT.

[*] https://lore.kernel.org/all/ZQHzVOIsesTTysgf@google.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2023-10-13 18:45 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
2023-09-14  1:54 ` [RFC PATCH v12 01/33] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
2023-09-15  6:47   ` Xiaoyao Li
2023-09-15 21:05     ` Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
2023-09-14  3:07   ` Binbin Wu
2023-09-14 14:19     ` Sean Christopherson
2023-09-20  6:07   ` Xu Yilun
2023-09-20 13:55     ` Sean Christopherson
2023-09-21  2:39       ` Xu Yilun
2023-09-21 14:24         ` Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 03/33] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 04/33] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 05/33] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER Sean Christopherson
2023-10-09 16:42   ` Anup Patel
2023-09-14  1:55 ` [RFC PATCH v12 06/33] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
2023-09-15  6:59   ` Xiaoyao Li
2023-09-14  1:55 ` [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace Sean Christopherson
2023-09-22  6:03   ` Xiaoyao Li
2023-09-22 14:30     ` Sean Christopherson
2023-09-22 16:28     ` Sean Christopherson
2023-09-22 16:35       ` Sean Christopherson
2023-10-02 22:33       ` Anish Moorthy
2023-10-03  1:42         ` Sean Christopherson
2023-10-03 22:59           ` Anish Moorthy
2023-10-03 23:46             ` Sean Christopherson
2023-10-05 22:07               ` Anish Moorthy
2023-10-05 22:46                 ` Sean Christopherson
2023-10-10 22:21                   ` David Matlack
2023-10-13 18:45                     ` Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 08/33] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 09/33] KVM: Drop .on_unlock() mmu_notifier hook Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events Sean Christopherson
2023-09-18  1:14   ` Binbin Wu
2023-09-18 15:57     ` Sean Christopherson
2023-09-18 18:07   ` Michael Roth
2023-09-19  0:08     ` Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes Sean Christopherson
2023-09-15  6:32   ` Yan Zhao
2023-09-20 21:00     ` Sean Christopherson
2023-09-21  1:21       ` Yan Zhao
2023-09-25 17:37         ` Sean Christopherson
2023-09-18  7:51   ` Binbin Wu
2023-09-20 21:03     ` Sean Christopherson
2023-09-27  5:19       ` Binbin Wu
2023-10-03 12:47   ` Fuad Tabba
2023-10-03 15:59     ` Sean Christopherson
2023-10-03 18:33       ` Fuad Tabba
2023-10-03 20:51         ` Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 12/33] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 13/33] security: Export security_inode_init_security_anon() for use by KVM Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
2023-09-15  6:11   ` Yan Zhao
2023-09-18 16:36   ` Michael Roth
2023-09-20 23:44     ` Sean Christopherson
2023-09-19  9:01   ` Binbin Wu
2023-09-20 14:24     ` Sean Christopherson
2023-09-21  5:58       ` Binbin Wu
2023-09-21 19:10   ` Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 15/33] KVM: Add transparent hugepage support for dedicated guest memory Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 16/33] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 17/33] KVM: x86: Disallow hugepages when memory attributes are mixed Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory Sean Christopherson
2023-09-15  5:40   ` Yan Zhao
2023-09-15 14:26     ` Sean Christopherson
2023-09-18  0:54       ` Yan Zhao
2023-09-21 14:59         ` Sean Christopherson
2023-09-21  5:51       ` Binbin Wu
2023-09-14  1:55 ` [RFC PATCH v12 19/33] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 20/33] KVM: Allow arch code to track number of memslot address spaces per VM Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 21/33] KVM: x86: Add support for "protected VMs" that can utilize private memory Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 22/33] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 23/33] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 24/33] KVM: selftests: Add support for creating private memslots Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 25/33] KVM: selftests: Add helpers to convert guest memory b/w private and shared Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 26/33] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 27/33] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 28/33] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 29/33] KVM: selftests: Add x86-only selftest for private memory conversions Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 30/33] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 31/33] KVM: selftests: Expand set_memory_region_test to validate guest_memfd() Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 32/33] KVM: selftests: Add basic selftest for guest_memfd() Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 33/33] KVM: selftests: Test KVM exit behavior for private memory/access Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).