kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes
@ 2023-10-27 18:21 Sean Christopherson
  2023-10-27 18:21 ` [PATCH v13 01/35] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
                   ` (35 more replies)
  0 siblings, 36 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Non-KVM people, please take a gander at two small-ish patches buried in the
middle of this series:

  fs: Export anon_inode_getfile_secure() for use by KVM
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

Our plan/hope is to take this through the KVM tree for 6.8, reviews (and acks!)
would be much appreciated.  Note, adding AS_UNMOVABLE isn't strictly required as
it's "just" an optimization, but we'd prefer to have it in place straightaway.

Reviews on all the KVM changes, especially the guest_memfd.c implementation, are
also most definitely welcome.

The "what and why" at the very bottom is hopefully old news for most readers.  My
plan is to copy the blurb into a tag when this is merged (today's word of the day
is: optimism), e.g. so that the big picture and why we're doing this is captured
in the git history.

Note, the v13 changelog below captures only changes that were not posted and
applied to the v12+ development branch.  Those changes can be found in commits
46c10adeda81..74a4d3b6a284 at
 
    https://github.com/kvm-x86/linux.git tags/kvm-x86-guest_memfd-v12

This series can be found at

    https://github.com/kvm-x86/linux.git guest_memfd

kvm-x86/guest_memfd is also now being fed into kvm-x86/next, i.e. will be getting
coverage in linux-next as of the next build.

Similar to the v12 "development cycle", any changes needed will be applied on
top of v13, and squashed prior to sending v14 (if needed) or merging (optimism!).

KVM folks, ***LOOK HERE***.  v13 has several breaking userspace changes relative
to v12.  Some were "necessary" (removal of a pointless ioctl), others were
opportunistic and opinionated (renaming kvm_userspace_memory_region2 fields to
use guest_memfd instead of gmem).  I didn't post changes as I found the "issues"
very late (when writing documentation) and didn't want to delay v13.

Here's a diff of the linux/include/uapi/linux/kvm.h changes that will break
userspace developed for v12.

@@ -102,8 +102,8 @@ struct kvm_userspace_memory_region2 {
        __u64 guest_phys_addr;
        __u64 memory_size;
        __u64 userspace_addr;
-       __u64 gmem_offset;
-       __u32 gmem_fd;
+       __u64 guest_memfd_offset;
+       __u32 guest_memfd;
        __u32 pad1;
        __u64 pad2[14];
 };

@@ -1231,9 +1215,10 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
 #define KVM_CAP_USER_MEMORY2 230
-#define KVM_CAP_MEMORY_ATTRIBUTES 231
-#define KVM_CAP_GUEST_MEMFD 232
-#define KVM_CAP_VM_TYPES 233
+#define KVM_CAP_MEMORY_FAULT_INFO 231
+#define KVM_CAP_MEMORY_ATTRIBUTES 232
+#define KVM_CAP_GUEST_MEMFD 233
+#define KVM_CAP_VM_TYPES 234
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2301,8 +2286,7 @@ struct kvm_s390_zpci_op {
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
 /* Available with KVM_CAP_MEMORY_ATTRIBUTES */
-#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
-#define KVM_SET_MEMORY_ATTRIBUTES              _IOW(KVMIO,  0xd3, struct kvm_memory_attributes)
+#define KVM_SET_MEMORY_ATTRIBUTES              _IOW(KVMIO,  0xd2, struct kvm_memory_attributes)
 
v13:
 - Drop KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES, have KVM_CAP_MEMORY_ATTRIBUTES
   return the supported attributes.
 - Add KVM_CAP_MEMORY_FAULT_INFO to report support for KVM_EXIT_MEMORY_FAULT,
   and shift capability numbers accordingly.
 - Do s/gmem/guest_memfd (roughly) in userspace-facing APIs, i.e. use guest_memfd
   as the formal name.  Going off of various internal conversations, "gmem" isn't
   at all intuitive, whereas "guest_memfd" gives readers/listeners a rough idea
   of what's going on.  If you don't like the rename, then next time volunteer
   to write the documentation.  :-)
 - Rename a leftover "out_restricted" label to "out_unbind".
 - Write and clean up changelogs.
 - Write and clean up documentation.
 - Move "memory_fault" to the standard exit reasons union (requires userspace to
   rebuild, but shouldn't require code changes).
 - Fix intermediate build issues (hidden behind unselectable Kconfigs)
 - KVM_CAP_GUEST_MEMFD and KVM_CREATE_GUEST_MEMFD under the same #ifdefs.
 - Fix a bug in kvm_mmu_invalidate_range_add() where adding multiple ranges in a
   single invalidation would captured only the last range. [Xu Yilun]

v12: https://lore.kernel.org/all/20230914015531.1419405-1-seanjc@google.com
v11: https://lore.kernel.org/all/20230718234512.1690985-1-seanjc@google.com
v10: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.coms

Fodder for a merge tag:
---
Introduce several new KVM uAPIs to ultimately create a guest-first memory
subsystem within KVM, a.k.a. guest_memfd.  Guest-first memory allows KVM to
provide features, enhancements, and optimizations that are kludgly or outright
impossible to implement in a generic memory subsystem.

The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which similar
to the generic memfd_create(), creates an anonymous file and returns a file
descriptor that refers to it.  Again like "regular" memfd files, guest_memfd
files live in RAM, have volatile storage, and are automatically released when
the last reference is dropped.  The key differences between memfd files (and
every other memory subystem) is that guest_memfd files are bound to their owning
virtual machine, cannot be mapped, read, or written by userspace, and cannot be
resized (guest_memfd files do however support PUNCH_HOLE).

A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to specify
attributes for a given page of guest memory, e.g. in the long term, it will
likely be extended to allow userspace to specify per-gfn RWX protections.

The immediate and driving use case for guest_memfd are Confidential (CoCo) VMs,
specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM.  For KVM CoCo use
cases, being able to map memory into KVM guests without requireming said memory
to be mapped into the host is a hard requirement.  While SEV+ and TDX prevent
untrusted software from reading guest private data by encrypting guest memory,
pKVM provides confidentiality and integrity *without* relying on memory
encryption.  And with SEV-SNP and TDX, accessing guest private memory can be
fatal to the host, i.e. KVM must be prevent host userspace from accessing guest
memory irrespective of hardware behavior.

Long term, guest_memfd provides KVM line-of-sight to use cases beyond CoCo VMs,
e.g. KVM currently doesn't support mapping memory as writable in the guest
without it also being writable in host userspace, as KVM's ABI uses userspace
VMA protections to define the allow guest protection (with an exception granted
to mapping guest memory executable).

Similarly, KVM currently requires the guest mapping size to be a strict subset
of the host userspace mapping size, e.g. KVM doesn’t support creating a 1GiB
guest mapping unless userspace also has a 1GiB guest mapping.  Decoupling the
mappings sizes would allow userspace to precisely map only what is needed
without impacting guest performance, e.g. to again harden against unintentional
accesses to guest memory.

A guest-first memory subsystem also provides clearer line of sight to things
like a dedicated memory pool (for slice-of-hardware VMs) and elimination of
"struct page" (for offload setups where userspace _never_ needs to mmap() guest
memory).

guest_memfd is the result of 3+ years of development and exploration; taking on
memory management responsibilities in KVM was not the first, second, or even
third choice for supporting CoCo VMs.  But after many failed attempts to avoid
KVM-specific backing memory, and looking at where things ended up, it is quite
clear that of all approaches tried, guest_memfd is the simplest, most robust,
and most extensible, and the right thing to do for KVM and the kernel at-large.
---

Ackerley Tng (1):
  KVM: selftests: Test KVM exit behavior for private memory/access

Chao Peng (8):
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  KVM: Introduce per-page memory attributes
  KVM: x86: Disallow hugepages when memory attributes are mixed
  KVM: x86/mmu: Handle page fault for private memory
  KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper
  KVM: selftests: Expand set_memory_region_test to validate
    guest_memfd()
  KVM: selftests: Add basic selftest for guest_memfd()

Sean Christopherson (23):
  KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn
    ranges
  KVM: Assert that mmu_invalidate_in_progress *never* goes negative
  KVM: WARN if there are dangling MMU invalidations at VM destruction
  KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER
  KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU
  KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to
    CONFIG_KVM_GENERIC_MMU_NOTIFIER
  KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory
  KVM: Drop .on_unlock() mmu_notifier hook
  KVM: Prepare for handling only shared mappings in mmu_notifier events
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
  fs: Export anon_inode_getfile_secure() for use by KVM
  KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing
    memory
  KVM: Add transparent hugepage support for dedicated guest memory
  KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
  KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro
  KVM: Allow arch code to track number of memslot address spaces per VM
  KVM: x86: Add support for "protected VMs" that can utilize private
    memory
  KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper
  KVM: selftests: Convert lib's mem regions to
    KVM_SET_USER_MEMORY_REGION2
  KVM: selftests: Add support for creating private memslots
  KVM: selftests: Introduce VM "shape" to allow tests to specify the VM
    type
  KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data

Vishal Annapurve (3):
  KVM: selftests: Add helpers to convert guest memory b/w private and
    shared
  KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls
    (x86)
  KVM: selftests: Add x86-only selftest for private memory conversions

 Documentation/virt/kvm/api.rst                | 208 ++++++
 arch/arm64/include/asm/kvm_host.h             |   2 -
 arch/arm64/kvm/Kconfig                        |   2 +-
 arch/mips/include/asm/kvm_host.h              |   2 -
 arch/mips/kvm/Kconfig                         |   2 +-
 arch/powerpc/include/asm/kvm_host.h           |   2 -
 arch/powerpc/kvm/Kconfig                      |   8 +-
 arch/powerpc/kvm/book3s_hv.c                  |   2 +-
 arch/powerpc/kvm/powerpc.c                    |   7 +-
 arch/riscv/include/asm/kvm_host.h             |   2 -
 arch/riscv/kvm/Kconfig                        |   2 +-
 arch/x86/include/asm/kvm_host.h               |  17 +-
 arch/x86/include/uapi/asm/kvm.h               |   3 +
 arch/x86/kvm/Kconfig                          |  14 +-
 arch/x86/kvm/debugfs.c                        |   2 +-
 arch/x86/kvm/mmu/mmu.c                        | 271 +++++++-
 arch/x86/kvm/mmu/mmu_internal.h               |   2 +
 arch/x86/kvm/vmx/vmx.c                        |  11 +-
 arch/x86/kvm/x86.c                            |  26 +-
 fs/anon_inodes.c                              |   1 +
 include/linux/kvm_host.h                      | 143 ++++-
 include/linux/kvm_types.h                     |   1 +
 include/linux/pagemap.h                       |  19 +-
 include/uapi/linux/kvm.h                      |  51 ++
 mm/compaction.c                               |  43 +-
 mm/migrate.c                                  |   2 +
 tools/testing/selftests/kvm/Makefile          |   3 +
 tools/testing/selftests/kvm/dirty_log_test.c  |   2 +-
 .../testing/selftests/kvm/guest_memfd_test.c  | 221 +++++++
 .../selftests/kvm/include/kvm_util_base.h     | 148 ++++-
 .../testing/selftests/kvm/include/test_util.h |   5 +
 .../selftests/kvm/include/ucall_common.h      |  11 +
 .../selftests/kvm/include/x86_64/processor.h  |  15 +
 .../selftests/kvm/kvm_page_table_test.c       |   2 +-
 tools/testing/selftests/kvm/lib/kvm_util.c    | 233 ++++---
 tools/testing/selftests/kvm/lib/memstress.c   |   3 +-
 .../selftests/kvm/set_memory_region_test.c    | 100 +++
 .../kvm/x86_64/private_mem_conversions_test.c | 487 ++++++++++++++
 .../kvm/x86_64/private_mem_kvm_exits_test.c   | 120 ++++
 .../kvm/x86_64/ucna_injection_test.c          |   2 +-
 virt/kvm/Kconfig                              |  17 +
 virt/kvm/Makefile.kvm                         |   1 +
 virt/kvm/dirty_ring.c                         |   2 +-
 virt/kvm/guest_memfd.c                        | 607 ++++++++++++++++++
 virt/kvm/kvm_main.c                           | 505 ++++++++++++---
 virt/kvm/kvm_mm.h                             |  26 +
 46 files changed, 3083 insertions(+), 272 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_test.c
 create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
 create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
 create mode 100644 virt/kvm/guest_memfd.c


base-commit: 2b3f2325e71f09098723727d665e2e8003d455dc
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH v13 01/35] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-11-01 12:46   ` Fuad Tabba
  2023-10-27 18:21 ` [PATCH v13 02/35] KVM: Assert that mmu_invalidate_in_progress *never* goes negative Sean Christopherson
                   ` (34 subsequent siblings)
  35 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Rework and rename "struct kvm_hva_range" into "kvm_mmu_notifier_range" so
that the structure can be used to handle notifications that operate on gfn
context, i.e. that aren't tied to a host virtual address.  Rename the
handler typedef too (arguably it should always have been gfn_handler_t).

Practically speaking, this is a nop for 64-bit kernels as the only
meaningful change is to store start+end as u64s instead of unsigned longs.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 486800a7024b..0524933856d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -541,18 +541,22 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 	return container_of(mn, struct kvm, mmu_notifier);
 }
 
-typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
 			     unsigned long end);
 
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
-struct kvm_hva_range {
-	unsigned long start;
-	unsigned long end;
+struct kvm_mmu_notifier_range {
+	/*
+	 * 64-bit addresses, as KVM notifiers can operate on host virtual
+	 * addresses (unsigned long) and guest physical addresses (64-bit).
+	 */
+	u64 start;
+	u64 end;
 	union kvm_mmu_notifier_arg arg;
-	hva_handler_t handler;
+	gfn_handler_t handler;
 	on_lock_fn_t on_lock;
 	on_unlock_fn_t on_unlock;
 	bool flush_on_ret;
@@ -581,7 +585,7 @@ static const union kvm_mmu_notifier_arg KVM_MMU_NOTIFIER_NO_ARG;
 	     node = interval_tree_iter_next(node, start, last))	     \
 
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
-						  const struct kvm_hva_range *range)
+						  const struct kvm_mmu_notifier_range *range)
 {
 	bool ret = false, locked = false;
 	struct kvm_gfn_range gfn_range;
@@ -608,9 +612,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			unsigned long hva_start, hva_end;
 
 			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
-			hva_start = max(range->start, slot->userspace_addr);
-			hva_end = min(range->end, slot->userspace_addr +
-						  (slot->npages << PAGE_SHIFT));
+			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
+			hva_end = min_t(unsigned long, range->end,
+					slot->userspace_addr + (slot->npages << PAGE_SHIFT));
 
 			/*
 			 * To optimize for the likely case where the address
@@ -660,10 +664,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 						unsigned long start,
 						unsigned long end,
 						union kvm_mmu_notifier_arg arg,
-						hva_handler_t handler)
+						gfn_handler_t handler)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range range = {
+	const struct kvm_mmu_notifier_range range = {
 		.start		= start,
 		.end		= end,
 		.arg		= arg,
@@ -680,10 +684,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
 							 unsigned long start,
 							 unsigned long end,
-							 hva_handler_t handler)
+							 gfn_handler_t handler)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range range = {
+	const struct kvm_mmu_notifier_range range = {
 		.start		= start,
 		.end		= end,
 		.handler	= handler,
@@ -771,7 +775,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range hva_range = {
+	const struct kvm_mmu_notifier_range hva_range = {
 		.start		= range->start,
 		.end		= range->end,
 		.handler	= kvm_unmap_gfn_range,
@@ -835,7 +839,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range hva_range = {
+	const struct kvm_mmu_notifier_range hva_range = {
 		.start		= range->start,
 		.end		= range->end,
 		.handler	= (void *)kvm_null_fn,
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 02/35] KVM: Assert that mmu_invalidate_in_progress *never* goes negative
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
  2023-10-27 18:21 ` [PATCH v13 01/35] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-30 16:27   ` Paolo Bonzini
  2023-11-01 12:46   ` Fuad Tabba
  2023-10-27 18:21 ` [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
                   ` (33 subsequent siblings)
  35 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Move the assertion on the in-progress invalidation count from the primary
MMU's notifier path to KVM's common notification path, i.e. assert that
the count doesn't go negative even when the invalidation is coming from
KVM itself.

Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only
the affected VM, not the entire kernel.  A corrupted count is fatal to the
VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry()
to block any and all attempts to install new mappings.  But it's far from
guaranteed that an end() without a start() is fatal or even problematic to
anything other than the target VM, e.g. the underlying bug could simply be
a duplicate call to end().  And it's much more likely that a missed
invalidation, i.e. a potential use-after-free, would manifest as no
notification whatsoever, not an end() without a start().

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0524933856d4..5a97e6c7d9c2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -833,6 +833,7 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
 	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
 	 */
 	kvm->mmu_invalidate_in_progress--;
+	KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
@@ -863,8 +864,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	 */
 	if (wake)
 		rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
-
-	BUG_ON(kvm->mmu_invalidate_in_progress < 0);
 }
 
 static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
  2023-10-27 18:21 ` [PATCH v13 01/35] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
  2023-10-27 18:21 ` [PATCH v13 02/35] KVM: Assert that mmu_invalidate_in_progress *never* goes negative Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-30 16:30   ` Paolo Bonzini
                     ` (2 more replies)
  2023-10-27 18:21 ` [PATCH v13 04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction Sean Christopherson
                   ` (32 subsequent siblings)
  35 siblings, 3 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

Currently in mmu_notifier invalidate path, hva range is recorded and
then checked against by mmu_notifier_retry_hva() in the page fault
handling path. However, for the to be introduced private memory, a page
fault may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
[sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c   | 10 ++++++----
 arch/x86/kvm/vmx/vmx.c   | 11 +++++------
 include/linux/kvm_host.h | 33 ++++++++++++++++++++------------
 virt/kvm/kvm_main.c      | 41 +++++++++++++++++++++++++++++++---------
 4 files changed, 64 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f7901cb4d2fa..d33657d61d80 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3056,7 +3056,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
  *
  * There are several ways to safely use this helper:
  *
- * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, before
+ * - Check mmu_invalidate_retry_gfn() after grabbing the mapping level, before
  *   consuming it.  In this case, mmu_lock doesn't need to be held during the
  *   lookup, but it does need to be held while checking the MMU notifier.
  *
@@ -4358,7 +4358,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 		return true;
 
 	return fault->slot &&
-	       mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
+	       mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -6245,7 +6245,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	write_lock(&kvm->mmu_lock);
 
-	kvm_mmu_invalidate_begin(kvm, 0, -1ul);
+	kvm_mmu_invalidate_begin(kvm);
+
+	kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
 
 	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
@@ -6255,7 +6257,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	if (flush)
 		kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);
 
-	kvm_mmu_invalidate_end(kvm, 0, -1ul);
+	kvm_mmu_invalidate_end(kvm);
 
 	write_unlock(&kvm->mmu_lock);
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 72e3943f3693..6e502ba93141 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6757,10 +6757,10 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 		return;
 
 	/*
-	 * Grab the memslot so that the hva lookup for the mmu_notifier retry
-	 * is guaranteed to use the same memslot as the pfn lookup, i.e. rely
-	 * on the pfn lookup's validation of the memslot to ensure a valid hva
-	 * is used for the retry check.
+	 * Explicitly grab the memslot using KVM's internal slot ID to ensure
+	 * KVM doesn't unintentionally grab a userspace memslot.  It _should_
+	 * be impossible for userspace to create a memslot for the APIC when
+	 * APICv is enabled, but paranoia won't hurt in this case.
 	 */
 	slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
 	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
@@ -6785,8 +6785,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 		return;
 
 	read_lock(&vcpu->kvm->mmu_lock);
-	if (mmu_invalidate_retry_hva(kvm, mmu_seq,
-				     gfn_to_hva_memslot(slot, gfn))) {
+	if (mmu_invalidate_retry_gfn(kvm, mmu_seq, gfn)) {
 		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
 		read_unlock(&vcpu->kvm->mmu_lock);
 		goto out;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fb6c6109fdca..11d091688346 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -787,8 +787,8 @@ struct kvm {
 	struct mmu_notifier mmu_notifier;
 	unsigned long mmu_invalidate_seq;
 	long mmu_invalidate_in_progress;
-	unsigned long mmu_invalidate_range_start;
-	unsigned long mmu_invalidate_range_end;
+	gfn_t mmu_invalidate_range_start;
+	gfn_t mmu_invalidate_range_end;
 #endif
 	struct list_head devices;
 	u64 manual_dirty_log_protect;
@@ -1392,10 +1392,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end);
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end);
+void kvm_mmu_invalidate_begin(struct kvm *kvm);
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_invalidate_end(struct kvm *kvm);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -1970,9 +1969,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 	return 0;
 }
 
-static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
+static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
 					   unsigned long mmu_seq,
-					   unsigned long hva)
+					   gfn_t gfn)
 {
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
@@ -1981,10 +1980,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
 	 * that might be being invalidated. Note that it may include some false
 	 * positives, due to shortcuts when handing concurrent invalidations.
 	 */
-	if (unlikely(kvm->mmu_invalidate_in_progress) &&
-	    hva >= kvm->mmu_invalidate_range_start &&
-	    hva < kvm->mmu_invalidate_range_end)
-		return 1;
+	if (unlikely(kvm->mmu_invalidate_in_progress)) {
+		/*
+		 * Dropping mmu_lock after bumping mmu_invalidate_in_progress
+		 * but before updating the range is a KVM bug.
+		 */
+		if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
+				 kvm->mmu_invalidate_range_end == INVALID_GPA))
+			return 1;
+
+		if (gfn >= kvm->mmu_invalidate_range_start &&
+		    gfn < kvm->mmu_invalidate_range_end)
+			return 1;
+	}
+
 	if (kvm->mmu_invalidate_seq != mmu_seq)
 		return 1;
 	return 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5a97e6c7d9c2..1a577a25de47 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -543,9 +543,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
-			     unsigned long end);
-
+typedef void (*on_lock_fn_t)(struct kvm *kvm);
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
 struct kvm_mmu_notifier_range {
@@ -637,7 +635,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 				locked = true;
 				KVM_MMU_LOCK(kvm);
 				if (!IS_KVM_NULL_FN(range->on_lock))
-					range->on_lock(kvm, range->start, range->end);
+					range->on_lock(kvm);
+
 				if (IS_KVM_NULL_FN(range->handler))
 					break;
 			}
@@ -742,16 +741,29 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	kvm_handle_hva_range(mn, address, address + 1, arg, kvm_change_spte_gfn);
 }
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end)
+void kvm_mmu_invalidate_begin(struct kvm *kvm)
 {
+	lockdep_assert_held_write(&kvm->mmu_lock);
 	/*
 	 * The count increase must become visible at unlock time as no
 	 * spte can be established without taking the mmu_lock and
 	 * count is also read inside the mmu_lock critical section.
 	 */
 	kvm->mmu_invalidate_in_progress++;
+
 	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+		kvm->mmu_invalidate_range_start = INVALID_GPA;
+		kvm->mmu_invalidate_range_end = INVALID_GPA;
+	}
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
+	if (likely(kvm->mmu_invalidate_range_start == INVALID_GPA)) {
 		kvm->mmu_invalidate_range_start = start;
 		kvm->mmu_invalidate_range_end = end;
 	} else {
@@ -771,6 +783,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
 	}
 }
 
+static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+	return kvm_unmap_gfn_range(kvm, range);
+}
+
 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
@@ -778,7 +796,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	const struct kvm_mmu_notifier_range hva_range = {
 		.start		= range->start,
 		.end		= range->end,
-		.handler	= kvm_unmap_gfn_range,
+		.handler	= kvm_mmu_unmap_gfn_range,
 		.on_lock	= kvm_mmu_invalidate_begin,
 		.on_unlock	= kvm_arch_guest_memory_reclaimed,
 		.flush_on_ret	= true,
@@ -817,8 +835,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	return 0;
 }
 
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end)
+void kvm_mmu_invalidate_end(struct kvm *kvm)
 {
 	/*
 	 * This sequence increase will notify the kvm page fault that
@@ -834,6 +851,12 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
 	 */
 	kvm->mmu_invalidate_in_progress--;
 	KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
+
+	/*
+	 * Assert that at least one range was added between start() and end().
+	 * Not adding a range isn't fatal, but it is a KVM bug.
+	 */
+	WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA);
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (2 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-30 16:32   ` Paolo Bonzini
  2023-11-01 12:50   ` Fuad Tabba
  2023-10-27 18:21 ` [PATCH v13 05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER Sean Christopherson
                   ` (31 subsequent siblings)
  35 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Add an assertion that there are no in-progress MMU invalidations when a
VM is being destroyed, with the exception of the scenario where KVM
unregisters its MMU notifier between an .invalidate_range_start() call and
the corresponding .invalidate_range_end().

KVM can't detect unpaired calls from the mmu_notifier due to the above
exception waiver, but the assertion can detect KVM bugs, e.g. such as the
bug that *almost* escaped initial guest_memfd development.

Link: https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5389@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1a577a25de47..4dba682586ee 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1356,9 +1356,16 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	 * No threads can be waiting in kvm_swap_active_memslots() as the
 	 * last reference on KVM has been dropped, but freeing
 	 * memslots would deadlock without this manual intervention.
+	 *
+	 * If the count isn't unbalanced, i.e. KVM did NOT unregister its MMU
+	 * notifier between a start() and end(), then there shouldn't be any
+	 * in-progress invalidations.
 	 */
 	WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
-	kvm->mn_active_invalidate_count = 0;
+	if (kvm->mn_active_invalidate_count)
+		kvm->mn_active_invalidate_count = 0;
+	else
+		WARN_ON(kvm->mmu_invalidate_in_progress);
 #else
 	kvm_flush_shadow_all(kvm);
 #endif
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (3 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-30 16:34   ` Paolo Bonzini
  2023-11-01 12:51   ` Fuad Tabba
  2023-10-27 18:21 ` [PATCH v13 06/35] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU Sean Christopherson
                   ` (30 subsequent siblings)
  35 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Assert that both KVM_ARCH_WANT_MMU_NOTIFIER and CONFIG_MMU_NOTIFIER are
defined when KVM is enabled, and return '1' unconditionally for the
CONFIG_KVM_BOOK3S_HV_POSSIBLE=n path.  All flavors of PPC support for KVM
select MMU_NOTIFIER, and KVM_ARCH_WANT_MMU_NOTIFIER is unconditionally
defined by arch/powerpc/include/asm/kvm_host.h.

Effectively dropping use of KVM_ARCH_WANT_MMU_NOTIFIER will simplify a
future cleanup to turn KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig, i.e.
will allow combining all of the

  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)

checks into a single

  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER

without having to worry about PPC's "bare" usage of
KVM_ARCH_WANT_MMU_NOTIFIER.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/powerpc/kvm/powerpc.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 7197c8256668..b0a512ede764 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -632,12 +632,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		break;
 #endif
 	case KVM_CAP_SYNC_MMU:
+#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+		BUILD_BUG();
+#endif
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 		r = hv_enabled;
-#elif defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-		r = 1;
 #else
-		r = 0;
+		r = 1;
 #endif
 		break;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 06/35] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (4 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-27 18:21 ` [PATCH v13 07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER Sean Christopherson
                   ` (29 subsequent siblings)
  35 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Advertise that KVM's MMU is synchronized with the primary MMU for all
flavors of PPC KVM support, i.e. advertise that the MMU is synchronized
when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y but the VM is not using hypervisor
mode (a.k.a. PR VMs).  PR VMs, via kvm_unmap_gfn_range_pr(), do the right
thing for mmu_notifier invalidation events, and more tellingly, KVM
returns '1' for KVM_CAP_SYNC_MMU when CONFIG_KVM_BOOK3S_HV_POSSIBLE=n
and CONFIG_KVM_BOOK3S_PR_POSSIBLE=y, i.e. KVM already advertises a
synchronized MMU for PR VMs, just not when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/powerpc/kvm/powerpc.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b0a512ede764..8d3ec483bc2b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -635,11 +635,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 		BUILD_BUG();
 #endif
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-		r = hv_enabled;
-#else
 		r = 1;
-#endif
 		break;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 	case KVM_CAP_PPC_HTAB_FD:
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (5 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 06/35] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-30 16:37   ` Paolo Bonzini
  2023-11-01 12:54   ` Fuad Tabba
  2023-10-27 18:21 ` [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
                   ` (28 subsequent siblings)
  35 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior.  Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.

Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared.  PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.

  bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);

Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.

Acked-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/include/asm/kvm_host.h   |  2 --
 arch/arm64/kvm/Kconfig              |  2 +-
 arch/mips/include/asm/kvm_host.h    |  2 --
 arch/mips/kvm/Kconfig               |  2 +-
 arch/powerpc/include/asm/kvm_host.h |  2 --
 arch/powerpc/kvm/Kconfig            |  8 ++++----
 arch/powerpc/kvm/powerpc.c          |  4 +---
 arch/riscv/include/asm/kvm_host.h   |  2 --
 arch/riscv/kvm/Kconfig              |  2 +-
 arch/x86/include/asm/kvm_host.h     |  2 --
 arch/x86/kvm/Kconfig                |  2 +-
 include/linux/kvm_host.h            |  6 +++---
 include/linux/kvm_types.h           |  1 +
 virt/kvm/Kconfig                    |  4 ++++
 virt/kvm/kvm_main.c                 | 10 +++++-----
 15 files changed, 22 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index af06ccb7ee34..9e046b64847a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -921,8 +921,6 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
 int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
 			      struct kvm_vcpu_events *events);
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 83c1e09be42e..1a777715199f 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -22,7 +22,7 @@ menuconfig KVM
 	bool "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select KVM_GENERIC_HARDWARE_ENABLING
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	select PREEMPT_NOTIFIERS
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select KVM_MMIO
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 54a85f1d4f2c..179f320cc231 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -810,8 +810,6 @@ int kvm_mips_mkclean_gpa_pt(struct kvm *kvm, gfn_t start_gfn, gfn_t end_gfn);
 pgd_t *kvm_pgd_alloc(void);
 void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 /* Emulation */
 enum emulation_result update_pc(struct kvm_vcpu *vcpu, u32 cause);
 int kvm_get_badinstr(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index a8cdba75f98d..c04987d2ed2e 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -25,7 +25,7 @@ config KVM
 	select HAVE_KVM_EVENTFD
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select KVM_MMIO
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	select INTERVAL_TREE
 	select KVM_GENERIC_HARDWARE_ENABLING
 	help
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..4b5c3f2acf78 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -62,8 +62,6 @@
 
 #include <linux/mmu_notifier.h>
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 #define HPTEG_CACHE_NUM			(1 << 15)
 #define HPTEG_HASH_BITS_PTE		13
 #define HPTEG_HASH_BITS_PTE_LONG	12
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 902611954200..b33358ee6424 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -42,7 +42,7 @@ config KVM_BOOK3S_64_HANDLER
 config KVM_BOOK3S_PR_POSSIBLE
 	bool
 	select KVM_MMIO
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 
 config KVM_BOOK3S_HV_POSSIBLE
 	bool
@@ -85,7 +85,7 @@ config KVM_BOOK3S_64_HV
 	tristate "KVM for POWER7 and later using hypervisor mode in host"
 	depends on KVM_BOOK3S_64 && PPC_POWERNV
 	select KVM_BOOK3S_HV_POSSIBLE
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	select CMA
 	help
 	  Support running unmodified book3s_64 guest kernels in
@@ -194,7 +194,7 @@ config KVM_E500V2
 	depends on !CONTEXT_TRACKING_USER
 	select KVM
 	select KVM_MMIO
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	help
 	  Support running unmodified E500 guest kernels in virtual machines on
 	  E500v2 host processors.
@@ -211,7 +211,7 @@ config KVM_E500MC
 	select KVM
 	select KVM_MMIO
 	select KVM_BOOKE_HV
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	help
 	  Support running unmodified E500MC/E5500/E6500 guest kernels in
 	  virtual machines on E500MC/E5500/E6500 host processors.
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 8d3ec483bc2b..aac75c98a956 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -632,9 +632,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		break;
 #endif
 	case KVM_CAP_SYNC_MMU:
-#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-		BUILD_BUG();
-#endif
+		BUILD_BUG_ON(!IS_ENABLED(CONFIG_KVM_GENERIC_MMU_NOTIFIER));
 		r = 1;
 		break;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 1ebf20dfbaa6..66ee9ff483e9 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -249,8 +249,6 @@ struct kvm_vcpu_arch {
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 #define KVM_RISCV_GSTAGE_TLB_MIN_ORDER		12
 
 void kvm_riscv_local_hfence_gvma_vmid_gpa(unsigned long vmid,
diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
index dfc237d7875b..ae2e05f050ec 100644
--- a/arch/riscv/kvm/Kconfig
+++ b/arch/riscv/kvm/Kconfig
@@ -30,7 +30,7 @@ config KVM
 	select KVM_GENERIC_HARDWARE_ENABLING
 	select KVM_MMIO
 	select KVM_XFER_TO_GUEST_WORK
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	select PREEMPT_NOTIFIERS
 	help
 	  Support hosting virtualized guest machines.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 70d139406bc8..31e84668014e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2129,8 +2129,6 @@ enum {
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
 #endif
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_extint(struct kvm_vcpu *v);
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ed90f148140d..091b74599c22 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -24,7 +24,7 @@ config KVM
 	depends on HIGH_RES_TIMERS
 	depends on X86_LOCAL_APIC
 	select PREEMPT_NOTIFIERS
-	select MMU_NOTIFIER
+	select KVM_GENERIC_MMU_NOTIFIER
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_PFNCACHE
 	select HAVE_KVM_IRQFD
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 11d091688346..5faba69403ac 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -253,7 +253,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 union kvm_mmu_notifier_arg {
 	pte_t pte;
 };
@@ -783,7 +783,7 @@ struct kvm {
 	struct hlist_head irq_ack_notifier_list;
 #endif
 
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 	struct mmu_notifier mmu_notifier;
 	unsigned long mmu_invalidate_seq;
 	long mmu_invalidate_in_progress;
@@ -1946,7 +1946,7 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[];
 extern const struct kvm_stats_header kvm_vcpu_stats_header;
 extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
 
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 {
 	if (unlikely(kvm->mmu_invalidate_in_progress))
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 6f4737d5046a..9d1f7835d8c1 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -6,6 +6,7 @@
 struct kvm;
 struct kvm_async_pf;
 struct kvm_device_ops;
+struct kvm_gfn_range;
 struct kvm_interrupt;
 struct kvm_irq_routing_table;
 struct kvm_memory_slot;
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 484d0873061c..ecae2914c97e 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -92,3 +92,7 @@ config HAVE_KVM_PM_NOTIFIER
 
 config KVM_GENERIC_HARDWARE_ENABLING
        bool
+
+config KVM_GENERIC_MMU_NOTIFIER
+       select MMU_NOTIFIER
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4dba682586ee..6e708017064d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -535,7 +535,7 @@ void kvm_destroy_vcpus(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
 
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 {
 	return container_of(mn, struct kvm, mmu_notifier);
@@ -960,14 +960,14 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
 }
 
-#else  /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
+#else  /* !CONFIG_KVM_GENERIC_MMU_NOTIFIER */
 
 static int kvm_init_mmu_notifier(struct kvm *kvm)
 {
 	return 0;
 }
 
-#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
+#endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_pm_notifier_call(struct notifier_block *bl,
@@ -1287,7 +1287,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 out_err_no_debugfs:
 	kvm_coalesced_mmio_free(kvm);
 out_no_coalesced_mmio:
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 	if (kvm->mmu_notifier.ops)
 		mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
 #endif
@@ -1347,7 +1347,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 		kvm->buses[i] = NULL;
 	}
 	kvm_coalesced_mmio_free(kvm);
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
 	/*
 	 * At this point, pending calls to invalidate_range_start()
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (6 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-30 16:41   ` Paolo Bonzini
                     ` (2 more replies)
  2023-10-27 18:21 ` [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace Sean Christopherson
                   ` (27 subsequent siblings)
  35 siblings, 3 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
information can be supplied without setting userspace up to fail.  The
padding in the new kvm_userspace_memory_region2 structure will be used to
pass a file descriptor in addition to the userspace_addr, i.e. allow
userspace to point at a file descriptor and map memory into a guest that
is NOT mapped into host userspace.

Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
makes detection of bad flags a bit more robust, e.g. if the new fd field
is guarded only by a flag and not a new ioctl(), then a userspace bug
(setting a "bad" flag) would generate out-of-bounds access instead of an
-EINVAL error.

Cc: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst | 21 +++++++++++++++++++
 arch/x86/kvm/x86.c             |  2 +-
 include/linux/kvm_host.h       |  4 ++--
 include/uapi/linux/kvm.h       | 13 ++++++++++++
 virt/kvm/kvm_main.c            | 38 +++++++++++++++++++++++++++-------
 5 files changed, 67 insertions(+), 11 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 21a7578142a1..ace984acc125 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6070,6 +6070,27 @@ writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using the SET_ONE_REG
 interface. No error will be returned, but the resulting offset will not be
 applied.
 
+4.139 KVM_SET_USER_MEMORY_REGION2
+---------------------------------
+
+:Capability: KVM_CAP_USER_MEMORY2
+:Architectures: all
+:Type: vm ioctl
+:Parameters: struct kvm_userspace_memory_region2 (in)
+:Returns: 0 on success, -1 on error
+
+::
+
+  struct kvm_userspace_memory_region2 {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size; /* bytes */
+	__u64 userspace_addr; /* start of the userspace allocated memory */
+  };
+
+See KVM_SET_USER_MEMORY_REGION.
+
 5. The kvm_run structure
 ========================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 41cce5031126..6409914428ca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12455,7 +12455,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		struct kvm_userspace_memory_region m;
+		struct kvm_userspace_memory_region2 m;
 
 		m.slot = id | (i << 16);
 		m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5faba69403ac..4e741ff27af3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1146,9 +1146,9 @@ enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem);
+			  const struct kvm_userspace_memory_region2 *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem);
+			    const struct kvm_userspace_memory_region2 *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 13065dd96132..bd1abe067f28 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+/* for KVM_SET_USER_MEMORY_REGION2 */
+struct kvm_userspace_memory_region2 {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size;
+	__u64 userspace_addr;
+	__u64 pad[16];
+};
+
 /*
  * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
  * userspace, other bits are reserved for kvm internal use which are defined
@@ -1192,6 +1202,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_COUNTER_OFFSET 227
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
+#define KVM_CAP_USER_MEMORY2 230
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1473,6 +1484,8 @@ struct kvm_vfio_spapr_tce {
 					struct kvm_userspace_memory_region)
 #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
+#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
+					 struct kvm_userspace_memory_region2)
 
 /* enable ucontrol for s390 */
 struct kvm_s390_ucas_mapping {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6e708017064d..3f5b7c2c5327 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1578,7 +1578,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(const struct kvm_userspace_memory_region2 *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
@@ -1980,7 +1980,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem)
+			    const struct kvm_userspace_memory_region2 *mem)
 {
 	struct kvm_memory_slot *old, *new;
 	struct kvm_memslots *slots;
@@ -2084,7 +2084,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem)
+			  const struct kvm_userspace_memory_region2 *mem)
 {
 	int r;
 
@@ -2096,7 +2096,7 @@ int kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
-					  struct kvm_userspace_memory_region *mem)
+					  struct kvm_userspace_memory_region2 *mem)
 {
 	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
@@ -4566,6 +4566,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 {
 	switch (arg) {
 	case KVM_CAP_USER_MEMORY:
+	case KVM_CAP_USER_MEMORY2:
 	case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
 	case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
 	case KVM_CAP_INTERNAL_ERROR_DATA:
@@ -4821,6 +4822,14 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
 	return fd;
 }
 
+#define SANITY_CHECK_MEM_REGION_FIELD(field)					\
+do {										\
+	BUILD_BUG_ON(offsetof(struct kvm_userspace_memory_region, field) !=		\
+		     offsetof(struct kvm_userspace_memory_region2, field));	\
+	BUILD_BUG_ON(sizeof_field(struct kvm_userspace_memory_region, field) !=		\
+		     sizeof_field(struct kvm_userspace_memory_region2, field));	\
+} while (0)
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -4843,15 +4852,28 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
 		break;
 	}
+	case KVM_SET_USER_MEMORY_REGION2:
 	case KVM_SET_USER_MEMORY_REGION: {
-		struct kvm_userspace_memory_region kvm_userspace_mem;
+		struct kvm_userspace_memory_region2 mem;
+		unsigned long size;
+
+		if (ioctl == KVM_SET_USER_MEMORY_REGION)
+			size = sizeof(struct kvm_userspace_memory_region);
+		else
+			size = sizeof(struct kvm_userspace_memory_region2);
+
+		/* Ensure the common parts of the two structs are identical. */
+		SANITY_CHECK_MEM_REGION_FIELD(slot);
+		SANITY_CHECK_MEM_REGION_FIELD(flags);
+		SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+		SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+		SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
 
 		r = -EFAULT;
-		if (copy_from_user(&kvm_userspace_mem, argp,
-						sizeof(kvm_userspace_mem)))
+		if (copy_from_user(&mem, argp, size))
 			goto out;
 
-		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
 		break;
 	}
 	case KVM_GET_DIRTY_LOG: {
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (7 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-30 17:22   ` Paolo Bonzini
                     ` (3 more replies)
  2023-10-27 18:21 ` [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory Sean Christopherson
                   ` (26 subsequent siblings)
  35 siblings, 4 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).

KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory.  With guest private memory,
there will be two kind of memory conversions:

  - explicit conversion: happens when the guest explicitly calls into KVM
    to map a range (as private or shared)

  - implicit conversion: happens when the guest attempts to access a gfn
    that is configured in the "wrong" state (private vs. shared)

On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.

KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.

Note!  To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'!  Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.

Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages.  It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.

Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com
Cc: Anish Moorthy <amoorthy@google.com>
Cc: David Matlack <dmatlack@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst | 41 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c             |  1 +
 include/linux/kvm_host.h       | 11 +++++++++
 include/uapi/linux/kvm.h       |  8 +++++++
 4 files changed, 61 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index ace984acc125..860216536810 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory;
+
+KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
+could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
+guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
+describes properties of the faulting access that are likely pertinent.
+Currently, no flags are defined.
+
+Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
+accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
+or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
+kvm_run.exit_reason is stale/undefined for all other error numbers.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
@@ -7757,6 +7777,27 @@ This capability is aimed to mitigate the threat that malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.
 
+7.34 KVM_CAP_MEMORY_FAULT_INFO
+------------------------------
+
+:Architectures: x86
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN will fill
+kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
+there is a valid memslot but no backing VMA for the corresponding host virtual
+address.
+
+The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
+an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
+to KVM_EXIT_MEMORY_FAULT.
+
+Note: Userspaces which attempt to resolve memory faults so that they can retry
+KVM_RUN are encouraged to guard against repeatedly receiving the same
+error/annotated fault.
+
+See KVM_EXIT_MEMORY_FAULT for more information.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6409914428ca..ee3cd8c3c0ef 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4518,6 +4518,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
 	case KVM_CAP_IRQFD_RESAMPLE:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4e741ff27af3..96aa930536b1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+						 gpa_t gpa, gpa_t size)
+{
+	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+	vcpu->run->memory_fault.gpa = gpa;
+	vcpu->run->memory_fault.size = size;
+
+	/* Flags are not (yet) defined or communicated to userspace. */
+	vcpu->run->memory_fault.flags = 0;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index bd1abe067f28..7ae9987b48dd 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -274,6 +274,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -520,6 +521,12 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -1203,6 +1210,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
 #define KVM_CAP_USER_MEMORY2 230
+#define KVM_CAP_MEMORY_FAULT_INFO 231
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (8 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-30 17:11   ` Paolo Bonzini
  2023-11-02 13:55   ` Fuad Tabba
  2023-10-27 18:21 ` [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook Sean Christopherson
                   ` (25 subsequent siblings)
  35 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
__kvm_handle_hva_range() return whether or not an overlapping memslot
was found, i.e. mmu_lock was acquired.  Using the .on_unlock() hook
works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.

Use a small struct to return the tuple of the notifier-specific return,
plus whether or not overlap was found.  Because the iteration helpers are
__always_inlined, practically speaking, the struct will never actually be
returned from a function call (not to mention the size of the struct will
be two bytes in practice).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 53 +++++++++++++++++++++++++++++++--------------
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3f5b7c2c5327..2bc04c8ae1f4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -561,6 +561,19 @@ struct kvm_mmu_notifier_range {
 	bool may_block;
 };
 
+/*
+ * The inner-most helper returns a tuple containing the return value from the
+ * arch- and action-specific handler, plus a flag indicating whether or not at
+ * least one memslot was found, i.e. if the handler found guest memory.
+ *
+ * Note, most notifiers are averse to booleans, so even though KVM tracks the
+ * return from arch code as a bool, outer helpers will cast it to an int. :-(
+ */
+typedef struct kvm_mmu_notifier_return {
+	bool ret;
+	bool found_memslot;
+} kvm_mn_ret_t;
+
 /*
  * Use a dedicated stub instead of NULL to indicate that there is no callback
  * function/handler.  The compiler technically can't guarantee that a real
@@ -582,22 +595,25 @@ static const union kvm_mmu_notifier_arg KVM_MMU_NOTIFIER_NO_ARG;
 	     node;							     \
 	     node = interval_tree_iter_next(node, start, last))	     \
 
-static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
-						  const struct kvm_mmu_notifier_range *range)
+static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
+							   const struct kvm_mmu_notifier_range *range)
 {
-	bool ret = false, locked = false;
+	struct kvm_mmu_notifier_return r = {
+		.ret = false,
+		.found_memslot = false,
+	};
 	struct kvm_gfn_range gfn_range;
 	struct kvm_memory_slot *slot;
 	struct kvm_memslots *slots;
 	int i, idx;
 
 	if (WARN_ON_ONCE(range->end <= range->start))
-		return 0;
+		return r;
 
 	/* A null handler is allowed if and only if on_lock() is provided. */
 	if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
 			 IS_KVM_NULL_FN(range->handler)))
-		return 0;
+		return r;
 
 	idx = srcu_read_lock(&kvm->srcu);
 
@@ -631,8 +647,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
 			gfn_range.slot = slot;
 
-			if (!locked) {
-				locked = true;
+			if (!r.found_memslot) {
+				r.found_memslot = true;
 				KVM_MMU_LOCK(kvm);
 				if (!IS_KVM_NULL_FN(range->on_lock))
 					range->on_lock(kvm);
@@ -640,14 +656,14 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 				if (IS_KVM_NULL_FN(range->handler))
 					break;
 			}
-			ret |= range->handler(kvm, &gfn_range);
+			r.ret |= range->handler(kvm, &gfn_range);
 		}
 	}
 
-	if (range->flush_on_ret && ret)
+	if (range->flush_on_ret && r.ret)
 		kvm_flush_remote_tlbs(kvm);
 
-	if (locked) {
+	if (r.found_memslot) {
 		KVM_MMU_UNLOCK(kvm);
 		if (!IS_KVM_NULL_FN(range->on_unlock))
 			range->on_unlock(kvm);
@@ -655,8 +671,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 
 	srcu_read_unlock(&kvm->srcu, idx);
 
-	/* The notifiers are averse to booleans. :-( */
-	return (int)ret;
+	return r;
 }
 
 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
@@ -677,7 +692,7 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.may_block	= false,
 	};
 
-	return __kvm_handle_hva_range(kvm, &range);
+	return __kvm_handle_hva_range(kvm, &range).ret;
 }
 
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
@@ -696,7 +711,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 		.may_block	= false,
 	};
 
-	return __kvm_handle_hva_range(kvm, &range);
+	return __kvm_handle_hva_range(kvm, &range).ret;
 }
 
 static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -798,7 +813,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.end		= range->end,
 		.handler	= kvm_mmu_unmap_gfn_range,
 		.on_lock	= kvm_mmu_invalidate_begin,
-		.on_unlock	= kvm_arch_guest_memory_reclaimed,
+		.on_unlock	= (void *)kvm_null_fn,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
@@ -830,7 +845,13 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end,
 					  hva_range.may_block);
 
-	__kvm_handle_hva_range(kvm, &hva_range);
+	/*
+	 * If one or more memslots were found and thus zapped, notify arch code
+	 * that guest memory has been reclaimed.  This needs to be done *after*
+	 * dropping mmu_lock, as x86's reclaim path is slooooow.
+	 */
+	if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot)
+		kvm_arch_guest_memory_reclaimed(kvm);
 
 	return 0;
 }
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (9 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-30 17:18   ` Paolo Bonzini
  2023-11-02 13:55   ` Fuad Tabba
  2023-10-27 18:21 ` [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events Sean Christopherson
                   ` (24 subsequent siblings)
  35 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
notifying arch code that memory has been reclaimed.  Adding .on_unlock()
and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
resulted in .on_lock() and .on_unlock() having divergent and asymmetric
behavior, and set future developers up for failure, i.e. all but asked for
bugs where KVM relied on using .on_unlock() to try to run a callback while
holding mmu_lock.

Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
guard against future bugs of this nature.

Reported-by: Isaku Yamahata <isaku.yamahata@intel.com>
Link: https://lore.kernel.org/all/20230802203119.GB2021422@ls.amr.corp.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2bc04c8ae1f4..cb9376833c18 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -544,7 +544,6 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm);
-typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
 struct kvm_mmu_notifier_range {
 	/*
@@ -556,7 +555,6 @@ struct kvm_mmu_notifier_range {
 	union kvm_mmu_notifier_arg arg;
 	gfn_handler_t handler;
 	on_lock_fn_t on_lock;
-	on_unlock_fn_t on_unlock;
 	bool flush_on_ret;
 	bool may_block;
 };
@@ -663,11 +661,8 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 	if (range->flush_on_ret && r.ret)
 		kvm_flush_remote_tlbs(kvm);
 
-	if (r.found_memslot) {
+	if (r.found_memslot)
 		KVM_MMU_UNLOCK(kvm);
-		if (!IS_KVM_NULL_FN(range->on_unlock))
-			range->on_unlock(kvm);
-	}
 
 	srcu_read_unlock(&kvm->srcu, idx);
 
@@ -687,7 +682,6 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.arg		= arg,
 		.handler	= handler,
 		.on_lock	= (void *)kvm_null_fn,
-		.on_unlock	= (void *)kvm_null_fn,
 		.flush_on_ret	= true,
 		.may_block	= false,
 	};
@@ -706,7 +700,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 		.end		= end,
 		.handler	= handler,
 		.on_lock	= (void *)kvm_null_fn,
-		.on_unlock	= (void *)kvm_null_fn,
 		.flush_on_ret	= false,
 		.may_block	= false,
 	};
@@ -813,7 +806,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.end		= range->end,
 		.handler	= kvm_mmu_unmap_gfn_range,
 		.on_lock	= kvm_mmu_invalidate_begin,
-		.on_unlock	= (void *)kvm_null_fn,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
@@ -858,6 +850,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 void kvm_mmu_invalidate_end(struct kvm *kvm)
 {
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
 	/*
 	 * This sequence increase will notify the kvm page fault that
 	 * the page that is going to be mapped in the spte could have
@@ -889,7 +883,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		.end		= range->end,
 		.handler	= (void *)kvm_null_fn,
 		.on_lock	= kvm_mmu_invalidate_end,
-		.on_unlock	= (void *)kvm_null_fn,
 		.flush_on_ret	= false,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (10 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-30 17:21   ` Paolo Bonzini
                     ` (2 more replies)
  2023-10-27 18:21 ` [PATCH v13 13/35] KVM: Introduce per-page memory attributes Sean Christopherson
                   ` (23 subsequent siblings)
  35 siblings, 3 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Add flags to "struct kvm_gfn_range" to let notifier events target only
shared and only private mappings, and write up the existing mmu_notifier
events to be shared-only (private memory is never associated with a
userspace virtual address, i.e. can't be reached via mmu_notifiers).

Add two flags so that KVM can handle the three possibilities (shared,
private, and shared+private) without needing something like a tri-state
enum.

Link: https://lore.kernel.org/all/ZJX0hk+KpQP0KUyB@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h | 2 ++
 virt/kvm/kvm_main.c      | 7 +++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 96aa930536b1..89c1a991a3b8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -263,6 +263,8 @@ struct kvm_gfn_range {
 	gfn_t start;
 	gfn_t end;
 	union kvm_mmu_notifier_arg arg;
+	bool only_private;
+	bool only_shared;
 	bool may_block;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb9376833c18..302ccb87b4c1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 			 * the second or later invocation of the handler).
 			 */
 			gfn_range.arg = range->arg;
+
+			/*
+			 * HVA-based notifications aren't relevant to private
+			 * mappings as they don't have a userspace mapping.
+			 */
+			gfn_range.only_private = false;
+			gfn_range.only_shared = true;
 			gfn_range.may_block = range->may_block;
 
 			/*
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 13/35] KVM: Introduce per-page memory attributes
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (11 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-30  8:11   ` Chao Gao
                     ` (2 more replies)
  2023-10-27 18:21 ` [PATCH v13 14/35] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable Sean Christopherson
                   ` (22 subsequent siblings)
  35 siblings, 3 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
userspace to operate on the per-page memory attributes.
  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
    a guest memory range.
  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
    memory attributes.

Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.

Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.

Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.

To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation.  For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.

It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst |  36 +++++
 include/linux/kvm_host.h       |  18 +++
 include/uapi/linux/kvm.h       |  13 ++
 virt/kvm/Kconfig               |   4 +
 virt/kvm/kvm_main.c            | 233 +++++++++++++++++++++++++++++++++
 5 files changed, 304 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 860216536810..e2252c748fd6 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6091,6 +6091,42 @@ applied.
 
 See KVM_SET_USER_MEMORY_REGION.
 
+4.140 KVM_SET_MEMORY_ATTRIBUTES
+-------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_memory_attributes(in)
+:Returns: 0 on success, <0 on error
+
+KVM_SET_MEMORY_ATTRIBUTES allows userspace to set memory attributes for a range
+of guest physical memory.
+
+::
+
+  struct kvm_memory_attributes {
+	__u64 address;
+	__u64 size;
+	__u64 attributes;
+	__u64 flags;
+  };
+
+  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
+
+The address and size must be page aligned.  The supported attributes can be
+retrieved via ioctl(KVM_CHECK_EXTENSION) on KVM_CAP_MEMORY_ATTRIBUTES.  If
+executed on a VM, KVM_CAP_MEMORY_ATTRIBUTES precisely returns the attributes
+supported by that VM.  If executed at system scope, KVM_CAP_MEMORY_ATTRIBUTES
+returns all attributes supported by KVM.  The only attribute defined at this
+time is KVM_MEMORY_ATTRIBUTE_PRIVATE, which marks the associated gfn as being
+guest private memory.
+
+Note, there is no "get" API.  Userspace is responsible for explicitly tracking
+the state of a gfn/page as needed.
+
+The "flags" field is reserved for future extensions and must be '0'.
+
 5. The kvm_run structure
 ========================
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 89c1a991a3b8..df573229651b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,6 +256,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 union kvm_mmu_notifier_arg {
 	pte_t pte;
+	unsigned long attributes;
 };
 
 struct kvm_gfn_range {
@@ -808,6 +809,9 @@ struct kvm {
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 	struct notifier_block pm_notifier;
+#endif
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+	struct xarray mem_attr_array;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
 };
@@ -2340,4 +2344,18 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 	vcpu->run->memory_fault.flags = 0;
 }
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
+{
+	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
+}
+
+bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+				     unsigned long attrs);
+bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+					struct kvm_gfn_range *range);
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+					 struct kvm_gfn_range *range);
+#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 7ae9987b48dd..547837feaa28 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1211,6 +1211,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
 #define KVM_CAP_USER_MEMORY2 230
 #define KVM_CAP_MEMORY_FAULT_INFO 231
+#define KVM_CAP_MEMORY_ATTRIBUTES 232
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2277,4 +2278,16 @@ struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
+#define KVM_SET_MEMORY_ATTRIBUTES              _IOW(KVMIO,  0xd2, struct kvm_memory_attributes)
+
+struct kvm_memory_attributes {
+	__u64 address;
+	__u64 size;
+	__u64 attributes;
+	__u64 flags;
+};
+
+#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index ecae2914c97e..5bd7fcaf9089 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -96,3 +96,7 @@ config KVM_GENERIC_HARDWARE_ENABLING
 config KVM_GENERIC_MMU_NOTIFIER
        select MMU_NOTIFIER
        bool
+
+config KVM_GENERIC_MEMORY_ATTRIBUTES
+       select KVM_GENERIC_MMU_NOTIFIER
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 302ccb87b4c1..78a0b09ef2a5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1218,6 +1218,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+	xa_init(&kvm->mem_attr_array);
+#endif
 
 	INIT_LIST_HEAD(&kvm->gpc_list);
 	spin_lock_init(&kvm->gpc_lock);
@@ -1398,6 +1401,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	}
 	cleanup_srcu_struct(&kvm->irq_srcu);
 	cleanup_srcu_struct(&kvm->srcu);
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+	xa_destroy(&kvm->mem_attr_array);
+#endif
 	kvm_arch_free_vm(kvm);
 	preempt_notifier_dec();
 	hardware_disable_all();
@@ -2396,6 +2402,210 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+/*
+ * Returns true if _all_ gfns in the range [@start, @end) have attributes
+ * matching @attrs.
+ */
+bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+				     unsigned long attrs)
+{
+	XA_STATE(xas, &kvm->mem_attr_array, start);
+	unsigned long index;
+	bool has_attrs;
+	void *entry;
+
+	rcu_read_lock();
+
+	if (!attrs) {
+		has_attrs = !xas_find(&xas, end - 1);
+		goto out;
+	}
+
+	has_attrs = true;
+	for (index = start; index < end; index++) {
+		do {
+			entry = xas_next(&xas);
+		} while (xas_retry(&xas, entry));
+
+		if (xas.xa_index != index || xa_to_value(entry) != attrs) {
+			has_attrs = false;
+			break;
+		}
+	}
+
+out:
+	rcu_read_unlock();
+	return has_attrs;
+}
+
+static u64 kvm_supported_mem_attributes(struct kvm *kvm)
+{
+	if (!kvm)
+		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
+	return 0;
+}
+
+static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
+						 struct kvm_mmu_notifier_range *range)
+{
+	struct kvm_gfn_range gfn_range;
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	struct kvm_memslot_iter iter;
+	bool found_memslot = false;
+	bool ret = false;
+	int i;
+
+	gfn_range.arg = range->arg;
+	gfn_range.may_block = range->may_block;
+
+	/*
+	 * If/when KVM supports more attributes beyond private .vs shared, this
+	 * _could_ set only_{private,shared} appropriately if the entire target
+	 * range already has the desired private vs. shared state (it's unclear
+	 * if that is a net win).  For now, KVM reaches this point if and only
+	 * if the private flag is being toggled, i.e. all mappings are in play.
+	 */
+	gfn_range.only_private = false;
+	gfn_range.only_shared = false;
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+
+		kvm_for_each_memslot_in_gfn_range(&iter, slots, range->start, range->end) {
+			slot = iter.slot;
+			gfn_range.slot = slot;
+
+			gfn_range.start = max(range->start, slot->base_gfn);
+			gfn_range.end = min(range->end, slot->base_gfn + slot->npages);
+			if (gfn_range.start >= gfn_range.end)
+				continue;
+
+			if (!found_memslot) {
+				found_memslot = true;
+				KVM_MMU_LOCK(kvm);
+				if (!IS_KVM_NULL_FN(range->on_lock))
+					range->on_lock(kvm);
+			}
+
+			ret |= range->handler(kvm, &gfn_range);
+		}
+	}
+
+	if (range->flush_on_ret && ret)
+		kvm_flush_remote_tlbs(kvm);
+
+	if (found_memslot)
+		KVM_MMU_UNLOCK(kvm);
+}
+
+static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
+					  struct kvm_gfn_range *range)
+{
+	/*
+	 * Unconditionally add the range to the invalidation set, regardless of
+	 * whether or not the arch callback actually needs to zap SPTEs.  E.g.
+	 * if KVM supports RWX attributes in the future and the attributes are
+	 * going from R=>RW, zapping isn't strictly necessary.  Unconditionally
+	 * adding the range allows KVM to require that MMU invalidations add at
+	 * least one range between begin() and end(), e.g. allows KVM to detect
+	 * bugs where the add() is missed.  Rexlaing the rule *might* be safe,
+	 * but it's not obvious that allowing new mappings while the attributes
+	 * are in flux is desirable or worth the complexity.
+	 */
+	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+
+	return kvm_arch_pre_set_memory_attributes(kvm, range);
+}
+
+/* Set @attributes for the gfn range [@start, @end). */
+static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+				     unsigned long attributes)
+{
+	struct kvm_mmu_notifier_range pre_set_range = {
+		.start = start,
+		.end = end,
+		.handler = kvm_pre_set_memory_attributes,
+		.on_lock = kvm_mmu_invalidate_begin,
+		.flush_on_ret = true,
+		.may_block = true,
+	};
+	struct kvm_mmu_notifier_range post_set_range = {
+		.start = start,
+		.end = end,
+		.arg.attributes = attributes,
+		.handler = kvm_arch_post_set_memory_attributes,
+		.on_lock = kvm_mmu_invalidate_end,
+		.may_block = true,
+	};
+	unsigned long i;
+	void *entry;
+	int r = 0;
+
+	entry = attributes ? xa_mk_value(attributes) : NULL;
+
+	mutex_lock(&kvm->slots_lock);
+
+	/* Nothing to do if the entire range as the desired attributes. */
+	if (kvm_range_has_memory_attributes(kvm, start, end, attributes))
+		goto out_unlock;
+
+	/*
+	 * Reserve memory ahead of time to avoid having to deal with failures
+	 * partway through setting the new attributes.
+	 */
+	for (i = start; i < end; i++) {
+		r = xa_reserve(&kvm->mem_attr_array, i, GFP_KERNEL_ACCOUNT);
+		if (r)
+			goto out_unlock;
+	}
+
+	kvm_handle_gfn_range(kvm, &pre_set_range);
+
+	for (i = start; i < end; i++) {
+		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
+				    GFP_KERNEL_ACCOUNT));
+		KVM_BUG_ON(r, kvm);
+	}
+
+	kvm_handle_gfn_range(kvm, &post_set_range);
+
+out_unlock:
+	mutex_unlock(&kvm->slots_lock);
+
+	return r;
+}
+static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
+					   struct kvm_memory_attributes *attrs)
+{
+	gfn_t start, end;
+
+	/* flags is currently not used. */
+	if (attrs->flags)
+		return -EINVAL;
+	if (attrs->attributes & ~kvm_supported_mem_attributes(kvm))
+		return -EINVAL;
+	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
+		return -EINVAL;
+	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
+		return -EINVAL;
+
+	start = attrs->address >> PAGE_SHIFT;
+	end = (attrs->address + attrs->size) >> PAGE_SHIFT;
+
+	/*
+	 * xarray tracks data using "unsigned long", and as a result so does
+	 * KVM.  For simplicity, supports generic attributes only on 64-bit
+	 * architectures.
+	 */
+	BUILD_BUG_ON(sizeof(attrs->attributes) != sizeof(unsigned long));
+
+	return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
+}
+#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 {
 	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
@@ -4640,6 +4850,17 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_BINARY_STATS_FD:
 	case KVM_CAP_SYSTEM_EVENT_DATA:
 		return 1;
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+	case KVM_CAP_MEMORY_ATTRIBUTES:
+		u64 attrs = kvm_supported_mem_attributes(kvm);
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &attrs, sizeof(attrs)))
+			goto out;
+		r = 0;
+		break;
+	}
+#endif
 	default:
 		break;
 	}
@@ -5022,6 +5243,18 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+	case KVM_SET_MEMORY_ATTRIBUTES: {
+		struct kvm_memory_attributes attrs;
+
+		r = -EFAULT;
+		if (copy_from_user(&attrs, argp, sizeof(attrs)))
+			goto out;
+
+		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
+		break;
+	}
+#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
 	case KVM_CREATE_DEVICE: {
 		struct kvm_create_device cd;
 
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 14/35] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (12 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 13/35] KVM: Introduce per-page memory attributes Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-30 17:24   ` Paolo Bonzini
  2023-10-27 18:21 ` [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM Sean Christopherson
                   ` (21 subsequent siblings)
  35 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Add an "unmovable" flag for mappings that cannot be migrated under any
circumstance.  KVM will use the flag for its upcoming GUEST_MEMFD support,
which will not support compaction/migration, at least not in the
foreseeable future.

Test AS_UNMOVABLE under folio lock as already done for the async
compaction/dirty folio case, as the mapping can be removed by truncation
while compaction is running.  To avoid having to lock every folio with a
mapping, assume/require that unmovable mappings are also unevictable, and
have mapping_set_unmovable() also set AS_UNEVICTABLE.

Cc: Matthew Wilcox <willy@infradead.org>
Co-developed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/pagemap.h | 19 +++++++++++++++++-
 mm/compaction.c         | 43 +++++++++++++++++++++++++++++------------
 mm/migrate.c            |  2 ++
 3 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 351c3b7f93a1..82c9bf506b79 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -203,7 +203,8 @@ enum mapping_flags {
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
 	AS_LARGE_FOLIO_SUPPORT = 6,
-	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
+	AS_RELEASE_ALWAYS = 7,	/* Call ->release_folio(), even if no private data */
+	AS_UNMOVABLE	= 8,	/* The mapping cannot be moved, ever */
 };
 
 /**
@@ -289,6 +290,22 @@ static inline void mapping_clear_release_always(struct address_space *mapping)
 	clear_bit(AS_RELEASE_ALWAYS, &mapping->flags);
 }
 
+static inline void mapping_set_unmovable(struct address_space *mapping)
+{
+	/*
+	 * It's expected unmovable mappings are also unevictable. Compaction
+	 * migrate scanner (isolate_migratepages_block()) relies on this to
+	 * reduce page locking.
+	 */
+	set_bit(AS_UNEVICTABLE, &mapping->flags);
+	set_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
+static inline bool mapping_unmovable(struct address_space *mapping)
+{
+	return test_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return mapping->gfp_mask;
diff --git a/mm/compaction.c b/mm/compaction.c
index 38c8d216c6a3..12b828aed7c8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -883,6 +883,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 	/* Time to isolate some pages for migration */
 	for (; low_pfn < end_pfn; low_pfn++) {
+		bool is_dirty, is_unevictable;
 
 		if (skip_on_failure && low_pfn >= next_skip_pfn) {
 			/*
@@ -1080,8 +1081,10 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		if (!folio_test_lru(folio))
 			goto isolate_fail_put;
 
+		is_unevictable = folio_test_unevictable(folio);
+
 		/* Compaction might skip unevictable pages but CMA takes them */
-		if (!(mode & ISOLATE_UNEVICTABLE) && folio_test_unevictable(folio))
+		if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
 			goto isolate_fail_put;
 
 		/*
@@ -1093,26 +1096,42 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_writeback(folio))
 			goto isolate_fail_put;
 
-		if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) {
-			bool migrate_dirty;
+		is_dirty = folio_test_dirty(folio);
+
+		if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) ||
+		    (mapping && is_unevictable)) {
+			bool migrate_dirty = true;
+			bool is_unmovable;
 
 			/*
 			 * Only folios without mappings or that have
-			 * a ->migrate_folio callback are possible to
-			 * migrate without blocking.  However, we may
-			 * be racing with truncation, which can free
-			 * the mapping.  Truncation holds the folio lock
-			 * until after the folio is removed from the page
-			 * cache so holding it ourselves is sufficient.
+			 * a ->migrate_folio callback are possible to migrate
+			 * without blocking.
+			 *
+			 * Folios from unmovable mappings are not migratable.
+			 *
+			 * However, we can be racing with truncation, which can
+			 * free the mapping that we need to check. Truncation
+			 * holds the folio lock until after the folio is removed
+			 * from the page so holding it ourselves is sufficient.
+			 *
+			 * To avoid locking the folio just to check unmovable,
+			 * assume every unmovable folio is also unevictable,
+			 * which is a cheaper test.  If our assumption goes
+			 * wrong, it's not a correctness bug, just potentially
+			 * wasted cycles.
 			 */
 			if (!folio_trylock(folio))
 				goto isolate_fail_put;
 
 			mapping = folio_mapping(folio);
-			migrate_dirty = !mapping ||
-					mapping->a_ops->migrate_folio;
+			if ((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) {
+				migrate_dirty = !mapping ||
+						mapping->a_ops->migrate_folio;
+			}
+			is_unmovable = mapping && mapping_unmovable(mapping);
 			folio_unlock(folio);
-			if (!migrate_dirty)
+			if (!migrate_dirty || is_unmovable)
 				goto isolate_fail_put;
 		}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 2053b54556ca..ed874e43ecd7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -956,6 +956,8 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
 
 		if (!mapping)
 			rc = migrate_folio(mapping, dst, src, mode);
+		else if (mapping_unmovable(mapping))
+			rc = -EOPNOTSUPP;
 		else if (mapping->a_ops->migrate_folio)
 			/*
 			 * Most folios have a mapping and most filesystems
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (13 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 14/35] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-30 17:30   ` Paolo Bonzini
  2023-11-02 16:24   ` Christian Brauner
  2023-10-27 18:21 ` [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
                   ` (20 subsequent siblings)
  35 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Export anon_inode_getfile_secure() so that it can be used by KVM to create
and manage file-based guest memory without need a fullblow filesystem.
The "standard" anon_inode_getfd() doesn't work for KVM's use case as KVM
needs a unique inode for each file, e.g. to be able to independently
manage the size and lifecycle of a given file.

Note, KVM doesn't need a "secure" version, just unique inodes, i.e. ignore
the name.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 fs/anon_inodes.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 24192a7667ed..4190336180ee 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -176,6 +176,7 @@ struct file *anon_inode_getfile_secure(const char *name,
 	return __anon_inode_getfile(name, fops, priv, flags,
 				    context_inode, true);
 }
+EXPORT_SYMBOL_GPL(anon_inode_getfile_secure);
 
 static int __anon_inode_getfd(const char *name,
 			      const struct file_operations *fops,
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (14 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-31  2:27   ` Xiaoyao Li
                     ` (5 more replies)
  2023-10-27 18:21 ` [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory Sean Christopherson
                   ` (19 subsequent siblings)
  35 siblings, 6 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.

A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem.  With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings.   E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection.  Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.

Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping.  Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.

Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).

More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd.  While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption.  And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.

Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.

Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping.  And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.

Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory.  That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.

Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem.  I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.

Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay.  Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.

Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst |  69 ++++-
 include/linux/kvm_host.h       |  48 +++
 include/uapi/linux/kvm.h       |  15 +-
 virt/kvm/Kconfig               |   4 +
 virt/kvm/Makefile.kvm          |   1 +
 virt/kvm/guest_memfd.c         | 548 +++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c            |  68 +++-
 virt/kvm/kvm_mm.h              |  26 ++
 8 files changed, 764 insertions(+), 15 deletions(-)
 create mode 100644 virt/kvm/guest_memfd.c

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e2252c748fd6..e82c69d5e755 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6079,6 +6079,15 @@ applied.
 :Parameters: struct kvm_userspace_memory_region2 (in)
 :Returns: 0 on success, -1 on error
 
+KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
+allows mapping guest_memfd memory into a guest.  All fields shared with
+KVM_SET_USER_MEMORY_REGION identically.  Userspace can set KVM_MEM_PRIVATE in
+flags to have KVM bind the memory region to a given guest_memfd range of
+[guest_memfd_offset, guest_memfd_offset + memory_size].  The target guest_memfd
+must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
+the target range must not be bound to any other memory region.  All standard
+bounds checks apply (use common sense).
+
 ::
 
   struct kvm_userspace_memory_region2 {
@@ -6087,9 +6096,24 @@ applied.
 	__u64 guest_phys_addr;
 	__u64 memory_size; /* bytes */
 	__u64 userspace_addr; /* start of the userspace allocated memory */
+  __u64 guest_memfd_offset;
+	__u32 guest_memfd;
+	__u32 pad1;
+	__u64 pad2[14];
   };
 
-See KVM_SET_USER_MEMORY_REGION.
+A KVM_MEM_PRIVATE region _must_ have a valid guest_memfd (private memory) and
+userspace_addr (shared memory).  However, "valid" for userspace_addr simply
+means that the address itself must be a legal userspace address.  The backing
+mapping for userspace_addr is not required to be valid/populated at the time of
+KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated
+on-demand.
+
+When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes
+userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE
+state.  At VM creation time, all memory is shared, i.e. the PRIVATE attribute
+is '0' for all gfns.  Userspace can control whether memory is shared/private by
+toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
 
 4.140 KVM_SET_MEMORY_ATTRIBUTES
 -------------------------------
@@ -6127,6 +6151,49 @@ the state of a gfn/page as needed.
 
 The "flags" field is reserved for future extensions and must be '0'.
 
+4.141 KVM_CREATE_GUEST_MEMFD
+----------------------------
+
+:Capability: KVM_CAP_GUEST_MEMFD
+:Architectures: none
+:Type: vm ioctl
+:Parameters: struct struct kvm_create_guest_memfd(in)
+:Returns: 0 on success, <0 on error
+
+KVM_CREATE_GUEST_MEMFD creates an anonymous file and returns a file descriptor
+that refers to it.  guest_memfd files are roughly analogous to files created
+via memfd_create(), e.g. guest_memfd files live in RAM, have volatile storage,
+and are automatically released when the last reference is dropped.  Unlike
+"regular" memfd_create() files, guest_memfd files are bound to their owning
+virtual machine (see below), cannot be mapped, read, or written by userspace,
+and cannot be resized  (guest_memfd files do however support PUNCH_HOLE).
+
+::
+
+  struct kvm_create_guest_memfd {
+	__u64 size;
+	__u64 flags;
+	__u64 reserved[6];
+  };
+
+Conceptually, the inode backing a guest_memfd file represents physical memory,
+i.e. is coupled to the virtual machine as a thing, not to a "struct kvm".  The
+file itself, which is bound to a "struct kvm", is that instance's view of the
+underlying memory, e.g. effectively provides the translation of guest addresses
+to host memory.  This allows for use cases where multiple KVM structures are
+used to manage a single virtual machine, e.g. when performing intrahost
+migration of a virtual machine.
+
+KVM currently only supports mapping guest_memfd via KVM_SET_USER_MEMORY_REGION2,
+and more specifically via the guest_memfd and guest_memfd_offset fields in
+"struct kvm_userspace_memory_region2", where guest_memfd_offset is the offset
+into the guest_memfd instance.  For a given guest_memfd file, there can be at
+most one mapping per page, i.e. binding multiple memory regions to a single
+guest_memfd range is not allowed (any number of memory regions can be bound to
+a single guest_memfd file, but the bound ranges must not overlap).
+
+See KVM_SET_USER_MEMORY_REGION2 for additional details.
+
 5. The kvm_run structure
 ========================
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index df573229651b..7de93858054d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -591,8 +591,20 @@ struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+
+#ifdef CONFIG_KVM_PRIVATE_MEM
+	struct {
+		struct file __rcu *file;
+		pgoff_t pgoff;
+	} gmem;
+#endif
 };
 
+static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
+{
+	return slot && (slot->flags & KVM_MEM_PRIVATE);
+}
+
 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
 {
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
@@ -687,6 +699,17 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 }
 #endif
 
+/*
+ * Arch code must define kvm_arch_has_private_mem if support for private memory
+ * is enabled.
+ */
+#if !defined(kvm_arch_has_private_mem) && !IS_ENABLED(CONFIG_KVM_PRIVATE_MEM)
+static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
+{
+	return false;
+}
+#endif
+
 struct kvm_memslots {
 	u64 generation;
 	atomic_long_t last_used_slot;
@@ -1401,6 +1424,7 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_invalidate_begin(struct kvm *kvm);
 void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
 void kvm_mmu_invalidate_end(struct kvm *kvm);
+bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -2356,6 +2380,30 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range);
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					 struct kvm_gfn_range *range);
+
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+	return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&
+	       kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+}
+#else
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+	return false;
+}
 #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
 
+#ifdef CONFIG_KVM_PRIVATE_MEM
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+#else
+static inline int kvm_gmem_get_pfn(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   kvm_pfn_t *pfn, int *max_order)
+{
+	KVM_BUG_ON(1, kvm);
+	return -EIO;
+}
+#endif /* CONFIG_KVM_PRIVATE_MEM */
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 547837feaa28..25caee8d1a80 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -102,7 +102,10 @@ struct kvm_userspace_memory_region2 {
 	__u64 guest_phys_addr;
 	__u64 memory_size;
 	__u64 userspace_addr;
-	__u64 pad[16];
+	__u64 guest_memfd_offset;
+	__u32 guest_memfd;
+	__u32 pad1;
+	__u64 pad2[14];
 };
 
 /*
@@ -112,6 +115,7 @@ struct kvm_userspace_memory_region2 {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -1212,6 +1216,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_USER_MEMORY2 230
 #define KVM_CAP_MEMORY_FAULT_INFO 231
 #define KVM_CAP_MEMORY_ATTRIBUTES 232
+#define KVM_CAP_GUEST_MEMFD 233
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2290,4 +2295,12 @@ struct kvm_memory_attributes {
 
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
 
+#define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
+
+struct kvm_create_guest_memfd {
+	__u64 size;
+	__u64 flags;
+	__u64 reserved[6];
+};
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 5bd7fcaf9089..08afef022db9 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -100,3 +100,7 @@ config KVM_GENERIC_MMU_NOTIFIER
 config KVM_GENERIC_MEMORY_ATTRIBUTES
        select KVM_GENERIC_MMU_NOTIFIER
        bool
+
+config KVM_PRIVATE_MEM
+       select XARRAY_MULTI
+       bool
diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
index 2c27d5d0c367..724c89af78af 100644
--- a/virt/kvm/Makefile.kvm
+++ b/virt/kvm/Makefile.kvm
@@ -12,3 +12,4 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
 kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
 kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o
 kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o
+kvm-$(CONFIG_KVM_PRIVATE_MEM) += $(KVM)/guest_memfd.o
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
new file mode 100644
index 000000000000..98a12da80214
--- /dev/null
+++ b/virt/kvm/guest_memfd.c
@@ -0,0 +1,548 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/backing-dev.h>
+#include <linux/falloc.h>
+#include <linux/kvm_host.h>
+#include <linux/pagemap.h>
+#include <linux/anon_inodes.h>
+
+#include "kvm_mm.h"
+
+struct kvm_gmem {
+	struct kvm *kvm;
+	struct xarray bindings;
+	struct list_head entry;
+};
+
+static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
+{
+	struct folio *folio;
+
+	/* TODO: Support huge pages. */
+	folio = filemap_grab_folio(inode->i_mapping, index);
+	if (IS_ERR_OR_NULL(folio))
+		return NULL;
+
+	/*
+	 * Use the up-to-date flag to track whether or not the memory has been
+	 * zeroed before being handed off to the guest.  There is no backing
+	 * storage for the memory, so the folio will remain up-to-date until
+	 * it's removed.
+	 *
+	 * TODO: Skip clearing pages when trusted firmware will do it when
+	 * assigning memory to the guest.
+	 */
+	if (!folio_test_uptodate(folio)) {
+		unsigned long nr_pages = folio_nr_pages(folio);
+		unsigned long i;
+
+		for (i = 0; i < nr_pages; i++)
+			clear_highpage(folio_page(folio, i));
+
+		folio_mark_uptodate(folio);
+	}
+
+	/*
+	 * Ignore accessed, referenced, and dirty flags.  The memory is
+	 * unevictable and there is no storage to write back to.
+	 */
+	return folio;
+}
+
+static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
+				      pgoff_t end)
+{
+	bool flush = false, found_memslot = false;
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = gmem->kvm;
+	unsigned long index;
+
+	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
+		pgoff_t pgoff = slot->gmem.pgoff;
+
+		struct kvm_gfn_range gfn_range = {
+			.start = slot->base_gfn + max(pgoff, start) - pgoff,
+			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
+			.slot = slot,
+			.may_block = true,
+		};
+
+		if (!found_memslot) {
+			found_memslot = true;
+
+			KVM_MMU_LOCK(kvm);
+			kvm_mmu_invalidate_begin(kvm);
+		}
+
+		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
+	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+
+	if (found_memslot)
+		KVM_MMU_UNLOCK(kvm);
+}
+
+static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
+				    pgoff_t end)
+{
+	struct kvm *kvm = gmem->kvm;
+
+	if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
+		KVM_MMU_LOCK(kvm);
+		kvm_mmu_invalidate_end(kvm);
+		KVM_MMU_UNLOCK(kvm);
+	}
+}
+
+static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+{
+	struct list_head *gmem_list = &inode->i_mapping->private_list;
+	pgoff_t start = offset >> PAGE_SHIFT;
+	pgoff_t end = (offset + len) >> PAGE_SHIFT;
+	struct kvm_gmem *gmem;
+
+	/*
+	 * Bindings must stable across invalidation to ensure the start+end
+	 * are balanced.
+	 */
+	filemap_invalidate_lock(inode->i_mapping);
+
+	list_for_each_entry(gmem, gmem_list, entry)
+		kvm_gmem_invalidate_begin(gmem, start, end);
+
+	truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
+
+	list_for_each_entry(gmem, gmem_list, entry)
+		kvm_gmem_invalidate_end(gmem, start, end);
+
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	return 0;
+}
+
+static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
+{
+	struct address_space *mapping = inode->i_mapping;
+	pgoff_t start, index, end;
+	int r;
+
+	/* Dedicated guest is immutable by default. */
+	if (offset + len > i_size_read(inode))
+		return -EINVAL;
+
+	filemap_invalidate_lock_shared(mapping);
+
+	start = offset >> PAGE_SHIFT;
+	end = (offset + len) >> PAGE_SHIFT;
+
+	r = 0;
+	for (index = start; index < end; ) {
+		struct folio *folio;
+
+		if (signal_pending(current)) {
+			r = -EINTR;
+			break;
+		}
+
+		folio = kvm_gmem_get_folio(inode, index);
+		if (!folio) {
+			r = -ENOMEM;
+			break;
+		}
+
+		index = folio_next_index(folio);
+
+		folio_unlock(folio);
+		folio_put(folio);
+
+		/* 64-bit only, wrapping the index should be impossible. */
+		if (WARN_ON_ONCE(!index))
+			break;
+
+		cond_resched();
+	}
+
+	filemap_invalidate_unlock_shared(mapping);
+
+	return r;
+}
+
+static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
+			       loff_t len)
+{
+	int ret;
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE))
+		return -EOPNOTSUPP;
+
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+		return -EOPNOTSUPP;
+
+	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+		return -EINVAL;
+
+	if (mode & FALLOC_FL_PUNCH_HOLE)
+		ret = kvm_gmem_punch_hole(file_inode(file), offset, len);
+	else
+		ret = kvm_gmem_allocate(file_inode(file), offset, len);
+
+	if (!ret)
+		file_modified(file);
+	return ret;
+}
+
+static int kvm_gmem_release(struct inode *inode, struct file *file)
+{
+	struct kvm_gmem *gmem = file->private_data;
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = gmem->kvm;
+	unsigned long index;
+
+	/*
+	 * Prevent concurrent attempts to *unbind* a memslot.  This is the last
+	 * reference to the file and thus no new bindings can be created, but
+	 * dereferencing the slot for existing bindings needs to be protected
+	 * against memslot updates, specifically so that unbind doesn't race
+	 * and free the memslot (kvm_gmem_get_file() will return NULL).
+	 */
+	mutex_lock(&kvm->slots_lock);
+
+	filemap_invalidate_lock(inode->i_mapping);
+
+	xa_for_each(&gmem->bindings, index, slot)
+		rcu_assign_pointer(slot->gmem.file, NULL);
+
+	synchronize_rcu();
+
+	/*
+	 * All in-flight operations are gone and new bindings can be created.
+	 * Zap all SPTEs pointed at by this file.  Do not free the backing
+	 * memory, as its lifetime is associated with the inode, not the file.
+	 */
+	kvm_gmem_invalidate_begin(gmem, 0, -1ul);
+	kvm_gmem_invalidate_end(gmem, 0, -1ul);
+
+	list_del(&gmem->entry);
+
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	mutex_unlock(&kvm->slots_lock);
+
+	xa_destroy(&gmem->bindings);
+	kfree(gmem);
+
+	kvm_put_kvm(kvm);
+
+	return 0;
+}
+
+static struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
+{
+	struct file *file;
+
+	rcu_read_lock();
+
+	file = rcu_dereference(slot->gmem.file);
+	if (file && !get_file_rcu(file))
+		file = NULL;
+
+	rcu_read_unlock();
+
+	return file;
+}
+
+static struct file_operations kvm_gmem_fops = {
+	.open		= generic_file_open,
+	.release	= kvm_gmem_release,
+	.fallocate	= kvm_gmem_fallocate,
+};
+
+void kvm_gmem_init(struct module *module)
+{
+	kvm_gmem_fops.owner = module;
+}
+
+static int kvm_gmem_migrate_folio(struct address_space *mapping,
+				  struct folio *dst, struct folio *src,
+				  enum migrate_mode mode)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+
+static int kvm_gmem_error_page(struct address_space *mapping, struct page *page)
+{
+	struct list_head *gmem_list = &mapping->private_list;
+	struct kvm_gmem *gmem;
+	pgoff_t start, end;
+
+	filemap_invalidate_lock_shared(mapping);
+
+	start = page->index;
+	end = start + thp_nr_pages(page);
+
+	list_for_each_entry(gmem, gmem_list, entry)
+		kvm_gmem_invalidate_begin(gmem, start, end);
+
+	/*
+	 * Do not truncate the range, what action is taken in response to the
+	 * error is userspace's decision (assuming the architecture supports
+	 * gracefully handling memory errors).  If/when the guest attempts to
+	 * access a poisoned page, kvm_gmem_get_pfn() will return -EHWPOISON,
+	 * at which point KVM can either terminate the VM or propagate the
+	 * error to userspace.
+	 */
+
+	list_for_each_entry(gmem, gmem_list, entry)
+		kvm_gmem_invalidate_end(gmem, start, end);
+
+	filemap_invalidate_unlock_shared(mapping);
+
+	return MF_DELAYED;
+}
+
+static const struct address_space_operations kvm_gmem_aops = {
+	.dirty_folio = noop_dirty_folio,
+#ifdef CONFIG_MIGRATION
+	.migrate_folio	= kvm_gmem_migrate_folio,
+#endif
+	.error_remove_page = kvm_gmem_error_page,
+};
+
+static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
+			    struct kstat *stat, u32 request_mask,
+			    unsigned int query_flags)
+{
+	struct inode *inode = path->dentry->d_inode;
+
+	/* TODO */
+	generic_fillattr(idmap, request_mask, inode, stat);
+	return 0;
+}
+
+static int kvm_gmem_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
+			    struct iattr *attr)
+{
+	/* TODO */
+	return -EINVAL;
+}
+static const struct inode_operations kvm_gmem_iops = {
+	.getattr	= kvm_gmem_getattr,
+	.setattr	= kvm_gmem_setattr,
+};
+
+static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
+{
+	const char *anon_name = "[kvm-gmem]";
+	struct kvm_gmem *gmem;
+	struct inode *inode;
+	struct file *file;
+	int fd, err;
+
+	fd = get_unused_fd_flags(0);
+	if (fd < 0)
+		return fd;
+
+	gmem = kzalloc(sizeof(*gmem), GFP_KERNEL);
+	if (!gmem) {
+		err = -ENOMEM;
+		goto err_fd;
+	}
+
+	/*
+	 * Use the so called "secure" variant, which creates a unique inode
+	 * instead of reusing a single inode.  Each guest_memfd instance needs
+	 * its own inode to track the size, flags, etc.
+	 */
+	file = anon_inode_getfile_secure(anon_name, &kvm_gmem_fops, gmem,
+					 O_RDWR, NULL);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_gmem;
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	inode = file->f_inode;
+	WARN_ON(file->f_mapping != inode->i_mapping);
+
+	inode->i_private = (void *)(unsigned long)flags;
+	inode->i_op = &kvm_gmem_iops;
+	inode->i_mapping->a_ops = &kvm_gmem_aops;
+	inode->i_mode |= S_IFREG;
+	inode->i_size = size;
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_unmovable(inode->i_mapping);
+	/* Unmovable mappings are supposed to be marked unevictable as well. */
+	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+
+	kvm_get_kvm(kvm);
+	gmem->kvm = kvm;
+	xa_init(&gmem->bindings);
+	list_add(&gmem->entry, &inode->i_mapping->private_list);
+
+	fd_install(fd, file);
+	return fd;
+
+err_gmem:
+	kfree(gmem);
+err_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
+int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
+{
+	loff_t size = args->size;
+	u64 flags = args->flags;
+	u64 valid_flags = 0;
+
+	if (flags & ~valid_flags)
+		return -EINVAL;
+
+	if (size < 0 || !PAGE_ALIGNED(size))
+		return -EINVAL;
+
+	return __kvm_gmem_create(kvm, size, flags);
+}
+
+int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
+		  unsigned int fd, loff_t offset)
+{
+	loff_t size = slot->npages << PAGE_SHIFT;
+	unsigned long start, end;
+	struct kvm_gmem *gmem;
+	struct inode *inode;
+	struct file *file;
+
+	BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
+
+	file = fget(fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kvm_gmem_fops)
+		goto err;
+
+	gmem = file->private_data;
+	if (gmem->kvm != kvm)
+		goto err;
+
+	inode = file_inode(file);
+
+	if (offset < 0 || !PAGE_ALIGNED(offset))
+		return -EINVAL;
+
+	if (offset + size > i_size_read(inode))
+		goto err;
+
+	filemap_invalidate_lock(inode->i_mapping);
+
+	start = offset >> PAGE_SHIFT;
+	end = start + slot->npages;
+
+	if (!xa_empty(&gmem->bindings) &&
+	    xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
+		filemap_invalidate_unlock(inode->i_mapping);
+		goto err;
+	}
+
+	/*
+	 * No synchronize_rcu() needed, any in-flight readers are guaranteed to
+	 * be see either a NULL file or this new file, no need for them to go
+	 * away.
+	 */
+	rcu_assign_pointer(slot->gmem.file, file);
+	slot->gmem.pgoff = start;
+
+	xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	/*
+	 * Drop the reference to the file, even on success.  The file pins KVM,
+	 * not the other way 'round.  Active bindings are invalidated if the
+	 * file is closed before memslots are destroyed.
+	 */
+	fput(file);
+	return 0;
+
+err:
+	fput(file);
+	return -EINVAL;
+}
+
+void kvm_gmem_unbind(struct kvm_memory_slot *slot)
+{
+	unsigned long start = slot->gmem.pgoff;
+	unsigned long end = start + slot->npages;
+	struct kvm_gmem *gmem;
+	struct file *file;
+
+	/*
+	 * Nothing to do if the underlying file was already closed (or is being
+	 * closed right now), kvm_gmem_release() invalidates all bindings.
+	 */
+	file = kvm_gmem_get_file(slot);
+	if (!file)
+		return;
+
+	gmem = file->private_data;
+
+	filemap_invalidate_lock(file->f_mapping);
+	xa_store_range(&gmem->bindings, start, end - 1, NULL, GFP_KERNEL);
+	rcu_assign_pointer(slot->gmem.file, NULL);
+	synchronize_rcu();
+	filemap_invalidate_unlock(file->f_mapping);
+
+	fput(file);
+}
+
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+{
+	pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
+	struct kvm_gmem *gmem;
+	struct folio *folio;
+	struct page *page;
+	struct file *file;
+	int r;
+
+	file = kvm_gmem_get_file(slot);
+	if (!file)
+		return -EFAULT;
+
+	gmem = file->private_data;
+
+	if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) {
+		r = -EIO;
+		goto out_fput;
+	}
+
+	folio = kvm_gmem_get_folio(file_inode(file), index);
+	if (!folio) {
+		r = -ENOMEM;
+		goto out_fput;
+	}
+
+	if (folio_test_hwpoison(folio)) {
+		r = -EHWPOISON;
+		goto out_unlock;
+	}
+
+	page = folio_file_page(folio, index);
+
+	*pfn = page_to_pfn(page);
+	if (max_order)
+		*max_order = 0;
+
+	r = 0;
+
+out_unlock:
+	folio_unlock(folio);
+out_fput:
+	fput(file);
+
+	return r;
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 78a0b09ef2a5..5d1a2f1b4e94 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -798,7 +798,7 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
 	}
 }
 
-static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
 	return kvm_unmap_gfn_range(kvm, range);
@@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
 /* This does not remove the slot from struct kvm_memslots data structures */
 static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
+	if (slot->flags & KVM_MEM_PRIVATE)
+		kvm_gmem_unbind(slot);
+
 	kvm_destroy_dirty_bitmap(slot);
 
 	kvm_arch_free_memslot(kvm, slot);
@@ -1605,10 +1608,18 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region2 *mem)
+static int check_memory_region_flags(struct kvm *kvm,
+				     const struct kvm_userspace_memory_region2 *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
+	if (kvm_arch_has_private_mem(kvm))
+		valid_flags |= KVM_MEM_PRIVATE;
+
+	/* Dirty logging private memory is not currently supported. */
+	if (mem->flags & KVM_MEM_PRIVATE)
+		valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
+
 #ifdef __KVM_HAVE_READONLY_MEM
 	valid_flags |= KVM_MEM_READONLY;
 #endif
@@ -2017,7 +2028,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	int as_id, id;
 	int r;
 
-	r = check_memory_region_flags(mem);
+	r = check_memory_region_flags(kvm, mem);
 	if (r)
 		return r;
 
@@ -2036,6 +2047,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
 			mem->memory_size))
 		return -EINVAL;
+	if (mem->flags & KVM_MEM_PRIVATE &&
+	    (mem->guest_memfd_offset & (PAGE_SIZE - 1) ||
+	     mem->guest_memfd_offset + mem->memory_size < mem->guest_memfd_offset))
+		return -EINVAL;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
 		return -EINVAL;
 	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
@@ -2074,6 +2089,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
 			return -EINVAL;
 	} else { /* Modify an existing slot. */
+		/* Private memslots are immutable, they can only be deleted. */
+		if (mem->flags & KVM_MEM_PRIVATE)
+			return -EINVAL;
 		if ((mem->userspace_addr != old->userspace_addr) ||
 		    (npages != old->npages) ||
 		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
@@ -2102,10 +2120,23 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new->npages = npages;
 	new->flags = mem->flags;
 	new->userspace_addr = mem->userspace_addr;
+	if (mem->flags & KVM_MEM_PRIVATE) {
+		r = kvm_gmem_bind(kvm, new, mem->guest_memfd, mem->guest_memfd_offset);
+		if (r)
+			goto out;
+	}
 
 	r = kvm_set_memslot(kvm, old, new, change);
 	if (r)
-		kfree(new);
+		goto out_unbind;
+
+	return 0;
+
+out_unbind:
+	if (mem->flags & KVM_MEM_PRIVATE)
+		kvm_gmem_unbind(new);
+out:
+	kfree(new);
 	return r;
 }
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
@@ -2441,7 +2472,7 @@ bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
-	if (!kvm)
+	if (!kvm || kvm_arch_has_private_mem(kvm))
 		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
 
 	return 0;
@@ -4852,14 +4883,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 		return 1;
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 	case KVM_CAP_MEMORY_ATTRIBUTES:
-		u64 attrs = kvm_supported_mem_attributes(kvm);
-
-		r = -EFAULT;
-		if (copy_to_user(argp, &attrs, sizeof(attrs)))
-			goto out;
-		r = 0;
-		break;
-	}
+		return kvm_supported_mem_attributes(kvm);
+#endif
+#ifdef CONFIG_KVM_PRIVATE_MEM
+	case KVM_CAP_GUEST_MEMFD:
+		return !kvm || kvm_arch_has_private_mem(kvm);
 #endif
 	default:
 		break;
@@ -5282,6 +5310,18 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_GET_STATS_FD:
 		r = kvm_vm_ioctl_get_stats_fd(kvm);
 		break;
+#ifdef CONFIG_KVM_PRIVATE_MEM
+	case KVM_CREATE_GUEST_MEMFD: {
+		struct kvm_create_guest_memfd guest_memfd;
+
+		r = -EFAULT;
+		if (copy_from_user(&guest_memfd, argp, sizeof(guest_memfd)))
+			goto out;
+
+		r = kvm_gmem_create(kvm, &guest_memfd);
+		break;
+	}
+#endif
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
@@ -6414,6 +6454,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
 	if (WARN_ON_ONCE(r))
 		goto err_vfio;
 
+	kvm_gmem_init(module);
+
 	/*
 	 * Registration _must_ be the very last thing done, as this exposes
 	 * /dev/kvm to userspace, i.e. all infrastructure must be setup!
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 180f1a09e6ba..ecefc7ec51af 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -37,4 +37,30 @@ static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
 }
 #endif /* HAVE_KVM_PFNCACHE */
 
+#ifdef CONFIG_KVM_PRIVATE_MEM
+void kvm_gmem_init(struct module *module);
+int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args);
+int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
+		  unsigned int fd, loff_t offset);
+void kvm_gmem_unbind(struct kvm_memory_slot *slot);
+#else
+static inline void kvm_gmem_init(struct module *module)
+{
+
+}
+
+static inline int kvm_gmem_bind(struct kvm *kvm,
+					 struct kvm_memory_slot *slot,
+					 unsigned int fd, loff_t offset)
+{
+	WARN_ON_ONCE(1);
+	return -EIO;
+}
+
+static inline void kvm_gmem_unbind(struct kvm_memory_slot *slot)
+{
+	WARN_ON_ONCE(1);
+}
+#endif /* CONFIG_KVM_PRIVATE_MEM */
+
 #endif /* __KVM_MM_H__ */
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (15 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
@ 2023-10-27 18:21 ` Sean Christopherson
  2023-10-31  8:35   ` Xiaoyao Li
  2023-10-27 18:22 ` [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN Sean Christopherson
                   ` (18 subsequent siblings)
  35 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:21 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Extended guest_memfd to allow backing guest memory with transparent
hugepages.  Require userspace to opt-in via a flag even though there's no
known/anticipated use case for forcing small pages as THP is optional,
i.e. to avoid ending up in a situation where userspace is unaware that
KVM can't provide hugepages.

For simplicity, require the guest_memfd size to be a multiple of the
hugepage size, e.g. so that KVM doesn't need to do bounds checking when
deciding whether or not to allocate a huge folio.

When reporting the max order when KVM gets a pfn from guest_memfd, force
order-0 pages if the hugepage is not fully contained by the memslot
binding, e.g. if userspace requested hugepages but punches a hole in the
memslot bindings in order to emulate x86's VGA hole.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst |  7 ++++
 include/uapi/linux/kvm.h       |  2 +
 virt/kvm/guest_memfd.c         | 73 ++++++++++++++++++++++++++++++----
 3 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e82c69d5e755..7f00c310c24a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6176,6 +6176,8 @@ and cannot be resized  (guest_memfd files do however support PUNCH_HOLE).
 	__u64 reserved[6];
   };
 
+  #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE         (1ULL << 0)
+
 Conceptually, the inode backing a guest_memfd file represents physical memory,
 i.e. is coupled to the virtual machine as a thing, not to a "struct kvm".  The
 file itself, which is bound to a "struct kvm", is that instance's view of the
@@ -6192,6 +6194,11 @@ most one mapping per page, i.e. binding multiple memory regions to a single
 guest_memfd range is not allowed (any number of memory regions can be bound to
 a single guest_memfd file, but the bound ranges must not overlap).
 
+If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set in flags, KVM will attempt to allocate
+and map hugepages for the guest_memfd file.  This is currently best effort.  If
+KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set, the size must be aligned to the maximum
+transparent hugepage size supported by the kernel
+
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
 5. The kvm_run structure
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 25caee8d1a80..33d542de0a61 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2303,4 +2303,6 @@ struct kvm_create_guest_memfd {
 	__u64 reserved[6];
 };
 
+#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE		(1ULL << 0)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 98a12da80214..94bc478c26f3 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -13,14 +13,47 @@ struct kvm_gmem {
 	struct list_head entry;
 };
 
+static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long huge_index = round_down(index, HPAGE_PMD_NR);
+	unsigned long flags = (unsigned long)inode->i_private;
+	struct address_space *mapping  = inode->i_mapping;
+	gfp_t gfp = mapping_gfp_mask(mapping);
+	struct folio *folio;
+
+	if (!(flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE))
+		return NULL;
+
+	if (filemap_range_has_page(mapping, huge_index << PAGE_SHIFT,
+				   (huge_index + HPAGE_PMD_NR - 1) << PAGE_SHIFT))
+		return NULL;
+
+	folio = filemap_alloc_folio(gfp, HPAGE_PMD_ORDER);
+	if (!folio)
+		return NULL;
+
+	if (filemap_add_folio(mapping, folio, huge_index, gfp)) {
+		folio_put(folio);
+		return NULL;
+	}
+
+	return folio;
+#else
+	return NULL;
+#endif
+}
+
 static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 {
 	struct folio *folio;
 
-	/* TODO: Support huge pages. */
-	folio = filemap_grab_folio(inode->i_mapping, index);
-	if (IS_ERR_OR_NULL(folio))
-		return NULL;
+	folio = kvm_gmem_get_huge_folio(inode, index);
+	if (!folio) {
+		folio = filemap_grab_folio(inode->i_mapping, index);
+		if (IS_ERR_OR_NULL(folio))
+			return NULL;
+	}
 
 	/*
 	 * Use the up-to-date flag to track whether or not the memory has been
@@ -373,6 +406,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 	inode->i_mode |= S_IFREG;
 	inode->i_size = size;
 	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_large_folios(inode->i_mapping);
 	mapping_set_unmovable(inode->i_mapping);
 	/* Unmovable mappings are supposed to be marked unevictable as well. */
 	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
@@ -398,12 +432,21 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	u64 flags = args->flags;
 	u64 valid_flags = 0;
 
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+
 	if (flags & ~valid_flags)
 		return -EINVAL;
 
 	if (size < 0 || !PAGE_ALIGNED(size))
 		return -EINVAL;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
+	    !IS_ALIGNED(size, HPAGE_PMD_SIZE))
+		return -EINVAL;
+#endif
+
 	return __kvm_gmem_create(kvm, size, flags);
 }
 
@@ -501,7 +544,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
 {
-	pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
+	pgoff_t index, huge_index;
 	struct kvm_gmem *gmem;
 	struct folio *folio;
 	struct page *page;
@@ -514,6 +557,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 
 	gmem = file->private_data;
 
+	index = gfn - slot->base_gfn + slot->gmem.pgoff;
 	if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) {
 		r = -EIO;
 		goto out_fput;
@@ -533,9 +577,24 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	page = folio_file_page(folio, index);
 
 	*pfn = page_to_pfn(page);
-	if (max_order)
+	if (!max_order)
+		goto success;
+
+	*max_order = compound_order(compound_head(page));
+	if (!*max_order)
+		goto success;
+
+	/*
+	 * The folio can be mapped with a hugepage if and only if the folio is
+	 * fully contained by the range the memslot is bound to.  Note, the
+	 * caller is responsible for handling gfn alignment, this only deals
+	 * with the file binding.
+	 */
+	huge_index = ALIGN(index, 1ull << *max_order);
+	if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) ||
+	    huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages)
 		*max_order = 0;
-
+success:
 	r = 0;
 
 out_unlock:
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (16 preceding siblings ...)
  2023-10-27 18:21 ` [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-30 17:31   ` Paolo Bonzini
  2023-11-02 14:16   ` Fuad Tabba
  2023-10-27 18:22 ` [PATCH v13 19/35] KVM: x86: Disallow hugepages when memory attributes are mixed Sean Christopherson
                   ` (17 subsequent siblings)
  35 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
the probability of exiting to userspace with a stale run->exit_reason that
*appears* to be valid.

To support fd-based guest memory (guest memory without a corresponding
userspace virtual address), KVM will exit to userspace for various memory
related errors, which userspace *may* be able to resolve, instead of using
e.g. BUS_MCEERR_AR.  And in the more distant future, KVM will also likely
utilize the same functionality to let userspace "intercept" and handle
memory faults when the userspace mapping is missing, i.e. when fast gup()
fails.

Because many of KVM's internal APIs related to guest memory use '0' to
indicate "success, continue on" and not "exit to userspace", reporting
memory faults/errors to userspace will set run->exit_reason and
corresponding fields in the run structure fields in conjunction with a
a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON.  And because
KVM already returns  -EFAULT in many paths, there's a relatively high
probability that KVM could return -EFAULT without setting run->exit_reason,
in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
whatever exit reason happened to be in the run structure.

Note, KVM must wait until after run->immediate_exit is serviced to
sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
preserved across KVM_RUN when run->immediate_exit is true.

Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoorthy@google.com
Link: https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ee3cd8c3c0ef..f41dbb1465a0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10963,6 +10963,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 {
 	int r;
 
+	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
 	vcpu->arch.l1tf_flush_l1d = true;
 
 	for (;;) {
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 19/35] KVM: x86: Disallow hugepages when memory attributes are mixed
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (17 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-27 18:22 ` [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory Sean Christopherson
                   ` (16 subsequent siblings)
  35 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

Disallow creating hugepages with mixed memory attributes, e.g. shared
versus private, as mapping a hugepage in this case would allow the guest
to access memory with the wrong attributes, e.g. overlaying private memory
with a shared hugepage.

Tracking whether or not attributes are mixed via the existing
disallow_lpage field, but use the most significant bit in 'disallow_lpage'
to indicate a hugepage has mixed attributes instead using the normal
refcounting.  Whether or not attributes are mixed is binary; either they
are or they aren't.  Attempting to squeeze that info into the refcount is
unnecessarily complex as it would require knowing the previous state of
the mixed count when updating attributes.  Using a flag means KVM just
needs to ensure the current status is reflected in the memslots.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/kvm/mmu/mmu.c          | 154 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   4 +
 3 files changed, 159 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 31e84668014e..8d60e4745e8b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1836,6 +1836,9 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu);
 void kvm_mmu_init_vm(struct kvm *kvm);
 void kvm_mmu_uninit_vm(struct kvm *kvm);
 
+void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
+					    struct kvm_memory_slot *slot);
+
 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu);
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d33657d61d80..4167d557c577 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -795,16 +795,26 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
 	return &slot->arch.lpage_info[level - 2][idx];
 }
 
+/*
+ * The most significant bit in disallow_lpage tracks whether or not memory
+ * attributes are mixed, i.e. not identical for all gfns at the current level.
+ * The lower order bits are used to refcount other cases where a hugepage is
+ * disallowed, e.g. if KVM has shadow a page table at the gfn.
+ */
+#define KVM_LPAGE_MIXED_FLAG	BIT(31)
+
 static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
 					    gfn_t gfn, int count)
 {
 	struct kvm_lpage_info *linfo;
-	int i;
+	int old, i;
 
 	for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
 		linfo = lpage_info_slot(gfn, slot, i);
+
+		old = linfo->disallow_lpage;
 		linfo->disallow_lpage += count;
-		WARN_ON_ONCE(linfo->disallow_lpage < 0);
+		WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
 	}
 }
 
@@ -7161,3 +7171,143 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 	if (kvm->arch.nx_huge_page_recovery_thread)
 		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
+
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+				int level)
+{
+	return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_MIXED_FLAG;
+}
+
+static void hugepage_clear_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+				 int level)
+{
+	lpage_info_slot(gfn, slot, level)->disallow_lpage &= ~KVM_LPAGE_MIXED_FLAG;
+}
+
+static void hugepage_set_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+			       int level)
+{
+	lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_MIXED_FLAG;
+}
+
+static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
+			       gfn_t gfn, int level, unsigned long attrs)
+{
+	const unsigned long start = gfn;
+	const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
+
+	if (level == PG_LEVEL_2M)
+		return kvm_range_has_memory_attributes(kvm, start, end, attrs);
+
+	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
+		if (hugepage_test_mixed(slot, gfn, level - 1) ||
+		    attrs != kvm_get_memory_attributes(kvm, gfn))
+			return false;
+	}
+	return true;
+}
+
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+					 struct kvm_gfn_range *range)
+{
+	unsigned long attrs = range->arg.attributes;
+	struct kvm_memory_slot *slot = range->slot;
+	int level;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+	lockdep_assert_held(&kvm->slots_lock);
+
+	/*
+	 * Calculate which ranges can be mapped with hugepages even if the slot
+	 * can't map memory PRIVATE.  KVM mustn't create a SHARED hugepage over
+	 * a range that has PRIVATE GFNs, and conversely converting a range to
+	 * SHARED may now allow hugepages.
+	 */
+	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
+		return false;
+
+	/*
+	 * The sequence matters here: upper levels consume the result of lower
+	 * level's scanning.
+	 */
+	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+		gfn_t nr_pages = KVM_PAGES_PER_HPAGE(level);
+		gfn_t gfn = gfn_round_for_level(range->start, level);
+
+		/* Process the head page if it straddles the range. */
+		if (gfn != range->start || gfn + nr_pages > range->end) {
+			/*
+			 * Skip mixed tracking if the aligned gfn isn't covered
+			 * by the memslot, KVM can't use a hugepage due to the
+			 * misaligned address regardless of memory attributes.
+			 */
+			if (gfn >= slot->base_gfn) {
+				if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
+					hugepage_clear_mixed(slot, gfn, level);
+				else
+					hugepage_set_mixed(slot, gfn, level);
+			}
+			gfn += nr_pages;
+		}
+
+		/*
+		 * Pages entirely covered by the range are guaranteed to have
+		 * only the attributes which were just set.
+		 */
+		for ( ; gfn + nr_pages <= range->end; gfn += nr_pages)
+			hugepage_clear_mixed(slot, gfn, level);
+
+		/*
+		 * Process the last tail page if it straddles the range and is
+		 * contained by the memslot.  Like the head page, KVM can't
+		 * create a hugepage if the slot size is misaligned.
+		 */
+		if (gfn < range->end &&
+		    (gfn + nr_pages) <= (slot->base_gfn + slot->npages)) {
+			if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
+				hugepage_clear_mixed(slot, gfn, level);
+			else
+				hugepage_set_mixed(slot, gfn, level);
+		}
+	}
+	return false;
+}
+
+void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
+					    struct kvm_memory_slot *slot)
+{
+	int level;
+
+	if (!kvm_arch_has_private_mem(kvm))
+		return;
+
+	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+		/*
+		 * Don't bother tracking mixed attributes for pages that can't
+		 * be huge due to alignment, i.e. process only pages that are
+		 * entirely contained by the memslot.
+		 */
+		gfn_t end = gfn_round_for_level(slot->base_gfn + slot->npages, level);
+		gfn_t start = gfn_round_for_level(slot->base_gfn, level);
+		gfn_t nr_pages = KVM_PAGES_PER_HPAGE(level);
+		gfn_t gfn;
+
+		if (start < slot->base_gfn)
+			start += nr_pages;
+
+		/*
+		 * Unlike setting attributes, every potential hugepage needs to
+		 * be manually checked as the attributes may already be mixed.
+		 */
+		for (gfn = start; gfn < end; gfn += nr_pages) {
+			unsigned long attrs = kvm_get_memory_attributes(kvm, gfn);
+
+			if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
+				hugepage_clear_mixed(slot, gfn, level);
+			else
+				hugepage_set_mixed(slot, gfn, level);
+		}
+	}
+}
+#endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f41dbb1465a0..824b58b44382 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12607,6 +12607,10 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 		}
 	}
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+	kvm_mmu_init_memslot_memory_attributes(kvm, slot);
+#endif
+
 	if (kvm_page_track_create_memslot(kvm, slot, npages))
 		goto out_free;
 
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (18 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 19/35] KVM: x86: Disallow hugepages when memory attributes are mixed Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-11-02 14:34   ` Fuad Tabba
  2023-11-05 13:02   ` Xu Yilun
  2023-10-27 18:22 ` [PATCH v13 21/35] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro Sean Christopherson
                   ` (15 subsequent siblings)
  35 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory.  For such VMs,
KVM_MEM_PRIVATE memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.

For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes.  To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace.  Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.

Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits.  In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.

Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst  |   8 ++-
 arch/x86/kvm/mmu/mmu.c          | 101 ++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/mmu_internal.h |   1 +
 include/linux/kvm_host.h        |   8 ++-
 include/uapi/linux/kvm.h        |   1 +
 5 files changed, 110 insertions(+), 9 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7f00c310c24a..38dc1fda4f45 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6837,6 +6837,7 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
 
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
 			__u64 flags;
 			__u64 gpa;
 			__u64 size;
@@ -6845,8 +6846,11 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
 KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
 could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
 guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
-describes properties of the faulting access that are likely pertinent.
-Currently, no flags are defined.
+describes properties of the faulting access that are likely pertinent:
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
+   on a private memory access.  When clear, indicates the fault occurred on a
+   shared access.
 
 Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
 accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4167d557c577..c4e758f0aebb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3147,9 +3147,9 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
 	return level;
 }
 
-int kvm_mmu_max_mapping_level(struct kvm *kvm,
-			      const struct kvm_memory_slot *slot, gfn_t gfn,
-			      int max_level)
+static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
+				       const struct kvm_memory_slot *slot,
+				       gfn_t gfn, int max_level, bool is_private)
 {
 	struct kvm_lpage_info *linfo;
 	int host_level;
@@ -3161,6 +3161,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			break;
 	}
 
+	if (is_private)
+		return max_level;
+
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
@@ -3168,6 +3171,16 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 	return min(host_level, max_level);
 }
 
+int kvm_mmu_max_mapping_level(struct kvm *kvm,
+			      const struct kvm_memory_slot *slot, gfn_t gfn,
+			      int max_level)
+{
+	bool is_private = kvm_slot_can_be_private(slot) &&
+			  kvm_mem_is_private(kvm, gfn);
+
+	return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_private);
+}
+
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
@@ -3188,8 +3201,9 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 * Enforce the iTLB multihit workaround after capturing the requested
 	 * level, which will be used to do precise, accurate accounting.
 	 */
-	fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
-						     fault->gfn, fault->max_level);
+	fault->req_level = __kvm_mmu_max_mapping_level(vcpu->kvm, slot,
+						       fault->gfn, fault->max_level,
+						       fault->is_private);
 	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
 		return;
 
@@ -4261,6 +4275,55 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
 }
 
+static inline u8 kvm_max_level_for_order(int order)
+{
+	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+	KVM_MMU_WARN_ON(order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G) &&
+			order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M) &&
+			order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K));
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+		return PG_LEVEL_1G;
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+		return PG_LEVEL_2M;
+
+	return PG_LEVEL_4K;
+}
+
+static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+					      struct kvm_page_fault *fault)
+{
+	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
+				      PAGE_SIZE, fault->write, fault->exec,
+				      fault->is_private);
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+				   struct kvm_page_fault *fault)
+{
+	int max_order, r;
+
+	if (!kvm_slot_can_be_private(fault->slot)) {
+		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+		return -EFAULT;
+	}
+
+	r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
+			     &max_order);
+	if (r) {
+		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+		return r;
+	}
+
+	fault->max_level = min(kvm_max_level_for_order(max_order),
+			       fault->max_level);
+	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
+
+	return RET_PF_CONTINUE;
+}
+
 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
@@ -4293,6 +4356,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			return RET_PF_EMULATE;
 	}
 
+	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+		return -EFAULT;
+	}
+
+	if (fault->is_private)
+		return kvm_faultin_pfn_private(vcpu, fault);
+
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
 					  fault->write, &fault->map_writable,
@@ -7173,6 +7244,26 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 }
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+					struct kvm_gfn_range *range)
+{
+	/*
+	 * Zap SPTEs even if the slot can't be mapped PRIVATE.  KVM x86 only
+	 * supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like KVM
+	 * can simply ignore such slots.  But if userspace is making memory
+	 * PRIVATE, then KVM must prevent the guest from accessing the memory
+	 * as shared.  And if userspace is making memory SHARED and this point
+	 * is reached, then at least one page within the range was previously
+	 * PRIVATE, i.e. the slot's possible hugepage ranges are changing.
+	 * Zapping SPTEs in this case ensures KVM will reassess whether or not
+	 * a hugepage can be used for affected ranges.
+	 */
+	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
+		return false;
+
+	return kvm_unmap_gfn_range(kvm, range);
+}
+
 static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
 				int level)
 {
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index decc1f153669..86c7cb692786 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -201,6 +201,7 @@ struct kvm_page_fault {
 
 	/* Derived from mmu and global state.  */
 	const bool is_tdp;
+	const bool is_private;
 	const bool nx_huge_page_workaround_enabled;
 
 	/*
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7de93858054d..e3223cafd7db 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2358,14 +2358,18 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
 static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
-						 gpa_t gpa, gpa_t size)
+						 gpa_t gpa, gpa_t size,
+						 bool is_write, bool is_exec,
+						 bool is_private)
 {
 	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
 	vcpu->run->memory_fault.gpa = gpa;
 	vcpu->run->memory_fault.size = size;
 
-	/* Flags are not (yet) defined or communicated to userspace. */
+	/* RWX flags are not (yet) defined or communicated to userspace. */
 	vcpu->run->memory_fault.flags = 0;
+	if (is_private)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
 }
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 33d542de0a61..29e9eb51dec9 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -527,6 +527,7 @@ struct kvm_run {
 		} notify;
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
 			__u64 flags;
 			__u64 gpa;
 			__u64 size;
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 21/35] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (19 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-11-02 14:35   ` Fuad Tabba
  2023-10-27 18:22 ` [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM Sean Christopherson
                   ` (14 subsequent siblings)
  35 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
KVM_ADDRESS_SPACE_NUM.

No functional change intended.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 1 -
 include/linux/kvm_host.h        | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8d60e4745e8b..6702f795c862 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2124,7 +2124,6 @@ enum {
 #define HF_SMM_MASK		(1 << 1)
 #define HF_SMM_INSIDE_NMI_MASK	(1 << 2)
 
-# define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
 # define KVM_ADDRESS_SPACE_NUM 2
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e3223cafd7db..c3cfe08b1300 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -692,7 +692,7 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
 #define KVM_MEM_SLOTS_NUM SHRT_MAX
 #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
 
-#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
+#if KVM_ADDRESS_SPACE_NUM == 1
 static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 {
 	return 0;
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (20 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 21/35] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-30 17:34   ` Paolo Bonzini
  2023-11-02 14:52   ` Fuad Tabba
  2023-10-27 18:22 ` [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory Sean Christopherson
                   ` (13 subsequent siblings)
  35 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs.  Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.

Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/powerpc/kvm/book3s_hv.c    |  2 +-
 arch/x86/include/asm/kvm_host.h |  8 +++++++-
 arch/x86/kvm/debugfs.c          |  2 +-
 arch/x86/kvm/mmu/mmu.c          |  6 +++---
 arch/x86/kvm/x86.c              |  2 +-
 include/linux/kvm_host.h        | 17 +++++++++++------
 virt/kvm/dirty_ring.c           |  2 +-
 virt/kvm/kvm_main.c             | 26 ++++++++++++++------------
 8 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 130bafdb1430..9b0eaa17275a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -6084,7 +6084,7 @@ static int kvmhv_svm_off(struct kvm *kvm)
 	}
 
 	srcu_idx = srcu_read_lock(&kvm->srcu);
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		struct kvm_memory_slot *memslot;
 		struct kvm_memslots *slots = __kvm_memslots(kvm, i);
 		int bkt;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6702f795c862..f9e8d5642069 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2124,9 +2124,15 @@ enum {
 #define HF_SMM_MASK		(1 << 1)
 #define HF_SMM_INSIDE_NMI_MASK	(1 << 2)
 
-# define KVM_ADDRESS_SPACE_NUM 2
+# define KVM_MAX_NR_ADDRESS_SPACES	2
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
+
+static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
+{
+	return KVM_MAX_NR_ADDRESS_SPACES;
+}
+
 #else
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
 #endif
diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index ee8c4c3496ed..42026b3f3ff3 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -111,7 +111,7 @@ static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void *v)
 	mutex_lock(&kvm->slots_lock);
 	write_lock(&kvm->mmu_lock);
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		int bkt;
 
 		slots = __kvm_memslots(kvm, i);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c4e758f0aebb..baeba8fc1c38 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3755,7 +3755,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 	    kvm_page_track_write_tracking_enabled(kvm))
 		goto out_success;
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		slots = __kvm_memslots(kvm, i);
 		kvm_for_each_memslot(slot, bkt, slots) {
 			/*
@@ -6294,7 +6294,7 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_e
 	if (!kvm_memslots_have_rmaps(kvm))
 		return flush;
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		slots = __kvm_memslots(kvm, i);
 
 		kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, gfn_end) {
@@ -6791,7 +6791,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
 	 * modifier prior to checking for a wrap of the MMIO generation so
 	 * that a wrap in any address space is detected.
 	 */
-	gen &= ~((u64)KVM_ADDRESS_SPACE_NUM - 1);
+	gen &= ~((u64)kvm_arch_nr_memslot_as_ids(kvm) - 1);
 
 	/*
 	 * The very rare case: if the MMIO generation number has wrapped,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 824b58b44382..c4d17727b199 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12456,7 +12456,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 		hva = slot->userspace_addr;
 	}
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		struct kvm_userspace_memory_region2 m;
 
 		m.slot = id | (i << 16);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c3cfe08b1300..687589ce9f63 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -80,8 +80,8 @@
 /* Two fragments for cross MMIO pages. */
 #define KVM_MAX_MMIO_FRAGMENTS	2
 
-#ifndef KVM_ADDRESS_SPACE_NUM
-#define KVM_ADDRESS_SPACE_NUM	1
+#ifndef KVM_MAX_NR_ADDRESS_SPACES
+#define KVM_MAX_NR_ADDRESS_SPACES	1
 #endif
 
 /*
@@ -692,7 +692,12 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
 #define KVM_MEM_SLOTS_NUM SHRT_MAX
 #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
 
-#if KVM_ADDRESS_SPACE_NUM == 1
+#if KVM_MAX_NR_ADDRESS_SPACES == 1
+static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
+{
+	return KVM_MAX_NR_ADDRESS_SPACES;
+}
+
 static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 {
 	return 0;
@@ -747,9 +752,9 @@ struct kvm {
 	struct mm_struct *mm; /* userspace tied to this vm */
 	unsigned long nr_memslot_pages;
 	/* The two memslot sets - active and inactive (per address space) */
-	struct kvm_memslots __memslots[KVM_ADDRESS_SPACE_NUM][2];
+	struct kvm_memslots __memslots[KVM_MAX_NR_ADDRESS_SPACES][2];
 	/* The current active memslot set for each address space */
-	struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
+	struct kvm_memslots __rcu *memslots[KVM_MAX_NR_ADDRESS_SPACES];
 	struct xarray vcpu_array;
 	/*
 	 * Protected by slots_lock, but can be read outside if an
@@ -1018,7 +1023,7 @@ void kvm_put_kvm_no_destroy(struct kvm *kvm);
 
 static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
 {
-	as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
+	as_id = array_index_nospec(as_id, KVM_MAX_NR_ADDRESS_SPACES);
 	return srcu_dereference_check(kvm->memslots[as_id], &kvm->srcu,
 			lockdep_is_held(&kvm->slots_lock) ||
 			!refcount_read(&kvm->users_count));
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index c1cd7dfe4a90..86d267db87bb 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -58,7 +58,7 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
 	as_id = slot >> 16;
 	id = (u16)slot;
 
-	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
 		return;
 
 	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5d1a2f1b4e94..23633984142f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -615,7 +615,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 
 	idx = srcu_read_lock(&kvm->srcu);
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		struct interval_tree_node *node;
 
 		slots = __kvm_memslots(kvm, i);
@@ -1248,7 +1248,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 		goto out_err_no_irq_srcu;
 
 	refcount_set(&kvm->users_count, 1);
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		for (j = 0; j < 2; j++) {
 			slots = &kvm->__memslots[i][j];
 
@@ -1398,7 +1398,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 #endif
 	kvm_arch_destroy_vm(kvm);
 	kvm_destroy_devices(kvm);
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
 		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
 	}
@@ -1681,7 +1681,7 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
 	 * space 0 will use generations 0, 2, 4, ... while address space 1 will
 	 * use generations 1, 3, 5, ...
 	 */
-	gen += KVM_ADDRESS_SPACE_NUM;
+	gen += kvm_arch_nr_memslot_as_ids(kvm);
 
 	kvm_arch_memslots_updated(kvm, gen);
 
@@ -2051,7 +2051,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	    (mem->guest_memfd_offset & (PAGE_SIZE - 1) ||
 	     mem->guest_memfd_offset + mem->memory_size < mem->guest_memfd_offset))
 		return -EINVAL;
-	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
+	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM)
 		return -EINVAL;
 	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
 		return -EINVAL;
@@ -2187,7 +2187,7 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
 
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
-	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
 
 	slots = __kvm_memslots(kvm, as_id);
@@ -2249,7 +2249,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
-	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
 
 	slots = __kvm_memslots(kvm, as_id);
@@ -2361,7 +2361,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
-	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
 
 	if (log->first_page & 63)
@@ -2502,7 +2502,7 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 	gfn_range.only_private = false;
 	gfn_range.only_shared = false;
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		slots = __kvm_memslots(kvm, i);
 
 		kvm_for_each_memslot_in_gfn_range(&iter, slots, range->start, range->end) {
@@ -4857,9 +4857,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_IRQ_ROUTING:
 		return KVM_MAX_IRQ_ROUTES;
 #endif
-#if KVM_ADDRESS_SPACE_NUM > 1
+#if KVM_MAX_NR_ADDRESS_SPACES > 1
 	case KVM_CAP_MULTI_ADDRESS_SPACE:
-		return KVM_ADDRESS_SPACE_NUM;
+		if (kvm)
+			return kvm_arch_nr_memslot_as_ids(kvm);
+		return KVM_MAX_NR_ADDRESS_SPACES;
 #endif
 	case KVM_CAP_NR_MEMSLOTS:
 		return KVM_USER_MEM_SLOTS;
@@ -4967,7 +4969,7 @@ bool kvm_are_all_memslots_empty(struct kvm *kvm)
 
 	lockdep_assert_held(&kvm->slots_lock);
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
 		if (!kvm_memslots_empty(__kvm_memslots(kvm, i)))
 			return false;
 	}
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (21 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-30 17:36   ` Paolo Bonzini
  2023-11-06 11:00   ` Fuad Tabba
  2023-10-27 18:22 ` [PATCH v13 24/35] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper Sean Christopherson
                   ` (12 subsequent siblings)
  35 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.

The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.

At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst  | 32 ++++++++++++++++++++++++++++++++
 arch/x86/include/asm/kvm_host.h | 15 +++++++++------
 arch/x86/include/uapi/asm/kvm.h |  3 +++
 arch/x86/kvm/Kconfig            | 12 ++++++++++++
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 arch/x86/kvm/x86.c              | 16 +++++++++++++++-
 include/uapi/linux/kvm.h        |  1 +
 virt/kvm/Kconfig                |  5 +++++
 8 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 38dc1fda4f45..00029436ac5b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -147,10 +147,29 @@ described as 'basic' will be available.
 The new VM has no virtual cpus and no memory.
 You probably want to use 0 as machine type.
 
+X86:
+^^^^
+
+Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
+
+S390:
+^^^^^
+
 In order to create user controlled virtual machines on S390, check
 KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
 privileged user (CAP_SYS_ADMIN).
 
+MIPS:
+^^^^^
+
+To use hardware assisted virtualization on MIPS (VZ ASE) rather than
+the default trap & emulate implementation (which changes the virtual
+memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
+flag KVM_VM_MIPS_VZ.
+
+ARM64:
+^^^^^^
+
 On arm64, the physical address size for a VM (IPA Size limit) is limited
 to 40bits by default. The limit can be configured if the host supports the
 extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
@@ -8650,6 +8669,19 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
 64-bit bitmap (each bit describing a block size). The default value is
 0, to disable the eager page splitting.
 
+8.41 KVM_CAP_VM_TYPES
+---------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: system ioctl
+
+This capability returns a bitmap of support VM types.  The 1-setting of bit @n
+means the VM type with value @n is supported.  Possible values of @n are::
+
+  #define KVM_X86_DEFAULT_VM	0
+  #define KVM_X86_SW_PROTECTED_VM	1
+
 9. Known KVM API problems
 =========================
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f9e8d5642069..dff10051e9b6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1244,6 +1244,7 @@ enum kvm_apicv_inhibit {
 };
 
 struct kvm_arch {
+	unsigned long vm_type;
 	unsigned long n_used_mmu_pages;
 	unsigned long n_requested_mmu_pages;
 	unsigned long n_max_mmu_pages;
@@ -2077,6 +2078,12 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd);
 void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 		       int tdp_max_root_level, int tdp_huge_page_level);
 
+#ifdef CONFIG_KVM_PRIVATE_MEM
+#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.vm_type != KVM_X86_DEFAULT_VM)
+#else
+#define kvm_arch_has_private_mem(kvm) false
+#endif
+
 static inline u16 kvm_read_ldt(void)
 {
 	u16 ldt;
@@ -2125,14 +2132,10 @@ enum {
 #define HF_SMM_INSIDE_NMI_MASK	(1 << 2)
 
 # define KVM_MAX_NR_ADDRESS_SPACES	2
+/* SMM is currently unsupported for guests with private memory. */
+# define kvm_arch_nr_memslot_as_ids(kvm) (kvm_arch_has_private_mem(kvm) ? 1 : 2)
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
-
-static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
-{
-	return KVM_MAX_NR_ADDRESS_SPACES;
-}
-
 #else
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
 #endif
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 1a6a1f987949..a448d0964fc0 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -562,4 +562,7 @@ struct kvm_pmu_event_filter {
 /* x86-specific KVM_EXIT_HYPERCALL flags. */
 #define KVM_EXIT_HYPERCALL_LONG_MODE	BIT(0)
 
+#define KVM_X86_DEFAULT_VM	0
+#define KVM_X86_SW_PROTECTED_VM	1
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 091b74599c22..8452ed0228cb 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -77,6 +77,18 @@ config KVM_WERROR
 
 	  If in doubt, say "N".
 
+config KVM_SW_PROTECTED_VM
+	bool "Enable support for KVM software-protected VMs"
+	depends on EXPERT
+	depends on X86_64
+	select KVM_GENERIC_PRIVATE_MEM
+	help
+	  Enable support for KVM software-protected VMs.  Currently "protected"
+	  means the VM can be backed with memory provided by
+	  KVM_CREATE_GUEST_MEMFD.
+
+	  If unsure, say "N".
+
 config KVM_INTEL
 	tristate "KVM for Intel (and compatible) processors support"
 	depends on KVM && IA32_FEAT_CTL
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 86c7cb692786..b66a7d47e0e4 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -297,6 +297,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
 		.req_level = PG_LEVEL_4K,
 		.goal_level = PG_LEVEL_4K,
+		.is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
 	};
 	int r;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c4d17727b199..e3eb608b6692 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4441,6 +4441,13 @@ static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+static bool kvm_is_vm_type_supported(unsigned long type)
+{
+	return type == KVM_X86_DEFAULT_VM ||
+	       (type == KVM_X86_SW_PROTECTED_VM &&
+		IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled);
+}
+
 int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 {
 	int r = 0;
@@ -4632,6 +4639,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_X86_NOTIFY_VMEXIT:
 		r = kvm_caps.has_notify_vmexit;
 		break;
+	case KVM_CAP_VM_TYPES:
+		r = BIT(KVM_X86_DEFAULT_VM);
+		if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM))
+			r |= BIT(KVM_X86_SW_PROTECTED_VM);
+		break;
 	default:
 		break;
 	}
@@ -12314,9 +12326,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	int ret;
 	unsigned long flags;
 
-	if (type)
+	if (!kvm_is_vm_type_supported(type))
 		return -EINVAL;
 
+	kvm->arch.vm_type = type;
+
 	ret = kvm_page_track_init(kvm);
 	if (ret)
 		goto out;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 29e9eb51dec9..5b5820d19e71 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1218,6 +1218,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_MEMORY_FAULT_INFO 231
 #define KVM_CAP_MEMORY_ATTRIBUTES 232
 #define KVM_CAP_GUEST_MEMFD 233
+#define KVM_CAP_VM_TYPES 234
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 08afef022db9..2c964586aa14 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -104,3 +104,8 @@ config KVM_GENERIC_MEMORY_ATTRIBUTES
 config KVM_PRIVATE_MEM
        select XARRAY_MULTI
        bool
+
+config KVM_GENERIC_PRIVATE_MEM
+       select KVM_GENERIC_MEMORY_ATTRIBUTES
+       select KVM_PRIVATE_MEM
+       bool
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 24/35] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (22 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-27 18:22 ` [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
                   ` (11 subsequent siblings)
  35 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Drop kvm_userspace_memory_region_find(), it's unused and a terrible API
(probably why it's unused).  If anything outside of kvm_util.c needs to
get at the memslot, userspace_mem_region_find() can be exposed to give
others full access to all memory region/slot information.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     |  4 ---
 tools/testing/selftests/kvm/lib/kvm_util.c    | 29 -------------------
 2 files changed, 33 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index a18db6a7b3cf..967eaaeacd75 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -776,10 +776,6 @@ vm_adjust_num_guest_pages(enum vm_guest_mode mode, unsigned int num_guest_pages)
 	return n;
 }
 
-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
-				 uint64_t end);
-
 #define sync_global_to_guest(vm, g) ({				\
 	typeof(g) *_p = addr_gva2hva(vm, (vm_vaddr_t)&(g));	\
 	memcpy(_p, &(g), sizeof(g));				\
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 7a8af1821f5d..f09295d56c23 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -590,35 +590,6 @@ userspace_mem_region_find(struct kvm_vm *vm, uint64_t start, uint64_t end)
 	return NULL;
 }
 
-/*
- * KVM Userspace Memory Region Find
- *
- * Input Args:
- *   vm - Virtual Machine
- *   start - Starting VM physical address
- *   end - Ending VM physical address, inclusive.
- *
- * Output Args: None
- *
- * Return:
- *   Pointer to overlapping region, NULL if no such region.
- *
- * Public interface to userspace_mem_region_find. Allows tests to look up
- * the memslot datastructure for a given range of guest physical memory.
- */
-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
-				 uint64_t end)
-{
-	struct userspace_mem_region *region;
-
-	region = userspace_mem_region_find(vm, start, end);
-	if (!region)
-		return NULL;
-
-	return &region->region;
-}
-
 __weak void vcpu_arch_free(struct kvm_vcpu *vcpu)
 {
 
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (23 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 24/35] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2024-04-25 14:12   ` Dan Carpenter
  2023-10-27 18:22 ` [PATCH v13 26/35] KVM: selftests: Add support for creating private memslots Sean Christopherson
                   ` (10 subsequent siblings)
  35 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
support for guest private memory can be added without needing an entirely
separate set of helpers.

Note, this obviously makes selftests backwards-incompatible with older KVM
versions from this point forward.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     |  2 +-
 tools/testing/selftests/kvm/lib/kvm_util.c    | 19 ++++++++++---------
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 967eaaeacd75..9f144841c2ee 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -44,7 +44,7 @@ typedef uint64_t vm_paddr_t; /* Virtual Machine (Guest) physical address */
 typedef uint64_t vm_vaddr_t; /* Virtual Machine (Guest) virtual address */
 
 struct userspace_mem_region {
-	struct kvm_userspace_memory_region region;
+	struct kvm_userspace_memory_region2 region;
 	struct sparsebit *unused_phy_pages;
 	int fd;
 	off_t offset;
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index f09295d56c23..3676b37bea38 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -453,8 +453,9 @@ void kvm_vm_restart(struct kvm_vm *vmp)
 		vm_create_irqchip(vmp);
 
 	hash_for_each(vmp->regions.slot_hash, ctr, region, slot_node) {
-		int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION, &region->region);
-		TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+		int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION2, &region->region);
+
+		TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
 			    "  rc: %i errno: %i\n"
 			    "  slot: %u flags: 0x%x\n"
 			    "  guest_phys_addr: 0x%llx size: 0x%llx",
@@ -657,7 +658,7 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
 	}
 
 	region->region.memory_size = 0;
-	vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region);
+	vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
 
 	sparsebit_free(&region->unused_phy_pages);
 	ret = munmap(region->mmap_start, region->mmap_size);
@@ -1014,8 +1015,8 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	region->region.guest_phys_addr = guest_paddr;
 	region->region.memory_size = npages * vm->page_size;
 	region->region.userspace_addr = (uintptr_t) region->host_mem;
-	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region);
-	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
+	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
 		"  rc: %i errno: %i\n"
 		"  slot: %u flags: 0x%x\n"
 		"  guest_phys_addr: 0x%lx size: 0x%lx",
@@ -1097,9 +1098,9 @@ void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags)
 
 	region->region.flags = flags;
 
-	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region);
+	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
 
-	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
 		"  rc: %i errno: %i slot: %u flags: 0x%x",
 		ret, errno, slot, flags);
 }
@@ -1127,9 +1128,9 @@ void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, uint64_t new_gpa)
 
 	region->region.guest_phys_addr = new_gpa;
 
-	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, &region->region);
+	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
 
-	TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION failed\n"
+	TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed\n"
 		    "ret: %i errno: %i slot: %u new_gpa: 0x%lx",
 		    ret, errno, slot, new_gpa);
 }
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 26/35] KVM: selftests: Add support for creating private memslots
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (24 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-27 18:22 ` [PATCH v13 27/35] KVM: selftests: Add helpers to convert guest memory b/w private and shared Sean Christopherson
                   ` (9 subsequent siblings)
  35 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Add support for creating "private" memslots via KVM_CREATE_GUEST_MEMFD and
KVM_SET_USER_MEMORY_REGION2.  Make vm_userspace_mem_region_add() a wrapper
to its effective replacement, vm_mem_add(), so that private memslots are
fully opt-in, i.e. don't require update all tests that add memory regions.

Pivot on the KVM_MEM_PRIVATE flag instead of the validity of the "gmem"
file descriptor so that simple tests can let vm_mem_add() do the heavy
lifting of creating the guest memfd, but also allow the caller to pass in
an explicit fd+offset so that fancier tests can do things like back
multiple memslots with a single file.  If the caller passes in a fd, dup()
the fd so that (a) __vm_mem_region_delete() can close the fd associated
with the memory region without needing yet another flag, and (b) so that
the caller can safely close its copy of the fd without having to first
destroy memslots.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     | 23 +++++
 .../testing/selftests/kvm/include/test_util.h |  5 ++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 85 ++++++++++++-------
 3 files changed, 82 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 9f144841c2ee..9f861182c02a 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -431,6 +431,26 @@ static inline uint64_t vm_get_stat(struct kvm_vm *vm, const char *stat_name)
 
 void vm_create_irqchip(struct kvm_vm *vm);
 
+static inline int __vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+					uint64_t flags)
+{
+	struct kvm_create_guest_memfd guest_memfd = {
+		.size = size,
+		.flags = flags,
+	};
+
+	return __vm_ioctl(vm, KVM_CREATE_GUEST_MEMFD, &guest_memfd);
+}
+
+static inline int vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+					uint64_t flags)
+{
+	int fd = __vm_create_guest_memfd(vm, size, flags);
+
+	TEST_ASSERT(fd >= 0, KVM_IOCTL_ERROR(KVM_CREATE_GUEST_MEMFD, fd));
+	return fd;
+}
+
 void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
 			       uint64_t gpa, uint64_t size, void *hva);
 int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
@@ -439,6 +459,9 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	enum vm_mem_backing_src_type src_type,
 	uint64_t guest_paddr, uint32_t slot, uint64_t npages,
 	uint32_t flags);
+void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
+		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
+		uint32_t flags, int guest_memfd_fd, uint64_t guest_memfd_offset);
 
 void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags);
 void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, uint64_t new_gpa);
diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 7e614adc6cf4..7257f2243ab9 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -142,6 +142,11 @@ static inline bool backing_src_is_shared(enum vm_mem_backing_src_type t)
 	return vm_mem_backing_src_alias(t)->flag & MAP_SHARED;
 }
 
+static inline bool backing_src_can_be_huge(enum vm_mem_backing_src_type t)
+{
+	return t != VM_MEM_SRC_ANONYMOUS && t != VM_MEM_SRC_SHMEM;
+}
+
 /* Aligns x up to the next multiple of size. Size must be a power of 2. */
 static inline uint64_t align_up(uint64_t x, uint64_t size)
 {
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 3676b37bea38..45050f54701a 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -669,6 +669,8 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
 		TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("munmap()", ret));
 		close(region->fd);
 	}
+	if (region->region.guest_memfd >= 0)
+		close(region->region.guest_memfd);
 
 	free(region);
 }
@@ -870,36 +872,15 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
 		    errno, strerror(errno));
 }
 
-/*
- * VM Userspace Memory Region Add
- *
- * Input Args:
- *   vm - Virtual Machine
- *   src_type - Storage source for this region.
- *              NULL to use anonymous memory.
- *   guest_paddr - Starting guest physical address
- *   slot - KVM region slot
- *   npages - Number of physical pages
- *   flags - KVM memory region flags (e.g. KVM_MEM_LOG_DIRTY_PAGES)
- *
- * Output Args: None
- *
- * Return: None
- *
- * Allocates a memory area of the number of pages specified by npages
- * and maps it to the VM specified by vm, at a starting physical address
- * given by guest_paddr.  The region is created with a KVM region slot
- * given by slot, which must be unique and < KVM_MEM_SLOTS_NUM.  The
- * region is created with the flags given by flags.
- */
-void vm_userspace_mem_region_add(struct kvm_vm *vm,
-	enum vm_mem_backing_src_type src_type,
-	uint64_t guest_paddr, uint32_t slot, uint64_t npages,
-	uint32_t flags)
+/* FIXME: This thing needs to be ripped apart and rewritten. */
+void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
+		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
+		uint32_t flags, int guest_memfd, uint64_t guest_memfd_offset)
 {
 	int ret;
 	struct userspace_mem_region *region;
 	size_t backing_src_pagesz = get_backing_src_pagesz(src_type);
+	size_t mem_size = npages * vm->page_size;
 	size_t alignment;
 
 	TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
@@ -952,7 +933,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	/* Allocate and initialize new mem region structure. */
 	region = calloc(1, sizeof(*region));
 	TEST_ASSERT(region != NULL, "Insufficient Memory");
-	region->mmap_size = npages * vm->page_size;
+	region->mmap_size = mem_size;
 
 #ifdef __s390x__
 	/* On s390x, the host address must be aligned to 1M (due to PGSTEs) */
@@ -999,14 +980,47 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	/* As needed perform madvise */
 	if ((src_type == VM_MEM_SRC_ANONYMOUS ||
 	     src_type == VM_MEM_SRC_ANONYMOUS_THP) && thp_configured()) {
-		ret = madvise(region->host_mem, npages * vm->page_size,
+		ret = madvise(region->host_mem, mem_size,
 			      src_type == VM_MEM_SRC_ANONYMOUS ? MADV_NOHUGEPAGE : MADV_HUGEPAGE);
 		TEST_ASSERT(ret == 0, "madvise failed, addr: %p length: 0x%lx src_type: %s",
-			    region->host_mem, npages * vm->page_size,
+			    region->host_mem, mem_size,
 			    vm_mem_backing_src_alias(src_type)->name);
 	}
 
 	region->backing_src_type = src_type;
+
+	if (flags & KVM_MEM_PRIVATE) {
+		if (guest_memfd < 0) {
+			uint32_t guest_memfd_flags = 0;
+
+			/*
+			 * Allow hugepages for the guest memfd backing if the
+			 * "normal" backing is allowed/required to be huge.
+			 */
+			if (src_type != VM_MEM_SRC_ANONYMOUS &&
+			    src_type != VM_MEM_SRC_SHMEM)
+				guest_memfd_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+
+			TEST_ASSERT(!guest_memfd_offset,
+				    "Offset must be zero when creating new guest_memfd");
+			guest_memfd = vm_create_guest_memfd(vm, mem_size, guest_memfd_flags);
+		} else {
+			/*
+			 * Install a unique fd for each memslot so that the fd
+			 * can be closed when the region is deleted without
+			 * needing to track if the fd is owned by the framework
+			 * or by the caller.
+			 */
+			guest_memfd = dup(guest_memfd);
+			TEST_ASSERT(guest_memfd >= 0, __KVM_SYSCALL_ERROR("dup()", guest_memfd));
+		}
+
+		region->region.guest_memfd = guest_memfd;
+		region->region.guest_memfd_offset = guest_memfd_offset;
+	} else {
+		region->region.guest_memfd = -1;
+	}
+
 	region->unused_phy_pages = sparsebit_alloc();
 	sparsebit_set_num(region->unused_phy_pages,
 		guest_paddr >> vm->page_shift, npages);
@@ -1019,9 +1033,10 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
 		"  rc: %i errno: %i\n"
 		"  slot: %u flags: 0x%x\n"
-		"  guest_phys_addr: 0x%lx size: 0x%lx",
+		"  guest_phys_addr: 0x%lx size: 0x%lx guest_memfd: %d\n",
 		ret, errno, slot, flags,
-		guest_paddr, (uint64_t) region->region.memory_size);
+		guest_paddr, (uint64_t) region->region.memory_size,
+		region->region.guest_memfd);
 
 	/* Add to quick lookup data structures */
 	vm_userspace_mem_region_gpa_insert(&vm->regions.gpa_tree, region);
@@ -1042,6 +1057,14 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	}
 }
 
+void vm_userspace_mem_region_add(struct kvm_vm *vm,
+				 enum vm_mem_backing_src_type src_type,
+				 uint64_t guest_paddr, uint32_t slot,
+				 uint64_t npages, uint32_t flags)
+{
+	vm_mem_add(vm, src_type, guest_paddr, slot, npages, flags, -1, 0);
+}
+
 /*
  * Memslot to region
  *
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 27/35] KVM: selftests: Add helpers to convert guest memory b/w private and shared
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (25 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 26/35] KVM: selftests: Add support for creating private memslots Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-11-06 11:26   ` Fuad Tabba
  2023-10-27 18:22 ` [PATCH v13 28/35] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) Sean Christopherson
                   ` (8 subsequent siblings)
  35 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Vishal Annapurve <vannapurve@google.com>

Add helpers to convert memory between private and shared via KVM's
memory attributes, as well as helpers to free/allocate guest_memfd memory
via fallocate().  Userspace, i.e. tests, is NOT required to do fallocate()
when converting memory, as the attributes are the single source of true.
Provide allocate() helpers so that tests can mimic a userspace that frees
private memory on conversion, e.g. to prioritize memory usage over
performance.

Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     | 48 +++++++++++++++++++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 28 +++++++++++
 2 files changed, 76 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 9f861182c02a..1441fca6c273 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -333,6 +333,54 @@ static inline void vm_enable_cap(struct kvm_vm *vm, uint32_t cap, uint64_t arg0)
 	vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
 }
 
+static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
+					    uint64_t size, uint64_t attributes)
+{
+	struct kvm_memory_attributes attr = {
+		.attributes = attributes,
+		.address = gpa,
+		.size = size,
+		.flags = 0,
+	};
+
+	/*
+	 * KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes.  These flows
+	 * need significant enhancements to support multiple attributes.
+	 */
+	TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
+		    "Update me to support multiple attributes!");
+
+	vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
+}
+
+
+static inline void vm_mem_set_private(struct kvm_vm *vm, uint64_t gpa,
+				      uint64_t size)
+{
+	vm_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa,
+				     uint64_t size)
+{
+	vm_set_memory_attributes(vm, gpa, size, 0);
+}
+
+void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
+			    bool punch_hole);
+
+static inline void vm_guest_mem_punch_hole(struct kvm_vm *vm, uint64_t gpa,
+					   uint64_t size)
+{
+	vm_guest_mem_fallocate(vm, gpa, size, true);
+}
+
+static inline void vm_guest_mem_allocate(struct kvm_vm *vm, uint64_t gpa,
+					 uint64_t size)
+{
+	vm_guest_mem_fallocate(vm, gpa, size, false);
+}
+
 void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
 const char *vm_guest_mode_string(uint32_t i);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 45050f54701a..a140aee8d0f5 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1176,6 +1176,34 @@ void vm_mem_region_delete(struct kvm_vm *vm, uint32_t slot)
 	__vm_mem_region_delete(vm, memslot2region(vm, slot), true);
 }
 
+void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t base, uint64_t size,
+			    bool punch_hole)
+{
+	const int mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? FALLOC_FL_PUNCH_HOLE : 0);
+	struct userspace_mem_region *region;
+	uint64_t end = base + size;
+	uint64_t gpa, len;
+	off_t fd_offset;
+	int ret;
+
+	for (gpa = base; gpa < end; gpa += len) {
+		uint64_t offset;
+
+		region = userspace_mem_region_find(vm, gpa, gpa);
+		TEST_ASSERT(region && region->region.flags & KVM_MEM_PRIVATE,
+			    "Private memory region not found for GPA 0x%lx", gpa);
+
+		offset = (gpa - region->region.guest_phys_addr);
+		fd_offset = region->region.guest_memfd_offset + offset;
+		len = min_t(uint64_t, end - gpa, region->region.memory_size - offset);
+
+		ret = fallocate(region->region.guest_memfd, mode, fd_offset, len);
+		TEST_ASSERT(!ret, "fallocate() failed to %s at %lx (len = %lu), fd = %d, mode = %x, offset = %lx\n",
+			    punch_hole ? "punch hole" : "allocate", gpa, len,
+			    region->region.guest_memfd, mode, fd_offset);
+	}
+}
+
 /* Returns the size of a vCPU's kvm_run structure. */
 static int vcpu_mmap_sz(void)
 {
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 28/35] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86)
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (26 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 27/35] KVM: selftests: Add helpers to convert guest memory b/w private and shared Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-27 18:22 ` [PATCH v13 29/35] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type Sean Christopherson
                   ` (7 subsequent siblings)
  35 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Vishal Annapurve <vannapurve@google.com>

Add helpers for x86 guests to invoke the KVM_HC_MAP_GPA_RANGE hypercall,
which KVM will forward to userspace and thus can be used by tests to
coordinate private<=>shared conversions between host userspace code and
guest code.

Signed-off-by: Vishal Annapurve <vannapurve@google.com>
[sean: drop shared/private helpers (let tests specify flags)]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/x86_64/processor.h      | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 25bc61dac5fb..a84863503fcb 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -15,6 +15,7 @@
 #include <asm/msr-index.h>
 #include <asm/prctl.h>
 
+#include <linux/kvm_para.h>
 #include <linux/stringify.h>
 
 #include "../kvm_util.h"
@@ -1194,6 +1195,20 @@ uint64_t kvm_hypercall(uint64_t nr, uint64_t a0, uint64_t a1, uint64_t a2,
 uint64_t __xen_hypercall(uint64_t nr, uint64_t a0, void *a1);
 void xen_hypercall(uint64_t nr, uint64_t a0, void *a1);
 
+static inline uint64_t __kvm_hypercall_map_gpa_range(uint64_t gpa,
+						     uint64_t size, uint64_t flags)
+{
+	return kvm_hypercall(KVM_HC_MAP_GPA_RANGE, gpa, size >> PAGE_SHIFT, flags, 0);
+}
+
+static inline void kvm_hypercall_map_gpa_range(uint64_t gpa, uint64_t size,
+					       uint64_t flags)
+{
+	uint64_t ret = __kvm_hypercall_map_gpa_range(gpa, size, flags);
+
+	GUEST_ASSERT(!ret);
+}
+
 void __vm_xsave_require_permission(uint64_t xfeature, const char *name);
 
 #define vm_xsave_require_permission(xfeature)	\
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 29/35] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (27 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 28/35] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-27 18:22 ` [PATCH v13 30/35] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data Sean Christopherson
                   ` (6 subsequent siblings)
  35 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Add a "vm_shape" structure to encapsulate the selftests-defined "mode",
along with the KVM-defined "type" for use when creating a new VM.  "mode"
tracks physical and virtual address properties, as well as the preferred
backing memory type, while "type" corresponds to the VM type.

Taking the VM type will allow adding tests for KVM_CREATE_GUEST_MEMFD,
a.k.a. guest private memory, without needing an entirely separate set of
helpers.  Guest private memory is effectively usable only by confidential
VM types, and it's expected that x86 will double down and require unique
VM types for TDX and SNP guests.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c  |  2 +-
 .../selftests/kvm/include/kvm_util_base.h     | 54 +++++++++++++++----
 .../selftests/kvm/kvm_page_table_test.c       |  2 +-
 tools/testing/selftests/kvm/lib/kvm_util.c    | 43 +++++++--------
 tools/testing/selftests/kvm/lib/memstress.c   |  3 +-
 .../kvm/x86_64/ucna_injection_test.c          |  2 +-
 6 files changed, 72 insertions(+), 34 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 936f3a8d1b83..6cbecf499767 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -699,7 +699,7 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, struct kvm_vcpu **vcpu,
 
 	pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode));
 
-	vm = __vm_create(mode, 1, extra_mem_pages);
+	vm = __vm_create(VM_SHAPE(mode), 1, extra_mem_pages);
 
 	log_mode_create_vm_done(vm);
 	*vcpu = vm_vcpu_add(vm, 0, guest_code);
diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 1441fca6c273..157508c071f3 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -188,6 +188,23 @@ enum vm_guest_mode {
 	NUM_VM_MODES,
 };
 
+struct vm_shape {
+	enum vm_guest_mode mode;
+	unsigned int type;
+};
+
+#define VM_TYPE_DEFAULT			0
+
+#define VM_SHAPE(__mode)			\
+({						\
+	struct vm_shape shape = {		\
+		.mode = (__mode),		\
+		.type = VM_TYPE_DEFAULT		\
+	};					\
+						\
+	shape;					\
+})
+
 #if defined(__aarch64__)
 
 extern enum vm_guest_mode vm_mode_default;
@@ -220,6 +237,8 @@ extern enum vm_guest_mode vm_mode_default;
 
 #endif
 
+#define VM_SHAPE_DEFAULT	VM_SHAPE(VM_MODE_DEFAULT)
+
 #define MIN_PAGE_SIZE		(1U << MIN_PAGE_SHIFT)
 #define PTES_PER_MIN_PAGE	ptes_per_page(MIN_PAGE_SIZE)
 
@@ -784,21 +803,21 @@ vm_paddr_t vm_alloc_page_table(struct kvm_vm *vm);
  * __vm_create() does NOT create vCPUs, @nr_runnable_vcpus is used purely to
  * calculate the amount of memory needed for per-vCPU data, e.g. stacks.
  */
-struct kvm_vm *____vm_create(enum vm_guest_mode mode);
-struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus,
+struct kvm_vm *____vm_create(struct vm_shape shape);
+struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
 			   uint64_t nr_extra_pages);
 
 static inline struct kvm_vm *vm_create_barebones(void)
 {
-	return ____vm_create(VM_MODE_DEFAULT);
+	return ____vm_create(VM_SHAPE_DEFAULT);
 }
 
 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)
 {
-	return __vm_create(VM_MODE_DEFAULT, nr_runnable_vcpus, 0);
+	return __vm_create(VM_SHAPE_DEFAULT, nr_runnable_vcpus, 0);
 }
 
-struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus,
+struct kvm_vm *__vm_create_with_vcpus(struct vm_shape shape, uint32_t nr_vcpus,
 				      uint64_t extra_mem_pages,
 				      void *guest_code, struct kvm_vcpu *vcpus[]);
 
@@ -806,17 +825,27 @@ static inline struct kvm_vm *vm_create_with_vcpus(uint32_t nr_vcpus,
 						  void *guest_code,
 						  struct kvm_vcpu *vcpus[])
 {
-	return __vm_create_with_vcpus(VM_MODE_DEFAULT, nr_vcpus, 0,
+	return __vm_create_with_vcpus(VM_SHAPE_DEFAULT, nr_vcpus, 0,
 				      guest_code, vcpus);
 }
 
+
+struct kvm_vm *__vm_create_shape_with_one_vcpu(struct vm_shape shape,
+					       struct kvm_vcpu **vcpu,
+					       uint64_t extra_mem_pages,
+					       void *guest_code);
+
 /*
  * Create a VM with a single vCPU with reasonable defaults and @extra_mem_pages
  * additional pages of guest memory.  Returns the VM and vCPU (via out param).
  */
-struct kvm_vm *__vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
-					 uint64_t extra_mem_pages,
-					 void *guest_code);
+static inline struct kvm_vm *__vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
+						       uint64_t extra_mem_pages,
+						       void *guest_code)
+{
+	return __vm_create_shape_with_one_vcpu(VM_SHAPE_DEFAULT, vcpu,
+					       extra_mem_pages, guest_code);
+}
 
 static inline struct kvm_vm *vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
 						     void *guest_code)
@@ -824,6 +853,13 @@ static inline struct kvm_vm *vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
 	return __vm_create_with_one_vcpu(vcpu, 0, guest_code);
 }
 
+static inline struct kvm_vm *vm_create_shape_with_one_vcpu(struct vm_shape shape,
+							   struct kvm_vcpu **vcpu,
+							   void *guest_code)
+{
+	return __vm_create_shape_with_one_vcpu(shape, vcpu, 0, guest_code);
+}
+
 struct kvm_vcpu *vm_recreate_with_one_vcpu(struct kvm_vm *vm);
 
 void kvm_pin_this_task_to_pcpu(uint32_t pcpu);
diff --git a/tools/testing/selftests/kvm/kvm_page_table_test.c b/tools/testing/selftests/kvm/kvm_page_table_test.c
index 69f26d80c821..e37dc9c21888 100644
--- a/tools/testing/selftests/kvm/kvm_page_table_test.c
+++ b/tools/testing/selftests/kvm/kvm_page_table_test.c
@@ -254,7 +254,7 @@ static struct kvm_vm *pre_init_before_test(enum vm_guest_mode mode, void *arg)
 
 	/* Create a VM with enough guest pages */
 	guest_num_pages = test_mem_size / guest_page_size;
-	vm = __vm_create_with_vcpus(mode, nr_vcpus, guest_num_pages,
+	vm = __vm_create_with_vcpus(VM_SHAPE(mode), nr_vcpus, guest_num_pages,
 				    guest_code, test_args.vcpus);
 
 	/* Align down GPA of the testing memslot */
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index a140aee8d0f5..52b131e3aca5 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -209,7 +209,7 @@ __weak void vm_vaddr_populate_bitmap(struct kvm_vm *vm)
 		(1ULL << (vm->va_bits - 1)) >> vm->page_shift);
 }
 
-struct kvm_vm *____vm_create(enum vm_guest_mode mode)
+struct kvm_vm *____vm_create(struct vm_shape shape)
 {
 	struct kvm_vm *vm;
 
@@ -221,13 +221,13 @@ struct kvm_vm *____vm_create(enum vm_guest_mode mode)
 	vm->regions.hva_tree = RB_ROOT;
 	hash_init(vm->regions.slot_hash);
 
-	vm->mode = mode;
-	vm->type = 0;
+	vm->mode = shape.mode;
+	vm->type = shape.type;
 
-	vm->pa_bits = vm_guest_mode_params[mode].pa_bits;
-	vm->va_bits = vm_guest_mode_params[mode].va_bits;
-	vm->page_size = vm_guest_mode_params[mode].page_size;
-	vm->page_shift = vm_guest_mode_params[mode].page_shift;
+	vm->pa_bits = vm_guest_mode_params[vm->mode].pa_bits;
+	vm->va_bits = vm_guest_mode_params[vm->mode].va_bits;
+	vm->page_size = vm_guest_mode_params[vm->mode].page_size;
+	vm->page_shift = vm_guest_mode_params[vm->mode].page_shift;
 
 	/* Setup mode specific traits. */
 	switch (vm->mode) {
@@ -265,7 +265,7 @@ struct kvm_vm *____vm_create(enum vm_guest_mode mode)
 		/*
 		 * Ignore KVM support for 5-level paging (vm->va_bits == 57),
 		 * it doesn't take effect unless a CR4.LA57 is set, which it
-		 * isn't for this VM_MODE.
+		 * isn't for this mode (48-bit virtual address space).
 		 */
 		TEST_ASSERT(vm->va_bits == 48 || vm->va_bits == 57,
 			    "Linear address width (%d bits) not supported",
@@ -285,10 +285,11 @@ struct kvm_vm *____vm_create(enum vm_guest_mode mode)
 		vm->pgtable_levels = 5;
 		break;
 	default:
-		TEST_FAIL("Unknown guest mode, mode: 0x%x", mode);
+		TEST_FAIL("Unknown guest mode: 0x%x", vm->mode);
 	}
 
 #ifdef __aarch64__
+	TEST_ASSERT(!vm->type, "ARM doesn't support test-provided types");
 	if (vm->pa_bits != 40)
 		vm->type = KVM_VM_TYPE_ARM_IPA_SIZE(vm->pa_bits);
 #endif
@@ -347,19 +348,19 @@ static uint64_t vm_nr_pages_required(enum vm_guest_mode mode,
 	return vm_adjust_num_guest_pages(mode, nr_pages);
 }
 
-struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus,
+struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
 			   uint64_t nr_extra_pages)
 {
-	uint64_t nr_pages = vm_nr_pages_required(mode, nr_runnable_vcpus,
+	uint64_t nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
 						 nr_extra_pages);
 	struct userspace_mem_region *slot0;
 	struct kvm_vm *vm;
 	int i;
 
-	pr_debug("%s: mode='%s' pages='%ld'\n", __func__,
-		 vm_guest_mode_string(mode), nr_pages);
+	pr_debug("%s: mode='%s' type='%d', pages='%ld'\n", __func__,
+		 vm_guest_mode_string(shape.mode), shape.type, nr_pages);
 
-	vm = ____vm_create(mode);
+	vm = ____vm_create(shape);
 
 	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 0);
 	for (i = 0; i < NR_MEM_REGIONS; i++)
@@ -400,7 +401,7 @@ struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus,
  * extra_mem_pages is only used to calculate the maximum page table size,
  * no real memory allocation for non-slot0 memory in this function.
  */
-struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus,
+struct kvm_vm *__vm_create_with_vcpus(struct vm_shape shape, uint32_t nr_vcpus,
 				      uint64_t extra_mem_pages,
 				      void *guest_code, struct kvm_vcpu *vcpus[])
 {
@@ -409,7 +410,7 @@ struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus
 
 	TEST_ASSERT(!nr_vcpus || vcpus, "Must provide vCPU array");
 
-	vm = __vm_create(mode, nr_vcpus, extra_mem_pages);
+	vm = __vm_create(shape, nr_vcpus, extra_mem_pages);
 
 	for (i = 0; i < nr_vcpus; ++i)
 		vcpus[i] = vm_vcpu_add(vm, i, guest_code);
@@ -417,15 +418,15 @@ struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus
 	return vm;
 }
 
-struct kvm_vm *__vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
-					 uint64_t extra_mem_pages,
-					 void *guest_code)
+struct kvm_vm *__vm_create_shape_with_one_vcpu(struct vm_shape shape,
+					       struct kvm_vcpu **vcpu,
+					       uint64_t extra_mem_pages,
+					       void *guest_code)
 {
 	struct kvm_vcpu *vcpus[1];
 	struct kvm_vm *vm;
 
-	vm = __vm_create_with_vcpus(VM_MODE_DEFAULT, 1, extra_mem_pages,
-				    guest_code, vcpus);
+	vm = __vm_create_with_vcpus(shape, 1, extra_mem_pages, guest_code, vcpus);
 
 	*vcpu = vcpus[0];
 	return vm;
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index df457452d146..d05487e5a371 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -168,7 +168,8 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 	 * The memory is also added to memslot 0, but that's a benign side
 	 * effect as KVM allows aliasing HVAs in meslots.
 	 */
-	vm = __vm_create_with_vcpus(mode, nr_vcpus, slot0_pages + guest_num_pages,
+	vm = __vm_create_with_vcpus(VM_SHAPE(mode), nr_vcpus,
+				    slot0_pages + guest_num_pages,
 				    memstress_guest_code, vcpus);
 
 	args->vm = vm;
diff --git a/tools/testing/selftests/kvm/x86_64/ucna_injection_test.c b/tools/testing/selftests/kvm/x86_64/ucna_injection_test.c
index 85f34ca7e49e..0ed32ec903d0 100644
--- a/tools/testing/selftests/kvm/x86_64/ucna_injection_test.c
+++ b/tools/testing/selftests/kvm/x86_64/ucna_injection_test.c
@@ -271,7 +271,7 @@ int main(int argc, char *argv[])
 
 	kvm_check_cap(KVM_CAP_MCE);
 
-	vm = __vm_create(VM_MODE_DEFAULT, 3, 0);
+	vm = __vm_create(VM_SHAPE_DEFAULT, 3, 0);
 
 	kvm_ioctl(vm->kvm_fd, KVM_X86_GET_MCE_CAP_SUPPORTED,
 		  &supported_mcg_caps);
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 30/35] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (28 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 29/35] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-27 18:22 ` [PATCH v13 31/35] KVM: selftests: Add x86-only selftest for private memory conversions Sean Christopherson
                   ` (5 subsequent siblings)
  35 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Add GUEST_SYNC[1-6]() so that tests can pass the maximum amount of
information supported via ucall(), without needing to resort to shared
memory.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/include/ucall_common.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/ucall_common.h b/tools/testing/selftests/kvm/include/ucall_common.h
index ce33d306c2cb..0fb472a5a058 100644
--- a/tools/testing/selftests/kvm/include/ucall_common.h
+++ b/tools/testing/selftests/kvm/include/ucall_common.h
@@ -52,6 +52,17 @@ int ucall_nr_pages_required(uint64_t page_size);
 #define GUEST_SYNC_ARGS(stage, arg1, arg2, arg3, arg4)	\
 				ucall(UCALL_SYNC, 6, "hello", stage, arg1, arg2, arg3, arg4)
 #define GUEST_SYNC(stage)	ucall(UCALL_SYNC, 2, "hello", stage)
+#define GUEST_SYNC1(arg0)	ucall(UCALL_SYNC, 1, arg0)
+#define GUEST_SYNC2(arg0, arg1)	ucall(UCALL_SYNC, 2, arg0, arg1)
+#define GUEST_SYNC3(arg0, arg1, arg2) \
+				ucall(UCALL_SYNC, 3, arg0, arg1, arg2)
+#define GUEST_SYNC4(arg0, arg1, arg2, arg3) \
+				ucall(UCALL_SYNC, 4, arg0, arg1, arg2, arg3)
+#define GUEST_SYNC5(arg0, arg1, arg2, arg3, arg4) \
+				ucall(UCALL_SYNC, 5, arg0, arg1, arg2, arg3, arg4)
+#define GUEST_SYNC6(arg0, arg1, arg2, arg3, arg4, arg5) \
+				ucall(UCALL_SYNC, 6, arg0, arg1, arg2, arg3, arg4, arg5)
+
 #define GUEST_PRINTF(_fmt, _args...) ucall_fmt(UCALL_PRINTF, _fmt, ##_args)
 #define GUEST_DONE()		ucall(UCALL_DONE, 0)
 
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 31/35] KVM: selftests: Add x86-only selftest for private memory conversions
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (29 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 30/35] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-27 18:22 ` [PATCH v13 32/35] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper Sean Christopherson
                   ` (4 subsequent siblings)
  35 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Vishal Annapurve <vannapurve@google.com>

Add a selftest to exercise implicit/explicit conversion functionality
within KVM and verify:

 - Shared memory is visible to host userspace
 - Private memory is not visible to host userspace
 - Host userspace and guest can communicate over shared memory
 - Data in shared backing is preserved across conversions (test's
   host userspace doesn't free the data)
 - Private memory is bound to the lifetime of the VM

Ideally, KVM's selftests infrastructure would be reworked to allow backing
a single region of guest memory with multiple memslots for _all_ backing
types and shapes, i.e. ideally the code for using a single backing fd
across multiple memslots would work for "regular" memory as well.  But
sadly, support for KVM_CREATE_GUEST_MEMFD has languished for far too long,
and overhauling selftests' memslots infrastructure would likely open a can
of worms, i.e. delay things even further.

In addition to the more obvious tests, verify that PUNCH_HOLE actually
frees memory.  Directly verifying that KVM frees memory is impractical, if
it's even possible, so instead indirectly verify memory is freed by
asserting that the guest reads zeroes after a PUNCH_HOLE.  E.g. if KVM
zaps SPTEs but doesn't actually punch a hole in the inode, the subsequent
read will still see the previous value.  And obviously punching a hole
shouldn't cause explosions.

Let the user specify the number of memslots in the private mem conversion
test, i.e. don't require the number of memslots to be '1' or "nr_vcpus".
Creating more memslots than vCPUs is particularly interesting, e.g. it can
result in a single KVM_SET_MEMORY_ATTRIBUTES spanning multiple memslots.
To keep the math reasonable, align each vCPU's chunk to at least 2MiB (the
size is 2MiB+4KiB), and require the total size to be cleanly divisible by
the number of memslots.  The goal is to be able to validate that KVM plays
nice with multiple memslots, being able to create a truly arbitrary number
of memslots doesn't add meaningful value, i.e. isn't worth the cost.

Intentionally don't take a requirement on KVM_CAP_GUEST_MEMFD,
KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_ATTRIBUTE_PRIVATE, etc., as it's a
KVM bug to advertise KVM_X86_SW_PROTECTED_VM without its prerequisites.

Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../kvm/x86_64/private_mem_conversions_test.c | 487 ++++++++++++++++++
 2 files changed, 488 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index a3bb36fb3cfc..b709a52d5cdb 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -81,6 +81,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test
 TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
+TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
 TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
new file mode 100644
index 000000000000..be311944e90a
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -0,0 +1,487 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include <fcntl.h>
+#include <limits.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+#include <linux/kvm_para.h>
+#include <linux/memfd.h>
+#include <linux/sizes.h>
+
+#include <test_util.h>
+#include <kvm_util.h>
+#include <processor.h>
+
+#define BASE_DATA_SLOT		10
+#define BASE_DATA_GPA		((uint64_t)(1ull << 32))
+#define PER_CPU_DATA_SIZE	((uint64_t)(SZ_2M + PAGE_SIZE))
+
+/* Horrific macro so that the line info is captured accurately :-( */
+#define memcmp_g(gpa, pattern,  size)								\
+do {												\
+	uint8_t *mem = (uint8_t *)gpa;								\
+	size_t i;										\
+												\
+	for (i = 0; i < size; i++)								\
+		__GUEST_ASSERT(mem[i] == pattern,						\
+			       "Guest expected 0x%x at offset %lu (gpa 0x%llx), got 0x%x",	\
+			       pattern, i, gpa + i, mem[i]);					\
+} while (0)
+
+static void memcmp_h(uint8_t *mem, uint64_t gpa, uint8_t pattern, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		TEST_ASSERT(mem[i] == pattern,
+			    "Host expected 0x%x at gpa 0x%lx, got 0x%x",
+			    pattern, gpa + i, mem[i]);
+}
+
+/*
+ * Run memory conversion tests with explicit conversion:
+ * Execute KVM hypercall to map/unmap gpa range which will cause userspace exit
+ * to back/unback private memory. Subsequent accesses by guest to the gpa range
+ * will not cause exit to userspace.
+ *
+ * Test memory conversion scenarios with following steps:
+ * 1) Access private memory using private access and verify that memory contents
+ *   are not visible to userspace.
+ * 2) Convert memory to shared using explicit conversions and ensure that
+ *   userspace is able to access the shared regions.
+ * 3) Convert memory back to private using explicit conversions and ensure that
+ *   userspace is again not able to access converted private regions.
+ */
+
+#define GUEST_STAGE(o, s) { .offset = o, .size = s }
+
+enum ucall_syncs {
+	SYNC_SHARED,
+	SYNC_PRIVATE,
+};
+
+static void guest_sync_shared(uint64_t gpa, uint64_t size,
+			      uint8_t current_pattern, uint8_t new_pattern)
+{
+	GUEST_SYNC5(SYNC_SHARED, gpa, size, current_pattern, new_pattern);
+}
+
+static void guest_sync_private(uint64_t gpa, uint64_t size, uint8_t pattern)
+{
+	GUEST_SYNC4(SYNC_PRIVATE, gpa, size, pattern);
+}
+
+/* Arbitrary values, KVM doesn't care about the attribute flags. */
+#define MAP_GPA_SET_ATTRIBUTES	BIT(0)
+#define MAP_GPA_SHARED		BIT(1)
+#define MAP_GPA_DO_FALLOCATE	BIT(2)
+
+static void guest_map_mem(uint64_t gpa, uint64_t size, bool map_shared,
+			  bool do_fallocate)
+{
+	uint64_t flags = MAP_GPA_SET_ATTRIBUTES;
+
+	if (map_shared)
+		flags |= MAP_GPA_SHARED;
+	if (do_fallocate)
+		flags |= MAP_GPA_DO_FALLOCATE;
+	kvm_hypercall_map_gpa_range(gpa, size, flags);
+}
+
+static void guest_map_shared(uint64_t gpa, uint64_t size, bool do_fallocate)
+{
+	guest_map_mem(gpa, size, true, do_fallocate);
+}
+
+static void guest_map_private(uint64_t gpa, uint64_t size, bool do_fallocate)
+{
+	guest_map_mem(gpa, size, false, do_fallocate);
+}
+
+struct {
+	uint64_t offset;
+	uint64_t size;
+} static const test_ranges[] = {
+	GUEST_STAGE(0, PAGE_SIZE),
+	GUEST_STAGE(0, SZ_2M),
+	GUEST_STAGE(PAGE_SIZE, PAGE_SIZE),
+	GUEST_STAGE(PAGE_SIZE, SZ_2M),
+	GUEST_STAGE(SZ_2M, PAGE_SIZE),
+};
+
+static void guest_test_explicit_conversion(uint64_t base_gpa, bool do_fallocate)
+{
+	const uint8_t def_p = 0xaa;
+	const uint8_t init_p = 0xcc;
+	uint64_t j;
+	int i;
+
+	/* Memory should be shared by default. */
+	memset((void *)base_gpa, def_p, PER_CPU_DATA_SIZE);
+	memcmp_g(base_gpa, def_p, PER_CPU_DATA_SIZE);
+	guest_sync_shared(base_gpa, PER_CPU_DATA_SIZE, def_p, init_p);
+
+	memcmp_g(base_gpa, init_p, PER_CPU_DATA_SIZE);
+
+	for (i = 0; i < ARRAY_SIZE(test_ranges); i++) {
+		uint64_t gpa = base_gpa + test_ranges[i].offset;
+		uint64_t size = test_ranges[i].size;
+		uint8_t p1 = 0x11;
+		uint8_t p2 = 0x22;
+		uint8_t p3 = 0x33;
+		uint8_t p4 = 0x44;
+
+		/*
+		 * Set the test region to pattern one to differentiate it from
+		 * the data range as a whole (contains the initial pattern).
+		 */
+		memset((void *)gpa, p1, size);
+
+		/*
+		 * Convert to private, set and verify the private data, and
+		 * then verify that the rest of the data (map shared) still
+		 * holds the initial pattern, and that the host always sees the
+		 * shared memory (initial pattern).  Unlike shared memory,
+		 * punching a hole in private memory is destructive, i.e.
+		 * previous values aren't guaranteed to be preserved.
+		 */
+		guest_map_private(gpa, size, do_fallocate);
+
+		if (size > PAGE_SIZE) {
+			memset((void *)gpa, p2, PAGE_SIZE);
+			goto skip;
+		}
+
+		memset((void *)gpa, p2, size);
+		guest_sync_private(gpa, size, p1);
+
+		/*
+		 * Verify that the private memory was set to pattern two, and
+		 * that shared memory still holds the initial pattern.
+		 */
+		memcmp_g(gpa, p2, size);
+		if (gpa > base_gpa)
+			memcmp_g(base_gpa, init_p, gpa - base_gpa);
+		if (gpa + size < base_gpa + PER_CPU_DATA_SIZE)
+			memcmp_g(gpa + size, init_p,
+				 (base_gpa + PER_CPU_DATA_SIZE) - (gpa + size));
+
+		/*
+		 * Convert odd-number page frames back to shared to verify KVM
+		 * also correctly handles holes in private ranges.
+		 */
+		for (j = 0; j < size; j += PAGE_SIZE) {
+			if ((j >> PAGE_SHIFT) & 1) {
+				guest_map_shared(gpa + j, PAGE_SIZE, do_fallocate);
+				guest_sync_shared(gpa + j, PAGE_SIZE, p1, p3);
+
+				memcmp_g(gpa + j, p3, PAGE_SIZE);
+			} else {
+				guest_sync_private(gpa + j, PAGE_SIZE, p1);
+			}
+		}
+
+skip:
+		/*
+		 * Convert the entire region back to shared, explicitly write
+		 * pattern three to fill in the even-number frames before
+		 * asking the host to verify (and write pattern four).
+		 */
+		guest_map_shared(gpa, size, do_fallocate);
+		memset((void *)gpa, p3, size);
+		guest_sync_shared(gpa, size, p3, p4);
+		memcmp_g(gpa, p4, size);
+
+		/* Reset the shared memory back to the initial pattern. */
+		memset((void *)gpa, init_p, size);
+
+		/*
+		 * Free (via PUNCH_HOLE) *all* private memory so that the next
+		 * iteration starts from a clean slate, e.g. with respect to
+		 * whether or not there are pages/folios in guest_mem.
+		 */
+		guest_map_shared(base_gpa, PER_CPU_DATA_SIZE, true);
+	}
+}
+
+static void guest_punch_hole(uint64_t gpa, uint64_t size)
+{
+	/* "Mapping" memory shared via fallocate() is done via PUNCH_HOLE. */
+	uint64_t flags = MAP_GPA_SHARED | MAP_GPA_DO_FALLOCATE;
+
+	kvm_hypercall_map_gpa_range(gpa, size, flags);
+}
+
+/*
+ * Test that PUNCH_HOLE actually frees memory by punching holes without doing a
+ * proper conversion.  Freeing (PUNCH_HOLE) should zap SPTEs, and reallocating
+ * (subsequent fault) should zero memory.
+ */
+static void guest_test_punch_hole(uint64_t base_gpa, bool precise)
+{
+	const uint8_t init_p = 0xcc;
+	int i;
+
+	/*
+	 * Convert the entire range to private, this testcase is all about
+	 * punching holes in guest_memfd, i.e. shared mappings aren't needed.
+	 */
+	guest_map_private(base_gpa, PER_CPU_DATA_SIZE, false);
+
+	for (i = 0; i < ARRAY_SIZE(test_ranges); i++) {
+		uint64_t gpa = base_gpa + test_ranges[i].offset;
+		uint64_t size = test_ranges[i].size;
+
+		/*
+		 * Free all memory before each iteration, even for the !precise
+		 * case where the memory will be faulted back in.  Freeing and
+		 * reallocating should obviously work, and freeing all memory
+		 * minimizes the probability of cross-testcase influence.
+		 */
+		guest_punch_hole(base_gpa, PER_CPU_DATA_SIZE);
+
+		/* Fault-in and initialize memory, and verify the pattern. */
+		if (precise) {
+			memset((void *)gpa, init_p, size);
+			memcmp_g(gpa, init_p, size);
+		} else {
+			memset((void *)base_gpa, init_p, PER_CPU_DATA_SIZE);
+			memcmp_g(base_gpa, init_p, PER_CPU_DATA_SIZE);
+		}
+
+		/*
+		 * Punch a hole at the target range and verify that reads from
+		 * the guest succeed and return zeroes.
+		 */
+		guest_punch_hole(gpa, size);
+		memcmp_g(gpa, 0, size);
+	}
+}
+
+static void guest_code(uint64_t base_gpa)
+{
+	/*
+	 * Run the conversion test twice, with and without doing fallocate() on
+	 * the guest_memfd backing when converting between shared and private.
+	 */
+	guest_test_explicit_conversion(base_gpa, false);
+	guest_test_explicit_conversion(base_gpa, true);
+
+	/*
+	 * Run the PUNCH_HOLE test twice too, once with the entire guest_memfd
+	 * faulted in, once with only the target range faulted in.
+	 */
+	guest_test_punch_hole(base_gpa, false);
+	guest_test_punch_hole(base_gpa, true);
+	GUEST_DONE();
+}
+
+static void handle_exit_hypercall(struct kvm_vcpu *vcpu)
+{
+	struct kvm_run *run = vcpu->run;
+	uint64_t gpa = run->hypercall.args[0];
+	uint64_t size = run->hypercall.args[1] * PAGE_SIZE;
+	bool set_attributes = run->hypercall.args[2] & MAP_GPA_SET_ATTRIBUTES;
+	bool map_shared = run->hypercall.args[2] & MAP_GPA_SHARED;
+	bool do_fallocate = run->hypercall.args[2] & MAP_GPA_DO_FALLOCATE;
+	struct kvm_vm *vm = vcpu->vm;
+
+	TEST_ASSERT(run->hypercall.nr == KVM_HC_MAP_GPA_RANGE,
+		    "Wanted MAP_GPA_RANGE (%u), got '%llu'",
+		    KVM_HC_MAP_GPA_RANGE, run->hypercall.nr);
+
+	if (do_fallocate)
+		vm_guest_mem_fallocate(vm, gpa, size, map_shared);
+
+	if (set_attributes)
+		vm_set_memory_attributes(vm, gpa, size,
+					 map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE);
+	run->hypercall.ret = 0;
+}
+
+static bool run_vcpus;
+
+static void *__test_mem_conversions(void *__vcpu)
+{
+	struct kvm_vcpu *vcpu = __vcpu;
+	struct kvm_run *run = vcpu->run;
+	struct kvm_vm *vm = vcpu->vm;
+	struct ucall uc;
+
+	while (!READ_ONCE(run_vcpus))
+		;
+
+	for ( ;; ) {
+		vcpu_run(vcpu);
+
+		if (run->exit_reason == KVM_EXIT_HYPERCALL) {
+			handle_exit_hypercall(vcpu);
+			continue;
+		}
+
+		TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+			    "Wanted KVM_EXIT_IO, got exit reason: %u (%s)",
+			    run->exit_reason, exit_reason_str(run->exit_reason));
+
+		switch (get_ucall(vcpu, &uc)) {
+		case UCALL_ABORT:
+			REPORT_GUEST_ASSERT(uc);
+		case UCALL_SYNC: {
+			uint64_t gpa  = uc.args[1];
+			size_t size = uc.args[2];
+			size_t i;
+
+			TEST_ASSERT(uc.args[0] == SYNC_SHARED ||
+				    uc.args[0] == SYNC_PRIVATE,
+				    "Unknown sync command '%ld'", uc.args[0]);
+
+			for (i = 0; i < size; i += vm->page_size) {
+				size_t nr_bytes = min_t(size_t, vm->page_size, size - i);
+				uint8_t *hva = addr_gpa2hva(vm, gpa + i);
+
+				/* In all cases, the host should observe the shared data. */
+				memcmp_h(hva, gpa + i, uc.args[3], nr_bytes);
+
+				/* For shared, write the new pattern to guest memory. */
+				if (uc.args[0] == SYNC_SHARED)
+					memset(hva, uc.args[4], nr_bytes);
+			}
+			break;
+		}
+		case UCALL_DONE:
+			return NULL;
+		default:
+			TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
+		}
+	}
+}
+
+static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t nr_vcpus,
+				 uint32_t nr_memslots)
+{
+	/*
+	 * Allocate enough memory so that each vCPU's chunk of memory can be
+	 * naturally aligned with respect to the size of the backing store.
+	 */
+	const size_t alignment = max_t(size_t, SZ_2M, get_backing_src_pagesz(src_type));
+	const size_t per_cpu_size = align_up(PER_CPU_DATA_SIZE, alignment);
+	const size_t memfd_size = per_cpu_size * nr_vcpus;
+	const size_t slot_size = memfd_size / nr_memslots;
+	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
+	pthread_t threads[KVM_MAX_VCPUS];
+	uint64_t memfd_flags;
+	struct kvm_vm *vm;
+	int memfd, i, r;
+
+	const struct vm_shape shape = {
+		.mode = VM_MODE_DEFAULT,
+		.type = KVM_X86_SW_PROTECTED_VM,
+	};
+
+	TEST_ASSERT(slot_size * nr_memslots == memfd_size,
+		    "The memfd size (0x%lx) needs to be cleanly divisible by the number of memslots (%u)",
+		    memfd_size, nr_memslots);
+	vm = __vm_create_with_vcpus(shape, nr_vcpus, 0, guest_code, vcpus);
+
+	vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
+
+	if (backing_src_can_be_huge(src_type))
+		memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+	else
+		memfd_flags = 0;
+	memfd = vm_create_guest_memfd(vm, memfd_size, memfd_flags);
+
+	for (i = 0; i < nr_memslots; i++)
+		vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
+			   BASE_DATA_SLOT + i, slot_size / vm->page_size,
+			   KVM_MEM_PRIVATE, memfd, slot_size * i);
+
+	for (i = 0; i < nr_vcpus; i++) {
+		uint64_t gpa =  BASE_DATA_GPA + i * per_cpu_size;
+
+		vcpu_args_set(vcpus[i], 1, gpa);
+
+		/*
+		 * Map only what is needed so that an out-of-bounds access
+		 * results #PF => SHUTDOWN instead of data corruption.
+		 */
+		virt_map(vm, gpa, gpa, PER_CPU_DATA_SIZE / vm->page_size);
+
+		pthread_create(&threads[i], NULL, __test_mem_conversions, vcpus[i]);
+	}
+
+	WRITE_ONCE(run_vcpus, true);
+
+	for (i = 0; i < nr_vcpus; i++)
+		pthread_join(threads[i], NULL);
+
+	kvm_vm_free(vm);
+
+	/*
+	 * Allocate and free memory from the guest_memfd after closing the VM
+	 * fd.  The guest_memfd is gifted a reference to its owning VM, i.e.
+	 * should prevent the VM from being fully destroyed until the last
+	 * reference to the guest_memfd is also put.
+	 */
+	r = fallocate(memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0, memfd_size);
+	TEST_ASSERT(!r, __KVM_SYSCALL_ERROR("fallocate()", r));
+
+	r = fallocate(memfd, FALLOC_FL_KEEP_SIZE, 0, memfd_size);
+	TEST_ASSERT(!r, __KVM_SYSCALL_ERROR("fallocate()", r));
+}
+
+static void usage(const char *cmd)
+{
+	puts("");
+	printf("usage: %s [-h] [-m nr_memslots] [-s mem_type] [-n nr_vcpus]\n", cmd);
+	puts("");
+	backing_src_help("-s");
+	puts("");
+	puts(" -n: specify the number of vcpus (default: 1)");
+	puts("");
+	puts(" -m: specify the number of memslots (default: 1)");
+	puts("");
+}
+
+int main(int argc, char *argv[])
+{
+	enum vm_mem_backing_src_type src_type = DEFAULT_VM_MEM_SRC;
+	uint32_t nr_memslots = 1;
+	uint32_t nr_vcpus = 1;
+	int opt;
+
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
+
+	while ((opt = getopt(argc, argv, "hm:s:n:")) != -1) {
+		switch (opt) {
+		case 's':
+			src_type = parse_backing_src_type(optarg);
+			break;
+		case 'n':
+			nr_vcpus = atoi_positive("nr_vcpus", optarg);
+			break;
+		case 'm':
+			nr_memslots = atoi_positive("nr_memslots", optarg);
+			break;
+		case 'h':
+		default:
+			usage(argv[0]);
+			exit(0);
+		}
+	}
+
+	test_mem_conversions(src_type, nr_vcpus, nr_memslots);
+
+	return 0;
+}
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 32/35] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (30 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 31/35] KVM: selftests: Add x86-only selftest for private memory conversions Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-27 18:22 ` [PATCH v13 33/35] KVM: selftests: Expand set_memory_region_test to validate guest_memfd() Sean Christopherson
                   ` (3 subsequent siblings)
  35 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

Add helpers to invoke KVM_SET_USER_MEMORY_REGION2 directly so that tests
can validate of features that are unique to "version 2" of "set user
memory region", e.g. do negative testing on gmem_fd and gmem_offset.

Provide a raw version as well as an assert-success version to reduce
the amount of boilerplate code need for basic usage.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     |  7 +++++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 29 +++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 157508c071f3..8ec122f5fcc8 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -522,6 +522,13 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
 			       uint64_t gpa, uint64_t size, void *hva);
 int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
 				uint64_t gpa, uint64_t size, void *hva);
+void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+				uint64_t gpa, uint64_t size, void *hva,
+				uint32_t guest_memfd, uint64_t guest_memfd_offset);
+int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+				 uint64_t gpa, uint64_t size, void *hva,
+				 uint32_t guest_memfd, uint64_t guest_memfd_offset);
+
 void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	enum vm_mem_backing_src_type src_type,
 	uint64_t guest_paddr, uint32_t slot, uint64_t npages,
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 52b131e3aca5..1620452c1cf7 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -873,6 +873,35 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
 		    errno, strerror(errno));
 }
 
+int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+				 uint64_t gpa, uint64_t size, void *hva,
+				 uint32_t guest_memfd, uint64_t guest_memfd_offset)
+{
+	struct kvm_userspace_memory_region2 region = {
+		.slot = slot,
+		.flags = flags,
+		.guest_phys_addr = gpa,
+		.memory_size = size,
+		.userspace_addr = (uintptr_t)hva,
+		.guest_memfd = guest_memfd,
+		.guest_memfd_offset = guest_memfd_offset,
+	};
+
+	return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION2, &region);
+}
+
+void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
+				uint64_t gpa, uint64_t size, void *hva,
+				uint32_t guest_memfd, uint64_t guest_memfd_offset)
+{
+	int ret = __vm_set_user_memory_region2(vm, slot, flags, gpa, size, hva,
+					       guest_memfd, guest_memfd_offset);
+
+	TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed, errno = %d (%s)",
+		    errno, strerror(errno));
+}
+
+
 /* FIXME: This thing needs to be ripped apart and rewritten. */
 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 33/35] KVM: selftests: Expand set_memory_region_test to validate guest_memfd()
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (31 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 32/35] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-27 18:22 ` [PATCH v13 34/35] KVM: selftests: Add basic selftest for guest_memfd() Sean Christopherson
                   ` (2 subsequent siblings)
  35 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

Expand set_memory_region_test to exercise various positive and negative
testcases for private memory.

 - Non-guest_memfd() file descriptor for private memory
 - guest_memfd() from different VM
 - Overlapping bindings
 - Unaligned bindings

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
[sean: trim the testcases to remove duplicate coverage]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     |  10 ++
 .../selftests/kvm/set_memory_region_test.c    | 100 ++++++++++++++++++
 2 files changed, 110 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 8ec122f5fcc8..e4d2cd9218b2 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -819,6 +819,16 @@ static inline struct kvm_vm *vm_create_barebones(void)
 	return ____vm_create(VM_SHAPE_DEFAULT);
 }
 
+static inline struct kvm_vm *vm_create_barebones_protected_vm(void)
+{
+	const struct vm_shape shape = {
+		.mode = VM_MODE_DEFAULT,
+		.type = KVM_X86_SW_PROTECTED_VM,
+	};
+
+	return ____vm_create(shape);
+}
+
 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)
 {
 	return __vm_create(VM_SHAPE_DEFAULT, nr_runnable_vcpus, 0);
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c
index b32960189f5f..ca83e3307a98 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -385,6 +385,98 @@ static void test_add_max_memory_regions(void)
 	kvm_vm_free(vm);
 }
 
+
+static void test_invalid_guest_memfd(struct kvm_vm *vm, int memfd,
+				     size_t offset, const char *msg)
+{
+	int r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+					     MEM_REGION_GPA, MEM_REGION_SIZE,
+					     0, memfd, offset);
+	TEST_ASSERT(r == -1 && errno == EINVAL, "%s", msg);
+}
+
+static void test_add_private_memory_region(void)
+{
+	struct kvm_vm *vm, *vm2;
+	int memfd, i;
+
+	pr_info("Testing ADD of KVM_MEM_PRIVATE memory regions\n");
+
+	vm = vm_create_barebones_protected_vm();
+
+	test_invalid_guest_memfd(vm, vm->kvm_fd, 0, "KVM fd should fail");
+	test_invalid_guest_memfd(vm, vm->fd, 0, "VM's fd should fail");
+
+	memfd = kvm_memfd_alloc(MEM_REGION_SIZE, false);
+	test_invalid_guest_memfd(vm, memfd, 0, "Regular memfd() should fail");
+	close(memfd);
+
+	vm2 = vm_create_barebones_protected_vm();
+	memfd = vm_create_guest_memfd(vm2, MEM_REGION_SIZE, 0);
+	test_invalid_guest_memfd(vm, memfd, 0, "Other VM's guest_memfd() should fail");
+
+	vm_set_user_memory_region2(vm2, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+				   MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0);
+	close(memfd);
+	kvm_vm_free(vm2);
+
+	memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE, 0);
+	for (i = 1; i < PAGE_SIZE; i++)
+		test_invalid_guest_memfd(vm, memfd, i, "Unaligned offset should fail");
+
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+				   MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 0);
+	close(memfd);
+
+	kvm_vm_free(vm);
+}
+
+static void test_add_overlapping_private_memory_regions(void)
+{
+	struct kvm_vm *vm;
+	int memfd;
+	int r;
+
+	pr_info("Testing ADD of overlapping KVM_MEM_PRIVATE memory regions\n");
+
+	vm = vm_create_barebones_protected_vm();
+
+	memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE * 4, 0);
+
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+				   MEM_REGION_GPA, MEM_REGION_SIZE * 2, 0, memfd, 0);
+
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT + 1, KVM_MEM_PRIVATE,
+				   MEM_REGION_GPA * 2, MEM_REGION_SIZE * 2,
+				   0, memfd, MEM_REGION_SIZE * 2);
+
+	/*
+	 * Delete the first memslot, and then attempt to recreate it except
+	 * with a "bad" offset that results in overlap in the guest_memfd().
+	 */
+	vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+				   MEM_REGION_GPA, 0, NULL, -1, 0);
+
+	/* Overlap the front half of the other slot. */
+	r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+					 MEM_REGION_GPA * 2 - MEM_REGION_SIZE,
+					 MEM_REGION_SIZE * 2,
+					 0, memfd, 0);
+	TEST_ASSERT(r == -1 && errno == EEXIST, "%s",
+		    "Overlapping guest_memfd() bindings should fail with EEXIST");
+
+	/* And now the back half of the other slot. */
+	r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+					 MEM_REGION_GPA * 2 + MEM_REGION_SIZE,
+					 MEM_REGION_SIZE * 2,
+					 0, memfd, 0);
+	TEST_ASSERT(r == -1 && errno == EEXIST, "%s",
+		    "Overlapping guest_memfd() bindings should fail with EEXIST");
+
+	close(memfd);
+	kvm_vm_free(vm);
+}
+
 int main(int argc, char *argv[])
 {
 #ifdef __x86_64__
@@ -401,6 +493,14 @@ int main(int argc, char *argv[])
 
 	test_add_max_memory_regions();
 
+	if (kvm_has_cap(KVM_CAP_GUEST_MEMFD) &&
+	    (kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM))) {
+		test_add_private_memory_region();
+		test_add_overlapping_private_memory_regions();
+	} else {
+		pr_info("Skipping tests for KVM_MEM_PRIVATE memory regions\n");
+	}
+
 #ifdef __x86_64__
 	if (argc > 1)
 		loops = atoi_positive("Number of iterations", argv[1]);
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 34/35] KVM: selftests: Add basic selftest for guest_memfd()
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (32 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 33/35] KVM: selftests: Expand set_memory_region_test to validate guest_memfd() Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-27 18:22 ` [PATCH v13 35/35] KVM: selftests: Test KVM exit behavior for private memory/access Sean Christopherson
  2023-10-30 17:39 ` [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Paolo Bonzini
  35 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Chao Peng <chao.p.peng@linux.intel.com>

Add a selftest to verify the basic functionality of guest_memfd():

+ file descriptor created with the guest_memfd() ioctl does not allow
  read/write/mmap operations
+ file size and block size as returned from fstat are as expected
+ fallocate on the fd checks that offset/length on
  fallocate(FALLOC_FL_PUNCH_HOLE) should be page aligned
+ invalid inputs (misaligned size, invalid flags) are rejected
+ file size and inode are unique (the innocuous-sounding
  anon_inode_getfile() backs all files with a single inode...)

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../testing/selftests/kvm/guest_memfd_test.c  | 221 ++++++++++++++++++
 2 files changed, 222 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index b709a52d5cdb..2b1ef809d73a 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -124,6 +124,7 @@ TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
+TEST_GEN_PROGS_x86_64 += guest_memfd_test
 TEST_GEN_PROGS_x86_64 += guest_print_test
 TEST_GEN_PROGS_x86_64 += hardware_disable_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
new file mode 100644
index 000000000000..c15de9852316
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -0,0 +1,221 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright Intel Corporation, 2023
+ *
+ * Author: Chao Peng <chao.p.peng@linux.intel.com>
+ */
+
+#define _GNU_SOURCE
+#include "test_util.h"
+#include "kvm_util_base.h"
+#include <linux/bitmap.h>
+#include <linux/falloc.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+#include <fcntl.h>
+
+static void test_file_read_write(int fd)
+{
+	char buf[64];
+
+	TEST_ASSERT(read(fd, buf, sizeof(buf)) < 0,
+		    "read on a guest_mem fd should fail");
+	TEST_ASSERT(write(fd, buf, sizeof(buf)) < 0,
+		    "write on a guest_mem fd should fail");
+	TEST_ASSERT(pread(fd, buf, sizeof(buf), 0) < 0,
+		    "pread on a guest_mem fd should fail");
+	TEST_ASSERT(pwrite(fd, buf, sizeof(buf), 0) < 0,
+		    "pwrite on a guest_mem fd should fail");
+}
+
+static void test_mmap(int fd, size_t page_size)
+{
+	char *mem;
+
+	mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT_EQ(mem, MAP_FAILED);
+}
+
+static void test_file_size(int fd, size_t page_size, size_t total_size)
+{
+	struct stat sb;
+	int ret;
+
+	ret = fstat(fd, &sb);
+	TEST_ASSERT(!ret, "fstat should succeed");
+	TEST_ASSERT_EQ(sb.st_size, total_size);
+	TEST_ASSERT_EQ(sb.st_blksize, page_size);
+}
+
+static void test_fallocate(int fd, size_t page_size, size_t total_size)
+{
+	int ret;
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, total_size);
+	TEST_ASSERT(!ret, "fallocate with aligned offset and size should succeed");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			page_size - 1, page_size);
+	TEST_ASSERT(ret, "fallocate with unaligned offset should fail");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size, page_size);
+	TEST_ASSERT(ret, "fallocate beginning at total_size should fail");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size + page_size, page_size);
+	TEST_ASSERT(ret, "fallocate beginning after total_size should fail");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			total_size, page_size);
+	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) at total_size should succeed");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			total_size + page_size, page_size);
+	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) after total_size should succeed");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			page_size, page_size - 1);
+	TEST_ASSERT(ret, "fallocate with unaligned size should fail");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			page_size, page_size);
+	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) with aligned offset and size should succeed");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, page_size, page_size);
+	TEST_ASSERT(!ret, "fallocate to restore punched hole should succeed");
+}
+
+static void test_invalid_punch_hole(int fd, size_t page_size, size_t total_size)
+{
+	struct {
+		off_t offset;
+		off_t len;
+	} testcases[] = {
+		{0, 1},
+		{0, page_size - 1},
+		{0, page_size + 1},
+
+		{1, 1},
+		{1, page_size - 1},
+		{1, page_size},
+		{1, page_size + 1},
+
+		{page_size, 1},
+		{page_size, page_size - 1},
+		{page_size, page_size + 1},
+	};
+	int ret, i;
+
+	for (i = 0; i < ARRAY_SIZE(testcases); i++) {
+		ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+				testcases[i].offset, testcases[i].len);
+		TEST_ASSERT(ret == -1 && errno == EINVAL,
+			    "PUNCH_HOLE with !PAGE_SIZE offset (%lx) and/or length (%lx) should fail",
+			    testcases[i].offset, testcases[i].len);
+	}
+}
+
+static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
+{
+	uint64_t valid_flags = 0;
+	size_t page_size = getpagesize();
+	uint64_t flag;
+	size_t size;
+	int fd;
+
+	for (size = 1; size < page_size; size++) {
+		fd = __vm_create_guest_memfd(vm, size, 0);
+		TEST_ASSERT(fd == -1 && errno == EINVAL,
+			    "guest_memfd() with non-page-aligned page size '0x%lx' should fail with EINVAL",
+			    size);
+	}
+
+	if (thp_configured()) {
+		for (size = page_size * 2; size < get_trans_hugepagesz(); size += page_size) {
+			fd = __vm_create_guest_memfd(vm, size, KVM_GUEST_MEMFD_ALLOW_HUGEPAGE);
+			TEST_ASSERT(fd == -1 && errno == EINVAL,
+				    "guest_memfd() with non-hugepage-aligned page size '0x%lx' should fail with EINVAL",
+				    size);
+		}
+
+		valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+	}
+
+	for (flag = 1; flag; flag <<= 1) {
+		uint64_t bit;
+
+		if (flag & valid_flags)
+			continue;
+
+		fd = __vm_create_guest_memfd(vm, page_size, flag);
+		TEST_ASSERT(fd == -1 && errno == EINVAL,
+			    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
+			    flag);
+
+		for_each_set_bit(bit, &valid_flags, 64) {
+			fd = __vm_create_guest_memfd(vm, page_size, flag | BIT_ULL(bit));
+			TEST_ASSERT(fd == -1 && errno == EINVAL,
+				    "guest_memfd() with flags '0x%llx' should fail with EINVAL",
+				    flag | BIT_ULL(bit));
+		}
+	}
+}
+
+static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
+{
+	int fd1, fd2, ret;
+	struct stat st1, st2;
+
+	fd1 = __vm_create_guest_memfd(vm, 4096, 0);
+	TEST_ASSERT(fd1 != -1, "memfd creation should succeed");
+
+	ret = fstat(fd1, &st1);
+	TEST_ASSERT(ret != -1, "memfd fstat should succeed");
+	TEST_ASSERT(st1.st_size == 4096, "memfd st_size should match requested size");
+
+	fd2 = __vm_create_guest_memfd(vm, 8192, 0);
+	TEST_ASSERT(fd2 != -1, "memfd creation should succeed");
+
+	ret = fstat(fd2, &st2);
+	TEST_ASSERT(ret != -1, "memfd fstat should succeed");
+	TEST_ASSERT(st2.st_size == 8192, "second memfd st_size should match requested size");
+
+	ret = fstat(fd1, &st1);
+	TEST_ASSERT(ret != -1, "memfd fstat should succeed");
+	TEST_ASSERT(st1.st_size == 4096, "first memfd st_size should still match requested size");
+	TEST_ASSERT(st1.st_ino != st2.st_ino, "different memfd should have different inode numbers");
+}
+
+int main(int argc, char *argv[])
+{
+	size_t page_size;
+	size_t total_size;
+	int fd;
+	struct kvm_vm *vm;
+
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
+
+	page_size = getpagesize();
+	total_size = page_size * 4;
+
+	vm = vm_create_barebones();
+
+	test_create_guest_memfd_invalid(vm);
+	test_create_guest_memfd_multiple(vm);
+
+	fd = vm_create_guest_memfd(vm, total_size, 0);
+
+	test_file_read_write(fd);
+	test_mmap(fd, page_size);
+	test_file_size(fd, page_size, total_size);
+	test_fallocate(fd, page_size, total_size);
+	test_invalid_punch_hole(fd, page_size, total_size);
+
+	close(fd);
+}
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v13 35/35] KVM: selftests: Test KVM exit behavior for private memory/access
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (33 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 34/35] KVM: selftests: Add basic selftest for guest_memfd() Sean Christopherson
@ 2023-10-27 18:22 ` Sean Christopherson
  2023-10-30 17:39 ` [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Paolo Bonzini
  35 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-27 18:22 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Sean Christopherson, Alexander Viro,
	Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

From: Ackerley Tng <ackerleytng@google.com>

"Testing private access when memslot gets deleted" tests the behavior
of KVM when a private memslot gets deleted while the VM is using the
private memslot. When KVM looks up the deleted (slot = NULL) memslot,
KVM should exit to userspace with KVM_EXIT_MEMORY_FAULT.

In the second test, upon a private access to non-private memslot, KVM
should also exit to userspace with KVM_EXIT_MEMORY_FAULT.

Intentionally don't take a requirement on KVM_CAP_GUEST_MEMFD,
KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_ATTRIBUTE_PRIVATE, etc., as it's a
KVM bug to advertise KVM_X86_SW_PROTECTED_VM without its prerequisites.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
[sean: call out the similarities with set_memory_region_test]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../kvm/x86_64/private_mem_kvm_exits_test.c   | 120 ++++++++++++++++++
 2 files changed, 121 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 2b1ef809d73a..f7fdd8244547 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -82,6 +82,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
 TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
+TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
 TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
new file mode 100644
index 000000000000..7f7ca4475745
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+#include <linux/kvm.h>
+#include <pthread.h>
+#include <stdint.h>
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "test_util.h"
+
+/* Arbitrarily selected to avoid overlaps with anything else */
+#define EXITS_TEST_GVA 0xc0000000
+#define EXITS_TEST_GPA EXITS_TEST_GVA
+#define EXITS_TEST_NPAGES 1
+#define EXITS_TEST_SIZE (EXITS_TEST_NPAGES * PAGE_SIZE)
+#define EXITS_TEST_SLOT 10
+
+static uint64_t guest_repeatedly_read(void)
+{
+	volatile uint64_t value;
+
+	while (true)
+		value = *((uint64_t *) EXITS_TEST_GVA);
+
+	return value;
+}
+
+static uint32_t run_vcpu_get_exit_reason(struct kvm_vcpu *vcpu)
+{
+	int r;
+
+	r = _vcpu_run(vcpu);
+	if (r) {
+		TEST_ASSERT(errno == EFAULT, KVM_IOCTL_ERROR(KVM_RUN, r));
+		TEST_ASSERT_EQ(vcpu->run->exit_reason, KVM_EXIT_MEMORY_FAULT);
+	}
+	return vcpu->run->exit_reason;
+}
+
+const struct vm_shape protected_vm_shape = {
+	.mode = VM_MODE_DEFAULT,
+	.type = KVM_X86_SW_PROTECTED_VM,
+};
+
+static void test_private_access_memslot_deleted(void)
+{
+	struct kvm_vm *vm;
+	struct kvm_vcpu *vcpu;
+	pthread_t vm_thread;
+	void *thread_return;
+	uint32_t exit_reason;
+
+	vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
+					   guest_repeatedly_read);
+
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+				    EXITS_TEST_GPA, EXITS_TEST_SLOT,
+				    EXITS_TEST_NPAGES,
+				    KVM_MEM_PRIVATE);
+
+	virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
+
+	/* Request to access page privately */
+	vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+
+	pthread_create(&vm_thread, NULL,
+		       (void *(*)(void *))run_vcpu_get_exit_reason,
+		       (void *)vcpu);
+
+	vm_mem_region_delete(vm, EXITS_TEST_SLOT);
+
+	pthread_join(vm_thread, &thread_return);
+	exit_reason = (uint32_t)(uint64_t)thread_return;
+
+	TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+	TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
+	TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
+	TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+
+	kvm_vm_free(vm);
+}
+
+static void test_private_access_memslot_not_private(void)
+{
+	struct kvm_vm *vm;
+	struct kvm_vcpu *vcpu;
+	uint32_t exit_reason;
+
+	vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
+					   guest_repeatedly_read);
+
+	/* Add a non-private memslot (flags = 0) */
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+				    EXITS_TEST_GPA, EXITS_TEST_SLOT,
+				    EXITS_TEST_NPAGES, 0);
+
+	virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
+
+	/* Request to access page privately */
+	vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+
+	exit_reason = run_vcpu_get_exit_reason(vcpu);
+
+	TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+	TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
+	TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
+	TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+
+	kvm_vm_free(vm);
+}
+
+int main(int argc, char *argv[])
+{
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
+
+	test_private_access_memslot_deleted();
+	test_private_access_memslot_not_private();
+}
-- 
2.42.0.820.g83a721a137-goog


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes
  2023-10-27 18:21 ` [PATCH v13 13/35] KVM: Introduce per-page memory attributes Sean Christopherson
@ 2023-10-30  8:11   ` Chao Gao
  2023-10-30 16:10     ` Sean Christopherson
  2023-10-31 16:43   ` David Matlack
  2023-11-02  3:01   ` Huang, Kai
  2 siblings, 1 reply; 148+ messages in thread
From: Chao Gao @ 2023-10-30  8:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Fri, Oct 27, 2023 at 11:21:55AM -0700, Sean Christopherson wrote:
>From: Chao Peng <chao.p.peng@linux.intel.com>
>
>In confidential computing usages, whether a page is private or shared is
>necessary information for KVM to perform operations like page fault
>handling, page zapping etc. There are other potential use cases for
>per-page memory attributes, e.g. to make memory read-only (or no-exec,
>or exec-only, etc.) without having to modify memslots.
>
>Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
>userspace to operate on the per-page memory attributes.
>  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
>    a guest memory range.

>  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
>    memory attributes.

This ioctl() is already removed. So, the changelog is out-of-date and needs
an update.

>
>+
>+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
>+:Architectures: x86
>+:Type: vm ioctl
>+:Parameters: struct kvm_memory_attributes(in)

					   ^ add one space here?


>+static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
>+					  struct kvm_gfn_range *range)
>+{
>+	/*
>+	 * Unconditionally add the range to the invalidation set, regardless of
>+	 * whether or not the arch callback actually needs to zap SPTEs.  E.g.
>+	 * if KVM supports RWX attributes in the future and the attributes are
>+	 * going from R=>RW, zapping isn't strictly necessary.  Unconditionally
>+	 * adding the range allows KVM to require that MMU invalidations add at
>+	 * least one range between begin() and end(), e.g. allows KVM to detect
>+	 * bugs where the add() is missed.  Rexlaing the rule *might* be safe,

					    ^^^^^^^^ Relaxing

>@@ -4640,6 +4850,17 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> 	case KVM_CAP_BINARY_STATS_FD:
> 	case KVM_CAP_SYSTEM_EVENT_DATA:
> 		return 1;
>+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
>+	case KVM_CAP_MEMORY_ATTRIBUTES:
>+		u64 attrs = kvm_supported_mem_attributes(kvm);
>+
>+		r = -EFAULT;
>+		if (copy_to_user(argp, &attrs, sizeof(attrs)))
>+			goto out;
>+		r = 0;
>+		break;

This cannot work, e.g., no @argp in this function and is fixed by a later commit:

	fcbef1e5e5d2 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes
  2023-10-30  8:11   ` Chao Gao
@ 2023-10-30 16:10     ` Sean Christopherson
  2023-10-30 22:05       ` Sean Christopherson
  0 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-10-30 16:10 UTC (permalink / raw)
  To: Chao Gao
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Mon, Oct 30, 2023, Chao Gao wrote:
> On Fri, Oct 27, 2023 at 11:21:55AM -0700, Sean Christopherson wrote:
> >From: Chao Peng <chao.p.peng@linux.intel.com>
> >
> >In confidential computing usages, whether a page is private or shared is
> >necessary information for KVM to perform operations like page fault
> >handling, page zapping etc. There are other potential use cases for
> >per-page memory attributes, e.g. to make memory read-only (or no-exec,
> >or exec-only, etc.) without having to modify memslots.
> >
> >Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> >userspace to operate on the per-page memory attributes.
> >  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> >    a guest memory range.
> 
> >  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> >    memory attributes.
> 
> This ioctl() is already removed. So, the changelog is out-of-date and needs
> an update.

Doh, I lost track of this and the fixup for KVM_CAP_MEMORY_ATTRIBUTES below.

> >+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> >+:Architectures: x86
> >+:Type: vm ioctl
> >+:Parameters: struct kvm_memory_attributes(in)
> 
> 					   ^ add one space here?

Ah, yeah, that does appear to be the standard.
> 
> 
> >+static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
> >+					  struct kvm_gfn_range *range)
> >+{
> >+	/*
> >+	 * Unconditionally add the range to the invalidation set, regardless of
> >+	 * whether or not the arch callback actually needs to zap SPTEs.  E.g.
> >+	 * if KVM supports RWX attributes in the future and the attributes are
> >+	 * going from R=>RW, zapping isn't strictly necessary.  Unconditionally
> >+	 * adding the range allows KVM to require that MMU invalidations add at
> >+	 * least one range between begin() and end(), e.g. allows KVM to detect
> >+	 * bugs where the add() is missed.  Rexlaing the rule *might* be safe,
> 
> 					    ^^^^^^^^ Relaxing
> 
> >@@ -4640,6 +4850,17 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > 	case KVM_CAP_BINARY_STATS_FD:
> > 	case KVM_CAP_SYSTEM_EVENT_DATA:
> > 		return 1;
> >+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> >+	case KVM_CAP_MEMORY_ATTRIBUTES:
> >+		u64 attrs = kvm_supported_mem_attributes(kvm);
> >+
> >+		r = -EFAULT;
> >+		if (copy_to_user(argp, &attrs, sizeof(attrs)))
> >+			goto out;
> >+		r = 0;
> >+		break;
> 
> This cannot work, e.g., no @argp in this function and is fixed by a later commit:
> 
> 	fcbef1e5e5d2 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")

I'll post a fixup patch for all of these, thanks much!

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 02/35] KVM: Assert that mmu_invalidate_in_progress *never* goes negative
  2023-10-27 18:21 ` [PATCH v13 02/35] KVM: Assert that mmu_invalidate_in_progress *never* goes negative Sean Christopherson
@ 2023-10-30 16:27   ` Paolo Bonzini
  2023-11-01 12:46   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 16:27 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:21, Sean Christopherson wrote:
> Move the assertion on the in-progress invalidation count from the primary
> MMU's notifier path to KVM's common notification path, i.e. assert that
> the count doesn't go negative even when the invalidation is coming from
> KVM itself.
> 
> Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only
> the affected VM, not the entire kernel.  A corrupted count is fatal to the
> VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry()
> to block any and all attempts to install new mappings.  But it's far from
> guaranteed that an end() without a start() is fatal or even problematic to
> anything other than the target VM, e.g. the underlying bug could simply be
> a duplicate call to end().  And it's much more likely that a missed
> invalidation, i.e. a potential use-after-free, would manifest as no
> notification whatsoever, not an end() without a start().

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   virt/kvm/kvm_main.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0524933856d4..5a97e6c7d9c2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -833,6 +833,7 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
>   	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
>   	 */
>   	kvm->mmu_invalidate_in_progress--;
> +	KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
>   }
>   
>   static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> @@ -863,8 +864,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>   	 */
>   	if (wake)
>   		rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
> -
> -	BUG_ON(kvm->mmu_invalidate_in_progress < 0);
>   }
>   
>   static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-10-27 18:21 ` [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
@ 2023-10-30 16:30   ` Paolo Bonzini
  2023-10-30 16:53   ` David Matlack
  2023-11-01 15:31   ` Xu Yilun
  2 siblings, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 16:30 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:21, Sean Christopherson wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com> Currently in mmu_notifier 
> invalidate path, hva range is recorded and then checked against by 
> mmu_notifier_retry_hva() in the page fault handling path. However, for 
> the to be introduced private memory, a page fault may not have a hva 
> associated, checking gfn(gpa) makes more sense. For existing hva based 
> shared memory, gfn is expected to also work. The only downside is when 
> aliasing multiple gfns to a single hva, the current algorithm of 
> checking multiple ranges could result in a much larger range being 
> rejected. Such aliasing should be uncommon, so the impact is expected 
> small.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction
  2023-10-27 18:21 ` [PATCH v13 04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction Sean Christopherson
@ 2023-10-30 16:32   ` Paolo Bonzini
  2023-11-01 12:50   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 16:32 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:21, Sean Christopherson wrote:
> Add an assertion that there are no in-progress MMU invalidations when a 
> VM is being destroyed, with the exception of the scenario where KVM 
> unregisters its MMU notifier between an .invalidate_range_start() call 
> and the corresponding .invalidate_range_end(). KVM can't detect unpaired 
> calls from the mmu_notifier due to the above exception waiver, but the 
> assertion can detect KVM bugs, e.g. such as the bug that *almost* 
> escaped initial guest_memfd development.
>
> Link: https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5389@linux.intel.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER
  2023-10-27 18:21 ` [PATCH v13 05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER Sean Christopherson
@ 2023-10-30 16:34   ` Paolo Bonzini
  2023-11-01 12:51   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 16:34 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:21, Sean Christopherson wrote:
> Assert that both KVM_ARCH_WANT_MMU_NOTIFIER and CONFIG_MMU_NOTIFIER are
> defined when KVM is enabled, and return '1' unconditionally for the
> CONFIG_KVM_BOOK3S_HV_POSSIBLE=n path.  All flavors of PPC support for KVM
> select MMU_NOTIFIER, and KVM_ARCH_WANT_MMU_NOTIFIER is unconditionally
> defined by arch/powerpc/include/asm/kvm_host.h.
> 
> Effectively dropping use of KVM_ARCH_WANT_MMU_NOTIFIER will simplify a
> future cleanup to turn KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig, i.e.
> will allow combining all of the
> 
>    #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> 
> checks into a single
> 
>    #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> 
> without having to worry about PPC's "bare" usage of
> KVM_ARCH_WANT_MMU_NOTIFIER.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/powerpc/kvm/powerpc.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 7197c8256668..b0a512ede764 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -632,12 +632,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   		break;
>   #endif
>   	case KVM_CAP_SYNC_MMU:
> +#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +		BUILD_BUG();
> +#endif
>   #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>   		r = hv_enabled;
> -#elif defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> -		r = 1;
>   #else
> -		r = 0;
> +		r = 1;
>   #endif
>   		break;
>   #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER
  2023-10-27 18:21 ` [PATCH v13 07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER Sean Christopherson
@ 2023-10-30 16:37   ` Paolo Bonzini
  2023-11-01 12:54   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 16:37 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:21, Sean Christopherson wrote:
> Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
> appropriate to effectively maintain existing behavior.  Using a proper
> Kconfig will simplify building more functionality on top of KVM's
> mmu_notifier infrastructure.
> 
> Add a forward declaration of kvm_gfn_range to kvm_types.h so that
> including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
> generate warnings due to kvm_gfn_range being undeclared.  PPC defines
> hooks for PR vs. HV without guarding them via #ifdeffery, e.g.
> 
>   bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
>   bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>   bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>   bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> 
> Alternatively, PPC could forward declare kvm_gfn_range, but there's no
> good reason not to define it in common KVM.

The new #define should also imply KVM_CAP_SYNC_MMU, or even: 
KVM_CAP_SYNC_MMU should just be enabled by all architectures at this 
point.  You don't need to care about it, I have a larger series for caps 
that are enabled by all architectures and I'll post it for 6.8.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  2023-10-27 18:21 ` [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
@ 2023-10-30 16:41   ` Paolo Bonzini
  2023-10-30 20:25     ` Sean Christopherson
  2023-10-31  2:26   ` Xiaoyao Li
  2023-11-01 14:19   ` Fuad Tabba
  2 siblings, 1 reply; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 16:41 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:21, Sean Christopherson wrote:
> 
> +		if (ioctl == KVM_SET_USER_MEMORY_REGION)
> +			size = sizeof(struct kvm_userspace_memory_region);

This also needs a memset(&mem, 0, sizeof(mem)), otherwise the 
out-of-bounds access of the commit message becomes a kernel stack read.

Probably worth adding a check on valid flags here.

Paolo

> +		else
> +			size = sizeof(struct kvm_userspace_memory_region2);
> +
> +		/* Ensure the common parts of the two structs are identical. */
> +		SANITY_CHECK_MEM_REGION_FIELD(slot);
> +		SANITY_CHECK_MEM_REGION_FIELD(flags);
> +		SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +		SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +		SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
>  



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-10-27 18:21 ` [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
  2023-10-30 16:30   ` Paolo Bonzini
@ 2023-10-30 16:53   ` David Matlack
  2023-10-30 17:00     ` Paolo Bonzini
  2023-10-30 18:19     ` David Matlack
  2023-11-01 15:31   ` Xu Yilun
  2 siblings, 2 replies; 148+ messages in thread
From: David Matlack @ 2023-10-30 16:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
> 
> Currently in mmu_notifier invalidate path, hva range is recorded and
> then checked against by mmu_notifier_retry_hva() in the page fault
> handling path. However, for the to be introduced private memory, a page
                          ^^^^^^^^^^^^^^^^^^^^^^^^

Is there a missing word here?

> fault may not have a hva associated, checking gfn(gpa) makes more sense.
> 
> For existing hva based shared memory, gfn is expected to also work. The
> only downside is when aliasing multiple gfns to a single hva, the
> current algorithm of checking multiple ranges could result in a much
> larger range being rejected. Such aliasing should be uncommon, so the
> impact is expected small.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Cc: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Tested-by: Fuad Tabba <tabba@google.com>
> [sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c   | 10 ++++++----
>  arch/x86/kvm/vmx/vmx.c   | 11 +++++------
>  include/linux/kvm_host.h | 33 ++++++++++++++++++++------------
>  virt/kvm/kvm_main.c      | 41 +++++++++++++++++++++++++++++++---------
>  4 files changed, 64 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f7901cb4d2fa..d33657d61d80 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3056,7 +3056,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
>   *
>   * There are several ways to safely use this helper:
>   *
> - * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, before
> + * - Check mmu_invalidate_retry_gfn() after grabbing the mapping level, before
>   *   consuming it.  In this case, mmu_lock doesn't need to be held during the
>   *   lookup, but it does need to be held while checking the MMU notifier.
>   *
> @@ -4358,7 +4358,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
>  		return true;
>  
>  	return fault->slot &&
> -	       mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
> +	       mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn);
>  }
>  
>  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> @@ -6245,7 +6245,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	write_lock(&kvm->mmu_lock);
>  
> -	kvm_mmu_invalidate_begin(kvm, 0, -1ul);
> +	kvm_mmu_invalidate_begin(kvm);
> +
> +	kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
>  
>  	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>  
> @@ -6255,7 +6257,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  	if (flush)
>  		kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);
>  
> -	kvm_mmu_invalidate_end(kvm, 0, -1ul);
> +	kvm_mmu_invalidate_end(kvm);
>  
>  	write_unlock(&kvm->mmu_lock);
>  }
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 72e3943f3693..6e502ba93141 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6757,10 +6757,10 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  		return;
>  
>  	/*
> -	 * Grab the memslot so that the hva lookup for the mmu_notifier retry
> -	 * is guaranteed to use the same memslot as the pfn lookup, i.e. rely
> -	 * on the pfn lookup's validation of the memslot to ensure a valid hva
> -	 * is used for the retry check.
> +	 * Explicitly grab the memslot using KVM's internal slot ID to ensure
> +	 * KVM doesn't unintentionally grab a userspace memslot.  It _should_
> +	 * be impossible for userspace to create a memslot for the APIC when
> +	 * APICv is enabled, but paranoia won't hurt in this case.
>  	 */
>  	slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
>  	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
> @@ -6785,8 +6785,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  		return;
>  
>  	read_lock(&vcpu->kvm->mmu_lock);
> -	if (mmu_invalidate_retry_hva(kvm, mmu_seq,
> -				     gfn_to_hva_memslot(slot, gfn))) {
> +	if (mmu_invalidate_retry_gfn(kvm, mmu_seq, gfn)) {
>  		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
>  		read_unlock(&vcpu->kvm->mmu_lock);
>  		goto out;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index fb6c6109fdca..11d091688346 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -787,8 +787,8 @@ struct kvm {
>  	struct mmu_notifier mmu_notifier;
>  	unsigned long mmu_invalidate_seq;
>  	long mmu_invalidate_in_progress;
> -	unsigned long mmu_invalidate_range_start;
> -	unsigned long mmu_invalidate_range_end;
> +	gfn_t mmu_invalidate_range_start;
> +	gfn_t mmu_invalidate_range_end;
>  #endif
>  	struct list_head devices;
>  	u64 manual_dirty_log_protect;
> @@ -1392,10 +1392,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  #endif
>  
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -			      unsigned long end);
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -			    unsigned long end);
> +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);

What is the reason to separate range_add() from begin()?

> +void kvm_mmu_invalidate_end(struct kvm *kvm);
>  
>  long kvm_arch_dev_ioctl(struct file *filp,
>  			unsigned int ioctl, unsigned long arg);
> @@ -1970,9 +1969,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
>  	return 0;
>  }
>  
> -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
>  					   unsigned long mmu_seq,
> -					   unsigned long hva)
> +					   gfn_t gfn)
>  {
>  	lockdep_assert_held(&kvm->mmu_lock);
>  	/*
> @@ -1981,10 +1980,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  	 * that might be being invalidated. Note that it may include some false
>  	 * positives, due to shortcuts when handing concurrent invalidations.
>  	 */
> -	if (unlikely(kvm->mmu_invalidate_in_progress) &&
> -	    hva >= kvm->mmu_invalidate_range_start &&
> -	    hva < kvm->mmu_invalidate_range_end)
> -		return 1;
> +	if (unlikely(kvm->mmu_invalidate_in_progress)) {
> +		/*
> +		 * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> +		 * but before updating the range is a KVM bug.
> +		 */
> +		if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> +				 kvm->mmu_invalidate_range_end == INVALID_GPA))
> +			return 1;
> +
> +		if (gfn >= kvm->mmu_invalidate_range_start &&
> +		    gfn < kvm->mmu_invalidate_range_end)
> +			return 1;
> +	}
> +
>  	if (kvm->mmu_invalidate_seq != mmu_seq)
>  		return 1;
>  	return 0;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5a97e6c7d9c2..1a577a25de47 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -543,9 +543,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  
>  typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>  
> -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> -			     unsigned long end);
> -
> +typedef void (*on_lock_fn_t)(struct kvm *kvm);
>  typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>  
>  struct kvm_mmu_notifier_range {
> @@ -637,7 +635,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  				locked = true;
>  				KVM_MMU_LOCK(kvm);
>  				if (!IS_KVM_NULL_FN(range->on_lock))
> -					range->on_lock(kvm, range->start, range->end);
> +					range->on_lock(kvm);
> +
>  				if (IS_KVM_NULL_FN(range->handler))
>  					break;
>  			}
> @@ -742,16 +741,29 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	kvm_handle_hva_range(mn, address, address + 1, arg, kvm_change_spte_gfn);
>  }
>  
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -			      unsigned long end)
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
>  {
> +	lockdep_assert_held_write(&kvm->mmu_lock);
>  	/*
>  	 * The count increase must become visible at unlock time as no
>  	 * spte can be established without taking the mmu_lock and
>  	 * count is also read inside the mmu_lock critical section.
>  	 */
>  	kvm->mmu_invalidate_in_progress++;
> +
>  	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> +		kvm->mmu_invalidate_range_start = INVALID_GPA;
> +		kvm->mmu_invalidate_range_end = INVALID_GPA;

I don't think this is incorrect, but I was a little suprised to see this
here rather than in end() when mmu_invalidate_in_progress decrements to
0.

> +	}
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	lockdep_assert_held_write(&kvm->mmu_lock);

Does this compile/function on KVM architectures with
!KVM_HAVE_MMU_RWLOCK? I assumed we would get an email from the buildbot
if it didn't compile but I don't know if buildbot builds with lockdep
enabled.

On this topic, I wonder if we should just bit the bullet and convert all
architectures to a rwlock_t.

> +
> +	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
> +	if (likely(kvm->mmu_invalidate_range_start == INVALID_GPA)) {
>  		kvm->mmu_invalidate_range_start = start;
>  		kvm->mmu_invalidate_range_end = end;
>  	} else {
> @@ -771,6 +783,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>  	}
>  }
>  
> +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> +	return kvm_unmap_gfn_range(kvm, range);
> +}
> +
>  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  					const struct mmu_notifier_range *range)
>  {
> @@ -778,7 +796,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	const struct kvm_mmu_notifier_range hva_range = {
>  		.start		= range->start,
>  		.end		= range->end,
> -		.handler	= kvm_unmap_gfn_range,
> +		.handler	= kvm_mmu_unmap_gfn_range,
>  		.on_lock	= kvm_mmu_invalidate_begin,
>  		.on_unlock	= kvm_arch_guest_memory_reclaimed,
>  		.flush_on_ret	= true,
> @@ -817,8 +835,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	return 0;
>  }
>  
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -			    unsigned long end)
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
>  {
>  	/*
>  	 * This sequence increase will notify the kvm page fault that
> @@ -834,6 +851,12 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,

Let's add a lockdep_assert_held_write(&kvm->mmu_lock) here too while
we're at it?

>  	 */
>  	kvm->mmu_invalidate_in_progress--;
>  	KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
> +
> +	/*
> +	 * Assert that at least one range was added between start() and end().
> +	 * Not adding a range isn't fatal, but it is a KVM bug.
> +	 */
> +	WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA);
>  }
>  
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> -- 
> 2.42.0.820.g83a721a137-goog
> 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-10-30 16:53   ` David Matlack
@ 2023-10-30 17:00     ` Paolo Bonzini
  2023-10-30 18:21       ` David Matlack
  2023-10-30 18:19     ` David Matlack
  1 sibling, 1 reply; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 17:00 UTC (permalink / raw)
  To: David Matlack
  Cc: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Mon, Oct 30, 2023 at 5:53 PM David Matlack <dmatlack@google.com> wrote:
>
> On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> > From: Chao Peng <chao.p.peng@linux.intel.com>
> >
> > Currently in mmu_notifier invalidate path, hva range is recorded and
> > then checked against by mmu_notifier_retry_hva() in the page fault
> > handling path. However, for the to be introduced private memory, a page
>                           ^^^^^^^^^^^^^^^^^^^^^^^^
>
> Is there a missing word here?

No but there could be missing hyphens ("for the to-be-introduced
private memory"); possibly a "soon" could help parsing and that is
what you were talking about?

> >       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > +             kvm->mmu_invalidate_range_start = INVALID_GPA;
> > +             kvm->mmu_invalidate_range_end = INVALID_GPA;
>
> I don't think this is incorrect, but I was a little suprised to see this
> here rather than in end() when mmu_invalidate_in_progress decrements to
> 0.

I think that would be incorrect on the very first start?

> > +     }
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +     lockdep_assert_held_write(&kvm->mmu_lock);
>
> Does this compile/function on KVM architectures with
> !KVM_HAVE_MMU_RWLOCK?

Yes:

#define lockdep_assert_held_write(l)    \
        lockdep_assert(lockdep_is_held_type(l, 0))

where 0 is the lock-held type used by lock_acquire_exclusive. In turn
is what you get for a spinlock or mutex, in addition to a rwlock or
rwsem that is taken for write.

Instead, lockdep_assert_held() asserts that the lock is taken without
asserting a particular lock-held type.

> > @@ -834,6 +851,12 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
>
> Let's add a lockdep_assert_held_write(&kvm->mmu_lock) here too while
> we're at it?

Yes, good idea.

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory
  2023-10-27 18:21 ` [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory Sean Christopherson
@ 2023-10-30 17:11   ` Paolo Bonzini
  2023-11-02 13:55   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 17:11 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:21, Sean Christopherson wrote:
> Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
> __kvm_handle_hva_range() return whether or not an overlapping memslot
> was found, i.e. mmu_lock was acquired.  Using the .on_unlock() hook
> works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
> mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.
> 
> Use a small struct to return the tuple of the notifier-specific return,
> plus whether or not overlap was found.  Because the iteration helpers are
> __always_inlined, practically speaking, the struct will never actually be
> returned from a function call (not to mention the size of the struct will
> be two bytes in practice).

Could have been split in two patches, but it's fine anyway.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook
  2023-10-27 18:21 ` [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook Sean Christopherson
@ 2023-10-30 17:18   ` Paolo Bonzini
  2023-11-02 13:55   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 17:18 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:21, Sean Christopherson wrote:
> Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
> notifying arch code that memory has been reclaimed.  Adding .on_unlock()
> and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
> resulted in .on_lock() and .on_unlock() having divergent and asymmetric
> behavior, and set future developers up for failure, i.e. all but asked for
> bugs where KVM relied on using .on_unlock() to try to run a callback while
> holding mmu_lock.
> 
> Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
> guard against future bugs of this nature.

This is what David suggested to do in patch 3, FWIW.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo

> Reported-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Link: https://lore.kernel.org/all/20230802203119.GB2021422@ls.amr.corp.intel.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
  2023-10-27 18:21 ` [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events Sean Christopherson
@ 2023-10-30 17:21   ` Paolo Bonzini
  2023-10-30 22:07     ` Sean Christopherson
  2023-11-02  5:59   ` Binbin Wu
  2023-11-02 14:01   ` Fuad Tabba
  2 siblings, 1 reply; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 17:21 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:21, Sean Christopherson wrote:
> @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>  			 * the second or later invocation of the handler).
>  			 */
>  			gfn_range.arg = range->arg;
> +
> +			/*
> +			 * HVA-based notifications aren't relevant to private
> +			 * mappings as they don't have a userspace mapping.

It's confusing who "they" is.  Maybe

	 * HVA-based notifications provide a userspace address,
	 * and as such are only relevant for shared mappings.

Paolo

> +			 */
> +			gfn_range.only_private = false;
> +			gfn_range.only_shared = true;
>  			gfn_range.may_block = range->may_block;
>  
>  			/*



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-10-27 18:21 ` [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace Sean Christopherson
@ 2023-10-30 17:22   ` Paolo Bonzini
  2023-11-01  7:30   ` Binbin Wu
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 17:22 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:21, Sean Christopherson wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
> 
> Add a new KVM exit type to allow userspace to handle memory faults that
> KVM cannot resolve, but that userspace *may* be able to handle (without
> terminating the guest).
> 
> KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
> conversions between private and shared memory.  With guest private memory,
> there will be two kind of memory conversions:
> 
>    - explicit conversion: happens when the guest explicitly calls into KVM
>      to map a range (as private or shared)
> 
>    - implicit conversion: happens when the guest attempts to access a gfn
>      that is configured in the "wrong" state (private vs. shared)
> 
> On x86 (first architecture to support guest private memory), explicit
> conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
> but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
> as there is (obviously) no hypercall, and there is no guarantee that the
> guest actually intends to convert between private and shared, i.e. what
> KVM thinks is an implicit conversion "request" could actually be the
> result of a guest code bug.
> 
> KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
> be implicit conversions.
> 
> Note!  To allow for future possibilities where KVM reports
> KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
> fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
> perspective), not '0'!  Due to historical baggage within KVM, exiting to
> userspace with '0' from deep callstacks, e.g. in emulation paths, is
> infeasible as doing so would require a near-complete overhaul of KVM,
> whereas KVM already propagates -errno return codes to userspace even when
> the -errno originated in a low level helper.
> 
> Report the gpa+size instead of a single gfn even though the initial usage
> is expected to always report single pages.  It's entirely possible, likely
> even, that KVM will someday support sub-page granularity faults, e.g.
> Intel's sub-page protection feature allows for additional protections at
> 128-byte granularity.
> 
> Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
> Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com
> Cc: Anish Moorthy <amoorthy@google.com>
> Cc: David Matlack <dmatlack@google.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

> ---
>   Documentation/virt/kvm/api.rst | 41 ++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/x86.c             |  1 +
>   include/linux/kvm_host.h       | 11 +++++++++
>   include/uapi/linux/kvm.h       |  8 +++++++
>   4 files changed, 61 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index ace984acc125..860216536810 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return
>   values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>   spec refer, https://github.com/riscv/riscv-sbi-doc.
>   
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory;
> +
> +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
> +could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
> +guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
> +describes properties of the faulting access that are likely pertinent.
> +Currently, no flags are defined.
> +
> +Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
> +accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
> +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
> +kvm_run.exit_reason is stale/undefined for all other error numbers.
> +
>   ::
>   
>       /* KVM_EXIT_NOTIFY */
> @@ -7757,6 +7777,27 @@ This capability is aimed to mitigate the threat that malicious VMs can
>   cause CPU stuck (due to event windows don't open up) and make the CPU
>   unavailable to host or other VMs.
>   
> +7.34 KVM_CAP_MEMORY_FAULT_INFO
> +------------------------------
> +
> +:Architectures: x86
> +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> +
> +The presence of this capability indicates that KVM_RUN will fill
> +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
> +there is a valid memslot but no backing VMA for the corresponding host virtual
> +address.
> +
> +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
> +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
> +to KVM_EXIT_MEMORY_FAULT.
> +
> +Note: Userspaces which attempt to resolve memory faults so that they can retry
> +KVM_RUN are encouraged to guard against repeatedly receiving the same
> +error/annotated fault.
> +
> +See KVM_EXIT_MEMORY_FAULT for more information.
> +
>   8. Other capabilities.
>   ======================
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6409914428ca..ee3cd8c3c0ef 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4518,6 +4518,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   	case KVM_CAP_ENABLE_CAP:
>   	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
>   	case KVM_CAP_IRQFD_RESAMPLE:
> +	case KVM_CAP_MEMORY_FAULT_INFO:
>   		r = 1;
>   		break;
>   	case KVM_CAP_EXIT_HYPERCALL:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4e741ff27af3..96aa930536b1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>   /* Max number of entries allowed for each kvm dirty ring */
>   #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>   
> +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> +						 gpa_t gpa, gpa_t size)
> +{
> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	vcpu->run->memory_fault.gpa = gpa;
> +	vcpu->run->memory_fault.size = size;
> +
> +	/* Flags are not (yet) defined or communicated to userspace. */
> +	vcpu->run->memory_fault.flags = 0;
> +}
> +
>   #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index bd1abe067f28..7ae9987b48dd 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -274,6 +274,7 @@ struct kvm_xen_exit {
>   #define KVM_EXIT_RISCV_SBI        35
>   #define KVM_EXIT_RISCV_CSR        36
>   #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>   
>   /* For KVM_EXIT_INTERNAL_ERROR */
>   /* Emulate instruction failed. */
> @@ -520,6 +521,12 @@ struct kvm_run {
>   #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>   			__u32 flags;
>   		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory_fault;
>   		/* Fix the size of the union. */
>   		char padding[256];
>   	};
> @@ -1203,6 +1210,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
>   #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
>   #define KVM_CAP_USER_MEMORY2 230
> +#define KVM_CAP_MEMORY_FAULT_INFO 231
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 14/35] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
  2023-10-27 18:21 ` [PATCH v13 14/35] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable Sean Christopherson
@ 2023-10-30 17:24   ` Paolo Bonzini
  0 siblings, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 17:24 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:21, Sean Christopherson wrote:
> Add an "unmovable" flag for mappings that cannot be migrated under any
> circumstance.  KVM will use the flag for its upcoming GUEST_MEMFD support,
> which will not support compaction/migration, at least not in the
> foreseeable future.
> 
> Test AS_UNMOVABLE under folio lock as already done for the async
> compaction/dirty folio case, as the mapping can be removed by truncation
> while compaction is running.  To avoid having to lock every folio with a
> mapping, assume/require that unmovable mappings are also unevictable, and
> have mapping_set_unmovable() also set AS_UNEVICTABLE.
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Co-developed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

I think this could even be "From: Vlastimil", but no biggie.

Paolo

> ---
>   include/linux/pagemap.h | 19 +++++++++++++++++-
>   mm/compaction.c         | 43 +++++++++++++++++++++++++++++------------
>   mm/migrate.c            |  2 ++
>   3 files changed, 51 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 351c3b7f93a1..82c9bf506b79 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -203,7 +203,8 @@ enum mapping_flags {
>   	/* writeback related tags are not used */
>   	AS_NO_WRITEBACK_TAGS = 5,
>   	AS_LARGE_FOLIO_SUPPORT = 6,
> -	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
> +	AS_RELEASE_ALWAYS = 7,	/* Call ->release_folio(), even if no private data */
> +	AS_UNMOVABLE	= 8,	/* The mapping cannot be moved, ever */
>   };
>   
>   /**
> @@ -289,6 +290,22 @@ static inline void mapping_clear_release_always(struct address_space *mapping)
>   	clear_bit(AS_RELEASE_ALWAYS, &mapping->flags);
>   }
>   
> +static inline void mapping_set_unmovable(struct address_space *mapping)
> +{
> +	/*
> +	 * It's expected unmovable mappings are also unevictable. Compaction
> +	 * migrate scanner (isolate_migratepages_block()) relies on this to
> +	 * reduce page locking.
> +	 */
> +	set_bit(AS_UNEVICTABLE, &mapping->flags);
> +	set_bit(AS_UNMOVABLE, &mapping->flags);
> +}
> +
> +static inline bool mapping_unmovable(struct address_space *mapping)
> +{
> +	return test_bit(AS_UNMOVABLE, &mapping->flags);
> +}
> +
>   static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
>   {
>   	return mapping->gfp_mask;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 38c8d216c6a3..12b828aed7c8 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -883,6 +883,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   
>   	/* Time to isolate some pages for migration */
>   	for (; low_pfn < end_pfn; low_pfn++) {
> +		bool is_dirty, is_unevictable;
>   
>   		if (skip_on_failure && low_pfn >= next_skip_pfn) {
>   			/*
> @@ -1080,8 +1081,10 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   		if (!folio_test_lru(folio))
>   			goto isolate_fail_put;
>   
> +		is_unevictable = folio_test_unevictable(folio);
> +
>   		/* Compaction might skip unevictable pages but CMA takes them */
> -		if (!(mode & ISOLATE_UNEVICTABLE) && folio_test_unevictable(folio))
> +		if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
>   			goto isolate_fail_put;
>   
>   		/*
> @@ -1093,26 +1096,42 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   		if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_writeback(folio))
>   			goto isolate_fail_put;
>   
> -		if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) {
> -			bool migrate_dirty;
> +		is_dirty = folio_test_dirty(folio);
> +
> +		if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) ||
> +		    (mapping && is_unevictable)) {
> +			bool migrate_dirty = true;
> +			bool is_unmovable;
>   
>   			/*
>   			 * Only folios without mappings or that have
> -			 * a ->migrate_folio callback are possible to
> -			 * migrate without blocking.  However, we may
> -			 * be racing with truncation, which can free
> -			 * the mapping.  Truncation holds the folio lock
> -			 * until after the folio is removed from the page
> -			 * cache so holding it ourselves is sufficient.
> +			 * a ->migrate_folio callback are possible to migrate
> +			 * without blocking.
> +			 *
> +			 * Folios from unmovable mappings are not migratable.
> +			 *
> +			 * However, we can be racing with truncation, which can
> +			 * free the mapping that we need to check. Truncation
> +			 * holds the folio lock until after the folio is removed
> +			 * from the page so holding it ourselves is sufficient.
> +			 *
> +			 * To avoid locking the folio just to check unmovable,
> +			 * assume every unmovable folio is also unevictable,
> +			 * which is a cheaper test.  If our assumption goes
> +			 * wrong, it's not a correctness bug, just potentially
> +			 * wasted cycles.
>   			 */
>   			if (!folio_trylock(folio))
>   				goto isolate_fail_put;
>   
>   			mapping = folio_mapping(folio);
> -			migrate_dirty = !mapping ||
> -					mapping->a_ops->migrate_folio;
> +			if ((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) {
> +				migrate_dirty = !mapping ||
> +						mapping->a_ops->migrate_folio;
> +			}
> +			is_unmovable = mapping && mapping_unmovable(mapping);
>   			folio_unlock(folio);
> -			if (!migrate_dirty)
> +			if (!migrate_dirty || is_unmovable)
>   				goto isolate_fail_put;
>   		}
>   
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2053b54556ca..ed874e43ecd7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -956,6 +956,8 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
>   
>   		if (!mapping)
>   			rc = migrate_folio(mapping, dst, src, mode);
> +		else if (mapping_unmovable(mapping))
> +			rc = -EOPNOTSUPP;
>   		else if (mapping->a_ops->migrate_folio)
>   			/*
>   			 * Most folios have a mapping and most filesystems


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM
  2023-10-27 18:21 ` [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM Sean Christopherson
@ 2023-10-30 17:30   ` Paolo Bonzini
  2023-11-02 16:24   ` Christian Brauner
  1 sibling, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 17:30 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:21, Sean Christopherson wrote:
> Export anon_inode_getfile_secure() so that it can be used by KVM to 
> create and manage file-based guest memory without need a fullblow 

without introducing a full-blown

Otherwise,

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo

> filesystem. The "standard" anon_inode_getfd() doesn't work for KVM's use 
> case as KVM needs a unique inode for each file, e.g. to be able to 
> independently manage the size and lifecycle of a given file.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
  2023-10-27 18:22 ` [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN Sean Christopherson
@ 2023-10-30 17:31   ` Paolo Bonzini
  2023-11-02 14:16   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 17:31 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:22, Sean Christopherson wrote:
> Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
> the probability of exiting to userspace with a stale run->exit_reason that
> *appears* to be valid.
> 
> To support fd-based guest memory (guest memory without a corresponding
> userspace virtual address), KVM will exit to userspace for various memory
> related errors, which userspace *may* be able to resolve, instead of using
> e.g. BUS_MCEERR_AR.  And in the more distant future, KVM will also likely
> utilize the same functionality to let userspace "intercept" and handle
> memory faults when the userspace mapping is missing, i.e. when fast gup()
> fails.
> 
> Because many of KVM's internal APIs related to guest memory use '0' to
> indicate "success, continue on" and not "exit to userspace", reporting
> memory faults/errors to userspace will set run->exit_reason and
> corresponding fields in the run structure fields in conjunction with a
> a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON.  And because
> KVM already returns  -EFAULT in many paths, there's a relatively high
> probability that KVM could return -EFAULT without setting run->exit_reason,
> in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
> whatever exit reason happened to be in the run structure.
> 
> Note, KVM must wait until after run->immediate_exit is serviced to
> sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
> preserved across KVM_RUN when run->immediate_exit is true.
> 
> Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoorthy@google.com
> Link: https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/x86.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ee3cd8c3c0ef..f41dbb1465a0 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10963,6 +10963,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>   {
>   	int r;
>   
> +	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
>   	vcpu->arch.l1tf_flush_l1d = true;
>   
>   	for (;;) {

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM
  2023-10-27 18:22 ` [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM Sean Christopherson
@ 2023-10-30 17:34   ` Paolo Bonzini
  2023-11-02 14:52   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 17:34 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:22, Sean Christopherson wrote:
> Let x86 track the number of address spaces on a per-VM basis so that KVM
> can disallow SMM memslots for confidential VMs.  Confidentials VMs are
> fundamentally incompatible with emulating SMM, which as the name suggests
> requires being able to read and write guest memory and register state.
> 
> Disallowing SMM will simplify support for guest private memory, as KVM
> will not need to worry about tracking memory attributes for multiple
> address spaces (SMM is the only "non-default" address space across all
> architectures).

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory
  2023-10-27 18:22 ` [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory Sean Christopherson
@ 2023-10-30 17:36   ` Paolo Bonzini
  2023-11-06 11:00   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 17:36 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:22, Sean Christopherson wrote:
> Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
> and testing vehicle for Confidential (CoCo) VMs, and potentially to even
> become a "real" product in the distant future, e.g. a la pKVM.
> 
> The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
> Intel's TDX, but those technologies are extremely complex (understatement),
> difficult to debug, don't support running as nested guests, and require
> hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX
> for maintaining guest private memory isn't a realistic option.
> 
> At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
> selftests for guest_memfd and private memory support without requiring
> unique hardware.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

with one nit:

> +---------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: system ioctl
> +
> +This capability returns a bitmap of support VM types.  The 1-setting of bit @n

s/support/supported/

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes
  2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
                   ` (34 preceding siblings ...)
  2023-10-27 18:22 ` [PATCH v13 35/35] KVM: selftests: Test KVM exit behavior for private memory/access Sean Christopherson
@ 2023-10-30 17:39 ` Paolo Bonzini
  35 siblings, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 17:39 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/27/23 20:21, Sean Christopherson wrote:
> Non-KVM people, please take a gander at two small-ish patches buried in the
> middle of this series:
> 
>    fs: Export anon_inode_getfile_secure() for use by KVM
>    mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
> 
> Our plan/hope is to take this through the KVM tree for 6.8, reviews (and acks!)
> would be much appreciated.  Note, adding AS_UNMOVABLE isn't strictly required as
> it's "just" an optimization, but we'd prefer to have it in place straightaway.

Reporting what I wrote in the other thread, for wider distribution:

I'm going to wait a couple days more for reviews to come in, post a v14
myself, and apply the series to kvm/next as soon as Linus merges the 6.7
changes.  The series will be based on the 6.7 tags/for-linus, and when
6.7-rc1 comes up, I'll do this to straighten the history:

	git checkout kvm/next
	git tag -s -f kvm-gmem HEAD
	git reset --hard v6.7-rc1
	git merge tags/kvm-gmem
	# fix conflict with Christian Brauner's VFS series
	git commit
	git push kvm

6.8 is not going to be out for four months, and I'm pretty sure that
anything that would be discovered within "a few weeks" can also be
applied on top, and the heaviness of a 35-patch series will outweigh any
imperfections by a long margin.

(Full disclosure: this is _also_ because I want to apply this series to
the RHEL kernel, and Red Hat has a high level of disdain for
non-upstream patches.  But it's mostly because I want all dependencies
to be able to move on and be developed on top of stock kvm/next).

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-10-30 16:53   ` David Matlack
  2023-10-30 17:00     ` Paolo Bonzini
@ 2023-10-30 18:19     ` David Matlack
  1 sibling, 0 replies; 148+ messages in thread
From: David Matlack @ 2023-10-30 18:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Mon, Oct 30, 2023 at 9:53 AM David Matlack <dmatlack@google.com> wrote:
>
> On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> > From: Chao Peng <chao.p.peng@linux.intel.com>
> >
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
>
> What is the reason to separate range_add() from begin()?

Nevermind, I see how it's needed in kvm_mmu_unmap_gfn_range().

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-10-30 17:00     ` Paolo Bonzini
@ 2023-10-30 18:21       ` David Matlack
  0 siblings, 0 replies; 148+ messages in thread
From: David Matlack @ 2023-10-30 18:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Mon, Oct 30, 2023 at 10:01 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On Mon, Oct 30, 2023 at 5:53 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> > > From: Chao Peng <chao.p.peng@linux.intel.com>
> > >
> > > Currently in mmu_notifier invalidate path, hva range is recorded and
> > > then checked against by mmu_notifier_retry_hva() in the page fault
> > > handling path. However, for the to be introduced private memory, a page
> >                           ^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > Is there a missing word here?
>
> No but there could be missing hyphens ("for the to-be-introduced
> private memory"); possibly a "soon" could help parsing and that is
> what you were talking about?

Ah that explains it :)

>
> > >       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > +             kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > +             kvm->mmu_invalidate_range_end = INVALID_GPA;
> >
> > I don't think this is incorrect, but I was a little suprised to see this
> > here rather than in end() when mmu_invalidate_in_progress decrements to
> > 0.
>
> I think that would be incorrect on the very first start?

Good point. KVM could initialize start/end before registering
notifiers, but that's extra code.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  2023-10-30 16:41   ` Paolo Bonzini
@ 2023-10-30 20:25     ` Sean Christopherson
  2023-10-30 22:12       ` Sean Christopherson
  2023-10-30 23:22       ` Paolo Bonzini
  0 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-30 20:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, Huacai Chen, Michael Ellerman,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Viro, Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Mon, Oct 30, 2023, Paolo Bonzini wrote:
> On 10/27/23 20:21, Sean Christopherson wrote:
> > 
> > +		if (ioctl == KVM_SET_USER_MEMORY_REGION)
> > +			size = sizeof(struct kvm_userspace_memory_region);
> 
> This also needs a memset(&mem, 0, sizeof(mem)), otherwise the out-of-bounds
> access of the commit message becomes a kernel stack read.

Ouch.  There's some irony.  Might be worth doing memset(&mem, -1, sizeof(mem))
though as '0' is a valid file descriptor and a valid file offset.

> Probably worth adding a check on valid flags here.

Definitely needed.  There's a very real bug here.  But rather than duplicate flags
checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we
have the fancy guard(mutex) and there are no internal calls to kvm_set_memory_region(),
what if we:

  1. Acquire/release slots_lock in __kvm_set_memory_region()
  2. Call kvm_set_memory_region() from x86 code for the internal memslots
  3. Disallow *any* flags for internal memslots
  4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region()
  5. Pass @ioctl to kvm_vm_ioctl_set_memory_region() and allow KVM_MEM_PRIVATE
     only for KVM_SET_USER_MEMORY_REGION2

E.g. this over ~5 patches

---
 arch/x86/kvm/x86.c       |  2 +-
 include/linux/kvm_host.h |  4 +--
 virt/kvm/kvm_main.c      | 65 +++++++++++++++++-----------------------
 3 files changed, 29 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e3eb608b6692..dd3e2017366c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12478,7 +12478,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 		m.guest_phys_addr = gpa;
 		m.userspace_addr = hva;
 		m.memory_size = size;
-		r = __kvm_set_memory_region(kvm, &m);
+		r = kvm_set_memory_region(kvm, &m);
 		if (r < 0)
 			return ERR_PTR_USR(r);
 	}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 687589ce9f63..fbb98efe8200 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1170,7 +1170,7 @@ static inline bool kvm_memslot_iter_is_valid(struct kvm_memslot_iter *iter, gfn_
  *   -- just change its flags
  *
  * Since flags can be changed by some of these operations, the following
- * differentiation is the best we can do for __kvm_set_memory_region():
+ * differentiation is the best we can do for __kvm_set_memory_region().
  */
 enum kvm_mr_change {
 	KVM_MR_CREATE,
@@ -1181,8 +1181,6 @@ enum kvm_mr_change {
 
 int kvm_set_memory_region(struct kvm *kvm,
 			  const struct kvm_userspace_memory_region2 *mem);
-int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region2 *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 23633984142f..39ceee2f67f2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1608,28 +1608,6 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(struct kvm *kvm,
-				     const struct kvm_userspace_memory_region2 *mem)
-{
-	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
-
-	if (kvm_arch_has_private_mem(kvm))
-		valid_flags |= KVM_MEM_PRIVATE;
-
-	/* Dirty logging private memory is not currently supported. */
-	if (mem->flags & KVM_MEM_PRIVATE)
-		valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
-
-#ifdef __KVM_HAVE_READONLY_MEM
-	valid_flags |= KVM_MEM_READONLY;
-#endif
-
-	if (mem->flags & ~valid_flags)
-		return -EINVAL;
-
-	return 0;
-}
-
 static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
 {
 	struct kvm_memslots *slots = kvm_get_inactive_memslots(kvm, as_id);
@@ -2014,11 +1992,9 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
  * space.
  *
  * Discontiguous memory is allowed, mostly for framebuffers.
- *
- * Must be called holding kvm->slots_lock for write.
  */
-int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region2 *mem)
+static int __kvm_set_memory_region(struct kvm *kvm,
+				   const struct kvm_userspace_memory_region2 *mem)
 {
 	struct kvm_memory_slot *old, *new;
 	struct kvm_memslots *slots;
@@ -2028,9 +2004,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	int as_id, id;
 	int r;
 
-	r = check_memory_region_flags(kvm, mem);
-	if (r)
-		return r;
+	guard(mutex)(&kvm->slots_lock);
 
 	as_id = mem->slot >> 16;
 	id = (u16)mem->slot;
@@ -2139,27 +2113,42 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	kfree(new);
 	return r;
 }
-EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
 			  const struct kvm_userspace_memory_region2 *mem)
 {
-	int r;
+	/* Flags aren't supported for KVM-internal memslots. */
+	if (WARN_ON_ONCE(mem->flags))
+		return -EINVAL;
 
-	mutex_lock(&kvm->slots_lock);
-	r = __kvm_set_memory_region(kvm, mem);
-	mutex_unlock(&kvm->slots_lock);
-	return r;
+	return __kvm_set_memory_region(kvm, mem);
 }
 EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
-static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
+static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm, unsigned int ioctl,
 					  struct kvm_userspace_memory_region2 *mem)
 {
+	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
+
+	if (ioctl == KVM_SET_USER_MEMORY_REGION2 &&
+	    kvm_arch_has_private_mem(kvm))
+		valid_flags |= KVM_MEM_PRIVATE;
+
+	/* Dirty logging private memory is not currently supported. */
+	if (mem->flags & KVM_MEM_PRIVATE)
+		valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
+
+#ifdef __KVM_HAVE_READONLY_MEM
+	valid_flags |= KVM_MEM_READONLY;
+#endif
+
+	if (mem->flags & ~valid_flags)
+		return -EINVAL;
+
 	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
 
-	return kvm_set_memory_region(kvm, mem);
+	return __kvm_set_memory_region(kvm, mem);
 }
 
 #ifndef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
@@ -5145,7 +5134,7 @@ static long kvm_vm_ioctl(struct file *filp,
 		if (copy_from_user(&mem, argp, size))
 			goto out;
 
-		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
+		r = kvm_vm_ioctl_set_memory_region(kvm, ioctl, &mem);
 		break;
 	}
 	case KVM_GET_DIRTY_LOG: {

base-commit: 881375a408c0f4ea451ff14545b59216d2923881
-- 


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes
  2023-10-30 16:10     ` Sean Christopherson
@ 2023-10-30 22:05       ` Sean Christopherson
  0 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-30 22:05 UTC (permalink / raw)
  To: Chao Gao
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Mon, Oct 30, 2023, Sean Christopherson wrote:
> On Mon, Oct 30, 2023, Chao Gao wrote:
> > On Fri, Oct 27, 2023 at 11:21:55AM -0700, Sean Christopherson wrote:
> > >From: Chao Peng <chao.p.peng@linux.intel.com>
> > >
> > >In confidential computing usages, whether a page is private or shared is
> > >necessary information for KVM to perform operations like page fault
> > >handling, page zapping etc. There are other potential use cases for
> > >per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > >or exec-only, etc.) without having to modify memslots.
> > >
> > >Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > >userspace to operate on the per-page memory attributes.
> > >  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > >    a guest memory range.
> > 
> > >  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > >    memory attributes.
> > 
> > This ioctl() is already removed. So, the changelog is out-of-date and needs
> > an update.
> 
> Doh, I lost track of this and the fixup for KVM_CAP_MEMORY_ATTRIBUTES below.
> 
> > >+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > >+:Architectures: x86
> > >+:Type: vm ioctl
> > >+:Parameters: struct kvm_memory_attributes(in)
> > 
> > 					   ^ add one space here?
> 
> Ah, yeah, that does appear to be the standard.
> > 
> > 
> > >+static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
> > >+					  struct kvm_gfn_range *range)
> > >+{
> > >+	/*
> > >+	 * Unconditionally add the range to the invalidation set, regardless of
> > >+	 * whether or not the arch callback actually needs to zap SPTEs.  E.g.
> > >+	 * if KVM supports RWX attributes in the future and the attributes are
> > >+	 * going from R=>RW, zapping isn't strictly necessary.  Unconditionally
> > >+	 * adding the range allows KVM to require that MMU invalidations add at
> > >+	 * least one range between begin() and end(), e.g. allows KVM to detect
> > >+	 * bugs where the add() is missed.  Rexlaing the rule *might* be safe,
> > 
> > 					    ^^^^^^^^ Relaxing
> > 
> > >@@ -4640,6 +4850,17 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > > 	case KVM_CAP_BINARY_STATS_FD:
> > > 	case KVM_CAP_SYSTEM_EVENT_DATA:
> > > 		return 1;
> > >+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> > >+	case KVM_CAP_MEMORY_ATTRIBUTES:
> > >+		u64 attrs = kvm_supported_mem_attributes(kvm);
> > >+
> > >+		r = -EFAULT;
> > >+		if (copy_to_user(argp, &attrs, sizeof(attrs)))
> > >+			goto out;
> > >+		r = 0;
> > >+		break;
> > 
> > This cannot work, e.g., no @argp in this function and is fixed by a later commit:
> > 
> > 	fcbef1e5e5d2 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
> 
> I'll post a fixup patch for all of these, thanks much!

Heh, that was an -ENOCOFFEE.  Fixup patches for a changelog goof and an ephemeral
bug are going to be hard to post.

Paolo, do you want to take care of all of these fixups and typos, or would you
prefer that I start a v14 branch and then hand it off to you at some point?

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
  2023-10-30 17:21   ` Paolo Bonzini
@ 2023-10-30 22:07     ` Sean Christopherson
  0 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-30 22:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, Huacai Chen, Michael Ellerman,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Viro, Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Mon, Oct 30, 2023, Paolo Bonzini wrote:
> On 10/27/23 20:21, Sean Christopherson wrote:
> > @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> >  			 * the second or later invocation of the handler).
> >  			 */
> >  			gfn_range.arg = range->arg;
> > +
> > +			/*
> > +			 * HVA-based notifications aren't relevant to private
> > +			 * mappings as they don't have a userspace mapping.
> 
> It's confusing who "they" is.  Maybe
> 
> 	 * HVA-based notifications provide a userspace address,
> 	 * and as such are only relevant for shared mappings.

Works for me.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  2023-10-30 20:25     ` Sean Christopherson
@ 2023-10-30 22:12       ` Sean Christopherson
  2023-10-30 23:22       ` Paolo Bonzini
  1 sibling, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-30 22:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, Huacai Chen, Michael Ellerman,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Viro, Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Mon, Oct 30, 2023, Sean Christopherson wrote:
> On Mon, Oct 30, 2023, Paolo Bonzini wrote:
> > On 10/27/23 20:21, Sean Christopherson wrote:
> > > 
> > > +		if (ioctl == KVM_SET_USER_MEMORY_REGION)
> > > +			size = sizeof(struct kvm_userspace_memory_region);
> > 
> > This also needs a memset(&mem, 0, sizeof(mem)), otherwise the out-of-bounds
> > access of the commit message becomes a kernel stack read.
> 
> Ouch.  There's some irony.  Might be worth doing memset(&mem, -1, sizeof(mem))
> though as '0' is a valid file descriptor and a valid file offset.
> 
> > Probably worth adding a check on valid flags here.
> 
> Definitely needed.  There's a very real bug here.  But rather than duplicate flags
> checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we
> have the fancy guard(mutex) and there are no internal calls to kvm_set_memory_region(),
> what if we:
> 
>   1. Acquire/release slots_lock in __kvm_set_memory_region()

Gah, this won't work with x86's usage, which acquire slots_lock quite far away
from __kvm_set_memory_region() in several places.  The rest of the idea still
works, kvm_vm_ioctl_set_memory_region() will just need to take slots_lock manually.

>   2. Call kvm_set_memory_region() from x86 code for the internal memslots
>   3. Disallow *any* flags for internal memslots
>   4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region()
>   5. Pass @ioctl to kvm_vm_ioctl_set_memory_region() and allow KVM_MEM_PRIVATE
>      only for KVM_SET_USER_MEMORY_REGION2

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  2023-10-30 20:25     ` Sean Christopherson
  2023-10-30 22:12       ` Sean Christopherson
@ 2023-10-30 23:22       ` Paolo Bonzini
  2023-10-31  0:18         ` Sean Christopherson
  1 sibling, 1 reply; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-30 23:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Huacai Chen, Michael Ellerman,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Viro, Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On 10/30/23 21:25, Sean Christopherson wrote:
> On Mon, Oct 30, 2023, Paolo Bonzini wrote:
>> On 10/27/23 20:21, Sean Christopherson wrote:
>>>
>>> +		if (ioctl == KVM_SET_USER_MEMORY_REGION)
>>> +			size = sizeof(struct kvm_userspace_memory_region);
>>
>> This also needs a memset(&mem, 0, sizeof(mem)), otherwise the out-of-bounds
>> access of the commit message becomes a kernel stack read.
> 
> Ouch.  There's some irony.  Might be worth doing memset(&mem, -1, sizeof(mem))
> though as '0' is a valid file descriptor and a valid file offset.

Either is okay, because unless the flags check is screwed up it should
not matter.  The memset is actually unnecessary, though it may be a good
idea anyway to keep it, aka belt-and-suspenders.

>> Probably worth adding a check on valid flags here.
> 
> Definitely needed.  There's a very real bug here.  But rather than duplicate flags
> checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we
> have the fancy guard(mutex) and there are no internal calls to kvm_set_memory_region(),
> what if we:
> 
>    1. Acquire/release slots_lock in __kvm_set_memory_region()
>    2. Call kvm_set_memory_region() from x86 code for the internal memslots
>    3. Disallow *any* flags for internal memslots
>    4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region()

I dislike this step, there is a clear point where all paths meet
(ioctl/internal, locked/unlocked) and that's __kvm_set_memory_region().
I think that's the place where flags should be checked.  (I don't mind
the restriction on internal memslots; it's just that to me it's not a
particularly natural way to structure the checks).

On the other hand, the place where to protect from out-of-bounds
accesses, is the place where you stop caring about struct
kvm_userspace_memory_region vs kvm_userspace_memory_region2 (and
your code gets it right, by dropping "ioctl" as soon as possible).

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 87f45aa91ced..fe5a2af14fff 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1635,6 +1635,14 @@ bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
  	return true;
  }
  
+/*
+ * Flags that do not access any of the extra space of struct
+ * kvm_userspace_memory_region2.  KVM_SET_USER_MEMORY_REGION_FLAGS
+ * only allows these.
+ */
+#define KVM_SET_USER_MEMORY_REGION_FLAGS \
+	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
+
  static int check_memory_region_flags(struct kvm *kvm,
  				     const struct kvm_userspace_memory_region2 *mem)
  {
@@ -5149,10 +5149,16 @@ static long kvm_vm_ioctl(struct file *filp,
  		struct kvm_userspace_memory_region2 mem;
  		unsigned long size;
  
-		if (ioctl == KVM_SET_USER_MEMORY_REGION)
+		if (ioctl == KVM_SET_USER_MEMORY_REGION) {
+			/*
+			 * Fields beyond struct kvm_userspace_memory_region shouldn't be
+			 * accessed, but avoid leaking kernel memory in case of a bug.
+			 */
+			memset(&mem, 0, sizeof(mem));
  			size = sizeof(struct kvm_userspace_memory_region);
-		else
+		} else {
  			size = sizeof(struct kvm_userspace_memory_region2);
+		}
  
  		/* Ensure the common parts of the two structs are identical. */
  		SANITY_CHECK_MEM_REGION_FIELD(slot);
@@ -5165,6 +5167,11 @@ static long kvm_vm_ioctl(struct file *filp,
  		if (copy_from_user(&mem, argp, size))
  			goto out;
  
+		r = -EINVAL;
+		if (ioctl == KVM_SET_USER_MEMORY_REGION &&
+		    (mem->flags & ~KVM_SET_USER_MEMORY_REGION_FLAGS))
+			goto out;
+
  		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
  		break;
  	}


That's a kind of patch that you can't really get wrong (though I have
the brown paper bag ready).

Maintainance-wise it's fine, since flags are being added at a pace of
roughly one every five years, and anyway it's also future proof: I placed
the #define near check_memory_region_flags so that in five years we remember
to keep it up to date.  But worst case, the new flags will only be allowed
by KVM_SET_USER_MEMORY_REGION2 unnecessarily; there are no security issues
waiting to bite us.

In sum, this is exactly the only kind of fix that should be in the v13->v14
delta.

Paolo


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  2023-10-30 23:22       ` Paolo Bonzini
@ 2023-10-31  0:18         ` Sean Christopherson
  0 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-31  0:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, Huacai Chen, Michael Ellerman,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Viro, Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Tue, Oct 31, 2023, Paolo Bonzini wrote:
> On 10/30/23 21:25, Sean Christopherson wrote:
> > > Probably worth adding a check on valid flags here.
> > 
> > Definitely needed.  There's a very real bug here.  But rather than duplicate flags
> > checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we
> > have the fancy guard(mutex) and there are no internal calls to kvm_set_memory_region(),
> > what if we:
> > 
> >    1. Acquire/release slots_lock in __kvm_set_memory_region()
> >    2. Call kvm_set_memory_region() from x86 code for the internal memslots
> >    3. Disallow *any* flags for internal memslots
> >    4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region()
> 
> I dislike this step, there is a clear point where all paths meet
> (ioctl/internal, locked/unlocked) and that's __kvm_set_memory_region().
> I think that's the place where flags should be checked.  (I don't mind
> the restriction on internal memslots; it's just that to me it's not a
> particularly natural way to structure the checks).

Yeah, I just don't like the discrepancy it causes where some flags are explicitly
checked and allowed, allowed and then later disallowed.

> On the other hand, the place where to protect from out-of-bounds
> accesses, is the place where you stop caring about struct
> kvm_userspace_memory_region vs kvm_userspace_memory_region2 (and
> your code gets it right, by dropping "ioctl" as soon as possible).
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 87f45aa91ced..fe5a2af14fff 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1635,6 +1635,14 @@ bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
>  	return true;
>  }
> +/*
> + * Flags that do not access any of the extra space of struct
> + * kvm_userspace_memory_region2.  KVM_SET_USER_MEMORY_REGION_FLAGS
> + * only allows these.
> + */
> +#define KVM_SET_USER_MEMORY_REGION_FLAGS \

Can we name this KVM_SET_USER_MEMORY_REGION_LEGACY_FLAGS, or something equally
horrific?  As is, this sounds way too much like a generic "allowed flags for any
memory region".

Or maybe invert the macro?  I.e. something to make it more obvious that it's
effectively a versioning check, not a generic "what's supported?" check.

#define KVM_SET_USER_MEMORY_FLAGS_V2_ONLY \
	(~(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY))


> +	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
> +
>  static int check_memory_region_flags(struct kvm *kvm,
>  				     const struct kvm_userspace_memory_region2 *mem)
>  {
> @@ -5149,10 +5149,16 @@ static long kvm_vm_ioctl(struct file *filp,
>  		struct kvm_userspace_memory_region2 mem;
>  		unsigned long size;
> -		if (ioctl == KVM_SET_USER_MEMORY_REGION)
> +		if (ioctl == KVM_SET_USER_MEMORY_REGION) {
> +			/*
> +			 * Fields beyond struct kvm_userspace_memory_region shouldn't be
> +			 * accessed, but avoid leaking kernel memory in case of a bug.
> +			 */
> +			memset(&mem, 0, sizeof(mem));
>  			size = sizeof(struct kvm_userspace_memory_region);
> -		else
> +		} else {
>  			size = sizeof(struct kvm_userspace_memory_region2);
> +		}
>  		/* Ensure the common parts of the two structs are identical. */
>  		SANITY_CHECK_MEM_REGION_FIELD(slot);
> @@ -5165,6 +5167,11 @@ static long kvm_vm_ioctl(struct file *filp,
>  		if (copy_from_user(&mem, argp, size))
>  			goto out;
> +		r = -EINVAL;
> +		if (ioctl == KVM_SET_USER_MEMORY_REGION &&
> +		    (mem->flags & ~KVM_SET_USER_MEMORY_REGION_FLAGS))
> +			goto out;
> +
>  		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>  		break;
>  	}
> 
> 
> That's a kind of patch that you can't really get wrong (though I have
> the brown paper bag ready).
> 
> Maintainance-wise it's fine, since flags are being added at a pace of
> roughly one every five years,

Heh, true.

> and anyway it's also future proof: I placed the #define near
> check_memory_region_flags so that in five years we remember to keep it up to
> date.  But worst case, the new flags will only be allowed by
> KVM_SET_USER_MEMORY_REGION2 unnecessarily; there are no security issues
> waiting to bite us.
>
> In sum, this is exactly the only kind of fix that should be in the v13->v14
> delta.

Boiling the ocean can be fun too ;-)

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  2023-10-27 18:21 ` [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
  2023-10-30 16:41   ` Paolo Bonzini
@ 2023-10-31  2:26   ` Xiaoyao Li
  2023-10-31 14:04     ` Sean Christopherson
  2023-11-01 14:19   ` Fuad Tabba
  2 siblings, 1 reply; 148+ messages in thread
From: Xiaoyao Li @ 2023-10-31  2:26 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Huacai Chen, Michael Ellerman, Anup Patel, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen, Anish Moorthy,
	David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
> information can be supplied without setting userspace up to fail.  The
> padding in the new kvm_userspace_memory_region2 structure will be used to
> pass a file descriptor in addition to the userspace_addr, i.e. allow
> userspace to point at a file descriptor and map memory into a guest that
> is NOT mapped into host userspace.
> 
> Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
> without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
> makes detection of bad flags a bit more robust, e.g. if the new fd field
> is guarded only by a flag and not a new ioctl(), then a userspace bug
> (setting a "bad" flag) would generate out-of-bounds access instead of an
> -EINVAL error.
> 
> Cc: Jarkko Sakkinen <jarkko@kernel.org>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   Documentation/virt/kvm/api.rst | 21 +++++++++++++++++++
>   arch/x86/kvm/x86.c             |  2 +-
>   include/linux/kvm_host.h       |  4 ++--
>   include/uapi/linux/kvm.h       | 13 ++++++++++++
>   virt/kvm/kvm_main.c            | 38 +++++++++++++++++++++++++++-------
>   5 files changed, 67 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 21a7578142a1..ace984acc125 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6070,6 +6070,27 @@ writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using the SET_ONE_REG
>   interface. No error will be returned, but the resulting offset will not be
>   applied.
>   
> +4.139 KVM_SET_USER_MEMORY_REGION2
> +---------------------------------
> +
> +:Capability: KVM_CAP_USER_MEMORY2
> +:Architectures: all
> +:Type: vm ioctl
> +:Parameters: struct kvm_userspace_memory_region2 (in)
> +:Returns: 0 on success, -1 on error
> +
> +::
> +
> +  struct kvm_userspace_memory_region2 {
> +	__u32 slot;
> +	__u32 flags;
> +	__u64 guest_phys_addr;
> +	__u64 memory_size; /* bytes */
> +	__u64 userspace_addr; /* start of the userspace allocated memory */

missing

	__u64 pad[16];

> +  };
> +
> +See KVM_SET_USER_MEMORY_REGION.
> +
>   5. The kvm_run structure
>   ========================
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 41cce5031126..6409914428ca 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12455,7 +12455,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
>   	}
>   
>   	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> -		struct kvm_userspace_memory_region m;
> +		struct kvm_userspace_memory_region2 m;
>   
>   		m.slot = id | (i << 16);
>   		m.flags = 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 5faba69403ac..4e741ff27af3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1146,9 +1146,9 @@ enum kvm_mr_change {
>   };
>   
>   int kvm_set_memory_region(struct kvm *kvm,
> -			  const struct kvm_userspace_memory_region *mem);
> +			  const struct kvm_userspace_memory_region2 *mem);
>   int __kvm_set_memory_region(struct kvm *kvm,
> -			    const struct kvm_userspace_memory_region *mem);
> +			    const struct kvm_userspace_memory_region2 *mem);
>   void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
>   void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
>   int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 13065dd96132..bd1abe067f28 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
>   	__u64 userspace_addr; /* start of the userspace allocated memory */
>   };
>   
> +/* for KVM_SET_USER_MEMORY_REGION2 */
> +struct kvm_userspace_memory_region2 {
> +	__u32 slot;
> +	__u32 flags;
> +	__u64 guest_phys_addr;
> +	__u64 memory_size;
> +	__u64 userspace_addr;
> +	__u64 pad[16];
> +};
> +
>   /*
>    * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
>    * userspace, other bits are reserved for kvm internal use which are defined
> @@ -1192,6 +1202,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_COUNTER_OFFSET 227
>   #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
>   #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
> +#define KVM_CAP_USER_MEMORY2 230
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
> @@ -1473,6 +1484,8 @@ struct kvm_vfio_spapr_tce {
>   					struct kvm_userspace_memory_region)
>   #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
>   #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> +#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> +					 struct kvm_userspace_memory_region2)
>   
>   /* enable ucontrol for s390 */
>   struct kvm_s390_ucas_mapping {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6e708017064d..3f5b7c2c5327 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1578,7 +1578,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
>   	}
>   }
>   
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_userspace_memory_region2 *mem)
>   {
>   	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>   
> @@ -1980,7 +1980,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>    * Must be called holding kvm->slots_lock for write.
>    */
>   int __kvm_set_memory_region(struct kvm *kvm,
> -			    const struct kvm_userspace_memory_region *mem)
> +			    const struct kvm_userspace_memory_region2 *mem)
>   {
>   	struct kvm_memory_slot *old, *new;
>   	struct kvm_memslots *slots;
> @@ -2084,7 +2084,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>   
>   int kvm_set_memory_region(struct kvm *kvm,
> -			  const struct kvm_userspace_memory_region *mem)
> +			  const struct kvm_userspace_memory_region2 *mem)
>   {
>   	int r;
>   
> @@ -2096,7 +2096,7 @@ int kvm_set_memory_region(struct kvm *kvm,
>   EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>   
>   static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> -					  struct kvm_userspace_memory_region *mem)
> +					  struct kvm_userspace_memory_region2 *mem)
>   {
>   	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
>   		return -EINVAL;
> @@ -4566,6 +4566,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>   {
>   	switch (arg) {
>   	case KVM_CAP_USER_MEMORY:
> +	case KVM_CAP_USER_MEMORY2:
>   	case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
>   	case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
>   	case KVM_CAP_INTERNAL_ERROR_DATA:
> @@ -4821,6 +4822,14 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
>   	return fd;
>   }
>   
> +#define SANITY_CHECK_MEM_REGION_FIELD(field)					\
> +do {										\
> +	BUILD_BUG_ON(offsetof(struct kvm_userspace_memory_region, field) !=		\
> +		     offsetof(struct kvm_userspace_memory_region2, field));	\
> +	BUILD_BUG_ON(sizeof_field(struct kvm_userspace_memory_region, field) !=		\
> +		     sizeof_field(struct kvm_userspace_memory_region2, field));	\
> +} while (0)
> +
>   static long kvm_vm_ioctl(struct file *filp,
>   			   unsigned int ioctl, unsigned long arg)
>   {
> @@ -4843,15 +4852,28 @@ static long kvm_vm_ioctl(struct file *filp,
>   		r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
>   		break;
>   	}
> +	case KVM_SET_USER_MEMORY_REGION2:
>   	case KVM_SET_USER_MEMORY_REGION: {
> -		struct kvm_userspace_memory_region kvm_userspace_mem;
> +		struct kvm_userspace_memory_region2 mem;
> +		unsigned long size;
> +
> +		if (ioctl == KVM_SET_USER_MEMORY_REGION)
> +			size = sizeof(struct kvm_userspace_memory_region);
> +		else
> +			size = sizeof(struct kvm_userspace_memory_region2);
> +
> +		/* Ensure the common parts of the two structs are identical. */
> +		SANITY_CHECK_MEM_REGION_FIELD(slot);
> +		SANITY_CHECK_MEM_REGION_FIELD(flags);
> +		SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +		SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +		SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
>   
>   		r = -EFAULT;
> -		if (copy_from_user(&kvm_userspace_mem, argp,
> -						sizeof(kvm_userspace_mem)))
> +		if (copy_from_user(&mem, argp, size))
>   			goto out;
>   
> -		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> +		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>   		break;
>   	}
>   	case KVM_GET_DIRTY_LOG: {


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-27 18:21 ` [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
@ 2023-10-31  2:27   ` Xiaoyao Li
  2023-10-31  6:30   ` Chao Gao
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 148+ messages in thread
From: Xiaoyao Li @ 2023-10-31  2:27 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Huacai Chen, Michael Ellerman, Anup Patel, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen, Anish Moorthy,
	David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/28/2023 2:21 AM, Sean Christopherson wrote:
...
> +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
> +allows mapping guest_memfd memory into a guest.  All fields shared with
> +KVM_SET_USER_MEMORY_REGION identically.  Userspace can set KVM_MEM_PRIVATE in
> +flags to have KVM bind the memory region to a given guest_memfd range of
> +[guest_memfd_offset, guest_memfd_offset + memory_size].  The target guest_memfd
> +must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
> +the target range must not be bound to any other memory region.  All standard
> +bounds checks apply (use common sense).
> +
>   ::
>   
>     struct kvm_userspace_memory_region2 {
> @@ -6087,9 +6096,24 @@ applied.
>   	__u64 guest_phys_addr;
>   	__u64 memory_size; /* bytes */
>   	__u64 userspace_addr; /* start of the userspace allocated memory */
> +  __u64 guest_memfd_offset;

missing a tab

> +	__u32 guest_memfd;
> +	__u32 pad1;
> +	__u64 pad2[14];
>     };
>   


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-27 18:21 ` [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
  2023-10-31  2:27   ` Xiaoyao Li
@ 2023-10-31  6:30   ` Chao Gao
  2023-10-31 14:10     ` Sean Christopherson
  2023-10-31 15:05   ` Fuad Tabba
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 148+ messages in thread
From: Chao Gao @ 2023-10-31  6:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

>+int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>+{
>+	loff_t size = args->size;
>+	u64 flags = args->flags;
>+	u64 valid_flags = 0;
>+
>+	if (flags & ~valid_flags)
>+		return -EINVAL;
>+
>+	if (size < 0 || !PAGE_ALIGNED(size))
>+		return -EINVAL;

is size == 0 a valid case?

>+int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>+		  unsigned int fd, loff_t offset)
>+{
>+	loff_t size = slot->npages << PAGE_SHIFT;
>+	unsigned long start, end;
>+	struct kvm_gmem *gmem;
>+	struct inode *inode;
>+	struct file *file;
>+
>+	BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
>+
>+	file = fget(fd);
>+	if (!file)
>+		return -EBADF;
>+
>+	if (file->f_op != &kvm_gmem_fops)
>+		goto err;
>+
>+	gmem = file->private_data;
>+	if (gmem->kvm != kvm)
>+		goto err;
>+
>+	inode = file_inode(file);
>+
>+	if (offset < 0 || !PAGE_ALIGNED(offset))
>+		return -EINVAL;

		goto err;

or hoist the check above fget().

>+
>+	if (offset + size > i_size_read(inode))
>+		goto err;
>+
>+	filemap_invalidate_lock(inode->i_mapping);
>+
>+	start = offset >> PAGE_SHIFT;
>+	end = start + slot->npages;
>+
>+	if (!xa_empty(&gmem->bindings) &&
>+	    xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
>+		filemap_invalidate_unlock(inode->i_mapping);
>+		goto err;
>+	}
>+
>+	/*
>+	 * No synchronize_rcu() needed, any in-flight readers are guaranteed to
>+	 * be see either a NULL file or this new file, no need for them to go
>+	 * away.
>+	 */
>+	rcu_assign_pointer(slot->gmem.file, file);
>+	slot->gmem.pgoff = start;
>+
>+	xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
>+	filemap_invalidate_unlock(inode->i_mapping);
>+
>+	/*
>+	 * Drop the reference to the file, even on success.  The file pins KVM,
>+	 * not the other way 'round.  Active bindings are invalidated if the
>+	 * file is closed before memslots are destroyed.
>+	 */
>+	fput(file);
>+	return 0;
>+
>+err:
>+	fput(file);
>+	return -EINVAL;

The cleanup, i.e., filemap_invalidate_unlock() and fput(), is common. So, I think it
may be slightly better to consolidate the common part e.g.,

	int ret = -EINVAL;

	...

	xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
	ret = 0;
unlock:
	filemap_invalidate_unlock(inode->i_mapping);
err:
	/*
	 * Drop the reference to the file, even on success.  The file pins KVM,
	 * not the other way 'round.  Active bindings are invalidated if the
	 * file is closed before memslots are destroyed.
	 */
	fput(file);
	return ret;

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-10-27 18:21 ` [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory Sean Christopherson
@ 2023-10-31  8:35   ` Xiaoyao Li
  2023-10-31 14:16     ` Sean Christopherson
  0 siblings, 1 reply; 148+ messages in thread
From: Xiaoyao Li @ 2023-10-31  8:35 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Huacai Chen, Michael Ellerman, Anup Patel, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton
  Cc: kvm, linux-arm-kernel, kvmarm, linux-mips, linuxppc-dev,
	kvm-riscv, linux-riscv, linux-fsdevel, linux-mm, linux-kernel,
	Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen, Anish Moorthy,
	David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> Extended guest_memfd to allow backing guest memory with transparent 
> hugepages. Require userspace to opt-in via a flag even though there's no 
> known/anticipated use case for forcing small pages as THP is optional, 
> i.e. to avoid ending up in a situation where userspace is unaware that 
> KVM can't provide hugepages.

Personally, it seems not so "transparent" if requiring userspace to opt-in.

People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE 
support, or check is the sysfs of transparent hugepage exists; 2)get the 
maximum support hugepage size 3) ensure the size satisfies the 
alignment; before opt-in it.

Even simpler, userspace can blindly try to create guest memfd with 
transparent hugapage flag. If getting error, fallback to create without 
the transparent hugepage flag.

However, it doesn't look transparent to me.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  2023-10-31  2:26   ` Xiaoyao Li
@ 2023-10-31 14:04     ` Sean Christopherson
  0 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-31 14:04 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
> > information can be supplied without setting userspace up to fail.  The
> > padding in the new kvm_userspace_memory_region2 structure will be used to
> > pass a file descriptor in addition to the userspace_addr, i.e. allow
> > userspace to point at a file descriptor and map memory into a guest that
> > is NOT mapped into host userspace.
> > 
> > Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
> > without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
> > makes detection of bad flags a bit more robust, e.g. if the new fd field
> > is guarded only by a flag and not a new ioctl(), then a userspace bug
> > (setting a "bad" flag) would generate out-of-bounds access instead of an
> > -EINVAL error.
> > 
> > Cc: Jarkko Sakkinen <jarkko@kernel.org>
> > Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> > Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   Documentation/virt/kvm/api.rst | 21 +++++++++++++++++++
> >   arch/x86/kvm/x86.c             |  2 +-
> >   include/linux/kvm_host.h       |  4 ++--
> >   include/uapi/linux/kvm.h       | 13 ++++++++++++
> >   virt/kvm/kvm_main.c            | 38 +++++++++++++++++++++++++++-------
> >   5 files changed, 67 insertions(+), 11 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 21a7578142a1..ace984acc125 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6070,6 +6070,27 @@ writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using the SET_ONE_REG
> >   interface. No error will be returned, but the resulting offset will not be
> >   applied.
> > +4.139 KVM_SET_USER_MEMORY_REGION2
> > +---------------------------------
> > +
> > +:Capability: KVM_CAP_USER_MEMORY2
> > +:Architectures: all
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_userspace_memory_region2 (in)
> > +:Returns: 0 on success, -1 on error
> > +
> > +::
> > +
> > +  struct kvm_userspace_memory_region2 {
> > +	__u32 slot;
> > +	__u32 flags;
> > +	__u64 guest_phys_addr;
> > +	__u64 memory_size; /* bytes */
> > +	__u64 userspace_addr; /* start of the userspace allocated memory */
> 
> missing
> 
> 	__u64 pad[16];

I can't even copy+paste correctly :-(

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-31  6:30   ` Chao Gao
@ 2023-10-31 14:10     ` Sean Christopherson
  0 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-31 14:10 UTC (permalink / raw)
  To: Chao Gao
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Tue, Oct 31, 2023, Chao Gao wrote:
> >+int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> >+{
> >+	loff_t size = args->size;
> >+	u64 flags = args->flags;
> >+	u64 valid_flags = 0;
> >+
> >+	if (flags & ~valid_flags)
> >+		return -EINVAL;
> >+
> >+	if (size < 0 || !PAGE_ALIGNED(size))
> >+		return -EINVAL;
> 
> is size == 0 a valid case?

Nope, this is a bug.

> >+	if (!xa_empty(&gmem->bindings) &&
> >+	    xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
> >+		filemap_invalidate_unlock(inode->i_mapping);
> >+		goto err;
> >+	}
> >+
> >+	/*
> >+	 * No synchronize_rcu() needed, any in-flight readers are guaranteed to
> >+	 * be see either a NULL file or this new file, no need for them to go
> >+	 * away.
> >+	 */
> >+	rcu_assign_pointer(slot->gmem.file, file);
> >+	slot->gmem.pgoff = start;
> >+
> >+	xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
> >+	filemap_invalidate_unlock(inode->i_mapping);
> >+
> >+	/*
> >+	 * Drop the reference to the file, even on success.  The file pins KVM,
> >+	 * not the other way 'round.  Active bindings are invalidated if the
> >+	 * file is closed before memslots are destroyed.
> >+	 */
> >+	fput(file);
> >+	return 0;
> >+
> >+err:
> >+	fput(file);
> >+	return -EINVAL;
> 
> The cleanup, i.e., filemap_invalidate_unlock() and fput(), is common. So, I think it
> may be slightly better to consolidate the common part e.g.,

I would prefer to keep this as-is.  Only goto needs the unlock, and I find it easier
to understand the success vs. error paths with explicit returns.  But it's not a
super strong preference.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-10-31  8:35   ` Xiaoyao Li
@ 2023-10-31 14:16     ` Sean Christopherson
  2023-11-01  7:25       ` Xiaoyao Li
  0 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-10-31 14:16 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > Extended guest_memfd to allow backing guest memory with transparent
> > hugepages. Require userspace to opt-in via a flag even though there's no
> > known/anticipated use case for forcing small pages as THP is optional,
> > i.e. to avoid ending up in a situation where userspace is unaware that
> > KVM can't provide hugepages.
> 
> Personally, it seems not so "transparent" if requiring userspace to opt-in.
> 
> People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE
> support, or check is the sysfs of transparent hugepage exists; 2)get the
> maximum support hugepage size 3) ensure the size satisfies the alignment;
> before opt-in it.
> 
> Even simpler, userspace can blindly try to create guest memfd with
> transparent hugapage flag. If getting error, fallback to create without the
> transparent hugepage flag.
> 
> However, it doesn't look transparent to me.

The "transparent" part is referring to the underlying kernel mechanism, it's not
saying anything about the API.  The "transparent" part of THP is that the kernel
doesn't guarantee hugepages, i.e. whether or not hugepages are actually used is
(mostly) transparent to userspace.

Paolo also isn't the biggest fan[*], but there are also downsides to always
allowing hugepages, e.g. silent failure due to lack of THP or unaligned size,
and there's precedent in the form of MADV_HUGEPAGE.

[*] https://lore.kernel.org/all/84a908ae-04c7-51c7-c9a8-119e1933a189@redhat.com

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-27 18:21 ` [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
  2023-10-31  2:27   ` Xiaoyao Li
  2023-10-31  6:30   ` Chao Gao
@ 2023-10-31 15:05   ` Fuad Tabba
  2023-10-31 22:13     ` Sean Christopherson
  2023-10-31 18:24   ` David Matlack
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 148+ messages in thread
From: Fuad Tabba @ 2023-10-31 15:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Hi,

On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <seanjc@google.com> wrote:

...

> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index e2252c748fd6..e82c69d5e755 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6079,6 +6079,15 @@ applied.
>  :Parameters: struct kvm_userspace_memory_region2 (in)
>  :Returns: 0 on success, -1 on error
>
> +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
> +allows mapping guest_memfd memory into a guest.  All fields shared with
> +KVM_SET_USER_MEMORY_REGION identically.  Userspace can set KVM_MEM_PRIVATE in
> +flags to have KVM bind the memory region to a given guest_memfd range of
> +[guest_memfd_offset, guest_memfd_offset + memory_size].  The target guest_memfd
> +must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
> +the target range must not be bound to any other memory region.  All standard
> +bounds checks apply (use common sense).
> +

Bikeshedding here: Not sure if KVM_MEM_PRIVATE is the best name for
this. It gets confusing with KVM_MEMORY_ATTRIBUTE_PRIVATE, i.e., that
a region marked as KVM_MEM_PRIVATE is only potentially private. It did
confuse the rest of the team when I walked them through a previous
version of this code once. Would something like KVM_MEM_GUESTMEM make
more sense?

>  ::
>
>    struct kvm_userspace_memory_region2 {
> @@ -6087,9 +6096,24 @@ applied.
>         __u64 guest_phys_addr;
>         __u64 memory_size; /* bytes */
>         __u64 userspace_addr; /* start of the userspace allocated memory */
> +  __u64 guest_memfd_offset;
> +       __u32 guest_memfd;
> +       __u32 pad1;
> +       __u64 pad2[14];
>    };
>
> -See KVM_SET_USER_MEMORY_REGION.
> +A KVM_MEM_PRIVATE region _must_ have a valid guest_memfd (private memory) and
> +userspace_addr (shared memory).  However, "valid" for userspace_addr simply
> +means that the address itself must be a legal userspace address.  The backing
> +mapping for userspace_addr is not required to be valid/populated at the time of
> +KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated
> +on-demand.

Regarding requiring that a private region have both a valid
guest_memfd and a userspace_addr, should this be
implementation-specific? In pKVM at least, all regions for protected
VMs are private, and KVM doesn't care about the host userspace address
for those regions even when part of the memory is shared.

> +When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes
> +userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE
> +state.  At VM creation time, all memory is shared, i.e. the PRIVATE attribute
> +is '0' for all gfns.  Userspace can control whether memory is shared/private by
> +toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.

In pKVM, guest memory is private by default, and most of it will
remain so for the lifetime of the VM. Userspace could explicitly mark
all the guest's memory as private at initialization, but it would save
a slight amount of work. That said, I understand that it might be
better to be consistent across implementations.

...

> --- /dev/null
> +++ b/virt/kvm/guest_memfd.c
> @@ -0,0 +1,548 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/backing-dev.h>
> +#include <linux/falloc.h>
> +#include <linux/kvm_host.h>
> +#include <linux/pagemap.h>
> +#include <linux/anon_inodes.h>

nit: should this include be first (to maintain alphabetical ordering
of the includes)?

> +
> +#include "kvm_mm.h"
> +
> +struct kvm_gmem {
> +       struct kvm *kvm;
> +       struct xarray bindings;
> +       struct list_head entry;
> +};
> +
> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> +{
> +       struct folio *folio;
> +
> +       /* TODO: Support huge pages. */
> +       folio = filemap_grab_folio(inode->i_mapping, index);
> +       if (IS_ERR_OR_NULL(folio))
> +               return NULL;
> +
> +       /*
> +        * Use the up-to-date flag to track whether or not the memory has been
> +        * zeroed before being handed off to the guest.  There is no backing
> +        * storage for the memory, so the folio will remain up-to-date until
> +        * it's removed.
> +        *
> +        * TODO: Skip clearing pages when trusted firmware will do it when
> +        * assigning memory to the guest.
> +        */
> +       if (!folio_test_uptodate(folio)) {
> +               unsigned long nr_pages = folio_nr_pages(folio);
> +               unsigned long i;
> +
> +               for (i = 0; i < nr_pages; i++)
> +                       clear_highpage(folio_page(folio, i));
> +
> +               folio_mark_uptodate(folio);
> +       }
> +
> +       /*
> +        * Ignore accessed, referenced, and dirty flags.  The memory is
> +        * unevictable and there is no storage to write back to.
> +        */
> +       return folio;
> +}
> +
> +static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> +                                     pgoff_t end)
> +{
> +       bool flush = false, found_memslot = false;
> +       struct kvm_memory_slot *slot;
> +       struct kvm *kvm = gmem->kvm;
> +       unsigned long index;
> +
> +       xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> +               pgoff_t pgoff = slot->gmem.pgoff;
> +
> +               struct kvm_gfn_range gfn_range = {
> +                       .start = slot->base_gfn + max(pgoff, start) - pgoff,
> +                       .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
> +                       .slot = slot,
> +                       .may_block = true,
> +               };
> +
> +               if (!found_memslot) {
> +                       found_memslot = true;
> +
> +                       KVM_MMU_LOCK(kvm);
> +                       kvm_mmu_invalidate_begin(kvm);
> +               }
> +
> +               flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
> +       }
> +
> +       if (flush)
> +               kvm_flush_remote_tlbs(kvm);
> +
> +       if (found_memslot)
> +               KVM_MMU_UNLOCK(kvm);
> +}
> +
> +static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
> +                                   pgoff_t end)
> +{
> +       struct kvm *kvm = gmem->kvm;
> +
> +       if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
> +               KVM_MMU_LOCK(kvm);
> +               kvm_mmu_invalidate_end(kvm);
> +               KVM_MMU_UNLOCK(kvm);
> +       }
> +}
> +
> +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> +{
> +       struct list_head *gmem_list = &inode->i_mapping->private_list;
> +       pgoff_t start = offset >> PAGE_SHIFT;
> +       pgoff_t end = (offset + len) >> PAGE_SHIFT;
> +       struct kvm_gmem *gmem;
> +
> +       /*
> +        * Bindings must stable across invalidation to ensure the start+end

nit: Bindings must _be/stay?_ stable

...

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 78a0b09ef2a5..5d1a2f1b4e94 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -798,7 +798,7 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
>         }
>  }
>
> -static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> +bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>         kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
>         return kvm_unmap_gfn_range(kvm, range);
> @@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
>  /* This does not remove the slot from struct kvm_memslots data structures */
>  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>  {
> +       if (slot->flags & KVM_MEM_PRIVATE)
> +               kvm_gmem_unbind(slot);
> +

Should this be called after kvm_arch_free_memslot()? Arch-specific ode
might need some of the data before the unbinding, something I thought
might be necessary at one point for the pKVM port when deleting a
memslot, but realized later that kvm_invalidate_memslot() ->
kvm_arch_guest_memory_reclaimed() was the more logical place for it.
Also, since that seems to be the pattern for arch-specific handlers in
KVM.

>         kvm_destroy_dirty_bitmap(slot);
>
>         kvm_arch_free_memslot(kvm, slot);

...

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes
  2023-10-27 18:21 ` [PATCH v13 13/35] KVM: Introduce per-page memory attributes Sean Christopherson
  2023-10-30  8:11   ` Chao Gao
@ 2023-10-31 16:43   ` David Matlack
  2023-11-02  3:01   ` Huang, Kai
  2 siblings, 0 replies; 148+ messages in thread
From: David Matlack @ 2023-10-31 16:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 89c1a991a3b8..df573229651b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -808,6 +809,9 @@ struct kvm {
>  
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  	struct notifier_block pm_notifier;
> +#endif
> +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +	struct xarray mem_attr_array;

Please document how access to mem_attr_array is synchronized. If I'm
reading the code correctly I think it's...

  /* Protected by slots_locks (for writes) and RCU (for reads) */

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-27 18:21 ` [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
                     ` (2 preceding siblings ...)
  2023-10-31 15:05   ` Fuad Tabba
@ 2023-10-31 18:24   ` David Matlack
  2023-10-31 21:36     ` Sean Christopherson
  2023-11-03  9:42   ` Fuad Tabba
  2023-11-04 10:26   ` Xu Yilun
  5 siblings, 1 reply; 148+ messages in thread
From: David Matlack @ 2023-10-31 18:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
> memory that is tied to a specific KVM virtual machine and whose primary
> purpose is to serve guest memory.
> 
> A guest-first memory subsystem allows for optimizations and enhancements
> that are kludgy or outright infeasible to implement/support in a generic
> memory subsystem.  With guest_memfd, guest protections and mapping sizes
> are fully decoupled from host userspace mappings.   E.g. KVM currently
> doesn't support mapping memory as writable in the guest without it also
> being writable in host userspace, as KVM's ABI uses VMA protections to
> define the allow guest protection.  Userspace can fudge this by
> establishing two mappings, a writable mapping for the guest and readable
> one for itself, but that’s suboptimal on multiple fronts.
> 
> Similarly, KVM currently requires the guest mapping size to be a strict
> subset of the host userspace mapping size, e.g. KVM doesn’t support
> creating a 1GiB guest mapping unless userspace also has a 1GiB guest
> mapping.  Decoupling the mappings sizes would allow userspace to precisely
> map only what is needed without impacting guest performance, e.g. to
> harden against unintentional accesses to guest memory.
> 
> Decoupling guest and userspace mappings may also allow for a cleaner
> alternative to high-granularity mappings for HugeTLB, which has reached a
> bit of an impasse and is unlikely to ever be merged.
> 
> A guest-first memory subsystem also provides clearer line of sight to
> things like a dedicated memory pool (for slice-of-hardware VMs) and
> elimination of "struct page" (for offload setups where userspace _never_
> needs to mmap() guest memory).

All of these use-cases involve using guest_memfd for shared pages, but
this entire series sets up KVM to only use guest_memfd for private
pages.

For example, the per-page attributes are a property of a KVM VM, not the
underlying guest_memfd. So that implies we will need separate
guest_memfds for private and shared pages. But a given memslot can have
a mix of private and shared pages. So that implies a memslot will need
to support 2 guest_memfds? But the UAPI only allows 1 and uses the HVA
for shared mappings.

My initial reaction after reading through this series is that the
per-page private/shared should be a property of the guest_memfd, not the
VM. Maybe it would even be cleaner in the long-run to make all memory
attributes a property of the guest_memfd. That way we can scope the
support to only guest_memfds and not have to worry about making per-page
attributes work with "legacy" HVA-based memslots.

Maybe can you sketch out how you see this proposal being extensible to
using guest_memfd for shared mappings?

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-31 18:24   ` David Matlack
@ 2023-10-31 21:36     ` Sean Christopherson
  2023-10-31 22:39       ` David Matlack
  0 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-10-31 21:36 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Tue, Oct 31, 2023, David Matlack wrote:
> On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> > Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
> > memory that is tied to a specific KVM virtual machine and whose primary
> > purpose is to serve guest memory.
> > 
> > A guest-first memory subsystem allows for optimizations and enhancements
> > that are kludgy or outright infeasible to implement/support in a generic
> > memory subsystem.  With guest_memfd, guest protections and mapping sizes
> > are fully decoupled from host userspace mappings.   E.g. KVM currently
> > doesn't support mapping memory as writable in the guest without it also
> > being writable in host userspace, as KVM's ABI uses VMA protections to
> > define the allow guest protection.  Userspace can fudge this by
> > establishing two mappings, a writable mapping for the guest and readable
> > one for itself, but that’s suboptimal on multiple fronts.
> > 
> > Similarly, KVM currently requires the guest mapping size to be a strict
> > subset of the host userspace mapping size, e.g. KVM doesn’t support
> > creating a 1GiB guest mapping unless userspace also has a 1GiB guest
> > mapping.  Decoupling the mappings sizes would allow userspace to precisely
> > map only what is needed without impacting guest performance, e.g. to
> > harden against unintentional accesses to guest memory.
> > 
> > Decoupling guest and userspace mappings may also allow for a cleaner
> > alternative to high-granularity mappings for HugeTLB, which has reached a
> > bit of an impasse and is unlikely to ever be merged.
> > 
> > A guest-first memory subsystem also provides clearer line of sight to
> > things like a dedicated memory pool (for slice-of-hardware VMs) and
> > elimination of "struct page" (for offload setups where userspace _never_
> > needs to mmap() guest memory).
> 
> All of these use-cases involve using guest_memfd for shared pages, but
> this entire series sets up KVM to only use guest_memfd for private
> pages.
> 
> For example, the per-page attributes are a property of a KVM VM, not the
> underlying guest_memfd. So that implies we will need separate
> guest_memfds for private and shared pages. But a given memslot can have
> a mix of private and shared pages. So that implies a memslot will need
> to support 2 guest_memfds?

Yes, someday this may be true.  Allowing guest_memfd (it was probably called
something else at that point) for "regular" memory was discussed in I think v10?
We made a concious decision to defer supporting 2 guest_memfds because it isn't strictly
necessary to support the TDX/SNP use cases for which all of this was initially
designed, and adding a second guest_memfd and the infrastructure needed to let
userspace map a guest_memfd can be done on top with minimal overhead.

> But the UAPI only allows 1 and uses the HVA for shared mappings.
> 
> My initial reaction after reading through this series is that the
> per-page private/shared should be a property of the guest_memfd, not the
> VM. Maybe it would even be cleaner in the long-run to make all memory
> attributes a property of the guest_memfd. That way we can scope the
> support to only guest_memfds and not have to worry about making per-page
> attributes work with "legacy" HVA-based memslots.

Making the private vs. shared state a property of the guest_memfd doesn't work
for TDX and SNP.  We (upstream x86 and KVM maintainers) have taken a hard stance
that in-place conversion will not be allowed for TDX/SNP due to the ease with
which a misbehaving userspace and/or guest can crash the host.

We'd also be betting that there would *never* be a use case for per-gfn attributes
for non-standard memory, e.g. virtio-gpu buffers, any kind of device memory, etc.

We'd also effectively be signing up to either support swap and page migration in
guest_memfd, or make those mutually exclusive with per-gfn attributes too.

guest_memfd is only intended for guest DRAM, and if I get my way, will never support
swap (page migration is less scary).  I.e. guest_memfd isn't intended to be a
one-size-fits-all solution, nor is it intended to wholesale replace memslots,
which is effectively what we'd be doing by deprecating hva-based guest memory.

And ignoring all that, the ABI would end up being rather bizarre due to way guest_memfd
interacts with memslots.  guest_memfd itself has no real notion of gfns, i.e. the
shared vs. private state would be tied to a file offset, not a gfn.  That's a solvable
problem, e.g. we could make a gfn:offset binding "sticky", but that would edd extra
complexity to the ABI, and AFAICT wouldn't buy us that much, if anything.

> Maybe can you sketch out how you see this proposal being extensible to
> using guest_memfd for shared mappings?

For in-place conversions, e.g. pKVM, no additional guest_memfd is needed.  What's
missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM needs to
ensure there are no outstanding references when converting back to private.

For TDX/SNP, assuming we don't find a performant and robust way to do in-place
conversions, a second fd+offset pair would be needed.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-31 15:05   ` Fuad Tabba
@ 2023-10-31 22:13     ` Sean Christopherson
  2023-10-31 22:18       ` Paolo Bonzini
  2023-11-01 10:51       ` Fuad Tabba
  0 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-10-31 22:13 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Tue, Oct 31, 2023, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <seanjc@google.com> wrote:
> 
> ...
> 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index e2252c748fd6..e82c69d5e755 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6079,6 +6079,15 @@ applied.
> >  :Parameters: struct kvm_userspace_memory_region2 (in)
> >  :Returns: 0 on success, -1 on error
> >
> > +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
> > +allows mapping guest_memfd memory into a guest.  All fields shared with
> > +KVM_SET_USER_MEMORY_REGION identically.  Userspace can set KVM_MEM_PRIVATE in
> > +flags to have KVM bind the memory region to a given guest_memfd range of
> > +[guest_memfd_offset, guest_memfd_offset + memory_size].  The target guest_memfd
> > +must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
> > +the target range must not be bound to any other memory region.  All standard
> > +bounds checks apply (use common sense).
> > +
> 
> Bikeshedding here: Not sure if KVM_MEM_PRIVATE is the best name for
> this. It gets confusing with KVM_MEMORY_ATTRIBUTE_PRIVATE, i.e., that
> a region marked as KVM_MEM_PRIVATE is only potentially private. It did
> confuse the rest of the team when I walked them through a previous
> version of this code once. Would something like KVM_MEM_GUESTMEM make
> more sense?

Heh, deja vu.  We discussed this back in v7[*], and I came to the conclusion that
choosing a name that wasn't explicitly tied to private memory wasn't justified.
But that was before a KVM-owned guest_memfd was even an idea, and thus before we
had anything close to a real use case.

Since we now know that at least pKVM will use guest_memfd for shared memory, and
odds are quite good that "regular" VMs will also do the same, i.e. will want
guest_memfd with the concept of private memory, I agree that we should avoid
PRIVATE.

Though I vote for KVM_MEM_GUEST_MEMFD (or KVM_MEM_GUEST_MEMFD_VALID or
KVM_MEM_USE_GUEST_MEMFD).  I.e. do our best to avoid ambiguity between referring
to "guest memory" at-large and guest_memfd.

Copying a few relevant points from v7 to save a click or three.

 : I don't have a concrete use case (this is a recent idea on my end), but since we're
 : already adding fd-based memory, I can't think of a good reason not make it more generic
 : for not much extra cost.  And there are definitely classes of VMs for which fd-based
 : memory would Just Work, e.g. large VMs that are never oversubscribed on memory don't
 : need to support reclaim, so the fact that fd-based memslots won't support page aging
 : (among other things) right away is a non-issue.

...

 : Hrm, but basing private memory on top of a generic FD_VALID would effectively require
 : shared memory to use hva-based memslots for confidential VMs.  That'd yield a very
 : weird API, e.g. non-confidential VMs could be backed entirely by fd-based memslots,
 : but confidential VMs would be forced to use hva-based memslots.
 : 
 : Ignore this idea for now.  If there's an actual use case for generic fd-based memory
 : then we'll want a separate flag, fd, and offset, i.e. that support could be added
 : independent of KVM_MEM_PRIVATE.

...

 : One alternative would be to call it KVM_MEM_PROTECTED.  That shouldn't cause
 : problems for the known use of "private" (TDX and SNP), and it gives us a little
 : wiggle room, e.g. if we ever get a use case where VMs can share memory that is
 : otherwise protected.
 : 
 : That's a pretty big "if" though, and odds are good we'd need more memslot flags and
 : fd+offset pairs to allow differentiating "private" vs. "protected-shared" without
 : forcing userspace to punch holes in memslots, so I don't know that hedging now will
 : buy us anything.
 : 
 : So I'd say that if people think KVM_MEM_PRIVATE brings additional and meaningful
 : clarity over KVM_MEM_PROTECTECD, then lets go with PRIVATE.  But if PROTECTED is
 : just as good, go with PROTECTED as it gives us a wee bit of wiggle room for the
 : future.

[*] https://lore.kernel.org/all/Yuh0ikhoh+tCK6VW@google.com
 
> > -See KVM_SET_USER_MEMORY_REGION.
> > +A KVM_MEM_PRIVATE region _must_ have a valid guest_memfd (private memory) and
> > +userspace_addr (shared memory).  However, "valid" for userspace_addr simply
> > +means that the address itself must be a legal userspace address.  The backing
> > +mapping for userspace_addr is not required to be valid/populated at the time of
> > +KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated
> > +on-demand.
> 
> Regarding requiring that a private region have both a valid
> guest_memfd and a userspace_addr, should this be
> implementation-specific? In pKVM at least, all regions for protected
> VMs are private, and KVM doesn't care about the host userspace address
> for those regions even when part of the memory is shared.

Hmm, as of this patch, no, because the pKVM usage doesn't exist.  E.g. 

.  Because this literally documents the current ABI.  When

> > +When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes
> > +userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE
> > +state.  At VM creation time, all memory is shared, i.e. the PRIVATE attribute
> > +is '0' for all gfns.  Userspace can control whether memory is shared/private by
> > +toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
> 
> In pKVM, guest memory is private by default, and most of it will
> remain so for the lifetime of the VM. Userspace could explicitly mark
> all the guest's memory as private at initialization, but it would save
> a slight amount of work. That said, I understand that it might be
> better to be consistent across implementations.

Yeah, we discussed this in v12[*].  The default really doesn't matter for memory
overheads or performances once supports range-based xarray entries, and if that
isn't sufficient, KVM can internally invert the polarity of PRIVATE.

But for the ABI, I think we put a stake in the ground and say that all memory is
shared by default.  That way CoCo VMs and regular VMs (i.e VMs without the concept
of private memory) all have the same ABI.  Practically speaking, the cost to pKVM
(and likely every other CoCo VM type) is a single ioctl() during VM creation to
"convert" all memory to private.

[*] https://lore.kernel.org/all/ZRw6X2BptZnRPNK7@google.com

> > --- /dev/null
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -0,0 +1,548 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/backing-dev.h>
> > +#include <linux/falloc.h>
> > +#include <linux/kvm_host.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/anon_inodes.h>
> 
> nit: should this include be first (to maintain alphabetical ordering
> of the includes)?

Heh, yeah.  I would argue this isn't a nit though ;-)

> > +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> > +{
> > +       struct list_head *gmem_list = &inode->i_mapping->private_list;
> > +       pgoff_t start = offset >> PAGE_SHIFT;
> > +       pgoff_t end = (offset + len) >> PAGE_SHIFT;
> > +       struct kvm_gmem *gmem;
> > +
> > +       /*
> > +        * Bindings must stable across invalidation to ensure the start+end
> 
> nit: Bindings must _be/stay?_ stable

"be" is what's intended.

> ...
> 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 78a0b09ef2a5..5d1a2f1b4e94 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -798,7 +798,7 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> >         }
> >  }
> >
> > -static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > +bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> >         kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> >         return kvm_unmap_gfn_range(kvm, range);
> > @@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
> >  /* This does not remove the slot from struct kvm_memslots data structures */
> >  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> >  {
> > +       if (slot->flags & KVM_MEM_PRIVATE)
> > +               kvm_gmem_unbind(slot);
> > +
> 
> Should this be called after kvm_arch_free_memslot()? Arch-specific ode
> might need some of the data before the unbinding, something I thought
> might be necessary at one point for the pKVM port when deleting a
> memslot, but realized later that kvm_invalidate_memslot() ->
> kvm_arch_guest_memory_reclaimed() was the more logical place for it.
> Also, since that seems to be the pattern for arch-specific handlers in
> KVM.

Maybe?  But only if we can about symmetry between the allocation and free paths
I really don't think kvm_arch_free_memslot() should be doing anything beyond a
"pure" free.  E.g. kvm_arch_free_memslot() is also called after moving a memslot,
which hopefully we never actually have to allow for guest_memfd, but any code in
kvm_arch_free_memslot() would bring about "what if" questions regarding memslot
movement.  I.e. the API is intended to be a "free arch metadata associated with
the memslot".

Out of curiosity, what does pKVM need to do at kvm_arch_guest_memory_reclaimed()?

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-31 22:13     ` Sean Christopherson
@ 2023-10-31 22:18       ` Paolo Bonzini
  2023-11-01 10:51       ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-10-31 22:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Fuad Tabba, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Tue, Oct 31, 2023 at 11:13 PM Sean Christopherson <seanjc@google.com> wrote:
> On Tue, Oct 31, 2023, Fuad Tabba wrote:
> > On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <seanjc@google.com> wrote:
> Since we now know that at least pKVM will use guest_memfd for shared memory, and
> odds are quite good that "regular" VMs will also do the same, i.e. will want
> guest_memfd with the concept of private memory, I agree that we should avoid
> PRIVATE.
>
> Though I vote for KVM_MEM_GUEST_MEMFD (or KVM_MEM_GUEST_MEMFD_VALID or
> KVM_MEM_USE_GUEST_MEMFD).  I.e. do our best to avoid ambiguity between referring
> to "guest memory" at-large and guest_memfd.

I was going to propose KVM_MEM_HAS_GUESTMEMFD.  Any option
is okay for me so, if no one complains, I'll go for KVM_MEM_GUESTMEMFD
(no underscore because I found the repeated "_MEM" distracting).

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-31 21:36     ` Sean Christopherson
@ 2023-10-31 22:39       ` David Matlack
  2023-11-02 15:48         ` Paolo Bonzini
  0 siblings, 1 reply; 148+ messages in thread
From: David Matlack @ 2023-10-31 22:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Tue, Oct 31, 2023 at 2:36 PM Sean Christopherson <seanjc@google.com> wrote:
> On Tue, Oct 31, 2023, David Matlack wrote:
> > On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> > > Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
> > > memory that is tied to a specific KVM virtual machine and whose primary
> > > purpose is to serve guest memory.
>
> > Maybe can you sketch out how you see this proposal being extensible to
> > using guest_memfd for shared mappings?
>
> For in-place conversions, e.g. pKVM, no additional guest_memfd is needed.  What's
> missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM needs to
> ensure there are no outstanding references when converting back to private.
>
> For TDX/SNP, assuming we don't find a performant and robust way to do in-place
> conversions, a second fd+offset pair would be needed.

Is there a way to support non-in-place conversions within a single guest_memfd?

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-10-31 14:16     ` Sean Christopherson
@ 2023-11-01  7:25       ` Xiaoyao Li
  2023-11-01 13:41         ` Sean Christopherson
  0 siblings, 1 reply; 148+ messages in thread
From: Xiaoyao Li @ 2023-11-01  7:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/31/2023 10:16 PM, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Xiaoyao Li wrote:
>> On 10/28/2023 2:21 AM, Sean Christopherson wrote:
>>> Extended guest_memfd to allow backing guest memory with transparent
>>> hugepages. Require userspace to opt-in via a flag even though there's no
>>> known/anticipated use case for forcing small pages as THP is optional,
>>> i.e. to avoid ending up in a situation where userspace is unaware that
>>> KVM can't provide hugepages.
>>
>> Personally, it seems not so "transparent" if requiring userspace to opt-in.
>>
>> People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE
>> support, or check is the sysfs of transparent hugepage exists; 2)get the
>> maximum support hugepage size 3) ensure the size satisfies the alignment;
>> before opt-in it.
>>
>> Even simpler, userspace can blindly try to create guest memfd with
>> transparent hugapage flag. If getting error, fallback to create without the
>> transparent hugepage flag.
>>
>> However, it doesn't look transparent to me.
> 
> The "transparent" part is referring to the underlying kernel mechanism, it's not
> saying anything about the API.  The "transparent" part of THP is that the kernel
> doesn't guarantee hugepages, i.e. whether or not hugepages are actually used is
> (mostly) transparent to userspace.
> 
> Paolo also isn't the biggest fan[*], but there are also downsides to always
> allowing hugepages, e.g. silent failure due to lack of THP or unaligned size,
> and there's precedent in the form of MADV_HUGEPAGE.
> 
> [*] https://lore.kernel.org/all/84a908ae-04c7-51c7-c9a8-119e1933a189@redhat.com

But it's different than MADV_HUGEPAGE, in a way. Per my understanding, 
the failure of MADV_HUGEPAGE is not fatal, user space can ignore it and 
continue.

However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which 
leads to failure of guest memfd creation.

For current implementation, I think maybe 
KVM_GUEST_MEMFD_DESIRE_HUGEPAGE fits better than 
KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*?

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-10-27 18:21 ` [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace Sean Christopherson
  2023-10-30 17:22   ` Paolo Bonzini
@ 2023-11-01  7:30   ` Binbin Wu
  2023-11-01 10:52   ` Huang, Kai
  2023-11-03  4:09   ` Xu Yilun
  3 siblings, 0 replies; 148+ messages in thread
From: Binbin Wu @ 2023-11-01  7:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov



On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
>
> Add a new KVM exit type to allow userspace to handle memory faults that
> KVM cannot resolve, but that userspace *may* be able to handle (without
> terminating the guest).
>
> KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
> conversions between private and shared memory.  With guest private memory,
> there will be two kind of memory conversions:
>
>    - explicit conversion: happens when the guest explicitly calls into KVM
>      to map a range (as private or shared)
>
>    - implicit conversion: happens when the guest attempts to access a gfn
>      that is configured in the "wrong" state (private vs. shared)
>
> On x86 (first architecture to support guest private memory), explicit
> conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
> but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
> as there is (obviously) no hypercall, and there is no guarantee that the
> guest actually intends to convert between private and shared, i.e. what
> KVM thinks is an implicit conversion "request" could actually be the
> result of a guest code bug.
>
> KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
> be implicit conversions.
>
> Note!  To allow for future possibilities where KVM reports
> KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
> fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
> perspective), not '0'!
Is "-EHWPOISON" case not considered unresolved, so it is not mentioned here?

> Due to historical baggage within KVM, exiting to
> userspace with '0' from deep callstacks, e.g. in emulation paths, is
> infeasible as doing so would require a near-complete overhaul of KVM,
> whereas KVM already propagates -errno return codes to userspace even when
> the -errno originated in a low level helper.
>
> Report the gpa+size instead of a single gfn even though the initial usage
> is expected to always report single pages.  It's entirely possible, likely
> even, that KVM will someday support sub-page granularity faults, e.g.
> Intel's sub-page protection feature allows for additional protections at
> 128-byte granularity.
>
> Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
> Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com
> Cc: Anish Moorthy <amoorthy@google.com>
> Cc: David Matlack <dmatlack@google.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   Documentation/virt/kvm/api.rst | 41 ++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/x86.c             |  1 +
>   include/linux/kvm_host.h       | 11 +++++++++
>   include/uapi/linux/kvm.h       |  8 +++++++
>   4 files changed, 61 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index ace984acc125..860216536810 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return
>   values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>   spec refer, https://github.com/riscv/riscv-sbi-doc.
>   
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory;
> +
> +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
> +could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
> +guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
> +describes properties of the faulting access that are likely pertinent.
> +Currently, no flags are defined.
> +
> +Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
> +accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
> +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
> +kvm_run.exit_reason is stale/undefined for all other error numbers.
> +
>   ::
>   
>       /* KVM_EXIT_NOTIFY */
> @@ -7757,6 +7777,27 @@ This capability is aimed to mitigate the threat that malicious VMs can
>   cause CPU stuck (due to event windows don't open up) and make the CPU
>   unavailable to host or other VMs.
>   
> +7.34 KVM_CAP_MEMORY_FAULT_INFO
> +------------------------------
> +
> +:Architectures: x86
> +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> +
> +The presence of this capability indicates that KVM_RUN will fill
> +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
> +there is a valid memslot but no backing VMA for the corresponding host virtual
> +address.
> +
> +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
> +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
> +to KVM_EXIT_MEMORY_FAULT.
> +
> +Note: Userspaces which attempt to resolve memory faults so that they can retry
> +KVM_RUN are encouraged to guard against repeatedly receiving the same
> +error/annotated fault.
> +
> +See KVM_EXIT_MEMORY_FAULT for more information.
> +
>   8. Other capabilities.
>   ======================
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6409914428ca..ee3cd8c3c0ef 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4518,6 +4518,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   	case KVM_CAP_ENABLE_CAP:
>   	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
>   	case KVM_CAP_IRQFD_RESAMPLE:
> +	case KVM_CAP_MEMORY_FAULT_INFO:
>   		r = 1;
>   		break;
>   	case KVM_CAP_EXIT_HYPERCALL:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4e741ff27af3..96aa930536b1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>   /* Max number of entries allowed for each kvm dirty ring */
>   #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>   
> +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> +						 gpa_t gpa, gpa_t size)
> +{
> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	vcpu->run->memory_fault.gpa = gpa;
> +	vcpu->run->memory_fault.size = size;
> +
> +	/* Flags are not (yet) defined or communicated to userspace. */
> +	vcpu->run->memory_fault.flags = 0;
> +}
> +
>   #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index bd1abe067f28..7ae9987b48dd 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -274,6 +274,7 @@ struct kvm_xen_exit {
>   #define KVM_EXIT_RISCV_SBI        35
>   #define KVM_EXIT_RISCV_CSR        36
>   #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>   
>   /* For KVM_EXIT_INTERNAL_ERROR */
>   /* Emulate instruction failed. */
> @@ -520,6 +521,12 @@ struct kvm_run {
>   #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>   			__u32 flags;
>   		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory_fault;
>   		/* Fix the size of the union. */
>   		char padding[256];
>   	};
> @@ -1203,6 +1210,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
>   #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
>   #define KVM_CAP_USER_MEMORY2 230
> +#define KVM_CAP_MEMORY_FAULT_INFO 231
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-31 22:13     ` Sean Christopherson
  2023-10-31 22:18       ` Paolo Bonzini
@ 2023-11-01 10:51       ` Fuad Tabba
  2023-11-01 21:55         ` Sean Christopherson
  1 sibling, 1 reply; 148+ messages in thread
From: Fuad Tabba @ 2023-11-01 10:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Hi Sean,

On Tue, Oct 31, 2023 at 10:13 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Oct 31, 2023, Fuad Tabba wrote:
> > Hi,
> >
> > On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > ...
> >
> > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > index e2252c748fd6..e82c69d5e755 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -6079,6 +6079,15 @@ applied.
> > >  :Parameters: struct kvm_userspace_memory_region2 (in)
> > >  :Returns: 0 on success, -1 on error
> > >
> > > +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
> > > +allows mapping guest_memfd memory into a guest.  All fields shared with
> > > +KVM_SET_USER_MEMORY_REGION identically.  Userspace can set KVM_MEM_PRIVATE in
> > > +flags to have KVM bind the memory region to a given guest_memfd range of
> > > +[guest_memfd_offset, guest_memfd_offset + memory_size].  The target guest_memfd
> > > +must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
> > > +the target range must not be bound to any other memory region.  All standard
> > > +bounds checks apply (use common sense).
> > > +
> >
> > Bikeshedding here: Not sure if KVM_MEM_PRIVATE is the best name for
> > this. It gets confusing with KVM_MEMORY_ATTRIBUTE_PRIVATE, i.e., that
> > a region marked as KVM_MEM_PRIVATE is only potentially private. It did
> > confuse the rest of the team when I walked them through a previous
> > version of this code once. Would something like KVM_MEM_GUESTMEM make
> > more sense?
>
> Heh, deja vu.  We discussed this back in v7[*], and I came to the conclusion that
> choosing a name that wasn't explicitly tied to private memory wasn't justified.
> But that was before a KVM-owned guest_memfd was even an idea, and thus before we
> had anything close to a real use case.
>
> Since we now know that at least pKVM will use guest_memfd for shared memory, and
> odds are quite good that "regular" VMs will also do the same, i.e. will want
> guest_memfd with the concept of private memory, I agree that we should avoid
> PRIVATE.
>
> Though I vote for KVM_MEM_GUEST_MEMFD (or KVM_MEM_GUEST_MEMFD_VALID or
> KVM_MEM_USE_GUEST_MEMFD).  I.e. do our best to avoid ambiguity between referring
> to "guest memory" at-large and guest_memfd.

Yeah I remember that discussion, but I didn't realize how confusing it
was until I was presenting my work so far to the rest of the team, and
it confused them quite a bit. I'm happy with any of the other
alternatives that you suggest, as long as the distinction is clear.

> > > -See KVM_SET_USER_MEMORY_REGION.
> > > +A KVM_MEM_PRIVATE region _must_ have a valid guest_memfd (private memory) and
> > > +userspace_addr (shared memory).  However, "valid" for userspace_addr simply
> > > +means that the address itself must be a legal userspace address.  The backing
> > > +mapping for userspace_addr is not required to be valid/populated at the time of
> > > +KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated
> > > +on-demand.
> >
> > Regarding requiring that a private region have both a valid
> > guest_memfd and a userspace_addr, should this be
> > implementation-specific? In pKVM at least, all regions for protected
> > VMs are private, and KVM doesn't care about the host userspace address
> > for those regions even when part of the memory is shared.
>
> Hmm, as of this patch, no, because the pKVM usage doesn't exist.  E.g.
>
> .  Because this literally documents the current ABI.  When

Ack.

> > > +When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes
> > > +userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE
> > > +state.  At VM creation time, all memory is shared, i.e. the PRIVATE attribute
> > > +is '0' for all gfns.  Userspace can control whether memory is shared/private by
> > > +toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
> >
> > In pKVM, guest memory is private by default, and most of it will
> > remain so for the lifetime of the VM. Userspace could explicitly mark
> > all the guest's memory as private at initialization, but it would save
> > a slight amount of work. That said, I understand that it might be
> > better to be consistent across implementations.
>
> Yeah, we discussed this in v12[*].  The default really doesn't matter for memory
> overheads or performances once supports range-based xarray entries, and if that
> isn't sufficient, KVM can internally invert the polarity of PRIVATE.
>
> But for the ABI, I think we put a stake in the ground and say that all memory is
> shared by default.  That way CoCo VMs and regular VMs (i.e VMs without the concept
> of private memory) all have the same ABI.  Practically speaking, the cost to pKVM
> (and likely every other CoCo VM type) is a single ioctl() during VM creation to
> "convert" all memory to private.
>
> [*] https://lore.kernel.org/all/ZRw6X2BptZnRPNK7@google.com

Ack.

...

> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 78a0b09ef2a5..5d1a2f1b4e94 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -798,7 +798,7 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > >         }
> > >  }
> > >
> > > -static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > > +bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > >  {
> > >         kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > >         return kvm_unmap_gfn_range(kvm, range);
> > > @@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
> > >  /* This does not remove the slot from struct kvm_memslots data structures */
> > >  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> > >  {
> > > +       if (slot->flags & KVM_MEM_PRIVATE)
> > > +               kvm_gmem_unbind(slot);
> > > +
> >
> > Should this be called after kvm_arch_free_memslot()? Arch-specific ode
> > might need some of the data before the unbinding, something I thought
> > might be necessary at one point for the pKVM port when deleting a
> > memslot, but realized later that kvm_invalidate_memslot() ->
> > kvm_arch_guest_memory_reclaimed() was the more logical place for it.
> > Also, since that seems to be the pattern for arch-specific handlers in
> > KVM.
>
> Maybe?  But only if we can about symmetry between the allocation and free paths
> I really don't think kvm_arch_free_memslot() should be doing anything beyond a
> "pure" free.  E.g. kvm_arch_free_memslot() is also called after moving a memslot,
> which hopefully we never actually have to allow for guest_memfd, but any code in
> kvm_arch_free_memslot() would bring about "what if" questions regarding memslot
> movement.  I.e. the API is intended to be a "free arch metadata associated with
> the memslot".
>
> Out of curiosity, what does pKVM need to do at kvm_arch_guest_memory_reclaimed()?

It's about the host reclaiming ownership of guest memory when tearing
down a protected guest. In pKVM, we currently teardown the guest and
reclaim its memory when kvm_arch_destroy_vm() is called. The problem
with guestmem is that kvm_gmem_unbind() could get called before that
happens, after which the host might try to access the unbound guest
memory. Since the host hasn't reclaimed ownership of the guest memory
from hyp, hilarity ensues (it crashes).

Initially, I hooked reclaim guest memory to kvm_free_memslot(), but
then I needed to move the unbind later in the function. I realized
later that kvm_arch_guest_memory_reclaimed() gets called earlier (at
the right time), and is more aptly named.

Thanks,
/fuad

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-10-27 18:21 ` [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace Sean Christopherson
  2023-10-30 17:22   ` Paolo Bonzini
  2023-11-01  7:30   ` Binbin Wu
@ 2023-11-01 10:52   ` Huang, Kai
  2023-11-01 17:36     ` Sean Christopherson
  2023-11-03  4:09   ` Xu Yilun
  3 siblings, 1 reply; 148+ messages in thread
From: Huang, Kai @ 2023-11-01 10:52 UTC (permalink / raw)
  To: viro, aou, Christopherson,,
	Sean, brauner, oliver.upton, chenhuacai, paul.walmsley, palmer,
	maz, pbonzini, mpe, willy, anup, akpm
  Cc: Li, Xiaoyao, kvm-riscv, mic, liam.merwick, kvm, Yamahata, Isaku,
	kirill.shutemov, david, tabba, amoorthy, linuxppc-dev,
	michael.roth, kvmarm, linux-kernel, linux-fsdevel, linux-riscv,
	chao.p.peng, linux-mips, Annapurve, Vishal, vbabka, mail,
	yu.c.zhang, qperret, dmatlack, Xu, Yilun, isaku.yamahata,
	ackerleytng, jarkko, linux-arm-kernel, linux-mm, Wang, Wei W


> +7.34 KVM_CAP_MEMORY_FAULT_INFO
> +------------------------------
> +
> +:Architectures: x86
> +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> +
> +The presence of this capability indicates that KVM_RUN will fill
> +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
> +there is a valid memslot but no backing VMA for the corresponding host virtual
> +address.
> +
> +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
> +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
> +to KVM_EXIT_MEMORY_FAULT.

IIUC returning -EFAULT or whatever -errno is sort of KVM internal
implementation.  Is it better to relax the validity of kvm_run.memory_fault when
KVM_RUN returns any -errno?

[...]


> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> +						 gpa_t gpa, gpa_t size)
> +{
> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	vcpu->run->memory_fault.gpa = gpa;
> +	vcpu->run->memory_fault.size = size;
> +
> +	/* Flags are not (yet) defined or communicated to userspace. */
> +	vcpu->run->memory_fault.flags = 0;
> +}
> +

KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function to
<asm/kvm_host.h>?


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 01/35] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges
  2023-10-27 18:21 ` [PATCH v13 01/35] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
@ 2023-11-01 12:46   ` Fuad Tabba
  0 siblings, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-01 12:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Rework and rename "struct kvm_hva_range" into "kvm_mmu_notifier_range" so
> that the structure can be used to handle notifications that operate on gfn
> context, i.e. that aren't tied to a host virtual address.  Rename the
> handler typedef too (arguably it should always have been gfn_handler_t).
>
> Practically speaking, this is a nop for 64-bit kernels as the only
> meaningful change is to store start+end as u64s instead of unsigned longs.
>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad


>  virt/kvm/kvm_main.c | 34 +++++++++++++++++++---------------
>  1 file changed, 19 insertions(+), 15 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 486800a7024b..0524933856d4 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -541,18 +541,22 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>         return container_of(mn, struct kvm, mmu_notifier);
>  }
>
> -typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
> +typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>
>  typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
>                              unsigned long end);
>
>  typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>
> -struct kvm_hva_range {
> -       unsigned long start;
> -       unsigned long end;
> +struct kvm_mmu_notifier_range {
> +       /*
> +        * 64-bit addresses, as KVM notifiers can operate on host virtual
> +        * addresses (unsigned long) and guest physical addresses (64-bit).
> +        */
> +       u64 start;
> +       u64 end;
>         union kvm_mmu_notifier_arg arg;
> -       hva_handler_t handler;
> +       gfn_handler_t handler;
>         on_lock_fn_t on_lock;
>         on_unlock_fn_t on_unlock;
>         bool flush_on_ret;
> @@ -581,7 +585,7 @@ static const union kvm_mmu_notifier_arg KVM_MMU_NOTIFIER_NO_ARG;
>              node = interval_tree_iter_next(node, start, last))      \
>
>  static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> -                                                 const struct kvm_hva_range *range)
> +                                                 const struct kvm_mmu_notifier_range *range)
>  {
>         bool ret = false, locked = false;
>         struct kvm_gfn_range gfn_range;
> @@ -608,9 +612,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>                         unsigned long hva_start, hva_end;
>
>                         slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> -                       hva_start = max(range->start, slot->userspace_addr);
> -                       hva_end = min(range->end, slot->userspace_addr +
> -                                                 (slot->npages << PAGE_SHIFT));
> +                       hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
> +                       hva_end = min_t(unsigned long, range->end,
> +                                       slot->userspace_addr + (slot->npages << PAGE_SHIFT));
>
>                         /*
>                          * To optimize for the likely case where the address
> @@ -660,10 +664,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
>                                                 unsigned long start,
>                                                 unsigned long end,
>                                                 union kvm_mmu_notifier_arg arg,
> -                                               hva_handler_t handler)
> +                                               gfn_handler_t handler)
>  {
>         struct kvm *kvm = mmu_notifier_to_kvm(mn);
> -       const struct kvm_hva_range range = {
> +       const struct kvm_mmu_notifier_range range = {
>                 .start          = start,
>                 .end            = end,
>                 .arg            = arg,
> @@ -680,10 +684,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
>  static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
>                                                          unsigned long start,
>                                                          unsigned long end,
> -                                                        hva_handler_t handler)
> +                                                        gfn_handler_t handler)
>  {
>         struct kvm *kvm = mmu_notifier_to_kvm(mn);
> -       const struct kvm_hva_range range = {
> +       const struct kvm_mmu_notifier_range range = {
>                 .start          = start,
>                 .end            = end,
>                 .handler        = handler,
> @@ -771,7 +775,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                                         const struct mmu_notifier_range *range)
>  {
>         struct kvm *kvm = mmu_notifier_to_kvm(mn);
> -       const struct kvm_hva_range hva_range = {
> +       const struct kvm_mmu_notifier_range hva_range = {
>                 .start          = range->start,
>                 .end            = range->end,
>                 .handler        = kvm_unmap_gfn_range,
> @@ -835,7 +839,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>                                         const struct mmu_notifier_range *range)
>  {
>         struct kvm *kvm = mmu_notifier_to_kvm(mn);
> -       const struct kvm_hva_range hva_range = {
> +       const struct kvm_mmu_notifier_range hva_range = {
>                 .start          = range->start,
>                 .end            = range->end,
>                 .handler        = (void *)kvm_null_fn,
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 02/35] KVM: Assert that mmu_invalidate_in_progress *never* goes negative
  2023-10-27 18:21 ` [PATCH v13 02/35] KVM: Assert that mmu_invalidate_in_progress *never* goes negative Sean Christopherson
  2023-10-30 16:27   ` Paolo Bonzini
@ 2023-11-01 12:46   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-01 12:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Move the assertion on the in-progress invalidation count from the primary
> MMU's notifier path to KVM's common notification path, i.e. assert that
> the count doesn't go negative even when the invalidation is coming from
> KVM itself.
>
> Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only
> the affected VM, not the entire kernel.  A corrupted count is fatal to the
> VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry()
> to block any and all attempts to install new mappings.  But it's far from
> guaranteed that an end() without a start() is fatal or even problematic to
> anything other than the target VM, e.g. the underlying bug could simply be
> a duplicate call to end().  And it's much more likely that a missed
> invalidation, i.e. a potential use-after-free, would manifest as no
> notification whatsoever, not an end() without a start().
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

>  virt/kvm/kvm_main.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0524933856d4..5a97e6c7d9c2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -833,6 +833,7 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
>          * in conjunction with the smp_rmb in mmu_invalidate_retry().
>          */
>         kvm->mmu_invalidate_in_progress--;
> +       KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
>  }
>
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> @@ -863,8 +864,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>          */
>         if (wake)
>                 rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
> -
> -       BUG_ON(kvm->mmu_invalidate_in_progress < 0);
>  }
>
>  static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction
  2023-10-27 18:21 ` [PATCH v13 04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction Sean Christopherson
  2023-10-30 16:32   ` Paolo Bonzini
@ 2023-11-01 12:50   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-01 12:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Add an assertion that there are no in-progress MMU invalidations when a
> VM is being destroyed, with the exception of the scenario where KVM
> unregisters its MMU notifier between an .invalidate_range_start() call and
> the corresponding .invalidate_range_end().
>
> KVM can't detect unpaired calls from the mmu_notifier due to the above
> exception waiver, but the assertion can detect KVM bugs, e.g. such as the
> bug that *almost* escaped initial guest_memfd development.
>
> Link: https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5389@linux.intel.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

>  virt/kvm/kvm_main.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1a577a25de47..4dba682586ee 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1356,9 +1356,16 @@ static void kvm_destroy_vm(struct kvm *kvm)
>          * No threads can be waiting in kvm_swap_active_memslots() as the
>          * last reference on KVM has been dropped, but freeing
>          * memslots would deadlock without this manual intervention.
> +        *
> +        * If the count isn't unbalanced, i.e. KVM did NOT unregister its MMU
> +        * notifier between a start() and end(), then there shouldn't be any
> +        * in-progress invalidations.
>          */
>         WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
> -       kvm->mn_active_invalidate_count = 0;
> +       if (kvm->mn_active_invalidate_count)
> +               kvm->mn_active_invalidate_count = 0;
> +       else
> +               WARN_ON(kvm->mmu_invalidate_in_progress);
>  #else
>         kvm_flush_shadow_all(kvm);
>  #endif
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER
  2023-10-27 18:21 ` [PATCH v13 05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER Sean Christopherson
  2023-10-30 16:34   ` Paolo Bonzini
@ 2023-11-01 12:51   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-01 12:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Assert that both KVM_ARCH_WANT_MMU_NOTIFIER and CONFIG_MMU_NOTIFIER are
> defined when KVM is enabled, and return '1' unconditionally for the
> CONFIG_KVM_BOOK3S_HV_POSSIBLE=n path.  All flavors of PPC support for KVM
> select MMU_NOTIFIER, and KVM_ARCH_WANT_MMU_NOTIFIER is unconditionally
> defined by arch/powerpc/include/asm/kvm_host.h.
>
> Effectively dropping use of KVM_ARCH_WANT_MMU_NOTIFIER will simplify a
> future cleanup to turn KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig, i.e.
> will allow combining all of the
>
>   #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>
> checks into a single
>
>   #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>
> without having to worry about PPC's "bare" usage of
> KVM_ARCH_WANT_MMU_NOTIFIER.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

>  arch/powerpc/kvm/powerpc.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 7197c8256668..b0a512ede764 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -632,12 +632,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>                 break;
>  #endif
>         case KVM_CAP_SYNC_MMU:
> +#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +               BUILD_BUG();
> +#endif
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>                 r = hv_enabled;
> -#elif defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> -               r = 1;
>  #else
> -               r = 0;
> +               r = 1;
>  #endif
>                 break;
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER
  2023-10-27 18:21 ` [PATCH v13 07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER Sean Christopherson
  2023-10-30 16:37   ` Paolo Bonzini
@ 2023-11-01 12:54   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-01 12:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
> appropriate to effectively maintain existing behavior.  Using a proper
> Kconfig will simplify building more functionality on top of KVM's
> mmu_notifier infrastructure.
>
> Add a forward declaration of kvm_gfn_range to kvm_types.h so that
> including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
> generate warnings due to kvm_gfn_range being undeclared.  PPC defines
> hooks for PR vs. HV without guarding them via #ifdeffery, e.g.
>
>   bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
>   bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>   bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>   bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>
> Alternatively, PPC could forward declare kvm_gfn_range, but there's no
> good reason not to define it in common KVM.
>
> Acked-by: Anup Patel <anup@brainfault.org>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
(Tested x86 and arm64 on qemu)

Cheers,
/fuad

>  arch/arm64/include/asm/kvm_host.h   |  2 --
>  arch/arm64/kvm/Kconfig              |  2 +-
>  arch/mips/include/asm/kvm_host.h    |  2 --
>  arch/mips/kvm/Kconfig               |  2 +-
>  arch/powerpc/include/asm/kvm_host.h |  2 --
>  arch/powerpc/kvm/Kconfig            |  8 ++++----
>  arch/powerpc/kvm/powerpc.c          |  4 +---
>  arch/riscv/include/asm/kvm_host.h   |  2 --
>  arch/riscv/kvm/Kconfig              |  2 +-
>  arch/x86/include/asm/kvm_host.h     |  2 --
>  arch/x86/kvm/Kconfig                |  2 +-
>  include/linux/kvm_host.h            |  6 +++---
>  include/linux/kvm_types.h           |  1 +
>  virt/kvm/Kconfig                    |  4 ++++
>  virt/kvm/kvm_main.c                 | 10 +++++-----
>  15 files changed, 22 insertions(+), 29 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index af06ccb7ee34..9e046b64847a 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -921,8 +921,6 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>  int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>                               struct kvm_vcpu_events *events);
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
>  void kvm_arm_halt_guest(struct kvm *kvm);
>  void kvm_arm_resume_guest(struct kvm *kvm);
>
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 83c1e09be42e..1a777715199f 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -22,7 +22,7 @@ menuconfig KVM
>         bool "Kernel-based Virtual Machine (KVM) support"
>         depends on HAVE_KVM
>         select KVM_GENERIC_HARDWARE_ENABLING
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         select PREEMPT_NOTIFIERS
>         select HAVE_KVM_CPU_RELAX_INTERCEPT
>         select KVM_MMIO
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 54a85f1d4f2c..179f320cc231 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -810,8 +810,6 @@ int kvm_mips_mkclean_gpa_pt(struct kvm *kvm, gfn_t start_gfn, gfn_t end_gfn);
>  pgd_t *kvm_pgd_alloc(void);
>  void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
>  /* Emulation */
>  enum emulation_result update_pc(struct kvm_vcpu *vcpu, u32 cause);
>  int kvm_get_badinstr(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
> diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
> index a8cdba75f98d..c04987d2ed2e 100644
> --- a/arch/mips/kvm/Kconfig
> +++ b/arch/mips/kvm/Kconfig
> @@ -25,7 +25,7 @@ config KVM
>         select HAVE_KVM_EVENTFD
>         select HAVE_KVM_VCPU_ASYNC_IOCTL
>         select KVM_MMIO
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         select INTERVAL_TREE
>         select KVM_GENERIC_HARDWARE_ENABLING
>         help
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 14ee0dece853..4b5c3f2acf78 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -62,8 +62,6 @@
>
>  #include <linux/mmu_notifier.h>
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
>  #define HPTEG_CACHE_NUM                        (1 << 15)
>  #define HPTEG_HASH_BITS_PTE            13
>  #define HPTEG_HASH_BITS_PTE_LONG       12
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 902611954200..b33358ee6424 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -42,7 +42,7 @@ config KVM_BOOK3S_64_HANDLER
>  config KVM_BOOK3S_PR_POSSIBLE
>         bool
>         select KVM_MMIO
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>
>  config KVM_BOOK3S_HV_POSSIBLE
>         bool
> @@ -85,7 +85,7 @@ config KVM_BOOK3S_64_HV
>         tristate "KVM for POWER7 and later using hypervisor mode in host"
>         depends on KVM_BOOK3S_64 && PPC_POWERNV
>         select KVM_BOOK3S_HV_POSSIBLE
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         select CMA
>         help
>           Support running unmodified book3s_64 guest kernels in
> @@ -194,7 +194,7 @@ config KVM_E500V2
>         depends on !CONTEXT_TRACKING_USER
>         select KVM
>         select KVM_MMIO
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         help
>           Support running unmodified E500 guest kernels in virtual machines on
>           E500v2 host processors.
> @@ -211,7 +211,7 @@ config KVM_E500MC
>         select KVM
>         select KVM_MMIO
>         select KVM_BOOKE_HV
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         help
>           Support running unmodified E500MC/E5500/E6500 guest kernels in
>           virtual machines on E500MC/E5500/E6500 host processors.
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 8d3ec483bc2b..aac75c98a956 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -632,9 +632,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>                 break;
>  #endif
>         case KVM_CAP_SYNC_MMU:
> -#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> -               BUILD_BUG();
> -#endif
> +               BUILD_BUG_ON(!IS_ENABLED(CONFIG_KVM_GENERIC_MMU_NOTIFIER));
>                 r = 1;
>                 break;
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index 1ebf20dfbaa6..66ee9ff483e9 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -249,8 +249,6 @@ struct kvm_vcpu_arch {
>  static inline void kvm_arch_sync_events(struct kvm *kvm) {}
>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
>  #define KVM_RISCV_GSTAGE_TLB_MIN_ORDER         12
>
>  void kvm_riscv_local_hfence_gvma_vmid_gpa(unsigned long vmid,
> diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
> index dfc237d7875b..ae2e05f050ec 100644
> --- a/arch/riscv/kvm/Kconfig
> +++ b/arch/riscv/kvm/Kconfig
> @@ -30,7 +30,7 @@ config KVM
>         select KVM_GENERIC_HARDWARE_ENABLING
>         select KVM_MMIO
>         select KVM_XFER_TO_GUEST_WORK
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         select PREEMPT_NOTIFIERS
>         help
>           Support hosting virtualized guest machines.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 70d139406bc8..31e84668014e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2129,8 +2129,6 @@ enum {
>  # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
>  #endif
>
> -#define KVM_ARCH_WANT_MMU_NOTIFIER
> -
>  int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
>  int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_cpu_has_extint(struct kvm_vcpu *v);
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index ed90f148140d..091b74599c22 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -24,7 +24,7 @@ config KVM
>         depends on HIGH_RES_TIMERS
>         depends on X86_LOCAL_APIC
>         select PREEMPT_NOTIFIERS
> -       select MMU_NOTIFIER
> +       select KVM_GENERIC_MMU_NOTIFIER
>         select HAVE_KVM_IRQCHIP
>         select HAVE_KVM_PFNCACHE
>         select HAVE_KVM_IRQFD
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 11d091688346..5faba69403ac 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -253,7 +253,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>  union kvm_mmu_notifier_arg {
>         pte_t pte;
>  };
> @@ -783,7 +783,7 @@ struct kvm {
>         struct hlist_head irq_ack_notifier_list;
>  #endif
>
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>         struct mmu_notifier mmu_notifier;
>         unsigned long mmu_invalidate_seq;
>         long mmu_invalidate_in_progress;
> @@ -1946,7 +1946,7 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[];
>  extern const struct kvm_stats_header kvm_vcpu_stats_header;
>  extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
>
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>  static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
>  {
>         if (unlikely(kvm->mmu_invalidate_in_progress))
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 6f4737d5046a..9d1f7835d8c1 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -6,6 +6,7 @@
>  struct kvm;
>  struct kvm_async_pf;
>  struct kvm_device_ops;
> +struct kvm_gfn_range;
>  struct kvm_interrupt;
>  struct kvm_irq_routing_table;
>  struct kvm_memory_slot;
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 484d0873061c..ecae2914c97e 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -92,3 +92,7 @@ config HAVE_KVM_PM_NOTIFIER
>
>  config KVM_GENERIC_HARDWARE_ENABLING
>         bool
> +
> +config KVM_GENERIC_MMU_NOTIFIER
> +       select MMU_NOTIFIER
> +       bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4dba682586ee..6e708017064d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -535,7 +535,7 @@ void kvm_destroy_vcpus(struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  {
>         return container_of(mn, struct kvm, mmu_notifier);
> @@ -960,14 +960,14 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>         return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
>  }
>
> -#else  /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
> +#else  /* !CONFIG_KVM_GENERIC_MMU_NOTIFIER */
>
>  static int kvm_init_mmu_notifier(struct kvm *kvm)
>  {
>         return 0;
>  }
>
> -#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> +#endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */
>
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  static int kvm_pm_notifier_call(struct notifier_block *bl,
> @@ -1287,7 +1287,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  out_err_no_debugfs:
>         kvm_coalesced_mmio_free(kvm);
>  out_no_coalesced_mmio:
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>         if (kvm->mmu_notifier.ops)
>                 mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
>  #endif
> @@ -1347,7 +1347,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>                 kvm->buses[i] = NULL;
>         }
>         kvm_coalesced_mmio_free(kvm);
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>         mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
>         /*
>          * At this point, pending calls to invalidate_range_start()
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-11-01  7:25       ` Xiaoyao Li
@ 2023-11-01 13:41         ` Sean Christopherson
  2023-11-01 13:49           ` Paolo Bonzini
  0 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-11-01 13:41 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Wed, Nov 01, 2023, Xiaoyao Li wrote:
> On 10/31/2023 10:16 PM, Sean Christopherson wrote:
> > On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> > > On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > > > Extended guest_memfd to allow backing guest memory with transparent
> > > > hugepages. Require userspace to opt-in via a flag even though there's no
> > > > known/anticipated use case for forcing small pages as THP is optional,
> > > > i.e. to avoid ending up in a situation where userspace is unaware that
> > > > KVM can't provide hugepages.
> > > 
> > > Personally, it seems not so "transparent" if requiring userspace to opt-in.
> > > 
> > > People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE
> > > support, or check is the sysfs of transparent hugepage exists; 2)get the
> > > maximum support hugepage size 3) ensure the size satisfies the alignment;
> > > before opt-in it.
> > > 
> > > Even simpler, userspace can blindly try to create guest memfd with
> > > transparent hugapage flag. If getting error, fallback to create without the
> > > transparent hugepage flag.
> > > 
> > > However, it doesn't look transparent to me.
> > 
> > The "transparent" part is referring to the underlying kernel mechanism, it's not
> > saying anything about the API.  The "transparent" part of THP is that the kernel
> > doesn't guarantee hugepages, i.e. whether or not hugepages are actually used is
> > (mostly) transparent to userspace.
> > 
> > Paolo also isn't the biggest fan[*], but there are also downsides to always
> > allowing hugepages, e.g. silent failure due to lack of THP or unaligned size,
> > and there's precedent in the form of MADV_HUGEPAGE.
> > 
> > [*] https://lore.kernel.org/all/84a908ae-04c7-51c7-c9a8-119e1933a189@redhat.com
> 
> But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the
> failure of MADV_HUGEPAGE is not fatal, user space can ignore it and
> continue.
>
> However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which leads
> to failure of guest memfd creation.

Failing KVM_CREATE_GUEST_MEMFD isn't truly fatal, it just requires different
action from userspace, i.e. instead of ignoring the error, userspace could redo
KVM_CREATE_GUEST_MEMFD with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE=0.

We could make the behavior more like MADV_HUGEPAGE, e.g. theoretically we could
extend fadvise() with FADV_HUGEPAGE, or add a guest_memfd knob/ioctl() to let
userspace provide advice/hints after creating a guest_memfd.  But I suspect that
guest_memfd would be the only user of FADV_HUGEPAGE, and IMO a post-creation hint
is actually less desirable.

KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will fail only if userspace didn't provide a
compatible size or the kernel doesn't support THP.  An incompatible size is likely
a userspace bug, and for most setups that want to utilize guest_memfd, lack of THP
support is likely a configuration bug.  I.e. many/most uses *want* failures due to
KVM_GUEST_MEMFD_ALLOW_HUGEPAGE to be fatal.

> For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE
> fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*?

Why?  Verbs like "prefer" and "desire" aren't a good fit IMO because they suggest
the flag is a hint, and hints are usually best effort only, i.e. are ignored if
there is a fundamental incompatibility.

"Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES
or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't
(yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
a hint, but weaker than a requirement.  And if/when KVM supports a dedicated memory
pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-11-01 13:41         ` Sean Christopherson
@ 2023-11-01 13:49           ` Paolo Bonzini
  2023-11-01 16:36             ` Sean Christopherson
  0 siblings, 1 reply; 148+ messages in thread
From: Paolo Bonzini @ 2023-11-01 13:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiaoyao Li, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Wed, Nov 1, 2023 at 2:41 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Nov 01, 2023, Xiaoyao Li wrote:
> > On 10/31/2023 10:16 PM, Sean Christopherson wrote:
> > > On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> > > > On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > > > > Extended guest_memfd to allow backing guest memory with transparent
> > > > > hugepages. Require userspace to opt-in via a flag even though there's no
> > > > > known/anticipated use case for forcing small pages as THP is optional,
> > > > > i.e. to avoid ending up in a situation where userspace is unaware that
> > > > > KVM can't provide hugepages.
> > > >
> > > > Personally, it seems not so "transparent" if requiring userspace to opt-in.
> > > >
> > > > People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE
> > > > support, or check is the sysfs of transparent hugepage exists; 2)get the
> > > > maximum support hugepage size 3) ensure the size satisfies the alignment;
> > > > before opt-in it.
> > > >
> > > > Even simpler, userspace can blindly try to create guest memfd with
> > > > transparent hugapage flag. If getting error, fallback to create without the
> > > > transparent hugepage flag.
> > > >
> > > > However, it doesn't look transparent to me.
> > >
> > > The "transparent" part is referring to the underlying kernel mechanism, it's not
> > > saying anything about the API.  The "transparent" part of THP is that the kernel
> > > doesn't guarantee hugepages, i.e. whether or not hugepages are actually used is
> > > (mostly) transparent to userspace.
> > >
> > > Paolo also isn't the biggest fan[*], but there are also downsides to always
> > > allowing hugepages, e.g. silent failure due to lack of THP or unaligned size,
> > > and there's precedent in the form of MADV_HUGEPAGE.
> > >
> > > [*] https://lore.kernel.org/all/84a908ae-04c7-51c7-c9a8-119e1933a189@redhat.com
> >
> > But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the
> > failure of MADV_HUGEPAGE is not fatal, user space can ignore it and
> > continue.
> >
> > However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which leads
> > to failure of guest memfd creation.
>
> Failing KVM_CREATE_GUEST_MEMFD isn't truly fatal, it just requires different
> action from userspace, i.e. instead of ignoring the error, userspace could redo
> KVM_CREATE_GUEST_MEMFD with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE=0.
>
> We could make the behavior more like MADV_HUGEPAGE, e.g. theoretically we could
> extend fadvise() with FADV_HUGEPAGE, or add a guest_memfd knob/ioctl() to let
> userspace provide advice/hints after creating a guest_memfd.  But I suspect that
> guest_memfd would be the only user of FADV_HUGEPAGE, and IMO a post-creation hint
> is actually less desirable.
>
> KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will fail only if userspace didn't provide a
> compatible size or the kernel doesn't support THP.  An incompatible size is likely
> a userspace bug, and for most setups that want to utilize guest_memfd, lack of THP
> support is likely a configuration bug.  I.e. many/most uses *want* failures due to
> KVM_GUEST_MEMFD_ALLOW_HUGEPAGE to be fatal.
>
> > For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE
> > fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*?
>
> Why?  Verbs like "prefer" and "desire" aren't a good fit IMO because they suggest
> the flag is a hint, and hints are usually best effort only, i.e. are ignored if
> there is a fundamental incompatibility.
>
> "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES
> or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't
> (yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
> a hint, but weaker than a requirement.  And if/when KVM supports a dedicated memory
> pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.

I think that the current patch is fine, but I will adjust it to always
allow the flag,
and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
If hugepages are not guaranteed, and (theoretically) you could have no
hugepage at all in the result, it's okay to get this result even if THP is not
available in the kernel.

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  2023-10-27 18:21 ` [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
  2023-10-30 16:41   ` Paolo Bonzini
  2023-10-31  2:26   ` Xiaoyao Li
@ 2023-11-01 14:19   ` Fuad Tabba
  2 siblings, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-01 14:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
> information can be supplied without setting userspace up to fail.  The
> padding in the new kvm_userspace_memory_region2 structure will be used to
> pass a file descriptor in addition to the userspace_addr, i.e. allow
> userspace to point at a file descriptor and map memory into a guest that
> is NOT mapped into host userspace.
>
> Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
> without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
> makes detection of bad flags a bit more robust, e.g. if the new fd field
> is guarded only by a flag and not a new ioctl(), then a userspace bug
> (setting a "bad" flag) would generate out-of-bounds access instead of an
> -EINVAL error.
>
> Cc: Jarkko Sakkinen <jarkko@kernel.org>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

With the missing pad in api.rst fixed:
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

>  Documentation/virt/kvm/api.rst | 21 +++++++++++++++++++
>  arch/x86/kvm/x86.c             |  2 +-
>  include/linux/kvm_host.h       |  4 ++--
>  include/uapi/linux/kvm.h       | 13 ++++++++++++
>  virt/kvm/kvm_main.c            | 38 +++++++++++++++++++++++++++-------
>  5 files changed, 67 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 21a7578142a1..ace984acc125 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6070,6 +6070,27 @@ writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using the SET_ONE_REG
>  interface. No error will be returned, but the resulting offset will not be
>  applied.
>
> +4.139 KVM_SET_USER_MEMORY_REGION2
> +---------------------------------
> +
> +:Capability: KVM_CAP_USER_MEMORY2
> +:Architectures: all
> +:Type: vm ioctl
> +:Parameters: struct kvm_userspace_memory_region2 (in)
> +:Returns: 0 on success, -1 on error
> +
> +::
> +
> +  struct kvm_userspace_memory_region2 {
> +       __u32 slot;
> +       __u32 flags;
> +       __u64 guest_phys_addr;
> +       __u64 memory_size; /* bytes */
> +       __u64 userspace_addr; /* start of the userspace allocated memory */
> +  };
> +
> +See KVM_SET_USER_MEMORY_REGION.
> +
>  5. The kvm_run structure
>  ========================
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 41cce5031126..6409914428ca 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12455,7 +12455,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
>         }
>
>         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> -               struct kvm_userspace_memory_region m;
> +               struct kvm_userspace_memory_region2 m;
>
>                 m.slot = id | (i << 16);
>                 m.flags = 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 5faba69403ac..4e741ff27af3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1146,9 +1146,9 @@ enum kvm_mr_change {
>  };
>
>  int kvm_set_memory_region(struct kvm *kvm,
> -                         const struct kvm_userspace_memory_region *mem);
> +                         const struct kvm_userspace_memory_region2 *mem);
>  int __kvm_set_memory_region(struct kvm *kvm,
> -                           const struct kvm_userspace_memory_region *mem);
> +                           const struct kvm_userspace_memory_region2 *mem);
>  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
>  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
>  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 13065dd96132..bd1abe067f28 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
>         __u64 userspace_addr; /* start of the userspace allocated memory */
>  };
>
> +/* for KVM_SET_USER_MEMORY_REGION2 */
> +struct kvm_userspace_memory_region2 {
> +       __u32 slot;
> +       __u32 flags;
> +       __u64 guest_phys_addr;
> +       __u64 memory_size;
> +       __u64 userspace_addr;
> +       __u64 pad[16];
> +};
> +
>  /*
>   * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
>   * userspace, other bits are reserved for kvm internal use which are defined
> @@ -1192,6 +1202,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_COUNTER_OFFSET 227
>  #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
>  #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
> +#define KVM_CAP_USER_MEMORY2 230
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -1473,6 +1484,8 @@ struct kvm_vfio_spapr_tce {
>                                         struct kvm_userspace_memory_region)
>  #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
>  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> +#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> +                                        struct kvm_userspace_memory_region2)
>
>  /* enable ucontrol for s390 */
>  struct kvm_s390_ucas_mapping {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6e708017064d..3f5b7c2c5327 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1578,7 +1578,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
>
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_userspace_memory_region2 *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> @@ -1980,7 +1980,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>   * Must be called holding kvm->slots_lock for write.
>   */
>  int __kvm_set_memory_region(struct kvm *kvm,
> -                           const struct kvm_userspace_memory_region *mem)
> +                           const struct kvm_userspace_memory_region2 *mem)
>  {
>         struct kvm_memory_slot *old, *new;
>         struct kvm_memslots *slots;
> @@ -2084,7 +2084,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>
>  int kvm_set_memory_region(struct kvm *kvm,
> -                         const struct kvm_userspace_memory_region *mem)
> +                         const struct kvm_userspace_memory_region2 *mem)
>  {
>         int r;
>
> @@ -2096,7 +2096,7 @@ int kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>
>  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> -                                         struct kvm_userspace_memory_region *mem)
> +                                         struct kvm_userspace_memory_region2 *mem)
>  {
>         if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
>                 return -EINVAL;
> @@ -4566,6 +4566,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  {
>         switch (arg) {
>         case KVM_CAP_USER_MEMORY:
> +       case KVM_CAP_USER_MEMORY2:
>         case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
>         case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
>         case KVM_CAP_INTERNAL_ERROR_DATA:
> @@ -4821,6 +4822,14 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
>         return fd;
>  }
>
> +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> +do {                                                                           \
> +       BUILD_BUG_ON(offsetof(struct kvm_userspace_memory_region, field) !=             \
> +                    offsetof(struct kvm_userspace_memory_region2, field));     \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_userspace_memory_region, field) !=         \
> +                    sizeof_field(struct kvm_userspace_memory_region2, field)); \
> +} while (0)
> +
>  static long kvm_vm_ioctl(struct file *filp,
>                            unsigned int ioctl, unsigned long arg)
>  {
> @@ -4843,15 +4852,28 @@ static long kvm_vm_ioctl(struct file *filp,
>                 r = kvm_vm_ioctl_enable_cap_generic(kvm, &cap);
>                 break;
>         }
> +       case KVM_SET_USER_MEMORY_REGION2:
>         case KVM_SET_USER_MEMORY_REGION: {
> -               struct kvm_userspace_memory_region kvm_userspace_mem;
> +               struct kvm_userspace_memory_region2 mem;
> +               unsigned long size;
> +
> +               if (ioctl == KVM_SET_USER_MEMORY_REGION)
> +                       size = sizeof(struct kvm_userspace_memory_region);
> +               else
> +                       size = sizeof(struct kvm_userspace_memory_region2);
> +
> +               /* Ensure the common parts of the two structs are identical. */
> +               SANITY_CHECK_MEM_REGION_FIELD(slot);
> +               SANITY_CHECK_MEM_REGION_FIELD(flags);
> +               SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +               SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +               SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
>
>                 r = -EFAULT;
> -               if (copy_from_user(&kvm_userspace_mem, argp,
> -                                               sizeof(kvm_userspace_mem)))
> +               if (copy_from_user(&mem, argp, size))
>                         goto out;
>
> -               r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> +               r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>                 break;
>         }
>         case KVM_GET_DIRTY_LOG: {
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry
  2023-10-27 18:21 ` [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
  2023-10-30 16:30   ` Paolo Bonzini
  2023-10-30 16:53   ` David Matlack
@ 2023-11-01 15:31   ` Xu Yilun
  2 siblings, 0 replies; 148+ messages in thread
From: Xu Yilun @ 2023-11-01 15:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Fri, Oct 27, 2023 at 11:21:45AM -0700, Sean Christopherson wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
> 
> Currently in mmu_notifier invalidate path, hva range is recorded and
> then checked against by mmu_notifier_retry_hva() in the page fault
                          ^

should be mmu_invalidate_retry_hva().

Besides this, Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>

Thanks

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-11-01 13:49           ` Paolo Bonzini
@ 2023-11-01 16:36             ` Sean Christopherson
  2023-11-01 22:28               ` Paolo Bonzini
  0 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-11-01 16:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Xiaoyao Li, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> On Wed, Nov 1, 2023 at 2:41 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Nov 01, 2023, Xiaoyao Li wrote:
> > > On 10/31/2023 10:16 PM, Sean Christopherson wrote:
> > > > On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> > > > > On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > > But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the
> > > failure of MADV_HUGEPAGE is not fatal, user space can ignore it and
> > > continue.
> > >
> > > However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which leads
> > > to failure of guest memfd creation.
> >
> > Failing KVM_CREATE_GUEST_MEMFD isn't truly fatal, it just requires different
> > action from userspace, i.e. instead of ignoring the error, userspace could redo
> > KVM_CREATE_GUEST_MEMFD with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE=0.
> >
> > We could make the behavior more like MADV_HUGEPAGE, e.g. theoretically we could
> > extend fadvise() with FADV_HUGEPAGE, or add a guest_memfd knob/ioctl() to let
> > userspace provide advice/hints after creating a guest_memfd.  But I suspect that
> > guest_memfd would be the only user of FADV_HUGEPAGE, and IMO a post-creation hint
> > is actually less desirable.
> >
> > KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will fail only if userspace didn't provide a
> > compatible size or the kernel doesn't support THP.  An incompatible size is likely
> > a userspace bug, and for most setups that want to utilize guest_memfd, lack of THP
> > support is likely a configuration bug.  I.e. many/most uses *want* failures due to
> > KVM_GUEST_MEMFD_ALLOW_HUGEPAGE to be fatal.
> >
> > > For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE
> > > fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*?
> >
> > Why?  Verbs like "prefer" and "desire" aren't a good fit IMO because they suggest
> > the flag is a hint, and hints are usually best effort only, i.e. are ignored if
> > there is a fundamental incompatibility.
> >
> > "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES
> > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't
> > (yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
> > a hint, but weaker than a requirement.  And if/when KVM supports a dedicated memory
> > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.
> 
> I think that the current patch is fine, but I will adjust it to always
> allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
> If hugepages are not guaranteed, and (theoretically) you could have no
> hugepage at all in the result, it's okay to get this result even if THP is not
> available in the kernel.

Can you post a fixup patch?  It's not clear to me exactly what behavior you intend
to end up with.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-11-01 10:52   ` Huang, Kai
@ 2023-11-01 17:36     ` Sean Christopherson
  2023-11-02  2:19       ` Xiaoyao Li
                         ` (2 more replies)
  0 siblings, 3 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-11-01 17:36 UTC (permalink / raw)
  To: Kai Huang
  Cc: viro, aou, brauner, oliver.upton, chenhuacai, paul.walmsley,
	palmer, maz, pbonzini, mpe, willy, anup, akpm, Xiaoyao Li,
	kvm-riscv, mic, liam.merwick, kvm, Isaku Yamahata,
	kirill.shutemov, david, tabba, amoorthy, linuxppc-dev,
	michael.roth, kvmarm, linux-kernel, linux-fsdevel, linux-riscv,
	chao.p.peng, linux-mips, Vishal Annapurve, vbabka, mail,
	yu.c.zhang, qperret, dmatlack, Yilun Xu, isaku.yamahata,
	ackerleytng, jarkko, linux-arm-kernel, linux-mm, Wei W Wang

On Wed, Nov 01, 2023, Kai Huang wrote:
> 
> > +7.34 KVM_CAP_MEMORY_FAULT_INFO
> > +------------------------------
> > +
> > +:Architectures: x86
> > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> > +
> > +The presence of this capability indicates that KVM_RUN will fill
> > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
> > +there is a valid memslot but no backing VMA for the corresponding host virtual
> > +address.
> > +
> > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
> > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
> > +to KVM_EXIT_MEMORY_FAULT.
> 
> IIUC returning -EFAULT or whatever -errno is sort of KVM internal
> implementation.

The errno that is returned to userspace is ABI.  In KVM, it's a _very_ poorly
defined ABI for the vast majority of ioctls(), but it's still technically ABI.
KVM gets away with being cavalier with errno because the vast majority of errors
are considered fatal by userespace, i.e. in most cases, userspace simply doesn't
care about the exact errno.

A good example is KVM_RUN with -EINTR; if KVM were to return something other than
-EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over.

> Is it better to relax the validity of kvm_run.memory_fault when
> KVM_RUN returns any -errno?

Not unless there's a need to do so, and if there is then we can update the
documentation accordingly.  If KVM's ABI is that kvm_run.memory_fault is valid
for any errno, then KVM would need to purge kvm_run.exit_reason super early in
KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being
misinterpreted as KVM_EXIT_MEMORY_FAULT.  And purging exit_reason super early is
subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit
reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a
pending signal).

https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com

> [...]
> 
> 
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> >  /* Max number of entries allowed for each kvm dirty ring */
> >  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> >  
> > +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> > +						 gpa_t gpa, gpa_t size)
> > +{
> > +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > +	vcpu->run->memory_fault.gpa = gpa;
> > +	vcpu->run->memory_fault.size = size;
> > +
> > +	/* Flags are not (yet) defined or communicated to userspace. */
> > +	vcpu->run->memory_fault.flags = 0;
> > +}
> > +
> 
> KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function to
> <asm/kvm_host.h>?

I'd prefer to keep it in generic code, as it's highly likely to end up there
sooner than later.  There's a known use case for ARM (exit to userspace on missing
userspace mapping[*]), and I'm guessing pKVM (also ARM) will also utilize this API.

[*] https://lore.kernel.org/all/20230908222905.1321305-8-amoorthy@google.com

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-11-01 10:51       ` Fuad Tabba
@ 2023-11-01 21:55         ` Sean Christopherson
  2023-11-02 13:52           ` Fuad Tabba
  0 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-11-01 21:55 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Wed, Nov 01, 2023, Fuad Tabba wrote:
> > > > @@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
> > > >  /* This does not remove the slot from struct kvm_memslots data structures */
> > > >  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> > > >  {
> > > > +       if (slot->flags & KVM_MEM_PRIVATE)
> > > > +               kvm_gmem_unbind(slot);
> > > > +
> > >
> > > Should this be called after kvm_arch_free_memslot()? Arch-specific ode
> > > might need some of the data before the unbinding, something I thought
> > > might be necessary at one point for the pKVM port when deleting a
> > > memslot, but realized later that kvm_invalidate_memslot() ->
> > > kvm_arch_guest_memory_reclaimed() was the more logical place for it.
> > > Also, since that seems to be the pattern for arch-specific handlers in
> > > KVM.
> >
> > Maybe?  But only if we can about symmetry between the allocation and free paths
> > I really don't think kvm_arch_free_memslot() should be doing anything beyond a
> > "pure" free.  E.g. kvm_arch_free_memslot() is also called after moving a memslot,
> > which hopefully we never actually have to allow for guest_memfd, but any code in
> > kvm_arch_free_memslot() would bring about "what if" questions regarding memslot
> > movement.  I.e. the API is intended to be a "free arch metadata associated with
> > the memslot".
> >
> > Out of curiosity, what does pKVM need to do at kvm_arch_guest_memory_reclaimed()?
> 
> It's about the host reclaiming ownership of guest memory when tearing
> down a protected guest. In pKVM, we currently teardown the guest and
> reclaim its memory when kvm_arch_destroy_vm() is called. The problem
> with guestmem is that kvm_gmem_unbind() could get called before that
> happens, after which the host might try to access the unbound guest
> memory. Since the host hasn't reclaimed ownership of the guest memory
> from hyp, hilarity ensues (it crashes).
> 
> Initially, I hooked reclaim guest memory to kvm_free_memslot(), but
> then I needed to move the unbind later in the function. I realized
> later that kvm_arch_guest_memory_reclaimed() gets called earlier (at
> the right time), and is more aptly named.

Aha!  I suspected that might be the case.

TDX and SNP also need to solve the same problem of "reclaiming" memory before it
can be safely accessed by the host.  The plan is to add an arch hook (or two?)
into guest_memfd that is invoked when memory is freed from guest_memfd.

Hooking kvm_arch_guest_memory_reclaimed() isn't completely correct as deleting a
memslot doesn't *guarantee* that guest memory is actually reclaimed (which reminds
me, we need to figure out a better name for that thing before introducing
kvm_arch_gmem_invalidate()).

The effective false positives aren't fatal for the current usage because the hook
is used only for x86 SEV guests to flush caches.  An unnecessary flush can cause
performance issues, but it doesn't affect correctness. For TDX and SNP, and IIUC
pKVM, false positives are fatal because KVM could assign memory back to the host
that is still owned by guest_memfd.

E.g. a misbehaving userspace could prematurely delete a memslot.  And the more
fun example is intrahost migration, where the plan is to allow pointing multiple
guest_memfd files at a single guest_memfd inode:
https://lore.kernel.org/all/cover.1691446946.git.ackerleytng@google.com

There was a lot of discussion for this, but it's scattered all over the place.
The TL;DR is is that the inode will represent physical memory, and a file will
represent a given "struct kvm" instance's view of that memory.  And so the memory
isn't reclaimed until the inode is truncated/punched.

I _think_ this reflects the most recent plan from the guest_memfd side:
https://lore.kernel.org/all/1233d749211c08d51f9ca5d427938d47f008af1f.1689893403.git.isaku.yamahata@intel.com

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-11-01 16:36             ` Sean Christopherson
@ 2023-11-01 22:28               ` Paolo Bonzini
  2023-11-01 22:34                 ` Sean Christopherson
  0 siblings, 1 reply; 148+ messages in thread
From: Paolo Bonzini @ 2023-11-01 22:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiaoyao Li, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 11/1/23 17:36, Sean Christopherson wrote:
>>> "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES
>>> or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't
>>> (yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
>>> a hint, but weaker than a requirement.  And if/when KVM supports a dedicated memory
>>> pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.
>> I think that the current patch is fine, but I will adjust it to always
>> allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
>> If hugepages are not guaranteed, and (theoretically) you could have no
>> hugepage at all in the result, it's okay to get this result even if THP is not
>> available in the kernel.
> Can you post a fixup patch?  It's not clear to me exactly what behavior you intend
> to end up with.

Sure, just this:

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 7d1a33c2ad42..34fd070e03d9 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
  {
  	loff_t size = args->size;
  	u64 flags = args->flags;
-	u64 valid_flags = 0;
-
-	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
-		valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+	u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
  
  	if (flags & ~valid_flags)
  		return -EINVAL;
@@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
  	if (size < 0 || !PAGE_ALIGNED(size))
  		return -EINVAL;
  
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
  	if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
  	    !IS_ALIGNED(size, HPAGE_PMD_SIZE))
  		return -EINVAL;
-#endif
  
  	return __kvm_gmem_create(kvm, size, flags);
  }

Paolo


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-11-01 22:28               ` Paolo Bonzini
@ 2023-11-01 22:34                 ` Sean Christopherson
  2023-11-01 23:17                   ` Paolo Bonzini
  0 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-11-01 22:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Xiaoyao Li, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> On 11/1/23 17:36, Sean Christopherson wrote:
> > > > "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES
> > > > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't
> > > > (yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
> > > > a hint, but weaker than a requirement.  And if/when KVM supports a dedicated memory
> > > > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.
> > > I think that the current patch is fine, but I will adjust it to always
> > > allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
> > > If hugepages are not guaranteed, and (theoretically) you could have no
> > > hugepage at all in the result, it's okay to get this result even if THP is not
> > > available in the kernel.
> > Can you post a fixup patch?  It's not clear to me exactly what behavior you intend
> > to end up with.
> 
> Sure, just this:
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 7d1a33c2ad42..34fd070e03d9 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  {
>  	loff_t size = args->size;
>  	u64 flags = args->flags;
> -	u64 valid_flags = 0;
> -
> -	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> -		valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> +	u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
>  	if (flags & ~valid_flags)
>  		return -EINVAL;
> @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  	if (size < 0 || !PAGE_ALIGNED(size))
>  		return -EINVAL;
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
>  	    !IS_ALIGNED(size, HPAGE_PMD_SIZE))
>  		return -EINVAL;
> -#endif

That won't work, HPAGE_PMD_SIZE is valid only for CONFIG_TRANSPARENT_HUGEPAGE=y.

#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })

...

>  	return __kvm_gmem_create(kvm, size, flags);
>  }
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-11-01 22:34                 ` Sean Christopherson
@ 2023-11-01 23:17                   ` Paolo Bonzini
  2023-11-02 15:38                     ` Sean Christopherson
  0 siblings, 1 reply; 148+ messages in thread
From: Paolo Bonzini @ 2023-11-01 23:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiaoyao Li, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Wed, Nov 1, 2023 at 11:35 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> > On 11/1/23 17:36, Sean Christopherson wrote:
> > > > > "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES
> > > > > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't
> > > > > (yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
> > > > > a hint, but weaker than a requirement.  And if/when KVM supports a dedicated memory
> > > > > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.
> > > > I think that the current patch is fine, but I will adjust it to always
> > > > allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
> > > > If hugepages are not guaranteed, and (theoretically) you could have no
> > > > hugepage at all in the result, it's okay to get this result even if THP is not
> > > > available in the kernel.
> > > Can you post a fixup patch?  It's not clear to me exactly what behavior you intend
> > > to end up with.
> >
> > Sure, just this:
> >
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 7d1a33c2ad42..34fd070e03d9 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> >  {
> >       loff_t size = args->size;
> >       u64 flags = args->flags;
> > -     u64 valid_flags = 0;
> > -
> > -     if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > -             valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > +     u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> >       if (flags & ~valid_flags)
> >               return -EINVAL;
> > @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> >       if (size < 0 || !PAGE_ALIGNED(size))
> >               return -EINVAL;
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >       if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> >           !IS_ALIGNED(size, HPAGE_PMD_SIZE))
> >               return -EINVAL;
> > -#endif
>
> That won't work, HPAGE_PMD_SIZE is valid only for CONFIG_TRANSPARENT_HUGEPAGE=y.
>
> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })

Would have caught it when actually testing it, I guess. :) It has to
be PMD_SIZE, possibly with

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
BUILD_BUG_ON(HPAGE_PMD_SIZE != PMD_SIZE);
#endif

for extra safety.

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-11-01 17:36     ` Sean Christopherson
@ 2023-11-02  2:19       ` Xiaoyao Li
  2023-11-02 15:51         ` Sean Christopherson
  2023-11-02  3:17       ` Huang, Kai
  2023-11-02 11:01       ` Paolo Bonzini
  2 siblings, 1 reply; 148+ messages in thread
From: Xiaoyao Li @ 2023-11-02  2:19 UTC (permalink / raw)
  To: Sean Christopherson, Kai Huang
  Cc: viro, aou, brauner, oliver.upton, chenhuacai, paul.walmsley,
	palmer, maz, pbonzini, mpe, willy, anup, akpm, kvm-riscv, mic,
	liam.merwick, kvm, Isaku Yamahata, kirill.shutemov, david, tabba,
	amoorthy, linuxppc-dev, michael.roth, kvmarm, linux-kernel,
	linux-fsdevel, linux-riscv, chao.p.peng, linux-mips,
	Vishal Annapurve, vbabka, mail, yu.c.zhang, qperret, dmatlack,
	Yilun Xu, isaku.yamahata, ackerleytng, jarkko, linux-arm-kernel,
	linux-mm, Wei W Wang

On 11/2/2023 1:36 AM, Sean Christopherson wrote:
>> KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function to
>> <asm/kvm_host.h>?
> I'd prefer to keep it in generic code, as it's highly likely to end up there
> sooner than later.  There's a known use case for ARM (exit to userspace on missing
> userspace mapping[*]), and I'm guessing pKVM (also ARM) will also utilize this API.
> 
> [*]https://lore.kernel.org/all/20230908222905.1321305-8-amoorthy@google.com

I wonder how this CAP is supposed to be checked in userspace, for guest 
memfd case? something like this?

	if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO) &&
	    run->exit_reason == KVM_EXIT_MEMORY_FAULT)
		abort("unexpected KVM_EXIT_MEMORY_FAULT");

In my implementation of QEMU patches, I find it's unnecessary. When 
userspace gets an exit with KVM_EXIT_MEMORY_FAULT, it implies 
"KVM_CAP_MEMORY_FAULT_INFO".

So I don't see how it is necessary in this series. Whether it's 
necessary or not for [*], I don't have the answer but we can leave the 
discussion to that patch series.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes
  2023-10-27 18:21 ` [PATCH v13 13/35] KVM: Introduce per-page memory attributes Sean Christopherson
  2023-10-30  8:11   ` Chao Gao
  2023-10-31 16:43   ` David Matlack
@ 2023-11-02  3:01   ` Huang, Kai
  2023-11-02 10:32     ` Paolo Bonzini
  2 siblings, 1 reply; 148+ messages in thread
From: Huang, Kai @ 2023-11-02  3:01 UTC (permalink / raw)
  To: viro, aou, Christopherson,,
	Sean, brauner, oliver.upton, chenhuacai, paul.walmsley, palmer,
	maz, pbonzini, mpe, willy, anup, akpm
  Cc: Li, Xiaoyao, kvm-riscv, mic, liam.merwick, kvm, Yamahata, Isaku,
	kirill.shutemov, david, tabba, amoorthy, linuxppc-dev,
	michael.roth, kvmarm, linux-kernel, linux-fsdevel, linux-riscv,
	chao.p.peng, linux-mips, Annapurve, Vishal, vbabka, mail,
	yu.c.zhang, qperret, dmatlack, Xu, Yilun, isaku.yamahata,
	ackerleytng, jarkko, linux-arm-kernel, linux-mm, Wang, Wei W

On Fri, 2023-10-27 at 11:21 -0700, Sean Christopherson wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
> 
> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
> 
> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> userspace to operate on the per-page memory attributes.
>   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
>     a guest memory range.
>   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
>     memory attributes.
> 
> Use an xarray to store the per-page attributes internally, with a naive,
> not fully optimized implementation, i.e. prioritize correctness over
> performance for the initial implementation.
> 
> Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
> attributes/protections in the future, e.g. to give userspace fine-grained
> control over read, write, and execute protections for guest memory.
> 
> Provide arch hooks for handling attribute changes before and after common
> code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
> relevant mappings, and the "post" hook to track whether or not hugepages
> can be used to map the range.
> 
> To simplify the implementation wrap the entire sequence with
> kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
> guaranteed to be an invalidation.  For the initial use case, x86 *will*
> always invalidate memory, and preventing arch code from creating new
> mappings while the attributes are in flux makes it much easier to reason
> about the correctness of consuming attributes.
> 
> It's possible that future usages may not require an invalidation, e.g.
> if KVM ends up supporting RWX protections and userspace grants _more_
> protections, but again opt for simplicity and punt optimizations to
> if/when they are needed.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
> Cc: Fuad Tabba <tabba@google.com>
> Cc: Xu Yilun <yilun.xu@intel.com>
> Cc: Mickaël Salaün <mic@digikod.net>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> 

[...]

> +Note, there is no "get" API.  Userspace is responsible for explicitly tracking
> +the state of a gfn/page as needed.
> +
> 

[...]

>  
> +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +{
> +	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
> +}

Only call xa_to_value() when xa_load() returns !NULL?

> +
> +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> +				     unsigned long attrs);

Seems it's not immediately clear why this function is needed in this patch,
especially when you said there is no "get" API above.  Add some material to
changelog?
 

> +bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> +					struct kvm_gfn_range *range);
> +bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> +					 struct kvm_gfn_range *range);

Looks if this Kconfig is on, the above two arch hooks won't have implementation.

Is it better to have two __weak empty versions here in this patch?

Anyway, from the changelog it seems it's not mandatory for some ARCH to provide
the above two if one wants to turn this on, i.e., the two hooks can be empty and
the ARCH can just use the __weak version.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-11-01 17:36     ` Sean Christopherson
  2023-11-02  2:19       ` Xiaoyao Li
@ 2023-11-02  3:17       ` Huang, Kai
  2023-11-02  9:35         ` Huang, Kai
  2023-11-02 15:56         ` Sean Christopherson
  2023-11-02 11:01       ` Paolo Bonzini
  2 siblings, 2 replies; 148+ messages in thread
From: Huang, Kai @ 2023-11-02  3:17 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: Li, Xiaoyao, kvm-riscv, mic, liam.merwick, Yamahata, Isaku, kvm,
	pbonzini, kirill.shutemov, david, linux-fsdevel, amoorthy,
	linuxppc-dev, tabba, kvmarm, linux-kernel, michael.roth, viro,
	oliver.upton, chao.p.peng, palmer, chenhuacai, aou, linux-mips,
	mpe, Annapurve, Vishal, vbabka, mail, linux-riscv, maz, willy,
	dmatlack, anup, yu.c.zhang, Xu, Yilun, qperret, brauner,
	isaku.yamahata, ackerleytng, jarkko, paul.walmsley,
	linux-arm-kernel, linux-mm, Wang, Wei W, akpm

On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote:
> On Wed, Nov 01, 2023, Kai Huang wrote:
> > 
> > > +7.34 KVM_CAP_MEMORY_FAULT_INFO
> > > +------------------------------
> > > +
> > > +:Architectures: x86
> > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> > > +
> > > +The presence of this capability indicates that KVM_RUN will fill
> > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
> > > +there is a valid memslot but no backing VMA for the corresponding host virtual
> > > +address.
> > > +
> > > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
> > > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
> > > +to KVM_EXIT_MEMORY_FAULT.
> > 
> > IIUC returning -EFAULT or whatever -errno is sort of KVM internal
> > implementation.
> 
> The errno that is returned to userspace is ABI.  In KVM, it's a _very_ poorly
> defined ABI for the vast majority of ioctls(), but it's still technically ABI.
> KVM gets away with being cavalier with errno because the vast majority of errors
> are considered fatal by userespace, i.e. in most cases, userspace simply doesn't
> care about the exact errno.
> 
> A good example is KVM_RUN with -EINTR; if KVM were to return something other than
> -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over.
> 
> > Is it better to relax the validity of kvm_run.memory_fault when
> > KVM_RUN returns any -errno?
> 
> Not unless there's a need to do so, and if there is then we can update the
> documentation accordingly.  If KVM's ABI is that kvm_run.memory_fault is valid
> for any errno, then KVM would need to purge kvm_run.exit_reason super early in
> KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being
> misinterpreted as KVM_EXIT_MEMORY_FAULT.  And purging exit_reason super early is
> subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit
> reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a
> pending signal).
> 
> https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com
> 
> 

Agreed with not to relax to any errno.  However using -EFAULT as part of ABI
definition seems a little bit dangerous, e.g., someone could accidentally or
mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely
different code path, etc.  -EINTR has well defined meaning, but -EFAULT (which
is "Bad address") seems doesn't but I am not sure either. :-)

One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns
KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc
fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just
return -EFAULT.  If kvm_run.exit_reason isn't purged early then is it possible
to have some issue here?



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
  2023-10-27 18:21 ` [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events Sean Christopherson
  2023-10-30 17:21   ` Paolo Bonzini
@ 2023-11-02  5:59   ` Binbin Wu
  2023-11-02 11:14     ` Paolo Bonzini
  2023-11-02 14:01   ` Fuad Tabba
  2 siblings, 1 reply; 148+ messages in thread
From: Binbin Wu @ 2023-11-02  5:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov



On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> Add flags to "struct kvm_gfn_range" to let notifier events target only
> shared and only private mappings, and write up the existing mmu_notifier
> events to be shared-only (private memory is never associated with a
> userspace virtual address, i.e. can't be reached via mmu_notifiers).
>
> Add two flags so that KVM can handle the three possibilities (shared,
> private, and shared+private) without needing something like a tri-state
> enum.
I see the two flags are set/cleared in __kvm_handle_hva_range() in this 
patch
and kvm_handle_gfn_range() from the later patch 13/35, but I didn't see they
are used/read in this patch series if I didn't miss anything.  How are they
supposed to be used in KVM?


>
> Link: https://lore.kernel.org/all/ZJX0hk+KpQP0KUyB@google.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   include/linux/kvm_host.h | 2 ++
>   virt/kvm/kvm_main.c      | 7 +++++++
>   2 files changed, 9 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 96aa930536b1..89c1a991a3b8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -263,6 +263,8 @@ struct kvm_gfn_range {
>   	gfn_t start;
>   	gfn_t end;
>   	union kvm_mmu_notifier_arg arg;
> +	bool only_private;
> +	bool only_shared;
>   	bool may_block;
>   };
>   bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cb9376833c18..302ccb87b4c1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>   			 * the second or later invocation of the handler).
>   			 */
>   			gfn_range.arg = range->arg;
> +
> +			/*
> +			 * HVA-based notifications aren't relevant to private
> +			 * mappings as they don't have a userspace mapping.
> +			 */
> +			gfn_range.only_private = false;
> +			gfn_range.only_shared = true;
>   			gfn_range.may_block = range->may_block;
>   
>   			/*


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-11-02  3:17       ` Huang, Kai
@ 2023-11-02  9:35         ` Huang, Kai
  2023-11-02 11:03           ` Paolo Bonzini
  2023-11-02 15:56         ` Sean Christopherson
  1 sibling, 1 reply; 148+ messages in thread
From: Huang, Kai @ 2023-11-02  9:35 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: Li, Xiaoyao, kvm-riscv, mic, liam.merwick, kvm, Yamahata, Isaku,
	viro, kirill.shutemov, david, linux-mips, amoorthy, linuxppc-dev,
	tabba, kvmarm, linux-kernel, linux-fsdevel, oliver.upton,
	michael.roth, chao.p.peng, palmer, pbonzini, chenhuacai, aou,
	mpe, Annapurve, Vishal, vbabka, mail, linux-riscv, maz, willy,
	dmatlack, anup, yu.c.zhang, Xu, Yilun, qperret, brauner,
	isaku.yamahata, ackerleytng, jarkko, paul.walmsley,
	linux-arm-kernel, linux-mm, Wang, Wei W, akpm

On Thu, 2023-11-02 at 03:17 +0000, Huang, Kai wrote:
> On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote:
> > On Wed, Nov 01, 2023, Kai Huang wrote:
> > > 
> > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO
> > > > +------------------------------
> > > > +
> > > > +:Architectures: x86
> > > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> > > > +
> > > > +The presence of this capability indicates that KVM_RUN will fill
> > > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
> > > > +there is a valid memslot but no backing VMA for the corresponding host virtual
> > > > +address.
> > > > +
> > > > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
> > > > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
> > > > +to KVM_EXIT_MEMORY_FAULT.
> > > 
> > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal
> > > implementation.
> > 
> > The errno that is returned to userspace is ABI.  In KVM, it's a _very_ poorly
> > defined ABI for the vast majority of ioctls(), but it's still technically ABI.
> > KVM gets away with being cavalier with errno because the vast majority of errors
> > are considered fatal by userespace, i.e. in most cases, userspace simply doesn't
> > care about the exact errno.
> > 
> > A good example is KVM_RUN with -EINTR; if KVM were to return something other than
> > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over.
> > 
> > > Is it better to relax the validity of kvm_run.memory_fault when
> > > KVM_RUN returns any -errno?
> > 
> > Not unless there's a need to do so, and if there is then we can update the
> > documentation accordingly.  If KVM's ABI is that kvm_run.memory_fault is valid
> > for any errno, then KVM would need to purge kvm_run.exit_reason super early in
> > KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being
> > misinterpreted as KVM_EXIT_MEMORY_FAULT.  And purging exit_reason super early is
> > subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit
> > reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a
> > pending signal).
> > 
> > https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com
> > 
> > 
> 
> Agreed with not to relax to any errno.  However using -EFAULT as part of ABI
> definition seems a little bit dangerous, e.g., someone could accidentally or
> mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely
> different code path, etc.  -EINTR has well defined meaning, but -EFAULT (which
> is "Bad address") seems doesn't but I am not sure either. :-)
> 
> One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns
> KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc
> fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just
> return -EFAULT.  If kvm_run.exit_reason isn't purged early then is it possible
> to have some issue here?
> 

Also, regardless whether -EFAULT is too ambiguous to be part of ABI, could you
elaborate the EHWPOISON part?  IIUC KVM can already handle the case of poisoned
page by sending signal to user app: 

	static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, 
			struct kvm_page_fault *fault)                                               
	{       
		...

       		if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
              		kvm_send_hwpoison_signal(fault->slot, fault->gfn);
                	return RET_PF_RETRY;                                          
        	}
	}

And (sorry to hijack) I am thinking whether "SGX vepc unable to allocate EPC"
can also use this memory_fault mechanism.  Currently as mentioned above when
vepc fault handler cannot allocate EPC page KVM returns -EFAULT to Qemu, and
Qemu prints ...

	...: Bad address
	<dump guest cpu registers>

... which is nonsense.

If we can use memory_fault.flags (or is 'fault_reason' a better name?) to carry
a specific value for EPC to let Qemu know and Qemu can then do more reasonable
things.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes
  2023-11-02  3:01   ` Huang, Kai
@ 2023-11-02 10:32     ` Paolo Bonzini
  2023-11-02 10:55       ` Huang, Kai
  0 siblings, 1 reply; 148+ messages in thread
From: Paolo Bonzini @ 2023-11-02 10:32 UTC (permalink / raw)
  To: Huang, Kai, viro, aou, Christopherson,,
	Sean, brauner, oliver.upton, chenhuacai, paul.walmsley, palmer,
	maz, mpe, willy, anup, akpm
  Cc: Li, Xiaoyao, kvm-riscv, mic, liam.merwick, kvm, Yamahata, Isaku,
	kirill.shutemov, david, tabba, amoorthy, linuxppc-dev,
	michael.roth, kvmarm, linux-kernel, linux-fsdevel, linux-riscv,
	chao.p.peng, linux-mips, Annapurve, Vishal, vbabka, mail,
	yu.c.zhang, qperret, dmatlack, Xu, Yilun, isaku.yamahata,
	ackerleytng, jarkko, linux-arm-kernel, linux-mm, Wang, Wei W

On 11/2/23 04:01, Huang, Kai wrote:
> On Fri, 2023-10-27 at 11:21 -0700, Sean Christopherson wrote:
>> From: Chao Peng <chao.p.peng@linux.intel.com>
>>
>> In confidential computing usages, whether a page is private or shared is
>> necessary information for KVM to perform operations like page fault
>> handling, page zapping etc. There are other potential use cases for
>> per-page memory attributes, e.g. to make memory read-only (or no-exec,
>> or exec-only, etc.) without having to modify memslots.
>>
>> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
>> userspace to operate on the per-page memory attributes.
>>    - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
>>      a guest memory range.
>>    - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
>>      memory attributes.
>>
>> Use an xarray to store the per-page attributes internally, with a naive,
>> not fully optimized implementation, i.e. prioritize correctness over
>> performance for the initial implementation.
>>
>> Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
>> attributes/protections in the future, e.g. to give userspace fine-grained
>> control over read, write, and execute protections for guest memory.
>>
>> Provide arch hooks for handling attribute changes before and after common
>> code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
>> relevant mappings, and the "post" hook to track whether or not hugepages
>> can be used to map the range.
>>
>> To simplify the implementation wrap the entire sequence with
>> kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
>> guaranteed to be an invalidation.  For the initial use case, x86 *will*
>> always invalidate memory, and preventing arch code from creating new
>> mappings while the attributes are in flux makes it much easier to reason
>> about the correctness of consuming attributes.
>>
>> It's possible that future usages may not require an invalidation, e.g.
>> if KVM ends up supporting RWX protections and userspace grants _more_
>> protections, but again opt for simplicity and punt optimizations to
>> if/when they are needed.
>>
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
>> Cc: Fuad Tabba <tabba@google.com>
>> Cc: Xu Yilun <yilun.xu@intel.com>
>> Cc: Mickaël Salaün <mic@digikod.net>
>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> Co-developed-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>
> 
> [...]
> 
>> +Note, there is no "get" API.  Userspace is responsible for explicitly tracking
>> +the state of a gfn/page as needed.
>> +
>>
> 
> [...]
> 
>>   
>> +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
>> +static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>> +{
>> +	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
>> +}
> 
> Only call xa_to_value() when xa_load() returns !NULL?

This xarray does not store a pointer, therefore xa_load() actually 
returns an integer that is tagged with 1 in the low bit:

static inline unsigned long xa_to_value(const void *entry)
{
         return (unsigned long)entry >> 1;
}

Returning zero for an empty entry is okay, so the result of xa_load() 
can be used directly.


>> +
>> +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
>> +				     unsigned long attrs);
> 
> Seems it's not immediately clear why this function is needed in this patch,
> especially when you said there is no "get" API above.  Add some material to
> changelog?

It's used by later patches; even without a "get" API, it's a pretty 
fundamental functionality.

>> +bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
>> +					struct kvm_gfn_range *range);
>> +bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>> +					 struct kvm_gfn_range *range);
> 
> Looks if this Kconfig is on, the above two arch hooks won't have implementation.
> 
> Is it better to have two __weak empty versions here in this patch?
> 
> Anyway, from the changelog it seems it's not mandatory for some ARCH to provide
> the above two if one wants to turn this on, i.e., the two hooks can be empty and
> the ARCH can just use the __weak version.

I think this can be added by the first arch that needs memory attributes 
and also doesn't need one of these hooks.  Or perhaps the x86 
kvm_arch_pre_set_memory_attributes() could be made generic and thus that 
would be the __weak version.  It's too early to tell, so it's better to 
leave the implementation to the architectures for now.

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes
  2023-11-02 10:32     ` Paolo Bonzini
@ 2023-11-02 10:55       ` Huang, Kai
  0 siblings, 0 replies; 148+ messages in thread
From: Huang, Kai @ 2023-11-02 10:55 UTC (permalink / raw)
  To: willy, Christopherson,,
	Sean, brauner, oliver.upton, paul.walmsley, aou, palmer,
	chenhuacai, viro, pbonzini, maz, anup, mpe, akpm
  Cc: Li, Xiaoyao, kvm-riscv, mic, liam.merwick, kvm, Yamahata, Isaku,
	kirill.shutemov, david, tabba, amoorthy, linuxppc-dev,
	michael.roth, kvmarm, linux-kernel, linux-fsdevel, linux-riscv,
	linux-mips, chao.p.peng, Annapurve, Vishal, vbabka, mail,
	yu.c.zhang, qperret, dmatlack, Xu, Yilun, isaku.yamahata,
	ackerleytng, jarkko, linux-arm-kernel, linux-mm, Wang, Wei W

On Thu, 2023-11-02 at 11:32 +0100, Paolo Bonzini wrote:
> > > +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> > > +static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> > > +{
> > > +	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
> > > +}
> > 
> > Only call xa_to_value() when xa_load() returns !NULL?
> 
> This xarray does not store a pointer, therefore xa_load() actually 
> returns an integer that is tagged with 1 in the low bit:
> 
> static inline unsigned long xa_to_value(const void *entry)
> {
>          return (unsigned long)entry >> 1;
> }
> 
> Returning zero for an empty entry is okay, so the result of xa_load() 
> can be used directly.

Thanks for explaining.  I was thinking perhaps it's better to do:

	void *entry = xa_load(...);
	
	return xa_is_value(entry) ? xa_to_value(entry) : 0;

But "NULL (0) >> 1" is still 0, so yes we can use directly.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-11-01 17:36     ` Sean Christopherson
  2023-11-02  2:19       ` Xiaoyao Li
  2023-11-02  3:17       ` Huang, Kai
@ 2023-11-02 11:01       ` Paolo Bonzini
  2 siblings, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-11-02 11:01 UTC (permalink / raw)
  To: Sean Christopherson, Kai Huang
  Cc: viro, aou, brauner, oliver.upton, chenhuacai, paul.walmsley,
	palmer, maz, mpe, willy, anup, akpm, Xiaoyao Li, kvm-riscv, mic,
	liam.merwick, kvm, Isaku Yamahata, kirill.shutemov, david, tabba,
	amoorthy, linuxppc-dev, michael.roth, kvmarm, linux-kernel,
	linux-fsdevel, linux-riscv, chao.p.peng, linux-mips,
	Vishal Annapurve, vbabka, mail, yu.c.zhang, qperret, dmatlack,
	Yilun Xu, isaku.yamahata, ackerleytng, jarkko, linux-arm-kernel,
	linux-mm, Wei W Wang

On 11/1/23 18:36, Sean Christopherson wrote:
> A good example is KVM_RUN with -EINTR; if KVM were to return something other than
> -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over.

And dually if KVM were to return KVM_EXIT_INTR together with something 
other than -EINTR.

> And purging exit_reason super early is subtly tricky because KVM's 
> (again, poorly documented) ABI is that *some* exit reasons are preserved 
> across KVM_RUN with vcpu->run->immediate_exit (or with a pending 
> signal). https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com

vcpu->run->immediate_exit preserves all exit reasons, but it's not a 
good idea that immediate_exit behaves different from a pending signal on 
entry to KVM_RUN (remember that immediate_exit was meant to be a better 
performing alternative to KVM_SET_SIGNAL_MASK).

In principle, vcpu->run->immediate_exit could return KVM_EXIT_INTR 
(perhaps even _should_, except that breaks selftests so at this point it 
is ABI).

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-11-02  9:35         ` Huang, Kai
@ 2023-11-02 11:03           ` Paolo Bonzini
  2023-11-02 15:44             ` Sean Christopherson
  0 siblings, 1 reply; 148+ messages in thread
From: Paolo Bonzini @ 2023-11-02 11:03 UTC (permalink / raw)
  To: Huang, Kai, Christopherson,, Sean
  Cc: Li, Xiaoyao, kvm-riscv, mic, liam.merwick, kvm, Yamahata, Isaku,
	viro, kirill.shutemov, david, linux-mips, amoorthy, linuxppc-dev,
	tabba, kvmarm, linux-kernel, linux-fsdevel, oliver.upton,
	michael.roth, chao.p.peng, palmer, chenhuacai, aou, mpe,
	Annapurve, Vishal, vbabka, mail, linux-riscv, maz, willy,
	dmatlack, anup, yu.c.zhang, Xu, Yilun, qperret, brauner,
	isaku.yamahata, ackerleytng, jarkko, paul.walmsley,
	linux-arm-kernel, linux-mm, Wang, Wei W, akpm

On 11/2/23 10:35, Huang, Kai wrote:
> IIUC KVM can already handle the case of poisoned
> page by sending signal to user app: 
> 
> 	static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, 
> 			struct kvm_page_fault *fault)                                               
> 	{       
> 		...
> 
>        		if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
>               		kvm_send_hwpoison_signal(fault->slot, fault->gfn);
>                 	return RET_PF_RETRY;                                          
>         	}
> 	}

EHWPOISON is not implemented by this series, so it should be left out of 
the documentation.


> Currently as mentioned above when
> vepc fault handler cannot allocate EPC page KVM returns -EFAULT to Qemu, and
> Qemu prints ...
> 
> 	...: Bad address
> 	<dump guest cpu registers>
> 
> ... which is nonsense.
> 
> If we can use memory_fault.flags (or is 'fault_reason' a better name?) to carry
> a specific value for EPC to let Qemu know and Qemu can then do more reasonable
> things.

Yes, that's a good idea that can be implemented on top.

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
  2023-11-02  5:59   ` Binbin Wu
@ 2023-11-02 11:14     ` Paolo Bonzini
  0 siblings, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-11-02 11:14 UTC (permalink / raw)
  To: Binbin Wu, Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Huacai Chen, Michael Ellerman,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Viro, Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On 11/2/23 06:59, Binbin Wu wrote:
> 
>> Add flags to "struct kvm_gfn_range" to let notifier events target
>> only shared and only private mappings, and write up the existing
>> mmu_notifier events to be shared-only (private memory is never
>> associated with a userspace virtual address, i.e. can't be reached
>> via mmu_notifiers).
>> 
>> Add two flags so that KVM can handle the three possibilities
>> (shared, private, and shared+private) without needing something
>> like a tri-state enum.
>
> I see the two flags are set/cleared in __kvm_handle_hva_range() in
> this patch and kvm_handle_gfn_range() from the later patch 13/35, but
> I didn't see they are used/read in this patch series if I didn't miss
> anything.  How are they supposed to be used in KVM?

They are going to be used by SNP/TDX patches.

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-11-01 21:55         ` Sean Christopherson
@ 2023-11-02 13:52           ` Fuad Tabba
  2023-11-03 23:17             ` Sean Christopherson
  0 siblings, 1 reply; 148+ messages in thread
From: Fuad Tabba @ 2023-11-02 13:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Wed, Nov 1, 2023 at 9:55 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Nov 01, 2023, Fuad Tabba wrote:
> > > > > @@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
> > > > >  /* This does not remove the slot from struct kvm_memslots data structures */
> > > > >  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> > > > >  {
> > > > > +       if (slot->flags & KVM_MEM_PRIVATE)
> > > > > +               kvm_gmem_unbind(slot);
> > > > > +
> > > >
> > > > Should this be called after kvm_arch_free_memslot()? Arch-specific ode
> > > > might need some of the data before the unbinding, something I thought
> > > > might be necessary at one point for the pKVM port when deleting a
> > > > memslot, but realized later that kvm_invalidate_memslot() ->
> > > > kvm_arch_guest_memory_reclaimed() was the more logical place for it.
> > > > Also, since that seems to be the pattern for arch-specific handlers in
> > > > KVM.
> > >
> > > Maybe?  But only if we can about symmetry between the allocation and free paths
> > > I really don't think kvm_arch_free_memslot() should be doing anything beyond a
> > > "pure" free.  E.g. kvm_arch_free_memslot() is also called after moving a memslot,
> > > which hopefully we never actually have to allow for guest_memfd, but any code in
> > > kvm_arch_free_memslot() would bring about "what if" questions regarding memslot
> > > movement.  I.e. the API is intended to be a "free arch metadata associated with
> > > the memslot".
> > >
> > > Out of curiosity, what does pKVM need to do at kvm_arch_guest_memory_reclaimed()?
> >
> > It's about the host reclaiming ownership of guest memory when tearing
> > down a protected guest. In pKVM, we currently teardown the guest and
> > reclaim its memory when kvm_arch_destroy_vm() is called. The problem
> > with guestmem is that kvm_gmem_unbind() could get called before that
> > happens, after which the host might try to access the unbound guest
> > memory. Since the host hasn't reclaimed ownership of the guest memory
> > from hyp, hilarity ensues (it crashes).
> >
> > Initially, I hooked reclaim guest memory to kvm_free_memslot(), but
> > then I needed to move the unbind later in the function. I realized
> > later that kvm_arch_guest_memory_reclaimed() gets called earlier (at
> > the right time), and is more aptly named.
>
> Aha!  I suspected that might be the case.
>
> TDX and SNP also need to solve the same problem of "reclaiming" memory before it
> can be safely accessed by the host.  The plan is to add an arch hook (or two?)
> into guest_memfd that is invoked when memory is freed from guest_memfd.
>
> Hooking kvm_arch_guest_memory_reclaimed() isn't completely correct as deleting a
> memslot doesn't *guarantee* that guest memory is actually reclaimed (which reminds
> me, we need to figure out a better name for that thing before introducing
> kvm_arch_gmem_invalidate()).

I see. I'd assumed that that was what you're using. I agree that it's
not completely correct, so for the moment, I assume that if that
happens we have a misbehaving host, teardown the guest and reclaim its
memory.

> The effective false positives aren't fatal for the current usage because the hook
> is used only for x86 SEV guests to flush caches.  An unnecessary flush can cause
> performance issues, but it doesn't affect correctness. For TDX and SNP, and IIUC
> pKVM, false positives are fatal because KVM could assign memory back to the host
> that is still owned by guest_memfd.

Yup.

> E.g. a misbehaving userspace could prematurely delete a memslot.  And the more
> fun example is intrahost migration, where the plan is to allow pointing multiple
> guest_memfd files at a single guest_memfd inode:
> https://lore.kernel.org/all/cover.1691446946.git.ackerleytng@google.com
>
> There was a lot of discussion for this, but it's scattered all over the place.
> The TL;DR is is that the inode will represent physical memory, and a file will
> represent a given "struct kvm" instance's view of that memory.  And so the memory
> isn't reclaimed until the inode is truncated/punched.
>
> I _think_ this reflects the most recent plan from the guest_memfd side:
> https://lore.kernel.org/all/1233d749211c08d51f9ca5d427938d47f008af1f.1689893403.git.isaku.yamahata@intel.com

Thanks for pointing that out. I think this might be the way to go.
I'll have a closer look at this and see how to get it to work with
pKVM.

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory
  2023-10-27 18:21 ` [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory Sean Christopherson
  2023-10-30 17:11   ` Paolo Bonzini
@ 2023-11-02 13:55   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-02 13:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
> __kvm_handle_hva_range() return whether or not an overlapping memslot
> was found, i.e. mmu_lock was acquired.  Using the .on_unlock() hook
> works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
> mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.
>
> Use a small struct to return the tuple of the notifier-specific return,
> plus whether or not overlap was found.  Because the iteration helpers are
> __always_inlined, practically speaking, the struct will never actually be
> returned from a function call (not to mention the size of the struct will
> be two bytes in practice).
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad


>  virt/kvm/kvm_main.c | 53 +++++++++++++++++++++++++++++++--------------
>  1 file changed, 37 insertions(+), 16 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 3f5b7c2c5327..2bc04c8ae1f4 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -561,6 +561,19 @@ struct kvm_mmu_notifier_range {
>         bool may_block;
>  };
>
> +/*
> + * The inner-most helper returns a tuple containing the return value from the
> + * arch- and action-specific handler, plus a flag indicating whether or not at
> + * least one memslot was found, i.e. if the handler found guest memory.
> + *
> + * Note, most notifiers are averse to booleans, so even though KVM tracks the
> + * return from arch code as a bool, outer helpers will cast it to an int. :-(
> + */
> +typedef struct kvm_mmu_notifier_return {
> +       bool ret;
> +       bool found_memslot;
> +} kvm_mn_ret_t;
> +
>  /*
>   * Use a dedicated stub instead of NULL to indicate that there is no callback
>   * function/handler.  The compiler technically can't guarantee that a real
> @@ -582,22 +595,25 @@ static const union kvm_mmu_notifier_arg KVM_MMU_NOTIFIER_NO_ARG;
>              node;                                                           \
>              node = interval_tree_iter_next(node, start, last))      \
>
> -static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> -                                                 const struct kvm_mmu_notifier_range *range)
> +static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> +                                                          const struct kvm_mmu_notifier_range *range)
>  {
> -       bool ret = false, locked = false;
> +       struct kvm_mmu_notifier_return r = {
> +               .ret = false,
> +               .found_memslot = false,
> +       };
>         struct kvm_gfn_range gfn_range;
>         struct kvm_memory_slot *slot;
>         struct kvm_memslots *slots;
>         int i, idx;
>
>         if (WARN_ON_ONCE(range->end <= range->start))
> -               return 0;
> +               return r;
>
>         /* A null handler is allowed if and only if on_lock() is provided. */
>         if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
>                          IS_KVM_NULL_FN(range->handler)))
> -               return 0;
> +               return r;
>
>         idx = srcu_read_lock(&kvm->srcu);
>
> @@ -631,8 +647,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>                         gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
>                         gfn_range.slot = slot;
>
> -                       if (!locked) {
> -                               locked = true;
> +                       if (!r.found_memslot) {
> +                               r.found_memslot = true;
>                                 KVM_MMU_LOCK(kvm);
>                                 if (!IS_KVM_NULL_FN(range->on_lock))
>                                         range->on_lock(kvm);
> @@ -640,14 +656,14 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>                                 if (IS_KVM_NULL_FN(range->handler))
>                                         break;
>                         }
> -                       ret |= range->handler(kvm, &gfn_range);
> +                       r.ret |= range->handler(kvm, &gfn_range);
>                 }
>         }
>
> -       if (range->flush_on_ret && ret)
> +       if (range->flush_on_ret && r.ret)
>                 kvm_flush_remote_tlbs(kvm);
>
> -       if (locked) {
> +       if (r.found_memslot) {
>                 KVM_MMU_UNLOCK(kvm);
>                 if (!IS_KVM_NULL_FN(range->on_unlock))
>                         range->on_unlock(kvm);
> @@ -655,8 +671,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>
>         srcu_read_unlock(&kvm->srcu, idx);
>
> -       /* The notifiers are averse to booleans. :-( */
> -       return (int)ret;
> +       return r;
>  }
>
>  static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
> @@ -677,7 +692,7 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
>                 .may_block      = false,
>         };
>
> -       return __kvm_handle_hva_range(kvm, &range);
> +       return __kvm_handle_hva_range(kvm, &range).ret;
>  }
>
>  static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
> @@ -696,7 +711,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
>                 .may_block      = false,
>         };
>
> -       return __kvm_handle_hva_range(kvm, &range);
> +       return __kvm_handle_hva_range(kvm, &range).ret;
>  }
>
>  static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> @@ -798,7 +813,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                 .end            = range->end,
>                 .handler        = kvm_mmu_unmap_gfn_range,
>                 .on_lock        = kvm_mmu_invalidate_begin,
> -               .on_unlock      = kvm_arch_guest_memory_reclaimed,
> +               .on_unlock      = (void *)kvm_null_fn,
>                 .flush_on_ret   = true,
>                 .may_block      = mmu_notifier_range_blockable(range),
>         };
> @@ -830,7 +845,13 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>         gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end,
>                                           hva_range.may_block);
>
> -       __kvm_handle_hva_range(kvm, &hva_range);
> +       /*
> +        * If one or more memslots were found and thus zapped, notify arch code
> +        * that guest memory has been reclaimed.  This needs to be done *after*
> +        * dropping mmu_lock, as x86's reclaim path is slooooow.
> +        */
> +       if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot)
> +               kvm_arch_guest_memory_reclaimed(kvm);
>
>         return 0;
>  }
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook
  2023-10-27 18:21 ` [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook Sean Christopherson
  2023-10-30 17:18   ` Paolo Bonzini
@ 2023-11-02 13:55   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-02 13:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
> notifying arch code that memory has been reclaimed.  Adding .on_unlock()
> and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
> resulted in .on_lock() and .on_unlock() having divergent and asymmetric
> behavior, and set future developers up for failure, i.e. all but asked for
> bugs where KVM relied on using .on_unlock() to try to run a callback while
> holding mmu_lock.
>
> Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
> guard against future bugs of this nature.
>
> Reported-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Link: https://lore.kernel.org/all/20230802203119.GB2021422@ls.amr.corp.intel.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

>  virt/kvm/kvm_main.c | 13 +++----------
>  1 file changed, 3 insertions(+), 10 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2bc04c8ae1f4..cb9376833c18 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -544,7 +544,6 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>
>  typedef void (*on_lock_fn_t)(struct kvm *kvm);
> -typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>
>  struct kvm_mmu_notifier_range {
>         /*
> @@ -556,7 +555,6 @@ struct kvm_mmu_notifier_range {
>         union kvm_mmu_notifier_arg arg;
>         gfn_handler_t handler;
>         on_lock_fn_t on_lock;
> -       on_unlock_fn_t on_unlock;
>         bool flush_on_ret;
>         bool may_block;
>  };
> @@ -663,11 +661,8 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>         if (range->flush_on_ret && r.ret)
>                 kvm_flush_remote_tlbs(kvm);
>
> -       if (r.found_memslot) {
> +       if (r.found_memslot)
>                 KVM_MMU_UNLOCK(kvm);
> -               if (!IS_KVM_NULL_FN(range->on_unlock))
> -                       range->on_unlock(kvm);
> -       }
>
>         srcu_read_unlock(&kvm->srcu, idx);
>
> @@ -687,7 +682,6 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
>                 .arg            = arg,
>                 .handler        = handler,
>                 .on_lock        = (void *)kvm_null_fn,
> -               .on_unlock      = (void *)kvm_null_fn,
>                 .flush_on_ret   = true,
>                 .may_block      = false,
>         };
> @@ -706,7 +700,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
>                 .end            = end,
>                 .handler        = handler,
>                 .on_lock        = (void *)kvm_null_fn,
> -               .on_unlock      = (void *)kvm_null_fn,
>                 .flush_on_ret   = false,
>                 .may_block      = false,
>         };
> @@ -813,7 +806,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                 .end            = range->end,
>                 .handler        = kvm_mmu_unmap_gfn_range,
>                 .on_lock        = kvm_mmu_invalidate_begin,
> -               .on_unlock      = (void *)kvm_null_fn,
>                 .flush_on_ret   = true,
>                 .may_block      = mmu_notifier_range_blockable(range),
>         };
> @@ -858,6 +850,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>
>  void kvm_mmu_invalidate_end(struct kvm *kvm)
>  {
> +       lockdep_assert_held_write(&kvm->mmu_lock);
> +
>         /*
>          * This sequence increase will notify the kvm page fault that
>          * the page that is going to be mapped in the spte could have
> @@ -889,7 +883,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>                 .end            = range->end,
>                 .handler        = (void *)kvm_null_fn,
>                 .on_lock        = kvm_mmu_invalidate_end,
> -               .on_unlock      = (void *)kvm_null_fn,
>                 .flush_on_ret   = false,
>                 .may_block      = mmu_notifier_range_blockable(range),
>         };
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
  2023-10-27 18:21 ` [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events Sean Christopherson
  2023-10-30 17:21   ` Paolo Bonzini
  2023-11-02  5:59   ` Binbin Wu
@ 2023-11-02 14:01   ` Fuad Tabba
  2023-11-02 14:41     ` Sean Christopherson
  2 siblings, 1 reply; 148+ messages in thread
From: Fuad Tabba @ 2023-11-02 14:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Hi,

On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Add flags to "struct kvm_gfn_range" to let notifier events target only
> shared and only private mappings, and write up the existing mmu_notifier
> events to be shared-only (private memory is never associated with a
> userspace virtual address, i.e. can't be reached via mmu_notifiers).
>
> Add two flags so that KVM can handle the three possibilities (shared,
> private, and shared+private) without needing something like a tri-state
> enum.
>
> Link: https://lore.kernel.org/all/ZJX0hk+KpQP0KUyB@google.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  include/linux/kvm_host.h | 2 ++
>  virt/kvm/kvm_main.c      | 7 +++++++
>  2 files changed, 9 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 96aa930536b1..89c1a991a3b8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -263,6 +263,8 @@ struct kvm_gfn_range {
>         gfn_t start;
>         gfn_t end;
>         union kvm_mmu_notifier_arg arg;
> +       bool only_private;
> +       bool only_shared;

If these flags aren't used in this patch series, should this patch be
moved to the other series?

Also, if shared+private is a possibility, doesn't the prefix "only_"
confuse things a bit? I.e., what is shared+private, is it when both
are 0 or when both are 1? I assume it's the former (both are 0), but
it might be clearer.

Cheers,
/fuad

>         bool may_block;
>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cb9376833c18..302ccb87b4c1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>                          * the second or later invocation of the handler).
>                          */
>                         gfn_range.arg = range->arg;
> +
> +                       /*
> +                        * HVA-based notifications aren't relevant to private
> +                        * mappings as they don't have a userspace mapping.
> +                        */
> +                       gfn_range.only_private = false;
> +                       gfn_range.only_shared = true;
>                         gfn_range.may_block = range->may_block;
>
>                         /*
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
  2023-10-27 18:22 ` [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN Sean Christopherson
  2023-10-30 17:31   ` Paolo Bonzini
@ 2023-11-02 14:16   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-02 14:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
> the probability of exiting to userspace with a stale run->exit_reason that
> *appears* to be valid.
>
> To support fd-based guest memory (guest memory without a corresponding
> userspace virtual address), KVM will exit to userspace for various memory
> related errors, which userspace *may* be able to resolve, instead of using
> e.g. BUS_MCEERR_AR.  And in the more distant future, KVM will also likely
> utilize the same functionality to let userspace "intercept" and handle
> memory faults when the userspace mapping is missing, i.e. when fast gup()
> fails.
>
> Because many of KVM's internal APIs related to guest memory use '0' to
> indicate "success, continue on" and not "exit to userspace", reporting
> memory faults/errors to userspace will set run->exit_reason and
> corresponding fields in the run structure fields in conjunction with a
> a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON.  And because
> KVM already returns  -EFAULT in many paths, there's a relatively high
> probability that KVM could return -EFAULT without setting run->exit_reason,
> in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
> whatever exit reason happened to be in the run structure.
>
> Note, KVM must wait until after run->immediate_exit is serviced to
> sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
> preserved across KVM_RUN when run->immediate_exit is true.
>
> Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoorthy@google.com
> Link: https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

>  arch/x86/kvm/x86.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ee3cd8c3c0ef..f41dbb1465a0 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10963,6 +10963,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>  {
>         int r;
>
> +       vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
>         vcpu->arch.l1tf_flush_l1d = true;
>
>         for (;;) {
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory
  2023-10-27 18:22 ` [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory Sean Christopherson
@ 2023-11-02 14:34   ` Fuad Tabba
  2023-11-05 13:02   ` Xu Yilun
  1 sibling, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-02 14:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <seanjc@google.com> wrote:
>
> From: Chao Peng <chao.p.peng@linux.intel.com>
>
> Add support for resolving page faults on guest private memory for VMs
> that differentiate between "shared" and "private" memory.  For such VMs,
> KVM_MEM_PRIVATE memslots can include both fd-based private memory and
> hva-based shared memory, and KVM needs to map in the "correct" variant,
> i.e. KVM needs to map the gfn shared/private as appropriate based on the
> current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
>
> For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
> shared vs. private via a bit in the guest page tables, i.e. what the guest
> wants may conflict with the current memory attributes.  To support such
> "implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
> to forward the request to userspace.  Add a new flag for memory faults,
> KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
> map memory as shared vs. private.
>
> Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
> so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
> needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
> exit on missing mappings when handling guest page fault VM-Exits.  In
> that case, userspace will want to know RWX information in order to
> correctly/precisely resolve the fault.
>
> Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
> always come from the host userspace page tables, and private mappings
> always come from a guest_memfd instance.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

With my limited understanding of kvm-x86 mmu code:
Reviewed-by: Fuad Tabba <tabba@google.com>

Tested the x86 code (as part of this patch series) on qemu. The x86
fault handling code was used as a guide to how it should be done for
pKVM/arm64 (with similar code added there):
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

>  Documentation/virt/kvm/api.rst  |   8 ++-
>  arch/x86/kvm/mmu/mmu.c          | 101 ++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu/mmu_internal.h |   1 +
>  include/linux/kvm_host.h        |   8 ++-
>  include/uapi/linux/kvm.h        |   1 +
>  5 files changed, 110 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 7f00c310c24a..38dc1fda4f45 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6837,6 +6837,7 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
>
>                 /* KVM_EXIT_MEMORY_FAULT */
>                 struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3)
>                         __u64 flags;
>                         __u64 gpa;
>                         __u64 size;
> @@ -6845,8 +6846,11 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
>  KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
>  could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
>  guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
> -describes properties of the faulting access that are likely pertinent.
> -Currently, no flags are defined.
> +describes properties of the faulting access that are likely pertinent:
> +
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
> +   on a private memory access.  When clear, indicates the fault occurred on a
> +   shared access.
>
>  Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
>  accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4167d557c577..c4e758f0aebb 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3147,9 +3147,9 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
>         return level;
>  }
>
> -int kvm_mmu_max_mapping_level(struct kvm *kvm,
> -                             const struct kvm_memory_slot *slot, gfn_t gfn,
> -                             int max_level)
> +static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
> +                                      const struct kvm_memory_slot *slot,
> +                                      gfn_t gfn, int max_level, bool is_private)
>  {
>         struct kvm_lpage_info *linfo;
>         int host_level;
> @@ -3161,6 +3161,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>                         break;
>         }
>
> +       if (is_private)
> +               return max_level;
> +
>         if (max_level == PG_LEVEL_4K)
>                 return PG_LEVEL_4K;
>
> @@ -3168,6 +3171,16 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>         return min(host_level, max_level);
>  }
>
> +int kvm_mmu_max_mapping_level(struct kvm *kvm,
> +                             const struct kvm_memory_slot *slot, gfn_t gfn,
> +                             int max_level)
> +{
> +       bool is_private = kvm_slot_can_be_private(slot) &&
> +                         kvm_mem_is_private(kvm, gfn);
> +
> +       return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_private);
> +}
> +
>  void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>         struct kvm_memory_slot *slot = fault->slot;
> @@ -3188,8 +3201,9 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>          * Enforce the iTLB multihit workaround after capturing the requested
>          * level, which will be used to do precise, accurate accounting.
>          */
> -       fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> -                                                    fault->gfn, fault->max_level);
> +       fault->req_level = __kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> +                                                      fault->gfn, fault->max_level,
> +                                                      fault->is_private);
>         if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
>                 return;
>
> @@ -4261,6 +4275,55 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>         kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
>  }
>
> +static inline u8 kvm_max_level_for_order(int order)
> +{
> +       BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> +
> +       KVM_MMU_WARN_ON(order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G) &&
> +                       order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M) &&
> +                       order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K));
> +
> +       if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> +               return PG_LEVEL_1G;
> +
> +       if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> +               return PG_LEVEL_2M;
> +
> +       return PG_LEVEL_4K;
> +}
> +
> +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> +                                             struct kvm_page_fault *fault)
> +{
> +       kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> +                                     PAGE_SIZE, fault->write, fault->exec,
> +                                     fault->is_private);
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> +                                  struct kvm_page_fault *fault)
> +{
> +       int max_order, r;
> +
> +       if (!kvm_slot_can_be_private(fault->slot)) {
> +               kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +               return -EFAULT;
> +       }
> +
> +       r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
> +                            &max_order);
> +       if (r) {
> +               kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +               return r;
> +       }
> +
> +       fault->max_level = min(kvm_max_level_for_order(max_order),
> +                              fault->max_level);
> +       fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
> +
> +       return RET_PF_CONTINUE;
> +}
> +
>  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>         struct kvm_memory_slot *slot = fault->slot;
> @@ -4293,6 +4356,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>                         return RET_PF_EMULATE;
>         }
>
> +       if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> +               kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +               return -EFAULT;
> +       }
> +
> +       if (fault->is_private)
> +               return kvm_faultin_pfn_private(vcpu, fault);
> +
>         async = false;
>         fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
>                                           fault->write, &fault->map_writable,
> @@ -7173,6 +7244,26 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>  }
>
>  #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> +                                       struct kvm_gfn_range *range)
> +{
> +       /*
> +        * Zap SPTEs even if the slot can't be mapped PRIVATE.  KVM x86 only
> +        * supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like KVM
> +        * can simply ignore such slots.  But if userspace is making memory
> +        * PRIVATE, then KVM must prevent the guest from accessing the memory
> +        * as shared.  And if userspace is making memory SHARED and this point
> +        * is reached, then at least one page within the range was previously
> +        * PRIVATE, i.e. the slot's possible hugepage ranges are changing.
> +        * Zapping SPTEs in this case ensures KVM will reassess whether or not
> +        * a hugepage can be used for affected ranges.
> +        */
> +       if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
> +               return false;
> +
> +       return kvm_unmap_gfn_range(kvm, range);
> +}
> +
>  static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
>                                 int level)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index decc1f153669..86c7cb692786 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -201,6 +201,7 @@ struct kvm_page_fault {
>
>         /* Derived from mmu and global state.  */
>         const bool is_tdp;
> +       const bool is_private;
>         const bool nx_huge_page_workaround_enabled;
>
>         /*
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7de93858054d..e3223cafd7db 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2358,14 +2358,18 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>
>  static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> -                                                gpa_t gpa, gpa_t size)
> +                                                gpa_t gpa, gpa_t size,
> +                                                bool is_write, bool is_exec,
> +                                                bool is_private)
>  {
>         vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
>         vcpu->run->memory_fault.gpa = gpa;
>         vcpu->run->memory_fault.size = size;
>
> -       /* Flags are not (yet) defined or communicated to userspace. */
> +       /* RWX flags are not (yet) defined or communicated to userspace. */
>         vcpu->run->memory_fault.flags = 0;
> +       if (is_private)
> +               vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
>  }
>
>  #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 33d542de0a61..29e9eb51dec9 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -527,6 +527,7 @@ struct kvm_run {
>                 } notify;
>                 /* KVM_EXIT_MEMORY_FAULT */
>                 struct {
> +#define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1ULL << 3)
>                         __u64 flags;
>                         __u64 gpa;
>                         __u64 size;
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 21/35] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro
  2023-10-27 18:22 ` [PATCH v13 21/35] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro Sean Christopherson
@ 2023-11-02 14:35   ` Fuad Tabba
  0 siblings, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-02 14:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
> KVM_ADDRESS_SPACE_NUM.
>
> No functional change intended.
>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

>  arch/x86/include/asm/kvm_host.h | 1 -
>  include/linux/kvm_host.h        | 2 +-
>  2 files changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8d60e4745e8b..6702f795c862 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2124,7 +2124,6 @@ enum {
>  #define HF_SMM_MASK            (1 << 1)
>  #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
>
> -# define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
>  # define KVM_ADDRESS_SPACE_NUM 2
>  # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
>  # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index e3223cafd7db..c3cfe08b1300 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -692,7 +692,7 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
>  #define KVM_MEM_SLOTS_NUM SHRT_MAX
>  #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
>
> -#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
> +#if KVM_ADDRESS_SPACE_NUM == 1
>  static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>  {
>         return 0;
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
  2023-11-02 14:01   ` Fuad Tabba
@ 2023-11-02 14:41     ` Sean Christopherson
  2023-11-02 14:57       ` Fuad Tabba
  0 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-11-02 14:41 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Thu, Nov 02, 2023, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Add flags to "struct kvm_gfn_range" to let notifier events target only
> > shared and only private mappings, and write up the existing mmu_notifier
> > events to be shared-only (private memory is never associated with a
> > userspace virtual address, i.e. can't be reached via mmu_notifiers).
> >
> > Add two flags so that KVM can handle the three possibilities (shared,
> > private, and shared+private) without needing something like a tri-state
> > enum.
> >
> > Link: https://lore.kernel.org/all/ZJX0hk+KpQP0KUyB@google.com
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  include/linux/kvm_host.h | 2 ++
> >  virt/kvm/kvm_main.c      | 7 +++++++
> >  2 files changed, 9 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 96aa930536b1..89c1a991a3b8 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -263,6 +263,8 @@ struct kvm_gfn_range {
> >         gfn_t start;
> >         gfn_t end;
> >         union kvm_mmu_notifier_arg arg;
> > +       bool only_private;
> > +       bool only_shared;
> 
> If these flags aren't used in this patch series, should this patch be
> moved to the other series?

If *both* TDX and SNP need this patch, then I think it's probably worth applying
it now to make their lives easier.  But if only one needs the support, then I
completely agree this should be punted to whichever series needs it (this also
came up in v11, but we didn't force the issue).

Mike, Isaku?

> Also, if shared+private is a possibility, doesn't the prefix "only_"
> confuse things a bit? I.e., what is shared+private, is it when both
> are 0 or when both are 1? I assume it's the former (both are 0), but
> it might be clearer.

Heh, I was hoping that "only_private && only_shared" would be obviously nonsensical.

The only alternative I can think would be to add an enum, e.g.

	enum {
		PROCESS_PRIVATE_AND_SHARED,
		PROCESS_ONLY_PRIVATE,
		PROCESS_ONLY_SHARED,
	};

because every other way of expressing the flags either results in more confusion
or an unsafe default.  I.e. I want zapping only private or only shared to require
the caller to explicitly set a non-zero value, which is how I ended up with
"only_{private,shared}" as opposed to "process_{private,shared}".

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM
  2023-10-27 18:22 ` [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM Sean Christopherson
  2023-10-30 17:34   ` Paolo Bonzini
@ 2023-11-02 14:52   ` Fuad Tabba
  1 sibling, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-02 14:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Let x86 track the number of address spaces on a per-VM basis so that KVM
> can disallow SMM memslots for confidential VMs.  Confidentials VMs are
> fundamentally incompatible with emulating SMM, which as the name suggests
> requires being able to read and write guest memory and register state.
>
> Disallowing SMM will simplify support for guest private memory, as KVM
> will not need to worry about tracking memory attributes for multiple
> address spaces (SMM is the only "non-default" address space across all
> architectures).
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

>  arch/powerpc/kvm/book3s_hv.c    |  2 +-
>  arch/x86/include/asm/kvm_host.h |  8 +++++++-
>  arch/x86/kvm/debugfs.c          |  2 +-
>  arch/x86/kvm/mmu/mmu.c          |  6 +++---
>  arch/x86/kvm/x86.c              |  2 +-
>  include/linux/kvm_host.h        | 17 +++++++++++------
>  virt/kvm/dirty_ring.c           |  2 +-
>  virt/kvm/kvm_main.c             | 26 ++++++++++++++------------
>  8 files changed, 39 insertions(+), 26 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 130bafdb1430..9b0eaa17275a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -6084,7 +6084,7 @@ static int kvmhv_svm_off(struct kvm *kvm)
>         }
>
>         srcu_idx = srcu_read_lock(&kvm->srcu);
> -       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +       for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
>                 struct kvm_memory_slot *memslot;
>                 struct kvm_memslots *slots = __kvm_memslots(kvm, i);
>                 int bkt;
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6702f795c862..f9e8d5642069 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2124,9 +2124,15 @@ enum {
>  #define HF_SMM_MASK            (1 << 1)
>  #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
>
> -# define KVM_ADDRESS_SPACE_NUM 2
> +# define KVM_MAX_NR_ADDRESS_SPACES     2
>  # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
>  # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
> +
> +static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
> +{
> +       return KVM_MAX_NR_ADDRESS_SPACES;
> +}
> +
>  #else
>  # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
>  #endif
> diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
> index ee8c4c3496ed..42026b3f3ff3 100644
> --- a/arch/x86/kvm/debugfs.c
> +++ b/arch/x86/kvm/debugfs.c
> @@ -111,7 +111,7 @@ static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void *v)
>         mutex_lock(&kvm->slots_lock);
>         write_lock(&kvm->mmu_lock);
>
> -       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +       for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
>                 int bkt;
>
>                 slots = __kvm_memslots(kvm, i);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c4e758f0aebb..baeba8fc1c38 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3755,7 +3755,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
>             kvm_page_track_write_tracking_enabled(kvm))
>                 goto out_success;
>
> -       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +       for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
>                 slots = __kvm_memslots(kvm, i);
>                 kvm_for_each_memslot(slot, bkt, slots) {
>                         /*
> @@ -6294,7 +6294,7 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_e
>         if (!kvm_memslots_have_rmaps(kvm))
>                 return flush;
>
> -       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +       for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
>                 slots = __kvm_memslots(kvm, i);
>
>                 kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, gfn_end) {
> @@ -6791,7 +6791,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
>          * modifier prior to checking for a wrap of the MMIO generation so
>          * that a wrap in any address space is detected.
>          */
> -       gen &= ~((u64)KVM_ADDRESS_SPACE_NUM - 1);
> +       gen &= ~((u64)kvm_arch_nr_memslot_as_ids(kvm) - 1);
>
>         /*
>          * The very rare case: if the MMIO generation number has wrapped,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 824b58b44382..c4d17727b199 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12456,7 +12456,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
>                 hva = slot->userspace_addr;
>         }
>
> -       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +       for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
>                 struct kvm_userspace_memory_region2 m;
>
>                 m.slot = id | (i << 16);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c3cfe08b1300..687589ce9f63 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -80,8 +80,8 @@
>  /* Two fragments for cross MMIO pages. */
>  #define KVM_MAX_MMIO_FRAGMENTS 2
>
> -#ifndef KVM_ADDRESS_SPACE_NUM
> -#define KVM_ADDRESS_SPACE_NUM  1
> +#ifndef KVM_MAX_NR_ADDRESS_SPACES
> +#define KVM_MAX_NR_ADDRESS_SPACES      1
>  #endif
>
>  /*
> @@ -692,7 +692,12 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
>  #define KVM_MEM_SLOTS_NUM SHRT_MAX
>  #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
>
> -#if KVM_ADDRESS_SPACE_NUM == 1
> +#if KVM_MAX_NR_ADDRESS_SPACES == 1
> +static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
> +{
> +       return KVM_MAX_NR_ADDRESS_SPACES;
> +}
> +
>  static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>  {
>         return 0;
> @@ -747,9 +752,9 @@ struct kvm {
>         struct mm_struct *mm; /* userspace tied to this vm */
>         unsigned long nr_memslot_pages;
>         /* The two memslot sets - active and inactive (per address space) */
> -       struct kvm_memslots __memslots[KVM_ADDRESS_SPACE_NUM][2];
> +       struct kvm_memslots __memslots[KVM_MAX_NR_ADDRESS_SPACES][2];
>         /* The current active memslot set for each address space */
> -       struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
> +       struct kvm_memslots __rcu *memslots[KVM_MAX_NR_ADDRESS_SPACES];
>         struct xarray vcpu_array;
>         /*
>          * Protected by slots_lock, but can be read outside if an
> @@ -1018,7 +1023,7 @@ void kvm_put_kvm_no_destroy(struct kvm *kvm);
>
>  static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
>  {
> -       as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
> +       as_id = array_index_nospec(as_id, KVM_MAX_NR_ADDRESS_SPACES);
>         return srcu_dereference_check(kvm->memslots[as_id], &kvm->srcu,
>                         lockdep_is_held(&kvm->slots_lock) ||
>                         !refcount_read(&kvm->users_count));
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index c1cd7dfe4a90..86d267db87bb 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -58,7 +58,7 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
>         as_id = slot >> 16;
>         id = (u16)slot;
>
> -       if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> +       if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
>                 return;
>
>         memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5d1a2f1b4e94..23633984142f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -615,7 +615,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>
>         idx = srcu_read_lock(&kvm->srcu);
>
> -       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +       for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
>                 struct interval_tree_node *node;
>
>                 slots = __kvm_memslots(kvm, i);
> @@ -1248,7 +1248,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>                 goto out_err_no_irq_srcu;
>
>         refcount_set(&kvm->users_count, 1);
> -       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +       for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
>                 for (j = 0; j < 2; j++) {
>                         slots = &kvm->__memslots[i][j];
>
> @@ -1398,7 +1398,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  #endif
>         kvm_arch_destroy_vm(kvm);
>         kvm_destroy_devices(kvm);
> -       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +       for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>         }
> @@ -1681,7 +1681,7 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
>          * space 0 will use generations 0, 2, 4, ... while address space 1 will
>          * use generations 1, 3, 5, ...
>          */
> -       gen += KVM_ADDRESS_SPACE_NUM;
> +       gen += kvm_arch_nr_memslot_as_ids(kvm);
>
>         kvm_arch_memslots_updated(kvm, gen);
>
> @@ -2051,7 +2051,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>             (mem->guest_memfd_offset & (PAGE_SIZE - 1) ||
>              mem->guest_memfd_offset + mem->memory_size < mem->guest_memfd_offset))
>                 return -EINVAL;
> -       if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> +       if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM)
>                 return -EINVAL;
>         if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
>                 return -EINVAL;
> @@ -2187,7 +2187,7 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
>
>         as_id = log->slot >> 16;
>         id = (u16)log->slot;
> -       if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> +       if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
>                 return -EINVAL;
>
>         slots = __kvm_memslots(kvm, as_id);
> @@ -2249,7 +2249,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
>
>         as_id = log->slot >> 16;
>         id = (u16)log->slot;
> -       if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> +       if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
>                 return -EINVAL;
>
>         slots = __kvm_memslots(kvm, as_id);
> @@ -2361,7 +2361,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
>
>         as_id = log->slot >> 16;
>         id = (u16)log->slot;
> -       if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> +       if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_USER_MEM_SLOTS)
>                 return -EINVAL;
>
>         if (log->first_page & 63)
> @@ -2502,7 +2502,7 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
>         gfn_range.only_private = false;
>         gfn_range.only_shared = false;
>
> -       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +       for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
>                 slots = __kvm_memslots(kvm, i);
>
>                 kvm_for_each_memslot_in_gfn_range(&iter, slots, range->start, range->end) {
> @@ -4857,9 +4857,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>         case KVM_CAP_IRQ_ROUTING:
>                 return KVM_MAX_IRQ_ROUTES;
>  #endif
> -#if KVM_ADDRESS_SPACE_NUM > 1
> +#if KVM_MAX_NR_ADDRESS_SPACES > 1
>         case KVM_CAP_MULTI_ADDRESS_SPACE:
> -               return KVM_ADDRESS_SPACE_NUM;
> +               if (kvm)
> +                       return kvm_arch_nr_memslot_as_ids(kvm);
> +               return KVM_MAX_NR_ADDRESS_SPACES;
>  #endif
>         case KVM_CAP_NR_MEMSLOTS:
>                 return KVM_USER_MEM_SLOTS;
> @@ -4967,7 +4969,7 @@ bool kvm_are_all_memslots_empty(struct kvm *kvm)
>
>         lockdep_assert_held(&kvm->slots_lock);
>
> -       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +       for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
>                 if (!kvm_memslots_empty(__kvm_memslots(kvm, i)))
>                         return false;
>         }
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
  2023-11-02 14:41     ` Sean Christopherson
@ 2023-11-02 14:57       ` Fuad Tabba
  0 siblings, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-02 14:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Thu, Nov 2, 2023 at 2:41 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Nov 02, 2023, Fuad Tabba wrote:
> > Hi,
> >
> > On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > Add flags to "struct kvm_gfn_range" to let notifier events target only
> > > shared and only private mappings, and write up the existing mmu_notifier
> > > events to be shared-only (private memory is never associated with a
> > > userspace virtual address, i.e. can't be reached via mmu_notifiers).
> > >
> > > Add two flags so that KVM can handle the three possibilities (shared,
> > > private, and shared+private) without needing something like a tri-state
> > > enum.
> > >
> > > Link: https://lore.kernel.org/all/ZJX0hk+KpQP0KUyB@google.com
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  include/linux/kvm_host.h | 2 ++
> > >  virt/kvm/kvm_main.c      | 7 +++++++
> > >  2 files changed, 9 insertions(+)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 96aa930536b1..89c1a991a3b8 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -263,6 +263,8 @@ struct kvm_gfn_range {
> > >         gfn_t start;
> > >         gfn_t end;
> > >         union kvm_mmu_notifier_arg arg;
> > > +       bool only_private;
> > > +       bool only_shared;
> >
> > If these flags aren't used in this patch series, should this patch be
> > moved to the other series?
>
> If *both* TDX and SNP need this patch, then I think it's probably worth applying
> it now to make their lives easier.  But if only one needs the support, then I
> completely agree this should be punted to whichever series needs it (this also
> came up in v11, but we didn't force the issue).
>
> Mike, Isaku?
>
> > Also, if shared+private is a possibility, doesn't the prefix "only_"
> > confuse things a bit? I.e., what is shared+private, is it when both
> > are 0 or when both are 1? I assume it's the former (both are 0), but
> > it might be clearer.
>
> Heh, I was hoping that "only_private && only_shared" would be obviously nonsensical.
>
> The only alternative I can think would be to add an enum, e.g.
>
>         enum {
>                 PROCESS_PRIVATE_AND_SHARED,
>                 PROCESS_ONLY_PRIVATE,
>                 PROCESS_ONLY_SHARED,
>         };
>
> because every other way of expressing the flags either results in more confusion
> or an unsafe default.  I.e. I want zapping only private or only shared to require
> the caller to explicitly set a non-zero value, which is how I ended up with
> "only_{private,shared}" as opposed to "process_{private,shared}".

I don't have a strong opinion about this. Having an enum looks good to me.

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-11-01 23:17                   ` Paolo Bonzini
@ 2023-11-02 15:38                     ` Sean Christopherson
  2023-11-02 15:46                       ` Paolo Bonzini
  0 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-11-02 15:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Xiaoyao Li, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> On Wed, Nov 1, 2023 at 11:35 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> > > On 11/1/23 17:36, Sean Christopherson wrote:
> > > > Can you post a fixup patch?  It's not clear to me exactly what behavior you intend
> > > > to end up with.
> > >
> > > Sure, just this:
> > >
> > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > index 7d1a33c2ad42..34fd070e03d9 100644
> > > --- a/virt/kvm/guest_memfd.c
> > > +++ b/virt/kvm/guest_memfd.c
> > > @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> > >  {
> > >       loff_t size = args->size;
> > >       u64 flags = args->flags;
> > > -     u64 valid_flags = 0;
> > > -
> > > -     if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > > -             valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > > +     u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > >       if (flags & ~valid_flags)
> > >               return -EINVAL;
> > > @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> > >       if (size < 0 || !PAGE_ALIGNED(size))
> > >               return -EINVAL;
> > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >       if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> > >           !IS_ALIGNED(size, HPAGE_PMD_SIZE))
> > >               return -EINVAL;
> > > -#endif
> >
> > That won't work, HPAGE_PMD_SIZE is valid only for CONFIG_TRANSPARENT_HUGEPAGE=y.
> >
> > #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> > #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> > #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> > #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
> 
> Would have caught it when actually testing it, I guess. :) It has to
> be PMD_SIZE, possibly with
> 
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> BUILD_BUG_ON(HPAGE_PMD_SIZE != PMD_SIZE);
> #endif

Yeah, that works for me.

Actually, looking that this again, there's not actually a hard dependency on THP.
A THP-enabled kernel _probably_  gives a higher probability of using hugepages,
but mostly because THP selects COMPACTION, and I suppose because using THP for
other allocations reduces overall fragmentation.

So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think
we should do the below (I verified KVM can create hugepages with THP=n).  We'll
need another capability, but (a) we probably should have that anyways and (b) it
provides a cleaner path to adding PUD-sized hugepage support in the future.

And then adjust the tests like so:

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index c15de9852316..c9f449718fce 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -201,6 +201,10 @@ int main(int argc, char *argv[])
 
        TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
 
+       if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE) && thp_configured())
+               TEST_ASSERT_EQ(get_trans_hugepagesz(),
+                              kvm_check_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE));
+
        page_size = getpagesize();
        total_size = page_size * 4;
 
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
index be311944e90a..245901587ed2 100644
--- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -396,7 +396,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
 
        vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
 
-       if (backing_src_can_be_huge(src_type))
+       if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE))
                memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
        else
                memfd_flags = 0;

--
From: Sean Christopherson <seanjc@google.com>
Date: Wed, 25 Oct 2023 16:26:41 -0700
Subject: [PATCH] KVM: Add best-effort hugepage support for dedicated guest
 memory

Extend guest_memfd to allow backing guest memory with hugepages.  For now,
make hugepage utilization best-effort, i.e. fall back to non-huge mappings
if a hugepage can't be allocated.  Guaranteeing hugepages would require a
dedicated memory pool and significantly more complexity and churn..

Require userspace to opt-in via a flag even though it's unlikely real use
cases will ever want to use order-0 pages, e.g. to give userspace a safety
valve in case hugepage support is buggy, and to allow for easier testing
of both paths.

Do not take a dependency on CONFIG_TRANSPARENT_HUGEPAGE, as THP enabling
primarily deals with userspace page tables, which are explicitly not in
play for guest_memfd.  Selecting THP does make obtaining hugepages more
likely, but only because THP selects CONFIG_COMPACTION.  Don't select
CONFIG_COMPACTION either, because again it's not a hard dependency.

For simplicity, require the guest_memfd size to be a multiple of the
hugepage size, e.g. so that KVM doesn't need to do bounds checking when
deciding whether or not to allocate a huge folio.

When reporting the max order when KVM gets a pfn from guest_memfd, force
order-0 pages if the hugepage is not fully contained by the memslot
binding, e.g. if userspace requested hugepages but punches a hole in the
memslot bindings in order to emulate x86's VGA hole.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst | 17 +++++++++
 include/uapi/linux/kvm.h       |  3 ++
 virt/kvm/guest_memfd.c         | 69 +++++++++++++++++++++++++++++-----
 virt/kvm/kvm_main.c            |  4 ++
 4 files changed, 84 insertions(+), 9 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e82c69d5e755..ccdd5413920d 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6176,6 +6176,8 @@ and cannot be resized  (guest_memfd files do however support PUNCH_HOLE).
 	__u64 reserved[6];
   };
 
+  #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE         (1ULL << 0)
+
 Conceptually, the inode backing a guest_memfd file represents physical memory,
 i.e. is coupled to the virtual machine as a thing, not to a "struct kvm".  The
 file itself, which is bound to a "struct kvm", is that instance's view of the
@@ -6192,6 +6194,12 @@ most one mapping per page, i.e. binding multiple memory regions to a single
 guest_memfd range is not allowed (any number of memory regions can be bound to
 a single guest_memfd file, but the bound ranges must not overlap).
 
+If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set in flags, KVM will attempt to allocate
+and map PMD-size hugepages for the guest_memfd file.  This is currently best
+effort.  If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set, size must be aligned to at
+least the size reported by KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE (which also
+enumerates support for KVM_GUEST_MEMFD_ALLOW_HUGEPAGE).
+
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
 5. The kvm_run structure
@@ -8639,6 +8647,15 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
 64-bit bitmap (each bit describing a block size). The default value is
 0, to disable the eager page splitting.
 
+
+8.41 KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE
+------------------------------------------
+
+This is an information-only capability that returns guest_memfd's hugepage size
+for PMD hugepages.  Returns '0' if guest_memfd is not supported, or if KVM
+doesn't support creating hugepages for guest_memfd.  Note, guest_memfd doesn't
+currently support PUD-sized hugepages.
+
 9. Known KVM API problems
 =========================
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 25caee8d1a80..b78d0e3bf22a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1217,6 +1217,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_MEMORY_FAULT_INFO 231
 #define KVM_CAP_MEMORY_ATTRIBUTES 232
 #define KVM_CAP_GUEST_MEMFD 233
+#define KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE 234
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2303,4 +2304,6 @@ struct kvm_create_guest_memfd {
 	__u64 reserved[6];
 };
 
+#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE		(1ULL << 0)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 98a12da80214..31b5e94d461a 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -13,14 +13,44 @@ struct kvm_gmem {
 	struct list_head entry;
 };
 
+#define NR_PAGES_PER_PMD (1 << PMD_ORDER)
+
+static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index)
+{
+	unsigned long huge_index = round_down(index, NR_PAGES_PER_PMD);
+	unsigned long flags = (unsigned long)inode->i_private;
+	struct address_space *mapping  = inode->i_mapping;
+	gfp_t gfp = mapping_gfp_mask(mapping);
+	struct folio *folio;
+
+	if (!(flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE))
+		return NULL;
+
+	if (filemap_range_has_page(mapping, huge_index << PAGE_SHIFT,
+				   (huge_index + NR_PAGES_PER_PMD - 1) << PAGE_SHIFT))
+		return NULL;
+
+	folio = filemap_alloc_folio(gfp, PMD_ORDER);
+	if (!folio)
+		return NULL;
+
+	if (filemap_add_folio(mapping, folio, huge_index, gfp)) {
+		folio_put(folio);
+		return NULL;
+	}
+	return folio;
+}
+
 static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 {
 	struct folio *folio;
 
-	/* TODO: Support huge pages. */
-	folio = filemap_grab_folio(inode->i_mapping, index);
-	if (IS_ERR_OR_NULL(folio))
-		return NULL;
+	folio = kvm_gmem_get_huge_folio(inode, index);
+	if (!folio) {
+		folio = filemap_grab_folio(inode->i_mapping, index);
+		if (IS_ERR_OR_NULL(folio))
+			return NULL;
+	}
 
 	/*
 	 * Use the up-to-date flag to track whether or not the memory has been
@@ -373,6 +403,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 	inode->i_mode |= S_IFREG;
 	inode->i_size = size;
 	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_large_folios(inode->i_mapping);
 	mapping_set_unmovable(inode->i_mapping);
 	/* Unmovable mappings are supposed to be marked unevictable as well. */
 	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
@@ -394,14 +425,18 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 
 int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 {
+	u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
 	loff_t size = args->size;
 	u64 flags = args->flags;
-	u64 valid_flags = 0;
 
 	if (flags & ~valid_flags)
 		return -EINVAL;
 
-	if (size < 0 || !PAGE_ALIGNED(size))
+	if (size <= 0 || !PAGE_ALIGNED(size))
+		return -EINVAL;
+
+	if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
+	    !IS_ALIGNED(size, PMD_SIZE))
 		return -EINVAL;
 
 	return __kvm_gmem_create(kvm, size, flags);
@@ -501,7 +536,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
 {
-	pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
+	pgoff_t index, huge_index;
 	struct kvm_gmem *gmem;
 	struct folio *folio;
 	struct page *page;
@@ -514,6 +549,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 
 	gmem = file->private_data;
 
+	index = gfn - slot->base_gfn + slot->gmem.pgoff;
 	if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) {
 		r = -EIO;
 		goto out_fput;
@@ -533,9 +569,24 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	page = folio_file_page(folio, index);
 
 	*pfn = page_to_pfn(page);
-	if (max_order)
+	if (!max_order)
+		goto success;
+
+	*max_order = compound_order(compound_head(page));
+	if (!*max_order)
+		goto success;
+
+	/*
+	 * The folio can be mapped with a hugepage if and only if the folio is
+	 * fully contained by the range the memslot is bound to.  Note, the
+	 * caller is responsible for handling gfn alignment, this only deals
+	 * with the file binding.
+	 */
+	huge_index = ALIGN(index, 1ull << *max_order);
+	if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) ||
+	    huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages)
 		*max_order = 0;
-
+success:
 	r = 0;
 
 out_unlock:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5d1a2f1b4e94..0711f2c75667 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4888,6 +4888,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #ifdef CONFIG_KVM_PRIVATE_MEM
 	case KVM_CAP_GUEST_MEMFD:
 		return !kvm || kvm_arch_has_private_mem(kvm);
+	case KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE:
+		if (kvm && !kvm_arch_has_private_mem(kvm))
+			return 0;
+		return PMD_SIZE;
 #endif
 	default:
 		break;

base-commit: fcbef1e5e5d2a60dacac0d16c06ac00bedaefc0f
--

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-11-02 11:03           ` Paolo Bonzini
@ 2023-11-02 15:44             ` Sean Christopherson
  2023-11-02 18:35               ` Huang, Kai
  0 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-11-02 15:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kai Huang, Xiaoyao Li, kvm-riscv, mic, liam.merwick, kvm,
	Isaku Yamahata, viro, kirill.shutemov, david, linux-mips,
	amoorthy, linuxppc-dev, tabba, kvmarm, linux-kernel,
	linux-fsdevel, oliver.upton, michael.roth, chao.p.peng, palmer,
	chenhuacai, aou, mpe, Vishal Annapurve, vbabka, mail,
	linux-riscv, maz, willy, dmatlack, anup, yu.c.zhang, Yilun Xu,
	qperret, brauner, isaku.yamahata, ackerleytng, jarkko,
	paul.walmsley, linux-arm-kernel, linux-mm, Wei W Wang, akpm

On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> On 11/2/23 10:35, Huang, Kai wrote:
> > IIUC KVM can already handle the case of poisoned
> > page by sending signal to user app:
> > 
> > 	static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, 			struct
> > kvm_page_fault *fault)                                               	{
> > 		...
> > 
> >        		if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
> >               		kvm_send_hwpoison_signal(fault->slot, fault->gfn);

No, this doesn't work, because that signals the host virtual address

	unsigned long hva = gfn_to_hva_memslot(slot, gfn);

	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current);

which is the *shared* page.

> >                 	return RET_PF_RETRY;
> > 	}
> > 	}
> 
> EHWPOISON is not implemented by this series, so it should be left out of the
> documentation.

EHWPOISON *is* implemented.  kvm_gmem_get_pfn() returns -EWPOISON as appropriate,
and kvm_faultin_pfn() returns that directly without going through kvm_handle_error_pfn().

  kvm_faultin_pfn_private()
  |
  |-> kvm_gmem_get_pfn()
      |
      |-> if (folio_test_hwpoison(folio)) {
		r = -EHWPOISON;
		goto out_unlock;
	  }

          |
          |-> 	r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
			     &max_order);
		if (r) {
			kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
			return r;
		}

		|
		|-> ret = __kvm_faultin_pfn(vcpu, fault);
		    if (ret != RET_PF_CONTINUE)
			    return ret;

		    if (unlikely(is_error_pfn(fault->pfn)))
			    return kvm_handle_error_pfn(vcpu, fault);

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-11-02 15:38                     ` Sean Christopherson
@ 2023-11-02 15:46                       ` Paolo Bonzini
  2023-11-27 11:13                         ` Vlastimil Babka
  0 siblings, 1 reply; 148+ messages in thread
From: Paolo Bonzini @ 2023-11-02 15:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiaoyao Li, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson <seanjc@google.com> wrote:
> Actually, looking that this again, there's not actually a hard dependency on THP.
> A THP-enabled kernel _probably_  gives a higher probability of using hugepages,
> but mostly because THP selects COMPACTION, and I suppose because using THP for
> other allocations reduces overall fragmentation.

Yes, that's why I didn't even bother enabling it unless THP is
enabled, but it makes even more sense to just try.

> So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think
> we should do the below (I verified KVM can create hugepages with THP=n).  We'll
> need another capability, but (a) we probably should have that anyways and (b) it
> provides a cleaner path to adding PUD-sized hugepage support in the future.

I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This
should be a generic kernel API and in fact the sizes are available in
a not-so-friendly format in /sys/kernel/mm/hugepages.

We should just add /sys/kernel/mm/hugepages/sizes that contains
"2097152 1073741824" on x86 (only the former if 1G pages are not
supported).

Plus: is this the best API if we need something else for 1G pages?

Let's drop *this* patch and proceed incrementally. (Again, this is
what I want to do with this final review: identify places that are
stil sticky, and don't let them block the rest).

Coincidentially we have an open spot next week at plumbers. Let's
extend Fuad's section to cover more guestmem work.

Paolo

> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
> index c15de9852316..c9f449718fce 100644
> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
> @@ -201,6 +201,10 @@ int main(int argc, char *argv[])
>
>         TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
>
> +       if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE) && thp_configured())
> +               TEST_ASSERT_EQ(get_trans_hugepagesz(),
> +                              kvm_check_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE));
> +
>         page_size = getpagesize();
>         total_size = page_size * 4;
>
> diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> index be311944e90a..245901587ed2 100644
> --- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> @@ -396,7 +396,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
>
>         vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
>
> -       if (backing_src_can_be_huge(src_type))
> +       if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE))
>                 memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
>         else
>                 memfd_flags = 0;
>
> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Wed, 25 Oct 2023 16:26:41 -0700
> Subject: [PATCH] KVM: Add best-effort hugepage support for dedicated guest
>  memory
>
> Extend guest_memfd to allow backing guest memory with hugepages.  For now,
> make hugepage utilization best-effort, i.e. fall back to non-huge mappings
> if a hugepage can't be allocated.  Guaranteeing hugepages would require a
> dedicated memory pool and significantly more complexity and churn..
>
> Require userspace to opt-in via a flag even though it's unlikely real use
> cases will ever want to use order-0 pages, e.g. to give userspace a safety
> valve in case hugepage support is buggy, and to allow for easier testing
> of both paths.
>
> Do not take a dependency on CONFIG_TRANSPARENT_HUGEPAGE, as THP enabling
> primarily deals with userspace page tables, which are explicitly not in
> play for guest_memfd.  Selecting THP does make obtaining hugepages more
> likely, but only because THP selects CONFIG_COMPACTION.  Don't select
> CONFIG_COMPACTION either, because again it's not a hard dependency.
>
> For simplicity, require the guest_memfd size to be a multiple of the
> hugepage size, e.g. so that KVM doesn't need to do bounds checking when
> deciding whether or not to allocate a huge folio.
>
> When reporting the max order when KVM gets a pfn from guest_memfd, force
> order-0 pages if the hugepage is not fully contained by the memslot
> binding, e.g. if userspace requested hugepages but punches a hole in the
> memslot bindings in order to emulate x86's VGA hole.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  Documentation/virt/kvm/api.rst | 17 +++++++++
>  include/uapi/linux/kvm.h       |  3 ++
>  virt/kvm/guest_memfd.c         | 69 +++++++++++++++++++++++++++++-----
>  virt/kvm/kvm_main.c            |  4 ++
>  4 files changed, 84 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index e82c69d5e755..ccdd5413920d 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6176,6 +6176,8 @@ and cannot be resized  (guest_memfd files do however support PUNCH_HOLE).
>         __u64 reserved[6];
>    };
>
> +  #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE         (1ULL << 0)
> +
>  Conceptually, the inode backing a guest_memfd file represents physical memory,
>  i.e. is coupled to the virtual machine as a thing, not to a "struct kvm".  The
>  file itself, which is bound to a "struct kvm", is that instance's view of the
> @@ -6192,6 +6194,12 @@ most one mapping per page, i.e. binding multiple memory regions to a single
>  guest_memfd range is not allowed (any number of memory regions can be bound to
>  a single guest_memfd file, but the bound ranges must not overlap).
>
> +If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set in flags, KVM will attempt to allocate
> +and map PMD-size hugepages for the guest_memfd file.  This is currently best
> +effort.  If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set, size must be aligned to at
> +least the size reported by KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE (which also
> +enumerates support for KVM_GUEST_MEMFD_ALLOW_HUGEPAGE).
> +
>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
>
>  5. The kvm_run structure
> @@ -8639,6 +8647,15 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
>  64-bit bitmap (each bit describing a block size). The default value is
>  0, to disable the eager page splitting.
>
> +
> +8.41 KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE
> +------------------------------------------
> +
> +This is an information-only capability that returns guest_memfd's hugepage size
> +for PMD hugepages.  Returns '0' if guest_memfd is not supported, or if KVM
> +doesn't support creating hugepages for guest_memfd.  Note, guest_memfd doesn't
> +currently support PUD-sized hugepages.
> +
>  9. Known KVM API problems
>  =========================
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 25caee8d1a80..b78d0e3bf22a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1217,6 +1217,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_MEMORY_FAULT_INFO 231
>  #define KVM_CAP_MEMORY_ATTRIBUTES 232
>  #define KVM_CAP_GUEST_MEMFD 233
> +#define KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE 234
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -2303,4 +2304,6 @@ struct kvm_create_guest_memfd {
>         __u64 reserved[6];
>  };
>
> +#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE         (1ULL << 0)
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 98a12da80214..31b5e94d461a 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -13,14 +13,44 @@ struct kvm_gmem {
>         struct list_head entry;
>  };
>
> +#define NR_PAGES_PER_PMD (1 << PMD_ORDER)
> +
> +static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index)
> +{
> +       unsigned long huge_index = round_down(index, NR_PAGES_PER_PMD);
> +       unsigned long flags = (unsigned long)inode->i_private;
> +       struct address_space *mapping  = inode->i_mapping;
> +       gfp_t gfp = mapping_gfp_mask(mapping);
> +       struct folio *folio;
> +
> +       if (!(flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE))
> +               return NULL;
> +
> +       if (filemap_range_has_page(mapping, huge_index << PAGE_SHIFT,
> +                                  (huge_index + NR_PAGES_PER_PMD - 1) << PAGE_SHIFT))
> +               return NULL;
> +
> +       folio = filemap_alloc_folio(gfp, PMD_ORDER);
> +       if (!folio)
> +               return NULL;
> +
> +       if (filemap_add_folio(mapping, folio, huge_index, gfp)) {
> +               folio_put(folio);
> +               return NULL;
> +       }
> +       return folio;
> +}
> +
>  static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  {
>         struct folio *folio;
>
> -       /* TODO: Support huge pages. */
> -       folio = filemap_grab_folio(inode->i_mapping, index);
> -       if (IS_ERR_OR_NULL(folio))
> -               return NULL;
> +       folio = kvm_gmem_get_huge_folio(inode, index);
> +       if (!folio) {
> +               folio = filemap_grab_folio(inode->i_mapping, index);
> +               if (IS_ERR_OR_NULL(folio))
> +                       return NULL;
> +       }
>
>         /*
>          * Use the up-to-date flag to track whether or not the memory has been
> @@ -373,6 +403,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>         inode->i_mode |= S_IFREG;
>         inode->i_size = size;
>         mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> +       mapping_set_large_folios(inode->i_mapping);
>         mapping_set_unmovable(inode->i_mapping);
>         /* Unmovable mappings are supposed to be marked unevictable as well. */
>         WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> @@ -394,14 +425,18 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>
>  int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  {
> +       u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
>         loff_t size = args->size;
>         u64 flags = args->flags;
> -       u64 valid_flags = 0;
>
>         if (flags & ~valid_flags)
>                 return -EINVAL;
>
> -       if (size < 0 || !PAGE_ALIGNED(size))
> +       if (size <= 0 || !PAGE_ALIGNED(size))
> +               return -EINVAL;
> +
> +       if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> +           !IS_ALIGNED(size, PMD_SIZE))
>                 return -EINVAL;
>
>         return __kvm_gmem_create(kvm, size, flags);
> @@ -501,7 +536,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
>  int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>                      gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
>  {
> -       pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
> +       pgoff_t index, huge_index;
>         struct kvm_gmem *gmem;
>         struct folio *folio;
>         struct page *page;
> @@ -514,6 +549,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>
>         gmem = file->private_data;
>
> +       index = gfn - slot->base_gfn + slot->gmem.pgoff;
>         if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) {
>                 r = -EIO;
>                 goto out_fput;
> @@ -533,9 +569,24 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>         page = folio_file_page(folio, index);
>
>         *pfn = page_to_pfn(page);
> -       if (max_order)
> +       if (!max_order)
> +               goto success;
> +
> +       *max_order = compound_order(compound_head(page));
> +       if (!*max_order)
> +               goto success;
> +
> +       /*
> +        * The folio can be mapped with a hugepage if and only if the folio is
> +        * fully contained by the range the memslot is bound to.  Note, the
> +        * caller is responsible for handling gfn alignment, this only deals
> +        * with the file binding.
> +        */
> +       huge_index = ALIGN(index, 1ull << *max_order);
> +       if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) ||
> +           huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages)
>                 *max_order = 0;
> -
> +success:
>         r = 0;
>
>  out_unlock:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5d1a2f1b4e94..0711f2c75667 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4888,6 +4888,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  #ifdef CONFIG_KVM_PRIVATE_MEM
>         case KVM_CAP_GUEST_MEMFD:
>                 return !kvm || kvm_arch_has_private_mem(kvm);
> +       case KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE:
> +               if (kvm && !kvm_arch_has_private_mem(kvm))
> +                       return 0;
> +               return PMD_SIZE;
>  #endif
>         default:
>                 break;
>
> base-commit: fcbef1e5e5d2a60dacac0d16c06ac00bedaefc0f
> --
>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-31 22:39       ` David Matlack
@ 2023-11-02 15:48         ` Paolo Bonzini
  2023-11-02 16:03           ` Sean Christopherson
  0 siblings, 1 reply; 148+ messages in thread
From: Paolo Bonzini @ 2023-11-02 15:48 UTC (permalink / raw)
  To: David Matlack, Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Huacai Chen, Michael Ellerman,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Viro, Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 10/31/23 23:39, David Matlack wrote:
>>> Maybe can you sketch out how you see this proposal being extensible to
>>> using guest_memfd for shared mappings?
>> For in-place conversions, e.g. pKVM, no additional guest_memfd is needed.  What's
>> missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM needs to
>> ensure there are no outstanding references when converting back to private.
>>
>> For TDX/SNP, assuming we don't find a performant and robust way to do in-place
>> conversions, a second fd+offset pair would be needed.
> Is there a way to support non-in-place conversions within a single guest_memfd?

For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to 
guest memory.  The hook would invalidate now-private parts if they have 
a VMA, causing a SIGSEGV/EFAULT if the host touches them.

It would forbid mappings from multiple gfns to a single offset of the 
guest_memfd, because then the shared vs. private attribute would be tied 
to the offset.  This should not be a problem; for example, in the case 
of SNP, the RMP already requires a single mapping from host physical 
address to guest physical address.

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-11-02  2:19       ` Xiaoyao Li
@ 2023-11-02 15:51         ` Sean Christopherson
  0 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-11-02 15:51 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Kai Huang, viro, aou, brauner, oliver.upton, chenhuacai,
	paul.walmsley, palmer, maz, pbonzini, mpe, willy, anup, akpm,
	kvm-riscv, mic, liam.merwick, kvm, Isaku Yamahata,
	kirill.shutemov, david, tabba, amoorthy, linuxppc-dev,
	michael.roth, kvmarm, linux-kernel, linux-fsdevel, linux-riscv,
	chao.p.peng, linux-mips, Vishal Annapurve, vbabka, mail,
	yu.c.zhang, qperret, dmatlack, Yilun Xu, isaku.yamahata,
	ackerleytng, jarkko, linux-arm-kernel, linux-mm, Wei W Wang

On Thu, Nov 02, 2023, Xiaoyao Li wrote:
> On 11/2/2023 1:36 AM, Sean Christopherson wrote:
> > > KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function to
> > > <asm/kvm_host.h>?
> > I'd prefer to keep it in generic code, as it's highly likely to end up there
> > sooner than later.  There's a known use case for ARM (exit to userspace on missing
> > userspace mapping[*]), and I'm guessing pKVM (also ARM) will also utilize this API.
> > 
> > [*]https://lore.kernel.org/all/20230908222905.1321305-8-amoorthy@google.com
> 
> I wonder how this CAP is supposed to be checked in userspace, for guest
> memfd case? 

It's basically useless for guest_memfd.

> 	if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO) &&
> 	    run->exit_reason == KVM_EXIT_MEMORY_FAULT)
> 		abort("unexpected KVM_EXIT_MEMORY_FAULT");
> 
> In my implementation of QEMU patches, I find it's unnecessary. When
> userspace gets an exit with KVM_EXIT_MEMORY_FAULT, it implies
> "KVM_CAP_MEMORY_FAULT_INFO".
> 
> So I don't see how it is necessary in this series. Whether it's necessary or
> not for [*], I don't have the answer but we can leave the discussion to that
> patch series.

It's not strictly necessary there either.

However, Oliver felt (and presumably still feels) quite strongly, and I agree,
that neither reporting extra information shouldn't be tightly coupled to
KVM_CAP_EXIT_ON_MISSING or KVM_CAP_GUEST_MEMFD.

E.g. if userspace develops a "standalone" use case for KVM_CAP_MEMORY_FAULT_INFO,
userspace should be able to check for support without having to take a dependency
on KVM_CAP_GUEST_MEMFD, especially since because KVM_CAP_GUEST_MEMFD may not be
supported, i.e. userspace should be able to do:

	if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO))
		abort("KVM_CAP_MEMORY_FAULT_INFO required for fancy feature XYZ");



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-11-02  3:17       ` Huang, Kai
  2023-11-02  9:35         ` Huang, Kai
@ 2023-11-02 15:56         ` Sean Christopherson
  1 sibling, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-11-02 15:56 UTC (permalink / raw)
  To: Kai Huang
  Cc: Xiaoyao Li, kvm-riscv, mic, liam.merwick, Isaku Yamahata, kvm,
	pbonzini, kirill.shutemov, david, linux-fsdevel, amoorthy,
	linuxppc-dev, tabba, kvmarm, linux-kernel, michael.roth, viro,
	oliver.upton, chao.p.peng, palmer, chenhuacai, aou, linux-mips,
	mpe, Vishal Annapurve, vbabka, mail, linux-riscv, maz, willy,
	dmatlack, anup, yu.c.zhang, Yilun Xu, qperret, brauner,
	isaku.yamahata, ackerleytng, jarkko, paul.walmsley,
	linux-arm-kernel, linux-mm, Wei W Wang, akpm

On Thu, Nov 02, 2023, Kai Huang wrote:
> On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote:
> > On Wed, Nov 01, 2023, Kai Huang wrote:
> > > 
> > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO
> > > > +------------------------------
> > > > +
> > > > +:Architectures: x86
> > > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> > > > +
> > > > +The presence of this capability indicates that KVM_RUN will fill
> > > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
> > > > +there is a valid memslot but no backing VMA for the corresponding host virtual
> > > > +address.
> > > > +
> > > > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
> > > > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
> > > > +to KVM_EXIT_MEMORY_FAULT.
> > > 
> > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal
> > > implementation.
> > 
> > The errno that is returned to userspace is ABI.  In KVM, it's a _very_ poorly
> > defined ABI for the vast majority of ioctls(), but it's still technically ABI.
> > KVM gets away with being cavalier with errno because the vast majority of errors
> > are considered fatal by userespace, i.e. in most cases, userspace simply doesn't
> > care about the exact errno.
> > 
> > A good example is KVM_RUN with -EINTR; if KVM were to return something other than
> > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over.
> > 
> > > Is it better to relax the validity of kvm_run.memory_fault when
> > > KVM_RUN returns any -errno?
> > 
> > Not unless there's a need to do so, and if there is then we can update the
> > documentation accordingly.  If KVM's ABI is that kvm_run.memory_fault is valid
> > for any errno, then KVM would need to purge kvm_run.exit_reason super early in
> > KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being
> > misinterpreted as KVM_EXIT_MEMORY_FAULT.  And purging exit_reason super early is
> > subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit
> > reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a
> > pending signal).
> > 
> > https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com
> > 
> > 
> 
> Agreed with not to relax to any errno.  However using -EFAULT as part of ABI
> definition seems a little bit dangerous, e.g., someone could accidentally or
> mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely
> different code path, etc.  -EINTR has well defined meaning, but -EFAULT (which
> is "Bad address") seems doesn't but I am not sure either. :-)

KVM has returned -EFAULT since forever, i.e. it's effectively already part of the
ABI.  I doubt there's a userspace that relies precisely on -EFAULT, but userspace
definitely will be confused if KVM returns '0' where KVM used to return -EFAULT.
And so if we want to return '0', it needs to be opt-in, which means forcing
userspace to enable a capability *and* requires code in KVM to conditionally return
'0' instead of -EFAULT/-EHWPOISON.

> One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns
> KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc
> fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just
> return -EFAULT.  If kvm_run.exit_reason isn't purged early then is it possible
> to have some issue here?

Well, yeah, but that's exactly why this series has a patch to reset exit_reason.
The solution to "if KVM is buggy then bad things happen" is to not have KVM bugs :-)

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-11-02 15:48         ` Paolo Bonzini
@ 2023-11-02 16:03           ` Sean Christopherson
  2023-11-02 16:28             ` David Matlack
  0 siblings, 1 reply; 148+ messages in thread
From: Sean Christopherson @ 2023-11-02 16:03 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Matlack, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> On 10/31/23 23:39, David Matlack wrote:
> > > > Maybe can you sketch out how you see this proposal being extensible to
> > > > using guest_memfd for shared mappings?
> > > For in-place conversions, e.g. pKVM, no additional guest_memfd is needed.  What's
> > > missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM needs to
> > > ensure there are no outstanding references when converting back to private.
> > > 
> > > For TDX/SNP, assuming we don't find a performant and robust way to do in-place
> > > conversions, a second fd+offset pair would be needed.
> > Is there a way to support non-in-place conversions within a single guest_memfd?
> 
> For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to guest
> memory.  The hook would invalidate now-private parts if they have a VMA,
> causing a SIGSEGV/EFAULT if the host touches them.
> 
> It would forbid mappings from multiple gfns to a single offset of the
> guest_memfd, because then the shared vs. private attribute would be tied to
> the offset.  This should not be a problem; for example, in the case of SNP,
> the RMP already requires a single mapping from host physical address to
> guest physical address.

I don't see how this can work.  It's not a M:1 scenario (where M is multiple gfns),
it's a 1:N scenario (wheren N is multiple offsets).  The *gfn* doesn't change on
a conversion, what needs to change to do non-in-place conversion is the pfn, which
is effectively the guest_memfd+offset pair.

So yes, we *could* support non-in-place conversions within a single guest_memfd,
but it would require a second offset, at which point it makes sense to add a
second file descriptor as well.  Userspace could still use a single guest_memfd
instance, i.e. pass in the same file descriptor but different offsets.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM
  2023-10-27 18:21 ` [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM Sean Christopherson
  2023-10-30 17:30   ` Paolo Bonzini
@ 2023-11-02 16:24   ` Christian Brauner
  2023-11-03 10:40     ` Paolo Bonzini
  1 sibling, 1 reply; 148+ messages in thread
From: Christian Brauner @ 2023-11-02 16:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Fri, Oct 27, 2023 at 11:21:57AM -0700, Sean Christopherson wrote:
> Export anon_inode_getfile_secure() so that it can be used by KVM to create
> and manage file-based guest memory without need a fullblow filesystem.
> The "standard" anon_inode_getfd() doesn't work for KVM's use case as KVM
> needs a unique inode for each file, e.g. to be able to independently
> manage the size and lifecycle of a given file.
> 
> Note, KVM doesn't need a "secure" version, just unique inodes, i.e. ignore
> the name.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Before we enshrine this misleading name let's rename this to:

create_anon_inode_getfile()

I don't claim it's a great name but it's better than *_secure() which is
very confusing. So just:

struct file *create_anon_inode_getfile(const char *name,
                                       const struct file_operations *fops,
                                       void *priv, int flags)

May also just remove that context_inode argument from the exported
function. The only other caller is io_uring. And neither it nor this
patchset need the context_inode thing afaict. Merge conflict risk is
extremely low so carrying that as part of this patchset is fine and
shouldn't cause huge issues for you.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-11-02 16:03           ` Sean Christopherson
@ 2023-11-02 16:28             ` David Matlack
  2023-11-02 17:37               ` Sean Christopherson
  0 siblings, 1 reply; 148+ messages in thread
From: David Matlack @ 2023-11-02 16:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Thu, Nov 2, 2023 at 9:03 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> > On 10/31/23 23:39, David Matlack wrote:
> > > > > Maybe can you sketch out how you see this proposal being extensible to
> > > > > using guest_memfd for shared mappings?
> > > > For in-place conversions, e.g. pKVM, no additional guest_memfd is needed.  What's
> > > > missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM needs to
> > > > ensure there are no outstanding references when converting back to private.
> > > >
> > > > For TDX/SNP, assuming we don't find a performant and robust way to do in-place
> > > > conversions, a second fd+offset pair would be needed.
> > > Is there a way to support non-in-place conversions within a single guest_memfd?
> >
> > For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to guest
> > memory.  The hook would invalidate now-private parts if they have a VMA,
> > causing a SIGSEGV/EFAULT if the host touches them.
> >
> > It would forbid mappings from multiple gfns to a single offset of the
> > guest_memfd, because then the shared vs. private attribute would be tied to
> > the offset.  This should not be a problem; for example, in the case of SNP,
> > the RMP already requires a single mapping from host physical address to
> > guest physical address.
>
> I don't see how this can work.  It's not a M:1 scenario (where M is multiple gfns),
> it's a 1:N scenario (wheren N is multiple offsets).  The *gfn* doesn't change on
> a conversion, what needs to change to do non-in-place conversion is the pfn, which
> is effectively the guest_memfd+offset pair.
>
> So yes, we *could* support non-in-place conversions within a single guest_memfd,
> but it would require a second offset,

Why can't KVM free the existing page at guest_memfd+offset and
allocate a new one when doing non-in-place conversions?

> at which point it makes sense to add a
> second file descriptor as well.  Userspace could still use a single guest_memfd
> instance, i.e. pass in the same file descriptor but different offsets.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-11-02 16:28             ` David Matlack
@ 2023-11-02 17:37               ` Sean Christopherson
  0 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-11-02 17:37 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Thu, Nov 02, 2023, David Matlack wrote:
> On Thu, Nov 2, 2023 at 9:03 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> > > On 10/31/23 23:39, David Matlack wrote:
> > > > > > Maybe can you sketch out how you see this proposal being extensible to
> > > > > > using guest_memfd for shared mappings?
> > > > > For in-place conversions, e.g. pKVM, no additional guest_memfd is needed.  What's
> > > > > missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM needs to
> > > > > ensure there are no outstanding references when converting back to private.
> > > > >
> > > > > For TDX/SNP, assuming we don't find a performant and robust way to do in-place
> > > > > conversions, a second fd+offset pair would be needed.
> > > > Is there a way to support non-in-place conversions within a single guest_memfd?
> > >
> > > For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to guest
> > > memory.  The hook would invalidate now-private parts if they have a VMA,
> > > causing a SIGSEGV/EFAULT if the host touches them.
> > >
> > > It would forbid mappings from multiple gfns to a single offset of the
> > > guest_memfd, because then the shared vs. private attribute would be tied to
> > > the offset.  This should not be a problem; for example, in the case of SNP,
> > > the RMP already requires a single mapping from host physical address to
> > > guest physical address.
> >
> > I don't see how this can work.  It's not a M:1 scenario (where M is multiple gfns),
> > it's a 1:N scenario (wheren N is multiple offsets).  The *gfn* doesn't change on
> > a conversion, what needs to change to do non-in-place conversion is the pfn, which
> > is effectively the guest_memfd+offset pair.
> >
> > So yes, we *could* support non-in-place conversions within a single guest_memfd,
> > but it would require a second offset,
> 
> Why can't KVM free the existing page at guest_memfd+offset and
> allocate a new one when doing non-in-place conversions?

Oh, I see what you're suggesting.  Eww.

It's certainly possible, but it would largely defeat the purpose of why we are
adding guest_memfd in the first place.

For TDX and SNP, the goal is to provide a simple, robust mechanism for isolating
guest private memory so that it's all but impossible for the host to access private
memory.  As things stand, memory for a given guest_memfd is either private or shared
(assuming we support a second guest_memfd per memslot).  I.e. there's no need to
track whether a given page/folio in the guest_memfd is private vs. shared.

We could use memory attributes, but that further complicates things when intrahost
migration (and potentially other multi-user scenarios) comes along, i.e. when KVM
supports linking multiple guest_memfd files to a single inode.  We'd have to ensure
that all "struct kvm" instances have identical PRIVATE attributes for a given
*offset* in the inode.  I'm not even sure how feasible that is for intrahost
migration, and that's the *easy* case, because IIRC it's already a hard requirement
that the source and destination have identical gnf=>guest_memfd bindings, i.e. KVM
can somewhat easily reason about gfn attributes.

But even then, that only helps with the actual migration of the VM, e.g. we'd still
have to figure out how to deal with .mmap() and other shared vs. private actions
when linking a new guest_memfd file against an existing inode.

I haven't seen the pKVM patches for supporting .mmap(), so maybe this is already
a solved problem, but I'd honestly be quite surprised if it all works correctly
if/when KVM supports multiple files per inode.

And I don't see what value non-in-place conversions would add.  The value added
by in-place conversions, aside from the obvious preservation of data, which isn't
relevant to TDX/SNP, is that it doesn't require freeing and reallocating memory
to avoid double-allocating for private vs. shared.  That's especialy quite nice
when hugepages are being used because reconstituing a hugepage "only" requires
zapping SPTEs.

But if KVM is freeing the private page, it's the same as punching a hole, probably
quite literally, when mapping the gfn as shared.  In every way I can think of, it's
worse.  E.g. it's more complex for KVM, and the PUNCH_HOLE => allocation operations
must be serialized.

Regarding double-allocating, I really, really think we should solve that in the
guest.  I.e. teach Linux-as-a-guest to aggressively convert at 2MiB granularity
and avoid 4KiB conversions.  4KiB conversions aren't just a memory utilization
problem, they're also a performance problem, e.g. shatters hugepages (which KVM
doesn't yet support recovering) and increases TLB pressure for both stage-1 and
stage-2 mappings.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-11-02 15:44             ` Sean Christopherson
@ 2023-11-02 18:35               ` Huang, Kai
  0 siblings, 0 replies; 148+ messages in thread
From: Huang, Kai @ 2023-11-02 18:35 UTC (permalink / raw)
  To: pbonzini, Christopherson,, Sean
  Cc: Li, Xiaoyao, kvm-riscv, mic, liam.merwick, kvm, Yamahata, Isaku,
	viro, kirill.shutemov, david, linux-mips, amoorthy, linuxppc-dev,
	tabba, kvmarm, linux-kernel, linux-fsdevel, oliver.upton,
	michael.roth, chao.p.peng, palmer, chenhuacai, aou, mpe,
	Annapurve, Vishal, vbabka, mail, linux-riscv, maz, willy,
	dmatlack, anup, yu.c.zhang, Xu, Yilun, qperret, brauner,
	isaku.yamahata, ackerleytng, jarkko, paul.walmsley,
	linux-arm-kernel, linux-mm, Wang, Wei W, akpm

On Thu, 2023-11-02 at 08:44 -0700, Sean Christopherson wrote:
> On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> > On 11/2/23 10:35, Huang, Kai wrote:
> > > IIUC KVM can already handle the case of poisoned
> > > page by sending signal to user app:
> > > 
> > >  	static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, 			struct
> > > kvm_page_fault *fault)                                               	{
> > >  		...
> > > 
> > >         		if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
> > >                		kvm_send_hwpoison_signal(fault->slot, fault->gfn);
> 
> No, this doesn't work, because that signals the host virtual address

Ah, right :-)

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  2023-10-27 18:21 ` [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace Sean Christopherson
                     ` (2 preceding siblings ...)
  2023-11-01 10:52   ` Huang, Kai
@ 2023-11-03  4:09   ` Xu Yilun
  3 siblings, 0 replies; 148+ messages in thread
From: Xu Yilun @ 2023-11-03  4:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Fri, Oct 27, 2023 at 11:21:51AM -0700, Sean Christopherson wrote:
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory;
                  ^

Should update to "memory_fault" to align with other places.

[...]

> @@ -520,6 +521,12 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>  			__u32 flags;
>  		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory_fault;
>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};

Thanks,
Yilun

> 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-27 18:21 ` [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
                     ` (3 preceding siblings ...)
  2023-10-31 18:24   ` David Matlack
@ 2023-11-03  9:42   ` Fuad Tabba
  2023-11-04 10:26   ` Xu Yilun
  5 siblings, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-03  9:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Hi,

...

> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index e2252c748fd6..e82c69d5e755 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst

...

> +4.141 KVM_CREATE_GUEST_MEMFD
> +----------------------------
> +
> +:Capability: KVM_CAP_GUEST_MEMFD
> +:Architectures: none
> +:Type: vm ioctl
> +:Parameters: struct struct kvm_create_guest_memfd(in)

One struct too many.

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM
  2023-11-02 16:24   ` Christian Brauner
@ 2023-11-03 10:40     ` Paolo Bonzini
  0 siblings, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-11-03 10:40 UTC (permalink / raw)
  To: Christian Brauner, Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Huacai Chen, Michael Ellerman,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Viro, Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On 11/2/23 17:24, Christian Brauner wrote:
> On Fri, Oct 27, 2023 at 11:21:57AM -0700, Sean Christopherson wrote:
>> Export anon_inode_getfile_secure() so that it can be used by KVM to create
>> and manage file-based guest memory without need a fullblow filesystem.
>> The "standard" anon_inode_getfd() doesn't work for KVM's use case as KVM
>> needs a unique inode for each file, e.g. to be able to independently
>> manage the size and lifecycle of a given file.
>>
>> Note, KVM doesn't need a "secure" version, just unique inodes, i.e. ignore
>> the name.
>>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> ---
> 
> Before we enshrine this misleading name let's rename this to:
> 
> create_anon_inode_getfile()
> 
> I don't claim it's a great name but it's better than *_secure() which is
> very confusing. So just:
> 
> struct file *create_anon_inode_getfile(const char *name,
>                                         const struct file_operations *fops,
>                                         void *priv, int flags)

I slightly prefer anon_inode_create_getfile(); grepping include/linux 
for '\<create_' vs '_create_' shows that this is much more common.

Neither userfaultfd (which uses anon_inode_getfd_secure()) nor io_uring 
strictly speaking need separate inodes; they do want the call to 
inode_init_security_anon().  But I agree that the new name is better and 
I will adjust the comments so that it is clear why you'd use this 
function instead of anon_inode_get{file,fd}().

> May also just remove that context_inode argument from the exported
> function. The only other caller is io_uring. And neither it nor this
> patchset need the context_inode thing afaict.

True, OTOH we might as well rename anon_inode_getfd_secure() to 
anon_inode_create_getfd(), and that one does need context_inode.

I'll Cc you on v14 and will carry the patch in my tree.

Paolo

> Merge conflict risk is
> extremely low so carrying that as part of this patchset is fine and
> shouldn't cause huge issues for you.
> 


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-11-02 13:52           ` Fuad Tabba
@ 2023-11-03 23:17             ` Sean Christopherson
  0 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-11-03 23:17 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Thu, Nov 02, 2023, Fuad Tabba wrote:
> On Wed, Nov 1, 2023 at 9:55 PM Sean Christopherson <seanjc@google.com> wrote:
> > E.g. a misbehaving userspace could prematurely delete a memslot.  And the more
> > fun example is intrahost migration, where the plan is to allow pointing multiple
> > guest_memfd files at a single guest_memfd inode:
> > https://lore.kernel.org/all/cover.1691446946.git.ackerleytng@google.com
> >
> > There was a lot of discussion for this, but it's scattered all over the place.
> > The TL;DR is is that the inode will represent physical memory, and a file will
> > represent a given "struct kvm" instance's view of that memory.  And so the memory
> > isn't reclaimed until the inode is truncated/punched.
> >
> > I _think_ this reflects the most recent plan from the guest_memfd side:
> > https://lore.kernel.org/all/1233d749211c08d51f9ca5d427938d47f008af1f.1689893403.git.isaku.yamahata@intel.com

Doh, sitting in my TODO folder...

https://lore.kernel.org/all/20231016115028.996656-1-michael.roth@amd.com

> Thanks for pointing that out. I think this might be the way to go.
> I'll have a closer look at this and see how to get it to work with
> pKVM.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-10-27 18:21 ` [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
                     ` (4 preceding siblings ...)
  2023-11-03  9:42   ` Fuad Tabba
@ 2023-11-04 10:26   ` Xu Yilun
  2023-11-06 15:43     ` Sean Christopherson
  5 siblings, 1 reply; 148+ messages in thread
From: Xu Yilun @ 2023-11-04 10:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

> +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
> +allows mapping guest_memfd memory into a guest.  All fields shared with
> +KVM_SET_USER_MEMORY_REGION identically.  Userspace can set KVM_MEM_PRIVATE in
> +flags to have KVM bind the memory region to a given guest_memfd range of
> +[guest_memfd_offset, guest_memfd_offset + memory_size].  The target guest_memfd
                                                        ^
The range end should be exclusive, is it?

> +must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
> +the target range must not be bound to any other memory region.  All standard
> +bounds checks apply (use common sense).
> +
>  ::
>  
>    struct kvm_userspace_memory_region2 {
> @@ -6087,9 +6096,24 @@ applied.
>  	__u64 guest_phys_addr;
>  	__u64 memory_size; /* bytes */
>  	__u64 userspace_addr; /* start of the userspace allocated memory */
> +  __u64 guest_memfd_offset;
> +	__u32 guest_memfd;
> +	__u32 pad1;
> +	__u64 pad2[14];
>    };
>  

[...]

> +static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> +{
> +	const char *anon_name = "[kvm-gmem]";
> +	struct kvm_gmem *gmem;
> +	struct inode *inode;
> +	struct file *file;
> +	int fd, err;
> +
> +	fd = get_unused_fd_flags(0);
> +	if (fd < 0)
> +		return fd;
> +
> +	gmem = kzalloc(sizeof(*gmem), GFP_KERNEL);
> +	if (!gmem) {
> +		err = -ENOMEM;
> +		goto err_fd;
> +	}
> +
> +	/*
> +	 * Use the so called "secure" variant, which creates a unique inode
> +	 * instead of reusing a single inode.  Each guest_memfd instance needs
> +	 * its own inode to track the size, flags, etc.
> +	 */
> +	file = anon_inode_getfile_secure(anon_name, &kvm_gmem_fops, gmem,
> +					 O_RDWR, NULL);
> +	if (IS_ERR(file)) {
> +		err = PTR_ERR(file);
> +		goto err_gmem;
> +	}
> +
> +	file->f_flags |= O_LARGEFILE;
> +
> +	inode = file->f_inode;
> +	WARN_ON(file->f_mapping != inode->i_mapping);

Just curious, why should we check the mapping fields which is garanteed in
other subsystem?

> +
> +	inode->i_private = (void *)(unsigned long)flags;
> +	inode->i_op = &kvm_gmem_iops;
> +	inode->i_mapping->a_ops = &kvm_gmem_aops;
> +	inode->i_mode |= S_IFREG;
> +	inode->i_size = size;
> +	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> +	mapping_set_unmovable(inode->i_mapping);
> +	/* Unmovable mappings are supposed to be marked unevictable as well. */
> +	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> +
> +	kvm_get_kvm(kvm);
> +	gmem->kvm = kvm;
> +	xa_init(&gmem->bindings);
> +	list_add(&gmem->entry, &inode->i_mapping->private_list);
> +
> +	fd_install(fd, file);
> +	return fd;
> +
> +err_gmem:
> +	kfree(gmem);
> +err_fd:
> +	put_unused_fd(fd);
> +	return err;
> +}

[...]

> +int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
> +		  unsigned int fd, loff_t offset)
> +{
> +	loff_t size = slot->npages << PAGE_SHIFT;
> +	unsigned long start, end;
> +	struct kvm_gmem *gmem;
> +	struct inode *inode;
> +	struct file *file;
> +
> +	BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
> +
> +	file = fget(fd);
> +	if (!file)
> +		return -EBADF;
> +
> +	if (file->f_op != &kvm_gmem_fops)
> +		goto err;
> +
> +	gmem = file->private_data;
> +	if (gmem->kvm != kvm)
> +		goto err;
> +
> +	inode = file_inode(file);
> +
> +	if (offset < 0 || !PAGE_ALIGNED(offset))
> +		return -EINVAL;

Should also "goto err" here.

> +
> +	if (offset + size > i_size_read(inode))
> +		goto err;
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +
> +	start = offset >> PAGE_SHIFT;
> +	end = start + slot->npages;
> +
> +	if (!xa_empty(&gmem->bindings) &&
> +	    xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
> +		filemap_invalidate_unlock(inode->i_mapping);
> +		goto err;
> +	}
> +
> +	/*
> +	 * No synchronize_rcu() needed, any in-flight readers are guaranteed to
> +	 * be see either a NULL file or this new file, no need for them to go
> +	 * away.
> +	 */
> +	rcu_assign_pointer(slot->gmem.file, file);
> +	slot->gmem.pgoff = start;
> +
> +	xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
> +	filemap_invalidate_unlock(inode->i_mapping);
> +
> +	/*
> +	 * Drop the reference to the file, even on success.  The file pins KVM,
> +	 * not the other way 'round.  Active bindings are invalidated if the
                             ^
around?

Thanks,
Yilun

> +	 * file is closed before memslots are destroyed.
> +	 */
> +	fput(file);
> +	return 0;
> +
> +err:
> +	fput(file);
> +	return -EINVAL;
> +}

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory
  2023-10-27 18:22 ` [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory Sean Christopherson
  2023-11-02 14:34   ` Fuad Tabba
@ 2023-11-05 13:02   ` Xu Yilun
  2023-11-05 16:19     ` Paolo Bonzini
  1 sibling, 1 reply; 148+ messages in thread
From: Xu Yilun @ 2023-11-05 13:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

> +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> +					      struct kvm_page_fault *fault)
> +{
> +	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> +				      PAGE_SIZE, fault->write, fault->exec,
> +				      fault->is_private);
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> +				   struct kvm_page_fault *fault)
> +{
> +	int max_order, r;
> +
> +	if (!kvm_slot_can_be_private(fault->slot)) {
> +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +		return -EFAULT;
> +	}
> +
> +	r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
> +			     &max_order);
> +	if (r) {
> +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +		return r;

Why report KVM_EXIT_MEMORY_FAULT here? even with a ret != -EFAULT? This is
different from the decription where KVM_EXIT_MEMORY_FAULT is introduced:

  KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
  be implicit conversions.

  To allow for future possibilities where KVM reports KVM_EXIT_MEMORY_FAULT
  and fills run->memory_fault on _any_ unresolved fault, KVM returns
  "-EFAULT"

Thanks,
Yilun

> +	}
> +
> +	fault->max_level = min(kvm_max_level_for_order(max_order),
> +			       fault->max_level);
> +	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
> +
> +	return RET_PF_CONTINUE;
> +}

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory
  2023-11-05 13:02   ` Xu Yilun
@ 2023-11-05 16:19     ` Paolo Bonzini
  2023-11-06 13:29       ` Xu Yilun
  0 siblings, 1 reply; 148+ messages in thread
From: Paolo Bonzini @ 2023-11-05 16:19 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Sun, Nov 5, 2023 at 2:04 PM Xu Yilun <yilun.xu@linux.intel.com> wrote:
>
> > +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> > +                                           struct kvm_page_fault *fault)
> > +{
> > +     kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> > +                                   PAGE_SIZE, fault->write, fault->exec,
> > +                                   fault->is_private);
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > +                                struct kvm_page_fault *fault)
> > +{
> > +     int max_order, r;
> > +
> > +     if (!kvm_slot_can_be_private(fault->slot)) {
> > +             kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > +             return -EFAULT;
> > +     }
> > +
> > +     r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
> > +                          &max_order);
> > +     if (r) {
> > +             kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > +             return r;
>
> Why report KVM_EXIT_MEMORY_FAULT here? even with a ret != -EFAULT?

The cases are EFAULT, EHWPOISON (which can report
KVM_EXIT_MEMORY_FAULT) and ENOMEM. I think it's fine
that even -ENOMEM can return KVM_EXIT_MEMORY_FAULT,
and it doesn't violate the documentation.  The docs tell you "what
can you do if error if EFAULT or EHWPOISON?"; they don't
exclude that other errnos result in KVM_EXIT_MEMORY_FAULT,
it's just that you're not supposed to look at it

Paolo


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory
  2023-10-27 18:22 ` [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory Sean Christopherson
  2023-10-30 17:36   ` Paolo Bonzini
@ 2023-11-06 11:00   ` Fuad Tabba
  2023-11-06 11:03     ` Paolo Bonzini
  1 sibling, 1 reply; 148+ messages in thread
From: Fuad Tabba @ 2023-11-06 11:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

Hi,


On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
> and testing vehicle for Confidential (CoCo) VMs, and potentially to even
> become a "real" product in the distant future, e.g. a la pKVM.
>
> The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
> Intel's TDX, but those technologies are extremely complex (understatement),
> difficult to debug, don't support running as nested guests, and require
> hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX

nit: "that isn't"

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

> for maintaining guest private memory isn't a realistic option.
>
> At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
> selftests for guest_memfd and private memory support without requiring
> unique hardware.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  Documentation/virt/kvm/api.rst  | 32 ++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/kvm_host.h | 15 +++++++++------
>  arch/x86/include/uapi/asm/kvm.h |  3 +++
>  arch/x86/kvm/Kconfig            | 12 ++++++++++++
>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>  arch/x86/kvm/x86.c              | 16 +++++++++++++++-
>  include/uapi/linux/kvm.h        |  1 +
>  virt/kvm/Kconfig                |  5 +++++
>  8 files changed, 78 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 38dc1fda4f45..00029436ac5b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -147,10 +147,29 @@ described as 'basic' will be available.
>  The new VM has no virtual cpus and no memory.
>  You probably want to use 0 as machine type.
>
> +X86:
> +^^^^
> +
> +Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
> +
> +S390:
> +^^^^^
> +
>  In order to create user controlled virtual machines on S390, check
>  KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
>  privileged user (CAP_SYS_ADMIN).
>
> +MIPS:
> +^^^^^
> +
> +To use hardware assisted virtualization on MIPS (VZ ASE) rather than
> +the default trap & emulate implementation (which changes the virtual
> +memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
> +flag KVM_VM_MIPS_VZ.
> +
> +ARM64:
> +^^^^^^
> +
>  On arm64, the physical address size for a VM (IPA Size limit) is limited
>  to 40bits by default. The limit can be configured if the host supports the
>  extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
> @@ -8650,6 +8669,19 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
>  64-bit bitmap (each bit describing a block size). The default value is
>  0, to disable the eager page splitting.
>
> +8.41 KVM_CAP_VM_TYPES
> +---------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: system ioctl
> +
> +This capability returns a bitmap of support VM types.  The 1-setting of bit @n
> +means the VM type with value @n is supported.  Possible values of @n are::
> +
> +  #define KVM_X86_DEFAULT_VM   0
> +  #define KVM_X86_SW_PROTECTED_VM      1
> +
>  9. Known KVM API problems
>  =========================
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f9e8d5642069..dff10051e9b6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1244,6 +1244,7 @@ enum kvm_apicv_inhibit {
>  };
>
>  struct kvm_arch {
> +       unsigned long vm_type;
>         unsigned long n_used_mmu_pages;
>         unsigned long n_requested_mmu_pages;
>         unsigned long n_max_mmu_pages;
> @@ -2077,6 +2078,12 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd);
>  void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>                        int tdp_max_root_level, int tdp_huge_page_level);
>
> +#ifdef CONFIG_KVM_PRIVATE_MEM
> +#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.vm_type != KVM_X86_DEFAULT_VM)
> +#else
> +#define kvm_arch_has_private_mem(kvm) false
> +#endif
> +
>  static inline u16 kvm_read_ldt(void)
>  {
>         u16 ldt;
> @@ -2125,14 +2132,10 @@ enum {
>  #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
>
>  # define KVM_MAX_NR_ADDRESS_SPACES     2
> +/* SMM is currently unsupported for guests with private memory. */
> +# define kvm_arch_nr_memslot_as_ids(kvm) (kvm_arch_has_private_mem(kvm) ? 1 : 2)
>  # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
>  # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
> -
> -static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
> -{
> -       return KVM_MAX_NR_ADDRESS_SPACES;
> -}
> -
>  #else
>  # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
>  #endif
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 1a6a1f987949..a448d0964fc0 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -562,4 +562,7 @@ struct kvm_pmu_event_filter {
>  /* x86-specific KVM_EXIT_HYPERCALL flags. */
>  #define KVM_EXIT_HYPERCALL_LONG_MODE   BIT(0)
>
> +#define KVM_X86_DEFAULT_VM     0
> +#define KVM_X86_SW_PROTECTED_VM        1
> +
>  #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 091b74599c22..8452ed0228cb 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -77,6 +77,18 @@ config KVM_WERROR
>
>           If in doubt, say "N".
>
> +config KVM_SW_PROTECTED_VM
> +       bool "Enable support for KVM software-protected VMs"
> +       depends on EXPERT
> +       depends on X86_64
> +       select KVM_GENERIC_PRIVATE_MEM
> +       help
> +         Enable support for KVM software-protected VMs.  Currently "protected"
> +         means the VM can be backed with memory provided by
> +         KVM_CREATE_GUEST_MEMFD.
> +
> +         If unsure, say "N".
> +
>  config KVM_INTEL
>         tristate "KVM for Intel (and compatible) processors support"
>         depends on KVM && IA32_FEAT_CTL
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 86c7cb692786..b66a7d47e0e4 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -297,6 +297,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>                 .max_level = KVM_MAX_HUGEPAGE_LEVEL,
>                 .req_level = PG_LEVEL_4K,
>                 .goal_level = PG_LEVEL_4K,
> +               .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
>         };
>         int r;
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c4d17727b199..e3eb608b6692 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4441,6 +4441,13 @@ static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu,
>         return 0;
>  }
>
> +static bool kvm_is_vm_type_supported(unsigned long type)
> +{
> +       return type == KVM_X86_DEFAULT_VM ||
> +              (type == KVM_X86_SW_PROTECTED_VM &&
> +               IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled);
> +}
> +
>  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  {
>         int r = 0;
> @@ -4632,6 +4639,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>         case KVM_CAP_X86_NOTIFY_VMEXIT:
>                 r = kvm_caps.has_notify_vmexit;
>                 break;
> +       case KVM_CAP_VM_TYPES:
> +               r = BIT(KVM_X86_DEFAULT_VM);
> +               if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM))
> +                       r |= BIT(KVM_X86_SW_PROTECTED_VM);
> +               break;
>         default:
>                 break;
>         }
> @@ -12314,9 +12326,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>         int ret;
>         unsigned long flags;
>
> -       if (type)
> +       if (!kvm_is_vm_type_supported(type))
>                 return -EINVAL;
>
> +       kvm->arch.vm_type = type;
> +
>         ret = kvm_page_track_init(kvm);
>         if (ret)
>                 goto out;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 29e9eb51dec9..5b5820d19e71 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1218,6 +1218,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_MEMORY_FAULT_INFO 231
>  #define KVM_CAP_MEMORY_ATTRIBUTES 232
>  #define KVM_CAP_GUEST_MEMFD 233
> +#define KVM_CAP_VM_TYPES 234
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 08afef022db9..2c964586aa14 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -104,3 +104,8 @@ config KVM_GENERIC_MEMORY_ATTRIBUTES
>  config KVM_PRIVATE_MEM
>         select XARRAY_MULTI
>         bool
> +
> +config KVM_GENERIC_PRIVATE_MEM
> +       select KVM_GENERIC_MEMORY_ATTRIBUTES
> +       select KVM_PRIVATE_MEM
> +       bool
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory
  2023-11-06 11:00   ` Fuad Tabba
@ 2023-11-06 11:03     ` Paolo Bonzini
  0 siblings, 0 replies; 148+ messages in thread
From: Paolo Bonzini @ 2023-11-06 11:03 UTC (permalink / raw)
  To: Fuad Tabba, Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Huacai Chen, Michael Ellerman,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Viro, Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 11/6/23 12:00, Fuad Tabba wrote:
> Hi,
> 
> 
> On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <seanjc@google.com> wrote:
>>
>> Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
>> and testing vehicle for Confidential (CoCo) VMs, and potentially to even
>> become a "real" product in the distant future, e.g. a la pKVM.
>>
>> The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
>> Intel's TDX, but those technologies are extremely complex (understatement),
>> difficult to debug, don't support running as nested guests, and require
>> hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX
> 
> nit: "that isn't"
> 
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Tested-by: Fuad Tabba <tabba@google.com>

Hi Fuad,

thanks for your reviews and tests of the gmem patches!  Can you please 
continue replying to v14?

Thanks,

Paolo

> Cheers,
> /fuad
> 
>> for maintaining guest private memory isn't a realistic option.
>>
>> At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
>> selftests for guest_memfd and private memory support without requiring
>> unique hardware.
>>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> ---
>>   Documentation/virt/kvm/api.rst  | 32 ++++++++++++++++++++++++++++++++
>>   arch/x86/include/asm/kvm_host.h | 15 +++++++++------
>>   arch/x86/include/uapi/asm/kvm.h |  3 +++
>>   arch/x86/kvm/Kconfig            | 12 ++++++++++++
>>   arch/x86/kvm/mmu/mmu_internal.h |  1 +
>>   arch/x86/kvm/x86.c              | 16 +++++++++++++++-
>>   include/uapi/linux/kvm.h        |  1 +
>>   virt/kvm/Kconfig                |  5 +++++
>>   8 files changed, 78 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 38dc1fda4f45..00029436ac5b 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -147,10 +147,29 @@ described as 'basic' will be available.
>>   The new VM has no virtual cpus and no memory.
>>   You probably want to use 0 as machine type.
>>
>> +X86:
>> +^^^^
>> +
>> +Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
>> +
>> +S390:
>> +^^^^^
>> +
>>   In order to create user controlled virtual machines on S390, check
>>   KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
>>   privileged user (CAP_SYS_ADMIN).
>>
>> +MIPS:
>> +^^^^^
>> +
>> +To use hardware assisted virtualization on MIPS (VZ ASE) rather than
>> +the default trap & emulate implementation (which changes the virtual
>> +memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
>> +flag KVM_VM_MIPS_VZ.
>> +
>> +ARM64:
>> +^^^^^^
>> +
>>   On arm64, the physical address size for a VM (IPA Size limit) is limited
>>   to 40bits by default. The limit can be configured if the host supports the
>>   extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
>> @@ -8650,6 +8669,19 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
>>   64-bit bitmap (each bit describing a block size). The default value is
>>   0, to disable the eager page splitting.
>>
>> +8.41 KVM_CAP_VM_TYPES
>> +---------------------
>> +
>> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
>> +:Architectures: x86
>> +:Type: system ioctl
>> +
>> +This capability returns a bitmap of support VM types.  The 1-setting of bit @n
>> +means the VM type with value @n is supported.  Possible values of @n are::
>> +
>> +  #define KVM_X86_DEFAULT_VM   0
>> +  #define KVM_X86_SW_PROTECTED_VM      1
>> +
>>   9. Known KVM API problems
>>   =========================
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index f9e8d5642069..dff10051e9b6 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1244,6 +1244,7 @@ enum kvm_apicv_inhibit {
>>   };
>>
>>   struct kvm_arch {
>> +       unsigned long vm_type;
>>          unsigned long n_used_mmu_pages;
>>          unsigned long n_requested_mmu_pages;
>>          unsigned long n_max_mmu_pages;
>> @@ -2077,6 +2078,12 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd);
>>   void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>>                         int tdp_max_root_level, int tdp_huge_page_level);
>>
>> +#ifdef CONFIG_KVM_PRIVATE_MEM
>> +#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.vm_type != KVM_X86_DEFAULT_VM)
>> +#else
>> +#define kvm_arch_has_private_mem(kvm) false
>> +#endif
>> +
>>   static inline u16 kvm_read_ldt(void)
>>   {
>>          u16 ldt;
>> @@ -2125,14 +2132,10 @@ enum {
>>   #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
>>
>>   # define KVM_MAX_NR_ADDRESS_SPACES     2
>> +/* SMM is currently unsupported for guests with private memory. */
>> +# define kvm_arch_nr_memslot_as_ids(kvm) (kvm_arch_has_private_mem(kvm) ? 1 : 2)
>>   # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
>>   # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
>> -
>> -static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
>> -{
>> -       return KVM_MAX_NR_ADDRESS_SPACES;
>> -}
>> -
>>   #else
>>   # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
>>   #endif
>> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
>> index 1a6a1f987949..a448d0964fc0 100644
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -562,4 +562,7 @@ struct kvm_pmu_event_filter {
>>   /* x86-specific KVM_EXIT_HYPERCALL flags. */
>>   #define KVM_EXIT_HYPERCALL_LONG_MODE   BIT(0)
>>
>> +#define KVM_X86_DEFAULT_VM     0
>> +#define KVM_X86_SW_PROTECTED_VM        1
>> +
>>   #endif /* _ASM_X86_KVM_H */
>> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
>> index 091b74599c22..8452ed0228cb 100644
>> --- a/arch/x86/kvm/Kconfig
>> +++ b/arch/x86/kvm/Kconfig
>> @@ -77,6 +77,18 @@ config KVM_WERROR
>>
>>            If in doubt, say "N".
>>
>> +config KVM_SW_PROTECTED_VM
>> +       bool "Enable support for KVM software-protected VMs"
>> +       depends on EXPERT
>> +       depends on X86_64
>> +       select KVM_GENERIC_PRIVATE_MEM
>> +       help
>> +         Enable support for KVM software-protected VMs.  Currently "protected"
>> +         means the VM can be backed with memory provided by
>> +         KVM_CREATE_GUEST_MEMFD.
>> +
>> +         If unsure, say "N".
>> +
>>   config KVM_INTEL
>>          tristate "KVM for Intel (and compatible) processors support"
>>          depends on KVM && IA32_FEAT_CTL
>> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
>> index 86c7cb692786..b66a7d47e0e4 100644
>> --- a/arch/x86/kvm/mmu/mmu_internal.h
>> +++ b/arch/x86/kvm/mmu/mmu_internal.h
>> @@ -297,6 +297,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>>                  .max_level = KVM_MAX_HUGEPAGE_LEVEL,
>>                  .req_level = PG_LEVEL_4K,
>>                  .goal_level = PG_LEVEL_4K,
>> +               .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
>>          };
>>          int r;
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index c4d17727b199..e3eb608b6692 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4441,6 +4441,13 @@ static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu,
>>          return 0;
>>   }
>>
>> +static bool kvm_is_vm_type_supported(unsigned long type)
>> +{
>> +       return type == KVM_X86_DEFAULT_VM ||
>> +              (type == KVM_X86_SW_PROTECTED_VM &&
>> +               IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled);
>> +}
>> +
>>   int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>   {
>>          int r = 0;
>> @@ -4632,6 +4639,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>          case KVM_CAP_X86_NOTIFY_VMEXIT:
>>                  r = kvm_caps.has_notify_vmexit;
>>                  break;
>> +       case KVM_CAP_VM_TYPES:
>> +               r = BIT(KVM_X86_DEFAULT_VM);
>> +               if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM))
>> +                       r |= BIT(KVM_X86_SW_PROTECTED_VM);
>> +               break;
>>          default:
>>                  break;
>>          }
>> @@ -12314,9 +12326,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>>          int ret;
>>          unsigned long flags;
>>
>> -       if (type)
>> +       if (!kvm_is_vm_type_supported(type))
>>                  return -EINVAL;
>>
>> +       kvm->arch.vm_type = type;
>> +
>>          ret = kvm_page_track_init(kvm);
>>          if (ret)
>>                  goto out;
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 29e9eb51dec9..5b5820d19e71 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1218,6 +1218,7 @@ struct kvm_ppc_resize_hpt {
>>   #define KVM_CAP_MEMORY_FAULT_INFO 231
>>   #define KVM_CAP_MEMORY_ATTRIBUTES 232
>>   #define KVM_CAP_GUEST_MEMFD 233
>> +#define KVM_CAP_VM_TYPES 234
>>
>>   #ifdef KVM_CAP_IRQ_ROUTING
>>
>> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
>> index 08afef022db9..2c964586aa14 100644
>> --- a/virt/kvm/Kconfig
>> +++ b/virt/kvm/Kconfig
>> @@ -104,3 +104,8 @@ config KVM_GENERIC_MEMORY_ATTRIBUTES
>>   config KVM_PRIVATE_MEM
>>          select XARRAY_MULTI
>>          bool
>> +
>> +config KVM_GENERIC_PRIVATE_MEM
>> +       select KVM_GENERIC_MEMORY_ATTRIBUTES
>> +       select KVM_PRIVATE_MEM
>> +       bool
>> --
>> 2.42.0.820.g83a721a137-goog
>>
> 


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 27/35] KVM: selftests: Add helpers to convert guest memory b/w private and shared
  2023-10-27 18:22 ` [PATCH v13 27/35] KVM: selftests: Add helpers to convert guest memory b/w private and shared Sean Christopherson
@ 2023-11-06 11:26   ` Fuad Tabba
  0 siblings, 0 replies; 148+ messages in thread
From: Fuad Tabba @ 2023-11-06 11:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson <seanjc@google.com> wrote:
>
> From: Vishal Annapurve <vannapurve@google.com>
>
> Add helpers to convert memory between private and shared via KVM's
> memory attributes, as well as helpers to free/allocate guest_memfd memory
> via fallocate().  Userspace, i.e. tests, is NOT required to do fallocate()
> when converting memory, as the attributes are the single source of true.

true->truth

> Provide allocate() helpers so that tests can mimic a userspace that frees
> private memory on conversion, e.g. to prioritize memory usage over
> performance.
>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  .../selftests/kvm/include/kvm_util_base.h     | 48 +++++++++++++++++++
>  tools/testing/selftests/kvm/lib/kvm_util.c    | 28 +++++++++++
>  2 files changed, 76 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
> index 9f861182c02a..1441fca6c273 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util_base.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
> @@ -333,6 +333,54 @@ static inline void vm_enable_cap(struct kvm_vm *vm, uint32_t cap, uint64_t arg0)
>         vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
>  }
>
> +static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
> +                                           uint64_t size, uint64_t attributes)
> +{
> +       struct kvm_memory_attributes attr = {
> +               .attributes = attributes,
> +               .address = gpa,
> +               .size = size,
> +               .flags = 0,
> +       };
> +
> +       /*
> +        * KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes.  These flows
> +        * need significant enhancements to support multiple attributes.
> +        */
> +       TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
> +                   "Update me to support multiple attributes!");
> +
> +       vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
> +}
> +
> +
> +static inline void vm_mem_set_private(struct kvm_vm *vm, uint64_t gpa,
> +                                     uint64_t size)
> +{
> +       vm_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +}
> +
> +static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa,
> +                                    uint64_t size)
> +{
> +       vm_set_memory_attributes(vm, gpa, size, 0);
> +}
> +
> +void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
> +                           bool punch_hole);
> +
> +static inline void vm_guest_mem_punch_hole(struct kvm_vm *vm, uint64_t gpa,
> +                                          uint64_t size)
> +{
> +       vm_guest_mem_fallocate(vm, gpa, size, true);
> +}
> +
> +static inline void vm_guest_mem_allocate(struct kvm_vm *vm, uint64_t gpa,
> +                                        uint64_t size)
> +{
> +       vm_guest_mem_fallocate(vm, gpa, size, false);
> +}
> +
>  void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
>  const char *vm_guest_mode_string(uint32_t i);
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 45050f54701a..a140aee8d0f5 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -1176,6 +1176,34 @@ void vm_mem_region_delete(struct kvm_vm *vm, uint32_t slot)
>         __vm_mem_region_delete(vm, memslot2region(vm, slot), true);
>  }
>
> +void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t base, uint64_t size,
> +                           bool punch_hole)
> +{
> +       const int mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? FALLOC_FL_PUNCH_HOLE : 0);
> +       struct userspace_mem_region *region;
> +       uint64_t end = base + size;
> +       uint64_t gpa, len;
> +       off_t fd_offset;
> +       int ret;
> +
> +       for (gpa = base; gpa < end; gpa += len) {
> +               uint64_t offset;
> +
> +               region = userspace_mem_region_find(vm, gpa, gpa);
> +               TEST_ASSERT(region && region->region.flags & KVM_MEM_PRIVATE,
> +                           "Private memory region not found for GPA 0x%lx", gpa);
> +
> +               offset = (gpa - region->region.guest_phys_addr);

nit: why the parentheses?

> +               fd_offset = region->region.guest_memfd_offset + offset;
> +               len = min_t(uint64_t, end - gpa, region->region.memory_size - offset);
> +
> +               ret = fallocate(region->region.guest_memfd, mode, fd_offset, len);
> +               TEST_ASSERT(!ret, "fallocate() failed to %s at %lx (len = %lu), fd = %d, mode = %x, offset = %lx\n",
> +                           punch_hole ? "punch hole" : "allocate", gpa, len,
> +                           region->region.guest_memfd, mode, fd_offset);
> +       }
> +}
> +

Nits aside:

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad


>  /* Returns the size of a vCPU's kvm_run structure. */
>  static int vcpu_mmap_sz(void)
>  {
> --
> 2.42.0.820.g83a721a137-goog
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory
  2023-11-05 16:19     ` Paolo Bonzini
@ 2023-11-06 13:29       ` Xu Yilun
  2023-11-06 15:56         ` Sean Christopherson
  0 siblings, 1 reply; 148+ messages in thread
From: Xu Yilun @ 2023-11-06 13:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Sun, Nov 05, 2023 at 05:19:36PM +0100, Paolo Bonzini wrote:
> On Sun, Nov 5, 2023 at 2:04 PM Xu Yilun <yilun.xu@linux.intel.com> wrote:
> >
> > > +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> > > +                                           struct kvm_page_fault *fault)
> > > +{
> > > +     kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> > > +                                   PAGE_SIZE, fault->write, fault->exec,
> > > +                                   fault->is_private);
> > > +}
> > > +
> > > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > > +                                struct kvm_page_fault *fault)
> > > +{
> > > +     int max_order, r;
> > > +
> > > +     if (!kvm_slot_can_be_private(fault->slot)) {
> > > +             kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > > +             return -EFAULT;
> > > +     }
> > > +
> > > +     r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
> > > +                          &max_order);
> > > +     if (r) {
> > > +             kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > > +             return r;
> >
> > Why report KVM_EXIT_MEMORY_FAULT here? even with a ret != -EFAULT?
> 
> The cases are EFAULT, EHWPOISON (which can report
> KVM_EXIT_MEMORY_FAULT) and ENOMEM. I think it's fine
> that even -ENOMEM can return KVM_EXIT_MEMORY_FAULT,
> and it doesn't violate the documentation.  The docs tell you "what
> can you do if error if EFAULT or EHWPOISON?"; they don't
> exclude that other errnos result in KVM_EXIT_MEMORY_FAULT,
> it's just that you're not supposed to look at it

Thanks, it's OK for ENOMEM + KVM_EXIT_MEMORY_FAULT.

Another concern is, now 3 places to report EFAULT + KVM_EXIT_MEMORY_FAULT:

  if (!kvm_slot_can_be_private(fault->slot)) {
	kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
	return -EFAULT;
  }

  file = kvm_gmem_get_file(slot);
  if (!file)
	return -EFAULT;

  if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
	kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
	return -EFAULT;
  }

They are different cases, and seems userspace should handle them
differently, but not enough information to distinguish them.

Thanks,
Yilun

> 
> Paolo
> 
> 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
  2023-11-04 10:26   ` Xu Yilun
@ 2023-11-06 15:43     ` Sean Christopherson
  0 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-11-06 15:43 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Sat, Nov 04, 2023, Xu Yilun wrote:
> > +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
> > +allows mapping guest_memfd memory into a guest.  All fields shared with
> > +KVM_SET_USER_MEMORY_REGION identically.  Userspace can set KVM_MEM_PRIVATE in
> > +flags to have KVM bind the memory region to a given guest_memfd range of
> > +[guest_memfd_offset, guest_memfd_offset + memory_size].  The target guest_memfd
>                                                         ^
> The range end should be exclusive, is it?

Yes, that should be a ')', not a ']'.

> > +static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> > +{
> > +	const char *anon_name = "[kvm-gmem]";
> > +	struct kvm_gmem *gmem;
> > +	struct inode *inode;
> > +	struct file *file;
> > +	int fd, err;
> > +
> > +	fd = get_unused_fd_flags(0);
> > +	if (fd < 0)
> > +		return fd;
> > +
> > +	gmem = kzalloc(sizeof(*gmem), GFP_KERNEL);
> > +	if (!gmem) {
> > +		err = -ENOMEM;
> > +		goto err_fd;
> > +	}
> > +
> > +	/*
> > +	 * Use the so called "secure" variant, which creates a unique inode
> > +	 * instead of reusing a single inode.  Each guest_memfd instance needs
> > +	 * its own inode to track the size, flags, etc.
> > +	 */
> > +	file = anon_inode_getfile_secure(anon_name, &kvm_gmem_fops, gmem,
> > +					 O_RDWR, NULL);
> > +	if (IS_ERR(file)) {
> > +		err = PTR_ERR(file);
> > +		goto err_gmem;
> > +	}
> > +
> > +	file->f_flags |= O_LARGEFILE;
> > +
> > +	inode = file->f_inode;
> > +	WARN_ON(file->f_mapping != inode->i_mapping);
> 
> Just curious, why should we check the mapping fields which is garanteed in
> other subsystem?

Mostly to document the behavior.  The vast majority of folks that read this code
will be KVM developers, not file systems developers, and will likely have no clue
about the relationship between f_mapping and i_mapping.  And in the extremely
unlikely scenario that anon_inode_getfile_secure() no longer sets f_mapping, a
WARN detects the issue whereas a comment does not.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory
  2023-11-06 13:29       ` Xu Yilun
@ 2023-11-06 15:56         ` Sean Christopherson
  0 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-11-06 15:56 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov

On Mon, Nov 06, 2023, Xu Yilun wrote:
> On Sun, Nov 05, 2023 at 05:19:36PM +0100, Paolo Bonzini wrote:
> > On Sun, Nov 5, 2023 at 2:04 PM Xu Yilun <yilun.xu@linux.intel.com> wrote:
> > >
> > > > +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> > > > +                                           struct kvm_page_fault *fault)
> > > > +{
> > > > +     kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> > > > +                                   PAGE_SIZE, fault->write, fault->exec,
> > > > +                                   fault->is_private);
> > > > +}
> > > > +
> > > > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > > > +                                struct kvm_page_fault *fault)
> > > > +{
> > > > +     int max_order, r;
> > > > +
> > > > +     if (!kvm_slot_can_be_private(fault->slot)) {
> > > > +             kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > > > +             return -EFAULT;
> > > > +     }
> > > > +
> > > > +     r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
> > > > +                          &max_order);
> > > > +     if (r) {
> > > > +             kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > > > +             return r;
> > >
> > > Why report KVM_EXIT_MEMORY_FAULT here? even with a ret != -EFAULT?
> > 
> > The cases are EFAULT, EHWPOISON (which can report
> > KVM_EXIT_MEMORY_FAULT) and ENOMEM. I think it's fine
> > that even -ENOMEM can return KVM_EXIT_MEMORY_FAULT,
> > and it doesn't violate the documentation.  The docs tell you "what
> > can you do if error if EFAULT or EHWPOISON?"; they don't
> > exclude that other errnos result in KVM_EXIT_MEMORY_FAULT,
> > it's just that you're not supposed to look at it
> 
> Thanks, it's OK for ENOMEM + KVM_EXIT_MEMORY_FAULT.
> 
> Another concern is, now 3 places to report EFAULT + KVM_EXIT_MEMORY_FAULT:
> 
>   if (!kvm_slot_can_be_private(fault->slot)) {
> 	kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> 	return -EFAULT;
>   }
> 
>   file = kvm_gmem_get_file(slot);
>   if (!file)
> 	return -EFAULT;
> 
>   if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> 	kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> 	return -EFAULT;
>   }
> 
> They are different cases, and seems userspace should handle them
> differently, but not enough information to distinguish them.

For the first, the memory_fault exit will inform userspace that the guest wants
to map memory as private, and userspace will see that the memslot isn't configured
to support private mappings.  Userspace may not even need to query memslots, e.g.
if the gfn in question has been enumerated to the guest as something that can only
be mapped shared.

For the second (no valid guest_memfd file), userspace put the last reference to
the guest_memfd file without informing the guest or creating a memslot.  That's
firmly a userspace bug.

For the third and last, userspace will see that the guest is requesting a private
mapping but the gfn is configured for shared mappings.

In all cases, userspace has the necessary information to resolve the issue, where
"resolving the issue" may mean terminating the guest.  If userspace isn't tracking
memslots or the private attribute, then userspace has far bigger problems.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-11-02 15:46                       ` Paolo Bonzini
@ 2023-11-27 11:13                         ` Vlastimil Babka
  2023-11-29 22:40                           ` Sean Christopherson
  0 siblings, 1 reply; 148+ messages in thread
From: Vlastimil Babka @ 2023-11-27 11:13 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Xiaoyao Li, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On 11/2/23 16:46, Paolo Bonzini wrote:
> On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson <seanjc@google.com> wrote:
>> Actually, looking that this again, there's not actually a hard dependency on THP.
>> A THP-enabled kernel _probably_  gives a higher probability of using hugepages,
>> but mostly because THP selects COMPACTION, and I suppose because using THP for
>> other allocations reduces overall fragmentation.
> 
> Yes, that's why I didn't even bother enabling it unless THP is
> enabled, but it makes even more sense to just try.
> 
>> So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think
>> we should do the below (I verified KVM can create hugepages with THP=n).  We'll
>> need another capability, but (a) we probably should have that anyways and (b) it
>> provides a cleaner path to adding PUD-sized hugepage support in the future.
> 
> I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This
> should be a generic kernel API and in fact the sizes are available in
> a not-so-friendly format in /sys/kernel/mm/hugepages.
> 
> We should just add /sys/kernel/mm/hugepages/sizes that contains
> "2097152 1073741824" on x86 (only the former if 1G pages are not
> supported).
> 
> Plus: is this the best API if we need something else for 1G pages?
> 
> Let's drop *this* patch and proceed incrementally. (Again, this is
> what I want to do with this final review: identify places that are
> stil sticky, and don't let them block the rest).
> 
> Coincidentially we have an open spot next week at plumbers. Let's
> extend Fuad's section to cover more guestmem work.

Hi,

was there any outcome wrt this one? Based on my experience with THP's it
would be best if userspace didn't have to opt-in, nor care about the
supported size. If the given size is unaligned, provide a mix of large pages
up to an aligned size, and for the rest fallback to base pages, which should
be better than -EINVAL on creation (is it possible with the current
implementation? I'd hope so so?). A way to opt-out from huge pages could be
useful although there's always the risk of some initial troubles resulting
in various online sources cargo-cult recommending to opt-out forever.

Vlastimil

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
  2023-11-27 11:13                         ` Vlastimil Babka
@ 2023-11-29 22:40                           ` Sean Christopherson
  0 siblings, 0 replies; 148+ messages in thread
From: Sean Christopherson @ 2023-11-29 22:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Paolo Bonzini, Xiaoyao Li, Marc Zyngier, Oliver Upton,
	Huacai Chen, Michael Ellerman, Anup Patel, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xu Yilun, Chao Peng, Fuad Tabba, Jarkko Sakkinen,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vishal Annapurve, Ackerley Tng,
	Maciej Szmigiero, David Hildenbrand, Quentin Perret,
	Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov

On Mon, Nov 27, 2023, Vlastimil Babka wrote:
> On 11/2/23 16:46, Paolo Bonzini wrote:
> > On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson <seanjc@google.com> wrote:
> >> Actually, looking that this again, there's not actually a hard dependency on THP.
> >> A THP-enabled kernel _probably_  gives a higher probability of using hugepages,
> >> but mostly because THP selects COMPACTION, and I suppose because using THP for
> >> other allocations reduces overall fragmentation.
> > 
> > Yes, that's why I didn't even bother enabling it unless THP is
> > enabled, but it makes even more sense to just try.
> > 
> >> So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think
> >> we should do the below (I verified KVM can create hugepages with THP=n).  We'll
> >> need another capability, but (a) we probably should have that anyways and (b) it
> >> provides a cleaner path to adding PUD-sized hugepage support in the future.
> > 
> > I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This
> > should be a generic kernel API and in fact the sizes are available in
> > a not-so-friendly format in /sys/kernel/mm/hugepages.
> > 
> > We should just add /sys/kernel/mm/hugepages/sizes that contains
> > "2097152 1073741824" on x86 (only the former if 1G pages are not
> > supported).
> > 
> > Plus: is this the best API if we need something else for 1G pages?
> > 
> > Let's drop *this* patch and proceed incrementally. (Again, this is
> > what I want to do with this final review: identify places that are
> > stil sticky, and don't let them block the rest).
> > 
> > Coincidentially we have an open spot next week at plumbers. Let's
> > extend Fuad's section to cover more guestmem work.
> 
> Hi,
> 
> was there any outcome wrt this one?

No, we punted on hugepage support for the initial guest_memfd merge.  We definitely
plan on adding hugeapge support sooner than later, but we haven't yet agreed on
exactly what that will look like.

> Based on my experience with THP's it would be best if userspace didn't have
> to opt-in, nor care about the supported size. If the given size is unaligned,
> provide a mix of large pages up to an aligned size, and for the rest fallback
> to base pages, which should be better than -EINVAL on creation (is it
> possible with the current implementation? I'd hope so so?).

guest_memfd serves a different use case than THP.  For modern VMs, and especially
for slice-of-hardware VMs that are one of the main targets for guest_memfd, if not
_the_ main target, guest memory should _always_ be backed by hugepages in the
physical domain.  The actual guest mappings might not be huge, e.g. x86 needs to
do partial mappings to skip over (legacy) memory holes, but KVM already gracefully
handles that.

In other words, for most guest_memfd use cases, if userspace wants hugepages but
KVM can't provide hugepages, then it is much more desirable to return an error
than to silently fall back to small pages.

I 100% agree that having to opt-in is suboptimal, but IMO providing "error on an
incompatible configuration" semantics without requiring userspace to opt-in is an
even worse experience for userspace.

> A way to opt-out from huge pages could be useful although there's always the
> risk of some initial troubles resulting in various online sources cargo-cult
> recommending to opt-out forever.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2
  2023-10-27 18:22 ` [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
@ 2024-04-25 14:12   ` Dan Carpenter
  2024-04-25 14:45     ` Shuah Khan
  0 siblings, 1 reply; 148+ messages in thread
From: Dan Carpenter @ 2024-04-25 14:12 UTC (permalink / raw)
  To: Sean Christopherson, Shuah Khan, Greg Kroah-Hartman
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov,
	Naresh Kamboju, Anders Roxell, Benjamin Copeland

On Fri, Oct 27, 2023 at 11:22:07AM -0700, Sean Christopherson wrote:
> Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
> support for guest private memory can be added without needing an entirely
> separate set of helpers.
> 
> Note, this obviously makes selftests backwards-incompatible with older KVM
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> versions from this point forward.
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Is there a way we could disable the tests on older kernels instead of
making them fail?  Check uname or something?  There is probably a
standard way to do this...  It's these tests which fail.

 kvm_aarch32_id_regs
 kvm_access_tracking_perf_test
 kvm_arch_timer
 kvm_debug-exceptions
 kvm_demand_paging_test
 kvm_dirty_log_perf_test
 kvm_dirty_log_test
 kvm_guest_print_test
 kvm_hypercalls
 kvm_kvm_page_table_test
 kvm_memslot_modification_stress_test
 kvm_memslot_perf_test
 kvm_page_fault_test
 kvm_psci_test
 kvm_rseq_test
 kvm_smccc_filter
 kvm_steal_time
 kvm_vgic_init
 kvm_vgic_irq
 kvm_vpmu_counter_access

regards,
dan carpenter


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2
  2024-04-25 14:12   ` Dan Carpenter
@ 2024-04-25 14:45     ` Shuah Khan
  2024-04-25 15:09       ` Sean Christopherson
  0 siblings, 1 reply; 148+ messages in thread
From: Shuah Khan @ 2024-04-25 14:45 UTC (permalink / raw)
  To: Dan Carpenter, Sean Christopherson, Shuah Khan, Greg Kroah-Hartman
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Huacai Chen,
	Michael Ellerman, Anup Patel, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Viro, Christian Brauner,
	Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov,
	Naresh Kamboju, Anders Roxell, Benjamin Copeland, Shuah Khan

On 4/25/24 08:12, Dan Carpenter wrote:
> On Fri, Oct 27, 2023 at 11:22:07AM -0700, Sean Christopherson wrote:
>> Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
>> support for guest private memory can be added without needing an entirely
>> separate set of helpers.
>>
>> Note, this obviously makes selftests backwards-incompatible with older KVM
>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> versions from this point forward.
>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Is there a way we could disable the tests on older kernels instead of
> making them fail?  Check uname or something?  There is probably a
> standard way to do this...  It's these tests which fail.

They shouldn't fail - the tests should be skipped on older kernels.
If it is absolutely necessary to dd uname to check kernel version,
refer to zram/zram_lib.sh for an example.

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2
  2024-04-25 14:45     ` Shuah Khan
@ 2024-04-25 15:09       ` Sean Christopherson
  2024-04-25 16:22         ` Shuah Khan
  2024-04-26  7:33         ` Jarkko Sakkinen
  0 siblings, 2 replies; 148+ messages in thread
From: Sean Christopherson @ 2024-04-25 15:09 UTC (permalink / raw)
  To: Shuah Khan
  Cc: Dan Carpenter, Shuah Khan, Greg Kroah-Hartman, Paolo Bonzini,
	Marc Zyngier, Oliver Upton, Huacai Chen, Michael Ellerman,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Viro, Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov,
	Naresh Kamboju, Anders Roxell, Benjamin Copeland

On Thu, Apr 25, 2024, Shuah Khan wrote:
> On 4/25/24 08:12, Dan Carpenter wrote:
> > On Fri, Oct 27, 2023 at 11:22:07AM -0700, Sean Christopherson wrote:
> > > Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
> > > support for guest private memory can be added without needing an entirely
> > > separate set of helpers.
> > > 
> > > Note, this obviously makes selftests backwards-incompatible with older KVM
> >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > versions from this point forward.
> >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > Is there a way we could disable the tests on older kernels instead of
> > making them fail?  Check uname or something?  There is probably a
> > standard way to do this...  It's these tests which fail.
> 
> They shouldn't fail - the tests should be skipped on older kernels.

Ah, that makes sense.  Except for a few outliers that aren't all that interesting,
all KVM selftests create memslots, so I'm tempted to just make it a hard requirement
to spare us headache, e.g.

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index b2262b5fad9e..4b2038b1f11f 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -2306,6 +2306,9 @@ void __attribute((constructor)) kvm_selftest_init(void)
        /* Tell stdout not to buffer its content. */
        setbuf(stdout, NULL);
 
+       __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
+                      "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");
+
        kvm_selftest_arch_init();
 }

--

but it's also easy enough to be more precise and skip only those that actually
create memslots.

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index b2262b5fad9e..b21152adf448 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -944,6 +944,9 @@ int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flag
                .guest_memfd_offset = guest_memfd_offset,
        };
 
+       __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
+                      "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");
+
        return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION2, &region);
 }
 
@@ -970,6 +973,9 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
        size_t mem_size = npages * vm->page_size;
        size_t alignment;
 
+       __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
+                      "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");
+
        TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
                "Number of guest pages is not compatible with the host. "
                "Try npages=%d", vm_adjust_num_guest_pages(vm->mode, npages));
--

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2
  2024-04-25 15:09       ` Sean Christopherson
@ 2024-04-25 16:22         ` Shuah Khan
  2024-04-26  7:33         ` Jarkko Sakkinen
  1 sibling, 0 replies; 148+ messages in thread
From: Shuah Khan @ 2024-04-25 16:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dan Carpenter, Shuah Khan, Greg Kroah-Hartman, Paolo Bonzini,
	Marc Zyngier, Oliver Upton, Huacai Chen, Michael Ellerman,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Viro, Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Jarkko Sakkinen, Anish Moorthy, David Matlack, Yu Zhang,
	Isaku Yamahata, Mickaël Salaün, Vlastimil Babka,
	Vishal Annapurve, Ackerley Tng, Maciej Szmigiero,
	David Hildenbrand, Quentin Perret, Michael Roth, Wang,
	Liam Merwick, Isaku Yamahata, Kirill A . Shutemov,
	Naresh Kamboju, Anders Roxell, Benjamin Copeland, Shuah Khan

On 4/25/24 09:09, Sean Christopherson wrote:
> On Thu, Apr 25, 2024, Shuah Khan wrote:
>> On 4/25/24 08:12, Dan Carpenter wrote:
>>> On Fri, Oct 27, 2023 at 11:22:07AM -0700, Sean Christopherson wrote:
>>>> Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
>>>> support for guest private memory can be added without needing an entirely
>>>> separate set of helpers.
>>>>
>>>> Note, this obviously makes selftests backwards-incompatible with older KVM
>>>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> versions from this point forward.
>>>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>
>>> Is there a way we could disable the tests on older kernels instead of
>>> making them fail?  Check uname or something?  There is probably a
>>> standard way to do this...  It's these tests which fail.
>>
>> They shouldn't fail - the tests should be skipped on older kernels.
> 
> Ah, that makes sense.  Except for a few outliers that aren't all that interesting,
> all KVM selftests create memslots, so I'm tempted to just make it a hard requirement
> to spare us headache, e.g.
> 
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index b2262b5fad9e..4b2038b1f11f 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -2306,6 +2306,9 @@ void __attribute((constructor)) kvm_selftest_init(void)
>          /* Tell stdout not to buffer its content. */
>          setbuf(stdout, NULL);
>   
> +       __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
> +                      "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");
> +
>          kvm_selftest_arch_init();
>   }
> 
> --
> 
> but it's also easy enough to be more precise and skip only those that actually
> create memslots.

This is approach is what is recommended in kselfest document. Rubn as many tests
as possible and skip the ones that can't be run due to unmet dependencies.

> 
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index b2262b5fad9e..b21152adf448 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -944,6 +944,9 @@ int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flag
>                  .guest_memfd_offset = guest_memfd_offset,
>          };
>   
> +       __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
> +                      "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");
> +
>          return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION2, &region);
>   }
>   
> @@ -970,6 +973,9 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
>          size_t mem_size = npages * vm->page_size;
>          size_t alignment;
>   
> +       __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
> +                      "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");
> +
>          TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
>                  "Number of guest pages is not compatible with the host. "
>                  "Try npages=%d", vm_adjust_num_guest_pages(vm->mode, npages));
> --

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2
  2024-04-25 15:09       ` Sean Christopherson
  2024-04-25 16:22         ` Shuah Khan
@ 2024-04-26  7:33         ` Jarkko Sakkinen
  1 sibling, 0 replies; 148+ messages in thread
From: Jarkko Sakkinen @ 2024-04-26  7:33 UTC (permalink / raw)
  To: Sean Christopherson, Shuah Khan
  Cc: Dan Carpenter, Shuah Khan, Greg Kroah-Hartman, Paolo Bonzini,
	Marc Zyngier, Oliver Upton, Huacai Chen, Michael Ellerman,
	Anup Patel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Viro, Christian Brauner, Matthew Wilcox (Oracle),
	Andrew Morton, kvm, linux-arm-kernel, kvmarm, linux-mips,
	linuxppc-dev, kvm-riscv, linux-riscv, linux-fsdevel, linux-mm,
	linux-kernel, Xiaoyao Li, Xu Yilun, Chao Peng, Fuad Tabba,
	Anish Moorthy, David Matlack, Yu Zhang, Isaku Yamahata,
	Mickaël Salaün, Vlastimil Babka, Vishal Annapurve,
	Ackerley Tng, Maciej Szmigiero, David Hildenbrand,
	Quentin Perret, Michael Roth, Wang, Liam Merwick, Isaku Yamahata,
	Kirill A . Shutemov, Naresh Kamboju, Anders Roxell,
	Benjamin Copeland

On Thu Apr 25, 2024 at 6:09 PM EEST, Sean Christopherson wrote:
> +       __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
> +                      "KVM selftests from v6.8+ require KVM_SET_USER_MEMORY_REGION2");

This would work also for casual (but not seasoned) visitor in KVM code
as additionl documentation.

BR, Jarkko

^ permalink raw reply	[flat|nested] 148+ messages in thread

end of thread, other threads:[~2024-04-26  7:33 UTC | newest]

Thread overview: 148+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-27 18:21 [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Sean Christopherson
2023-10-27 18:21 ` [PATCH v13 01/35] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
2023-11-01 12:46   ` Fuad Tabba
2023-10-27 18:21 ` [PATCH v13 02/35] KVM: Assert that mmu_invalidate_in_progress *never* goes negative Sean Christopherson
2023-10-30 16:27   ` Paolo Bonzini
2023-11-01 12:46   ` Fuad Tabba
2023-10-27 18:21 ` [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
2023-10-30 16:30   ` Paolo Bonzini
2023-10-30 16:53   ` David Matlack
2023-10-30 17:00     ` Paolo Bonzini
2023-10-30 18:21       ` David Matlack
2023-10-30 18:19     ` David Matlack
2023-11-01 15:31   ` Xu Yilun
2023-10-27 18:21 ` [PATCH v13 04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction Sean Christopherson
2023-10-30 16:32   ` Paolo Bonzini
2023-11-01 12:50   ` Fuad Tabba
2023-10-27 18:21 ` [PATCH v13 05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER Sean Christopherson
2023-10-30 16:34   ` Paolo Bonzini
2023-11-01 12:51   ` Fuad Tabba
2023-10-27 18:21 ` [PATCH v13 06/35] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU Sean Christopherson
2023-10-27 18:21 ` [PATCH v13 07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER Sean Christopherson
2023-10-30 16:37   ` Paolo Bonzini
2023-11-01 12:54   ` Fuad Tabba
2023-10-27 18:21 ` [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
2023-10-30 16:41   ` Paolo Bonzini
2023-10-30 20:25     ` Sean Christopherson
2023-10-30 22:12       ` Sean Christopherson
2023-10-30 23:22       ` Paolo Bonzini
2023-10-31  0:18         ` Sean Christopherson
2023-10-31  2:26   ` Xiaoyao Li
2023-10-31 14:04     ` Sean Christopherson
2023-11-01 14:19   ` Fuad Tabba
2023-10-27 18:21 ` [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace Sean Christopherson
2023-10-30 17:22   ` Paolo Bonzini
2023-11-01  7:30   ` Binbin Wu
2023-11-01 10:52   ` Huang, Kai
2023-11-01 17:36     ` Sean Christopherson
2023-11-02  2:19       ` Xiaoyao Li
2023-11-02 15:51         ` Sean Christopherson
2023-11-02  3:17       ` Huang, Kai
2023-11-02  9:35         ` Huang, Kai
2023-11-02 11:03           ` Paolo Bonzini
2023-11-02 15:44             ` Sean Christopherson
2023-11-02 18:35               ` Huang, Kai
2023-11-02 15:56         ` Sean Christopherson
2023-11-02 11:01       ` Paolo Bonzini
2023-11-03  4:09   ` Xu Yilun
2023-10-27 18:21 ` [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory Sean Christopherson
2023-10-30 17:11   ` Paolo Bonzini
2023-11-02 13:55   ` Fuad Tabba
2023-10-27 18:21 ` [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook Sean Christopherson
2023-10-30 17:18   ` Paolo Bonzini
2023-11-02 13:55   ` Fuad Tabba
2023-10-27 18:21 ` [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events Sean Christopherson
2023-10-30 17:21   ` Paolo Bonzini
2023-10-30 22:07     ` Sean Christopherson
2023-11-02  5:59   ` Binbin Wu
2023-11-02 11:14     ` Paolo Bonzini
2023-11-02 14:01   ` Fuad Tabba
2023-11-02 14:41     ` Sean Christopherson
2023-11-02 14:57       ` Fuad Tabba
2023-10-27 18:21 ` [PATCH v13 13/35] KVM: Introduce per-page memory attributes Sean Christopherson
2023-10-30  8:11   ` Chao Gao
2023-10-30 16:10     ` Sean Christopherson
2023-10-30 22:05       ` Sean Christopherson
2023-10-31 16:43   ` David Matlack
2023-11-02  3:01   ` Huang, Kai
2023-11-02 10:32     ` Paolo Bonzini
2023-11-02 10:55       ` Huang, Kai
2023-10-27 18:21 ` [PATCH v13 14/35] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable Sean Christopherson
2023-10-30 17:24   ` Paolo Bonzini
2023-10-27 18:21 ` [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM Sean Christopherson
2023-10-30 17:30   ` Paolo Bonzini
2023-11-02 16:24   ` Christian Brauner
2023-11-03 10:40     ` Paolo Bonzini
2023-10-27 18:21 ` [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
2023-10-31  2:27   ` Xiaoyao Li
2023-10-31  6:30   ` Chao Gao
2023-10-31 14:10     ` Sean Christopherson
2023-10-31 15:05   ` Fuad Tabba
2023-10-31 22:13     ` Sean Christopherson
2023-10-31 22:18       ` Paolo Bonzini
2023-11-01 10:51       ` Fuad Tabba
2023-11-01 21:55         ` Sean Christopherson
2023-11-02 13:52           ` Fuad Tabba
2023-11-03 23:17             ` Sean Christopherson
2023-10-31 18:24   ` David Matlack
2023-10-31 21:36     ` Sean Christopherson
2023-10-31 22:39       ` David Matlack
2023-11-02 15:48         ` Paolo Bonzini
2023-11-02 16:03           ` Sean Christopherson
2023-11-02 16:28             ` David Matlack
2023-11-02 17:37               ` Sean Christopherson
2023-11-03  9:42   ` Fuad Tabba
2023-11-04 10:26   ` Xu Yilun
2023-11-06 15:43     ` Sean Christopherson
2023-10-27 18:21 ` [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory Sean Christopherson
2023-10-31  8:35   ` Xiaoyao Li
2023-10-31 14:16     ` Sean Christopherson
2023-11-01  7:25       ` Xiaoyao Li
2023-11-01 13:41         ` Sean Christopherson
2023-11-01 13:49           ` Paolo Bonzini
2023-11-01 16:36             ` Sean Christopherson
2023-11-01 22:28               ` Paolo Bonzini
2023-11-01 22:34                 ` Sean Christopherson
2023-11-01 23:17                   ` Paolo Bonzini
2023-11-02 15:38                     ` Sean Christopherson
2023-11-02 15:46                       ` Paolo Bonzini
2023-11-27 11:13                         ` Vlastimil Babka
2023-11-29 22:40                           ` Sean Christopherson
2023-10-27 18:22 ` [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN Sean Christopherson
2023-10-30 17:31   ` Paolo Bonzini
2023-11-02 14:16   ` Fuad Tabba
2023-10-27 18:22 ` [PATCH v13 19/35] KVM: x86: Disallow hugepages when memory attributes are mixed Sean Christopherson
2023-10-27 18:22 ` [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory Sean Christopherson
2023-11-02 14:34   ` Fuad Tabba
2023-11-05 13:02   ` Xu Yilun
2023-11-05 16:19     ` Paolo Bonzini
2023-11-06 13:29       ` Xu Yilun
2023-11-06 15:56         ` Sean Christopherson
2023-10-27 18:22 ` [PATCH v13 21/35] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro Sean Christopherson
2023-11-02 14:35   ` Fuad Tabba
2023-10-27 18:22 ` [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM Sean Christopherson
2023-10-30 17:34   ` Paolo Bonzini
2023-11-02 14:52   ` Fuad Tabba
2023-10-27 18:22 ` [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory Sean Christopherson
2023-10-30 17:36   ` Paolo Bonzini
2023-11-06 11:00   ` Fuad Tabba
2023-11-06 11:03     ` Paolo Bonzini
2023-10-27 18:22 ` [PATCH v13 24/35] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper Sean Christopherson
2023-10-27 18:22 ` [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
2024-04-25 14:12   ` Dan Carpenter
2024-04-25 14:45     ` Shuah Khan
2024-04-25 15:09       ` Sean Christopherson
2024-04-25 16:22         ` Shuah Khan
2024-04-26  7:33         ` Jarkko Sakkinen
2023-10-27 18:22 ` [PATCH v13 26/35] KVM: selftests: Add support for creating private memslots Sean Christopherson
2023-10-27 18:22 ` [PATCH v13 27/35] KVM: selftests: Add helpers to convert guest memory b/w private and shared Sean Christopherson
2023-11-06 11:26   ` Fuad Tabba
2023-10-27 18:22 ` [PATCH v13 28/35] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) Sean Christopherson
2023-10-27 18:22 ` [PATCH v13 29/35] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type Sean Christopherson
2023-10-27 18:22 ` [PATCH v13 30/35] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data Sean Christopherson
2023-10-27 18:22 ` [PATCH v13 31/35] KVM: selftests: Add x86-only selftest for private memory conversions Sean Christopherson
2023-10-27 18:22 ` [PATCH v13 32/35] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper Sean Christopherson
2023-10-27 18:22 ` [PATCH v13 33/35] KVM: selftests: Expand set_memory_region_test to validate guest_memfd() Sean Christopherson
2023-10-27 18:22 ` [PATCH v13 34/35] KVM: selftests: Add basic selftest for guest_memfd() Sean Christopherson
2023-10-27 18:22 ` [PATCH v13 35/35] KVM: selftests: Test KVM exit behavior for private memory/access Sean Christopherson
2023-10-30 17:39 ` [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).