All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults.
@ 2023-09-08 22:28 Anish Moorthy
  2023-09-08 22:28 ` [PATCH v5 01/17] KVM: Clarify documentation of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
                   ` (16 more replies)
  0 siblings, 17 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:28 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

This series improves the scalability of KVM and userfaultfd migration by
actually providing a mechanism to *bypass* userfaultfd while handling
stage 2 page table violations. A more complete description of the
problem is provided in the cover letter for v4 (linked below).

Major changes in v5
~~~~~~~~~~~~~~~~~~~
* A new union is added to the run struct to store memory fault
  annotations. This resolves the concerns I brought up (see the
  "Important Note") in the cover letter of v4.
* Querying the new memslot flag is moved out of x86/arm64 code and into
  __gfn_to_pfn_memslot() itself.

Outstanding Issues/Details
~~~~~~~~~~~~~~~~~~~~~~~~~~
The two commits for introducing/implementing
KVM_CAP_USERFAULT_ON_MISSING will/should probably generate the most
discussion on this version.

Also, a question: in the user_mem_abort() annotation, should the
annotations be aligned to PAGE_SIZE rather than vma_pagesize?

Base Commit
~~~~~~~~~~~
This series is based off of kvm/next (d011151616e7)

Links/Notes
~~~~~~~~~~~
[1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
[2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
    A quick rundown of the new flags (also detailed in later commits)
        -a registers all of guest memory to a single uffd.
        -r species the number of reader threads for polling the uffd.
        -w is what actually enables the new capabilities.
    All data was collected after applying the entire series
[3] https://lore.kernel.org/kvm/ZBTgnjXJvR8jtc4i@google.com/
[4] https://lore.kernel.org/kvm/ZHkfDCItA8HUxOG1@linux.dev/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
[5] https://lore.kernel.org/kvm/168556721574.515120.10821482819846567909.b4-ty@google.com/T/#t

---

v5
  - Rename APIs (again) [Sean]
  - Initialize hardware_exit_reason along w/ exit_reason on x86 [Isaku]
  - Reword hva_to_pfn_fast() change commit message [Sean]
  - Correct style on terminal if statements [Sean]
  - Switch to kconfig to signal KVM_CAP_USERFAULT_ON_MISSING [Sean]
  - Add read fault flag for annotated faults [Sean]
  - read/write_guest_page() changes
      - Move the annotations into vcpu wrapper fns [Sean]
      - Reorder parameters [Robert]
  - Rename kvm_populate_efault_info() to
    kvm_handle_guest_uaccess_fault() [Sean]
  - Remove unnecessary EINVAL on trying to enable memory fault info cap [Sean]
  - Correct description of the faults which hva_to_pfn_fast() can now
    resolve [Sean]
  - Eliminate unnecessary parameter added to __kvm_faultin_pfn() [Sean]
  - Magnanimously accept Sean's rewrite of the handle_error_pfn()
    annotation [Anish]
  - Remove vcpu null check from kvm_handle_guest_uaccess_fault [Sean]

v4: https://lore.kernel.org/kvm/20230602161921.208564-1-amoorthy@google.com/T/#t
  - Fix excessive indentation [Robert, Oliver]
  - Calculate final stats when uffd handler fn returns an error [Robert]
  - Remove redundant info from uffd_desc [Robert]
  - Fix various commit message typos [Robert]
  - Add comment about suppressed EEXISTs in selftest [Robert]
  - Add exit_reasons_known definition for KVM_EXIT_MEMORY_FAULT [Robert]
  - Fix some include/logic issues in self test [Robert]
  - Rename no-slow-gup cap to KVM_CAP_NOWAIT_ON_FAULT [Oliver, Sean]
  - Make KVM_CAP_MEMORY_FAULT_INFO informational-only [Oliver, Sean]
  - Drop most of the annotations from v3: see
    https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
  - Remove WARN on bare efaults [Sean, Oliver]
  - Eliminate unnecessary UFFDIO_WAKE call from self test [James]

v3: https://lore.kernel.org/kvm/ZEBXi5tZZNxA+jRs@x1n/T/#t
  - Rework the implementation to be based on two orthogonal
    capabilities (KVM_CAP_MEMORY_FAULT_INFO and
    KVM_CAP_NOWAIT_ON_FAULT) [Sean, Oliver]
  - Change return code of kvm_populate_efault_info [Isaku]
  - Use kvm_populate_efault_info from arm code [Oliver]

v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/

    This was a bit of a misfire, as I sent my WIP series on the mailing
    list but was just targeting Sean for some feedback. Oliver Upton and
    Isaku Yamahata ended up discovering the series and giving me some
    feedback anyways, so thanks to them :) In the end, there was enough
    discussion to justify retroactively labeling it as v2, even with the
    limited cc list.

  - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
  - API changes:
        - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
          KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
          requirement).
        - Switched to memslot flag
  - Take Oliver's simplification to the "allow fast gup for readable
    faults" logic.
  - Slightly redefine the return code of user_mem_abort.
  - Fix documentation errors brought up by Marc
  - Reword commit messages in imperative mood

v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/

Anish Moorthy (17):
  KVM: Clarify documentation of hva_to_pfn()'s 'atomic' parameter
  KVM: Add docstrings to __kvm_read/write_guest_page()
  KVM: Simplify error handling in __gfn_to_pfn_memslot()
  KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page()
  KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  KVM: arm64: Annotate -EFAULT from user_mem_abort()
  KVM: Allow hva_pfn_fast() to resolve read faults
  KVM: Introduce KVM_CAP_USERFAULT_ON_MISSING without implementation
  KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing
    __gfn_to_pfn_memslot() calls
  KVM: x86: Enable KVM_CAP_USERFAULT_ON_MISSING
  KVM: arm64: Enable KVM_CAP_USERFAULT_ON_MISSING
  KVM: selftests: Report per-vcpu demand paging rate from demand paging
    test
  KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
    paging test
  KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
    signal errors via TEST_ASSERT
  KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
  KVM: selftests: Handle memory fault exits in demand_paging_test

 Documentation/virt/kvm/api.rst                |  77 ++++-
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/mmu.c                          |  17 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c           |   3 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c        |   3 +-
 arch/x86/kvm/Kconfig                          |   1 +
 arch/x86/kvm/mmu/mmu.c                        |  21 +-
 include/linux/kvm_host.h                      |  58 +++-
 include/uapi/linux/kvm.h                      |  36 +++
 tools/include/uapi/linux/kvm.h                |  25 ++
 .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
 .../selftests/kvm/access_tracking_perf_test.c |   2 +-
 .../selftests/kvm/demand_paging_test.c        | 295 ++++++++++++++----
 .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
 .../testing/selftests/kvm/include/memstress.h |   2 +-
 .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
 tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
 .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
 .../kvm/memslot_modification_stress_test.c    |   2 +-
 .../x86_64/dirty_log_page_splitting_test.c    |   2 +-
 virt/kvm/Kconfig                              |   3 +
 virt/kvm/kvm_main.c                           |  83 +++--
 22 files changed, 633 insertions(+), 184 deletions(-)

-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v5 01/17] KVM: Clarify documentation of hva_to_pfn()'s 'atomic' parameter
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
@ 2023-09-08 22:28 ` Anish Moorthy
  2023-09-08 22:28 ` [PATCH v5 02/17] KVM: Add docstrings to __kvm_read/write_guest_page() Anish Moorthy
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:28 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

The current docstring can be read as "atomic -> allowed to sleep," when
in fact the intended statement is "atomic -> NOT allowed to sleep." Make
that clearer in the docstring.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d63cf1c4f5a7..84e90ed3a134 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2669,7 +2669,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 /*
  * Pin guest page in memory and return its pfn.
  * @addr: host virtual address which maps memory to the guest
- * @atomic: whether this function can sleep
+ * @atomic: whether this function is forbidden from sleeping
  * @interruptible: whether the process can be interrupted by non-fatal signals
  * @async: whether this function need to wait IO complete if the
  *         host page is not in the memory
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 02/17] KVM: Add docstrings to __kvm_read/write_guest_page()
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
  2023-09-08 22:28 ` [PATCH v5 01/17] KVM: Clarify documentation of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
@ 2023-09-08 22:28 ` Anish Moorthy
  2023-10-05  1:18   ` Sean Christopherson
  2023-09-08 22:28 ` [PATCH v5 03/17] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:28 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

The (gfn, data, offset, len) order of parameters is a little strange,
since "offset" actually applies to "gfn" rather than to "data". Add
docstrings to make things perfectly clear.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 84e90ed3a134..12837416ce8a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3014,6 +3014,9 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
+/*
+ * Copy 'len' bytes from guest memory at '(gfn * PAGE_SIZE) + offset' to 'data'
+ */
 static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
 				 void *data, int offset, int len)
 {
@@ -3115,6 +3118,9 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
+/*
+ * Copy 'len' bytes from 'data' into guest memory at '(gfn * PAGE_SIZE) + offset'
+ */
 static int __kvm_write_guest_page(struct kvm *kvm,
 				  struct kvm_memory_slot *memslot, gfn_t gfn,
 			          const void *data, int offset, int len)
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 03/17] KVM: Simplify error handling in __gfn_to_pfn_memslot()
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
  2023-09-08 22:28 ` [PATCH v5 01/17] KVM: Clarify documentation of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
  2023-09-08 22:28 ` [PATCH v5 02/17] KVM: Add docstrings to __kvm_read/write_guest_page() Anish Moorthy
@ 2023-09-08 22:28 ` Anish Moorthy
  2023-09-08 22:28 ` [PATCH v5 04/17] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:28 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

KVM_HVA_ERR_RO_BAD satisfies kvm_is_error_hva(), so there's no need to
duplicate the "if (writable)" block. Fix this by bringing all
kvm_is_error_hva() cases under one conditional.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 12837416ce8a..8b2d5aab32bf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2741,15 +2741,13 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 	if (hva)
 		*hva = addr;
 
-	if (addr == KVM_HVA_ERR_RO_BAD) {
-		if (writable)
-			*writable = false;
-		return KVM_PFN_ERR_RO_FAULT;
-	}
-
 	if (kvm_is_error_hva(addr)) {
 		if (writable)
 			*writable = false;
+
+		if (addr == KVM_HVA_ERR_RO_BAD)
+			return KVM_PFN_ERR_RO_FAULT;
+
 		return KVM_PFN_NOSLOT;
 	}
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 04/17] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (2 preceding siblings ...)
  2023-09-08 22:28 ` [PATCH v5 03/17] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
@ 2023-09-08 22:28 ` Anish Moorthy
  2023-10-05  1:14   ` Sean Christopherson
  2023-10-10 22:58   ` David Matlack
  2023-09-08 22:28 ` [PATCH v5 05/17] KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page() Anish Moorthy
                   ` (12 subsequent siblings)
  16 siblings, 2 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:28 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information
besides a return value of -1 and errno of EFAULT when a vCPU fails an
access to guest memory which may be resolvable by userspace.

Add documentation, updates to the KVM headers, and a helper function
(kvm_handle_guest_uaccess_fault()) for implementing the capability.

Mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86, even
though EFAULT annotation are currently totally absent. Picking a point
to declare the implementation "done" is difficult because

  1. Annotations will be performed incrementally in subsequent commits
     across both core and arch-specific KVM.
  2. The initial series will very likely miss some cases which need
     annotation. Although these omissions are to be fixed in the future,
     userspace thus still needs to expect and be able to handle
     unannotated EFAULTs.

Given these qualifications, just marking it available here seems the
least arbitrary thing to do.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 49 ++++++++++++++++++++++++++++++++--
 include/linux/kvm_host.h       | 35 ++++++++++++++++++++++++
 include/uapi/linux/kvm.h       | 34 +++++++++++++++++++++++
 tools/include/uapi/linux/kvm.h | 24 +++++++++++++++++
 virt/kvm/kvm_main.c            |  3 +++
 5 files changed, 143 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 660d9ca7a251..92fd3faa6bab 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6130,8 +6130,10 @@ local APIC is not used.
 
 	__u16 flags;
 
-More architecture-specific flags detailing state of the VCPU that may
-affect the device's behavior. Current defined flags::
+Flags detailing state of the VCPU. The lower/upper bytes encode archiecture
+specific/agnostic bytes respectively. Current defined flags
+
+::
 
   /* x86, set if the VCPU is in system management mode */
   #define KVM_RUN_X86_SMM     (1 << 0)
@@ -6140,6 +6142,9 @@ affect the device's behavior. Current defined flags::
   /* arm64, set for KVM_EXIT_DEBUG */
   #define KVM_DEBUG_ARCH_HSR_HIGH_VALID  (1 << 0)
 
+  /* Arch-agnostic, see KVM_CAP_MEMORY_FAULT_INFO */
+  #define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8)
+
 ::
 
 	/* in (pre_kvm_run), out (post_kvm_run) */
@@ -6750,6 +6755,18 @@ kvm_valid_regs for specific bits. These bits are architecture specific
 and usually define the validity of a groups of registers. (e.g. one bit
 for general purpose registers)
 
+::
+	union {
+		/* KVM_SPEC_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 len; /* in bytes */
+		} memory_fault;
+
+Indicates a memory fault on the guest physical address range
+[gpa, gpa + len). See KVM_CAP_MEMORY_FAULT_INFO for more details.
+
 Please note that the kernel is allowed to use the kvm_run structure as the
 primary storage for certain register types. Therefore, the kernel may use the
 values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
@@ -7736,6 +7753,34 @@ This capability is aimed to mitigate the threat that malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.
 
+7.34 KVM_CAP_MEMORY_FAULT_INFO
+------------------------------
+
+:Architectures: x86, arm64
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN may fill
+kvm_run.memory_fault in response to failed guest memory accesses in a vCPU
+context.
+
+When KVM_RUN returns an error with errno=EFAULT, (kvm_run.flags |
+KVM_RUN_MEMORY_FAULT_FILLED) indicates that the kvm_run.memory_fault field is
+valid. This capability is only partially implemented in that not all EFAULTs
+from KVM_RUN may be annotated in this way: these "bare" EFAULTs should be
+considered bugs and reported to the maintainers.
+
+The 'gpa' and 'len' (in bytes) fields of kvm_run.memory_fault describe the range
+of guest physical memory to which access failed, i.e. [gpa, gpa + len). 'flags'
+is a bitfield indicating the nature of the access: valid masks are
+
+  - KVM_MEMORY_FAULT_FLAG_READ:      The failed access was a read.
+  - KVM_MEMORY_FAULT_FLAG_WRITE:     The failed access was a write.
+  - KVM_MEMORY_FAULT_FLAG_EXEC:      The failed access was an exec.
+
+Note: Userspaces which attempt to resolve memory faults so that they can retry
+KVM_RUN are encouraged to guard against repeatedly receiving the same
+error/annotated fault.
+
 8. Other capabilities.
 ======================
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fb6c6109fdca..9206ac944d31 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -392,6 +392,12 @@ struct kvm_vcpu {
 	 */
 	struct kvm_memory_slot *last_used_slot;
 	u64 last_used_slot_gen;
+
+	/*
+	 * KVM_RUN initializes this value to KVM_SPEC_EXIT_UNUSED on entry and
+	 * sets it to something else when it fills the speculative_exit struct.
+	 */
+	u8 speculative_exit_canary;
 };
 
 /*
@@ -2318,4 +2324,33 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+/*
+ * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and
+ * populate the memory_fault field with the given information.
+ *
+ * WARNs and does nothing if the speculative exit canary has already been set
+ * or if 'vcpu' is not the current running vcpu.
+ */
+static inline void kvm_handle_guest_uaccess_fault(struct kvm_vcpu *vcpu,
+						  uint64_t gpa, uint64_t len, uint64_t flags)
+{
+	/*
+	 * Ensure that an unloaded vCPU's run struct isn't being modified
+	 */
+	if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()))
+		return;
+
+	/*
+	 * Warn when overwriting an already-populated run struct.
+	 */
+	WARN_ON_ONCE(vcpu->speculative_exit_canary != KVM_SPEC_EXIT_UNUSED);
+
+	vcpu->speculative_exit_canary = KVM_SPEC_EXIT_MEMORY_FAULT;
+
+	vcpu->run->flags |= KVM_RUN_MEMORY_FAULT_FILLED;
+	vcpu->run->memory_fault.gpa = gpa;
+	vcpu->run->memory_fault.len = len;
+	vcpu->run->memory_fault.flags = flags;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f089ab290978..b2e4ac83b5a8 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -265,6 +265,9 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
 
+#define KVM_SPEC_EXIT_UNUSED           0
+#define KVM_SPEC_EXIT_MEMORY_FAULT     1
+
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
 #define KVM_INTERNAL_ERROR_EMULATION	1
@@ -278,6 +281,9 @@ struct kvm_xen_exit {
 /* Flags that describe what fields in emulation_failure hold valid data. */
 #define KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES (1ULL << 0)
 
+/* KVM_CAP_MEMORY_FAULT_INFO flag for kvm_run.flags */
+#define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8)
+
 /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
 struct kvm_run {
 	/* in */
@@ -531,6 +537,27 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+
+	/*
+	 * This second exit union holds structs for exits which may be triggered
+	 * after KVM has already initiated a different exit, and/or may be
+	 * filled speculatively by KVM.
+	 *
+	 * For instance, because of limitations in KVM's uAPI, a memory fault
+	 * may be encounterd after an MMIO exit is initiated and exit_reason and
+	 * kvm_run.mmio are filled: isolating the speculative exits here ensures
+	 * that KVM won't clobber information for the original exit.
+	 */
+	union {
+		/* KVM_SPEC_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 len;
+		} memory_fault;
+		/* Fix the size of the union. */
+		char speculative_exit_padding[256];
+	};
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
@@ -1192,6 +1219,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_COUNTER_OFFSET 227
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
+#define KVM_CAP_MEMORY_FAULT_INFO 230
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2249,4 +2277,10 @@ struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+/* flags for KVM_CAP_MEMORY_FAULT_INFO */
+
+#define KVM_MEMORY_FAULT_FLAG_READ     (1 << 0)
+#define KVM_MEMORY_FAULT_FLAG_WRITE    (1 << 1)
+#define KVM_MEMORY_FAULT_FLAG_EXEC     (1 << 2)
+
 #endif /* __LINUX_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index f089ab290978..d19aa7965392 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -278,6 +278,9 @@ struct kvm_xen_exit {
 /* Flags that describe what fields in emulation_failure hold valid data. */
 #define KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES (1ULL << 0)
 
+/* KVM_CAP_MEMORY_FAULT_INFO flag for kvm_run.flags */
+#define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8)
+
 /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
 struct kvm_run {
 	/* in */
@@ -531,6 +534,27 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+
+	/*
+	 * This second exit union holds structs for exits which may be triggered
+	 * after KVM has already initiated a different exit, and/or may be
+	 * filled speculatively by KVM.
+	 *
+	 * For instance, because of limitations in KVM's uAPI, a memory fault
+	 * may be encounterd after an MMIO exit is initiated and exit_reason and
+	 * kvm_run.mmio are filled: isolating the speculative exits here ensures
+	 * that KVM won't clobber information for the original exit.
+	 */
+	union {
+		/* KVM_RUN_MEMORY_FAULT_FILLED + EFAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 len;
+		} memory_fault;
+		/* Fix the size of the union. */
+		char speculative_exit_padding[256];
+	};
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b2d5aab32bf..e31435179764 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4151,6 +4151,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
 				synchronize_rcu();
 			put_pid(oldpid);
 		}
+		vcpu->speculative_exit_canary = KVM_SPEC_EXIT_UNUSED;
+		vcpu->run->flags &= ~KVM_RUN_MEMORY_FAULT_FILLED;
 		r = kvm_arch_vcpu_ioctl_run(vcpu);
 		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
 		break;
@@ -4539,6 +4541,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_CHECK_EXTENSION_VM:
 	case KVM_CAP_ENABLE_CAP_VM:
 	case KVM_CAP_HALT_POLL:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		return 1;
 #ifdef CONFIG_KVM_MMIO
 	case KVM_CAP_COALESCED_MMIO:
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 05/17] KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page()
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (3 preceding siblings ...)
  2023-09-08 22:28 ` [PATCH v5 04/17] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
@ 2023-09-08 22:28 ` Anish Moorthy
  2023-09-14  8:04   ` kernel test robot
                     ` (2 more replies)
  2023-09-08 22:28 ` [PATCH v5 06/17] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
                   ` (11 subsequent siblings)
  16 siblings, 3 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:28 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

Implement KVM_CAP_MEMORY_FAULT_INFO for uaccess failures in
kvm_vcpu_read/write_guest_page()

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e31435179764..13aa2ed11d0d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3043,8 +3043,12 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
 			     int offset, int len)
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+	int r = __kvm_read_guest_page(slot, gfn, data, offset, len);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	if (r)
+		kvm_handle_guest_uaccess_fault(vcpu, gfn * PAGE_SIZE + offset,
+					       len, KVM_MEMORY_FAULT_FLAG_READ);
+	return r;
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
 
@@ -3149,8 +3153,12 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 			      const void *data, int offset, int len)
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+	int r = __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
 
-	return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
+	if (r)
+		kvm_handle_guest_uaccess_fault(vcpu, gfn * PAGE_SIZE + offset,
+					       len, KVM_MEMORY_FAULT_FLAG_WRITE);
+	return r;
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 06/17] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (4 preceding siblings ...)
  2023-09-08 22:28 ` [PATCH v5 05/17] KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page() Anish Moorthy
@ 2023-09-08 22:28 ` Anish Moorthy
  2023-10-05  1:26   ` Sean Christopherson
  2023-09-08 22:28 ` [PATCH v5 07/17] KVM: arm64: Annotate -EFAULT from user_mem_abort() Anish Moorthy
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:28 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_handle_error_pfn().

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e1d011c67cc6..deae8ac74d9a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3267,6 +3267,8 @@ static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
 
 static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
+	u64 fault_flags;
+
 	if (is_sigpending_pfn(fault->pfn)) {
 		kvm_handle_signal_exit(vcpu);
 		return -EINTR;
@@ -3285,6 +3287,17 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 		return RET_PF_RETRY;
 	}
 
+	WARN_ON_ONCE(fault->goal_level != PG_LEVEL_4K);
+
+	fault_flags = 0;
+	if (fault->write)
+		fault_flags = KVM_MEMORY_FAULT_FLAG_WRITE;
+	else if (fault->exec)
+		fault_flags = KVM_MEMORY_FAULT_FLAG_EXEC;
+	else
+		fault_flags = KVM_MEMORY_FAULT_FLAG_READ;
+	kvm_handle_guest_uaccess_fault(vcpu, gfn_to_gpa(fault->gfn), PAGE_SIZE,
+				       fault_flags);
 	return -EFAULT;
 }
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 07/17] KVM: arm64: Annotate -EFAULT from user_mem_abort()
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (5 preceding siblings ...)
  2023-09-08 22:28 ` [PATCH v5 06/17] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
@ 2023-09-08 22:28 ` Anish Moorthy
  2023-09-28 21:42   ` Anish Moorthy
                     ` (2 more replies)
  2023-09-08 22:28 ` [PATCH v5 08/17] KVM: Allow hva_pfn_fast() to resolve read faults Anish Moorthy
                   ` (9 subsequent siblings)
  16 siblings, 3 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:28 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

Implement KVM_CAP_MEMORY_FAULT_INFO for guest access failure in
user_mem_abort().

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/arm64/kvm/mmu.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 587a104f66c3..8ede6c5edc5f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1408,6 +1408,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
+	uint64_t memory_fault_flags;
 
 	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
 	write_fault = kvm_is_write_fault(vcpu);
@@ -1507,8 +1508,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
 	}
-	if (is_error_noslot_pfn(pfn))
+	if (is_error_noslot_pfn(pfn)) {
+		memory_fault_flags = 0;
+		if (write_fault)
+			memory_fault_flags = KVM_MEMORY_FAULT_FLAG_EXEC;
+		else if (exec_fault)
+			memory_fault_flags = KVM_MEMORY_FAULT_FLAG_EXEC;
+		else
+			memory_fault_flags = KVM_MEMORY_FAULT_FLAG_READ;
+		kvm_handle_guest_uaccess_fault(vcpu, round_down(gfn * PAGE_SIZE, vma_pagesize),
+					       vma_pagesize, memory_fault_flags);
 		return -EFAULT;
+	}
 
 	if (kvm_is_device_pfn(pfn)) {
 		/*
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 08/17] KVM: Allow hva_pfn_fast() to resolve read faults
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (6 preceding siblings ...)
  2023-09-08 22:28 ` [PATCH v5 07/17] KVM: arm64: Annotate -EFAULT from user_mem_abort() Anish Moorthy
@ 2023-09-08 22:28 ` Anish Moorthy
  2023-09-08 22:28 ` [PATCH v5 09/17] KVM: Introduce KVM_CAP_USERFAULT_ON_MISSING without implementation Anish Moorthy
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:28 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

hva_to_pfn_fast() currently just fails for faults where establishing
writable mappings is forbidden, which is unnecessary. Instead, try
getting the page without passing FOLL_WRITE. This allows the
aforementioned faults to (potentially) be resolved without falling back
to slow GUP.

Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 13aa2ed11d0d..a7e6320dd7f0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2508,7 +2508,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
 }
 
 /*
- * The fast path to get the writable pfn which will be stored in @pfn,
+ * The fast path to get the pfn which will be stored in @pfn,
  * true indicates success, otherwise false is returned.  It's also the
  * only part that runs if we can in atomic context.
  */
@@ -2522,10 +2522,9 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 	 * or the caller allows to map a writable pfn for a read fault
 	 * request.
 	 */
-	if (!(write_fault || writable))
-		return false;
+	unsigned int gup_flags = (write_fault || writable) ? FOLL_WRITE : 0;
 
-	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+	if (get_user_page_fast_only(addr, gup_flags, page)) {
 		*pfn = page_to_pfn(page[0]);
 
 		if (writable)
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 09/17] KVM: Introduce KVM_CAP_USERFAULT_ON_MISSING without implementation
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (7 preceding siblings ...)
  2023-09-08 22:28 ` [PATCH v5 08/17] KVM: Allow hva_pfn_fast() to resolve read faults Anish Moorthy
@ 2023-09-08 22:28 ` Anish Moorthy
  2023-10-10 23:16   ` David Matlack
  2023-09-08 22:28 ` [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls Anish Moorthy
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:28 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

Add documentation, memslot flags, useful helper functions, and the
definition of the capability. Implementation is provided in a subsequent
commit.

Memory fault exits on absent mappings are particularly useful for
userfaultfd-based postcopy live migration, where contention within uffd
can lead to slowness When many vCPUs fault on a single uffd/vma.
Bypassing the uffd entirely by returning information directly to the
vCPU via an exit avoids contention and can greatly improves the fault
rate.

Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 28 +++++++++++++++++++++++++---
 include/linux/kvm_host.h       |  9 +++++++++
 include/uapi/linux/kvm.h       |  2 ++
 tools/include/uapi/linux/kvm.h |  1 +
 virt/kvm/Kconfig               |  3 +++
 virt/kvm/kvm_main.c            |  5 +++++
 6 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 92fd3faa6bab..c2eaacb6dc63 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
   /* for kvm_userspace_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_USERFAULT_ON_MISSING  (1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1342,12 +1343,15 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
+The flags field supports three flags
+
+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
 writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
+use it.
+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
 to make a new slot read-only.  In this case, writes to this memory will be
 posted to userspace as KVM_EXIT_MMIO exits.
+3.  KVM_MEM_USERFAULT_ON_MISSING: see KVM_CAP_USERFAULT_ON_MISSING for details.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
@@ -7781,6 +7785,24 @@ Note: Userspaces which attempt to resolve memory faults so that they can retry
 KVM_RUN are encouraged to guard against repeatedly receiving the same
 error/annotated fault.
 
+7.35 KVM_CAP_USERFAULT_ON_MISSING
+---------------------------------
+
+:Architectures: None
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that userspace may set the
+KVM_MEM_USERFAULT_ON_MISSING on memslots (via KVM_SET_USER_MEMORY_REGION). Said
+flag will cause KVM_RUN to fail (-EFAULT) in response to guest-context memory
+accesses which would require KVM to page fault on the userspace mapping.
+
+The range of guest physical memory causing the fault is advertised to userspace
+through KVM_CAP_MEMORY_FAULT_INFO. Userspace should determine how best to make
+the mapping present, take appropriate action, then return to KVM_RUN to retry
+the access.
+
+Attempts to enable this capability directly will fail.
+
 8. Other capabilities.
 ======================
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9206ac944d31..db5c3eae58fe 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2353,4 +2353,13 @@ static inline void kvm_handle_guest_uaccess_fault(struct kvm_vcpu *vcpu,
 	vcpu->run->memory_fault.flags = flags;
 }
 
+/*
+ * Whether non-atomic accesses to the userspace mapping of the memslot should
+ * be upgraded when possible.
+ */
+static inline bool kvm_is_slot_userfault_on_missing(const struct kvm_memory_slot *slot)
+{
+	return slot && slot->flags & KVM_MEM_USERFAULT_ON_MISSING;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b2e4ac83b5a8..a21921e4ee2a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_USERFAULT_ON_MISSING	(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -1220,6 +1221,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
 #define KVM_CAP_MEMORY_FAULT_INFO 230
+#define KVM_CAP_USERFAULT_ON_MISSING 231
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index d19aa7965392..188be8549070 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_USERFAULT_ON_MISSING (1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 484d0873061c..906878438687 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -92,3 +92,6 @@ config HAVE_KVM_PM_NOTIFIER
 
 config KVM_GENERIC_HARDWARE_ENABLING
        bool
+
+config HAVE_KVM_USERFAULT_ON_MISSING
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a7e6320dd7f0..aa81e41b1488 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1553,6 +1553,9 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
 	valid_flags |= KVM_MEM_READONLY;
 #endif
 
+	if (IS_ENABLED(CONFIG_HAVE_KVM_USERFAULT_ON_MISSING))
+		valid_flags |= KVM_MEM_USERFAULT_ON_MISSING;
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
 
@@ -4588,6 +4591,8 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_BINARY_STATS_FD:
 	case KVM_CAP_SYSTEM_EVENT_DATA:
 		return 1;
+	case KVM_CAP_USERFAULT_ON_MISSING:
+		return IS_ENABLED(CONFIG_HAVE_KVM_USERFAULT_ON_MISSING);
 	default:
 		break;
 	}
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (8 preceding siblings ...)
  2023-09-08 22:28 ` [PATCH v5 09/17] KVM: Introduce KVM_CAP_USERFAULT_ON_MISSING without implementation Anish Moorthy
@ 2023-09-08 22:28 ` Anish Moorthy
  2023-10-05  1:44   ` Sean Christopherson
  2023-09-08 22:28 ` [PATCH v5 11/17] KVM: x86: Enable KVM_CAP_USERFAULT_ON_MISSING Anish Moorthy
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:28 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

Change the "atomic" parameter of __gfn_to_pfn_memslot() to an enum which
reflects the possibility of allowig non-atomic accesses (GUP calls)
being "upgraded" to atomic, and mark locations where such upgrading is
allowed.

Concerning gfn_to_pfn_prot(): this function is unused on x86, and the
only usage on arm64 is from a codepath where upgrading gup calls to
atomic based on the memslot is undesirable. Therefore, punt on adding
any plumbing to expose the 'atomicity' parameter.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/arm64/kvm/mmu.c                   |  4 +--
 arch/powerpc/kvm/book3s_64_mmu_hv.c    |  3 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  3 +-
 arch/x86/kvm/mmu/mmu.c                 |  8 +++---
 include/linux/kvm_host.h               | 14 +++++++++-
 virt/kvm/kvm_main.c                    | 38 ++++++++++++++++++++------
 6 files changed, 53 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8ede6c5edc5f..ac77ae5b5d2b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1502,8 +1502,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
-	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-				   write_fault, &writable, NULL);
+	pfn = __gfn_to_pfn_memslot(memslot, gfn, MEMSLOT_ACCESS_NONATOMIC_MAY_UPGRADE,
+				   false, NULL, write_fault, &writable, NULL);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 7f765d5ad436..ab7caa86aa16 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -612,7 +612,8 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 		write_ok = true;
 	} else {
 		/* Call KVM generic code to do the slow-path check */
-		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
+		pfn = __gfn_to_pfn_memslot(memslot, gfn, MEMSLOT_ACCESS_FORCE_ALLOW_NONATOMIC,
+					   false, NULL,
 					   writing, &write_ok, NULL);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 572707858d65..3fa05c8e96b0 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -846,7 +846,8 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 		unsigned long pfn;
 
 		/* Call KVM generic code to do the slow-path check */
-		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
+		pfn = __gfn_to_pfn_memslot(memslot, gfn, MEMSLOT_ACCESS_FORCE_ALLOW_NONATOMIC,
+					   false, NULL,
 					   writing, upgrade_p, NULL);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index deae8ac74d9a..43516eb50e06 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4297,8 +4297,8 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	}
 
 	async = false;
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
-					  fault->write, &fault->map_writable,
+	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, MEMSLOT_ACCESS_NONATOMIC_MAY_UPGRADE,
+					  false, &async, fault->write, &fault->map_writable,
 					  &fault->hva);
 	if (!async)
 		return RET_PF_CONTINUE; /* *pfn has correct page already */
@@ -4319,8 +4319,8 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 * to wait for IO.  Note, gup always bails if it is unable to quickly
 	 * get a page and a fatal signal, i.e. SIGKILL, is pending.
 	 */
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
-					  fault->write, &fault->map_writable,
+	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, MEMSLOT_ACCESS_NONATOMIC_MAY_UPGRADE,
+					  true, NULL, fault->write, &fault->map_writable,
 					  &fault->hva);
 	return RET_PF_CONTINUE;
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index db5c3eae58fe..fdd386e1d3c0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1188,8 +1188,20 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable);
 kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn);
 kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn);
+enum memslot_access_atomicity {
+	/* Force atomic access */
+	MEMSLOT_ACCESS_ATOMIC,
+	/*
+	 * Ask for non-atomic access, but allow upgrading to atomic depending
+	 * on the memslot
+	 */
+	MEMSLOT_ACCESS_NONATOMIC_MAY_UPGRADE,
+	/* Force non-atomic access */
+	MEMSLOT_ACCESS_FORCE_ALLOW_NONATOMIC
+};
 kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
-			       bool atomic, bool interruptible, bool *async,
+			       enum memslot_access_atomicity atomicity,
+			       bool interruptible, bool *async,
 			       bool write_fault, bool *writable, hva_t *hva);
 
 void kvm_release_pfn_clean(kvm_pfn_t pfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index aa81e41b1488..d4f4ccb29e6d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2735,9 +2735,11 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
 }
 
 kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
-			       bool atomic, bool interruptible, bool *async,
+			       enum memslot_access_atomicity atomicity,
+			       bool interruptible, bool *async,
 			       bool write_fault, bool *writable, hva_t *hva)
 {
+	bool atomic;
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
 	if (hva)
@@ -2759,6 +2761,23 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 		writable = NULL;
 	}
 
+	if (atomicity == MEMSLOT_ACCESS_ATOMIC) {
+		atomic = true;
+	} else if (atomicity == MEMSLOT_ACCESS_NONATOMIC_MAY_UPGRADE) {
+		atomic = false;
+		if (kvm_is_slot_userfault_on_missing(slot)) {
+			atomic = true;
+			if (async) {
+				*async = false;
+				async = NULL;
+			}
+		}
+	} else if (atomicity == MEMSLOT_ACCESS_FORCE_ALLOW_NONATOMIC) {
+		atomic = false;
+	} else {
+		BUG();
+	}
+
 	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
 			  writable);
 }
@@ -2767,22 +2786,23 @@ EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
-	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
-				    NULL, write_fault, writable, NULL);
+	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn,
+				    MEMSLOT_ACCESS_FORCE_ALLOW_NONATOMIC,
+				    false, NULL, write_fault, writable, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
-				    NULL, NULL);
+	return __gfn_to_pfn_memslot(slot, gfn, MEMSLOT_ACCESS_NONATOMIC_MAY_UPGRADE,
+				    false, NULL, true, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
-				    NULL, NULL);
+	return __gfn_to_pfn_memslot(slot, gfn, MEMSLOT_ACCESS_ATOMIC,
+				    false, NULL, true, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
@@ -2862,7 +2882,9 @@ int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
 	if (!map)
 		return -EINVAL;
 
-	pfn = gfn_to_pfn(vcpu->kvm, gfn);
+	pfn = __gfn_to_pfn_memslot(gfn_to_memslot(vcpu->kvm, gfn), gfn,
+				   MEMSLOT_ACCESS_FORCE_ALLOW_NONATOMIC,
+				   false, NULL, true, NULL, NULL);
 	if (is_error_noslot_pfn(pfn))
 		return -EINVAL;
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 11/17] KVM: x86: Enable KVM_CAP_USERFAULT_ON_MISSING
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (9 preceding siblings ...)
  2023-09-08 22:28 ` [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls Anish Moorthy
@ 2023-09-08 22:28 ` Anish Moorthy
  2023-10-05  1:52   ` Sean Christopherson
  2023-09-08 22:28 ` [PATCH v5 12/17] KVM: arm64: " Anish Moorthy
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:28 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

The relevant __gfn_to_pfn_memslot() calls in __kvm_faultin_pfn()
already use MEMSLOT_ACCESS_NONATOMIC_MAY_UPGRADE.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 2 +-
 arch/x86/kvm/Kconfig           | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index c2eaacb6dc63..a74d721a18f6 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7788,7 +7788,7 @@ error/annotated fault.
 7.35 KVM_CAP_USERFAULT_ON_MISSING
 ---------------------------------
 
-:Architectures: None
+:Architectures: x86
 :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
 
 The presence of this capability indicates that userspace may set the
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ed90f148140d..11d956f17a9d 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -49,6 +49,7 @@ config KVM
 	select INTERVAL_TREE
 	select HAVE_KVM_PM_NOTIFIER if PM
 	select KVM_GENERIC_HARDWARE_ENABLING
+        select HAVE_KVM_USERFAULT_ON_MISSING
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 12/17] KVM: arm64: Enable KVM_CAP_USERFAULT_ON_MISSING
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (10 preceding siblings ...)
  2023-09-08 22:28 ` [PATCH v5 11/17] KVM: x86: Enable KVM_CAP_USERFAULT_ON_MISSING Anish Moorthy
@ 2023-09-08 22:28 ` Anish Moorthy
  2023-09-08 22:29 ` [PATCH v5 13/17] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:28 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

The relevant __gfn_to_pfn_memslot() call in user_mem_abort() already
uses MEMSLOT_ACCESS_NONATOMIC_MAY_UPGRADE.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 2 +-
 arch/arm64/kvm/Kconfig         | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index a74d721a18f6..b0b1124277e8 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7788,7 +7788,7 @@ error/annotated fault.
 7.35 KVM_CAP_USERFAULT_ON_MISSING
 ---------------------------------
 
-:Architectures: x86
+:Architectures: x86, arm64
 :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
 
 The presence of this capability indicates that userspace may set the
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 83c1e09be42e..d966a955d876 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -43,6 +43,7 @@ menuconfig KVM
 	select GUEST_PERF_EVENTS if PERF_EVENTS
 	select INTERVAL_TREE
 	select XARRAY_MULTI
+        select HAVE_KVM_USERFAULT_ON_MISSING
 	help
 	  Support hosting virtualized guest machines.
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 13/17] KVM: selftests: Report per-vcpu demand paging rate from demand paging test
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (11 preceding siblings ...)
  2023-09-08 22:28 ` [PATCH v5 12/17] KVM: arm64: " Anish Moorthy
@ 2023-09-08 22:29 ` Anish Moorthy
  2023-09-08 22:29 ` [PATCH v5 14/17] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:29 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

Using the overall demand paging rate to measure performance can be
slightly misleading when vCPU accesses are not overlapped. Adding more
vCPUs will (usually) increase the overall demand paging rate even
if performance remains constant or even degrades on a per-vcpu basis. As
such, it makes sense to report both the total and per-vcpu paging rates.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 tools/testing/selftests/kvm/demand_paging_test.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 09c116a82a84..6dc823fa933a 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -135,6 +135,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
 	int i;
+	double vcpu_paging_rate;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
@@ -191,11 +192,17 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			uffd_stop_demand_paging(uffd_descs[i]);
 	}
 
-	pr_info("Total guest execution time: %ld.%.9lds\n",
+	pr_info("Total guest execution time:\t%ld.%.9lds\n",
 		ts_diff.tv_sec, ts_diff.tv_nsec);
-	pr_info("Overall demand paging rate: %f pgs/sec\n",
-		memstress_args.vcpu_args[0].pages * nr_vcpus /
-		((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / NSEC_PER_SEC));
+
+	vcpu_paging_rate =
+		memstress_args.vcpu_args[0].pages
+		/ ((double)ts_diff.tv_sec
+			+ (double)ts_diff.tv_nsec / NSEC_PER_SEC);
+	pr_info("Per-vcpu demand paging rate:\t%f pgs/sec/vcpu\n",
+		vcpu_paging_rate);
+	pr_info("Overall demand paging rate:\t%f pgs/sec\n",
+		vcpu_paging_rate * nr_vcpus);
 
 	memstress_destroy_vm(vm);
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 14/17] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (12 preceding siblings ...)
  2023-09-08 22:29 ` [PATCH v5 13/17] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
@ 2023-09-08 22:29 ` Anish Moorthy
  2023-09-08 22:29 ` [PATCH v5 15/17] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:29 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

At the moment, demand_paging_test does not support profiling/testing
multiple vCPU threads concurrently faulting on a single uffd because

    (a) "-u" (run test in userfaultfd mode) creates a uffd for each vCPU's
        region, so that each uffd services a single vCPU thread.
    (b) "-u -o" (userfaultfd mode + overlapped vCPU memory accesses)
        simply doesn't work: the test tries to register the same memory
        to multiple uffds, causing an error.

Add support for many vcpus per uffd by
    (1) Keeping "-u" behavior unchanged.
    (2) Making "-u -a" create a single uffd for all of guest memory.
    (3) Making "-u -o" implicitly pass "-a", solving the problem in (b).
In cases (2) and (3) all vCPU threads fault on a single uffd.

With potentially multiple vCPUs per UFFD, it makes sense to allow
configuring the number of reader threads per UFFD as well: add the "-r"
flag to do so.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/aarch64/page_fault_test.c   |  4 +-
 .../selftests/kvm/demand_paging_test.c        | 76 +++++++++++++---
 .../selftests/kvm/include/userfaultfd_util.h  | 17 +++-
 .../selftests/kvm/lib/userfaultfd_util.c      | 87 +++++++++++++------
 4 files changed, 137 insertions(+), 47 deletions(-)

diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
index 47bb914ab2fa..e5ca4b4be903 100644
--- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
+++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
@@ -375,14 +375,14 @@ static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
 		*pt_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						    pt_args.hva,
 						    pt_args.paging_size,
-						    test->uffd_pt_handler);
+						    1, test->uffd_pt_handler);
 
 	*data_uffd = NULL;
 	if (test->uffd_data_handler)
 		*data_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						      data_args.hva,
 						      data_args.paging_size,
-						      test->uffd_data_handler);
+						      1, test->uffd_data_handler);
 }
 
 static void free_uffd(struct test_desc *test, struct uffd_desc *pt_uffd,
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 6dc823fa933a..f7897a951f90 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -77,8 +77,20 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		copy.mode = 0;
 
 		r = ioctl(uffd, UFFDIO_COPY, &copy);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d with errno: %d\n",
+		/*
+		 * With multiple vCPU threads fault on a single page and there are
+		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
+		 * will fail with EEXIST: handle that case without signaling an
+		 * error.
+		 *
+		 * Note that this also suppress any EEXISTs occurring from,
+		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
+		 * happens here, but a realistic VMM might potentially maintain
+		 * some external state to correctly surface EEXISTs to userspace
+		 * (or prevent duplicate COPY/CONTINUEs in the first place).
+		 */
+		if (r == -1 && errno != EEXIST) {
+			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
 				addr, tid, errno);
 			return r;
 		}
@@ -89,8 +101,20 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		cont.range.len = demand_paging_size;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_CONTINUE in 0x%lx from thread %d with errno: %d\n",
+		/*
+		 * With multiple vCPU threads fault on a single page and there are
+		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
+		 * will fail with EEXIST: handle that case without signaling an
+		 * error.
+		 *
+		 * Note that this also suppress any EEXISTs occurring from,
+		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
+		 * happens here, but a realistic VMM might potentially maintain
+		 * some external state to correctly surface EEXISTs to userspace
+		 * (or prevent duplicate COPY/CONTINUEs in the first place).
+		 */
+		if (r == -1 && errno != EEXIST) {
+			pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
 				addr, tid, errno);
 			return r;
 		}
@@ -110,7 +134,9 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 
 struct test_params {
 	int uffd_mode;
+	bool single_uffd;
 	useconds_t uffd_delay;
+	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
 };
@@ -134,8 +160,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i;
+	int i, num_uffds = 0;
 	double vcpu_paging_rate;
+	uint64_t uffd_region_size;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
@@ -148,7 +175,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
 	if (p->uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
-		for (i = 0; i < nr_vcpus; i++) {
+		num_uffds = p->single_uffd ? 1 : nr_vcpus;
+		for (i = 0; i < num_uffds; i++) {
 			vcpu_args = &memstress_args.vcpu_args[i];
 			prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
 				     vcpu_args->pages * memstress_args.guest_page_size);
@@ -156,9 +184,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	}
 
 	if (p->uffd_mode) {
-		uffd_descs = malloc(nr_vcpus * sizeof(struct uffd_desc *));
+		num_uffds = p->single_uffd ? 1 : nr_vcpus;
+		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
+
+		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
 		TEST_ASSERT(uffd_descs, "Memory allocation failed");
-		for (i = 0; i < nr_vcpus; i++) {
+		for (i = 0; i < num_uffds; i++) {
+			struct memstress_vcpu_args *vcpu_args;
 			void *vcpu_hva;
 
 			vcpu_args = &memstress_args.vcpu_args[i];
@@ -171,7 +203,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
 				p->uffd_mode, p->uffd_delay, vcpu_hva,
-				vcpu_args->pages * memstress_args.guest_page_size,
+				uffd_region_size,
+				p->readers_per_uffd,
 				&handle_uffd_page_request);
 		}
 	}
@@ -188,7 +221,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
 	if (p->uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
-		for (i = 0; i < nr_vcpus; i++)
+		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
 	}
 
@@ -214,15 +247,20 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 static void help(char *name)
 {
 	puts("");
-	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-d uffd_delay_usec]\n"
-	       "          [-b memory] [-s type] [-v vcpus] [-c cpu_list] [-o]\n", name);
+	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
+		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
+		   "          [-s type] [-v vcpus] [-c cpu_list] [-o]\n", name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
 	kvm_print_vcpu_pinning_help();
+	printf(" -a: Use a single userfaultfd for all of guest memory, instead of\n"
+	       "     creating one for each region paged by a unique vCPU\n"
+	       "     Set implicitly with -o, and no effect without -u.\n");
 	printf(" -d: add a delay in usec to the User Fault\n"
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
+	printf(" -r: Set the number of reader threads per uffd.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -241,12 +279,14 @@ int main(int argc, char *argv[])
 	struct test_params p = {
 		.src_type = DEFAULT_VM_MEM_SRC,
 		.partition_vcpu_memory_access = true,
+		.readers_per_uffd = 1,
+		.single_uffd = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "hm:u:d:b:s:v:c:o")) != -1) {
+	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:c:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
@@ -258,6 +298,9 @@ int main(int argc, char *argv[])
 				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
 			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
+		case 'a':
+			p.single_uffd = true;
+			break;
 		case 'd':
 			p.uffd_delay = strtoul(optarg, NULL, 0);
 			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
@@ -278,6 +321,13 @@ int main(int argc, char *argv[])
 			break;
 		case 'o':
 			p.partition_vcpu_memory_access = false;
+			p.single_uffd = true;
+			break;
+		case 'r':
+			p.readers_per_uffd = atoi(optarg);
+			TEST_ASSERT(p.readers_per_uffd >= 1,
+				    "Invalid number of readers per uffd %d: must be >=1",
+				    p.readers_per_uffd);
 			break;
 		case 'h':
 		default:
diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
index 877449c34592..af83a437e74a 100644
--- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
+++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
@@ -17,18 +17,27 @@
 
 typedef int (*uffd_handler_t)(int uffd_mode, int uffd, struct uffd_msg *msg);
 
-struct uffd_desc {
+struct uffd_reader_args {
 	int uffd_mode;
 	int uffd;
-	int pipefds[2];
 	useconds_t delay;
 	uffd_handler_t handler;
-	pthread_t thread;
+	/* Holds the read end of the pipe for killing the reader. */
+	int pipe;
+};
+
+struct uffd_desc {
+	int uffd;
+	uint64_t num_readers;
+	/* Holds the write ends of the pipes for killing the readers. */
+	int *pipefds;
+	pthread_t *readers;
+	struct uffd_reader_args *reader_args;
 };
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler);
+					   uint64_t num_readers, uffd_handler_t handler);
 
 void uffd_stop_demand_paging(struct uffd_desc *uffd);
 
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 271f63891581..6f220aa4fb08 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -27,10 +27,8 @@
 
 static void *uffd_handler_thread_fn(void *arg)
 {
-	struct uffd_desc *uffd_desc = (struct uffd_desc *)arg;
-	int uffd = uffd_desc->uffd;
-	int pipefd = uffd_desc->pipefds[0];
-	useconds_t delay = uffd_desc->delay;
+	struct uffd_reader_args *reader_args = (struct uffd_reader_args *)arg;
+	int uffd = reader_args->uffd;
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
@@ -44,7 +42,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 		pollfd[0].fd = uffd;
 		pollfd[0].events = POLLIN;
-		pollfd[1].fd = pipefd;
+		pollfd[1].fd = reader_args->pipe;
 		pollfd[1].events = POLLIN;
 
 		r = poll(pollfd, 2, -1);
@@ -92,9 +90,9 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
 
-		if (delay)
-			usleep(delay);
-		r = uffd_desc->handler(uffd_desc->uffd_mode, uffd, &msg);
+		if (reader_args->delay)
+			usleep(reader_args->delay);
+		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
 		if (r < 0)
 			return NULL;
 		pages++;
@@ -110,7 +108,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler)
+					   uint64_t num_readers, uffd_handler_t handler)
 {
 	struct uffd_desc *uffd_desc;
 	bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR);
@@ -118,14 +116,26 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	struct uffdio_api uffdio_api;
 	struct uffdio_register uffdio_register;
 	uint64_t expected_ioctls = ((uint64_t) 1) << _UFFDIO_COPY;
-	int ret;
+	int ret, i;
 
 	PER_PAGE_DEBUG("Userfaultfd %s mode, faults resolved with %s\n",
 		       is_minor ? "MINOR" : "MISSING",
 		       is_minor ? "UFFDIO_CONINUE" : "UFFDIO_COPY");
 
 	uffd_desc = malloc(sizeof(struct uffd_desc));
-	TEST_ASSERT(uffd_desc, "malloc failed");
+	TEST_ASSERT(uffd_desc, "Failed to malloc uffd descriptor");
+
+	uffd_desc->pipefds = malloc(sizeof(int) * num_readers);
+	TEST_ASSERT(uffd_desc->pipefds, "Failed to malloc pipes");
+
+	uffd_desc->readers = malloc(sizeof(pthread_t) * num_readers);
+	TEST_ASSERT(uffd_desc->readers, "Failed to malloc reader threads");
+
+	uffd_desc->reader_args = malloc(
+		sizeof(struct uffd_reader_args) * num_readers);
+	TEST_ASSERT(uffd_desc->reader_args, "Failed to malloc reader_args");
+
+	uffd_desc->num_readers = num_readers;
 
 	/* In order to get minor faults, prefault via the alias. */
 	if (is_minor)
@@ -148,18 +158,28 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
 		    expected_ioctls, "missing userfaultfd ioctls");
 
-	ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
-	TEST_ASSERT(!ret, "Failed to set up pipefd");
-
-	uffd_desc->uffd_mode = uffd_mode;
 	uffd_desc->uffd = uffd;
-	uffd_desc->delay = delay;
-	uffd_desc->handler = handler;
-	pthread_create(&uffd_desc->thread, NULL, uffd_handler_thread_fn,
-		       uffd_desc);
+	for (i = 0; i < uffd_desc->num_readers; ++i) {
+		int pipes[2];
+
+		ret = pipe2((int *) &pipes, O_CLOEXEC | O_NONBLOCK);
+		TEST_ASSERT(!ret, "Failed to set up pipefd %i for uffd_desc %p",
+			    i, uffd_desc);
+
+		uffd_desc->pipefds[i] = pipes[1];
 
-	PER_VCPU_DEBUG("Created uffd thread for HVA range [%p, %p)\n",
-		       hva, hva + len);
+		uffd_desc->reader_args[i].uffd_mode = uffd_mode;
+		uffd_desc->reader_args[i].uffd = uffd;
+		uffd_desc->reader_args[i].delay = delay;
+		uffd_desc->reader_args[i].handler = handler;
+		uffd_desc->reader_args[i].pipe = pipes[0];
+
+		pthread_create(&uffd_desc->readers[i], NULL, uffd_handler_thread_fn,
+			       &uffd_desc->reader_args[i]);
+
+		PER_VCPU_DEBUG("Created uffd thread %i for HVA range [%p, %p)\n",
+			       i, hva, hva + len);
+	}
 
 	return uffd_desc;
 }
@@ -167,19 +187,30 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 void uffd_stop_demand_paging(struct uffd_desc *uffd)
 {
 	char c = 0;
-	int ret;
+	int i, ret;
 
-	ret = write(uffd->pipefds[1], &c, 1);
-	TEST_ASSERT(ret == 1, "Unable to write to pipefd");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = write(uffd->pipefds[i], &c, 1);
+		TEST_ASSERT(
+			ret == 1, "Unable to write to pipefd %i for uffd_desc %p", i, uffd);
+	}
 
-	ret = pthread_join(uffd->thread, NULL);
-	TEST_ASSERT(ret == 0, "Pthread_join failed.");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = pthread_join(uffd->readers[i], NULL);
+		TEST_ASSERT(
+			ret == 0, "Pthread_join failed on reader %i for uffd_desc %p", i, uffd);
+	}
 
 	close(uffd->uffd);
 
-	close(uffd->pipefds[1]);
-	close(uffd->pipefds[0]);
+	for (i = 0; i < uffd->num_readers; ++i) {
+		close(uffd->pipefds[i]);
+		close(uffd->reader_args[i].pipe);
+	}
 
+	free(uffd->pipefds);
+	free(uffd->readers);
+	free(uffd->reader_args);
 	free(uffd);
 }
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 15/17] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (13 preceding siblings ...)
  2023-09-08 22:29 ` [PATCH v5 14/17] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
@ 2023-09-08 22:29 ` Anish Moorthy
  2023-09-08 22:29 ` [PATCH v5 16/17] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
  2023-09-08 22:29 ` [PATCH v5 17/17] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
  16 siblings, 0 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:29 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

With multiple reader threads POLLing a single UFFD, the test suffers
from the thundering herd problem: performance degrades as the number of
reader threads is increased. Solve this issue [1] by switching the
the polling mechanism to EPOLL + EPOLLEXCLUSIVE.

Also, change the error-handling convention of uffd_handler_thread_fn.
Instead of just printing errors and returning early from the polling
loop, check for them via TEST_ASSERT. "return NULL" is reserved for a
successful exit from uffd_handler_thread_fn, ie one triggered by a
write to the exit pipe.

Performance samples generated by the command in [2] are given below.

Num Reader Threads, Paging Rate (POLL), Paging Rate (EPOLL)
1      249k      185k
2      201k      235k
4      186k      155k
16     150k      217k
32     89k       198k

[1] Single-vCPU performance does suffer somewhat.
[2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        |  1 -
 .../selftests/kvm/lib/userfaultfd_util.c      | 74 +++++++++----------
 2 files changed, 35 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index f7897a951f90..0455347f932a 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -13,7 +13,6 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
-#include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
 #include <sys/syscall.h>
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 6f220aa4fb08..2a179133645a 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -16,6 +16,7 @@
 #include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <sys/epoll.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -32,60 +33,55 @@ static void *uffd_handler_thread_fn(void *arg)
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
+	int epollfd;
+	struct epoll_event evt;
+
+	epollfd = epoll_create(1);
+	TEST_ASSERT(epollfd >= 0, "Failed to create epollfd.");
+
+	evt.events = EPOLLIN | EPOLLEXCLUSIVE;
+	evt.data.u32 = 0;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, uffd, &evt) == 0,
+		    "Failed to add uffd to epollfd");
+
+	evt.events = EPOLLIN;
+	evt.data.u32 = 1;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, reader_args->pipe, &evt) == 0,
+		    "Failed to add pipe to epollfd");
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
 	while (1) {
 		struct uffd_msg msg;
-		struct pollfd pollfd[2];
-		char tmp_chr;
 		int r;
 
-		pollfd[0].fd = uffd;
-		pollfd[0].events = POLLIN;
-		pollfd[1].fd = reader_args->pipe;
-		pollfd[1].events = POLLIN;
-
-		r = poll(pollfd, 2, -1);
-		switch (r) {
-		case -1:
-			pr_info("poll err");
-			continue;
-		case 0:
-			continue;
-		case 1:
-			break;
-		default:
-			pr_info("Polling uffd returned %d", r);
-			return NULL;
-		}
+		r = epoll_wait(epollfd, &evt, 1, -1);
+		TEST_ASSERT(r == 1,
+			    "Unexpected number of events (%d) from epoll, errno = %d",
+			    r, errno);
 
-		if (pollfd[0].revents & POLLERR) {
-			pr_info("uffd revents has POLLERR");
-			return NULL;
-		}
+		if (evt.data.u32 == 1) {
+			char tmp_chr;
 
-		if (pollfd[1].revents & POLLIN) {
-			r = read(pollfd[1].fd, &tmp_chr, 1);
+			TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+				    "Reader thread received EPOLLERR or EPOLLHUP on pipe.");
+			r = read(reader_args->pipe, &tmp_chr, 1);
 			TEST_ASSERT(r == 1,
-				    "Error reading pipefd in UFFD thread\n");
+				    "Error reading pipefd in uffd reader thread");
 			break;
 		}
 
-		if (!(pollfd[0].revents & POLLIN))
-			continue;
+		TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+			    "Reader thread received EPOLLERR or EPOLLHUP on uffd.");
 
 		r = read(uffd, &msg, sizeof(msg));
 		if (r == -1) {
-			if (errno == EAGAIN)
-				continue;
-			pr_info("Read of uffd got errno %d\n", errno);
-			return NULL;
+			TEST_ASSERT(errno == EAGAIN,
+				    "Error reading from UFFD: errno = %d", errno);
+			continue;
 		}
 
-		if (r != sizeof(msg)) {
-			pr_info("Read on uffd returned unexpected size: %d bytes", r);
-			return NULL;
-		}
+		TEST_ASSERT(r == sizeof(msg),
+			    "Read on uffd returned unexpected number of bytes (%d)", r);
 
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
@@ -93,8 +89,8 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (reader_args->delay)
 			usleep(reader_args->delay);
 		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
-		if (r < 0)
-			return NULL;
+		TEST_ASSERT(r >= 0,
+			    "Reader thread handler fn returned negative value %d", r);
 		pages++;
 	}
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 16/17] KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (14 preceding siblings ...)
  2023-09-08 22:29 ` [PATCH v5 15/17] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
@ 2023-09-08 22:29 ` Anish Moorthy
  2023-09-08 22:29 ` [PATCH v5 17/17] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
  16 siblings, 0 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:29 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

Memslot flags aren't currently exposed to the tests, and are just always
set to 0. Add a parameter to allow tests to manually set those flags.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 tools/testing/selftests/kvm/access_tracking_perf_test.c       | 2 +-
 tools/testing/selftests/kvm/demand_paging_test.c              | 2 +-
 tools/testing/selftests/kvm/dirty_log_perf_test.c             | 2 +-
 tools/testing/selftests/kvm/include/memstress.h               | 2 +-
 tools/testing/selftests/kvm/lib/memstress.c                   | 4 ++--
 .../testing/selftests/kvm/memslot_modification_stress_test.c  | 2 +-
 .../selftests/kvm/x86_64/dirty_log_page_splitting_test.c      | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..b51656b408b8 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -306,7 +306,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct kvm_vm *vm;
 	int nr_vcpus = params->nr_vcpus;
 
-	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1, 0,
 				 params->backing_src, !overlap_memory_access);
 
 	memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 0455347f932a..61bb2e23bef0 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -163,7 +163,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	double vcpu_paging_rate;
 	uint64_t uffd_region_size;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
 				 p->src_type, p->partition_vcpu_memory_access);
 
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index d374dbcf9a53..8b1a84a4db3b 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -153,7 +153,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	int i;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
-				 p->slots, p->backing_src,
+				 p->slots, 0, p->backing_src,
 				 p->partition_vcpu_memory_access);
 
 	pr_info("Random seed: %u\n", p->random_seed);
diff --git a/tools/testing/selftests/kvm/include/memstress.h b/tools/testing/selftests/kvm/include/memstress.h
index ce4e603050ea..8be9609d3ca0 100644
--- a/tools/testing/selftests/kvm/include/memstress.h
+++ b/tools/testing/selftests/kvm/include/memstress.h
@@ -56,7 +56,7 @@ struct memstress_args {
 extern struct memstress_args memstress_args;
 
 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access);
 void memstress_destroy_vm(struct kvm_vm *vm);
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index df457452d146..dc145952d19c 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -123,7 +123,7 @@ void memstress_setup_vcpus(struct kvm_vm *vm, int nr_vcpus,
 }
 
 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access)
 {
@@ -211,7 +211,7 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 
 		vm_userspace_mem_region_add(vm, backing_src, region_start,
 					    MEMSTRESS_MEM_SLOT_INDEX + i,
-					    region_pages, 0);
+					    region_pages, slot_flags);
 	}
 
 	/* Do mapping for the demand paging memory slot */
diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
index 9855c41ca811..0b19ec3ecc9c 100644
--- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
+++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
@@ -95,7 +95,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct test_params *p = arg;
 	struct kvm_vm *vm;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
 				 VM_MEM_SRC_ANONYMOUS,
 				 p->partition_vcpu_memory_access);
 
diff --git a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
index 634c6bfcd572..a770d7fa469a 100644
--- a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
+++ b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
@@ -100,7 +100,7 @@ static void run_test(enum vm_guest_mode mode, void *unused)
 	struct kvm_page_stats stats_dirty_logging_disabled;
 	struct kvm_page_stats stats_repopulated;
 
-	vm = memstress_create_vm(mode, VCPUS, guest_percpu_mem_size,
+	vm = memstress_create_vm(mode, VCPUS, guest_percpu_mem_size, 0,
 				 SLOTS, backing_src, false);
 
 	guest_num_pages = (VCPUS * guest_percpu_mem_size) >> vm->page_shift;
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 17/17] KVM: selftests: Handle memory fault exits in demand_paging_test
  2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (15 preceding siblings ...)
  2023-09-08 22:29 ` [PATCH v5 16/17] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
@ 2023-09-08 22:29 ` Anish Moorthy
  16 siblings, 0 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-08 22:29 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

Demonstrate a (very basic) scheme for supporting memory fault exits.

From the vCPU threads:
1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
   with the purpose of establishing the absent mappings. Do so with
   wake_waiters=false to avoid serializing on the userfaultfd wait queue
   locks.

2. When the UFFDIO_COPY/CONTINUE in (1) fails with EEXIST,
   assume that the mapping was already established but is currently
   absent [A] and attempt to populate it using MADV_POPULATE_WRITE.

Issue UFFDIO_COPY/CONTINUEs from the reader threads as well, but with
wake_waiters=true to ensure that any threads sleeping on the uffd are
eventually woken up.

A real VMM would track whether it had already COPY/CONTINUEd pages (eg,
via a bitmap) to avoid calls destined to EEXIST. However, even the
naive approach is enough to demonstrate the performance advantages of
KVM_EXIT_MEMORY_FAULT.

[A] In reality it is much likelier that the vCPU thread simply lost a
    race to establish the mapping for the page.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        | 245 +++++++++++++-----
 1 file changed, 173 insertions(+), 72 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 61bb2e23bef0..ded5cdf6dde9 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -15,6 +15,7 @@
 #include <time.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <linux/mman.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -31,36 +32,102 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
 static size_t demand_paging_size;
 static char *guest_data_prototype;
 
+static int num_uffds;
+static size_t uffd_region_size;
+static struct uffd_desc **uffd_descs;
+/*
+ * Delay when demand paging is performed through userfaultfd or directly by
+ * vcpu_worker in the case of an annotated memory fault.
+ */
+static useconds_t uffd_delay;
+static int uffd_mode;
+
+
+static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
+				    bool is_vcpu);
+
+static void madv_write_or_err(uint64_t gpa)
+{
+	int r;
+	void *hva = addr_gpa2hva(memstress_args.vm, gpa);
+
+	r = madvise(hva, demand_paging_size, MADV_POPULATE_WRITE);
+	TEST_ASSERT(r == 0,
+		    "MADV_POPULATE_WRITE on hva 0x%lx (gpa 0x%lx) fail, errno %i\n",
+		    (uintptr_t) hva, gpa, errno);
+}
+
+static void ready_page(uint64_t gpa)
+{
+	int r, uffd;
+
+	/*
+	 * This test only registers memslot 1 w/ userfaultfd. Any accesses outside
+	 * the registered ranges should fault in the physical pages through
+	 * MADV_POPULATE_WRITE.
+	 */
+	if ((gpa < memstress_args.gpa)
+		|| (gpa >= memstress_args.gpa + memstress_args.size)) {
+		madv_write_or_err(gpa);
+	} else {
+		if (uffd_delay)
+			usleep(uffd_delay);
+
+		uffd = uffd_descs[(gpa - memstress_args.gpa) / uffd_region_size]->uffd;
+
+		r = handle_uffd_page_request(uffd_mode, uffd,
+					     (uint64_t) addr_gpa2hva(memstress_args.vm, gpa), true);
+
+		if (r == EEXIST)
+			madv_write_or_err(gpa);
+	}
+}
+
 static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 {
 	struct kvm_vcpu *vcpu = vcpu_args->vcpu;
 	int vcpu_idx = vcpu_args->vcpu_idx;
 	struct kvm_run *run = vcpu->run;
-	struct timespec start;
-	struct timespec ts_diff;
+	struct timespec last_start;
+	struct timespec total_runtime = {};
 	int ret;
-
-	clock_gettime(CLOCK_MONOTONIC, &start);
-
-	/* Let the guest access its memory */
-	ret = _vcpu_run(vcpu);
-	TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-	if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
-		TEST_ASSERT(false,
-			    "Invalid guest sync status: exit_reason=%s\n",
-			    exit_reason_str(run->exit_reason));
+	u64 num_memory_fault_exits = 0;
+	bool annotated_memory_fault = false;
+
+	while (true) {
+		clock_gettime(CLOCK_MONOTONIC, &last_start);
+		/* Let the guest access its memory */
+		ret = _vcpu_run(vcpu);
+		annotated_memory_fault = errno == EFAULT
+					 && run->flags | KVM_RUN_MEMORY_FAULT_FILLED;
+		TEST_ASSERT(ret == 0 || annotated_memory_fault,
+			    "vcpu_run failed: %d\n", ret);
+
+		total_runtime = timespec_add(total_runtime,
+					     timespec_elapsed(last_start));
+		if (ret != 0 && get_ucall(vcpu, NULL) != UCALL_SYNC) {
+
+			if (annotated_memory_fault) {
+				++num_memory_fault_exits;
+				ready_page(run->memory_fault.gpa);
+				continue;
+			}
+
+			TEST_ASSERT(false,
+				    "Invalid guest sync status: exit_reason=%s\n",
+				    exit_reason_str(run->exit_reason));
+		}
+		break;
 	}
-
-	ts_diff = timespec_elapsed(start);
-	PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
-		       ts_diff.tv_sec, ts_diff.tv_nsec);
+	PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds, %d memory fault exits\n",
+		       vcpu_idx, total_runtime.tv_sec, total_runtime.tv_nsec,
+		       num_memory_fault_exits);
 }
 
-static int handle_uffd_page_request(int uffd_mode, int uffd,
-		struct uffd_msg *msg)
+static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
+				    bool is_vcpu)
 {
 	pid_t tid = syscall(__NR_gettid);
-	uint64_t addr = msg->arg.pagefault.address;
 	struct timespec start;
 	struct timespec ts_diff;
 	int r;
@@ -71,16 +138,15 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		struct uffdio_copy copy;
 
 		copy.src = (uint64_t)guest_data_prototype;
-		copy.dst = addr;
+		copy.dst = hva;
 		copy.len = demand_paging_size;
-		copy.mode = 0;
+		copy.mode = is_vcpu ? UFFDIO_COPY_MODE_DONTWAKE : 0;
 
-		r = ioctl(uffd, UFFDIO_COPY, &copy);
 		/*
-		 * With multiple vCPU threads fault on a single page and there are
-		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
-		 * will fail with EEXIST: handle that case without signaling an
-		 * error.
+		 * With multiple vCPU threads and at least one of multiple reader threads
+		 * or vCPU memory faults, multiple vCPUs accessing an absent page will
+		 * almost certainly cause some thread doing the UFFDIO_COPY here to get
+		 * EEXIST: make sure to allow that case.
 		 *
 		 * Note that this also suppress any EEXISTs occurring from,
 		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
@@ -88,23 +154,24 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		 * some external state to correctly surface EEXISTs to userspace
 		 * (or prevent duplicate COPY/CONTINUEs in the first place).
 		 */
-		if (r == -1 && errno != EEXIST) {
-			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
-				addr, tid, errno);
-			return r;
-		}
+		r = ioctl(uffd, UFFDIO_COPY, &copy);
+		TEST_ASSERT(r == 0 || errno == EEXIST,
+			    "Thread 0x%x failed UFFDIO_COPY on hva 0x%lx, errno = %d",
+			    tid, hva, errno);
 	} else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
+		/* The comments in the UFFDIO_COPY branch also apply here. */
 		struct uffdio_continue cont = {0};
 
-		cont.range.start = addr;
+		cont.range.start = hva;
 		cont.range.len = demand_paging_size;
+		cont.mode = is_vcpu ? UFFDIO_CONTINUE_MODE_DONTWAKE : 0;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
 		/*
-		 * With multiple vCPU threads fault on a single page and there are
-		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
-		 * will fail with EEXIST: handle that case without signaling an
-		 * error.
+		 * With multiple vCPU threads and at least one of multiple reader threads
+		 * or vCPU memory faults, multiple vCPUs accessing an absent page will
+		 * almost certainly cause some thread doing the UFFDIO_COPY here to get
+		 * EEXIST: make sure to allow that case.
 		 *
 		 * Note that this also suppress any EEXISTs occurring from,
 		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
@@ -112,32 +179,54 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		 * some external state to correctly surface EEXISTs to userspace
 		 * (or prevent duplicate COPY/CONTINUEs in the first place).
 		 */
-		if (r == -1 && errno != EEXIST) {
-			pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
-				addr, tid, errno);
-			return r;
-		}
+		TEST_ASSERT(r == 0 || errno == EEXIST,
+			    "Thread 0x%x failed UFFDIO_CONTINUE on hva 0x%lx, errno = %d",
+			    tid, hva, errno);
 	} else {
 		TEST_FAIL("Invalid uffd mode %d", uffd_mode);
 	}
 
+	/*
+	 * If the above UFFDIO_COPY/CONTINUE failed with EEXIST, waiting threads
+	 * will not have been woken: wake them here.
+	 */
+	if (!is_vcpu && r != 0) {
+		struct uffdio_range range = {
+			.start = hva,
+			.len = demand_paging_size
+		};
+		r = ioctl(uffd, UFFDIO_WAKE, &range);
+		TEST_ASSERT(r == 0,
+			    "Thread 0x%x failed UFFDIO_WAKE on hva 0x%lx, errno = %d",
+			    tid, hva, errno);
+	}
+
 	ts_diff = timespec_elapsed(start);
 
 	PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
 		       timespec_to_ns(ts_diff));
 	PER_PAGE_DEBUG("Paged in %ld bytes at 0x%lx from thread %d\n",
-		       demand_paging_size, addr, tid);
+		       demand_paging_size, hva, tid);
 
 	return 0;
 }
 
+static int handle_uffd_page_request_from_uffd(int uffd_mode, int uffd,
+					      struct uffd_msg *msg)
+{
+	TEST_ASSERT(msg->event == UFFD_EVENT_PAGEFAULT,
+		    "Received uffd message with event %d != UFFD_EVENT_PAGEFAULT",
+		    msg->event);
+	return handle_uffd_page_request(uffd_mode, uffd,
+					msg->arg.pagefault.address, false);
+}
+
 struct test_params {
-	int uffd_mode;
 	bool single_uffd;
-	useconds_t uffd_delay;
 	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
+	bool memfault_exits;
 };
 
 static void prefault_mem(void *alias, uint64_t len)
@@ -155,16 +244,22 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 {
 	struct memstress_vcpu_args *vcpu_args;
 	struct test_params *p = arg;
-	struct uffd_desc **uffd_descs = NULL;
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i, num_uffds = 0;
+	int i;
 	double vcpu_paging_rate;
-	uint64_t uffd_region_size;
+	uint32_t slot_flags = 0;
+	bool uffd_memfault_exits = uffd_mode && p->memfault_exits;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
-				 p->src_type, p->partition_vcpu_memory_access);
+	if (uffd_memfault_exits) {
+		TEST_ASSERT(kvm_has_cap(KVM_CAP_USERFAULT_ON_MISSING) > 0,
+					"KVM does not have KVM_CAP_USERFAULT_ON_MISSING");
+		slot_flags = KVM_MEM_USERFAULT_ON_MISSING;
+	}
+
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
+				 1, slot_flags, p->src_type, p->partition_vcpu_memory_access);
 
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
 
@@ -173,21 +268,21 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		    "Failed to allocate buffer for guest data pattern");
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
-	if (p->uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
-		num_uffds = p->single_uffd ? 1 : nr_vcpus;
-		for (i = 0; i < num_uffds; i++) {
-			vcpu_args = &memstress_args.vcpu_args[i];
-			prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
-				     vcpu_args->pages * memstress_args.guest_page_size);
-		}
-	}
-
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		num_uffds = p->single_uffd ? 1 : nr_vcpus;
 		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
 
+		if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
+			for (i = 0; i < num_uffds; i++) {
+				vcpu_args = &memstress_args.vcpu_args[i];
+				prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
+					     uffd_region_size);
+			}
+		}
+
 		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
-		TEST_ASSERT(uffd_descs, "Memory allocation failed");
+		TEST_ASSERT(uffd_descs, "Failed to allocate uffd descriptors");
+
 		for (i = 0; i < num_uffds; i++) {
 			struct memstress_vcpu_args *vcpu_args;
 			void *vcpu_hva;
@@ -201,10 +296,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 * requests.
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
-				p->uffd_mode, p->uffd_delay, vcpu_hva,
+				uffd_mode, uffd_delay, vcpu_hva,
 				uffd_region_size,
 				p->readers_per_uffd,
-				&handle_uffd_page_request);
+				&handle_uffd_page_request_from_uffd);
 		}
 	}
 
@@ -218,7 +313,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	ts_diff = timespec_elapsed(start);
 	pr_info("All vCPU threads joined\n");
 
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
 		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
@@ -239,7 +334,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memstress_destroy_vm(vm);
 
 	free(guest_data_prototype);
-	if (p->uffd_mode)
+	if (uffd_mode)
 		free(uffd_descs);
 }
 
@@ -248,7 +343,8 @@ static void help(char *name)
 	puts("");
 	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
 		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
-		   "          [-s type] [-v vcpus] [-c cpu_list] [-o]\n", name);
+		   "          [-s type] [-v vcpus] [-c cpu_list] [-o] [-w] \n",
+	       name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
@@ -260,6 +356,7 @@ static void help(char *name)
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
 	printf(" -r: Set the number of reader threads per uffd.\n");
+	printf(" -w: Enable kvm cap for memory fault exits.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -280,29 +377,30 @@ int main(int argc, char *argv[])
 		.partition_vcpu_memory_access = true,
 		.readers_per_uffd = 1,
 		.single_uffd = false,
+		.memfault_exits = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:c:r:")) != -1) {
+	while ((opt = getopt(argc, argv, "ahowm:u:d:b:s:v:c:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
 			break;
 		case 'u':
 			if (!strcmp("MISSING", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
+				uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
 			else if (!strcmp("MINOR", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
-			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
+				uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
+			TEST_ASSERT(uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
 		case 'a':
 			p.single_uffd = true;
 			break;
 		case 'd':
-			p.uffd_delay = strtoul(optarg, NULL, 0);
-			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
+			uffd_delay = strtoul(optarg, NULL, 0);
+			TEST_ASSERT(uffd_delay >= 0, "A negative UFFD delay is not supported.");
 			break;
 		case 'b':
 			guest_percpu_mem_size = parse_size(optarg);
@@ -328,6 +426,9 @@ int main(int argc, char *argv[])
 				    "Invalid number of readers per uffd %d: must be >=1",
 				    p.readers_per_uffd);
 			break;
+		case 'w':
+			p.memfault_exits = true;
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
@@ -335,7 +436,7 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	if (p.uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
+	if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
 	    !backing_src_is_shared(p.src_type)) {
 		TEST_FAIL("userfaultfd MINOR mode requires shared memory; pick a different -s");
 	}
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 05/17] KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page()
  2023-09-08 22:28 ` [PATCH v5 05/17] KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page() Anish Moorthy
@ 2023-09-14  8:04   ` kernel test robot
  2023-10-05  1:53   ` Sean Christopherson
  2023-10-05 23:03   ` Anish Moorthy
  2 siblings, 0 replies; 53+ messages in thread
From: kernel test robot @ 2023-09-14  8:04 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oe-lkp, lkp, kvm, seanjc, oliver.upton, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, amoorthy, ricarkol, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul, oliver.sang



Hello,

kernel test robot noticed "WARNING:at_include/linux/kvm_host.h:#kvm_vcpu_write_guest_page[kvm]" on:

commit: 00aaa25de7f10dfd5ac7afec09d6b4d72c379451 ("[PATCH v5 05/17] KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page()")
url: https://github.com/intel-lab-lkp/linux/commits/Anish-Moorthy/KVM-Clarify-documentation-of-hva_to_pfn-s-atomic-parameter/20230909-063310
base: https://git.kernel.org/cgit/virt/kvm/kvm.git queue
patch link: https://lore.kernel.org/all/20230908222905.1321305-6-amoorthy@google.com/
patch subject: [PATCH v5 05/17] KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page()

in testcase: kernel-selftests
version: kernel-selftests-x86_64-60acb023-1_20230329
with following parameters:

	group: kvm



compiler: gcc-12
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) with 256G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202309141107.30863e9d-oliver.sang@intel.com


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230914/202309141107.30863e9d-oliver.sang@intel.com


[  216.317580][ T6089] ------------[ cut here ]------------
[  216.324543][ T6089] WARNING: CPU: 117 PID: 6089 at include/linux/kvm_host.h:2346 kvm_vcpu_write_guest_page+0x23b/0x2a0 [kvm]
[  216.338385][ T6089] Modules linked in: openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 intel_rapl_msr intel_rapl_common btrfs x86_pkg_temp_thermal blake2b_generic intel_powerclamp xor coretemp raid6_pq kvm_intel zstd_compress libcrc32c kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel nvme sha512_ssse3 nvme_core rapl t10_pi intel_cstate mei_me ast dax_hmem crc64_rocksoft_generic crc64_rocksoft drm_shmem_helper i2c_i801 crc64 i2c_ismt mei i2c_smbus drm_kms_helper wmi ipmi_ssif acpi_ipmi joydev ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad binfmt_misc fuse drm ip_tables
[  216.406963][ T6089] CPU: 117 PID: 6089 Comm: mmio_warning_te Not tainted 6.5.0-00313-g00aaa25de7f1 #1
[  216.418660][ T6089] RIP: 0010:kvm_vcpu_write_guest_page+0x23b/0x2a0 [kvm]
[  216.427008][ T6089] Code: c1 8b 04 24 e9 d0 fe ff ff 89 04 24 e8 3e 3c 09 c1 8b 04 24 e9 1f ff ff ff 0f 1f 44 00 00 e9 5b fe ff ff 0f 0b e9 24 fe ff ff <0f> 0b e9 89 fe ff ff 48 89 df 48 89 34 24 e8 52 3c 09 c1 48 8b 34
[  216.450579][ T6089] RSP: 0018:ffa000001ad0f638 EFLAGS: 00010202
[  216.457880][ T6089] RAX: 00000000fffffff2 RBX: ff1100019e5a8040 RCX: 1fe2200033cb53c9
[  216.467703][ T6089] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffa00000177e1000
[  216.477457][ T6089] RBP: 000000000000fffc R08: 0000000000000ffc R09: 0000000000000002
[  216.487445][ T6089] R10: ffa00000177eafd3 R11: 0000000000000001 R12: ff1100019e5a9e48
[  216.497466][ T6089] R13: 0000000000000002 R14: ff11000500a72cd0 R15: 000000000000000f
[  216.507485][ T6089] FS:  00007fd4160036c0(0000) GS:ff110017fe680000(0000) knlGS:0000000000000000
[  216.518653][ T6089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  216.526450][ T6089] CR2: 00007fd416002f78 CR3: 0000000154bb0003 CR4: 0000000000f73ee0
[  216.536508][ T6089] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  216.546584][ T6089] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  216.556682][ T6089] PKRU: 55555554
[  216.561031][ T6089] Call Trace:
[  216.565102][ T6089]  <TASK>
[  216.568851][ T6089]  ? __warn+0xcd/0x2b0
[  216.573882][ T6089]  ? kvm_vcpu_write_guest_page+0x23b/0x2a0 [kvm]
[  216.581406][ T6089]  ? report_bug+0x267/0x2d0
[  216.586914][ T6089]  ? handle_bug+0x3c/0x70
[  216.592099][ T6089]  ? exc_invalid_op+0x17/0x40
[  216.597749][ T6089]  ? asm_exc_invalid_op+0x1a/0x20
[  216.603751][ T6089]  ? kvm_vcpu_write_guest_page+0x23b/0x2a0 [kvm]
[  216.611219][ T6089]  ? kvm_vcpu_write_guest_page+0x5b/0x2a0 [kvm]
[  216.618594][ T6089]  kvm_vcpu_write_guest+0x4b/0x80 [kvm]
[  216.625453][ T6089]  write_emulate+0x23/0x50 [kvm]
[  216.631477][ T6089]  emulator_read_write_onepage+0x2ff/0x4a0 [kvm]
[  216.638944][ T6089]  ? vcpu_mmio_gva_to_gpa+0x730/0x730 [kvm]
[  216.645902][ T6089]  ? em_clflushopt+0x10/0x10 [kvm]
[  216.651976][ T6089]  emulator_read_write+0x149/0x510 [kvm]
[  216.658642][ T6089]  segmented_write+0xce/0x120 [kvm]
[  216.665492][ T6089]  ? em_sgdt+0x70/0x70 [kvm]
[  216.670959][ T6089]  ? vmx_read_guest_seg_selector+0x2c/0x290 [kvm_intel]
[  216.679008][ T6089]  push+0x316/0x5f0 [kvm]
[  216.684164][ T6089]  ? emulator_get_segment+0xbe/0x410 [kvm]
[  216.690978][ T6089]  ? load_state_from_tss16+0x940/0x940 [kvm]
[  216.697979][ T6089]  __emulate_int_real+0x306/0x690 [kvm]
[  216.704485][ T6089]  ? vmx_read_guest_seg_ar+0x2f/0x2b0 [kvm_intel]
[  216.711940][ T6089]  ? em_call+0x120/0x120 [kvm]
[  216.717583][ T6089]  ? kvm_guest_time_update+0x420/0xae0 [kvm]
[  216.724596][ T6089]  ? trace_event_raw_event_kvm_exit+0x2d0/0x2d0 [kvm]
[  216.732449][ T6089]  ? validate_chain+0x151/0xfe0
[  216.738101][ T6089]  ? slab_free_freelist_hook+0x11e/0x1e0
[  216.744690][ T6089]  emulate_int_real+0x79/0xc0 [kvm]
[  216.750867][ T6089]  kvm_inject_realmode_interrupt+0x102/0x260 [kvm]
[  216.758433][ T6089]  kvm_check_and_inject_events+0x805/0x1090 [kvm]
[  216.765924][ T6089]  vcpu_enter_guest+0xbd3/0x3780 [kvm]
[  216.773487][ T6089]  ? kvm_check_and_inject_events+0x1090/0x1090 [kvm]
[  216.781257][ T6089]  ? lock_acquire+0x193/0x4b0
[  216.786793][ T6089]  ? kvm_arch_vcpu_ioctl_run+0x12d/0x1630 [kvm]
[  216.794064][ T6089]  ? lock_sync+0x170/0x170
[  216.799259][ T6089]  ? mark_held_locks+0x9e/0xe0
[  216.804829][ T6089]  ? vcpu_run+0xb2/0xa00 [kvm]
[  216.810443][ T6089]  vcpu_run+0xb2/0xa00 [kvm]
[  216.815883][ T6089]  ? __local_bh_enable_ip+0xa6/0x110
[  216.822064][ T6089]  kvm_arch_vcpu_ioctl_run+0x39f/0x1630 [kvm]
[  216.829166][ T6089]  kvm_vcpu_ioctl+0x51c/0xcb0 [kvm]
[  216.835258][ T6089]  ? kvm_vcpu_kick+0x320/0x320 [kvm]
[  216.841460][ T6089]  ? find_held_lock+0x2d/0x110
[  216.847022][ T6089]  ? __lock_release+0x111/0x440
[  216.853383][ T6089]  ? __fget_files+0x1c5/0x380
[  216.858863][ T6089]  ? reacquire_held_locks+0x4e0/0x4e0
[  216.865134][ T6089]  ? __fget_files+0x1c5/0x380
[  216.870640][ T6089]  ? lock_release+0xe3/0x200
[  216.876005][ T6089]  ? __fget_files+0x1dd/0x380
[  216.881470][ T6089]  __x64_sys_ioctl+0x130/0x1a0
[  216.887004][ T6089]  do_syscall_64+0x59/0x80
[  216.892164][ T6089]  entry_SYSCALL_64_after_hwframe+0x5e/0xc8
[  216.898974][ T6089] RIP: 0033:0x7fd416905bab
[  216.904137][ T6089] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[  216.927031][ T6089] RSP: 002b:00007fd416002e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  216.937130][ T6089] RAX: ffffffffffffffda RBX: 00007fd4169ef000 RCX: 00007fd416905bab
[  216.946672][ T6089] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000007
[  216.956171][ T6089] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007fff966831e7
[  216.965766][ T6089] R10: 0000000000000008 R11: 0000000000000246 R12: ffffffffffffff80
[  216.975342][ T6089] R13: 0000000000000000 R14: 00007fff966830f0 R15: 00007fd415803000
[  216.985071][ T6089]  </TASK>
[  216.988860][ T6089] irq event stamp: 1547
[  216.993931][ T6089] hardirqs last  enabled at (1561): [<ffffffff81385452>] __up_console_sem+0x52/0x60
[  217.005280][ T6089] hardirqs last disabled at (1580): [<ffffffff81385437>] __up_console_sem+0x37/0x60
[  217.016487][ T6089] softirqs last  enabled at (1574): [<ffffffff83a997a5>] __do_softirq+0x545/0x814
[  217.027532][ T6089] softirqs last disabled at (1569): [<ffffffff811eb372>] __irq_exit_rcu+0x132/0x180
[  217.038927][ T6089] ---[ end trace 0000000000000000 ]---
[  217.045467][ T6089] ------------[ cut here ]------------
[  217.051961][ T6089] WARNING: CPU: 117 PID: 6089 at include/linux/kvm_host.h:2346 kvm_vcpu_read_guest_page+0x21f/0x270 [kvm]
[  217.065501][ T6089] Modules linked in: openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 intel_rapl_msr intel_rapl_common btrfs x86_pkg_temp_thermal blake2b_generic intel_powerclamp xor coretemp raid6_pq kvm_intel zstd_compress libcrc32c kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel nvme sha512_ssse3 nvme_core rapl t10_pi intel_cstate mei_me ast dax_hmem crc64_rocksoft_generic crc64_rocksoft drm_shmem_helper i2c_i801 crc64 i2c_ismt mei i2c_smbus drm_kms_helper wmi ipmi_ssif acpi_ipmi joydev ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad binfmt_misc fuse drm ip_tables
[  217.134216][ T6089] CPU: 117 PID: 6089 Comm: mmio_warning_te Tainted: G        W          6.5.0-00313-g00aaa25de7f1 #1
[  217.147426][ T6089] RIP: 0010:kvm_vcpu_read_guest_page+0x21f/0x270 [kvm]
[  217.155531][ T6089] Code: 24 04 e9 d0 fe ff ff 89 44 24 04 e8 db 38 09 c1 8b 44 24 04 e9 1d ff ff ff 0f 1f 44 00 00 e9 59 fe ff ff 0f 0b e9 22 fe ff ff <0f> 0b e9 87 fe ff ff 89 44 24 04 e8 91 39 09 c1 8b 44 24 04 e9 25
[  217.179210][ T6089] RSP: 0018:ffa000001ad0f8b8 EFLAGS: 00010202
[  217.186390][ T6089] RAX: 00000000fffffff2 RBX: ff1100019e5a8040 RCX: 1fe2200033cb53c9
[  217.196363][ T6089] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  217.206324][ T6089] RBP: 000000000000001a R08: 0000000000000002 R09: fff3fc0002efd5fa
[  217.216274][ T6089] R10: ffa00000177eafd3 R11: 0000000000000001 R12: ff1100019e5a9e48
[  217.226228][ T6089] R13: 0000000000000002 R14: ffa000001ad0f9a0 R15: ffa000001ad0f9a0
[  217.236138][ T6089] FS:  00007fd4160036c0(0000) GS:ff110017fe680000(0000) knlGS:0000000000000000
[  217.247194][ T6089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  217.254995][ T6089] CR2: 00007fd416002f78 CR3: 0000000154bb0003 CR4: 0000000000f73ee0
[  217.264906][ T6089] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  217.274788][ T6089] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  217.284600][ T6089] PKRU: 55555554
[  217.289083][ T6089] Call Trace:
[  217.293161][ T6089]  <TASK>
[  217.296927][ T6089]  ? __warn+0xcd/0x2b0
[  217.301964][ T6089]  ? kvm_vcpu_read_guest_page+0x21f/0x270 [kvm]
[  217.309368][ T6089]  ? report_bug+0x267/0x2d0
[  217.314868][ T6089]  ? handle_bug+0x3c/0x70
[  217.320085][ T6089]  ? exc_invalid_op+0x17/0x40
[  217.325731][ T6089]  ? asm_exc_invalid_op+0x1a/0x20
[  217.331757][ T6089]  ? kvm_vcpu_read_guest_page+0x21f/0x270 [kvm]
[  217.339112][ T6089]  ? kvm_vcpu_read_guest_page+0x3d/0x270 [kvm]
[  217.346393][ T6089]  kvm_read_guest_virt_helper+0x97/0x150 [kvm]
[  217.353707][ T6089]  __emulate_int_real+0x478/0x690 [kvm]
[  217.360268][ T6089]  ? vmx_read_guest_seg_ar+0x2f/0x2b0 [kvm_intel]
[  217.367857][ T6089]  ? em_call+0x120/0x120 [kvm]
[  217.373544][ T6089]  ? kvm_guest_time_update+0x420/0xae0 [kvm]
[  217.380580][ T6089]  ? trace_event_raw_event_kvm_exit+0x2d0/0x2d0 [kvm]
[  217.388496][ T6089]  ? validate_chain+0x151/0xfe0
[  217.394204][ T6089]  ? slab_free_freelist_hook+0x11e/0x1e0
[  217.400874][ T6089]  emulate_int_real+0x79/0xc0 [kvm]
[  217.407014][ T6089]  kvm_inject_realmode_interrupt+0x102/0x260 [kvm]
[  217.414639][ T6089]  kvm_check_and_inject_events+0x805/0x1090 [kvm]
[  217.422140][ T6089]  vcpu_enter_guest+0xbd3/0x3780 [kvm]
[  217.429792][ T6089]  ? kvm_check_and_inject_events+0x1090/0x1090 [kvm]
[  217.437587][ T6089]  ? lock_acquire+0x193/0x4b0
[  217.443085][ T6089]  ? kvm_arch_vcpu_ioctl_run+0x12d/0x1630 [kvm]
[  217.450366][ T6089]  ? lock_sync+0x170/0x170
[  217.455537][ T6089]  ? mark_held_locks+0x9e/0xe0
[  217.461108][ T6089]  ? vcpu_run+0xb2/0xa00 [kvm]
[  217.466753][ T6089]  vcpu_run+0xb2/0xa00 [kvm]
[  217.472164][ T6089]  ? __local_bh_enable_ip+0xa6/0x110
[  217.478296][ T6089]  kvm_arch_vcpu_ioctl_run+0x39f/0x1630 [kvm]
[  217.485359][ T6089]  kvm_vcpu_ioctl+0x51c/0xcb0 [kvm]
[  217.491433][ T6089]  ? kvm_vcpu_kick+0x320/0x320 [kvm]
[  217.497653][ T6089]  ? find_held_lock+0x2d/0x110
[  217.503209][ T6089]  ? __lock_release+0x111/0x440
[  217.509571][ T6089]  ? __fget_files+0x1c5/0x380
[  217.515022][ T6089]  ? reacquire_held_locks+0x4e0/0x4e0
[  217.521289][ T6089]  ? __fget_files+0x1c5/0x380
[  217.526854][ T6089]  ? lock_release+0xe3/0x200
[  217.532235][ T6089]  ? __fget_files+0x1dd/0x380
[  217.537777][ T6089]  __x64_sys_ioctl+0x130/0x1a0
[  217.543329][ T6089]  do_syscall_64+0x59/0x80
[  217.548489][ T6089]  entry_SYSCALL_64_after_hwframe+0x5e/0xc8
[  217.555297][ T6089] RIP: 0033:0x7fd416905bab
[  217.560474][ T6089] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[  217.583524][ T6089] RSP: 002b:00007fd416002e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  217.593465][ T6089] RAX: ffffffffffffffda RBX: 00007fd4169ef000 RCX: 00007fd416905bab
[  217.603087][ T6089] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000007
[  217.612606][ T6089] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007fff966831e7
[  217.622354][ T6089] R10: 0000000000000008 R11: 0000000000000246 R12: ffffffffffffff80
[  217.632027][ T6089] R13: 0000000000000000 R14: 00007fff966830f0 R15: 00007fd415803000
[  217.641658][ T6089]  </TASK>
[  217.645355][ T6089] irq event stamp: 2949
[  217.650299][ T6089] hardirqs last  enabled at (2961): [<ffffffff81385452>] __up_console_sem+0x52/0x60
[  217.661481][ T6089] hardirqs last disabled at (2978): [<ffffffff81385437>] __up_console_sem+0x37/0x60
[  217.672743][ T6089] softirqs last  enabled at (2974): [<ffffffff83a997a5>] __do_softirq+0x545/0x814
[  217.683899][ T6089] softirqs last disabled at (2969): [<ffffffff811eb372>] __irq_exit_rcu+0x132/0x180


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 07/17] KVM: arm64: Annotate -EFAULT from user_mem_abort()
  2023-09-08 22:28 ` [PATCH v5 07/17] KVM: arm64: Annotate -EFAULT from user_mem_abort() Anish Moorthy
@ 2023-09-28 21:42   ` Anish Moorthy
  2023-10-05  1:26   ` Sean Christopherson
  2023-10-10 23:01   ` David Matlack
  2 siblings, 0 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-09-28 21:42 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

> +               if (write_fault)
> +                       memory_fault_flags = KVM_MEMORY_FAULT_FLAG_EXEC;
> +               else if (exec_fault)
> +                       memory_fault_flags = KVM_MEMORY_FAULT_FLAG_EXEC;

Ugh I could have sworn I already fixed this, thanks Oliver for bringing it up

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 04/17] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-09-08 22:28 ` [PATCH v5 04/17] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
@ 2023-10-05  1:14   ` Sean Christopherson
  2023-10-05 18:45     ` Anish Moorthy
  2023-10-10 22:58   ` David Matlack
  1 sibling, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2023-10-05  1:14 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Fri, Sep 08, 2023, Anish Moorthy wrote:
> @@ -2318,4 +2324,33 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +/*
> + * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and
> + * populate the memory_fault field with the given information.
> + *
> + * WARNs and does nothing if the speculative exit canary has already been set
> + * or if 'vcpu' is not the current running vcpu.
> + */
> +static inline void kvm_handle_guest_uaccess_fault(struct kvm_vcpu *vcpu,
> +						  uint64_t gpa, uint64_t len, uint64_t flags)

After a lot of fiddling and leading you on a wild goose chase, I think the least
awful name is kvm_prepare_memory_fault_exit().  Like kvm_prepare_emulation_failure_exit(),
this doesn't actually "handle" anything, it just preps for the exit.

If it actually returned something then maybe kvm_handle_guest_uaccess_fault()
would be an ok name (IIRC, that was my original intent, but we wandered in a
different direction).

And peeking at future patches, pass in the RWX flags as bools, that way this
helper can deal with the bools=>flags conversion.  Oh, and fill the flags with
bitwise ORs, that way future conflicts with private memory will be trivial to
resolve.

E.g.

static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
						 gpa_t gpa, gpa_t size,
						 bool is_write, bool is_exec)
{
	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
	vcpu->run->memory_fault.gpa = gpa;
	vcpu->run->memory_fault.size = size;

	vcpu->run->memory_fault.flags = 0;
	if (is_write)
		vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_WRITE;
	else if (is_exec)
		vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_EXEC;
	else
		vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_READ;
}

> +{
> +	/*
> +	 * Ensure that an unloaded vCPU's run struct isn't being modified

"unloaded" isn't accurate, e.g. the vCPU could be loaded, just not on this vCPU.
I'd just drop the comment entirely, this one is fairly self-explanatory.

> +	 */
> +	if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()))
> +		return;
> +
> +	/*
> +	 * Warn when overwriting an already-populated run struct.
> +	 */

For future reference, use this style

	/*
	 *
	 */

only if the comment spans multiple lines.  For single line comments, just:

	/* Warn when overwriting an already-populated run struct. */

> +	WARN_ON_ONCE(vcpu->speculative_exit_canary != KVM_SPEC_EXIT_UNUSED);

As mentioned in the guest_memfd thread[1], this WARN can be triggered by userspace,
e.g. by getting KVM to fill the union but not exit, which is sadly not too hard
because emulator_write_phys() incorrectly treats all failures as MMIO.

I'm not even sure how to fix that in a race-free, sane way.  E.g. rechecking the
memslots doesn't work because a memslot could appear between __kvm_write_guest_page()
failing and rechecking in emulator_read_write_onepage().

Hmm, maybe we could get away with returning a different errno, e.g. -ENXIO?  And
then emulator_write_phys() and emulator_read_write_onepage() can be taught to
handle different errors accordingly.

Anyways, I highly recommend just dropping the canary for now, trying to clean up
the emulator and get this fully functional probably won't be a smooth process.

> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index f089ab290978..d19aa7965392 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -278,6 +278,9 @@ struct kvm_xen_exit {
>  /* Flags that describe what fields in emulation_failure hold valid data. */
>  #define KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES (1ULL << 0)
>  
> +/* KVM_CAP_MEMORY_FAULT_INFO flag for kvm_run.flags */
> +#define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8)
> +
>  /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
>  struct kvm_run {
>  	/* in */
> @@ -531,6 +534,27 @@ struct kvm_run {
>  		struct kvm_sync_regs regs;
>  		char padding[SYNC_REGS_SIZE_BYTES];
>  	} s;
> +
> +	/*
> +	 * This second exit union holds structs for exits which may be triggered
> +	 * after KVM has already initiated a different exit, and/or may be
> +	 * filled speculatively by KVM.
> +	 *
> +	 * For instance, because of limitations in KVM's uAPI, a memory fault
> +	 * may be encounterd after an MMIO exit is initiated and exit_reason and
> +	 * kvm_run.mmio are filled: isolating the speculative exits here ensures
> +	 * that KVM won't clobber information for the original exit.
> +	 */
> +	union {
> +		/* KVM_RUN_MEMORY_FAULT_FILLED + EFAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 len;
> +		} memory_fault;
> +		/* Fix the size of the union. */
> +		char speculative_exit_padding[256];
> +	};
>  };

As proposed in the guest_memfd thread[2], I think we should scrap the second union
and just commit to achieving 100% accuracy only for page fault paths in the
initial merge.

I'll send you a clean-ish patch to use as a starting point sometime next week.

[1] https://lore.kernel.org/all/ZRtxoaJdVF1C2Mvy@google.com
[2] https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 02/17] KVM: Add docstrings to __kvm_read/write_guest_page()
  2023-09-08 22:28 ` [PATCH v5 02/17] KVM: Add docstrings to __kvm_read/write_guest_page() Anish Moorthy
@ 2023-10-05  1:18   ` Sean Christopherson
  0 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2023-10-05  1:18 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Fri, Sep 08, 2023, Anish Moorthy wrote:
> The (gfn, data, offset, len) order of parameters is a little strange,
> since "offset" actually applies to "gfn" rather than to "data". Add
> docstrings to make things perfectly clear.
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  virt/kvm/kvm_main.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 84e90ed3a134..12837416ce8a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3014,6 +3014,9 @@ static int next_segment(unsigned long len, int offset)
>  		return len;
>  }
>  
> +/*
> + * Copy 'len' bytes from guest memory at '(gfn * PAGE_SIZE) + offset' to 'data'

A preceding @ is usually how kernel comments refer to paramaters, e.g. an
alternative would be:

  /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */

Though I think I find your version to be more readable.

> + */

No need for a multi-line comment.

>  static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
>  				 void *data, int offset, int len)
>  {
> @@ -3115,6 +3118,9 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
>  
> +/*
> + * Copy 'len' bytes from 'data' into guest memory at '(gfn * PAGE_SIZE) + offset'
> + */
>  static int __kvm_write_guest_page(struct kvm *kvm,
>  				  struct kvm_memory_slot *memslot, gfn_t gfn,
>  			          const void *data, int offset, int len)
> -- 
> 2.42.0.283.g2d96d420d3-goog
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 06/17] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  2023-09-08 22:28 ` [PATCH v5 06/17] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
@ 2023-10-05  1:26   ` Sean Christopherson
  2023-10-05 23:57     ` Anish Moorthy
  0 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2023-10-05  1:26 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Fri, Sep 08, 2023, Anish Moorthy wrote:
> Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
> kvm_handle_error_pfn().

Rewrite with --verbose please.  And avoid function names if possible.  Sometimes
it's better/useful/necessary to reference a function by name, but in this case,
just saying kvm_handle_error_pfn() isn't helpful because it doesn't provide any
insight into the actual impact of the change, i.e. requires the reader to already
know exactly how and when kvm_handle_error_pfn() is used.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 07/17] KVM: arm64: Annotate -EFAULT from user_mem_abort()
  2023-09-08 22:28 ` [PATCH v5 07/17] KVM: arm64: Annotate -EFAULT from user_mem_abort() Anish Moorthy
  2023-09-28 21:42   ` Anish Moorthy
@ 2023-10-05  1:26   ` Sean Christopherson
  2023-10-10 23:01   ` David Matlack
  2 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2023-10-05  1:26 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Fri, Sep 08, 2023, Anish Moorthy wrote:
> Implement KVM_CAP_MEMORY_FAULT_INFO for guest access failure in
> user_mem_abort().

Same comments as the x86 patch, this is way too terse.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls
  2023-09-08 22:28 ` [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls Anish Moorthy
@ 2023-10-05  1:44   ` Sean Christopherson
  2023-10-05 18:58     ` Anish Moorthy
  2023-11-01 21:53     ` Anish Moorthy
  0 siblings, 2 replies; 53+ messages in thread
From: Sean Christopherson @ 2023-10-05  1:44 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

Eh, the shortlog basically says "do work" with a lot of fancy words.  It really
just boils down to:

  KVM: Let callers of __gfn_to_pfn_memslot() opt-out of USERFAULT_ON_MISSING

On Fri, Sep 08, 2023, Anish Moorthy wrote:
> Change the "atomic" parameter of __gfn_to_pfn_memslot() to an enum which

I've pushed back on more booleans multiple times, but IMO this is even worse.
E.g. what does an "upgrade" to atomic even mean?

Since we have line of sight to getting out of boolean hell via David's series,
just bite the bullet for now.  Deciphering the callers will suck, but not really
anymore than it already sucks.

kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
			       bool atomic, bool interruptible, bool *async,
			       bool write_fault, bool *writable,
			       bool can_do_userfault, hva_t *hva)
{
	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);

	if (hva)
		*hva = addr;

	if (kvm_is_error_hva(addr)) {
		if (writable)
			*writable = false;

		if (addr == KVM_HVA_ERR_RO_BAD)
			return KVM_PFN_ERR_RO_FAULT;

		return KVM_PFN_NOSLOT;
	}

	if (!atomic && can_do_userfault &&
	    kvm_is_slot_userfault_on_missing(slot)) {
		atomic = true;
		if (async) {
			*async = false;
			async = NULL;
		}
	}

	/* Do not map writable pfn in the readonly memslot. */
	if (writable && memslot_is_readonly(slot)) {
		*writable = false;
		writable = NULL;
	}

	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
			  writable);
}

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 11/17] KVM: x86: Enable KVM_CAP_USERFAULT_ON_MISSING
  2023-09-08 22:28 ` [PATCH v5 11/17] KVM: x86: Enable KVM_CAP_USERFAULT_ON_MISSING Anish Moorthy
@ 2023-10-05  1:52   ` Sean Christopherson
  2023-11-01 22:55     ` Anish Moorthy
  0 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2023-10-05  1:52 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Fri, Sep 08, 2023, Anish Moorthy wrote:
> The relevant __gfn_to_pfn_memslot() calls in __kvm_faultin_pfn()
> already use MEMSLOT_ACCESS_NONATOMIC_MAY_UPGRADE.

--verbose

> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  Documentation/virt/kvm/api.rst | 2 +-
>  arch/x86/kvm/Kconfig           | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index c2eaacb6dc63..a74d721a18f6 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -7788,7 +7788,7 @@ error/annotated fault.
>  7.35 KVM_CAP_USERFAULT_ON_MISSING
>  ---------------------------------
>  
> -:Architectures: None
> +:Architectures: x86
>  :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
>  
>  The presence of this capability indicates that userspace may set the
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index ed90f148140d..11d956f17a9d 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
>  	select INTERVAL_TREE
>  	select HAVE_KVM_PM_NOTIFIER if PM
>  	select KVM_GENERIC_HARDWARE_ENABLING
> +        select HAVE_KVM_USERFAULT_ON_MISSING


Hmm, I vote to squash this patch with

  KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()

or rather, squash that code into this patch.  Ditto for the ARM patches.

Since we're making KVM_EXIT_MEMORY_FAULT informational-only for flows that KVM
isn't committing to supporting, I think it makes sense to enable the arch specific
paths that *need* to return KVM_EXIT_MEMORY_FAULT at the same time as the feature
that adds the requirement.

E.g. without HAVE_KVM_USERFAULT_ON_MISSING support, per the contract we are creating,
it would be totally fine for KVM to not use KVM_EXIT_MEMORY_FAULT for the page
fault paths.  And that obviously has to be the case since KVM_CAP_MEMORY_FAULT_INFO
is introduced before the arch code is enabled.

But as of this path, KVM *must* return KVM_EXIT_MEMORY_FAULT, so we should make
it impossible to do a straight revert of that dependency.  That should also help
with the changelogs, e.g. it'll give you a prompt for why only kvm_handle_error_pfn()
is getting treatment.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 05/17] KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page()
  2023-09-08 22:28 ` [PATCH v5 05/17] KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page() Anish Moorthy
  2023-09-14  8:04   ` kernel test robot
@ 2023-10-05  1:53   ` Sean Christopherson
  2023-10-05 23:03   ` Anish Moorthy
  2 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2023-10-05  1:53 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Fri, Sep 08, 2023, Anish Moorthy wrote:
> Implement KVM_CAP_MEMORY_FAULT_INFO for uaccess failures in
> kvm_vcpu_read/write_guest_page()

Why?  (rhetorical question)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 04/17] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-10-05  1:14   ` Sean Christopherson
@ 2023-10-05 18:45     ` Anish Moorthy
  2023-10-05 22:13       ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: Anish Moorthy @ 2023-10-05 18:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Oct 4, 2023 at 6:14 PM Sean Christopherson <seanjc@google.com> wrote:
>
> And peeking at future patches, pass in the RWX flags as bools, that way this
> helper can deal with the bools=>flags conversion.  Oh, and fill the flags with
> bitwise ORs, that way future conflicts with private memory will be trivial to
> resolve.
>
> E.g.
>
> static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>                                                  gpa_t gpa, gpa_t size,
>                                                  bool is_write, bool is_exec)
> {
>         vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
>         vcpu->run->memory_fault.gpa = gpa;
>         vcpu->run->memory_fault.size = size;
>
>         vcpu->run->memory_fault.flags = 0;
>         if (is_write)
>                 vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_WRITE;
>         else if (is_exec)
>                 vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_EXEC;
>         else
>                 vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_READ;
> }

Is a BUG/VM_BUG_ON() warranted in the (is_write && is_exec) case do
you think? I see that user_mem_abort already VM_BUG_ON()s for this
case, but if there's one in the x86 page fault path I don't
immediately see it. Also this helper could be called from other paths,
so maybe there's some value in it.

> I'll send you a clean-ish patch to use as a starting point sometime next week.

You mean something containing the next spin of the guest memfd stuff?
Or parts of David's series as well?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls
  2023-10-05  1:44   ` Sean Christopherson
@ 2023-10-05 18:58     ` Anish Moorthy
  2023-10-06  0:17       ` Sean Christopherson
  2023-11-01 21:53     ` Anish Moorthy
  1 sibling, 1 reply; 53+ messages in thread
From: Anish Moorthy @ 2023-10-05 18:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Oct 4, 2023 at 6:44 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Eh, the shortlog basically says "do work" with a lot of fancy words.  It really
> just boils down to:
>
>   KVM: Let callers of __gfn_to_pfn_memslot() opt-out of USERFAULT_ON_MISSING
>
> On Fri, Sep 08, 2023, Anish Moorthy wrote:
> > Change the "atomic" parameter of __gfn_to_pfn_memslot() to an enum which
>
> I've pushed back on more booleans multiple times, but IMO this is even worse.
> E.g. what does an "upgrade" to atomic even mean?

Oh, that bad huh? Based on what you mentioned earlier about some
callers of __gfn_to_pfn_memslot() needing to opt out of the memslot
flag, it seemed like there were basically three ways (wrt to @atomic)
that function needed to be called

1. With atomic = true
2. With atomic = false, and some way to make sure that's respected
whatever the memslot flag says
3. With atomic = false, but respecting the memslot flag (ie, changing
to atomic = true when it's set).

An "upgrade" in this context was meant to describe case 3 (when the
memslot flag is set). Anyways despite terminology issues, the idea of
an enum encapsulating those three cases seems like, essentially, the
right thing to do. Though of course, let me know if you disagree :)

> Since we have line of sight to getting out of boolean hell via David's series,
> just bite the bullet for now.  Deciphering the callers will suck, but not really
> anymore than it already sucks.

Sorry, what exactly are you suggesting via "bite the bullet" here?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 04/17] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-10-05 18:45     ` Anish Moorthy
@ 2023-10-05 22:13       ` Sean Christopherson
  0 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2023-10-05 22:13 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Oct 05, 2023, Anish Moorthy wrote:
> On Wed, Oct 4, 2023 at 6:14 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > And peeking at future patches, pass in the RWX flags as bools, that way this
> > helper can deal with the bools=>flags conversion.  Oh, and fill the flags with
> > bitwise ORs, that way future conflicts with private memory will be trivial to
> > resolve.
> >
> > E.g.
> >
> > static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> >                                                  gpa_t gpa, gpa_t size,
> >                                                  bool is_write, bool is_exec)
> > {
> >         vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> >         vcpu->run->memory_fault.gpa = gpa;
> >         vcpu->run->memory_fault.size = size;
> >
> >         vcpu->run->memory_fault.flags = 0;
> >         if (is_write)
> >                 vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_WRITE;
> >         else if (is_exec)
> >                 vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_EXEC;
> >         else
> >                 vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_READ;
> > }
> 
> Is a BUG/VM_BUG_ON() warranted in the (is_write && is_exec) case do
> you think?

Nah, not here.  If we really wanted to add a sanity check, a KVM_MMU_WARN_ON()
in kvm_mmu_page_fault() would be the way to go.

> I see that user_mem_abort already VM_BUG_ON()s for this case, but if there's
> one in the x86 page fault path I don't immediately see it. Also this helper
> could be called from other paths, so maybe there's some value in it.
> 
> > I'll send you a clean-ish patch to use as a starting point sometime next week.
> 
> You mean something containing the next spin of the guest memfd stuff?
> Or parts of David's series as well?

guest_memfd stuff.  Effectively this patch, just massaged to splice together all
of the fixups.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 05/17] KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page()
  2023-09-08 22:28 ` [PATCH v5 05/17] KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page() Anish Moorthy
  2023-09-14  8:04   ` kernel test robot
  2023-10-05  1:53   ` Sean Christopherson
@ 2023-10-05 23:03   ` Anish Moorthy
  2 siblings, 0 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-10-05 23:03 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

Dropping as per
https://lore.kernel.org/kvm/ZR88w9W62qsZDro-@google.com/. Take that,
kernel test robot!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 06/17] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  2023-10-05  1:26   ` Sean Christopherson
@ 2023-10-05 23:57     ` Anish Moorthy
  2023-10-06  0:36       ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: Anish Moorthy @ 2023-10-05 23:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Oct 4, 2023 at 6:26 PM Sean Christopherson <seanjc@google.com> wrote:
>
> ...isn't helpful because it doesn't provide any
> insight into the actual impact of the change, i.e. requires the reader to already
> know exactly how and when kvm_handle_error_pfn() is used.

Ah, true. Hopefully the following is better?

> KVM: x86: Annotate hva->hpa translation failures in stage-2 fault path
>
> Set up a KVM_EXIT_MEMORY_FAULT in cases where the stage-2 fault exit
> handler successfully translates a GPA to an HVA, but fails to translate
> the HVA to an HPA via GUP. This provides information to userspace which it
> could potentially use, for instance to resolve the failure.

Maybe the "via GUP" isn't really doing anything there

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls
  2023-10-05 18:58     ` Anish Moorthy
@ 2023-10-06  0:17       ` Sean Christopherson
  2023-10-11 22:04         ` Anish Moorthy
  0 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2023-10-06  0:17 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Oct 05, 2023, Anish Moorthy wrote:
> On Wed, Oct 4, 2023 at 6:44 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Eh, the shortlog basically says "do work" with a lot of fancy words.  It really
> > just boils down to:
> >
> >   KVM: Let callers of __gfn_to_pfn_memslot() opt-out of USERFAULT_ON_MISSING
> >
> > On Fri, Sep 08, 2023, Anish Moorthy wrote:
> > > Change the "atomic" parameter of __gfn_to_pfn_memslot() to an enum which
> >
> > I've pushed back on more booleans multiple times, but IMO this is even worse.
> > E.g. what does an "upgrade" to atomic even mean?
> 
> Oh, that bad huh? Based on what you mentioned earlier about some
> callers of __gfn_to_pfn_memslot() needing to opt out of the memslot
> flag, it seemed like there were basically three ways (wrt to @atomic)
> that function needed to be called
> 
> 1. With atomic = true
> 2. With atomic = false, and some way to make sure that's respected
> whatever the memslot flag says
> 3. With atomic = false, but respecting the memslot flag (ie, changing
> to atomic = true when it's set).
> 
> An "upgrade" in this context was meant to describe case 3 (when the
> memslot flag is set). Anyways despite terminology issues, the idea of
> an enum encapsulating those three cases seems like, essentially, the
> right thing to do. Though of course, let me know if you disagree :)

The problem is that the three possibilities aren't directly related.  The existing
use of atomic truly means "this call can't sleep/block".  The userfault-on-missing
case has nothing to do with sleeping being illegal, the behavior of @atomic just
happens to align exactly with what is needed *today*.

E.g. if there was a flavor of gup() that could fault in memory without sleeping,
that could be used for the @atomic case but not the userfault-on-missing case.
I doubt such a variation will ever exist, but "that probably won't happen" isn't
a good reason to conflate the two things.

> > Since we have line of sight to getting out of boolean hell via David's series,
> > just bite the bullet for now.  Deciphering the callers will suck, but not really
> > anymore than it already sucks.
> 
> Sorry, what exactly are you suggesting via "bite the bullet" here?

Add another boolean, the "bool can_do_userfault" from the diff that got snipped.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 06/17] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  2023-10-05 23:57     ` Anish Moorthy
@ 2023-10-06  0:36       ` Sean Christopherson
  0 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2023-10-06  0:36 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Oct 05, 2023, Anish Moorthy wrote:
> On Wed, Oct 4, 2023 at 6:26 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > ...isn't helpful because it doesn't provide any
> > insight into the actual impact of the change, i.e. requires the reader to already
> > know exactly how and when kvm_handle_error_pfn() is used.
> 
> Ah, true. Hopefully the following is better?
> 
> > KVM: x86: Annotate hva->hpa translation failures in stage-2 fault path
> >
> > Set up a KVM_EXIT_MEMORY_FAULT in cases where the stage-2 fault exit
> > handler successfully translates a GPA to an HVA, but fails to translate
> > the HVA to an HPA via GUP. This provides information to userspace which it
> > could potentially use, for instance to resolve the failure.
> 
> Maybe the "via GUP" isn't really doing anything there

Yeah, drop GUP.  Not all pfns are resolved via gup(), e.g. see hva_to_pfn_remapped().

Actually, I would hedge on the "HVA" part too.  It's an accurate statement for
both ARM and x86, but only because writes to readonly slots get treated as MMIO
(stupid option ROMs).

And what part of the translation fails isn't important, what's important is that
KVM is exiting to userspace with an error.  Using "can't resolve the pfn" is fine
and desirable, but the changelog should first focus on the important outcome,
which is that KVM is now providing more information when throwing an -EFAULT at
userspace.

Something like this?

  When handling a stage-2 fault, set up a KVM_EXIT_MEMORY_FAULT exit if KVM
  returns -EFAULT to userspace, e.g. if resolving the pfn fails for whatever
  reason.  Providing information about the unhandled fault will allow
  userspace to potentially fix the source of the error and avoid terminating
  the guest. 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 04/17] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-09-08 22:28 ` [PATCH v5 04/17] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
  2023-10-05  1:14   ` Sean Christopherson
@ 2023-10-10 22:58   ` David Matlack
  2023-10-10 23:40     ` Sean Christopherson
  1 sibling, 1 reply; 53+ messages in thread
From: David Matlack @ 2023-10-10 22:58 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: seanjc, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata, kconsul

On Fri, Sep 8, 2023 at 3:30 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information
> besides a return value of -1 and errno of EFAULT when a vCPU fails an
> access to guest memory which may be resolvable by userspace.
>
> Add documentation, updates to the KVM headers, and a helper function
> (kvm_handle_guest_uaccess_fault()) for implementing the capability.
>
> Mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86, even
> though EFAULT annotation are currently totally absent. Picking a point
> to declare the implementation "done" is difficult because
>
>   1. Annotations will be performed incrementally in subsequent commits
>      across both core and arch-specific KVM.
>   2. The initial series will very likely miss some cases which need
>      annotation. Although these omissions are to be fixed in the future,
>      userspace thus still needs to expect and be able to handle
>      unannotated EFAULTs.
>
> Given these qualifications, just marking it available here seems the
> least arbitrary thing to do.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
[...]
> +::
> +       union {
> +               /* KVM_SPEC_EXIT_MEMORY_FAULT */
> +               struct {
> +                       __u64 flags;
> +                       __u64 gpa;
> +                       __u64 len; /* in bytes */

I wonder if `gpa` and `len` should just be replaced with `gfn`.

- We don't seem to care about returning an exact `gpa` out to
userspace since this series just returns gpa = gfn * PAGE_SIZE out to
userspace.
- The len we return seems kind of arbitrary. PAGE_SIZE on x86 and
vma_pagesize on ARM64. But at the end of the day we're not asking the
kernel to fault in any specific length of mapping. We're just asking
for gfn-to-pfn for a specific gfn.
- I'm not sure userspace will want to do anything with this information.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 07/17] KVM: arm64: Annotate -EFAULT from user_mem_abort()
  2023-09-08 22:28 ` [PATCH v5 07/17] KVM: arm64: Annotate -EFAULT from user_mem_abort() Anish Moorthy
  2023-09-28 21:42   ` Anish Moorthy
  2023-10-05  1:26   ` Sean Christopherson
@ 2023-10-10 23:01   ` David Matlack
  2 siblings, 0 replies; 53+ messages in thread
From: David Matlack @ 2023-10-10 23:01 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: seanjc, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata, kconsul

On Fri, Sep 8, 2023 at 3:30 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> Implement KVM_CAP_MEMORY_FAULT_INFO for guest access failure in
> user_mem_abort().
>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  arch/arm64/kvm/mmu.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 587a104f66c3..8ede6c5edc5f 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1408,6 +1408,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         long vma_pagesize, fault_granule;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>         struct kvm_pgtable *pgt;
> +       uint64_t memory_fault_flags;
>
>         fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
>         write_fault = kvm_is_write_fault(vcpu);
> @@ -1507,8 +1508,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                 kvm_send_hwpoison_signal(hva, vma_shift);
>                 return 0;
>         }
> -       if (is_error_noslot_pfn(pfn))
> +       if (is_error_noslot_pfn(pfn)) {
> +               memory_fault_flags = 0;
> +               if (write_fault)
> +                       memory_fault_flags = KVM_MEMORY_FAULT_FLAG_EXEC;
> +               else if (exec_fault)
> +                       memory_fault_flags = KVM_MEMORY_FAULT_FLAG_EXEC;
> +               else
> +                       memory_fault_flags = KVM_MEMORY_FAULT_FLAG_READ;
> +               kvm_handle_guest_uaccess_fault(vcpu, round_down(gfn * PAGE_SIZE, vma_pagesize),

I think gfn * PAGE_SIZE is already rounded down to vma_pagesize. See
earlier in this function:

1484         vma_pagesize = 1UL << vma_shift;
1485         if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
1486                 fault_ipa &= ~(vma_pagesize - 1);
1487
1488         gfn = fault_ipa >> PAGE_SHIFT;


> +                                              vma_pagesize, memory_fault_flags);
>                 return -EFAULT;
> +       }
>
>         if (kvm_is_device_pfn(pfn)) {
>                 /*
> --
> 2.42.0.283.g2d96d420d3-goog
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 09/17] KVM: Introduce KVM_CAP_USERFAULT_ON_MISSING without implementation
  2023-09-08 22:28 ` [PATCH v5 09/17] KVM: Introduce KVM_CAP_USERFAULT_ON_MISSING without implementation Anish Moorthy
@ 2023-10-10 23:16   ` David Matlack
  2023-10-11 17:54     ` Anish Moorthy
  0 siblings, 1 reply; 53+ messages in thread
From: David Matlack @ 2023-10-10 23:16 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: seanjc, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata, kconsul

On Fri, Sep 8, 2023 at 3:30 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> Add documentation, memslot flags, useful helper functions, and the
> definition of the capability. Implementation is provided in a subsequent
> commit.
>
> Memory fault exits on absent mappings are particularly useful for
> userfaultfd-based postcopy live migration, where contention within uffd
> can lead to slowness When many vCPUs fault on a single uffd/vma.
> Bypassing the uffd entirely by returning information directly to the
> vCPU via an exit avoids contention and can greatly improves the fault
> rate.
>
> Suggested-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  Documentation/virt/kvm/api.rst | 28 +++++++++++++++++++++++++---
>  include/linux/kvm_host.h       |  9 +++++++++
>  include/uapi/linux/kvm.h       |  2 ++
>  tools/include/uapi/linux/kvm.h |  1 +
>  virt/kvm/Kconfig               |  3 +++
>  virt/kvm/kvm_main.c            |  5 +++++
>  6 files changed, 45 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 92fd3faa6bab..c2eaacb6dc63 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
>    /* for kvm_userspace_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
>    #define KVM_MEM_READONLY     (1UL << 1)
> +  #define KVM_MEM_USERFAULT_ON_MISSING  (1UL << 2)

Bike Shedding! Maybe KVM_MEM_EXIT_ON_MISSING? "Exiting" has concrete
meaning in the KVM UAPI whereas "userfault" doesn't and could suggest
going through userfaultfd, which is the opposite of what this
capability is doing.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 04/17] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-10-10 22:58   ` David Matlack
@ 2023-10-10 23:40     ` Sean Christopherson
  2023-10-16 17:07       ` David Matlack
  0 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2023-10-10 23:40 UTC (permalink / raw)
  To: David Matlack
  Cc: Anish Moorthy, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata, kconsul

On Tue, Oct 10, 2023, David Matlack wrote:
> On Fri, Sep 8, 2023 at 3:30 PM Anish Moorthy <amoorthy@google.com> wrote:
> >
> > KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information
> > besides a return value of -1 and errno of EFAULT when a vCPU fails an
> > access to guest memory which may be resolvable by userspace.
> >
> > Add documentation, updates to the KVM headers, and a helper function
> > (kvm_handle_guest_uaccess_fault()) for implementing the capability.
> >
> > Mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86, even
> > though EFAULT annotation are currently totally absent. Picking a point
> > to declare the implementation "done" is difficult because
> >
> >   1. Annotations will be performed incrementally in subsequent commits
> >      across both core and arch-specific KVM.
> >   2. The initial series will very likely miss some cases which need
> >      annotation. Although these omissions are to be fixed in the future,
> >      userspace thus still needs to expect and be able to handle
> >      unannotated EFAULTs.
> >
> > Given these qualifications, just marking it available here seems the
> > least arbitrary thing to do.
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > ---
> [...]
> > +::
> > +       union {
> > +               /* KVM_SPEC_EXIT_MEMORY_FAULT */
> > +               struct {
> > +                       __u64 flags;
> > +                       __u64 gpa;
> > +                       __u64 len; /* in bytes */
> 
> I wonder if `gpa` and `len` should just be replaced with `gfn`.
> 
> - We don't seem to care about returning an exact `gpa` out to
> userspace since this series just returns gpa = gfn * PAGE_SIZE out to
> userspace.
> - The len we return seems kind of arbitrary. PAGE_SIZE on x86 and
> vma_pagesize on ARM64. But at the end of the day we're not asking the
> kernel to fault in any specific length of mapping. We're just asking
> for gfn-to-pfn for a specific gfn.
> - I'm not sure userspace will want to do anything with this information.

Extending ABI is tricky.  E.g. if a use case comes along that needs/wants to
return a range, then we'd need to add a flag and also update userspace to actually
do the right thing.

The page fault path doesn't need such information because hardware gives a very
precise faulting address.  But if we ever get to a point where KVM provides info
for uaccess failures, then we'll likely want to provide the range.  E.g. if a
uaccess splits a page, on x86, we'd either need to register our own exception
fixup and use custom uaccess macros (eww), or convice the world that extending
ex_handler_uaccess() and all of the uaccess macros that they need to provide the
exact address that failed.

And for SNP and TDX, I believe the range will be used when the guest uses a
hardware-vendor-defined hypercall to request conversions between private and
shared.  Or maybe the plan is to funnel those into KVM_HC_MAP_GPA_RANGE?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 09/17] KVM: Introduce KVM_CAP_USERFAULT_ON_MISSING without implementation
  2023-10-10 23:16   ` David Matlack
@ 2023-10-11 17:54     ` Anish Moorthy
  2023-10-16 19:38       ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: Anish Moorthy @ 2023-10-11 17:54 UTC (permalink / raw)
  To: David Matlack
  Cc: seanjc, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata, kconsul

> Bike Shedding! Maybe KVM_MEM_EXIT_ON_MISSING? "Exiting" has concrete
> meaning in the KVM UAPI whereas "userfault" doesn't and could suggest
> going through userfaultfd, which is the opposite of what this
> capability is doing.

You know, in the three or four names this thing has had, I'm not sure
if "exit" has ever appeared :D

It is accurate, which is a definite plus. But since the exit in
question is special due to accompanying EFAULT, I think we've been
trying to reflect that in the nomenclature ("memory faults" or
"userfault")- maybe that's not worth doing though.

Wrt the current name, I agree w/ you on the potential for userfaultfd
confusion but I sort of see Sean's argument as well [1]. I see you've
re-raised the question of the exit accompanying EFAULT in [2] though,
so we should probably resolve that first.

[1] https://lore.kernel.org/kvm/20230602161921.208564-1-amoorthy@google.com/T/#t
[2] https://lore.kernel.org/kvm/CALzav=csPcd3f5CYc=6Fa4JnsYP8UTVeSex0-7LvUBnTDpHxLQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls
  2023-10-06  0:17       ` Sean Christopherson
@ 2023-10-11 22:04         ` Anish Moorthy
  0 siblings, 0 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-10-11 22:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Oct 5, 2023 at 5:17 PM Sean Christopherson <seanjc@google.com> wrote:
>
> > Sorry, what exactly are you suggesting via "bite the bullet" here?
>
> Add another boolean, the "bool can_do_userfault" from the diff that got snipped.

Whoops, somehow mistook that for my patch being quoted back to me :/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 04/17] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-10-10 23:40     ` Sean Christopherson
@ 2023-10-16 17:07       ` David Matlack
  2023-10-16 19:14         ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: David Matlack @ 2023-10-16 17:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Anish Moorthy, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata, kconsul

On Tue, Oct 10, 2023 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Oct 10, 2023, David Matlack wrote:
> > On Fri, Sep 8, 2023 at 3:30 PM Anish Moorthy <amoorthy@google.com> wrote:
> > >
> > > +::
> > > +       union {
> > > +               /* KVM_SPEC_EXIT_MEMORY_FAULT */
> > > +               struct {
> > > +                       __u64 flags;
> > > +                       __u64 gpa;
> > > +                       __u64 len; /* in bytes */
> >
> > I wonder if `gpa` and `len` should just be replaced with `gfn`.
> >
> > - We don't seem to care about returning an exact `gpa` out to
> > userspace since this series just returns gpa = gfn * PAGE_SIZE out to
> > userspace.
> > - The len we return seems kind of arbitrary. PAGE_SIZE on x86 and
> > vma_pagesize on ARM64. But at the end of the day we're not asking the
> > kernel to fault in any specific length of mapping. We're just asking
> > for gfn-to-pfn for a specific gfn.
> > - I'm not sure userspace will want to do anything with this information.
>
> Extending ABI is tricky.  E.g. if a use case comes along that needs/wants to
> return a range, then we'd need to add a flag and also update userspace to actually
> do the right thing.
>
> The page fault path doesn't need such information because hardware gives a very
> precise faulting address.  But if we ever get to a point where KVM provides info
> for uaccess failures, then we'll likely want to provide the range.  E.g. if a
> uaccess splits a page, on x86, we'd either need to register our own exception
> fixup and use custom uaccess macros (eww), or convice the world that extending
> ex_handler_uaccess() and all of the uaccess macros that they need to provide the
> exact address that failed.

I wonder if userspace might need a precise fault address in some
situations? e.g. If KVM returns -HWPOISON for an access that spans a
page boundary, userspace won't know which is poisoned. Maybe SNP/TDX
need precise fault addresses as well? I don't know enough about how
SNP and TDX plan to use this UAPI.


>
> And for SNP and TDX, I believe the range will be used when the guest uses a
> hardware-vendor-defined hypercall to request conversions between private and
> shared.  Or maybe the plan is to funnel those into KVM_HC_MAP_GPA_RANGE?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 04/17] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-10-16 17:07       ` David Matlack
@ 2023-10-16 19:14         ` Sean Christopherson
  0 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2023-10-16 19:14 UTC (permalink / raw)
  To: David Matlack
  Cc: Anish Moorthy, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata, kconsul

On Mon, Oct 16, 2023, David Matlack wrote:
> On Tue, Oct 10, 2023 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Oct 10, 2023, David Matlack wrote:
> > > On Fri, Sep 8, 2023 at 3:30 PM Anish Moorthy <amoorthy@google.com> wrote:
> > > >
> > > > +::
> > > > +       union {
> > > > +               /* KVM_SPEC_EXIT_MEMORY_FAULT */
> > > > +               struct {
> > > > +                       __u64 flags;
> > > > +                       __u64 gpa;
> > > > +                       __u64 len; /* in bytes */
> > >
> > > I wonder if `gpa` and `len` should just be replaced with `gfn`.
> > >
> > > - We don't seem to care about returning an exact `gpa` out to
> > > userspace since this series just returns gpa = gfn * PAGE_SIZE out to
> > > userspace.
> > > - The len we return seems kind of arbitrary. PAGE_SIZE on x86 and
> > > vma_pagesize on ARM64. But at the end of the day we're not asking the
> > > kernel to fault in any specific length of mapping. We're just asking
> > > for gfn-to-pfn for a specific gfn.
> > > - I'm not sure userspace will want to do anything with this information.
> >
> > Extending ABI is tricky.  E.g. if a use case comes along that needs/wants to
> > return a range, then we'd need to add a flag and also update userspace to actually
> > do the right thing.
> >
> > The page fault path doesn't need such information because hardware gives a very
> > precise faulting address.  But if we ever get to a point where KVM provides info
> > for uaccess failures, then we'll likely want to provide the range.  E.g. if a
> > uaccess splits a page, on x86, we'd either need to register our own exception
> > fixup and use custom uaccess macros (eww), or convice the world that extending
> > ex_handler_uaccess() and all of the uaccess macros that they need to provide the
> > exact address that failed.
> 
> I wonder if userspace might need a precise fault address in some
> situations? e.g. If KVM returns -HWPOISON for an access that spans a
> page boundary, userspace won't know which is poisoned.

As things currently stand, the -EHWPOISON case is guaranteed to be precise because
uaccess failures only ever return -EFAULT.  The resulting BUS_MCEERR_AR from the
kernel's #MC handler will provide the necessary precision to userspace.

Though even if -EHWPOISON were imprecise, userspace should be able to figure out
which page is poisoned, e.g. by probing each possible page (gross, but doable).

Ah, and a much more concrete reason to report gpa+len is that it's possible that
KVM may someday support faults at sub-page granularity, e.g. if something like
HEKI[*] wants to use Intel's Sub-Page Write Permissions to make a minimal amount
of guest code writable when the guest kernel is doing code patching.

> Maybe SNP/TDX need precise fault addresses as well? I don't know enough about
> how SNP and TDX plan to use this UAPI.

FWIW, SNP and TDX usage are limited to the KVM page fault path, i.e. always do
precise, single-page reporting.

[*] https://lore.kernel.org/all/20230505152046.6575-1-mic@digikod.net

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 09/17] KVM: Introduce KVM_CAP_USERFAULT_ON_MISSING without implementation
  2023-10-11 17:54     ` Anish Moorthy
@ 2023-10-16 19:38       ` Sean Christopherson
  0 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2023-10-16 19:38 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: David Matlack, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata, kconsul

On Wed, Oct 11, 2023, Anish Moorthy wrote:
> > Bike Shedding! Maybe KVM_MEM_EXIT_ON_MISSING? "Exiting" has concrete
> > meaning in the KVM UAPI whereas "userfault" doesn't and could suggest
> > going through userfaultfd, which is the opposite of what this
> > capability is doing.
> 
> You know, in the three or four names this thing has had, I'm not sure
> if "exit" has ever appeared :D
> 
> It is accurate, which is a definite plus. But since the exit in
> question is special due to accompanying EFAULT, I think we've been
> trying to reflect that in the nomenclature ("memory faults" or
> "userfault")- maybe that's not worth doing though.

Heh, KVM's uAPI surface is so large that there's almost always both an example
and a counter-example.  E.g. KVM_CAP_EXIT_ON_EMULATION_FAILURE forces emulation
failures to exit to userspace, whereas KVM_CAP_X86_DISABLE_EXITS disables exits
from guest=>KVM.  The latter is why I've shied away from "EXIT", but I think that
I'm looking at the name too much through the lens of a KVM developer, and not
considering how it will be read by KVM users.

So yeah, I agree that KVM_MEM_EXIT_ON_MISSING and KVM_CAP_EXIT_ON_MISSING are
better.  Let's go with that.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls
  2023-10-05  1:44   ` Sean Christopherson
  2023-10-05 18:58     ` Anish Moorthy
@ 2023-11-01 21:53     ` Anish Moorthy
  2023-11-01 22:03       ` Sean Christopherson
  2023-11-02 19:14       ` Anish Moorthy
  1 sibling, 2 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-11-01 21:53 UTC (permalink / raw)
  To: Sean Christopherson, David Matlack
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Wed, Oct 4, 2023 at 6:44 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Eh, the shortlog basically says "do work" with a lot of fancy words.  It really
> just boils down to:
>
>   KVM: Let callers of __gfn_to_pfn_memslot() opt-out of USERFAULT_ON_MISSING

Proposed commit message for v6:

> KVM: Implement KVM_CAP EXIT_ON_MISSING by checking memslot flag in __gfn_to_pfn_memslot()
>
> When the slot flag is enabled, forbid __gfn_to_pfn_memslot() from
> faulting in pages for which mappings are absent. However, some callers of
> __gfn_to_pfn_memslot() (such as kvm_vcpu_map()) must be able to opt out
> of this behavior: allow doing so via the new can_exit_on_missing
> parameter.

Although separately, I don't think the parameter should be named
can_exit_on_missing (or, as you suggested, can_do_userfault)-
__gfn_to_pfn_memslot() shouldn't know or care how its callers are
setting up KVM exits, after all.

I think it makes sense to rename the new parameter and, for the same
reasoning, the memslot flag to "forbid_fault_on_missing" and
KVM_MEM_FORBID_FAULT_ON_MISSING respectively. Objections?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls
  2023-11-01 21:53     ` Anish Moorthy
@ 2023-11-01 22:03       ` Sean Christopherson
  2023-11-01 22:25         ` Anish Moorthy
  2023-11-02 19:14       ` Anish Moorthy
  1 sibling, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2023-11-01 22:03 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: David Matlack, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Nov 01, 2023, Anish Moorthy wrote:
> On Wed, Oct 4, 2023 at 6:44 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Eh, the shortlog basically says "do work" with a lot of fancy words.  It really
> > just boils down to:
> >
> >   KVM: Let callers of __gfn_to_pfn_memslot() opt-out of USERFAULT_ON_MISSING
> 
> Proposed commit message for v6:
> 
> > KVM: Implement KVM_CAP EXIT_ON_MISSING by checking memslot flag in __gfn_to_pfn_memslot()
> >
> > When the slot flag is enabled, forbid __gfn_to_pfn_memslot() from
> > faulting in pages for which mappings are absent. However, some callers of
> > __gfn_to_pfn_memslot() (such as kvm_vcpu_map()) must be able to opt out
> > of this behavior: allow doing so via the new can_exit_on_missing
> > parameter.
> 
> Although separately, I don't think the parameter should be named
> can_exit_on_missing (or, as you suggested, can_do_userfault)-
> __gfn_to_pfn_memslot() shouldn't know or care how its callers are
> setting up KVM exits, after all.

Why not?  __gfn_to_pfn_memslot() gets passed all kinds of constraints, I don't
see how "I can't handle exits to userspace" is any different.

> I think it makes sense to rename the new parameter and, for the same
> reasoning, the memslot flag to "forbid_fault_on_missing" and
> KVM_MEM_FORBID_FAULT_ON_MISSING respectively. Objections?

Yes.  I very strongly prefer KVM_MEM_EXIT_ON_MISSING.  As David pointed out, KVM
already has established nomenclature around exit, i.e. "exit on" should be quite
intuitive for most KVM developers.

"Forbid fault" is rather nonsensical because a fault has already happened.  The
confusion between "page fault VM-Exit" and "faulting in memory in the host" is
the main reason we wandered away from anything with "fault" in the name.

That said, I do agree that can_do_userfault doesn't work with KVM_MEM_EXIT_ON_MISSING.
Maybe just a boring can_exit_on_missing?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls
  2023-11-01 22:03       ` Sean Christopherson
@ 2023-11-01 22:25         ` Anish Moorthy
  2023-11-01 22:39           ` David Matlack
  2023-11-01 22:42           ` Sean Christopherson
  0 siblings, 2 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-11-01 22:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Matlack, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Nov 1, 2023 at 3:03 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Nov 01, 2023, Anish Moorthy wrote:
> > On Wed, Oct 4, 2023 at 6:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > Eh, the shortlog basically says "do work" with a lot of fancy words.  It really
> > > just boils down to:
> > >
> > >   KVM: Let callers of __gfn_to_pfn_memslot() opt-out of USERFAULT_ON_MISSING
> >
> > Proposed commit message for v6:
> >
> > > KVM: Implement KVM_CAP EXIT_ON_MISSING by checking memslot flag in __gfn_to_pfn_memslot()
> > >
> > > When the slot flag is enabled, forbid __gfn_to_pfn_memslot() from
> > > faulting in pages for which mappings are absent. However, some callers of
> > > __gfn_to_pfn_memslot() (such as kvm_vcpu_map()) must be able to opt out
> > > of this behavior: allow doing so via the new can_exit_on_missing
> > > parameter.
> >
> > Although separately, I don't think the parameter should be named
> > can_exit_on_missing (or, as you suggested, can_do_userfault)-
> > __gfn_to_pfn_memslot() shouldn't know or care how its callers are
> > setting up KVM exits, after all.
>
> Why not?  __gfn_to_pfn_memslot() gets passed all kinds of constraints, I don't
> see how "I can't handle exits to userspace" is any different.

Well the thing is that __gfn_to_pfn_memslot() is orthogonal to KVM
exits. Its callers are just using it to try resolving a pfn, and what
they do with the results is up to them.

Put more concretely, __gfn_to_pfn_memslot() has many callers of which
only two (the stage-2 fault handlers) actually use it to set up a KVM
exit- how does a parameter named "can_exit_on_missing" make sense to
its callers in general? If it were __gfn_to_pfn_memslot() itself that
was populating the run struct in response to absent mappings then I
would agree that the name was appropriate- but that's not what's going
on here.

(side note, I'll assume that aside from the current naming discussion
the commit message I proposed is fine)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls
  2023-11-01 22:25         ` Anish Moorthy
@ 2023-11-01 22:39           ` David Matlack
  2023-11-01 22:42           ` Sean Christopherson
  1 sibling, 0 replies; 53+ messages in thread
From: David Matlack @ 2023-11-01 22:39 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Sean Christopherson, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Nov 1, 2023 at 3:26 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Wed, Nov 1, 2023 at 3:03 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Nov 01, 2023, Anish Moorthy wrote:
> > > On Wed, Oct 4, 2023 at 6:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > Eh, the shortlog basically says "do work" with a lot of fancy words.  It really
> > > > just boils down to:
> > > >
> > > >   KVM: Let callers of __gfn_to_pfn_memslot() opt-out of USERFAULT_ON_MISSING
> > >
> > > Proposed commit message for v6:
> > >
> > > > KVM: Implement KVM_CAP EXIT_ON_MISSING by checking memslot flag in __gfn_to_pfn_memslot()
> > > >
> > > > When the slot flag is enabled, forbid __gfn_to_pfn_memslot() from
> > > > faulting in pages for which mappings are absent. However, some callers of
> > > > __gfn_to_pfn_memslot() (such as kvm_vcpu_map()) must be able to opt out
> > > > of this behavior: allow doing so via the new can_exit_on_missing
> > > > parameter.
> > >
> > > Although separately, I don't think the parameter should be named
> > > can_exit_on_missing (or, as you suggested, can_do_userfault)-
> > > __gfn_to_pfn_memslot() shouldn't know or care how its callers are
> > > setting up KVM exits, after all.
> >
> > Why not?  __gfn_to_pfn_memslot() gets passed all kinds of constraints, I don't
> > see how "I can't handle exits to userspace" is any different.
>
> Well the thing is that __gfn_to_pfn_memslot() is orthogonal to KVM
> exits. Its callers are just using it to try resolving a pfn, and what
> they do with the results is up to them.
>
> Put more concretely, __gfn_to_pfn_memslot() has many callers of which
> only two (the stage-2 fault handlers) actually use it to set up a KVM
> exit- how does a parameter named "can_exit_on_missing" make sense to
> its callers in general?

"exit" is shorthand for "return_to_userspace" and all KVM code runs in
the context of ioctls (which eventually return to userspace). So I
think it makes sense in general.

And FWIW I found the MEMSLOT_ACCESS_ enum hard to read and agree a
boolean would be simpler.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls
  2023-11-01 22:25         ` Anish Moorthy
  2023-11-01 22:39           ` David Matlack
@ 2023-11-01 22:42           ` Sean Christopherson
  1 sibling, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2023-11-01 22:42 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: David Matlack, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Nov 01, 2023, Anish Moorthy wrote:
> On Wed, Nov 1, 2023 at 3:03 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Nov 01, 2023, Anish Moorthy wrote:
> > > On Wed, Oct 4, 2023 at 6:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > Eh, the shortlog basically says "do work" with a lot of fancy words.  It really
> > > > just boils down to:
> > > >
> > > >   KVM: Let callers of __gfn_to_pfn_memslot() opt-out of USERFAULT_ON_MISSING
> > >
> > > Proposed commit message for v6:
> > >
> > > > KVM: Implement KVM_CAP EXIT_ON_MISSING by checking memslot flag in __gfn_to_pfn_memslot()
> > > >
> > > > When the slot flag is enabled, forbid __gfn_to_pfn_memslot() from
> > > > faulting in pages for which mappings are absent. However, some callers of
> > > > __gfn_to_pfn_memslot() (such as kvm_vcpu_map()) must be able to opt out
> > > > of this behavior: allow doing so via the new can_exit_on_missing
> > > > parameter.
> > >
> > > Although separately, I don't think the parameter should be named
> > > can_exit_on_missing (or, as you suggested, can_do_userfault)-
> > > __gfn_to_pfn_memslot() shouldn't know or care how its callers are
> > > setting up KVM exits, after all.
> >
> > Why not?  __gfn_to_pfn_memslot() gets passed all kinds of constraints, I don't
> > see how "I can't handle exits to userspace" is any different.
> 
> Well the thing is that __gfn_to_pfn_memslot() is orthogonal to KVM
> exits. Its callers are just using it to try resolving a pfn, and what
> they do with the results is up to them.

But "how" the pfn is resolved is the business of the caller and of __gfn_to_pfn_memslot().
This already exits in the form of @async and @atomic, which respectively say
"don't wait on I/O" and "can't sleep, period".  The @async name is confusing,
but David Steven's series is planning on replacing that with the much more literal
FOLL_NOWAIT (IIRC).

> Put more concretely, __gfn_to_pfn_memslot() has many callers of which
> only two (the stage-2 fault handlers) actually use it to set up a KVM
> exit- how does a parameter named "can_exit_on_missing" make sense to
> its callers in general? 

It's a flag that says "I can't exit right now, please ignore exit_on_missing".

> If it were __gfn_to_pfn_memslot() itself that was populating the run struct
> in response to absent mappings then I would agree that the name was
> appropriate- but that's not what's going on here.
> 
> (side note, I'll assume that aside from the current naming discussion
> the commit message I proposed is fine)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 11/17] KVM: x86: Enable KVM_CAP_USERFAULT_ON_MISSING
  2023-10-05  1:52   ` Sean Christopherson
@ 2023-11-01 22:55     ` Anish Moorthy
  2023-11-02 14:31       ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: Anish Moorthy @ 2023-11-01 22:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Oct 4, 2023 at 6:52 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Sep 08, 2023, Anish Moorthy wrote:
> > The relevant __gfn_to_pfn_memslot() calls in __kvm_faultin_pfn()
> > already use MEMSLOT_ACCESS_NONATOMIC_MAY_UPGRADE.
>
> --verbose

Alright, how about the following?

    KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING

    __gfn_to_pfn_memslot(), used by the stage-2 fault handler, already
    checks the memslot flag to avoid faulting in missing pages as required.
    Therefore, enabling the capability is just a matter of selecting the kconfig
    to allow the flag to actually be set.

> Hmm, I vote to squash this patch with
>
>   KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
>
> or rather, squash that code into this patch.  Ditto for the ARM patches.
>
> Since we're making KVM_EXIT_MEMORY_FAULT informational-only for flows that KVM
> isn't committing to supporting, I think it makes sense to enable the arch specific
> paths that *need* to return KVM_EXIT_MEMORY_FAULT at the same time as the feature
> that adds the requirement.
>
> E.g. without HAVE_KVM_USERFAULT_ON_MISSING support, per the contract we are creating,
> it would be totally fine for KVM to not use KVM_EXIT_MEMORY_FAULT for the page
> fault paths.  And that obviously has to be the case since KVM_CAP_MEMORY_FAULT_INFO
> is introduced before the arch code is enabled.
>
> But as of this path, KVM *must* return KVM_EXIT_MEMORY_FAULT, so we should make
> it impossible to do a straight revert of that dependency.

Should we really be concerned about people reverting the
KVM_CAP_MEMORY_FAULT_INFO commit(s) in this way? I see what you're
saying- but it also seems to me that KVM could add other things akin
to KVM_CAP_EXIT_ON_MISSING in the future, and then we end up in the
exact same situation. Sure the squash might make sense for the
specific commits in the series, but there's a general issue that isn't
really being solved.

Maybe I'm just letting the better be the enemy of the best, but I do
like the current separation/single-focus of the commits. That said,
the squash is nbd and I can go ahead if you're not convinced.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 11/17] KVM: x86: Enable KVM_CAP_USERFAULT_ON_MISSING
  2023-11-01 22:55     ` Anish Moorthy
@ 2023-11-02 14:31       ` Sean Christopherson
  0 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2023-11-02 14:31 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Wed, Nov 01, 2023, Anish Moorthy wrote:
> On Wed, Oct 4, 2023 at 6:52 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Sep 08, 2023, Anish Moorthy wrote:
> > > The relevant __gfn_to_pfn_memslot() calls in __kvm_faultin_pfn()
> > > already use MEMSLOT_ACCESS_NONATOMIC_MAY_UPGRADE.
> >
> > --verbose
> 
> Alright, how about the following?
> 
>     KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING
> 
>     __gfn_to_pfn_memslot(), used by the stage-2 fault handler, already
>     checks the memslot flag to avoid faulting in missing pages as required.
>     Therefore, enabling the capability is just a matter of selecting the kconfig
>     to allow the flag to actually be set.
> 
> > Hmm, I vote to squash this patch with
> >
> >   KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
> >
> > or rather, squash that code into this patch.  Ditto for the ARM patches.
> >
> > Since we're making KVM_EXIT_MEMORY_FAULT informational-only for flows that KVM
> > isn't committing to supporting, I think it makes sense to enable the arch specific
> > paths that *need* to return KVM_EXIT_MEMORY_FAULT at the same time as the feature
> > that adds the requirement.
> >
> > E.g. without HAVE_KVM_USERFAULT_ON_MISSING support, per the contract we are creating,
> > it would be totally fine for KVM to not use KVM_EXIT_MEMORY_FAULT for the page
> > fault paths.  And that obviously has to be the case since KVM_CAP_MEMORY_FAULT_INFO
> > is introduced before the arch code is enabled.
> >
> > But as of this path, KVM *must* return KVM_EXIT_MEMORY_FAULT, so we should make
> > it impossible to do a straight revert of that dependency.
> 
> Should we really be concerned about people reverting the
> KVM_CAP_MEMORY_FAULT_INFO commit(s) in this way?

Yes.  A revert is highly unlikely, but possible.  Keep in mind that with upstream
KVM, there are a *lot* of end users that don't know the inner workings of KVM
(or the kernel).  When things break, the standard approach for such users is to
bisect to find where things first broke, and then to try reverting the suspected
commit to see if that fixes the problem.

In the (again, highly unlikely) case that filling run->memory_fault breaks something,
an unsuspecting user will bisect to that commit and revert.  Then they're suddenly
in a situation where they've unknowing broken KVM, and likely in a way that won't
immediately fail.

*Nothing* in either changelog gives any clue that there is a hard dependency.
Even the slightly more verbose version above provides almost no indication of the
real dependency, as it primarily talks about __gfn_to_pfn_memslot() and
KVM_CAP_EXIT_ON_MISSING.
 
And now that we've punted on trying annotate "everything", there's no sane way
for the "Annotate -EFAULTs from kvm_handle_error_pfn()" changelog to communicate
that it will have a hard dependency in the future.  If the changelog says "this
will be used by blah blah blah", then it raises the question of "well why isn't
this added there?".

And the patches are tiny.  Having a final "enable and advertise XYZ" patch is
relatively common for new features, but that's often because the enabling of the
new feature is spread across multiple patches.  E.g. see the LAM support sitting
in "https://github.com/kvm-x86/linux.git lam".  And in those cases, the hard
dependency is always very obvious, e.g. if someone complained that reverting
"Virtualize LAM for user pointer" but not "Advertise and enable LAM (user and
supervisor)" caused problems, we'd be like, "well, yeah".

> I see what you're> saying- but it also seems to me that KVM could add other
> things akin to KVM_CAP_EXIT_ON_MISSING in the future, and then we end up in
> the exact same situation.

No, it's not the exact same situation.

First and foremost, we *can't* solve that problem, because some future feature
doesn't exist yet.

Second, merging features into two different kernel releases creates a very different
situation.  Let's say this goes into kernel 6.9, and then some future feature
lands in kernel 6.11 and has a hard dependency on the annotations.  The odds of
needing to revert a patch from 6.9 while upgrading to 6.11 are significnatly lower
than the odds of needing to revert a patch from 6.9-rc1 between when rc1 is released
and a user upgrades to 6.9.  And users aren't stupid; they might not know the inner
workings of KVM, but bisecting to a patch from 6.9 when upgrading to 6.11 would
give them pause.

Third, with the patches squashed, to revert the annotations, a person would also
be reverting _this_ patch (because they'd be one and the same).  At that point,
they're no longer reverting a seemingly innocous "give userspace more info" commit,
they're reverting something that clearly advertises a feature userspace, which
again provides a clue that a straight revert might not be super safe.

> Sure the squash might make sense for the specific commits in the series, but
> there's a general issue that isn't really being solved.

Patches within a series and future series are two very different things.

> Maybe I'm just letting the better be the enemy of the best,

The "best" isn't even possible, unless we never ship anything, ever.

> but I do like the current separation/single-focus of the commits.

Because you already know what the entire series does.  Someone looking at this
patch without already understanding the big picture is going to have no idea that
these two patches are tightly coupled.

And again, now that we've punted on annotating everything, I see zero reason to
split the patches.  Maybe you could argue it provides a new bisection point, but
again the patches are _tiny_, and that same bisection point is effectively achieved
by running with an "older" userspace, i.e. a userspace that doesn't utilize the
KVM_MEM_EXIT_ON_MISSING, which is literally every userspace in existence right now.

> That said, the squash is nbd and I can go ahead if you're not convinced.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls
  2023-11-01 21:53     ` Anish Moorthy
  2023-11-01 22:03       ` Sean Christopherson
@ 2023-11-02 19:14       ` Anish Moorthy
  2023-11-02 20:25         ` Anish Moorthy
  2023-11-03 20:05         ` Sean Christopherson
  1 sibling, 2 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-11-02 19:14 UTC (permalink / raw)
  To: Sean Christopherson, David Matlack
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

I'm going to squash this into patch 9: the fact that it's separated is
a holdover from when the memslot flag check lived in arch-specific
code.

Proposed commit message for the squashed commit

> KVM: Add KVM_CAP_EXIT_ON_MISSING which forbids page faults in stage-2 fault handlers

> Faulting in pages via the stage-2 fault handlers can be undesirable in
> the context of userfaultfd-based postcopy: contention for userfaultfd's
> internal wait queues can cause significant slowness when delivering
> faults to userspace. A performant alternative is to allow the stage-2
> fault handlers to, where they would currently page-fault on the
> userspace mappings, exit from KVM_RUN and deliver relevant information
> to userspace via KVM_CAP_MEMORY_FAULT_INFO. This approach avoids
> contention-related issues.

> Add, and mostly implement, a capability through which userspace can
> indicate to KVM (through the new KVM_MEM_EXIT_ON_MISSING memslot flag)
> that it should not fault in pages from the stage-2 fault handlers.

> The basic implementation strategy is to check the memslot flag from
> within __gfn_to_pfn_memslot() and override the "atomic" parameter
> accordingly. Some callers (such as kvm_vcpu_map()) must be able to opt
> out of this behavior, and do so by passing the new can_exit_on_missing
> parameter as false.

> No functional change intended: nothing sets KVM_MEM_EXIT_ON_MISSING.

One comment/question though- as I have the (sqaushed) patches/new
commit message written, I think it could mislead readers into thinking
that callers that pass can_exit_on_missing=false to
__gfn_to_pfn_memslot() *must* do so. But at least some of these cases,
such as the ones in the powerpc mmu, I think we're just punting on.

I see a few options here

1. Make all callers pass can_exit_on_missing=false, and leave the
=true update to the x86/arm64 specific "enable/annotate
KVM_EXIT_ON_MISSING" commits [my preference]

2. Make the powerpc callers pass can_exit_on_missing=true as well,
even though we're not adding KVM_CAP_EXIT_ON_MISSING for them

3. Add a disclaimer in the commit message that some cases where
can_exit_on_missing=false could become =true in the future

4. Things are fine as-is and I'm making a mountain out of a molehill

Any thoughts?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls
  2023-11-02 19:14       ` Anish Moorthy
@ 2023-11-02 20:25         ` Anish Moorthy
  2023-11-03 20:05         ` Sean Christopherson
  1 sibling, 0 replies; 53+ messages in thread
From: Anish Moorthy @ 2023-11-02 20:25 UTC (permalink / raw)
  To: Sean Christopherson, David Matlack
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Thu, Nov 2, 2023 at 12:14 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> Proposed commit message for the squashed commit
>
> > KVM: Add KVM_CAP_EXIT_ON_MISSING which forbids page faults in stage-2 fault handlers

Erm, I seem to have flubbed it already. IIRC this should be
"vcpu-context accesses," not "stage-2 fault handlers"

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls
  2023-11-02 19:14       ` Anish Moorthy
  2023-11-02 20:25         ` Anish Moorthy
@ 2023-11-03 20:05         ` Sean Christopherson
  1 sibling, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2023-11-03 20:05 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: David Matlack, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Nov 02, 2023, Anish Moorthy wrote:
> I'm going to squash this into patch 9: the fact that it's separated is
> a holdover from when the memslot flag check lived in arch-specific
> code.
> 
> Proposed commit message for the squashed commit
> 
> > KVM: Add KVM_CAP_EXIT_ON_MISSING which forbids page faults in stage-2 fault handlers

Please don't use "forbids page faults".  As I said earlier:

 : "Forbid fault" is rather nonsensical because a fault has already happened.  The
 : confusion between "page fault VM-Exit" and "faulting in memory in the host" is
 : the main reason we wandered away from anything with "fault" in the name.

The part that is being "forbidden" is not a "page fault", it's the act of faulting
in a page.  And "forbids" doesn't add any clarity to KVM_CAP_EXIT_ON_MISSING; if
anything, it muddies the waters by making it sound like page faults somehow become
fatal.

Regarding the capability, stating the capability name is both unnecessary and
uninteresting.  The fact that there's a new capability is an implementation detail,
it's in no way needed to understand the basic gist of the patch, which is basically
*the* role of the shortlog.

The key things to cover in the shortlog:

 * What does the feature do?   Exits when an hva isn't fully mapped
 * Who controls the behavior?  Userspace
 * How is the feature enabled? Memslot flag

I would go with something like:

  KVM: Add memslot flag to let userspace force an exit on missing hva mappings

> > Faulting in pages via the stage-2 fault handlers can be undesirable in
> > the context of userfaultfd-based postcopy: contention for userfaultfd's
> > internal wait queues can cause significant slowness when delivering
> > faults to userspace. A performant alternative is to allow the stage-2
> > fault handlers to, where they would currently page-fault on the
> > userspace mappings, exit from KVM_RUN and deliver relevant information
> > to userspace via KVM_CAP_MEMORY_FAULT_INFO. This approach avoids
> > contention-related issues.
> 
> > Add, and mostly implement, a capability through which userspace can
> > indicate to KVM (through the new KVM_MEM_EXIT_ON_MISSING memslot flag)
> > that it should not fault in pages from the stage-2 fault handlers.
> 
> > The basic implementation strategy is to check the memslot flag from
> > within __gfn_to_pfn_memslot() and override the "atomic" parameter
> > accordingly. Some callers (such as kvm_vcpu_map()) must be able to opt
> > out of this behavior, and do so by passing the new can_exit_on_missing
> > parameter as false.
> 
> > No functional change intended: nothing sets KVM_MEM_EXIT_ON_MISSING.
> 
> One comment/question though- as I have the (sqaushed) patches/new
> commit message written, I think it could mislead readers into thinking
> that callers that pass can_exit_on_missing=false to
> __gfn_to_pfn_memslot() *must* do so. But at least some of these cases,
> such as the ones in the powerpc mmu, I think we're just punting on.
> 
> I see a few options here
> 
> 1. Make all callers pass can_exit_on_missing=false, and leave the
> =true update to the x86/arm64 specific "enable/annotate
> KVM_EXIT_ON_MISSING" commits [my preference]

This one.  Nothing should pass %true until the caller fully supports
KVM_MEM_EXIT_ON_MISSING.

> 2. Make the powerpc callers pass can_exit_on_missing=true as well,
> even though we're not adding KVM_CAP_EXIT_ON_MISSING for them
> 
> 3. Add a disclaimer in the commit message that some cases where
> can_exit_on_missing=false could become =true in the future
> 
> 4. Things are fine as-is and I'm making a mountain out of a molehill
> 
> Any thoughts?

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2023-11-03 20:05 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-08 22:28 [PATCH v5 00/17] Improve KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
2023-09-08 22:28 ` [PATCH v5 01/17] KVM: Clarify documentation of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
2023-09-08 22:28 ` [PATCH v5 02/17] KVM: Add docstrings to __kvm_read/write_guest_page() Anish Moorthy
2023-10-05  1:18   ` Sean Christopherson
2023-09-08 22:28 ` [PATCH v5 03/17] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
2023-09-08 22:28 ` [PATCH v5 04/17] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
2023-10-05  1:14   ` Sean Christopherson
2023-10-05 18:45     ` Anish Moorthy
2023-10-05 22:13       ` Sean Christopherson
2023-10-10 22:58   ` David Matlack
2023-10-10 23:40     ` Sean Christopherson
2023-10-16 17:07       ` David Matlack
2023-10-16 19:14         ` Sean Christopherson
2023-09-08 22:28 ` [PATCH v5 05/17] KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page() Anish Moorthy
2023-09-14  8:04   ` kernel test robot
2023-10-05  1:53   ` Sean Christopherson
2023-10-05 23:03   ` Anish Moorthy
2023-09-08 22:28 ` [PATCH v5 06/17] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
2023-10-05  1:26   ` Sean Christopherson
2023-10-05 23:57     ` Anish Moorthy
2023-10-06  0:36       ` Sean Christopherson
2023-09-08 22:28 ` [PATCH v5 07/17] KVM: arm64: Annotate -EFAULT from user_mem_abort() Anish Moorthy
2023-09-28 21:42   ` Anish Moorthy
2023-10-05  1:26   ` Sean Christopherson
2023-10-10 23:01   ` David Matlack
2023-09-08 22:28 ` [PATCH v5 08/17] KVM: Allow hva_pfn_fast() to resolve read faults Anish Moorthy
2023-09-08 22:28 ` [PATCH v5 09/17] KVM: Introduce KVM_CAP_USERFAULT_ON_MISSING without implementation Anish Moorthy
2023-10-10 23:16   ` David Matlack
2023-10-11 17:54     ` Anish Moorthy
2023-10-16 19:38       ` Sean Christopherson
2023-09-08 22:28 ` [PATCH v5 10/17] KVM: Implement KVM_CAP_USERFAULT_ON_MISSING by atomizing __gfn_to_pfn_memslot() calls Anish Moorthy
2023-10-05  1:44   ` Sean Christopherson
2023-10-05 18:58     ` Anish Moorthy
2023-10-06  0:17       ` Sean Christopherson
2023-10-11 22:04         ` Anish Moorthy
2023-11-01 21:53     ` Anish Moorthy
2023-11-01 22:03       ` Sean Christopherson
2023-11-01 22:25         ` Anish Moorthy
2023-11-01 22:39           ` David Matlack
2023-11-01 22:42           ` Sean Christopherson
2023-11-02 19:14       ` Anish Moorthy
2023-11-02 20:25         ` Anish Moorthy
2023-11-03 20:05         ` Sean Christopherson
2023-09-08 22:28 ` [PATCH v5 11/17] KVM: x86: Enable KVM_CAP_USERFAULT_ON_MISSING Anish Moorthy
2023-10-05  1:52   ` Sean Christopherson
2023-11-01 22:55     ` Anish Moorthy
2023-11-02 14:31       ` Sean Christopherson
2023-09-08 22:28 ` [PATCH v5 12/17] KVM: arm64: " Anish Moorthy
2023-09-08 22:29 ` [PATCH v5 13/17] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
2023-09-08 22:29 ` [PATCH v5 14/17] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
2023-09-08 22:29 ` [PATCH v5 15/17] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
2023-09-08 22:29 ` [PATCH v5 16/17] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
2023-09-08 22:29 ` [PATCH v5 17/17] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.