All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
@ 2023-06-02 16:19 Anish Moorthy
  2023-06-02 16:19 ` [PATCH v4 01/16] KVM: Allow hva_pfn_fast() to resolve read-only faults Anish Moorthy
                   ` (15 more replies)
  0 siblings, 16 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

Outstanding Issues
~~~~~~~~~~~~~~~~~~
- The list of annotation sites still needs feedback. My latest
  assessment of which sites to annotate is in [4]
- The WARNs introduced in the kvm_populate_efault_info() helper need
  to be sorted out (gate behind kconfig?)
- Probably more (but hopefully not too much more :)

Cover Letter
~~~~~~~~~~~~
Due to serialization on internal wait queues, userfaultfd can be quite
slow at delivering faults to userspace when many vCPUs fault on the same
VMA/uffd: this problem only worsens as the number of vCPUs increases.
This series allows page faults encountered in KVM_RUN to bypass
userfaultfd (KVM_CAP_NOWAIT_ON_FAULT) and be delivered directly via
VM exit to the faulting vCPU (KVM_CAP_MEMORY_FAULT_INFO), allowing much
higher page-in rates during uffd-based postcopy.

KVM_CAP_MEMORY_FAULT_INFO comes first. This capability delivers
information to userspace on vCPU guest memory access failures. KVM_RUN
currently just returns -1 and sets errno=EFAULT upon these failures.

Upon receiving an annotated EFAULT, userspace may diagnose and act to
resolve the failed access. This might involve a MADV_POPULATE_READ|WRITE
or, in the context of uffd-based live migration postcopy, a
UFFDIO_COPY|CONTINUE.

~~~~~~~~~~~~~~ IMPORTANT NOTE ~~~~~~~~~~~~~~
The implementation strategy for KVM_CAP_MEMORY_FAULT_INFO has risks: for
example, if there are any existing paths in KVM_RUN which cause a vCPU
to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
access but ignore the failure and then (3) complete the exit to
userspace set up in (1), then the contents of the kvm_run struct written
in (1) will be corrupted.

Another example: if KVM_RUN fails a guest memory access for which the
EFAULT is annotated but does not return the EFAULT to userspace, then
later returns an *un*annotated EFAULT to userspace, then userspace will
receive incorrect information.

These are pathological and (hopefully) hypothetical cases, but awareness
is important.

Rationale for taking this approach over the alternative strategy of
filling the efault info only for verified return paths from KVM_RUN to
userserspace occurred in [3].
~~~~~~~~~~~~~~ END IMPORTANT NOTE ~~~~~~~~~~~~~~

KVM_CAP_NOWAIT_ON_FAULT (originally proposed by James Houghton in [1])
comes next. This capability causes KVM_RUN to error with errno=EFAULT
when it encounters a page fault which would require the vCPU thread to
sleep. During uffd-based postcopy this allows delivery of page faults
directly to vCPU threads, bypassing the uffd wait queue and contention.

Side note" KVM_CAP_NOWAIT_ON_FAULT prevents async page faults, so
userspace will likely want to limit its use to uffd-based postcopy.

KVM's demand paging self test is extended to demonstrate the benefits
of the new caps to uffd-based postcopy. The performance samples below
(rates in thousands of pages/s, avg of 5 runs), were gathered using [2]
on an x86 machine with 256 cores.

vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
1       150     340
2       191     477
4       210     809
8       155     1239
16      130     1595
32      108     2299
64      86      3482
128     62      4134
256     36      4012

Base Commit
~~~~~~~~~~~
This series is based off of kvm/next (39428f6ea9ea) with [5] applied

Links/Notes
~~~~~~~~~~~
[1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
[2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
    A quick rundown of the new flags (also detailed in later commits)
        -a registers all of guest memory to a single uffd.
        -r species the number of reader threads for polling the uffd.
        -w is what actually enables the new capabilities.
    All data was collected after applying the entire series
[3] https://lore.kernel.org/kvm/ZBTgnjXJvR8jtc4i@google.com/
[4] https://lore.kernel.org/kvm/ZHkfDCItA8HUxOG1@linux.dev/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
[5] https://lore.kernel.org/kvm/168556721574.515120.10821482819846567909.b4-ty@google.com/T/#t

---

v4
  - Fix excessive indentation [Robert, Oliver]
  - Calculate final stats when uffd handler fn returns an error [Robert]
  - Remove redundant info from uffd_desc [Robert]
  - Fix various commit message typos [Robert]
  - Add comment about suppressed EEXISTs in selftest [Robert]
  - Add exit_reasons_known definition for KVM_EXIT_MEMORY_FAULT [Robert]
  - Fix some include/logic issues in self test [Robert]
  - Rename no-slow-gup cap to KVM_CAP_NOWAIT_ON_FAULT [Oliver, Sean]
  - Make KVM_CAP_MEMORY_FAULT_INFO informational-only [Oliver, Sean]
  - Drop most of the annotations from v3: see
    https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
  - Remove WARN on bare efaults [Sean, Oliver]
  - Eliminate unnecessary UFFDIO_WAKE call from self test [James]

v3: https://lore.kernel.org/kvm/ZEBXi5tZZNxA+jRs@x1n/T/#t
  - Rework the implementation to be based on two orthogonal
    capabilities (KVM_CAP_MEMORY_FAULT_INFO and
    KVM_CAP_NOWAIT_ON_FAULT) [Sean, Oliver]
  - Change return code of kvm_populate_efault_info [Isaku]
  - Use kvm_populate_efault_info from arm code [Oliver]

v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/

    This was a bit of a misfire, as I sent my WIP series on the mailing
    list but was just targeting Sean for some feedback. Oliver Upton and
    Isaku Yamahata ended up discovering the series and giving me some
    feedback anyways, so thanks to them :) In the end, there was enough
    discussion to justify retroactively labeling it as v2, even with the
    limited cc list.

  - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
  - API changes:
        - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
          KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
          requirement).
        - Switched to memslot flag
  - Take Oliver's simplification to the "allow fast gup for readable
    faults" logic.
  - Slightly redefine the return code of user_mem_abort.
  - Fix documentation errors brought up by Marc
  - Reword commit messages in imperative mood

v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/

Anish Moorthy (16):
  KVM: Allow hva_pfn_fast() to resolve read-only faults.
  KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN  at the start of
    KVM_RUN
  KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  KVM: Add docstrings to __kvm_write_guest_page() and
    __kvm_read_guest_page()
  KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
  KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page()
  KVM: Simplify error handling in __gfn_to_pfn_memslot()
  KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  KVM: x86: Implement KVM_CAP_NOWAIT_ON_FAULT
  KVM: arm64: Implement KVM_CAP_NOWAIT_ON_FAULT
  KVM: selftests: Report per-vcpu demand paging rate from demand paging
    test
  KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
    paging test
  KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
    signal errors via TEST_ASSERT
  KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
  KVM: selftests: Handle memory fault exits in demand_paging_test

 Documentation/virt/kvm/api.rst                |  74 ++++-
 arch/arm64/kvm/arm.c                          |   2 +
 arch/arm64/kvm/mmu.c                          |  16 +-
 arch/x86/kvm/mmu/mmu.c                        |  30 +-
 arch/x86/kvm/x86.c                            |   3 +
 include/linux/kvm_host.h                      |  15 +
 include/uapi/linux/kvm.h                      |  15 +
 tools/include/uapi/linux/kvm.h                |   8 +
 .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
 .../selftests/kvm/access_tracking_perf_test.c |   2 +-
 .../selftests/kvm/demand_paging_test.c        | 285 ++++++++++++++----
 .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
 .../testing/selftests/kvm/include/memstress.h |   2 +-
 .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
 tools/testing/selftests/kvm/lib/kvm_util.c    |   1 +
 tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
 .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
 .../kvm/memslot_modification_stress_test.c    |   2 +-
 virt/kvm/kvm_main.c                           |  97 ++++--
 19 files changed, 564 insertions(+), 174 deletions(-)

-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v4 01/16] KVM: Allow hva_pfn_fast() to resolve read-only faults.
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-14 14:39   ` Sean Christopherson
  2023-06-02 16:19 ` [PATCH v4 02/16] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN Anish Moorthy
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

hva_to_pfn_fast() currently just fails for read-only faults, which is
unnecessary. Instead, try pinning the page without passing FOLL_WRITE.
This allows read-only faults to (potentially) be resolved without
falling back to slow GUP.

Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb5c13eee193..fd80a560378c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2482,7 +2482,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
 }
 
 /*
- * The fast path to get the writable pfn which will be stored in @pfn,
+ * The fast path to get the pfn which will be stored in @pfn,
  * true indicates success, otherwise false is returned.  It's also the
  * only part that runs if we can in atomic context.
  */
@@ -2496,10 +2496,9 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 	 * or the caller allows to map a writable pfn for a read fault
 	 * request.
 	 */
-	if (!(write_fault || writable))
-		return false;
+	unsigned int gup_flags = (write_fault || writable) ? FOLL_WRITE : 0;
 
-	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+	if (get_user_page_fast_only(addr, gup_flags, page)) {
 		*pfn = page_to_pfn(page[0]);
 
 		if (writable)
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 02/16] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
  2023-06-02 16:19 ` [PATCH v4 01/16] KVM: Allow hva_pfn_fast() to resolve read-only faults Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-02 20:30   ` Isaku Yamahata
  2023-06-02 16:19 ` [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

Give kvm_run.exit_reason a defined initial value on entry into KVM_RUN:
other architectures (riscv, arm64) already use KVM_EXIT_UNKNOWN for this
purpose, so copy that convention.

This gives vCPUs trying to fill the run struct a mechanism to avoid
overwriting already-populated data, albeit an imperfect one. Being able
to detect an already-populated KVM run struct will prevent at least some
bugs in the upcoming implementation of KVM_CAP_MEMORY_FAULT_INFO, which
will attempt to fill the run struct whenever a vCPU fails a guest memory
access.

Without the already-populated check, KVM_CAP_MEMORY_FAULT_INFO could
change kvm_run in any code paths which

1. Populate kvm_run for some exit and prepare to return to userspace
2. Access guest memory for some reason (but without returning -EFAULTs
    to userspace)
3. Finish the return to userspace set up in (1), now with the contents
    of kvm_run changed to contain efault info.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/x86/kvm/x86.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ceb7c5e9cf9e..a7725d41570a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11163,6 +11163,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	if (r <= 0)
 		goto out;
 
+	kvm_run->exit_reason = KVM_EXIT_UNKNOWN;
 	r = vcpu_run(vcpu);
 
 out:
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
  2023-06-02 16:19 ` [PATCH v4 01/16] KVM: Allow hva_pfn_fast() to resolve read-only faults Anish Moorthy
  2023-06-02 16:19 ` [PATCH v4 02/16] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-03 16:58   ` Isaku Yamahata
                     ` (3 more replies)
  2023-06-02 16:19 ` [PATCH v4 04/16] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page() Anish Moorthy
                   ` (12 subsequent siblings)
  15 siblings, 4 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information
besides a return value of -1 and errno of EFAULT when a vCPU fails an
access to guest memory which may be resolvable by userspace.

Add documentation, updates to the KVM headers, and a helper function
(kvm_populate_efault_info()) for implementing the capability.

Besides simply filling the run struct, kvm_populate_efault_info() takes
two safety measures

  a. It tries to prevent concurrent fills on a single vCPU run struct
     by checking that the run struct being modified corresponds to the
     currently loaded vCPU.
  b. It tries to avoid filling an already-populated run struct by
     checking whether the exit reason has been modified since entry
     into KVM_RUN.

Finally, mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86,
even though EFAULT annotation are currently totally absent. Picking a
point to declare the implementation "done" is difficult because

  1. Annotations will be performed incrementally in subsequent commits
     across both core and arch-specific KVM.
  2. The initial series will very likely miss some cases which need
     annotation. Although these omissions are to be fixed in the future,
     userspace thus still needs to expect and be able to handle
     unannotated EFAULTs.

Given these qualifications, just marking it available here seems the
least arbitrary thing to do.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst             | 42 ++++++++++++++++++++++
 arch/arm64/kvm/arm.c                       |  1 +
 arch/x86/kvm/x86.c                         |  1 +
 include/linux/kvm_host.h                   |  9 +++++
 include/uapi/linux/kvm.h                   | 13 +++++++
 tools/include/uapi/linux/kvm.h             |  7 ++++
 tools/testing/selftests/kvm/lib/kvm_util.c |  1 +
 virt/kvm/kvm_main.c                        | 35 ++++++++++++++++++
 8 files changed, 109 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index add067793b90..5b24059143b3 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6700,6 +6700,18 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 len; /* in bytes */
+		} memory_fault;
+
+Indicates a vCPU memory fault on the guest physical address range
+[gpa, gpa + len). See KVM_CAP_MEMORY_FAULT_INFO for more details.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
@@ -7734,6 +7746,36 @@ This capability is aimed to mitigate the threat that malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.
 
+7.34 KVM_CAP_MEMORY_FAULT_INFO
+------------------------------
+
+:Architectures: x86, arm64
+:Returns: -EINVAL.
+
+The presence of this capability indicates that KVM_RUN may annotate EFAULTs
+returned by KVM_RUN in response to failed vCPU guest memory accesses which
+userspace may be able to resolve.
+
+The annotation is returned via the run struct. When KVM_RUN returns an error
+with errno=EFAULT, userspace may check the exit reason: if it is
+KVM_EXIT_MEMORY_FAULT, userspace is then permitted to read the run struct's
+'memory_fault' field.
+
+This capability is informational only: attempts to KVM_ENABLE_CAP it directly
+will fail.
+
+The 'gpa' and 'len' (in bytes) fields describe the range of guest
+physical memory to which access failed, i.e. [gpa, gpa + len). 'flags' is a
+bitfield indicating the nature of the access: valid masks are
+
+  - KVM_MEMORY_FAULT_FLAG_WRITE:     The failed access was a write.
+  - KVM_MEMORY_FAULT_FLAG_EXEC:      The failed access was an exec.
+
+NOTE: The implementation of this capability is incomplete. Even with it enabled,
+userspace may receive "bare" EFAULTs (i.e. exit reason != KVM_EXIT_MEMORY_FAULT)
+from KVM_RUN for failures which may be resolvable. These should be considered
+bugs and reported to the maintainers so that annotations can be added.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 14391826241c..b34cf0cedffa 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -234,6 +234,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_SYSTEM_SUSPEND:
 	case KVM_CAP_IRQFD_RESAMPLE:
 	case KVM_CAP_COUNTER_OFFSET:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		r = 1;
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a7725d41570a..d15bacb3f634 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4497,6 +4497,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
 	case KVM_CAP_IRQFD_RESAMPLE:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0e571e973bc2..69a221f71914 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2288,4 +2288,13 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+/*
+ * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and
+ * populate the memory_fault field with the given information.
+ *
+ * WARNs and does nothing if the exit reason is not KVM_EXIT_UNKNOWN, or if
+ * 'vcpu' is not the current running vcpu.
+ */
+inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
+				     uint64_t gpa, uint64_t len, uint64_t flags);
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 737318b1c1d9..143abb334f56 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -510,6 +511,12 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 len;
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -1190,6 +1197,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
 #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
 #define KVM_CAP_COUNTER_OFFSET 227
+#define KVM_CAP_MEMORY_FAULT_INFO 228
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2245,4 +2253,9 @@ struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+/* flags for KVM_CAP_MEMORY_FAULT_INFO */
+
+#define KVM_MEMORY_FAULT_FLAG_WRITE    (1 << 0)
+#define KVM_MEMORY_FAULT_FLAG_EXEC     (1 << 1)
+
 #endif /* __LINUX_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 4003a166328c..5476fe169921 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -505,6 +506,12 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 len;
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 298c4372fb1a..7d7e9f893fd5 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1868,6 +1868,7 @@ static struct exit_reason {
 #ifdef KVM_EXIT_MEMORY_NOT_PRESENT
 	KVM_EXIT_STRING(MEMORY_NOT_PRESENT),
 #endif
+	KVM_EXIT_STRING(MEMORY_FAULT),
 };
 
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fd80a560378c..09d4d85691e1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4674,6 +4674,9 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 
 		return r;
 	}
+	case KVM_CAP_MEMORY_FAULT_INFO: {
+		return -EINVAL;
+	}
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}
@@ -6173,3 +6176,35 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
 
 	return init_context.err;
 }
+
+inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
+				     uint64_t gpa, uint64_t len, uint64_t flags)
+{
+	if (WARN_ON_ONCE(!vcpu))
+		return;
+
+	preempt_disable();
+	/*
+	 * Ensure the this vCPU isn't modifying another vCPU's run struct, which
+	 * would open the door for races between concurrent calls to this
+	 * function.
+	 */
+	if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
+		goto out;
+	/*
+	 * Try not to overwrite an already-populated run struct.
+	 * This isn't a perfect solution, as there's no guarantee that the exit
+	 * reason is set before the run struct is populated, but it should prevent
+	 * at least some bugs.
+	 */
+	else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
+		goto out;
+
+	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+	vcpu->run->memory_fault.gpa = gpa;
+	vcpu->run->memory_fault.len = len;
+	vcpu->run->memory_fault.flags = flags;
+
+out:
+	preempt_enable();
+}
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 04/16] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page()
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (2 preceding siblings ...)
  2023-06-02 16:19 ` [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-15  2:41   ` Robert Hoo
  2023-06-02 16:19 ` [PATCH v4 05/16] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page() Anish Moorthy
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

The order of parameters in these function signature is a little strange,
with "offset" actually applying to "gfn" rather than to "data". Add
short comments to make things perfectly clear.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 09d4d85691e1..d9c0fa7c907f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2984,6 +2984,9 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
+/*
+ * Copy 'len' bytes from guest memory at '(gfn * PAGE_SIZE) + offset' to 'data'
+ */
 static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
 				 void *data, int offset, int len)
 {
@@ -3085,6 +3088,9 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
+/*
+ * Copy 'len' bytes from 'data' into guest memory at '(gfn * PAGE_SIZE) + offset'
+ */
 static int __kvm_write_guest_page(struct kvm *kvm,
 				  struct kvm_memory_slot *memslot, gfn_t gfn,
 			          const void *data, int offset, int len)
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 05/16] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (3 preceding siblings ...)
  2023-06-02 16:19 ` [PATCH v4 04/16] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page() Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-14 19:10   ` Sean Christopherson
  2023-06-02 16:19 ` [PATCH v4 06/16] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page() Anish Moorthy
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

Implement KVM_CAP_MEMORY_FAULT_INFO for uaccess failures in
kvm_vcpu_write_guest_page()

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d9c0fa7c907f..ea27a8178f1a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3090,8 +3090,10 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
 /*
  * Copy 'len' bytes from 'data' into guest memory at '(gfn * PAGE_SIZE) + offset'
+ * If 'vcpu' is non-null, then may fill its run struct for a
+ * KVM_EXIT_MEMORY_FAULT on uaccess failure.
  */
-static int __kvm_write_guest_page(struct kvm *kvm,
+static int __kvm_write_guest_page(struct kvm *kvm, struct kvm_vcpu *vcpu,
 				  struct kvm_memory_slot *memslot, gfn_t gfn,
 			          const void *data, int offset, int len)
 {
@@ -3102,8 +3104,13 @@ static int __kvm_write_guest_page(struct kvm *kvm,
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
 	r = __copy_to_user((void __user *)addr + offset, data, len);
-	if (r)
+	if (r) {
+		if (vcpu)
+			kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset,
+						 len,
+						 KVM_MEMORY_FAULT_FLAG_WRITE);
 		return -EFAULT;
+	}
 	mark_page_dirty_in_slot(kvm, memslot, gfn);
 	return 0;
 }
@@ -3113,7 +3120,7 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_write_guest_page(kvm, slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(kvm, NULL, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_page);
 
@@ -3121,8 +3128,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 			      const void *data, int offset, int len)
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-
-	return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
+	return __kvm_write_guest_page(vcpu->kvm, vcpu, slot, gfn, data,
+				      offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
 
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 06/16] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page()
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (4 preceding siblings ...)
  2023-06-02 16:19 ` [PATCH v4 05/16] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page() Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-14 19:22   ` Sean Christopherson
  2023-06-02 16:19 ` [PATCH v4 07/16] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

Implement KVM_CAP_MEMORY_FAULT_INFO for uaccess failures within
kvm_vcpu_read_guest_page().

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ea27a8178f1a..b9d2606f9251 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2986,9 +2986,12 @@ static int next_segment(unsigned long len, int offset)
 
 /*
  * Copy 'len' bytes from guest memory at '(gfn * PAGE_SIZE) + offset' to 'data'
+ * If 'vcpu' is non-null, then may fill its run struct for a
+ * KVM_EXIT_MEMORY_FAULT on uaccess failure.
  */
-static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
-				 void *data, int offset, int len)
+static int __kvm_read_guest_page(struct kvm_memory_slot *slot,
+				 struct kvm_vcpu *vcpu,
+				 gfn_t gfn, void *data, int offset, int len)
 {
 	int r;
 	unsigned long addr;
@@ -2997,8 +3000,12 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
 	r = __copy_from_user(data, (void __user *)addr + offset, len);
-	if (r)
+	if (r) {
+		if (vcpu)
+			kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset,
+						 len, 0);
 		return -EFAULT;
+	}
 	return 0;
 }
 
@@ -3007,7 +3014,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(slot, NULL, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_page);
 
@@ -3016,7 +3023,7 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(slot, vcpu, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
 
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 07/16] KVM: Simplify error handling in __gfn_to_pfn_memslot()
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (5 preceding siblings ...)
  2023-06-02 16:19 ` [PATCH v4 06/16] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page() Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-14 19:26   ` Sean Christopherson
  2023-06-02 16:19 ` [PATCH v4 08/16] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

KVM_HVA_ERR_RO_BAD satisfies kvm_is_error_hva(), so there's no need to
duplicate the "if (writable)" block. Fix this by bringing all
kvm_is_error_hva() cases under one conditional.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b9d2606f9251..05d6e7e3994d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2711,16 +2711,14 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 	if (hva)
 		*hva = addr;
 
-	if (addr == KVM_HVA_ERR_RO_BAD) {
-		if (writable)
-			*writable = false;
-		return KVM_PFN_ERR_RO_FAULT;
-	}
-
 	if (kvm_is_error_hva(addr)) {
 		if (writable)
 			*writable = false;
-		return KVM_PFN_NOSLOT;
+
+		if (addr == KVM_HVA_ERR_RO_BAD)
+			return KVM_PFN_ERR_RO_FAULT;
+		else
+			return KVM_PFN_NOSLOT;
 	}
 
 	/* Do not map writable pfn in the readonly memslot. */
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 08/16] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (6 preceding siblings ...)
  2023-06-02 16:19 ` [PATCH v4 07/16] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-14 20:03   ` Sean Christopherson
  2023-06-15  2:43   ` Robert Hoo
  2023-06-02 16:19 ` [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation Anish Moorthy
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_handle_error_pfn().

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8961f45e3b1..cb71aae9aaec 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3291,6 +3291,10 @@ static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
 
 static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
+	uint64_t rounded_gfn;
+	uint64_t fault_size;
+	uint64_t fault_flags;
+
 	if (is_sigpending_pfn(fault->pfn)) {
 		kvm_handle_signal_exit(vcpu);
 		return -EINTR;
@@ -3309,6 +3313,15 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 		return RET_PF_RETRY;
 	}
 
+	fault_size = KVM_HPAGE_SIZE(fault->goal_level);
+	rounded_gfn = round_down(fault->gfn * PAGE_SIZE, fault_size);
+
+	fault_flags = 0;
+	if (fault->write)
+		fault_flags |= KVM_MEMORY_FAULT_FLAG_WRITE;
+	if (fault->exec)
+		fault_flags |= KVM_MEMORY_FAULT_FLAG_EXEC;
+	kvm_populate_efault_info(vcpu, rounded_gfn, fault_size, fault_flags);
 	return -EFAULT;
 }
 
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (7 preceding siblings ...)
  2023-06-02 16:19 ` [PATCH v4 08/16] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-14 20:11   ` Sean Christopherson
  2023-06-14 21:20   ` Sean Christopherson
  2023-06-02 16:19 ` [PATCH v4 10/16] KVM: x86: Implement KVM_CAP_NOWAIT_ON_FAULT Anish Moorthy
                   ` (6 subsequent siblings)
  15 siblings, 2 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

Add documentation, memslot flags, useful helper functions, and the
actual new capability itself.

Memory fault exits on absent mappings are particularly useful for
userfaultfd-based postcopy live migration. When many vCPUs fault on a
single userfaultfd the faults can take a while to surface to userspace
due to having to contend for uffd wait queue locks. Bypassing the uffd
entirely by returning information directly to the vCPU exit avoids this
contention and improves the fault rate.

Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 32 +++++++++++++++++++++++++++++---
 include/linux/kvm_host.h       |  6 ++++++
 include/uapi/linux/kvm.h       |  2 ++
 tools/include/uapi/linux/kvm.h |  1 +
 virt/kvm/kvm_main.c            |  3 +++
 5 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 5b24059143b3..9daadbe2c7ed 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
   /* for kvm_userspace_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_NOWAIT_ON_FAULT (1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1342,12 +1343,15 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
+The flags field supports three flags
+
+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
 writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
+use it.
+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
 to make a new slot read-only.  In this case, writes to this memory will be
 posted to userspace as KVM_EXIT_MMIO exits.
+3.  KVM_MEM_NOWAIT_ON_FAULT: see KVM_CAP_NOWAIT_ON_FAULT for details.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
@@ -7776,6 +7780,28 @@ userspace may receive "bare" EFAULTs (i.e. exit reason != KVM_EXIT_MEMORY_FAULT)
 from KVM_RUN for failures which may be resolvable. These should be considered
 bugs and reported to the maintainers so that annotations can be added.
 
+7.35 KVM_CAP_NOWAIT_ON_FAULT
+----------------------------
+
+:Architectures: None
+:Returns: -EINVAL.
+
+The presence of this capability indicates that userspace may pass the
+KVM_MEM_NOWAIT_ON_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
+to fail (-EFAULT) in response to page faults for which resolution would require
+the faulting thread to sleep.
+
+The range of guest physical memory causing the fault is advertised to userspace
+through KVM_CAP_MEMORY_FAULT_INFO.
+
+Userspace should determine how best to make the mapping present, then take
+appropriate action. For instance establishing the mapping could involve a
+MADV_POPULATE_READ|WRITE, in the context of userfaultfd a UFFDIO_COPY|CONTINUE
+could be appropriate, etc. After establishing the mapping, userspace can return
+to KVM to retry the memory access.
+
+Attempts to enable this capability directly will fail.
+
 8. Other capabilities.
 ======================
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 69a221f71914..abbc5dd72292 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2297,4 +2297,10 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
  */
 inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
 				     uint64_t gpa, uint64_t len, uint64_t flags);
+
+static inline bool kvm_slot_nowait_on_fault(
+	const struct kvm_memory_slot *slot)
+{
+	return slot->flags & KVM_MEM_NOWAIT_ON_FAULT;
+}
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 143abb334f56..595c3d7d36aa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_NOWAIT_ON_FAULT	(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -1198,6 +1199,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
 #define KVM_CAP_COUNTER_OFFSET 227
 #define KVM_CAP_MEMORY_FAULT_INFO 228
+#define KVM_CAP_NOWAIT_ON_FAULT 229
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 5476fe169921..f64845cd599f 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_NOWAIT_ON_FAULT (1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 05d6e7e3994d..2c276d4d0821 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1527,6 +1527,9 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
 	valid_flags |= KVM_MEM_READONLY;
 #endif
 
+	if (kvm_vm_ioctl_check_extension(NULL, KVM_CAP_NOWAIT_ON_FAULT))
+		valid_flags |= KVM_MEM_NOWAIT_ON_FAULT;
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
 
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 10/16] KVM: x86: Implement KVM_CAP_NOWAIT_ON_FAULT
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (8 preceding siblings ...)
  2023-06-02 16:19 ` [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-14 20:25   ` Sean Christopherson
  2023-06-02 16:19 ` [PATCH v4 11/16] KVM: arm64: " Anish Moorthy
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

When the memslot flag is enabled, fail guest memory accesses for which
fast-gup fails (ie, where resolving the page fault would require putting
the faulting thread to sleep).

Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst |  2 +-
 arch/x86/kvm/mmu/mmu.c         | 17 ++++++++++++-----
 arch/x86/kvm/x86.c             |  1 +
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9daadbe2c7ed..aa7b4024fd41 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7783,7 +7783,7 @@ bugs and reported to the maintainers so that annotations can be added.
 7.35 KVM_CAP_NOWAIT_ON_FAULT
 ----------------------------
 
-:Architectures: None
+:Architectures: x86
 :Returns: -EINVAL.
 
 The presence of this capability indicates that userspace may pass the
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cb71aae9aaec..288008a64e5c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4299,7 +4299,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
 }
 
-static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu,
+			     struct kvm_page_fault *fault,
+			     bool nowait)
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	bool async;
@@ -4332,9 +4334,12 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	}
 
 	async = false;
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
+
+	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn,
+					  nowait, false,
+					  nowait ? NULL : &async,
+					  fault->write, &fault->map_writable, &fault->hva);
+
 	if (!async)
 		return RET_PF_CONTINUE; /* *pfn has correct page already */
 
@@ -4368,7 +4373,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	smp_rmb();
 
-	ret = __kvm_faultin_pfn(vcpu, fault);
+	ret = __kvm_faultin_pfn(vcpu, fault,
+				likely(fault->slot)
+					&& kvm_slot_nowait_on_fault(fault->slot));
 	if (ret != RET_PF_CONTINUE)
 		return ret;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d15bacb3f634..4fbe9c811cc7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4498,6 +4498,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
 	case KVM_CAP_IRQFD_RESAMPLE:
 	case KVM_CAP_MEMORY_FAULT_INFO:
+	case KVM_CAP_NOWAIT_ON_FAULT:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 11/16] KVM: arm64: Implement KVM_CAP_NOWAIT_ON_FAULT
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (9 preceding siblings ...)
  2023-06-02 16:19 ` [PATCH v4 10/16] KVM: x86: Implement KVM_CAP_NOWAIT_ON_FAULT Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-02 16:19 ` [PATCH v4 12/16] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

Return -EFAULT from user_mem_abort when the memslot flag is enabled and
fast GUP fails to find a present mapping for the page.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst |  2 +-
 arch/arm64/kvm/arm.c           |  1 +
 arch/arm64/kvm/mmu.c           | 16 +++++++++++++++-
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index aa7b4024fd41..8a1205f7c271 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7783,7 +7783,7 @@ bugs and reported to the maintainers so that annotations can be added.
 7.35 KVM_CAP_NOWAIT_ON_FAULT
 ----------------------------
 
-:Architectures: x86
+:Architectures: x86, arm64
 :Returns: -EINVAL.
 
 The presence of this capability indicates that userspace may pass the
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index b34cf0cedffa..46a09c4db901 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -235,6 +235,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_IRQFD_RESAMPLE:
 	case KVM_CAP_COUNTER_OFFSET:
 	case KVM_CAP_MEMORY_FAULT_INFO:
+	case KVM_CAP_NOWAIT_ON_FAULT:
 		r = 1;
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 3b9d4d24c361..5451b712b0ac 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1232,6 +1232,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
+	bool exit_on_memory_fault = kvm_slot_nowait_on_fault(memslot);
+	uint64_t memory_fault_flags;
 
 	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
 	write_fault = kvm_is_write_fault(vcpu);
@@ -1325,8 +1327,20 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
-	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
+	pfn = __gfn_to_pfn_memslot(memslot, gfn, exit_on_memory_fault, false, NULL,
 				   write_fault, &writable, NULL);
+
+	if (exit_on_memory_fault && pfn == KVM_PFN_ERR_FAULT) {
+		memory_fault_flags = 0;
+		if (write_fault)
+			memory_fault_flags |= KVM_MEMORY_FAULT_FLAG_EXEC;
+		if (exec_fault)
+			memory_fault_flags |= KVM_MEMORY_FAULT_FLAG_EXEC;
+		kvm_populate_efault_info(vcpu,
+					 round_down(gfn * PAGE_SIZE, vma_pagesize), vma_pagesize,
+					 memory_fault_flags);
+		return -EFAULT;
+	}
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 12/16] KVM: selftests: Report per-vcpu demand paging rate from demand paging test
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (10 preceding siblings ...)
  2023-06-02 16:19 ` [PATCH v4 11/16] KVM: arm64: " Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-02 16:19 ` [PATCH v4 13/16] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

Using the overall demand paging rate to measure performance can be
slightly misleading when vCPU accesses are not overlapped. Adding more
vCPUs will (usually) increase the overall demand paging rate even
if performance remains constant or even degrades on a per-vcpu basis. As
such, it makes sense to report both the total and per-vcpu paging rates.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 tools/testing/selftests/kvm/demand_paging_test.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 9c18686b4f63..5e8bda388814 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -135,6 +135,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
 	int i;
+	double vcpu_paging_rate;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
@@ -191,11 +192,17 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			uffd_stop_demand_paging(uffd_descs[i]);
 	}
 
-	pr_info("Total guest execution time: %ld.%.9lds\n",
+	pr_info("Total guest execution time:\t%ld.%.9lds\n",
 		ts_diff.tv_sec, ts_diff.tv_nsec);
-	pr_info("Overall demand paging rate: %f pgs/sec\n",
-		memstress_args.vcpu_args[0].pages * nr_vcpus /
-		((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / NSEC_PER_SEC));
+
+	vcpu_paging_rate =
+		memstress_args.vcpu_args[0].pages
+		/ ((double)ts_diff.tv_sec
+			+ (double)ts_diff.tv_nsec / NSEC_PER_SEC);
+	pr_info("Per-vcpu demand paging rate:\t%f pgs/sec/vcpu\n",
+		vcpu_paging_rate);
+	pr_info("Overall demand paging rate:\t%f pgs/sec\n",
+		vcpu_paging_rate * nr_vcpus);
 
 	memstress_destroy_vm(vm);
 
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 13/16] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (11 preceding siblings ...)
  2023-06-02 16:19 ` [PATCH v4 12/16] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-02 16:19 ` [PATCH v4 14/16] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

At the moment, demand_paging_test does not support profiling/testing
multiple vCPU threads concurrently faulting on a single uffd because

    (a) "-u" (run test in userfaultfd mode) creates a uffd for each vCPU's
        region, so that each uffd services a single vCPU thread.
    (b) "-u -o" (userfaultfd mode + overlapped vCPU memory accesses)
        simply doesn't work: the test tries to register the same memory
        to multiple uffds, causing an error.

Add support for many vcpus per uffd by
    (1) Keeping "-u" behavior unchanged.
    (2) Making "-u -a" create a single uffd for all of guest memory.
    (3) Making "-u -o" implicitly pass "-a", solving the problem in (b).
In cases (2) and (3) all vCPU threads fault on a single uffd.

With potentially multiple vCPUs per UFFD, it makes sense to allow
configuring the number of reader threads per UFFD as well: add the "-r"
flag to do so.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/aarch64/page_fault_test.c   |  4 +-
 .../selftests/kvm/demand_paging_test.c        | 76 +++++++++++++---
 .../selftests/kvm/include/userfaultfd_util.h  | 17 +++-
 .../selftests/kvm/lib/userfaultfd_util.c      | 87 +++++++++++++------
 4 files changed, 137 insertions(+), 47 deletions(-)

diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
index df10f1ffa20d..3b6d228a9340 100644
--- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
+++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
@@ -376,14 +376,14 @@ static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
 		*pt_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						    pt_args.hva,
 						    pt_args.paging_size,
-						    test->uffd_pt_handler);
+						    1, test->uffd_pt_handler);
 
 	*data_uffd = NULL;
 	if (test->uffd_data_handler)
 		*data_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						      data_args.hva,
 						      data_args.paging_size,
-						      test->uffd_data_handler);
+						      1, test->uffd_data_handler);
 }
 
 static void free_uffd(struct test_desc *test, struct uffd_desc *pt_uffd,
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 5e8bda388814..35935874c690 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -77,8 +77,20 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		copy.mode = 0;
 
 		r = ioctl(uffd, UFFDIO_COPY, &copy);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d with errno: %d\n",
+		/*
+		 * With multiple vCPU threads fault on a single page and there are
+		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
+		 * will fail with EEXIST: handle that case without signaling an
+		 * error.
+		 *
+		 * Note that this also suppress any EEXISTs occurring from,
+		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
+		 * happens here, but a realistic VMM might potentially maintain
+		 * some external state to correctly surface EEXISTs to userspace
+		 * (or prevent duplicate COPY/CONTINUEs in the first place).
+		 */
+		if (r == -1 && errno != EEXIST) {
+			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
 				addr, tid, errno);
 			return r;
 		}
@@ -89,8 +101,20 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		cont.range.len = demand_paging_size;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_CONTINUE in 0x%lx from thread %d with errno: %d\n",
+		/*
+		 * With multiple vCPU threads fault on a single page and there are
+		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
+		 * will fail with EEXIST: handle that case without signaling an
+		 * error.
+		 *
+		 * Note that this also suppress any EEXISTs occurring from,
+		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
+		 * happens here, but a realistic VMM might potentially maintain
+		 * some external state to correctly surface EEXISTs to userspace
+		 * (or prevent duplicate COPY/CONTINUEs in the first place).
+		 */
+		if (r == -1 && errno != EEXIST) {
+			pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
 				addr, tid, errno);
 			return r;
 		}
@@ -110,7 +134,9 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 
 struct test_params {
 	int uffd_mode;
+	bool single_uffd;
 	useconds_t uffd_delay;
+	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
 };
@@ -134,8 +160,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i;
+	int i, num_uffds = 0;
 	double vcpu_paging_rate;
+	uint64_t uffd_region_size;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
@@ -148,7 +175,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
 	if (p->uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
-		for (i = 0; i < nr_vcpus; i++) {
+		num_uffds = p->single_uffd ? 1 : nr_vcpus;
+		for (i = 0; i < num_uffds; i++) {
 			vcpu_args = &memstress_args.vcpu_args[i];
 			prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
 				     vcpu_args->pages * memstress_args.guest_page_size);
@@ -156,9 +184,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	}
 
 	if (p->uffd_mode) {
-		uffd_descs = malloc(nr_vcpus * sizeof(struct uffd_desc *));
+		num_uffds = p->single_uffd ? 1 : nr_vcpus;
+		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
+
+		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
 		TEST_ASSERT(uffd_descs, "Memory allocation failed");
-		for (i = 0; i < nr_vcpus; i++) {
+		for (i = 0; i < num_uffds; i++) {
+			struct memstress_vcpu_args *vcpu_args;
 			void *vcpu_hva;
 
 			vcpu_args = &memstress_args.vcpu_args[i];
@@ -171,7 +203,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
 				p->uffd_mode, p->uffd_delay, vcpu_hva,
-				vcpu_args->pages * memstress_args.guest_page_size,
+				uffd_region_size,
+				p->readers_per_uffd,
 				&handle_uffd_page_request);
 		}
 	}
@@ -188,7 +221,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
 	if (p->uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
-		for (i = 0; i < nr_vcpus; i++)
+		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
 	}
 
@@ -214,14 +247,19 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 static void help(char *name)
 {
 	puts("");
-	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-d uffd_delay_usec]\n"
-	       "          [-b memory] [-s type] [-v vcpus] [-o]\n", name);
+	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
+		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
+		   "          [-s type] [-v vcpus] [-o]\n", name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
+	printf(" -a: Use a single userfaultfd for all of guest memory, instead of\n"
+	       "     creating one for each region paged by a unique vCPU\n"
+	       "     Set implicitly with -o, and no effect without -u.\n");
 	printf(" -d: add a delay in usec to the User Fault\n"
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
+	printf(" -r: Set the number of reader threads per uffd.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -239,12 +277,14 @@ int main(int argc, char *argv[])
 	struct test_params p = {
 		.src_type = DEFAULT_VM_MEM_SRC,
 		.partition_vcpu_memory_access = true,
+		.readers_per_uffd = 1,
+		.single_uffd = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "hm:u:d:b:s:v:o")) != -1) {
+	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
@@ -256,6 +296,9 @@ int main(int argc, char *argv[])
 				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
 			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
+		case 'a':
+			p.single_uffd = true;
+			break;
 		case 'd':
 			p.uffd_delay = strtoul(optarg, NULL, 0);
 			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
@@ -273,6 +316,13 @@ int main(int argc, char *argv[])
 			break;
 		case 'o':
 			p.partition_vcpu_memory_access = false;
+			p.single_uffd = true;
+			break;
+		case 'r':
+			p.readers_per_uffd = atoi(optarg);
+			TEST_ASSERT(p.readers_per_uffd >= 1,
+				    "Invalid number of readers per uffd %d: must be >=1",
+				    p.readers_per_uffd);
 			break;
 		case 'h':
 		default:
diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
index 877449c34592..af83a437e74a 100644
--- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
+++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
@@ -17,18 +17,27 @@
 
 typedef int (*uffd_handler_t)(int uffd_mode, int uffd, struct uffd_msg *msg);
 
-struct uffd_desc {
+struct uffd_reader_args {
 	int uffd_mode;
 	int uffd;
-	int pipefds[2];
 	useconds_t delay;
 	uffd_handler_t handler;
-	pthread_t thread;
+	/* Holds the read end of the pipe for killing the reader. */
+	int pipe;
+};
+
+struct uffd_desc {
+	int uffd;
+	uint64_t num_readers;
+	/* Holds the write ends of the pipes for killing the readers. */
+	int *pipefds;
+	pthread_t *readers;
+	struct uffd_reader_args *reader_args;
 };
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler);
+					   uint64_t num_readers, uffd_handler_t handler);
 
 void uffd_stop_demand_paging(struct uffd_desc *uffd);
 
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 271f63891581..6f220aa4fb08 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -27,10 +27,8 @@
 
 static void *uffd_handler_thread_fn(void *arg)
 {
-	struct uffd_desc *uffd_desc = (struct uffd_desc *)arg;
-	int uffd = uffd_desc->uffd;
-	int pipefd = uffd_desc->pipefds[0];
-	useconds_t delay = uffd_desc->delay;
+	struct uffd_reader_args *reader_args = (struct uffd_reader_args *)arg;
+	int uffd = reader_args->uffd;
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
@@ -44,7 +42,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 		pollfd[0].fd = uffd;
 		pollfd[0].events = POLLIN;
-		pollfd[1].fd = pipefd;
+		pollfd[1].fd = reader_args->pipe;
 		pollfd[1].events = POLLIN;
 
 		r = poll(pollfd, 2, -1);
@@ -92,9 +90,9 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
 
-		if (delay)
-			usleep(delay);
-		r = uffd_desc->handler(uffd_desc->uffd_mode, uffd, &msg);
+		if (reader_args->delay)
+			usleep(reader_args->delay);
+		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
 		if (r < 0)
 			return NULL;
 		pages++;
@@ -110,7 +108,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler)
+					   uint64_t num_readers, uffd_handler_t handler)
 {
 	struct uffd_desc *uffd_desc;
 	bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR);
@@ -118,14 +116,26 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	struct uffdio_api uffdio_api;
 	struct uffdio_register uffdio_register;
 	uint64_t expected_ioctls = ((uint64_t) 1) << _UFFDIO_COPY;
-	int ret;
+	int ret, i;
 
 	PER_PAGE_DEBUG("Userfaultfd %s mode, faults resolved with %s\n",
 		       is_minor ? "MINOR" : "MISSING",
 		       is_minor ? "UFFDIO_CONINUE" : "UFFDIO_COPY");
 
 	uffd_desc = malloc(sizeof(struct uffd_desc));
-	TEST_ASSERT(uffd_desc, "malloc failed");
+	TEST_ASSERT(uffd_desc, "Failed to malloc uffd descriptor");
+
+	uffd_desc->pipefds = malloc(sizeof(int) * num_readers);
+	TEST_ASSERT(uffd_desc->pipefds, "Failed to malloc pipes");
+
+	uffd_desc->readers = malloc(sizeof(pthread_t) * num_readers);
+	TEST_ASSERT(uffd_desc->readers, "Failed to malloc reader threads");
+
+	uffd_desc->reader_args = malloc(
+		sizeof(struct uffd_reader_args) * num_readers);
+	TEST_ASSERT(uffd_desc->reader_args, "Failed to malloc reader_args");
+
+	uffd_desc->num_readers = num_readers;
 
 	/* In order to get minor faults, prefault via the alias. */
 	if (is_minor)
@@ -148,18 +158,28 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
 		    expected_ioctls, "missing userfaultfd ioctls");
 
-	ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
-	TEST_ASSERT(!ret, "Failed to set up pipefd");
-
-	uffd_desc->uffd_mode = uffd_mode;
 	uffd_desc->uffd = uffd;
-	uffd_desc->delay = delay;
-	uffd_desc->handler = handler;
-	pthread_create(&uffd_desc->thread, NULL, uffd_handler_thread_fn,
-		       uffd_desc);
+	for (i = 0; i < uffd_desc->num_readers; ++i) {
+		int pipes[2];
+
+		ret = pipe2((int *) &pipes, O_CLOEXEC | O_NONBLOCK);
+		TEST_ASSERT(!ret, "Failed to set up pipefd %i for uffd_desc %p",
+			    i, uffd_desc);
+
+		uffd_desc->pipefds[i] = pipes[1];
 
-	PER_VCPU_DEBUG("Created uffd thread for HVA range [%p, %p)\n",
-		       hva, hva + len);
+		uffd_desc->reader_args[i].uffd_mode = uffd_mode;
+		uffd_desc->reader_args[i].uffd = uffd;
+		uffd_desc->reader_args[i].delay = delay;
+		uffd_desc->reader_args[i].handler = handler;
+		uffd_desc->reader_args[i].pipe = pipes[0];
+
+		pthread_create(&uffd_desc->readers[i], NULL, uffd_handler_thread_fn,
+			       &uffd_desc->reader_args[i]);
+
+		PER_VCPU_DEBUG("Created uffd thread %i for HVA range [%p, %p)\n",
+			       i, hva, hva + len);
+	}
 
 	return uffd_desc;
 }
@@ -167,19 +187,30 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 void uffd_stop_demand_paging(struct uffd_desc *uffd)
 {
 	char c = 0;
-	int ret;
+	int i, ret;
 
-	ret = write(uffd->pipefds[1], &c, 1);
-	TEST_ASSERT(ret == 1, "Unable to write to pipefd");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = write(uffd->pipefds[i], &c, 1);
+		TEST_ASSERT(
+			ret == 1, "Unable to write to pipefd %i for uffd_desc %p", i, uffd);
+	}
 
-	ret = pthread_join(uffd->thread, NULL);
-	TEST_ASSERT(ret == 0, "Pthread_join failed.");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = pthread_join(uffd->readers[i], NULL);
+		TEST_ASSERT(
+			ret == 0, "Pthread_join failed on reader %i for uffd_desc %p", i, uffd);
+	}
 
 	close(uffd->uffd);
 
-	close(uffd->pipefds[1]);
-	close(uffd->pipefds[0]);
+	for (i = 0; i < uffd->num_readers; ++i) {
+		close(uffd->pipefds[i]);
+		close(uffd->reader_args[i].pipe);
+	}
 
+	free(uffd->pipefds);
+	free(uffd->readers);
+	free(uffd->reader_args);
 	free(uffd);
 }
 
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 14/16] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (12 preceding siblings ...)
  2023-06-02 16:19 ` [PATCH v4 13/16] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-02 16:19 ` [PATCH v4 15/16] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
  2023-06-02 16:19 ` [PATCH v4 16/16] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
  15 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

With multiple reader threads POLLing a single UFFD, the test suffers
from the thundering herd problem: performance degrades as the number of
reader threads is increased. Solve this issue [1] by switching the
the polling mechanism to EPOLL + EPOLLEXCLUSIVE.

Also, change the error-handling convention of uffd_handler_thread_fn.
Instead of just printing errors and returning early from the polling
loop, check for them via TEST_ASSERT. "return NULL" is reserved for a
successful exit from uffd_handler_thread_fn, ie one triggered by a
write to the exit pipe.

Performance samples generated by the command in [2] are given below.

Num Reader Threads, Paging Rate (POLL), Paging Rate (EPOLL)
1      249k      185k
2      201k      235k
4      186k      155k
16     150k      217k
32     89k       198k

[1] Single-vCPU performance does suffer somewhat.
[2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        |  1 -
 .../selftests/kvm/lib/userfaultfd_util.c      | 74 +++++++++----------
 2 files changed, 35 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 35935874c690..115194e19ad9 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -13,7 +13,6 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
-#include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
 #include <sys/syscall.h>
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 6f220aa4fb08..2a179133645a 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -16,6 +16,7 @@
 #include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <sys/epoll.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -32,60 +33,55 @@ static void *uffd_handler_thread_fn(void *arg)
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
+	int epollfd;
+	struct epoll_event evt;
+
+	epollfd = epoll_create(1);
+	TEST_ASSERT(epollfd >= 0, "Failed to create epollfd.");
+
+	evt.events = EPOLLIN | EPOLLEXCLUSIVE;
+	evt.data.u32 = 0;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, uffd, &evt) == 0,
+		    "Failed to add uffd to epollfd");
+
+	evt.events = EPOLLIN;
+	evt.data.u32 = 1;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, reader_args->pipe, &evt) == 0,
+		    "Failed to add pipe to epollfd");
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
 	while (1) {
 		struct uffd_msg msg;
-		struct pollfd pollfd[2];
-		char tmp_chr;
 		int r;
 
-		pollfd[0].fd = uffd;
-		pollfd[0].events = POLLIN;
-		pollfd[1].fd = reader_args->pipe;
-		pollfd[1].events = POLLIN;
-
-		r = poll(pollfd, 2, -1);
-		switch (r) {
-		case -1:
-			pr_info("poll err");
-			continue;
-		case 0:
-			continue;
-		case 1:
-			break;
-		default:
-			pr_info("Polling uffd returned %d", r);
-			return NULL;
-		}
+		r = epoll_wait(epollfd, &evt, 1, -1);
+		TEST_ASSERT(r == 1,
+			    "Unexpected number of events (%d) from epoll, errno = %d",
+			    r, errno);
 
-		if (pollfd[0].revents & POLLERR) {
-			pr_info("uffd revents has POLLERR");
-			return NULL;
-		}
+		if (evt.data.u32 == 1) {
+			char tmp_chr;
 
-		if (pollfd[1].revents & POLLIN) {
-			r = read(pollfd[1].fd, &tmp_chr, 1);
+			TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+				    "Reader thread received EPOLLERR or EPOLLHUP on pipe.");
+			r = read(reader_args->pipe, &tmp_chr, 1);
 			TEST_ASSERT(r == 1,
-				    "Error reading pipefd in UFFD thread\n");
+				    "Error reading pipefd in uffd reader thread");
 			break;
 		}
 
-		if (!(pollfd[0].revents & POLLIN))
-			continue;
+		TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+			    "Reader thread received EPOLLERR or EPOLLHUP on uffd.");
 
 		r = read(uffd, &msg, sizeof(msg));
 		if (r == -1) {
-			if (errno == EAGAIN)
-				continue;
-			pr_info("Read of uffd got errno %d\n", errno);
-			return NULL;
+			TEST_ASSERT(errno == EAGAIN,
+				    "Error reading from UFFD: errno = %d", errno);
+			continue;
 		}
 
-		if (r != sizeof(msg)) {
-			pr_info("Read on uffd returned unexpected size: %d bytes", r);
-			return NULL;
-		}
+		TEST_ASSERT(r == sizeof(msg),
+			    "Read on uffd returned unexpected number of bytes (%d)", r);
 
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
@@ -93,8 +89,8 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (reader_args->delay)
 			usleep(reader_args->delay);
 		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
-		if (r < 0)
-			return NULL;
+		TEST_ASSERT(r >= 0,
+			    "Reader thread handler fn returned negative value %d", r);
 		pages++;
 	}
 
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 15/16] KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (13 preceding siblings ...)
  2023-06-02 16:19 ` [PATCH v4 14/16] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-02 16:19 ` [PATCH v4 16/16] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
  15 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

Memslot flags aren't currently exposed to the tests, and are just always
set to 0. Add a parameter to allow tests to manually set those flags.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 tools/testing/selftests/kvm/access_tracking_perf_test.c       | 2 +-
 tools/testing/selftests/kvm/demand_paging_test.c              | 2 +-
 tools/testing/selftests/kvm/dirty_log_perf_test.c             | 2 +-
 tools/testing/selftests/kvm/include/memstress.h               | 2 +-
 tools/testing/selftests/kvm/lib/memstress.c                   | 4 ++--
 .../testing/selftests/kvm/memslot_modification_stress_test.c  | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..b51656b408b8 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -306,7 +306,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct kvm_vm *vm;
 	int nr_vcpus = params->nr_vcpus;
 
-	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1, 0,
 				 params->backing_src, !overlap_memory_access);
 
 	memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 115194e19ad9..ffbc89300c46 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -163,7 +163,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	double vcpu_paging_rate;
 	uint64_t uffd_region_size;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
 				 p->src_type, p->partition_vcpu_memory_access);
 
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index e9d6d1aecf89..6c8749193cfa 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -224,7 +224,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	int i;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
-				 p->slots, p->backing_src,
+				 p->slots, 0, p->backing_src,
 				 p->partition_vcpu_memory_access);
 
 	pr_info("Random seed: %u\n", p->random_seed);
diff --git a/tools/testing/selftests/kvm/include/memstress.h b/tools/testing/selftests/kvm/include/memstress.h
index 72e3e358ef7b..1cba965d2d33 100644
--- a/tools/testing/selftests/kvm/include/memstress.h
+++ b/tools/testing/selftests/kvm/include/memstress.h
@@ -56,7 +56,7 @@ struct memstress_args {
 extern struct memstress_args memstress_args;
 
 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access);
 void memstress_destroy_vm(struct kvm_vm *vm);
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index 5f1d3173c238..7589b8cef691 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -119,7 +119,7 @@ void memstress_setup_vcpus(struct kvm_vm *vm, int nr_vcpus,
 }
 
 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access)
 {
@@ -207,7 +207,7 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 
 		vm_userspace_mem_region_add(vm, backing_src, region_start,
 					    MEMSTRESS_MEM_SLOT_INDEX + i,
-					    region_pages, 0);
+					    region_pages, slot_flags);
 	}
 
 	/* Do mapping for the demand paging memory slot */
diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
index 9855c41ca811..0b19ec3ecc9c 100644
--- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
+++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
@@ -95,7 +95,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct test_params *p = arg;
 	struct kvm_vm *vm;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
 				 VM_MEM_SRC_ANONYMOUS,
 				 p->partition_vcpu_memory_access);
 
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v4 16/16] KVM: selftests: Handle memory fault exits in demand_paging_test
  2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (14 preceding siblings ...)
  2023-06-02 16:19 ` [PATCH v4 15/16] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
@ 2023-06-02 16:19 ` Anish Moorthy
  2023-06-20  2:44   ` Robert Hoo
  15 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-06-02 16:19 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, amoorthy, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

Demonstrate a (very basic) scheme for supporting memory fault exits.

From the vCPU threads:
1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
   with the purpose of establishing the absent mappings. Do so with
   wake_waiters=false to avoid serializing on the userfaultfd wait queue
   locks.

2. When the UFFDIO_COPY/CONTINUE in (1) fails with EEXIST,
   assume that the mapping was already established but is currently
   absent [A] and attempt to populate it using MADV_POPULATE_WRITE.

Issue UFFDIO_COPY/CONTINUEs from the reader threads as well, but with
wake_waiters=true to ensure that any threads sleeping on the uffd are
eventually woken up.

A real VMM would track whether it had already COPY/CONTINUEd pages (eg,
via a bitmap) to avoid calls destined to EEXIST. However, even the
naive approach is enough to demonstrate the performance advantages of
KVM_EXIT_MEMORY_FAULT.

[A] In reality it is much likelier that the vCPU thread simply lost a
    race to establish the mapping for the page.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        | 235 +++++++++++++-----
 1 file changed, 166 insertions(+), 69 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index ffbc89300c46..4b79c88cb22d 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -15,6 +15,7 @@
 #include <time.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <linux/mman.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -31,36 +32,99 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
 static size_t demand_paging_size;
 static char *guest_data_prototype;
 
+static int num_uffds;
+static size_t uffd_region_size;
+static struct uffd_desc **uffd_descs;
+/*
+ * Delay when demand paging is performed through userfaultfd or directly by
+ * vcpu_worker in the case of a KVM_EXIT_MEMORY_FAULT.
+ */
+static useconds_t uffd_delay;
+static int uffd_mode;
+
+
+static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
+				    bool is_vcpu);
+
+static void madv_write_or_err(uint64_t gpa)
+{
+	int r;
+	void *hva = addr_gpa2hva(memstress_args.vm, gpa);
+
+	r = madvise(hva, demand_paging_size, MADV_POPULATE_WRITE);
+	TEST_ASSERT(r == 0,
+		    "MADV_POPULATE_WRITE on hva 0x%lx (gpa 0x%lx) fail, errno %i\n",
+		    (uintptr_t) hva, gpa, errno);
+}
+
+static void ready_page(uint64_t gpa)
+{
+	int r, uffd;
+
+	/*
+	 * This test only registers memslot 1 w/ userfaultfd. Any accesses outside
+	 * the registered ranges should fault in the physical pages through
+	 * MADV_POPULATE_WRITE.
+	 */
+	if ((gpa < memstress_args.gpa)
+		|| (gpa >= memstress_args.gpa + memstress_args.size)) {
+		madv_write_or_err(gpa);
+	} else {
+		if (uffd_delay)
+			usleep(uffd_delay);
+
+		uffd = uffd_descs[(gpa - memstress_args.gpa) / uffd_region_size]->uffd;
+
+		r = handle_uffd_page_request(uffd_mode, uffd,
+					     (uint64_t) addr_gpa2hva(memstress_args.vm, gpa), true);
+
+		if (r == EEXIST)
+			madv_write_or_err(gpa);
+	}
+}
+
 static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 {
 	struct kvm_vcpu *vcpu = vcpu_args->vcpu;
 	int vcpu_idx = vcpu_args->vcpu_idx;
 	struct kvm_run *run = vcpu->run;
-	struct timespec start;
-	struct timespec ts_diff;
+	struct timespec last_start;
+	struct timespec total_runtime = {};
 	int ret;
 
-	clock_gettime(CLOCK_MONOTONIC, &start);
 
-	/* Let the guest access its memory */
-	ret = _vcpu_run(vcpu);
-	TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-	if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
-		TEST_ASSERT(false,
-			    "Invalid guest sync status: exit_reason=%s\n",
-			    exit_reason_str(run->exit_reason));
-	}
+	while (true) {
+		clock_gettime(CLOCK_MONOTONIC, &last_start);
+		/* Let the guest access its memory */
+		ret = _vcpu_run(vcpu);
+		TEST_ASSERT(ret == 0
+			    || (errno == EFAULT
+				&& run->exit_reason == KVM_EXIT_MEMORY_FAULT),
+			    "vcpu_run failed: %d\n", ret);
 
-	ts_diff = timespec_elapsed(start);
+		total_runtime = timespec_add(total_runtime,
+					     timespec_elapsed(last_start));
+		if (ret != 0 && get_ucall(vcpu, NULL) != UCALL_SYNC) {
+
+			if (run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
+				ready_page(run->memory_fault.gpa);
+				continue;
+			}
+
+			TEST_ASSERT(false,
+				    "Invalid guest sync status: exit_reason=%s\n",
+				    exit_reason_str(run->exit_reason));
+		}
+		break;
+	}
 	PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
-		       ts_diff.tv_sec, ts_diff.tv_nsec);
+			total_runtime.tv_sec, total_runtime.tv_nsec);
 }
 
-static int handle_uffd_page_request(int uffd_mode, int uffd,
-		struct uffd_msg *msg)
+static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
+				    bool is_vcpu)
 {
 	pid_t tid = syscall(__NR_gettid);
-	uint64_t addr = msg->arg.pagefault.address;
 	struct timespec start;
 	struct timespec ts_diff;
 	int r;
@@ -71,16 +135,15 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		struct uffdio_copy copy;
 
 		copy.src = (uint64_t)guest_data_prototype;
-		copy.dst = addr;
+		copy.dst = hva;
 		copy.len = demand_paging_size;
-		copy.mode = 0;
+		copy.mode = is_vcpu ? UFFDIO_COPY_MODE_DONTWAKE : 0;
 
-		r = ioctl(uffd, UFFDIO_COPY, &copy);
 		/*
-		 * With multiple vCPU threads fault on a single page and there are
-		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
-		 * will fail with EEXIST: handle that case without signaling an
-		 * error.
+		 * With multiple vCPU threads and at least one of multiple reader threads
+		 * or vCPU memory faults, multiple vCPUs accessing an absent page will
+		 * almost certainly cause some thread doing the UFFDIO_COPY here to get
+		 * EEXIST: make sure to allow that case.
 		 *
 		 * Note that this also suppress any EEXISTs occurring from,
 		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
@@ -88,23 +151,24 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		 * some external state to correctly surface EEXISTs to userspace
 		 * (or prevent duplicate COPY/CONTINUEs in the first place).
 		 */
-		if (r == -1 && errno != EEXIST) {
-			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
-				addr, tid, errno);
-			return r;
-		}
+		r = ioctl(uffd, UFFDIO_COPY, &copy);
+		TEST_ASSERT(r == 0 || errno == EEXIST,
+			    "Thread 0x%x failed UFFDIO_COPY on hva 0x%lx, errno = %d",
+			    tid, hva, errno);
 	} else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
+		/* The comments in the UFFDIO_COPY branch also apply here. */
 		struct uffdio_continue cont = {0};
 
-		cont.range.start = addr;
+		cont.range.start = hva;
 		cont.range.len = demand_paging_size;
+		cont.mode = is_vcpu ? UFFDIO_CONTINUE_MODE_DONTWAKE : 0;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
 		/*
-		 * With multiple vCPU threads fault on a single page and there are
-		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
-		 * will fail with EEXIST: handle that case without signaling an
-		 * error.
+		 * With multiple vCPU threads and at least one of multiple reader threads
+		 * or vCPU memory faults, multiple vCPUs accessing an absent page will
+		 * almost certainly cause some thread doing the UFFDIO_COPY here to get
+		 * EEXIST: make sure to allow that case.
 		 *
 		 * Note that this also suppress any EEXISTs occurring from,
 		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
@@ -112,32 +176,54 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		 * some external state to correctly surface EEXISTs to userspace
 		 * (or prevent duplicate COPY/CONTINUEs in the first place).
 		 */
-		if (r == -1 && errno != EEXIST) {
-			pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
-				addr, tid, errno);
-			return r;
-		}
+		TEST_ASSERT(r == 0 || errno == EEXIST,
+			    "Thread 0x%x failed UFFDIO_CONTINUE on hva 0x%lx, errno = %d",
+			    tid, hva, errno);
 	} else {
 		TEST_FAIL("Invalid uffd mode %d", uffd_mode);
 	}
 
+	/*
+	 * If the above UFFDIO_COPY/CONTINUE failed with EEXIST, waiting threads
+	 * will not have been woken: wake them here.
+	 */
+	if (!is_vcpu && r != 0) {
+		struct uffdio_range range = {
+			.start = hva,
+			.len = demand_paging_size
+		};
+		r = ioctl(uffd, UFFDIO_WAKE, &range);
+		TEST_ASSERT(r == 0,
+			    "Thread 0x%x failed UFFDIO_WAKE on hva 0x%lx, errno = %d",
+			    tid, hva, errno);
+	}
+
 	ts_diff = timespec_elapsed(start);
 
 	PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
 		       timespec_to_ns(ts_diff));
 	PER_PAGE_DEBUG("Paged in %ld bytes at 0x%lx from thread %d\n",
-		       demand_paging_size, addr, tid);
+		       demand_paging_size, hva, tid);
 
 	return 0;
 }
 
+static int handle_uffd_page_request_from_uffd(int uffd_mode, int uffd,
+					      struct uffd_msg *msg)
+{
+	TEST_ASSERT(msg->event == UFFD_EVENT_PAGEFAULT,
+		    "Received uffd message with event %d != UFFD_EVENT_PAGEFAULT",
+		    msg->event);
+	return handle_uffd_page_request(uffd_mode, uffd,
+					msg->arg.pagefault.address, false);
+}
+
 struct test_params {
-	int uffd_mode;
 	bool single_uffd;
-	useconds_t uffd_delay;
 	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
+	bool memfault_exits;
 };
 
 static void prefault_mem(void *alias, uint64_t len)
@@ -155,16 +241,22 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 {
 	struct memstress_vcpu_args *vcpu_args;
 	struct test_params *p = arg;
-	struct uffd_desc **uffd_descs = NULL;
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i, num_uffds = 0;
+	int i;
 	double vcpu_paging_rate;
-	uint64_t uffd_region_size;
+	uint32_t slot_flags = 0;
+	bool uffd_memfault_exits = uffd_mode && p->memfault_exits;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
-				 p->src_type, p->partition_vcpu_memory_access);
+	if (uffd_memfault_exits) {
+		TEST_ASSERT(kvm_has_cap(KVM_CAP_NOWAIT_ON_FAULT) > 0,
+					"KVM does not have KVM_CAP_NOWAIT_ON_FAULT");
+		slot_flags = KVM_MEM_NOWAIT_ON_FAULT;
+	}
+
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
+				 1, slot_flags, p->src_type, p->partition_vcpu_memory_access);
 
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
 
@@ -173,21 +265,21 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		    "Failed to allocate buffer for guest data pattern");
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
-	if (p->uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
-		num_uffds = p->single_uffd ? 1 : nr_vcpus;
-		for (i = 0; i < num_uffds; i++) {
-			vcpu_args = &memstress_args.vcpu_args[i];
-			prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
-				     vcpu_args->pages * memstress_args.guest_page_size);
-		}
-	}
-
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		num_uffds = p->single_uffd ? 1 : nr_vcpus;
 		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
 
+		if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
+			for (i = 0; i < num_uffds; i++) {
+				vcpu_args = &memstress_args.vcpu_args[i];
+				prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
+					     uffd_region_size);
+			}
+		}
+
 		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
-		TEST_ASSERT(uffd_descs, "Memory allocation failed");
+		TEST_ASSERT(uffd_descs, "Failed to allocate uffd descriptors");
+
 		for (i = 0; i < num_uffds; i++) {
 			struct memstress_vcpu_args *vcpu_args;
 			void *vcpu_hva;
@@ -201,10 +293,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 * requests.
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
-				p->uffd_mode, p->uffd_delay, vcpu_hva,
+				uffd_mode, uffd_delay, vcpu_hva,
 				uffd_region_size,
 				p->readers_per_uffd,
-				&handle_uffd_page_request);
+				&handle_uffd_page_request_from_uffd);
 		}
 	}
 
@@ -218,7 +310,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	ts_diff = timespec_elapsed(start);
 	pr_info("All vCPU threads joined\n");
 
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
 		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
@@ -239,7 +331,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memstress_destroy_vm(vm);
 
 	free(guest_data_prototype);
-	if (p->uffd_mode)
+	if (uffd_mode)
 		free(uffd_descs);
 }
 
@@ -248,7 +340,7 @@ static void help(char *name)
 	puts("");
 	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
 		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
-		   "          [-s type] [-v vcpus] [-o]\n", name);
+		   "          [-w] [-s type] [-v vcpus] [-o]\n", name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
@@ -259,6 +351,7 @@ static void help(char *name)
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
 	printf(" -r: Set the number of reader threads per uffd.\n");
+	printf(" -w: Enable kvm cap for memory fault exits.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -278,29 +371,30 @@ int main(int argc, char *argv[])
 		.partition_vcpu_memory_access = true,
 		.readers_per_uffd = 1,
 		.single_uffd = false,
+		.memfault_exits = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
+	while ((opt = getopt(argc, argv, "ahowm:u:d:b:s:v:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
 			break;
 		case 'u':
 			if (!strcmp("MISSING", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
+				uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
 			else if (!strcmp("MINOR", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
-			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
+				uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
+			TEST_ASSERT(uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
 		case 'a':
 			p.single_uffd = true;
 			break;
 		case 'd':
-			p.uffd_delay = strtoul(optarg, NULL, 0);
-			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
+			uffd_delay = strtoul(optarg, NULL, 0);
+			TEST_ASSERT(uffd_delay >= 0, "A negative UFFD delay is not supported.");
 			break;
 		case 'b':
 			guest_percpu_mem_size = parse_size(optarg);
@@ -323,6 +417,9 @@ int main(int argc, char *argv[])
 				    "Invalid number of readers per uffd %d: must be >=1",
 				    p.readers_per_uffd);
 			break;
+		case 'w':
+			p.memfault_exits = true;
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
@@ -330,7 +427,7 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	if (p.uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
+	if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
 	    !backing_src_is_shared(p.src_type)) {
 		TEST_FAIL("userfaultfd MINOR mode requires shared memory; pick a different -s");
 	}
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 02/16] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
  2023-06-02 16:19 ` [PATCH v4 02/16] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN Anish Moorthy
@ 2023-06-02 20:30   ` Isaku Yamahata
  2023-06-05 16:41     ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Isaku Yamahata @ 2023-06-02 20:30 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: seanjc, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata

On Fri, Jun 02, 2023 at 04:19:07PM +0000,
Anish Moorthy <amoorthy@google.com> wrote:

> Give kvm_run.exit_reason a defined initial value on entry into KVM_RUN:
> other architectures (riscv, arm64) already use KVM_EXIT_UNKNOWN for this
> purpose, so copy that convention.
> 
> This gives vCPUs trying to fill the run struct a mechanism to avoid
> overwriting already-populated data, albeit an imperfect one. Being able
> to detect an already-populated KVM run struct will prevent at least some
> bugs in the upcoming implementation of KVM_CAP_MEMORY_FAULT_INFO, which
> will attempt to fill the run struct whenever a vCPU fails a guest memory
> access.
> 
> Without the already-populated check, KVM_CAP_MEMORY_FAULT_INFO could
> change kvm_run in any code paths which
> 
> 1. Populate kvm_run for some exit and prepare to return to userspace
> 2. Access guest memory for some reason (but without returning -EFAULTs
>     to userspace)
> 3. Finish the return to userspace set up in (1), now with the contents
>     of kvm_run changed to contain efault info.
> 

As vmx code uses KVM_EXIT_UNKNOWN with hardware_exit_reason=exit reason,
Can we initialize hardware_exit_reason to -1.
It's just in case.


> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  arch/x86/kvm/x86.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ceb7c5e9cf9e..a7725d41570a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11163,6 +11163,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  	if (r <= 0)
>  		goto out;
>  
> +	kvm_run->exit_reason = KVM_EXIT_UNKNOWN;

+	kvm_run->hardware_exit_reason = -1;     /* unused exit reason value */

>  	r = vcpu_run(vcpu);
>  
>  out:
> -- 
> 2.41.0.rc0.172.g3f132b7071-goog
> 

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-06-02 16:19 ` [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
@ 2023-06-03 16:58   ` Isaku Yamahata
  2023-06-05 16:37     ` Anish Moorthy
  2023-06-05 17:46   ` Anish Moorthy
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 79+ messages in thread
From: Isaku Yamahata @ 2023-06-03 16:58 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: seanjc, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata

On Fri, Jun 02, 2023 at 04:19:08PM +0000,
Anish Moorthy <amoorthy@google.com> wrote:

> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index add067793b90..5b24059143b3 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6700,6 +6700,18 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 len; /* in bytes */


UPM or gmem uses size instead of len. Personally I don't have any strong
preference.  It's better to converge. (or use union to accept both?)

I check the existing one.
KVM_EXIT_IO: size.
KVM_EXIT_MMIO: len.
KVM_INTERNAL_ERROR_EMULATION: insn_size
struct kvm_coalesced_mmio_zone: size
struct kvm_coalesced_mmio: len
struct kvm_ioeventfd: len
struct kvm_enc_region: size
struct kvm_sev_*: len
struct kvm_memory_attributes: size
struct kvm_create_guest_memfd: size


> +		} memory_fault;
> +
> +Indicates a vCPU memory fault on the guest physical address range
> +[gpa, gpa + len). See KVM_CAP_MEMORY_FAULT_INFO for more details.
> +
>  ::
>  
>      /* KVM_EXIT_NOTIFY */


-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-06-03 16:58   ` Isaku Yamahata
@ 2023-06-05 16:37     ` Anish Moorthy
  2023-06-14 14:55       ` Sean Christopherson
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-06-05 16:37 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: seanjc, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit

On Sat, Jun 3, 2023 at 9:58 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> UPM or gmem uses size instead of len. Personally I don't have any strong
> preference.  It's better to converge. (or use union to accept both?)

I like "len" because to me it implies a contiguous range, whereas
"size" does not: but it's a minor thing. Converging does seem good
though.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 02/16] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
  2023-06-02 20:30   ` Isaku Yamahata
@ 2023-06-05 16:41     ` Anish Moorthy
  0 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-06-05 16:41 UTC (permalink / raw)
  To: Isaku Yamahata, Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit

On Fri, Jun 2, 2023 at 1:30 PM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> As vmx code uses KVM_EXIT_UNKNOWN with hardware_exit_reason=exit reason,
> Can we initialize hardware_exit_reason to -1.
> It's just in case.

Will do

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-06-02 16:19 ` [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
  2023-06-03 16:58   ` Isaku Yamahata
@ 2023-06-05 17:46   ` Anish Moorthy
  2023-06-14 17:35   ` Sean Christopherson
  2023-07-05  8:21   ` Kautuk Consul
  3 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-06-05 17:46 UTC (permalink / raw)
  To: seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, robert.hoo.linux, jthoughton, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, nadav.amit, isaku.yamahata

By the way, I'd like to solicit opinions on the checks that
kvm_populate_efault_info is performing: specifically the the
exit-reason-unset check

On Fri, Jun 2, 2023 at 9:19 AM Anish Moorthy <amoorthy@google.com> wrote:
>
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> +                                    uint64_t gpa, uint64_t len, uint64_t flags)
> +{
>  ...
> +       else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
> +               goto out;

What I intended this check to guard against was the first problematic
case (A) I called out in the cover letter

> The implementation strategy for KVM_CAP_MEMORY_FAULT_INFO has risks: for
> example, if there are any existing paths in KVM_RUN which cause a vCPU
> to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
> access but ignore the failure and then (3) complete the exit to
> userspace set up in (1), then the contents of the kvm_run struct written
> in (1) will be corrupted.

What I wrote was actually incorrect, as you may see: if in (1) the
exit reason != KVM_EXIT_UNKNOWN then the exit-reason-unset check will
prevent writing to the run struct. Now, if for some reason this flow
involved populating most of the run struct in (1) but only setting the
exit reason in (3) then we'd still have a problem: but it's not
feasible to anticipate everything after all :)

I also mentioned a different error case (B)

> Another example: if KVM_RUN fails a guest memory access for which the
> EFAULT is annotated but does not return the EFAULT to userspace, then
> later returns an *un*annotated EFAULT to userspace, then userspace will
> receive incorrect information.

When the second EFAULT is un-annotated the presence/absence of the
exit-reason-unset check is irrelevant: userspace will observe an
annotated EFAULT in place of an un-annotated one either way.

There's also a third interesting case (C) which I didn't mention: an
annotated EFAULT which is ignored/suppressed followed by one which is
propagated to userspace. Here the exit-reason-unset check will prevent
the second annotation from being written, so userspace sees an
annotation with bad contents, If we believe that case (A) is a weird
sequence of events that shouldn't be happening in the first place,
then case (C) seems more important to ensure correctness in. But I
don't know anything about how often (A) happens in KVM, which is why I
want others' opinions.

So, should we drop the exit-reason-unset check (and the accompanying
patch 4) and treat existing occurrences of case (A) as bugs, or should
we maintain the check at the cost of incorrect behavior in case (C)?
Or is there another option here entirely?

Sean, I remember you calling out that some of the emulated mmio code
follows the pattern in (A), but it's been a while and my memory is
fuzzy. What's your opinion here?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 01/16] KVM: Allow hva_pfn_fast() to resolve read-only faults.
  2023-06-02 16:19 ` [PATCH v4 01/16] KVM: Allow hva_pfn_fast() to resolve read-only faults Anish Moorthy
@ 2023-06-14 14:39   ` Sean Christopherson
  2023-06-14 16:57     ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-06-14 14:39 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

Don't put trailing punctation in shortlogs, i.e. drop the period.

On Fri, Jun 02, 2023, Anish Moorthy wrote:
> hva_to_pfn_fast() currently just fails for read-only faults, which is
> unnecessary. Instead, try pinning the page without passing FOLL_WRITE.

s/pinning/getting (or maybe grabbing?), because "pinning" is already way too
overloaded in the context of gup(), e.g. FOLL_PIN vs. FOLL_GET.

> This allows read-only faults to (potentially) be resolved without

"read-only faults" is somewhat confusing, because every architecture passes a
non-NULL @writable for read faults.  If it weren't for KVM_ARM_MTE_COPY_TAGS,
this could be "faults to read-only memslots".  Not sure how to concisely and
accurately describe this.  :-/

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-06-05 16:37     ` Anish Moorthy
@ 2023-06-14 14:55       ` Sean Christopherson
  0 siblings, 0 replies; 79+ messages in thread
From: Sean Christopherson @ 2023-06-14 14:55 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Isaku Yamahata, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit

On Mon, Jun 05, 2023, Anish Moorthy wrote:
> On Sat, Jun 3, 2023 at 9:58 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
> >
> > UPM or gmem uses size instead of len. Personally I don't have any strong
> > preference.  It's better to converge. (or use union to accept both?)
> 
> I like "len" because to me it implies a contiguous range, whereas
> "size" does not: but it's a minor thing. Converging does seem good
> though.

Eh, I don't think we need to converge the two.  "size" is far more common when
describing the properties of a file (the gmem case), whereas "length" is often
used when describing the number of bytes being accessed by a read/write.  I.e.
they're two different things, so using different words to describe them isn't a
bad thing.

Though I suspect by "UPM or gmem" Isakue really meant "struct kvm_memory_attributes".
I don't think we need to converge that one either, though I do agree that "size"
isn't the greatest name.  I vote to rename kvm_memory_attributes's "size" to either
"nr_bytes" or "len".

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 01/16] KVM: Allow hva_pfn_fast() to resolve read-only faults.
  2023-06-14 14:39   ` Sean Christopherson
@ 2023-06-14 16:57     ` Anish Moorthy
  2023-08-10 19:54       ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-06-14 16:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Wed, Jun 14, 2023 at 7:39 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Don't put trailing punctation in shortlogs, i.e. drop the period.
>
> s/pinning/getting (or maybe grabbing?), because "pinning" is already way too
> overloaded in the context of gup(), e.g. FOLL_PIN vs. FOLL_GET.

Done

> "read-only faults" is somewhat confusing, because every architecture passes a
> non-NULL @writable for read faults.  If it weren't for KVM_ARM_MTE_COPY_TAGS,
> this could be "faults to read-only memslots".  Not sure how to concisely and
> accurately describe this.  :-/

"Read faults when establishing writable mappings is forbidden" maybe?
That should be accurate, although it's certainly not concise.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-06-02 16:19 ` [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
  2023-06-03 16:58   ` Isaku Yamahata
  2023-06-05 17:46   ` Anish Moorthy
@ 2023-06-14 17:35   ` Sean Christopherson
  2023-06-20 21:13     ` Anish Moorthy
                       ` (3 more replies)
  2023-07-05  8:21   ` Kautuk Consul
  3 siblings, 4 replies; 79+ messages in thread
From: Sean Christopherson @ 2023-06-14 17:35 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Fri, Jun 02, 2023, Anish Moorthy wrote:
> +The 'gpa' and 'len' (in bytes) fields describe the range of guest
> +physical memory to which access failed, i.e. [gpa, gpa + len). 'flags' is a
> +bitfield indicating the nature of the access: valid masks are
> +
> +  - KVM_MEMORY_FAULT_FLAG_WRITE:     The failed access was a write.
> +  - KVM_MEMORY_FAULT_FLAG_EXEC:      The failed access was an exec.

We should also define a READ flag, even though it's not strictly necessary.  That
gives userspace another way to detect "bad" data (flags should never be zero), and
it will allow us to use the same RWX bits that KVM_SET_MEMORY_ATTRIBUTES uses,
e.g. R = BIT(0), W = BIT(1), X = BIT(2) (which just so happens to be the same
RWX definitions EPT uses).

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0e571e973bc2..69a221f71914 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2288,4 +2288,13 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +/*
> + * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and
> + * populate the memory_fault field with the given information.
> + *
> + * WARNs and does nothing if the exit reason is not KVM_EXIT_UNKNOWN, or if
> + * 'vcpu' is not the current running vcpu.
> + */
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,

Tagging a globally visible, non-static function as "inline" is odd, to say the
least.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fd80a560378c..09d4d85691e1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4674,6 +4674,9 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  
>  		return r;
>  	}
> +	case KVM_CAP_MEMORY_FAULT_INFO: {

No need for curly braces.  But that's moot because there's no need for this at
all, just let it fall through to the default handling, which KVM already does for
a number of other informational capabilities.

> +		return -EINVAL;
> +	}
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -6173,3 +6176,35 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>  
>  	return init_context.err;
>  }
> +
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,

I strongly prefer to avoid "populate" and "efault".  Avoid "populate" because
that verb will become stale the instance we do anything else in the helper.
Avoid "efault" because it's imprecise, i.e. this isn't to be used for just any
old -EFAULT scenario.  Something like kvm_handle_guest_uaccess_fault()? Definitely
open to other names (especially less verbose names).

> +				     uint64_t gpa, uint64_t len, uint64_t flags)
> +{
> +	if (WARN_ON_ONCE(!vcpu))
> +		return;

Drop this and instead invoke the helper if and only if vCPU is guaranteed to be
valid, e.g. in a future patch, don't add a conditional call to __kvm_write_guest_page(),
just handle the -EFAULT in kvm_vcpu_write_guest_page().  If the concern is that
callers would need to manually check "r == -EFAULT", this helper could take in the
error code, same as we do for kvm_handle_memory_failure(), e.g. do 

	if (r != -EFAULT)
		return;

here.  This is also an argument for using a less prescriptive name for the helper.

> +
> +	preempt_disable();
> +	/*
> +	 * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> +	 * would open the door for races between concurrent calls to this
> +	 * function.
> +	 */
> +	if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> +		goto out;

Meh, this is overkill IMO.  The check in mark_page_dirty_in_slot() is an
abomination that I wish didn't exist, not a pattern that should be copied.  If
we do keep this sanity check, it can simply be

	if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()))
		return;

because as the comment for kvm_get_running_vcpu() explains, the returned vCPU
pointer won't change even if this task gets migrated to a different pCPU.  If
this code were doing something with vcpu->cpu then preemption would need to be
disabled throughout, but that's not the case.

> +	/*
> +	 * Try not to overwrite an already-populated run struct.
> +	 * This isn't a perfect solution, as there's no guarantee that the exit
> +	 * reason is set before the run struct is populated, but it should prevent
> +	 * at least some bugs.
> +	 */
> +	else if

Kernel style is to not use if-elif-elif if any of the preceding checks are terminal,
i.e. return or goto.  There's a good reason for that style/rule too, as it allows
avoiding weirdness like this where there's a big block comment in the middle of
an if-elif sequence.

> (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))

As I've stated multiple times, this can't WARN in "normal" builds because userspace
can modify kvm_run fields at will.  I do want a WARN as it will allow fuzzers to
find bugs for us, but it needs to be guarded with a Kconfig (or maybe a module
param).  One idea would be to make the proposed CONFIG_KVM_PROVE_MMU[*] a generic
Kconfig and use that.

And this should not be a terminal condition, i.e. KVM should WARN but continue on.
I am like 99% confident there are existing cases where KVM fills exit_reason
without actually exiting, i.e. bailing will immediately "break" KVM.  On the other
hand, clobbering what came before *might* break KVM, but it might work too.  More
thoughts below.

[*] https://lkml.kernel.org/r/20230511235917.639770-8-seanjc%40google.com

> +		goto out;

Folding in your other reply, as I wanted the full original context.

> What I intended this check to guard against was the first problematic
> case (A) I called out in the cover letter
>
> > The implementation strategy for KVM_CAP_MEMORY_FAULT_INFO has risks: for
> > example, if there are any existing paths in KVM_RUN which cause a vCPU
> > to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
> > access but ignore the failure and then (3) complete the exit to
> > userspace set up in (1), then the contents of the kvm_run struct written
> > in (1) will be corrupted.
>
> What I wrote was actually incorrect, as you may see: if in (1) the
> exit reason != KVM_EXIT_UNKNOWN then the exit-reason-unset check will
> prevent writing to the run struct. Now, if for some reason this flow
> involved populating most of the run struct in (1) but only setting the
> exit reason in (3) then we'd still have a problem: but it's not
> feasible to anticipate everything after all :)
>
> I also mentioned a different error case (B)
>
> > Another example: if KVM_RUN fails a guest memory access for which the
> > EFAULT is annotated but does not return the EFAULT to userspace, then
> > later returns an *un*annotated EFAULT to userspace, then userspace will
> > receive incorrect information.
>
> When the second EFAULT is un-annotated the presence/absence of the
> exit-reason-unset check is irrelevant: userspace will observe an
> annotated EFAULT in place of an un-annotated one either way.
>
> There's also a third interesting case (C) which I didn't mention: an
> annotated EFAULT which is ignored/suppressed followed by one which is
> propagated to userspace. Here the exit-reason-unset check will prevent
> the second annotation from being written, so userspace sees an
> annotation with bad contents, If we believe that case (A) is a weird
> sequence of events that shouldn't be happening in the first place,
> then case (C) seems more important to ensure correctn`ess in. But I
> don't know anything about how often (A) happens in KVM, which is why I
> want others' opinions.
>
> So, should we drop the exit-reason-unset check (and the accompanying
> patch 4) and treat existing occurrences of case (A) as bugs, or should
> we maintain the check at the cost of incorrect behavior in case (C)?
> Or is there another option here entirely?
>
> Sean, I remember you calling out that some of the emulated mmio code
> follows the pattern in (A), but it's been a while and my memory is
> fuzzy. What's your opinion here?

I got a bit (ok, way more than a bit) lost in all of the (A) (B) (C) madness.  I
think this what you intended for each case?

  (A) if there are any existing paths in KVM_RUN which cause a vCPU
      to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
      access but ignore the failure and then (3) complete the exit to
      userspace set up in (1), then the contents of the kvm_run struct written
      in (1) will be corrupted.

  (B) if KVM_RUN fails a guest memory access for which the EFAULT is annotated
      but does not return the EFAULT to userspace, then later returns an *un*annotated
      EFAULT to userspace, then userspace will receive incorrect information.

  (C) an annotated EFAULT which is ignored/suppressed followed by one which is
      propagated to userspace. Here the exit-reason-unset check will prevent the
      second annotation from being written, so userspace sees an annotation with
      bad contents, If we believe that case (A) is a weird sequence of events
      that shouldn't be happening in the first place, then case (C) seems more
      important to ensure correctness in. But I don't know anything about how often
      (A) happens in KVM, which is why I want others' opinions.

(A) does sadly happen.  I wouldn't call it a "pattern" though, it's an unfortunate
side effect of deficiencies in KVM's uAPI.

(B) is the trickiest to defend against in the kernel, but as I mentioned in earlier
versions of this series, userspace needs to guard against a vCPU getting stuck in
an infinite fault anyways, so I'm not _that_ concerned with figuring out a way to
address this in the kernel.  KVM's documentation should strongly encourage userspace
to take action if KVM repeatedly exits with the same info over and over, but beyond
that I think anything else is nice to have, not mandatory.

(C) should simply not be possible.  (A) is very much a "this shouldn't happen,
but it does" case.  KVM provides no meaningful guarantees if (A) does happen, so
yes, prioritizing correctness for (C) is far more important.

That said, prioritizing (C) doesn't mean we can't also do our best to play nice
with (A).  None of the existing exits use anywhere near the exit info union's 256
bytes, i.e. there is tons of space to play with.  So rather than put memory_fault
in with all the others, what if we split the union in two, and place memory_fault
in the high half (doesn't have to literally be half, but you get the idea).  It'd
kinda be similar to x86's contributory vs. benign faults; exits that can't be
"nested" or "speculative" go in the low half, and things like memory_fault go in
the high half.

That way, if (A) does occur, the original information will be preserved when KVM
fills memory_fault.  And my suggestion to WARN-and-continue limits the problematic
scenarios to just fields in the second union, i.e. just memory_fault for now.
At the very least, not clobbering would likely make it easier for us to debug when
things go sideways.

And rather than use kvm_run.exit_reason as the canary, we should carve out a
kernel-only, i.e. non-ABI, field from the union.  That would allow setting the
canary in common KVM code, which can't be done for kvm_run.exit_reason because
some architectures, e.g. s390 (and x86 IIRC), consume the exit_reason early on
in KVM_RUN.

E.g. something like this (the #ifdefs are heinous, it might be better to let
userspace see the exit_canary, but make it abundantly clear that it's not ABI).

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 143abb334f56..233702124e0a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -511,16 +511,43 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID     (1 << 0)
                        __u32 flags;
                } notify;
+               /* Fix the size of the union. */
+               char padding[128];
+       };
+
+       /*
+        * This second KVM_EXIT_* union holds structures for exits that may be
+        * triggered after KVM has already initiated a different exit, and/or
+        * may be filled speculatively by KVM.  E.g. because of limitations in
+        * KVM's uAPI, a memory fault can be encountered after an MMIO exit is
+        * initiated and kvm_run.mmio is filled.  Isolating these structures
+        * from the primary KVM_EXIT_* union ensures that KVM won't clobber
+        * information for the original exit.
+        */
+       union {
                /* KVM_EXIT_MEMORY_FAULT */
                struct {
                        __u64 flags;
                        __u64 gpa;
                        __u64 len;
                } memory_fault;
-               /* Fix the size of the union. */
-               char padding[256];
+               /* Fix the size of this union too. */
+#ifndef __KERNEL__
+               char padding2[128];
+#else
+               char padding2[120];
+#endif
        };
 
+#ifdef __KERNEL__
+       /*
+        * Non-ABI, kernel-only field that KVM uses to detect bugs related to
+        * filling exit_reason and the exit unions, e.g. to guard against
+        * clobbering a previous exit.
+        */
+       __u64 exit_canary;
+#endif
+
        /* 2048 is the size of the char array used to bound/pad the size
         * of the union that holds sync regs.
         */


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 05/16] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
  2023-06-02 16:19 ` [PATCH v4 05/16] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page() Anish Moorthy
@ 2023-06-14 19:10   ` Sean Christopherson
  2023-07-06 22:51     ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-06-14 19:10 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Fri, Jun 02, 2023, Anish Moorthy wrote:
> Implement KVM_CAP_MEMORY_FAULT_INFO for uaccess failures in
> kvm_vcpu_write_guest_page()
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  virt/kvm/kvm_main.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d9c0fa7c907f..ea27a8178f1a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3090,8 +3090,10 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
>  
>  /*
>   * Copy 'len' bytes from 'data' into guest memory at '(gfn * PAGE_SIZE) + offset'
> + * If 'vcpu' is non-null, then may fill its run struct for a
> + * KVM_EXIT_MEMORY_FAULT on uaccess failure.
>   */
> -static int __kvm_write_guest_page(struct kvm *kvm,
> +static int __kvm_write_guest_page(struct kvm *kvm, struct kvm_vcpu *vcpu,
>  				  struct kvm_memory_slot *memslot, gfn_t gfn,
>  			          const void *data, int offset, int len)
>  {
> @@ -3102,8 +3104,13 @@ static int __kvm_write_guest_page(struct kvm *kvm,
>  	if (kvm_is_error_hva(addr))
>  		return -EFAULT;
>  	r = __copy_to_user((void __user *)addr + offset, data, len);
> -	if (r)
> +	if (r) {
> +		if (vcpu)

As mentioned in a previous mail, put this in the (one) caller.  If more callers
come along, which is highly unlikely, we can revisit that decision.  Right now,
it just adds noise, both here and in the helper.

> +			kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset,
> +						 len,
> +						 KVM_MEMORY_FAULT_FLAG_WRITE);

For future reference, the 80 char limit is a soft limit, and with a lot of
subjectivity, can be breached.  In this case, this

			kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset,
						 len, KVM_MEMORY_FAULT_FLAG_WRITE);

is subjectively more readable than

			kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset,
						 len,
						 KVM_MEMORY_FAULT_FLAG_WRITE);
>  		return -EFAULT;
> +	}
>  	mark_page_dirty_in_slot(kvm, memslot, gfn);
>  	return 0;
>  }
> @@ -3113,7 +3120,7 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
>  {
>  	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>  
> -	return __kvm_write_guest_page(kvm, slot, gfn, data, offset, len);
> +	return __kvm_write_guest_page(kvm, NULL, slot, gfn, data, offset, len);
>  }
>  EXPORT_SYMBOL_GPL(kvm_write_guest_page);
>  
> @@ -3121,8 +3128,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  			      const void *data, int offset, int len)
>  {
>  	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> -

Newline after variable declarations.  Double demerits for breaking what was
originally correct :-)

> -	return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
> +	return __kvm_write_guest_page(vcpu->kvm, vcpu, slot, gfn, data,
> +				      offset, len);

With my various suggestions, something like

	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
	int r;

	r = __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
	if (r)
		kvm_handle_guest_uaccess_fault(...);
	return r;

Side topic, "uaccess", and thus any "userfault" variants, is probably a bad name
for the API, because that will fall apart when guest private memory comes along.  

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 06/16] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page()
  2023-06-02 16:19 ` [PATCH v4 06/16] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page() Anish Moorthy
@ 2023-06-14 19:22   ` Sean Christopherson
  2023-07-07 17:35     ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-06-14 19:22 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Fri, Jun 02, 2023, Anish Moorthy wrote:
> Implement KVM_CAP_MEMORY_FAULT_INFO for uaccess failures within
> kvm_vcpu_read_guest_page().

Same comments as the "write" patch.  And while I often advocate for tiny patches,
I see no reason to split the read and write changes into separate patches, they're
thematically identical enough to count as a "single logical change".

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 07/16] KVM: Simplify error handling in __gfn_to_pfn_memslot()
  2023-06-02 16:19 ` [PATCH v4 07/16] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
@ 2023-06-14 19:26   ` Sean Christopherson
  2023-07-07 17:33     ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-06-14 19:26 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Fri, Jun 02, 2023, Anish Moorthy wrote:
> KVM_HVA_ERR_RO_BAD satisfies kvm_is_error_hva(), so there's no need to
> duplicate the "if (writable)" block. Fix this by bringing all
> kvm_is_error_hva() cases under one conditional.
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  virt/kvm/kvm_main.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b9d2606f9251..05d6e7e3994d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2711,16 +2711,14 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>  	if (hva)
>  		*hva = addr;
>  
> -	if (addr == KVM_HVA_ERR_RO_BAD) {
> -		if (writable)
> -			*writable = false;
> -		return KVM_PFN_ERR_RO_FAULT;
> -	}
> -
>  	if (kvm_is_error_hva(addr)) {
>  		if (writable)
>  			*writable = false;
> -		return KVM_PFN_NOSLOT;
> +
> +		if (addr == KVM_HVA_ERR_RO_BAD)
> +			return KVM_PFN_ERR_RO_FAULT;
> +		else
> +			return KVM_PFN_NOSLOT;

Similar to an earlier patch, preferred style for terminal if-statements is

		if (addr == KVM_HVA_ERR_RO_BAD)
			return KVM_PFN_ERR_RO_FAULT;

		return KVM_PFN_NOSLOT;

Again, there are reasons for the style/rule.  In this case, it will yield a smaller
diff (obviously not a huge deal, but helpful), and it makes it more obvious that
the taken path of "if (kvm_is_error_hva(addr))" is itself terminal.

Alternatively, a ternary operator is often used for these types of things, though
in this case I much prefer the above, as I find the below hard to read.

		return addr == KVM_HVA_ERR_RO_BAD ? KVM_PFN_ERR_RO_FAULT :
						    KVM_PFN_NOSLOT;

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 08/16] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  2023-06-02 16:19 ` [PATCH v4 08/16] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
@ 2023-06-14 20:03   ` Sean Christopherson
  2023-07-07 18:05     ` Anish Moorthy
  2023-06-15  2:43   ` Robert Hoo
  1 sibling, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-06-14 20:03 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Fri, Jun 02, 2023, Anish Moorthy wrote:
> Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
> kvm_handle_error_pfn().
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8961f45e3b1..cb71aae9aaec 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3291,6 +3291,10 @@ static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
>  
>  static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
> +	uint64_t rounded_gfn;

gfn_t, and probably no need to specificy "rounded", let the code do the talking.

> +	uint64_t fault_size;
> +	uint64_t fault_flags;

With a few exceptions that we should kill off, KVM uses "u64", not "uint64_t".
Though arguably the "size" should be gfn_t too.

And these can all go on a single line, e.g.

	u64 fault_size, fault_flags;

Though since the kvm_run.memory_fault field and the param to the helper are "len",
a simple "len" here is better IMO.

And since this is not remotely performance sensitive, I wouldn't bother deferring
the initialization, e.g.

	gfn_t gfn = gfn_round_for_level(fault->gfn, fault->goal_level);
	gfn_t len = KVM_HPAGE_SIZE(fault->goal_level);
	u64 fault_flags;

All that said, consuming fault->goal_level is unnecessary, and not be coincidence.
The *only* time KVM should bail to userspace is if KVM failed to faultin a 4KiB
page.  Providing a hugepage is done opportunistically, it's never a hard requirement.
So throw away all of the above and see below.

> +
>  	if (is_sigpending_pfn(fault->pfn)) {
>  		kvm_handle_signal_exit(vcpu);
>  		return -EINTR;
> @@ -3309,6 +3313,15 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
>  		return RET_PF_RETRY;
>  	}
>  
> +	fault_size = KVM_HPAGE_SIZE(fault->goal_level);
> +	rounded_gfn = round_down(fault->gfn * PAGE_SIZE, fault_size);

We have a helper for this too, gfn_round_for_level().  Ooh, but this isn't storing
a gfn, it's storing a gpa.  Naughty, naughty.
	
> +
> +	fault_flags = 0;
> +	if (fault->write)
> +		fault_flags |= KVM_MEMORY_FAULT_FLAG_WRITE;
> +	if (fault->exec)

exec and write are mutually exclusive.  There's even documented precedent for
this in page_fault_can_be_fast().

So with a READ flag, this can be

	if (fault->write)
		fault_flags = KVM_MEMORY_FAULT_FLAG_WRITE;
	else if (fault->exec)
		fault_flags = KVM_MEMORY_FAULT_FLAG_EXEC;
	else
		fault_flags = KVM_MEMORY_FAULT_FLAG_READ;

Or as Paolo would probably write it ;-)

	fault_flags = (fault->write & 1) << KVM_MEMORY_FAULT_FLAG_WRITE_SHIFT |
		      (fault->exec & 1) << KVM_MEMORY_FAULT_FLAG_EXEC_SHIFT |
		      (!fault->write && !fault->exec) << KVM_MEMORY_FAULT_FLAG_READ_SHIFT;

(that was a joke, don't actually do that)

> +		fault_flags |= KVM_MEMORY_FAULT_FLAG_EXEC;
> +	kvm_populate_efault_info(vcpu, rounded_gfn, fault_size, fault_flags);

This is where passing a "gfn" variable as a "gpa" looks broken.

>  	return -EFAULT;

All in all, something like this?

	u64 fault_flags;

	<other error handling>

	/* comment goes here */
	WARN_ON_ONCE(fault->goal_level != PG_LEVEL_4K);

	if (fault->write)
		fault_flags = KVM_MEMORY_FAULT_FLAG_WRITE;
	else if (fault->exec)
		fault_flags = KVM_MEMORY_FAULT_FLAG_EXEC;
	else
		fault_flags = KVM_MEMORY_FAULT_FLAG_READ;

	kvm_handle_blahblahblah_fault(vcpu, gfn_to_gpa(fault->gfn), PAGE_SIZE,
				      fault_flags);

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-06-02 16:19 ` [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation Anish Moorthy
@ 2023-06-14 20:11   ` Sean Christopherson
  2023-07-06 19:04     ` Anish Moorthy
  2023-06-14 21:20   ` Sean Christopherson
  1 sibling, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-06-14 20:11 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Fri, Jun 02, 2023, Anish Moorthy wrote:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 05d6e7e3994d..2c276d4d0821 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1527,6 +1527,9 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
>  	valid_flags |= KVM_MEM_READONLY;
>  #endif
>  
> +	if (kvm_vm_ioctl_check_extension(NULL, KVM_CAP_NOWAIT_ON_FAULT))

Rather than force architectures to add the extension, probably better to use a

	config HAVE_KVM_NOWAIT_ON_FAULT
	       bool

and select that from arch Kconfigs.  That way the enumeration can be done in
common code, and then this can be computed at compile time doesn't need to do a
rather weird invocation of kvm_dev_ioctl() with KVM_CHECK_EXTENSION.

FWIW, you should be able to do 

	if (IS_ENABLED(CONFIG_HAVE_KVM_NOWAIT_ON_FAULT))

and avoid more #ifdefs.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 10/16] KVM: x86: Implement KVM_CAP_NOWAIT_ON_FAULT
  2023-06-02 16:19 ` [PATCH v4 10/16] KVM: x86: Implement KVM_CAP_NOWAIT_ON_FAULT Anish Moorthy
@ 2023-06-14 20:25   ` Sean Christopherson
  2023-07-07 17:41     ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-06-14 20:25 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Fri, Jun 02, 2023, Anish Moorthy wrote:
> When the memslot flag is enabled, fail guest memory accesses for which
> fast-gup fails (ie, where resolving the page fault would require putting
> the faulting thread to sleep).
> 
> Suggested-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  Documentation/virt/kvm/api.rst |  2 +-
>  arch/x86/kvm/mmu/mmu.c         | 17 ++++++++++++-----
>  arch/x86/kvm/x86.c             |  1 +
>  3 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9daadbe2c7ed..aa7b4024fd41 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -7783,7 +7783,7 @@ bugs and reported to the maintainers so that annotations can be added.
>  7.35 KVM_CAP_NOWAIT_ON_FAULT
>  ----------------------------
>  
> -:Architectures: None
> +:Architectures: x86
>  :Returns: -EINVAL.
>  
>  The presence of this capability indicates that userspace may pass the
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index cb71aae9aaec..288008a64e5c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4299,7 +4299,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
>  }
>  
> -static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> +static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu,
> +			     struct kvm_page_fault *fault,
> +			     bool nowait)

More booleans!?  Just say no!  And in this case, there's no reason to pass in a
flag, just handle this entirely in __kvm_faultin_pfn().

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7b6eab6f84e8..ebf21f1f43ce 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4302,6 +4302,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
        struct kvm_memory_slot *slot = fault->slot;
+       bool nowait = kvm_is_slot_nowait_on_fault(slot);
        bool async;
 
        /*
@@ -4332,9 +4333,12 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
        }
 
        async = false;
-       fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
-                                         fault->write, &fault->map_writable,
-                                         &fault->hva);
+
+       fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn,
+                                         nowait, false,
+                                         nowait ? NULL : &async,
+                                         fault->write, &fault->map_writable, &fault->hva);
+
        if (!async)
                return RET_PF_CONTINUE; /* *pfn has correct page already */


On a related topic, I would *love* for someone to overhaul gfn_to_pfn() to replace
the "booleans for everything" approach and instead have KVM pass FOLL_* flags
internally.  Rough sketch here: https://lkml.kernel.org/r/ZGvUsf7lMkrNDHuE%40google.com

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-06-02 16:19 ` [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation Anish Moorthy
  2023-06-14 20:11   ` Sean Christopherson
@ 2023-06-14 21:20   ` Sean Christopherson
  2023-06-14 21:23     ` Sean Christopherson
                       ` (2 more replies)
  1 sibling, 3 replies; 79+ messages in thread
From: Sean Christopherson @ 2023-06-14 21:20 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Fri, Jun 02, 2023, Anish Moorthy wrote:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 69a221f71914..abbc5dd72292 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2297,4 +2297,10 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>   */
>  inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
>  				     uint64_t gpa, uint64_t len, uint64_t flags);
> +
> +static inline bool kvm_slot_nowait_on_fault(
> +	const struct kvm_memory_slot *slot)

Just when I was starting to think that we had beat all of the Google3 out of you :-)

And predicate helpers in KVM typically have "is" or "has" in the name, so that it's
clear the helper queries, versus e.g. sets the flag or something. 

> +{
> +	return slot->flags & KVM_MEM_NOWAIT_ON_FAULT;

KVM is anything but consistent, but if the helper is likely to be called without
a known good memslot, it probably makes sense to have the helper check for NULL,
e.g.

static inline bool kvm_is_slot_nowait_on_fault(const struct kvm_memory_slot *slot)
{
	return slot && slot->flags & KVM_MEM_NOWAIT_ON_FAULT;
}

However, do we actually need to force vendor code to query nowait?  At a glance,
the only external (relative to kvm_main.c) users of __gfn_to_pfn_memslot() are
in flows that play nice with nowait or that don't support it at all.  So I *think*
we can do this?

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be06b09e9104..78024318286d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2403,6 +2403,11 @@ static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
        return slot->flags & KVM_MEM_READONLY;
 }
 
+static bool memslot_is_nowait_on_fault(const struct kvm_memory_slot *slot)
+{
+       return slot->flags & KVM_MEM_NOWAIT_ON_FAULT;
+}
+
 static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
                                       gfn_t *nr_pages, bool write)
 {
@@ -2730,6 +2735,11 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
                writable = NULL;
        }
 
+       if (async && memslot_is_nowait_on_fault(slot)) {
+               *async = false;
+               async = NULL;
+       }
+
        return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
                          writable);
 }

Ah, crud.  The above highlights something I missed in v3.  The memslot NOWAIT
flag isn't tied to FOLL_NOWAIT, it's really truly a "fast-only" flag.  And even
more confusingly, KVM does set FOLL_NOWAIT, but for the async #PF case, which will
get even more confusing if/when KVM uses FOLL_NOWAIT internally.

Drat.  I really like the NOWAIT name, but unfortunately it doesn't do what as the
name says.

I still don't love "fast-only" as that bleeds kernel internals to userspace.
Anyone have ideas?  Maybe something about not installing new mappings?

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-06-14 21:20   ` Sean Christopherson
@ 2023-06-14 21:23     ` Sean Christopherson
  2023-08-23 21:17       ` Anish Moorthy
  2023-06-15  3:55     ` Wang, Wei W
  2023-07-07 18:13     ` Anish Moorthy
  2 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-06-14 21:23 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Wed, Jun 14, 2023, Sean Christopherson wrote:
> On Fri, Jun 02, 2023, Anish Moorthy wrote:
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 69a221f71914..abbc5dd72292 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2297,4 +2297,10 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> >   */
> >  inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> >  				     uint64_t gpa, uint64_t len, uint64_t flags);
> > +
> > +static inline bool kvm_slot_nowait_on_fault(
> > +	const struct kvm_memory_slot *slot)
> 
> Just when I was starting to think that we had beat all of the Google3 out of you :-)
> 
> And predicate helpers in KVM typically have "is" or "has" in the name, so that it's
> clear the helper queries, versus e.g. sets the flag or something. 
> 
> > +{
> > +	return slot->flags & KVM_MEM_NOWAIT_ON_FAULT;
> 
> KVM is anything but consistent, but if the helper is likely to be called without
> a known good memslot, it probably makes sense to have the helper check for NULL,
> e.g.
> 
> static inline bool kvm_is_slot_nowait_on_fault(const struct kvm_memory_slot *slot)
> {
> 	return slot && slot->flags & KVM_MEM_NOWAIT_ON_FAULT;
> }
> 
> However, do we actually need to force vendor code to query nowait?  At a glance,
> the only external (relative to kvm_main.c) users of __gfn_to_pfn_memslot() are
> in flows that play nice with nowait or that don't support it at all.  So I *think*
> we can do this?
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index be06b09e9104..78024318286d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2403,6 +2403,11 @@ static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
>         return slot->flags & KVM_MEM_READONLY;
>  }
>  
> +static bool memslot_is_nowait_on_fault(const struct kvm_memory_slot *slot)
> +{
> +       return slot->flags & KVM_MEM_NOWAIT_ON_FAULT;
> +}
> +
>  static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
>                                        gfn_t *nr_pages, bool write)
>  {
> @@ -2730,6 +2735,11 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>                 writable = NULL;
>         }
>  
> +       if (async && memslot_is_nowait_on_fault(slot)) {
> +               *async = false;
> +               async = NULL;
> +       }

Gah, got turned around and forgot to account for @atomic.  So this?

	if (!atomic && memslot_is_nowait_on_fault(slot)) {
		atomic = true;
		if (async) {
			*async = false;
			async = NULL;
		}
	}
	
> +
>         return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
>                           writable);
>  }
> 
> Ah, crud.  The above highlights something I missed in v3.  The memslot NOWAIT
> flag isn't tied to FOLL_NOWAIT, it's really truly a "fast-only" flag.  And even
> more confusingly, KVM does set FOLL_NOWAIT, but for the async #PF case, which will
> get even more confusing if/when KVM uses FOLL_NOWAIT internally.
> 
> Drat.  I really like the NOWAIT name, but unfortunately it doesn't do what as the
> name says.
> 
> I still don't love "fast-only" as that bleeds kernel internals to userspace.
> Anyone have ideas?  Maybe something about not installing new mappings?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 04/16] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page()
  2023-06-02 16:19 ` [PATCH v4 04/16] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page() Anish Moorthy
@ 2023-06-15  2:41   ` Robert Hoo
  2023-08-14 22:51     ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Robert Hoo @ 2023-06-15  2:41 UTC (permalink / raw)
  To: Anish Moorthy, seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata

On 6/3/2023 12:19 AM, Anish Moorthy wrote:
> The order of parameters in these function signature is a little strange,
> with "offset" actually applying to "gfn" rather than to "data". Add
> short comments to make things perfectly clear.
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>   virt/kvm/kvm_main.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 09d4d85691e1..d9c0fa7c907f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2984,6 +2984,9 @@ static int next_segment(unsigned long len, int offset)
>   		return len;
>   }
>   
> +/*
> + * Copy 'len' bytes from guest memory at '(gfn * PAGE_SIZE) + offset' to 'data'
> + */
>   static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
>   				 void *data, int offset, int len)
>   {
> @@ -3085,6 +3088,9 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
>   }
>   EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
>   
> +/*
> + * Copy 'len' bytes from 'data' into guest memory at '(gfn * PAGE_SIZE) + offset'
> + */
>   static int __kvm_write_guest_page(struct kvm *kvm,
>   				  struct kvm_memory_slot *memslot, gfn_t gfn,
>   			          const void *data, int offset, int len)

Agree, and how about one step further, i.e. adjust the param's sequence.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2c276d4d0821..db2bc5d3e2c2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2992,7 +2992,7 @@ static int next_segment(unsigned long len, int offset)
   */
  static int __kvm_read_guest_page(struct kvm_memory_slot *slot,
                                  struct kvm_vcpu *vcpu,
-                                gfn_t gfn, void *data, int offset, int len)
+                                gfn_t gfn, int offset, void *data, int len)
  {
         int r;
         unsigned long addr;
@@ -3015,7 +3015,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void 
*data, int offset,
  {
         struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);

-       return __kvm_read_guest_page(slot, NULL, gfn, data, offset, len);
+       return __kvm_read_guest_page(slot, NULL, gfn, offset, data, len);
  }
  EXPORT_SYMBOL_GPL(kvm_read_guest_page);

@@ -3024,7 +3024,7 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t 
gfn, void *data,
  {
         struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);

-       return __kvm_read_guest_page(slot, vcpu, gfn, data, offset, len);
+       return __kvm_read_guest_page(slot, vcpu, gfn, offset, data, len);
  }
  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);

@@ -3103,7 +3103,7 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
   */
  static int __kvm_write_guest_page(struct kvm *kvm, struct kvm_vcpu *vcpu,
                                   struct kvm_memory_slot *memslot, gfn_t gfn,
-                                 const void *data, int offset, int len)
+                                 int offset, const void *data, int len)
  {
         int r;
         unsigned long addr;
@@ -3128,7 +3128,7 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
  {
         struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);

-       return __kvm_write_guest_page(kvm, NULL, slot, gfn, data, offset, len);
+       return __kvm_write_guest_page(kvm, NULL, slot, gfn, offset, data, len);
  }
  EXPORT_SYMBOL_GPL(kvm_write_guest_page);

@@ -3136,8 +3136,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t 
gfn,
                               const void *data, int offset, int len)
  {
         struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-       return __kvm_write_guest_page(vcpu->kvm, vcpu, slot, gfn, data,
-                                     offset, len);
+       return __kvm_write_guest_page(vcpu->kvm, vcpu, slot, gfn, offset,
+                                     data, len);
  }
  EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 08/16] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  2023-06-02 16:19 ` [PATCH v4 08/16] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
  2023-06-14 20:03   ` Sean Christopherson
@ 2023-06-15  2:43   ` Robert Hoo
  2023-06-15 14:40     ` Sean Christopherson
  1 sibling, 1 reply; 79+ messages in thread
From: Robert Hoo @ 2023-06-15  2:43 UTC (permalink / raw)
  To: Anish Moorthy, seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata

On 6/3/2023 12:19 AM, Anish Moorthy wrote:
> Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
> kvm_handle_error_pfn().
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8961f45e3b1..cb71aae9aaec 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3291,6 +3291,10 @@ static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
>   
>   static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   {
> +	uint64_t rounded_gfn;
> +	uint64_t fault_size;
> +	uint64_t fault_flags;
> +
>   	if (is_sigpending_pfn(fault->pfn)) {
>   		kvm_handle_signal_exit(vcpu);
>   		return -EINTR;
> @@ -3309,6 +3313,15 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
>   		return RET_PF_RETRY;
>   	}
>   
> +	fault_size = KVM_HPAGE_SIZE(fault->goal_level);

IIUC, here fault->goal_level is always PG_LEVEL_4K.
goal_level could be adjusted in later kvm_tdp_mmu_map() --> 
kvm_mmu_hugepage_adjust(), if kvm_faultin_pfn() doesn't fail, that is to say, 
code path doesn't go through here.

I wonder, if you would like put (kind of) kvm_mmu_hugepage_adjust() here as 
well, reporting to user space the maximum map size it could do with, OR, just 
report 4K size, let user space itself to detect/decide max possible size (but 
now I've no idea how to).

> +	rounded_gfn = round_down(fault->gfn * PAGE_SIZE, fault_size);
> +
> +	fault_flags = 0;
> +	if (fault->write)
> +		fault_flags |= KVM_MEMORY_FAULT_FLAG_WRITE;
> +	if (fault->exec)
> +		fault_flags |= KVM_MEMORY_FAULT_FLAG_EXEC;
> +	kvm_populate_efault_info(vcpu, rounded_gfn, fault_size, fault_flags);
>   	return -EFAULT;
>   }
>   


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-06-14 21:20   ` Sean Christopherson
  2023-06-14 21:23     ` Sean Christopherson
@ 2023-06-15  3:55     ` Wang, Wei W
  2023-06-15 14:56       ` Sean Christopherson
  2023-07-07 18:13     ` Anish Moorthy
  2 siblings, 1 reply; 79+ messages in thread
From: Wang, Wei W @ 2023-06-15  3:55 UTC (permalink / raw)
  To: Christopherson,, Sean, Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Thursday, June 15, 2023 5:21 AM, Sean Christopherson wrote:
> On Fri, Jun 02, 2023, Anish Moorthy wrote:
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index
> > 69a221f71914..abbc5dd72292 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2297,4 +2297,10 @@ static inline void
> kvm_account_pgtable_pages(void *virt, int nr)
> >   */
> >  inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> >  				     uint64_t gpa, uint64_t len, uint64_t flags);
> > +
> > +static inline bool kvm_slot_nowait_on_fault(
> > +	const struct kvm_memory_slot *slot)
> 
> Just when I was starting to think that we had beat all of the Google3 out of
> you :-)
> 
> And predicate helpers in KVM typically have "is" or "has" in the name, so that it's
> clear the helper queries, versus e.g. sets the flag or something.
> 
> > +{
> > +	return slot->flags & KVM_MEM_NOWAIT_ON_FAULT;
> 
> KVM is anything but consistent, but if the helper is likely to be called without a
> known good memslot, it probably makes sense to have the helper check for
> NULL, e.g.
> 
> static inline bool kvm_is_slot_nowait_on_fault(const struct kvm_memory_slot
> *slot) {
> 	return slot && slot->flags & KVM_MEM_NOWAIT_ON_FAULT; }
> 
> However, do we actually need to force vendor code to query nowait?  At a
> glance, the only external (relative to kvm_main.c) users of
> __gfn_to_pfn_memslot() are in flows that play nice with nowait or that don't
> support it at all.  So I *think* we can do this?
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index
> be06b09e9104..78024318286d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2403,6 +2403,11 @@ static bool memslot_is_readonly(const struct
> kvm_memory_slot *slot)
>         return slot->flags & KVM_MEM_READONLY;  }
> 
> +static bool memslot_is_nowait_on_fault(const struct kvm_memory_slot
> +*slot) {
> +       return slot->flags & KVM_MEM_NOWAIT_ON_FAULT; }
> +
>  static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot,
> gfn_t gfn,
>                                        gfn_t *nr_pages, bool write)  { @@ -2730,6 +2735,11 @@
> kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t
> gfn,
>                 writable = NULL;
>         }
> 
> +       if (async && memslot_is_nowait_on_fault(slot)) {
> +               *async = false;
> +               async = NULL;
> +       }
> +
>         return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
>                           writable);
>  }
> 
> Ah, crud.  The above highlights something I missed in v3.  The memslot NOWAIT
> flag isn't tied to FOLL_NOWAIT, it's really truly a "fast-only" flag.  And even
> more confusingly, KVM does set FOLL_NOWAIT, but for the async #PF case,
> which will get even more confusing if/when KVM uses FOLL_NOWAIT internally.
> 
> Drat.  I really like the NOWAIT name, but unfortunately it doesn't do what as the
> name says.
> 
> I still don't love "fast-only" as that bleeds kernel internals to userspace.
> Anyone have ideas?  Maybe something about not installing new mappings?

Yes, "NOWAIT" sounds a bit confusing here. If this is a patch applied to userfaultfd
to solve the "wait" issue on queuing/handling faults, then it would make sense.
But this is a KVM specific solution, which is not directly related to userfaultfd, and
it's not related to FOLL_NOWAIT. There seems nothing to wait in the KVM context
here.

Why not just name the cap as what it does (i.e. something to indicate the cap of
having the fault exited to userspace to handle), e.g. KVM_CAP_EXIT_ON_FAULT
or KVM_CAP_USERSPACE_FAULT.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 08/16] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  2023-06-15  2:43   ` Robert Hoo
@ 2023-06-15 14:40     ` Sean Christopherson
  0 siblings, 0 replies; 79+ messages in thread
From: Sean Christopherson @ 2023-06-15 14:40 UTC (permalink / raw)
  To: Robert Hoo
  Cc: Anish Moorthy, oliver.upton, kvm, kvmarm, pbonzini, maz,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Thu, Jun 15, 2023, Robert Hoo wrote:
> On 6/3/2023 12:19 AM, Anish Moorthy wrote:
> > Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
> > kvm_handle_error_pfn().
> > 
> > Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > ---
> >   arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
> >   1 file changed, 13 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index c8961f45e3b1..cb71aae9aaec 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3291,6 +3291,10 @@ static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
> >   static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >   {
> > +	uint64_t rounded_gfn;
> > +	uint64_t fault_size;
> > +	uint64_t fault_flags;
> > +
> >   	if (is_sigpending_pfn(fault->pfn)) {
> >   		kvm_handle_signal_exit(vcpu);
> >   		return -EINTR;
> > @@ -3309,6 +3313,15 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
> >   		return RET_PF_RETRY;
> >   	}
> > +	fault_size = KVM_HPAGE_SIZE(fault->goal_level);
> 
> IIUC, here fault->goal_level is always PG_LEVEL_4K.
> goal_level could be adjusted in later kvm_tdp_mmu_map() -->
> kvm_mmu_hugepage_adjust(), if kvm_faultin_pfn() doesn't fail, that is to
> say, code path doesn't go through here.
> 
> I wonder, if you would like put (kind of) kvm_mmu_hugepage_adjust() here as
> well, reporting to user space the maximum map size it could do with, OR,
> just report 4K size, let user space itself to detect/decide max possible
> size (but now I've no idea how to).

No, that's nonsensical because KVM uses the host mapping to compute the max
mapping level.  If there's no valid mapping, then there's no defined level.  And
as I said in my reply, KVM should never kick out to userspace if KVM can establish
a 4KiB mapping, i.e. 4KiB is always the effective scope, and reporting anything
else would just be wild speculation.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-06-15  3:55     ` Wang, Wei W
@ 2023-06-15 14:56       ` Sean Christopherson
  2023-06-16 12:08         ` Wang, Wei W
  0 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-06-15 14:56 UTC (permalink / raw)
  To: Wei W Wang
  Cc: Anish Moorthy, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata

On Thu, Jun 15, 2023, Wei W Wang wrote:
> On Thursday, June 15, 2023 5:21 AM, Sean Christopherson wrote:
> > Ah, crud.  The above highlights something I missed in v3.  The memslot NOWAIT
> > flag isn't tied to FOLL_NOWAIT, it's really truly a "fast-only" flag.  And even
> > more confusingly, KVM does set FOLL_NOWAIT, but for the async #PF case,
> > which will get even more confusing if/when KVM uses FOLL_NOWAIT internally.
> > 
> > Drat.  I really like the NOWAIT name, but unfortunately it doesn't do what as the
> > name says.
> > 
> > I still don't love "fast-only" as that bleeds kernel internals to userspace.
> > Anyone have ideas?  Maybe something about not installing new mappings?
> 
> Yes, "NOWAIT" sounds a bit confusing here. If this is a patch applied to userfaultfd
> to solve the "wait" issue on queuing/handling faults, then it would make sense.
> But this is a KVM specific solution, which is not directly related to userfaultfd, and
> it's not related to FOLL_NOWAIT. There seems nothing to wait in the KVM context
> here.
> 
> Why not just name the cap as what it does (i.e. something to indicate the cap of
> having the fault exited to userspace to handle), e.g. KVM_CAP_EXIT_ON_FAULT
> or KVM_CAP_USERSPACE_FAULT.

Because that's even further away from the truth when accounting for the fact that
the flag controls behavior when handling are *guest* faults.  The memslot flag
doesn't cause KVM to exit on every guest fault.  And USERSPACE_FAULT is far too
vague; KVM constantly faults in userspace mappings, the flag needs to communicate
that KVM *won't* do that for guest accesses.

Something like KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS?  Ridiculously verbose, but
I think it captures the KVM behavior, and "guest access" instead of "guest fault"
gives KVM some wiggle room, e.g. the name won't become stale if we figure out a
way to apply the behavior to KVM emulation of guest accesses in the future.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-06-15 14:56       ` Sean Christopherson
@ 2023-06-16 12:08         ` Wang, Wei W
  0 siblings, 0 replies; 79+ messages in thread
From: Wang, Wei W @ 2023-06-16 12:08 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: Anish Moorthy, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata

On Thursday, June 15, 2023 10:56 PM, Sean Christopherson wrote:
> Because that's even further away from the truth when accounting for the fact
> that the flag controls behavior when handling are *guest* faults.  The

What do you mean by guest faults here?
I think more precisely, it's host page fault triggered by guest access (through host
GUP though), isn't it? When the flag is set, we want to have this fault to be
handled by userspace?

> memslot flag doesn't cause KVM to exit on every guest fault.  And
> USERSPACE_FAULT is far too vague; KVM constantly faults in userspace
> mappings, the flag needs to communicate that KVM *won't* do that for guest
> accesses.

I was trying to meant USERSPACE_FAULT_HANDLING.

> 
> Something like KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS?  Ridiculously

Yeah, it's kind of verbose. Was your intension for "NO_USERFAULT" to mean
bypassing the userfaultfd mechanism?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 16/16] KVM: selftests: Handle memory fault exits in demand_paging_test
  2023-06-02 16:19 ` [PATCH v4 16/16] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
@ 2023-06-20  2:44   ` Robert Hoo
  0 siblings, 0 replies; 79+ messages in thread
From: Robert Hoo @ 2023-06-20  2:44 UTC (permalink / raw)
  To: Anish Moorthy, seanjc, oliver.upton, kvm, kvmarm
  Cc: pbonzini, maz, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata

On 6/3/2023 12:19 AM, Anish Moorthy wrote:
[...]
>   static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
>   {
>   	struct kvm_vcpu *vcpu = vcpu_args->vcpu;
>   	int vcpu_idx = vcpu_args->vcpu_idx;
>   	struct kvm_run *run = vcpu->run;
> -	struct timespec start;
> -	struct timespec ts_diff;
> +	struct timespec last_start;
> +	struct timespec total_runtime = {};
>   	int ret;
>   
> -	clock_gettime(CLOCK_MONOTONIC, &start);
>   
> -	/* Let the guest access its memory */
> -	ret = _vcpu_run(vcpu);
> -	TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
> -	if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
> -		TEST_ASSERT(false,
> -			    "Invalid guest sync status: exit_reason=%s\n",
> -			    exit_reason_str(run->exit_reason));
> -	}
> +	while (true) {
> +		clock_gettime(CLOCK_MONOTONIC, &last_start);
> +		/* Let the guest access its memory */
> +		ret = _vcpu_run(vcpu);
> +		TEST_ASSERT(ret == 0
> +			    || (errno == EFAULT
> +				&& run->exit_reason == KVM_EXIT_MEMORY_FAULT),
> +			    "vcpu_run failed: %d\n", ret);
>   
> -	ts_diff = timespec_elapsed(start);
> +		total_runtime = timespec_add(total_runtime,
> +					     timespec_elapsed(last_start));
> +		if (ret != 0 && get_ucall(vcpu, NULL) != UCALL_SYNC) {
> +
> +			if (run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
> +				ready_page(run->memory_fault.gpa);
> +				continue;
> +			}
> +
> +			TEST_ASSERT(false,
> +				    "Invalid guest sync status: exit_reason=%s\n",
> +				    exit_reason_str(run->exit_reason));
> +		}
> +		break;
> +	}
>   	PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
> -		       ts_diff.tv_sec, ts_diff.tv_nsec);
> +			total_runtime.tv_sec, total_runtime.tv_nsec);
>   }

Can include number of #PF handled by vcpu worker in PER_VCPU_DEBUG output.

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c 
b/tools/testing/selftests/kvm/demand_paging_test.c
index 4b79c88cb22d..8841150b0e2b 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -91,7 +91,7 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
         struct timespec last_start;
         struct timespec total_runtime = {};
         int ret;
-
+       int pages = 0;

         while (true) {
                 clock_gettime(CLOCK_MONOTONIC, &last_start);
@@ -108,6 +108,7 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)

                         if (run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
                                 ready_page(run->memory_fault.gpa);
+                               pages++;
                                 continue;
                         }

@@ -117,8 +118,8 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
                 }
                 break;
         }
-       PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
-                       total_runtime.tv_sec, total_runtime.tv_nsec);
+       PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds, %d page faults 
handled\n", vcpu_idx,
+                       total_runtime.tv_sec, total_runtime.tv_nsec, pages);
  }



^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-06-14 17:35   ` Sean Christopherson
@ 2023-06-20 21:13     ` Anish Moorthy
  2023-07-07 11:50     ` Kautuk Consul
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-06-20 21:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

Thanks for all the review Sean: I'm busy with other work at the
moment, so I can't address this all atm. But I should have a chance to
take the feedback and send up a new version before too long :)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-06-02 16:19 ` [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
                     ` (2 preceding siblings ...)
  2023-06-14 17:35   ` Sean Christopherson
@ 2023-07-05  8:21   ` Kautuk Consul
  3 siblings, 0 replies; 79+ messages in thread
From: Kautuk Consul @ 2023-07-05  8:21 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: seanjc, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata

Hi,

> +
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> +				     uint64_t gpa, uint64_t len, uint64_t flags)
> +{
> +	if (WARN_ON_ONCE(!vcpu))
> +		return;
> +
> +	preempt_disable();
> +	/*
> +	 * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> +	 * would open the door for races between concurrent calls to this
> +	 * function.
> +	 */
> +	if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> +		goto out;
Sorry, I also wrote to the v3 discussion for this patch.
Re-iterating what I said there:
Why use WARN_ON_ONCE when there is a clear possiblity of preemption
kicking in (with the possibility of vcpu_load/vcpu_put being called
in the new task) before preempt_disable() is called in this function ?
I think you should use WARN_ON_ONCE only where there is some impossible
or unhandled situation happening, not when there is a possibility of that
situation clearly happening as per the kernel code. I think that this WARN_ON_ONCE
could make sense if kvm_populate_efault_info() is called from atomic context,
but not when you are disabling preemption from this function itself.
Basically I don't think there is any way we can guarantee that
preemption DOESN'T kick in before the preempt_disable() such that
this if-check is actually something that deserves to have a kernel
WARN_ON_ONCE() warning.
Can we get rid of this WARN_ON_ONCE and straightaway jump to the
out label if "(vcpu != __this_cpu_read(kvm_running_vcpu))" is true, or
please do correct me if I am wrong about something ?
> +	/*
> +	 * Try not to overwrite an already-populated run struct.
> +	 * This isn't a perfect solution, as there's no guarantee that the exit
> +	 * reason is set before the run struct is populated, but it should prevent
> +	 * at least some bugs.
> +	 */
> +	else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
> +		goto out;
> +
> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	vcpu->run->memory_fault.gpa = gpa;
> +	vcpu->run->memory_fault.len = len;
> +	vcpu->run->memory_fault.flags = flags;
> +
> +out:
> +	preempt_enable();
> +}
> -- 
> 2.41.0.rc0.172.g3f132b7071-goog
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-06-14 20:11   ` Sean Christopherson
@ 2023-07-06 19:04     ` Anish Moorthy
  0 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-07-06 19:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

> Rather than force architectures to add the extension, probably better to use a
>
>         config HAVE_KVM_NOWAIT_ON_FAULT
>                bool
>
> and select that from arch Kconfigs.  That way the enumeration can be done in
> common code, and then this can be computed at compile time doesn't need to do a
> rather weird invocation of kvm_dev_ioctl() with KVM_CHECK_EXTENSION.

Done. If I'm reading this correctly you also want me to move the logic
for checking the cap out of the arch-specific
kvm_vm_ioctl_check_extension() and into
kvm_vm_ioctl_check_extension_generic(), which I've done too.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 05/16] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
  2023-06-14 19:10   ` Sean Christopherson
@ 2023-07-06 22:51     ` Anish Moorthy
  2023-07-12 14:08       ` Sean Christopherson
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-07-06 22:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Wed, Jun 14, 2023 at 12:10 PM Sean Christopherson <seanjc@google.com> wrote:
>
> For future reference, the 80 char limit is a soft limit, and with a lot of
> subjectivity, can be breached.  In this case, this...

Oh neat, I thought it looked pretty ugly too: done.

> Newline after variable declarations.  Double demerits for breaking what was
> originally correct :-)

:(

>
> As mentioned in a previous mail, put this in the (one) caller.  If more callers
> come along, which is highly unlikely, we can revisit that decision.  Right now,
> it just adds noise, both here and in the helper.
>
> ...
>
> With my various suggestions, something like
>
>         struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>         int r;
>
>         r = __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
>         if (r)
>                 kvm_handle_guest_uaccess_fault(...);
>         return r;
>

So, the reason I had the logic within the helper was that it also
returns -EFAULT on gfn_to_hva_memslot() failures.

> static int __kvm_write_guest_page(struct kvm *kvm,
>     struct kvm_memory_slot *memslot, gfn_t gfn,
>     const void *data, int offset, int len)
> {
>     int r;
>     unsigned long addr;
>
>     addr = gfn_to_hva_memslot(memslot, gfn);
>     if (kvm_is_error_hva(addr))
>         return -EFAULT;
> ...

Is it ok to catch these with an annotated efault? Strictly speaking
they don't seem to arise from a failed user access (perhaps my
definition is wrong: I'm looking at uaccess.h) so I'm not sure that
annotating them is valid.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-06-14 17:35   ` Sean Christopherson
  2023-06-20 21:13     ` Anish Moorthy
@ 2023-07-07 11:50     ` Kautuk Consul
  2023-07-10 15:00       ` Anish Moorthy
  2023-08-11 22:12     ` Anish Moorthy
  2023-08-17 22:55     ` Anish Moorthy
  3 siblings, 1 reply; 79+ messages in thread
From: Kautuk Consul @ 2023-07-07 11:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Anish Moorthy, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata

Hi,
> 
> > +
> > +	preempt_disable();
> > +	/*
> > +	 * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> > +	 * would open the door for races between concurrent calls to this
> > +	 * function.
> > +	 */
> > +	if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> > +		goto out;
> 
> Meh, this is overkill IMO.  The check in mark_page_dirty_in_slot() is an
> abomination that I wish didn't exist, not a pattern that should be copied.  If
> we do keep this sanity check, it can simply be
> 
> 	if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()))
> 		return;
> 
> because as the comment for kvm_get_running_vcpu() explains, the returned vCPU
> pointer won't change even if this task gets migrated to a different pCPU.  If
> this code were doing something with vcpu->cpu then preemption would need to be
> disabled throughout, but that's not the case.
> 
I think that this check is needed but without the WARN_ON_ONCE as per my 
other comment.
Reason is that we really need to insulate the code against preemption
kicking in before the call to preempt_disable() as the logic seems to
need this check to avoid concurrency problems.
(I don't think Anish simply copied this if-check from mark_page_dirty_in_slot)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 07/16] KVM: Simplify error handling in __gfn_to_pfn_memslot()
  2023-06-14 19:26   ` Sean Christopherson
@ 2023-07-07 17:33     ` Anish Moorthy
  2023-07-10 17:40       ` Sean Christopherson
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-07-07 17:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

Done

(somebody please let me know if these short "ack"/"done" messages are
frowned upon btw. Nobody's complained about it so far, but I'm not
sure if people consider it spam)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 06/16] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page()
  2023-06-14 19:22   ` Sean Christopherson
@ 2023-07-07 17:35     ` Anish Moorthy
  0 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-07-07 17:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

Done, and same question/comment as the "write" patch (though I'm sure
we'll just keep all the discussion there henceforth)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 10/16] KVM: x86: Implement KVM_CAP_NOWAIT_ON_FAULT
  2023-06-14 20:25   ` Sean Christopherson
@ 2023-07-07 17:41     ` Anish Moorthy
  0 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-07-07 17:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Wed, Jun 14, 2023 at 1:25 PM Sean Christopherson <seanjc@google.com> wrote:
> > -static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > +static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu,
> > +                          struct kvm_page_fault *fault,
> > +                          bool nowait)
>
> More booleans!?  Just say no!  And in this case, there's no reason to pass in a
> flag, just handle this entirely in __kvm_faultin_pfn().

Ah, thanks: that extra parameter is a holdover from forever ago where
"nowait" was a special thing that was read by handle_error_pfn(). Done.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 08/16] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  2023-06-14 20:03   ` Sean Christopherson
@ 2023-07-07 18:05     ` Anish Moorthy
  0 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-07-07 18:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Wed, Jun 14, 2023 at 1:03 PM Sean Christopherson <seanjc@google.com> wrote:
>
> We have a helper for this too, gfn_round_for_level().  Ooh, but this isn't storing
> a gfn, it's storing a gpa.  Naughty, naughty.

Caught in mischief I didn't even realize I was committing :O Anyways,
I've taken all the feedback you provided here. Although,

> All that said, consuming fault->goal_level is unnecessary, and not be coincidence.
> The *only* time KVM should bail to userspace is if KVM failed to faultin a 4KiB
> page.  Providing a hugepage is done opportunistically, it's never a hard requirement.
> So throw away all of the above and see below.

I wonder if my arm64 patch commits the same error. I'll send an email
over there asking Oliver about it.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-06-14 21:20   ` Sean Christopherson
  2023-06-14 21:23     ` Sean Christopherson
  2023-06-15  3:55     ` Wang, Wei W
@ 2023-07-07 18:13     ` Anish Moorthy
  2023-07-07 20:07       ` Anish Moorthy
  2 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-07-07 18:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Wed, Jun 14, 2023 at 2:20 PM Sean Christopherson <seanjc@google.com> wrote:
>
> > +static inline bool kvm_slot_nowait_on_fault(
> > +     const struct kvm_memory_slot *slot)
>
> Just when I was starting to think that we had beat all of the Google3 out of you :-)

I was trying to stay under the line limit here :( But you've commented
on that elsewhere. Fixed (hopefully :)

> And predicate helpers in KVM typically have "is" or "has" in the name, so that it's
> clear the helper queries, versus e.g. sets the flag or something.

Done

> KVM is anything but consistent, but if the helper is likely to be called without
> a known good memslot, it probably makes sense to have the helper check for NULL,
> e.g.

Done: I was doing the null checks in other ways in the arch-specific
code, but yeah it's easier to centralize that here.

> However, do we actually need to force vendor code to query nowait?  At a glance,
> the only external (relative to kvm_main.c) users of __gfn_to_pfn_memslot() are
> in flows that play nice with nowait or that don't support it at all.  So I *think*
> we can do this?
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index be06b09e9104..78024318286d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2403,6 +2403,11 @@ static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
>         return slot->flags & KVM_MEM_READONLY;
>  }
>
> +static bool memslot_is_nowait_on_fault(const struct kvm_memory_slot *slot)
> +{
> +       return slot->flags & KVM_MEM_NOWAIT_ON_FAULT;
> +}
> +
>  static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
>                                        gfn_t *nr_pages, bool write)
>  {
> @@ -2730,6 +2735,11 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>                 writable = NULL;
>         }
>
> +       if (async && memslot_is_nowait_on_fault(slot)) {
> +               *async = false;
> +               async = NULL;
> +       }
> +
>         return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
>                           writable);
>  }

Hmm, well not having to modify the vendor code would be nice... but
I'll have to look more at __gfn_to_pfn_memslot()'s callers (and
probably send more questions your way :). Hopefully it works out more
like what you suggest.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-07-07 18:13     ` Anish Moorthy
@ 2023-07-07 20:07       ` Anish Moorthy
  2023-07-11 15:29         ` Sean Christopherson
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-07-07 20:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

> > However, do we actually need to force vendor code to query nowait?  At a glance,
> > the only external (relative to kvm_main.c) users of __gfn_to_pfn_memslot() are
> > in flows that play nice with nowait or that don't support it at all.  So I *think*
> > we can do this?
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index be06b09e9104..78024318286d 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2403,6 +2403,11 @@ static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
> >         return slot->flags & KVM_MEM_READONLY;
> >  }
> >
> > +static bool memslot_is_nowait_on_fault(const struct kvm_memory_slot *slot)
> > +{
> > +       return slot->flags & KVM_MEM_NOWAIT_ON_FAULT;
> > +}
> > +
> >  static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
> >                                        gfn_t *nr_pages, bool write)
> >  {
> > @@ -2730,6 +2735,11 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> >                 writable = NULL;
> >         }
> >
> > +       if (async && memslot_is_nowait_on_fault(slot)) {
> > +               *async = false;
> > +               async = NULL;
> > +       }
> > +
> >         return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> >                           writable);
> >  }
>
> Hmm, well not having to modify the vendor code would be nice... but
> I'll have to look more at __gfn_to_pfn_memslot()'s callers (and
> probably send more questions your way :). Hopefully it works out more
> like what you suggest.

I took a look of my own, and I don't think moving the nowait query
into __gfn_to_pfn_memslot() would work. At issue is the actual
behavior of KVM_CAP_NOWAIT_ON_FAULT, which I documented as follows:

> The presence of this capability indicates that userspace may pass the
> KVM_MEM_NOWAIT_ON_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
> to fail (-EFAULT) in response to page faults for which resolution would require
> the faulting thread to sleep.

 Moving the nowait check out of __kvm_faultin_pfn()/user_mem_abort()
and into __gfn_to_pfn_memslot() means that, obviously, other callers
will start to see behavior changes. Some of that is probably actually
necessary for that documentation to be accurate (since any usages of
__gfn_to_pfn_memslot() under KVM_RUN should respect the memslot flag),
but I think there are consumers of __gfn_to_pfn_memslot() from outside
KVM_RUN.

Anyways, after some searching on my end: I think the only caller of
__gfn_to_pfn_memslot() in core kvm/x86/arm64 where moving the "nowait"
check into the function actually changes anything is gfn_to_pfn(). But
that function gets called from vmx_vcpu_create() (through
kvm_alloc_apic_access_page()), and *that* certainly doesn't look like
something KVM_RUN does or would ever call.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-07-07 11:50     ` Kautuk Consul
@ 2023-07-10 15:00       ` Anish Moorthy
  2023-07-11  3:54         ` Kautuk Consul
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-07-10 15:00 UTC (permalink / raw)
  To: Kautuk Consul
  Cc: Sean Christopherson, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata

> > > +   preempt_disable();
> > > +   /*
> > > +    * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> > > +    * would open the door for races between concurrent calls to this
> > > +    * function.
> > > +    */
> > > +   if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> > > +           goto out;
> >
> > Meh, this is overkill IMO.  The check in mark_page_dirty_in_slot() is an
> > abomination that I wish didn't exist, not a pattern that should be copied.  If
> > we do keep this sanity check, it can simply be
> >
> >       if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()))
> >               return;
> >
> > because as the comment for kvm_get_running_vcpu() explains, the returned vCPU
> > pointer won't change even if this task gets migrated to a different pCPU.  If
> > this code were doing something with vcpu->cpu then preemption would need to be
> > disabled throughout, but that's not the case.
> >
> I think that this check is needed but without the WARN_ON_ONCE as per my
> other comment.
> Reason is that we really need to insulate the code against preemption
> kicking in before the call to preempt_disable() as the logic seems to
> need this check to avoid concurrency problems.
> (I don't think Anish simply copied this if-check from mark_page_dirty_in_slot)

Heh, you're giving me too much credit here. I did copy-paste this
check, not from from mark_page_dirty_in_slot() but from one of Sean's
old responses [1]

> That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> without the target vCPU being loaded:
>
>     int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
>     {
>         preempt_disable();
>         if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
>             goto out;
>
>         vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
>         ...
>     out:
>         preempt_enable();
>         return -EFAULT;
>     }

Ancient history aside, let's figure out what's really needed here.

> Why use WARN_ON_ONCE when there is a clear possiblity of preemption
> kicking in (with the possibility of vcpu_load/vcpu_put being called
> in the new task) before preempt_disable() is called in this function ?
> I think you should use WARN_ON_ONCE only where there is some impossible
> or unhandled situation happening, not when there is a possibility of that
> situation clearly happening as per the kernel code.

I did some mucking around to try and understand the kvm_running_vcpu
variable, and I don't think preemption/rescheduling actually trips the
WARN here? From my (limited) understanding, it seems that the
thread being preempted will cause a vcpu_put() via kvm_sched_out().
But when the thread is eventually scheduled back in onto whatever
core, it'll vcpu_load() via kvm_sched_in(), and the docstring for
kvm_get_running_vcpu() seems to imply the thing that vcpu_load()
stores into the per-cpu "kvm_running_vcpu" variable will be the same
thing which would have been observed before preemption.

All that's to say: I wouldn't expect the value of
"__this_cpu_read(kvm_running_vcpu)" to change in any given thread. If
that's true, then the things I would expect this WARN to catch are (a)
bugs where somehow the thread gets scheduled without calling
vcpu_load() or (b) bizarre situations (probably all bugs?) where some
vcpu thread has a hold of some _other_ kvm_vcpu* and is trying to do
something with it.


On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Meh, this is overkill IMO.  The check in mark_page_dirty_in_slot() is an
> abomination that I wish didn't exist, not a pattern that should be copied.  If
> we do keep this sanity check, it can simply be
>
>         if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()))
>                 return;
>
> because as the comment for kvm_get_running_vcpu() explains, the returned vCPU
> pointer won't change even if this task gets migrated to a different pCPU.  If
> this code were doing something with vcpu->cpu then preemption would need to be
> disabled throughout, but that's not the case.

Oh, looks like Sean said the same thing. Guess I probably should have
read that reply more closely first :/


[1] https://lore.kernel.org/kvm/ZBnLaidtZEM20jMp@google.com/

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 07/16] KVM: Simplify error handling in __gfn_to_pfn_memslot()
  2023-07-07 17:33     ` Anish Moorthy
@ 2023-07-10 17:40       ` Sean Christopherson
  0 siblings, 0 replies; 79+ messages in thread
From: Sean Christopherson @ 2023-07-10 17:40 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Fri, Jul 07, 2023, Anish Moorthy wrote:
> Done
> 
> (somebody please let me know if these short "ack"/"done" messages are
> frowned upon btw. Nobody's complained about it so far, but I'm not
> sure if people consider it spam)

I personally think that ack/done messages that don't add anything else to the
conversation are useless.   The bar for "anything else" can be very low, e.g. a
simple "gotcha" can be valuable if it wraps up a conversation, but "accepting"
every piece of feedback is a waste of everyone's time IMO as the expectation is
that all review feedback will be addressed, either by a follow-up conversation or
by modifying the patch in the next version, i.e. by *not* pushing back you are
implicitly accepting feedback.

And an "ack/done" isn't binding, i.e. doesn't magically morph into code and guarantee
that the next version of the patch will actually contain the requested changes.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-07-10 15:00       ` Anish Moorthy
@ 2023-07-11  3:54         ` Kautuk Consul
  2023-07-11 14:25           ` Sean Christopherson
  0 siblings, 1 reply; 79+ messages in thread
From: Kautuk Consul @ 2023-07-11  3:54 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Sean Christopherson, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata

> > >
> > I think that this check is needed but without the WARN_ON_ONCE as per my
> > other comment.
> > Reason is that we really need to insulate the code against preemption
> > kicking in before the call to preempt_disable() as the logic seems to
> > need this check to avoid concurrency problems.
> > (I don't think Anish simply copied this if-check from mark_page_dirty_in_slot)
> 
> Heh, you're giving me too much credit here. I did copy-paste this
> check, not from from mark_page_dirty_in_slot() but from one of Sean's
> old responses [1]
Oh, I see.
> 
> > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> > without the target vCPU being loaded:
> >
> >     int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
> >     {
> >         preempt_disable();
> >         if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> >             goto out;
> >
> >         vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> >         ...
> >     out:
> >         preempt_enable();
> >         return -EFAULT;
> >     }
> 
> Ancient history aside, let's figure out what's really needed here.
> 
> > Why use WARN_ON_ONCE when there is a clear possiblity of preemption
> > kicking in (with the possibility of vcpu_load/vcpu_put being called
> > in the new task) before preempt_disable() is called in this function ?
> > I think you should use WARN_ON_ONCE only where there is some impossible
> > or unhandled situation happening, not when there is a possibility of that
> > situation clearly happening as per the kernel code.
> 
> I did some mucking around to try and understand the kvm_running_vcpu
> variable, and I don't think preemption/rescheduling actually trips the
> WARN here? From my (limited) understanding, it seems that the
> thread being preempted will cause a vcpu_put() via kvm_sched_out().
> But when the thread is eventually scheduled back in onto whatever
> core, it'll vcpu_load() via kvm_sched_in(), and the docstring for
> kvm_get_running_vcpu() seems to imply the thing that vcpu_load()
> stores into the per-cpu "kvm_running_vcpu" variable will be the same
> thing which would have been observed before preemption.
> 
> All that's to say: I wouldn't expect the value of
> "__this_cpu_read(kvm_running_vcpu)" to change in any given thread. If
> that's true, then the things I would expect this WARN to catch are (a)
> bugs where somehow the thread gets scheduled without calling
> vcpu_load() or (b) bizarre situations (probably all bugs?) where some
> vcpu thread has a hold of some _other_ kvm_vcpu* and is trying to do
> something with it.
Oh I completely missed the scheduling path for KVM.
But since vcpu_put and vcpu_load are exported symbols, I wonder what'll
happen when there are calls to these functions from places other
than kvm_sched_in() and kvm_sched_out() ? Just thinking out loud.
> 
> 
> On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Meh, this is overkill IMO.  The check in mark_page_dirty_in_slot() is an
> > abomination that I wish didn't exist, not a pattern that should be copied.  If
> > we do keep this sanity check, it can simply be
> >
> >         if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()))
> >                 return;
> >
> > because as the comment for kvm_get_running_vcpu() explains, the returned vCPU
> > pointer won't change even if this task gets migrated to a different pCPU.  If
> > this code were doing something with vcpu->cpu then preemption would need to be
> > disabled throughout, but that's not the case.
> 
> Oh, looks like Sean said the same thing. Guess I probably should have
> read that reply more closely first :/
I too get less time to completely read through emails and
associated code between my work assignments. :-).
> 
> 
> [1] https://lore.kernel.org/kvm/ZBnLaidtZEM20jMp@google.com/

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-07-11  3:54         ` Kautuk Consul
@ 2023-07-11 14:25           ` Sean Christopherson
  0 siblings, 0 replies; 79+ messages in thread
From: Sean Christopherson @ 2023-07-11 14:25 UTC (permalink / raw)
  To: Kautuk Consul
  Cc: Anish Moorthy, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata

On Tue, Jul 11, 2023, Kautuk Consul wrote:
> > > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> > > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> > > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> > > without the target vCPU being loaded:
> > >
> > >     int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
> > >     {
> > >         preempt_disable();
> > >         if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> > >             goto out;
> > >
> > >         vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > >         ...
> > >     out:
> > >         preempt_enable();
> > >         return -EFAULT;
> > >     }
> > 
> > Ancient history aside, let's figure out what's really needed here.
> > 
> > > Why use WARN_ON_ONCE when there is a clear possiblity of preemption
> > > kicking in (with the possibility of vcpu_load/vcpu_put being called
> > > in the new task) before preempt_disable() is called in this function ?
> > > I think you should use WARN_ON_ONCE only where there is some impossible
> > > or unhandled situation happening, not when there is a possibility of that
> > > situation clearly happening as per the kernel code.
> > 
> > I did some mucking around to try and understand the kvm_running_vcpu
> > variable, and I don't think preemption/rescheduling actually trips the
> > WARN here? From my (limited) understanding, it seems that the
> > thread being preempted will cause a vcpu_put() via kvm_sched_out().
> > But when the thread is eventually scheduled back in onto whatever
> > core, it'll vcpu_load() via kvm_sched_in(), and the docstring for
> > kvm_get_running_vcpu() seems to imply the thing that vcpu_load()
> > stores into the per-cpu "kvm_running_vcpu" variable will be the same
> > thing which would have been observed before preemption.
> > 
> > All that's to say: I wouldn't expect the value of
> > "__this_cpu_read(kvm_running_vcpu)" to change in any given thread. If
> > that's true, then the things I would expect this WARN to catch are (a)
> > bugs where somehow the thread gets scheduled without calling
> > vcpu_load() or (b) bizarre situations (probably all bugs?) where some
> > vcpu thread has a hold of some _other_ kvm_vcpu* and is trying to do
> > something with it.
> Oh I completely missed the scheduling path for KVM.
> But since vcpu_put and vcpu_load are exported symbols, I wonder what'll
> happen when there are calls to these functions from places other
> than kvm_sched_in() and kvm_sched_out() ? Just thinking out loud.

Invoking this helper without the target vCPU loaded on the current task would be
considered a bug.  kvm.ko exports a rather disgusting number of symbols purely for
use by vendor modules, e.g. kvm-intel.ko and kvm-amd.ko on x86.  The exports are
not at all intended to be used by non-KVM code, i.e. any such misuse would also be
considered a bug.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-07-07 20:07       ` Anish Moorthy
@ 2023-07-11 15:29         ` Sean Christopherson
  2023-08-25  0:15           ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-07-11 15:29 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Fri, Jul 07, 2023, Anish Moorthy wrote:
> > Hmm, well not having to modify the vendor code would be nice... but
> > I'll have to look more at __gfn_to_pfn_memslot()'s callers (and
> > probably send more questions your way :). Hopefully it works out more
> > like what you suggest.
> 
> I took a look of my own, and I don't think moving the nowait query
> into __gfn_to_pfn_memslot() would work. At issue is the actual
> behavior of KVM_CAP_NOWAIT_ON_FAULT, which I documented as follows:
> 
> > The presence of this capability indicates that userspace may pass the
> > KVM_MEM_NOWAIT_ON_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
> > to fail (-EFAULT) in response to page faults for which resolution would require
> > the faulting thread to sleep.

Well, that description is wrong for other reasons.  As mentioned in my reply
(got snipped), the behavior is not tied to sleeping or waiting on I/O.

>  Moving the nowait check out of __kvm_faultin_pfn()/user_mem_abort()
> and into __gfn_to_pfn_memslot() means that, obviously, other callers
> will start to see behavior changes. Some of that is probably actually
> necessary for that documentation to be accurate (since any usages of
> __gfn_to_pfn_memslot() under KVM_RUN should respect the memslot flag),
> but I think there are consumers of __gfn_to_pfn_memslot() from outside
> KVM_RUN.

Yeah, replace "in response to page faults" with something along the lines of "if
an access in guest context ..."

> Anyways, after some searching on my end: I think the only caller of
> __gfn_to_pfn_memslot() in core kvm/x86/arm64 where moving the "nowait"
> check into the function actually changes anything is gfn_to_pfn(). But
> that function gets called from vmx_vcpu_create() (through
> kvm_alloc_apic_access_page()), and *that* certainly doesn't look like
> something KVM_RUN does or would ever call.

Correct, but that particular gfn_to_pfn() works on a KVM-internal memslot, i.e.
will never have the "fast-only" flag set.

	hva = __x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, <===
				      APIC_DEFAULT_PHYS_BASE, PAGE_SIZE);
	if (IS_ERR(hva)) {
		ret = PTR_ERR(hva);
		goto out;
	}

	page = gfn_to_page(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
	if (is_error_page(page)) {
		ret = -EFAULT;
		goto out;
	} 

On x86, there should not be any other usages of user memslots outside of KVM_RUN.
arm64 is unfortunately a different story (see this thread[*]), but we may be able
to solve that with a documentation update.  I *think* the accesses are limited to
the sub-ioctl KVM_DEV_ARM_VGIC_GRP_CTRL, and more precisely the sub-sub-ioctls
KVM_DEV_ARM_ITS_{SAVE,RESTORE}_TABLES and KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES.

[*] https://lore.kernel.org/all/Y1ghIKrAsRFwSFsO@google.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 05/16] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
  2023-07-06 22:51     ` Anish Moorthy
@ 2023-07-12 14:08       ` Sean Christopherson
  0 siblings, 0 replies; 79+ messages in thread
From: Sean Christopherson @ 2023-07-12 14:08 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Thu, Jul 06, 2023, Anish Moorthy wrote:
> On Wed, Jun 14, 2023 at 12:10 PM Sean Christopherson <seanjc@google.com> wrote:
> > static int __kvm_write_guest_page(struct kvm *kvm,
> >     struct kvm_memory_slot *memslot, gfn_t gfn,
> >     const void *data, int offset, int len)
> > {
> >     int r;
> >     unsigned long addr;
> >
> >     addr = gfn_to_hva_memslot(memslot, gfn);
> >     if (kvm_is_error_hva(addr))
> >         return -EFAULT;
> > ...
> 
> Is it ok to catch these with an annotated efault? Strictly speaking
> they don't seem to arise from a failed user access (perhaps my
> definition is wrong: I'm looking at uaccess.h) so I'm not sure that
> annotating them is valid.

IMO it's ok, and even desirable, to annotate them.  This is a judgment call we
need to make as gfn=>hva translations are a KVM concept, i.e. aren't covered by
uaccess or anything else in the kernel.  Userspace would need to be aware that
an annotated -EFAULT may have failed due to the memslot lookup, but I don't see
that as being problematic, e.g. userspace will likely need to do its own memslot
lookup anyways.

In an ideal world, KVM would flag such "faults" as failing the hva lookup, and
provide the hva when it's a uaccess (or gup()) that fails, i.e. provide userspace
with precise details on the unresolved fault.  But I don't think I want to take
that on in KVM unless it proves to be absolutely necessary.  Userspace *should*
be able to derive the same information, and I'm concerned that providing precise
*and accurate* reporting in KVM would be a maintenance burden.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 01/16] KVM: Allow hva_pfn_fast() to resolve read-only faults.
  2023-06-14 16:57     ` Anish Moorthy
@ 2023-08-10 19:54       ` Anish Moorthy
  2023-08-10 23:48         ` Sean Christopherson
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-08-10 19:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

I figured I'd start double checking my documentation changes before
sending out the next version, since those have been a persistent
issue. So, here's what I've currently got for the commit message here

> hva_to_pfn_fast() currently just fails for read faults where
> establishing writable mappings is forbidden, which is unnecessary.
> Instead, try getting the page without passing FOLL_WRITE. This allows
> the aforementioned faults to (potentially) be resolved without falling
> back to slow GUP.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 01/16] KVM: Allow hva_pfn_fast() to resolve read-only faults.
  2023-08-10 19:54       ` Anish Moorthy
@ 2023-08-10 23:48         ` Sean Christopherson
  0 siblings, 0 replies; 79+ messages in thread
From: Sean Christopherson @ 2023-08-10 23:48 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Thu, Aug 10, 2023, Anish Moorthy wrote:
> I figured I'd start double checking my documentation changes before
> sending out the next version, since those have been a persistent
> issue. So, here's what I've currently got for the commit message here
> 
> > hva_to_pfn_fast() currently just fails for read faults where
> > establishing writable mappings is forbidden, which is unnecessary.
> > Instead, try getting the page without passing FOLL_WRITE. This allows
> > the aforementioned faults to (potentially) be resolved without falling
> > back to slow GUP.

Looks good!  One nit, I would drop the "read" part of "read faults".  This behavior
also applies to executable faults.  You captured the key part well (writable mappings
forbidden), so I don't think there's any need to further clarify what types of
faults this applies to.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-06-14 17:35   ` Sean Christopherson
  2023-06-20 21:13     ` Anish Moorthy
  2023-07-07 11:50     ` Kautuk Consul
@ 2023-08-11 22:12     ` Anish Moorthy
  2023-08-14 18:01       ` Sean Christopherson
  2023-08-17 22:55     ` Anish Moorthy
  3 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-08-11 22:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
>
> Tagging a globally visible, non-static function as "inline" is odd, to say the
> least.

I think my eyes glaze over whenever I read the words "translation
unit" (my brain certainly does) so I'll have to take your word for it.
IIRC last time I tried to mark this function "static" the compiler
yelled at me, so removing the "inline" it is.

> > +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
>
> I strongly prefer to avoid "populate" and "efault".  Avoid "populate" because
> that verb will become stale the instance we do anything else in the helper.
> Avoid "efault" because it's imprecise, i.e. this isn't to be used for just any
> old -EFAULT scenario.  Something like kvm_handle_guest_uaccess_fault()? Definitely
> open to other names (especially less verbose names).

I've taken the kvm_handle_guest_uaccess_fault() name for now, though I
remember you saying something about "uaccess" names being bad because
they'll become inaccurate once GPM rolls around? I'll circle back on
the names before sending v5 out.

> > (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
>
> As I've stated multiple times, this can't WARN in "normal" builds because userspace
> can modify kvm_run fields at will.  I do want a WARN as it will allow fuzzers to
> find bugs for us, but it needs to be guarded with a Kconfig (or maybe a module
> param).  One idea would be to make the proposed CONFIG_KVM_PROVE_MMU[*] a generic
> Kconfig and use that.

For now I've added a KVM_WARN_MEMORY_FAULT_ANNOTATE_ALREADY_POPULATED
kconfig: open to suggestions on the name.

> I got a bit (ok, way more than a bit) lost in all of the (A) (B) (C) madness.  I
> think this what you intended for each case?
>
>   (A) if there are any existing paths in KVM_RUN which cause a vCPU
>       to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
>       access but ignore the failure and then (3) complete the exit to
>       userspace set up in (1), then the contents of the kvm_run struct written
>       in (1) will be corrupted.
>
>   (B) if KVM_RUN fails a guest memory access for which the EFAULT is annotated
>       but does not return the EFAULT to userspace, then later returns an *un*annotated
>       EFAULT to userspace, then userspace will receive incorrect information.
>
>   (C) an annotated EFAULT which is ignored/suppressed followed by one which is
>       propagated to userspace. Here the exit-reason-unset check will prevent the
>       second annotation from being written, so userspace sees an annotation with
>       bad contents, If we believe that case (A) is a weird sequence of events
>       that shouldn't be happening in the first place, then case (C) seems more
>       important to ensure correctness in. But I don't know anything about how often
>       (A) happens in KVM, which is why I want others' opinions.

Yeah, I got lost in the weeds: you've gotten the important points though

> (A) does sadly happen.  I wouldn't call it a "pattern" though, it's an unfortunate
> side effect of deficiencies in KVM's uAPI.
>
> (B) is the trickiest to defend against in the kernel, but as I mentioned in earlier
> versions of this series, userspace needs to guard against a vCPU getting stuck in
> an infinite fault anyways, so I'm not _that_ concerned with figuring out a way to
> address this in the kernel.  KVM's documentation should strongly encourage userspace
> to take action if KVM repeatedly exits with the same info over and over, but beyond
> that I think anything else is nice to have, not mandatory.
>
> (C) should simply not be possible.  (A) is very much a "this shouldn't happen,
> but it does" case.  KVM provides no meaningful guarantees if (A) does happen, so
> yes, prioritizing correctness for (C) is far more important.
>
> That said, prioritizing (C) doesn't mean we can't also do our best to play nice
> with (A).  None of the existing exits use anywhere near the exit info union's 256
> bytes, i.e. there is tons of space to play with.  So rather than put memory_fault
> in with all the others, what if we split the union in two, and place memory_fault
> in the high half (doesn't have to literally be half, but you get the idea).  It'd
> kinda be similar to x86's contributory vs. benign faults; exits that can't be
> "nested" or "speculative" go in the low half, and things like memory_fault go in
> the high half.
>
> That way, if (A) does occur, the original information will be preserved when KVM
> fills memory_fault.  And my suggestion to WARN-and-continue limits the problematic
> scenarios to just fields in the second union, i.e. just memory_fault for now.
> At the very least, not clobbering would likely make it easier for us to debug when
> things go sideways.
>
> And rather than use kvm_run.exit_reason as the canary, we should carve out a
> kernel-only, i.e. non-ABI, field from the union.  That would allow setting the
> canary in common KVM code, which can't be done for kvm_run.exit_reason because
> some architectures, e.g. s390 (and x86 IIRC), consume the exit_reason early on
> in KVM_RUN.

I think this is a good idea :D I was going to suggest something
similar a while back, but I thought it would be off the table- whoops.

My one concern is that if/when other features eventually also use the
"speculative" portion, then they're going to run into the same issues
as we're trying to avoid here. But fixing *that* (probably by
propagating these exits through return values/the call stack) would be
a really big refactor, and C doesn't really have the type system for
it in the first place :(

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-08-11 22:12     ` Anish Moorthy
@ 2023-08-14 18:01       ` Sean Christopherson
  2023-08-15  0:06         ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-08-14 18:01 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Fri, Aug 11, 2023, Anish Moorthy wrote:
> On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> >
> > Tagging a globally visible, non-static function as "inline" is odd, to say the
> > least.
> 
> I think my eyes glaze over whenever I read the words "translation
> unit" (my brain certainly does) so I'll have to take your word for it.
> IIRC last time I tried to mark this function "static" the compiler
> yelled at me, so removing the "inline" it is.

What is/was the error?  It's probably worth digging into; "static inline" should
work just fine, so there might be something funky elsewhere that you're papering
over.

> > I got a bit (ok, way more than a bit) lost in all of the (A) (B) (C) madness.  I
> > think this what you intended for each case?
> >
> >   (A) if there are any existing paths in KVM_RUN which cause a vCPU
> >       to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
> >       access but ignore the failure and then (3) complete the exit to
> >       userspace set up in (1), then the contents of the kvm_run struct written
> >       in (1) will be corrupted.
> >
> >   (B) if KVM_RUN fails a guest memory access for which the EFAULT is annotated
> >       but does not return the EFAULT to userspace, then later returns an *un*annotated
> >       EFAULT to userspace, then userspace will receive incorrect information.
> >
> >   (C) an annotated EFAULT which is ignored/suppressed followed by one which is
> >       propagated to userspace. Here the exit-reason-unset check will prevent the
> >       second annotation from being written, so userspace sees an annotation with
> >       bad contents, If we believe that case (A) is a weird sequence of events
> >       that shouldn't be happening in the first place, then case (C) seems more
> >       important to ensure correctness in. But I don't know anything about how often
> >       (A) happens in KVM, which is why I want others' opinions.
> 
> Yeah, I got lost in the weeds: you've gotten the important points though
> 
> > (A) does sadly happen.  I wouldn't call it a "pattern" though, it's an unfortunate
> > side effect of deficiencies in KVM's uAPI.
> >
> > (B) is the trickiest to defend against in the kernel, but as I mentioned in earlier
> > versions of this series, userspace needs to guard against a vCPU getting stuck in
> > an infinite fault anyways, so I'm not _that_ concerned with figuring out a way to
> > address this in the kernel.  KVM's documentation should strongly encourage userspace
> > to take action if KVM repeatedly exits with the same info over and over, but beyond
> > that I think anything else is nice to have, not mandatory.
> >
> > (C) should simply not be possible.  (A) is very much a "this shouldn't happen,
> > but it does" case.  KVM provides no meaningful guarantees if (A) does happen, so
> > yes, prioritizing correctness for (C) is far more important.
> >
> > That said, prioritizing (C) doesn't mean we can't also do our best to play nice
> > with (A).  None of the existing exits use anywhere near the exit info union's 256
> > bytes, i.e. there is tons of space to play with.  So rather than put memory_fault
> > in with all the others, what if we split the union in two, and place memory_fault
> > in the high half (doesn't have to literally be half, but you get the idea).  It'd
> > kinda be similar to x86's contributory vs. benign faults; exits that can't be
> > "nested" or "speculative" go in the low half, and things like memory_fault go in
> > the high half.
> >
> > That way, if (A) does occur, the original information will be preserved when KVM
> > fills memory_fault.  And my suggestion to WARN-and-continue limits the problematic
> > scenarios to just fields in the second union, i.e. just memory_fault for now.
> > At the very least, not clobbering would likely make it easier for us to debug when
> > things go sideways.
> >
> > And rather than use kvm_run.exit_reason as the canary, we should carve out a
> > kernel-only, i.e. non-ABI, field from the union.  That would allow setting the
> > canary in common KVM code, which can't be done for kvm_run.exit_reason because
> > some architectures, e.g. s390 (and x86 IIRC), consume the exit_reason early on
> > in KVM_RUN.
> 
> I think this is a good idea :D I was going to suggest something
> similar a while back, but I thought it would be off the table- whoops.
> 
> My one concern is that if/when other features eventually also use the
> "speculative" portion, then they're going to run into the same issues
> as we're trying to avoid here.

I think it's worth the risk.  We could mitigate potential future problems to some
degree by maintaining the last N "speculative" user exits since KVM_RUN, e.g. with
a ring buffer, but (a) that's more than a bit crazy and (b) I don't think the
extra data would be actionable for userspace unless userspace somehow had a priori
knowledge of the "failing" sequence.

> But fixing *that* (probably by propagating these exits through return
> values/the call stack) would be a really big refactor, and C doesn't really
> have the type system for it in the first place :(

Yeah, lack of a clean and easy way to return a tuple makes it all but impossible
to handle this without resorting to evil shenanigans.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 04/16] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page()
  2023-06-15  2:41   ` Robert Hoo
@ 2023-08-14 22:51     ` Anish Moorthy
  0 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-08-14 22:51 UTC (permalink / raw)
  To: Robert Hoo
  Cc: seanjc, oliver.upton, kvm, kvmarm, pbonzini, maz, jthoughton,
	bgardon, dmatlack, ricarkol, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata

On Wed, Jun 14, 2023 at 7:41 PM Robert Hoo <robert.hoo.linux@gmail.com> wrote:
>
> Agree, and how about one step further, i.e. adjust the param's sequence.
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2c276d4d0821..db2bc5d3e2c2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2992,7 +2992,7 @@ static int next_segment(unsigned long len, int offset)
>    */
>   static int __kvm_read_guest_page(struct kvm_memory_slot *slot,
>                                   struct kvm_vcpu *vcpu,
> -                                gfn_t gfn, void *data, int offset, int len)
> +                                gfn_t gfn, int offset, void *data, int len)

There are a lot of functions/callsites in kvm_main.c which use the
"offset, data, len" convention. I'd prefer to switch them all at the
same time for consistency, but I think that's too large of a change to
splice in here: so I'll leave it as is for now.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-08-14 18:01       ` Sean Christopherson
@ 2023-08-15  0:06         ` Anish Moorthy
  2023-08-15  0:43           ` Sean Christopherson
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-08-15  0:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Mon, Aug 14, 2023 at 11:01 AM Sean Christopherson <seanjc@google.com> wrote:
>
> What is/was the error?  It's probably worth digging into; "static inline" should
> work just fine, so there might be something funky elsewhere that you're papering
> over.

What I get is

> ./include/linux/kvm_host.h:2298:20: error: function 'kvm_handle_guest_uaccess_fault' has internal linkage but is not defined [-Werror,-Wundefined-internal]
> static inline void kvm_handle_guest_uaccess_fault(struct kvm_vcpu *vcpu,
>                    ^
> arch/x86/kvm/mmu/mmu.c:3323:2: note: used here
>         kvm_handle_guest_uaccess_fault(vcpu, gfn_to_gpa(fault->gfn), PAGE_SIZE,
>         ^
> 1 error generated.

(mmu.c:3323 is in kvm_handle_error_pfn()). I tried shoving the
definition of the function from kvm_main.c to kvm_host.h so that I
could make it "static inline": but then the same "internal linkage"
error pops up in the kvm_vcpu_read/write_guest_page() functions.

Btw, do you actually know the size of the union in the run struct? I
started checking it but stopped when I realized that it includes
arch-dependent structs.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-08-15  0:06         ` Anish Moorthy
@ 2023-08-15  0:43           ` Sean Christopherson
  2023-08-15 17:01             ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-08-15  0:43 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Mon, Aug 14, 2023, Anish Moorthy wrote:
> On Mon, Aug 14, 2023 at 11:01 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > What is/was the error?  It's probably worth digging into; "static inline" should
> > work just fine, so there might be something funky elsewhere that you're papering
> > over.
> 
> What I get is
> 
> > ./include/linux/kvm_host.h:2298:20: error: function 'kvm_handle_guest_uaccess_fault' has internal linkage but is not defined [-Werror,-Wundefined-internal]
> > static inline void kvm_handle_guest_uaccess_fault(struct kvm_vcpu *vcpu,
> >                    ^
> > arch/x86/kvm/mmu/mmu.c:3323:2: note: used here
> >         kvm_handle_guest_uaccess_fault(vcpu, gfn_to_gpa(fault->gfn), PAGE_SIZE,
> >         ^
> > 1 error generated.
> 
> (mmu.c:3323 is in kvm_handle_error_pfn()). I tried shoving the
> definition of the function from kvm_main.c to kvm_host.h so that I
> could make it "static inline": but then the same "internal linkage"
> error pops up in the kvm_vcpu_read/write_guest_page() functions.

Can you point me at your branch?  That should be easy to resolve, but it's all
but impossible to figure out what's going wrong without being able to see the
full code.

> Btw, do you actually know the size of the union in the run struct? I
> started checking it but stopped when I realized that it includes
> arch-dependent structs.

256 bytes, though how much of that is actually free for the "speculative" idea...

		/* Fix the size of the union. */
		char padding[256];

Well fudge.  PPC's KVM_EXIT_OSI actually uses all 256 bytes.  And KVM_EXIT_SYSTEM_EVENT
is closer to the limit than I'd like.

On the other hand, despite burning 2048 bytes for kvm_sync_regs, all of kvm_run
is only 2352 bytes, i.e. we have plenty of room in the 4KiB page.  So we could
throw the "speculative" exits in a completely different union.  But that would
be cumbersome for userspace.

Hrm.  The best option would probably be to have a "nested" or "previous" exit union,
and copy the existing exit information to that field prior to filling a new exit
reason.  But that would require an absolute insane amount of refactoring because
everything just writes the fields directly. *sigh*

I suppose we could copy the information into two places for "speculative" exits,
the actual exit union and a separate "speculative" field.  I might be grasping at
straws though, not sure that ugliness would be worth making it slightly easier to
deal with the (A) scenario from earlier.

FWIW, my trick for quick finding the real size is to feed the size+1 into an alignment.
Unless you get really unlucky, that alignment will be illegal and the compiler
will tell you the size, e.g. 

arch/x86/kvm/x86.c:13405:9: error: requested alignment ‘2353’ is not a positive power of 2
13405 |         unsigned int len __aligned(sizeof(*run) + 1);
      |         ^~~~~~~~


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-08-15  0:43           ` Sean Christopherson
@ 2023-08-15 17:01             ` Anish Moorthy
  2023-08-16 15:58               ` Sean Christopherson
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-08-15 17:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Fri, Aug 11, 2023 at 3:12 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> >
> > Tagging a globally visible, non-static function as "inline" is odd, to say the
> > least.
>
> I think my eyes glaze over whenever I read the words "translation
> unit" (my brain certainly does) so I'll have to take your word for it.
> IIRC last time I tried to mark this function "static" the compiler
> yelled at me, so removing the "inline" it is.
>
>...
>
On Mon, Aug 14, 2023 at 5:43 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Can you point me at your branch?  That should be easy to resolve, but it's all
> but impossible to figure out what's going wrong without being able to see the
> full code.

Sure: https://github.com/anlsh/linux/tree/suffd-kvm-staticinline.
Don't worry about this unless you're bored though: I only called out
my change because I wanted to make sure the final signature was fine.
If you say it should be static inline then I can take a more concerted
stab at learning/figuring out what's going on here.

> > Btw, do you actually know the size of the union in the run struct? I
> > started checking it but stopped when I realized that it includes
> > arch-dependent structs.
>
> 256 bytes, though how much of that is actually free for the "speculative" idea...
>
>                 /* Fix the size of the union. */
>                 char padding[256];
>
> Well fudge.  PPC's KVM_EXIT_OSI actually uses all 256 bytes.  And KVM_EXIT_SYSTEM_EVENT
> is closer to the limit than I'd like
>
> On the other hand, despite burning 2048 bytes for kvm_sync_regs, all of kvm_run
> is only 2352 bytes, i.e. we have plenty of room in the 4KiB page.  So we could
> throw the "speculative" exits in a completely different union.  But that would
> be cumbersome for userspace.

Haha, well it's a good thing we checked. What about an extra union
would be cumbersome for userspace though? From an API perspective it
doesn't seem like splitting the current struct or adding an extra one
would be all too different- is it something about needing to recompile
things due to the struct size change?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-08-15 17:01             ` Anish Moorthy
@ 2023-08-16 15:58               ` Sean Christopherson
  2023-08-16 21:28                 ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-08-16 15:58 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Tue, Aug 15, 2023, Anish Moorthy wrote:
> On Fri, Aug 11, 2023 at 3:12 PM Anish Moorthy <amoorthy@google.com> wrote:
> >
> > On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> > >
> > > Tagging a globally visible, non-static function as "inline" is odd, to say the
> > > least.
> >
> > I think my eyes glaze over whenever I read the words "translation
> > unit" (my brain certainly does) so I'll have to take your word for it.
> > IIRC last time I tried to mark this function "static" the compiler
> > yelled at me, so removing the "inline" it is.
> >
> >...
> >
> On Mon, Aug 14, 2023 at 5:43 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Can you point me at your branch?  That should be easy to resolve, but it's all
> > but impossible to figure out what's going wrong without being able to see the
> > full code.
> 
> Sure: https://github.com/anlsh/linux/tree/suffd-kvm-staticinline.
> Don't worry about this unless you're bored though: I only called out
> my change because I wanted to make sure the final signature was fine.
> If you say it should be static inline then I can take a more concerted
> stab at learning/figuring out what's going on here.

That branch builds (and looks) just fine on gcc-12 and clang-14.  Maybe you have
stale objects in your build directory?  Or maybe PEBKAC?
 
> > > Btw, do you actually know the size of the union in the run struct? I
> > > started checking it but stopped when I realized that it includes
> > > arch-dependent structs.
> >
> > 256 bytes, though how much of that is actually free for the "speculative" idea...
> >
> >                 /* Fix the size of the union. */
> >                 char padding[256];
> >
> > Well fudge.  PPC's KVM_EXIT_OSI actually uses all 256 bytes.  And KVM_EXIT_SYSTEM_EVENT
> > is closer to the limit than I'd like
> >
> > On the other hand, despite burning 2048 bytes for kvm_sync_regs, all of kvm_run
> > is only 2352 bytes, i.e. we have plenty of room in the 4KiB page.  So we could
> > throw the "speculative" exits in a completely different union.  But that would
> > be cumbersome for userspace.
> 
> Haha, well it's a good thing we checked. What about an extra union
> would be cumbersome for userspace though? From an API perspective it
> doesn't seem like splitting the current struct or adding an extra one
> would be all too different- is it something about needing to recompile
> things due to the struct size change?

I was thinking that we couldn't have two anonymous unions, and so userspace (and
KVM) would need to do something like

	run->exit2.memory_fault.gpa

instead of 

	run->memory_fault.gpa

but the names just need to be unique, e.g. the below compiles just fine.  So unless
someone has a better idea, using a separate union for exits that might be clobbered
seems like the way to go.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5bdda75bfd10..fc3701d835d6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3289,6 +3289,9 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
                return RET_PF_RETRY;
        }
 
+       vcpu->run->memory_fault.flags = 0;
+       vcpu->run->memory_fault.gpa = fault->gfn << PAGE_SHIFT;
+       vcpu->run->memory_fault.len = PAGE_SIZE;
        return -EFAULT;
 }
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f089ab290978..1a8ccd5f949a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -531,6 +531,18 @@ struct kvm_run {
                struct kvm_sync_regs regs;
                char padding[SYNC_REGS_SIZE_BYTES];
        } s;
+
+       /* Anonymous union for exits #2. */
+       union {
+               /* KVM_EXIT_MEMORY_FAULT */
+               struct {
+                       __u64 flags;
+                       __u64 gpa;
+                       __u64 len; /* in bytes */
+               } memory_fault;
+
+               char padding2[256];
+       };
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-08-16 15:58               ` Sean Christopherson
@ 2023-08-16 21:28                 ` Anish Moorthy
  2023-08-17 23:58                   ` Sean Christopherson
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-08-16 21:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Wed, Aug 16, 2023 at 8:58 AM Sean Christopherson <seanjc@google.com> wrote:
>
> That branch builds (and looks) just fine on gcc-12 and clang-14.  Maybe you have
> stale objects in your build directory?  Or maybe PEBKAC?

Hmm, so it does- PEBKAC indeed...

> I was thinking that we couldn't have two anonymous unions, and so userspace (and
> KVM) would need to do something like
>
>         run->exit2.memory_fault.gpa
>
> instead of
>
>         run->memory_fault.gpa
>
> but the names just need to be unique, e.g. the below compiles just fine.  So unless
> someone has a better idea, using a separate union for exits that might be clobbered
> seems like the way to go.

Agreed. By the way, what was the reason why you wanted to avoid the
exit reason canary being ABI?

On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> And rather than use kvm_run.exit_reason as the canary, we should carve out a
> kernel-only, i.e. non-ABI, field from the union.  That would allow setting the
> canary in common KVM code, which can't be done for kvm_run.exit_reason because
> some architectures, e.g. s390 (and x86 IIRC), consume the exit_reason early on
> in KVM_RUN.
>
> E.g. something like this (the #ifdefs are heinous, it might be better to let
> userspace see the exit_canary, but make it abundantly clear that it's not ABI).
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 143abb334f56..233702124e0a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -511,16 +511,43 @@ struct kvm_run {
> +       /*
> +        * This second KVM_EXIT_* union holds structures for exits that may be
> +        * triggered after KVM has already initiated a different exit, and/or
> +        * may be filled speculatively by KVM.  E.g. because of limitations in
> +        * KVM's uAPI, a memory fault can be encountered after an MMIO exit is
> +        * initiated and kvm_run.mmio is filled.  Isolating these structures
> +        * from the primary KVM_EXIT_* union ensures that KVM won't clobber
> +        * information for the original exit.
> +        */
> +       union {
>                 /* KVM_EXIT_MEMORY_FAULT */
>                 blahblahblah
> +#endif
>         };
>
> +#ifdef __KERNEL__
> +       /*
> +        * Non-ABI, kernel-only field that KVM uses to detect bugs related to
> +        * filling exit_reason and the exit unions, e.g. to guard against
> +        * clobbering a previous exit.
> +        */
> +       __u64 exit_canary;
> +#endif
> +

We can't set exit_reason in the kvm_handle_guest_uaccess_fault()
helper if we're to handle case A (the setup vcpu exit -> fail guest
memory access -> return to userspace) correctly. But then userspace
needs some other way to check whether an efault is annotated, and
might as well check the canary, so something like

> +       __u32 speculative_exit_reason;
> +       union {
> +               /* KVM_SPEC_EXIT_MEMORY_FAULT */
> +               struct {
> +                       __u64 flags;
> +                       __u64 gpa;
> +                       __u64 len;
> +               } memory_fault;
> +               /* Fix the size of the union. */
> +               char speculative_padding[256];
> +       };

With the condition for an annotated efault being "if kvm_run returns
-EFAULT and speculative_exit_reason is..."

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-06-14 17:35   ` Sean Christopherson
                       ` (2 preceding siblings ...)
  2023-08-11 22:12     ` Anish Moorthy
@ 2023-08-17 22:55     ` Anish Moorthy
  3 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-08-17 22:55 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, kvmarm

Documentation update for KVM_CAP_MEMORY_FAULT_INFO

> KVM_CAP_MEMORY_FAULT_INFO
> -------------------------------------------

Old:
> -The presence of this capability indicates that KVM_RUN may annotate EFAULTs
> -returned by KVM_RUN in response to failed vCPU guest memory accesses which
> -userspace may be able to resolve.

New:
> +The presence of this capability indicates that KVM_RUN may fill
> +kvm_run.memory_fault in response to failed guest memory accesses in a vCPU
> +context.

On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> (B) is the trickiest to defend against in the kernel, but as I mentioned in earlier
> versions of this series, userspace needs to guard against a vCPU getting stuck in
> an infinite fault anyways, so I'm not _that_ concerned with figuring out a way to
> address this in the kernel.  KVM's documentation should strongly encourage userspace
> to take action if KVM repeatedly exits with the same info over and over, but beyond
> that I think anything else is nice to have, not mandatory.

In response to that, I've added the following bit as well:

+Note: Userspaces which attempt to resolve memory faults so that they can retry
+KVM_RUN are encouraged to guard against repeatedly receiving the same
+error/annotated fault.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-08-16 21:28                 ` Anish Moorthy
@ 2023-08-17 23:58                   ` Sean Christopherson
  2023-08-18 17:32                     ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-08-17 23:58 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Wed, Aug 16, 2023, Anish Moorthy wrote:
> > but the names just need to be unique, e.g. the below compiles just fine.  So unless
> > someone has a better idea, using a separate union for exits that might be clobbered
> > seems like the way to go.
> 
> Agreed. By the way, what was the reason why you wanted to avoid the
> exit reason canary being ABI?

Because it doesn't need to be exposed to usersepace, and it would be quite
unfortunate if we ever need/want to drop the canary, but can't because it's exposed
to userspace.

Though I have no idea why I suggested it be placed in kvm_run, the canary can simply
go in kvm_vcpu.  I'm guessing I was going for code locality, but abusing an
#ifdef to achieve that is silly.

> On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > And rather than use kvm_run.exit_reason as the canary, we should carve out a
> > kernel-only, i.e. non-ABI, field from the union.  That would allow setting the
> > canary in common KVM code, which can't be done for kvm_run.exit_reason because
> > some architectures, e.g. s390 (and x86 IIRC), consume the exit_reason early on
> > in KVM_RUN.
> >
> > E.g. something like this (the #ifdefs are heinous, it might be better to let
> > userspace see the exit_canary, but make it abundantly clear that it's not ABI).
> >
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 143abb334f56..233702124e0a 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -511,16 +511,43 @@ struct kvm_run {
> > +       /*
> > +        * This second KVM_EXIT_* union holds structures for exits that may be
> > +        * triggered after KVM has already initiated a different exit, and/or
> > +        * may be filled speculatively by KVM.  E.g. because of limitations in
> > +        * KVM's uAPI, a memory fault can be encountered after an MMIO exit is
> > +        * initiated and kvm_run.mmio is filled.  Isolating these structures
> > +        * from the primary KVM_EXIT_* union ensures that KVM won't clobber
> > +        * information for the original exit.
> > +        */
> > +       union {
> >                 /* KVM_EXIT_MEMORY_FAULT */
> >                 blahblahblah
> > +#endif
> >         };
> >
> > +#ifdef __KERNEL__
> > +       /*
> > +        * Non-ABI, kernel-only field that KVM uses to detect bugs related to
> > +        * filling exit_reason and the exit unions, e.g. to guard against
> > +        * clobbering a previous exit.
> > +        */
> > +       __u64 exit_canary;
> > +#endif
> > +
> 
> We can't set exit_reason in the kvm_handle_guest_uaccess_fault()
> helper if we're to handle case A (the setup vcpu exit -> fail guest
> memory access -> return to userspace) correctly. But then userspace
> needs some other way to check whether an efault is annotated, and
> might as well check the canary, so something like
> 
> > +       __u32 speculative_exit_reason;

No need for a full 32-bit value, or even a separate field, we can use kvm_run.flags.
Ugh, but of course flags' usage is arch specific.  *sigh*

We can either defines a flags2 (blech), or grab the upper byte of flags for
arch agnostic flags (slightly less blech).

Regarding the canary, if we want to use it for WARN_ON_ONCE(), it can't be
exposed to userspace.  Either that or we need to guard the WARN in some way.

> > +       union {
> > +               /* KVM_SPEC_EXIT_MEMORY_FAULT */

Definitely just KVM_EXIT_MEMORY_FAULT, the vast, vast majority of exits to
userspace will not be speculative in any way.

> > +               struct {
> > +                       __u64 flags;
> > +                       __u64 gpa;
> > +                       __u64 len;
> > +               } memory_fault;
> > +               /* Fix the size of the union. */
> > +               char speculative_padding[256];
> > +       };
> 
> With the condition for an annotated efault being "if kvm_run returns
> -EFAULT and speculative_exit_reason is..."

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-08-17 23:58                   ` Sean Christopherson
@ 2023-08-18 17:32                     ` Anish Moorthy
  2023-08-23 22:20                       ` Sean Christopherson
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-08-18 17:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Thu, Aug 17, 2023 at 4:58 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Aug 16, 2023, Anish Moorthy wrote:
> > > but the names just need to be unique, e.g. the below compiles just fine.  So unless
> > > someone has a better idea, using a separate union for exits that might be clobbered
> > > seems like the way to go.
> >
> > Agreed. By the way, what was the reason why you wanted to avoid the
> > exit reason canary being ABI?
>
> Because it doesn't need to be exposed to usersepace, and it would be quite
> unfortunate if we ever need/want to drop the canary, but can't because it's exposed
> to userspace.

> No need for a full 32-bit value, or even a separate field, we can use kvm_run.flags.
> Ugh, but of course flags' usage is arch specific.  *sigh*

Ah, I realise now you're thinking of separating the canary and
whatever userspace reads to check for an annotated memory fault. I
think that even if one variable in kvm_run did double-duty for now,
we'd always be able to switch to using another as the canary without
changing the ABI. But I'm on board with separating them anyways.

> Regarding the canary, if we want to use it for WARN_ON_ONCE(), it can't be
> exposed to userspace.  Either that or we need to guard the WARN in some way.

It's guarded behind a kconfig atm, although if we decide to drop the
userspace-visible canary then I'll drop that bit.

> > On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> > > +       __u32 speculative_exit_reason;
> ...
> We can either defines a flags2 (blech), or grab the upper byte of flags for
> arch agnostic flags (slightly less blech).

Grabbing the upper byte seems reasonable: but do you anticipate KVM
ever supporting more than eight of these speculative exits? Because if
so then it seems like it'd be less trouble to use a separate u32 or
u16 (or even u8, judging by the number of KVM exits). Not sure how
much future-proofing is appropriate here :)

>
> > > +       union {
> > > +               /* KVM_SPEC_EXIT_MEMORY_FAULT */
>
> Definitely just KVM_EXIT_MEMORY_FAULT, the vast, vast majority of exits to
> userspace will not be speculative in any way.

Speaking of future-proofing, this was me trying to anticipate future
uses of the speculative exit struct: I figured that some case might
come along where KVM_RUN returns 0 *and* the contents of the speculative
exit struct might be useful- it'd be weird to look for KVM_EXIT_*s in
two different fields.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-06-14 21:23     ` Sean Christopherson
@ 2023-08-23 21:17       ` Anish Moorthy
  0 siblings, 0 replies; 79+ messages in thread
From: Anish Moorthy @ 2023-08-23 21:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Wed, Jun 14, 2023 at 2:23 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Gah, got turned around and forgot to account for @atomic.  So this?
>
>         if (!atomic && memslot_is_nowait_on_fault(slot)) {
>                 atomic = true;
>                 if (async) {
>                         *async = false;
>                         async = NULL;
>                 }
>         }
>
> > +
> >         return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> >                           writable);
> >  }

Makes sense to me, although I think the documentation for hva_to_pfn(),
where those async/atomic parameters eventually feed into, is slightly
off

> /*
>  * Pin guest page in memory and return its pfn.
> * @addr: host virtual address which maps memory to the guest
> * @atomic: whether this function can sleep
> ...
> */
> kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
 >     bool *async, bool write_fault, bool *writable)

I initially read this as "atomic == true if function can sleep," but I
think it actually means to say "atomic == true if function can *not*
sleep". So I'll add a patch to change the line to

> @atomic: whether this function is disallowed from sleeping

I'm pretty sure I have things straight: if I don't though, then we
can't upgrade the __gfn_to_pfn_memslot() calls to "atomic=true" like
you suggested above.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-08-18 17:32                     ` Anish Moorthy
@ 2023-08-23 22:20                       ` Sean Christopherson
  2023-08-23 23:38                         ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-08-23 22:20 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Fri, Aug 18, 2023, Anish Moorthy wrote:
> On Thu, Aug 17, 2023 at 4:58 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Aug 16, 2023, Anish Moorthy wrote:
> > > > but the names just need to be unique, e.g. the below compiles just fine.  So unless
> > > > someone has a better idea, using a separate union for exits that might be clobbered
> > > > seems like the way to go.
> > >
> > > Agreed. By the way, what was the reason why you wanted to avoid the
> > > exit reason canary being ABI?
> >
> > Because it doesn't need to be exposed to usersepace, and it would be quite
> > unfortunate if we ever need/want to drop the canary, but can't because it's exposed
> > to userspace.
> 
> > No need for a full 32-bit value, or even a separate field, we can use kvm_run.flags.
> > Ugh, but of course flags' usage is arch specific.  *sigh*
> 
> Ah, I realise now you're thinking of separating the canary and
> whatever userspace reads to check for an annotated memory fault. I
> think that even if one variable in kvm_run did double-duty for now,
> we'd always be able to switch to using another as the canary without
> changing the ABI. But I'm on board with separating them anyways.
> 
> > Regarding the canary, if we want to use it for WARN_ON_ONCE(), it can't be
> > exposed to userspace.  Either that or we need to guard the WARN in some way.
> 
> It's guarded behind a kconfig atm, although if we decide to drop the
> userspace-visible canary then I'll drop that bit.

Yeah, burning a kconfig for this probably overkill.

> > > On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > +       __u32 speculative_exit_reason;
> > ...
> > We can either defines a flags2 (blech), or grab the upper byte of flags for
> > arch agnostic flags (slightly less blech).
> 
> Grabbing the upper byte seems reasonable: but do you anticipate KVM
> ever supporting more than eight of these speculative exits? Because if
> so then it seems like it'd be less trouble to use a separate u32 or
> u16 (or even u8, judging by the number of KVM exits). Not sure how
> much future-proofing is appropriate here :)

I don't anticipate anything beyond the memory fault case.  We essentially already
treat incomplete exits to userspace as KVM bugs.   MMIO is the only other case I
can think of where KVM doesn't complete an exit to usersapce, but that one is
essentially getting grandfathered in because of x86's flawed MMIO handling.
Userspace memory faults also get grandfathered in because of paravirt ABIs, i.e.
KVM is effectively required to ignore some faults due to external forces.

In other words, the bar for adding "speculative" exit to userspace is very high.

> > > > +       union {
> > > > +               /* KVM_SPEC_EXIT_MEMORY_FAULT */
> >
> > Definitely just KVM_EXIT_MEMORY_FAULT, the vast, vast majority of exits to
> > userspace will not be speculative in any way.
> 
> Speaking of future-proofing, this was me trying to anticipate future
> uses of the speculative exit struct: I figured that some case might
> come along where KVM_RUN returns 0 *and* the contents of the speculative
> exit struct might be useful- it'd be weird to look for KVM_EXIT_*s in
> two different fields.

That can be handled with a comment about the new flag, e.g.

/*
 * Set if KVM filled the memory_fault field since the start of KVM_RUN.  Note,
 * memory_fault is guaranteed to be fresh if and only KVM_RUN returns -EFAULT.
 * For all other return values, memory_fault may be stale and should be
 * considered informational only, e.g. can captured to aid debug, but shouldn't
 * be relied on for correctness.
 */
#define 	KVM_RUN_MEMORY_FAULT_FILLED

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-08-23 22:20                       ` Sean Christopherson
@ 2023-08-23 23:38                         ` Anish Moorthy
  2023-08-24 17:24                           ` Sean Christopherson
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-08-23 23:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Wed, Aug 23, 2023 at 3:20 PM Sean Christopherson <seanjc@google.com> wrote:
>
> I don't anticipate anything beyond the memory fault case.  We essentially already
> treat incomplete exits to userspace as KVM bugs.   MMIO is the only other case I
> can think of where KVM doesn't complete an exit to usersapce, but that one is
> essentially getting grandfathered in because of x86's flawed MMIO handling.
> Userspace memory faults also get grandfathered in because of paravirt ABIs, i.e.
> KVM is effectively required to ignore some faults due to external forces.

Well that's good to hear. Are you sure that we don't want to add even
just a dedicated u8 to indicate the speculative exit reason though?
I'm just thinking that the different structs in speculative_exit will
be mutually exclusive, whereas flags/bitfields usually indicate
non-mutually exclusive conditions.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-08-23 23:38                         ` Anish Moorthy
@ 2023-08-24 17:24                           ` Sean Christopherson
  0 siblings, 0 replies; 79+ messages in thread
From: Sean Christopherson @ 2023-08-24 17:24 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Wed, Aug 23, 2023, Anish Moorthy wrote:
> On Wed, Aug 23, 2023 at 3:20 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > I don't anticipate anything beyond the memory fault case.  We essentially already
> > treat incomplete exits to userspace as KVM bugs.   MMIO is the only other case I
> > can think of where KVM doesn't complete an exit to usersapce, but that one is
> > essentially getting grandfathered in because of x86's flawed MMIO handling.
> > Userspace memory faults also get grandfathered in because of paravirt ABIs, i.e.
> > KVM is effectively required to ignore some faults due to external forces.
> 
> Well that's good to hear. Are you sure that we don't want to add even
> just a dedicated u8 to indicate the speculative exit reason though?

Pretty sure.

> I'm just thinking that the different structs in speculative_exit will
> be mutually exclusive,

Given that we have no idea what the next "speculative" exit might be, I don't
think it's safe to assume that the next one will be mutually exclusive with
memory_fault.  I'm essentially betting that we'll never have more than 8
"speculative" exit types, which IMO is a pretty safe bet.

> whereas flags/bitfields usually indicate non-mutually exclusive conditions.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-07-11 15:29         ` Sean Christopherson
@ 2023-08-25  0:15           ` Anish Moorthy
  2023-08-29 22:41             ` Sean Christopherson
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-08-25  0:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Tue, Jul 11, 2023 at 8:29 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Well, that description is wrong for other reasons.  As mentioned in my reply
> (got snipped), the behavior is not tied to sleeping or waiting on I/O.
>
> >  Moving the nowait check out of __kvm_faultin_pfn()/user_mem_abort()
> > and into __gfn_to_pfn_memslot() means that, obviously, other callers
> > will start to see behavior changes. Some of that is probably actually
> > necessary for that documentation to be accurate (since any usages of
> > __gfn_to_pfn_memslot() under KVM_RUN should respect the memslot flag),
> > but I think there are consumers of __gfn_to_pfn_memslot() from outside
> > KVM_RUN.
>
> Yeah, replace "in response to page faults" with something along the lines of "if
> an access in guest context ..."

Alright, how about

+ KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS
+ The presence of this capability indicates that userspace may pass the
+ KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS flag to
+ KVM_SET_USER_MEMORY_REGION. Said flag will cause KVM_RUN to fail (-EFAULT)
+ in response to guest-context memory accesses which would require KVM
+ to page fault on the userspace mapping.

Although, as Wang mentioned, USERFAULT seems to suggest something
related to userfaultfd which is a liiiiitle too specific. Perhaps we
should use USERSPACE_FAULT (*cries*) instead?

On Wed, Jun 14, 2023 at 2:20 PM Sean Christopherson <seanjc@google.com> wrote:
>
> However, do we actually need to force vendor code to query nowait?  At a glance,
> the only external (relative to kvm_main.c) users of __gfn_to_pfn_memslot() are
> in flows that play nice with nowait or that don't support it at all.  So I *think*
> we can do this?

On Wed, Jun 14, 2023 at 2:23 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Gah, got turned around and forgot to account for @atomic.  So this?
>
>         if (!atomic && memslot_is_nowait_on_fault(slot)) {
>                 atomic = true;
>                 if (async) {
>                         *async = false;
>                         async = NULL;
>                 }
>         }
>
> > +
> >         return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> >                           writable);
> >  }

Took another look at this and I *think* it works too (I added my notes
at the end here so if anyone wants to double-check they can). But
there are a couple of quirks

1. The memslot flag can cause new __gfn_to_pfn_memslot() failures in
kvm_vcpu_map() (good thing!) but those failures result in an EINVAL
(bad!). It kinda looks like the current is_error_noslot_pfn() check in
that function should be returning EFAULT anyways though, any opinions?

2. kvm_vm_ioctl_mte_copy_tags() will see new failures. This function
has come up before (a) and it doesn't seem like an access in a guest
context. Is this something to just be documented away?

3. I don't think I've caught parts of the who-calls tree that are in
drivers/. The one part I know about is the gfn_to_pfn() call in
is_2MB_gtt_possible() (drivers/gpu/drm/i915/gvt/gtt.c), and I'm not
sure what to do about it. Again, doesn't look like a guest-context
access.

(a) https://lore.kernel.org/kvm/ZIoiLGotFsDDvRN3@google.com/T/#u

---------------------------------------------------
Notes: Tracing the usages of __gfn_to_pfn_memslot()
"shove" = "moving the nowait check inside of __gfn_to_pfn_memslot

* [x86] __gfn_to_pfn_memslot() has 5 callers
** DONE kvm_faultin_pfn() accounts for two calls, shove will cause
bail (as intended) after first
** DONE __gfn_to_pfn_prot(): No usages on x86
** DONE __gfn_to_pfn_memslot_atomic: already requires atomic access :)
** gfn_to_pfn_memslot() has two callers
*** DONE kvm_vcpu_gfn_to_pfn(): No callers
*** gfn_to_pfn() has two callers
**** TODO kvm_vcpu_map() When memslot flag trips will get
KVM_PFN_ERR_FAULT, error is handled
HOWEVER it will return -EINVAL which is kinda... not right
**** gfn_to_page() has two callers, but both operate on
APIC_DEFAULT_PHYS_BASE addr
** Ok so that's "done," as long as my LSP is reliable

* [arm64] __gfn_to_pfn_memslot() has 4 callers
** DONE user_mem_abort() has one, accounted for by the subsequent
is_error_noslot_pfn()
** DONE __gfn_to_pfn_memslot_atomic() fine as above
** TODO gfn_to_pfn_prot() One caller in kvm_vm_ioctl_mte_copy_tags()
There's a is_error_noslot_pfn() to catch the failure, but there's no vCPU
floating around to annotate a fault in!
** gfn_to_pfn_memslot() two callers
*** DONE kvm_vcpu_gfn_to_pfn() no callers
*** gfn_to_pfn() two callers
**** kvm_vcpu_map() as above
**** DONE gfn_to_page() no callers

* TODO Weird driver code reference I discovered only via ripgrep?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-08-25  0:15           ` Anish Moorthy
@ 2023-08-29 22:41             ` Sean Christopherson
  2023-08-30 16:21               ` Anish Moorthy
  0 siblings, 1 reply; 79+ messages in thread
From: Sean Christopherson @ 2023-08-29 22:41 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Thu, Aug 24, 2023, Anish Moorthy wrote:
> On Tue, Jul 11, 2023 at 8:29 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Well, that description is wrong for other reasons.  As mentioned in my reply
> > (got snipped), the behavior is not tied to sleeping or waiting on I/O.
> >
> > >  Moving the nowait check out of __kvm_faultin_pfn()/user_mem_abort()
> > > and into __gfn_to_pfn_memslot() means that, obviously, other callers
> > > will start to see behavior changes. Some of that is probably actually
> > > necessary for that documentation to be accurate (since any usages of
> > > __gfn_to_pfn_memslot() under KVM_RUN should respect the memslot flag),
> > > but I think there are consumers of __gfn_to_pfn_memslot() from outside
> > > KVM_RUN.
> >
> > Yeah, replace "in response to page faults" with something along the lines of "if
> > an access in guest context ..."
> 
> Alright, how about
> 
> + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS
> + The presence of this capability indicates that userspace may pass the
> + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS flag to
> + KVM_SET_USER_MEMORY_REGION. Said flag will cause KVM_RUN to fail (-EFAULT)
> + in response to guest-context memory accesses which would require KVM
> + to page fault on the userspace mapping.
> 
> Although, as Wang mentioned, USERFAULT seems to suggest something
> related to userfaultfd which is a liiiiitle too specific. Perhaps we
> should use USERSPACE_FAULT (*cries*) instead?

Heh, it's not strictly on guest accesses though.

At this point, I'm tempted to come up with some completely arbitrary name for the
feature and give up on trying to describe its effects in the name itself.

> On Wed, Jun 14, 2023 at 2:20 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > However, do we actually need to force vendor code to query nowait?  At a glance,
> > the only external (relative to kvm_main.c) users of __gfn_to_pfn_memslot() are
> > in flows that play nice with nowait or that don't support it at all.  So I *think*
> > we can do this?
> 
> On Wed, Jun 14, 2023 at 2:23 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Gah, got turned around and forgot to account for @atomic.  So this?
> >
> >         if (!atomic && memslot_is_nowait_on_fault(slot)) {
> >                 atomic = true;
> >                 if (async) {
> >                         *async = false;
> >                         async = NULL;
> >                 }
> >         }
> >
> > > +
> > >         return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
> > >                           writable);
> > >  }
> 
> Took another look at this and I *think* it works too (I added my notes
> at the end here so if anyone wants to double-check they can). But
> there are a couple of quirks
> 
> 1. The memslot flag can cause new __gfn_to_pfn_memslot() failures in
> kvm_vcpu_map() (good thing!) but those failures result in an EINVAL
> (bad!). It kinda looks like the current is_error_noslot_pfn() check in
> that function should be returning EFAULT anyways though, any opinions?

Argh, it "needs" to return -EINVAL because KVM "needs" to inject a #GP if the guest
accesses a non-existent PFN in various nested SVM flows.  It's somewhat of a moot
point though, because kvm_vcpu_map() can't fail, KVM just isn't equipped to report
such failures out to userspace.

> 2. kvm_vm_ioctl_mte_copy_tags() will see new failures. This function
> has come up before (a) and it doesn't seem like an access in a guest
> context. Is this something to just be documented away?

We'll need a way to way for KVM to opt-out for kvm_vcpu_map(), at which point it
makes sense to opt-out for kvm_vm_ioctl_mte_copy_tags() as well.

> 3. I don't think I've caught parts of the who-calls tree that are in
> drivers/. The one part I know about is the gfn_to_pfn() call in
> is_2MB_gtt_possible() (drivers/gpu/drm/i915/gvt/gtt.c), and I'm not
> sure what to do about it. Again, doesn't look like a guest-context
> access.

On x86, that _was_ the only one.  You're welcome ;-)

https://lore.kernel.org/all/20230729013535.1070024-9-seanjc@google.com

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-08-29 22:41             ` Sean Christopherson
@ 2023-08-30 16:21               ` Anish Moorthy
  2023-09-07 21:17                 ` Sean Christopherson
  0 siblings, 1 reply; 79+ messages in thread
From: Anish Moorthy @ 2023-08-30 16:21 UTC (permalink / raw)
  To: Sean Christopherson, stevensd
  Cc: oliver.upton, kvm, kvmarm, pbonzini, maz, robert.hoo.linux,
	jthoughton, bgardon, dmatlack, ricarkol, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata

On Tue, Aug 29, 2023 at 3:42 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Aug 24, 2023, Anish Moorthy wrote:
> > On Tue, Jul 11, 2023 at 8:29 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > Well, that description is wrong for other reasons.  As mentioned in my reply
> > > (got snipped), the behavior is not tied to sleeping or waiting on I/O.
> > >
> > > >  Moving the nowait check out of __kvm_faultin_pfn()/user_mem_abort()
> > > > and into __gfn_to_pfn_memslot() means that, obviously, other callers
> > > > will start to see behavior changes. Some of that is probably actually
> > > > necessary for that documentation to be accurate (since any usages of
> > > > __gfn_to_pfn_memslot() under KVM_RUN should respect the memslot flag),
> > > > but I think there are consumers of __gfn_to_pfn_memslot() from outside
> > > > KVM_RUN.
> > >
> > > Yeah, replace "in response to page faults" with something along the lines of "if
> > > an access in guest context ..."
> >
> > Alright, how about
> >
> > + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS
> > + The presence of this capability indicates that userspace may pass the
> > + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS flag to
> > + KVM_SET_USER_MEMORY_REGION. Said flag will cause KVM_RUN to fail (-EFAULT)
> > + in response to guest-context memory accesses which would require KVM
> > + to page fault on the userspace mapping.
> >
> > Although, as Wang mentioned, USERFAULT seems to suggest something
> > related to userfaultfd which is a liiiiitle too specific. Perhaps we
> > should use USERSPACE_FAULT (*cries*) instead?
>
> Heh, it's not strictly on guest accesses though.

Is the inaccuracy just because of the KVM_DEV_ARM_VGIC_GRP_CTRL
disclaimer, or something else? I thought that "guest-context accesses"
would capture the flag affecting memory accesses that KVM emulates for
the guest as well, in addition to the "normal" EPT-violation -> page
fault path. But if that's still not totally accurate then you should
probably just spell this out for me.

> At this point, I'm tempted to come up with some completely arbitrary name for the
> feature and give up on trying to describe its effects in the name itself.

You know, Oliver may have made an inspired suggestion a while back...

On Mon, Mar 20, 2023 at 3:22 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> I couldn't care less about what the user-facing portion of this thing is
> called, TBH. We could just refer to it as KVM_MEM_BIT_2 /s

> > On Wed, Jun 14, 2023 at 2:20 PM Sean Christopherson <seanjc@google.com> wrote:
>
> We'll need a way to way for KVM to opt-out for kvm_vcpu_map(), at which point it
> makes sense to opt-out for kvm_vm_ioctl_mte_copy_tags() as well.

Uh oh, I sense another parameter to __gfn_to_pfn_memslot(). Although I
did see that David Stevens has been proposing cleanups to that code
[1]. Is proper practice here to take a dependency on his series, do we
just resolve the conflicts when the series are merged, or something
else?

[1] https://lore.kernel.org/kvm/20230824080408.2933205-1-stevensd@google.com/

> > 3. I don't think I've caught parts of the who-calls tree that are in
> > drivers/. The one part I know about is the gfn_to_pfn() call in
> > is_2MB_gtt_possible() (drivers/gpu/drm/i915/gvt/gtt.c), and I'm not
> > sure what to do about it. Again, doesn't look like a guest-context
> > access.
>
> On x86, that _was_ the only one.  You're welcome ;-)
>
> https://lore.kernel.org/all/20230729013535.1070024-9-seanjc@google.com

Much obliged :D

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation
  2023-08-30 16:21               ` Anish Moorthy
@ 2023-09-07 21:17                 ` Sean Christopherson
  0 siblings, 0 replies; 79+ messages in thread
From: Sean Christopherson @ 2023-09-07 21:17 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: stevensd, oliver.upton, kvm, kvmarm, pbonzini, maz,
	robert.hoo.linux, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata

On Wed, Aug 30, 2023, Anish Moorthy wrote:
> On Tue, Aug 29, 2023 at 3:42 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Aug 24, 2023, Anish Moorthy wrote:
> > > On Tue, Jul 11, 2023 at 8:29 AM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > Well, that description is wrong for other reasons.  As mentioned in my reply
> > > > (got snipped), the behavior is not tied to sleeping or waiting on I/O.
> > > >
> > > > >  Moving the nowait check out of __kvm_faultin_pfn()/user_mem_abort()
> > > > > and into __gfn_to_pfn_memslot() means that, obviously, other callers
> > > > > will start to see behavior changes. Some of that is probably actually
> > > > > necessary for that documentation to be accurate (since any usages of
> > > > > __gfn_to_pfn_memslot() under KVM_RUN should respect the memslot flag),
> > > > > but I think there are consumers of __gfn_to_pfn_memslot() from outside
> > > > > KVM_RUN.
> > > >
> > > > Yeah, replace "in response to page faults" with something along the lines of "if
> > > > an access in guest context ..."
> > >
> > > Alright, how about
> > >
> > > + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS
> > > + The presence of this capability indicates that userspace may pass the
> > > + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS flag to
> > > + KVM_SET_USER_MEMORY_REGION. Said flag will cause KVM_RUN to fail (-EFAULT)
> > > + in response to guest-context memory accesses which would require KVM
> > > + to page fault on the userspace mapping.
> > >
> > > Although, as Wang mentioned, USERFAULT seems to suggest something
> > > related to userfaultfd which is a liiiiitle too specific. Perhaps we
> > > should use USERSPACE_FAULT (*cries*) instead?
> >
> > Heh, it's not strictly on guest accesses though.
> 
> Is the inaccuracy just because of the KVM_DEV_ARM_VGIC_GRP_CTRL
> disclaimer, or something else? I thought that "guest-context accesses"
> would capture the flag affecting memory accesses that KVM emulates for
> the guest as well, in addition to the "normal" EPT-violation -> page
> fault path. But if that's still not totally accurate then you should
> probably just spell this out for me.

A pedantic interpretation of "on guest access" could be that the flag would only
apply to accesses from the guest itself, i.e. not any emulated accesses.

In general, I think we should avoid having the name describe when KVM honors the
flag, because it'll limit our ability to extend KVM functionality, and I doubt
we'll ever be 100% accurate, e.g. guest emulation that "needs" kvm_vcpu_map() will
ignore the flag.

Regarding USERFAULT, why not lean into that instead of trying to avoid it?  The
behavior *is* related to userfaultfd; not in code, but certainly in its purpose.
I don't think it's a stretch to say that userfault doesn't _just_ mean the fault
is induced by userspace, it also means that the fault is relayed to userspace.
And we can even borrow some amount of UFFD nomenclature to make it easier for
userspace to understand the purpose.

For initial support, I'm thinking

  KVM_MEM_USERFAULT_ON_MISSING

i.e. generate a "user fault" when the mapping is missing.  That would give us
leeway for future expansion, e.g. if someday there's a use case for generating a
userfault exit on major faults but not on missing mappings or minor fault, we
could add KVM_MEM_USERFAULT_ON_MAJOR.
 
> > > On Wed, Jun 14, 2023 at 2:20 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > We'll need a way to way for KVM to opt-out for kvm_vcpu_map(), at which point it
> > makes sense to opt-out for kvm_vm_ioctl_mte_copy_tags() as well.
> 
> Uh oh, I sense another parameter to __gfn_to_pfn_memslot(). Although I
> did see that David Stevens has been proposing cleanups to that code
> [1]. Is proper practice here to take a dependency on his series, do we
> just resolve the conflicts when the series are merged, or something
> else?

No, don't take a dependency.  At this point, it's a coin toss as to which series
will be ready first, taking a dependency could unnecessarily slow this series down
and/or generate pointless work.  Whoever "loses" is likely going to have a somewhat
painful rebase to deal with, but I can help on that front if/when the time comes.

As for what is "proper practice", it's always a case-by-case basis, but a good rule
of thumb is to default to letting the maintainer handle conflicts (though definitely
call out any known conflicts to make life easier for everyone), and if you suspect
that your series will have non-trivial conflicts, ask for guidance (like you just
did).

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2023-09-07 21:17 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-02 16:19 [PATCH v4 00/16] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
2023-06-02 16:19 ` [PATCH v4 01/16] KVM: Allow hva_pfn_fast() to resolve read-only faults Anish Moorthy
2023-06-14 14:39   ` Sean Christopherson
2023-06-14 16:57     ` Anish Moorthy
2023-08-10 19:54       ` Anish Moorthy
2023-08-10 23:48         ` Sean Christopherson
2023-06-02 16:19 ` [PATCH v4 02/16] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN Anish Moorthy
2023-06-02 20:30   ` Isaku Yamahata
2023-06-05 16:41     ` Anish Moorthy
2023-06-02 16:19 ` [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
2023-06-03 16:58   ` Isaku Yamahata
2023-06-05 16:37     ` Anish Moorthy
2023-06-14 14:55       ` Sean Christopherson
2023-06-05 17:46   ` Anish Moorthy
2023-06-14 17:35   ` Sean Christopherson
2023-06-20 21:13     ` Anish Moorthy
2023-07-07 11:50     ` Kautuk Consul
2023-07-10 15:00       ` Anish Moorthy
2023-07-11  3:54         ` Kautuk Consul
2023-07-11 14:25           ` Sean Christopherson
2023-08-11 22:12     ` Anish Moorthy
2023-08-14 18:01       ` Sean Christopherson
2023-08-15  0:06         ` Anish Moorthy
2023-08-15  0:43           ` Sean Christopherson
2023-08-15 17:01             ` Anish Moorthy
2023-08-16 15:58               ` Sean Christopherson
2023-08-16 21:28                 ` Anish Moorthy
2023-08-17 23:58                   ` Sean Christopherson
2023-08-18 17:32                     ` Anish Moorthy
2023-08-23 22:20                       ` Sean Christopherson
2023-08-23 23:38                         ` Anish Moorthy
2023-08-24 17:24                           ` Sean Christopherson
2023-08-17 22:55     ` Anish Moorthy
2023-07-05  8:21   ` Kautuk Consul
2023-06-02 16:19 ` [PATCH v4 04/16] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page() Anish Moorthy
2023-06-15  2:41   ` Robert Hoo
2023-08-14 22:51     ` Anish Moorthy
2023-06-02 16:19 ` [PATCH v4 05/16] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page() Anish Moorthy
2023-06-14 19:10   ` Sean Christopherson
2023-07-06 22:51     ` Anish Moorthy
2023-07-12 14:08       ` Sean Christopherson
2023-06-02 16:19 ` [PATCH v4 06/16] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page() Anish Moorthy
2023-06-14 19:22   ` Sean Christopherson
2023-07-07 17:35     ` Anish Moorthy
2023-06-02 16:19 ` [PATCH v4 07/16] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
2023-06-14 19:26   ` Sean Christopherson
2023-07-07 17:33     ` Anish Moorthy
2023-07-10 17:40       ` Sean Christopherson
2023-06-02 16:19 ` [PATCH v4 08/16] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
2023-06-14 20:03   ` Sean Christopherson
2023-07-07 18:05     ` Anish Moorthy
2023-06-15  2:43   ` Robert Hoo
2023-06-15 14:40     ` Sean Christopherson
2023-06-02 16:19 ` [PATCH v4 09/16] KVM: Introduce KVM_CAP_NOWAIT_ON_FAULT without implementation Anish Moorthy
2023-06-14 20:11   ` Sean Christopherson
2023-07-06 19:04     ` Anish Moorthy
2023-06-14 21:20   ` Sean Christopherson
2023-06-14 21:23     ` Sean Christopherson
2023-08-23 21:17       ` Anish Moorthy
2023-06-15  3:55     ` Wang, Wei W
2023-06-15 14:56       ` Sean Christopherson
2023-06-16 12:08         ` Wang, Wei W
2023-07-07 18:13     ` Anish Moorthy
2023-07-07 20:07       ` Anish Moorthy
2023-07-11 15:29         ` Sean Christopherson
2023-08-25  0:15           ` Anish Moorthy
2023-08-29 22:41             ` Sean Christopherson
2023-08-30 16:21               ` Anish Moorthy
2023-09-07 21:17                 ` Sean Christopherson
2023-06-02 16:19 ` [PATCH v4 10/16] KVM: x86: Implement KVM_CAP_NOWAIT_ON_FAULT Anish Moorthy
2023-06-14 20:25   ` Sean Christopherson
2023-07-07 17:41     ` Anish Moorthy
2023-06-02 16:19 ` [PATCH v4 11/16] KVM: arm64: " Anish Moorthy
2023-06-02 16:19 ` [PATCH v4 12/16] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
2023-06-02 16:19 ` [PATCH v4 13/16] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
2023-06-02 16:19 ` [PATCH v4 14/16] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
2023-06-02 16:19 ` [PATCH v4 15/16] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
2023-06-02 16:19 ` [PATCH v4 16/16] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
2023-06-20  2:44   ` Robert Hoo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.