kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
@ 2023-04-12 21:34 Anish Moorthy
  2023-04-12 21:34 ` [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
                   ` (23 more replies)
  0 siblings, 24 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Due to serialization on internal wait queues, userfaultfd can be quite
slow at delivering faults to userspace when many vCPUs fault on the same
VMA/uffd: this problem only worsens as the number of vCPUs increases.
This series allows page faults encountered in KVM_RUN to bypass
userfaultfd (KVM_CAP_ABSENT_MAPPING_FAULT) and be delivered directly via
VM exit to the faulting vCPU (KVM_CAP_MEMORY_FAULT_INFO), allowing much
higher page-in rates during uffd-based postcopy.

As a first step, KVM_CAP_MEMORY_FAULT_INFO is introduced. This
capability is meant to deliver useful information to userspace (i.e. the
problematic range of guest physical memory) when a vCPU fails a guest
memory access. KVM_RUN currently just returns -1 and sets errno=EFAULT
in response to these failed accesses: the new capability will cause it
to also fill the kvm_run struct with an exit reason of
KVM_EXIT_MEMORY_FAULT and populate the memory_fault field with the
faulting range.

Upon receiving an annotated EFAULT, userspace may take appropriate
action to resolve the failed access. For instance, this might involve a
UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live
migration postcopy.

KVM_CAP_MEMORY_FAULT_INFO comes with two important caveats, one public
and another internal.

1. Its implementation is incomplete: userspace may still receive
   un-annotated EFAULTs (exit reason != KVM_EXIT_MEMORY_FAULT) and must
   be able to handle these, although these cases are to be fixed as the
   maintainers learn of/identify them.

2. The implementation strategy given in this series, which is basically
   to fill the kvm_run.memory_fault field whenever a vCPU fails a guest
   memory access (even if the resulting EFAULT might not have been
   returned to userspace from KVM_RUN) is not without risk: some safety
   measures are taken, but they will not ensure total correctness.

   For example, if there are any existing paths in KVM_RUN which cause
   a vCPU to (1) populate the kvm_run struct in preparation for an
   exit to userspace then (2) try and fail to access guest memory for
   some reason, but ignore the result of the access and then (3)
   complete the exit to userspace, then the contents of the kvm_run
   struct written in (1) could be lost.

   Another example: if KVM_RUN fails a guest memory access for which the
   EFAULT is annotated but does not return the EFAULT to userspace,
   then later encounters another *un*annotated EFAULT which *is*
   returned to userspace, then the kvm_run.memory_fault field read by
   userspace will correspond to the first EFAULT, not the second.

   The discussion on this topic and of the alternative (filling the
   efault info only for those cases where KVM_RUN immediately returns to
   userspace) occurs in [3].

KVM_CAP_ABSENT_MAPPING_FAULT is introduced next (and is, I should note,
an idea proposed by James Houghton in [1] :). This capability causes
KVM_RUN to error with errno=EFAULT when it encounters a page fault for
which the userspace page tables do not contain present mappings. When
combined with KVM_CAP_MEMORY_FAULT_INFO, this capability allows KVM to
deliver information on page faults directly to the involved vCPU thread,
thus bypassing the userfaultfd wait queue and its related contention.

As a side note, KVM_CAP_ABSENT_MAPPING_FAULT prevents KVM from
generating async page faults. For this reason, hypervisors using it to
improve postcopy performance will likely want to disable it at the end
of postcopy.

KVM's demand paging self test is extended to demonstrate the performance
benefits of using the two new capabilities to bypass the userfaultfd
wait queue. The performance samples below (rates in thousands of
pages/s, n = 5), were generated using [2] on an x86 machine with 256
cores.

vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
1       150     340
2       191     477
4       210     809
8       155     1239
16      130     1595
32      108     2299
64      86      3482
128     62      4134
256     36      4012

[1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
[2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
    A quick rundown of the new flags (also detailed in later commits)
        -a registers all of guest memory to a single uffd.
        -r species the number of reader threads for polling the uffd.
        -w is what actually enables the new capabilities.
    All data was collected after applying both the entire series and
    the following bugfix:
    https://lore.kernel.org/kvm/20230223001805.2971237-1-amoorthy@google.com/#r
[3] https://lore.kernel.org/kvm/ZBTgnjXJvR8jtc4i@google.com/

---

v3
  - Rework the implementation to be based on two orthogonal
    capabilities (KVM_CAP_MEMORY_FAULT_INFO and
    KVM_CAP_ABSENT_MAPPING_FAULT) [Sean, Oliver]
  - Change return code of kvm_populate_efault_info [Isaku]
  - Use kvm_populate_efault_info from arm code [Oliver]

v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/

    This was a bit of a misfire, as I sent my WIP series on the mailing
    list but was just targeting Sean for some feedback. Oliver Upton and
    Isaku Yamahata ended up discovering the series and giving me some
    feedback anyways, so thanks to them :) In the end, there was enough
    discussion to justify retroactively labeling it as v2, even with the
    limited cc list.

  - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
  - API changes:
        - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
          KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
          requirement).
        - Switched to memslot flag
  - Take Oliver's simplification to the "allow fast gup for readable
    faults" logic.
  - Slightly redefine the return code of user_mem_abort.
  - Fix documentation errors brought up by Marc
  - Reword commit messages in imperative mood

v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/

Anish Moorthy (22):
  KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
    paging test
  KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
    signal errors via TEST_ASSERT
  KVM: Allow hva_pfn_fast() to resolve read-only faults.
  KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN  at the start of
    KVM_RUN
  KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  KVM: Add docstrings to __kvm_write_guest_page() and
    __kvm_read_guest_page()
  KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
  KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page()
  KVM: Annotate -EFAULTs from kvm_vcpu_map()
  KVM: x86: Annotate -EFAULTs from kvm_mmu_page_fault()
  KVM: x86: Annotate -EFAULTs from setup_vmgexit_scratch()
  KVM: x86: Annotate -EFAULTs from kvm_handle_page_fault()
  KVM: x86: Annotate -EFAULTs from kvm_hv_get_assist_page()
  KVM: x86: Annotate -EFAULTs from kvm_pv_clock_pairing()
  KVM: x86: Annotate -EFAULTs from direct_map()
  KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
  KVM: x86: Implement KVM_CAP_ABSENT_MAPPING_FAULT
  KVM: arm64: Annotate (some) -EFAULTs from user_mem_abort()
  KVM: arm64: Implement KVM_CAP_ABSENT_MAPPING_FAULT
  KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
  KVM: selftests: Handle memory fault exits in demand_paging_test

 Documentation/virt/kvm/api.rst                |  66 ++++-
 arch/arm64/kvm/arm.c                          |   2 +
 arch/arm64/kvm/mmu.c                          |  18 +-
 arch/x86/kvm/hyperv.c                         |  14 +-
 arch/x86/kvm/mmu/mmu.c                        |  39 ++-
 arch/x86/kvm/svm/sev.c                        |   1 +
 arch/x86/kvm/x86.c                            |   7 +-
 include/linux/kvm_host.h                      |  19 ++
 include/uapi/linux/kvm.h                      |  18 ++
 tools/include/uapi/linux/kvm.h                |  12 +
 .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
 .../selftests/kvm/access_tracking_perf_test.c |   2 +-
 .../selftests/kvm/demand_paging_test.c        | 242 ++++++++++++++----
 .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
 .../testing/selftests/kvm/include/memstress.h |   2 +-
 .../selftests/kvm/include/userfaultfd_util.h  |  18 +-
 tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
 .../selftests/kvm/lib/userfaultfd_util.c      | 158 +++++++-----
 .../kvm/memslot_modification_stress_test.c    |   2 +-
 virt/kvm/kvm_main.c                           |  75 +++++-
 20 files changed, 551 insertions(+), 154 deletions(-)


base-commit: d8708b80fa0e6e21bc0c9e7276ad0bccef73b6e7
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
  2023-04-19 13:51   ` Hoo Robert
  2023-04-12 21:34 ` [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
                   ` (22 subsequent siblings)
  23 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

At the moment, demand_paging_test does not support profiling/testing
multiple vCPU threads concurrently faulting on a single uffd because

    (a) "-u" (run test in userfaultfd mode) creates a uffd for each vCPU's
        region, so that each uffd services a single vCPU thread.
    (b) "-u -o" (userfaultfd mode + overlapped vCPU memory accesses)
        simply doesn't work: the test tries to register the same memory
        to multiple uffds, causing an error.

Add support for many vcpus per uffd by
    (1) Keeping "-u" behavior unchanged.
    (2) Making "-u -a" create a single uffd for all of guest memory.
    (3) Making "-u -o" implicitly pass "-a", solving the problem in (b).
In cases (2) and (3) all vCPU threads fault on a single uffd.

With multiple potentially multiple vCPU per UFFD, it makes sense to
allow configuring the number reader threads per UFFD as well: add the
"-r" flag to do so.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/aarch64/page_fault_test.c   |  4 +-
 .../selftests/kvm/demand_paging_test.c        | 62 +++++++++----
 .../selftests/kvm/include/userfaultfd_util.h  | 18 +++-
 .../selftests/kvm/lib/userfaultfd_util.c      | 86 +++++++++++++------
 4 files changed, 124 insertions(+), 46 deletions(-)

diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
index df10f1ffa20d9..3b6d228a9340d 100644
--- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
+++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
@@ -376,14 +376,14 @@ static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
 		*pt_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						    pt_args.hva,
 						    pt_args.paging_size,
-						    test->uffd_pt_handler);
+						    1, test->uffd_pt_handler);
 
 	*data_uffd = NULL;
 	if (test->uffd_data_handler)
 		*data_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						      data_args.hva,
 						      data_args.paging_size,
-						      test->uffd_data_handler);
+						      1, test->uffd_data_handler);
 }
 
 static void free_uffd(struct test_desc *test, struct uffd_desc *pt_uffd,
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index b0e1fc4de9e29..6c2253f4a64ef 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -77,9 +77,15 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		copy.mode = 0;
 
 		r = ioctl(uffd, UFFDIO_COPY, &copy);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d with errno: %d\n",
-				addr, tid, errno);
+		/*
+		 * With multiple vCPU threads fault on a single page and there are
+		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
+		 * will fail with EEXIST: handle that case without signaling an
+		 * error.
+		 */
+		if (r == -1 && errno != EEXIST) {
+			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
+					addr, tid, errno);
 			return r;
 		}
 	} else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
@@ -89,9 +95,10 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		cont.range.len = demand_paging_size;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_CONTINUE in 0x%lx from thread %d with errno: %d\n",
-				addr, tid, errno);
+		/* See the note about EEXISTs in the UFFDIO_COPY branch. */
+		if (r == -1 && errno != EEXIST) {
+			pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
+					addr, tid, errno);
 			return r;
 		}
 	} else {
@@ -110,7 +117,9 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 
 struct test_params {
 	int uffd_mode;
+	bool single_uffd;
 	useconds_t uffd_delay;
+	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
 };
@@ -133,7 +142,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i;
+	int i, num_uffds = 0;
+	uint64_t uffd_region_size;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
@@ -146,10 +156,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
 	if (p->uffd_mode) {
-		uffd_descs = malloc(nr_vcpus * sizeof(struct uffd_desc *));
+		num_uffds = p->single_uffd ? 1 : nr_vcpus;
+		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
+
+		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
 		TEST_ASSERT(uffd_descs, "Memory allocation failed");
 
-		for (i = 0; i < nr_vcpus; i++) {
+		for (i = 0; i < num_uffds; i++) {
 			struct memstress_vcpu_args *vcpu_args;
 			void *vcpu_hva;
 			void *vcpu_alias;
@@ -160,8 +173,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			vcpu_hva = addr_gpa2hva(vm, vcpu_args->gpa);
 			vcpu_alias = addr_gpa2alias(vm, vcpu_args->gpa);
 
-			prefault_mem(vcpu_alias,
-				vcpu_args->pages * memstress_args.guest_page_size);
+			prefault_mem(vcpu_alias, uffd_region_size);
 
 			/*
 			 * Set up user fault fd to handle demand paging
@@ -169,7 +181,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
 				p->uffd_mode, p->uffd_delay, vcpu_hva,
-				vcpu_args->pages * memstress_args.guest_page_size,
+				uffd_region_size,
+				p->readers_per_uffd,
 				&handle_uffd_page_request);
 		}
 	}
@@ -186,7 +199,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
 	if (p->uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
-		for (i = 0; i < nr_vcpus; i++)
+		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
 	}
 
@@ -206,14 +219,19 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 static void help(char *name)
 {
 	puts("");
-	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-d uffd_delay_usec]\n"
-	       "          [-b memory] [-s type] [-v vcpus] [-o]\n", name);
+	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
+		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
+		   "          [-s type] [-v vcpus] [-o]\n", name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
+	printf(" -a: Use a single userfaultfd for all of guest memory, instead of\n"
+		   "     creating one for each region paged by a unique vCPU\n"
+		   "     Set implicitly with -o, and no effect without -u.\n");
 	printf(" -d: add a delay in usec to the User Fault\n"
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
+	printf(" -r: Set the number of reader threads per uffd.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -231,12 +249,14 @@ int main(int argc, char *argv[])
 	struct test_params p = {
 		.src_type = DEFAULT_VM_MEM_SRC,
 		.partition_vcpu_memory_access = true,
+		.readers_per_uffd = 1,
+		.single_uffd = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "hm:u:d:b:s:v:o")) != -1) {
+	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
@@ -248,6 +268,9 @@ int main(int argc, char *argv[])
 				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
 			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
+		case 'a':
+			p.single_uffd = true;
+			break;
 		case 'd':
 			p.uffd_delay = strtoul(optarg, NULL, 0);
 			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
@@ -265,6 +288,13 @@ int main(int argc, char *argv[])
 			break;
 		case 'o':
 			p.partition_vcpu_memory_access = false;
+			p.single_uffd = true;
+			break;
+		case 'r':
+			p.readers_per_uffd = atoi(optarg);
+			TEST_ASSERT(p.readers_per_uffd >= 1,
+						"Invalid number of readers per uffd %d: must be >=1",
+						p.readers_per_uffd);
 			break;
 		case 'h':
 		default:
diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
index 877449c345928..92cc1f9ec0686 100644
--- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
+++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
@@ -17,18 +17,30 @@
 
 typedef int (*uffd_handler_t)(int uffd_mode, int uffd, struct uffd_msg *msg);
 
+struct uffd_reader_args {
+	int uffd_mode;
+	int uffd;
+	useconds_t delay;
+	uffd_handler_t handler;
+	/* Holds the read end of the pipe for killing the reader. */
+	int pipe;
+};
+
 struct uffd_desc {
 	int uffd_mode;
 	int uffd;
-	int pipefds[2];
 	useconds_t delay;
 	uffd_handler_t handler;
-	pthread_t thread;
+	uint64_t num_readers;
+	/* Holds the write ends of the pipes for killing the readers. */
+	int *pipefds;
+	pthread_t *readers;
+	struct uffd_reader_args *reader_args;
 };
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler);
+					   uint64_t num_readers, uffd_handler_t handler);
 
 void uffd_stop_demand_paging(struct uffd_desc *uffd);
 
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 92cef20902f1f..2723ee1e3e1b2 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -27,10 +27,8 @@
 
 static void *uffd_handler_thread_fn(void *arg)
 {
-	struct uffd_desc *uffd_desc = (struct uffd_desc *)arg;
-	int uffd = uffd_desc->uffd;
-	int pipefd = uffd_desc->pipefds[0];
-	useconds_t delay = uffd_desc->delay;
+	struct uffd_reader_args *reader_args = (struct uffd_reader_args *)arg;
+	int uffd = reader_args->uffd;
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
@@ -44,7 +42,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 		pollfd[0].fd = uffd;
 		pollfd[0].events = POLLIN;
-		pollfd[1].fd = pipefd;
+		pollfd[1].fd = reader_args->pipe;
 		pollfd[1].events = POLLIN;
 
 		r = poll(pollfd, 2, -1);
@@ -92,9 +90,9 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
 
-		if (delay)
-			usleep(delay);
-		r = uffd_desc->handler(uffd_desc->uffd_mode, uffd, &msg);
+		if (reader_args->delay)
+			usleep(reader_args->delay);
+		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
 		if (r < 0)
 			return NULL;
 		pages++;
@@ -110,7 +108,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler)
+					   uint64_t num_readers, uffd_handler_t handler)
 {
 	struct uffd_desc *uffd_desc;
 	bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR);
@@ -118,14 +116,26 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	struct uffdio_api uffdio_api;
 	struct uffdio_register uffdio_register;
 	uint64_t expected_ioctls = ((uint64_t) 1) << _UFFDIO_COPY;
-	int ret;
+	int ret, i;
 
 	PER_PAGE_DEBUG("Userfaultfd %s mode, faults resolved with %s\n",
 		       is_minor ? "MINOR" : "MISSING",
 		       is_minor ? "UFFDIO_CONINUE" : "UFFDIO_COPY");
 
 	uffd_desc = malloc(sizeof(struct uffd_desc));
-	TEST_ASSERT(uffd_desc, "malloc failed");
+	TEST_ASSERT(uffd_desc, "Failed to malloc uffd descriptor");
+
+	uffd_desc->pipefds = malloc(sizeof(int) * num_readers);
+	TEST_ASSERT(uffd_desc->pipefds, "Failed to malloc pipes");
+
+	uffd_desc->readers = malloc(sizeof(pthread_t) * num_readers);
+	TEST_ASSERT(uffd_desc->readers, "Failed to malloc reader threads");
+
+	uffd_desc->reader_args = malloc(
+		sizeof(struct uffd_reader_args) * num_readers);
+	TEST_ASSERT(uffd_desc->reader_args, "Failed to malloc reader_args");
+
+	uffd_desc->num_readers = num_readers;
 
 	/* In order to get minor faults, prefault via the alias. */
 	if (is_minor)
@@ -148,18 +158,32 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
 		    expected_ioctls, "missing userfaultfd ioctls");
 
-	ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
-	TEST_ASSERT(!ret, "Failed to set up pipefd");
-
 	uffd_desc->uffd_mode = uffd_mode;
 	uffd_desc->uffd = uffd;
 	uffd_desc->delay = delay;
 	uffd_desc->handler = handler;
-	pthread_create(&uffd_desc->thread, NULL, uffd_handler_thread_fn,
-		       uffd_desc);
 
-	PER_VCPU_DEBUG("Created uffd thread for HVA range [%p, %p)\n",
-		       hva, hva + len);
+	for (i = 0; i < uffd_desc->num_readers; ++i) {
+		int pipes[2];
+
+		ret = pipe2((int *) &pipes, O_CLOEXEC | O_NONBLOCK);
+		TEST_ASSERT(!ret, "Failed to set up pipefd %i for uffd_desc %p",
+					i, uffd_desc);
+
+		uffd_desc->pipefds[i] = pipes[1];
+
+		uffd_desc->reader_args[i].uffd_mode = uffd_mode;
+		uffd_desc->reader_args[i].uffd = uffd;
+		uffd_desc->reader_args[i].delay = delay;
+		uffd_desc->reader_args[i].handler = handler;
+		uffd_desc->reader_args[i].pipe = pipes[0];
+
+		pthread_create(&uffd_desc->readers[i], NULL, uffd_handler_thread_fn,
+					   &uffd_desc->reader_args[i]);
+
+		PER_VCPU_DEBUG("Created uffd thread %i for HVA range [%p, %p)\n",
+					   i, hva, hva + len);
+	}
 
 	return uffd_desc;
 }
@@ -167,19 +191,31 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 void uffd_stop_demand_paging(struct uffd_desc *uffd)
 {
 	char c = 0;
-	int ret;
+	int i, ret;
 
-	ret = write(uffd->pipefds[1], &c, 1);
-	TEST_ASSERT(ret == 1, "Unable to write to pipefd");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = write(uffd->pipefds[i], &c, 1);
+		TEST_ASSERT(
+			ret == 1, "Unable to write to pipefd %i for uffd_desc %p", i, uffd);
+	}
 
-	ret = pthread_join(uffd->thread, NULL);
-	TEST_ASSERT(ret == 0, "Pthread_join failed.");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = pthread_join(uffd->readers[i], NULL);
+		TEST_ASSERT(
+			ret == 0,
+			"Pthread_join failed on reader thread %i for uffd_desc %p", i, uffd);
+	}
 
 	close(uffd->uffd);
 
-	close(uffd->pipefds[1]);
-	close(uffd->pipefds[0]);
+	for (i = 0; i < uffd->num_readers; ++i) {
+		close(uffd->pipefds[i]);
+		close(uffd->reader_args[i].pipe);
+	}
 
+	free(uffd->pipefds);
+	free(uffd->readers);
+	free(uffd->reader_args);
 	free(uffd);
 }
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
  2023-04-12 21:34 ` [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
  2023-04-19 13:36   ` Hoo Robert
  2023-04-12 21:34 ` [PATCH v3 03/22] KVM: Allow hva_pfn_fast() to resolve read-only faults Anish Moorthy
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

With multiple reader threads POLLing a single UFFD, the test suffers
from the thundering herd problem: performance degrades as the number of
reader threads is increased. Solve this issue [1] by switching the
the polling mechanism to EPOLL + EPOLLEXCLUSIVE.

Also, change the error-handling convention of uffd_handler_thread_fn.
Instead of just printing errors and returning early from the polling
loop, check for them via TEST_ASSERT. "return NULL" is reserved for a
successful exit from uffd_handler_thread_fn, ie one triggered by a
write to the exit pipe.

Performance samples generated by the command in [2] are given below.

Num Reader Threads, Paging Rate (POLL), Paging Rate (EPOLL)
1      249k      185k
2      201k      235k
4      186k      155k
16     150k      217k
32     89k       198k

[1] Single-vCPU performance does suffer somewhat.
[2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        |  1 -
 .../selftests/kvm/lib/userfaultfd_util.c      | 74 +++++++++----------
 2 files changed, 35 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 6c2253f4a64ef..c729cee4c2055 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -13,7 +13,6 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
-#include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
 #include <sys/syscall.h>
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 2723ee1e3e1b2..909ad69c1cb04 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -16,6 +16,7 @@
 #include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <sys/epoll.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -32,60 +33,55 @@ static void *uffd_handler_thread_fn(void *arg)
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
+	int epollfd;
+	struct epoll_event evt;
+
+	epollfd = epoll_create(1);
+	TEST_ASSERT(epollfd >= 0, "Failed to create epollfd.");
+
+	evt.events = EPOLLIN | EPOLLEXCLUSIVE;
+	evt.data.u32 = 0;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, uffd, &evt) == 0,
+				"Failed to add uffd to epollfd");
+
+	evt.events = EPOLLIN;
+	evt.data.u32 = 1;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, reader_args->pipe, &evt) == 0,
+				"Failed to add pipe to epollfd");
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
 	while (1) {
 		struct uffd_msg msg;
-		struct pollfd pollfd[2];
-		char tmp_chr;
 		int r;
 
-		pollfd[0].fd = uffd;
-		pollfd[0].events = POLLIN;
-		pollfd[1].fd = reader_args->pipe;
-		pollfd[1].events = POLLIN;
-
-		r = poll(pollfd, 2, -1);
-		switch (r) {
-		case -1:
-			pr_info("poll err");
-			continue;
-		case 0:
-			continue;
-		case 1:
-			break;
-		default:
-			pr_info("Polling uffd returned %d", r);
-			return NULL;
-		}
+		r = epoll_wait(epollfd, &evt, 1, -1);
+		TEST_ASSERT(r == 1,
+					"Unexpected number of events (%d) from epoll, errno = %d",
+					r, errno);
 
-		if (pollfd[0].revents & POLLERR) {
-			pr_info("uffd revents has POLLERR");
-			return NULL;
-		}
+		if (evt.data.u32 == 1) {
+			char tmp_chr;
 
-		if (pollfd[1].revents & POLLIN) {
-			r = read(pollfd[1].fd, &tmp_chr, 1);
+			TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+						"Reader thread received EPOLLERR or EPOLLHUP on pipe.");
+			r = read(reader_args->pipe, &tmp_chr, 1);
 			TEST_ASSERT(r == 1,
-				    "Error reading pipefd in UFFD thread\n");
+						"Error reading pipefd in uffd reader thread");
 			return NULL;
 		}
 
-		if (!(pollfd[0].revents & POLLIN))
-			continue;
+		TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+					"Reader thread received EPOLLERR or EPOLLHUP on uffd.");
 
 		r = read(uffd, &msg, sizeof(msg));
 		if (r == -1) {
-			if (errno == EAGAIN)
-				continue;
-			pr_info("Read of uffd got errno %d\n", errno);
-			return NULL;
+			TEST_ASSERT(errno == EAGAIN,
+						"Error reading from UFFD: errno = %d", errno);
+			continue;
 		}
 
-		if (r != sizeof(msg)) {
-			pr_info("Read on uffd returned unexpected size: %d bytes", r);
-			return NULL;
-		}
+		TEST_ASSERT(r == sizeof(msg),
+					"Read on uffd returned unexpected number of bytes (%d)", r);
 
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
@@ -93,8 +89,8 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (reader_args->delay)
 			usleep(reader_args->delay);
 		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
-		if (r < 0)
-			return NULL;
+		TEST_ASSERT(r >= 0,
+					"Reader thread handler fn returned negative value %d", r);
 		pages++;
 	}
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 03/22] KVM: Allow hva_pfn_fast() to resolve read-only faults.
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
  2023-04-12 21:34 ` [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
  2023-04-12 21:34 ` [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
  2023-04-12 21:34 ` [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN Anish Moorthy
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

hva_to_pfn_fast() currently just fails for read-only faults, which is
unnecessary. Instead, try pinning the page without passing FOLL_WRITE.
This allows read-only faults to (potentially) be resolved without
falling back to slow GUP.

Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f40b72eb0e7bf..cf7d3de6f3689 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2479,7 +2479,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
 }
 
 /*
- * The fast path to get the writable pfn which will be stored in @pfn,
+ * The fast path to get the pfn which will be stored in @pfn,
  * true indicates success, otherwise false is returned.  It's also the
  * only part that runs if we can in atomic context.
  */
@@ -2493,10 +2493,9 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 	 * or the caller allows to map a writable pfn for a read fault
 	 * request.
 	 */
-	if (!(write_fault || writable))
-		return false;
+	unsigned int gup_flags = (write_fault || writable) ? FOLL_WRITE : 0;
 
-	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+	if (get_user_page_fast_only(addr, gup_flags, page)) {
 		*pfn = page_to_pfn(page[0]);
 
 		if (writable)
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (2 preceding siblings ...)
  2023-04-12 21:34 ` [PATCH v3 03/22] KVM: Allow hva_pfn_fast() to resolve read-only faults Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
  2023-05-02 17:17   ` Anish Moorthy
  2023-04-12 21:34 ` [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
                   ` (19 subsequent siblings)
  23 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Give kvm_run.exit_reason a defined initial value on entry into KVM_RUN:
other architectures (riscv, arm64) already use KVM_EXIT_UNKNOWN for this
purpose, so copy that convention.

This gives vCPUs trying to fill the run struct a mechanism to avoid
overwriting already-populated data, albeit an imperfect one. Being able
to detect an already-populated KVM run struct will prevent at least some
bugs in the upcoming implementation of KVM_CAP_MEMORY_FAULT_INFO, which
will attempt to fill the run struct whenever a vCPU fails a guest memory
access.

Without the already-populated check, KVM_CAP_MEMORY_FAULT_INFO could
change kvm_run in any code paths which

1. Populate kvm_run for some exit and prepare to return to userspace
2. Access guest memory for some reason (but without returning -EFAULTs
    to userspace)
3. Finish the return to userspace set up in (1), now with the contents
    of kvm_run changed to contain efault info.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/x86/kvm/x86.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 237c483b12301..ca73eb066af81 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10965,6 +10965,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	kvm_run->flags = 0;
 	kvm_load_guest_fpu(vcpu);
 
+	kvm_run->exit_reason = KVM_EXIT_UNKNOWN;
 	kvm_vcpu_srcu_read_lock(vcpu);
 	if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
 		if (kvm_run->immediate_exit) {
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (3 preceding siblings ...)
  2023-04-12 21:34 ` [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
  2023-04-19 13:57   ` Hoo Robert
                     ` (2 more replies)
  2023-04-12 21:34 ` [PATCH v3 06/22] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page() Anish Moorthy
                   ` (18 subsequent siblings)
  23 siblings, 3 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information
besides a return value of -1 and errno of EFAULT when a vCPU fails an
access to guest memory.

Add documentation, updates to the KVM headers, and a helper function
(kvm_populate_efault_info) for implementing the capability.

Besides simply filling the run struct, kvm_populate_efault_info takes
two safety measures

  a. It tries to prevent concurrent fills on a single vCPU run struct
     by checking that the run struct being modified corresponds to the
     currently loaded vCPU.
  b. It tries to avoid filling an already-populated run struct by
     checking whether the exit reason has been modified since entry
     into KVM_RUN.

Finally, mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86,
even though EFAULT annotation are currently totally absent. Picking a
point to declare the implementation "done" is difficult because

  1. Annotations will be performed incrementally in subsequent commits
     across both core and arch-specific KVM.
  2. The initial series will very likely miss some cases which need
     annotation. Although these omissions are to be fixed in the future,
     userspace thus still needs to expect and be able to handle
     unannotated EFAULTs.

Given these qualifications, just marking it available here seems the
least arbitrary thing to do.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 35 +++++++++++++++++++++++++++
 arch/arm64/kvm/arm.c           |  1 +
 arch/x86/kvm/x86.c             |  1 +
 include/linux/kvm_host.h       | 12 ++++++++++
 include/uapi/linux/kvm.h       | 16 +++++++++++++
 tools/include/uapi/linux/kvm.h | 11 +++++++++
 virt/kvm/kvm_main.c            | 44 ++++++++++++++++++++++++++++++++++
 7 files changed, 120 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 48fad65568227..f174f43c38d45 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6637,6 +6637,18 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 len; /* in bytes */
+		} memory_fault;
+
+Indicates a vCPU memory fault on the guest physical address range
+[gpa, gpa + len). See KVM_CAP_MEMORY_FAULT_INFO for more details.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
@@ -7670,6 +7682,29 @@ This capability is aimed to mitigate the threat that malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.
 
+7.34 KVM_CAP_MEMORY_FAULT_INFO
+------------------------------
+
+:Architectures: x86, arm64
+:Parameters: args[0] - KVM_MEMORY_FAULT_INFO_ENABLE|DISABLE to enable/disable
+             the capability.
+:Returns: 0 on success, or -EINVAL if unsupported or invalid args[0].
+
+When enabled, EFAULTs "returned" by KVM_RUN in response to failed vCPU guest
+memory accesses may be annotated with additional information. When KVM_RUN
+returns an error with errno=EFAULT, userspace may check the exit reason: if it
+is KVM_EXIT_MEMORY_FAULT, userspace is then permitted to read the 'memory_fault'
+member of the run struct.
+
+The 'gpa' and 'len' (in bytes) fields describe the range of guest
+physical memory to which access failed, i.e. [gpa, gpa + len). 'flags' is
+currently always zero.
+
+NOTE: The implementation of this capability is incomplete. Even with it enabled,
+userspace may receive "bare" EFAULTs (i.e. exit reason !=
+KVM_EXIT_MEMORY_FAULT) from KVM_RUN. These should be considered bugs and
+reported to the maintainers.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a43e1cb3b7e97..a932346b59f61 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -220,6 +220,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VCPU_ATTRIBUTES:
 	case KVM_CAP_PTP_KVM:
 	case KVM_CAP_ARM_SYSTEM_SUSPEND:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		r = 1;
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ca73eb066af81..0925678e741de 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4432,6 +4432,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VAPIC:
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 90edc16d37e59..776f9713f3921 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -805,6 +805,8 @@ struct kvm {
 	struct notifier_block pm_notifier;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
+
+	bool fill_efault_info;
 };
 
 #define kvm_err(fmt, ...) \
@@ -2277,4 +2279,14 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+/*
+ * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and
+ * populate the memory_fault field with the given information.
+ *
+ * Does nothing if KVM_CAP_MEMORY_FAULT_INFO is not enabled. WARNs and does
+ * nothing if the exit reason is not KVM_EXIT_UNKNOWN, or if 'vcpu' is not
+ * the current running vcpu.
+ */
+inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
+					uint64_t gpa, uint64_t len);
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4003a166328cc..bc73e8381a2bb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -505,6 +506,16 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			/*
+			 * Indicates a memory fault on the guest physical address range
+			 * [gpa, gpa + len). flags is always zero for now.
+			 */
+			__u64 flags;
+			__u64 gpa;
+			__u64 len; /* in bytes */
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -1184,6 +1195,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
 #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
+#define KVM_CAP_MEMORY_FAULT_INFO 227
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2237,4 +2249,8 @@ struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+/* flags for KVM_CAP_MEMORY_FAULT_INFO */
+#define KVM_MEMORY_FAULT_INFO_DISABLE  0
+#define KVM_MEMORY_FAULT_INFO_ENABLE   1
+
 #endif /* __LINUX_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 4003a166328cc..5c57796364d65 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -505,6 +506,16 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			/*
+			 * Indicates a memory fault on the guest physical address range
+			 * [gpa, gpa + len). flags is always zero for now.
+			 */
+			__u64 flags;
+			__u64 gpa;
+			__u64 len; /* in bytes */
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cf7d3de6f3689..f3effc93cbef3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
+	kvm->fill_efault_info = false;
 
 	INIT_LIST_HEAD(&kvm->gpc_list);
 	spin_lock_init(&kvm->gpc_lock);
@@ -4096,6 +4097,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
 			put_pid(oldpid);
 		}
 		r = kvm_arch_vcpu_ioctl_run(vcpu);
+		WARN_ON_ONCE(r == -EFAULT &&
+					 vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);
 		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
 		break;
 	}
@@ -4672,6 +4675,15 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 
 		return r;
 	}
+	case KVM_CAP_MEMORY_FAULT_INFO: {
+		if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap)
+			|| (cap->args[0] != KVM_MEMORY_FAULT_INFO_ENABLE
+				&& cap->args[0] != KVM_MEMORY_FAULT_INFO_DISABLE)) {
+			return -EINVAL;
+		}
+		kvm->fill_efault_info = cap->args[0] == KVM_MEMORY_FAULT_INFO_ENABLE;
+		return 0;
+	}
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}
@@ -6173,3 +6185,35 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
 
 	return init_context.err;
 }
+
+inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
+					uint64_t gpa, uint64_t len)
+{
+	if (!vcpu->kvm->fill_efault_info)
+		return;
+
+	preempt_disable();
+	/*
+	 * Ensure the this vCPU isn't modifying another vCPU's run struct, which
+	 * would open the door for races between concurrent calls to this
+	 * function.
+	 */
+	if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
+		goto out;
+	/*
+	 * Try not to overwrite an already-populated run struct.
+	 * This isn't a perfect solution, as there's no guarantee that the exit
+	 * reason is set before the run struct is populated, but it should prevent
+	 * at least some bugs.
+	 */
+	else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
+		goto out;
+
+	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+	vcpu->run->memory_fault.gpa = gpa;
+	vcpu->run->memory_fault.len = len;
+	vcpu->run->memory_fault.flags = 0;
+
+out:
+	preempt_enable();
+}
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 06/22] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page()
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (4 preceding siblings ...)
  2023-04-12 21:34 ` [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
  2023-04-12 21:34 ` [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page() Anish Moorthy
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

The order of parameters in these function signature is a little strange,
with "offset" actually applying to "gfn" rather than to "data". Add
short comments to make things perfectly clear.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f3effc93cbef3..63b4285d858d1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2982,6 +2982,9 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
+/*
+ * Copy 'len' bytes from guest memory at '(gfn * PAGE_SIZE) + offset' to 'data'
+ */
 static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
 				 void *data, int offset, int len)
 {
@@ -3083,6 +3086,9 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
+/*
+ * Copy 'len' bytes from 'data' into guest memory at '(gfn * PAGE_SIZE) + offset'
+ */
 static int __kvm_write_guest_page(struct kvm *kvm,
 				  struct kvm_memory_slot *memslot, gfn_t gfn,
 			          const void *data, int offset, int len)
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (5 preceding siblings ...)
  2023-04-12 21:34 ` [PATCH v3 06/22] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page() Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
  2023-04-20 20:52   ` Peter Xu
  2023-04-12 21:34 ` [PATCH v3 08/22] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page() Anish Moorthy
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Implement KVM_CAP_MEMORY_FAULT_INFO for efaults from
kvm_vcpu_write_guest_page()

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 63b4285d858d1..b29a38af543f0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3119,8 +3119,11 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 			      const void *data, int offset, int len)
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+	int ret = __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
 
-	return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
+	if (ret == -EFAULT)
+		kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset, len);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 08/22] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page()
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (6 preceding siblings ...)
  2023-04-12 21:34 ` [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page() Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
  2023-04-12 21:34 ` [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map() Anish Moorthy
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_vcpu_read_guest_page().

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b29a38af543f0..572adba9ad8ed 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3014,7 +3014,11 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	int ret = __kvm_read_guest_page(slot, gfn, data, offset, len);
+
+	if (ret == -EFAULT)
+		kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset, len);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map()
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (7 preceding siblings ...)
  2023-04-12 21:34 ` [PATCH v3 08/22] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page() Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
  2023-04-20 20:53   ` Peter Xu
  2023-04-12 21:34 ` [PATCH v3 10/22] KVM: x86: Annotate -EFAULTs from kvm_mmu_page_fault() Anish Moorthy
                   ` (14 subsequent siblings)
  23 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_vcpu_map().

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 572adba9ad8ed..f3be5aa49829a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2843,8 +2843,10 @@ int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
 #endif
 	}
 
-	if (!hva)
+	if (!hva) {
+		kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE, PAGE_SIZE);
 		return -EFAULT;
+	}
 
 	map->page = page;
 	map->hva = hva;
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 10/22] KVM: x86: Annotate -EFAULTs from kvm_mmu_page_fault()
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (8 preceding siblings ...)
  2023-04-12 21:34 ` [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map() Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
  2023-04-12 21:34 ` [PATCH v3 11/22] KVM: x86: Annotate -EFAULTs from setup_vmgexit_scratch() Anish Moorthy
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_mmu_page_fault().

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 144c5a01cd778..7391d1f75149d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5670,6 +5670,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 			return -EIO;
 	}
 
+	if (r == -EFAULT)
+		kvm_populate_efault_info(vcpu, round_down(cr2_or_gpa, PAGE_SIZE),
+								 PAGE_SIZE);
 	if (r < 0)
 		return r;
 	if (r != RET_PF_EMULATE)
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 11/22] KVM: x86: Annotate -EFAULTs from setup_vmgexit_scratch()
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (9 preceding siblings ...)
  2023-04-12 21:34 ` [PATCH v3 10/22] KVM: x86: Annotate -EFAULTs from kvm_mmu_page_fault() Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
  2023-04-12 21:35 ` [PATCH v3 12/22] KVM: x86: Annotate -EFAULTs from kvm_handle_page_fault() Anish Moorthy
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
setup_vmgexit_scratch().

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/x86/kvm/svm/sev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c25aeb550cd97..9ef121f71dc26 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2683,6 +2683,7 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len)
 			pr_err("vmgexit: kvm_read_guest for scratch area failed\n");
 
 			kvfree(scratch_va);
+			kvm_populate_efault_info(&svm->vcpu, scratch_gpa_beg, len);
 			return -EFAULT;
 		}
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 12/22] KVM: x86: Annotate -EFAULTs from kvm_handle_page_fault()
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (10 preceding siblings ...)
  2023-04-12 21:34 ` [PATCH v3 11/22] KVM: x86: Annotate -EFAULTs from setup_vmgexit_scratch() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
  2023-04-12 21:35 ` [PATCH v3 13/22] KVM: x86: Annotate -EFAULTs from kvm_hv_get_assist_page() Anish Moorthy
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Implement KVM_CAP_MEMORY_FAULT_INFO for -EFAULTs caused by
kvm_handle_page_fault().

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7391d1f75149d..937329bee654e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4371,8 +4371,11 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 
 #ifndef CONFIG_X86_64
 	/* A 64-bit CR2 should be impossible on 32-bit KVM. */
-	if (WARN_ON_ONCE(fault_address >> 32))
+	if (WARN_ON_ONCE(fault_address >> 32)) {
+		kvm_populate_efault_info(vcpu, round_down(fault_address, PAGE_SIZE),
+								 PAGE_SIZE);
 		return -EFAULT;
+	}
 #endif
 
 	vcpu->arch.l1tf_flush_l1d = true;
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 13/22] KVM: x86: Annotate -EFAULTs from kvm_hv_get_assist_page()
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (11 preceding siblings ...)
  2023-04-12 21:35 ` [PATCH v3 12/22] KVM: x86: Annotate -EFAULTs from kvm_handle_page_fault() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
  2023-04-12 21:35 ` [PATCH v3 14/22] KVM: x86: Annotate -EFAULTs from kvm_pv_clock_pairing() Anish Moorthy
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_hv_get_assist_page().

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/x86/kvm/hyperv.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index b28fd020066f6..467fff271bc88 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -921,13 +921,21 @@ EXPORT_SYMBOL_GPL(kvm_hv_assist_page_enabled);
 
 int kvm_hv_get_assist_page(struct kvm_vcpu *vcpu)
 {
+	int ret = -EFAULT;
 	struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu);
 
 	if (!hv_vcpu || !kvm_hv_assist_page_enabled(vcpu))
-		return -EFAULT;
+		goto out;
+
+	ret = kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.pv_eoi.data,
+								&hv_vcpu->vp_assist_page,
+								sizeof(struct hv_vp_assist_page));
 
-	return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.pv_eoi.data,
-				     &hv_vcpu->vp_assist_page, sizeof(struct hv_vp_assist_page));
+out:
+	if (ret == -EFAULT)
+		kvm_populate_efault_info(vcpu, vcpu->arch.pv_eoi.data.gpa,
+								 vcpu->arch.pv_eoi.data.len);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(kvm_hv_get_assist_page);
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 14/22] KVM: x86: Annotate -EFAULTs from kvm_pv_clock_pairing()
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (12 preceding siblings ...)
  2023-04-12 21:35 ` [PATCH v3 13/22] KVM: x86: Annotate -EFAULTs from kvm_hv_get_assist_page() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
  2023-04-12 21:35 ` [PATCH v3 15/22] KVM: x86: Annotate -EFAULTs from direct_map() Anish Moorthy
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_pv_clock_pairing().

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/x86/kvm/x86.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0925678e741de..3e9deab31e1c8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9589,8 +9589,10 @@ static int kvm_pv_clock_pairing(struct kvm_vcpu *vcpu, gpa_t paddr,
 
 	ret = 0;
 	if (kvm_write_guest(vcpu->kvm, paddr, &clock_pairing,
-			    sizeof(struct kvm_clock_pairing)))
+						sizeof(struct kvm_clock_pairing))) {
+		kvm_populate_efault_info(vcpu, paddr, sizeof(struct kvm_clock_pairing));
 		ret = -KVM_EFAULT;
+	}
 
 	return ret;
 }
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 15/22] KVM: x86: Annotate -EFAULTs from direct_map()
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (13 preceding siblings ...)
  2023-04-12 21:35 ` [PATCH v3 14/22] KVM: x86: Annotate -EFAULTs from kvm_pv_clock_pairing() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
  2023-04-12 21:35 ` [PATCH v3 16/22] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
direct_map().

Since direct_map() traverses multiple levels of the shadow page table, it
seems like there are actually two correct guest physical address ranges
which could be provided.

1. A smaller range, more specific range, which potentially only
   corresponds to a part of what could not be mapped.
   start = gfn_round_for_level(fault->gfn, fault->goal_level)
   length = KVM_PAGES_PER_HPAGE(fault->goal_level)

2. The entire range which could not be mapped
   start = gfn_round_for_level(fault->gfn, fault->goal_level)
   length = KVM_PAGES_PER_HPAGE(fault->goal_level)

Take the first approach, although it's possible the second is actually
preferable.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 937329bee654e..a965c048edde8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3192,8 +3192,13 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 					     fault->req_level >= it.level);
 	}
 
-	if (WARN_ON_ONCE(it.level != fault->goal_level))
+	if (WARN_ON_ONCE(it.level != fault->goal_level)) {
+		gfn_t rounded_gfn = gfn_round_for_level(fault->gfn, fault->goal_level);
+		uint64_t len = KVM_PAGES_PER_HPAGE(fault->goal_level);
+
+		kvm_populate_efault_info(vcpu, rounded_gfn, len);
 		return -EFAULT;
+	}
 
 	ret = mmu_set_spte(vcpu, fault->slot, it.sptep, ACC_ALL,
 			   base_gfn, fault->pfn, fault);
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 16/22] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (14 preceding siblings ...)
  2023-04-12 21:35 ` [PATCH v3 15/22] KVM: x86: Annotate -EFAULTs from direct_map() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
  2023-04-12 21:35 ` [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation Anish Moorthy
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_handle_error_pfn().

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a965c048edde8..d83a3e1e3eff9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3218,6 +3218,9 @@ static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
 
 static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
+	uint64_t rounded_gfn;
+	uint64_t fault_size;
+
 	if (is_sigpending_pfn(fault->pfn)) {
 		kvm_handle_signal_exit(vcpu);
 		return -EINTR;
@@ -3236,6 +3239,10 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 		return RET_PF_RETRY;
 	}
 
+	fault_size = KVM_HPAGE_SIZE(fault->goal_level);
+	rounded_gfn = round_down(fault->gfn * PAGE_SIZE, fault_size);
+
+	kvm_populate_efault_info(vcpu, rounded_gfn, fault_size);
 	return -EFAULT;
 }
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (15 preceding siblings ...)
  2023-04-12 21:35 ` [PATCH v3 16/22] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
  2023-04-19 14:00   ` Hoo Robert
                     ` (2 more replies)
  2023-04-12 21:35 ` [PATCH v3 18/22] KVM: x86: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
                   ` (6 subsequent siblings)
  23 siblings, 3 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Add documentation, memslot flags, useful helper functions, and the
actual new capability itself.

Memory fault exits on absent mappings are particularly useful for
userfaultfd-based postcopy live migration. When many vCPUs fault on a
single userfaultfd the faults can take a while to surface to userspace
due to having to contend for uffd wait queue locks. Bypassing the uffd
entirely by returning information directly to the vCPU exit avoids this
contention and improves the fault rate.

Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 31 ++++++++++++++++++++++++++++---
 include/linux/kvm_host.h       |  7 +++++++
 include/uapi/linux/kvm.h       |  2 ++
 tools/include/uapi/linux/kvm.h |  1 +
 virt/kvm/kvm_main.c            |  3 +++
 5 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f174f43c38d45..7967b9909e28b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
   /* for kvm_userspace_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1342,12 +1343,15 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
+The flags field supports three flags
+
+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
 writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
+use it.
+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
 to make a new slot read-only.  In this case, writes to this memory will be
 posted to userspace as KVM_EXIT_MMIO exits.
+3.  KVM_MEM_ABSENT_MAPPING_FAULT: see KVM_CAP_ABSENT_MAPPING_FAULT for details.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
@@ -7705,6 +7709,27 @@ userspace may receive "bare" EFAULTs (i.e. exit reason !=
 KVM_EXIT_MEMORY_FAULT) from KVM_RUN. These should be considered bugs and
 reported to the maintainers.
 
+7.35 KVM_CAP_ABSENT_MAPPING_FAULT
+---------------------------------
+
+:Architectures: None
+:Returns: -EINVAL.
+
+The presence of this capability indicates that userspace may pass the
+KVM_MEM_ABSENT_MAPPING_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
+to fail (-EFAULT) in response to page faults for which the userspace page tables
+do not contain present mappings. Attempting to enable the capability directly
+will fail.
+
+The range of guest physical memory causing the fault is advertised to userspace
+through KVM_CAP_MEMORY_FAULT_INFO (if it is enabled).
+
+Userspace should determine how best to make the mapping present, then take
+appropriate action. For instance, in the case of absent mappings this might
+involve establishing the mapping for the first time via UFFDIO_COPY/CONTINUE or
+faulting the mapping in using MADV_POPULATE_READ/WRITE. After establishing the
+mapping, userspace can return to KVM to retry the previous memory access.
+
 8. Other capabilities.
 ======================
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 776f9713f3921..2407fc1e52ab8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2289,4 +2289,11 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
  */
 inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
 					uint64_t gpa, uint64_t len);
+
+static inline bool kvm_slot_fault_on_absent_mapping(
+							const struct kvm_memory_slot *slot)
+{
+	return slot->flags & KVM_MEM_ABSENT_MAPPING_FAULT;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index bc73e8381a2bb..21df449e74648 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_ABSENT_MAPPING_FAULT	(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -1196,6 +1197,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
 #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
 #define KVM_CAP_MEMORY_FAULT_INFO 227
+#define KVM_CAP_ABSENT_MAPPING_FAULT 228
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 5c57796364d65..59219da95634c 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f3be5aa49829a..7cd0ad94726df 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1525,6 +1525,9 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
 	valid_flags |= KVM_MEM_READONLY;
 #endif
 
+	if (kvm_vm_ioctl_check_extension(NULL, KVM_CAP_ABSENT_MAPPING_FAULT))
+		valid_flags |= KVM_MEM_ABSENT_MAPPING_FAULT;
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 18/22] KVM: x86: Implement KVM_CAP_ABSENT_MAPPING_FAULT
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (16 preceding siblings ...)
  2023-04-12 21:35 ` [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
  2023-04-12 21:35 ` [PATCH v3 19/22] KVM: arm64: Annotate (some) -EFAULTs from user_mem_abort() Anish Moorthy
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

When the memslot flag is enabled, fail guest memory accesses for which
fast-gup fails (ie, for which the mappings are not present).

Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst |  2 +-
 arch/x86/kvm/mmu/mmu.c         | 17 ++++++++++++-----
 arch/x86/kvm/x86.c             |  1 +
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7967b9909e28b..452bbca800b15 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7712,7 +7712,7 @@ reported to the maintainers.
 7.35 KVM_CAP_ABSENT_MAPPING_FAULT
 ---------------------------------
 
-:Architectures: None
+:Architectures: x86
 :Returns: -EINVAL.
 
 The presence of this capability indicates that userspace may pass the
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d83a3e1e3eff9..4aef79b97c985 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4218,7 +4218,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
 }
 
-static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu,
+				struct kvm_page_fault *fault,
+				bool fault_on_absent_mapping)
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	bool async;
@@ -4251,9 +4253,12 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	}
 
 	async = false;
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
+
+	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn,
+						fault_on_absent_mapping, false,
+						fault_on_absent_mapping ? NULL : &async,
+						fault->write, &fault->map_writable, &fault->hva);
+
 	if (!async)
 		return RET_PF_CONTINUE; /* *pfn has correct page already */
 
@@ -4287,7 +4292,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	smp_rmb();
 
-	ret = __kvm_faultin_pfn(vcpu, fault);
+	ret = __kvm_faultin_pfn(vcpu, fault,
+				likely(fault->slot)
+					&& kvm_slot_fault_on_absent_mapping(fault->slot));
 	if (ret != RET_PF_CONTINUE)
 		return ret;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3e9deab31e1c8..bc465cde7acf6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4433,6 +4433,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
 	case KVM_CAP_MEMORY_FAULT_INFO:
+	case KVM_CAP_ABSENT_MAPPING_FAULT:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 19/22] KVM: arm64: Annotate (some) -EFAULTs from user_mem_abort()
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (17 preceding siblings ...)
  2023-04-12 21:35 ` [PATCH v3 18/22] KVM: x86: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
  2023-04-12 21:35 ` [PATCH v3 20/22] KVM: arm64: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Implement KVM_CAP_MEMORY_FAULT_INFO for at least some -EFAULTs returned
by user_mem_abort(). Other EFAULTs returned by this function come from
before the guest physical address of the fault is calculated: leave
those unannotated.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/arm64/kvm/mmu.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7113587222ffe..d5ae636c26d62 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1307,8 +1307,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
 	}
-	if (is_error_noslot_pfn(pfn))
+	if (is_error_noslot_pfn(pfn)) {
+		kvm_populate_efault_info(vcpu, round_down(gfn * PAGE_SIZE, vma_pagesize),
+				vma_pagesize);
 		return -EFAULT;
+	}
 
 	if (kvm_is_device_pfn(pfn)) {
 		/*
@@ -1357,6 +1360,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		if (kvm_vma_mte_allowed(vma)) {
 			sanitise_mte_tags(kvm, pfn, vma_pagesize);
 		} else {
+			kvm_populate_efault_info(vcpu,
+					round_down(gfn * PAGE_SIZE, vma_pagesize), vma_pagesize);
 			ret = -EFAULT;
 			goto out_unlock;
 		}
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 20/22] KVM: arm64: Implement KVM_CAP_ABSENT_MAPPING_FAULT
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (18 preceding siblings ...)
  2023-04-12 21:35 ` [PATCH v3 19/22] KVM: arm64: Annotate (some) -EFAULTs from user_mem_abort() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
  2023-04-12 21:35 ` [PATCH v3 21/22] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Return -EFAULT from user_mem_abort when the memslot flag is enabled and
fast GUP fails to find a present mapping for the page.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst |  2 +-
 arch/arm64/kvm/arm.c           |  1 +
 arch/arm64/kvm/mmu.c           | 11 +++++++++--
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 452bbca800b15..47f728701aca4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7712,7 +7712,7 @@ reported to the maintainers.
 7.35 KVM_CAP_ABSENT_MAPPING_FAULT
 ---------------------------------
 
-:Architectures: x86
+:Architectures: x86, arm64
 :Returns: -EINVAL.
 
 The presence of this capability indicates that userspace may pass the
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a932346b59f61..c9666d7c6c4ff 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -221,6 +221,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_PTP_KVM:
 	case KVM_CAP_ARM_SYSTEM_SUSPEND:
 	case KVM_CAP_MEMORY_FAULT_INFO:
+	case KVM_CAP_ABSENT_MAPPING_FAULT:
 		r = 1;
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d5ae636c26d62..26b9485557056 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1206,6 +1206,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	unsigned long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
+	bool exit_on_memory_fault = kvm_slot_fault_on_absent_mapping(memslot);
 
 	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
 	write_fault = kvm_is_write_fault(vcpu);
@@ -1301,8 +1302,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 */
 	smp_rmb();
 
-	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-				   write_fault, &writable, NULL);
+	pfn = __gfn_to_pfn_memslot(memslot, gfn, exit_on_memory_fault, false, NULL,
+					write_fault, &writable, NULL);
+
+	if (exit_on_memory_fault && pfn == KVM_PFN_ERR_FAULT) {
+		kvm_populate_efault_info(vcpu,
+				round_down(gfn * PAGE_SIZE, vma_pagesize), vma_pagesize);
+		return -EFAULT;
+	}
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 21/22] KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (19 preceding siblings ...)
  2023-04-12 21:35 ` [PATCH v3 20/22] KVM: arm64: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
  2023-04-12 21:35 ` [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Memslot flags aren't currently exposed to the tests, and are just always
set to 0. Add a parameter to allow tests to manually set those flags.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 tools/testing/selftests/kvm/access_tracking_perf_test.c       | 2 +-
 tools/testing/selftests/kvm/demand_paging_test.c              | 4 ++--
 tools/testing/selftests/kvm/dirty_log_perf_test.c             | 2 +-
 tools/testing/selftests/kvm/include/memstress.h               | 2 +-
 tools/testing/selftests/kvm/lib/memstress.c                   | 4 ++--
 .../testing/selftests/kvm/memslot_modification_stress_test.c  | 2 +-
 6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f567..b51656b408b83 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -306,7 +306,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct kvm_vm *vm;
 	int nr_vcpus = params->nr_vcpus;
 
-	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1, 0,
 				 params->backing_src, !overlap_memory_access);
 
 	memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index c729cee4c2055..e84dde345edbc 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -144,8 +144,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	int i, num_uffds = 0;
 	uint64_t uffd_region_size;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
-				 p->src_type, p->partition_vcpu_memory_access);
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
+				1, 0, p->src_type, p->partition_vcpu_memory_access);
 
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
 
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index e9d6d1aecf89c..6c8749193cfa4 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -224,7 +224,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	int i;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
-				 p->slots, p->backing_src,
+				 p->slots, 0, p->backing_src,
 				 p->partition_vcpu_memory_access);
 
 	pr_info("Random seed: %u\n", p->random_seed);
diff --git a/tools/testing/selftests/kvm/include/memstress.h b/tools/testing/selftests/kvm/include/memstress.h
index 72e3e358ef7bd..1cba965d2d331 100644
--- a/tools/testing/selftests/kvm/include/memstress.h
+++ b/tools/testing/selftests/kvm/include/memstress.h
@@ -56,7 +56,7 @@ struct memstress_args {
 extern struct memstress_args memstress_args;
 
 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access);
 void memstress_destroy_vm(struct kvm_vm *vm);
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index 5f1d3173c238c..7589b8cef6911 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -119,7 +119,7 @@ void memstress_setup_vcpus(struct kvm_vm *vm, int nr_vcpus,
 }
 
 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access)
 {
@@ -207,7 +207,7 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 
 		vm_userspace_mem_region_add(vm, backing_src, region_start,
 					    MEMSTRESS_MEM_SLOT_INDEX + i,
-					    region_pages, 0);
+					    region_pages, slot_flags);
 	}
 
 	/* Do mapping for the demand paging memory slot */
diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
index 9855c41ca811f..0b19ec3ecc9cc 100644
--- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
+++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
@@ -95,7 +95,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct test_params *p = arg;
 	struct kvm_vm *vm;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
 				 VM_MEM_SRC_ANONYMOUS,
 				 p->partition_vcpu_memory_access);
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (20 preceding siblings ...)
  2023-04-12 21:35 ` [PATCH v3 21/22] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
  2023-04-19 14:09   ` Hoo Robert
  2023-04-27 15:48   ` James Houghton
  2023-04-19 19:55 ` [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Peter Xu
  2023-05-09 22:19 ` David Matlack
  23 siblings, 2 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Demonstrate a (very basic) scheme for supporting memory fault exits.

From the vCPU threads:
1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
   with the purpose of establishing the absent mappings. Do so with
   wake_waiters=false to avoid serializing on the userfaultfd wait queue
   locks.

2. When the UFFDIO_COPY/CONTINUE in (1) fails with EEXIST,
   assume that the mapping was already established but is currently
   absent [A] and attempt to populate it using MADV_POPULATE_WRITE.

Issue UFFDIO_COPY/CONTINUEs from the reader threads as well, but with
wake_waiters=true to ensure that any threads sleeping on the uffd are
eventually woken up.

A real VMM would track whether it had already COPY/CONTINUEd pages (eg,
via a bitmap) to avoid calls destined to EEXIST. However, even the
naive approach is enough to demonstrate the performance advantages of
KVM_EXIT_MEMORY_FAULT.

[A] In reality it is much likelier that the vCPU thread simply lost a
    race to establish the mapping for the page.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        | 209 +++++++++++++-----
 1 file changed, 155 insertions(+), 54 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index e84dde345edbc..668bd63d944e7 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -15,6 +15,7 @@
 #include <time.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <sys/mman.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -31,6 +32,57 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
 static size_t demand_paging_size;
 static char *guest_data_prototype;
 
+static int num_uffds;
+static size_t uffd_region_size;
+static struct uffd_desc **uffd_descs;
+/*
+ * Delay when demand paging is performed through userfaultfd or directly by
+ * vcpu_worker in the case of a KVM_EXIT_MEMORY_FAULT.
+ */
+static useconds_t uffd_delay;
+static int uffd_mode;
+
+
+static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
+									bool is_vcpu);
+
+static void madv_write_or_err(uint64_t gpa)
+{
+	int r;
+	void *hva = addr_gpa2hva(memstress_args.vm, gpa);
+
+	r = madvise(hva, demand_paging_size, MADV_POPULATE_WRITE);
+	TEST_ASSERT(r == 0,
+				"MADV_POPULATE_WRITE on hva 0x%lx (gpa 0x%lx) fail, errno %i\n",
+				(uintptr_t) hva, gpa, errno);
+}
+
+static void ready_page(uint64_t gpa)
+{
+	int r, uffd;
+
+	/*
+	 * This test only registers memslot 1 w/ userfaultfd. Any accesses outside
+	 * the registered ranges should fault in the physical pages through
+	 * MADV_POPULATE_WRITE.
+	 */
+	if ((gpa < memstress_args.gpa)
+		|| (gpa >= memstress_args.gpa + memstress_args.size)) {
+		madv_write_or_err(gpa);
+	} else {
+		if (uffd_delay)
+			usleep(uffd_delay);
+
+		uffd = uffd_descs[(gpa - memstress_args.gpa) / uffd_region_size]->uffd;
+
+		r = handle_uffd_page_request(uffd_mode, uffd,
+					(uint64_t) addr_gpa2hva(memstress_args.vm, gpa), true);
+
+		if (r == EEXIST)
+			madv_write_or_err(gpa);
+	}
+}
+
 static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 {
 	struct kvm_vcpu *vcpu = vcpu_args->vcpu;
@@ -42,25 +94,36 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
 
-	/* Let the guest access its memory */
-	ret = _vcpu_run(vcpu);
-	TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-	if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
-		TEST_ASSERT(false,
-			    "Invalid guest sync status: exit_reason=%s\n",
-			    exit_reason_str(run->exit_reason));
-	}
+	while (true) {
+		/* Let the guest access its memory */
+		ret = _vcpu_run(vcpu);
+		TEST_ASSERT(ret == 0
+					|| (errno == EFAULT
+						&& run->exit_reason == KVM_EXIT_MEMORY_FAULT),
+					"vcpu_run failed: %d\n", ret);
+		if (ret != 0 && get_ucall(vcpu, NULL) != UCALL_SYNC) {
+
+			if (run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
+				ready_page(run->memory_fault.gpa);
+				continue;
+			}
+
+			TEST_ASSERT(false,
+						"Invalid guest sync status: exit_reason=%s\n",
+						exit_reason_str(run->exit_reason));
+		}
 
-	ts_diff = timespec_elapsed(start);
-	PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
-		       ts_diff.tv_sec, ts_diff.tv_nsec);
+		ts_diff = timespec_elapsed(start);
+		PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
+					   ts_diff.tv_sec, ts_diff.tv_nsec);
+		break;
+	}
 }
 
-static int handle_uffd_page_request(int uffd_mode, int uffd,
-		struct uffd_msg *msg)
+static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
+									bool is_vcpu)
 {
 	pid_t tid = syscall(__NR_gettid);
-	uint64_t addr = msg->arg.pagefault.address;
 	struct timespec start;
 	struct timespec ts_diff;
 	int r;
@@ -71,56 +134,78 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		struct uffdio_copy copy;
 
 		copy.src = (uint64_t)guest_data_prototype;
-		copy.dst = addr;
+		copy.dst = hva;
 		copy.len = demand_paging_size;
-		copy.mode = 0;
+		copy.mode = UFFDIO_COPY_MODE_DONTWAKE;
 
-		r = ioctl(uffd, UFFDIO_COPY, &copy);
 		/*
-		 * With multiple vCPU threads fault on a single page and there are
-		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
-		 * will fail with EEXIST: handle that case without signaling an
-		 * error.
+		 * With multiple vCPU threads and at least one of multiple reader threads
+		 * or vCPU memory faults, multiple vCPUs accessing an absent page will
+		 * almost certainly cause some thread doing the UFFDIO_COPY here to get
+		 * EEXIST: make sure to allow that case.
 		 */
-		if (r == -1 && errno != EEXIST) {
-			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
-					addr, tid, errno);
-			return r;
-		}
+		r = ioctl(uffd, UFFDIO_COPY, &copy);
+		TEST_ASSERT(r == 0 || errno == EEXIST,
+			"Thread 0x%x failed UFFDIO_COPY on hva 0x%lx, errno = %d",
+			gettid(), hva, errno);
 	} else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
+		/* The comments in the UFFDIO_COPY branch also apply here. */
 		struct uffdio_continue cont = {0};
 
-		cont.range.start = addr;
+		cont.range.start = hva;
 		cont.range.len = demand_paging_size;
+		cont.mode = UFFDIO_CONTINUE_MODE_DONTWAKE;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
-		/* See the note about EEXISTs in the UFFDIO_COPY branch. */
-		if (r == -1 && errno != EEXIST) {
-			pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
-					addr, tid, errno);
-			return r;
-		}
+		TEST_ASSERT(r == 0 || errno == EEXIST,
+			"Thread 0x%x failed UFFDIO_CONTINUE on hva 0x%lx, errno = %d",
+			gettid(), hva, errno);
 	} else {
 		TEST_FAIL("Invalid uffd mode %d", uffd_mode);
 	}
 
+	/*
+	 * If the above UFFDIO_COPY/CONTINUE fails with EEXIST, it will do so without
+	 * waking threads waiting on the UFFD: make sure that happens here.
+	 */
+	if (!is_vcpu) {
+		struct uffdio_range range = {
+			.start = hva,
+			.len = demand_paging_size
+		};
+		r = ioctl(uffd, UFFDIO_WAKE, &range);
+		TEST_ASSERT(
+			r == 0,
+			"Thread 0x%x failed UFFDIO_WAKE on hva 0x%lx, errno = %d",
+			gettid(), hva, errno);
+	}
+
 	ts_diff = timespec_elapsed(start);
 
 	PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
 		       timespec_to_ns(ts_diff));
 	PER_PAGE_DEBUG("Paged in %ld bytes at 0x%lx from thread %d\n",
-		       demand_paging_size, addr, tid);
+		       demand_paging_size, hva, tid);
 
 	return 0;
 }
 
+static int handle_uffd_page_request_from_uffd(int uffd_mode, int uffd,
+				struct uffd_msg *msg)
+{
+	TEST_ASSERT(msg->event == UFFD_EVENT_PAGEFAULT,
+		"Received uffd message with event %d != UFFD_EVENT_PAGEFAULT",
+		msg->event);
+	return handle_uffd_page_request(uffd_mode, uffd,
+					msg->arg.pagefault.address, false);
+}
+
 struct test_params {
-	int uffd_mode;
 	bool single_uffd;
-	useconds_t uffd_delay;
 	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
+	bool memfault_exits;
 };
 
 static void prefault_mem(void *alias, uint64_t len)
@@ -137,15 +222,26 @@ static void prefault_mem(void *alias, uint64_t len)
 static void run_test(enum vm_guest_mode mode, void *arg)
 {
 	struct test_params *p = arg;
-	struct uffd_desc **uffd_descs = NULL;
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i, num_uffds = 0;
-	uint64_t uffd_region_size;
+	int i;
+	uint32_t slot_flags = 0;
+	bool uffd_memfault_exits = uffd_mode && p->memfault_exits;
+
+	if (uffd_memfault_exits) {
+		TEST_ASSERT(kvm_has_cap(KVM_CAP_ABSENT_MAPPING_FAULT) > 0,
+					"KVM does not have KVM_CAP_ABSENT_MAPPING_FAULT");
+		slot_flags = KVM_MEM_ABSENT_MAPPING_FAULT;
+	}
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
-				1, 0, p->src_type, p->partition_vcpu_memory_access);
+				1, slot_flags, p->src_type, p->partition_vcpu_memory_access);
+
+	if (uffd_memfault_exits) {
+		vm_enable_cap(vm,
+					  KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_FAULT_INFO_ENABLE);
+	}
 
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
 
@@ -154,12 +250,12 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		    "Failed to allocate buffer for guest data pattern");
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		num_uffds = p->single_uffd ? 1 : nr_vcpus;
 		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
 
 		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
-		TEST_ASSERT(uffd_descs, "Memory allocation failed");
+		TEST_ASSERT(uffd_descs, "Failed to allocate memory of uffd descriptors");
 
 		for (i = 0; i < num_uffds; i++) {
 			struct memstress_vcpu_args *vcpu_args;
@@ -179,10 +275,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 * requests.
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
-				p->uffd_mode, p->uffd_delay, vcpu_hva,
+				uffd_mode, uffd_delay, vcpu_hva,
 				uffd_region_size,
 				p->readers_per_uffd,
-				&handle_uffd_page_request);
+				&handle_uffd_page_request_from_uffd);
 		}
 	}
 
@@ -196,7 +292,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	ts_diff = timespec_elapsed(start);
 	pr_info("All vCPU threads joined\n");
 
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
 		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
@@ -211,7 +307,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memstress_destroy_vm(vm);
 
 	free(guest_data_prototype);
-	if (p->uffd_mode)
+	if (uffd_mode)
 		free(uffd_descs);
 }
 
@@ -220,7 +316,7 @@ static void help(char *name)
 	puts("");
 	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
 		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
-		   "          [-s type] [-v vcpus] [-o]\n", name);
+		   "          [-w] [-s type] [-v vcpus] [-o]\n", name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
@@ -231,6 +327,7 @@ static void help(char *name)
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
 	printf(" -r: Set the number of reader threads per uffd.\n");
+	printf(" -w: Enable kvm cap for memory fault exits.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -250,29 +347,30 @@ int main(int argc, char *argv[])
 		.partition_vcpu_memory_access = true,
 		.readers_per_uffd = 1,
 		.single_uffd = false,
+		.memfault_exits = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
+	while ((opt = getopt(argc, argv, "ahowm:u:d:b:s:v:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
 			break;
 		case 'u':
 			if (!strcmp("MISSING", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
+				uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
 			else if (!strcmp("MINOR", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
-			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
+				uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
+			TEST_ASSERT(uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
 		case 'a':
 			p.single_uffd = true;
 			break;
 		case 'd':
-			p.uffd_delay = strtoul(optarg, NULL, 0);
-			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
+			uffd_delay = strtoul(optarg, NULL, 0);
+			TEST_ASSERT(uffd_delay >= 0, "A negative UFFD delay is not supported.");
 			break;
 		case 'b':
 			guest_percpu_mem_size = parse_size(optarg);
@@ -295,6 +393,9 @@ int main(int argc, char *argv[])
 						"Invalid number of readers per uffd %d: must be >=1",
 						p.readers_per_uffd);
 			break;
+		case 'w':
+			p.memfault_exits = true;
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
@@ -302,7 +403,7 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	if (p.uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
+	if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
 	    !backing_src_is_shared(p.src_type)) {
 		TEST_FAIL("userfaultfd MINOR mode requires shared memory; pick a different -s");
 	}
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
  2023-04-12 21:34 ` [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
@ 2023-04-19 13:36   ` Hoo Robert
  2023-04-19 23:26     ` Anish Moorthy
  0 siblings, 1 reply; 103+ messages in thread
From: Hoo Robert @ 2023-04-19 13:36 UTC (permalink / raw)
  To: Anish Moorthy, pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, kvm, kvmarm

On 4/13/2023 5:34 AM, Anish Moorthy wrote:
> With multiple reader threads POLLing a single UFFD, the test suffers
> from the thundering herd problem: performance degrades as the number of
> reader threads is increased. Solve this issue [1] by switching the
> the polling mechanism to EPOLL + EPOLLEXCLUSIVE.
> 
> Also, change the error-handling convention of uffd_handler_thread_fn.
> Instead of just printing errors and returning early from the polling
> loop, check for them via TEST_ASSERT. "return NULL" is reserved for a
> successful exit from uffd_handler_thread_fn, ie one triggered by a
> write to the exit pipe.
> 
> Performance samples generated by the command in [2] are given below.
> 
> Num Reader Threads, Paging Rate (POLL), Paging Rate (EPOLL)
> 1      249k      185k
> 2      201k      235k
> 4      186k      155k
> 16     150k      217k
> 32     89k       198k
> 
> [1] Single-vCPU performance does suffer somewhat.
> [2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
>   .../selftests/kvm/demand_paging_test.c        |  1 -
>   .../selftests/kvm/lib/userfaultfd_util.c      | 74 +++++++++----------
>   2 files changed, 35 insertions(+), 40 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> index 6c2253f4a64ef..c729cee4c2055 100644
> --- a/tools/testing/selftests/kvm/demand_paging_test.c
> +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> @@ -13,7 +13,6 @@
>   #include <stdio.h>
>   #include <stdlib.h>
>   #include <time.h>
> -#include <poll.h>
>   #include <pthread.h>
>   #include <linux/userfaultfd.h>
>   #include <sys/syscall.h>
> diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
> index 2723ee1e3e1b2..909ad69c1cb04 100644
> --- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
> +++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
> @@ -16,6 +16,7 @@
>   #include <poll.h>
>   #include <pthread.h>
>   #include <linux/userfaultfd.h>
> +#include <sys/epoll.h>
>   #include <sys/syscall.h>
>   
>   #include "kvm_util.h"
> @@ -32,60 +33,55 @@ static void *uffd_handler_thread_fn(void *arg)
>   	int64_t pages = 0;
>   	struct timespec start;
>   	struct timespec ts_diff;
> +	int epollfd;
> +	struct epoll_event evt;
> +
> +	epollfd = epoll_create(1);
> +	TEST_ASSERT(epollfd >= 0, "Failed to create epollfd.");
> +
> +	evt.events = EPOLLIN | EPOLLEXCLUSIVE;
> +	evt.data.u32 = 0;
> +	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, uffd, &evt) == 0,
> +				"Failed to add uffd to epollfd");
> +
> +	evt.events = EPOLLIN;
> +	evt.data.u32 = 1;
> +	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, reader_args->pipe, &evt) == 0,
> +				"Failed to add pipe to epollfd");
>   
>   	clock_gettime(CLOCK_MONOTONIC, &start);
>   	while (1) {
>   		struct uffd_msg msg;
> -		struct pollfd pollfd[2];
> -		char tmp_chr;
>   		int r;
>   
> -		pollfd[0].fd = uffd;
> -		pollfd[0].events = POLLIN;
> -		pollfd[1].fd = reader_args->pipe;
> -		pollfd[1].events = POLLIN;
> -
> -		r = poll(pollfd, 2, -1);
> -		switch (r) {
> -		case -1:
> -			pr_info("poll err");
> -			continue;
> -		case 0:
> -			continue;
> -		case 1:
> -			break;
> -		default:
> -			pr_info("Polling uffd returned %d", r);
> -			return NULL;
> -		}
> +		r = epoll_wait(epollfd, &evt, 1, -1);
> +		TEST_ASSERT(r == 1,
> +					"Unexpected number of events (%d) from epoll, errno = %d",
> +					r, errno);
>   
too much indentation, also seen elsewhere.

> -		if (pollfd[0].revents & POLLERR) {
> -			pr_info("uffd revents has POLLERR");
> -			return NULL;
> -		}
> +		if (evt.data.u32 == 1) {
> +			char tmp_chr;
>   
> -		if (pollfd[1].revents & POLLIN) {
> -			r = read(pollfd[1].fd, &tmp_chr, 1);
> +			TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
> +						"Reader thread received EPOLLERR or EPOLLHUP on pipe.");
> +			r = read(reader_args->pipe, &tmp_chr, 1);
>   			TEST_ASSERT(r == 1,
> -				    "Error reading pipefd in UFFD thread\n");
> +						"Error reading pipefd in uffd reader thread");
>   			return NULL;

How about goto
	ts_diff = timespec_elapsed(start);
Otherwise last stats won't get chances to be calc'ed.

>   		}
>   
> -		if (!(pollfd[0].revents & POLLIN))
> -			continue;
> +		TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
> +					"Reader thread received EPOLLERR or EPOLLHUP on uffd.");
>   
>   		r = read(uffd, &msg, sizeof(msg));
>   		if (r == -1) {
> -			if (errno == EAGAIN)
> -				continue;
> -			pr_info("Read of uffd got errno %d\n", errno);
> -			return NULL;
> +			TEST_ASSERT(errno == EAGAIN,
> +						"Error reading from UFFD: errno = %d", errno);
> +			continue;
>   		}
>   
> -		if (r != sizeof(msg)) {
> -			pr_info("Read on uffd returned unexpected size: %d bytes", r);
> -			return NULL;
> -		}
> +		TEST_ASSERT(r == sizeof(msg),
> +					"Read on uffd returned unexpected number of bytes (%d)", r);
>   
>   		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
>   			continue;
> @@ -93,8 +89,8 @@ static void *uffd_handler_thread_fn(void *arg)
>   		if (reader_args->delay)
>   			usleep(reader_args->delay);
>   		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
> -		if (r < 0)
> -			return NULL;
> +		TEST_ASSERT(r >= 0,
> +					"Reader thread handler fn returned negative value %d", r);
>   		pages++;
>   	}
>   


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
  2023-04-12 21:34 ` [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
@ 2023-04-19 13:51   ` Hoo Robert
  2023-04-20 17:55     ` Anish Moorthy
  0 siblings, 1 reply; 103+ messages in thread
From: Hoo Robert @ 2023-04-19 13:51 UTC (permalink / raw)
  To: Anish Moorthy, pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, kvm, kvmarm

On 4/13/2023 5:34 AM, Anish Moorthy wrote:
> At the moment, demand_paging_test does not support profiling/testing
> multiple vCPU threads concurrently faulting on a single uffd because
> 
>      (a) "-u" (run test in userfaultfd mode) creates a uffd for each vCPU's
>          region, so that each uffd services a single vCPU thread.
>      (b) "-u -o" (userfaultfd mode + overlapped vCPU memory accesses)
>          simply doesn't work: the test tries to register the same memory
>          to multiple uffds, causing an error.
> 
> Add support for many vcpus per uffd by
>      (1) Keeping "-u" behavior unchanged.
>      (2) Making "-u -a" create a single uffd for all of guest memory.
>      (3) Making "-u -o" implicitly pass "-a", solving the problem in (b).
> In cases (2) and (3) all vCPU threads fault on a single uffd.
> 
> With multiple potentially multiple vCPU per UFFD, it makes sense to
        ^^^^^^^^
redundant "multiple"?

> allow configuring the number reader threads per UFFD as well: add the
> "-r" flag to do so.
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
>   .../selftests/kvm/aarch64/page_fault_test.c   |  4 +-
>   .../selftests/kvm/demand_paging_test.c        | 62 +++++++++----
>   .../selftests/kvm/include/userfaultfd_util.h  | 18 +++-
>   .../selftests/kvm/lib/userfaultfd_util.c      | 86 +++++++++++++------
>   4 files changed, 124 insertions(+), 46 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
> index df10f1ffa20d9..3b6d228a9340d 100644
> --- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
> +++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
> @@ -376,14 +376,14 @@ static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
>   		*pt_uffd = uffd_setup_demand_paging(uffd_mode, 0,
>   						    pt_args.hva,
>   						    pt_args.paging_size,
> -						    test->uffd_pt_handler);
> +						    1, test->uffd_pt_handler);
>   
>   	*data_uffd = NULL;
>   	if (test->uffd_data_handler)
>   		*data_uffd = uffd_setup_demand_paging(uffd_mode, 0,
>   						      data_args.hva,
>   						      data_args.paging_size,
> -						      test->uffd_data_handler);
> +						      1, test->uffd_data_handler);
>   }
>   
>   static void free_uffd(struct test_desc *test, struct uffd_desc *pt_uffd,
> diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> index b0e1fc4de9e29..6c2253f4a64ef 100644
> --- a/tools/testing/selftests/kvm/demand_paging_test.c
> +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> @@ -77,9 +77,15 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
>   		copy.mode = 0;
>   
>   		r = ioctl(uffd, UFFDIO_COPY, &copy);
> -		if (r == -1) {
> -			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d with errno: %d\n",
> -				addr, tid, errno);
> +		/*
> +		 * With multiple vCPU threads fault on a single page and there are
> +		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
> +		 * will fail with EEXIST: handle that case without signaling an
> +		 * error.
> +		 */

But this code path is also gone through in other cases, isn't it? In
those cases, is it still safe to ignore EEXIST?

> +		if (r == -1 && errno != EEXIST) {
> +			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
> +					addr, tid, errno);

unintended indent changes I think.

>   			return r;
>   		}
>   	} else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
> @@ -89,9 +95,10 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
>   		cont.range.len = demand_paging_size;
>   
>   		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
> -		if (r == -1) {
> -			pr_info("Failed UFFDIO_CONTINUE in 0x%lx from thread %d with errno: %d\n",
> -				addr, tid, errno);
> +		/* See the note about EEXISTs in the UFFDIO_COPY branch. */

Personally I would suggest copy the comments here. what if some day above
code/comment was changed/deleted?

> +		if (r == -1 && errno != EEXIST) {
> +			pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
> +					addr, tid, errno);

Ditto

>   			return r;
>   		}
>   	} else {
> @@ -110,7 +117,9 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
>   
>   struct test_params {
>   	int uffd_mode;
> +	bool single_uffd;
>   	useconds_t uffd_delay;
> +	int readers_per_uffd;
>   	enum vm_mem_backing_src_type src_type;
>   	bool partition_vcpu_memory_access;
>   };
> @@ -133,7 +142,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   	struct timespec start;
>   	struct timespec ts_diff;
>   	struct kvm_vm *vm;
> -	int i;
> +	int i, num_uffds = 0;
> +	uint64_t uffd_region_size;
>   
>   	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
>   				 p->src_type, p->partition_vcpu_memory_access);
> @@ -146,10 +156,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   	memset(guest_data_prototype, 0xAB, demand_paging_size);
>   
>   	if (p->uffd_mode) {
> -		uffd_descs = malloc(nr_vcpus * sizeof(struct uffd_desc *));
> +		num_uffds = p->single_uffd ? 1 : nr_vcpus;
> +		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
> +
> +		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
>   		TEST_ASSERT(uffd_descs, "Memory allocation failed");
>   
> -		for (i = 0; i < nr_vcpus; i++) {
> +		for (i = 0; i < num_uffds; i++) {
>   			struct memstress_vcpu_args *vcpu_args;
>   			void *vcpu_hva;
>   			void *vcpu_alias;
> @@ -160,8 +173,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   			vcpu_hva = addr_gpa2hva(vm, vcpu_args->gpa);
>   			vcpu_alias = addr_gpa2alias(vm, vcpu_args->gpa);
>   
> -			prefault_mem(vcpu_alias,
> -				vcpu_args->pages * memstress_args.guest_page_size);
> +			prefault_mem(vcpu_alias, uffd_region_size);
>   
>   			/*
>   			 * Set up user fault fd to handle demand paging
> @@ -169,7 +181,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   			 */
>   			uffd_descs[i] = uffd_setup_demand_paging(
>   				p->uffd_mode, p->uffd_delay, vcpu_hva,
> -				vcpu_args->pages * memstress_args.guest_page_size,
> +				uffd_region_size,
> +				p->readers_per_uffd,
>   				&handle_uffd_page_request);
>   		}
>   	}
> @@ -186,7 +199,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   
>   	if (p->uffd_mode) {
>   		/* Tell the user fault fd handler threads to quit */
> -		for (i = 0; i < nr_vcpus; i++)
> +		for (i = 0; i < num_uffds; i++)
>   			uffd_stop_demand_paging(uffd_descs[i]);
>   	}
>   
> @@ -206,14 +219,19 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   static void help(char *name)
>   {
>   	puts("");
> -	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-d uffd_delay_usec]\n"
> -	       "          [-b memory] [-s type] [-v vcpus] [-o]\n", name);
> +	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
> +		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
> +		   "          [-s type] [-v vcpus] [-o]\n", name);

Ditto

>   	guest_modes_help();
>   	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
>   	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
> +	printf(" -a: Use a single userfaultfd for all of guest memory, instead of\n"
> +		   "     creating one for each region paged by a unique vCPU\n"
> +		   "     Set implicitly with -o, and no effect without -u.\n");

Ditto

>   	printf(" -d: add a delay in usec to the User Fault\n"
>   	       "     FD handler to simulate demand paging\n"
>   	       "     overheads. Ignored without -u.\n");
> +	printf(" -r: Set the number of reader threads per uffd.\n");
>   	printf(" -b: specify the size of the memory region which should be\n"
>   	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
>   	       "     Default: 1G\n");
> @@ -231,12 +249,14 @@ int main(int argc, char *argv[])
>   	struct test_params p = {
>   		.src_type = DEFAULT_VM_MEM_SRC,
>   		.partition_vcpu_memory_access = true,
> +		.readers_per_uffd = 1,
> +		.single_uffd = false,
>   	};
>   	int opt;
>   
>   	guest_modes_append_default();
>   
> -	while ((opt = getopt(argc, argv, "hm:u:d:b:s:v:o")) != -1) {
> +	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
>   		switch (opt) {
>   		case 'm':
>   			guest_modes_cmdline(optarg);
> @@ -248,6 +268,9 @@ int main(int argc, char *argv[])
>   				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
>   			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
>   			break;
> +		case 'a':
> +			p.single_uffd = true;
> +			break;
>   		case 'd':
>   			p.uffd_delay = strtoul(optarg, NULL, 0);
>   			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
> @@ -265,6 +288,13 @@ int main(int argc, char *argv[])
>   			break;
>   		case 'o':
>   			p.partition_vcpu_memory_access = false;
> +			p.single_uffd = true;
> +			break;
> +		case 'r':
> +			p.readers_per_uffd = atoi(optarg);
> +			TEST_ASSERT(p.readers_per_uffd >= 1,
> +						"Invalid number of readers per uffd %d: must be >=1",
> +						p.readers_per_uffd);
>   			break;
>   		case 'h':
>   		default:
> diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
> index 877449c345928..92cc1f9ec0686 100644
> --- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
> +++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
> @@ -17,18 +17,30 @@
>   
>   typedef int (*uffd_handler_t)(int uffd_mode, int uffd, struct uffd_msg *msg);
>   
> +struct uffd_reader_args {
> +	int uffd_mode;
> +	int uffd;
> +	useconds_t delay;
> +	uffd_handler_t handler;
> +	/* Holds the read end of the pipe for killing the reader. */
> +	int pipe;
> +};
> +
>   struct uffd_desc {
>   	int uffd_mode;
>   	int uffd;
> -	int pipefds[2];
>   	useconds_t delay;
>   	uffd_handler_t handler;
> -	pthread_t thread;
> +	uint64_t num_readers;
> +	/* Holds the write ends of the pipes for killing the readers. */
> +	int *pipefds;
> +	pthread_t *readers;
> +	struct uffd_reader_args *reader_args;
>   };
>   
>   struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
>   					   void *hva, uint64_t len,
> -					   uffd_handler_t handler);
> +					   uint64_t num_readers, uffd_handler_t handler);
>   
>   void uffd_stop_demand_paging(struct uffd_desc *uffd);
>   
> diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
> index 92cef20902f1f..2723ee1e3e1b2 100644
> --- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
> +++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
> @@ -27,10 +27,8 @@
>   
>   static void *uffd_handler_thread_fn(void *arg)
>   {
> -	struct uffd_desc *uffd_desc = (struct uffd_desc *)arg;
> -	int uffd = uffd_desc->uffd;
> -	int pipefd = uffd_desc->pipefds[0];
> -	useconds_t delay = uffd_desc->delay;
> +	struct uffd_reader_args *reader_args = (struct uffd_reader_args *)arg;
> +	int uffd = reader_args->uffd;
>   	int64_t pages = 0;
>   	struct timespec start;
>   	struct timespec ts_diff;
> @@ -44,7 +42,7 @@ static void *uffd_handler_thread_fn(void *arg)
>   
>   		pollfd[0].fd = uffd;
>   		pollfd[0].events = POLLIN;
> -		pollfd[1].fd = pipefd;
> +		pollfd[1].fd = reader_args->pipe;
>   		pollfd[1].events = POLLIN;
>   
>   		r = poll(pollfd, 2, -1);
> @@ -92,9 +90,9 @@ static void *uffd_handler_thread_fn(void *arg)
>   		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
>   			continue;
>   
> -		if (delay)
> -			usleep(delay);
> -		r = uffd_desc->handler(uffd_desc->uffd_mode, uffd, &msg);
> +		if (reader_args->delay)
> +			usleep(reader_args->delay);
> +		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
>   		if (r < 0)
>   			return NULL;
>   		pages++;
> @@ -110,7 +108,7 @@ static void *uffd_handler_thread_fn(void *arg)
>   
>   struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
>   					   void *hva, uint64_t len,
> -					   uffd_handler_t handler)
> +					   uint64_t num_readers, uffd_handler_t handler)
>   {
>   	struct uffd_desc *uffd_desc;
>   	bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR);
> @@ -118,14 +116,26 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
>   	struct uffdio_api uffdio_api;
>   	struct uffdio_register uffdio_register;
>   	uint64_t expected_ioctls = ((uint64_t) 1) << _UFFDIO_COPY;
> -	int ret;
> +	int ret, i;
>   
>   	PER_PAGE_DEBUG("Userfaultfd %s mode, faults resolved with %s\n",
>   		       is_minor ? "MINOR" : "MISSING",
>   		       is_minor ? "UFFDIO_CONINUE" : "UFFDIO_COPY");
>   
>   	uffd_desc = malloc(sizeof(struct uffd_desc));
> -	TEST_ASSERT(uffd_desc, "malloc failed");
> +	TEST_ASSERT(uffd_desc, "Failed to malloc uffd descriptor");
> +
> +	uffd_desc->pipefds = malloc(sizeof(int) * num_readers);
> +	TEST_ASSERT(uffd_desc->pipefds, "Failed to malloc pipes");
> +
> +	uffd_desc->readers = malloc(sizeof(pthread_t) * num_readers);
> +	TEST_ASSERT(uffd_desc->readers, "Failed to malloc reader threads");
> +
> +	uffd_desc->reader_args = malloc(
> +		sizeof(struct uffd_reader_args) * num_readers);
> +	TEST_ASSERT(uffd_desc->reader_args, "Failed to malloc reader_args");
> +
> +	uffd_desc->num_readers = num_readers;
>   
>   	/* In order to get minor faults, prefault via the alias. */
>   	if (is_minor)
> @@ -148,18 +158,32 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
>   	TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
>   		    expected_ioctls, "missing userfaultfd ioctls");
>   
> -	ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
> -	TEST_ASSERT(!ret, "Failed to set up pipefd");
> -
>   	uffd_desc->uffd_mode = uffd_mode;
>   	uffd_desc->uffd = uffd;
>   	uffd_desc->delay = delay;
>   	uffd_desc->handler = handler;

Now that these info are encapsulated into reader args below, looks
unnecessary to have them in uffd_desc here.

> -	pthread_create(&uffd_desc->thread, NULL, uffd_handler_thread_fn,
> -		       uffd_desc);
>   
> -	PER_VCPU_DEBUG("Created uffd thread for HVA range [%p, %p)\n",
> -		       hva, hva + len);
> +	for (i = 0; i < uffd_desc->num_readers; ++i) {
> +		int pipes[2];
> +
> +		ret = pipe2((int *) &pipes, O_CLOEXEC | O_NONBLOCK);
> +		TEST_ASSERT(!ret, "Failed to set up pipefd %i for uffd_desc %p",
> +					i, uffd_desc);
> +
> +		uffd_desc->pipefds[i] = pipes[1];
> +
> +		uffd_desc->reader_args[i].uffd_mode = uffd_mode;
> +		uffd_desc->reader_args[i].uffd = uffd;
> +		uffd_desc->reader_args[i].delay = delay;
> +		uffd_desc->reader_args[i].handler = handler;
> +		uffd_desc->reader_args[i].pipe = pipes[0];
> +
> +		pthread_create(&uffd_desc->readers[i], NULL, uffd_handler_thread_fn,
> +					   &uffd_desc->reader_args[i]);
> +
> +		PER_VCPU_DEBUG("Created uffd thread %i for HVA range [%p, %p)\n",
> +					   i, hva, hva + len);
> +	}
>   
>   	return uffd_desc;
>   }
> @@ -167,19 +191,31 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
>   void uffd_stop_demand_paging(struct uffd_desc *uffd)
>   {
>   	char c = 0;
> -	int ret;
> +	int i, ret;
>   
> -	ret = write(uffd->pipefds[1], &c, 1);
> -	TEST_ASSERT(ret == 1, "Unable to write to pipefd");
> +	for (i = 0; i < uffd->num_readers; ++i) {
> +		ret = write(uffd->pipefds[i], &c, 1);
> +		TEST_ASSERT(
> +			ret == 1, "Unable to write to pipefd %i for uffd_desc %p", i, uffd);
> +	}
>   
> -	ret = pthread_join(uffd->thread, NULL);
> -	TEST_ASSERT(ret == 0, "Pthread_join failed.");
> +	for (i = 0; i < uffd->num_readers; ++i) {
> +		ret = pthread_join(uffd->readers[i], NULL);
> +		TEST_ASSERT(
> +			ret == 0,
> +			"Pthread_join failed on reader thread %i for uffd_desc %p", i, uffd);
> +	}
>   
>   	close(uffd->uffd);
>   
> -	close(uffd->pipefds[1]);
> -	close(uffd->pipefds[0]);
> +	for (i = 0; i < uffd->num_readers; ++i) {
> +		close(uffd->pipefds[i]);
> +		close(uffd->reader_args[i].pipe);
> +	}
>   
> +	free(uffd->pipefds);
> +	free(uffd->readers);
> +	free(uffd->reader_args);
>   	free(uffd);
>   }
>   


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-04-12 21:34 ` [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
@ 2023-04-19 13:57   ` Hoo Robert
  2023-04-20 18:09     ` Anish Moorthy
  2023-06-01 19:52   ` Oliver Upton
  2023-07-04 10:10   ` Kautuk Consul
  2 siblings, 1 reply; 103+ messages in thread
From: Hoo Robert @ 2023-04-19 13:57 UTC (permalink / raw)
  To: Anish Moorthy, pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, kvm, kvmarm

On 4/13/2023 5:34 AM, Anish Moorthy wrote:
> KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information
> besides a return value of -1 and errno of EFAULT when a vCPU fails an
> access to guest memory.
> 
> Add documentation, updates to the KVM headers, and a helper function
> (kvm_populate_efault_info) for implementing the capability.

kvm_populate_efault_info(), function name.
> 
> Besides simply filling the run struct, kvm_populate_efault_info takes

Ditto

> two safety measures
> 
>    a. It tries to prevent concurrent fills on a single vCPU run struct
>       by checking that the run struct being modified corresponds to the
>       currently loaded vCPU.
>    b. It tries to avoid filling an already-populated run struct by
>       checking whether the exit reason has been modified since entry
>       into KVM_RUN.
> 
> Finally, mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86,
> even though EFAULT annotation are currently totally absent. Picking a
> point to declare the implementation "done" is difficult because
> 
>    1. Annotations will be performed incrementally in subsequent commits
>       across both core and arch-specific KVM.
>    2. The initial series will very likely miss some cases which need
>       annotation. Although these omissions are to be fixed in the future,
>       userspace thus still needs to expect and be able to handle
>       unannotated EFAULTs.
> 
> Given these qualifications, just marking it available here seems the
> least arbitrary thing to do.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>   Documentation/virt/kvm/api.rst | 35 +++++++++++++++++++++++++++
>   arch/arm64/kvm/arm.c           |  1 +
>   arch/x86/kvm/x86.c             |  1 +
>   include/linux/kvm_host.h       | 12 ++++++++++
>   include/uapi/linux/kvm.h       | 16 +++++++++++++
>   tools/include/uapi/linux/kvm.h | 11 +++++++++
>   virt/kvm/kvm_main.c            | 44 ++++++++++++++++++++++++++++++++++
>   7 files changed, 120 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 48fad65568227..f174f43c38d45 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6637,6 +6637,18 @@ array field represents return values. The userspace should update the return
>   values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>   spec refer, https://github.com/riscv/riscv-sbi-doc.
>   
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 len; /* in bytes */
> +		} memory_fault;
> +
> +Indicates a vCPU memory fault on the guest physical address range
> +[gpa, gpa + len). See KVM_CAP_MEMORY_FAULT_INFO for more details.
> +
>   ::
>   
>       /* KVM_EXIT_NOTIFY */
> @@ -7670,6 +7682,29 @@ This capability is aimed to mitigate the threat that malicious VMs can
>   cause CPU stuck (due to event windows don't open up) and make the CPU
>   unavailable to host or other VMs.
>   
> +7.34 KVM_CAP_MEMORY_FAULT_INFO
> +------------------------------
> +
> +:Architectures: x86, arm64
> +:Parameters: args[0] - KVM_MEMORY_FAULT_INFO_ENABLE|DISABLE to enable/disable
> +             the capability.
> +:Returns: 0 on success, or -EINVAL if unsupported or invalid args[0].
> +
> +When enabled, EFAULTs "returned" by KVM_RUN in response to failed vCPU guest
> +memory accesses may be annotated with additional information. When KVM_RUN
> +returns an error with errno=EFAULT, userspace may check the exit reason: if it
> +is KVM_EXIT_MEMORY_FAULT, userspace is then permitted to read the 'memory_fault'
> +member of the run struct.
> +
> +The 'gpa' and 'len' (in bytes) fields describe the range of guest
> +physical memory to which access failed, i.e. [gpa, gpa + len). 'flags' is
> +currently always zero.
> +
> +NOTE: The implementation of this capability is incomplete. Even with it enabled,
> +userspace may receive "bare" EFAULTs (i.e. exit reason !=
> +KVM_EXIT_MEMORY_FAULT) from KVM_RUN. These should be considered bugs and
> +reported to the maintainers.
> +
>   8. Other capabilities.
>   ======================
>   
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index a43e1cb3b7e97..a932346b59f61 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -220,6 +220,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   	case KVM_CAP_VCPU_ATTRIBUTES:
>   	case KVM_CAP_PTP_KVM:
>   	case KVM_CAP_ARM_SYSTEM_SUSPEND:
> +	case KVM_CAP_MEMORY_FAULT_INFO:
>   		r = 1;
>   		break;
>   	case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ca73eb066af81..0925678e741de 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4432,6 +4432,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   	case KVM_CAP_VAPIC:
>   	case KVM_CAP_ENABLE_CAP:
>   	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
> +	case KVM_CAP_MEMORY_FAULT_INFO:
>   		r = 1;
>   		break;
>   	case KVM_CAP_EXIT_HYPERCALL:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 90edc16d37e59..776f9713f3921 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -805,6 +805,8 @@ struct kvm {
>   	struct notifier_block pm_notifier;
>   #endif
>   	char stats_id[KVM_STATS_NAME_SIZE];
> +
> +	bool fill_efault_info;
>   };
>   
>   #define kvm_err(fmt, ...) \
> @@ -2277,4 +2279,14 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>   /* Max number of entries allowed for each kvm dirty ring */
>   #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>   
> +/*
> + * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and
> + * populate the memory_fault field with the given information.
> + *
> + * Does nothing if KVM_CAP_MEMORY_FAULT_INFO is not enabled. WARNs and does
> + * nothing if the exit reason is not KVM_EXIT_UNKNOWN, or if 'vcpu' is not
> + * the current running vcpu.
> + */
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> +					uint64_t gpa, uint64_t len);
>   #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 4003a166328cc..bc73e8381a2bb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
>   #define KVM_EXIT_RISCV_SBI        35
>   #define KVM_EXIT_RISCV_CSR        36
>   #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38

struct exit_reason[] string for KVM_EXIT_MEMORY_FAULT can be added as
well.
>   
>   /* For KVM_EXIT_INTERNAL_ERROR */
>   /* Emulate instruction failed. */
> @@ -505,6 +506,16 @@ struct kvm_run {
>   #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>   			__u32 flags;
>   		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			/*
> +			 * Indicates a memory fault on the guest physical address range
> +			 * [gpa, gpa + len). flags is always zero for now.
> +			 */
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 len; /* in bytes */
> +		} memory_fault;
>   		/* Fix the size of the union. */
>   		char padding[256];
>   	};
> @@ -1184,6 +1195,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
>   #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
>   #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
> +#define KVM_CAP_MEMORY_FAULT_INFO 227
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
> @@ -2237,4 +2249,8 @@ struct kvm_s390_zpci_op {
>   /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
>   #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
>   
> +/* flags for KVM_CAP_MEMORY_FAULT_INFO */
> +#define KVM_MEMORY_FAULT_INFO_DISABLE  0
> +#define KVM_MEMORY_FAULT_INFO_ENABLE   1
> +
>   #endif /* __LINUX_KVM_H */
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index 4003a166328cc..5c57796364d65 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
>   #define KVM_EXIT_RISCV_SBI        35
>   #define KVM_EXIT_RISCV_CSR        36
>   #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>   
>   /* For KVM_EXIT_INTERNAL_ERROR */
>   /* Emulate instruction failed. */
> @@ -505,6 +506,16 @@ struct kvm_run {
>   #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>   			__u32 flags;
>   		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			/*
> +			 * Indicates a memory fault on the guest physical address range
> +			 * [gpa, gpa + len). flags is always zero for now.
> +			 */
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 len; /* in bytes */
> +		} memory_fault;
>   		/* Fix the size of the union. */
>   		char padding[256];
>   	};
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cf7d3de6f3689..f3effc93cbef3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>   	spin_lock_init(&kvm->mn_invalidate_lock);
>   	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>   	xa_init(&kvm->vcpu_array);
> +	kvm->fill_efault_info = false;
>   
>   	INIT_LIST_HEAD(&kvm->gpc_list);
>   	spin_lock_init(&kvm->gpc_lock);
> @@ -4096,6 +4097,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
>   			put_pid(oldpid);
>   		}
>   		r = kvm_arch_vcpu_ioctl_run(vcpu);
> +		WARN_ON_ONCE(r == -EFAULT &&
> +					 vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);
>   		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
>   		break;
>   	}
> @@ -4672,6 +4675,15 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>   
>   		return r;
>   	}
> +	case KVM_CAP_MEMORY_FAULT_INFO: {
> +		if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap)
> +			|| (cap->args[0] != KVM_MEMORY_FAULT_INFO_ENABLE
> +				&& cap->args[0] != KVM_MEMORY_FAULT_INFO_DISABLE)) {
> +			return -EINVAL;
> +		}
> +		kvm->fill_efault_info = cap->args[0] == KVM_MEMORY_FAULT_INFO_ENABLE;
> +		return 0;
> +	}
>   	default:
>   		return kvm_vm_ioctl_enable_cap(kvm, cap);
>   	}
> @@ -6173,3 +6185,35 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>   
>   	return init_context.err;
>   }
> +
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> +					uint64_t gpa, uint64_t len)
> +{
> +	if (!vcpu->kvm->fill_efault_info)
> +		return;
> +
> +	preempt_disable();
> +	/*
> +	 * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> +	 * would open the door for races between concurrent calls to this
> +	 * function.
> +	 */
> +	if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> +		goto out;
> +	/*
> +	 * Try not to overwrite an already-populated run struct.
> +	 * This isn't a perfect solution, as there's no guarantee that the exit
> +	 * reason is set before the run struct is populated, but it should prevent
> +	 * at least some bugs.
> +	 */
> +	else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
> +		goto out;
> +
> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	vcpu->run->memory_fault.gpa = gpa;
> +	vcpu->run->memory_fault.len = len;
> +	vcpu->run->memory_fault.flags = 0;
> +
> +out:
> +	preempt_enable();
> +}


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
  2023-04-12 21:35 ` [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation Anish Moorthy
@ 2023-04-19 14:00   ` Hoo Robert
  2023-04-20 18:23     ` Anish Moorthy
  2023-04-24 21:02   ` Sean Christopherson
  2023-06-01 18:19   ` Oliver Upton
  2 siblings, 1 reply; 103+ messages in thread
From: Hoo Robert @ 2023-04-19 14:00 UTC (permalink / raw)
  To: Anish Moorthy, pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, kvm, kvmarm

On 4/13/2023 5:35 AM, Anish Moorthy wrote:
> Add documentation, memslot flags, useful helper functions, and the
> actual new capability itself.
> 
> Memory fault exits on absent mappings are particularly useful for
> userfaultfd-based postcopy live migration. When many vCPUs fault on a
> single userfaultfd the faults can take a while to surface to userspace
> due to having to contend for uffd wait queue locks. Bypassing the uffd
> entirely by returning information directly to the vCPU exit avoids this
> contention and improves the fault rate.
> 
> Suggested-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>   Documentation/virt/kvm/api.rst | 31 ++++++++++++++++++++++++++++---
>   include/linux/kvm_host.h       |  7 +++++++
>   include/uapi/linux/kvm.h       |  2 ++
>   tools/include/uapi/linux/kvm.h |  1 +
>   virt/kvm/kvm_main.c            |  3 +++
>   5 files changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f174f43c38d45..7967b9909e28b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
>     /* for kvm_userspace_memory_region::flags */
>     #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>     #define KVM_MEM_READONLY	(1UL << 1)
> +  #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
>   
>   This ioctl allows the user to create, modify or delete a guest physical
>   memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> @@ -1342,12 +1343,15 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>   be identical.  This allows large pages in the guest to be backed by large
>   pages in the host.
>   
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> +The flags field supports three flags
> +
> +1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
>   writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> +use it.
> +2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
>   to make a new slot read-only.  In this case, writes to this memory will be
>   posted to userspace as KVM_EXIT_MMIO exits.
> +3.  KVM_MEM_ABSENT_MAPPING_FAULT: see KVM_CAP_ABSENT_MAPPING_FAULT for details.
>   
>   When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>   the memory region are automatically reflected into the guest.  For example, an
> @@ -7705,6 +7709,27 @@ userspace may receive "bare" EFAULTs (i.e. exit reason !=
>   KVM_EXIT_MEMORY_FAULT) from KVM_RUN. These should be considered bugs and
>   reported to the maintainers.
>   
> +7.35 KVM_CAP_ABSENT_MAPPING_FAULT
> +---------------------------------
> +
> +:Architectures: None
> +:Returns: -EINVAL.
> +
> +The presence of this capability indicates that userspace may pass the
> +KVM_MEM_ABSENT_MAPPING_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
> +to fail (-EFAULT) in response to page faults for which the userspace page tables
> +do not contain present mappings. Attempting to enable the capability directly
> +will fail.
> +
> +The range of guest physical memory causing the fault is advertised to userspace
> +through KVM_CAP_MEMORY_FAULT_INFO (if it is enabled).
> +
> +Userspace should determine how best to make the mapping present, then take
> +appropriate action. For instance, in the case of absent mappings this might
> +involve establishing the mapping for the first time via UFFDIO_COPY/CONTINUE or
> +faulting the mapping in using MADV_POPULATE_READ/WRITE. After establishing the
> +mapping, userspace can return to KVM to retry the previous memory access.
> +
>   8. Other capabilities.
>   ======================
>   
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 776f9713f3921..2407fc1e52ab8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2289,4 +2289,11 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>    */
>   inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
>   					uint64_t gpa, uint64_t len);
> +
> +static inline bool kvm_slot_fault_on_absent_mapping(
> +							const struct kvm_memory_slot *slot)

Strange line break.

> +{
> +	return slot->flags & KVM_MEM_ABSENT_MAPPING_FAULT;
> +}
> +
>   #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index bc73e8381a2bb..21df449e74648 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
>    */
>   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>   #define KVM_MEM_READONLY	(1UL << 1)
> +#define KVM_MEM_ABSENT_MAPPING_FAULT	(1UL << 2)
>   
>   /* for KVM_IRQ_LINE */
>   struct kvm_irq_level {
> @@ -1196,6 +1197,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
>   #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
>   #define KVM_CAP_MEMORY_FAULT_INFO 227
> +#define KVM_CAP_ABSENT_MAPPING_FAULT 228
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index 5c57796364d65..59219da95634c 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
>    */
>   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>   #define KVM_MEM_READONLY	(1UL << 1)
> +#define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
>   
>   /* for KVM_IRQ_LINE */
>   struct kvm_irq_level {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f3be5aa49829a..7cd0ad94726df 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1525,6 +1525,9 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
>   	valid_flags |= KVM_MEM_READONLY;

Is it better to also via kvm_vm_ioctl_check_extension() rather than
#ifdef __KVM_HAVE_READONLY_MEM?

>   #endif
>   
> +	if (kvm_vm_ioctl_check_extension(NULL, KVM_CAP_ABSENT_MAPPING_FAULT))
> +		valid_flags |= KVM_MEM_ABSENT_MAPPING_FAULT;
> +
>   	if (mem->flags & ~valid_flags)
>   		return -EINVAL;
>   


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test
  2023-04-12 21:35 ` [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
@ 2023-04-19 14:09   ` Hoo Robert
  2023-04-19 16:40     ` Anish Moorthy
  2023-04-20 22:47     ` Anish Moorthy
  2023-04-27 15:48   ` James Houghton
  1 sibling, 2 replies; 103+ messages in thread
From: Hoo Robert @ 2023-04-19 14:09 UTC (permalink / raw)
  To: Anish Moorthy, pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, kvm, kvmarm

On 4/13/2023 5:35 AM, Anish Moorthy wrote:
> Demonstrate a (very basic) scheme for supporting memory fault exits.
> 
>>From the vCPU threads:
> 1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
>     with the purpose of establishing the absent mappings. Do so with
>     wake_waiters=false to avoid serializing on the userfaultfd wait queue
>     locks.
> 
> 2. When the UFFDIO_COPY/CONTINUE in (1) fails with EEXIST,
>     assume that the mapping was already established but is currently
>     absent [A] and attempt to populate it using MADV_POPULATE_WRITE.
> 
> Issue UFFDIO_COPY/CONTINUEs from the reader threads as well, but with
> wake_waiters=true to ensure that any threads sleeping on the uffd are
> eventually woken up.
> 
> A real VMM would track whether it had already COPY/CONTINUEd pages (eg,
> via a bitmap) to avoid calls destined to EEXIST. However, even the
> naive approach is enough to demonstrate the performance advantages of
> KVM_EXIT_MEMORY_FAULT.
> 
> [A] In reality it is much likelier that the vCPU thread simply lost a
>      race to establish the mapping for the page.
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
>   .../selftests/kvm/demand_paging_test.c        | 209 +++++++++++++-----
>   1 file changed, 155 insertions(+), 54 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> index e84dde345edbc..668bd63d944e7 100644
> --- a/tools/testing/selftests/kvm/demand_paging_test.c
> +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> @@ -15,6 +15,7 @@
>   #include <time.h>
>   #include <pthread.h>
>   #include <linux/userfaultfd.h>
> +#include <sys/mman.h>

+#include <linux/mman.h> for MADV_POPULATE_WRITE definition.

>   #include <sys/syscall.h>
>   
>   #include "kvm_util.h"
> @@ -31,6 +32,57 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
>   static size_t demand_paging_size;
>   static char *guest_data_prototype;
>   
> +static int num_uffds;
> +static size_t uffd_region_size;
> +static struct uffd_desc **uffd_descs;
> +/*
> + * Delay when demand paging is performed through userfaultfd or directly by
> + * vcpu_worker in the case of a KVM_EXIT_MEMORY_FAULT.
> + */
> +static useconds_t uffd_delay;
> +static int uffd_mode;
> +
> +
> +static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
> +									bool is_vcpu);
> +
> +static void madv_write_or_err(uint64_t gpa)
> +{
> +	int r;
> +	void *hva = addr_gpa2hva(memstress_args.vm, gpa);
> +
> +	r = madvise(hva, demand_paging_size, MADV_POPULATE_WRITE);
> +	TEST_ASSERT(r == 0,
> +				"MADV_POPULATE_WRITE on hva 0x%lx (gpa 0x%lx) fail, errno %i\n",
> +				(uintptr_t) hva, gpa, errno);

There are quite a few strange line breaks/indentations across this
patch set, editor's issue?:-)

> +}
> +
> +static void ready_page(uint64_t gpa)
> +{
> +	int r, uffd;
> +
> +	/*
> +	 * This test only registers memslot 1 w/ userfaultfd. Any accesses outside
> +	 * the registered ranges should fault in the physical pages through
> +	 * MADV_POPULATE_WRITE.
> +	 */
> +	if ((gpa < memstress_args.gpa)
> +		|| (gpa >= memstress_args.gpa + memstress_args.size)) {
> +		madv_write_or_err(gpa);
> +	} else {
> +		if (uffd_delay)
> +			usleep(uffd_delay);
> +
> +		uffd = uffd_descs[(gpa - memstress_args.gpa) / uffd_region_size]->uffd;
> +
> +		r = handle_uffd_page_request(uffd_mode, uffd,
> +					(uint64_t) addr_gpa2hva(memstress_args.vm, gpa), true);
> +
> +		if (r == EEXIST)
> +			madv_write_or_err(gpa);
> +	}
> +}
> +
>   static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
>   {
>   	struct kvm_vcpu *vcpu = vcpu_args->vcpu;
> @@ -42,25 +94,36 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
>   
>   	clock_gettime(CLOCK_MONOTONIC, &start);
>   
> -	/* Let the guest access its memory */
> -	ret = _vcpu_run(vcpu);
> -	TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
> -	if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
> -		TEST_ASSERT(false,
> -			    "Invalid guest sync status: exit_reason=%s\n",
> -			    exit_reason_str(run->exit_reason));
> -	}
> +	while (true) {
> +		/* Let the guest access its memory */
> +		ret = _vcpu_run(vcpu);
> +		TEST_ASSERT(ret == 0
> +					|| (errno == EFAULT
> +						&& run->exit_reason == KVM_EXIT_MEMORY_FAULT),
> +					"vcpu_run failed: %d\n", ret);
> +		if (ret != 0 && get_ucall(vcpu, NULL) != UCALL_SYNC) {
> +
> +			if (run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
> +				ready_page(run->memory_fault.gpa);
> +				continue;
> +			}
> +
> +			TEST_ASSERT(false,

TEST_ASSERT(false, ...) == TEST_FAIL()

> +						"Invalid guest sync status: exit_reason=%s\n",
> +						exit_reason_str(run->exit_reason));
> +		}
>   
> -	ts_diff = timespec_elapsed(start);
> -	PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
> -		       ts_diff.tv_sec, ts_diff.tv_nsec);
> +		ts_diff = timespec_elapsed(start);
> +		PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
> +					   ts_diff.tv_sec, ts_diff.tv_nsec);

I think this vcpu exec time calc should be outside while() {} block.

> +		break;
> +	}
>   }
>   
> -static int handle_uffd_page_request(int uffd_mode, int uffd,
> -		struct uffd_msg *msg)
> +static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
> +									bool is_vcpu)
>   {
>   	pid_t tid = syscall(__NR_gettid);
> -	uint64_t addr = msg->arg.pagefault.address;
>   	struct timespec start;
>   	struct timespec ts_diff;
>   	int r;
> @@ -71,56 +134,78 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
>   		struct uffdio_copy copy;
>   
>   		copy.src = (uint64_t)guest_data_prototype;
> -		copy.dst = addr;
> +		copy.dst = hva;
>   		copy.len = demand_paging_size;
> -		copy.mode = 0;
> +		copy.mode = UFFDIO_COPY_MODE_DONTWAKE;
>   
> -		r = ioctl(uffd, UFFDIO_COPY, &copy);
>   		/*
> -		 * With multiple vCPU threads fault on a single page and there are
> -		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
> -		 * will fail with EEXIST: handle that case without signaling an
> -		 * error.
> +		 * With multiple vCPU threads and at least one of multiple reader threads
> +		 * or vCPU memory faults, multiple vCPUs accessing an absent page will
> +		 * almost certainly cause some thread doing the UFFDIO_COPY here to get
> +		 * EEXIST: make sure to allow that case.
>   		 */
> -		if (r == -1 && errno != EEXIST) {
> -			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
> -					addr, tid, errno);
> -			return r;
> -		}
> +		r = ioctl(uffd, UFFDIO_COPY, &copy);
> +		TEST_ASSERT(r == 0 || errno == EEXIST,
> +			"Thread 0x%x failed UFFDIO_COPY on hva 0x%lx, errno = %d",
> +			gettid(), hva, errno);

can this gettid() be substituted by tid above? or #include header file
for its prototype, otherwise build warning/error.

>   	} else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
> +		/* The comments in the UFFDIO_COPY branch also apply here. */
>   		struct uffdio_continue cont = {0};
>   
> -		cont.range.start = addr;
> +		cont.range.start = hva;
>   		cont.range.len = demand_paging_size;
> +		cont.mode = UFFDIO_CONTINUE_MODE_DONTWAKE;
>   
>   		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
> -		/* See the note about EEXISTs in the UFFDIO_COPY branch. */
> -		if (r == -1 && errno != EEXIST) {
> -			pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
> -					addr, tid, errno);
> -			return r;
> -		}
> +		TEST_ASSERT(r == 0 || errno == EEXIST,
> +			"Thread 0x%x failed UFFDIO_CONTINUE on hva 0x%lx, errno = %d",
> +			gettid(), hva, errno);

Ditto

>   	} else {
>   		TEST_FAIL("Invalid uffd mode %d", uffd_mode);
>   	}
>   
> +	/*
> +	 * If the above UFFDIO_COPY/CONTINUE fails with EEXIST, it will do so without
> +	 * waking threads waiting on the UFFD: make sure that happens here.
> +	 */
> +	if (!is_vcpu) {
> +		struct uffdio_range range = {
> +			.start = hva,
> +			.len = demand_paging_size
> +		};
> +		r = ioctl(uffd, UFFDIO_WAKE, &range);
> +		TEST_ASSERT(
> +			r == 0,
> +			"Thread 0x%x failed UFFDIO_WAKE on hva 0x%lx, errno = %d",
> +			gettid(), hva, errno);

Ditto

> +	}
> +
>   	ts_diff = timespec_elapsed(start);
>   
>   	PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
>   		       timespec_to_ns(ts_diff));
>   	PER_PAGE_DEBUG("Paged in %ld bytes at 0x%lx from thread %d\n",
> -		       demand_paging_size, addr, tid);
> +		       demand_paging_size, hva, tid);
>   
>   	return 0;
>   }
>   
> +static int handle_uffd_page_request_from_uffd(int uffd_mode, int uffd,
> +				struct uffd_msg *msg)
> +{
> +	TEST_ASSERT(msg->event == UFFD_EVENT_PAGEFAULT,
> +		"Received uffd message with event %d != UFFD_EVENT_PAGEFAULT",
> +		msg->event);
> +	return handle_uffd_page_request(uffd_mode, uffd,
> +					msg->arg.pagefault.address, false);
> +}
> +
>   struct test_params {
> -	int uffd_mode;
>   	bool single_uffd;
> -	useconds_t uffd_delay;
>   	int readers_per_uffd;
>   	enum vm_mem_backing_src_type src_type;
>   	bool partition_vcpu_memory_access;
> +	bool memfault_exits;
>   };
>   
>   static void prefault_mem(void *alias, uint64_t len)
> @@ -137,15 +222,26 @@ static void prefault_mem(void *alias, uint64_t len)
>   static void run_test(enum vm_guest_mode mode, void *arg)
>   {
>   	struct test_params *p = arg;
> -	struct uffd_desc **uffd_descs = NULL;
>   	struct timespec start;
>   	struct timespec ts_diff;
>   	struct kvm_vm *vm;
> -	int i, num_uffds = 0;
> -	uint64_t uffd_region_size;
> +	int i;
> +	uint32_t slot_flags = 0;
> +	bool uffd_memfault_exits = uffd_mode && p->memfault_exits;
> +
> +	if (uffd_memfault_exits) {
> +		TEST_ASSERT(kvm_has_cap(KVM_CAP_ABSENT_MAPPING_FAULT) > 0,
> +					"KVM does not have KVM_CAP_ABSENT_MAPPING_FAULT");
> +		slot_flags = KVM_MEM_ABSENT_MAPPING_FAULT;
> +	}
>   
>   	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
> -				1, 0, p->src_type, p->partition_vcpu_memory_access);
> +				1, slot_flags, p->src_type, p->partition_vcpu_memory_access);
> +
> +	if (uffd_memfault_exits) {
> +		vm_enable_cap(vm,
> +					  KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_FAULT_INFO_ENABLE);
> +	}
>   
>   	demand_paging_size = get_backing_src_pagesz(p->src_type);
>   
> @@ -154,12 +250,12 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   		    "Failed to allocate buffer for guest data pattern");
>   	memset(guest_data_prototype, 0xAB, demand_paging_size);
>   
> -	if (p->uffd_mode) {
> +	if (uffd_mode) {
>   		num_uffds = p->single_uffd ? 1 : nr_vcpus;
>   		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
>   
>   		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
> -		TEST_ASSERT(uffd_descs, "Memory allocation failed");
> +		TEST_ASSERT(uffd_descs, "Failed to allocate memory of uffd descriptors");
>   
>   		for (i = 0; i < num_uffds; i++) {
>   			struct memstress_vcpu_args *vcpu_args;
> @@ -179,10 +275,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   			 * requests.
>   			 */
>   			uffd_descs[i] = uffd_setup_demand_paging(
> -				p->uffd_mode, p->uffd_delay, vcpu_hva,
> +				uffd_mode, uffd_delay, vcpu_hva,
>   				uffd_region_size,
>   				p->readers_per_uffd,
> -				&handle_uffd_page_request);
> +				&handle_uffd_page_request_from_uffd);
>   		}
>   	}
>   
> @@ -196,7 +292,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   	ts_diff = timespec_elapsed(start);
>   	pr_info("All vCPU threads joined\n");
>   
> -	if (p->uffd_mode) {
> +	if (uffd_mode) {
>   		/* Tell the user fault fd handler threads to quit */
>   		for (i = 0; i < num_uffds; i++)
>   			uffd_stop_demand_paging(uffd_descs[i]);
> @@ -211,7 +307,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   	memstress_destroy_vm(vm);
>   
>   	free(guest_data_prototype);
> -	if (p->uffd_mode)
> +	if (uffd_mode)
>   		free(uffd_descs);
>   }
>   
> @@ -220,7 +316,7 @@ static void help(char *name)
>   	puts("");
>   	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
>   		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
> -		   "          [-s type] [-v vcpus] [-o]\n", name);
> +		   "          [-w] [-s type] [-v vcpus] [-o]\n", name);
>   	guest_modes_help();
>   	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
>   	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
> @@ -231,6 +327,7 @@ static void help(char *name)
>   	       "     FD handler to simulate demand paging\n"
>   	       "     overheads. Ignored without -u.\n");
>   	printf(" -r: Set the number of reader threads per uffd.\n");
> +	printf(" -w: Enable kvm cap for memory fault exits.\n");
>   	printf(" -b: specify the size of the memory region which should be\n"
>   	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
>   	       "     Default: 1G\n");
> @@ -250,29 +347,30 @@ int main(int argc, char *argv[])
>   		.partition_vcpu_memory_access = true,
>   		.readers_per_uffd = 1,
>   		.single_uffd = false,
> +		.memfault_exits = false,
>   	};
>   	int opt;
>   
>   	guest_modes_append_default();
>   
> -	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
> +	while ((opt = getopt(argc, argv, "ahowm:u:d:b:s:v:r:")) != -1) {
>   		switch (opt) {
>   		case 'm':
>   			guest_modes_cmdline(optarg);
>   			break;
>   		case 'u':
>   			if (!strcmp("MISSING", optarg))
> -				p.uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
> +				uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
>   			else if (!strcmp("MINOR", optarg))
> -				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
> -			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
> +				uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
> +			TEST_ASSERT(uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
>   			break;
>   		case 'a':
>   			p.single_uffd = true;
>   			break;
>   		case 'd':
> -			p.uffd_delay = strtoul(optarg, NULL, 0);
> -			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
> +			uffd_delay = strtoul(optarg, NULL, 0);
> +			TEST_ASSERT(uffd_delay >= 0, "A negative UFFD delay is not supported.");
>   			break;
>   		case 'b':
>   			guest_percpu_mem_size = parse_size(optarg);
> @@ -295,6 +393,9 @@ int main(int argc, char *argv[])
>   						"Invalid number of readers per uffd %d: must be >=1",
>   						p.readers_per_uffd);
>   			break;
> +		case 'w':
> +			p.memfault_exits = true;
> +			break;
>   		case 'h':
>   		default:
>   			help(argv[0]);
> @@ -302,7 +403,7 @@ int main(int argc, char *argv[])
>   		}
>   	}
>   
> -	if (p.uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
> +	if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
>   	    !backing_src_is_shared(p.src_type)) {
>   		TEST_FAIL("userfaultfd MINOR mode requires shared memory; pick a different -s");
>   	}


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test
  2023-04-19 14:09   ` Hoo Robert
@ 2023-04-19 16:40     ` Anish Moorthy
  2023-04-20 22:47     ` Anish Moorthy
  1 sibling, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-19 16:40 UTC (permalink / raw)
  To: Hoo Robert
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Wed, Apr 19, 2023 at 7:10 AM Hoo Robert <robert.hoo.linux@gmail.com> wrote:
>
> There are quite a few strange line breaks/indentations across this
> patch set, editor's issue?:-)

A combination of editor issues and inconsistency on my part I think,
that's been a bit of a theme :/ Thanks for pointing out so many
places, I'll figure out what's going wrong (and also look at your
non-style related feedback as well :)

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (21 preceding siblings ...)
  2023-04-12 21:35 ` [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
@ 2023-04-19 19:55 ` Peter Xu
  2023-04-19 20:15   ` Axel Rasmussen
  2023-05-09 22:19 ` David Matlack
  23 siblings, 1 reply; 103+ messages in thread
From: Peter Xu @ 2023-04-19 19:55 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, kvm, kvmarm

Hi, Anish,

On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> KVM's demand paging self test is extended to demonstrate the performance
> benefits of using the two new capabilities to bypass the userfaultfd
> wait queue. The performance samples below (rates in thousands of
> pages/s, n = 5), were generated using [2] on an x86 machine with 256
> cores.
> 
> vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> 1       150     340
> 2       191     477
> 4       210     809
> 8       155     1239
> 16      130     1595
> 32      108     2299
> 64      86      3482
> 128     62      4134
> 256     36      4012

The number looks very promising.  Though..

> 
> [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
>     A quick rundown of the new flags (also detailed in later commits)
>         -a registers all of guest memory to a single uffd.

... this is the worst case scenario.  I'd say it's slightly unfair to
compare by first introducing a bottleneck then compare with it. :)

Jokes aside: I'd think it'll make more sense if such a performance solution
will be measured on real systems showing real benefits, because so far it's
still not convincing enough if it's only with the test especially with only
one uffd.

I don't remember whether I used to discuss this with James before, but..

I know that having multiple uffds in productions also means scattered guest
memory and scattered VMAs all over the place.  However split the guest
large mem into at least a few (or even tens of) VMAs may still be something
worth trying?  Do you think that'll already solve some of the contentions
on userfaultfd, either on the queue or else?

With a bunch of VMAs and userfaultfds (paired with uffd fault handler
threads, totally separate uffd queues), I'd expect to some extend other
things can pop up already, e.g., the network bandwidth, without teaching
each vcpu thread to report uffd faults themselves.

These are my pure imaginations though, I think that's also why it'll be
great if such a solution can be tested more or less on a real migration
scenario to show its real benefits.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-19 19:55 ` [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Peter Xu
@ 2023-04-19 20:15   ` Axel Rasmussen
  2023-04-19 21:05     ` Peter Xu
  0 siblings, 1 reply; 103+ messages in thread
From: Axel Rasmussen @ 2023-04-19 20:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Anish Moorthy, pbonzini, maz, oliver.upton, seanjc, jthoughton,
	bgardon, dmatlack, ricarkol, kvm, kvmarm

On Wed, Apr 19, 2023 at 12:56 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Anish,
>
> On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > KVM's demand paging self test is extended to demonstrate the performance
> > benefits of using the two new capabilities to bypass the userfaultfd
> > wait queue. The performance samples below (rates in thousands of
> > pages/s, n = 5), were generated using [2] on an x86 machine with 256
> > cores.
> >
> > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> > 1       150     340
> > 2       191     477
> > 4       210     809
> > 8       155     1239
> > 16      130     1595
> > 32      108     2299
> > 64      86      3482
> > 128     62      4134
> > 256     36      4012
>
> The number looks very promising.  Though..
>
> >
> > [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
> >     A quick rundown of the new flags (also detailed in later commits)
> >         -a registers all of guest memory to a single uffd.
>
> ... this is the worst case scenario.  I'd say it's slightly unfair to
> compare by first introducing a bottleneck then compare with it. :)
>
> Jokes aside: I'd think it'll make more sense if such a performance solution
> will be measured on real systems showing real benefits, because so far it's
> still not convincing enough if it's only with the test especially with only
> one uffd.
>
> I don't remember whether I used to discuss this with James before, but..
>
> I know that having multiple uffds in productions also means scattered guest
> memory and scattered VMAs all over the place.  However split the guest
> large mem into at least a few (or even tens of) VMAs may still be something
> worth trying?  Do you think that'll already solve some of the contentions
> on userfaultfd, either on the queue or else?

We considered sharding into several UFFDs. I do think it helps, but
also I think there are two main problems with it:

- One is, I think there's a limit to how much you'd want to do that.
E.g. splitting guest memory in 1/2, or in 1/10, could be reasonable,
but 1/100 or 1/1000 might become ridiculous in terms of the
"scattering" of VMAs and so on like you mentioned. Especially for very
large VMs (e.g. consider Google offers VMs with ~11T of RAM [1]) I'm
not sure splitting just "slightly" is enough to get good performance.

- Another is, sharding UFFDs sort of assumes accesses are randomly
distributed across the guest physical address space. I'm not sure this
is guaranteed for all possible VMs / customer workloads. In other
words, even if we shard across several UFFDs, we may end up with a
small number of them being "hot".

A benefit to Anish's series is that it solves the problem more
fundamentally, and allows demand paging with no "global" locking. So,
it will scale better regardless of VM size, or access pattern.

[1]: https://cloud.google.com/compute/docs/memory-optimized-machines

>
> With a bunch of VMAs and userfaultfds (paired with uffd fault handler
> threads, totally separate uffd queues), I'd expect to some extend other
> things can pop up already, e.g., the network bandwidth, without teaching
> each vcpu thread to report uffd faults themselves.
>
> These are my pure imaginations though, I think that's also why it'll be
> great if such a solution can be tested more or less on a real migration
> scenario to show its real benefits.

I wonder, is there an existing open source QEMU/KVM based live
migration stress test?

I think we could share numbers from some of our internal benchmarks,
or at the very least give relative numbers (e.g. +50% increase), but
since a lot of the software stack is proprietary (e.g. we don't use
QEMU), it may not be that useful or reproducible for folks.

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-19 20:15   ` Axel Rasmussen
@ 2023-04-19 21:05     ` Peter Xu
       [not found]       ` <CAF7b7mo68VLNp=QynfT7QKgdq=d1YYGv1SEVEDxF9UwHzF6YDw@mail.gmail.com>
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Xu @ 2023-04-19 21:05 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Anish Moorthy, pbonzini, maz, oliver.upton, seanjc, jthoughton,
	bgardon, dmatlack, ricarkol, kvm, kvmarm

On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> On Wed, Apr 19, 2023 at 12:56 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Anish,
> >
> > On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > > KVM's demand paging self test is extended to demonstrate the performance
> > > benefits of using the two new capabilities to bypass the userfaultfd
> > > wait queue. The performance samples below (rates in thousands of
> > > pages/s, n = 5), were generated using [2] on an x86 machine with 256
> > > cores.
> > >
> > > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> > > 1       150     340
> > > 2       191     477
> > > 4       210     809
> > > 8       155     1239
> > > 16      130     1595
> > > 32      108     2299
> > > 64      86      3482
> > > 128     62      4134
> > > 256     36      4012
> >
> > The number looks very promising.  Though..
> >
> > >
> > > [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> > > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
> > >     A quick rundown of the new flags (also detailed in later commits)
> > >         -a registers all of guest memory to a single uffd.
> >
> > ... this is the worst case scenario.  I'd say it's slightly unfair to
> > compare by first introducing a bottleneck then compare with it. :)
> >
> > Jokes aside: I'd think it'll make more sense if such a performance solution
> > will be measured on real systems showing real benefits, because so far it's
> > still not convincing enough if it's only with the test especially with only
> > one uffd.
> >
> > I don't remember whether I used to discuss this with James before, but..
> >
> > I know that having multiple uffds in productions also means scattered guest
> > memory and scattered VMAs all over the place.  However split the guest
> > large mem into at least a few (or even tens of) VMAs may still be something
> > worth trying?  Do you think that'll already solve some of the contentions
> > on userfaultfd, either on the queue or else?
> 
> We considered sharding into several UFFDs. I do think it helps, but
> also I think there are two main problems with it:
> 
> - One is, I think there's a limit to how much you'd want to do that.
> E.g. splitting guest memory in 1/2, or in 1/10, could be reasonable,
> but 1/100 or 1/1000 might become ridiculous in terms of the
> "scattering" of VMAs and so on like you mentioned. Especially for very
> large VMs (e.g. consider Google offers VMs with ~11T of RAM [1]) I'm
> not sure splitting just "slightly" is enough to get good performance.
> 
> - Another is, sharding UFFDs sort of assumes accesses are randomly
> distributed across the guest physical address space. I'm not sure this
> is guaranteed for all possible VMs / customer workloads. In other
> words, even if we shard across several UFFDs, we may end up with a
> small number of them being "hot".

I never tried to monitor this, but I had a feeling that it's actually
harder to maintain physical continuity of pages being used and accessed at
least on Linux.

The more possible case to me is the system pages goes very scattered easily
after boot a few hours unless special care is taken, e.g., on using hugetlb
pages or reservations for specific purpose.

I also think that's normally optimal to the system, e.g., numa balancing
will help nodes / cpus using local memory which helps spread the memory
consumptions, hence each core can access different pages that is local to
it.

But I agree I can never justify that it'll always work.  If you or Anish
could provide some data points to further support this issue that would be
very interesting and helpful, IMHO, not required though.

> 
> A benefit to Anish's series is that it solves the problem more
> fundamentally, and allows demand paging with no "global" locking. So,
> it will scale better regardless of VM size, or access pattern.
> 
> [1]: https://cloud.google.com/compute/docs/memory-optimized-machines
> 
> >
> > With a bunch of VMAs and userfaultfds (paired with uffd fault handler
> > threads, totally separate uffd queues), I'd expect to some extend other
> > things can pop up already, e.g., the network bandwidth, without teaching
> > each vcpu thread to report uffd faults themselves.
> >
> > These are my pure imaginations though, I think that's also why it'll be
> > great if such a solution can be tested more or less on a real migration
> > scenario to show its real benefits.
> 
> I wonder, is there an existing open source QEMU/KVM based live
> migration stress test?

I am not aware of any.

> 
> I think we could share numbers from some of our internal benchmarks,
> or at the very least give relative numbers (e.g. +50% increase), but
> since a lot of the software stack is proprietary (e.g. we don't use
> QEMU), it may not be that useful or reproducible for folks.

Those numbers can still be helpful.  I was not asking for reproduceability,
but some test to better justify this feature.

IMHO the demand paging test (at least the current one) may or may not be a
good test to show the value of this specific feature.  When with 1-uffd, it
obviously bottlenecks on the single uffd, so it doesn't explain whether
scaling num of uffds could help.

But it's not friendly to multi-uffd either, because it'll be the other
extreme case where all mem accesses are spread the cores, so probably the
feature won't show a result proving its worthwhile.

From another aspect, if a kernel feature is proposed it'll be always nice
(and sometimes mandatory) to have at least one user of it (besides the unit
tests).  I think that should also include proprietary softwares.  It
doesn't need to be used already in production, but some POC would
definitely be very helpful to move a feature forward towards community
acceptance.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
  2023-04-19 13:36   ` Hoo Robert
@ 2023-04-19 23:26     ` Anish Moorthy
  0 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-19 23:26 UTC (permalink / raw)
  To: Hoo Robert
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Wed, Apr 19, 2023 at 6:36 AM Hoo Robert <robert.hoo.linux@gmail.com> wrote:
>
> How about goto
>         ts_diff = timespec_elapsed(start);
> Otherwise last stats won't get chances to be calc'ed.

Good idea, done.

> > +             TEST_ASSERT(r == 1,
> > +                                     "Unexpected number of events (%d) from epoll, errno = %d",
> > +                                     r, errno);
> >
> too much indentation, also seen elsewhere.

Augh, my editor has been set to a tab width of 4 this entire time.
That... explains a lot >:(

> >               }
> >
> > -             if (!(pollfd[0].revents & POLLIN))
> > -                     continue;
> > +             TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
> > +                                     "Reader thread received EPOLLERR or EPOLLHUP on uffd.");
> >
> >               r = read(uffd, &msg, sizeof(msg));
> >               if (r == -1) {
> > -                     if (errno == EAGAIN)
> > -                             continue;
> > -                     pr_info("Read of uffd got errno %d\n", errno);
> > -                     return NULL;
> > +                     TEST_ASSERT(errno == EAGAIN,
> > +                                             "Error reading from UFFD: errno = %d", errno);
> > +                     continue;
> >               }
> >
> > -             if (r != sizeof(msg)) {
> > -                     pr_info("Read on uffd returned unexpected size: %d bytes", r);
> > -                     return NULL;
> > -             }
> > +             TEST_ASSERT(r == sizeof(msg),
> > +                                     "Read on uffd returned unexpected number of bytes (%d)", r);
> >
> >               if (!(msg.event & UFFD_EVENT_PAGEFAULT))
> >                       continue;
> > @@ -93,8 +89,8 @@ static void *uffd_handler_thread_fn(void *arg)
> >               if (reader_args->delay)
> >                       usleep(reader_args->delay);
> >               r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
> > -             if (r < 0)
> > -                     return NULL;
> > +             TEST_ASSERT(r >= 0,
> > +                                     "Reader thread handler fn returned negative value %d", r);
> >               pages++;
> >       }
> >
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
  2023-04-19 13:51   ` Hoo Robert
@ 2023-04-20 17:55     ` Anish Moorthy
  2023-04-21 12:15       ` Robert Hoo
  0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-20 17:55 UTC (permalink / raw)
  To: Hoo Robert
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Wed, Apr 19, 2023 at 6:51 AM Hoo Robert <robert.hoo.linux@gmail.com> wrote:
>
> On 4/13/2023 5:34 AM, Anish Moorthy wrote:
> > At the moment, demand_paging_test does not support profiling/testing
> > multiple vCPU threads concurrently faulting on a single uffd because
> >
> >      (a) "-u" (run test in userfaultfd mode) creates a uffd for each vCPU's
> >          region, so that each uffd services a single vCPU thread.
> >      (b) "-u -o" (userfaultfd mode + overlapped vCPU memory accesses)
> >          simply doesn't work: the test tries to register the same memory
> >          to multiple uffds, causing an error.
> >
> > Add support for many vcpus per uffd by
> >      (1) Keeping "-u" behavior unchanged.
> >      (2) Making "-u -a" create a single uffd for all of guest memory.
> >      (3) Making "-u -o" implicitly pass "-a", solving the problem in (b).
> > In cases (2) and (3) all vCPU threads fault on a single uffd.
> >
> > With multiple potentially multiple vCPU per UFFD, it makes sense to
>         ^^^^^^^^
> redundant "multiple"?

Thanks, fixed

> > --- a/tools/testing/selftests/kvm/demand_paging_test.c
> > +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> > @@ -77,9 +77,15 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
> >               copy.mode = 0;
> >
> >               r = ioctl(uffd, UFFDIO_COPY, &copy);
> > -             if (r == -1) {
> > -                     pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d with errno: %d\n",
> > -                             addr, tid, errno);
> > +             /*
> > +              * With multiple vCPU threads fault on a single page and there are
> > +              * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
> > +              * will fail with EEXIST: handle that case without signaling an
> > +              * error.
> > +              */
>
> But this code path is also gone through in other cases, isn't it? In
> those cases, is it still safe to ignore EEXIST?

Good point: the answer is no, it's not always safe to ignore EEXISTs
here. For instance the first UFFDIO_CONTINUE for a page shouldn't be
allowed to EEXIST, and that's swept under the rug here. I've added the
following to the comment

+ * Note that this does sweep under the rug any EEXISTs occurring
+ * from, e.g., the first UFFDIO_COPY/CONTINUEs on a page. A
+ * realistic VMM would maintain some other state to correctly
+ * surface EEXISTs to userspace or prevent duplicate
+ * COPY/CONTINUEs from happening in the first place.

I could add that extra state to the self test (via for instance, an
atomic bitmap that threads "or" into before issuing any
COPY/CONTINUEs) but it's a bit of an extra complication without any
real payoff. Let me know if you think the comment's inadequate though.

> > +             if (r == -1 && errno != EEXIST) {
> > +                     pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
> > +                                     addr, tid, errno);
>
> unintended indent changes I think.
>
> >                       return r;
> >               }
> >       } else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
> > @@ -89,9 +95,10 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
> >               cont.range.len = demand_paging_size;
> >
> >               r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
> > -             if (r == -1) {
> > -                     pr_info("Failed UFFDIO_CONTINUE in 0x%lx from thread %d with errno: %d\n",
> > -                             addr, tid, errno);
> > +             /* See the note about EEXISTs in the UFFDIO_COPY branch. */
>
> Personally I would suggest copy the comments here. what if some day above
> code/comment was changed/deleted?

You might be right: on the other hand, if the comment ever gets
updated then it would have to be done in two places. Anyone to break
the tie? :)

> > @@ -148,18 +158,32 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
> >       TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
> >                   expected_ioctls, "missing userfaultfd ioctls");
> >
> > -     ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
> > -     TEST_ASSERT(!ret, "Failed to set up pipefd");
> > -
> >       uffd_desc->uffd_mode = uffd_mode;
> >       uffd_desc->uffd = uffd;
> >       uffd_desc->delay = delay;
> >       uffd_desc->handler = handler;
>
> Now that these info are encapsulated into reader args below, looks
> unnecessary to have them in uffd_desc here.

Good point. I've removed uffd_mode, delay, and handler from uffd_desc.
I left the "uffd" field in because that's a shared resource, and
close()ing it as "close(desc->uffd)" makes more sense than, say,
"close(desc->reader_args[0].uffd)"

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-04-19 13:57   ` Hoo Robert
@ 2023-04-20 18:09     ` Anish Moorthy
  2023-04-21 12:28       ` Robert Hoo
  0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-20 18:09 UTC (permalink / raw)
  To: Hoo Robert
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Wed, Apr 19, 2023 at 6:57 AM Hoo Robert <robert.hoo.linux@gmail.com> wrote:
>
> kvm_populate_efault_info(), function name.
> ...
> Ditto

Done

> struct exit_reason[] string for KVM_EXIT_MEMORY_FAULT can be added as
> well.

Done, assuming you mean the exit_reasons_known definition in kvm_util.c

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
  2023-04-19 14:00   ` Hoo Robert
@ 2023-04-20 18:23     ` Anish Moorthy
  0 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-20 18:23 UTC (permalink / raw)
  To: Hoo Robert
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Wed, Apr 19, 2023 at 7:00 AM Hoo Robert <robert.hoo.linux@gmail.com> wrote:
> > +static inline bool kvm_slot_fault_on_absent_mapping(
> > +                                                     const struct kvm_memory_slot *slot)
>
> Strange line break.

Fixed: there's now a single indent on the second line.

> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index f3be5aa49829a..7cd0ad94726df 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1525,6 +1525,9 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
> >       valid_flags |= KVM_MEM_READONLY;
>
> Is it better to also via kvm_vm_ioctl_check_extension() rather than
> #ifdef __KVM_HAVE_READONLY_MEM?

Probably, that's unrelated though so I won't change it here

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
  2023-04-12 21:34 ` [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page() Anish Moorthy
@ 2023-04-20 20:52   ` Peter Xu
  2023-04-20 23:29     ` Anish Moorthy
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Xu @ 2023-04-20 20:52 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, kvm, kvmarm

On Wed, Apr 12, 2023 at 09:34:55PM +0000, Anish Moorthy wrote:
> Implement KVM_CAP_MEMORY_FAULT_INFO for efaults from
> kvm_vcpu_write_guest_page()
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  virt/kvm/kvm_main.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 63b4285d858d1..b29a38af543f0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3119,8 +3119,11 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  			      const void *data, int offset, int len)
>  {
>  	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> +	int ret = __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
>  
> -	return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
> +	if (ret == -EFAULT)
> +		kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset, len);
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);

Why need to trap this?  Is this -EFAULT part of the "scalable userfault"
plan or not?

My previous memory was one can still leave things like copy_to_user() to go
via the userfaults channels which should work in parallel with the new vcpu
MEMORY_FAULT exit.  But maybe the plan changed?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map()
  2023-04-12 21:34 ` [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map() Anish Moorthy
@ 2023-04-20 20:53   ` Peter Xu
  2023-04-20 23:34     ` Anish Moorthy
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Xu @ 2023-04-20 20:53 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, kvm, kvmarm

On Wed, Apr 12, 2023 at 09:34:57PM +0000, Anish Moorthy wrote:
> Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
> kvm_vcpu_map().
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  virt/kvm/kvm_main.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 572adba9ad8ed..f3be5aa49829a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2843,8 +2843,10 @@ int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
>  #endif
>  	}
>  
> -	if (!hva)
> +	if (!hva) {
> +		kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE, PAGE_SIZE);
>  		return -EFAULT;
> +	}
>  
>  	map->page = page;
>  	map->hva = hva;

Totally not familiar with nested, just a pure question on whether all the
kvm_vcpu_map() callers will be prepared to receive this -EFAULT yet?

I quickly went over the later patches but I didn't find a full solution
yet, but maybe I missed something.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
       [not found]       ` <CAF7b7mo68VLNp=QynfT7QKgdq=d1YYGv1SEVEDxF9UwHzF6YDw@mail.gmail.com>
@ 2023-04-20 21:29         ` Peter Xu
  2023-04-21 16:58           ` Anish Moorthy
  2023-04-21 17:39           ` Nadav Amit
  2023-04-20 23:42         ` Anish Moorthy
  1 sibling, 2 replies; 103+ messages in thread
From: Peter Xu @ 2023-04-20 21:29 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Axel Rasmussen, pbonzini, maz, oliver.upton, seanjc, jthoughton,
	bgardon, dmatlack, ricarkol, kvm, kvmarm, Nadav Amit

Hi, Anish,

[Copied Nadav Amit for the last few paragraphs on userfaultfd, because
 Nadav worked on a few userfaultfd performance problems; so maybe he'll
 also have some ideas around]

On Wed, Apr 19, 2023 at 02:53:46PM -0700, Anish Moorthy wrote:
> On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> > > We considered sharding into several UFFDs. I do think it helps, but
> > > also I think there are two main problems with it...
> >
> > But I agree I can never justify that it'll always work.  If you or Anish
> > could provide some data points to further support this issue that would be
> > very interesting and helpful, IMHO, not required though.
> 
> Axel covered the reasons for not pursuing the sharding approach nicely
> (thanks!). It's not something we ever prototyped, so I don't have any
> further numbers there.
> 
> On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> >
> > > I think we could share numbers from some of our internal benchmarks,
> > > or at the very least give relative numbers (e.g. +50% increase), but
> > > since a lot of the software stack is proprietary (e.g. we don't use
> > > QEMU), it may not be that useful or reproducible for folks.
> >
> > Those numbers can still be helpful.  I was not asking for reproduceability,
> > but some test to better justify this feature.
> 
> I do have some internal benchmarking numbers on this front, although
> it's been a while since I've collected them so the details might be a
> little sparse.

Thanks for sharing these data points.  I don't understand most of them yet,
but I think it's better than the unit test numbers provided.

> 
> I've confirmed performance gains with "scalable userfaultfd" using two
> workloads besides the self-test:
> 
> The first, cycler, spins up a VM and launches a binary which (a) maps
> a large amount of memory and then (b) loops over it issuing writes as
> fast as possible. It's not a very realistic guest but it at least
> involves an actual migrating VM, and we often use it to
> stress/performance test migration changes. The write rate which cycler
> achieves during userfaultfd-based postcopy (without scalable uffd
> enabled) is about 25% of what it achieves under KVM Demand Paging (the
> internal KVM feature GCE currently uses for postcopy). With
> userfaultfd-based postcopy and scalable uffd enabled that rate jumps
> nearly 3x, so about 75% of what KVM Demand Paging achieves. The
> attached "Cycler.png" illustrates this effect (though due to some
> other details, faster demand paging actually makes the migrations
> worse: the point is that scalable uffd performs more similarly to kvm
> demand paging :)

Yes I don't understand why vanilla uffd is so different, neither am I sure
what does the graph mean, though. :)

Is the first drop caused by starting migration/precopy?

Is the 2nd (huge) drop (mostly to zero) caused by frequently accessing new
pages during postcopy?

Is the workload busy writes single thread, or NCPU threads?

Is what you mentioned on the 25%-75% comparison can be shown on the graph?
Or maybe that's part of the period where all three are very close to 0?

> 
> The second is the redis memtier benchmark [1], a more realistic
> workflow where we migrate a VM running the redis server. With scalable
> userfaultfd, the client VM observes significantly higher transaction
> rates during uffd-based postcopy (see "Memtier.png"). I can pull the
> exact numbers if needed, but just from eyeballing the graph you can
> see that the improvement is something like 5-10x (at least) for
> several seconds. There's still a noticeable gap with KVM demand paging
> based-postcopy, but the improvement is definitely significant.
> 
> [1] https://github.com/RedisLabs/memtier_benchmark

Does the "5-10x" difference rely in the "15s valley" you pointed out in the
graph?

Is it reproduceable that the blue line always has a totally different
"valley" comparing to yellow/red?

Personally I still really want to know what happens if we just split the
vma and see how it goes with a standard workloads, but maybe I'm asking too
much so don't yet worry.  The solution here proposed still makes sense to
me and I agree if this can be done well it can resolve the bottleneck over
1-userfaultfd.

But after I read some of the patches I'm not sure whether it's possible it
can be implemented in a complete way.  You mentioned here and there on that
things can be missing probably due to random places accessing guest pages
all over kvm.  Relying sololy on -EFAULT so far doesn't look very reliable
to me, but it could be because I didn't yet really understand how it works.

Is above a concern to the current solution?

Have any of you tried to investigate the other approach to scale
userfaultfd?

It seems userfaultfd does one thing great which is to have the trapping at
an unified place (when the page fault happens), hence it doesn't need to
worry on random codes splat over KVM module read/write a guest page.  The
question is whether it'll be easy to do so.

Split vma definitely is still a way to scale userfaultfd, but probably not
in a good enough way because it's scaling in memory axis, not cores.  If
tens of cores accessing a small region that falls into the same VMA, then
it stops working.

However maybe it can be scaled in other form?  So far my understanding is
"read" upon uffd for messages is still not a problem - the read can be done
in chunk, and each message will be converted into a request to be send
later.

If the real problem relies in a bunch of threads queuing, is it possible
that we can provide just more queues for the events?  The readers will just
need to go over all the queues.

Way to decide "which thread uses which queue" can be another problem, what
comes ups quickly to me is a "hash(tid) % n_queues" but maybe it can be
better.  Each vcpu thread will have different tids, then they can hopefully
scale on the queues.

There's at least one issue that I know with such an idea, that after we
have >1 uffd queues it means the message order will be uncertain.  It may
matter for some uffd users (e.g. cooperative userfaultfd, see
UFFD_FEATURE_FORK|REMOVE|etc.)  because I believe order of messages matter
for them (mostly CRIU).  But I think that's not a blocker either because we
can forbid those features with multi queues.

That's a wild idea that I'm just thinking about, which I have totally no
idea whether it'll work or not.  It's more or less of a generic question on
"whether there's chance to scale on uffd side just in case it might be a
cleaner approach", when above concern is a real concern.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test
  2023-04-19 14:09   ` Hoo Robert
  2023-04-19 16:40     ` Anish Moorthy
@ 2023-04-20 22:47     ` Anish Moorthy
  1 sibling, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-20 22:47 UTC (permalink / raw)
  To: Hoo Robert
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Wed, Apr 19, 2023 at 7:10 AM Hoo Robert <robert.hoo.linux@gmail.com> wrote:
>
> I think this vcpu exec time calc should be outside while() {} block.

Ah, you're right: fixed.

> can this gettid() be substituted by tid above? or #include header file
> for its prototype, otherwise build warning/error.

Huh, not sure how I missed the warning. Thanks, and done.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
  2023-04-20 20:52   ` Peter Xu
@ 2023-04-20 23:29     ` Anish Moorthy
  2023-04-21 15:00       ` Peter Xu
  0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-20 23:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, kvm, kvmarm

On Thu, Apr 20, 2023 at 1:52 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Apr 12, 2023 at 09:34:55PM +0000, Anish Moorthy wrote:
> > Implement KVM_CAP_MEMORY_FAULT_INFO for efaults from
> > kvm_vcpu_write_guest_page()
> >
> > Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > ---
> >  virt/kvm/kvm_main.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 63b4285d858d1..b29a38af543f0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -3119,8 +3119,11 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> >                             const void *data, int offset, int len)
> >  {
> >       struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > +     int ret = __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
> >
> > -     return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
> > +     if (ret == -EFAULT)
> > +             kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset, len);
> > +     return ret;
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
>
> Why need to trap this?  Is this -EFAULT part of the "scalable userfault"
> plan or not?
>
> My previous memory was one can still leave things like copy_to_user() to go
> via the userfaults channels which should work in parallel with the new vcpu
> MEMORY_FAULT exit.  But maybe the plan changed?

This commit isn't really part of the "scalable uffd" changes, which
basically correspond to KVM_CAP_ABSENT_MAPPING_FAULT. There should be
more details in the cover letter, but basically my v1 just included
KVM_CAP_ABSENT_MAPPING_FAULT: Sean argued that the API there ("return
to userspace whenever KVM fails a guest memory access to a page
fault") was problematic, and so I reworked the series to include a
general capability for reporting extra information for failed guest
memory accesses (KVM_CAP_MEMORY_FAULT_INFO) and
KVM_CAP_ABSENT_MAPPING_FAULT (which is meant to be used in combination
with the other cap) for the "scalable userfaultfd" changes.

As such most of the commits in this series are unrelated to
KVM_CAP_ABSENT_MAPPING_FAULT, and this is one of those commits. It
doesn't affect page faults generated by copy_to_user (which should
still be delivered via uffd).

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map()
  2023-04-20 20:53   ` Peter Xu
@ 2023-04-20 23:34     ` Anish Moorthy
  2023-04-21 14:58       ` Peter Xu
  0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-20 23:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, kvm, kvmarm

On Thu, Apr 20, 2023 at 1:53 PM Peter Xu <peterx@redhat.com> wrote:
>
> Totally not familiar with nested, just a pure question on whether all the
> kvm_vcpu_map() callers will be prepared to receive this -EFAULT yet?

The return values of this function aren't being changed: I'm just
setting some extra state in the kvm_run_struct in the case where this
function already returns -EFAULT.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
       [not found]       ` <CAF7b7mo68VLNp=QynfT7QKgdq=d1YYGv1SEVEDxF9UwHzF6YDw@mail.gmail.com>
  2023-04-20 21:29         ` Peter Xu
@ 2023-04-20 23:42         ` Anish Moorthy
  1 sibling, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-20 23:42 UTC (permalink / raw)
  To: kvm, kvmarm

My reply to Peter earlier bounced from the mailing list due to the
attached images (sorry!). I've copied it below to get a record
on-list.

Just for completeness, the message ID of the bounced mail was
<CAF7b7mo68VLNp=QynfT7QKgdq=d1YYGv1SEVEDxF9UwHzF6YDw@mail.gmail.com>

On Wed, Apr 19, 2023 at 2:53 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> > > We considered sharding into several UFFDs. I do think it helps, but
> > > also I think there are two main problems with it...
> >
> > But I agree I can never justify that it'll always work.  If you or Anish
> > could provide some data points to further support this issue that would be
> > very interesting and helpful, IMHO, not required though.
>
> Axel covered the reasons for not pursuing the sharding approach nicely
> (thanks!). It's not something we ever prototyped, so I don't have any
> further numbers there.
>
> On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> >
> > > I think we could share numbers from some of our internal benchmarks,
> > > or at the very least give relative numbers (e.g. +50% increase), but
> > > since a lot of the software stack is proprietary (e.g. we don't use
> > > QEMU), it may not be that useful or reproducible for folks.
> >
> > Those numbers can still be helpful.  I was not asking for reproduceability,
> > but some test to better justify this feature.
>
> I do have some internal benchmarking numbers on this front, although
> it's been a while since I've collected them so the details might be a
> little sparse.
>
> I've confirmed performance gains with "scalable userfaultfd" using two
> workloads besides the self-test:
>
> The first, cycler, spins up a VM and launches a binary which (a) maps
> a large amount of memory and then (b) loops over it issuing writes as
> fast as possible. It's not a very realistic guest but it at least
> involves an actual migrating VM, and we often use it to
> stress/performance test migration changes. The write rate which cycler
> achieves during userfaultfd-based postcopy (without scalable uffd
> enabled) is about 25% of what it achieves under KVM Demand Paging (the
> internal KVM feature GCE currently uses for postcopy). With
> userfaultfd-based postcopy and scalable uffd enabled that rate jumps
> nearly 3x, so about 75% of what KVM Demand Paging achieves. The
> attached "Cycler.png" illustrates this effect (though due to some
> other details, faster demand paging actually makes the migrations
> worse: the point is that scalable uffd performs more similarly to kvm
> demand paging :)
>
> The second is the redis memtier benchmark [1], a more realistic
> workflow where we migrate a VM running the redis server. With scalable
> userfaultfd, the client VM observes significantly higher transaction
> rates during uffd-based postcopy (see "Memtier.png"). I can pull the
> exact numbers if needed, but just from eyeballing the graph you can
> see that the improvement is something like 5-10x (at least) for
> several seconds. There's still a noticeable gap with KVM demand paging
> based-postcopy, but the improvement is definitely significant.
>
> [1] https://github.com/RedisLabs/memtier_benchmark

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
  2023-04-20 17:55     ` Anish Moorthy
@ 2023-04-21 12:15       ` Robert Hoo
  2023-04-21 16:21         ` Anish Moorthy
  0 siblings, 1 reply; 103+ messages in thread
From: Robert Hoo @ 2023-04-21 12:15 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm

Anish Moorthy <amoorthy@google.com> 于2023年4月21日周五 01:56写道:
>
> > > +             /*
> > > +              * With multiple vCPU threads fault on a single page and there are
> > > +              * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
> > > +              * will fail with EEXIST: handle that case without signaling an
> > > +              * error.
> > > +              */
> >
> > But this code path is also gone through in other cases, isn't it? In
> > those cases, is it still safe to ignore EEXIST?
>
> Good point: the answer is no, it's not always safe to ignore EEXISTs
> here. For instance the first UFFDIO_CONTINUE for a page shouldn't be
> allowed to EEXIST, and that's swept under the rug here. I've added the
> following to the comment
>
> + * Note that this does sweep under the rug any EEXISTs occurring
> + * from, e.g., the first UFFDIO_COPY/CONTINUEs on a page. A
> + * realistic VMM would maintain some other state to correctly
> + * surface EEXISTs to userspace or prevent duplicate
> + * COPY/CONTINUEs from happening in the first place.
>
> I could add that extra state to the self test (via for instance, an
> atomic bitmap that threads "or" into before issuing any
> COPY/CONTINUEs) but it's a bit of an extra complication without any
> real payoff. Let me know if you think the comment's inadequate though.
>
IIUC, you could say: in this on demand paging test case, even
duplicate copy/continue doesn't do harm anyway. Am I right?

> > > +             /* See the note about EEXISTs in the UFFDIO_COPY branch. */
> >
> > Personally I would suggest copy the comments here. what if some day above
> > code/comment was changed/deleted?
>
> You might be right: on the other hand, if the comment ever gets
> updated then it would have to be done in two places. Anyone to break
> the tie? :)

The one who updates the place is responsible for the comments. make sense?:)
>
> > > @@ -148,18 +158,32 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
> > >       TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
> > >                   expected_ioctls, "missing userfaultfd ioctls");
> > >
> > > -     ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
> > > -     TEST_ASSERT(!ret, "Failed to set up pipefd");
> > > -
> > >       uffd_desc->uffd_mode = uffd_mode;
> > >       uffd_desc->uffd = uffd;
> > >       uffd_desc->delay = delay;
> > >       uffd_desc->handler = handler;
> >
> > Now that these info are encapsulated into reader args below, looks
> > unnecessary to have them in uffd_desc here.
>
> Good point. I've removed uffd_mode, delay, and handler from uffd_desc.
> I left the "uffd" field in because that's a shared resource, and
> close()ing it as "close(desc->uffd)" makes more sense than, say,
> "close(desc->reader_args[0].uffd)"

Sure, that's also what I originally changed on my side. sorry didn't
mention it earlier.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-04-20 18:09     ` Anish Moorthy
@ 2023-04-21 12:28       ` Robert Hoo
  0 siblings, 0 replies; 103+ messages in thread
From: Robert Hoo @ 2023-04-21 12:28 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm

Anish Moorthy <amoorthy@google.com> 于2023年4月21日周五 02:10写道:
> > struct exit_reason[] string for KVM_EXIT_MEMORY_FAULT can be added as
> > well.
>
> Done, assuming you mean the exit_reasons_known definition in kvm_util.c

Yes.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map()
  2023-04-20 23:34     ` Anish Moorthy
@ 2023-04-21 14:58       ` Peter Xu
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Xu @ 2023-04-21 14:58 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, kvm, kvmarm

On Thu, Apr 20, 2023 at 04:34:39PM -0700, Anish Moorthy wrote:
> On Thu, Apr 20, 2023 at 1:53 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Totally not familiar with nested, just a pure question on whether all the
> > kvm_vcpu_map() callers will be prepared to receive this -EFAULT yet?
> 
> The return values of this function aren't being changed: I'm just
> setting some extra state in the kvm_run_struct in the case where this
> function already returns -EFAULT.

Ah, I was wrongly assuming there'll be more -EFAULTs after you enable the
new memslot flag KVM_MEM_ABSENT_MAPPING_FAULT.  But then when I re-read
your patch below I see that the new flag only affects __kvm_faultin_pfn().

Then I assume that's fine, thanks.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
  2023-04-20 23:29     ` Anish Moorthy
@ 2023-04-21 15:00       ` Peter Xu
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Xu @ 2023-04-21 15:00 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, kvm, kvmarm

On Thu, Apr 20, 2023 at 04:29:38PM -0700, Anish Moorthy wrote:
> On Thu, Apr 20, 2023 at 1:52 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Apr 12, 2023 at 09:34:55PM +0000, Anish Moorthy wrote:
> > > Implement KVM_CAP_MEMORY_FAULT_INFO for efaults from
> > > kvm_vcpu_write_guest_page()
> > >
> > > Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > > ---
> > >  virt/kvm/kvm_main.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 63b4285d858d1..b29a38af543f0 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -3119,8 +3119,11 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > >                             const void *data, int offset, int len)
> > >  {
> > >       struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > > +     int ret = __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
> > >
> > > -     return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
> > > +     if (ret == -EFAULT)
> > > +             kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset, len);
> > > +     return ret;
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
> >
> > Why need to trap this?  Is this -EFAULT part of the "scalable userfault"
> > plan or not?
> >
> > My previous memory was one can still leave things like copy_to_user() to go
> > via the userfaults channels which should work in parallel with the new vcpu
> > MEMORY_FAULT exit.  But maybe the plan changed?
> 
> This commit isn't really part of the "scalable uffd" changes, which
> basically correspond to KVM_CAP_ABSENT_MAPPING_FAULT. There should be
> more details in the cover letter, but basically my v1 just included
> KVM_CAP_ABSENT_MAPPING_FAULT: Sean argued that the API there ("return
> to userspace whenever KVM fails a guest memory access to a page
> fault") was problematic, and so I reworked the series to include a
> general capability for reporting extra information for failed guest
> memory accesses (KVM_CAP_MEMORY_FAULT_INFO) and
> KVM_CAP_ABSENT_MAPPING_FAULT (which is meant to be used in combination
> with the other cap) for the "scalable userfaultfd" changes.
> 
> As such most of the commits in this series are unrelated to
> KVM_CAP_ABSENT_MAPPING_FAULT, and this is one of those commits. It
> doesn't affect page faults generated by copy_to_user (which should
> still be delivered via uffd).

Indeed it'll be an improvement itself to report more details for such an
error already.  Makes sense to me, thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
  2023-04-21 12:15       ` Robert Hoo
@ 2023-04-21 16:21         ` Anish Moorthy
  0 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-21 16:21 UTC (permalink / raw)
  To: Robert Hoo
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Fri, Apr 21, 2023 at 5:15 AM Robert Hoo <robert.hoo.linux@gmail.com> wrote:
>
> IIUC, you could say: in this on demand paging test case, even
> duplicate copy/continue doesn't do harm anyway. Am I right?

It's probably more accurate to say that it never happens in the first
place. I've added a sentence here,

> > > > +             /* See the note about EEXISTs in the UFFDIO_COPY branch. */
> > >
> > > Personally I would suggest copy the comments here. what if some day above
> > > code/comment was changed/deleted?
> >
> > You might be right: on the other hand, if the comment ever gets
> > updated then it would have to be done in two places. Anyone to break
> > the tie? :)
>
> The one who updates the place is responsible for the comments. make sense?:)

Fair enough, done.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-20 21:29         ` Peter Xu
@ 2023-04-21 16:58           ` Anish Moorthy
  2023-04-21 17:39           ` Nadav Amit
  1 sibling, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-21 16:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Axel Rasmussen, pbonzini, maz, oliver.upton, seanjc, jthoughton,
	bgardon, dmatlack, ricarkol, kvm, kvmarm, Nadav Amit

On Thu, Apr 20, 2023 at 2:29 PM Peter Xu <peterx@redhat.com> wrote:
>
> Yes I don't understand why vanilla uffd is so different, neither am I sure
> what does the graph mean, though. :)
>
> Is the first drop caused by starting migration/precopy?
>
> Is the 2nd (huge) drop (mostly to zero) caused by frequently accessing new
> pages during postcopy?

Right on both counts. By the way, for anyone who notices that the
userfaultfd (red/yellow) lines never recover to the initial level of
performance, whereas the blue line does: that's a separate issue,
please ignore :)

> Is the workload busy writes single thread, or NCPU threads?

One thread per vCPU.

> Is what you mentioned on the 25%-75% comparison can be shown on the graph?
> Or maybe that's part of the period where all three are very close to 0?

Yes, unfortunately the absolute size of the improvement is still
pretty small (we go from ~50 writes/s to ~150), so all looks like zero
with this scale.

> > The second is the redis memtier benchmark [1], a more realistic
> > workflow where we migrate a VM running the redis server. With scalable
> > userfaultfd, the client VM observes significantly higher transaction
> > rates during uffd-based postcopy (see "Memtier.png"). I can pull the
> > exact numbers if needed, but just from eyeballing the graph you can
> > see that the improvement is something like 5-10x (at least) for
> > several seconds. There's still a noticeable gap with KVM demand paging
> > based-postcopy, but the improvement is definitely significant.
> >
> > [1] https://github.com/RedisLabs/memtier_benchmark
>
> Does the "5-10x" difference rely in the "15s valley" you pointed out in the
> graph?

Not quite sure what you mean: I meant to point out that the ~15s
valley is where we observe improvements due to scalable userfaultfd.
For most of that valley, the speedup of scalable uffd is 5-10x (or
something, I admit to eyeballing those numbers :)

> Is it reproduceable that the blue line always has a totally different
> "valley" comparing to yellow/red?

Yes, but the offset of that valley is just precopy taking longer for
some reason on that configuration. Honestly it's probably just better
to ignore the blue line, since that's a google-specific stack.

> Personally I still really want to know what happens if we just split the
> vma and see how it goes with a standard workloads, but maybe I'm asking too
> much so don't yet worry.  The solution here proposed still makes sense to
> me and I agree if this can be done well it can resolve the bottleneck over
> 1-userfaultfd.
>
> But after I read some of the patches I'm not sure whether it's possible it
> can be implemented in a complete way.  You mentioned here and there on that
> things can be missing probably due to random places accessing guest pages
> all over kvm.  Relying sololy on -EFAULT so far doesn't look very reliable
> to me, but it could be because I didn't yet really understand how it works.
>
> Is above a concern to the current solution?

Based on your comment in [1], I think your impression of this series
is that it tries to (a) catch all of the cases where userfaultfd would
be triggered and (b) bypass userfaultfd by surfacing the page faults
via vCPU exit. That's only happening in two places (the
KVM_ABSENT_MAPPING_FAULT changes) corresponding to the EPT violation
handler on x86 and the arm64 equivalent. Bypassing the queuing of
faults onto a uffd in those two cases, and instead delivering those
faults via vCPU exit, is what provides the performance gains I'm
demonstrating.

However, all of the other changes (KVM_MEMORY_FAULT_INFO, the bulk of
this series) are totally unrelated to if/how faults are queued onto
userfaultfd. Page faults from copy_to_user/copy_from_user, etc will
continue to be delivered via uffd (if one is registered, obviously),
and changing that is *not* a goal. All that KVM_MEMORY_FAULT_INFO does
is deliver some extra information to userspace in cases where KVM_RUN
currently just returns -EFAULT.

Hopefully this, and my response to [1], clears things up. If not, let
me know and I'll be glad to discuss further.

[1] https://lore.kernel.org/kvm/ZEGuogfbtxPNUq7t@x1n/T/#m76f940846ecc94ea85efa80ffbe42366c2352636

> Have any of you tried to investigate the other approach to scale
> userfaultfd?

As Axel mentioned we considered sharding VMAs but didn't pursue it for
a few different reasons.

> It seems userfaultfd does one thing great which is to have the trapping at
> an unified place (when the page fault happens), hence it doesn't need to
> worry on random codes splat over KVM module read/write a guest page.  The
> question is whether it'll be easy to do so.

See a couple of notes above.

> Split vma definitely is still a way to scale userfaultfd, but probably not
> in a good enough way because it's scaling in memory axis, not cores.  If
> tens of cores accessing a small region that falls into the same VMA, then
> it stops working.
>
> However maybe it can be scaled in other form?  So far my understanding is
> "read" upon uffd for messages is still not a problem - the read can be done
> in chunk, and each message will be converted into a request to be send
> later.
>
> If the real problem relies in a bunch of threads queuing, is it possible
> that we can provide just more queues for the events?  The readers will just
> need to go over all the queues.
>
> Way to decide "which thread uses which queue" can be another problem, what
> comes ups quickly to me is a "hash(tid) % n_queues" but maybe it can be
> better.  Each vcpu thread will have different tids, then they can hopefully
> scale on the queues.
>
> There's at least one issue that I know with such an idea, that after we
> have >1 uffd queues it means the message order will be uncertain.  It may
> matter for some uffd users (e.g. cooperative userfaultfd, see
> UFFD_FEATURE_FORK|REMOVE|etc.)  because I believe order of messages matter
> for them (mostly CRIU).  But I think that's not a blocker either because we
> can forbid those features with multi queues.
>
> That's a wild idea that I'm just thinking about, which I have totally no
> idea whether it'll work or not.  It's more or less of a generic question on
> "whether there's chance to scale on uffd side just in case it might be a
> cleaner approach", when above concern is a real concern.

You bring up a good point, which is that this series only deals with
uffd's performance in the context of KVM. I had another idea in this
vein, which was to allow dedicating queues to certain threads: I even
threw together a prototype, though there was some bug in it which
stopped me from ever getting a real signal :(

I think there's still potential to make uffd itself faster but, as you
point out, that might get messy from an API perspective (I know my
prototype did :) and is going to require more investigation and
prototyping. The advantage of this approach is that it's simple, makes
a lot of conceptual sense IMO (in that the previously-stalled vCPU
threads can now participate in the work of demand fetching), and
solves a very important (probably *the* most important) bottleneck
when it comes to KVM + uffd-based postcopy.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-20 21:29         ` Peter Xu
  2023-04-21 16:58           ` Anish Moorthy
@ 2023-04-21 17:39           ` Nadav Amit
  2023-04-24 17:54             ` Anish Moorthy
  1 sibling, 1 reply; 103+ messages in thread
From: Nadav Amit @ 2023-04-21 17:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Anish Moorthy, Axel Rasmussen, Paolo Bonzini, maz, oliver.upton,
	Sean Christopherson, James Houghton, bgardon, dmatlack, ricarkol,
	kvm, kvmarm


> On Apr 20, 2023, at 2:29 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> Hi, Anish,
> 
> [Copied Nadav Amit for the last few paragraphs on userfaultfd, because
> Nadav worked on a few userfaultfd performance problems; so maybe he'll
> also have some ideas around]

Sorry for not following this thread and previous ones.

I skimmed through and I hope my answers would be helpful and relevant…

Anyhow, I also encountered to some extent the cost of locking (not the
contention). In my case, I only did a small change to reduce the overhead of
acquiring the locks by “combining" the locks of faulting_pending_wqh and
fault_wqh, which are (almost?) always acquired together. I only acquired
fault_pending_wqh and then introduced “fake_spin_lock()” which only got
lockdep to understand the fault_wqh is already protected.

But as I said, this solution was only intended to reduce the cost of locking,
and it does not solve the contention.

If I understand the problem correctly, it sounds as if the proper solution
should be some kind of a range-locks. If it is too heavy or the interface can
be changed/extended to wake a single address (instead of a range),
simpler hashed-locks can be used.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-21 17:39           ` Nadav Amit
@ 2023-04-24 17:54             ` Anish Moorthy
  2023-04-24 19:44               ` Nadav Amit
  0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-24 17:54 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Xu, Axel Rasmussen, Paolo Bonzini, maz, oliver.upton,
	Sean Christopherson, James Houghton, bgardon, dmatlack, ricarkol,
	kvm, kvmarm

On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> If I understand the problem correctly, it sounds as if the proper solution
> should be some kind of a range-locks. If it is too heavy or the interface can
> be changed/extended to wake a single address (instead of a range),
> simpler hashed-locks can be used.

Some sort of range-based locking system does seem relevant, although I
don't see how that would necessarily speed up the delivery of faults
to UFFD readers: I'll have to think about it more.

On the KVM side though, I think there's value in merging
KVM_CAP_ABSENT_MAPPING_FAULT and allowing performance improvements to
UFFD itself proceed separately. It's likely that returning faults
directly via the vCPU threads will be faster than even an improved
UFFD, since the former approach basically removes one level of
indirection. That seems important, given the criticality of the
EPT-violation path during postcopy. Furthermore, if future performance
improvements to UFFD involve changes/restrictions to its API, then
KVM_CAP_ABSENT_MAPPING_FAULT could well be useful anyways.

Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general,
so I'm guessing (some version of) that will (eventually :) be merged
in any case.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-24 17:54             ` Anish Moorthy
@ 2023-04-24 19:44               ` Nadav Amit
  2023-04-24 20:35                 ` Sean Christopherson
  2023-04-25  0:15                 ` Anish Moorthy
  0 siblings, 2 replies; 103+ messages in thread
From: Nadav Amit @ 2023-04-24 19:44 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Peter Xu, Axel Rasmussen, Paolo Bonzini, maz, oliver.upton,
	Sean Christopherson, James Houghton, bgardon, dmatlack, ricarkol,
	kvm, kvmarm



> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> 
> On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>> If I understand the problem correctly, it sounds as if the proper solution
>> should be some kind of a range-locks. If it is too heavy or the interface can
>> be changed/extended to wake a single address (instead of a range),
>> simpler hashed-locks can be used.
> 
> Some sort of range-based locking system does seem relevant, although I
> don't see how that would necessarily speed up the delivery of faults
> to UFFD readers: I'll have to think about it more.

Perhaps I misread your issue. Based on the scalability issues you raised,
I assumed that the problem you encountered is related to lock contention.
I do not know whether your profiled it, but some information would be
useful.

Anyhow, if the issue you encounter is mostly about the general overhead
of delivering userfaultfd, I encountered this one too. The solution I tried
(and you can find some old patches) is in delivering and resolving userfaultfd
using IO-uring. The main advantage is that this solution is generic and
performance is pretty good. The disadvantage is that you do need to allocate
a polling thread/core to handle the userfaultfd.

If you want, I can send you privately the last iteration of my patches for
you to give it a spin.

> 
> On the KVM side though, I think there's value in merging
> KVM_CAP_ABSENT_MAPPING_FAULT and allowing performance improvements to
> UFFD itself proceed separately. It's likely that returning faults
> directly via the vCPU threads will be faster than even an improved
> UFFD, since the former approach basically removes one level of
> indirection. That seems important, given the criticality of the
> EPT-violation path during postcopy. Furthermore, if future performance
> improvements to UFFD involve changes/restrictions to its API, then
> KVM_CAP_ABSENT_MAPPING_FAULT could well be useful anyways.
> 
> Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general,
> so I'm guessing (some version of) that will (eventually :) be merged
> in any case.

It certainly not my call. But if you ask me, introducing a solution for
a concrete use-case that requires API changes/enhancements is not
guaranteed to be the best solution. It may be better first to fully
understand the existing overheads and agree that there is no alternative
cleaner and more general solution with similar performance.

Considering the mess that KVM async-PF introduced, I
would be very careful before introducing such API changes. I did not look
too much on the details, but some things anyhow look slightly strange
(which might be since I am out-of-touch with KVM). For instance, returning
-EFAULT on from KVM_RUN? I would have assumed -EAGAIN would be more
appropriate since the invocation did succeed.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-24 19:44               ` Nadav Amit
@ 2023-04-24 20:35                 ` Sean Christopherson
  2023-04-24 23:47                   ` Nadav Amit
  2023-04-25  0:15                 ` Anish Moorthy
  1 sibling, 1 reply; 103+ messages in thread
From: Sean Christopherson @ 2023-04-24 20:35 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Anish Moorthy, Peter Xu, Axel Rasmussen, Paolo Bonzini, maz,
	oliver.upton, James Houghton, bgardon, dmatlack, ricarkol, kvm,
	kvmarm

On Mon, Apr 24, 2023, Nadav Amit wrote:
> 
> > On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> > Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general,
> > so I'm guessing (some version of) that will (eventually :) be merged
> > in any case.
> 
> It certainly not my call. But if you ask me, introducing a solution for
> a concrete use-case that requires API changes/enhancements is not
> guaranteed to be the best solution. It may be better first to fully
> understand the existing overheads and agree that there is no alternative
> cleaner and more general solution with similar performance.

KVM already returns -EFAULT for these situations, the change I really want to land
is to have KVM report detailed information about why the -EFAULT occurred.  I'll be
happy to carry the code in KVM even if userspace never does anything beyond dumping
the extra information on failures.

> Considering the mess that KVM async-PF introduced, I would be very careful
> before introducing such API changes. I did not look too much on the details,
> but some things anyhow look slightly strange (which might be since I am
> out-of-touch with KVM). For instance, returning -EFAULT on from KVM_RUN? I
> would have assumed -EAGAIN would be more appropriate since the invocation did
> succeed.

Yeah, returning -EFAULT is somewhat odd, but as above, that's pre-existing
behavior that's been around for many years.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
  2023-04-12 21:35 ` [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation Anish Moorthy
  2023-04-19 14:00   ` Hoo Robert
@ 2023-04-24 21:02   ` Sean Christopherson
  2023-06-01 16:04     ` Oliver Upton
  2023-06-01 18:19   ` Oliver Upton
  2 siblings, 1 reply; 103+ messages in thread
From: Sean Christopherson @ 2023-04-24 21:02 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, jthoughton, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Wed, Apr 12, 2023, Anish Moorthy wrote:
> Add documentation, memslot flags, useful helper functions, and the
> actual new capability itself.
> 
> Memory fault exits on absent mappings are particularly useful for
> userfaultfd-based postcopy live migration. When many vCPUs fault on a
> single userfaultfd the faults can take a while to surface to userspace
> due to having to contend for uffd wait queue locks. Bypassing the uffd
> entirely by returning information directly to the vCPU exit avoids this
> contention and improves the fault rate.
> 
> Suggested-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  Documentation/virt/kvm/api.rst | 31 ++++++++++++++++++++++++++++---
>  include/linux/kvm_host.h       |  7 +++++++
>  include/uapi/linux/kvm.h       |  2 ++
>  tools/include/uapi/linux/kvm.h |  1 +
>  virt/kvm/kvm_main.c            |  3 +++
>  5 files changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f174f43c38d45..7967b9909e28b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
>    /* for kvm_userspace_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>    #define KVM_MEM_READONLY	(1UL << 1)
> +  #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)

This name is both too specific and too vague.  It's too specific because it affects
more than just "absent" mappings, it will affect any page fault that can't be
resolved by fast GUP, i.e. I'm objecting for all the same reasons I objected to
the exit reason being name KVM_MEMFAULT_REASON_ABSENT_MAPPING.  It's too vague
because it doesn't describe what behavior the flag actually enables in any way.

I liked the "nowait" verbiage from the RFC.  "fast_only" is an ok alternative,
but that's much more of a kernel-internal name.

Oliver, you had concerns with using "fault" in the name, is something like
KVM_MEM_NOWAIT_ON_PAGE_FAULT or KVM_MEM_NOWAIT_ON_FAULT palatable?  IMO, "fault"
is perfectly ok, we just need to ensure it's unlikely to be ambiguous for userspace.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-24 20:35                 ` Sean Christopherson
@ 2023-04-24 23:47                   ` Nadav Amit
  2023-04-25  0:26                     ` Sean Christopherson
  0 siblings, 1 reply; 103+ messages in thread
From: Nadav Amit @ 2023-04-24 23:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Anish Moorthy, Peter Xu, Axel Rasmussen, Paolo Bonzini, maz,
	oliver.upton, James Houghton, bgardon, dmatlack, ricarkol, kvm,
	kvmarm

Feel free to tell me to shut up, as it is none of my business, and I might be
missing a lot of context. Yet, it never stopped me before. :)

> On Apr 24, 2023, at 1:35 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Mon, Apr 24, 2023, Nadav Amit wrote:
>> 
>>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
>>> Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general,
>>> so I'm guessing (some version of) that will (eventually :) be merged
>>> in any case.
>> 
>> It certainly not my call. But if you ask me, introducing a solution for
>> a concrete use-case that requires API changes/enhancements is not
>> guaranteed to be the best solution. It may be better first to fully
>> understand the existing overheads and agree that there is no alternative
>> cleaner and more general solution with similar performance.
> 
> KVM already returns -EFAULT for these situations, the change I really want to land
> is to have KVM report detailed information about why the -EFAULT occurred.  I'll be
> happy to carry the code in KVM even if userspace never does anything beyond dumping
> the extra information on failures.

I thought that the change is to inform on page-faults through a new interface
instead of the existing uffd-file-based one. There is already another interface
(signals) and I thought (but did not upstream) io-uring. You now suggest yet
another one.

I am not sure it is very clean. IIUC, it means that you would still need in
userspace to monitor uffd, as qemu (or whatever-kvm-userspace-counterpart-you-
use) might also trigger a page-fault. So userspace becomes more complicated,
and the ordering of different events/page-faults reporting becomes even more
broken.

At the same time, you also break various assumptions of userfaultfd. I don’t
immediately find some functionality that would break, but I am not very
confident about it either.

> 
>> Considering the mess that KVM async-PF introduced, I would be very careful
>> before introducing such API changes. I did not look too much on the details,
>> but some things anyhow look slightly strange (which might be since I am
>> out-of-touch with KVM). For instance, returning -EFAULT on from KVM_RUN? I
>> would have assumed -EAGAIN would be more appropriate since the invocation did
>> succeed.
> 
> Yeah, returning -EFAULT is somewhat odd, but as above, that's pre-existing
> behavior that's been around for many years.

Again, it is none of my business, but don’t you want to gradually try to fix
the interface so maybe on day you would be able to deprecate it?

IOW, if you already introduce a new interface which is enabled with a new
flag, which would require userspace change, then you can return the more
appropriate error-code.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-24 19:44               ` Nadav Amit
  2023-04-24 20:35                 ` Sean Christopherson
@ 2023-04-25  0:15                 ` Anish Moorthy
  2023-04-25  0:54                   ` Nadav Amit
  2023-04-27 20:26                   ` Peter Xu
  1 sibling, 2 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-25  0:15 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Xu, Axel Rasmussen, Paolo Bonzini, maz, oliver.upton,
	Sean Christopherson, James Houghton, bgardon, dmatlack, ricarkol,
	kvm, kvmarm

On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
>
>
> > On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> >
> > On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
> >>
> >> If I understand the problem correctly, it sounds as if the proper solution
> >> should be some kind of a range-locks. If it is too heavy or the interface can
> >> be changed/extended to wake a single address (instead of a range),
> >> simpler hashed-locks can be used.
> >
> > Some sort of range-based locking system does seem relevant, although I
> > don't see how that would necessarily speed up the delivery of faults
> > to UFFD readers: I'll have to think about it more.
>
> Perhaps I misread your issue. Based on the scalability issues you raised,
> I assumed that the problem you encountered is related to lock contention.
> I do not know whether your profiled it, but some information would be
> useful.

No, you had it right: the issue at hand is contention on the uffd wait
queues. I'm just not sure what the range-based locking would really be
doing. Events would still have to be delivered to userspace in an
ordered manner, so it seems to me that each uffd would still need to
maintain a queue (and the associated contention).

With respect to the "sharding" idea, I collected some more runs of the
self test (full command in [1]). This time I omitted the "-a" flag, so
that every vCPU accesses a different range of guest memory with its
own UFFD, and set the number of reader threads per UFFD to 1.

vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
1      180     307
2       85      220
4       80      206
8       39     163
16     18     104
32      8      73
64      4      57
128    1      37
256    1      16

I'm reporting paging rate on a per-vcpu rather than total basis, which
is why the numbers look so different than the ones in the cover
letter. I'm actually not sure why the demand paging rate falls off
with the number of vCPUs (maybe a prioritization issue on my side?),
but even when UFFDs aren't being contended for it's clear that demand
paging via memory fault exits is significantly faster.

I'll try to get some perf traces as well: that will take a little bit
of time though, as to do it for cycler will involve patching our VMM
first.

[1] ./demand_paging_test -b 64M -u MINOR -s shmem -v <n> -r 1 [-w]

> It certainly not my call. But if you ask me, introducing a solution for
> a concrete use-case that requires API changes/enhancements is not
> guaranteed to be the best solution. It may be better first to fully
> understand the existing overheads and agree that there is no alternative
> cleaner and more general solution with similar performance.
>
> Considering the mess that KVM async-PF introduced, I
> would be very careful before introducing such API changes. I did not look
> too much on the details, but some things anyhow look slightly strange
> (which might be since I am out-of-touch with KVM). For instance, returning
> -EFAULT on from KVM_RUN? I would have assumed -EAGAIN would be more
> appropriate since the invocation did succeed.

I'm not quite sure whether you're focusing on
KVM_CAP_MEMORY_FAULT_INFO or KVM_CAP_ABSENT_MAPPING_FAULT here. But to
my knowledge, none of the KVM folks have objections to either:
hopefully it stays that way, but we'll have to see :)

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-24 23:47                   ` Nadav Amit
@ 2023-04-25  0:26                     ` Sean Christopherson
  2023-04-25  0:37                       ` Nadav Amit
  0 siblings, 1 reply; 103+ messages in thread
From: Sean Christopherson @ 2023-04-25  0:26 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Anish Moorthy, Peter Xu, Axel Rasmussen, Paolo Bonzini, maz,
	oliver.upton, James Houghton, bgardon, dmatlack, ricarkol, kvm,
	kvmarm

On Mon, Apr 24, 2023, Nadav Amit wrote:
> Feel free to tell me to shut up, as it is none of my business, and I might be
> missing a lot of context. Yet, it never stopped me before. :)
> 
> > On Apr 24, 2023, at 1:35 PM, Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Mon, Apr 24, 2023, Nadav Amit wrote:
> >> 
> >>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> >>> Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general,
> >>> so I'm guessing (some version of) that will (eventually :) be merged
> >>> in any case.
> >> 
> >> It certainly not my call. But if you ask me, introducing a solution for
> >> a concrete use-case that requires API changes/enhancements is not
> >> guaranteed to be the best solution. It may be better first to fully
> >> understand the existing overheads and agree that there is no alternative
> >> cleaner and more general solution with similar performance.
> > 
> > KVM already returns -EFAULT for these situations, the change I really want to land
> > is to have KVM report detailed information about why the -EFAULT occurred.  I'll be
> > happy to carry the code in KVM even if userspace never does anything beyond dumping
> > the extra information on failures.
> 
> I thought that the change is to inform on page-faults through a new interface
> instead of the existing uffd-file-based one. There is already another interface
> (signals) and I thought (but did not upstream) io-uring. You now suggest yet
> another one.

There are two capabilities proposed in this series: one to provide more information
when KVM encounters a fault it can't resolve, and another to tell KVM to kick out
to userspace instead of attempting to resolve a fault when a given address couldn't
be resolved with fast gup.  I'm talking purely about the first one: providing more
information when KVM exits.

As for the second, my plan is to try and stay out of the way and let people that
actually deal with the userspace side of things settle on an approach.  From the
KVM side, supporting the "don't wait to resolve faults" flag is quite simple so
long as KVM punts the heavy lifting to userspace, e.g. identifying _why_ the address
isn't mapped with the appropriate permissions.

> I am not sure it is very clean. IIUC, it means that you would still need in
> userspace to monitor uffd, as qemu (or whatever-kvm-userspace-counterpart-you-
> use) might also trigger a page-fault. So userspace becomes more complicated,
> and the ordering of different events/page-faults reporting becomes even more
> broken.
> 
> At the same time, you also break various assumptions of userfaultfd. I don’t
> immediately find some functionality that would break, but I am not very
> confident about it either.
> 
> > 
> >> Considering the mess that KVM async-PF introduced, I would be very careful
> >> before introducing such API changes. I did not look too much on the details,
> >> but some things anyhow look slightly strange (which might be since I am
> >> out-of-touch with KVM). For instance, returning -EFAULT on from KVM_RUN? I
> >> would have assumed -EAGAIN would be more appropriate since the invocation did
> >> succeed.
> > 
> > Yeah, returning -EFAULT is somewhat odd, but as above, that's pre-existing
> > behavior that's been around for many years.
> 
> Again, it is none of my business, but don’t you want to gradually try to fix
> the interface so maybe on day you would be able to deprecate it?
>
> IOW, if you already introduce a new interface which is enabled with a new
> flag, which would require userspace change, then you can return the more
> appropriate error-code.

In a perfect world, yes.  But unfortunately, the relevant plumbing in KVM is quite
brittle (understatement) with respect to returning "0", and AFAICT, returning
-EFAULT instead of 0 is nothing more than an odditity.  E.g. at worst, it could be
suprising to users writing a new VMM from scratch.

But I hadn't thought about returning a _different_ error code.  -EAGAIN isn't
obviously better though, e.g. my understanding is that -EAGAIN typically means that
retrying will succeed, but in pretty much every case where KVM returns -EFAULT,
simply trying again will never succeed.  It's not even guaranteed to be appropriate
in the "don't wait to resolve faults" case, because KVM won't actually know why
an address isn't accessible, e.g. it could be because the page needs to be faulted
in, but it could also be a due to a guest bug that resulted in a permission
violation.

All in all, returning -EFAULT doesn't seem egregious.  I can't recall a single complaint
about returning -EFAULT instead of -XYZ, just complaints about not KVM providing any
useful information.  So absent a concrete need for returning something other than
-EFAULT, I'm definitely inclined to maintain the status quo, even though it's imperfect.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-25  0:26                     ` Sean Christopherson
@ 2023-04-25  0:37                       ` Nadav Amit
  0 siblings, 0 replies; 103+ messages in thread
From: Nadav Amit @ 2023-04-25  0:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Anish Moorthy, Peter Xu, Axel Rasmussen, Paolo Bonzini, maz,
	oliver.upton, James Houghton, bgardon, dmatlack, ricarkol, kvm,
	kvmarm



> On Apr 24, 2023, at 5:26 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Mon, Apr 24, 2023, Nadav Amit wrote:
>> Feel free to tell me to shut up, as it is none of my business, and I might be
>> missing a lot of context. Yet, it never stopped me before. :)
>> 
>>> On Apr 24, 2023, at 1:35 PM, Sean Christopherson <seanjc@google.com> wrote:
>>> 
>>> On Mon, Apr 24, 2023, Nadav Amit wrote:
>>>> 
>>>>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
>>>>> Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general,
>>>>> so I'm guessing (some version of) that will (eventually :) be merged
>>>>> in any case.
>>>> 
>>>> It certainly not my call. But if you ask me, introducing a solution for
>>>> a concrete use-case that requires API changes/enhancements is not
>>>> guaranteed to be the best solution. It may be better first to fully
>>>> understand the existing overheads and agree that there is no alternative
>>>> cleaner and more general solution with similar performance.
>>> 
>>> KVM already returns -EFAULT for these situations, the change I really want to land
>>> is to have KVM report detailed information about why the -EFAULT occurred.  I'll be
>>> happy to carry the code in KVM even if userspace never does anything beyond dumping
>>> the extra information on failures.
>> 
>> I thought that the change is to inform on page-faults through a new interface
>> instead of the existing uffd-file-based one. There is already another interface
>> (signals) and I thought (but did not upstream) io-uring. You now suggest yet
>> another one.
> 
> There are two capabilities proposed in this series: one to provide more information
> when KVM encounters a fault it can't resolve, and another to tell KVM to kick out
> to userspace instead of attempting to resolve a fault when a given address couldn't
> be resolved with fast gup.  I'm talking purely about the first one: providing more
> information when KVM exits.
> 
> As for the second, my plan is to try and stay out of the way and let people that
> actually deal with the userspace side of things settle on an approach.  From the
> KVM side, supporting the "don't wait to resolve faults" flag is quite simple so
> long as KVM punts the heavy lifting to userspace, e.g. identifying _why_ the address
> isn't mapped with the appropriate permissions.

Thanks for your kind and detailed reply. I removed the parts that I understand.

As you might guess, I focus on the second part, which you leave for others. The
interactions between two page-fault reporting mechanisms is not trivial, and it
is already not fully correct.

I understand the approach that Anish prefers is to do something that is tailored
for KVM, but I am not sure it is the best thing.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-25  0:15                 ` Anish Moorthy
@ 2023-04-25  0:54                   ` Nadav Amit
  2023-04-27 16:38                     ` James Houghton
  2023-04-27 20:26                   ` Peter Xu
  1 sibling, 1 reply; 103+ messages in thread
From: Nadav Amit @ 2023-04-25  0:54 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Peter Xu, Axel Rasmussen, Paolo Bonzini, maz, oliver.upton,
	Sean Christopherson, James Houghton, bgardon, dmatlack, ricarkol,
	kvm, kvmarm



> On Apr 24, 2023, at 5:15 PM, Anish Moorthy <amoorthy@google.com> wrote:
> 
> On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>> 
>> 
>>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
>>> 
>>> On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>>>> 
>>>> If I understand the problem correctly, it sounds as if the proper solution
>>>> should be some kind of a range-locks. If it is too heavy or the interface can
>>>> be changed/extended to wake a single address (instead of a range),
>>>> simpler hashed-locks can be used.
>>> 
>>> Some sort of range-based locking system does seem relevant, although I
>>> don't see how that would necessarily speed up the delivery of faults
>>> to UFFD readers: I'll have to think about it more.
>> 
>> Perhaps I misread your issue. Based on the scalability issues you raised,
>> I assumed that the problem you encountered is related to lock contention.
>> I do not know whether your profiled it, but some information would be
>> useful.
> 
> No, you had it right: the issue at hand is contention on the uffd wait
> queues. I'm just not sure what the range-based locking would really be
> doing. Events would still have to be delivered to userspace in an
> ordered manner, so it seems to me that each uffd would still need to
> maintain a queue (and the associated contention).

There are 2 queues. One for the pending faults that were still not reported
to userspace, and one for the faults that we might need to wake up. The second
one can have range locks.

Perhaps some hybrid approach would be best: do not block on page-faults that
KVM runs into, which would prevent you from the need to enqueue on fault_wqh.

But I do not know whether the reporting through KVM instead of 
userfaultfd-based mechanism is very clean. I think that an IO-uring based
solution, such as the one I proposed before, would be more generic. Actually,
now that I understand better your use-case, you do not need a core to poll
and you would just be able to read the page-fault information from the IO-uring.

Then, you can report whether the page-fault blocked or not in a flag.

> 
> With respect to the "sharding" idea, I collected some more runs of the
> self test (full command in [1]). This time I omitted the "-a" flag, so
> that every vCPU accesses a different range of guest memory with its
> own UFFD, and set the number of reader threads per UFFD to 1.

Just wondering, did you run the benchmark with DONTWAKE? Sounds as if the
wake is not needed.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test
  2023-04-12 21:35 ` [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
  2023-04-19 14:09   ` Hoo Robert
@ 2023-04-27 15:48   ` James Houghton
  2023-05-01 18:01     ` Anish Moorthy
  1 sibling, 1 reply; 103+ messages in thread
From: James Houghton @ 2023-04-27 15:48 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, seanjc, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, kvm, kvmarm

On Wed, Apr 12, 2023 at 2:35 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> Demonstrate a (very basic) scheme for supporting memory fault exits.
>
> From the vCPU threads:
> 1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
>    with the purpose of establishing the absent mappings. Do so with
>    wake_waiters=false to avoid serializing on the userfaultfd wait queue
>    locks.
>
> 2. When the UFFDIO_COPY/CONTINUE in (1) fails with EEXIST,
>    assume that the mapping was already established but is currently
>    absent [A] and attempt to populate it using MADV_POPULATE_WRITE.
>
> Issue UFFDIO_COPY/CONTINUEs from the reader threads as well, but with
> wake_waiters=true to ensure that any threads sleeping on the uffd are
> eventually woken up.
>
> A real VMM would track whether it had already COPY/CONTINUEd pages (eg,
> via a bitmap) to avoid calls destined to EEXIST. However, even the
> naive approach is enough to demonstrate the performance advantages of
> KVM_EXIT_MEMORY_FAULT.
>
> [A] In reality it is much likelier that the vCPU thread simply lost a
>     race to establish the mapping for the page.
>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
>  .../selftests/kvm/demand_paging_test.c        | 209 +++++++++++++-----
>  1 file changed, 155 insertions(+), 54 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> index e84dde345edbc..668bd63d944e7 100644
> --- a/tools/testing/selftests/kvm/demand_paging_test.c
> +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> @@ -15,6 +15,7 @@
>  #include <time.h>
>  #include <pthread.h>
>  #include <linux/userfaultfd.h>
> +#include <sys/mman.h>
>  #include <sys/syscall.h>
>
>  #include "kvm_util.h"
> @@ -31,6 +32,57 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
>  static size_t demand_paging_size;
>  static char *guest_data_prototype;
>
> +static int num_uffds;
> +static size_t uffd_region_size;
> +static struct uffd_desc **uffd_descs;
> +/*
> + * Delay when demand paging is performed through userfaultfd or directly by
> + * vcpu_worker in the case of a KVM_EXIT_MEMORY_FAULT.
> + */
> +static useconds_t uffd_delay;
> +static int uffd_mode;
> +
> +
> +static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
> +                                                                       bool is_vcpu);
> +
> +static void madv_write_or_err(uint64_t gpa)
> +{
> +       int r;
> +       void *hva = addr_gpa2hva(memstress_args.vm, gpa);
> +
> +       r = madvise(hva, demand_paging_size, MADV_POPULATE_WRITE);
> +       TEST_ASSERT(r == 0,
> +                               "MADV_POPULATE_WRITE on hva 0x%lx (gpa 0x%lx) fail, errno %i\n",
> +                               (uintptr_t) hva, gpa, errno);
> +}
> +
> +static void ready_page(uint64_t gpa)
> +{
> +       int r, uffd;
> +
> +       /*
> +        * This test only registers memslot 1 w/ userfaultfd. Any accesses outside
> +        * the registered ranges should fault in the physical pages through
> +        * MADV_POPULATE_WRITE.
> +        */
> +       if ((gpa < memstress_args.gpa)
> +               || (gpa >= memstress_args.gpa + memstress_args.size)) {
> +               madv_write_or_err(gpa);
> +       } else {
> +               if (uffd_delay)
> +                       usleep(uffd_delay);
> +
> +               uffd = uffd_descs[(gpa - memstress_args.gpa) / uffd_region_size]->uffd;
> +
> +               r = handle_uffd_page_request(uffd_mode, uffd,
> +                                       (uint64_t) addr_gpa2hva(memstress_args.vm, gpa), true);
> +
> +               if (r == EEXIST)
> +                       madv_write_or_err(gpa);
> +       }
> +}
> +
>  static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
>  {
>         struct kvm_vcpu *vcpu = vcpu_args->vcpu;
> @@ -42,25 +94,36 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
>
>         clock_gettime(CLOCK_MONOTONIC, &start);
>
> -       /* Let the guest access its memory */
> -       ret = _vcpu_run(vcpu);
> -       TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
> -       if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
> -               TEST_ASSERT(false,
> -                           "Invalid guest sync status: exit_reason=%s\n",
> -                           exit_reason_str(run->exit_reason));
> -       }
> +       while (true) {
> +               /* Let the guest access its memory */
> +               ret = _vcpu_run(vcpu);
> +               TEST_ASSERT(ret == 0
> +                                       || (errno == EFAULT
> +                                               && run->exit_reason == KVM_EXIT_MEMORY_FAULT),
> +                                       "vcpu_run failed: %d\n", ret);
> +               if (ret != 0 && get_ucall(vcpu, NULL) != UCALL_SYNC) {
> +
> +                       if (run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
> +                               ready_page(run->memory_fault.gpa);
> +                               continue;
> +                       }
> +
> +                       TEST_ASSERT(false,
> +                                               "Invalid guest sync status: exit_reason=%s\n",
> +                                               exit_reason_str(run->exit_reason));
> +               }
>
> -       ts_diff = timespec_elapsed(start);
> -       PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
> -                      ts_diff.tv_sec, ts_diff.tv_nsec);
> +               ts_diff = timespec_elapsed(start);
> +               PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
> +                                          ts_diff.tv_sec, ts_diff.tv_nsec);
> +               break;
> +       }
>  }
>
> -static int handle_uffd_page_request(int uffd_mode, int uffd,
> -               struct uffd_msg *msg)
> +static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
> +                                                                       bool is_vcpu)
>  {
>         pid_t tid = syscall(__NR_gettid);
> -       uint64_t addr = msg->arg.pagefault.address;
>         struct timespec start;
>         struct timespec ts_diff;
>         int r;
> @@ -71,56 +134,78 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
>                 struct uffdio_copy copy;
>
>                 copy.src = (uint64_t)guest_data_prototype;
> -               copy.dst = addr;
> +               copy.dst = hva;
>                 copy.len = demand_paging_size;
> -               copy.mode = 0;
> +               copy.mode = UFFDIO_COPY_MODE_DONTWAKE;
>
> -               r = ioctl(uffd, UFFDIO_COPY, &copy);
>                 /*
> -                * With multiple vCPU threads fault on a single page and there are
> -                * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
> -                * will fail with EEXIST: handle that case without signaling an
> -                * error.
> +                * With multiple vCPU threads and at least one of multiple reader threads
> +                * or vCPU memory faults, multiple vCPUs accessing an absent page will
> +                * almost certainly cause some thread doing the UFFDIO_COPY here to get
> +                * EEXIST: make sure to allow that case.
>                  */
> -               if (r == -1 && errno != EEXIST) {
> -                       pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
> -                                       addr, tid, errno);
> -                       return r;
> -               }
> +               r = ioctl(uffd, UFFDIO_COPY, &copy);
> +               TEST_ASSERT(r == 0 || errno == EEXIST,
> +                       "Thread 0x%x failed UFFDIO_COPY on hva 0x%lx, errno = %d",
> +                       gettid(), hva, errno);
>         } else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
> +               /* The comments in the UFFDIO_COPY branch also apply here. */
>                 struct uffdio_continue cont = {0};
>
> -               cont.range.start = addr;
> +               cont.range.start = hva;
>                 cont.range.len = demand_paging_size;
> +               cont.mode = UFFDIO_CONTINUE_MODE_DONTWAKE;
>
>                 r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
> -               /* See the note about EEXISTs in the UFFDIO_COPY branch. */
> -               if (r == -1 && errno != EEXIST) {
> -                       pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
> -                                       addr, tid, errno);
> -                       return r;
> -               }
> +               TEST_ASSERT(r == 0 || errno == EEXIST,
> +                       "Thread 0x%x failed UFFDIO_CONTINUE on hva 0x%lx, errno = %d",
> +                       gettid(), hva, errno);
>         } else {
>                 TEST_FAIL("Invalid uffd mode %d", uffd_mode);
>         }
>
> +       /*
> +        * If the above UFFDIO_COPY/CONTINUE fails with EEXIST, it will do so without
> +        * waking threads waiting on the UFFD: make sure that happens here.
> +        */

This comment sounds a little bit strange because we're always passing
MODE_DONTWAKE to UFFDIO_COPY/CONTINUE.

You *could* update the comment to reflect what this test is really
doing, but I think you actually probably want the test to do what the
comment suggests. That is, I think the code you should write should:
1. DONTWAKE if is_vcpu
2. UFFDIO_WAKE if !is_vcpu && UFFDIO_COPY/CONTINUE failed (with
EEXIST, but we would have already crashed if it weren't).

This way, we can save a syscall with almost no added complexity, and
the existing userfaultfd tests remain basically untouched (i.e., no
longer always need an explicit UFFDIO_WAKE).

Thanks!

> +       if (!is_vcpu) {
> +               struct uffdio_range range = {
> +                       .start = hva,
> +                       .len = demand_paging_size
> +               };
> +               r = ioctl(uffd, UFFDIO_WAKE, &range);
> +               TEST_ASSERT(
> +                       r == 0,
> +                       "Thread 0x%x failed UFFDIO_WAKE on hva 0x%lx, errno = %d",
> +                       gettid(), hva, errno);
> +       }
> +
>         ts_diff = timespec_elapsed(start);
>
>         PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
>                        timespec_to_ns(ts_diff));
>         PER_PAGE_DEBUG("Paged in %ld bytes at 0x%lx from thread %d\n",
> -                      demand_paging_size, addr, tid);
> +                      demand_paging_size, hva, tid);
>
>         return 0;
>  }
>
> +static int handle_uffd_page_request_from_uffd(int uffd_mode, int uffd,
> +                               struct uffd_msg *msg)
> +{
> +       TEST_ASSERT(msg->event == UFFD_EVENT_PAGEFAULT,
> +               "Received uffd message with event %d != UFFD_EVENT_PAGEFAULT",
> +               msg->event);
> +       return handle_uffd_page_request(uffd_mode, uffd,
> +                                       msg->arg.pagefault.address, false);
> +}
> +
>  struct test_params {
> -       int uffd_mode;
>         bool single_uffd;
> -       useconds_t uffd_delay;
>         int readers_per_uffd;
>         enum vm_mem_backing_src_type src_type;
>         bool partition_vcpu_memory_access;
> +       bool memfault_exits;
>  };
>
>  static void prefault_mem(void *alias, uint64_t len)
> @@ -137,15 +222,26 @@ static void prefault_mem(void *alias, uint64_t len)
>  static void run_test(enum vm_guest_mode mode, void *arg)
>  {
>         struct test_params *p = arg;
> -       struct uffd_desc **uffd_descs = NULL;
>         struct timespec start;
>         struct timespec ts_diff;
>         struct kvm_vm *vm;
> -       int i, num_uffds = 0;
> -       uint64_t uffd_region_size;
> +       int i;
> +       uint32_t slot_flags = 0;
> +       bool uffd_memfault_exits = uffd_mode && p->memfault_exits;
> +
> +       if (uffd_memfault_exits) {
> +               TEST_ASSERT(kvm_has_cap(KVM_CAP_ABSENT_MAPPING_FAULT) > 0,
> +                                       "KVM does not have KVM_CAP_ABSENT_MAPPING_FAULT");
> +               slot_flags = KVM_MEM_ABSENT_MAPPING_FAULT;
> +       }
>
>         vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
> -                               1, 0, p->src_type, p->partition_vcpu_memory_access);
> +                               1, slot_flags, p->src_type, p->partition_vcpu_memory_access);
> +
> +       if (uffd_memfault_exits) {
> +               vm_enable_cap(vm,
> +                                         KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_FAULT_INFO_ENABLE);
> +       }
>
>         demand_paging_size = get_backing_src_pagesz(p->src_type);
>
> @@ -154,12 +250,12 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>                     "Failed to allocate buffer for guest data pattern");
>         memset(guest_data_prototype, 0xAB, demand_paging_size);
>
> -       if (p->uffd_mode) {
> +       if (uffd_mode) {
>                 num_uffds = p->single_uffd ? 1 : nr_vcpus;
>                 uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
>
>                 uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
> -               TEST_ASSERT(uffd_descs, "Memory allocation failed");
> +               TEST_ASSERT(uffd_descs, "Failed to allocate memory of uffd descriptors");
>
>                 for (i = 0; i < num_uffds; i++) {
>                         struct memstress_vcpu_args *vcpu_args;
> @@ -179,10 +275,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>                          * requests.
>                          */
>                         uffd_descs[i] = uffd_setup_demand_paging(
> -                               p->uffd_mode, p->uffd_delay, vcpu_hva,
> +                               uffd_mode, uffd_delay, vcpu_hva,
>                                 uffd_region_size,
>                                 p->readers_per_uffd,
> -                               &handle_uffd_page_request);
> +                               &handle_uffd_page_request_from_uffd);
>                 }
>         }
>
> @@ -196,7 +292,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>         ts_diff = timespec_elapsed(start);
>         pr_info("All vCPU threads joined\n");
>
> -       if (p->uffd_mode) {
> +       if (uffd_mode) {
>                 /* Tell the user fault fd handler threads to quit */
>                 for (i = 0; i < num_uffds; i++)
>                         uffd_stop_demand_paging(uffd_descs[i]);
> @@ -211,7 +307,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>         memstress_destroy_vm(vm);
>
>         free(guest_data_prototype);
> -       if (p->uffd_mode)
> +       if (uffd_mode)
>                 free(uffd_descs);
>  }
>
> @@ -220,7 +316,7 @@ static void help(char *name)
>         puts("");
>         printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
>                    "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
> -                  "          [-s type] [-v vcpus] [-o]\n", name);
> +                  "          [-w] [-s type] [-v vcpus] [-o]\n", name);
>         guest_modes_help();
>         printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
>                "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
> @@ -231,6 +327,7 @@ static void help(char *name)
>                "     FD handler to simulate demand paging\n"
>                "     overheads. Ignored without -u.\n");
>         printf(" -r: Set the number of reader threads per uffd.\n");
> +       printf(" -w: Enable kvm cap for memory fault exits.\n");
>         printf(" -b: specify the size of the memory region which should be\n"
>                "     demand paged by each vCPU. e.g. 10M or 3G.\n"
>                "     Default: 1G\n");
> @@ -250,29 +347,30 @@ int main(int argc, char *argv[])
>                 .partition_vcpu_memory_access = true,
>                 .readers_per_uffd = 1,
>                 .single_uffd = false,
> +               .memfault_exits = false,
>         };
>         int opt;
>
>         guest_modes_append_default();
>
> -       while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
> +       while ((opt = getopt(argc, argv, "ahowm:u:d:b:s:v:r:")) != -1) {
>                 switch (opt) {
>                 case 'm':
>                         guest_modes_cmdline(optarg);
>                         break;
>                 case 'u':
>                         if (!strcmp("MISSING", optarg))
> -                               p.uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
> +                               uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
>                         else if (!strcmp("MINOR", optarg))
> -                               p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
> -                       TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
> +                               uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
> +                       TEST_ASSERT(uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
>                         break;
>                 case 'a':
>                         p.single_uffd = true;
>                         break;
>                 case 'd':
> -                       p.uffd_delay = strtoul(optarg, NULL, 0);
> -                       TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
> +                       uffd_delay = strtoul(optarg, NULL, 0);
> +                       TEST_ASSERT(uffd_delay >= 0, "A negative UFFD delay is not supported.");
>                         break;
>                 case 'b':
>                         guest_percpu_mem_size = parse_size(optarg);
> @@ -295,6 +393,9 @@ int main(int argc, char *argv[])
>                                                 "Invalid number of readers per uffd %d: must be >=1",
>                                                 p.readers_per_uffd);
>                         break;
> +               case 'w':
> +                       p.memfault_exits = true;
> +                       break;
>                 case 'h':
>                 default:
>                         help(argv[0]);
> @@ -302,7 +403,7 @@ int main(int argc, char *argv[])
>                 }
>         }
>
> -       if (p.uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
> +       if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
>             !backing_src_is_shared(p.src_type)) {
>                 TEST_FAIL("userfaultfd MINOR mode requires shared memory; pick a different -s");
>         }
> --
> 2.40.0.577.gac1e443424-goog
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-25  0:54                   ` Nadav Amit
@ 2023-04-27 16:38                     ` James Houghton
  0 siblings, 0 replies; 103+ messages in thread
From: James Houghton @ 2023-04-27 16:38 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Anish Moorthy, Peter Xu, Axel Rasmussen, Paolo Bonzini, maz,
	oliver.upton, Sean Christopherson, bgardon, dmatlack, ricarkol,
	kvm, kvmarm

On Mon, Apr 24, 2023 at 5:54 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
>
>
> > On Apr 24, 2023, at 5:15 PM, Anish Moorthy <amoorthy@google.com> wrote:
> >
> > On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> >>
> >>
> >>
> >>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> >>>
> >>> On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
> >>>>
> >>>> If I understand the problem correctly, it sounds as if the proper solution
> >>>> should be some kind of a range-locks. If it is too heavy or the interface can
> >>>> be changed/extended to wake a single address (instead of a range),
> >>>> simpler hashed-locks can be used.
> >>>
> >>> Some sort of range-based locking system does seem relevant, although I
> >>> don't see how that would necessarily speed up the delivery of faults
> >>> to UFFD readers: I'll have to think about it more.
> >>
> >> Perhaps I misread your issue. Based on the scalability issues you raised,
> >> I assumed that the problem you encountered is related to lock contention.
> >> I do not know whether your profiled it, but some information would be
> >> useful.
> >
> > No, you had it right: the issue at hand is contention on the uffd wait
> > queues. I'm just not sure what the range-based locking would really be
> > doing. Events would still have to be delivered to userspace in an
> > ordered manner, so it seems to me that each uffd would still need to
> > maintain a queue (and the associated contention).
>
> There are 2 queues. One for the pending faults that were still not reported
> to userspace, and one for the faults that we might need to wake up. The second
> one can have range locks.
>
> Perhaps some hybrid approach would be best: do not block on page-faults that
> KVM runs into, which would prevent you from the need to enqueue on fault_wqh.

Hi Nadav,

If we don't block on the page faults that KVM runs into, what are you
suggesting that these threads do?

1. If you're saying that we should kick the threads out to userspace
and then read the page fault event, then I would say that it's just
unnecessary complexity. (Seems like this is what you mean from what
you said below.)
2. If you're saying they should busy-wait, then unfortunately we can't
afford that.
3. If it's neither of those, could you clarify?

>
> But I do not know whether the reporting through KVM instead of
> userfaultfd-based mechanism is very clean. I think that an IO-uring based
> solution, such as the one I proposed before, would be more generic. Actually,
> now that I understand better your use-case, you do not need a core to poll
> and you would just be able to read the page-fault information from the IO-uring.
>
> Then, you can report whether the page-fault blocked or not in a flag.

This is a fine idea, but I don't think the required complexity is
worth it. The memory fault info reporting piece of this series is
relatively uncontentious, so let's assume we have it at our disposal.

Now, the complexity to make KVM only attempt fast GUP (and EFAULT if
it fails) is really minimal. We automatically know that we don't need
to WAKE and which address to make ready.  Userspace is also able to
resolve the fault: UFFDIO_CONTINUE if we haven't already, then
MADV_POPULATE_WRITE if we have (forces userspace page tables to be
populated if they haven't been, potentially going through userfaultfd
to do so, i.e., if UFFDIO_CONTINUE wasn't already done).

It sounds like what you're suggesting is something like:
1. KVM attempts fast GUP then slow GUP.
2. In slow GUP, queue a "non-blocking" userfault, but don't go to
sleep (return with VM_FAULT_SIGBUS or something).
3. The vCPU thread gets kicked out to userspace with EFAULT (+ fault
info if we've enabled it).
4. Read a fault from the userfaultfd or io_uring.
5. Make the page ready, and if it were non-blocking, then don't WAKE.

I have some questions/thoughts with this approach:
1. Is io_uring the only way to make reading from a userfaultfd scale?
Maybe it's possible to avoid using a wait_queue for "non-blocking"
faults, but then we'd need a special read() API specifically to
*avoid* the standard fault_pending_wqh queue. Either approach will be
quite complex.
2. We'll still need to annotate KVM in the same-ish place to tell
userfaultfd that the fault should be non-blocking, but we'll probably
*also* need like GUP_USERFAULT_NONBLOCK and/or
FAULT_FLAG_USERFAULT_NOBLOCK or something. (UFFD_FEATURE_SIGBUS does
not exactly solve this problem either.)
3. If the vCPU thread is getting kicked out to userspace, it seems
like there is no way for it to find/read the #pf it generated. This
seems problematic.

>
> >
> > With respect to the "sharding" idea, I collected some more runs of the
> > self test (full command in [1]). This time I omitted the "-a" flag, so
> > that every vCPU accesses a different range of guest memory with its
> > own UFFD, and set the number of reader threads per UFFD to 1.
>
> Just wondering, did you run the benchmark with DONTWAKE? Sounds as if the
> wake is not needed.
>

Anish's selftest only WAKEs when it's necessary[1]. IOW, we only WAKE
when we actually read the #pf from the userfaultfd. If we were to WAKE
for each fault, we wouldn't get much of a scalability improvement at
all (we would still be contending on the wait_queue locks, just not
quite as much as before).

[1]: https://lore.kernel.org/kvm/20230412213510.1220557-23-amoorthy@google.com/

Thanks for your insights/suggestions, Nadav.

- James

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-25  0:15                 ` Anish Moorthy
  2023-04-25  0:54                   ` Nadav Amit
@ 2023-04-27 20:26                   ` Peter Xu
  2023-05-03 19:45                     ` Anish Moorthy
  1 sibling, 1 reply; 103+ messages in thread
From: Peter Xu @ 2023-04-27 20:26 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Nadav Amit, Axel Rasmussen, Paolo Bonzini, maz, oliver.upton,
	Sean Christopherson, James Houghton, bgardon, dmatlack, ricarkol,
	kvm, kvmarm

Hi, Anish,

On Mon, Apr 24, 2023 at 05:15:49PM -0700, Anish Moorthy wrote:
> On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> >
> >
> >
> > > On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote:
> > >
> > > On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote:
> > >>
> > >> If I understand the problem correctly, it sounds as if the proper solution
> > >> should be some kind of a range-locks. If it is too heavy or the interface can
> > >> be changed/extended to wake a single address (instead of a range),
> > >> simpler hashed-locks can be used.
> > >
> > > Some sort of range-based locking system does seem relevant, although I
> > > don't see how that would necessarily speed up the delivery of faults
> > > to UFFD readers: I'll have to think about it more.
> >
> > Perhaps I misread your issue. Based on the scalability issues you raised,
> > I assumed that the problem you encountered is related to lock contention.
> > I do not know whether your profiled it, but some information would be
> > useful.
> 
> No, you had it right: the issue at hand is contention on the uffd wait
> queues. I'm just not sure what the range-based locking would really be
> doing. Events would still have to be delivered to userspace in an
> ordered manner, so it seems to me that each uffd would still need to
> maintain a queue (and the associated contention).
> 
> With respect to the "sharding" idea, I collected some more runs of the
> self test (full command in [1]). This time I omitted the "-a" flag, so
> that every vCPU accesses a different range of guest memory with its
> own UFFD, and set the number of reader threads per UFFD to 1.
> 
> vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> 1      180     307
> 2       85      220
> 4       80      206
> 8       39     163
> 16     18     104
> 32      8      73
> 64      4      57
> 128    1      37
> 256    1      16
> 
> I'm reporting paging rate on a per-vcpu rather than total basis, which
> is why the numbers look so different than the ones in the cover
> letter. I'm actually not sure why the demand paging rate falls off
> with the number of vCPUs (maybe a prioritization issue on my side?),
> but even when UFFDs aren't being contended for it's clear that demand
> paging via memory fault exits is significantly faster.
> 
> I'll try to get some perf traces as well: that will take a little bit
> of time though, as to do it for cycler will involve patching our VMM
> first.
> 
> [1] ./demand_paging_test -b 64M -u MINOR -s shmem -v <n> -r 1 [-w]

Thanks (for doing this test, and also to Nadav for all his inputs), and
sorry for a late response.

These numbers caught my eye, and I'm very curious why even 2 vcpus can
scale that bad.

I gave it a shot on a test machine and I got something slightly different:

  Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads)
  $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N
  |-------+----------+--------|
  | n_thr | per-vcpu | total  |
  |-------+----------+--------|
  |     1 | 39.5K    | 39.5K  |
  |     2 | 33.8K    | 67.6K  |
  |     4 | 31.8K    | 127.2K |
  |     8 | 30.8K    | 246.1K |
  |    16 | 21.9K    | 351.0K |
  |-------+----------+--------|

I used larger ram due to less cores.  I didn't try 32+ vcpus to make sure I
don't have two threads content on a core/thread already since I only got 40
hardware threads there, but still we can compare with your lower half.

When I was testing I noticed bad numbers and another bug on not using
NSEC_PER_SEC properly, so I did this before the test:

https://lore.kernel.org/all/20230427201112.2164776-1-peterx@redhat.com/

I think it means it still doesn't scale that good, however not so bad
either - no obvious 1/2 drop on using 2vcpus.  There're still a bunch of
paths triggered in the test so I also don't expect it to fully scale
linearly.  From my numbers I just didn't see as drastic as yours. I'm not
sure whether it's simply broken test number, parameter differences
(e.g. you used 64M only per-vcpu), or hardware differences.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test
  2023-04-27 15:48   ` James Houghton
@ 2023-05-01 18:01     ` Anish Moorthy
  0 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-05-01 18:01 UTC (permalink / raw)
  To: James Houghton
  Cc: pbonzini, maz, oliver.upton, seanjc, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, kvm, kvmarm

On Thu, Apr 27, 2023 at 8:48 AM James Houghton <jthoughton@google.com> wrote:
>
> This comment sounds a little bit strange because we're always passing
> MODE_DONTWAKE to UFFDIO_COPY/CONTINUE.
>
> You *could* update the comment to reflect what this test is really
> doing, but I think you actually probably want the test to do what the
> comment suggests. That is, I think the code you should write should:
> 1. DONTWAKE if is_vcpu
> 2. UFFDIO_WAKE if !is_vcpu && UFFDIO_COPY/CONTINUE failed (with
> EEXIST, but we would have already crashed if it weren't).
>
> This way, we can save a syscall with almost no added complexity, and
> the existing userfaultfd tests remain basically untouched (i.e., no
> longer always need an explicit UFFDIO_WAKE).
>
> Thanks!

Good points, and taken: though in practice I suspect that every fault
read from the uffd will EEXIST and necessitate the wake anyways.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
  2023-04-12 21:34 ` [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN Anish Moorthy
@ 2023-05-02 17:17   ` Anish Moorthy
  2023-05-02 18:51     ` Sean Christopherson
  0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-05-02 17:17 UTC (permalink / raw)
  To: pbonzini, maz
  Cc: oliver.upton, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, kvm, kvmarm

During some testing yesterday I realized that this patch actually
breaks the self test, causing an error which the later self test
changes cover up.

Running "./demand_paging_test -b 512M -u MINOR -s shmem -v 1" from
kvm/next (b3c98052d469) with just this patch applies gives the
following output

> # ./demand_paging_test -b 512M -u MINOR -s shmem -v 1
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory: [0x7fcdfffe000, 0x7fcffffe000)
> Finished creating vCPUs and starting uffd threads
> Started all vCPUs
> ==== Test Assertion Failure ====
>  demand_paging_test.c:50: false
>  pid=13293 tid=13297 errno=4 - Interrupted system call
>  // Some stack trace stuff
>  Invalid guest sync status: exit_reason=UNKNOWN, ucall=0

The problem is the get_ucall() part of the following block in the self
test's vcpu_worker()

> ret = _vcpu_run(vcpu);
> TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
> if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
>    TEST_ASSERT(false,
>                               "Invalid guest sync status: exit_reason=%s\n",
>                               exit_reason_str(run->exit_reason));
> }

I took a look and, while get_ucall() does depend on the value of
exit_reason, the error's root cause isn't clear to me yet.

Moving the "exit_reason = kvm_exit_unknown" line to later in the
function, right above the vcpu_run() call "fixes" the problem. I've
done that for now and will bisect later to investigate: if anyone
has any clues please let me know.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
  2023-05-02 17:17   ` Anish Moorthy
@ 2023-05-02 18:51     ` Sean Christopherson
  2023-05-02 19:49       ` Anish Moorthy
  0 siblings, 1 reply; 103+ messages in thread
From: Sean Christopherson @ 2023-05-02 18:51 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, jthoughton, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Tue, May 02, 2023, Anish Moorthy wrote:
> During some testing yesterday I realized that this patch actually
> breaks the self test, causing an error which the later self test
> changes cover up.
> 
> Running "./demand_paging_test -b 512M -u MINOR -s shmem -v 1" from
> kvm/next (b3c98052d469) with just this patch applies gives the
> following output
> 
> > # ./demand_paging_test -b 512M -u MINOR -s shmem -v 1
> > Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> > guest physical test memory: [0x7fcdfffe000, 0x7fcffffe000)
> > Finished creating vCPUs and starting uffd threads
> > Started all vCPUs
> > ==== Test Assertion Failure ====
> >  demand_paging_test.c:50: false
> >  pid=13293 tid=13297 errno=4 - Interrupted system call
> >  // Some stack trace stuff
> >  Invalid guest sync status: exit_reason=UNKNOWN, ucall=0
> 
> The problem is the get_ucall() part of the following block in the self
> test's vcpu_worker()
> 
> > ret = _vcpu_run(vcpu);
> > TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
> > if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
> >    TEST_ASSERT(false,
> >                               "Invalid guest sync status: exit_reason=%s\n",
> >                               exit_reason_str(run->exit_reason));
> > }
> 
> I took a look and, while get_ucall() does depend on the value of
> exit_reason, the error's root cause isn't clear to me yet.

Stating what you likely already know... On x86, the UCALL is performed via port
I/O, and so the selftests framework zeros out the ucall struct if the userspace
exit reason isn't KVM_EXIT_IO.

> Moving the "exit_reason = kvm_exit_unknown" line to later in the
> function, right above the vcpu_run() call "fixes" the problem. I've
> done that for now and will bisect later to investigate: if anyone
> has any clues please let me know.

Clobbering vcpu->run->exit_reason before this code block is a bug:

	if (unlikely(vcpu->arch.complete_userspace_io)) {
		int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
		vcpu->arch.complete_userspace_io = NULL;
		r = cui(vcpu);
		if (r <= 0)
			goto out;
	} else {
		WARN_ON_ONCE(vcpu->arch.pio.count);
		WARN_ON_ONCE(vcpu->mmio_needed);
	}

	if (kvm_run->immediate_exit) {
		r = -EINTR;
		goto out;
	}

For userspace I/O and MMIO, KVM requires userspace to "complete" the instruction
that triggered the exit to userspace, e.g. write memory/registers and skip the
instruction as needed.  The immediate_exit flag is set by userspace when userspace
wants to retain control and is doing KVM_RUN purely to placate KVM.  In selftests,
this is done by vcpu_run_complete_io().

The one part I'm a bit surprised by is that this caused ucall problems.  The ucall
framework invokes vcpu_run_complete_io() _after_ it grabs the information. 

	addr = ucall_arch_get_ucall(vcpu);
	if (addr) {
		TEST_ASSERT(addr != (void *)GUEST_UCALL_FAILED,
			    "Guest failed to allocate ucall struct");

		memcpy(uc, addr, sizeof(*uc));
		vcpu_run_complete_io(vcpu);
	} else {
		memset(uc, 0, sizeof(*uc));
	}

Making multiple calls to get_ucall() after a single guest ucall would explain
everything as only the first get_ucall() would succeed, but AFAICT the test doesn't
invoke get_ucall() multiple times.

Aha!  Found it.  _vcpu_run() invokes assert_on_unhandled_exception(), which does

	if (get_ucall(vcpu, &uc) == UCALL_UNHANDLED) {
		uint64_t vector = uc.args[0];

		TEST_FAIL("Unexpected vectored event in guest (vector:0x%lx)",
			  vector);
	}

and thus triggers vcpu_run_complete_io() before demand_paging_test's vcpu_worker()
gets control and does _its_ get_ucall().

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
  2023-05-02 18:51     ` Sean Christopherson
@ 2023-05-02 19:49       ` Anish Moorthy
  2023-05-02 20:41         ` Sean Christopherson
  0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-05-02 19:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, maz, oliver.upton, jthoughton, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Thanks for nailing this down for me! One more question: should we be
concerned about any guest memory accesses occurring in the preamble to
that vcpu_run() call in kvm_arch_vcpu_ioctl_run()?

I only see two spots from which an EFAULT could make it to userspace,
those being the sync_regs() and cui() calls. The former looks clean
but I'm not sure about the latter. As written it's not an issue per se
if the cui() call tries a vCPU memory access- the
kvm_populate_efault_info() helper will just not populate the run
struct and WARN_ON_ONCE(). But it would be good to know about.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
  2023-05-02 19:49       ` Anish Moorthy
@ 2023-05-02 20:41         ` Sean Christopherson
  2023-05-02 21:46           ` Anish Moorthy
  0 siblings, 1 reply; 103+ messages in thread
From: Sean Christopherson @ 2023-05-02 20:41 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, jthoughton, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Tue, May 02, 2023, Anish Moorthy wrote:
> Thanks for nailing this down for me! One more question: should we be
> concerned about any guest memory accesses occurring in the preamble to
> that vcpu_run() call in kvm_arch_vcpu_ioctl_run()?
> 
> I only see two spots from which an EFAULT could make it to userspace,
> those being the sync_regs() and cui() calls. The former looks clean

Ya, sync_regs() is a non-issue, that doesn't touch guest memory unless userspace
is doing something truly bizarre.

> but I'm not sure about the latter. As written it's not an issue per se
> if the cui() call tries a vCPU memory access- the
> kvm_populate_efault_info() helper will just not populate the run
> struct and WARN_ON_ONCE(). But it would be good to know about.

If KVM triggers a WARN_ON_ONCE(), then that's an issue.  Though looking at the
code, the cui() aspect is a moot point.  As I stated in the previous discussion,
the WARN_ON_ONCE() in question needs to be off-by-default.

 : Hmm, one idea would be to have the initial -EFAULT detection fill kvm_run.memory_fault,
 : but set kvm_run.exit_reason to some magic number, e.g. zero it out.  Then KVM could
 : WARN if something tries to overwrite kvm_run.exit_reason.  The WARN would need to
 : be buried by a Kconfig or something since kvm_run can be modified by userspace,
 : but other than that I think it would work.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
  2023-05-02 20:41         ` Sean Christopherson
@ 2023-05-02 21:46           ` Anish Moorthy
  2023-05-02 22:31             ` Sean Christopherson
  0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-05-02 21:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, maz, oliver.upton, jthoughton, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Tue, May 2, 2023 at 1:41 PM Sean Christopherson <seanjc@google.com> wrote:
>
> If KVM triggers a WARN_ON_ONCE(), then that's an issue.  Though looking at the
> code, the cui() aspect is a moot point.  As I stated in the previous discussion,
> the WARN_ON_ONCE() in question needs to be off-by-default.
>
>  : Hmm, one idea would be to have the initial -EFAULT detection fill kvm_run.memory_fault,
>  : but set kvm_run.exit_reason to some magic number, e.g. zero it out.  Then KVM could
>  : WARN if something tries to overwrite kvm_run.exit_reason.  The WARN would need to
>  : be buried by a Kconfig or something since kvm_run can be modified by userspace,
>  : but other than that I think it would work.

Ah, ok: I thought using WARN_ON_ONCE instead of WARN might have
obviated the Kconfig. I'll go add one.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
  2023-05-02 21:46           ` Anish Moorthy
@ 2023-05-02 22:31             ` Sean Christopherson
  0 siblings, 0 replies; 103+ messages in thread
From: Sean Christopherson @ 2023-05-02 22:31 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, jthoughton, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Tue, May 02, 2023, Anish Moorthy wrote:
> On Tue, May 2, 2023 at 1:41 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > If KVM triggers a WARN_ON_ONCE(), then that's an issue.  Though looking at the
> > code, the cui() aspect is a moot point.  As I stated in the previous discussion,
> > the WARN_ON_ONCE() in question needs to be off-by-default.
> >
> >  : Hmm, one idea would be to have the initial -EFAULT detection fill kvm_run.memory_fault,
> >  : but set kvm_run.exit_reason to some magic number, e.g. zero it out.  Then KVM could
> >  : WARN if something tries to overwrite kvm_run.exit_reason.  The WARN would need to
> >  : be buried by a Kconfig or something since kvm_run can be modified by userspace,
> >  : but other than that I think it would work.
> 
> Ah, ok: I thought using WARN_ON_ONCE instead of WARN might have
> obviated the Kconfig. I'll go add one.

Don't put too much effort into anything at this point.  I'm not entirely convinced
that it's worth carrying a Kconfig for this one-off case (my "suggestion" was mostly
just me spitballing), and at a quick glance through the rest of the series, I'll
definitely have more comments when I do a full review, i.e. things may change too.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-27 20:26                   ` Peter Xu
@ 2023-05-03 19:45                     ` Anish Moorthy
  2023-05-03 20:09                       ` Sean Christopherson
       [not found]                       ` <ZFLPlRReglM/Vgfu@x1n>
  0 siblings, 2 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-05-03 19:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nadav Amit, Axel Rasmussen, Paolo Bonzini, maz, oliver.upton,
	Sean Christopherson, James Houghton, bgardon, dmatlack, ricarkol,
	kvm, kvmarm

[-- Attachment #1: Type: text/plain, Size: 3310 bytes --]

On Thu, Apr 27, 2023 at 1:26 PM Peter Xu <peterx@redhat.com> wrote:
>
> Thanks (for doing this test, and also to Nadav for all his inputs), and
> sorry for a late response.

No need to apologize: anyways, I've got you comfortably beat on being
late at this point :)

> These numbers caught my eye, and I'm very curious why even 2 vcpus can
> scale that bad.
>
> I gave it a shot on a test machine and I got something slightly different:
>
>   Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads)
>   $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N
>   |-------+----------+--------|
>   | n_thr | per-vcpu | total  |
>   |-------+----------+--------|
>   |     1 | 39.5K    | 39.5K  |
>   |     2 | 33.8K    | 67.6K  |
>   |     4 | 31.8K    | 127.2K |
>   |     8 | 30.8K    | 246.1K |
>   |    16 | 21.9K    | 351.0K |
>   |-------+----------+--------|
>
> I used larger ram due to less cores.  I didn't try 32+ vcpus to make sure I
> don't have two threads content on a core/thread already since I only got 40
> hardware threads there, but still we can compare with your lower half.
>
> When I was testing I noticed bad numbers and another bug on not using
> NSEC_PER_SEC properly, so I did this before the test:
>
> https://lore.kernel.org/all/20230427201112.2164776-1-peterx@redhat.com/
>
> I think it means it still doesn't scale that good, however not so bad
> either - no obvious 1/2 drop on using 2vcpus.  There're still a bunch of
> paths triggered in the test so I also don't expect it to fully scale
> linearly.  From my numbers I just didn't see as drastic as yours. I'm not
> sure whether it's simply broken test number, parameter differences
> (e.g. you used 64M only per-vcpu), or hardware differences.

Hmm, I suspect we're dealing with  hardware differences here. I
rebased my changes onto those two patches you sent up, taking care not
to clobber them, but even with the repro command you provided my
results look very different than yours (at least on 1-4 vcpus) on the
machine I've been testing on (4x AMD EPYC 7B13 64-Core, 2.2GHz).

(n=20)
n_thr      per_vcpu       total
1            154K              154K
2             92k                184K
4             71K                285K
8             36K                291K
16           19K                310K

Out of interested I tested on another machine (Intel(R) Xeon(R)
Platinum 8273CL CPU @ 2.20GHz) as well, and results are a bit
different again

(n=20)
n_thr      per_vcpu       total
1            115K              115K
2             103k              206K
4             65K                262K
8             39K                319K
16           19K                398K

It is interesting how all three sets of numbers start off different
but seem to converge around 16 vCPUs. I did check to make sure the
memory fault exits sped things up in all cases, and that at least
stays true.

By the way, I've got a little helper script that I've been using to
run/average the selftest results (which can vary quite a bit). I've
attached it below- hopefully it doesn't bounce from the mailing list.
Just for reference, the invocation to test the command you provided is

> python dp_runner.py --num_runs 20 --max_cores 16 --percpu_mem 512M

[-- Attachment #2: dp_runner.py --]
[-- Type: text/x-python, Size: 2561 bytes --]

import subprocess
import argparse
import re

def get_command(percpu_mem, cores, single_uffd, use_memfaults, overlap_vcpus):
   if overlap_vcpus and not single_uffd:
       raise RuntimeError("Overlapping vcpus but not using single uffd, very strange")
   return "./demand_paging_test -s shmem -u MINOR " \
        + " -b " + percpu_mem \
        + (" -a " if single_uffd or overlap_vcpus else "") \
        + (" -o " if overlap_vcpus else "") \
        + " -v " + str(cores) \
        + " -r " + (str(cores) if single_uffd or overlap_vcpus else "1") \
        + (" -w" if use_memfaults else "") \
        + "; exit 0"

def run_command(cmd):

    output = subprocess.check_output(cmd, shell=True)
    v_paging_rate_re = r"Per-vcpu demand paging rate:\s*(.*) pgs/sec"
    t_paging_rate_re = r"Overall demand paging rate:\s*(.*) pgs/sec"
    v_match = re.search(v_paging_rate_re, output, re.MULTILINE)
    t_match = re.search(t_paging_rate_re, output, re.MULTILINE)
    return float(v_match.group(1)), float(t_match.group(1))

if __name__ == "__main__":
    ap = argparse.ArgumentParser()
    ap.add_argument("--num_runs", type=int, dest='num_runs', required=True)
    ap.add_argument("--max_cores", type=int, dest='max_cores', required=True)
    ap.add_argument("--percpu_mem", type=str, dest='percpu_mem', required=True)
    ap.add_argument("--oneuffd", type=bool, dest='oneuffd')
    ap.add_argument("--overlap", type=bool, dest='overlap')
    ap.add_argument("--memfaults", type=bool, dest='memfaults')

    args = ap.parse_args()

    print("Testing configuration: " + str(args))
    print("")

    cores = 1
    cores_arr = []
    results = []
    while cores <= args.max_cores:
        cmd = get_command(args.percpu_mem, cores, args.oneuffd, args.memfaults, args.overlap)
        if cores == 1 or cores == 2:
            print("cmd = " + cmd)

        print("Testing cores = " + str(cores))
        full_results = [run_command(cmd) for _ in range(args.num_runs)]
        v_rates = [f[0] for f in full_results]
        t_rates = [f[1] for f in full_results]

        def print_rates(tag, rates):
            average = sum(rates) / len(rates)
            print(tag + ":\t\t" + str(int(average / 10) / 100))

        print_rates("Vcpu demand paging rate", v_rates)
        print_rates("Total demand paging rate", t_rates)

        cores_arr.append(cores)
        results.append((cores, v_rates, t_rates))
        cores *= 2

    for c, v_rates, t_rates in results:
        print("Full results on core " + str(c) + " :\n" + str(v_rates) + "\n" + str(t_rates))


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-03 19:45                     ` Anish Moorthy
@ 2023-05-03 20:09                       ` Sean Christopherson
       [not found]                       ` <ZFLPlRReglM/Vgfu@x1n>
  1 sibling, 0 replies; 103+ messages in thread
From: Sean Christopherson @ 2023-05-03 20:09 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Peter Xu, Nadav Amit, Axel Rasmussen, Paolo Bonzini, maz,
	oliver.upton, James Houghton, bgardon, dmatlack, ricarkol, kvm,
	kvmarm

On Wed, May 03, 2023, Anish Moorthy wrote:
> On Thu, Apr 27, 2023 at 1:26 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Thanks (for doing this test, and also to Nadav for all his inputs), and
> > sorry for a late response.
> 
> No need to apologize: anyways, I've got you comfortably beat on being
> late at this point :)

LOL, hold my beer and let me you you the true meaning of "late response". :-)

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
       [not found]                       ` <ZFLPlRReglM/Vgfu@x1n>
@ 2023-05-03 21:27                         ` Peter Xu
  2023-05-03 21:42                           ` Sean Christopherson
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Xu @ 2023-05-03 21:27 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Nadav Amit, Axel Rasmussen, Paolo Bonzini, maz, oliver.upton,
	Sean Christopherson, James Houghton, bgardon, dmatlack, ricarkol,
	kvm, kvmarm

Oops, bounced back from the list..

Forward with no attachment this time - I assume the information is still
enough in the paragraphs even without the flamegraphs.  Sorry for the
noise.

On Wed, May 03, 2023 at 05:18:13PM -0400, Peter Xu wrote:
> On Wed, May 03, 2023 at 12:45:07PM -0700, Anish Moorthy wrote:
> > On Thu, Apr 27, 2023 at 1:26 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Thanks (for doing this test, and also to Nadav for all his inputs), and
> > > sorry for a late response.
> > 
> > No need to apologize: anyways, I've got you comfortably beat on being
> > late at this point :)
> > 
> > > These numbers caught my eye, and I'm very curious why even 2 vcpus can
> > > scale that bad.
> > >
> > > I gave it a shot on a test machine and I got something slightly different:
> > >
> > >   Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads)
> > >   $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N
> > >   |-------+----------+--------|
> > >   | n_thr | per-vcpu | total  |
> > >   |-------+----------+--------|
> > >   |     1 | 39.5K    | 39.5K  |
> > >   |     2 | 33.8K    | 67.6K  |
> > >   |     4 | 31.8K    | 127.2K |
> > >   |     8 | 30.8K    | 246.1K |
> > >   |    16 | 21.9K    | 351.0K |
> > >   |-------+----------+--------|
> > >
> > > I used larger ram due to less cores.  I didn't try 32+ vcpus to make sure I
> > > don't have two threads content on a core/thread already since I only got 40
> > > hardware threads there, but still we can compare with your lower half.
> > >
> > > When I was testing I noticed bad numbers and another bug on not using
> > > NSEC_PER_SEC properly, so I did this before the test:
> > >
> > > https://lore.kernel.org/all/20230427201112.2164776-1-peterx@redhat.com/
> > >
> > > I think it means it still doesn't scale that good, however not so bad
> > > either - no obvious 1/2 drop on using 2vcpus.  There're still a bunch of
> > > paths triggered in the test so I also don't expect it to fully scale
> > > linearly.  From my numbers I just didn't see as drastic as yours. I'm not
> > > sure whether it's simply broken test number, parameter differences
> > > (e.g. you used 64M only per-vcpu), or hardware differences.
> > 
> > Hmm, I suspect we're dealing with  hardware differences here. I
> > rebased my changes onto those two patches you sent up, taking care not
> > to clobber them, but even with the repro command you provided my
> > results look very different than yours (at least on 1-4 vcpus) on the
> > machine I've been testing on (4x AMD EPYC 7B13 64-Core, 2.2GHz).
> > 
> > (n=20)
> > n_thr      per_vcpu       total
> > 1            154K              154K
> > 2             92k                184K
> > 4             71K                285K
> > 8             36K                291K
> > 16           19K                310K
> > 
> > Out of interested I tested on another machine (Intel(R) Xeon(R)
> > Platinum 8273CL CPU @ 2.20GHz) as well, and results are a bit
> > different again
> > 
> > (n=20)
> > n_thr      per_vcpu       total
> > 1            115K              115K
> > 2             103k              206K
> > 4             65K                262K
> > 8             39K                319K
> > 16           19K                398K
> 
> Interesting.
> 
> > 
> > It is interesting how all three sets of numbers start off different
> > but seem to converge around 16 vCPUs. I did check to make sure the
> > memory fault exits sped things up in all cases, and that at least
> > stays true.
> > 
> > By the way, I've got a little helper script that I've been using to
> > run/average the selftest results (which can vary quite a bit). I've
> > attached it below- hopefully it doesn't bounce from the mailing list.
> > Just for reference, the invocation to test the command you provided is
> > 
> > > python dp_runner.py --num_runs 20 --max_cores 16 --percpu_mem 512M
> 
> I found that indeed I shouldn't have stopped at 16 vcpus since that's
> exactly where it starts to bottleneck. :)
> 
> So out of my curiosity I tried to profile 32 vcpus case on my system with
> this test case, meanwhile I tried it both with:
> 
>   - 1 uffd + 8 readers
>   - 32 uffds (so 32 readers)
> 
> I've got the flamegraphs attached for both.
> 
> It seems that when using >1 uffds the bottleneck is not the spinlock
> anymore but something else.
> 
> From what I got there, vmx_vcpu_load() gets more highlights than the
> spinlocks. I think that's the tlb flush broadcast.
> 
> While OTOH indeed when using 1 uffd we can see obviously the overhead of
> spinlock contention on either the fault() path or read()/poll() as you and
> James rightfully pointed out.
> 
> I'm not sure whether my number is caused by special setup, though. After
> all I only had 40 threads and I started 32 vcpus + 8 readers and there'll
> be contention already between the workloads.
> 
> IMHO this means that there's still chance to provide a more generic
> userfaultfd scaling solution as long as we can remove the single spinlock
> contention on the fault/fault_pending queues.  I'll see whether I can still
> explore a bit on the possibility of this and keep you guys updated.  The
> general idea here to me is still to make multi-queue out of 1 uffd.
> 
> I _think_ this might also be a positive result to your work, because if the
> bottleneck is not userfaultfd (as we scale it with creating multiple;
> ignoring the split vma effect), then it cannot be resolved by scaling
> userfaultfd alone anyway, anymore.  So a general solution, even if existed,
> may not work here for kvm, because we'll get stuck somewhere else already.
> 
> -- 
> Peter Xu




-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-03 21:27                         ` Peter Xu
@ 2023-05-03 21:42                           ` Sean Christopherson
  2023-05-03 23:45                             ` Peter Xu
  0 siblings, 1 reply; 103+ messages in thread
From: Sean Christopherson @ 2023-05-03 21:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: Anish Moorthy, Nadav Amit, Axel Rasmussen, Paolo Bonzini, maz,
	oliver.upton, James Houghton, bgardon, dmatlack, ricarkol, kvm,
	kvmarm

On Wed, May 03, 2023, Peter Xu wrote:
> Oops, bounced back from the list..
> 
> Forward with no attachment this time - I assume the information is still
> enough in the paragraphs even without the flamegraphs.

The flamegraphs are definitely useful beyond what is captured here.  Not sure
how to get them accepted on the list though.

> > From what I got there, vmx_vcpu_load() gets more highlights than the
> > spinlocks. I think that's the tlb flush broadcast.

No, it's KVM dealing with the vCPU being migrated to a different pCPU.  The
smp_call_function_single() that shows up is from loaded_vmcs_clear() and is
triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the
VMCS caches not being coherent).

Task migration can also trigger IBPB (if mitigations are enabled), and also does
an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU.

Can you trying 1:1 pinning of vCPUs to pCPUs?  That _should_ eliminate the
vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real
world configuration.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-03 21:42                           ` Sean Christopherson
@ 2023-05-03 23:45                             ` Peter Xu
  2023-05-04 19:09                               ` Peter Xu
  2023-05-05 20:05                               ` Nadav Amit
  0 siblings, 2 replies; 103+ messages in thread
From: Peter Xu @ 2023-05-03 23:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Anish Moorthy, Nadav Amit, Axel Rasmussen, Paolo Bonzini, maz,
	oliver.upton, James Houghton, bgardon, dmatlack, ricarkol, kvm,
	kvmarm

On Wed, May 03, 2023 at 02:42:35PM -0700, Sean Christopherson wrote:
> On Wed, May 03, 2023, Peter Xu wrote:
> > Oops, bounced back from the list..
> > 
> > Forward with no attachment this time - I assume the information is still
> > enough in the paragraphs even without the flamegraphs.
> 
> The flamegraphs are definitely useful beyond what is captured here.  Not sure
> how to get them accepted on the list though.

Trying again with google drive:

single uffd:
https://drive.google.com/file/d/1bYVYefIRRkW8oViRbYv_HyX5Zf81p3Jl/view

32 uffds:
https://drive.google.com/file/d/1T19yTEKKhbjU9G2FpANIvArSC61mqqtp/view

> 
> > > From what I got there, vmx_vcpu_load() gets more highlights than the
> > > spinlocks. I think that's the tlb flush broadcast.
> 
> No, it's KVM dealing with the vCPU being migrated to a different pCPU.  The
> smp_call_function_single() that shows up is from loaded_vmcs_clear() and is
> triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the
> VMCS caches not being coherent).
> 
> Task migration can also trigger IBPB (if mitigations are enabled), and also does
> an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU.
> 
> Can you trying 1:1 pinning of vCPUs to pCPUs?  That _should_ eliminate the
> vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real
> world configuration.

Yes it does went away:

https://drive.google.com/file/d/1ZFhWnWjoU33Lxy43jTYnKFuluo4zZArm/view

With pinning vcpu threads only (again, over 40 hard cores/threads):

./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32

It seems to me for some reason the scheduler ate more than I expected..
Maybe tomorrow I can try two more things:

  - Do cpu isolations, and
  - pin reader threads too (or just leave the readers on housekeeping cores)

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-03 23:45                             ` Peter Xu
@ 2023-05-04 19:09                               ` Peter Xu
  2023-05-05 18:32                                 ` Anish Moorthy
  2023-05-05 20:05                               ` Nadav Amit
  1 sibling, 1 reply; 103+ messages in thread
From: Peter Xu @ 2023-05-04 19:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Anish Moorthy, Nadav Amit, Axel Rasmussen, Paolo Bonzini, maz,
	oliver.upton, James Houghton, bgardon, dmatlack, ricarkol, kvm,
	kvmarm

On Wed, May 03, 2023 at 07:45:28PM -0400, Peter Xu wrote:
> On Wed, May 03, 2023 at 02:42:35PM -0700, Sean Christopherson wrote:
> > On Wed, May 03, 2023, Peter Xu wrote:
> > > Oops, bounced back from the list..
> > > 
> > > Forward with no attachment this time - I assume the information is still
> > > enough in the paragraphs even without the flamegraphs.
> > 
> > The flamegraphs are definitely useful beyond what is captured here.  Not sure
> > how to get them accepted on the list though.
> 
> Trying again with google drive:
> 
> single uffd:
> https://drive.google.com/file/d/1bYVYefIRRkW8oViRbYv_HyX5Zf81p3Jl/view
> 
> 32 uffds:
> https://drive.google.com/file/d/1T19yTEKKhbjU9G2FpANIvArSC61mqqtp/view
> 
> > 
> > > > From what I got there, vmx_vcpu_load() gets more highlights than the
> > > > spinlocks. I think that's the tlb flush broadcast.
> > 
> > No, it's KVM dealing with the vCPU being migrated to a different pCPU.  The
> > smp_call_function_single() that shows up is from loaded_vmcs_clear() and is
> > triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the
> > VMCS caches not being coherent).
> > 
> > Task migration can also trigger IBPB (if mitigations are enabled), and also does
> > an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU.
> > 
> > Can you trying 1:1 pinning of vCPUs to pCPUs?  That _should_ eliminate the
> > vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real
> > world configuration.
> 
> Yes it does went away:
> 
> https://drive.google.com/file/d/1ZFhWnWjoU33Lxy43jTYnKFuluo4zZArm/view
> 
> With pinning vcpu threads only (again, over 40 hard cores/threads):
> 
> ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
> 
> It seems to me for some reason the scheduler ate more than I expected..
> Maybe tomorrow I can try two more things:
> 
>   - Do cpu isolations, and
>   - pin reader threads too (or just leave the readers on housekeeping cores)

I gave it a shot by isolating 32 cores and split into two groups, 16 for
uffd threads and 16 for vcpu threads.  I got similiar results and I don't
see much changed.

I think it's possible it's just reaching the limit of my host since it only
got 40 cores anyway.  Throughput never hits over 350K faults/sec overall.

I assume this might not be the case for Anish if he has a much larger host.
So we can have similar test carried out and see how that goes.  I think the
idea is making sure vcpu load overhead during sched-in is ruled out, then
see whether it can keep scaling with more cores.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-04 19:09                               ` Peter Xu
@ 2023-05-05 18:32                                 ` Anish Moorthy
  2023-05-08  1:23                                   ` Peter Xu
  0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-05-05 18:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, Nadav Amit, Axel Rasmussen, Paolo Bonzini,
	maz, oliver.upton, James Houghton, bgardon, dmatlack, ricarkol,
	kvm, kvmarm

On Thu, May 4, 2023 at 12:09 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, May 03, 2023 at 07:45:28PM -0400, Peter Xu wrote:
> > On Wed, May 03, 2023 at 02:42:35PM -0700, Sean Christopherson wrote:
> > > On Wed, May 03, 2023, Peter Xu wrote:
> > > > Oops, bounced back from the list..
> > > >
> > > > Forward with no attachment this time - I assume the information is still
> > > > enough in the paragraphs even without the flamegraphs.
> > >
> > > The flamegraphs are definitely useful beyond what is captured here.  Not sure
> > > how to get them accepted on the list though.
> >
> > Trying again with google drive:
> >
> > single uffd:
> > https://drive.google.com/file/d/1bYVYefIRRkW8oViRbYv_HyX5Zf81p3Jl/view
> >
> > 32 uffds:
> > https://drive.google.com/file/d/1T19yTEKKhbjU9G2FpANIvArSC61mqqtp/view
> >
> > >
> > > > > From what I got there, vmx_vcpu_load() gets more highlights than the
> > > > > spinlocks. I think that's the tlb flush broadcast.
> > >
> > > No, it's KVM dealing with the vCPU being migrated to a different pCPU.  The
> > > smp_call_function_single() that shows up is from loaded_vmcs_clear() and is
> > > triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the
> > > VMCS caches not being coherent).
> > >
> > > Task migration can also trigger IBPB (if mitigations are enabled), and also does
> > > an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU.
> > >
> > > Can you trying 1:1 pinning of vCPUs to pCPUs?  That _should_ eliminate the
> > > vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real
> > > world configuration.
> >
> > Yes it does went away:
> >
> > https://drive.google.com/file/d/1ZFhWnWjoU33Lxy43jTYnKFuluo4zZArm/view
> >
> > With pinning vcpu threads only (again, over 40 hard cores/threads):
> >
> > ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
> >
> > It seems to me for some reason the scheduler ate more than I expected..
> > Maybe tomorrow I can try two more things:

I pulled in your patch adding the -c flag, and confirmed that it
doesn't seem to make a huge difference to the self test's
numbers/scalability. The percpu paging rate actually seems a bit
lower, going 117-103-77-55-18-9k for 1-32 vcpus

> >   - Do cpu isolations, and
> >   - pin reader threads too (or just leave the readers on housekeeping cores)
>
> I gave it a shot by isolating 32 cores and split into two groups, 16 for
> uffd threads and 16 for vcpu threads.  I got similiar results and I don't
> see much changed.
>
> I think it's possible it's just reaching the limit of my host since it only
> got 40 cores anyway.  Throughput never hits over 350K faults/sec overall.
>
> I assume this might not be the case for Anish if he has a much larger host.
> So we can have similar test carried out and see how that goes.  I think the
> idea is making sure vcpu load overhead during sched-in is ruled out, then
> see whether it can keep scaling with more cores.

Peter, I'm afraid that isolating cores and splitting them into groups
is new to me. Do you mind explaining exactly what you did here?

Also, I finally got some of my own perf traces for the self test: [1]
shows what happens with 32 vCPUs faulting on a single uffd with 32
reader threads, with the contention clearly being a huge issue, and
[2] shows the effect of demand paging through memory faults on that
configuration. Unfortunately the export-to-svg functionality on our
internal tool seems broken, so I could only grab pngs :(

[1] https://drive.google.com/file/d/1YWiZTjb2FPmqj0tkbk4cuH0Oq8l65nsU/view?usp=drivesdk
[2] https://drive.google.com/file/d/1P76_6SSAHpLxNgDAErSwRmXBLkuDeFoA/view?usp=drivesdk

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-03 23:45                             ` Peter Xu
  2023-05-04 19:09                               ` Peter Xu
@ 2023-05-05 20:05                               ` Nadav Amit
  2023-05-08  1:12                                 ` Peter Xu
  1 sibling, 1 reply; 103+ messages in thread
From: Nadav Amit @ 2023-05-05 20:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, Anish Moorthy, Axel Rasmussen,
	Paolo Bonzini, maz, oliver.upton, James Houghton, bgardon,
	dmatlack, ricarkol, kvm, kvmarm



> On May 3, 2023, at 4:45 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Wed, May 03, 2023 at 02:42:35PM -0700, Sean Christopherson wrote:
>> On Wed, May 03, 2023, Peter Xu wrote:
>>> Oops, bounced back from the list..
>>> 
>>> Forward with no attachment this time - I assume the information is still
>>> enough in the paragraphs even without the flamegraphs.
>> 
>> The flamegraphs are definitely useful beyond what is captured here.  Not sure
>> how to get them accepted on the list though.
> 
> Trying again with google drive:
> 
> single uffd:
> https://drive.google.com/file/d/1bYVYefIRRkW8oViRbYv_HyX5Zf81p3Jl/view
> 
> 32 uffds:
> https://drive.google.com/file/d/1T19yTEKKhbjU9G2FpANIvArSC61mqqtp/view
> 
>> 
>>>> From what I got there, vmx_vcpu_load() gets more highlights than the
>>>> spinlocks. I think that's the tlb flush broadcast.
>> 
>> No, it's KVM dealing with the vCPU being migrated to a different pCPU.  The
>> smp_call_function_single() that shows up is from loaded_vmcs_clear() and is
>> triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the
>> VMCS caches not being coherent).
>> 
>> Task migration can also trigger IBPB (if mitigations are enabled), and also does
>> an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU.
>> 
>> Can you trying 1:1 pinning of vCPUs to pCPUs?  That _should_ eliminate the
>> vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real
>> world configuration.
> 
> Yes it does went away:
> 
> https://drive.google.com/file/d/1ZFhWnWjoU33Lxy43jTYnKFuluo4zZArm/view
> 
> With pinning vcpu threads only (again, over 40 hard cores/threads):
> 
> ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
> 
> It seems to me for some reason the scheduler ate more than I expected..
> Maybe tomorrow I can try two more things:
> 
>  - Do cpu isolations, and
>  - pin reader threads too (or just leave the readers on housekeeping cores)

For the record (and I hope I do not repeat myself): these scheduler overheads
is something that I have encountered before.

The two main solutions I tried were:

1. Optional polling on the faulting thread to avoid context switch on the
   faulting thread.

(something like https://lore.kernel.org/linux-mm/20201129004548.1619714-6-namit@vmware.com/ )

and 

2. IO-uring to avoid context switch on the handler thread.

In addition, as I mentioned before, the queue locks is something that can be
simplified.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-05 20:05                               ` Nadav Amit
@ 2023-05-08  1:12                                 ` Peter Xu
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Xu @ 2023-05-08  1:12 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Sean Christopherson, Anish Moorthy, Axel Rasmussen,
	Paolo Bonzini, maz, oliver.upton, James Houghton, bgardon,
	dmatlack, ricarkol, kvm, kvmarm

Hi, Nadav,

On Fri, May 05, 2023 at 01:05:02PM -0700, Nadav Amit wrote:
> > ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
> > 
> > It seems to me for some reason the scheduler ate more than I expected..
> > Maybe tomorrow I can try two more things:
> > 
> >  - Do cpu isolations, and
> >  - pin reader threads too (or just leave the readers on housekeeping cores)
> 
> For the record (and I hope I do not repeat myself): these scheduler overheads
> is something that I have encountered before.
> 
> The two main solutions I tried were:
> 
> 1. Optional polling on the faulting thread to avoid context switch on the
>    faulting thread.
> 
> (something like https://lore.kernel.org/linux-mm/20201129004548.1619714-6-namit@vmware.com/ )
> 
> and 
> 
> 2. IO-uring to avoid context switch on the handler thread.
> 
> In addition, as I mentioned before, the queue locks is something that can be
> simplified.

Right, thanks for double checking on that.  Though do you think these are
two separate issues to be looked into?

One thing on reducing context switch overhead with a static configuration,
which I think is what can be resolved by what you mentioned above, and the
iouring series.

One thing on the possibility of scaling userfaultfd over splitting guest
memory into a few chunks (literally demand paging test with no -a).
Logically I think it should scale if with pcpu pinning on vcpu threads to
avoid kvm bottlenecks around.

Side note: IIUC none of above will resolve the problem right now if we
assume we can only have 1 uffd to register to the guest mem.

However I'm curious on testing multi-uffd because I wanted to make sure
there's no other thing stops the whole system from scaling with threads,
hence I'd expect to get higher fault/sec overall if we increase the cores
we use in the test.

If it already cannot scale for whatever reason then it means a generic
solution may not be possible at least for kvm use case.  While OTOH if
multi-uffd can scale well, then there's a chance of general solution as
long as we can remove the single-queue contention over the whole guest mem.

PS: Nadav, I think you mentioned twice on avoiding taking two locks for the
fault queue, which sounds reasonable.  Do you have plan to post a patch?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-05 18:32                                 ` Anish Moorthy
@ 2023-05-08  1:23                                   ` Peter Xu
  2023-05-09 20:52                                     ` Anish Moorthy
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Xu @ 2023-05-08  1:23 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Sean Christopherson, Nadav Amit, Axel Rasmussen, Paolo Bonzini,
	maz, oliver.upton, James Houghton, bgardon, dmatlack, ricarkol,
	kvm, kvmarm

On Fri, May 05, 2023 at 11:32:11AM -0700, Anish Moorthy wrote:
> Peter, I'm afraid that isolating cores and splitting them into groups
> is new to me. Do you mind explaining exactly what you did here?

So far I think the most important pinning is the vcpu thread pinning, we
should test always with that in this case to avoid the vcpu load overhead
not scaling with cores/vcpus.

What I did was (1) isolate cores (using isolcpus=xxx), then (2) manually
pinning the userfault threads to some other isolated cores.  But maybe this
is not needed.

> 
> Also, I finally got some of my own perf traces for the self test: [1]
> shows what happens with 32 vCPUs faulting on a single uffd with 32
> reader threads, with the contention clearly being a huge issue, and
> [2] shows the effect of demand paging through memory faults on that
> configuration. Unfortunately the export-to-svg functionality on our
> internal tool seems broken, so I could only grab pngs :(
> 
> [1] https://drive.google.com/file/d/1YWiZTjb2FPmqj0tkbk4cuH0Oq8l65nsU/view?usp=drivesdk
> [2] https://drive.google.com/file/d/1P76_6SSAHpLxNgDAErSwRmXBLkuDeFoA/view?usp=drivesdk

Understood.  What I tested was without -a so it's using >1 uffds.

I explained why I think it could be useful to test this in my reply to
Nadav, do you think it makes sense to you?  e.g. compare (1) 32 vcpus + 32
uffd threads and (2) 64 vcpus + 64 uffd threads, again we need to make sure
vcpu threads are pinned using -c this time.  It'll be nice to pin the uffd
threads too but I'm not sure whether it'll make a huge difference.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-08  1:23                                   ` Peter Xu
@ 2023-05-09 20:52                                     ` Anish Moorthy
  2023-05-10 21:50                                       ` Peter Xu
  0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-05-09 20:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, Nadav Amit, Axel Rasmussen, Paolo Bonzini,
	maz, oliver.upton, James Houghton, bgardon, dmatlack, ricarkol,
	kvm, kvmarm

On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
>
> I explained why I think it could be useful to test this in my reply to
> Nadav, do you think it makes sense to you?

Ah, I actually missed your reply to Nadav: didn't realize you had sent
*two* emails.

> While OTOH if multi-uffd can scale well, then there's a chance of
> general solution as long as we can remove the single-queue
> contention over the whole guest mem.

I don't quite understand your statement here: if we pursue multi-uffd,
then it seems to me that by definition we've removed the single
queue(s) for all of guest memory, and thus the associated contention.
And we'd still have the issue of multiple vCPUs contending for a
single UFFD.

But I do share some of your curiosity about multi-uffd performance,
especially since some of my earlier numbers indicated that multi-uffd
doesn't scale linearly, even when each vCPU corresponds to a single
UFFD.

So, I grabbed some more profiles for 32 and 64 vcpus using the following command
./demand_paging_test -b 512M -u MINOR -s shmem -v <n> -r 1 -c <1,...,n>

The 32-vcpu config achieves a per-vcpu paging rate of 8.8k. That rate
goes down to 3.9k (!) with 64 vCPUs. I don't immediately see the issue
from the traces, but safe to say it's definitely not scaling. Since I
applied your fixes from earlier, the prefaulting isn't being counted
against the demand paging rate either.

32-vcpu profile:
https://drive.google.com/file/d/19ZZDxZArhSsbW_5u5VcmLT48osHlO9TG/view?usp=drivesdk
64-vcpu profile:
https://drive.google.com/file/d/1dyLOLVHRNdkUoFFr7gxqtoSZGn1_GqmS/view?usp=drivesdk

Do let me know if you need svg files instead and I'll try and figure that out.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
                   ` (22 preceding siblings ...)
  2023-04-19 19:55 ` [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Peter Xu
@ 2023-05-09 22:19 ` David Matlack
  2023-05-10 16:35   ` Anish Moorthy
  2023-05-10 22:35   ` Sean Christopherson
  23 siblings, 2 replies; 103+ messages in thread
From: David Matlack @ 2023-05-09 22:19 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> Upon receiving an annotated EFAULT, userspace may take appropriate
> action to resolve the failed access. For instance, this might involve a
> UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live
> migration postcopy.

As implemented, I think it will be prohibitively expensive if not
impossible for userspace to determine why KVM is returning EFAULT when
KVM_CAP_ABSENT_MAPPING_FAULT is enabled, which means userspace can't
decide the correct action to take (try to resolve or bail).

Consider the direct_map() case in patch in PATCH 15. The only way to hit
that condition is a logic bug in KVM or data corruption. There isn't
really anything userspace can do to handle this situation, and it has no
way to distinguish that from faults to due absent mappings.

We could end up hitting cases where userspace loops forever doing
KVM_RUN, EFAULT, UFFDIO_CONTINUE/MADV_POPULATE_WRITE, KVM_RUN, EFAULT...

Maybe we should just change direct_map() to use KVM_BUG() and return
something other than EFAULT. But the general problem still exists and
even if we have confidence in all the current EFAULT sites, we don't have
much protection against someone adding an EFAULT in the future that
userspace can't handle.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-09 22:19 ` David Matlack
@ 2023-05-10 16:35   ` Anish Moorthy
  2023-05-10 22:35   ` Sean Christopherson
  1 sibling, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-05-10 16:35 UTC (permalink / raw)
  To: David Matlack
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Tue, May 9, 2023 at 3:19 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > Upon receiving an annotated EFAULT, userspace may take appropriate
> > action to resolve the failed access. For instance, this might involve a
> > UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live
> > migration postcopy.
>
> As implemented, I think it will be prohibitively expensive if not
> impossible for userspace to determine why KVM is returning EFAULT when
> KVM_CAP_ABSENT_MAPPING_FAULT is enabled, which means userspace can't
> decide the correct action to take (try to resolve or bail).
>
> Consider the direct_map() case in patch in PATCH 15. The only way to hit
> that condition is a logic bug in KVM or data corruption. There isn't
> really anything userspace can do to handle this situation, and it has no
> way to distinguish that from faults to due absent mappings.
>
> We could end up hitting cases where userspace loops forever doing
> KVM_RUN, EFAULT, UFFDIO_CONTINUE/MADV_POPULATE_WRITE, KVM_RUN, EFAULT...
>
> Maybe we should just change direct_map() to use KVM_BUG() and return
> something other than EFAULT. But the general problem still exists and
> even if we have confidence in all the current EFAULT sites, we don't have
> much protection against someone adding an EFAULT in the future that
> userspace can't handle.

Hmm, I had been operating under the assumption that userspace would
always have been able to make the memory access succeed somehow- I
(naively) didn't count on some guest memory access errors being
unrecoverable.

If that's the case, then we're back to needing some way to distinguish
the new faults/exits emitted by user_mem_abort/kvm_faultin_pfn with
the ABSENT_MAPPING_FAULT cap enabled :/ Let me paste in a bit of what
Sean said to refute the idea of a special page-fault-failure set in
those spots.

(from https://lore.kernel.org/kvm/ZBoIzo8FGxSyUJ2I@google.com/)
On Tue, Mar 21, 2023 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Setting a flag that essentially says "failure when handling a guest page fault"
> is problematic on multiple fronts.  Tying the ABI to KVM's internal implementation
> is not an option, i.e. the ABI would need to be defined as "on page faults from
> the guest".  And then the resulting behavior would be non-deterministic, e.g.
> userspace would see different behavior if KVM accessed a "bad" gfn via emulation
> instead of in response to a guest page fault.  And because of hardware TLBs, it
> would even be possible for the behavior to be non-deterministic on the same
> platform running the same guest code (though this would be exteremly unliklely
> in practice).
>
> And even if userspace is ok with only handling guest page faults_today_, I highly
> doubt that will hold forever.  I.e. at some point there will be a use case that
> wants to react to uaccess failures on fast-only memslots.
>
> Ignoring all of those issues, simplify flagging "this -EFAULT occurred when
> handling a guest page fault" isn't precise enough for userspace to blindly resolve
> the failure.  Even if KVM went through the trouble of setting information if and
> only if get_user_page_fast_only() failed while handling a guest page fault,
> userspace would still need/want a way to verify that the failure was expected and
> can be resolved, e.g. to guard against userspace bugs due to wrongly unmapping
> or mprotecting a page.

I wonder, how much of this problem comes down to my description/name
(I suggested MEMFAULT_REASON_PAGE_FAULT_FAILURE) for the flag? I see
Sean's concerns of the behavior issues when fast-only pages are
accessed via guest mode or via emulation/uaccess. What if the
description of the fast-only fault cap was tightened to something like
"generates vcpu faults/exits in response to *EPT/SLAT violations*
which cannot be mapped by present userspace page table entries?" I
think that would eliminate the emulation/uaccess issues (though I may
be wrong, so please let me know).

Of course, by the time we get to kvm_faultin_pfn we don't know that
we're faulting pages in response to an EPT violation... but if the
idea makes sense then that might justify some plumbing code.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-09 20:52                                     ` Anish Moorthy
@ 2023-05-10 21:50                                       ` Peter Xu
  2023-05-11 17:17                                         ` David Matlack
  2023-05-15 17:16                                         ` Anish Moorthy
  0 siblings, 2 replies; 103+ messages in thread
From: Peter Xu @ 2023-05-10 21:50 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Sean Christopherson, Nadav Amit, Axel Rasmussen, Paolo Bonzini,
	maz, oliver.upton, James Houghton, bgardon, dmatlack, ricarkol,
	kvm, kvmarm

Hi, Anish,

On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > I explained why I think it could be useful to test this in my reply to
> > Nadav, do you think it makes sense to you?
> 
> Ah, I actually missed your reply to Nadav: didn't realize you had sent
> *two* emails.
> 
> > While OTOH if multi-uffd can scale well, then there's a chance of
> > general solution as long as we can remove the single-queue
> > contention over the whole guest mem.
> 
> I don't quite understand your statement here: if we pursue multi-uffd,
> then it seems to me that by definition we've removed the single
> queue(s) for all of guest memory, and thus the associated contention.
> And we'd still have the issue of multiple vCPUs contending for a
> single UFFD.

Yes as I mentioned it's purely what I was curious and it also shows the
best result we can have if we go a more generic solution; it doesn't really
solve the issue immediately.

> 
> But I do share some of your curiosity about multi-uffd performance,
> especially since some of my earlier numbers indicated that multi-uffd
> doesn't scale linearly, even when each vCPU corresponds to a single
> UFFD.
> 
> So, I grabbed some more profiles for 32 and 64 vcpus using the following command
> ./demand_paging_test -b 512M -u MINOR -s shmem -v <n> -r 1 -c <1,...,n>
> 
> The 32-vcpu config achieves a per-vcpu paging rate of 8.8k. That rate
> goes down to 3.9k (!) with 64 vCPUs. I don't immediately see the issue
> from the traces, but safe to say it's definitely not scaling. Since I
> applied your fixes from earlier, the prefaulting isn't being counted
> against the demand paging rate either.
> 
> 32-vcpu profile:
> https://drive.google.com/file/d/19ZZDxZArhSsbW_5u5VcmLT48osHlO9TG/view?usp=drivesdk
> 64-vcpu profile:
> https://drive.google.com/file/d/1dyLOLVHRNdkUoFFr7gxqtoSZGn1_GqmS/view?usp=drivesdk
> 
> Do let me know if you need svg files instead and I'll try and figure that out.

Thanks for trying all these out, and sorry if I caused confusion in my
reply.

What I wanted to do is to understand whether there's still chance to
provide a generic solution.  I don't know why you have had a bunch of pmu
stack showing in the graph, perhaps you forgot to disable some of the perf
events when doing the test?  Let me know if you figure out why it happened
like that (so far I didn't see), but I feel guilty to keep overloading you
with such questions.

The major problem I had with this series is it's definitely not a clean
approach.  Say, even if you'll all rely on userapp you'll still need to
rely on userfaultfd for kernel traps on corner cases or it just won't work.
IIUC that's also the concern from Nadav.

But I also agree it seems to resolve every bottleneck in the kernel no
matter whether it's in scheduler or vcpu loading. After all you throw
everything into userspace.. :)

Considering that most of the changes are for -EFAULT traps and the 2nd part
change is very self contained and maintainable, no objection here to have
it.  I'll leave that to the maintainers to decide.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-09 22:19 ` David Matlack
  2023-05-10 16:35   ` Anish Moorthy
@ 2023-05-10 22:35   ` Sean Christopherson
  2023-05-10 23:44     ` Anish Moorthy
  2023-05-23 17:49     ` Anish Moorthy
  1 sibling, 2 replies; 103+ messages in thread
From: Sean Christopherson @ 2023-05-10 22:35 UTC (permalink / raw)
  To: David Matlack
  Cc: Anish Moorthy, pbonzini, maz, oliver.upton, jthoughton, bgardon,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Tue, May 09, 2023, David Matlack wrote:
> On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > Upon receiving an annotated EFAULT, userspace may take appropriate
> > action to resolve the failed access. For instance, this might involve a
> > UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live
> > migration postcopy.
> 
> As implemented, I think it will be prohibitively expensive if not
> impossible for userspace to determine why KVM is returning EFAULT when
> KVM_CAP_ABSENT_MAPPING_FAULT is enabled, which means userspace can't
> decide the correct action to take (try to resolve or bail).
> 
> Consider the direct_map() case in patch in PATCH 15. The only way to hit
> that condition is a logic bug in KVM or data corruption. There isn't
> really anything userspace can do to handle this situation, and it has no
> way to distinguish that from faults to due absent mappings.
> 
> We could end up hitting cases where userspace loops forever doing
> KVM_RUN, EFAULT, UFFDIO_CONTINUE/MADV_POPULATE_WRITE, KVM_RUN, EFAULT...
> 
> Maybe we should just change direct_map() to use KVM_BUG() and return
> something other than EFAULT. But the general problem still exists and
> even if we have confidence in all the current EFAULT sites, we don't have
> much protection against someone adding an EFAULT in the future that
> userspace can't handle.

Yeah, when I speed read the series, several of the conversions stood out as being
"wrong".  My (potentially unstated) idea was that KVM would only signal
KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access,
i.e. when the fault _might_ be resolvable by userspace.

If we want to populate KVM_EXIT_MEMORY_FAULT even on kernel bugs, and anything
else that userspace can't possibly resolve, then the easiest thing would be to
add a flag to signal that the fault is fatal, i.e. that userspace shouldn't retry.
Adding a flag may be more robust in the long term as it will force developers to
think about whether or not a fault is fatal, versus relying on documentation to
say "don't signal KVM_EXIT_MEMORY_FAULT for fatal EFAULT conditions".

Side topic, KVM x86 really should have a version of KVM_SYNC_X86_REGS that stores
registers for userspace, but doesn't load registers.  That would allow userspace
to detect many infinite loops with minimal overhead, e.g. (1) set KVM_STORE_X86_REGS
during demand paging, (2) check RIP on every exit to see if the vCPU is making
forward progress, (3) escalate to checking all registers if RIP hasn't changed for
N exits, and finally (4) take action if the guest is well and truly stuck after
N more exits.  KVM could even store RIP on every exit if userspace wanted to avoid
the overhead of storing registers until userspace actually wants all registers.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-10 22:35   ` Sean Christopherson
@ 2023-05-10 23:44     ` Anish Moorthy
  2023-05-23 17:49     ` Anish Moorthy
  1 sibling, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-05-10 23:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Matlack, pbonzini, maz, oliver.upton, jthoughton, bgardon,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Wed, May 10, 2023 at 3:35 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Yeah, when I speed read the series, several of the conversions stood out as being
> "wrong".  My (potentially unstated) idea was that KVM would only signal
> KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access,
> i.e. when the fault _might_ be resolvable by userspace.

Well, you definitely tried to get the idea across somehow- even in my
cover letter here, I state

> As a first step, KVM_CAP_MEMORY_FAULT_INFO is introduced. This
> capability is meant to deliver useful information to userspace (i.e. the
> problematic range of guest physical memory) when a vCPU fails a guest
> memory access.

So the fact that I'm doing something more here is unintentional and
stems from unfamiliarity with all of the ways in which KVM does (or
does not) perform user accesses.

Sean, besides direct_map which other patches did you notice as needing
to be dropped/marked as unrecoverable errors?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-10 21:50                                       ` Peter Xu
@ 2023-05-11 17:17                                         ` David Matlack
  2023-05-11 17:33                                           ` Axel Rasmussen
  2023-05-15 17:16                                         ` Anish Moorthy
  1 sibling, 1 reply; 103+ messages in thread
From: David Matlack @ 2023-05-11 17:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: Anish Moorthy, Sean Christopherson, Nadav Amit, Axel Rasmussen,
	Paolo Bonzini, maz, oliver.upton, James Houghton, bgardon,
	ricarkol, kvm, kvmarm

On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote:
> On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
>
> What I wanted to do is to understand whether there's still chance to
> provide a generic solution.  I don't know why you have had a bunch of pmu
> stack showing in the graph, perhaps you forgot to disable some of the perf
> events when doing the test?  Let me know if you figure out why it happened
> like that (so far I didn't see), but I feel guilty to keep overloading you
> with such questions.
>
> The major problem I had with this series is it's definitely not a clean
> approach.  Say, even if you'll all rely on userapp you'll still need to
> rely on userfaultfd for kernel traps on corner cases or it just won't work.
> IIUC that's also the concern from Nadav.

This is a long thread, so apologies if the following has already been discussed.

Would per-tid userfaultfd support be a generic solution? i.e. Allow
userspace to create a userfaultfd that is tied to a specific task. Any
userfaults encountered by that task use that fd, rather than the
process-wide fd. I'm making the assumption here that each of these fds
would have independent signaling mechanisms/queues and so this would
solve the scaling problem.

A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
vCPU for handling userfault requests. This seems like it'd have
roughly the same scalability characteristics as the KVM -EFAULT
approach.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-11 17:17                                         ` David Matlack
@ 2023-05-11 17:33                                           ` Axel Rasmussen
  2023-05-11 19:05                                             ` David Matlack
  2023-05-15 15:05                                             ` Peter Xu
  0 siblings, 2 replies; 103+ messages in thread
From: Axel Rasmussen @ 2023-05-11 17:33 UTC (permalink / raw)
  To: David Matlack
  Cc: Peter Xu, Anish Moorthy, Sean Christopherson, Nadav Amit,
	Paolo Bonzini, maz, oliver.upton, James Houghton, bgardon,
	ricarkol, kvm, kvmarm

On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote:
> > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > What I wanted to do is to understand whether there's still chance to
> > provide a generic solution.  I don't know why you have had a bunch of pmu
> > stack showing in the graph, perhaps you forgot to disable some of the perf
> > events when doing the test?  Let me know if you figure out why it happened
> > like that (so far I didn't see), but I feel guilty to keep overloading you
> > with such questions.
> >
> > The major problem I had with this series is it's definitely not a clean
> > approach.  Say, even if you'll all rely on userapp you'll still need to
> > rely on userfaultfd for kernel traps on corner cases or it just won't work.
> > IIUC that's also the concern from Nadav.
>
> This is a long thread, so apologies if the following has already been discussed.
>
> Would per-tid userfaultfd support be a generic solution? i.e. Allow
> userspace to create a userfaultfd that is tied to a specific task. Any
> userfaults encountered by that task use that fd, rather than the
> process-wide fd. I'm making the assumption here that each of these fds
> would have independent signaling mechanisms/queues and so this would
> solve the scaling problem.
>
> A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
> vCPU for handling userfault requests. This seems like it'd have
> roughly the same scalability characteristics as the KVM -EFAULT
> approach.

I think this would work in principle, but it's significantly different
from what exists today.

The splitting of userfaultfds Peter is describing is splitting up the
HVA address space, not splitting per-thread.

I think for this design, we'd need to change UFFD registration so
multiple UFFDs can register the same VMA, but can be filtered so they
only receive fault events caused by some particular tid(s).

This might also incur some (small?) overhead, because in the fault
path we now need to maintain some data structure so we can lookup
which UFFD to notify based on a combination of the address and our
tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial.

I think it's worth keeping in mind that a selling point of Anish's
approach is that it's a very small change. It's plausible we can come
up with some alternative way to scale, but it seems to me everything
suggested so far is likely to require a lot more code, complexity, and
effort vs. Anish's approach.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-11 17:33                                           ` Axel Rasmussen
@ 2023-05-11 19:05                                             ` David Matlack
  2023-05-11 19:45                                               ` Axel Rasmussen
  2023-05-15 15:05                                             ` Peter Xu
  1 sibling, 1 reply; 103+ messages in thread
From: David Matlack @ 2023-05-11 19:05 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Peter Xu, Anish Moorthy, Sean Christopherson, Nadav Amit,
	Paolo Bonzini, maz, oliver.upton, James Houghton, bgardon,
	ricarkol, kvm, kvmarm

On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote:
> On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote:
> > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > What I wanted to do is to understand whether there's still chance to
> > > provide a generic solution.  I don't know why you have had a bunch of pmu
> > > stack showing in the graph, perhaps you forgot to disable some of the perf
> > > events when doing the test?  Let me know if you figure out why it happened
> > > like that (so far I didn't see), but I feel guilty to keep overloading you
> > > with such questions.
> > >
> > > The major problem I had with this series is it's definitely not a clean
> > > approach.  Say, even if you'll all rely on userapp you'll still need to
> > > rely on userfaultfd for kernel traps on corner cases or it just won't work.
> > > IIUC that's also the concern from Nadav.
> >
> > This is a long thread, so apologies if the following has already been discussed.
> >
> > Would per-tid userfaultfd support be a generic solution? i.e. Allow
> > userspace to create a userfaultfd that is tied to a specific task. Any
> > userfaults encountered by that task use that fd, rather than the
> > process-wide fd. I'm making the assumption here that each of these fds
> > would have independent signaling mechanisms/queues and so this would
> > solve the scaling problem.
> >
> > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
> > vCPU for handling userfault requests. This seems like it'd have
> > roughly the same scalability characteristics as the KVM -EFAULT
> > approach.
> 
> I think this would work in principle, but it's significantly different
> from what exists today.
> 
> The splitting of userfaultfds Peter is describing is splitting up the
> HVA address space, not splitting per-thread.
> 
> I think for this design, we'd need to change UFFD registration so
> multiple UFFDs can register the same VMA, but can be filtered so they
> only receive fault events caused by some particular tid(s).
> 
> This might also incur some (small?) overhead, because in the fault
> path we now need to maintain some data structure so we can lookup
> which UFFD to notify based on a combination of the address and our
> tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial.

I was (perhaps naively) assuming the lookup would be as simple as:

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 44d1ee429eb0..e9856e2ba9ef 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -417,7 +417,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
         */
        mmap_assert_locked(mm);

-       ctx = vma->vm_userfaultfd_ctx.ctx;
+       if (current->userfaultfd_ctx)
+               ctx = current->userfaultfd_ctx;
+       else
+               ctx = vma->vm_userfaultfd_ctx.ctx;
        if (!ctx)
                goto out;

> 
> I think it's worth keeping in mind that a selling point of Anish's
> approach is that it's a very small change. It's plausible we can come
> up with some alternative way to scale, but it seems to me everything
> suggested so far is likely to require a lot more code, complexity, and
> effort vs. Anish's approach.

Agreed.

Mostly I think the per-thread UFFD approach would add complexity on the
userspace side of things. With Anish's approach userspace is able to
trivially re-use the vCPU thread (and it's associated pCPU if pinned) to
handle the request. That gets more complicated when juggling the extra
paired threads.

The per-thread approach would requires a new userfault UAPI change which
I think is a higher bar than the KVM UAPI change proposed here.

The per-thread approach would require KVM call into slow GUP and take
the mmap_lock before contacting userspace. I'm not 100% convinced that's
a bad thing long term (e.g. it avoids the false-positive -EFAULT exits
in Anish's proposal), but could have performance implications.

Lastly, inter-thread communication is likely slower than returning to
userspace from KVM_RUN. So the per-thread approach might increase the
end-to-end latency of demand fetches.

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-11 19:05                                             ` David Matlack
@ 2023-05-11 19:45                                               ` Axel Rasmussen
  2023-05-15 15:16                                                 ` Peter Xu
  0 siblings, 1 reply; 103+ messages in thread
From: Axel Rasmussen @ 2023-05-11 19:45 UTC (permalink / raw)
  To: David Matlack
  Cc: Peter Xu, Anish Moorthy, Sean Christopherson, Nadav Amit,
	Paolo Bonzini, maz, oliver.upton, James Houghton, bgardon,
	ricarkol, kvm, kvmarm

On Thu, May 11, 2023 at 12:05 PM David Matlack <dmatlack@google.com> wrote:
>
> On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote:
> > On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote:
> > >
> > > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote:
> > > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > What I wanted to do is to understand whether there's still chance to
> > > > provide a generic solution.  I don't know why you have had a bunch of pmu
> > > > stack showing in the graph, perhaps you forgot to disable some of the perf
> > > > events when doing the test?  Let me know if you figure out why it happened
> > > > like that (so far I didn't see), but I feel guilty to keep overloading you
> > > > with such questions.
> > > >
> > > > The major problem I had with this series is it's definitely not a clean
> > > > approach.  Say, even if you'll all rely on userapp you'll still need to
> > > > rely on userfaultfd for kernel traps on corner cases or it just won't work.
> > > > IIUC that's also the concern from Nadav.
> > >
> > > This is a long thread, so apologies if the following has already been discussed.
> > >
> > > Would per-tid userfaultfd support be a generic solution? i.e. Allow
> > > userspace to create a userfaultfd that is tied to a specific task. Any
> > > userfaults encountered by that task use that fd, rather than the
> > > process-wide fd. I'm making the assumption here that each of these fds
> > > would have independent signaling mechanisms/queues and so this would
> > > solve the scaling problem.
> > >
> > > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
> > > vCPU for handling userfault requests. This seems like it'd have
> > > roughly the same scalability characteristics as the KVM -EFAULT
> > > approach.
> >
> > I think this would work in principle, but it's significantly different
> > from what exists today.
> >
> > The splitting of userfaultfds Peter is describing is splitting up the
> > HVA address space, not splitting per-thread.
> >
> > I think for this design, we'd need to change UFFD registration so
> > multiple UFFDs can register the same VMA, but can be filtered so they
> > only receive fault events caused by some particular tid(s).
> >
> > This might also incur some (small?) overhead, because in the fault
> > path we now need to maintain some data structure so we can lookup
> > which UFFD to notify based on a combination of the address and our
> > tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial.
>
> I was (perhaps naively) assuming the lookup would be as simple as:
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 44d1ee429eb0..e9856e2ba9ef 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -417,7 +417,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>          */
>         mmap_assert_locked(mm);
>
> -       ctx = vma->vm_userfaultfd_ctx.ctx;
> +       if (current->userfaultfd_ctx)
> +               ctx = current->userfaultfd_ctx;
> +       else
> +               ctx = vma->vm_userfaultfd_ctx.ctx;
>         if (!ctx)
>                 goto out;

Hmm, perhaps. It might have to be more complicated if we want to allow
a single task to have both per-TID UFFDs for some addresses, and
"global" UFFDs for others.



Actually, while thinking about this, another wrinkle:

Imagine we have per-thread UFFDs. Thread X faults on some address, and
goes to sleep waiting for its paired resolver thread to resolve the
fault.

In the meantime, thread Y also faults on the same address, before the
resolution happens.

In the existing model, there is a single UFFD context per VMA, and
therefore a single wait queue for all threads to wait on. In the
per-TID-UFFD design, now each thread has its own context, and
ostensibly its own wait queue (since the wait queue locks are where
Anish saw the contention, I think this is exactly what we want to
split up). When we have this "multiple threads waiting on the same
address" situation, how do we ensure the fault is resolved exactly
once? And how do we wake up all of the sleeping threads when it is
resolved?

I'm sure it's solvable, but especially doing it without any locks /
contention seems like it could be a bit complicated.

>
> >
> > I think it's worth keeping in mind that a selling point of Anish's
> > approach is that it's a very small change. It's plausible we can come
> > up with some alternative way to scale, but it seems to me everything
> > suggested so far is likely to require a lot more code, complexity, and
> > effort vs. Anish's approach.
>
> Agreed.
>
> Mostly I think the per-thread UFFD approach would add complexity on the
> userspace side of things. With Anish's approach userspace is able to
> trivially re-use the vCPU thread (and it's associated pCPU if pinned) to
> handle the request. That gets more complicated when juggling the extra
> paired threads.
>
> The per-thread approach would requires a new userfault UAPI change which
> I think is a higher bar than the KVM UAPI change proposed here.
>
> The per-thread approach would require KVM call into slow GUP and take
> the mmap_lock before contacting userspace. I'm not 100% convinced that's
> a bad thing long term (e.g. it avoids the false-positive -EFAULT exits
> in Anish's proposal), but could have performance implications.
>
> Lastly, inter-thread communication is likely slower than returning to
> userspace from KVM_RUN. So the per-thread approach might increase the
> end-to-end latency of demand fetches.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-11 17:33                                           ` Axel Rasmussen
  2023-05-11 19:05                                             ` David Matlack
@ 2023-05-15 15:05                                             ` Peter Xu
  1 sibling, 0 replies; 103+ messages in thread
From: Peter Xu @ 2023-05-15 15:05 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: David Matlack, Anish Moorthy, Sean Christopherson, Nadav Amit,
	Paolo Bonzini, maz, oliver.upton, James Houghton, bgardon,
	ricarkol, kvm, kvmarm

On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote:
> On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote:
> > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > What I wanted to do is to understand whether there's still chance to
> > > provide a generic solution.  I don't know why you have had a bunch of pmu
> > > stack showing in the graph, perhaps you forgot to disable some of the perf
> > > events when doing the test?  Let me know if you figure out why it happened
> > > like that (so far I didn't see), but I feel guilty to keep overloading you
> > > with such questions.
> > >
> > > The major problem I had with this series is it's definitely not a clean
> > > approach.  Say, even if you'll all rely on userapp you'll still need to
> > > rely on userfaultfd for kernel traps on corner cases or it just won't work.
> > > IIUC that's also the concern from Nadav.
> >
> > This is a long thread, so apologies if the following has already been discussed.
> >
> > Would per-tid userfaultfd support be a generic solution? i.e. Allow
> > userspace to create a userfaultfd that is tied to a specific task. Any
> > userfaults encountered by that task use that fd, rather than the
> > process-wide fd. I'm making the assumption here that each of these fds
> > would have independent signaling mechanisms/queues and so this would
> > solve the scaling problem.
> >
> > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
> > vCPU for handling userfault requests. This seems like it'd have
> > roughly the same scalability characteristics as the KVM -EFAULT
> > approach.
> 
> I think this would work in principle, but it's significantly different
> from what exists today.
> 
> The splitting of userfaultfds Peter is describing is splitting up the
> HVA address space, not splitting per-thread.

[sorry mostly travel last week]

No, my idea was actually split per-thread, but since currently there's no
way to split per thread I was thinking we should start testing with split
per vma so it "emulates" the best we can have out of a split per thread.

> 
> I think for this design, we'd need to change UFFD registration so
> multiple UFFDs can register the same VMA, but can be filtered so they
> only receive fault events caused by some particular tid(s).

Having multiple real uffds per vma is challenging, as you mentioned
enqueuing may be more of an effort, meanwhile it's hard to know what's the
attribute of the uffd over this vma because each uffd has one feature list.

Here what we may need is only the "logical queue" of the uffd.  So I was
considering supporting multi-queue for a _single_ userfaultfd.

I actually mentioned some of it in the very initial reply to Anish:

https://lore.kernel.org/all/ZEGuogfbtxPNUq7t@x1n/

        If the real problem relies in a bunch of threads queuing, is it
        possible that we can provide just more queues for the events?  The
        readers will just need to go over all the queues.

        Way to decide "which thread uses which queue" can be another
        problem, what comes ups quickly to me is a "hash(tid) % n_queues"
        but maybe it can be better.  Each vcpu thread will have different
        tids, then they can hopefully scale on the queues.

The queues may need to be created also as sub-uffds, each only support
partial of the uffd interfaces (read/poll, COPY/CONTINUE/ZEROPAGE) but not
all (e.g. UFFDIO_API shouldn't be supported there).

> This might also incur some (small?) overhead, because in the fault path
> we now need to maintain some data structure so we can lookup which UFFD
> to notify based on a combination of the address and our tid. Today, since
> VMAs and UFFDs are 1:1 this lookup is trivial.  I think it's worth
> keeping in mind that a selling point of Anish's approach is that it's a
> very small change. It's plausible we can come up with some alternative
> way to scale, but it seems to me everything suggested so far is likely to
> require a lot more code, complexity, and effort vs. Anish's approach.

Yes, I think that's also the reason why I thought I overloaded too much on
this work.  If Anish eagerly wants that and make it useful, then I'm
totally fine because maintaining the 2nd cap seems trivial assuming the
maintainer already would accept the 1st cap.

I just hope it'll be thoroughly tested with even Google's private userspace
hypervisor, so the kernel interface is (even if not straightforward enough
to a new user seeing this) solid so it will service the goal for the
problem Anish is tackling with.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-11 19:45                                               ` Axel Rasmussen
@ 2023-05-15 15:16                                                 ` Peter Xu
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Xu @ 2023-05-15 15:16 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: David Matlack, Anish Moorthy, Sean Christopherson, Nadav Amit,
	Paolo Bonzini, maz, oliver.upton, James Houghton, bgardon,
	ricarkol, kvm, kvmarm

On Thu, May 11, 2023 at 12:45:32PM -0700, Axel Rasmussen wrote:
> On Thu, May 11, 2023 at 12:05 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote:
> > > On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote:
> > > >
> > > > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote:
> > > > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > What I wanted to do is to understand whether there's still chance to
> > > > > provide a generic solution.  I don't know why you have had a bunch of pmu
> > > > > stack showing in the graph, perhaps you forgot to disable some of the perf
> > > > > events when doing the test?  Let me know if you figure out why it happened
> > > > > like that (so far I didn't see), but I feel guilty to keep overloading you
> > > > > with such questions.
> > > > >
> > > > > The major problem I had with this series is it's definitely not a clean
> > > > > approach.  Say, even if you'll all rely on userapp you'll still need to
> > > > > rely on userfaultfd for kernel traps on corner cases or it just won't work.
> > > > > IIUC that's also the concern from Nadav.
> > > >
> > > > This is a long thread, so apologies if the following has already been discussed.
> > > >
> > > > Would per-tid userfaultfd support be a generic solution? i.e. Allow
> > > > userspace to create a userfaultfd that is tied to a specific task. Any
> > > > userfaults encountered by that task use that fd, rather than the
> > > > process-wide fd. I'm making the assumption here that each of these fds
> > > > would have independent signaling mechanisms/queues and so this would
> > > > solve the scaling problem.
> > > >
> > > > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per
> > > > vCPU for handling userfault requests. This seems like it'd have
> > > > roughly the same scalability characteristics as the KVM -EFAULT
> > > > approach.
> > >
> > > I think this would work in principle, but it's significantly different
> > > from what exists today.
> > >
> > > The splitting of userfaultfds Peter is describing is splitting up the
> > > HVA address space, not splitting per-thread.
> > >
> > > I think for this design, we'd need to change UFFD registration so
> > > multiple UFFDs can register the same VMA, but can be filtered so they
> > > only receive fault events caused by some particular tid(s).
> > >
> > > This might also incur some (small?) overhead, because in the fault
> > > path we now need to maintain some data structure so we can lookup
> > > which UFFD to notify based on a combination of the address and our
> > > tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial.
> >
> > I was (perhaps naively) assuming the lookup would be as simple as:
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 44d1ee429eb0..e9856e2ba9ef 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -417,7 +417,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >          */
> >         mmap_assert_locked(mm);
> >
> > -       ctx = vma->vm_userfaultfd_ctx.ctx;
> > +       if (current->userfaultfd_ctx)
> > +               ctx = current->userfaultfd_ctx;
> > +       else
> > +               ctx = vma->vm_userfaultfd_ctx.ctx;
> >         if (!ctx)
> >                 goto out;

This is an interesting idea, but I'll just double check before I grow
task_struct and see whether that's the only solution. :)

I'd start with hash(tid) or even hash(pcpu) to choose queue.  In a pinned
use case hash(pcpu) should probably reach a similar goal here (and I'd
guess hash(tid) too if vcpus are mostly always created in one shot, just
slightly trickier).

> 
> Hmm, perhaps. It might have to be more complicated if we want to allow
> a single task to have both per-TID UFFDs for some addresses, and
> "global" UFFDs for others.
> 
> 
> 
> Actually, while thinking about this, another wrinkle:
> 
> Imagine we have per-thread UFFDs. Thread X faults on some address, and
> goes to sleep waiting for its paired resolver thread to resolve the
> fault.
> 
> In the meantime, thread Y also faults on the same address, before the
> resolution happens.
> 
> In the existing model, there is a single UFFD context per VMA, and
> therefore a single wait queue for all threads to wait on. In the
> per-TID-UFFD design, now each thread has its own context, and
> ostensibly its own wait queue (since the wait queue locks are where
> Anish saw the contention, I think this is exactly what we want to
> split up). When we have this "multiple threads waiting on the same
> address" situation, how do we ensure the fault is resolved exactly
> once? And how do we wake up all of the sleeping threads when it is
> resolved?

We probably need to wake them one by one in that case.  The 2nd-Nth
UFFDIO_COPY/CONTINUE will fail with -EEXIST anyway, then the userapp will
need a UFFDIO_WAKE I assume.

> 
> I'm sure it's solvable, but especially doing it without any locks /
> contention seems like it could be a bit complicated.

IMHO no complicated locking is needed.  Here the "complicated locking" is
done with pgtable lock and it should be reflected by -EEXISTs to userapp.

> 
> >
> > >
> > > I think it's worth keeping in mind that a selling point of Anish's
> > > approach is that it's a very small change. It's plausible we can come
> > > up with some alternative way to scale, but it seems to me everything
> > > suggested so far is likely to require a lot more code, complexity, and
> > > effort vs. Anish's approach.
> >
> > Agreed.
> >
> > Mostly I think the per-thread UFFD approach would add complexity on the
> > userspace side of things. With Anish's approach userspace is able to
> > trivially re-use the vCPU thread (and it's associated pCPU if pinned) to
> > handle the request. That gets more complicated when juggling the extra
> > paired threads.
> >
> > The per-thread approach would requires a new userfault UAPI change which
> > I think is a higher bar than the KVM UAPI change proposed here.
> >
> > The per-thread approach would require KVM call into slow GUP and take
> > the mmap_lock before contacting userspace. I'm not 100% convinced that's
> > a bad thing long term (e.g. it avoids the false-positive -EFAULT exits
> > in Anish's proposal), but could have performance implications.
> >
> > Lastly, inter-thread communication is likely slower than returning to
> > userspace from KVM_RUN. So the per-thread approach might increase the
> > end-to-end latency of demand fetches.

Right.  The overhead here is (IMHO):

  - KVM solution: vcpu exit -> enter again, whatever happens in the
    procedure of exit/enter will count.

  - mm solution: at least schedule overhead, meanwhile let's hope we can
    scale first elsewhere (and I'm not sure if there'll be other issues,
    e.g., even if we can split the uffd queues, hopefully totally nowhere I
    overlooked that still need the shared uffd context)

I'm not sure which one will be higher or maybe it depends (e.g., some
specific cases where vcpu KVM_RUN can have higher overhead when loading?).

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-10 21:50                                       ` Peter Xu
  2023-05-11 17:17                                         ` David Matlack
@ 2023-05-15 17:16                                         ` Anish Moorthy
  1 sibling, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-05-15 17:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, Nadav Amit, Axel Rasmussen, Paolo Bonzini,
	maz, oliver.upton, James Houghton, bgardon, dmatlack, ricarkol,
	kvm, kvmarm

On Wed, May 10, 2023 at 2:51 PM Peter Xu <peterx@redhat.com> wrote:
>
> What I wanted to do is to understand whether there's still chance to
> provide a generic solution.  I don't know why you have had a bunch of pmu
> stack showing in the graph, perhaps you forgot to disable some of the perf
> events when doing the test?  Let me know if you figure out why it happened
> like that (so far I didn't see), but I feel guilty to keep overloading you
> with such questions.

Not at all, I'm happy to help try and answer as many questions as I
can. It helps me learn as well.

I'll see about revisiting these traces, but I'll be busy for the next
few days with other things so I doubt they'll come soon. I'll jump
back into the mailing list sometime on Thursday/Friday

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-10 22:35   ` Sean Christopherson
  2023-05-10 23:44     ` Anish Moorthy
@ 2023-05-23 17:49     ` Anish Moorthy
  2023-06-01 22:43       ` Oliver Upton
  1 sibling, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-05-23 17:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Matlack, pbonzini, maz, oliver.upton, jthoughton, bgardon,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Wed, May 10, 2023 at 4:44 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Wed, May 10, 2023 at 3:35 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Yeah, when I speed read the series, several of the conversions stood out as being
> > "wrong".  My (potentially unstated) idea was that KVM would only signal
> > KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access,
> > i.e. when the fault _might_ be resolvable by userspace.
>
> Sean, besides direct_map which other patches did you notice as needing
> to be dropped/marked as unrecoverable errors?

I tried going through on my own to try and identify the incorrect
annotations: here's my read.

Correct (or can easily be corrected)
-----------------------------------------------
- user_mem_abort
  Incorrect as is: the annotations in patch 19 are incorrect, as they
cover an error-on-no-slot case and one more I don't fully understand:
the one in patch 20 should be good though.

- kvm_vcpu_read/write_guest_page:
  Incorrect as-is, but can fixed: the current annotations cover
gpa_to_hva_memslot(_prot) failures, which can happen when "gpa" is not
converted by a memslot. However we can leave these as bare efaults and
just annotate the copy_to/from_user failures, which userspace should
be able to resolve by checking/changing the slot permissions.

- kvm_handle_error_pfn
  Correct: at the annotation point, the fault must be either a (a)
read/write to a writable memslot or (b) read from a readable one.
hva_to_pfn must have returned KVM_PFN_ERR_FAULT, which userspace can
attempt to resolve using a MADV

Flatly Incorrect (will drop in next version)
-----------------------------------------------
- kvm_handle_page_fault
  efault corresponds to a kernel bug not resolvable by userspace

- direct_map
  Same as above

- kvm_mmu_page_fault
  Not a "leaf" return of efault, Also, the
check-for-efault-and-annotate here catches efaults which userspace can
do nothing about: such as the one from direct_map [1]

Unsure (Switch kvm_read/write_guest to kvm_vcpu_read/write_guest?)
-----------------------------------------------

- setup_vmgexit_scratch and kvm_pv_clock_pairing
  These efault on errors from kvm_read/write_guest, and theoretically
it does seem to make sense to annotate them. However, the annotations
are incorrect as is for the same reason that the
kvm_vcpu_read/write_guest_page need to be corrected.

In fact, the kvm_read/write_guest calls are of the form
"kvm_read_guest(vcpu->kvm, ...)": if we switched these calls to
kvm_vcpu_read/write_guest instead, then it seems like we'd get correct
annotations for free. Would it be correct to make this switch? If not,
then perhaps an optional kvm_vcpu* parameter for the "non-vcpu"
read/write functions strictly for annotation purposes? That seems
rather ugly though...

Unsure (Similar-ish to above)
-----------------------------------------------

- kvm_hv_get_assist_page
  Incorrect as-is. The existing annotation would cover some efaults
which it doesn't seem likely that userspace can resolve [2]. Right
after those though, there's a copy_from_user which it could make sense
to annotate.

The efault here comes from failures of
kvm_read_guest_cached/kvm_read_guest_offset_cached, for which all of
the calls are again of the form "f(vcpu->kvm, ...)". Again, we'll need
either an (optional) vcpu parameter or to refactor these to just take
a "kvm_vcpu" instead if we want to annotate just the failing
uaccesses.

PS: I plan to add a couple of flags to the memory fault exit to
identify whether the failed access was a read/write/exec


[1] https://github.com/torvalds/linux/blob/v6.3/arch/x86/kvm/mmu/mmu.c#L3196
[2] https://github.com/torvalds/linux/blob/v6.3/virt/kvm/kvm_main.c#L3261-L3270

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
  2023-04-24 21:02   ` Sean Christopherson
@ 2023-06-01 16:04     ` Oliver Upton
  0 siblings, 0 replies; 103+ messages in thread
From: Oliver Upton @ 2023-06-01 16:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Anish Moorthy, pbonzini, maz, jthoughton, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

Better late than never right? :)

On Mon, Apr 24, 2023 at 02:02:49PM -0700, Sean Christopherson wrote:
> On Wed, Apr 12, 2023, Anish Moorthy wrote:
> > Add documentation, memslot flags, useful helper functions, and the
> > actual new capability itself.
> > 
> > Memory fault exits on absent mappings are particularly useful for
> > userfaultfd-based postcopy live migration. When many vCPUs fault on a
> > single userfaultfd the faults can take a while to surface to userspace
> > due to having to contend for uffd wait queue locks. Bypassing the uffd
> > entirely by returning information directly to the vCPU exit avoids this
> > contention and improves the fault rate.
> > 
> > Suggested-by: James Houghton <jthoughton@google.com>
> > Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > ---
> >  Documentation/virt/kvm/api.rst | 31 ++++++++++++++++++++++++++++---
> >  include/linux/kvm_host.h       |  7 +++++++
> >  include/uapi/linux/kvm.h       |  2 ++
> >  tools/include/uapi/linux/kvm.h |  1 +
> >  virt/kvm/kvm_main.c            |  3 +++
> >  5 files changed, 41 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f174f43c38d45..7967b9909e28b 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
> >    /* for kvm_userspace_memory_region::flags */
> >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >    #define KVM_MEM_READONLY	(1UL << 1)
> > +  #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
> 
> This name is both too specific and too vague.  It's too specific because it affects
> more than just "absent" mappings, it will affect any page fault that can't be
> resolved by fast GUP, i.e. I'm objecting for all the same reasons I objected to
> the exit reason being name KVM_MEMFAULT_REASON_ABSENT_MAPPING.  It's too vague
> because it doesn't describe what behavior the flag actually enables in any way.
> 
> I liked the "nowait" verbiage from the RFC.  "fast_only" is an ok alternative,
> but that's much more of a kernel-internal name.
> 
> Oliver, you had concerns with using "fault" in the name, is something like
> KVM_MEM_NOWAIT_ON_PAGE_FAULT or KVM_MEM_NOWAIT_ON_FAULT palatable?  IMO, "fault"
> is perfectly ok, we just need to ensure it's unlikely to be ambiguous for userspace.

Yeah, I can get over it. Slight preference towards KVM_MEM_NOWAIT_ON_FAULT,
fewer characters and still gets the point across.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
  2023-04-12 21:35 ` [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation Anish Moorthy
  2023-04-19 14:00   ` Hoo Robert
  2023-04-24 21:02   ` Sean Christopherson
@ 2023-06-01 18:19   ` Oliver Upton
  2023-06-01 18:59     ` Sean Christopherson
  2 siblings, 1 reply; 103+ messages in thread
From: Oliver Upton @ 2023-06-01 18:19 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, kvm, kvmarm

Anish,

On Wed, Apr 12, 2023 at 09:35:05PM +0000, Anish Moorthy wrote:
> +7.35 KVM_CAP_ABSENT_MAPPING_FAULT
> +---------------------------------
> +
> +:Architectures: None
> +:Returns: -EINVAL.
> +
> +The presence of this capability indicates that userspace may pass the
> +KVM_MEM_ABSENT_MAPPING_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
> +to fail (-EFAULT) in response to page faults for which the userspace page tables
> +do not contain present mappings. Attempting to enable the capability directly
> +will fail.
> +
> +The range of guest physical memory causing the fault is advertised to userspace
> +through KVM_CAP_MEMORY_FAULT_INFO (if it is enabled).

Maybe third time is the charm. I *really* do not like the
interdependence between NOWAIT exits and the completely orthogonal
annotation of existing EFAULT exits.

How do we support a userspace that only cares about NOWAIT exits but
doesn't want other EFAULT exits to be annotated? It is very likely that
userspace will only know how to resolve NOWAIT exits anyway. Since we do
not provide a precise description of the conditions that caused an exit,
there's no way for userspace to differentiate between NOWAIT exits and
other exits it couldn't care less about.

NOWAIT exits w/o annotation (i.e. a 'bare' EFAULT) make even less sense
since userspace cannot even tell what address needs fixing at that
point.

This is why I had been suggesting we separate the two capabilities and
make annotated exits an unconditional property of NOWAIT exits. It
aligns with the practical use you're proposing for the series, and still
puts userspace in the drivers seat for other issues it may or may not
care about.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
  2023-06-01 18:19   ` Oliver Upton
@ 2023-06-01 18:59     ` Sean Christopherson
  2023-06-01 19:29       ` Oliver Upton
  0 siblings, 1 reply; 103+ messages in thread
From: Sean Christopherson @ 2023-06-01 18:59 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Anish Moorthy, pbonzini, maz, jthoughton, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Thu, Jun 01, 2023, Oliver Upton wrote:
> Anish,
> 
> On Wed, Apr 12, 2023 at 09:35:05PM +0000, Anish Moorthy wrote:
> > +7.35 KVM_CAP_ABSENT_MAPPING_FAULT
> > +---------------------------------
> > +
> > +:Architectures: None
> > +:Returns: -EINVAL.
> > +
> > +The presence of this capability indicates that userspace may pass the
> > +KVM_MEM_ABSENT_MAPPING_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
> > +to fail (-EFAULT) in response to page faults for which the userspace page tables
> > +do not contain present mappings. Attempting to enable the capability directly
> > +will fail.
> > +
> > +The range of guest physical memory causing the fault is advertised to userspace
> > +through KVM_CAP_MEMORY_FAULT_INFO (if it is enabled).
> 
> Maybe third time is the charm. I *really* do not like the
> interdependence between NOWAIT exits and the completely orthogonal
> annotation of existing EFAULT exits.

They're not completely orthogonal, because the touchpoints for NOWAIT are themselves
existing EFAULT exits.

> How do we support a userspace that only cares about NOWAIT exits but
> doesn't want other EFAULT exits to be annotated?

We don't.  The proposed approach is to not change the return value, and the
vcpu->run union currently holds random garbage on -EFAULT, so I don't see any reason
to require userspace to opt-in, or to let userspace opt-out.  I.e. fill
vcpu->run->memory_fault unconditionally (for the paths that are converted) and
advertise to userspace that vcpu->run->memory_fault *may* contain useful info on
-EFAULT when KVM_CAP_MEMORY_FAULT_INFO is supported.  And then we define KVM's
ABI such that vcpu->run->memory_fault is guarateed to be valid if an -EFAULT occurs
when faulting in guest memory (on supported architectures).

> It is very likely that userspace will only know how to resolve NOWAIT exits
> anyway. Since we do not provide a precise description of the conditions that
> caused an exit, there's no way for userspace to differentiate between NOWAIT
> exits and other exits it couldn't care less about.
> 
> NOWAIT exits w/o annotation (i.e. a 'bare' EFAULT) make even less sense
> since userspace cannot even tell what address needs fixing at that
> point.
> 
> This is why I had been suggesting we separate the two capabilities and
> make annotated exits an unconditional property of NOWAIT exits.

No, because as I've been stating ad nauseum, KVM cannot differentiate between a
NOWAIT -EFAULT and an -EFAULT that would have occurred regardless of the NOWAIT
behavior.  Defining the ABI to be that KVM fills memory_fault if and only if the
slot has NOWAIT will create a mess, e.g. if an -EFAULT occurs while userspace
is doing a KVM_SET_USER_MEMORY_REGION to set NOWAIT, userspace may or may not see
valid memory_fault information depending on when the vCPU grabbed its memslot
snapshot.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
  2023-06-01 18:59     ` Sean Christopherson
@ 2023-06-01 19:29       ` Oliver Upton
  2023-06-01 19:34         ` Sean Christopherson
  0 siblings, 1 reply; 103+ messages in thread
From: Oliver Upton @ 2023-06-01 19:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Anish Moorthy, pbonzini, maz, jthoughton, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Thu, Jun 01, 2023 at 11:59:29AM -0700, Sean Christopherson wrote:
> On Thu, Jun 01, 2023, Oliver Upton wrote:
> > How do we support a userspace that only cares about NOWAIT exits but
> > doesn't want other EFAULT exits to be annotated?
> 
> We don't.  The proposed approach is to not change the return value, and the
> vcpu->run union currently holds random garbage on -EFAULT, so I don't see any reason
> to require userspace to opt-in, or to let userspace opt-out.  I.e. fill
> vcpu->run->memory_fault unconditionally (for the paths that are converted) and
> advertise to userspace that vcpu->run->memory_fault *may* contain useful info on
> -EFAULT when KVM_CAP_MEMORY_FAULT_INFO is supported.  And then we define KVM's
> ABI such that vcpu->run->memory_fault is guarateed to be valid if an -EFAULT occurs
> when faulting in guest memory (on supported architectures).

Sure, but the series currently gives userspace an explicit opt-in for
existing EFAULT paths. Hold your breath, I'll reply over there so we
don't mix context.

> > It is very likely that userspace will only know how to resolve NOWAIT exits
> > anyway. Since we do not provide a precise description of the conditions that
> > caused an exit, there's no way for userspace to differentiate between NOWAIT
> > exits and other exits it couldn't care less about.
> > 
> > NOWAIT exits w/o annotation (i.e. a 'bare' EFAULT) make even less sense
> > since userspace cannot even tell what address needs fixing at that
> > point.
> > 
> > This is why I had been suggesting we separate the two capabilities and
> > make annotated exits an unconditional property of NOWAIT exits.
> 
> No, because as I've been stating ad nauseum, KVM cannot differentiate between a
> NOWAIT -EFAULT and an -EFAULT that would have occurred regardless of the NOWAIT
> behavior.

IOW: "If you engage brain for more than a second, you'll actually see
the point"

Ok, I'm on board now and sorry for the noise.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
  2023-06-01 19:29       ` Oliver Upton
@ 2023-06-01 19:34         ` Sean Christopherson
  0 siblings, 0 replies; 103+ messages in thread
From: Sean Christopherson @ 2023-06-01 19:34 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Anish Moorthy, pbonzini, maz, jthoughton, bgardon, dmatlack,
	ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Thu, Jun 01, 2023, Oliver Upton wrote:
> On Thu, Jun 01, 2023 at 11:59:29AM -0700, Sean Christopherson wrote:
> > On Thu, Jun 01, 2023, Oliver Upton wrote:
> > > How do we support a userspace that only cares about NOWAIT exits but
> > > doesn't want other EFAULT exits to be annotated?
> > 
> > We don't.  The proposed approach is to not change the return value, and the
> > vcpu->run union currently holds random garbage on -EFAULT, so I don't see any reason
> > to require userspace to opt-in, or to let userspace opt-out.  I.e. fill
> > vcpu->run->memory_fault unconditionally (for the paths that are converted) and
> > advertise to userspace that vcpu->run->memory_fault *may* contain useful info on
> > -EFAULT when KVM_CAP_MEMORY_FAULT_INFO is supported.  And then we define KVM's
> > ABI such that vcpu->run->memory_fault is guarateed to be valid if an -EFAULT occurs
> > when faulting in guest memory (on supported architectures).
> 
> Sure, but the series currently gives userspace an explicit opt-in for
> existing EFAULT paths. 

Yeah, that's one of the things I am/was going to provide feedback on, I've been
really slow getting into reviews for this cycle :-/

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-04-12 21:34 ` [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
  2023-04-19 13:57   ` Hoo Robert
@ 2023-06-01 19:52   ` Oliver Upton
  2023-06-01 20:30     ` Anish Moorthy
  2023-07-04 10:10   ` Kautuk Consul
  2 siblings, 1 reply; 103+ messages in thread
From: Oliver Upton @ 2023-06-01 19:52 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, kvm, kvmarm

On Wed, Apr 12, 2023 at 09:34:53PM +0000, Anish Moorthy wrote:

[...]

> +7.34 KVM_CAP_MEMORY_FAULT_INFO
> +------------------------------
> +
> +:Architectures: x86, arm64
> +:Parameters: args[0] - KVM_MEMORY_FAULT_INFO_ENABLE|DISABLE to enable/disable
> +             the capability.
> +:Returns: 0 on success, or -EINVAL if unsupported or invalid args[0].
> +
> +When enabled, EFAULTs "returned" by KVM_RUN in response to failed vCPU guest
> +memory accesses may be annotated with additional information. When KVM_RUN
> +returns an error with errno=EFAULT, userspace may check the exit reason: if it
> +is KVM_EXIT_MEMORY_FAULT, userspace is then permitted to read the 'memory_fault'
> +member of the run struct.

So the other angle of my concern w.r.t. NOWAIT exits is the fact that
userspace gets to decide whether or not we annotate such an exit. We all
agree that a NOWAIT exit w/o context isn't actionable, right?

Sean is suggesting that we abuse the fact that kvm_run already contains
junk for EFAULT exits and populate kvm_run::memory_fault unconditionally
[*]. I agree with him, and it eliminates the odd quirk of 'bare' NOWAIT
exits too. Old userspace will still see 'garbage' in kvm_run struct,
but one man's trash is another man's treasure after all :)

So, based on that, could you:

 - Unconditionally prepare MEMORY_FAULT exits everywhere you're
   converting here

 - Redefine KVM_CAP_MEMORY_FAULT_INFO as an informational cap, and do
   not accept an attempt to enable it. Instead, have calls to
   KVM_CHECK_EXTENSION return a set of flags describing the supported
   feature set.

   Eventually, you can stuff a bit in there to advertise that all
   EFAULTs are reliable.

[*] https://lore.kernel.org/kvmarm/ZHjqkdEOVUiazj5d@google.com/

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cf7d3de6f3689..f3effc93cbef3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  	spin_lock_init(&kvm->mn_invalidate_lock);
>  	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>  	xa_init(&kvm->vcpu_array);
> +	kvm->fill_efault_info = false;
>  
>  	INIT_LIST_HEAD(&kvm->gpc_list);
>  	spin_lock_init(&kvm->gpc_lock);
> @@ -4096,6 +4097,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
>  			put_pid(oldpid);
>  		}
>  		r = kvm_arch_vcpu_ioctl_run(vcpu);
> +		WARN_ON_ONCE(r == -EFAULT &&
> +					 vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);

This might be a bit overkill, as it will definitely fire on unsupported
architectures. Instead you may want to condition this on an architecture
actually selecting support for MEMORY_FAULT_INFO.

>  		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
>  		break;
>  	}
> @@ -4672,6 +4675,15 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  
>  		return r;
>  	}
> +	case KVM_CAP_MEMORY_FAULT_INFO: {
> +		if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap)
> +			|| (cap->args[0] != KVM_MEMORY_FAULT_INFO_ENABLE
> +				&& cap->args[0] != KVM_MEMORY_FAULT_INFO_DISABLE)) {
> +			return -EINVAL;
> +		}
> +		kvm->fill_efault_info = cap->args[0] == KVM_MEMORY_FAULT_INFO_ENABLE;
> +		return 0;
> +	}
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -6173,3 +6185,35 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>  
>  	return init_context.err;
>  }
> +
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> +					uint64_t gpa, uint64_t len)
> +{
> +	if (!vcpu->kvm->fill_efault_info)
> +		return;
> +
> +	preempt_disable();
> +	/*
> +	 * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> +	 * would open the door for races between concurrent calls to this
> +	 * function.
> +	 */
> +	if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> +		goto out;
> +	/*
> +	 * Try not to overwrite an already-populated run struct.
> +	 * This isn't a perfect solution, as there's no guarantee that the exit
> +	 * reason is set before the run struct is populated, but it should prevent
> +	 * at least some bugs.
> +	 */
> +	else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
> +		goto out;
> +
> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	vcpu->run->memory_fault.gpa = gpa;
> +	vcpu->run->memory_fault.len = len;
> +	vcpu->run->memory_fault.flags = 0;
> +
> +out:
> +	preempt_enable();
> +}
> -- 
> 2.40.0.577.gac1e443424-goog
> 

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-06-01 19:52   ` Oliver Upton
@ 2023-06-01 20:30     ` Anish Moorthy
  2023-06-01 21:29       ` Oliver Upton
  0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-06-01 20:30 UTC (permalink / raw)
  To: Oliver Upton
  Cc: pbonzini, maz, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, kvm, kvmarm

On Thu, Jun 1, 2023 at 12:52 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> So the other angle of my concern w.r.t. NOWAIT exits is the fact that
> userspace gets to decide whether or not we annotate such an exit. We all
> agree that a NOWAIT exit w/o context isn't actionable, right?

Yup

> Sean is suggesting that we abuse the fact that kvm_run already contains
> junk for EFAULT exits and populate kvm_run::memory_fault unconditionally
> [*]. I agree with him, and it eliminates the odd quirk of 'bare' NOWAIT
> exits too. Old userspace will still see 'garbage' in kvm_run struct,
> but one man's trash is another man's treasure after all :)
>
> So, based on that, could you:
>
>  - Unconditionally prepare MEMORY_FAULT exits everywhere you're
>    converting here
>
>  - Redefine KVM_CAP_MEMORY_FAULT_INFO as an informational cap, and do
>    not accept an attempt to enable it. Instead, have calls to
>    KVM_CHECK_EXTENSION return a set of flags describing the supported
>    feature set.

Sure. I've been collecting feedback as it comes in, so I can send up a
v4 with everything up to now soon. The major thing left to resolve is
that the exact set of annotations is still waiting on feedback: I've
already gone ahead and dropped everything I wasn't sure of in [1], so
the next version will be quite a bit smaller. If it turns out that
I've dropped too much, then I can add things back in based on the
feedback.

[1] https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf

>    Eventually, you can stuff a bit in there to advertise that all
>    EFAULTs are reliable.

I don't think this is an objective: the idea is to annotate efaults
tracing back to user accesses (see [2]). Although the idea of
annotating with some "unrecoverable" flag set for other efaults has
been tossed around, so we may end up with that.

[2] https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#m5715f3a14a6a9ff9a4188918ec105592f0bfc69a

> [*] https://lore.kernel.org/kvmarm/ZHjqkdEOVUiazj5d@google.com/
>
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index cf7d3de6f3689..f3effc93cbef3 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> >       spin_lock_init(&kvm->mn_invalidate_lock);
> >       rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> >       xa_init(&kvm->vcpu_array);
> > +     kvm->fill_efault_info = false;
> >
> >       INIT_LIST_HEAD(&kvm->gpc_list);
> >       spin_lock_init(&kvm->gpc_lock);
> > @@ -4096,6 +4097,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
> >                       put_pid(oldpid);
> >               }
> >               r = kvm_arch_vcpu_ioctl_run(vcpu);
> > +             WARN_ON_ONCE(r == -EFAULT &&
> > +                                      vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);
>
> This might be a bit overkill, as it will definitely fire on unsupported
> architectures. Instead you may want to condition this on an architecture
> actually selecting support for MEMORY_FAULT_INFO.

Ah, that's embarrassing. Thanks for the catch.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-06-01 20:30     ` Anish Moorthy
@ 2023-06-01 21:29       ` Oliver Upton
  0 siblings, 0 replies; 103+ messages in thread
From: Oliver Upton @ 2023-06-01 21:29 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
	axelrasmussen, peterx, kvm, kvmarm

On Thu, Jun 01, 2023 at 01:30:58PM -0700, Anish Moorthy wrote:
> On Thu, Jun 1, 2023 at 12:52 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> >    Eventually, you can stuff a bit in there to advertise that all
> >    EFAULTs are reliable.
> 
> I don't think this is an objective: the idea is to annotate efaults
> tracing back to user accesses (see [2]). Although the idea of
> annotating with some "unrecoverable" flag set for other efaults has
> been tossed around, so we may end up with that.

Right, there's quite a bit of detail entailed by what such a bit
means... In any case, the idea would be to have a forward-looking
stance with the UAPI where we can bolt on more things to the existing
CAP in the future.

> [2] https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#m5715f3a14a6a9ff9a4188918ec105592f0bfc69a
> 
> > [*] https://lore.kernel.org/kvmarm/ZHjqkdEOVUiazj5d@google.com/
> >
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index cf7d3de6f3689..f3effc93cbef3 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > >       spin_lock_init(&kvm->mn_invalidate_lock);
> > >       rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> > >       xa_init(&kvm->vcpu_array);
> > > +     kvm->fill_efault_info = false;
> > >
> > >       INIT_LIST_HEAD(&kvm->gpc_list);
> > >       spin_lock_init(&kvm->gpc_lock);
> > > @@ -4096,6 +4097,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
> > >                       put_pid(oldpid);
> > >               }
> > >               r = kvm_arch_vcpu_ioctl_run(vcpu);
> > > +             WARN_ON_ONCE(r == -EFAULT &&
> > > +                                      vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);
> >
> > This might be a bit overkill, as it will definitely fire on unsupported
> > architectures. Instead you may want to condition this on an architecture
> > actually selecting support for MEMORY_FAULT_INFO.
> 
> Ah, that's embarrassing. Thanks for the catch.

No problem at all. Pretty sure I've done a lot more actually egregious
changes than you have ;)

While we're here, forgot to mention it before but please clean up that
indentation too. I think you may've gotten in a fight with the Google3
styling of your editor and lost :)

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
  2023-05-23 17:49     ` Anish Moorthy
@ 2023-06-01 22:43       ` Oliver Upton
  0 siblings, 0 replies; 103+ messages in thread
From: Oliver Upton @ 2023-06-01 22:43 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Sean Christopherson, David Matlack, pbonzini, maz, jthoughton,
	bgardon, ricarkol, axelrasmussen, peterx, kvm, kvmarm

On Tue, May 23, 2023 at 10:49:04AM -0700, Anish Moorthy wrote:
> On Wed, May 10, 2023 at 4:44 PM Anish Moorthy <amoorthy@google.com> wrote:
> >
> > On Wed, May 10, 2023 at 3:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > Yeah, when I speed read the series, several of the conversions stood out as being
> > > "wrong".  My (potentially unstated) idea was that KVM would only signal
> > > KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access,
> > > i.e. when the fault _might_ be resolvable by userspace.
> >
> > Sean, besides direct_map which other patches did you notice as needing
> > to be dropped/marked as unrecoverable errors?
> 
> I tried going through on my own to try and identify the incorrect
> annotations: here's my read.
> 
> Correct (or can easily be corrected)
> -----------------------------------------------
> - user_mem_abort
>   Incorrect as is: the annotations in patch 19 are incorrect, as they
> cover an error-on-no-slot case and one more I don't fully understand:

That other case is a wart we endearingly refer to as MTE (Memory Tagging
Extension). You theoretically _could_ pop out an annotated exit here, as
userspace likely messed up the mapping (like PROT_MTE missing).

But I'm perfectly happy letting someone complain about it before we go
out of our way to annotate that one. So feel free to drop.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
  2023-04-12 21:34 ` [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
  2023-04-19 13:57   ` Hoo Robert
  2023-06-01 19:52   ` Oliver Upton
@ 2023-07-04 10:10   ` Kautuk Consul
  2 siblings, 0 replies; 103+ messages in thread
From: Kautuk Consul @ 2023-07-04 10:10 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
	dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm

On 2023-04-12 21:34:53, Anish Moorthy wrote:
> KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information
> besides a return value of -1 and errno of EFAULT when a vCPU fails an
> access to guest memory.
> 
> Add documentation, updates to the KVM headers, and a helper function
> (kvm_populate_efault_info) for implementing the capability.
> 
> Besides simply filling the run struct, kvm_populate_efault_info takes
> two safety measures
> 
>   a. It tries to prevent concurrent fills on a single vCPU run struct
>      by checking that the run struct being modified corresponds to the
>      currently loaded vCPU.
>   b. It tries to avoid filling an already-populated run struct by
>      checking whether the exit reason has been modified since entry
>      into KVM_RUN.
> 
> Finally, mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86,
> even though EFAULT annotation are currently totally absent. Picking a
> point to declare the implementation "done" is difficult because
> 
>   1. Annotations will be performed incrementally in subsequent commits
>      across both core and arch-specific KVM.
>   2. The initial series will very likely miss some cases which need
>      annotation. Although these omissions are to be fixed in the future,
>      userspace thus still needs to expect and be able to handle
>      unannotated EFAULTs.
> 
> Given these qualifications, just marking it available here seems the
> least arbitrary thing to do.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  Documentation/virt/kvm/api.rst | 35 +++++++++++++++++++++++++++
>  arch/arm64/kvm/arm.c           |  1 +
>  arch/x86/kvm/x86.c             |  1 +
>  include/linux/kvm_host.h       | 12 ++++++++++
>  include/uapi/linux/kvm.h       | 16 +++++++++++++
>  tools/include/uapi/linux/kvm.h | 11 +++++++++
>  virt/kvm/kvm_main.c            | 44 ++++++++++++++++++++++++++++++++++
>  7 files changed, 120 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 48fad65568227..f174f43c38d45 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6637,6 +6637,18 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 len; /* in bytes */
> +		} memory_fault;
> +
> +Indicates a vCPU memory fault on the guest physical address range
> +[gpa, gpa + len). See KVM_CAP_MEMORY_FAULT_INFO for more details.
> +
>  ::
>  
>      /* KVM_EXIT_NOTIFY */
> @@ -7670,6 +7682,29 @@ This capability is aimed to mitigate the threat that malicious VMs can
>  cause CPU stuck (due to event windows don't open up) and make the CPU
>  unavailable to host or other VMs.
>  
> +7.34 KVM_CAP_MEMORY_FAULT_INFO
> +------------------------------
> +
> +:Architectures: x86, arm64
> +:Parameters: args[0] - KVM_MEMORY_FAULT_INFO_ENABLE|DISABLE to enable/disable
> +             the capability.
> +:Returns: 0 on success, or -EINVAL if unsupported or invalid args[0].
> +
> +When enabled, EFAULTs "returned" by KVM_RUN in response to failed vCPU guest
> +memory accesses may be annotated with additional information. When KVM_RUN
> +returns an error with errno=EFAULT, userspace may check the exit reason: if it
> +is KVM_EXIT_MEMORY_FAULT, userspace is then permitted to read the 'memory_fault'
> +member of the run struct.
> +
> +The 'gpa' and 'len' (in bytes) fields describe the range of guest
> +physical memory to which access failed, i.e. [gpa, gpa + len). 'flags' is
> +currently always zero.
> +
> +NOTE: The implementation of this capability is incomplete. Even with it enabled,
> +userspace may receive "bare" EFAULTs (i.e. exit reason !=
> +KVM_EXIT_MEMORY_FAULT) from KVM_RUN. These should be considered bugs and
> +reported to the maintainers.
> +
>  8. Other capabilities.
>  ======================
>  
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index a43e1cb3b7e97..a932346b59f61 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -220,6 +220,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_VCPU_ATTRIBUTES:
>  	case KVM_CAP_PTP_KVM:
>  	case KVM_CAP_ARM_SYSTEM_SUSPEND:
> +	case KVM_CAP_MEMORY_FAULT_INFO:
>  		r = 1;
>  		break;
>  	case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ca73eb066af81..0925678e741de 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4432,6 +4432,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_VAPIC:
>  	case KVM_CAP_ENABLE_CAP:
>  	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
> +	case KVM_CAP_MEMORY_FAULT_INFO:
>  		r = 1;
>  		break;
>  	case KVM_CAP_EXIT_HYPERCALL:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 90edc16d37e59..776f9713f3921 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -805,6 +805,8 @@ struct kvm {
>  	struct notifier_block pm_notifier;
>  #endif
>  	char stats_id[KVM_STATS_NAME_SIZE];
> +
> +	bool fill_efault_info;
>  };
>  
>  #define kvm_err(fmt, ...) \
> @@ -2277,4 +2279,14 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +/*
> + * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and
> + * populate the memory_fault field with the given information.
> + *
> + * Does nothing if KVM_CAP_MEMORY_FAULT_INFO is not enabled. WARNs and does
> + * nothing if the exit reason is not KVM_EXIT_UNKNOWN, or if 'vcpu' is not
> + * the current running vcpu.
> + */
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> +					uint64_t gpa, uint64_t len);
>  #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 4003a166328cc..bc73e8381a2bb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_RISCV_SBI        35
>  #define KVM_EXIT_RISCV_CSR        36
>  #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -505,6 +506,16 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>  			__u32 flags;
>  		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			/*
> +			 * Indicates a memory fault on the guest physical address range
> +			 * [gpa, gpa + len). flags is always zero for now.
> +			 */
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 len; /* in bytes */
> +		} memory_fault;
>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};
> @@ -1184,6 +1195,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
>  #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
>  #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
> +#define KVM_CAP_MEMORY_FAULT_INFO 227
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -2237,4 +2249,8 @@ struct kvm_s390_zpci_op {
>  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
>  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
>  
> +/* flags for KVM_CAP_MEMORY_FAULT_INFO */
> +#define KVM_MEMORY_FAULT_INFO_DISABLE  0
> +#define KVM_MEMORY_FAULT_INFO_ENABLE   1
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index 4003a166328cc..5c57796364d65 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_RISCV_SBI        35
>  #define KVM_EXIT_RISCV_CSR        36
>  #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -505,6 +506,16 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>  			__u32 flags;
>  		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			/*
> +			 * Indicates a memory fault on the guest physical address range
> +			 * [gpa, gpa + len). flags is always zero for now.
> +			 */
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 len; /* in bytes */
> +		} memory_fault;
>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cf7d3de6f3689..f3effc93cbef3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  	spin_lock_init(&kvm->mn_invalidate_lock);
>  	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>  	xa_init(&kvm->vcpu_array);
> +	kvm->fill_efault_info = false;
>  
>  	INIT_LIST_HEAD(&kvm->gpc_list);
>  	spin_lock_init(&kvm->gpc_lock);
> @@ -4096,6 +4097,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
>  			put_pid(oldpid);
>  		}
>  		r = kvm_arch_vcpu_ioctl_run(vcpu);
> +		WARN_ON_ONCE(r == -EFAULT &&
> +					 vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);
>  		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
>  		break;
>  	}
> @@ -4672,6 +4675,15 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  
>  		return r;
>  	}
> +	case KVM_CAP_MEMORY_FAULT_INFO: {
> +		if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap)
> +			|| (cap->args[0] != KVM_MEMORY_FAULT_INFO_ENABLE
> +				&& cap->args[0] != KVM_MEMORY_FAULT_INFO_DISABLE)) {
> +			return -EINVAL;
> +		}
> +		kvm->fill_efault_info = cap->args[0] == KVM_MEMORY_FAULT_INFO_ENABLE;
> +		return 0;
> +	}
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -6173,3 +6185,35 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>  
>  	return init_context.err;
>  }
> +
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> +					uint64_t gpa, uint64_t len)
> +{
> +	if (!vcpu->kvm->fill_efault_info)
> +		return;
> +
> +	preempt_disable();
> +	/*
> +	 * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> +	 * would open the door for races between concurrent calls to this
> +	 * function.
> +	 */
> +	if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> +		goto out;
Why use WARN_ON_ONCE when there is a clear possiblity of preemption
kicking in (with the possibility of vcpu_load/vcpu_put being called
in the new task) before preempt_disable() is called in this function ?
I think you should use WARN_ON_ONCE only where there is some impossible
situation happening, not when there is a possibility of that
situation happening as per the kernel code. I think that this WARN_ON_ONCE
could make sense if kvm_populate_efault_info() is called from atomic context,
but not when you are disabling preemption from this function itself.
Basically I don't think there is any way we can guarantee that
preemption DOESN'T kick in before the preempt_disable() such that
this warning is actually something that deserves to have a kernel
WARN_ON_ONCE() warning.
Can we get rid of this WARN_ON_ONCE and straightaway jump to the
out label if "(vcpu != __this_cpu_read(kvm_running_vcpu))" is true, or
please do correct me if I am wrong about something ?
> +	/*
> +	 * Try not to overwrite an already-populated run struct.
> +	 * This isn't a perfect solution, as there's no guarantee that the exit
> +	 * reason is set before the run struct is populated, but it should prevent
> +	 * at least some bugs.
> +	 */
> +	else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
> +		goto out;
> +
> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	vcpu->run->memory_fault.gpa = gpa;
> +	vcpu->run->memory_fault.len = len;
> +	vcpu->run->memory_fault.flags = 0;
> +
> +out:
> +	preempt_enable();
> +}
> -- 
> 2.40.0.577.gac1e443424-goog
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2023-07-04 10:11 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
2023-04-19 13:51   ` Hoo Robert
2023-04-20 17:55     ` Anish Moorthy
2023-04-21 12:15       ` Robert Hoo
2023-04-21 16:21         ` Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
2023-04-19 13:36   ` Hoo Robert
2023-04-19 23:26     ` Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 03/22] KVM: Allow hva_pfn_fast() to resolve read-only faults Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN Anish Moorthy
2023-05-02 17:17   ` Anish Moorthy
2023-05-02 18:51     ` Sean Christopherson
2023-05-02 19:49       ` Anish Moorthy
2023-05-02 20:41         ` Sean Christopherson
2023-05-02 21:46           ` Anish Moorthy
2023-05-02 22:31             ` Sean Christopherson
2023-04-12 21:34 ` [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
2023-04-19 13:57   ` Hoo Robert
2023-04-20 18:09     ` Anish Moorthy
2023-04-21 12:28       ` Robert Hoo
2023-06-01 19:52   ` Oliver Upton
2023-06-01 20:30     ` Anish Moorthy
2023-06-01 21:29       ` Oliver Upton
2023-07-04 10:10   ` Kautuk Consul
2023-04-12 21:34 ` [PATCH v3 06/22] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page() Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page() Anish Moorthy
2023-04-20 20:52   ` Peter Xu
2023-04-20 23:29     ` Anish Moorthy
2023-04-21 15:00       ` Peter Xu
2023-04-12 21:34 ` [PATCH v3 08/22] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page() Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map() Anish Moorthy
2023-04-20 20:53   ` Peter Xu
2023-04-20 23:34     ` Anish Moorthy
2023-04-21 14:58       ` Peter Xu
2023-04-12 21:34 ` [PATCH v3 10/22] KVM: x86: Annotate -EFAULTs from kvm_mmu_page_fault() Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 11/22] KVM: x86: Annotate -EFAULTs from setup_vmgexit_scratch() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 12/22] KVM: x86: Annotate -EFAULTs from kvm_handle_page_fault() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 13/22] KVM: x86: Annotate -EFAULTs from kvm_hv_get_assist_page() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 14/22] KVM: x86: Annotate -EFAULTs from kvm_pv_clock_pairing() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 15/22] KVM: x86: Annotate -EFAULTs from direct_map() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 16/22] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation Anish Moorthy
2023-04-19 14:00   ` Hoo Robert
2023-04-20 18:23     ` Anish Moorthy
2023-04-24 21:02   ` Sean Christopherson
2023-06-01 16:04     ` Oliver Upton
2023-06-01 18:19   ` Oliver Upton
2023-06-01 18:59     ` Sean Christopherson
2023-06-01 19:29       ` Oliver Upton
2023-06-01 19:34         ` Sean Christopherson
2023-04-12 21:35 ` [PATCH v3 18/22] KVM: x86: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 19/22] KVM: arm64: Annotate (some) -EFAULTs from user_mem_abort() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 20/22] KVM: arm64: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 21/22] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
2023-04-19 14:09   ` Hoo Robert
2023-04-19 16:40     ` Anish Moorthy
2023-04-20 22:47     ` Anish Moorthy
2023-04-27 15:48   ` James Houghton
2023-05-01 18:01     ` Anish Moorthy
2023-04-19 19:55 ` [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Peter Xu
2023-04-19 20:15   ` Axel Rasmussen
2023-04-19 21:05     ` Peter Xu
     [not found]       ` <CAF7b7mo68VLNp=QynfT7QKgdq=d1YYGv1SEVEDxF9UwHzF6YDw@mail.gmail.com>
2023-04-20 21:29         ` Peter Xu
2023-04-21 16:58           ` Anish Moorthy
2023-04-21 17:39           ` Nadav Amit
2023-04-24 17:54             ` Anish Moorthy
2023-04-24 19:44               ` Nadav Amit
2023-04-24 20:35                 ` Sean Christopherson
2023-04-24 23:47                   ` Nadav Amit
2023-04-25  0:26                     ` Sean Christopherson
2023-04-25  0:37                       ` Nadav Amit
2023-04-25  0:15                 ` Anish Moorthy
2023-04-25  0:54                   ` Nadav Amit
2023-04-27 16:38                     ` James Houghton
2023-04-27 20:26                   ` Peter Xu
2023-05-03 19:45                     ` Anish Moorthy
2023-05-03 20:09                       ` Sean Christopherson
     [not found]                       ` <ZFLPlRReglM/Vgfu@x1n>
2023-05-03 21:27                         ` Peter Xu
2023-05-03 21:42                           ` Sean Christopherson
2023-05-03 23:45                             ` Peter Xu
2023-05-04 19:09                               ` Peter Xu
2023-05-05 18:32                                 ` Anish Moorthy
2023-05-08  1:23                                   ` Peter Xu
2023-05-09 20:52                                     ` Anish Moorthy
2023-05-10 21:50                                       ` Peter Xu
2023-05-11 17:17                                         ` David Matlack
2023-05-11 17:33                                           ` Axel Rasmussen
2023-05-11 19:05                                             ` David Matlack
2023-05-11 19:45                                               ` Axel Rasmussen
2023-05-15 15:16                                                 ` Peter Xu
2023-05-15 15:05                                             ` Peter Xu
2023-05-15 17:16                                         ` Anish Moorthy
2023-05-05 20:05                               ` Nadav Amit
2023-05-08  1:12                                 ` Peter Xu
2023-04-20 23:42         ` Anish Moorthy
2023-05-09 22:19 ` David Matlack
2023-05-10 16:35   ` Anish Moorthy
2023-05-10 22:35   ` Sean Christopherson
2023-05-10 23:44     ` Anish Moorthy
2023-05-23 17:49     ` Anish Moorthy
2023-06-01 22:43       ` Oliver Upton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).