kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit
@ 2023-03-15  2:17 Anish Moorthy
  2023-03-15  2:17 ` [WIP Patch v2 01/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
                   ` (15 more replies)
  0 siblings, 16 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

Hi Sean, here's what I'm planing to send up as v2 of the scalable
userfaultfd series.

Don't worry, I'm not asking you to review this all :) I just have a few
remaining questions regarding KVM_CAP_MEMORY_FAULT_EXIT which seem important
enough to mention before I ask for more attention from others, and they'll be
clearer with the patches in hand. Anything else I'm happy to find out about when
I send the actual v2.

I want your opinion on

1. The general API I've set up for KVM_CAP_MEMORY_FAULT_EXIT
   (described in the api.rst file)
2. Whether the UNKNOWN exit reason cases (everywhere but
   handle_error_pfn atm) would need to be given "real" reasons
   before this could be merged.
3. If you think I've missed sites that currently -EFAULT to userspace

About (3): after we agreed to only tackle cases where -EFAULT currently makes it
to userspace, I went though our list and tried to trace which EFAULTS actually
bubble up to KVM_RUN. That set ended being suspiciously small, so I wanted to
sanity-check my findings with you. Lmk if you see obvious errors in my list
below.

--- EFAULTs under KVM_RUN ---

Confident that needs conversion (already converted)
---------------------------------------------------
* direct_map
* handle_error_pfn
* setup_vmgexit_scratch
* kvm_handle_page_fault
* FNAME(fetch)

EFAULT does not propagate to userspace (do not convert)
-------------------------------------------------------
* record_steal_time (arch/x86/kvm/x86.c:3463)
* hva_to_pfn_retry
* kvm_vcpu_map
* FNAME(update_accessed_dirty_bits)
* __kvm_gfn_to_hva_cache_init
  Might actually make it to userspace, but only through
  kvm_read|write_guest_offset_cached- would be covered by those conversions
* kvm_gfn_to_hva_cache_init
* __kvm_read_guest_page
* hva_to_pfn_remapped
  handle_error_pfn will handle this for the scalable uffd case. Don't think
  other callers -EFAULT to userspace.

Still unsure if needs conversion
--------------------------------
* __kvm_read_guest_atomic
  The EFAULT might be propagated though FNAME(sync_page)?
* kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226)
* __kvm_write_guest_page
  Called from kvm_write_guest_offset_cached: if that needs change, this does too
* kvm_write_guest_page
  Two interesting paths:
      - kvm_pv_clock_pairing returns a custom KVM_EFAULT error here
        (arch/x86/kvm/x86.c:9578)
      - kvm_write_guest_offset_cached returns this directly (so if that needs
        change, this does too)
* kvm_read_guest_offset_cached
  I actually do see a path to userspace, but it's through hyper-v, which we've
  said is out of scope for round 1.

--- Actual Cover Letter ---

Omitted: hasn't changed much since v1 anyways

--- Changelog ---

WIP v2
  - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
  - API changes:
        - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
          KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
          requirement).
        - Switched to memslot flag
  - Take Oliver's simplification to the "allow fast gup for readable
    faults" logic.
  - Slightly redefine the return code of user_mem_abort.
  - Fix documentation errors brought up by Marc
  - Reword commit messages in imperative mood

v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/

Anish Moorthy (14):
  KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
    paging test
  KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
    signal errors via TEST_ASSERT
  KVM: Allow hva_pfn_fast to resolve read-only faults.
  KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run
    field
  KVM: x86: Implement memory fault exit for direct_map
  KVM: x86: Implement memory fault exit for kvm_handle_page_fault
  KVM: x86: Implement memory fault exit for setup_vmgexit_scratch
  KVM: x86: Implement memory fault exit for FNAME(fetch)
  KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  KVM: x86: Implement KVM_CAP_MEMORY_FAULT_NOWAIT
  KVM: arm64: Allow user_mem_abort to return 0 to signal a 'normal' exit
  KVM: arm64: Implement KVM_CAP_MEMORY_FAULT_NOWAIT
  KVM: selftests: Add memslot_flags parameter to memstress_create_vm
  KVM: selftests: Handle memory fault exits in demand_paging_test

 Documentation/virt/kvm/api.rst                |  74 ++++-
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |  29 +-
 arch/x86/kvm/mmu/mmu.c                        |  42 ++-
 arch/x86/kvm/mmu/paging_tmpl.h                |   4 +-
 arch/x86/kvm/svm/sev.c                        |   4 +-
 arch/x86/kvm/x86.c                            |   2 +
 include/linux/kvm_host.h                      |  22 ++
 include/uapi/linux/kvm.h                      |  19 ++
 tools/include/uapi/linux/kvm.h                |  17 ++
 .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
 .../selftests/kvm/access_tracking_perf_test.c |   2 +-
 .../selftests/kvm/demand_paging_test.c        | 253 ++++++++++++++----
 .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
 .../testing/selftests/kvm/include/memstress.h |   2 +-
 .../selftests/kvm/include/userfaultfd_util.h  |  18 +-
 tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
 .../selftests/kvm/lib/userfaultfd_util.c      | 160 ++++++-----
 .../kvm/memslot_modification_stress_test.c    |   2 +-
 virt/kvm/kvm_main.c                           |  41 ++-
 20 files changed, 544 insertions(+), 158 deletions(-)

-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [WIP Patch v2 01/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-15  2:17 ` [WIP Patch v2 02/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

At the moment, demand_paging_test does not support profiling/testing
multiple vCPU threads concurrently faulting on a single uffd because

  (a) "-u" (run test in userfaultfd mode) creates a uffd for each vCPU's
      region, so that each uffd services a single vCPU thread.
  (b) "-u -o" (userfaultfd mode + overlapped vCPU memory accesses)
      simply doesn't work: the test tries to register the same memory
      to multiple uffds, causing an error.

Add support for many vcpus per uffd by
  (1) Keeping "-u" behavior unchanged.
  (2) Making "-u -a" create a single uffd for all of guest memory.
  (3) Making "-u -o" implicitly pass "-a", solving the problem in (b).
In cases (2) and (3) all vCPU threads fault on a single uffd.

With multiple potentially multiple vCPU per UFFD, it makes sense to
allow configuring the number reader threads per UFFD as well: add the
"-r" flag to do so.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/aarch64/page_fault_test.c   |  4 +-
 .../selftests/kvm/demand_paging_test.c        | 62 +++++++++----
 .../selftests/kvm/include/userfaultfd_util.h  | 18 +++-
 .../selftests/kvm/lib/userfaultfd_util.c      | 86 +++++++++++++------
 4 files changed, 125 insertions(+), 45 deletions(-)

diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
index df10f1ffa20d9..3b6d228a9340d 100644
--- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
+++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
@@ -376,14 +376,14 @@ static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
 		*pt_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						    pt_args.hva,
 						    pt_args.paging_size,
-						    test->uffd_pt_handler);
+						    1, test->uffd_pt_handler);
 
 	*data_uffd = NULL;
 	if (test->uffd_data_handler)
 		*data_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						      data_args.hva,
 						      data_args.paging_size,
-						      test->uffd_data_handler);
+						      1, test->uffd_data_handler);
 }
 
 static void free_uffd(struct test_desc *test, struct uffd_desc *pt_uffd,
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index b0e1fc4de9e29..fc9c6ac76660c 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -58,7 +58,7 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 }
 
 static int handle_uffd_page_request(int uffd_mode, int uffd,
-		struct uffd_msg *msg)
+									struct uffd_msg *msg)
 {
 	pid_t tid = syscall(__NR_gettid);
 	uint64_t addr = msg->arg.pagefault.address;
@@ -77,8 +77,15 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		copy.mode = 0;
 
 		r = ioctl(uffd, UFFDIO_COPY, &copy);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d with errno: %d\n",
+		/*
+		 * With multiple vCPU threads fault on a single page and there are
+		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
+		 * will fail with EEXIST: handle that case without signaling an
+		 * error.
+		 */
+		if (r == -1 && errno != EEXIST) {
+			pr_info(
+				"Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
 				addr, tid, errno);
 			return r;
 		}
@@ -89,8 +96,10 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		cont.range.len = demand_paging_size;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_CONTINUE in 0x%lx from thread %d with errno: %d\n",
+		/* See the note about EEXISTs in the UFFDIO_COPY branch. */
+		if (r == -1 && errno != EEXIST) {
+			pr_info(
+				"Failed UFFDIO_CONTINUE in 0x%lx from thread %d, errno = %d\n",
 				addr, tid, errno);
 			return r;
 		}
@@ -110,7 +119,9 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 
 struct test_params {
 	int uffd_mode;
+	bool single_uffd;
 	useconds_t uffd_delay;
+	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
 };
@@ -133,7 +144,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i;
+	int i, num_uffds = 0;
+	uint64_t uffd_region_size;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
@@ -146,10 +158,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
 	if (p->uffd_mode) {
-		uffd_descs = malloc(nr_vcpus * sizeof(struct uffd_desc *));
+		num_uffds = p->single_uffd ? 1 : nr_vcpus;
+		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
+
+		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
 		TEST_ASSERT(uffd_descs, "Memory allocation failed");
 
-		for (i = 0; i < nr_vcpus; i++) {
+		for (i = 0; i < num_uffds; i++) {
 			struct memstress_vcpu_args *vcpu_args;
 			void *vcpu_hva;
 			void *vcpu_alias;
@@ -160,8 +175,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			vcpu_hva = addr_gpa2hva(vm, vcpu_args->gpa);
 			vcpu_alias = addr_gpa2alias(vm, vcpu_args->gpa);
 
-			prefault_mem(vcpu_alias,
-				vcpu_args->pages * memstress_args.guest_page_size);
+			prefault_mem(vcpu_alias, uffd_region_size);
 
 			/*
 			 * Set up user fault fd to handle demand paging
@@ -169,7 +183,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
 				p->uffd_mode, p->uffd_delay, vcpu_hva,
-				vcpu_args->pages * memstress_args.guest_page_size,
+				uffd_region_size,
+				p->readers_per_uffd,
 				&handle_uffd_page_request);
 		}
 	}
@@ -186,7 +201,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
 	if (p->uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
-		for (i = 0; i < nr_vcpus; i++)
+		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
 	}
 
@@ -206,14 +221,19 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 static void help(char *name)
 {
 	puts("");
-	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-d uffd_delay_usec]\n"
-	       "          [-b memory] [-s type] [-v vcpus] [-o]\n", name);
+	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
+		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
+		   "          [-s type] [-v vcpus] [-o]\n", name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
+	printf(" -a: Use a single userfaultfd for all of guest memory, instead of\n"
+		   "     creating one for each region paged by a unique vCPU\n"
+		   "     Set implicitly with -o, and no effect without -u.\n");
 	printf(" -d: add a delay in usec to the User Fault\n"
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
+	printf(" -r: Set the number of reader threads per uffd.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -231,12 +251,14 @@ int main(int argc, char *argv[])
 	struct test_params p = {
 		.src_type = DEFAULT_VM_MEM_SRC,
 		.partition_vcpu_memory_access = true,
+		.readers_per_uffd = 1,
+		.single_uffd = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "hm:u:d:b:s:v:o")) != -1) {
+	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
@@ -248,6 +270,9 @@ int main(int argc, char *argv[])
 				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
 			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
+		case 'a':
+			p.single_uffd = true;
+			break;
 		case 'd':
 			p.uffd_delay = strtoul(optarg, NULL, 0);
 			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
@@ -265,6 +290,13 @@ int main(int argc, char *argv[])
 			break;
 		case 'o':
 			p.partition_vcpu_memory_access = false;
+			p.single_uffd = true;
+			break;
+		case 'r':
+			p.readers_per_uffd = atoi(optarg);
+			TEST_ASSERT(p.readers_per_uffd >= 1,
+						"Invalid number of readers per uffd %d: must be >=1",
+						p.readers_per_uffd);
 			break;
 		case 'h':
 		default:
diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
index 877449c345928..92cc1f9ec0686 100644
--- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
+++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
@@ -17,18 +17,30 @@
 
 typedef int (*uffd_handler_t)(int uffd_mode, int uffd, struct uffd_msg *msg);
 
+struct uffd_reader_args {
+	int uffd_mode;
+	int uffd;
+	useconds_t delay;
+	uffd_handler_t handler;
+	/* Holds the read end of the pipe for killing the reader. */
+	int pipe;
+};
+
 struct uffd_desc {
 	int uffd_mode;
 	int uffd;
-	int pipefds[2];
 	useconds_t delay;
 	uffd_handler_t handler;
-	pthread_t thread;
+	uint64_t num_readers;
+	/* Holds the write ends of the pipes for killing the readers. */
+	int *pipefds;
+	pthread_t *readers;
+	struct uffd_reader_args *reader_args;
 };
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler);
+					   uint64_t num_readers, uffd_handler_t handler);
 
 void uffd_stop_demand_paging(struct uffd_desc *uffd);
 
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 92cef20902f1f..2723ee1e3e1b2 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -27,10 +27,8 @@
 
 static void *uffd_handler_thread_fn(void *arg)
 {
-	struct uffd_desc *uffd_desc = (struct uffd_desc *)arg;
-	int uffd = uffd_desc->uffd;
-	int pipefd = uffd_desc->pipefds[0];
-	useconds_t delay = uffd_desc->delay;
+	struct uffd_reader_args *reader_args = (struct uffd_reader_args *)arg;
+	int uffd = reader_args->uffd;
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
@@ -44,7 +42,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 		pollfd[0].fd = uffd;
 		pollfd[0].events = POLLIN;
-		pollfd[1].fd = pipefd;
+		pollfd[1].fd = reader_args->pipe;
 		pollfd[1].events = POLLIN;
 
 		r = poll(pollfd, 2, -1);
@@ -92,9 +90,9 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
 
-		if (delay)
-			usleep(delay);
-		r = uffd_desc->handler(uffd_desc->uffd_mode, uffd, &msg);
+		if (reader_args->delay)
+			usleep(reader_args->delay);
+		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
 		if (r < 0)
 			return NULL;
 		pages++;
@@ -110,7 +108,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler)
+					   uint64_t num_readers, uffd_handler_t handler)
 {
 	struct uffd_desc *uffd_desc;
 	bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR);
@@ -118,14 +116,26 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	struct uffdio_api uffdio_api;
 	struct uffdio_register uffdio_register;
 	uint64_t expected_ioctls = ((uint64_t) 1) << _UFFDIO_COPY;
-	int ret;
+	int ret, i;
 
 	PER_PAGE_DEBUG("Userfaultfd %s mode, faults resolved with %s\n",
 		       is_minor ? "MINOR" : "MISSING",
 		       is_minor ? "UFFDIO_CONINUE" : "UFFDIO_COPY");
 
 	uffd_desc = malloc(sizeof(struct uffd_desc));
-	TEST_ASSERT(uffd_desc, "malloc failed");
+	TEST_ASSERT(uffd_desc, "Failed to malloc uffd descriptor");
+
+	uffd_desc->pipefds = malloc(sizeof(int) * num_readers);
+	TEST_ASSERT(uffd_desc->pipefds, "Failed to malloc pipes");
+
+	uffd_desc->readers = malloc(sizeof(pthread_t) * num_readers);
+	TEST_ASSERT(uffd_desc->readers, "Failed to malloc reader threads");
+
+	uffd_desc->reader_args = malloc(
+		sizeof(struct uffd_reader_args) * num_readers);
+	TEST_ASSERT(uffd_desc->reader_args, "Failed to malloc reader_args");
+
+	uffd_desc->num_readers = num_readers;
 
 	/* In order to get minor faults, prefault via the alias. */
 	if (is_minor)
@@ -148,18 +158,32 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
 		    expected_ioctls, "missing userfaultfd ioctls");
 
-	ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
-	TEST_ASSERT(!ret, "Failed to set up pipefd");
-
 	uffd_desc->uffd_mode = uffd_mode;
 	uffd_desc->uffd = uffd;
 	uffd_desc->delay = delay;
 	uffd_desc->handler = handler;
-	pthread_create(&uffd_desc->thread, NULL, uffd_handler_thread_fn,
-		       uffd_desc);
 
-	PER_VCPU_DEBUG("Created uffd thread for HVA range [%p, %p)\n",
-		       hva, hva + len);
+	for (i = 0; i < uffd_desc->num_readers; ++i) {
+		int pipes[2];
+
+		ret = pipe2((int *) &pipes, O_CLOEXEC | O_NONBLOCK);
+		TEST_ASSERT(!ret, "Failed to set up pipefd %i for uffd_desc %p",
+					i, uffd_desc);
+
+		uffd_desc->pipefds[i] = pipes[1];
+
+		uffd_desc->reader_args[i].uffd_mode = uffd_mode;
+		uffd_desc->reader_args[i].uffd = uffd;
+		uffd_desc->reader_args[i].delay = delay;
+		uffd_desc->reader_args[i].handler = handler;
+		uffd_desc->reader_args[i].pipe = pipes[0];
+
+		pthread_create(&uffd_desc->readers[i], NULL, uffd_handler_thread_fn,
+					   &uffd_desc->reader_args[i]);
+
+		PER_VCPU_DEBUG("Created uffd thread %i for HVA range [%p, %p)\n",
+					   i, hva, hva + len);
+	}
 
 	return uffd_desc;
 }
@@ -167,19 +191,31 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 void uffd_stop_demand_paging(struct uffd_desc *uffd)
 {
 	char c = 0;
-	int ret;
+	int i, ret;
 
-	ret = write(uffd->pipefds[1], &c, 1);
-	TEST_ASSERT(ret == 1, "Unable to write to pipefd");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = write(uffd->pipefds[i], &c, 1);
+		TEST_ASSERT(
+			ret == 1, "Unable to write to pipefd %i for uffd_desc %p", i, uffd);
+	}
 
-	ret = pthread_join(uffd->thread, NULL);
-	TEST_ASSERT(ret == 0, "Pthread_join failed.");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = pthread_join(uffd->readers[i], NULL);
+		TEST_ASSERT(
+			ret == 0,
+			"Pthread_join failed on reader thread %i for uffd_desc %p", i, uffd);
+	}
 
 	close(uffd->uffd);
 
-	close(uffd->pipefds[1]);
-	close(uffd->pipefds[0]);
+	for (i = 0; i < uffd->num_readers; ++i) {
+		close(uffd->pipefds[i]);
+		close(uffd->reader_args[i].pipe);
+	}
 
+	free(uffd->pipefds);
+	free(uffd->readers);
+	free(uffd->reader_args);
 	free(uffd);
 }
 
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [WIP Patch v2 02/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
  2023-03-15  2:17 ` [WIP Patch v2 01/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-15  2:17 ` [WIP Patch v2 03/14] KVM: Allow hva_pfn_fast to resolve read-only faults Anish Moorthy
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

With multiple reader threads POLLing a single UFFD, the test suffers
from the thundering herd problem: performance degrades as the number of
reader threads is increased. Solve this issue [1] by switching the
the polling mechanism to EPOLL + EPOLLEXCLUSIVE.

Also, change the error-handling convention of uffd_handler_thread_fn.
Instead of just printing errors and returning early from the polling
loop, check for them via TEST_ASSERT. "return NULL" is reserved for a
successful exit from uffd_handler_thread_fn, ie one triggered by a
write to the exit pipe.

Performance samples generated by the command in [2] are given below.

Num Reader Threads, Paging Rate (POLL), Paging Rate (EPOLL)
1      249k      185k
2      201k      235k
4      186k      155k
16     150k      217k
32     89k       198k

[1] Single-vCPU performance does suffer somewhat.
[2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        |  1 -
 .../selftests/kvm/lib/userfaultfd_util.c      | 76 +++++++++----------
 2 files changed, 37 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index fc9c6ac76660c..f8c1831614a9d 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -13,7 +13,6 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
-#include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
 #include <sys/syscall.h>
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 2723ee1e3e1b2..863840d340105 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -16,6 +16,7 @@
 #include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <sys/epoll.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -32,60 +33,56 @@ static void *uffd_handler_thread_fn(void *arg)
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
+	int epollfd;
+	struct epoll_event evt;
+
+	epollfd = epoll_create(1);
+	TEST_ASSERT(epollfd >= 0, "Failed to create epollfd.");
+
+	evt.events = EPOLLIN | EPOLLEXCLUSIVE;
+	evt.data.u32 = 0;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, uffd, &evt) == 0,
+				"Failed to add uffd to epollfd");
+
+	evt.events = EPOLLIN;
+	evt.data.u32 = 1;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, reader_args->pipe, &evt) == 0,
+				"Failed to add pipe to epollfd");
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
 	while (1) {
 		struct uffd_msg msg;
-		struct pollfd pollfd[2];
-		char tmp_chr;
 		int r;
 
-		pollfd[0].fd = uffd;
-		pollfd[0].events = POLLIN;
-		pollfd[1].fd = reader_args->pipe;
-		pollfd[1].events = POLLIN;
-
-		r = poll(pollfd, 2, -1);
-		switch (r) {
-		case -1:
-			pr_info("poll err");
-			continue;
-		case 0:
-			continue;
-		case 1:
-			break;
-		default:
-			pr_info("Polling uffd returned %d", r);
-			return NULL;
-		}
+		r = epoll_wait(epollfd, &evt, 1, -1);
+		TEST_ASSERT(
+			r == 1,
+			"Unexpected number of events (%d) returned by epoll, errno = %d",
+			r, errno);
 
-		if (pollfd[0].revents & POLLERR) {
-			pr_info("uffd revents has POLLERR");
-			return NULL;
-		}
+		if (evt.data.u32 == 1) {
+			char tmp_chr;
 
-		if (pollfd[1].revents & POLLIN) {
-			r = read(pollfd[1].fd, &tmp_chr, 1);
+			TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+						"Reader thread received EPOLLERR or EPOLLHUP on pipe.");
+			r = read(reader_args->pipe, &tmp_chr, 1);
 			TEST_ASSERT(r == 1,
-				    "Error reading pipefd in UFFD thread\n");
+						"Error reading pipefd in uffd reader thread");
 			return NULL;
 		}
 
-		if (!(pollfd[0].revents & POLLIN))
-			continue;
+		TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+					"Reader thread received EPOLLERR or EPOLLHUP on uffd.");
 
 		r = read(uffd, &msg, sizeof(msg));
 		if (r == -1) {
-			if (errno == EAGAIN)
-				continue;
-			pr_info("Read of uffd got errno %d\n", errno);
-			return NULL;
+			TEST_ASSERT(errno == EAGAIN,
+						"Error reading from UFFD: errno = %d", errno);
+			continue;
 		}
 
-		if (r != sizeof(msg)) {
-			pr_info("Read on uffd returned unexpected size: %d bytes", r);
-			return NULL;
-		}
+		TEST_ASSERT(r == sizeof(msg),
+					"Read on uffd returned unexpected number of bytes (%d)", r);
 
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
@@ -93,8 +90,9 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (reader_args->delay)
 			usleep(reader_args->delay);
 		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
-		if (r < 0)
-			return NULL;
+		TEST_ASSERT(
+			r >= 0,
+			"Reader thread handler function returned negative value %d", r);
 		pages++;
 	}
 
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [WIP Patch v2 03/14] KVM: Allow hva_pfn_fast to resolve read-only faults.
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
  2023-03-15  2:17 ` [WIP Patch v2 01/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
  2023-03-15  2:17 ` [WIP Patch v2 02/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-15  2:17 ` [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field Anish Moorthy
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

hva_to_pfn_fast currently just fails for read-only faults, which is
unnecessary. Instead, try pinning the page without passing FOLL_WRITE.
This allows read-only faults to (potentially) be resolved without
falling back to slow GUP.

Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d255964ec331e..e38ddda05b261 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2479,7 +2479,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
 }
 
 /*
- * The fast path to get the writable pfn which will be stored in @pfn,
+ * The fast path to get the pfn which will be stored in @pfn,
  * true indicates success, otherwise false is returned.  It's also the
  * only part that runs if we can in atomic context.
  */
@@ -2487,16 +2487,14 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 			    bool *writable, kvm_pfn_t *pfn)
 {
 	struct page *page[1];
-
 	/*
 	 * Fast pin a writable pfn only if it is a write fault request
 	 * or the caller allows to map a writable pfn for a read fault
 	 * request.
 	 */
-	if (!(write_fault || writable))
-		return false;
+	unsigned int gup_flags = (write_fault || writable) ? FOLL_WRITE : 0;
 
-	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+	if (get_user_page_fast_only(addr, gup_flags, page)) {
 		*pfn = page_to_pfn(page[0]);
 
 		if (writable)
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
                   ` (2 preceding siblings ...)
  2023-03-15  2:17 ` [WIP Patch v2 03/14] KVM: Allow hva_pfn_fast to resolve read-only faults Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-17  0:02   ` Isaku Yamahata
  2023-03-17 18:35   ` Oliver Upton
  2023-03-15  2:17 ` [WIP Patch v2 05/14] KVM: x86: Implement memory fault exit for direct_map Anish Moorthy
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

Memory fault exits allow KVM to return useful information from
KVM_RUN instead of having to -EFAULT when a guest memory access goes
wrong. Document the intent and API of the new capability, and introduce
helper functions which will be useful in places where it needs to be
implemented.

Also allow the capability to be enabled, even though that won't
currently *do* anything: implementations at the relevant -EFAULT sites
will performed in subsequent commits.
---
 Documentation/virt/kvm/api.rst | 37 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c             |  1 +
 include/linux/kvm_host.h       | 16 +++++++++++++++
 include/uapi/linux/kvm.h       | 16 +++++++++++++++
 tools/include/uapi/linux/kvm.h | 15 ++++++++++++++
 virt/kvm/kvm_main.c            | 28 +++++++++++++++++++++++++
 6 files changed, 113 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 62de0768d6aa5..f9ca18bbec879 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6636,6 +6636,19 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 len; /* in bytes */
+		} memory_fault;
+
+Indicates a memory fault on the guest physical address range [gpa, gpa + len).
+flags is a bitfield describing the reasons(s) for the fault. See
+KVM_CAP_X86_MEMORY_FAULT_EXIT for more details.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
@@ -7669,6 +7682,30 @@ This capability is aimed to mitigate the threat that malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.
 
+7.34 KVM_CAP_X86_MEMORY_FAULT_EXIT
+----------------------------------
+
+:Architectures: x86
+:Parameters: args[0] is a bitfield specifying what reasons to exit upon.
+:Returns: 0 on success, -EINVAL if unsupported or if unrecognized exit reason
+          specified.
+
+This capability transforms -EFAULTs returned by KVM_RUN in response to guest
+memory accesses into VM exits (KVM_EXIT_MEMORY_FAULT), with 'gpa' and 'len'
+describing the problematic range of memory and 'flags' describing the reason(s)
+for the fault.
+
+The implementation is currently incomplete. Please notify the maintainers if you
+come across a case where it needs to be implemented.
+
+Through args[0], the capability can be set on a per-exit-reason basis.
+Currently, the only exit reasons supported are
+
+1. KVM_MEMFAULT_REASON_UNKNOWN (1 << 0)
+
+Memory fault exits with a reason of UNKNOWN should not be depended upon: they
+may be added, removed, or reclassified under a stable reason.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f706621c35b86..b3c1b2f57e680 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4425,6 +4425,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VAPIC:
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
+	case KVM_CAP_X86_MEMORY_FAULT_EXIT:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8ada23756b0ec..d3ccfead73e42 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -805,6 +805,7 @@ struct kvm {
 	struct notifier_block pm_notifier;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
+	uint64_t memfault_exit_reasons;
 };
 
 #define kvm_err(fmt, ...) \
@@ -2278,4 +2279,19 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+/*
+ * If memory fault exits are enabled for any of the reasons given in exit_flags
+ * then sets up a KVM_EXIT_MEMORY_FAULT for the given guest physical address,
+ * length, and flags and returns -1.
+ * Otherwise, returns -EFAULT
+ */
+inline int kvm_memfault_exit_or_efault(
+	struct kvm_vcpu *vcpu, uint64_t gpa, uint64_t len, uint64_t exit_flags);
+
+/*
+ * Checks that all of the bits specified in 'reasons' correspond to known
+ * memory fault exit reasons.
+ */
+bool kvm_memfault_exit_flags_valid(uint64_t reasons);
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d77aef872a0a0..0ba1d7f01346e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -505,6 +506,17 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			/*
+			 * Indicates a memory fault on the guest physical address range
+			 * [gpa, gpa + len). flags is a bitfield describing the reasons(s)
+			 * for the fault.
+			 */
+			__u64 flags;
+			__u64 gpa;
+			__u64 len; /* in bytes */
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -1184,6 +1196,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
 #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
+#define KVM_CAP_X86_MEMORY_FAULT_EXIT 227
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2237,4 +2250,7 @@ struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+/* Exit reasons for KVM_EXIT_MEMORY_FAULT */
+#define KVM_MEMFAULT_REASON_UNKNOWN (1 << 0)
+
 #endif /* __LINUX_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 55155e262646e..2b468345f25c3 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -505,6 +506,17 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			/*
+			 * Indicates a memory fault on the guest physical address range
+			 * [gpa, gpa + len). flags is a bitfield describing the reasons(s)
+			 * for the fault.
+			 */
+			__u64 flags;
+			__u64 gpa;
+			__u64 len; /* in bytes */
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -2228,4 +2240,7 @@ struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+/* Exit reasons for KVM_EXIT_MEMORY_FAULT */
+#define KVM_MEMFAULT_REASON_UNKNOWN (1 << 0)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e38ddda05b261..00aec43860ff1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
+	kvm->memfault_exit_reasons = 0;
 
 	INIT_LIST_HEAD(&kvm->gpc_list);
 	spin_lock_init(&kvm->gpc_lock);
@@ -4671,6 +4672,14 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 
 		return r;
 	}
+	case KVM_CAP_X86_MEMORY_FAULT_EXIT: {
+		if (!kvm_vm_ioctl_check_extension(kvm, KVM_CAP_X86_MEMORY_FAULT_EXIT))
+			return -EINVAL;
+		else if (!kvm_memfault_exit_flags_valid(cap->args[0]))
+			return -EINVAL;
+		kvm->memfault_exit_reasons = cap->args[0];
+		return 0;
+	}
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}
@@ -6172,3 +6181,22 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
 
 	return init_context.err;
 }
+
+inline int kvm_memfault_exit_or_efault(
+	struct kvm_vcpu *vcpu, uint64_t gpa, uint64_t len, uint64_t exit_flags)
+{
+	if (!(vcpu->kvm->memfault_exit_reasons & exit_flags))
+		return -EFAULT;
+	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+	vcpu->run->memory_fault.gpa = gpa;
+	vcpu->run->memory_fault.len = len;
+	vcpu->run->memory_fault.flags = exit_flags;
+	return -1;
+}
+
+bool kvm_memfault_exit_flags_valid(uint64_t reasons)
+{
+	uint64_t valid_flags = KVM_MEMFAULT_REASON_UNKNOWN;
+
+	return !(reasons & !valid_flags);
+}
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [WIP Patch v2 05/14] KVM: x86: Implement memory fault exit for direct_map
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
                   ` (3 preceding siblings ...)
  2023-03-15  2:17 ` [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-15  2:17 ` [WIP Patch v2 06/14] KVM: x86: Implement memory fault exit for kvm_handle_page_fault Anish Moorthy
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

TODO: The return value of this function is ignored in
kvm_arch_async_page_ready. Make sure that the side effects of
memory_fault_exit_or_efault are acceptable there.
---
 arch/x86/kvm/mmu/mmu.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8ebe542c565f..0b02e2c360c08 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3193,7 +3193,10 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	}
 
 	if (WARN_ON_ONCE(it.level != fault->goal_level))
-		return -EFAULT;
+		return kvm_memfault_exit_or_efault(
+			vcpu, fault->gfn * PAGE_SIZE,
+			KVM_PAGES_PER_HPAGE(fault->goal_level),
+			KVM_MEMFAULT_REASON_UNKNOWN);
 
 	ret = mmu_set_spte(vcpu, fault->slot, it.sptep, ACC_ALL,
 			   base_gfn, fault->pfn, fault);
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [WIP Patch v2 06/14] KVM: x86: Implement memory fault exit for kvm_handle_page_fault
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
                   ` (4 preceding siblings ...)
  2023-03-15  2:17 ` [WIP Patch v2 05/14] KVM: x86: Implement memory fault exit for direct_map Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-15  2:17 ` [WIP Patch v2 07/14] KVM: x86: Implement memory fault exit for setup_vmgexit_scratch Anish Moorthy
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

---
 arch/x86/kvm/mmu/mmu.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0b02e2c360c08..5e0140db384f6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4375,7 +4375,9 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 #ifndef CONFIG_X86_64
 	/* A 64-bit CR2 should be impossible on 32-bit KVM. */
 	if (WARN_ON_ONCE(fault_address >> 32))
-		return -EFAULT;
+		return kvm_mefault_exit_or_efault(
+			vcpu, fault_address, PAGE_SIZE,
+			KVM_MEMFAULT_REASON_UNKNOWN);
 #endif
 
 	vcpu->arch.l1tf_flush_l1d = true;
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [WIP Patch v2 07/14] KVM: x86: Implement memory fault exit for setup_vmgexit_scratch
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
                   ` (5 preceding siblings ...)
  2023-03-15  2:17 ` [WIP Patch v2 06/14] KVM: x86: Implement memory fault exit for kvm_handle_page_fault Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-15  2:17 ` [WIP Patch v2 08/14] KVM: x86: Implement memory fault exit for FNAME(fetch) Anish Moorthy
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

---
 arch/x86/kvm/svm/sev.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c25aeb550cd97..c042d385350de 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2683,7 +2683,9 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len)
 			pr_err("vmgexit: kvm_read_guest for scratch area failed\n");
 
 			kvfree(scratch_va);
-			return -EFAULT;
+			return kvm_memfault_exit_or_efault(
+				&svm->vcpu, scratch_gpa_beg, len,
+				KVM_MEMFAULT_REASON_UNKNOWN);
 		}
 
 		/*
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [WIP Patch v2 08/14] KVM: x86: Implement memory fault exit for FNAME(fetch)
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
                   ` (6 preceding siblings ...)
  2023-03-15  2:17 ` [WIP Patch v2 07/14] KVM: x86: Implement memory fault exit for setup_vmgexit_scratch Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-15  2:17 ` [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation Anish Moorthy
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

---
 arch/x86/kvm/mmu/paging_tmpl.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 57f0b75c80f9d..ed996dccc03bf 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -717,7 +717,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	}
 
 	if (WARN_ON_ONCE(it.level != fault->goal_level))
-		return -EFAULT;
+		return kvm_memfault_exit_or_efault(
+			vcpu, fault->gfn * PAGE_SIZE, KVM_PAGES_PER_HPAGE(fault->goal_level),
+			KVM_MEMFAULT_REASON_UNKNOWN);
 
 	ret = mmu_set_spte(vcpu, fault->slot, it.sptep, gw->pte_access,
 			   base_gfn, fault->pfn, fault);
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
                   ` (7 preceding siblings ...)
  2023-03-15  2:17 ` [WIP Patch v2 08/14] KVM: x86: Implement memory fault exit for FNAME(fetch) Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-17 18:59   ` Oliver Upton
  2023-03-15  2:17 ` [WIP Patch v2 10/14] KVM: x86: Implement KVM_CAP_MEMORY_FAULT_NOWAIT Anish Moorthy
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

Add documentation, memslot flags, useful helper functions, and the
actual new capability itself.

Memory fault exits on absent mappings are particularly useful for
userfaultfd-based live migration postcopy. When many vCPUs fault upon a
single userfaultfd the faults can take a while to surface to userspace
due to having to contend for uffd wait queue locks. Bypassing the uffd
entirely by triggering a vCPU exit avoids this contention and can improve
the fault rate by as much as 10x.
---
 Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++++++---
 include/linux/kvm_host.h       |  6 ++++++
 include/uapi/linux/kvm.h       |  3 +++
 tools/include/uapi/linux/kvm.h |  2 ++
 virt/kvm/kvm_main.c            |  7 ++++++-
 5 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f9ca18bbec879..4932c0f62eb3d 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
   /* for kvm_userspace_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1342,12 +1343,15 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
+The flags field supports three flags
+
+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
 writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
+use it.
+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
 to make a new slot read-only.  In this case, writes to this memory will be
 posted to userspace as KVM_EXIT_MMIO exits.
+3.  KVM_MEM_ABSENT_MAPPING_FAULT: see KVM_CAP_MEMORY_FAULT_NOWAIT for details.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
@@ -7702,10 +7706,37 @@ Through args[0], the capability can be set on a per-exit-reason basis.
 Currently, the only exit reasons supported are
 
 1. KVM_MEMFAULT_REASON_UNKNOWN (1 << 0)
+2. KVM_MEMFAULT_REASON_ABSENT_MAPPING (1 << 1)
 
 Memory fault exits with a reason of UNKNOWN should not be depended upon: they
 may be added, removed, or reclassified under a stable reason.
 
+7.35 KVM_CAP_MEMORY_FAULT_NOWAIT
+--------------------------------
+
+:Architectures: x86, arm64
+:Returns: -EINVAL.
+
+The presence of this capability indicates that userspace may pass the
+KVM_MEM_ABSENT_MAPPING_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
+to exit to populate 'kvm_run.memory_fault' and exit to userspace (*) in response
+to page faults for which the userspace page tables do not contain present
+mappings. Attempting to enable the capability directly will fail.
+
+The 'gpa' and 'len' fields of kvm_run.memory_fault will be set to the starting
+address and length (in bytes) of the faulting page. 'flags' will be set to
+KVM_MEMFAULT_REASON_ABSENT_MAPPING.
+
+Userspace should determine how best to make the mapping present, then take
+appropriate action. For instance, in the case of absent mappings this might
+involve establishing the mapping for the first time via UFFDIO_COPY/CONTINUE or
+faulting the mapping in using MADV_POPULATE_READ/WRITE. After establishing the
+mapping, userspace can return to KVM to retry the previous memory access.
+
+(*) NOTE: On x86, KVM_CAP_X86_MEMORY_FAULT_EXIT must be enabled for the
+KVM_MEMFAULT_REASON_ABSENT_MAPPING_reason: otherwise userspace will only receive
+a -EFAULT from KVM_RUN without any useful information.
+
 8. Other capabilities.
 ======================
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d3ccfead73e42..c28330f25526f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -593,6 +593,12 @@ static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *sl
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
 }
 
+static inline bool kvm_slot_fault_on_absent_mapping(
+	const struct kvm_memory_slot *slot)
+{
+	return slot->flags & KVM_MEM_ABSENT_MAPPING_FAULT;
+}
+
 static inline unsigned long kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
 {
 	return ALIGN(memslot->npages, BITS_PER_LONG) / 8;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0ba1d7f01346e..2146b27cdd61a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_ABSENT_MAPPING_FAULT	(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -1197,6 +1198,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
 #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
 #define KVM_CAP_X86_MEMORY_FAULT_EXIT 227
+#define KVM_CAP_MEMORY_FAULT_NOWAIT 228
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2252,5 +2254,6 @@ struct kvm_s390_zpci_op {
 
 /* Exit reasons for KVM_EXIT_MEMORY_FAULT */
 #define KVM_MEMFAULT_REASON_UNKNOWN (1 << 0)
+#define KVM_MEMFAULT_REASON_ABSENT_MAPPING (1 << 1)
 
 #endif /* __LINUX_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 2b468345f25c3..1a1707d9f442a 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -2242,5 +2243,6 @@ struct kvm_s390_zpci_op {
 
 /* Exit reasons for KVM_EXIT_MEMORY_FAULT */
 #define KVM_MEMFAULT_REASON_UNKNOWN (1 << 0)
+#define KVM_MEMFAULT_REASON_ABSENT_MAPPING (1 << 1)
 
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 00aec43860ff1..aa3b59410a356 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1525,6 +1525,9 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
 	valid_flags |= KVM_MEM_READONLY;
 #endif
 
+	if (kvm_vm_ioctl_check_extension(NULL, KVM_CAP_MEMORY_FAULT_NOWAIT))
+		valid_flags |= KVM_MEM_ABSENT_MAPPING_FAULT;
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
 
@@ -6196,7 +6199,9 @@ inline int kvm_memfault_exit_or_efault(
 
 bool kvm_memfault_exit_flags_valid(uint64_t reasons)
 {
-	uint64_t valid_flags = KVM_MEMFAULT_REASON_UNKNOWN;
+	uint64_t valid_flags
+		= KVM_MEMFAULT_REASON_UNKNOWN
+		| KVM_MEMFAULT_REASON_ABSENT_MAPPING;
 
 	return !(reasons & !valid_flags);
 }
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [WIP Patch v2 10/14] KVM: x86: Implement KVM_CAP_MEMORY_FAULT_NOWAIT
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
                   ` (8 preceding siblings ...)
  2023-03-15  2:17 ` [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-17  0:32   ` Isaku Yamahata
  2023-03-15  2:17 ` [WIP Patch v2 11/14] KVM: arm64: Allow user_mem_abort to return 0 to signal a 'normal' exit Anish Moorthy
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

When a memslot has the KVM_MEM_MEMORY_FAULT_EXIT flag set, exit to
userspace upon encountering a page fault for which the userspace
page tables do not contain a present mapping.
---
 arch/x86/kvm/mmu/mmu.c | 33 +++++++++++++++++++++++++--------
 arch/x86/kvm/x86.c     |  1 +
 2 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5e0140db384f6..68bc4ab2bd942 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3214,7 +3214,9 @@ static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
 	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current);
 }
 
-static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static int kvm_handle_error_pfn(
+	struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+	bool faulted_on_absent_mapping)
 {
 	if (is_sigpending_pfn(fault->pfn)) {
 		kvm_handle_signal_exit(vcpu);
@@ -3234,7 +3236,11 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 		return RET_PF_RETRY;
 	}
 
-	return -EFAULT;
+	return kvm_memfault_exit_or_efault(
+		vcpu, fault->gfn * PAGE_SIZE, PAGE_SIZE,
+		faulted_on_absent_mapping
+			? KVM_MEMFAULT_REASON_ABSENT_MAPPING
+			: KVM_MEMFAULT_REASON_UNKNOWN);
 }
 
 static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
@@ -4209,7 +4215,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
 }
 
-static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static int __kvm_faultin_pfn(
+	struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+	bool fault_on_absent_mapping)
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	bool async;
@@ -4242,9 +4250,15 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	}
 
 	async = false;
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
+
+	fault->pfn = __gfn_to_pfn_memslot(
+		slot, fault->gfn,
+		fault_on_absent_mapping,
+		false,
+		fault_on_absent_mapping ? NULL : &async,
+		fault->write, &fault->map_writable,
+		&fault->hva);
+
 	if (!async)
 		return RET_PF_CONTINUE; /* *pfn has correct page already */
 
@@ -4274,16 +4288,19 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			   unsigned int access)
 {
 	int ret;
+	bool fault_on_absent_mapping
+		= likely(fault->slot) && kvm_slot_fault_on_absent_mapping(fault->slot);
 
 	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	smp_rmb();
 
-	ret = __kvm_faultin_pfn(vcpu, fault);
+	ret = __kvm_faultin_pfn(
+		vcpu, fault, fault_on_absent_mapping);
 	if (ret != RET_PF_CONTINUE)
 		return ret;
 
 	if (unlikely(is_error_pfn(fault->pfn)))
-		return kvm_handle_error_pfn(vcpu, fault);
+		return kvm_handle_error_pfn(vcpu, fault, fault_on_absent_mapping);
 
 	if (unlikely(!fault->slot))
 		return kvm_handle_noslot_fault(vcpu, fault, access);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b3c1b2f57e680..41435324b41d7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4426,6 +4426,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
 	case KVM_CAP_X86_MEMORY_FAULT_EXIT:
+	case KVM_CAP_MEMORY_FAULT_NOWAIT:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [WIP Patch v2 11/14] KVM: arm64: Allow user_mem_abort to return 0 to signal a 'normal' exit
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
                   ` (9 preceding siblings ...)
  2023-03-15  2:17 ` [WIP Patch v2 10/14] KVM: x86: Implement KVM_CAP_MEMORY_FAULT_NOWAIT Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-17 18:18   ` Oliver Upton
  2023-03-15  2:17 ` [WIP Patch v2 12/14] KVM: arm64: Implement KVM_CAP_MEMORY_FAULT_NOWAIT Anish Moorthy
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

kvm_handle_guest_abort currently just returns 1 if user_mem_abort
returns 0. Since 1 is the "resume the guest" code, user_mem_abort is
essentially incapable of triggering a "normal" exit: it can only trigger
exits by returning a negative value, which indicates an error.

Remove the "if (ret == 0) ret = 1;" statement from
kvm_handle_guest_abort and refactor user_mem_abort slightly to allow it
to trigger 'normal' exits by returning 0.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/arm64/kvm/mmu.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7113587222ffe..735044859eb25 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1190,7 +1190,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
 {
-	int ret = 0;
+	int ret = 1;
 	bool write_fault, writable, force_pte = false;
 	bool exec_fault;
 	bool device = false;
@@ -1281,8 +1281,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	    (logging_active && write_fault)) {
 		ret = kvm_mmu_topup_memory_cache(memcache,
 						 kvm_mmu_cache_min_pages(kvm));
-		if (ret)
+		if (ret < 0)
 			return ret;
+		else
+			ret = 1;
 	}
 
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
@@ -1305,7 +1307,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 				   write_fault, &writable, NULL);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
-		return 0;
+		return 1;
 	}
 	if (is_error_noslot_pfn(pfn))
 		return -EFAULT;
@@ -1387,6 +1389,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 					     KVM_PGTABLE_WALK_HANDLE_FAULT |
 					     KVM_PGTABLE_WALK_SHARED);
 
+	if (ret == 0)
+		ret = 1;
+
 	/* Mark the page dirty only if the fault is handled successfully */
 	if (writable && !ret) {
 		kvm_set_pfn_dirty(pfn);
@@ -1397,7 +1402,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	read_unlock(&kvm->mmu_lock);
 	kvm_set_pfn_accessed(pfn);
 	kvm_release_pfn_clean(pfn);
-	return ret != -EAGAIN ? ret : 0;
+	return ret != -EAGAIN ? ret : 1;
 }
 
 /* Resolve the access fault by making the page young again. */
@@ -1549,8 +1554,6 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 	}
 
 	ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status);
-	if (ret == 0)
-		ret = 1;
 out:
 	if (ret == -ENOEXEC) {
 		kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu));
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [WIP Patch v2 12/14] KVM: arm64: Implement KVM_CAP_MEMORY_FAULT_NOWAIT
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
                   ` (10 preceding siblings ...)
  2023-03-15  2:17 ` [WIP Patch v2 11/14] KVM: arm64: Allow user_mem_abort to return 0 to signal a 'normal' exit Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-17 18:27   ` Oliver Upton
  2023-03-15  2:17 ` [WIP Patch v2 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm Anish Moorthy
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

When a memslot has the KVM_MEM_MEMORY_FAULT_EXIT flag set, exit to
userspace upon encountering a page fault for which the userspace
page tables do not contain a present mapping.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 arch/arm64/kvm/arm.c |  1 +
 arch/arm64/kvm/mmu.c | 14 ++++++++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 3bd732eaf0872..f8337e757c777 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -220,6 +220,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VCPU_ATTRIBUTES:
 	case KVM_CAP_PTP_KVM:
 	case KVM_CAP_ARM_SYSTEM_SUSPEND:
+	case KVM_CAP_MEMORY_FAULT_NOWAIT:
 		r = 1;
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 735044859eb25..0d04ffc81f783 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1206,6 +1206,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	unsigned long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
+	bool exit_on_memory_fault = kvm_slot_fault_on_absent_mapping(memslot);
 
 	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
 	write_fault = kvm_is_write_fault(vcpu);
@@ -1303,8 +1304,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 */
 	smp_rmb();
 
-	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-				   write_fault, &writable, NULL);
+	pfn = __gfn_to_pfn_memslot(
+		memslot, gfn, exit_on_memory_fault, false, NULL,
+		write_fault, &writable, NULL);
+
+	if (exit_on_memory_fault && pfn == KVM_PFN_ERR_FAULT) {
+		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+		vcpu->run->memory_fault.flags = 0;
+		vcpu->run->memory_fault.gpa = gfn << PAGE_SHIFT;
+		vcpu->run->memory_fault.len = vma_pagesize;
+		return 0;
+	}
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 1;
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [WIP Patch v2 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
                   ` (11 preceding siblings ...)
  2023-03-15  2:17 ` [WIP Patch v2 12/14] KVM: arm64: Implement KVM_CAP_MEMORY_FAULT_NOWAIT Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-15  2:17 ` [WIP Patch v2 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

Memslot flags aren't currently exposed to the tests, and are just always
set to 0. Add a parameter to allow tests to manually set those flags.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 tools/testing/selftests/kvm/access_tracking_perf_test.c     | 2 +-
 tools/testing/selftests/kvm/demand_paging_test.c            | 6 ++++--
 tools/testing/selftests/kvm/dirty_log_perf_test.c           | 2 +-
 tools/testing/selftests/kvm/include/memstress.h             | 2 +-
 tools/testing/selftests/kvm/lib/memstress.c                 | 4 ++--
 .../selftests/kvm/memslot_modification_stress_test.c        | 2 +-
 6 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f567..b51656b408b83 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -306,7 +306,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct kvm_vm *vm;
 	int nr_vcpus = params->nr_vcpus;
 
-	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1, 0,
 				 params->backing_src, !overlap_memory_access);
 
 	memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index f8c1831614a9d..607cd2846e39c 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -146,8 +146,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	int i, num_uffds = 0;
 	uint64_t uffd_region_size;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
-				 p->src_type, p->partition_vcpu_memory_access);
+	vm = memstress_create_vm(
+		mode, nr_vcpus, guest_percpu_mem_size,
+		1, 0,
+		p->src_type, p->partition_vcpu_memory_access);
 
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
 
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index e9d6d1aecf89c..6c8749193cfa4 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -224,7 +224,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	int i;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
-				 p->slots, p->backing_src,
+				 p->slots, 0, p->backing_src,
 				 p->partition_vcpu_memory_access);
 
 	pr_info("Random seed: %u\n", p->random_seed);
diff --git a/tools/testing/selftests/kvm/include/memstress.h b/tools/testing/selftests/kvm/include/memstress.h
index 72e3e358ef7bd..1cba965d2d331 100644
--- a/tools/testing/selftests/kvm/include/memstress.h
+++ b/tools/testing/selftests/kvm/include/memstress.h
@@ -56,7 +56,7 @@ struct memstress_args {
 extern struct memstress_args memstress_args;
 
 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access);
 void memstress_destroy_vm(struct kvm_vm *vm);
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index 5f1d3173c238c..7589b8cef6911 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -119,7 +119,7 @@ void memstress_setup_vcpus(struct kvm_vm *vm, int nr_vcpus,
 }
 
 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access)
 {
@@ -207,7 +207,7 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 
 		vm_userspace_mem_region_add(vm, backing_src, region_start,
 					    MEMSTRESS_MEM_SLOT_INDEX + i,
-					    region_pages, 0);
+					    region_pages, slot_flags);
 	}
 
 	/* Do mapping for the demand paging memory slot */
diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
index 9855c41ca811f..0b19ec3ecc9cc 100644
--- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
+++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
@@ -95,7 +95,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct test_params *p = arg;
 	struct kvm_vm *vm;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
 				 VM_MEM_SRC_ANONYMOUS,
 				 p->partition_vcpu_memory_access);
 
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [WIP Patch v2 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
                   ` (12 preceding siblings ...)
  2023-03-15  2:17 ` [WIP Patch v2 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm Anish Moorthy
@ 2023-03-15  2:17 ` Anish Moorthy
  2023-03-17 17:43 ` [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Oliver Upton
  2023-03-17 20:35 ` Sean Christopherson
  15 siblings, 0 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-15  2:17 UTC (permalink / raw)
  To: seanjc; +Cc: jthoughton, kvm, Anish Moorthy

Demonstrate a (very basic) scheme for supporting memory fault exits.

From the vCPU threads:
1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
   with the purpose of establishing the absent mappings. Do so with
   wake_waiters=false to avoid serializing on the userfaultfd wait queue
   locks.

2. When the UFFDIO_COPY/CONTINUE in (1) fails with EEXIST,
   assume that the mapping was already established but is currently
   absent [A] and attempt to populate it using MADV_POPULATE_WRITE.

Issue UFFDIO_COPY/CONTINUEs from the reader threads as well, but with
wake_waiters=true to ensure that any threads sleeping on the uffd are
eventually woken up.

A real VMM would track whether it had already COPY/CONTINUEd pages (eg,
via a bitmap) to avoid calls destined to EEXIST. However, even the
naive approach is enough to demonstrate the performance advantages of
KVM_EXIT_MEMORY_FAULT.

[A] In reality it is much likelier that the vCPU thread simply lost a
    race to establish the mapping for the page.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        | 220 +++++++++++++-----
 1 file changed, 164 insertions(+), 56 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 607cd2846e39c..dce72adcb1632 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -15,6 +15,7 @@
 #include <time.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <sys/mman.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -31,6 +32,60 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
 static size_t demand_paging_size;
 static char *guest_data_prototype;
 
+static int num_uffds;
+static size_t uffd_region_size;
+static struct uffd_desc **uffd_descs;
+/*
+ * Delay when demand paging is performed through userfaultfd or directly by
+ * vcpu_worker in the case of a KVM_EXIT_MEMORY_FAULT.
+ */
+static useconds_t uffd_delay;
+static int uffd_mode;
+
+
+static int handle_uffd_page_request(
+	int uffd_mode, int uffd, uint64_t hva, bool is_vcpu
+);
+
+static void madv_write_or_err(uint64_t gpa)
+{
+	int r;
+	void *hva = addr_gpa2hva(memstress_args.vm, gpa);
+
+	r = madvise(hva, demand_paging_size, MADV_POPULATE_WRITE);
+	TEST_ASSERT(
+		r == 0,
+		"MADV_POPULATE_WRITE on hva 0x%lx (gpa 0x%lx) failed with errno %i\n",
+		(uintptr_t) hva, gpa, errno);
+}
+
+static void ready_page(uint64_t gpa)
+{
+	int r, uffd;
+
+	/*
+	 * This test only registers memslot 1 w/ userfaultfd. Any accesses outside
+	 * the registered ranges should fault in the physical pages through
+	 * MADV_POPULATE_WRITE.
+	 */
+	if ((gpa < memstress_args.gpa)
+		|| (gpa >= memstress_args.gpa + memstress_args.size)) {
+		madv_write_or_err(gpa);
+	} else {
+		if (uffd_delay)
+			usleep(uffd_delay);
+
+		uffd = uffd_descs[(gpa - memstress_args.gpa) / uffd_region_size]->uffd;
+
+		r = handle_uffd_page_request(
+			uffd_mode, uffd,
+			(uint64_t) addr_gpa2hva(memstress_args.vm, gpa), true);
+
+		if (r == EEXIST)
+			madv_write_or_err(gpa);
+	}
+}
+
 static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 {
 	struct kvm_vcpu *vcpu = vcpu_args->vcpu;
@@ -42,25 +97,37 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
 
-	/* Let the guest access its memory */
-	ret = _vcpu_run(vcpu);
-	TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-	if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
-		TEST_ASSERT(false,
-			    "Invalid guest sync status: exit_reason=%s\n",
-			    exit_reason_str(run->exit_reason));
-	}
+	while (true) {
+		/* Let the guest access its memory */
+		ret = _vcpu_run(vcpu);
+		TEST_ASSERT(ret == 0 || (run->exit_reason == KVM_EXIT_MEMORY_FAULT),
+					"vcpu_run failed: %d\n", ret);
+		if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
+
+			if (run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
+				TEST_ASSERT(run->memory_fault.flags == 0,
+							"Unrecognized flags 0x%llx on memory fault exit",
+							run->memory_fault.flags);
+				ready_page(run->memory_fault.gpa);
+				continue;
+			}
+
+			TEST_ASSERT(false,
+					"Invalid guest sync status: exit_reason=%s\n",
+					exit_reason_str(run->exit_reason));
+		}
 
-	ts_diff = timespec_elapsed(start);
-	PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
-		       ts_diff.tv_sec, ts_diff.tv_nsec);
+		ts_diff = timespec_elapsed(start);
+		PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
+				ts_diff.tv_sec, ts_diff.tv_nsec);
+		break;
+	}
 }
 
-static int handle_uffd_page_request(int uffd_mode, int uffd,
-									struct uffd_msg *msg)
+static int handle_uffd_page_request(
+	int uffd_mode, int uffd, uint64_t hva, bool is_vcpu)
 {
 	pid_t tid = syscall(__NR_gettid);
-	uint64_t addr = msg->arg.pagefault.address;
 	struct timespec start;
 	struct timespec ts_diff;
 	int r;
@@ -71,58 +138,81 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		struct uffdio_copy copy;
 
 		copy.src = (uint64_t)guest_data_prototype;
-		copy.dst = addr;
+		copy.dst = hva;
 		copy.len = demand_paging_size;
-		copy.mode = 0;
+		copy.mode = UFFDIO_COPY_MODE_DONTWAKE;
 
-		r = ioctl(uffd, UFFDIO_COPY, &copy);
 		/*
-		 * With multiple vCPU threads fault on a single page and there are
-		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
-		 * will fail with EEXIST: handle that case without signaling an
-		 * error.
+		 * With multiple vCPU threads and at least one of multiple reader threads
+		 * or vCPU memory faults, multiple vCPUs accessing an absent page will
+		 * almost certainly cause some thread doing the UFFDIO_COPY here to get
+		 * EEXIST: make sure to allow that case.
 		 */
-		if (r == -1 && errno != EEXIST) {
-			pr_info(
-				"Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
-				addr, tid, errno);
-			return r;
-		}
+		r = ioctl(uffd, UFFDIO_COPY, &copy);
+		TEST_ASSERT(
+			r == 0 || errno == EEXIST,
+			"Thread 0x%x failed UFFDIO_COPY on hva 0x%lx, errno = %d",
+			gettid(), hva, errno);
 	} else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
+		/* The comments in the UFFDIO_COPY branch also apply here. */
 		struct uffdio_continue cont = {0};
 
-		cont.range.start = addr;
+		cont.range.start = hva;
 		cont.range.len = demand_paging_size;
+		cont.mode = UFFDIO_CONTINUE_MODE_DONTWAKE;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
-		/* See the note about EEXISTs in the UFFDIO_COPY branch. */
-		if (r == -1 && errno != EEXIST) {
-			pr_info(
-				"Failed UFFDIO_CONTINUE in 0x%lx from thread %d, errno = %d\n",
-				addr, tid, errno);
-			return r;
-		}
+		TEST_ASSERT(
+			r == 0 || errno == EEXIST,
+			"Thread 0x%x failed UFFDIO_CONTINUE on hva 0x%lx, errno = %d",
+			gettid(), hva, errno);
 	} else {
 		TEST_FAIL("Invalid uffd mode %d", uffd_mode);
 	}
 
+	/*
+	 * If the above UFFDIO_COPY/CONTINUE fails with EEXIST, it will do so without
+	 * waking threads waiting on the UFFD: make sure that happens here.
+	 */
+	if (!is_vcpu) {
+		struct uffdio_range range = {
+			.start = hva,
+			.len = demand_paging_size
+		};
+		r = ioctl(uffd, UFFDIO_WAKE, &range);
+		TEST_ASSERT(
+			r == 0,
+			"Thread 0x%x failed UFFDIO_WAKE on hva 0x%lx, errno = %d",
+			gettid(), hva, errno);
+	}
+
 	ts_diff = timespec_elapsed(start);
 
 	PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
 		       timespec_to_ns(ts_diff));
 	PER_PAGE_DEBUG("Paged in %ld bytes at 0x%lx from thread %d\n",
-		       demand_paging_size, addr, tid);
+		       demand_paging_size, hva, tid);
 
 	return 0;
 }
 
+static int handle_uffd_page_request_from_uffd(
+	int uffd_mode, int uffd, struct uffd_msg *msg)
+{
+	TEST_ASSERT(
+		msg->event == UFFD_EVENT_PAGEFAULT,
+		"Received uffd message with event %d != UFFD_EVENT_PAGEFAULT",
+		msg->event);
+	return handle_uffd_page_request(
+		uffd_mode, uffd, msg->arg.pagefault.address, false);
+}
+
 struct test_params {
-	int uffd_mode;
 	bool single_uffd;
-	useconds_t uffd_delay;
 	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
+	bool memfault_exits;
 };
 
 static void prefault_mem(void *alias, uint64_t len)
@@ -139,18 +229,31 @@ static void prefault_mem(void *alias, uint64_t len)
 static void run_test(enum vm_guest_mode mode, void *arg)
 {
 	struct test_params *p = arg;
-	struct uffd_desc **uffd_descs = NULL;
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i, num_uffds = 0;
-	uint64_t uffd_region_size;
+	int i;
+	uint32_t slot_flags = 0;
+	bool uffd_memfault_exits = uffd_mode && p->memfault_exits;
+
+	if (uffd_memfault_exits) {
+		TEST_ASSERT(kvm_has_cap(KVM_CAP_MEMORY_FAULT_NOWAIT) > 0,
+					"KVM does not have KVM_CAP_MEMORY_FAULT_NOWAIT");
+		slot_flags = KVM_MEM_ABSENT_MAPPING_FAULT;
+	}
 
 	vm = memstress_create_vm(
 		mode, nr_vcpus, guest_percpu_mem_size,
-		1, 0,
+		1, slot_flags,
 		p->src_type, p->partition_vcpu_memory_access);
 
+	if (uffd_memfault_exits) {
+		if (kvm_has_cap(KVM_CAP_X86_MEMORY_FAULT_EXIT))
+			vm_enable_cap(
+				vm, KVM_CAP_X86_MEMORY_FAULT_EXIT,
+				KVM_MEMFAULT_REASON_ABSENT_MAPPING);
+	}
+
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
 
 	guest_data_prototype = malloc(demand_paging_size);
@@ -158,12 +261,12 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		    "Failed to allocate buffer for guest data pattern");
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		num_uffds = p->single_uffd ? 1 : nr_vcpus;
 		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
 
 		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
-		TEST_ASSERT(uffd_descs, "Memory allocation failed");
+		TEST_ASSERT(uffd_descs, "Failed to allocate memory of uffd descriptors");
 
 		for (i = 0; i < num_uffds; i++) {
 			struct memstress_vcpu_args *vcpu_args;
@@ -183,10 +286,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 * requests.
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
-				p->uffd_mode, p->uffd_delay, vcpu_hva,
+				uffd_mode, uffd_delay, vcpu_hva,
 				uffd_region_size,
 				p->readers_per_uffd,
-				&handle_uffd_page_request);
+				&handle_uffd_page_request_from_uffd);
 		}
 	}
 
@@ -200,7 +303,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	ts_diff = timespec_elapsed(start);
 	pr_info("All vCPU threads joined\n");
 
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
 		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
@@ -215,7 +318,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memstress_destroy_vm(vm);
 
 	free(guest_data_prototype);
-	if (p->uffd_mode)
+	if (uffd_mode)
 		free(uffd_descs);
 }
 
@@ -224,7 +327,7 @@ static void help(char *name)
 	puts("");
 	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
 		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
-		   "          [-s type] [-v vcpus] [-o]\n", name);
+		   "          [-w] [-s type] [-v vcpus] [-o]\n", name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
@@ -235,6 +338,7 @@ static void help(char *name)
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
 	printf(" -r: Set the number of reader threads per uffd.\n");
+	printf(" -w: Enable kvm cap for memory fault exits.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -254,29 +358,30 @@ int main(int argc, char *argv[])
 		.partition_vcpu_memory_access = true,
 		.readers_per_uffd = 1,
 		.single_uffd = false,
+		.memfault_exits = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
+	while ((opt = getopt(argc, argv, "ahowm:u:d:b:s:v:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
 			break;
 		case 'u':
 			if (!strcmp("MISSING", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
+				uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
 			else if (!strcmp("MINOR", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
-			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
+				uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
+			TEST_ASSERT(uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
 		case 'a':
 			p.single_uffd = true;
 			break;
 		case 'd':
-			p.uffd_delay = strtoul(optarg, NULL, 0);
-			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
+			uffd_delay = strtoul(optarg, NULL, 0);
+			TEST_ASSERT(uffd_delay >= 0, "A negative UFFD delay is not supported.");
 			break;
 		case 'b':
 			guest_percpu_mem_size = parse_size(optarg);
@@ -299,6 +404,9 @@ int main(int argc, char *argv[])
 						"Invalid number of readers per uffd %d: must be >=1",
 						p.readers_per_uffd);
 			break;
+		case 'w':
+			p.memfault_exits = true;
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
@@ -306,7 +414,7 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	if (p.uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
+	if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
 	    !backing_src_is_shared(p.src_type)) {
 		TEST_FAIL("userfaultfd MINOR mode requires shared memory; pick a different -s");
 	}
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-15  2:17 ` [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field Anish Moorthy
@ 2023-03-17  0:02   ` Isaku Yamahata
  2023-03-17 18:33     ` Anish Moorthy
  2023-03-17 18:35   ` Oliver Upton
  1 sibling, 1 reply; 60+ messages in thread
From: Isaku Yamahata @ 2023-03-17  0:02 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: seanjc, jthoughton, kvm, isaku.yamahata

On Wed, Mar 15, 2023 at 02:17:28AM +0000,
Anish Moorthy <amoorthy@google.com> wrote:

> Memory fault exits allow KVM to return useful information from
> KVM_RUN instead of having to -EFAULT when a guest memory access goes
> wrong. Document the intent and API of the new capability, and introduce
> helper functions which will be useful in places where it needs to be
> implemented.
> 
> Also allow the capability to be enabled, even though that won't
> currently *do* anything: implementations at the relevant -EFAULT sites
> will performed in subsequent commits.
> ---
>  Documentation/virt/kvm/api.rst | 37 ++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/x86.c             |  1 +
>  include/linux/kvm_host.h       | 16 +++++++++++++++
>  include/uapi/linux/kvm.h       | 16 +++++++++++++++
>  tools/include/uapi/linux/kvm.h | 15 ++++++++++++++
>  virt/kvm/kvm_main.c            | 28 +++++++++++++++++++++++++
>  6 files changed, 113 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 62de0768d6aa5..f9ca18bbec879 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6636,6 +6636,19 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 len; /* in bytes */
> +		} memory_fault;
> +
> +Indicates a memory fault on the guest physical address range [gpa, gpa + len).
> +flags is a bitfield describing the reasons(s) for the fault. See
> +KVM_CAP_X86_MEMORY_FAULT_EXIT for more details.
> +
>  ::
>  
>      /* KVM_EXIT_NOTIFY */
> @@ -7669,6 +7682,30 @@ This capability is aimed to mitigate the threat that malicious VMs can
>  cause CPU stuck (due to event windows don't open up) and make the CPU
>  unavailable to host or other VMs.
>  
> +7.34 KVM_CAP_X86_MEMORY_FAULT_EXIT
> +----------------------------------
> +
> +:Architectures: x86

Why x86 specific?

> +:Parameters: args[0] is a bitfield specifying what reasons to exit upon.
> +:Returns: 0 on success, -EINVAL if unsupported or if unrecognized exit reason
> +          specified.
> +
> +This capability transforms -EFAULTs returned by KVM_RUN in response to guest
> +memory accesses into VM exits (KVM_EXIT_MEMORY_FAULT), with 'gpa' and 'len'
> +describing the problematic range of memory and 'flags' describing the reason(s)
> +for the fault.
> +
> +The implementation is currently incomplete. Please notify the maintainers if you
> +come across a case where it needs to be implemented.
> +
> +Through args[0], the capability can be set on a per-exit-reason basis.
> +Currently, the only exit reasons supported are
> +
> +1. KVM_MEMFAULT_REASON_UNKNOWN (1 << 0)
> +
> +Memory fault exits with a reason of UNKNOWN should not be depended upon: they
> +may be added, removed, or reclassified under a stable reason.
> +
>  8. Other capabilities.
>  ======================
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f706621c35b86..b3c1b2f57e680 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4425,6 +4425,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_VAPIC:
>  	case KVM_CAP_ENABLE_CAP:
>  	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
> +	case KVM_CAP_X86_MEMORY_FAULT_EXIT:
>  		r = 1;
>  		break;
>  	case KVM_CAP_EXIT_HYPERCALL:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8ada23756b0ec..d3ccfead73e42 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -805,6 +805,7 @@ struct kvm {
>  	struct notifier_block pm_notifier;
>  #endif
>  	char stats_id[KVM_STATS_NAME_SIZE];
> +	uint64_t memfault_exit_reasons;
>  };
>  
>  #define kvm_err(fmt, ...) \
> @@ -2278,4 +2279,19 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +/*
> + * If memory fault exits are enabled for any of the reasons given in exit_flags
> + * then sets up a KVM_EXIT_MEMORY_FAULT for the given guest physical address,
> + * length, and flags and returns -1.
> + * Otherwise, returns -EFAULT
> + */
> +inline int kvm_memfault_exit_or_efault(
> +	struct kvm_vcpu *vcpu, uint64_t gpa, uint64_t len, uint64_t exit_flags);
> +
> +/*
> + * Checks that all of the bits specified in 'reasons' correspond to known
> + * memory fault exit reasons.
> + */
> +bool kvm_memfault_exit_flags_valid(uint64_t reasons);
> +
>  #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index d77aef872a0a0..0ba1d7f01346e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_RISCV_SBI        35
>  #define KVM_EXIT_RISCV_CSR        36
>  #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -505,6 +506,17 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>  			__u32 flags;
>  		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			/*
> +			 * Indicates a memory fault on the guest physical address range
> +			 * [gpa, gpa + len). flags is a bitfield describing the reasons(s)
> +			 * for the fault.
> +			 */
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 len; /* in bytes */
> +		} memory_fault;
>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};
> @@ -1184,6 +1196,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
>  #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
>  #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
> +#define KVM_CAP_X86_MEMORY_FAULT_EXIT 227
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -2237,4 +2250,7 @@ struct kvm_s390_zpci_op {
>  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
>  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
>  
> +/* Exit reasons for KVM_EXIT_MEMORY_FAULT */
> +#define KVM_MEMFAULT_REASON_UNKNOWN (1 << 0)
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index 55155e262646e..2b468345f25c3 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_RISCV_SBI        35
>  #define KVM_EXIT_RISCV_CSR        36
>  #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -505,6 +506,17 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>  			__u32 flags;
>  		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			/*
> +			 * Indicates a memory fault on the guest physical address range
> +			 * [gpa, gpa + len). flags is a bitfield describing the reasons(s)
> +			 * for the fault.
> +			 */
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 len; /* in bytes */
> +		} memory_fault;
>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};
> @@ -2228,4 +2240,7 @@ struct kvm_s390_zpci_op {
>  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
>  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
>  
> +/* Exit reasons for KVM_EXIT_MEMORY_FAULT */
> +#define KVM_MEMFAULT_REASON_UNKNOWN (1 << 0)
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e38ddda05b261..00aec43860ff1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  	spin_lock_init(&kvm->mn_invalidate_lock);
>  	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>  	xa_init(&kvm->vcpu_array);
> +	kvm->memfault_exit_reasons = 0;
>  
>  	INIT_LIST_HEAD(&kvm->gpc_list);
>  	spin_lock_init(&kvm->gpc_lock);
> @@ -4671,6 +4672,14 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  
>  		return r;
>  	}
> +	case KVM_CAP_X86_MEMORY_FAULT_EXIT: {
> +		if (!kvm_vm_ioctl_check_extension(kvm, KVM_CAP_X86_MEMORY_FAULT_EXIT))
> +			return -EINVAL;
> +		else if (!kvm_memfault_exit_flags_valid(cap->args[0]))
> +			return -EINVAL;
> +		kvm->memfault_exit_reasons = cap->args[0];
> +		return 0;
> +	}

Is KVM_CAP_X86_MEMORY_FAULT_EXIT really specific to x86?
If so, this should go to kvm_vm_ioctl_enable_cap() in arch/x86/kvm/x86.c.
(Or make it non-arch specific.)


>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -6172,3 +6181,22 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>  
>  	return init_context.err;
>  }
> +
> +inline int kvm_memfault_exit_or_efault(
> +	struct kvm_vcpu *vcpu, uint64_t gpa, uint64_t len, uint64_t exit_flags)
> +{
> +	if (!(vcpu->kvm->memfault_exit_reasons & exit_flags))
> +		return -EFAULT;
> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	vcpu->run->memory_fault.gpa = gpa;
> +	vcpu->run->memory_fault.len = len;
> +	vcpu->run->memory_fault.flags = exit_flags;
> +	return -1;

Why -1? 0? Anyway enum exit_fastpath_completion is x86 kvm mmu internal
convention. As WIP, it's okay for now, though.


> +}
> +
> +bool kvm_memfault_exit_flags_valid(uint64_t reasons)
> +{
> +	uint64_t valid_flags = KVM_MEMFAULT_REASON_UNKNOWN;
> +
> +	return !(reasons & !valid_flags);
> +}
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 10/14] KVM: x86: Implement KVM_CAP_MEMORY_FAULT_NOWAIT
  2023-03-15  2:17 ` [WIP Patch v2 10/14] KVM: x86: Implement KVM_CAP_MEMORY_FAULT_NOWAIT Anish Moorthy
@ 2023-03-17  0:32   ` Isaku Yamahata
  0 siblings, 0 replies; 60+ messages in thread
From: Isaku Yamahata @ 2023-03-17  0:32 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: seanjc, jthoughton, kvm, isaku.yamahata

On Wed, Mar 15, 2023 at 02:17:34AM +0000,
Anish Moorthy <amoorthy@google.com> wrote:

> When a memslot has the KVM_MEM_MEMORY_FAULT_EXIT flag set, exit to
> userspace upon encountering a page fault for which the userspace
> page tables do not contain a present mapping.
> ---
>  arch/x86/kvm/mmu/mmu.c | 33 +++++++++++++++++++++++++--------
>  arch/x86/kvm/x86.c     |  1 +
>  2 files changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5e0140db384f6..68bc4ab2bd942 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3214,7 +3214,9 @@ static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
>  	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current);
>  }
>  
> -static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> +static int kvm_handle_error_pfn(
> +	struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> +	bool faulted_on_absent_mapping)
>  {
>  	if (is_sigpending_pfn(fault->pfn)) {
>  		kvm_handle_signal_exit(vcpu);
> @@ -3234,7 +3236,11 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
>  		return RET_PF_RETRY;
>  	}
>  
> -	return -EFAULT;
> +	return kvm_memfault_exit_or_efault(
> +		vcpu, fault->gfn * PAGE_SIZE, PAGE_SIZE,
> +		faulted_on_absent_mapping
> +			? KVM_MEMFAULT_REASON_ABSENT_MAPPING
> +			: KVM_MEMFAULT_REASON_UNKNOWN);
>  }
>  
>  static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
> @@ -4209,7 +4215,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
>  }
>  
> -static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> +static int __kvm_faultin_pfn(
> +	struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> +	bool fault_on_absent_mapping)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
>  	bool async;
> @@ -4242,9 +4250,15 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	}
>  
>  	async = false;
> -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> -					  fault->write, &fault->map_writable,
> -					  &fault->hva);
> +
> +	fault->pfn = __gfn_to_pfn_memslot(
> +		slot, fault->gfn,
> +		fault_on_absent_mapping,
> +		false,
> +		fault_on_absent_mapping ? NULL : &async,
> +		fault->write, &fault->map_writable,
> +		&fault->hva);
> +
>  	if (!async)
>  		return RET_PF_CONTINUE; /* *pfn has correct page already */
>  
> @@ -4274,16 +4288,19 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  			   unsigned int access)
>  {
>  	int ret;
> +	bool fault_on_absent_mapping
> +		= likely(fault->slot) && kvm_slot_fault_on_absent_mapping(fault->slot);

nit: Instead of passing around the value, we can add a new member to
struct kvm_page_fault::fault_on_absent_mapping.

  fault->fault_on_absent_mapping = likely(fault->slot) && kvm_slot_fault_on_absent_mapping(fault->slot);

Thanks,
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
                   ` (13 preceding siblings ...)
  2023-03-15  2:17 ` [WIP Patch v2 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
@ 2023-03-17 17:43 ` Oliver Upton
  2023-03-17 18:13   ` Sean Christopherson
  2023-03-17 20:35 ` Sean Christopherson
  15 siblings, 1 reply; 60+ messages in thread
From: Oliver Upton @ 2023-03-17 17:43 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: seanjc, jthoughton, kvm, maz

Anish,

Generally the 'RFC PATCH' prefix is used for patches that are for feedback
only (i.e. not to be considered for inclusion).

On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote:
> Hi Sean, here's what I'm planing to send up as v2 of the scalable
> userfaultfd series.

I don't see a ton of value in sending a targeted posting of a series to the
list. IOW, just CC all of the appropriate reviewers+maintainers. I promise,
we won't bite.

> Don't worry, I'm not asking you to review this all :) I just have a few
> remaining questions regarding KVM_CAP_MEMORY_FAULT_EXIT which seem important
> enough to mention before I ask for more attention from others, and they'll be
> clearer with the patches in hand. Anything else I'm happy to find out about when
> I send the actual v2.
> 
> I want your opinion on
> 
> 1. The general API I've set up for KVM_CAP_MEMORY_FAULT_EXIT
>    (described in the api.rst file)
> 2. Whether the UNKNOWN exit reason cases (everywhere but
>    handle_error_pfn atm) would need to be given "real" reasons
>    before this could be merged.
> 3. If you think I've missed sites that currently -EFAULT to userspace
> 
> About (3): after we agreed to only tackle cases where -EFAULT currently makes it
> to userspace, I went though our list and tried to trace which EFAULTS actually
> bubble up to KVM_RUN. That set ended being suspiciously small, so I wanted to
> sanity-check my findings with you. Lmk if you see obvious errors in my list
> below.
> 
> --- EFAULTs under KVM_RUN ---
> 
> Confident that needs conversion (already converted)
> ---------------------------------------------------
> * direct_map
> * handle_error_pfn
> * setup_vmgexit_scratch
> * kvm_handle_page_fault
> * FNAME(fetch)
> 
> EFAULT does not propagate to userspace (do not convert)
> -------------------------------------------------------
> * record_steal_time (arch/x86/kvm/x86.c:3463)
> * hva_to_pfn_retry
> * kvm_vcpu_map
> * FNAME(update_accessed_dirty_bits)
> * __kvm_gfn_to_hva_cache_init
>   Might actually make it to userspace, but only through
>   kvm_read|write_guest_offset_cached- would be covered by those conversions
> * kvm_gfn_to_hva_cache_init
> * __kvm_read_guest_page
> * hva_to_pfn_remapped
>   handle_error_pfn will handle this for the scalable uffd case. Don't think
>   other callers -EFAULT to userspace.
> 
> Still unsure if needs conversion
> --------------------------------
> * __kvm_read_guest_atomic
>   The EFAULT might be propagated though FNAME(sync_page)?
> * kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226)
> * __kvm_write_guest_page
>   Called from kvm_write_guest_offset_cached: if that needs change, this does too

The low-level accessors are common across architectures and can be called from
other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT
and convert that into an exit?

> * kvm_write_guest_page
>   Two interesting paths:
>       - kvm_pv_clock_pairing returns a custom KVM_EFAULT error here
>         (arch/x86/kvm/x86.c:9578)

This is a hypercall handler, so the return code is ABI with the guest. So it
shouldn't be converted to an exit to userspace.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit
  2023-03-17 17:43 ` [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Oliver Upton
@ 2023-03-17 18:13   ` Sean Christopherson
  2023-03-17 18:46     ` David Matlack
  0 siblings, 1 reply; 60+ messages in thread
From: Sean Christopherson @ 2023-03-17 18:13 UTC (permalink / raw)
  To: Oliver Upton; +Cc: Anish Moorthy, jthoughton, kvm, maz

On Fri, Mar 17, 2023, Oliver Upton wrote:
> On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote:
> > Hi Sean, here's what I'm planing to send up as v2 of the scalable
> > userfaultfd series.
> 
> I don't see a ton of value in sending a targeted posting of a series to the
> list. IOW, just CC all of the appropriate reviewers+maintainers. I promise,
> we won't bite.

+1.  And though I discourage off-list review, if something is really truly not
ready for public review, e.g. will do more harm than good by causing confusing,
then just send the patches off-list.  Half measures like this will just make folks
grumpy.

> > Don't worry, I'm not asking you to review this all :) I just have a few
> > remaining questions regarding KVM_CAP_MEMORY_FAULT_EXIT which seem important
> > enough to mention before I ask for more attention from others, and they'll be
> > clearer with the patches in hand. Anything else I'm happy to find out about when
> > I send the actual v2.
> > 
> > I want your opinion on
> > 
> > 1. The general API I've set up for KVM_CAP_MEMORY_FAULT_EXIT
> >    (described in the api.rst file)
> > 2. Whether the UNKNOWN exit reason cases (everywhere but
> >    handle_error_pfn atm) would need to be given "real" reasons
> >    before this could be merged.
> > 3. If you think I've missed sites that currently -EFAULT to userspace
> > 
> > About (3): after we agreed to only tackle cases where -EFAULT currently makes it
> > to userspace, I went though our list and tried to trace which EFAULTS actually
> > bubble up to KVM_RUN. That set ended being suspiciously small, so I wanted to
> > sanity-check my findings with you. Lmk if you see obvious errors in my list
> > below.
> > 
> > --- EFAULTs under KVM_RUN ---
> > 
> > Confident that needs conversion (already converted)
> > ---------------------------------------------------
> > * direct_map
> > * handle_error_pfn
> > * setup_vmgexit_scratch
> > * kvm_handle_page_fault
> > * FNAME(fetch)
> > 
> > EFAULT does not propagate to userspace (do not convert)
> > -------------------------------------------------------
> > * record_steal_time (arch/x86/kvm/x86.c:3463)
> > * hva_to_pfn_retry
> > * kvm_vcpu_map
> > * FNAME(update_accessed_dirty_bits)
> > * __kvm_gfn_to_hva_cache_init
> >   Might actually make it to userspace, but only through
> >   kvm_read|write_guest_offset_cached- would be covered by those conversions
> > * kvm_gfn_to_hva_cache_init
> > * __kvm_read_guest_page
> > * hva_to_pfn_remapped
> >   handle_error_pfn will handle this for the scalable uffd case. Don't think
> >   other callers -EFAULT to userspace.
> >
> > Still unsure if needs conversion
> > --------------------------------
> > * __kvm_read_guest_atomic
> >   The EFAULT might be propagated though FNAME(sync_page)?
> > * kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226)
> > * __kvm_write_guest_page
> >   Called from kvm_write_guest_offset_cached: if that needs change, this does too
> 
> The low-level accessors are common across architectures and can be called from
> other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT
> and convert that into an exit?

Ya, as things stand today, the conversions _must_ be performed at the caller, as
there are (sadly) far too many flows where KVM squashes the error.  E.g. almost
all of x86's paravirt code just suppresses user memory faults :-(

Anish, when we discussed this off-list, what I meant by limiting the intial support
to existing -EFAULT cases was limiting support to existing cases where KVM directly
returns -EFAULT to userspace, not to all existing cases where -EFAULT is ever
returned _within KVM_ while handling KVM_RUN.  My apologies if I didn't make that clear.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 11/14] KVM: arm64: Allow user_mem_abort to return 0 to signal a 'normal' exit
  2023-03-15  2:17 ` [WIP Patch v2 11/14] KVM: arm64: Allow user_mem_abort to return 0 to signal a 'normal' exit Anish Moorthy
@ 2023-03-17 18:18   ` Oliver Upton
  0 siblings, 0 replies; 60+ messages in thread
From: Oliver Upton @ 2023-03-17 18:18 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: seanjc, jthoughton, kvm

On Wed, Mar 15, 2023 at 02:17:35AM +0000, Anish Moorthy wrote:
> kvm_handle_guest_abort currently just returns 1 if user_mem_abort
> returns 0. Since 1 is the "resume the guest" code, user_mem_abort is
> essentially incapable of triggering a "normal" exit: it can only trigger
> exits by returning a negative value, which indicates an error.
> 
> Remove the "if (ret == 0) ret = 1;" statement from
> kvm_handle_guest_abort and refactor user_mem_abort slightly to allow it
> to trigger 'normal' exits by returning 0.

You should append '()' to function names, as it makes it abundantly obvious to
the reader that the symbols you describe are indeed functions.

I find the changelog a bit too mechanical and doesn't capture the nuance.

  Generally, in the context of a vCPU exit, a return value of 1 is used
  to indicate KVM should return to the guest and 0 is used to complete a
  'normal' exit to userspace. user_mem_abort() deviates from this
  slightly, using 0 to return to the guest.

  Just return 1 from user_mem_abort() to return to the guest and drop
  the return code conversion from kvm_handle_guest_abort(). It is now
  possible to do a 'normal' exit to userspace from user_mem_abort(),
  which will be used in a later change.

> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  arch/arm64/kvm/mmu.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 7113587222ffe..735044859eb25 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1190,7 +1190,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
>  {
> -	int ret = 0;
> +	int ret = 1;
>  	bool write_fault, writable, force_pte = false;
>  	bool exec_fault;
>  	bool device = false;
> @@ -1281,8 +1281,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	    (logging_active && write_fault)) {
>  		ret = kvm_mmu_topup_memory_cache(memcache,
>  						 kvm_mmu_cache_min_pages(kvm));
> -		if (ret)
> +		if (ret < 0)

There's no need to change this condition.

>  			return ret;
> +		else
> +			ret = 1;

I'd prefer if you set 'ret' close to where it is actually used, which I
believe is only if mmu_invalidate_retry():

	if (mmu_invalidate_retry(kvm, mmu_seq)) {
		ret = 1;
		goto out_unlock;
	}

Otherwise ret gets written to before exiting.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 12/14] KVM: arm64: Implement KVM_CAP_MEMORY_FAULT_NOWAIT
  2023-03-15  2:17 ` [WIP Patch v2 12/14] KVM: arm64: Implement KVM_CAP_MEMORY_FAULT_NOWAIT Anish Moorthy
@ 2023-03-17 18:27   ` Oliver Upton
  2023-03-17 19:00     ` Anish Moorthy
  0 siblings, 1 reply; 60+ messages in thread
From: Oliver Upton @ 2023-03-17 18:27 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: seanjc, jthoughton, kvm

On Wed, Mar 15, 2023 at 02:17:36AM +0000, Anish Moorthy wrote:
> When a memslot has the KVM_MEM_MEMORY_FAULT_EXIT flag set, exit to
> userspace upon encountering a page fault for which the userspace
> page tables do not contain a present mapping.
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
>  arch/arm64/kvm/arm.c |  1 +
>  arch/arm64/kvm/mmu.c | 14 ++++++++++++--
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 3bd732eaf0872..f8337e757c777 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -220,6 +220,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_VCPU_ATTRIBUTES:
>  	case KVM_CAP_PTP_KVM:
>  	case KVM_CAP_ARM_SYSTEM_SUSPEND:
> +	case KVM_CAP_MEMORY_FAULT_NOWAIT:
>  		r = 1;
>  		break;
>  	case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 735044859eb25..0d04ffc81f783 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1206,6 +1206,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	unsigned long vma_pagesize, fault_granule;
>  	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>  	struct kvm_pgtable *pgt;
> +	bool exit_on_memory_fault = kvm_slot_fault_on_absent_mapping(memslot);
>  
>  	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
>  	write_fault = kvm_is_write_fault(vcpu);
> @@ -1303,8 +1304,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	 */
>  	smp_rmb();
>  
> -	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
> -				   write_fault, &writable, NULL);
> +	pfn = __gfn_to_pfn_memslot(
> +		memslot, gfn, exit_on_memory_fault, false, NULL,
> +		write_fault, &writable, NULL);

As stated before [*], this google3-esque style does not match the kernel style
guide. You may want to check if your work machine is setting up a G3-specific
editor configuration behind your back.

[*] https://lore.kernel.org/kvm/Y+0QRsZ4yWyUdpnc@google.com/

> +	if (exit_on_memory_fault && pfn == KVM_PFN_ERR_FAULT) {

nit: I don't think the local is explicitly necessary. I still find this
readable:

	if (pfn == KVM_PFN_ERR_FAULT && kvm_slot_fault_on_absent_mapping(memslot))

> +		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +		vcpu->run->memory_fault.flags = 0;
> +		vcpu->run->memory_fault.gpa = gfn << PAGE_SHIFT;
> +		vcpu->run->memory_fault.len = vma_pagesize;
> +		return 0;
> +	}
>  	if (pfn == KVM_PFN_ERR_HWPOISON) {
>  		kvm_send_hwpoison_signal(hva, vma_shift);
>  		return 1;
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 
> 

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-17  0:02   ` Isaku Yamahata
@ 2023-03-17 18:33     ` Anish Moorthy
  2023-03-17 19:30       ` Oliver Upton
  2023-03-17 21:50       ` Sean Christopherson
  0 siblings, 2 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-17 18:33 UTC (permalink / raw)
  To: Isaku Yamahata, Marc Zyngier, Oliver Upton; +Cc: seanjc, jthoughton, kvm

On Thu, Mar 16, 2023 at 5:02 PM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:

> > +7.34 KVM_CAP_X86_MEMORY_FAULT_EXIT
> > +----------------------------------
> > +
> > +:Architectures: x86
>
> Why x86 specific?

Sean was the only one to bring this functionality up and originally
did so in the context of some x86-specific functions, so I assumed
that x86 was the only ask and that maybe the other architectures had
alternative solutions. Admittedly I also wanted to avoid wading
through another big set of -EFAULT references :/

Those are the only reasons though. Marc, Oliver, should I bring this
capability to Arm as well?

> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index e38ddda05b261..00aec43860ff1 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> >       spin_lock_init(&kvm->mn_invalidate_lock);
> >       rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> >       xa_init(&kvm->vcpu_array);
> > +     kvm->memfault_exit_reasons = 0;
> >
> >       INIT_LIST_HEAD(&kvm->gpc_list);
> >       spin_lock_init(&kvm->gpc_lock);
> > @@ -4671,6 +4672,14 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
> >
> >               return r;
> >       }
> > +     case KVM_CAP_X86_MEMORY_FAULT_EXIT: {
> > +             if (!kvm_vm_ioctl_check_extension(kvm, KVM_CAP_X86_MEMORY_FAULT_EXIT))
> > +                     return -EINVAL;
> > +             else if (!kvm_memfault_exit_flags_valid(cap->args[0]))
> > +                     return -EINVAL;
> > +             kvm->memfault_exit_reasons = cap->args[0];
> > +             return 0;
> > +     }
>
> Is KVM_CAP_X86_MEMORY_FAULT_EXIT really specific to x86?
> If so, this should go to kvm_vm_ioctl_enable_cap() in arch/x86/kvm/x86.c.
> (Or make it non-arch specific.)

Ah, thanks for the catch: I renamed my old non-x86 specific
capability, and forgot to move this block.

> > +inline int kvm_memfault_exit_or_efault(
> > +     struct kvm_vcpu *vcpu, uint64_t gpa, uint64_t len, uint64_t exit_flags)
> > +{
> > +     if (!(vcpu->kvm->memfault_exit_reasons & exit_flags))
> > +             return -EFAULT;
> > +     vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > +     vcpu->run->memory_fault.gpa = gpa;
> > +     vcpu->run->memory_fault.len = len;
> > +     vcpu->run->memory_fault.flags = exit_flags;
> > +     return -1;
>
> Why -1? 0? Anyway enum exit_fastpath_completion is x86 kvm mmu internal
> convention. As WIP, it's okay for now, though.

The -1 isn't to indicate a failure in this function itself, but to
allow callers to substitute this for "return -EFAULT." A return code
of zero would mask errors and cause KVM to proceed in ways that it
shouldn't. For instance, "setup_vmgexit_scratch" uses it like this

if (kvm_read_guest(svm->vcpu.kvm, scratch_gpa_beg, scratch_va, len)) {
    ...
-  return -EFAULT;
+ return kvm_memfault_exit_or_efault(...);
}

and looking at one of its callers (sev_handle_vmgexit) shows how a
return code of zero would cause a different control flow

case SVM_VMGEXIT_MMIO_READ:
ret = setup_vmgexit_scratch(svm, true, control->exit_info_2);
if (ret)
    break;

ret = ret = kvm_sev_es_mmio_read(vcpu,

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-15  2:17 ` [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field Anish Moorthy
  2023-03-17  0:02   ` Isaku Yamahata
@ 2023-03-17 18:35   ` Oliver Upton
  1 sibling, 0 replies; 60+ messages in thread
From: Oliver Upton @ 2023-03-17 18:35 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: seanjc, jthoughton, kvm

On Wed, Mar 15, 2023 at 02:17:28AM +0000, Anish Moorthy wrote:

[...]

> @@ -6172,3 +6181,22 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>  
>  	return init_context.err;
>  }
> +
> +inline int kvm_memfault_exit_or_efault(
> +	struct kvm_vcpu *vcpu, uint64_t gpa, uint64_t len, uint64_t exit_flags)
> +{
> +	if (!(vcpu->kvm->memfault_exit_reasons & exit_flags))
> +		return -EFAULT;

<snip>

> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	vcpu->run->memory_fault.gpa = gpa;
> +	vcpu->run->memory_fault.len = len;
> +	vcpu->run->memory_fault.flags = exit_flags;

</snip>

Please spin this off into a helper and make use of it on the arm64 side.

> +	return -1;
> +}
> +
> +bool kvm_memfault_exit_flags_valid(uint64_t reasons)
> +{
> +	uint64_t valid_flags = KVM_MEMFAULT_REASON_UNKNOWN;
> +
> +	return !(reasons & !valid_flags);
> +}
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 
> 

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit
  2023-03-17 18:13   ` Sean Christopherson
@ 2023-03-17 18:46     ` David Matlack
  2023-03-17 18:54       ` Oliver Upton
  0 siblings, 1 reply; 60+ messages in thread
From: David Matlack @ 2023-03-17 18:46 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Oliver Upton, Anish Moorthy, jthoughton, kvm, maz

On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Mar 17, 2023, Oliver Upton wrote:
> > On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote:
> > > Hi Sean, here's what I'm planing to send up as v2 of the scalable
> > > userfaultfd series.
> >
> > I don't see a ton of value in sending a targeted posting of a series to the
> > list.

But isn't it already generating value as you were able to weigh in and
provide feedback on technical aspects that you would not have been
otherwise able to if Anish had just messaged Sean?

> > IOW, just CC all of the appropriate reviewers+maintainers. I promise,
> > we won't bite.

I disagree. While I think it's fine to reach out to someone off-list
to discuss a specific question, if you're going to message all
reviewers and maintainers, you should also CC the mailing list. That
allows more people to follow along and weigh in if necessary.

>
> +1.  And though I discourage off-list review, if something is really truly not
> ready for public review, e.g. will do more harm than good by causing confusing,
> then just send the patches off-list.  Half measures like this will just make folks
> grumpy.

In this specific case, Anish very clearly laid out the reason for
sending the patches and asked very specific directed questions in the
cover letter and called it out as WIP. Yes "WIP" should have been
"RFC" but other than that should anything have been different?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit
  2023-03-17 18:46     ` David Matlack
@ 2023-03-17 18:54       ` Oliver Upton
  2023-03-17 18:59         ` David Matlack
  0 siblings, 1 reply; 60+ messages in thread
From: Oliver Upton @ 2023-03-17 18:54 UTC (permalink / raw)
  To: David Matlack; +Cc: Sean Christopherson, Anish Moorthy, jthoughton, kvm, maz

David,

On Fri, Mar 17, 2023 at 11:46:58AM -0700, David Matlack wrote:
> On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Mar 17, 2023, Oliver Upton wrote:
> > > On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote:
> > > > Hi Sean, here's what I'm planing to send up as v2 of the scalable
> > > > userfaultfd series.
> > >
> > > I don't see a ton of value in sending a targeted posting of a series to the
> > > list.
> 
> But isn't it already generating value as you were able to weigh in and
> provide feedback on technical aspects that you would not have been
> otherwise able to if Anish had just messaged Sean?

No, I only happened upon this series looking at lore. My problem is that
none of the affected maintainers or reviewers were cc'ed on the series.

> > > IOW, just CC all of the appropriate reviewers+maintainers. I promise,
> > > we won't bite.
> 
> I disagree. While I think it's fine to reach out to someone off-list
> to discuss a specific question, if you're going to message all
> reviewers and maintainers, you should also CC the mailing list. That
> allows more people to follow along and weigh in if necessary.

I think there may be a slight disconnect here :) I'm in no way encouraging
off-list discussion and instead asking that mail on the list arrives in
the right folks' inboxes.

Posting an RFC on the list was absolutely the right thing to do.

> >
> > +1.  And though I discourage off-list review, if something is really truly not
> > ready for public review, e.g. will do more harm than good by causing confusing,
> > then just send the patches off-list.  Half measures like this will just make folks
> > grumpy.
> 
> In this specific case, Anish very clearly laid out the reason for
> sending the patches and asked very specific directed questions in the
> cover letter and called it out as WIP. Yes "WIP" should have been
> "RFC" but other than that should anything have been different?

See above

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit
  2023-03-17 18:54       ` Oliver Upton
@ 2023-03-17 18:59         ` David Matlack
  2023-03-17 19:53           ` Anish Moorthy
  0 siblings, 1 reply; 60+ messages in thread
From: David Matlack @ 2023-03-17 18:59 UTC (permalink / raw)
  To: Oliver Upton; +Cc: Sean Christopherson, Anish Moorthy, jthoughton, kvm, maz

On Fri, Mar 17, 2023 at 11:54 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> David,
>
> On Fri, Mar 17, 2023 at 11:46:58AM -0700, David Matlack wrote:
> > On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, Mar 17, 2023, Oliver Upton wrote:
> > > > On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote:
> > > > > Hi Sean, here's what I'm planing to send up as v2 of the scalable
> > > > > userfaultfd series.
> > > >
> > > > I don't see a ton of value in sending a targeted posting of a series to the
> > > > list.
> >
> > But isn't it already generating value as you were able to weigh in and
> > provide feedback on technical aspects that you would not have been
> > otherwise able to if Anish had just messaged Sean?
>
> No, I only happened upon this series looking at lore. My problem is that
> none of the affected maintainers or reviewers were cc'ed on the series.
>
> > > > IOW, just CC all of the appropriate reviewers+maintainers. I promise,
> > > > we won't bite.
> >
> > I disagree. While I think it's fine to reach out to someone off-list
> > to discuss a specific question, if you're going to message all
> > reviewers and maintainers, you should also CC the mailing list. That
> > allows more people to follow along and weigh in if necessary.
>
> I think there may be a slight disconnect here :) I'm in no way encouraging
> off-list discussion and instead asking that mail on the list arrives in
> the right folks' inboxes.
>
> Posting an RFC on the list was absolutely the right thing to do.

Doh. I misunderstood what you meant. We are in violent agreement!

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  2023-03-15  2:17 ` [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation Anish Moorthy
@ 2023-03-17 18:59   ` Oliver Upton
  2023-03-17 20:15     ` Anish Moorthy
  2023-03-17 20:17     ` Sean Christopherson
  0 siblings, 2 replies; 60+ messages in thread
From: Oliver Upton @ 2023-03-17 18:59 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: seanjc, jthoughton, kvm

On Wed, Mar 15, 2023 at 02:17:33AM +0000, Anish Moorthy wrote:
> Add documentation, memslot flags, useful helper functions, and the
> actual new capability itself.
> 
> Memory fault exits on absent mappings are particularly useful for
> userfaultfd-based live migration postcopy. When many vCPUs fault upon a
> single userfaultfd the faults can take a while to surface to userspace
> due to having to contend for uffd wait queue locks. Bypassing the uffd
> entirely by triggering a vCPU exit avoids this contention and can improve
> the fault rate by as much as 10x.
> ---
>  Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++++++---
>  include/linux/kvm_host.h       |  6 ++++++
>  include/uapi/linux/kvm.h       |  3 +++
>  tools/include/uapi/linux/kvm.h |  2 ++
>  virt/kvm/kvm_main.c            |  7 ++++++-
>  5 files changed, 51 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f9ca18bbec879..4932c0f62eb3d 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
>    /* for kvm_userspace_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>    #define KVM_MEM_READONLY	(1UL << 1)
> +  #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)

call it KVM_MEM_EXIT_ABSENT_MAPPING

>  
>  This ioctl allows the user to create, modify or delete a guest physical
>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> @@ -1342,12 +1343,15 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>  be identical.  This allows large pages in the guest to be backed by large
>  pages in the host.
>  
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> +The flags field supports three flags
> +
> +1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
>  writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> +use it.
> +2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
>  to make a new slot read-only.  In this case, writes to this memory will be
>  posted to userspace as KVM_EXIT_MMIO exits.
> +3.  KVM_MEM_ABSENT_MAPPING_FAULT: see KVM_CAP_MEMORY_FAULT_NOWAIT for details.
>  
>  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>  the memory region are automatically reflected into the guest.  For example, an
> @@ -7702,10 +7706,37 @@ Through args[0], the capability can be set on a per-exit-reason basis.
>  Currently, the only exit reasons supported are
>  
>  1. KVM_MEMFAULT_REASON_UNKNOWN (1 << 0)
> +2. KVM_MEMFAULT_REASON_ABSENT_MAPPING (1 << 1)
>  
>  Memory fault exits with a reason of UNKNOWN should not be depended upon: they
>  may be added, removed, or reclassified under a stable reason.
>  
> +7.35 KVM_CAP_MEMORY_FAULT_NOWAIT
> +--------------------------------
> +
> +:Architectures: x86, arm64
> +:Returns: -EINVAL.
> +
> +The presence of this capability indicates that userspace may pass the
> +KVM_MEM_ABSENT_MAPPING_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
> +to exit to populate 'kvm_run.memory_fault' and exit to userspace (*) in response
> +to page faults for which the userspace page tables do not contain present
> +mappings. Attempting to enable the capability directly will fail.
> +
> +The 'gpa' and 'len' fields of kvm_run.memory_fault will be set to the starting
> +address and length (in bytes) of the faulting page. 'flags' will be set to
> +KVM_MEMFAULT_REASON_ABSENT_MAPPING.
> +
> +Userspace should determine how best to make the mapping present, then take
> +appropriate action. For instance, in the case of absent mappings this might
> +involve establishing the mapping for the first time via UFFDIO_COPY/CONTINUE or
> +faulting the mapping in using MADV_POPULATE_READ/WRITE. After establishing the
> +mapping, userspace can return to KVM to retry the previous memory access.
> +
> +(*) NOTE: On x86, KVM_CAP_X86_MEMORY_FAULT_EXIT must be enabled for the
> +KVM_MEMFAULT_REASON_ABSENT_MAPPING_reason: otherwise userspace will only receive
> +a -EFAULT from KVM_RUN without any useful information.

I'm not a fan of this architecture-specific dependency. Userspace is already
explicitly opting in to this behavior by way of the memslot flag. These sort
of exits are entirely orthogonal to the -EFAULT conversion earlier in the
series.

>  8. Other capabilities.
>  ======================
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d3ccfead73e42..c28330f25526f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -593,6 +593,12 @@ static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *sl
>  	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
>  }
>  
> +static inline bool kvm_slot_fault_on_absent_mapping(
> +	const struct kvm_memory_slot *slot)

Style again...

I'd strongly recommend using 'exit' instead of 'fault' in the verbiage of the
KVM implementation. I understand we're giving userspace the illusion of a page
fault mechanism, but the term is then overloaded in KVM since we handle
literal faults from hardware.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 12/14] KVM: arm64: Implement KVM_CAP_MEMORY_FAULT_NOWAIT
  2023-03-17 18:27   ` Oliver Upton
@ 2023-03-17 19:00     ` Anish Moorthy
  2023-03-17 19:03       ` Oliver Upton
  2023-03-17 19:24       ` Sean Christopherson
  0 siblings, 2 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-17 19:00 UTC (permalink / raw)
  To: Oliver Upton; +Cc: seanjc, jthoughton, kvm

On Fri, Mar 17, 2023 at 11:27 AM Oliver Upton <oliver.upton@linux.dev> wrote:

> > +     pfn = __gfn_to_pfn_memslot(
> > +             memslot, gfn, exit_on_memory_fault, false, NULL,
> > +             write_fault, &writable, NULL);
>
> As stated before [*], this google3-esque style does not match the kernel style
> guide. You may want to check if your work machine is setting up a G3-specific
> editor configuration behind your back.
>
> [*] https://lore.kernel.org/kvm/Y+0QRsZ4yWyUdpnc@google.com/

If you're referring to the indentation, then that was definitely me.
I'll give the style guide another readthrough before I submit the next
version then, since checkpatch.pl doesn't seem to complain here.

> > +     if (exit_on_memory_fault && pfn == KVM_PFN_ERR_FAULT) {
>
> nit: I don't think the local is explicitly necessary. I still find this
> readable:

The local was for keeping a consistent value between the two blocks of code here

    pfn = __gfn_to_pfn_memslot(
        memslot, gfn, exit_on_memory_fault, false, NULL,
        write_fault, &writable, NULL);

    if (exit_on_memory_fault && pfn == KVM_PFN_ERR_FAULT) {
        // Set up vCPU exit and return 0
    }

I wanted to avoid the possibility of causing an early
__gfn_to_pfn_memslot exit but then not populating the vCPU exit.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 12/14] KVM: arm64: Implement KVM_CAP_MEMORY_FAULT_NOWAIT
  2023-03-17 19:00     ` Anish Moorthy
@ 2023-03-17 19:03       ` Oliver Upton
  2023-03-17 19:24       ` Sean Christopherson
  1 sibling, 0 replies; 60+ messages in thread
From: Oliver Upton @ 2023-03-17 19:03 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: seanjc, jthoughton, kvm

On Fri, Mar 17, 2023 at 12:00:30PM -0700, Anish Moorthy wrote:
> On Fri, Mar 17, 2023 at 11:27 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> > > +     pfn = __gfn_to_pfn_memslot(
> > > +             memslot, gfn, exit_on_memory_fault, false, NULL,
> > > +             write_fault, &writable, NULL);
> >
> > As stated before [*], this google3-esque style does not match the kernel style
> > guide. You may want to check if your work machine is setting up a G3-specific
> > editor configuration behind your back.
> >
> > [*] https://lore.kernel.org/kvm/Y+0QRsZ4yWyUdpnc@google.com/
> 
> If you're referring to the indentation, then that was definitely me.
> I'll give the style guide another readthrough before I submit the next
> version then, since checkpatch.pl doesn't seem to complain here.
> 
> > > +     if (exit_on_memory_fault && pfn == KVM_PFN_ERR_FAULT) {
> >
> > nit: I don't think the local is explicitly necessary. I still find this
> > readable:
> 
> The local was for keeping a consistent value between the two blocks of code here
> 
>     pfn = __gfn_to_pfn_memslot(
>         memslot, gfn, exit_on_memory_fault, false, NULL,
>         write_fault, &writable, NULL);
> 
>     if (exit_on_memory_fault && pfn == KVM_PFN_ERR_FAULT) {
>         // Set up vCPU exit and return 0
>     }
> 
> I wanted to avoid the possibility of causing an early
> __gfn_to_pfn_memslot exit but then not populating the vCPU exit.

Ignore me, I didn't see the other use of the local.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 12/14] KVM: arm64: Implement KVM_CAP_MEMORY_FAULT_NOWAIT
  2023-03-17 19:00     ` Anish Moorthy
  2023-03-17 19:03       ` Oliver Upton
@ 2023-03-17 19:24       ` Sean Christopherson
  1 sibling, 0 replies; 60+ messages in thread
From: Sean Christopherson @ 2023-03-17 19:24 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Oliver Upton, jthoughton, kvm

On Fri, Mar 17, 2023, Anish Moorthy wrote:
> On Fri, Mar 17, 2023 at 11:27 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> > > +     pfn = __gfn_to_pfn_memslot(
> > > +             memslot, gfn, exit_on_memory_fault, false, NULL,
> > > +             write_fault, &writable, NULL);
> >
> > As stated before [*], this google3-esque style does not match the kernel style
> > guide. You may want to check if your work machine is setting up a G3-specific
> > editor configuration behind your back.
> >
> > [*] https://lore.kernel.org/kvm/Y+0QRsZ4yWyUdpnc@google.com/
> 
> If you're referring to the indentation, then that was definitely me.

The two issues are (1) don't put newlines immediately after an opening '(', and
(2) align indentation relative to the direct parent '(' that encapsulates the code.

Concretely, the above should be:

	pfn = __gfn_to_pfn_memslot(memslot, gfn, exit_on_memory_fault, false,
				   NULL, write_fault, &writable, NULL);

> I'll give the style guide another readthrough before I submit the next
> version then, since checkpatch.pl doesn't seem to complain here.

I don't think checkpatch looks for these particular style issues.  FWIW, you
really shouldn't need to read through the formal documentation for these "basic"
rules, just spend time poking around the code base.  If your code looks different
than everything else, then you're likely doing it wrong.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-17 18:33     ` Anish Moorthy
@ 2023-03-17 19:30       ` Oliver Upton
  2023-03-17 21:50       ` Sean Christopherson
  1 sibling, 0 replies; 60+ messages in thread
From: Oliver Upton @ 2023-03-17 19:30 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Isaku Yamahata, Marc Zyngier, seanjc, jthoughton, kvm

On Fri, Mar 17, 2023 at 11:33:38AM -0700, Anish Moorthy wrote:
> On Thu, Mar 16, 2023 at 5:02 PM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
> 
> > > +7.34 KVM_CAP_X86_MEMORY_FAULT_EXIT
> > > +----------------------------------
> > > +
> > > +:Architectures: x86
> >
> > Why x86 specific?
> 
> Sean was the only one to bring this functionality up and originally
> did so in the context of some x86-specific functions, so I assumed
> that x86 was the only ask and that maybe the other architectures had
> alternative solutions. Admittedly I also wanted to avoid wading
> through another big set of -EFAULT references :/

There isn't much :) Sanity checks in mmu.c and some currently unhandled
failures to write guest memory in pvtime.c

> Those are the only reasons though. Marc, Oliver, should I bring this
> capability to Arm as well?

The x86 implementation shouldn't preclude UAPI reuse, but I'm not strongly
motivated in either direction on this. A clear use case where the exit
information is actionable rather than just informational would make the change
more desirable.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit
  2023-03-17 18:59         ` David Matlack
@ 2023-03-17 19:53           ` Anish Moorthy
  2023-03-17 22:03             ` Sean Christopherson
  0 siblings, 1 reply; 60+ messages in thread
From: Anish Moorthy @ 2023-03-17 19:53 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Oliver Upton, jthoughton, kvm, maz, Isaku Yamahata

On Fri, Mar 17, 2023 at 12:00 PM David Matlack <dmatlack@google.com> wrote:
>
> On Fri, Mar 17, 2023 at 11:54 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > David,
> >
> > On Fri, Mar 17, 2023 at 11:46:58AM -0700, David Matlack wrote:
> > > On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Fri, Mar 17, 2023, Oliver Upton wrote:
> > > > > On Wed, Mar 15, 2023 at 02:17:24AM +0000, Anish Moorthy wrote:
> > > > > > Hi Sean, here's what I'm planing to send up as v2 of the scalable
> > > > > > userfaultfd series.
> > > > >
> > > > > I don't see a ton of value in sending a targeted posting of a series to the
> > > > > list.
> > >
> > > But isn't it already generating value as you were able to weigh in and
> > > provide feedback on technical aspects that you would not have been
> > > otherwise able to if Anish had just messaged Sean?
> >
> > No, I only happened upon this series looking at lore. My problem is that
> > none of the affected maintainers or reviewers were cc'ed on the series.
> >
> > > > > IOW, just CC all of the appropriate reviewers+maintainers. I promise,
> > > > > we won't bite.
> > >
> > > I disagree. While I think it's fine to reach out to someone off-list
> > > to discuss a specific question, if you're going to message all
> > > reviewers and maintainers, you should also CC the mailing list. That
> > > allows more people to follow along and weigh in if necessary.
> >
> > I think there may be a slight disconnect here :) I'm in no way encouraging
> > off-list discussion and instead asking that mail on the list arrives in
> > the right folks' inboxes.
> >
> > Posting an RFC on the list was absolutely the right thing to do.
>
> Doh. I misunderstood what you meant. We are in violent agreement!

Noted. Also, thanks Oliver and Isaku for paying attention to the
series despite it being obscure.

On Fri, Mar 17, 2023 at 11:13 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Mar 17, 2023, Oliver Upton wrote:
> > > Still unsure if needs conversion
> > > --------------------------------
> > > * __kvm_read_guest_atomic
> > >   The EFAULT might be propagated though FNAME(sync_page)?
> > > * kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226)
> > > * __kvm_write_guest_page
> > >   Called from kvm_write_guest_offset_cached: if that needs change, this does too
> >
> > The low-level accessors are common across architectures and can be called from
> > other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT
> > and convert that into an exit?
>
> Ya, as things stand today, the conversions _must_ be performed at the caller, as
> there are (sadly) far too many flows where KVM squashes the error.  E.g. almost
> all of x86's paravirt code just suppresses user memory faults :-(
>
> Anish, when we discussed this off-list, what I meant by limiting the intial support
> to existing -EFAULT cases was limiting support to existing cases where KVM directly
> returns -EFAULT to userspace, not to all existing cases where -EFAULT is ever
> returned _within KVM_ while handling KVM_RUN.  My apologies if I didn't make that clear.

Don't worry, we eventually got there off-list :)

This brings us back to my original set of questions. As has already
been pointed out, I'll have to revisit my "Confident that needs
conversion" changes and tweak them so that the vCPU exit is populated
only for the call sites where the -EFAULT makes it to userspace. I
still want feedback on if I've mis-identified any of the functions in
my "EFAULT does not propagate to userspace" list and whether there are
functions/callers in the "Still unsure if needs conversion" which do
have return paths to KVM_RUN.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  2023-03-17 18:59   ` Oliver Upton
@ 2023-03-17 20:15     ` Anish Moorthy
  2023-03-17 20:54       ` Sean Christopherson
  2023-03-17 20:17     ` Sean Christopherson
  1 sibling, 1 reply; 60+ messages in thread
From: Anish Moorthy @ 2023-03-17 20:15 UTC (permalink / raw)
  To: Oliver Upton, Sean Christopherson; +Cc: jthoughton, kvm

On Fri, Mar 17, 2023 at 11:59 AM Oliver Upton <oliver.upton@linux.dev> wrote:

> > +  #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
>
> call it KVM_MEM_EXIT_ABSENT_MAPPING
> ...
> I'm not a fan of this architecture-specific dependency. Userspace is already
> explicitly opting in to this behavior by way of the memslot flag. These sort
> of exits are entirely orthogonal to the -EFAULT conversion earlier in the
> series.

I'm not a fan of the semantics varying between architectures either:
but the reason I have it like that (and that the EFAULT conversions
exist in this series in the first place) is (a) not having
KVM_CAP_MEMORY_FAULT_EXIT implemented for arm and (b) Sean's following
statement from https://lore.kernel.org/kvm/Y%2FfS0eab7GG0NVKS@google.com/

On Thu, Feb 23, 2023 at 12:55 PM Sean Christopherson <seanjc@google.com> wrote:
>
> The new memslot flag should depend on KVM_CAP_MEMORY_FAULT_EXIT, but
> KVM_CAP_MEMORY_FAULT_EXIT should be a standalone thing, i.e. should convert "all"
> guest-memory -EFAULTS to KVM_CAP_MEMORY_FAULT_EXIT.  All in quotes because I would
> likely be ok with a partial conversion for the initial implementation if there
> are paths that would require an absurd amount of work to convert.

The best way that I thought of how to do that was to have one cap
(KVM_CAP_MEMORY_FAULT_NOWAIT) to make KVM -EFAULT without calling slow
GUP, and KVM_CAP_MEMORY_FAULT_EXIT to transform efaults to useful vm
exits. But if you think the two are really orthogonal, then we need to
resolve the apparent disagreement.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  2023-03-17 18:59   ` Oliver Upton
  2023-03-17 20:15     ` Anish Moorthy
@ 2023-03-17 20:17     ` Sean Christopherson
  2023-03-20 22:22       ` Oliver Upton
  1 sibling, 1 reply; 60+ messages in thread
From: Sean Christopherson @ 2023-03-17 20:17 UTC (permalink / raw)
  To: Oliver Upton; +Cc: Anish Moorthy, jthoughton, kvm

On Fri, Mar 17, 2023, Oliver Upton wrote:
> On Wed, Mar 15, 2023 at 02:17:33AM +0000, Anish Moorthy wrote:
> > Add documentation, memslot flags, useful helper functions, and the
> > actual new capability itself.
> > 
> > Memory fault exits on absent mappings are particularly useful for
> > userfaultfd-based live migration postcopy. When many vCPUs fault upon a
> > single userfaultfd the faults can take a while to surface to userspace
> > due to having to contend for uffd wait queue locks. Bypassing the uffd
> > entirely by triggering a vCPU exit avoids this contention and can improve
> > the fault rate by as much as 10x.
> > ---
> >  Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++++++---
> >  include/linux/kvm_host.h       |  6 ++++++
> >  include/uapi/linux/kvm.h       |  3 +++
> >  tools/include/uapi/linux/kvm.h |  2 ++
> >  virt/kvm/kvm_main.c            |  7 ++++++-
> >  5 files changed, 51 insertions(+), 4 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f9ca18bbec879..4932c0f62eb3d 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
> >    /* for kvm_userspace_memory_region::flags */
> >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >    #define KVM_MEM_READONLY	(1UL << 1)
> > +  #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
> 
> call it KVM_MEM_EXIT_ABSENT_MAPPING

Ooh, look, a bikeshed!  :-)

I don't think it should have "EXIT" in the name.  The exit to userspace is a side
effect, e.g. KVM already exits to userspace on unresolved userfaults.  The only
thing this knob _directly_ controls is whether or not KVM attempts the slow path.
If we give the flag a name like "exit on absent userspace mappings", then KVM will
appear to do the wrong thing when KVM exits on a truly absent userspace mapping.

And as I argued in the last version[*], I am _strongly_ opposed to KVM speculating
on why KVM is exiting to userspace.  I.e. KVM should not set a special flag if
the memslot has "fast only" behavior.  The only thing the flag should do is control
whether or not KVM tries slow paths, what KVM does in response to an unresolved
fault should be an orthogonal thing.

E.g. If KVM encounters an unmapped page while prefetching SPTEs, KVM will (correctly)
not exit to userspace and instead simply terminate the prefetch.  Obviously we
could solve that through documentation, but I don't see any benefit in making this
more complex than it needs to be.

[*] https://lkml.kernel.org/r/Y%2B0RYMfw6pHrSLX4%40google.com

> > +7.35 KVM_CAP_MEMORY_FAULT_NOWAIT
> > +--------------------------------
> > +
> > +:Architectures: x86, arm64
> > +:Returns: -EINVAL.
> > +
> > +The presence of this capability indicates that userspace may pass the
> > +KVM_MEM_ABSENT_MAPPING_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
> > +to exit to populate 'kvm_run.memory_fault' and exit to userspace (*) in response
> > +to page faults for which the userspace page tables do not contain present
> > +mappings. Attempting to enable the capability directly will fail.
> > +
> > +The 'gpa' and 'len' fields of kvm_run.memory_fault will be set to the starting
> > +address and length (in bytes) of the faulting page. 'flags' will be set to
> > +KVM_MEMFAULT_REASON_ABSENT_MAPPING.
> > +
> > +Userspace should determine how best to make the mapping present, then take
> > +appropriate action. For instance, in the case of absent mappings this might
> > +involve establishing the mapping for the first time via UFFDIO_COPY/CONTINUE or
> > +faulting the mapping in using MADV_POPULATE_READ/WRITE. After establishing the
> > +mapping, userspace can return to KVM to retry the previous memory access.
> > +
> > +(*) NOTE: On x86, KVM_CAP_X86_MEMORY_FAULT_EXIT must be enabled for the
> > +KVM_MEMFAULT_REASON_ABSENT_MAPPING_reason: otherwise userspace will only receive
> > +a -EFAULT from KVM_RUN without any useful information.
> 
> I'm not a fan of this architecture-specific dependency. Userspace is already
> explicitly opting in to this behavior by way of the memslot flag. These sort
> of exits are entirely orthogonal to the -EFAULT conversion earlier in the
> series.

Ya, yet another reason not to speculate on why KVM wasn't able to resolve a fault.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit
  2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
                   ` (14 preceding siblings ...)
  2023-03-17 17:43 ` [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Oliver Upton
@ 2023-03-17 20:35 ` Sean Christopherson
  15 siblings, 0 replies; 60+ messages in thread
From: Sean Christopherson @ 2023-03-17 20:35 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: jthoughton, kvm

On Wed, Mar 15, 2023, Anish Moorthy wrote:
> Still unsure if needs conversion
> --------------------------------
> * __kvm_read_guest_atomic
>   The EFAULT might be propagated though FNAME(sync_page)?
> * kvm_write_guest_offset_cached (virt/kvm/kvm_main.c:3226)
> * __kvm_write_guest_page
>   Called from kvm_write_guest_offset_cached: if that needs change, this does too
> * kvm_write_guest_page
>   Two interesting paths:
>       - kvm_pv_clock_pairing returns a custom KVM_EFAULT error here
>         (arch/x86/kvm/x86.c:9578)
>       - kvm_write_guest_offset_cached returns this directly (so if that needs
>         change, this does too)
> * kvm_read_guest_offset_cached
>   I actually do see a path to userspace, but it's through hyper-v, which we've
>   said is out of scope for round 1.

To clarify: I didn't intend to make Hyper-V explicitly out-of-scope, rather Hyper-V
happened to be out-of-scope because the existing code suppresses -EFAULT.  I don't
think we should make any particular feature/area out-of-scope, as that will lead
to even more arbitrary behavior than we already have.

What I intended, and what I still think we should do, is limit the scope of the
capability to existing paths that return -EFAULT to userspace.  Trying to fix all
of the paths that suppress -EFAULT is going to be ridiculously difficult as so
much of the behavior is arguaby ABI, and there's no authoritative documentation
on what's supposed to happen.  I definitely would love to fix those paths in the
long term, but for the initial implementation/conversion, I think it makes sense
to punt on them, otherwise it'll take months/years to merge this code.

Back to the Hyper-V case, assuming you're referring to the use of kvm_hv_verify_vp_assist()
in nested_svm_vmrun(), that code is a mess.  KVM shouldn't inject a #GP and then
exit to userspace, e.g. the guest might see a spurious #GP if userspace fixes the
fault and resume the instruction.  And just a few lines below, KVM skips the
instruction if kvm_vcpu_map() returns -EFAULT.

As above, ideally that code would be converted to gracefully report the error,
but it's such a snafu that the easiest thing might be to change the "return ret;"
to "return 1;" until we fix all such KVM-on-HyperV code.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  2023-03-17 20:15     ` Anish Moorthy
@ 2023-03-17 20:54       ` Sean Christopherson
  2023-03-17 23:42         ` Anish Moorthy
  0 siblings, 1 reply; 60+ messages in thread
From: Sean Christopherson @ 2023-03-17 20:54 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Oliver Upton, jthoughton, kvm

On Fri, Mar 17, 2023, Anish Moorthy wrote:
> On Fri, Mar 17, 2023 at 11:59 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> > > +  #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
> >
> > call it KVM_MEM_EXIT_ABSENT_MAPPING
> > ...
> > I'm not a fan of this architecture-specific dependency. Userspace is already
> > explicitly opting in to this behavior by way of the memslot flag. These sort
> > of exits are entirely orthogonal to the -EFAULT conversion earlier in the
> > series.
> 
> I'm not a fan of the semantics varying between architectures either:
> but the reason I have it like that (and that the EFAULT conversions
> exist in this series in the first place) is (a) not having
> KVM_CAP_MEMORY_FAULT_EXIT implemented for arm and (b) Sean's following
> statement from https://lore.kernel.org/kvm/Y%2FfS0eab7GG0NVKS@google.com/

Strictly speaking, if y'all buy my argument that the flag shouldn't control the
gup behavior, there won't be semantic differences for the memslot flag.  KVM will
(obviously) behavior differently if KVM_CAP_MEMORY_FAULT_EXIT is not set, but that
will hold true for x86 as well.  The only difference is that x86 will also support
an orthogonal flag that makes the fast-only memslot flag useful in practice.

So yeah, there will be an arch dependency, but only because arch code needs to
actually handle perform the exit, and that's true no matter what.

That said, there's zero reason to put X86 in the name.  Just add the capability
as KVM_CAP_MEMORY_FAULT_EXIT or whatever and mark it as x86 in the documentation.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-17 18:33     ` Anish Moorthy
  2023-03-17 19:30       ` Oliver Upton
@ 2023-03-17 21:50       ` Sean Christopherson
  2023-03-17 22:44         ` Anish Moorthy
  1 sibling, 1 reply; 60+ messages in thread
From: Sean Christopherson @ 2023-03-17 21:50 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Fri, Mar 17, 2023, Anish Moorthy wrote:
> On Thu, Mar 16, 2023 at 5:02 PM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
> > > +inline int kvm_memfault_exit_or_efault(
> > > +     struct kvm_vcpu *vcpu, uint64_t gpa, uint64_t len, uint64_t exit_flags)
> > > +{
> > > +     if (!(vcpu->kvm->memfault_exit_reasons & exit_flags))
> > > +             return -EFAULT;
> > > +     vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > > +     vcpu->run->memory_fault.gpa = gpa;
> > > +     vcpu->run->memory_fault.len = len;
> > > +     vcpu->run->memory_fault.flags = exit_flags;
> > > +     return -1;
> >
> > Why -1? 0? Anyway enum exit_fastpath_completion is x86 kvm mmu internal
> > convention. As WIP, it's okay for now, though.
> 
> The -1 isn't to indicate a failure in this function itself, but to
> allow callers to substitute this for "return -EFAULT." A return code
> of zero would mask errors and cause KVM to proceed in ways that it
> shouldn't. For instance, "setup_vmgexit_scratch" uses it like this
> 
> if (kvm_read_guest(svm->vcpu.kvm, scratch_gpa_beg, scratch_va, len)) {
>     ...
> -  return -EFAULT;
> + return kvm_memfault_exit_or_efault(...);
> }
> 
> and looking at one of its callers (sev_handle_vmgexit) shows how a
> return code of zero would cause a different control flow
> 
> case SVM_VMGEXIT_MMIO_READ:
> ret = setup_vmgexit_scratch(svm, true, control->exit_info_2);
> if (ret)
>     break;
> 
> ret = ret = kvm_sev_es_mmio_read(vcpu,

Hmm, I generally agree with Isaku, the helper should really return 0.  Returning
-1 might work, but it'll likely confuse userspace, and will definitely confuse
KVM developers.

The "0 means exit to userspace" behavior is definitely a pain though, and is likely
going to make this all extremely fragile.

I wonder if we can get away with returning -EFAULT, but still filling vcpu->run
with KVM_EXIT_MEMORY_FAULT and all the other metadata.  That would likely simplify
the implementation greatly, and would let KVM fill vcpu->run unconditonally.  KVM
would still need a capability to advertise support to userspace, but userspace
wouldn't need to opt in.  I think this may have been my very original though, and
I just never actually wrote it down...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit
  2023-03-17 19:53           ` Anish Moorthy
@ 2023-03-17 22:03             ` Sean Christopherson
  2023-03-20 15:56               ` Sean Christopherson
  0 siblings, 1 reply; 60+ messages in thread
From: Sean Christopherson @ 2023-03-17 22:03 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Oliver Upton, jthoughton, kvm, maz, Isaku Yamahata

On Fri, Mar 17, 2023, Anish Moorthy wrote:
> On Fri, Mar 17, 2023 at 12:00 PM David Matlack <dmatlack@google.com> wrote:
> > > The low-level accessors are common across architectures and can be called from
> > > other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT
> > > and convert that into an exit?
> >
> > Ya, as things stand today, the conversions _must_ be performed at the caller, as
> > there are (sadly) far too many flows where KVM squashes the error.  E.g. almost
> > all of x86's paravirt code just suppresses user memory faults :-(
> >
> > Anish, when we discussed this off-list, what I meant by limiting the intial support
> > to existing -EFAULT cases was limiting support to existing cases where KVM directly
> > returns -EFAULT to userspace, not to all existing cases where -EFAULT is ever
> > returned _within KVM_ while handling KVM_RUN.  My apologies if I didn't make that clear.
> 
> Don't worry, we eventually got there off-list :)
> 
> This brings us back to my original set of questions. As has already
> been pointed out, I'll have to revisit my "Confident that needs
> conversion" changes and tweak them so that the vCPU exit is populated
> only for the call sites where the -EFAULT makes it to userspace. I
> still want feedback on if I've mis-identified any of the functions in
> my "EFAULT does not propagate to userspace" list and whether there are
> functions/callers in the "Still unsure if needs conversion" which do
> have return paths to KVM_RUN.

As you've probably gathered from the type of feedback you're receiving, identifying
the conversion touchpoints isn't going to be the long pole of this series.  Correctly
identifying all of the touchpoints may not be easy, but fixing any cases we get wrong
will likely be straightforward.  And realistically, no matter how many eyeballs look
at the code, odds are good we'll miss at least one case.  In other words, don't worry
too much about getting all the touchpoints correct on the first version.  Getting the
uAPI right is much more important.

And rather than rely on code review to get things right, we should be able to
detect issues programmatically.  E.g. use fault injection to make gup() and/or
uaccess fail (might even be wired up already?), and hack in a WARN in the KVM_RUN
path to assert that KVM_EXIT_MEMORY_FAULT is filled if the return code is -EFAULT
(assuming we go don't try to get KVM to return 0 everywhere), e.g. something like
the below would at least flag the "misses", although debug could still prove to be
annoying.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 67b890e54cf1..cccae0ad1436 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4100,6 +4100,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
                }
                r = kvm_arch_vcpu_ioctl_run(vcpu);
                trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
+               WARN_ON(r == -EFAULT &&
+                       vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT);
                break;
        }
        case KVM_GET_REGS: {


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-17 21:50       ` Sean Christopherson
@ 2023-03-17 22:44         ` Anish Moorthy
  2023-03-20 15:53           ` Sean Christopherson
  0 siblings, 1 reply; 60+ messages in thread
From: Anish Moorthy @ 2023-03-17 22:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Fri, Mar 17, 2023 at 2:50 PM Sean Christopherson <seanjc@google.com> wrote:
> I wonder if we can get away with returning -EFAULT, but still filling vcpu->run
> with KVM_EXIT_MEMORY_FAULT and all the other metadata.  That would likely simplify
> the implementation greatly, and would let KVM fill vcpu->run unconditonally.  KVM
> would still need a capability to advertise support to userspace, but userspace
> wouldn't need to opt in.  I think this may have been my very original though, and
> I just never actually wrote it down...

Oh, good to know that's actually an option. I thought of that too, but
assumed that returning a negative error code was a no-go for a proper
vCPU exit. But if that's not true then I think it's the obvious
solution because it precludes any uncaught behavior-change bugs.

A couple of notes
1. Since we'll likely miss some -EFAULT returns, we'll need to make
sure that the user can check for / doesn't see a stale
kvm_run::memory_fault field when a missed -EFAULT makes it to
userspace. It's a small and easy-to-fix detail, but I thought I'd
point it out.
2. I don't think this would simplify the series that much, since we
still need to find the call sites returning -EFAULT to userspace and
populate memory_fault only in those spots to avoid populating it for
-EFAULTs which don't make it to userspace. We *could* relax that
condition and just document that memory_fault should be ignored when
KVM_RUN does not return -EFAULT... but I don't think that's a good
solution from a coder/maintainer perspective.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  2023-03-17 20:54       ` Sean Christopherson
@ 2023-03-17 23:42         ` Anish Moorthy
  2023-03-20 15:13           ` Sean Christopherson
  0 siblings, 1 reply; 60+ messages in thread
From: Anish Moorthy @ 2023-03-17 23:42 UTC (permalink / raw)
  To: Sean Christopherson, Oliver Upton; +Cc: jthoughton, kvm

On Fri, Mar 17, 2023 at 1:17 PM Sean Christopherson <seanjc@google.com> wrote:
>
> And as I argued in the last version[*], I am _strongly_ opposed to KVM speculating
> on why KVM is exiting to userspace.  I.e. KVM should not set a special flag if
> the memslot has "fast only" behavior.  The only thing the flag should do is control
> whether or not KVM tries slow paths, what KVM does in response to an unresolved
> fault should be an orthogonal thing.

I'm guessing you would want changes to patch 10 of this series [*]
then, right? Setting a bit/exit reason in kvm_run::memory_fault.flags
depending on whether the failure originated from a "fast only" fault
is... exactly what I'm doing :/ I'm not totally clear on your usages
of the word "flag" above though, the "KVM should not set a special
flag... the only thing *the* flag should do" part is throwing me off a
bit. What I think you're saying is

"KVM should not set a special bit in kvm_run::memory_fault.flags if
the memslot has fast-only behavior. The only thing
KVM_MEM_ABSENT_MAPPING_FAULT should do is..."

[1] https://lore.kernel.org/all/20230315021738.1151386-11-amoorthy@google.com/

On Fri, Mar 17, 2023 at 1:54 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Strictly speaking, if y'all buy my argument that the flag shouldn't control the
> gup behavior, there won't be semantic differences for the memslot flag.  KVM will
> (obviously) behavior differently if KVM_CAP_MEMORY_FAULT_EXIT is not set, but that
> will hold true for x86 as well.  The only difference is that x86 will also support
> an orthogonal flag that makes the fast-only memslot flag useful in practice.
>
> So yeah, there will be an arch dependency, but only because arch code needs to
> actually handle perform the exit, and that's true no matter what.
>
> That said, there's zero reason to put X86 in the name.  Just add the capability
> as KVM_CAP_MEMORY_FAULT_EXIT or whatever and mark it as x86 in the documentation.
>
> That said, there's zero reason to put X86 in the name.  Just add the capability
> as KVM_CAP_MEMORY_FAULT_EXIT or whatever and mark it as x86 in the documentation.

Again, a little confused on your first "flag" usage here. I figure you
can't mean the memslot flag because the whole point of that is to
control the GUP behavior, but I'm not sure what else you'd be
referring to.

Anyways the idea of having orthogonal features, one to -EFAULTing
early before a slow path and another to transform/augment -EFAULTs
into/with useful information does make sense to me. But I think the
issue here is that we want the fast-only memslot flag to be useful on
Arm as well, and with KVM_CAP_MEMORY_FAULT_NOWAIT written as it is now
there is a semantic differences between x86 and Arm.

I don't see a way to keep the two features here orthogonal on x86 and
linked on arm without keeping that semantic difference. Perhaps the
solution here is a bare-bones implementation of
KVM_CAP_MEMORY_FAULT_EXIT for Arm? All that actually *needs* to be
covered to resolve this difference is the one call site in
user_mem_abort. since KVM_CAP_MEMORY_FAULT_EXIT will be allowed to
have holes anyways.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  2023-03-17 23:42         ` Anish Moorthy
@ 2023-03-20 15:13           ` Sean Christopherson
  2023-03-20 19:53             ` Anish Moorthy
  0 siblings, 1 reply; 60+ messages in thread
From: Sean Christopherson @ 2023-03-20 15:13 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Oliver Upton, jthoughton, kvm

On Fri, Mar 17, 2023, Anish Moorthy wrote:
> On Fri, Mar 17, 2023 at 1:17 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > And as I argued in the last version[*], I am _strongly_ opposed to KVM speculating
> > on why KVM is exiting to userspace.  I.e. KVM should not set a special flag if
> > the memslot has "fast only" behavior.  The only thing the flag should do is control
> > whether or not KVM tries slow paths, what KVM does in response to an unresolved
> > fault should be an orthogonal thing.
> 
> I'm guessing you would want changes to patch 10 of this series [*]
> then, right? Setting a bit/exit reason in kvm_run::memory_fault.flags
> depending on whether the failure originated from a "fast only" fault
> is... exactly what I'm doing :/ I'm not totally clear on your usages
> of the word "flag" above though, the "KVM should not set a special
> flag... the only thing *the* flag should do" part is throwing me off a
> bit. What I think you're saying is

Heh, the second "the flag" is referring to the memslot flag.  Rewriting the above:

  KVM should not set a special flag in kvm_run::memory_fault.flags ... the
  only thing KVM_MEM_FAST_FAULT_ONLY should do is ..."

> "KVM should not set a special bit in kvm_run::memory_fault.flags if
> the memslot has fast-only behavior. The only thing
> KVM_MEM_ABSENT_MAPPING_FAULT should do is..."
> 
> [1] https://lore.kernel.org/all/20230315021738.1151386-11-amoorthy@google.com/
> 
> On Fri, Mar 17, 2023 at 1:54 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Strictly speaking, if y'all buy my argument that the flag shouldn't control the
> > gup behavior, there won't be semantic differences for the memslot flag.  KVM will
> > (obviously) behavior differently if KVM_CAP_MEMORY_FAULT_EXIT is not set, but that
> > will hold true for x86 as well.  The only difference is that x86 will also support
> > an orthogonal flag that makes the fast-only memslot flag useful in practice.
> >
> > So yeah, there will be an arch dependency, but only because arch code needs to
> > actually handle perform the exit, and that's true no matter what.
> >
> > That said, there's zero reason to put X86 in the name.  Just add the capability
> > as KVM_CAP_MEMORY_FAULT_EXIT or whatever and mark it as x86 in the documentation.
> >
> > That said, there's zero reason to put X86 in the name.  Just add the capability
> > as KVM_CAP_MEMORY_FAULT_EXIT or whatever and mark it as x86 in the documentation.
> 
> Again, a little confused on your first "flag" usage here. I figure you
> can't mean the memslot flag because the whole point of that is to
> control the GUP behavior, but I'm not sure what else you'd be
> referring to.
> 
> Anyways the idea of having orthogonal features, one to -EFAULTing
> early before a slow path and another to transform/augment -EFAULTs
> into/with useful information does make sense to me. But I think the
> issue here is that we want the fast-only memslot flag to be useful on
> Arm as well, and with KVM_CAP_MEMORY_FAULT_NOWAIT written as it is now
> there is a semantic differences between x86 and Arm.

If and only if userspace enables the capability that transforms -EFAULT.

> I don't see a way to keep the two features here orthogonal on x86 and
> linked on arm without keeping that semantic difference. Perhaps the
> solution here is a bare-bones implementation of
> KVM_CAP_MEMORY_FAULT_EXIT for Arm? All that actually *needs* to be
> covered to resolve this difference is the one call site in
> user_mem_abort. since KVM_CAP_MEMORY_FAULT_EXIT will be allowed to
> have holes anyways.

As above, so long as userspace must opt into transforming -EFAULT, and can do
so independent of KVM_MEM_FAST_FAULT_ONLY (or whatever we call it), the behavior
of KVM_MEM_FAST_FAULT_ONLY itself is semantically identical across all
architectures.

KVM_MEM_FAST_FAULT_ONLY is obviously not very useful without precise information
about the failing address, but IMO that's not reason enough to tie the two
together.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-17 22:44         ` Anish Moorthy
@ 2023-03-20 15:53           ` Sean Christopherson
  2023-03-20 18:19             ` Anish Moorthy
  2023-03-20 22:11             ` Anish Moorthy
  0 siblings, 2 replies; 60+ messages in thread
From: Sean Christopherson @ 2023-03-20 15:53 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Fri, Mar 17, 2023, Anish Moorthy wrote:
> On Fri, Mar 17, 2023 at 2:50 PM Sean Christopherson <seanjc@google.com> wrote:
> > I wonder if we can get away with returning -EFAULT, but still filling vcpu->run
> > with KVM_EXIT_MEMORY_FAULT and all the other metadata.  That would likely simplify
> > the implementation greatly, and would let KVM fill vcpu->run unconditonally.  KVM
> > would still need a capability to advertise support to userspace, but userspace
> > wouldn't need to opt in.  I think this may have been my very original though, and
> > I just never actually wrote it down...
> 
> Oh, good to know that's actually an option. I thought of that too, but
> assumed that returning a negative error code was a no-go for a proper
> vCPU exit. But if that's not true then I think it's the obvious
> solution because it precludes any uncaught behavior-change bugs.
> 
> A couple of notes
> 1. Since we'll likely miss some -EFAULT returns, we'll need to make
> sure that the user can check for / doesn't see a stale
> kvm_run::memory_fault field when a missed -EFAULT makes it to
> userspace. It's a small and easy-to-fix detail, but I thought I'd
> point it out.

Ya, this is the main concern for me as well.  I'm not as confident that it's
easy-to-fix/avoid though.

> 2. I don't think this would simplify the series that much, since we
> still need to find the call sites returning -EFAULT to userspace and
> populate memory_fault only in those spots to avoid populating it for
> -EFAULTs which don't make it to userspace.

Filling kvm_run::memory_fault even if KVM never exits to userspace is perfectly
ok.  It's not ideal, but it's ok.

> We *could* relax that condition and just document that memory_fault should be
> ignored when KVM_RUN does not return -EFAULT... but I don't think that's a
> good solution from a coder/maintainer perspective.

You've got things backward.  memory_fault _must_ be ignored if KVM doesn't return
the associated "magic combo", where the magic value is either "0+KVM_EXIT_MEMORY_FAULT"
or "-EFAULT+KVM_EXIT_MEMORY_FAULT".

Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
never sees the data, i.e. userspace is completely unaware.  This behavior is not
ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
exiting to userspace can lead to other bugs, e.g. effective corruption of the
kvm_run union, but at least from a uABI perspective, the behavior is acceptable.

The reverse, userspace consuming kvm_run::memory_fault without being explicitly
told the data is valid, is not ok/safe.  KVM's contract is that fields contained
in kvm_run's big union are valid if and only if KVM returns '0' and the associated
exit reason is set in kvm_run::exit_reason.

From an ABI perspective, I don't see anything fundamentally wrong with bending
that rule slightly by saying that kvm_run::memory_fault is valid if KVM returns
-EFAULT+KVM_EXIT_MEMORY_FAULT.  It won't break existing userspace that is unaware
of KVM_EXIT_MEMORY_FAULT, and userspace can precisely check for the combination.

My big concern with piggybacking -EFAULT is that userspace will be fed stale if
KVM exits with -EFAULT in a patch that _doesn't_ fill kvm_run::memory_fault.
Returning a negative error code isn't hazardous in and of itself, e.g. KVM has
had bugs in the past where KVM returns '0' but doesn't fill kvm_run::exit_reason.
The big danger is that KVM has existing paths that return -EFAULT, i.e. we can
introduce bugs simply by doing nothing, whereas returning '0' would largely be
limited to new code.

The counter-argument is that propagating '0' correctly up the stack carries its
own risk due to plenty of code correctly treating '0' as "success" and not "exit
to userspace".

And we can mitigate the risk of using -EFAULT.  E.g. fill in kvm_run::memory_fault
even if we are 99.9999% confident the -EFAULT can't get out to userspace in the
context of KVM_RUN, and set kvm_run::exit_reason to some arbitrary value at the
start of KVM_RUN to prevent reusing memory_fault from a previous userspace exit.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit
  2023-03-17 22:03             ` Sean Christopherson
@ 2023-03-20 15:56               ` Sean Christopherson
  0 siblings, 0 replies; 60+ messages in thread
From: Sean Christopherson @ 2023-03-20 15:56 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Oliver Upton, jthoughton, kvm, maz, Isaku Yamahata

On Fri, Mar 17, 2023, Sean Christopherson wrote:
> On Fri, Mar 17, 2023, Anish Moorthy wrote:
> > On Fri, Mar 17, 2023 at 12:00 PM David Matlack <dmatlack@google.com> wrote:
> > > > The low-level accessors are common across architectures and can be called from
> > > > other contexts besides a vCPU. Is it possible for the caller to catch -EFAULT
> > > > and convert that into an exit?
> > >
> > > Ya, as things stand today, the conversions _must_ be performed at the caller, as
> > > there are (sadly) far too many flows where KVM squashes the error.  E.g. almost
> > > all of x86's paravirt code just suppresses user memory faults :-(
> > >
> > > Anish, when we discussed this off-list, what I meant by limiting the intial support
> > > to existing -EFAULT cases was limiting support to existing cases where KVM directly
> > > returns -EFAULT to userspace, not to all existing cases where -EFAULT is ever
> > > returned _within KVM_ while handling KVM_RUN.  My apologies if I didn't make that clear.
> > 
> > Don't worry, we eventually got there off-list :)
> > 
> > This brings us back to my original set of questions. As has already
> > been pointed out, I'll have to revisit my "Confident that needs
> > conversion" changes and tweak them so that the vCPU exit is populated
> > only for the call sites where the -EFAULT makes it to userspace. I
> > still want feedback on if I've mis-identified any of the functions in
> > my "EFAULT does not propagate to userspace" list and whether there are
> > functions/callers in the "Still unsure if needs conversion" which do
> > have return paths to KVM_RUN.
> 
> As you've probably gathered from the type of feedback you're receiving, identifying
> the conversion touchpoints isn't going to be the long pole of this series.  Correctly
> identifying all of the touchpoints may not be easy, but fixing any cases we get wrong
> will likely be straightforward.  And realistically, no matter how many eyeballs look
> at the code, odds are good we'll miss at least one case.  In other words, don't worry
> too much about getting all the touchpoints correct on the first version.  Getting the
> uAPI right is much more important.
> 
> And rather than rely on code review to get things right, we should be able to
> detect issues programmatically.  E.g. use fault injection to make gup() and/or
> uaccess fail (might even be wired up already?), and hack in a WARN in the KVM_RUN
> path to assert that KVM_EXIT_MEMORY_FAULT is filled if the return code is -EFAULT
> (assuming we go don't try to get KVM to return 0 everywhere), e.g. something like
> the below would at least flag the "misses", although debug could still prove to be
> annoying.
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 67b890e54cf1..cccae0ad1436 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4100,6 +4100,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
>                 }
>                 r = kvm_arch_vcpu_ioctl_run(vcpu);
>                 trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
> +               WARN_ON(r == -EFAULT &&
> +                       vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT);

Gah, I inverted the second check, this should be 

		WARN_ON(r == -EFAULT &&
			vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);
		
>                 break;
>         }
>         case KVM_GET_REGS: {
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-20 15:53           ` Sean Christopherson
@ 2023-03-20 18:19             ` Anish Moorthy
  2023-03-20 22:11             ` Anish Moorthy
  1 sibling, 0 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-20 18:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Mar 17, 2023, Anish Moorthy wrote:
> > On Fri, Mar 17, 2023 at 2:50 PM Sean Christopherson <seanjc@google.com> wrote:
> > > I wonder if we can get away with returning -EFAULT, but still filling vcpu->run
> > > with KVM_EXIT_MEMORY_FAULT and all the other metadata.  That would likely simplify
> > > the implementation greatly, and would let KVM fill vcpu->run unconditonally.  KVM
> > > would still need a capability to advertise support to userspace, but userspace
> > > wouldn't need to opt in.  I think this may have been my very original though, and
> > > I just never actually wrote it down...
> >
> > Oh, good to know that's actually an option. I thought of that too, but
> > assumed that returning a negative error code was a no-go for a proper
> > vCPU exit. But if that's not true then I think it's the obvious
> > solution because it precludes any uncaught behavior-change bugs.
> >
> > A couple of notes
> > 1. Since we'll likely miss some -EFAULT returns, we'll need to make
> > sure that the user can check for / doesn't see a stale
> > kvm_run::memory_fault field when a missed -EFAULT makes it to
> > userspace. It's a small and easy-to-fix detail, but I thought I'd
> > point it out.
>
> Ya, this is the main concern for me as well.  I'm not as confident that it's
> easy-to-fix/avoid though.
>
> > 2. I don't think this would simplify the series that much, since we
> > still need to find the call sites returning -EFAULT to userspace and
> > populate memory_fault only in those spots to avoid populating it for
> > -EFAULTs which don't make it to userspace.
>
> Filling kvm_run::memory_fault even if KVM never exits to userspace is perfectly
> ok.  It's not ideal, but it's ok.

Right- I was just pointing out that doing so could mislead readers of
the code if they assume that kvm_run::memory_fault is populated iff it
was going to be associated w/ an exit to userspace," which I know I
would.

> > We *could* relax that condition and just document that memory_fault should be
> > ignored when KVM_RUN does not return -EFAULT... but I don't think that's a
> > good solution from a coder/maintainer perspective.
>
> You've got things backward.  memory_fault _must_ be ignored if KVM doesn't return
> the associated "magic combo", where the magic value is either "0+KVM_EXIT_MEMORY_FAULT"
> or "-EFAULT+KVM_EXIT_MEMORY_FAULT".

I think we're saying the same thing- I was using "should" to mean "must."

> Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
> never sees the data, i.e. userspace is completely unaware.  This behavior is not
> ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
> exiting to userspace can lead to other bugs, e.g. effective corruption of the
> kvm_run union

Ooh, I didn't think of the corruption issue here: thanks for pointing it out.

> but at least from a uABI perspective, the behavior is acceptable.

This does complicate things for KVM implementation though, right? In
particular, we'd have to make sure that KVM_RUN never conditionally
modifies its return value/exit reason based on reads from kvm_run:
that seems like a slightly weird thing to do, but I don't want to
assume anything here.

Anyways, unless that's not (and never will be) a problem, allowing
corruption of kvm_run seems very risky.

> The reverse, userspace consuming kvm_run::memory_fault without being explicitly
> told the data is valid, is not ok/safe.  KVM's contract is that fields contained
> in kvm_run's big union are valid if and only if KVM returns '0' and the associated
> exit reason is set in kvm_run::exit_reason.
>
> From an ABI perspective, I don't see anything fundamentally wrong with bending
> that rule slightly by saying that kvm_run::memory_fault is valid if KVM returns
> -EFAULT+KVM_EXIT_MEMORY_FAULT.  It won't break existing userspace that is unaware
> of KVM_EXIT_MEMORY_FAULT, and userspace can precisely check for the combination.
>
> My big concern with piggybacking -EFAULT is that userspace will be fed stale if
> KVM exits with -EFAULT in a patch that _doesn't_ fill kvm_run::memory_fault.
> Returning a negative error code isn't hazardous in and of itself, e.g. KVM has
> had bugs in the past where KVM returns '0' but doesn't fill kvm_run::exit_reason.
> The big danger is that KVM has existing paths that return -EFAULT, i.e. we can
> introduce bugs simply by doing nothing, whereas returning '0' would largely be
> limited to new code.
>
> The counter-argument is that propagating '0' correctly up the stack carries its
> own risk due to plenty of code correctly treating '0' as "success" and not "exit
> to userspace".
>
> And we can mitigate the risk of using -EFAULT.  E.g. fill in kvm_run::memory_fault
> even if we are 99.9999% confident the -EFAULT can't get out to userspace in the
> context of KVM_RUN, and set kvm_run::exit_reason to some arbitrary value at the
> start of KVM_RUN to prevent reusing memory_fault from a previous userspace exit.

Right, this is what I had in mind when I called this "small and
easy-to-fix." Piggybacking -EFAULT seems like the right thing to do to
me, but I'm still uneasy about possibly corrupting kvm_run for masked
-EFAULTS.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  2023-03-20 15:13           ` Sean Christopherson
@ 2023-03-20 19:53             ` Anish Moorthy
  0 siblings, 0 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-03-20 19:53 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Oliver Upton, jthoughton, kvm

On Mon, Mar 20, 2023 at 8:13 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Mar 17, 2023, Anish Moorthy wrote:
> > On Fri, Mar 17, 2023 at 1:17 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > And as I argued in the last version[*], I am _strongly_ opposed to KVM speculating
> > > on why KVM is exiting to userspace.  I.e. KVM should not set a special flag if
> > > the memslot has "fast only" behavior.  The only thing the flag should do is control
> > > whether or not KVM tries slow paths, what KVM does in response to an unresolved
> > > fault should be an orthogonal thing.
> >
> > I'm guessing you would want changes to patch 10 of this series [*]
> > then, right? Setting a bit/exit reason in kvm_run::memory_fault.flags
> > depending on whether the failure originated from a "fast only" fault
> > is... exactly what I'm doing :/ I'm not totally clear on your usages
> > of the word "flag" above though, the "KVM should not set a special
> > flag... the only thing *the* flag should do" part is throwing me off a
> > bit. What I think you're saying is
>
> Heh, the second "the flag" is referring to the memslot flag.  Rewriting the above:
>
>   KVM should not set a special flag in kvm_run::memory_fault.flags ... the
>   only thing KVM_MEM_FAST_FAULT_ONLY should do is ..."
>
> > "KVM should not set a special bit in kvm_run::memory_fault.flags if
> > the memslot has fast-only behavior. The only thing
> > KVM_MEM_ABSENT_MAPPING_FAULT should do is..."
> >
> > [1] https://lore.kernel.org/all/20230315021738.1151386-11-amoorthy@google.com/

Ok so, just to be clear, you are not opposed to

(a) all -EFAULTs from kvm_faultin_pfn populating the
kvm_run.memory_fault and setting kvm_run.memory_fault.flags to, say,
FAULTIN_FAILURE if/when kvm_cap_memory_fault_exit is enabled

but *are* opposed to

(b) the combination of the memslot flag and kvm_cap_memory_fault_exit
providing any additional information on top of: for instance, a
kvm_run.memory_fault.flags of FAULTIN_FAILURE & FAST_FAULT_ONLY.

Is that right?


> > On Fri, Mar 17, 2023 at 1:54 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > Strictly speaking, if y'all buy my argument that the flag shouldn't control the
> > > gup behavior, there won't be semantic differences for the memslot flag.  KVM will
> > > (obviously) behavior differently if KVM_CAP_MEMORY_FAULT_EXIT is not set, but that
> > > will hold true for x86 as well.  The only difference is that x86 will also support
> > > an orthogonal flag that makes the fast-only memslot flag useful in practice.
> > >
> > > So yeah, there will be an arch dependency, but only because arch code needs to
> > > actually handle perform the exit, and that's true no matter what.
> > >
> > > That said, there's zero reason to put X86 in the name.  Just add the capability
> > > as KVM_CAP_MEMORY_FAULT_EXIT or whatever and mark it as x86 in the documentation.
> > >
> > > That said, there's zero reason to put X86 in the name.  Just add the capability
> > > as KVM_CAP_MEMORY_FAULT_EXIT or whatever and mark it as x86 in the documentation.
> >
> > Again, a little confused on your first "flag" usage here. I figure you
> > can't mean the memslot flag because the whole point of that is to
> > control the GUP behavior, but I'm not sure what else you'd be
> > referring to.
> >
> > Anyways the idea of having orthogonal features, one to -EFAULTing
> > early before a slow path and another to transform/augment -EFAULTs
> > into/with useful information does make sense to me. But I think the
> > issue here is that we want the fast-only memslot flag to be useful on
> > Arm as well, and with KVM_CAP_MEMORY_FAULT_NOWAIT written as it is now
> > there is a semantic differences between x86 and Arm.
>
> If and only if userspace enables the capability that transforms -EFAULT.
>
> > I don't see a way to keep the two features here orthogonal on x86 and
> > linked on arm without keeping that semantic difference. Perhaps the
> > solution here is a bare-bones implementation of
> > KVM_CAP_MEMORY_FAULT_EXIT for Arm? All that actually *needs* to be
> > covered to resolve this difference is the one call site in
> > user_mem_abort. since KVM_CAP_MEMORY_FAULT_EXIT will be allowed to
> > have holes anyways.
>
> As above, so long as userspace must opt into transforming -EFAULT, and can do
> so independent of KVM_MEM_FAST_FAULT_ONLY (or whatever we call it), the behavior
> of KVM_MEM_FAST_FAULT_ONLY itself is semantically identical across all
> architectures.
>
> KVM_MEM_FAST_FAULT_ONLY is obviously not very useful without precise information
> about the failing address, but IMO that's not reason enough to tie the two
> together.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-20 15:53           ` Sean Christopherson
  2023-03-20 18:19             ` Anish Moorthy
@ 2023-03-20 22:11             ` Anish Moorthy
  2023-03-21 15:21               ` Sean Christopherson
  1 sibling, 1 reply; 60+ messages in thread
From: Anish Moorthy @ 2023-03-20 22:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Mar 17, 2023, Anish Moorthy wrote:
> > On Fri, Mar 17, 2023 at 2:50 PM Sean Christopherson <seanjc@google.com> wrote:
> > > I wonder if we can get away with returning -EFAULT, but still filling vcpu->run
> > > with KVM_EXIT_MEMORY_FAULT and all the other metadata.  That would likely simplify
> > > the implementation greatly, and would let KVM fill vcpu->run unconditonally.  KVM
> > > would still need a capability to advertise support to userspace, but userspace
> > > wouldn't need to opt in.  I think this may have been my very original though, and
> > > I just never actually wrote it down...
> >
> > Oh, good to know that's actually an option. I thought of that too, but
> > assumed that returning a negative error code was a no-go for a proper
> > vCPU exit. But if that's not true then I think it's the obvious
> > solution because it precludes any uncaught behavior-change bugs.
> >
> > A couple of notes
> > 1. Since we'll likely miss some -EFAULT returns, we'll need to make
> > sure that the user can check for / doesn't see a stale
> > kvm_run::memory_fault field when a missed -EFAULT makes it to
> > userspace. It's a small and easy-to-fix detail, but I thought I'd
> > point it out.
>
> Ya, this is the main concern for me as well.  I'm not as confident that it's
> easy-to-fix/avoid though.
>
> > 2. I don't think this would simplify the series that much, since we
> > still need to find the call sites returning -EFAULT to userspace and
> > populate memory_fault only in those spots to avoid populating it for
> > -EFAULTs which don't make it to userspace.
>
> Filling kvm_run::memory_fault even if KVM never exits to userspace is perfectly
> ok.  It's not ideal, but it's ok.
>
> > We *could* relax that condition and just document that memory_fault should be
> > ignored when KVM_RUN does not return -EFAULT... but I don't think that's a
> > good solution from a coder/maintainer perspective.
>
> You've got things backward.  memory_fault _must_ be ignored if KVM doesn't return
> the associated "magic combo", where the magic value is either "0+KVM_EXIT_MEMORY_FAULT"
> or "-EFAULT+KVM_EXIT_MEMORY_FAULT".
>
> Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
> never sees the data, i.e. userspace is completely unaware.  This behavior is not
> ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
> exiting to userspace can lead to other bugs, e.g. effective corruption of the
> kvm_run union, but at least from a uABI perspective, the behavior is acceptable.

Actually, I don't think the idea of filling in kvm_run.memory_fault
for -EFAULTs which don't make it to userspace works at all. Consider
the direct_map function, which bubbles its -EFAULT to
kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both
kvm_arch_async_page_ready (which ignores the return value), and by
kvm_mmu_page_fault (where the return value does make it to userspace).
Populating kvm_run.memory_fault anywhere in or under
kvm_mmu_do_page_fault seems an immediate no-go, because a wayward
kvm_arch_async_page_ready could (presumably) overwrite/corrupt an
already-set kvm_run.memory_fault / other kvm_run field.

That in turn looks problematic for the
memory-fault-exit-on-fast-gup-failure part of this series, because
there are at least a couple of cases for which kvm_mmu_do_page_fault
will -EFAULT. One is the early-efault-on-fast-gup-failure case which
was the original purpose of this series. Another is a -EFAULT from
FNAME(fetch) (passed up through FNAME(page_fault)). There might be
other cases as well. But unless userspace can/should resolve *all*
such -EFAULTs in the same manner, a kvm_run.memory_fault populated in
"kvm_mmu_page_fault" wouldn't be actionable. At least, not without a
whole lot of plumbing code to make it so.

Sean, am I missing anything here?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  2023-03-17 20:17     ` Sean Christopherson
@ 2023-03-20 22:22       ` Oliver Upton
  2023-03-21 14:50         ` Sean Christopherson
  0 siblings, 1 reply; 60+ messages in thread
From: Oliver Upton @ 2023-03-20 22:22 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Anish Moorthy, jthoughton, kvm

Sean,

On Fri, Mar 17, 2023 at 01:17:22PM -0700, Sean Christopherson wrote:
> On Fri, Mar 17, 2023, Oliver Upton wrote:
> > On Wed, Mar 15, 2023 at 02:17:33AM +0000, Anish Moorthy wrote:
> > > Add documentation, memslot flags, useful helper functions, and the
> > > actual new capability itself.
> > > 
> > > Memory fault exits on absent mappings are particularly useful for
> > > userfaultfd-based live migration postcopy. When many vCPUs fault upon a
> > > single userfaultfd the faults can take a while to surface to userspace
> > > due to having to contend for uffd wait queue locks. Bypassing the uffd
> > > entirely by triggering a vCPU exit avoids this contention and can improve
> > > the fault rate by as much as 10x.
> > > ---
> > >  Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++++++---
> > >  include/linux/kvm_host.h       |  6 ++++++
> > >  include/uapi/linux/kvm.h       |  3 +++
> > >  tools/include/uapi/linux/kvm.h |  2 ++
> > >  virt/kvm/kvm_main.c            |  7 ++++++-
> > >  5 files changed, 51 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > index f9ca18bbec879..4932c0f62eb3d 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
> > >    /* for kvm_userspace_memory_region::flags */
> > >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> > >    #define KVM_MEM_READONLY	(1UL << 1)
> > > +  #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
> > 
> > call it KVM_MEM_EXIT_ABSENT_MAPPING
> 
> Ooh, look, a bikeshed!  :-)

Couldn't help myself :)

> I don't think it should have "EXIT" in the name.  The exit to userspace is a side
> effect, e.g. KVM already exits to userspace on unresolved userfaults.  The only
> thing this knob _directly_ controls is whether or not KVM attempts the slow path.
> If we give the flag a name like "exit on absent userspace mappings", then KVM will
> appear to do the wrong thing when KVM exits on a truly absent userspace mapping.
> 
> And as I argued in the last version[*], I am _strongly_ opposed to KVM speculating
> on why KVM is exiting to userspace.  I.e. KVM should not set a special flag if
> the memslot has "fast only" behavior.  The only thing the flag should do is control
> whether or not KVM tries slow paths, what KVM does in response to an unresolved
> fault should be an orthogonal thing.
> 
> E.g. If KVM encounters an unmapped page while prefetching SPTEs, KVM will (correctly)
> not exit to userspace and instead simply terminate the prefetch.  Obviously we
> could solve that through documentation, but I don't see any benefit in making this
> more complex than it needs to be.

I couldn't care less about what the user-facing portion of this thing is
called, TBH. We could just refer to it as KVM_MEM_BIT_2 /s

The only bit I wanted to avoid is having a collision in the kernel between
literal faults arising from hardware and exits to userspace that we are also
calling 'faults'.

> [*] https://lkml.kernel.org/r/Y%2B0RYMfw6pHrSLX4%40google.com
> 
> > > +7.35 KVM_CAP_MEMORY_FAULT_NOWAIT
> > > +--------------------------------
> > > +
> > > +:Architectures: x86, arm64
> > > +:Returns: -EINVAL.
> > > +
> > > +The presence of this capability indicates that userspace may pass the
> > > +KVM_MEM_ABSENT_MAPPING_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
> > > +to exit to populate 'kvm_run.memory_fault' and exit to userspace (*) in response
> > > +to page faults for which the userspace page tables do not contain present
> > > +mappings. Attempting to enable the capability directly will fail.
> > > +
> > > +The 'gpa' and 'len' fields of kvm_run.memory_fault will be set to the starting
> > > +address and length (in bytes) of the faulting page. 'flags' will be set to
> > > +KVM_MEMFAULT_REASON_ABSENT_MAPPING.
> > > +
> > > +Userspace should determine how best to make the mapping present, then take
> > > +appropriate action. For instance, in the case of absent mappings this might
> > > +involve establishing the mapping for the first time via UFFDIO_COPY/CONTINUE or
> > > +faulting the mapping in using MADV_POPULATE_READ/WRITE. After establishing the
> > > +mapping, userspace can return to KVM to retry the previous memory access.
> > > +
> > > +(*) NOTE: On x86, KVM_CAP_X86_MEMORY_FAULT_EXIT must be enabled for the
> > > +KVM_MEMFAULT_REASON_ABSENT_MAPPING_reason: otherwise userspace will only receive
> > > +a -EFAULT from KVM_RUN without any useful information.
> > 
> > I'm not a fan of this architecture-specific dependency. Userspace is already
> > explicitly opting in to this behavior by way of the memslot flag. These sort
> > of exits are entirely orthogonal to the -EFAULT conversion earlier in the
> > series.
> 
> Ya, yet another reason not to speculate on why KVM wasn't able to resolve a fault.

Regardless of what we name this memslot flag, we're already getting explicit
opt-in from userspace for new behavior. There seems to be zero value in
supporting memslot_flag && !MEMORY_FAULT_EXIT (i.e. returning EFAULT),
so why even bother?

Requiring two levels of opt-in to have the intended outcome for a single
architecture seems nauseating from a userspace perspective.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  2023-03-20 22:22       ` Oliver Upton
@ 2023-03-21 14:50         ` Sean Christopherson
  2023-03-21 20:23           ` Oliver Upton
  0 siblings, 1 reply; 60+ messages in thread
From: Sean Christopherson @ 2023-03-21 14:50 UTC (permalink / raw)
  To: Oliver Upton; +Cc: Anish Moorthy, jthoughton, kvm

On Mon, Mar 20, 2023, Oliver Upton wrote:
> On Fri, Mar 17, 2023 at 01:17:22PM -0700, Sean Christopherson wrote:
> > On Fri, Mar 17, 2023, Oliver Upton wrote:
> > > I'm not a fan of this architecture-specific dependency. Userspace is already
> > > explicitly opting in to this behavior by way of the memslot flag. These sort
> > > of exits are entirely orthogonal to the -EFAULT conversion earlier in the
> > > series.
> > 
> > Ya, yet another reason not to speculate on why KVM wasn't able to resolve a fault.
> 
> Regardless of what we name this memslot flag, we're already getting explicit
> opt-in from userspace for new behavior. There seems to be zero value in
> supporting memslot_flag && !MEMORY_FAULT_EXIT (i.e. returning EFAULT),
> so why even bother?

Because there are use cases for MEMORY_FAULT_EXIT beyond fast-only gup.  We could
have the memslot feature depend on the MEMORY_FAULT_EXIT capability, but I don't
see how that adds value for either KVM or userspace.

Filling MEMORY_FAULT_EXIT iff the memslot flag is set would also lead to a weird
ABI and/or funky KVM code.  E.g. if MEMORY_FAULT_EXIT is tied to the fast-only
memslot flag, what's the defined behavior if the gfn=>hva translation fails?  KVM
hasn't actually tried to gup() anything.  Obviously not the end of the world, but
I'd prefer not to avoid introduce more odditites into KVM, however minor.

> Requiring two levels of opt-in to have the intended outcome for a single
> architecture seems nauseating from a userspace perspective.

If we usurp -EFAULT, I don't think we'll actually need an opt-in for
MEMORY_FAULT_EXIT.  KVM will need to add a capability so that userspace can query
KVM support, but the actual filling of of kvm_run could be done unconditionally.

Even if we do end up making the behavior opt-in, I would expect them to be largely
orthogonal in userspace.  E.g. userspace would always enable MEMORY_FAULT_EXIT
during startup, and then toggle the memslot flag during postcopy.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-20 22:11             ` Anish Moorthy
@ 2023-03-21 15:21               ` Sean Christopherson
  2023-03-21 18:01                 ` Anish Moorthy
  0 siblings, 1 reply; 60+ messages in thread
From: Sean Christopherson @ 2023-03-21 15:21 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Mon, Mar 20, 2023, Anish Moorthy wrote:
> On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Mar 17, 2023, Anish Moorthy wrote:
> > > On Fri, Mar 17, 2023 at 2:50 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > I wonder if we can get away with returning -EFAULT, but still filling vcpu->run
> > > > with KVM_EXIT_MEMORY_FAULT and all the other metadata.  That would likely simplify
> > > > the implementation greatly, and would let KVM fill vcpu->run unconditonally.  KVM
> > > > would still need a capability to advertise support to userspace, but userspace
> > > > wouldn't need to opt in.  I think this may have been my very original though, and
> > > > I just never actually wrote it down...
> > >
> > > Oh, good to know that's actually an option. I thought of that too, but
> > > assumed that returning a negative error code was a no-go for a proper
> > > vCPU exit. But if that's not true then I think it's the obvious
> > > solution because it precludes any uncaught behavior-change bugs.
> > >
> > > A couple of notes
> > > 1. Since we'll likely miss some -EFAULT returns, we'll need to make
> > > sure that the user can check for / doesn't see a stale
> > > kvm_run::memory_fault field when a missed -EFAULT makes it to
> > > userspace. It's a small and easy-to-fix detail, but I thought I'd
> > > point it out.
> >
> > Ya, this is the main concern for me as well.  I'm not as confident that it's
> > easy-to-fix/avoid though.
> >
> > > 2. I don't think this would simplify the series that much, since we
> > > still need to find the call sites returning -EFAULT to userspace and
> > > populate memory_fault only in those spots to avoid populating it for
> > > -EFAULTs which don't make it to userspace.
> >
> > Filling kvm_run::memory_fault even if KVM never exits to userspace is perfectly
> > ok.  It's not ideal, but it's ok.
> >
> > > We *could* relax that condition and just document that memory_fault should be
> > > ignored when KVM_RUN does not return -EFAULT... but I don't think that's a
> > > good solution from a coder/maintainer perspective.
> >
> > You've got things backward.  memory_fault _must_ be ignored if KVM doesn't return
> > the associated "magic combo", where the magic value is either "0+KVM_EXIT_MEMORY_FAULT"
> > or "-EFAULT+KVM_EXIT_MEMORY_FAULT".
> >
> > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
> > never sees the data, i.e. userspace is completely unaware.  This behavior is not
> > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
> > exiting to userspace can lead to other bugs, e.g. effective corruption of the
> > kvm_run union, but at least from a uABI perspective, the behavior is acceptable.
> 
> Actually, I don't think the idea of filling in kvm_run.memory_fault
> for -EFAULTs which don't make it to userspace works at all. Consider
> the direct_map function, which bubbles its -EFAULT to
> kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both
> kvm_arch_async_page_ready (which ignores the return value), and by
> kvm_mmu_page_fault (where the return value does make it to userspace).
> Populating kvm_run.memory_fault anywhere in or under
> kvm_mmu_do_page_fault seems an immediate no-go, because a wayward
> kvm_arch_async_page_ready could (presumably) overwrite/corrupt an
> already-set kvm_run.memory_fault / other kvm_run field.

This particular case is a non-issue.  kvm_check_async_pf_completion() is called
only when the current task has control of the vCPU, i.e. is the current "running"
vCPU.  That's not a coincidence either, invoking kvm_mmu_do_page_fault() without
having control of the vCPU would be fraught with races, e.g. the entire KVM MMU
context would be unstable.

That will hold true for all cases.  Using a vCPU that is not loaded (not the
current "running" vCPU in KVM's misleading terminology) to access guest memory is
simply not safe, as the vCPU state is non-deterministic.  There are paths where
KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery
and making requests, but those are very controlled flows with dedicated machinery
to make them SMP safe.

That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
without the target vCPU being loaded:

	int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
	{
		preempt_disable();
		if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
			goto out;

		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
		...
	out:
		preempt_enable();
		return -EFAULT;
	}

FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing
that KVM "immediately" exits to userspace isn't ideal, but given the amount of
historical code that we need to deal with, it seems like the lesser of all evils.
Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far
better failure mode than KVM not filling kvm_run when it should, i.e. false
positives are ok, false negatives are fatal.

> That in turn looks problematic for the
> memory-fault-exit-on-fast-gup-failure part of this series, because
> there are at least a couple of cases for which kvm_mmu_do_page_fault
> will -EFAULT. One is the early-efault-on-fast-gup-failure case which
> was the original purpose of this series. Another is a -EFAULT from
> FNAME(fetch) (passed up through FNAME(page_fault)). There might be
> other cases as well. But unless userspace can/should resolve *all*
> such -EFAULTs in the same manner, a kvm_run.memory_fault populated in
> "kvm_mmu_page_fault" wouldn't be actionable.

Killing the VM, which is what all VMMs do today in response to -EFAULT, is an
action.  As I've pointed out elsewhere in this thread, userspace needs to be able
to identify "faults" that it (userspace) can resolve without a hint from KVM.

In other words, KVM is still returning -EFAULT (or a variant thereof), the _only_
difference, for all intents and purposes, is that userspace is given a bit more
information about the source of the -EFAULT.

> At least, not without a whole lot of plumbing code to make it so.

Plumbing where?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-21 15:21               ` Sean Christopherson
@ 2023-03-21 18:01                 ` Anish Moorthy
  2023-03-21 19:43                   ` Sean Christopherson
  0 siblings, 1 reply; 60+ messages in thread
From: Anish Moorthy @ 2023-03-21 18:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Tue, Mar 21, 2023 at 8:21 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Mar 20, 2023, Anish Moorthy wrote:
> > On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, Mar 17, 2023, Anish Moorthy wrote:
> > > > On Fri, Mar 17, 2023 at 2:50 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > I wonder if we can get away with returning -EFAULT, but still filling vcpu->run
> > > > > with KVM_EXIT_MEMORY_FAULT and all the other metadata.  That would likely simplify
> > > > > the implementation greatly, and would let KVM fill vcpu->run unconditonally.  KVM
> > > > > would still need a capability to advertise support to userspace, but userspace
> > > > > wouldn't need to opt in.  I think this may have been my very original though, and
> > > > > I just never actually wrote it down...
> > > >
> > > > Oh, good to know that's actually an option. I thought of that too, but
> > > > assumed that returning a negative error code was a no-go for a proper
> > > > vCPU exit. But if that's not true then I think it's the obvious
> > > > solution because it precludes any uncaught behavior-change bugs.
> > > >
> > > > A couple of notes
> > > > 1. Since we'll likely miss some -EFAULT returns, we'll need to make
> > > > sure that the user can check for / doesn't see a stale
> > > > kvm_run::memory_fault field when a missed -EFAULT makes it to
> > > > userspace. It's a small and easy-to-fix detail, but I thought I'd
> > > > point it out.
> > >
> > > Ya, this is the main concern for me as well.  I'm not as confident that it's
> > > easy-to-fix/avoid though.
> > >
> > > > 2. I don't think this would simplify the series that much, since we
> > > > still need to find the call sites returning -EFAULT to userspace and
> > > > populate memory_fault only in those spots to avoid populating it for
> > > > -EFAULTs which don't make it to userspace.
> > >
> > > Filling kvm_run::memory_fault even if KVM never exits to userspace is perfectly
> > > ok.  It's not ideal, but it's ok.
> > >
> > > > We *could* relax that condition and just document that memory_fault should be
> > > > ignored when KVM_RUN does not return -EFAULT... but I don't think that's a
> > > > good solution from a coder/maintainer perspective.
> > >
> > > You've got things backward.  memory_fault _must_ be ignored if KVM doesn't return
> > > the associated "magic combo", where the magic value is either "0+KVM_EXIT_MEMORY_FAULT"
> > > or "-EFAULT+KVM_EXIT_MEMORY_FAULT".
> > >
> > > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
> > > never sees the data, i.e. userspace is completely unaware.  This behavior is not
> > > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
> > > exiting to userspace can lead to other bugs, e.g. effective corruption of the
> > > kvm_run union, but at least from a uABI perspective, the behavior is acceptable.
> >
> > Actually, I don't think the idea of filling in kvm_run.memory_fault
> > for -EFAULTs which don't make it to userspace works at all. Consider
> > the direct_map function, which bubbles its -EFAULT to
> > kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both
> > kvm_arch_async_page_ready (which ignores the return value), and by
> > kvm_mmu_page_fault (where the return value does make it to userspace).
> > Populating kvm_run.memory_fault anywhere in or under
> > kvm_mmu_do_page_fault seems an immediate no-go, because a wayward
> > kvm_arch_async_page_ready could (presumably) overwrite/corrupt an
> > already-set kvm_run.memory_fault / other kvm_run field.
>
> This particular case is a non-issue.  kvm_check_async_pf_completion() is called
> only when the current task has control of the vCPU, i.e. is the current "running"
> vCPU.  That's not a coincidence either, invoking kvm_mmu_do_page_fault() without
> having control of the vCPU would be fraught with races, e.g. the entire KVM MMU
> context would be unstable.
>
> That will hold true for all cases.  Using a vCPU that is not loaded (not the
> current "running" vCPU in KVM's misleading terminology) to access guest memory is
> simply not safe, as the vCPU state is non-deterministic.  There are paths where
> KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery
> and making requests, but those are very controlled flows with dedicated machinery
> to make them SMP safe.
>
> That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> without the target vCPU being loaded:
>
>         int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
>         {
>                 preempt_disable();
>                 if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
>                         goto out;
>
>                 vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
>                 ...
>         out:
>                 preempt_enable();
>                 return -EFAULT;
>         }
>
> FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing
> that KVM "immediately" exits to userspace isn't ideal, but given the amount of
> historical code that we need to deal with, it seems like the lesser of all evils.
> Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far
> better failure mode than KVM not filling kvm_run when it should, i.e. false
> positives are ok, false negatives are fatal.

Don't you have this in reverse? False negatives will just result in
userspace not having useful extra information for the -EFAULT it
receives from KVM_RUN, in which case userspace can do what you
mentioned all VMMs do today and just terminate the VM. Whereas a false
positive might cause a double-write to the KVM_RUN struct, either
putting incorrect information in kvm_run.memory_fault or corrupting
another member of the union.

> > That in turn looks problematic for the
> > memory-fault-exit-on-fast-gup-failure part of this series, because
> > there are at least a couple of cases for which kvm_mmu_do_page_fault
> > will -EFAULT. One is the early-efault-on-fast-gup-failure case which
> > was the original purpose of this series. Another is a -EFAULT from
> > FNAME(fetch) (passed up through FNAME(page_fault)). There might be
> > other cases as well. But unless userspace can/should resolve *all*
> > such -EFAULTs in the same manner, a kvm_run.memory_fault populated in
> > "kvm_mmu_page_fault" wouldn't be actionable.
>
> Killing the VM, which is what all VMMs do today in response to -EFAULT, is an
> action.  As I've pointed out elsewhere in this thread, userspace needs to be able
> to identify "faults" that it (userspace) can resolve without a hint from KVM.
>
> In other words, KVM is still returning -EFAULT (or a variant thereof), the _only_
> difference, for all intents and purposes, is that userspace is given a bit more
> information about the source of the -EFAULT.
>
> > At least, not without a whole lot of plumbing code to make it so.
>
> Plumbing where?

In this example, I meant plumbing code to get a
kvm_run.memory_fault.flags which is more specific than (eg)
MEMFAULT_REASON_PAGE_FAULT_FAILURE from the -EFAULT paths under
kvm_mmu_page_fault. My idea for how userspace would distinguish
fast-gup failures was that kvm_faultin_pfn would set a special bit in
kvm_run.memory_fault.flags to indicate its failure. But (still
assuming that we shouldn't have false-positive kvm_run.memory_fault
fills) if the memory_fault can only be populated from
kvm_mmu_page_fault then either failures from FNAME(page_fault) and
kvm_faultin_pfn will be indistinguishable to userspace, or those
functions will need to plumb more specific exit reasons all the way up
to kvm_mmu_page_fault.

But, since you've made this point elsewhere, my guess is that your
answer is that it's actually userspace's job to detect the "specific"
reason for the fault and resolve it.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-21 18:01                 ` Anish Moorthy
@ 2023-03-21 19:43                   ` Sean Christopherson
  2023-03-22 21:06                     ` Anish Moorthy
  2023-03-28 22:19                     ` Anish Moorthy
  0 siblings, 2 replies; 60+ messages in thread
From: Sean Christopherson @ 2023-03-21 19:43 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Tue, Mar 21, 2023, Anish Moorthy wrote:
> On Tue, Mar 21, 2023 at 8:21 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Mar 20, 2023, Anish Moorthy wrote:
> > > On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
> > > > never sees the data, i.e. userspace is completely unaware.  This behavior is not
> > > > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
> > > > exiting to userspace can lead to other bugs, e.g. effective corruption of the
> > > > kvm_run union, but at least from a uABI perspective, the behavior is acceptable.
> > >
> > > Actually, I don't think the idea of filling in kvm_run.memory_fault
> > > for -EFAULTs which don't make it to userspace works at all. Consider
> > > the direct_map function, which bubbles its -EFAULT to
> > > kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both
> > > kvm_arch_async_page_ready (which ignores the return value), and by
> > > kvm_mmu_page_fault (where the return value does make it to userspace).
> > > Populating kvm_run.memory_fault anywhere in or under
> > > kvm_mmu_do_page_fault seems an immediate no-go, because a wayward
> > > kvm_arch_async_page_ready could (presumably) overwrite/corrupt an
> > > already-set kvm_run.memory_fault / other kvm_run field.
> >
> > This particular case is a non-issue.  kvm_check_async_pf_completion() is called
> > only when the current task has control of the vCPU, i.e. is the current "running"
> > vCPU.  That's not a coincidence either, invoking kvm_mmu_do_page_fault() without
> > having control of the vCPU would be fraught with races, e.g. the entire KVM MMU
> > context would be unstable.
> >
> > That will hold true for all cases.  Using a vCPU that is not loaded (not the
> > current "running" vCPU in KVM's misleading terminology) to access guest memory is
> > simply not safe, as the vCPU state is non-deterministic.  There are paths where
> > KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery
> > and making requests, but those are very controlled flows with dedicated machinery
> > to make them SMP safe.
> >
> > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> > without the target vCPU being loaded:
> >
> >         int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
> >         {
> >                 preempt_disable();
> >                 if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> >                         goto out;
> >
> >                 vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> >                 ...
> >         out:
> >                 preempt_enable();
> >                 return -EFAULT;
> >         }
> >
> > FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing
> > that KVM "immediately" exits to userspace isn't ideal, but given the amount of
> > historical code that we need to deal with, it seems like the lesser of all evils.
> > Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far
> > better failure mode than KVM not filling kvm_run when it should, i.e. false
> > positives are ok, false negatives are fatal.
> 
> Don't you have this in reverse?

No, I don't think so.

> False negatives will just result in userspace not having useful extra
> information for the -EFAULT it receives from KVM_RUN, in which case userspace
> can do what you mentioned all VMMs do today and just terminate the VM.

And that is _really_ bad behavior if we have any hope of userspace actually being
able to rely on this functionality.  E.g. any false negative when userspace is
trying to do postcopy demand paging will be fatal to the VM.

> Whereas a false positive might cause a double-write to the KVM_RUN struct,
> either putting incorrect information in kvm_run.memory_fault or

Recording unused information on -EFAULT in kvm_run doesn't make the information
incorrect.

> corrupting another member of the union.

Only if KVM accesses guest memory after initiating an exit to userspace, which
would be a KVM irrespective of kvm_run.memory_fault.  We actually have exactly
this type of bug today in the trainwreck that is KVM's MMIO emulation[*], but
KVM gets away with the shoddy behavior by virtue of the scenario simply not
triggered by any real-world code.

And if we're really concerned about clobbering state, we could add hardening/auditing
code to ensure that KVM actually exits when kvm_run.exit_reason is set (though there
are a non-zero number of exceptions, e.g. the aformentioned MMIO mess, nested SVM/VMX
pages, and probably a few others).

Prior to cleanups a few years back[2], emulation failures had issues similar to
what we are discussing, where KVM would fail to exit to userspace, not fill kvm_run,
etc.  Those are the types of bugs I want to avoid here.

[1] https://lkml.kernel.org/r/ZBNrWZQhMX8AHzWM%40google.com
[2] https://lore.kernel.org/kvm/20190823010709.24879-1-sean.j.christopherson@intel.com

> > > That in turn looks problematic for the
> > > memory-fault-exit-on-fast-gup-failure part of this series, because
> > > there are at least a couple of cases for which kvm_mmu_do_page_fault
> > > will -EFAULT. One is the early-efault-on-fast-gup-failure case which
> > > was the original purpose of this series. Another is a -EFAULT from
> > > FNAME(fetch) (passed up through FNAME(page_fault)). There might be
> > > other cases as well. But unless userspace can/should resolve *all*
> > > such -EFAULTs in the same manner, a kvm_run.memory_fault populated in
> > > "kvm_mmu_page_fault" wouldn't be actionable.
> >
> > Killing the VM, which is what all VMMs do today in response to -EFAULT, is an
> > action.  As I've pointed out elsewhere in this thread, userspace needs to be able
> > to identify "faults" that it (userspace) can resolve without a hint from KVM.
> >
> > In other words, KVM is still returning -EFAULT (or a variant thereof), the _only_
> > difference, for all intents and purposes, is that userspace is given a bit more
> > information about the source of the -EFAULT.
> >
> > > At least, not without a whole lot of plumbing code to make it so.
> >
> > Plumbing where?
> 
> In this example, I meant plumbing code to get a kvm_run.memory_fault.flags
> which is more specific than (eg) MEMFAULT_REASON_PAGE_FAULT_FAILURE from the
> -EFAULT paths under kvm_mmu_page_fault. My idea for how userspace would
> distinguish fast-gup failures was that kvm_faultin_pfn would set a special
> bit in kvm_run.memory_fault.flags to indicate its failure. But (still
> assuming that we shouldn't have false-positive kvm_run.memory_fault fills) if
> the memory_fault can only be populated from kvm_mmu_page_fault then either
> failures from FNAME(page_fault) and kvm_faultin_pfn will be indistinguishable
> to userspace, or those functions will need to plumb more specific exit
> reasons all the way up to kvm_mmu_page_fault.

Setting a flag that essentially says "failure when handling a guest page fault"
is problematic on multiple fronts.  Tying the ABI to KVM's internal implementation
is not an option, i.e. the ABI would need to be defined as "on page faults from
the guest".  And then the resulting behavior would be non-deterministic, e.g.
userspace would see different behavior if KVM accessed a "bad" gfn via emulation
instead of in response to a guest page fault.  And because of hardware TLBs, it
would even be possible for the behavior to be non-deterministic on the same
platform running the same guest code (though this would be exteremly unliklely
in practice).

And even if userspace is ok with only handling guest page faults_today_, I highly
doubt that will hold forever.  I.e. at some point there will be a use case that
wants to react to uaccess failures on fast-only memslots.

Ignoring all of those issues, simplify flagging "this -EFAULT occurred when
handling a guest page fault" isn't precise enough for userspace to blindly resolve
the failure.  Even if KVM went through the trouble of setting information if and
only if get_user_page_fast_only() failed while handling a guest page fault,
userspace would still need/want a way to verify that the failure was expected and
can be resolved, e.g. to guard against userspace bugs due to wrongly unmapping
or mprotecting a page.

> But, since you've made this point elsewhere, my guess is that your answer is
> that it's actually userspace's job to detect the "specific" reason for the
> fault and resolve it.

Yes, it's userspace's responsibity.  I simply don't see how KVM can provide
information that userspace doesn't already have without creating an unmaintainable
uABI, at least not without some deep, deep plumbing into gup().  I.e. unless gup()
were changed to explicitly communicate that it failed because of a uffd equivalent,
at best a flag in kvm_run would be a hint that userspace _might_ be able to resolve
the fault.  And even if we modified gup(), we'd still have all the open questions
about what to do when KVM encounters a fault on a uaccess.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  2023-03-21 14:50         ` Sean Christopherson
@ 2023-03-21 20:23           ` Oliver Upton
  2023-03-21 21:01             ` Sean Christopherson
  0 siblings, 1 reply; 60+ messages in thread
From: Oliver Upton @ 2023-03-21 20:23 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Anish Moorthy, jthoughton, kvm

On Tue, Mar 21, 2023 at 07:50:35AM -0700, Sean Christopherson wrote:
> On Mon, Mar 20, 2023, Oliver Upton wrote:
> > On Fri, Mar 17, 2023 at 01:17:22PM -0700, Sean Christopherson wrote:
> > > On Fri, Mar 17, 2023, Oliver Upton wrote:
> > > > I'm not a fan of this architecture-specific dependency. Userspace is already
> > > > explicitly opting in to this behavior by way of the memslot flag. These sort
> > > > of exits are entirely orthogonal to the -EFAULT conversion earlier in the
> > > > series.
> > > 
> > > Ya, yet another reason not to speculate on why KVM wasn't able to resolve a fault.
> > 
> > Regardless of what we name this memslot flag, we're already getting explicit
> > opt-in from userspace for new behavior. There seems to be zero value in
> > supporting memslot_flag && !MEMORY_FAULT_EXIT (i.e. returning EFAULT),
> > so why even bother?
> 
> Because there are use cases for MEMORY_FAULT_EXIT beyond fast-only gup.

To be abundantly clear -- I have no issue with (nor care about) the other
MEMORY_FAULT_EXIT changes. If we go the route of explicit user opt-in then
that deserves its own distinct bit of UAPI. None of my objection pertains
to the conversion of existing -EFAULT exits.

> We could have the memslot feature depend on the MEMORY_FAULT_EXIT capability,
> but I don't see how that adds value for either KVM or userspace.

That is exactly what I want to avoid! My issue was the language here:

  +(*) NOTE: On x86, KVM_CAP_X86_MEMORY_FAULT_EXIT must be enabled for the
  +KVM_MEMFAULT_REASON_ABSENT_MAPPING_reason: otherwise userspace will only receive
  +a -EFAULT from KVM_RUN without any useful information.

Which sounds to me as though there are *two* UAPI bits for the whole fast-gup
failed interaction (flip a bit in the CAP and set a bit on the memslot, but
only for x86).

What I'm asking for is this:

 1) A capability advertising MEMORY_FAULT_EXIT to userspace. Either usurp
   EFAULT or require userspace to enable this capability to convert
   _existing_ EFAULT exits to the new way of the world.

 2) A capability and a single memslot flag to enable the fast-gup-only
   behavior (naming TBD). This does not depend on (1) in any way, i.e.
   only setting (2) should still result in MEMORY_FAULT_EXITs when fast
   gup fails. IOW, enabling (2) should always yield precise fault
   information to userspace.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation
  2023-03-21 20:23           ` Oliver Upton
@ 2023-03-21 21:01             ` Sean Christopherson
  0 siblings, 0 replies; 60+ messages in thread
From: Sean Christopherson @ 2023-03-21 21:01 UTC (permalink / raw)
  To: Oliver Upton; +Cc: Anish Moorthy, jthoughton, kvm

On Tue, Mar 21, 2023, Oliver Upton wrote:
> On Tue, Mar 21, 2023 at 07:50:35AM -0700, Sean Christopherson wrote:
> > On Mon, Mar 20, 2023, Oliver Upton wrote:
> > > On Fri, Mar 17, 2023 at 01:17:22PM -0700, Sean Christopherson wrote:
> > > > On Fri, Mar 17, 2023, Oliver Upton wrote:
> > > > > I'm not a fan of this architecture-specific dependency. Userspace is already
> > > > > explicitly opting in to this behavior by way of the memslot flag. These sort
> > > > > of exits are entirely orthogonal to the -EFAULT conversion earlier in the
> > > > > series.
> > > > 
> > > > Ya, yet another reason not to speculate on why KVM wasn't able to resolve a fault.
> > > 
> > > Regardless of what we name this memslot flag, we're already getting explicit
> > > opt-in from userspace for new behavior. There seems to be zero value in
> > > supporting memslot_flag && !MEMORY_FAULT_EXIT (i.e. returning EFAULT),
> > > so why even bother?
> > 
> > Because there are use cases for MEMORY_FAULT_EXIT beyond fast-only gup.
> 
> To be abundantly clear -- I have no issue with (nor care about) the other
> MEMORY_FAULT_EXIT changes. If we go the route of explicit user opt-in then
> that deserves its own distinct bit of UAPI. None of my objection pertains
> to the conversion of existing -EFAULT exits.
> 
> > We could have the memslot feature depend on the MEMORY_FAULT_EXIT capability,
> > but I don't see how that adds value for either KVM or userspace.
> 
> That is exactly what I want to avoid! My issue was the language here:
> 
>   +(*) NOTE: On x86, KVM_CAP_X86_MEMORY_FAULT_EXIT must be enabled for the
>   +KVM_MEMFAULT_REASON_ABSENT_MAPPING_reason: otherwise userspace will only receive
>   +a -EFAULT from KVM_RUN without any useful information.
> 
> Which sounds to me as though there are *two* UAPI bits for the whole fast-gup
> failed interaction (flip a bit in the CAP and set a bit on the memslot, but
> only for x86).

It won't be x86 only.  Anish's proposed patch has it as x86 specific, but I think
we're all in agreement that that is undesirable.  There will inevitably be per-arch
enabling and enumeration, e.g. to actually fill information and kick out to
userspace, but I don't see a sane way to avoid that since the common paths don't
have the vCPU (largely by design).

> What I'm asking for is this:
> 
>  1) A capability advertising MEMORY_FAULT_EXIT to userspace. Either usurp
>    EFAULT or require userspace to enable this capability to convert
>    _existing_ EFAULT exits to the new way of the world.
> 
>  2) A capability and a single memslot flag to enable the fast-gup-only
>    behavior (naming TBD). This does not depend on (1) in any way, i.e.
>    only setting (2) should still result in MEMORY_FAULT_EXITs when fast
>    gup fails. IOW, enabling (2) should always yield precise fault
>    information to userspace.

Ah, so 2.2, providing precise fault information on fast-gup-only failures, is the
biggest (only?) point of contention.

My objection to that behavior is that it's either going to annoyingly difficult to
get right in KVM, and even more annoying to maintain, or we'll end up with "fuzzy"
behavior that userspace will inevitably come to rely on, and then we'll be in a real
pickle.  E.g. if KVM sets the information without checking if gup() itself actually
failed, then KVM _might_ fill the info, depending on when KVM detects a problem.

Conversly, if KVM's contract is that it provides precise information if and only
if gup() fails, then KVM needs to precisely propagate back up the stack that gup()
failed.

To avoid spending more time going in circles, I propose we try to usurp -EFAULT
and convert all userspace-exits-from-KVM_RUN -EFAULT paths on x86 (as a guinea pig)
without requiring userspace to opt-in.  If that approach pans out, then this point
of contention goes away because 2.2 Just Works.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-21 19:43                   ` Sean Christopherson
@ 2023-03-22 21:06                     ` Anish Moorthy
  2023-03-22 23:17                       ` Sean Christopherson
  2023-03-28 22:19                     ` Anish Moorthy
  1 sibling, 1 reply; 60+ messages in thread
From: Anish Moorthy @ 2023-03-22 21:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Tue, Mar 21, 2023 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Mar 21, 2023, Anish Moorthy wrote:
> > On Tue, Mar 21, 2023 at 8:21 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Mon, Mar 20, 2023, Anish Moorthy wrote:
> > > > On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
> > > > > never sees the data, i.e. userspace is completely unaware.  This behavior is not
> > > > > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
> > > > > exiting to userspace can lead to other bugs, e.g. effective corruption of the
> > > > > kvm_run union, but at least from a uABI perspective, the behavior is acceptable.
> > > >
> > > > Actually, I don't think the idea of filling in kvm_run.memory_fault
> > > > for -EFAULTs which don't make it to userspace works at all. Consider
> > > > the direct_map function, which bubbles its -EFAULT to
> > > > kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both
> > > > kvm_arch_async_page_ready (which ignores the return value), and by
> > > > kvm_mmu_page_fault (where the return value does make it to userspace).
> > > > Populating kvm_run.memory_fault anywhere in or under
> > > > kvm_mmu_do_page_fault seems an immediate no-go, because a wayward
> > > > kvm_arch_async_page_ready could (presumably) overwrite/corrupt an
> > > > already-set kvm_run.memory_fault / other kvm_run field.
> > >
> > > This particular case is a non-issue.  kvm_check_async_pf_completion() is called
> > > only when the current task has control of the vCPU, i.e. is the current "running"
> > > vCPU.  That's not a coincidence either, invoking kvm_mmu_do_page_fault() without
> > > having control of the vCPU would be fraught with races, e.g. the entire KVM MMU
> > > context would be unstable.
> > >
> > > That will hold true for all cases.  Using a vCPU that is not loaded (not the
> > > current "running" vCPU in KVM's misleading terminology) to access guest memory is
> > > simply not safe, as the vCPU state is non-deterministic.  There are paths where
> > > KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery
> > > and making requests, but those are very controlled flows with dedicated machinery
> > > to make them SMP safe.
> > >
> > > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> > > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> > > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> > > without the target vCPU being loaded:
> > >
> > >         int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
> > >         {
> > >                 preempt_disable();
> > >                 if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> > >                         goto out;
> > >
> > >                 vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > >                 ...
> > >         out:
> > >                 preempt_enable();
> > >                 return -EFAULT;
> > >         }
> > >
> > > FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing
> > > that KVM "immediately" exits to userspace isn't ideal, but given the amount of
> > > historical code that we need to deal with, it seems like the lesser of all evils.
> > > Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far
> > > better failure mode than KVM not filling kvm_run when it should, i.e. false
> > > positives are ok, false negatives are fatal.
> >
> > Don't you have this in reverse?
>
> No, I don't think so.
>
> > False negatives will just result in userspace not having useful extra
> > information for the -EFAULT it receives from KVM_RUN, in which case userspace
> > can do what you mentioned all VMMs do today and just terminate the VM.
>
> And that is _really_ bad behavior if we have any hope of userspace actually being
> able to rely on this functionality.  E.g. any false negative when userspace is
> trying to do postcopy demand paging will be fatal to the VM.

But since -EFAULTs from KVM_RUN today are already fatal, so there's no
new failure introduced by an -EFAULT w/o a populated memory_fault
field right? Obviously that's of no real use to userspace, but that
seems like part of the point of starting with a partial conversion: to
allow for filling holes in the implementation in the future.

It seems like what you're really concerned about here is the
interaction with the memslot fast-gup-only flag. Obviously, failing to
populate kvm_run.memory_fault for new userspace-visible -EFAULTs
caused by that flag would cause new fatal failures for the guest,
which would make the feature actually harmful. But as far as I know
(and please lmk if I'm wrong), the memslot flag only needs to be used
by the kvm_handle_error_pfn (x86) and user_mem_abort (arm64)
functions, meaning that those are the only places where we need to
check/populate kvm_run.memory_fault for new userspace-visible
-EFAULTs.

> > Whereas a false positive might cause a double-write to the KVM_RUN struct,
> > either putting incorrect information in kvm_run.memory_fault or
>
> Recording unused information on -EFAULT in kvm_run doesn't make the information
> incorrect.
>
> > corrupting another member of the union.
>
> Only if KVM accesses guest memory after initiating an exit to userspace, which
> would be a KVM irrespective of kvm_run.memory_fault.

Ah good: I was concerned that this was a valid set of code paths in
KVM. Although I'm assuming that "initiating an exit to userspace"
includes the "returning -EFAULT from KVM_RUN" cases, because we
wouldn't want EFAULTs to stomp on each other as well (the
kvm_mmu_do_page_fault usages were supposed to be one such example,
though I'm glad to know that they're not a problem).

> And if we're really concerned about clobbering state, we could add hardening/auditing
> code to ensure that KVM actually exits when kvm_run.exit_reason is set (though there
> are a non-zero number of exceptions, e.g. the aformentioned MMIO mess, nested SVM/VMX
> pages, and probably a few others).
>
> Prior to cleanups a few years back[2], emulation failures had issues similar to
> what we are discussing, where KVM would fail to exit to userspace, not fill kvm_run,
> etc.  Those are the types of bugs I want to avoid here.
>
> [1] https://lkml.kernel.org/r/ZBNrWZQhMX8AHzWM%40google.com
> [2] https://lore.kernel.org/kvm/20190823010709.24879-1-sean.j.christopherson@intel.com
>
> > > > That in turn looks problematic for the
> > > > memory-fault-exit-on-fast-gup-failure part of this series, because
> > > > there are at least a couple of cases for which kvm_mmu_do_page_fault
> > > > will -EFAULT. One is the early-efault-on-fast-gup-failure case which
> > > > was the original purpose of this series. Another is a -EFAULT from
> > > > FNAME(fetch) (passed up through FNAME(page_fault)). There might be
> > > > other cases as well. But unless userspace can/should resolve *all*
> > > > such -EFAULTs in the same manner, a kvm_run.memory_fault populated in
> > > > "kvm_mmu_page_fault" wouldn't be actionable.
> > >
> > > Killing the VM, which is what all VMMs do today in response to -EFAULT, is an
> > > action.  As I've pointed out elsewhere in this thread, userspace needs to be able
> > > to identify "faults" that it (userspace) can resolve without a hint from KVM.
> > >
> > > In other words, KVM is still returning -EFAULT (or a variant thereof), the _only_
> > > difference, for all intents and purposes, is that userspace is given a bit more
> > > information about the source of the -EFAULT.
> > >
> > > > At least, not without a whole lot of plumbing code to make it so.
> > >
> > > Plumbing where?
> >
> > In this example, I meant plumbing code to get a kvm_run.memory_fault.flags
> > which is more specific than (eg) MEMFAULT_REASON_PAGE_FAULT_FAILURE from the
> > -EFAULT paths under kvm_mmu_page_fault. My idea for how userspace would
> > distinguish fast-gup failures was that kvm_faultin_pfn would set a special
> > bit in kvm_run.memory_fault.flags to indicate its failure. But (still
> > assuming that we shouldn't have false-positive kvm_run.memory_fault fills) if
> > the memory_fault can only be populated from kvm_mmu_page_fault then either
> > failures from FNAME(page_fault) and kvm_faultin_pfn will be indistinguishable
> > to userspace, or those functions will need to plumb more specific exit
> > reasons all the way up to kvm_mmu_page_fault.
>
> Setting a flag that essentially says "failure when handling a guest page fault"
> is problematic on multiple fronts.  Tying the ABI to KVM's internal implementation
> is not an option, i.e. the ABI would need to be defined as "on page faults from
> the guest".  And then the resulting behavior would be non-deterministic, e.g.
> userspace would see different behavior if KVM accessed a "bad" gfn via emulation
> instead of in response to a guest page fault.  And because of hardware TLBs, it
> would even be possible for the behavior to be non-deterministic on the same
> platform running the same guest code (though this would be exteremly unliklely
> in practice).
>
> And even if userspace is ok with only handling guest page faults_today_, I highly
> doubt that will hold forever.  I.e. at some point there will be a use case that
> wants to react to uaccess failures on fast-only memslots.
>
> Ignoring all of those issues, simplify flagging "this -EFAULT occurred when
> handling a guest page fault" isn't precise enough for userspace to blindly resolve
> the failure.  Even if KVM went through the trouble of setting information if and
> only if get_user_page_fast_only() failed while handling a guest page fault,
> userspace would still need/want a way to verify that the failure was expected and
> can be resolved, e.g. to guard against userspace bugs due to wrongly unmapping
> or mprotecting a page.
>
> > But, since you've made this point elsewhere, my guess is that your answer is
> > that it's actually userspace's job to detect the "specific" reason for the
> > fault and resolve it.
>
> Yes, it's userspace's responsibity.  I simply don't see how KVM can provide
> information that userspace doesn't already have without creating an unmaintainable
> uABI, at least not without some deep, deep plumbing into gup().  I.e. unless gup()
> were changed to explicitly communicate that it failed because of a uffd equivalent,
> at best a flag in kvm_run would be a hint that userspace _might_ be able to resolve
> the fault.  And even if we modified gup(), we'd still have all the open questions
> about what to do when KVM encounters a fault on a uaccess.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-22 21:06                     ` Anish Moorthy
@ 2023-03-22 23:17                       ` Sean Christopherson
  0 siblings, 0 replies; 60+ messages in thread
From: Sean Christopherson @ 2023-03-22 23:17 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Wed, Mar 22, 2023, Anish Moorthy wrote:
> On Tue, Mar 21, 2023 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Mar 21, 2023, Anish Moorthy wrote:
> > > > FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing
> > > > that KVM "immediately" exits to userspace isn't ideal, but given the amount of
> > > > historical code that we need to deal with, it seems like the lesser of all evils.
> > > > Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far
> > > > better failure mode than KVM not filling kvm_run when it should, i.e. false
> > > > positives are ok, false negatives are fatal.
> > >
> > > Don't you have this in reverse?
> >
> > No, I don't think so.
> >
> > > False negatives will just result in userspace not having useful extra
> > > information for the -EFAULT it receives from KVM_RUN, in which case userspace
> > > can do what you mentioned all VMMs do today and just terminate the VM.
> >
> > And that is _really_ bad behavior if we have any hope of userspace actually being
> > able to rely on this functionality.  E.g. any false negative when userspace is
> > trying to do postcopy demand paging will be fatal to the VM.
> 
> But since -EFAULTs from KVM_RUN today are already fatal, so there's no
> new failure introduced by an -EFAULT w/o a populated memory_fault
> field right?

Yes, but it's a bit of a moot piont since the goal of the feature is to avoid
killing the VM.

> Obviously that's of no real use to userspace, but that seems like part of the
> point of starting with a partial conversion: to allow for filling holes in
> the implementation in the future.

Yes, but I want a forcing function to reveal any holes we missed sooner than
later, otherwise the feature will languish since it won't be useful beyond the
fast-gup-only use case.

> It seems like what you're really concerned about here is the interaction with
> the memslot fast-gup-only flag. Obviously, failing to populate
> kvm_run.memory_fault for new userspace-visible -EFAULTs caused by that flag
> would cause new fatal failures for the guest, which would make the feature
> actually harmful. But as far as I know (and please lmk if I'm wrong), the
> memslot flag only needs to be used by the kvm_handle_error_pfn (x86) and
> user_mem_abort (arm64) functions, meaning that those are the only places
> where we need to check/populate kvm_run.memory_fault for new
> userspace-visible -EFAULTs.

No.  As you point out, the fast-gup-only case should be pretty easy to get correct,
i.e. this should all work just fine for _GCE's current_ use case.  I'm more concerned
with setting KVM up for success when future use cases come along that might not be ok
with unhandled faults in random guest accesses killing the VM.

To be clear, I do not expect us to get this 100% correct on the first attempt,
but I do want to have mechanisms in place that will detect any bugs/misses so
that we can fix the issues _before_ a use case comes along that needs 100%
accuracy.

> > > Whereas a false positive might cause a double-write to the KVM_RUN struct,
> > > either putting incorrect information in kvm_run.memory_fault or
> >
> > Recording unused information on -EFAULT in kvm_run doesn't make the information
> > incorrect.
> >
> > > corrupting another member of the union.
> >
> > Only if KVM accesses guest memory after initiating an exit to userspace, which
> > would be a KVM irrespective of kvm_run.memory_fault.
> 
> Ah good: I was concerned that this was a valid set of code paths in
> KVM. Although I'm assuming that "initiating an exit to userspace"
> includes the "returning -EFAULT from KVM_RUN" cases, because we
> wouldn't want EFAULTs to stomp on each other as well (the
> kvm_mmu_do_page_fault usages were supposed to be one such example,
> though I'm glad to know that they're not a problem).

This one gets into a bit of a grey area.  The "rule" is really about the intent,
i.e. once KVM intends to exit to userspace, it's a bug if KVM encounters something
else and runs into the weeds.

In no small part because of the myriad paths where KVM ignores what be fatal errors
in most flows, e.g. record_steal_time(), simply returning -EFAULT from some low
level helper doesn't necessarily signal an intent to exit all the way to userspace.

To be honest, I don't have a clear idea of how difficult it will be to detect bugs.
In most cases, failure to exit to userspace leads to a fatal error fairly quickly.
With userspace faults, it's entirely possible that an exit could be missed and
nothing bad would happen.

Hmm, one idea would be to have the initial -EFAULT detection fill kvm_run.memory_fault,
but set kvm_run.exit_reason to some magic number, e.g. zero it out.  Then KVM could
WARN if something tries to overwrite kvm_run.exit_reason.  The WARN would need to
be buried by a Kconfig or something since kvm_run can be modified by userspace,
but other than that I think it would work.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-21 19:43                   ` Sean Christopherson
  2023-03-22 21:06                     ` Anish Moorthy
@ 2023-03-28 22:19                     ` Anish Moorthy
  2023-04-04 19:34                       ` Sean Christopherson
  1 sibling, 1 reply; 60+ messages in thread
From: Anish Moorthy @ 2023-03-28 22:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Tue, Mar 21, 2023 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Mar 21, 2023, Anish Moorthy wrote:
> > On Tue, Mar 21, 2023 at 8:21 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Mon, Mar 20, 2023, Anish Moorthy wrote:
> > > > On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
> > > > > never sees the data, i.e. userspace is completely unaware.  This behavior is not
> > > > > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
> > > > > exiting to userspace can lead to other bugs, e.g. effective corruption of the
> > > > > kvm_run union, but at least from a uABI perspective, the behavior is acceptable.
> > > >
> > > > Actually, I don't think the idea of filling in kvm_run.memory_fault
> > > > for -EFAULTs which don't make it to userspace works at all. Consider
> > > > the direct_map function, which bubbles its -EFAULT to
> > > > kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both
> > > > kvm_arch_async_page_ready (which ignores the return value), and by
> > > > kvm_mmu_page_fault (where the return value does make it to userspace).
> > > > Populating kvm_run.memory_fault anywhere in or under
> > > > kvm_mmu_do_page_fault seems an immediate no-go, because a wayward
> > > > kvm_arch_async_page_ready could (presumably) overwrite/corrupt an
> > > > already-set kvm_run.memory_fault / other kvm_run field.
> > >
> > > This particular case is a non-issue.  kvm_check_async_pf_completion() is called
> > > only when the current task has control of the vCPU, i.e. is the current "running"
> > > vCPU.  That's not a coincidence either, invoking kvm_mmu_do_page_fault() without
> > > having control of the vCPU would be fraught with races, e.g. the entire KVM MMU
> > > context would be unstable.
> > >
> > > That will hold true for all cases.  Using a vCPU that is not loaded (not the
> > > current "running" vCPU in KVM's misleading terminology) to access guest memory is
> > > simply not safe, as the vCPU state is non-deterministic.  There are paths where
> > > KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery
> > > and making requests, but those are very controlled flows with dedicated machinery
> > > to make them SMP safe.
> > >
> > > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> > > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> > > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> > > without the target vCPU being loaded:
> > >
> > >         int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
> > >         {
> > >                 preempt_disable();
> > >                 if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> > >                         goto out;
> > >
> > >                 vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > >                 ...
> > >         out:
> > >                 preempt_enable();
> > >                 return -EFAULT;
> > >         }
> > >
> > > FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing
> > > that KVM "immediately" exits to userspace isn't ideal, but given the amount of
> > > historical code that we need to deal with, it seems like the lesser of all evils.
> > > Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far
> > > better failure mode than KVM not filling kvm_run when it should, i.e. false
> > > positives are ok, false negatives are fatal.
> >
> > Don't you have this in reverse?
>
> No, I don't think so.
>
> > False negatives will just result in userspace not having useful extra
> > information for the -EFAULT it receives from KVM_RUN, in which case userspace
> > can do what you mentioned all VMMs do today and just terminate the VM.
>
> And that is _really_ bad behavior if we have any hope of userspace actually being
> able to rely on this functionality.  E.g. any false negative when userspace is
> trying to do postcopy demand paging will be fatal to the VM.
>
> > Whereas a false positive might cause a double-write to the KVM_RUN struct,
> > either putting incorrect information in kvm_run.memory_fault or
>
> Recording unused information on -EFAULT in kvm_run doesn't make the information
> incorrect.

Let's say that some function (converted to annotate its EFAULTs) fills
in kvm_run.memory_fault, but the EFAULT is suppressed from being
returned from kvm_run. What if, later within the same kvm_run call,
some other function (which we've completely overlooked) EFAULTs and
that return value actually does make it out to kvm_run? Userspace
would get stale information, which could be catastrophic.

Actually even performing the annotations only in functions that
currently always bubble EFAULTs to userspace still seems brittle: if
new callers are ever added which don't bubble the EFAULTs, then we end
up in the same situation.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-03-28 22:19                     ` Anish Moorthy
@ 2023-04-04 19:34                       ` Sean Christopherson
  2023-04-04 20:40                         ` Anish Moorthy
  0 siblings, 1 reply; 60+ messages in thread
From: Sean Christopherson @ 2023-04-04 19:34 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Tue, Mar 28, 2023, Anish Moorthy wrote:
> On Tue, Mar 21, 2023 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Mar 21, 2023, Anish Moorthy wrote:
> > > On Tue, Mar 21, 2023 at 8:21 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing
> > > > that KVM "immediately" exits to userspace isn't ideal, but given the amount of
> > > > historical code that we need to deal with, it seems like the lesser of all evils.
> > > > Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far
> > > > better failure mode than KVM not filling kvm_run when it should, i.e. false
> > > > positives are ok, false negatives are fatal.
> > >
> > > Don't you have this in reverse?
> >
> > No, I don't think so.
> >
> > > False negatives will just result in userspace not having useful extra
> > > information for the -EFAULT it receives from KVM_RUN, in which case userspace
> > > can do what you mentioned all VMMs do today and just terminate the VM.
> >
> > And that is _really_ bad behavior if we have any hope of userspace actually being
> > able to rely on this functionality.  E.g. any false negative when userspace is
> > trying to do postcopy demand paging will be fatal to the VM.
> >
> > > Whereas a false positive might cause a double-write to the KVM_RUN struct,
> > > either putting incorrect information in kvm_run.memory_fault or
> >
> > Recording unused information on -EFAULT in kvm_run doesn't make the information
> > incorrect.
> 
> Let's say that some function (converted to annotate its EFAULTs) fills
> in kvm_run.memory_fault, but the EFAULT is suppressed from being
> returned from kvm_run. What if, later within the same kvm_run call,
> some other function (which we've completely overlooked) EFAULTs and
> that return value actually does make it out to kvm_run? Userspace
> would get stale information, which could be catastrophic.

"catastrophic" is a bit hyperbolic.  Yes, it would be bad, but at _worst_ userspace
will kill the VM, which is the status quo today.

> Actually even performing the annotations only in functions that
> currently always bubble EFAULTs to userspace still seems brittle: if
> new callers are ever added which don't bubble the EFAULTs, then we end
> up in the same situation.

Because of KVM's semi-magical '1 == resume, -errno/0 == exit' "design", that's
true for literally every exit to userspace in KVM and every VM-Exit handler.
E.g. see commit 2368048bf5c2 ("KVM: x86: Signal #GP, not -EPERM, on bad
WRMSR(MCi_CTL/STATUS)"), where KVM returned '-1' instead of '1' when rejecting
MSR accesses and inadvertantly killed the VM.  A similar bug would be if KVM
returned EFAULT instead of -EFAULT, in which case vcpu_run() would resume the
guest instead of exiting to userspace and likely put the vCPU into an infinite
loop.

Do I want to harden KVM to make things like this less brittle?  Absolutely.  Do I
think we should hold up this functionality just because it doesn't solve all of
pre-existing flaws in the related KVM code?  No.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-04-04 19:34                       ` Sean Christopherson
@ 2023-04-04 20:40                         ` Anish Moorthy
  2023-04-04 22:07                           ` Sean Christopherson
  0 siblings, 1 reply; 60+ messages in thread
From: Anish Moorthy @ 2023-04-04 20:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Tue, Apr 4, 2023 at 12:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > Let's say that some function (converted to annotate its EFAULTs) fills
> > in kvm_run.memory_fault, but the EFAULT is suppressed from being
> > returned from kvm_run. What if, later within the same kvm_run call,
> > some other function (which we've completely overlooked) EFAULTs and
> > that return value actually does make it out to kvm_run? Userspace
> > would get stale information, which could be catastrophic.
>
> "catastrophic" is a bit hyperbolic.  Yes, it would be bad, but at _worst_ userspace
> will kill the VM, which is the status quo today.

Well what I'm saying is that in these cases userspace *wouldn't know*
that kvm_run.memory_fault contains incorrect information for the
-EFAULT it actually got (do you disagree?), which could presumably
cause it to do bad things like "resolve" faults on incorrect pages
and/or infinite-loop on KVM_RUN, etc.

Annotating the efault information as valid only from the call sites
which return directly to userspace prevents this class of problem, at
the cost of allowing un-annotated EFAULTs to make it to userspace. But
to me, paying that cost to make sure the EFAULT information is always
correct seems by far preferable to not paying it and allowing
userspace to get silently incorrect information.

> > Actually even performing the annotations only in functions that
> > currently always bubble EFAULTs to userspace still seems brittle: if
> > new callers are ever added which don't bubble the EFAULTs, then we end
> > up in the same situation.
>
> Because of KVM's semi-magical '1 == resume, -errno/0 == exit' "design", that's
> true for literally every exit to userspace in KVM and every VM-Exit handler.
> E.g. see commit 2368048bf5c2 ("KVM: x86: Signal #GP, not -EPERM, on bad
> WRMSR(MCi_CTL/STATUS)"), where KVM returned '-1' instead of '1' when rejecting
> MSR accesses and inadvertantly killed the VM.  A similar bug would be if KVM
> returned EFAULT instead of -EFAULT, in which case vcpu_run() would resume the
> guest instead of exiting to userspace and likely put the vCPU into an infinite
> loop.

Right, good point.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-04-04 20:40                         ` Anish Moorthy
@ 2023-04-04 22:07                           ` Sean Christopherson
  2023-04-05 20:21                             ` Anish Moorthy
  0 siblings, 1 reply; 60+ messages in thread
From: Sean Christopherson @ 2023-04-04 22:07 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Isaku Yamahata, Marc Zyngier, Oliver Upton, jthoughton, kvm

On Tue, Apr 04, 2023, Anish Moorthy wrote:
> On Tue, Apr 4, 2023 at 12:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > > Let's say that some function (converted to annotate its EFAULTs) fills
> > > in kvm_run.memory_fault, but the EFAULT is suppressed from being
> > > returned from kvm_run. What if, later within the same kvm_run call,
> > > some other function (which we've completely overlooked) EFAULTs and
> > > that return value actually does make it out to kvm_run? Userspace
> > > would get stale information, which could be catastrophic.
> >
> > "catastrophic" is a bit hyperbolic.  Yes, it would be bad, but at _worst_ userspace
> > will kill the VM, which is the status quo today.
> 
> Well what I'm saying is that in these cases userspace *wouldn't know*
> that kvm_run.memory_fault contains incorrect information for the
> -EFAULT it actually got (do you disagree?),

I disagree in the sense that if the stale information causes a problem, then by
definition userspace has to know.  It's the whole "if a tree falls in a forest"
thing.  If KVM reports stale information and literally nothing bad happens, ever,
then is the superfluous exit really a problem?  Not saying it wouldn't be treated
as a bug, just that it might not even warrant a stable backport if the worst case
scenario is a spurious exit to userspace (for example).

> which could presumably cause it to do bad things like "resolve" faults on
> incorrect pages and/or infinite-loop on KVM_RUN, etc.

Putting the vCPU into an infinite loop is _very_ visible, e.g. see the entire
mess surrounding commit 31c25585695a ("Revert "KVM: SVM: avoid infinite loop on
NPF from bad address"").

As above, fixing pages that don't need to be fixed isn't itself a major problem.
If the extra exits lead to a performance issue, then _that_ is a problem, but
again _something_ has to detect the problem and thus it becomes a known thing.

> Annotating the efault information as valid only from the call sites
> which return directly to userspace prevents this class of problem, at
> the cost of allowing un-annotated EFAULTs to make it to userspace. But
> to me, paying that cost to make sure the EFAULT information is always
> correct seems by far preferable to not paying it and allowing
> userspace to get silently incorrect information.

I don't think that's a maintainable approach.  Filling kvm_run if and only if the
-EFAULT has a direct path to userspace is (a) going to require a signficant amount
of code churn and (b) falls apart the instant code further up the stack changes.
E.g. the relatively straightforward page fault case requires bouncing through 7+
functions to get from kvm_handle_error_pfn() to kvm_arch_vcpu_ioctl_run(), and not
all of those are obviously "direct"

	if (IS_ENABLED(CONFIG_RETPOLINE) && fault.is_tdp)
		r = kvm_tdp_page_fault(vcpu, &fault);
	else
		r = vcpu->arch.mmu->page_fault(vcpu, &fault);

	if (fault.write_fault_to_shadow_pgtable && emulation_type)
		*emulation_type |= EMULTYPE_WRITE_PF_TO_SP;

	/*
	 * Similar to above, prefetch faults aren't truly spurious, and the
	 * async #PF path doesn't do emulation.  Do count faults that are fixed
	 * by the async #PF handler though, otherwise they'll never be counted.
	 */
	if (r == RET_PF_FIXED)
		vcpu->stat.pf_fixed++;
	else if (prefetch)
		;
	else if (r == RET_PF_EMULATE)
		vcpu->stat.pf_emulate++;
	else if (r == RET_PF_SPURIOUS)
		vcpu->stat.pf_spurious++;
	return r;


...

	if (r == RET_PF_INVALID) {
		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
					  lower_32_bits(error_code), false,
					  &emulation_type);
		if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
			return -EIO;
	}

	if (r < 0)
		return r;
	if (r != RET_PF_EMULATE)
		return 1;

In other words, the "only if it's direct" rule requires visually auditing changes,
i.e. catching "violations" via code review, not only to code that adds a new -EFAULT
return, but to all code throughout rather large swaths of KVM.  The odds of us (or
whoever the future maintainers/reviewers are) remembering to enforce the "rule", let
alone actually having 100% accuracy, are basically nil.

On the flip side, if we add a helper to fill kvm_run and return -EFAULT, then we can
add rule that only time KVM is allowed to return a bare -EFAULT is immediately after
a uaccess, i.e. after copy_to/from_user() and the many variants.  And _that_ can be
enforced through static checkers, e.g. someone with more (read: any) awk/sed skills
than me could bang something out in a matter of minutes.  Such a static checker won't
catch everything, but there would be very, very few bare non-uaccess -EFAULTS left,
and those could be filtered out with an allowlist, e.g. similar to how the folks that
run smatch and whatnot deal with false positives.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
  2023-04-04 22:07                           ` Sean Christopherson
@ 2023-04-05 20:21                             ` Anish Moorthy
  0 siblings, 0 replies; 60+ messages in thread
From: Anish Moorthy @ 2023-04-05 20:21 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Marc Zyngier, Oliver Upton, jthoughton, kvm

Ok. I'm still concerned about the implications of the "annotate
everywhere" approach, but I spoke with James and he shares your
opinion on the severity of the potential issues. I'll put the patches
together and send up a proper v3.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2023-04-05 20:21 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 01/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 02/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 03/14] KVM: Allow hva_pfn_fast to resolve read-only faults Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field Anish Moorthy
2023-03-17  0:02   ` Isaku Yamahata
2023-03-17 18:33     ` Anish Moorthy
2023-03-17 19:30       ` Oliver Upton
2023-03-17 21:50       ` Sean Christopherson
2023-03-17 22:44         ` Anish Moorthy
2023-03-20 15:53           ` Sean Christopherson
2023-03-20 18:19             ` Anish Moorthy
2023-03-20 22:11             ` Anish Moorthy
2023-03-21 15:21               ` Sean Christopherson
2023-03-21 18:01                 ` Anish Moorthy
2023-03-21 19:43                   ` Sean Christopherson
2023-03-22 21:06                     ` Anish Moorthy
2023-03-22 23:17                       ` Sean Christopherson
2023-03-28 22:19                     ` Anish Moorthy
2023-04-04 19:34                       ` Sean Christopherson
2023-04-04 20:40                         ` Anish Moorthy
2023-04-04 22:07                           ` Sean Christopherson
2023-04-05 20:21                             ` Anish Moorthy
2023-03-17 18:35   ` Oliver Upton
2023-03-15  2:17 ` [WIP Patch v2 05/14] KVM: x86: Implement memory fault exit for direct_map Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 06/14] KVM: x86: Implement memory fault exit for kvm_handle_page_fault Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 07/14] KVM: x86: Implement memory fault exit for setup_vmgexit_scratch Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 08/14] KVM: x86: Implement memory fault exit for FNAME(fetch) Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation Anish Moorthy
2023-03-17 18:59   ` Oliver Upton
2023-03-17 20:15     ` Anish Moorthy
2023-03-17 20:54       ` Sean Christopherson
2023-03-17 23:42         ` Anish Moorthy
2023-03-20 15:13           ` Sean Christopherson
2023-03-20 19:53             ` Anish Moorthy
2023-03-17 20:17     ` Sean Christopherson
2023-03-20 22:22       ` Oliver Upton
2023-03-21 14:50         ` Sean Christopherson
2023-03-21 20:23           ` Oliver Upton
2023-03-21 21:01             ` Sean Christopherson
2023-03-15  2:17 ` [WIP Patch v2 10/14] KVM: x86: Implement KVM_CAP_MEMORY_FAULT_NOWAIT Anish Moorthy
2023-03-17  0:32   ` Isaku Yamahata
2023-03-15  2:17 ` [WIP Patch v2 11/14] KVM: arm64: Allow user_mem_abort to return 0 to signal a 'normal' exit Anish Moorthy
2023-03-17 18:18   ` Oliver Upton
2023-03-15  2:17 ` [WIP Patch v2 12/14] KVM: arm64: Implement KVM_CAP_MEMORY_FAULT_NOWAIT Anish Moorthy
2023-03-17 18:27   ` Oliver Upton
2023-03-17 19:00     ` Anish Moorthy
2023-03-17 19:03       ` Oliver Upton
2023-03-17 19:24       ` Sean Christopherson
2023-03-15  2:17 ` [WIP Patch v2 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
2023-03-17 17:43 ` [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Oliver Upton
2023-03-17 18:13   ` Sean Christopherson
2023-03-17 18:46     ` David Matlack
2023-03-17 18:54       ` Oliver Upton
2023-03-17 18:59         ` David Matlack
2023-03-17 19:53           ` Anish Moorthy
2023-03-17 22:03             ` Sean Christopherson
2023-03-20 15:56               ` Sean Christopherson
2023-03-17 20:35 ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).