All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] Add memory fault exits to avoid slow GUP
@ 2023-02-15  1:16 Anish Moorthy
  2023-02-15  1:16 ` [PATCH 1/8] selftests/kvm: Fix bug in how demand_paging_test calculates paging rate Anish Moorthy
                   ` (8 more replies)
  0 siblings, 9 replies; 35+ messages in thread
From: Anish Moorthy @ 2023-02-15  1:16 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier
  Cc: Oliver Upton, Sean Christopherson, James Houghton, Anish Moorthy,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

This series improves scalabiity with userfaultfd-based postcopy live
migration. It implements the no-slow-gup approach which James Houghton
described in his earlier RFC ([1]). The new cap
KVM_CAP_MEM_FAULT_NOWAIT, is introduced, which causes KVM to exit to
userspace if fast get_user_pages (GUP) fails while resolving a page
fault. The motivation is to allow (most) EPT violations to be resolved
without going through userfaultfd, which involves serializing faults on
internal locks: see [1] for more details.

After receiving the new exit, userspace can check if it has previously
UFFDIO_COPY/CONTINUEd the faulting address- if not, then it knows that
fast GUP could not possibly have succeeded, and so the fault has to be
resolved via UFFDIO_COPY/CONTINUE. In these cases a UFFDIO_WAKE is
unnecessary, as the vCPU thread hasn't been put to sleep waiting on the
uffd.

If userspace *has* already COPY/CONTINUEd the address, then it must take
some other action to make fast GUP succeed: such as swapping in the
page (for instance, via MADV_POPULATE_WRITE for writable mappings).

This feature should only be enabled during userfaultfd postcopy, as it
prevents the generation of async page faults.

The actual kernel changes to implement the change on arm64/x86 are
small: most of this series is actually just adding support for the new
feature in the demand paging self test. Performance samples (rates
reported in thousands of pages/s, average of five runs each) generated
using [2] on an x86 machine with 256 cores, are shown below.

vCPUs, Paging Rate (w/o new cap), Paging Rate (w/ new cap)
1       150     340
2       191     477
4       210     809
8       155     1239
16      130     1595
32      108     2299
64      86      3482
128     62      4134
256     36      4012

[1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
[2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
    A quick rundown of the new flags (also detailed in later commits)
        -a registers all of guest memory to a single uffd.
        -r species the number of reader threads for polling the uffd.
        -w is what actually enables memory fault exits.
    All data was collected after applying the entire series.

This series is based on the latest kvm/next (7cb79f433e75).

Anish Moorthy (8):
  selftests/kvm: Fix bug in how demand_paging_test calculates paging
    rate
  selftests/kvm: Allow many vcpus per UFFD in demand paging test
  selftests/kvm: Switch demand paging uffd readers to epoll
  kvm: Allow hva_pfn_fast to resolve read-only faults.
  kvm: Add cap/kvm_run field for memory fault exits
  kvm/x86: Add mem fault exit on EPT violations
  kvm/arm64: Implement KVM_CAP_MEM_FAULT_NOWAIT for arm64
  selftests/kvm: Handle mem fault exits in demand paging test

 Documentation/virt/kvm/api.rst                |  42 ++++
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |  14 +-
 arch/x86/kvm/mmu/mmu.c                        |  23 +-
 arch/x86/kvm/x86.c                            |   1 +
 include/linux/kvm_host.h                      |  13 +
 include/uapi/linux/kvm.h                      |  13 +-
 tools/include/uapi/linux/kvm.h                |   7 +
 .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
 .../selftests/kvm/demand_paging_test.c        | 237 ++++++++++++++----
 .../selftests/kvm/include/userfaultfd_util.h  |  18 +-
 .../selftests/kvm/lib/userfaultfd_util.c      | 160 +++++++-----
 virt/kvm/kvm_main.c                           |  48 +++-
 13 files changed, 442 insertions(+), 139 deletions(-)

-- 
2.39.1.581.gbfd45094c4-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/8] selftests/kvm: Fix bug in how demand_paging_test calculates paging rate
  2023-02-15  1:16 [PATCH 0/8] Add memory fault exits to avoid slow GUP Anish Moorthy
@ 2023-02-15  1:16 ` Anish Moorthy
  2023-02-15  7:27   ` Oliver Upton
  2023-02-15  1:16 ` [PATCH 2/8] selftests/kvm: Allow many vcpus per UFFD in demand paging test Anish Moorthy
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 35+ messages in thread
From: Anish Moorthy @ 2023-02-15  1:16 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier
  Cc: Oliver Upton, Sean Christopherson, James Houghton, Anish Moorthy,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

Currently we're dividing tv_nsec by 1E8, not 1E9.

Reported-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 tools/testing/selftests/kvm/demand_paging_test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index b0e1fc4de9e29..6809184ce2390 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -194,7 +194,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		ts_diff.tv_sec, ts_diff.tv_nsec);
 	pr_info("Overall demand paging rate: %f pgs/sec\n",
 		memstress_args.vcpu_args[0].pages * nr_vcpus /
-		((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / 100000000.0));
+		((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / 1E9));
 
 	memstress_destroy_vm(vm);
 
-- 
2.39.1.581.gbfd45094c4-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/8] selftests/kvm: Allow many vcpus per UFFD in demand paging test
  2023-02-15  1:16 [PATCH 0/8] Add memory fault exits to avoid slow GUP Anish Moorthy
  2023-02-15  1:16 ` [PATCH 1/8] selftests/kvm: Fix bug in how demand_paging_test calculates paging rate Anish Moorthy
@ 2023-02-15  1:16 ` Anish Moorthy
  2023-02-15  1:16 ` [PATCH 3/8] selftests/kvm: Switch demand paging uffd readers to epoll Anish Moorthy
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 35+ messages in thread
From: Anish Moorthy @ 2023-02-15  1:16 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier
  Cc: Oliver Upton, Sean Christopherson, James Houghton, Anish Moorthy,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

The aim is to enable the demand_paging_selftest to benchmark multiple
vCPU threads concurrently faulting over a single region/uffd.

Currently,
  (a) "-u" (run test in userfaultfd mode) will create a uffd for each
      vCPU's region, so that each uffd services a single vcpu thread.
  (b) "-u -o" (userfaultfd mode + overlapped vCPU memory accesses)
      simply doesn't work (the test will try to register all of guest
      memory to multiple uffds, and get an error doing so).

With this change
  (1) "-u" behavior is unchanged.
  (2) "-u -a" will create a single uffd for *all* of guest memory.
  (3) "-u -o" will implicitly pass "-a", resolving the breakage in (b).

In cases (2) and (3) all vCPU threads will fault on a single uffd (with
and without partitioned accesses respectively), giving us the behavior
we want to test.

With multiple threads now allowed to fault on a single UFFD, it makes
sense to allow more than one reader thread per UFFD as well: an option
for this (-r) is also added.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/aarch64/page_fault_test.c   |  4 +-
 .../selftests/kvm/demand_paging_test.c        | 62 +++++++++----
 .../selftests/kvm/include/userfaultfd_util.h  | 18 +++-
 .../selftests/kvm/lib/userfaultfd_util.c      | 86 +++++++++++++------
 4 files changed, 125 insertions(+), 45 deletions(-)

diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
index beb944fa6fd46..de4251b811e3b 100644
--- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
+++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
@@ -377,14 +377,14 @@ static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
 		*pt_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						    pt_args.hva,
 						    pt_args.paging_size,
-						    test->uffd_pt_handler);
+						    1, test->uffd_pt_handler);
 
 	*data_uffd = NULL;
 	if (test->uffd_data_handler)
 		*data_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						      data_args.hva,
 						      data_args.paging_size,
-						      test->uffd_data_handler);
+						      1, test->uffd_data_handler);
 }
 
 static void free_uffd(struct test_desc *test, struct uffd_desc *pt_uffd,
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 6809184ce2390..3c1d5b81c9822 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -58,7 +58,7 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 }
 
 static int handle_uffd_page_request(int uffd_mode, int uffd,
-		struct uffd_msg *msg)
+									struct uffd_msg *msg)
 {
 	pid_t tid = syscall(__NR_gettid);
 	uint64_t addr = msg->arg.pagefault.address;
@@ -77,8 +77,15 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		copy.mode = 0;
 
 		r = ioctl(uffd, UFFDIO_COPY, &copy);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d with errno: %d\n",
+		/*
+		 * With multiple vCPU threads fault on a single page and there are
+		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
+		 * will fail with EEXIST: handle that case without signaling an
+		 * error.
+		 */
+		if (r == -1 && errno != EEXIST) {
+			pr_info(
+				"Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
 				addr, tid, errno);
 			return r;
 		}
@@ -89,8 +96,10 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		cont.range.len = demand_paging_size;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_CONTINUE in 0x%lx from thread %d with errno: %d\n",
+		/* See the note about EEXISTs in the UFFDIO_COPY branch. */
+		if (r == -1 && errno != EEXIST) {
+			pr_info(
+				"Failed UFFDIO_CONTINUE in 0x%lx from thread %d, errno = %d\n",
 				addr, tid, errno);
 			return r;
 		}
@@ -110,7 +119,9 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 
 struct test_params {
 	int uffd_mode;
+	bool single_uffd;
 	useconds_t uffd_delay;
+	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
 };
@@ -133,7 +144,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i;
+	int i, num_uffds = 0;
+	uint64_t uffd_region_size;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
@@ -146,10 +158,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
 	if (p->uffd_mode) {
-		uffd_descs = malloc(nr_vcpus * sizeof(struct uffd_desc *));
+		num_uffds = p->single_uffd ? 1 : nr_vcpus;
+		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
+
+		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
 		TEST_ASSERT(uffd_descs, "Memory allocation failed");
 
-		for (i = 0; i < nr_vcpus; i++) {
+		for (i = 0; i < num_uffds; i++) {
 			struct memstress_vcpu_args *vcpu_args;
 			void *vcpu_hva;
 			void *vcpu_alias;
@@ -160,8 +175,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			vcpu_hva = addr_gpa2hva(vm, vcpu_args->gpa);
 			vcpu_alias = addr_gpa2alias(vm, vcpu_args->gpa);
 
-			prefault_mem(vcpu_alias,
-				vcpu_args->pages * memstress_args.guest_page_size);
+			prefault_mem(vcpu_alias, uffd_region_size);
 
 			/*
 			 * Set up user fault fd to handle demand paging
@@ -169,7 +183,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
 				p->uffd_mode, p->uffd_delay, vcpu_hva,
-				vcpu_args->pages * memstress_args.guest_page_size,
+				uffd_region_size,
+				p->readers_per_uffd,
 				&handle_uffd_page_request);
 		}
 	}
@@ -186,7 +201,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
 	if (p->uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
-		for (i = 0; i < nr_vcpus; i++)
+		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
 	}
 
@@ -206,14 +221,19 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 static void help(char *name)
 {
 	puts("");
-	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-d uffd_delay_usec]\n"
-	       "          [-b memory] [-s type] [-v vcpus] [-o]\n", name);
+	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
+		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
+		   "          [-s type] [-v vcpus] [-o]\n", name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
+	printf(" -a: Use a single userfaultfd for all of guest memory, instead of\n"
+		   "     creating one for each region paged by a unique vCPU\n"
+		   "     Set implicitly with -o, and no effect without -u.\n");
 	printf(" -d: add a delay in usec to the User Fault\n"
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
+	printf(" -r: Set the number of reader threads per uffd.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -231,12 +251,14 @@ int main(int argc, char *argv[])
 	struct test_params p = {
 		.src_type = DEFAULT_VM_MEM_SRC,
 		.partition_vcpu_memory_access = true,
+		.readers_per_uffd = 1,
+		.single_uffd = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "hm:u:d:b:s:v:o")) != -1) {
+	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
@@ -248,6 +270,9 @@ int main(int argc, char *argv[])
 				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
 			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
+		case 'a':
+			p.single_uffd = true;
+			break;
 		case 'd':
 			p.uffd_delay = strtoul(optarg, NULL, 0);
 			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
@@ -265,6 +290,13 @@ int main(int argc, char *argv[])
 			break;
 		case 'o':
 			p.partition_vcpu_memory_access = false;
+			p.single_uffd = true;
+			break;
+		case 'r':
+			p.readers_per_uffd = atoi(optarg);
+			TEST_ASSERT(p.readers_per_uffd >= 1,
+						"Invalid number of readers per uffd %d: must be >=1",
+						p.readers_per_uffd);
 			break;
 		case 'h':
 		default:
diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
index 877449c345928..92cc1f9ec0686 100644
--- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
+++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
@@ -17,18 +17,30 @@
 
 typedef int (*uffd_handler_t)(int uffd_mode, int uffd, struct uffd_msg *msg);
 
+struct uffd_reader_args {
+	int uffd_mode;
+	int uffd;
+	useconds_t delay;
+	uffd_handler_t handler;
+	/* Holds the read end of the pipe for killing the reader. */
+	int pipe;
+};
+
 struct uffd_desc {
 	int uffd_mode;
 	int uffd;
-	int pipefds[2];
 	useconds_t delay;
 	uffd_handler_t handler;
-	pthread_t thread;
+	uint64_t num_readers;
+	/* Holds the write ends of the pipes for killing the readers. */
+	int *pipefds;
+	pthread_t *readers;
+	struct uffd_reader_args *reader_args;
 };
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler);
+					   uint64_t num_readers, uffd_handler_t handler);
 
 void uffd_stop_demand_paging(struct uffd_desc *uffd);
 
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 92cef20902f1f..2723ee1e3e1b2 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -27,10 +27,8 @@
 
 static void *uffd_handler_thread_fn(void *arg)
 {
-	struct uffd_desc *uffd_desc = (struct uffd_desc *)arg;
-	int uffd = uffd_desc->uffd;
-	int pipefd = uffd_desc->pipefds[0];
-	useconds_t delay = uffd_desc->delay;
+	struct uffd_reader_args *reader_args = (struct uffd_reader_args *)arg;
+	int uffd = reader_args->uffd;
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
@@ -44,7 +42,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 		pollfd[0].fd = uffd;
 		pollfd[0].events = POLLIN;
-		pollfd[1].fd = pipefd;
+		pollfd[1].fd = reader_args->pipe;
 		pollfd[1].events = POLLIN;
 
 		r = poll(pollfd, 2, -1);
@@ -92,9 +90,9 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
 
-		if (delay)
-			usleep(delay);
-		r = uffd_desc->handler(uffd_desc->uffd_mode, uffd, &msg);
+		if (reader_args->delay)
+			usleep(reader_args->delay);
+		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
 		if (r < 0)
 			return NULL;
 		pages++;
@@ -110,7 +108,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler)
+					   uint64_t num_readers, uffd_handler_t handler)
 {
 	struct uffd_desc *uffd_desc;
 	bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR);
@@ -118,14 +116,26 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	struct uffdio_api uffdio_api;
 	struct uffdio_register uffdio_register;
 	uint64_t expected_ioctls = ((uint64_t) 1) << _UFFDIO_COPY;
-	int ret;
+	int ret, i;
 
 	PER_PAGE_DEBUG("Userfaultfd %s mode, faults resolved with %s\n",
 		       is_minor ? "MINOR" : "MISSING",
 		       is_minor ? "UFFDIO_CONINUE" : "UFFDIO_COPY");
 
 	uffd_desc = malloc(sizeof(struct uffd_desc));
-	TEST_ASSERT(uffd_desc, "malloc failed");
+	TEST_ASSERT(uffd_desc, "Failed to malloc uffd descriptor");
+
+	uffd_desc->pipefds = malloc(sizeof(int) * num_readers);
+	TEST_ASSERT(uffd_desc->pipefds, "Failed to malloc pipes");
+
+	uffd_desc->readers = malloc(sizeof(pthread_t) * num_readers);
+	TEST_ASSERT(uffd_desc->readers, "Failed to malloc reader threads");
+
+	uffd_desc->reader_args = malloc(
+		sizeof(struct uffd_reader_args) * num_readers);
+	TEST_ASSERT(uffd_desc->reader_args, "Failed to malloc reader_args");
+
+	uffd_desc->num_readers = num_readers;
 
 	/* In order to get minor faults, prefault via the alias. */
 	if (is_minor)
@@ -148,18 +158,32 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
 		    expected_ioctls, "missing userfaultfd ioctls");
 
-	ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
-	TEST_ASSERT(!ret, "Failed to set up pipefd");
-
 	uffd_desc->uffd_mode = uffd_mode;
 	uffd_desc->uffd = uffd;
 	uffd_desc->delay = delay;
 	uffd_desc->handler = handler;
-	pthread_create(&uffd_desc->thread, NULL, uffd_handler_thread_fn,
-		       uffd_desc);
 
-	PER_VCPU_DEBUG("Created uffd thread for HVA range [%p, %p)\n",
-		       hva, hva + len);
+	for (i = 0; i < uffd_desc->num_readers; ++i) {
+		int pipes[2];
+
+		ret = pipe2((int *) &pipes, O_CLOEXEC | O_NONBLOCK);
+		TEST_ASSERT(!ret, "Failed to set up pipefd %i for uffd_desc %p",
+					i, uffd_desc);
+
+		uffd_desc->pipefds[i] = pipes[1];
+
+		uffd_desc->reader_args[i].uffd_mode = uffd_mode;
+		uffd_desc->reader_args[i].uffd = uffd;
+		uffd_desc->reader_args[i].delay = delay;
+		uffd_desc->reader_args[i].handler = handler;
+		uffd_desc->reader_args[i].pipe = pipes[0];
+
+		pthread_create(&uffd_desc->readers[i], NULL, uffd_handler_thread_fn,
+					   &uffd_desc->reader_args[i]);
+
+		PER_VCPU_DEBUG("Created uffd thread %i for HVA range [%p, %p)\n",
+					   i, hva, hva + len);
+	}
 
 	return uffd_desc;
 }
@@ -167,19 +191,31 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 void uffd_stop_demand_paging(struct uffd_desc *uffd)
 {
 	char c = 0;
-	int ret;
+	int i, ret;
 
-	ret = write(uffd->pipefds[1], &c, 1);
-	TEST_ASSERT(ret == 1, "Unable to write to pipefd");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = write(uffd->pipefds[i], &c, 1);
+		TEST_ASSERT(
+			ret == 1, "Unable to write to pipefd %i for uffd_desc %p", i, uffd);
+	}
 
-	ret = pthread_join(uffd->thread, NULL);
-	TEST_ASSERT(ret == 0, "Pthread_join failed.");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = pthread_join(uffd->readers[i], NULL);
+		TEST_ASSERT(
+			ret == 0,
+			"Pthread_join failed on reader thread %i for uffd_desc %p", i, uffd);
+	}
 
 	close(uffd->uffd);
 
-	close(uffd->pipefds[1]);
-	close(uffd->pipefds[0]);
+	for (i = 0; i < uffd->num_readers; ++i) {
+		close(uffd->pipefds[i]);
+		close(uffd->reader_args[i].pipe);
+	}
 
+	free(uffd->pipefds);
+	free(uffd->readers);
+	free(uffd->reader_args);
 	free(uffd);
 }
 
-- 
2.39.1.581.gbfd45094c4-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/8] selftests/kvm: Switch demand paging uffd readers to epoll
  2023-02-15  1:16 [PATCH 0/8] Add memory fault exits to avoid slow GUP Anish Moorthy
  2023-02-15  1:16 ` [PATCH 1/8] selftests/kvm: Fix bug in how demand_paging_test calculates paging rate Anish Moorthy
  2023-02-15  1:16 ` [PATCH 2/8] selftests/kvm: Allow many vcpus per UFFD in demand paging test Anish Moorthy
@ 2023-02-15  1:16 ` Anish Moorthy
  2023-02-15  1:16 ` [PATCH 4/8] kvm: Allow hva_pfn_fast to resolve read-only faults Anish Moorthy
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 35+ messages in thread
From: Anish Moorthy @ 2023-02-15  1:16 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier
  Cc: Oliver Upton, Sean Christopherson, James Houghton, Anish Moorthy,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

With multiple reader threads for each UFFD, the test suffers from the
thundering herd problem: performance degrades as the number of reader
threads is increased. Switching the readers over to EPOLL (which offers
EPOLLEXCLUSIVE) solves this problem, although base-case performance does
suffer significantly.

This commit also changes the error-handling convention of
uffd_handler_thread_fn: instead of checking for errors and returning
when they're found, we just use TEST_ASSERT instead, and "return NULL"
indicates a successful exit of the function (ie, triggered by a write to
the corresponding pipe).

Performance samples are given below, generated by the command in [1].

Num Reader Threads, Paging Rate (POLL), Paging Rate (EPOLL)
1      249k      185k
2      201k      235k
4      186k      155k
16     150k      217k
32     89k       198k

[1] ./demand_paging_test -u MINOR -s MINOR -s shmem -v 4 -o -r <n>

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        |  1 -
 .../selftests/kvm/lib/userfaultfd_util.c      | 76 +++++++++----------
 2 files changed, 37 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 3c1d5b81c9822..34d5ba2044a2c 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -13,7 +13,6 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
-#include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
 #include <sys/syscall.h>
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 2723ee1e3e1b2..863840d340105 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -16,6 +16,7 @@
 #include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <sys/epoll.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -32,60 +33,56 @@ static void *uffd_handler_thread_fn(void *arg)
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
+	int epollfd;
+	struct epoll_event evt;
+
+	epollfd = epoll_create(1);
+	TEST_ASSERT(epollfd >= 0, "Failed to create epollfd.");
+
+	evt.events = EPOLLIN | EPOLLEXCLUSIVE;
+	evt.data.u32 = 0;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, uffd, &evt) == 0,
+				"Failed to add uffd to epollfd");
+
+	evt.events = EPOLLIN;
+	evt.data.u32 = 1;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, reader_args->pipe, &evt) == 0,
+				"Failed to add pipe to epollfd");
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
 	while (1) {
 		struct uffd_msg msg;
-		struct pollfd pollfd[2];
-		char tmp_chr;
 		int r;
 
-		pollfd[0].fd = uffd;
-		pollfd[0].events = POLLIN;
-		pollfd[1].fd = reader_args->pipe;
-		pollfd[1].events = POLLIN;
-
-		r = poll(pollfd, 2, -1);
-		switch (r) {
-		case -1:
-			pr_info("poll err");
-			continue;
-		case 0:
-			continue;
-		case 1:
-			break;
-		default:
-			pr_info("Polling uffd returned %d", r);
-			return NULL;
-		}
+		r = epoll_wait(epollfd, &evt, 1, -1);
+		TEST_ASSERT(
+			r == 1,
+			"Unexpected number of events (%d) returned by epoll, errno = %d",
+			r, errno);
 
-		if (pollfd[0].revents & POLLERR) {
-			pr_info("uffd revents has POLLERR");
-			return NULL;
-		}
+		if (evt.data.u32 == 1) {
+			char tmp_chr;
 
-		if (pollfd[1].revents & POLLIN) {
-			r = read(pollfd[1].fd, &tmp_chr, 1);
+			TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+						"Reader thread received EPOLLERR or EPOLLHUP on pipe.");
+			r = read(reader_args->pipe, &tmp_chr, 1);
 			TEST_ASSERT(r == 1,
-				    "Error reading pipefd in UFFD thread\n");
+						"Error reading pipefd in uffd reader thread");
 			return NULL;
 		}
 
-		if (!(pollfd[0].revents & POLLIN))
-			continue;
+		TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+					"Reader thread received EPOLLERR or EPOLLHUP on uffd.");
 
 		r = read(uffd, &msg, sizeof(msg));
 		if (r == -1) {
-			if (errno == EAGAIN)
-				continue;
-			pr_info("Read of uffd got errno %d\n", errno);
-			return NULL;
+			TEST_ASSERT(errno == EAGAIN,
+						"Error reading from UFFD: errno = %d", errno);
+			continue;
 		}
 
-		if (r != sizeof(msg)) {
-			pr_info("Read on uffd returned unexpected size: %d bytes", r);
-			return NULL;
-		}
+		TEST_ASSERT(r == sizeof(msg),
+					"Read on uffd returned unexpected number of bytes (%d)", r);
 
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
@@ -93,8 +90,9 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (reader_args->delay)
 			usleep(reader_args->delay);
 		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
-		if (r < 0)
-			return NULL;
+		TEST_ASSERT(
+			r >= 0,
+			"Reader thread handler function returned negative value %d", r);
 		pages++;
 	}
 
-- 
2.39.1.581.gbfd45094c4-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 4/8] kvm: Allow hva_pfn_fast to resolve read-only faults.
  2023-02-15  1:16 [PATCH 0/8] Add memory fault exits to avoid slow GUP Anish Moorthy
                   ` (2 preceding siblings ...)
  2023-02-15  1:16 ` [PATCH 3/8] selftests/kvm: Switch demand paging uffd readers to epoll Anish Moorthy
@ 2023-02-15  1:16 ` Anish Moorthy
  2023-02-15  9:01   ` Oliver Upton
  2023-02-15  1:16 ` [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits Anish Moorthy
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 35+ messages in thread
From: Anish Moorthy @ 2023-02-15  1:16 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier
  Cc: Oliver Upton, Sean Christopherson, James Houghton, Anish Moorthy,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

The upcoming mem_fault_nowait commits will make it so that, when the
relevant cap is enabled, hva_to_pfn will return after calling
hva_to_pfn_fast without ever attempting to pin memory via
hva_to_pfn_slow.

hva_to_pfn_fast currently just fails for read-only faults. However,
there doesn't seem to be a reason that we can't just try pinning the
page without FOLL_WRITE instead of immediately falling back to slow-GUP.
This commit implements that behavior.

Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d255964ec331e..dae5f48151032 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2479,7 +2479,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
 }
 
 /*
- * The fast path to get the writable pfn which will be stored in @pfn,
+ * The fast path to get the pfn which will be stored in @pfn,
  * true indicates success, otherwise false is returned.  It's also the
  * only part that runs if we can in atomic context.
  */
@@ -2487,16 +2487,18 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 			    bool *writable, kvm_pfn_t *pfn)
 {
 	struct page *page[1];
+	bool found_by_fast_gup =
+		get_user_page_fast_only(
+			addr,
+			/*
+			 * Fast pin a writable pfn only if it is a write fault request
+			 * or the caller allows to map a writable pfn for a read fault
+			 * request.
+			 */
+			(write_fault || writable) ? FOLL_WRITE : 0,
+			page);
 
-	/*
-	 * Fast pin a writable pfn only if it is a write fault request
-	 * or the caller allows to map a writable pfn for a read fault
-	 * request.
-	 */
-	if (!(write_fault || writable))
-		return false;
-
-	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+	if (found_by_fast_gup) {
 		*pfn = page_to_pfn(page[0]);
 
 		if (writable)
-- 
2.39.1.581.gbfd45094c4-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits
  2023-02-15  1:16 [PATCH 0/8] Add memory fault exits to avoid slow GUP Anish Moorthy
                   ` (3 preceding siblings ...)
  2023-02-15  1:16 ` [PATCH 4/8] kvm: Allow hva_pfn_fast to resolve read-only faults Anish Moorthy
@ 2023-02-15  1:16 ` Anish Moorthy
  2023-02-15  8:41   ` Marc Zyngier
  2023-02-15  8:59   ` Oliver Upton
  2023-02-15  1:16 ` [PATCH 6/8] kvm/x86: Add mem fault exit on EPT violations Anish Moorthy
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 35+ messages in thread
From: Anish Moorthy @ 2023-02-15  1:16 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier
  Cc: Oliver Upton, Sean Christopherson, James Houghton, Anish Moorthy,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

This new KVM exit allows userspace to handle missing memory. It
indicates that the pages in the range [gpa, gpa + size) must be mapped.

The "flags" field actually goes unused in this series: it's included for
forward compatibility with [1], should this series happen to go in
first.

[1] https://lore.kernel.org/all/CA+EHjTyzZ2n8kQxH_Qx72aRq1k+dETJXTsoOM3tggPZAZkYbCA@mail.gmail.com/

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 Documentation/virt/kvm/api.rst | 42 ++++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h       | 13 +++++++++++
 include/uapi/linux/kvm.h       | 13 ++++++++++-
 tools/include/uapi/linux/kvm.h |  7 ++++++
 virt/kvm/kvm_main.c            | 26 +++++++++++++++++++++
 5 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9807b05a1b571..4b06e60668686 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5937,6 +5937,18 @@ delivery must be provided via the "reg_aen" struct.
 The "pad" and "reserved" fields may be used for future extensions and should be
 set to 0s by userspace.
 
+4.137 KVM_SET_MEM_FAULT_NOWAIT
+------------------------------
+
+:Capability: KVM_CAP_MEM_FAULT_NOWAIT
+:Architectures: x86, arm64
+:Type: vm ioctl
+:Parameters: bool state (in)
+:Returns: 0 on success, or -1 if KVM_CAP_MEM_FAULT_NOWAIT is not present.
+
+Enables (state=true) or disables (state=false) waitless memory faults. For more
+information, see the documentation of KVM_CAP_MEM_FAULT_NOWAIT.
+
 5. The kvm_run structure
 ========================
 
@@ -6544,6 +6556,21 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 gpa;
+			__u64 size;
+		} memory_fault;
+
+If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
+encountered a memory error which is not handled by KVM kernel module and
+which userspace may choose to handle.
+
+'gpa' and 'size' indicate the memory range the error occurs at. Userspace
+may handle the error and return to KVM to retry the previous memory access.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
@@ -7577,6 +7604,21 @@ This capability is aimed to mitigate the threat that malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.
 
+7.34 KVM_CAP_MEM_FAULT_NOWAIT
+-----------------------------
+
+:Architectures: x86, arm64
+:Target: VM
+:Parameters: None
+:Returns: 0 on success, or -EINVAL if capability is not supported.
+
+The presence of this capability indicates that userspace can enable/disable
+waitless memory faults through the KVM_SET_MEM_FAULT_NOWAIT ioctl.
+
+When waitless memory faults are enabled, fast get_user_pages failures when
+handling EPT/Shadow Page Table violations will cause a vCPU exit
+(KVM_EXIT_MEMORY_FAULT) instead of a fallback to slow get_user_pages.
+
 8. Other capabilities.
 ======================
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 109b18e2789c4..9352e7f8480fb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -801,6 +801,9 @@ struct kvm {
 	bool vm_bugged;
 	bool vm_dead;
 
+	rwlock_t mem_fault_nowait_lock;
+	bool mem_fault_nowait;
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 	struct notifier_block pm_notifier;
 #endif
@@ -2278,4 +2281,14 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+static inline bool memory_faults_enabled(struct kvm *kvm)
+{
+	bool ret;
+
+	read_lock(&kvm->mem_fault_nowait_lock);
+	ret = kvm->mem_fault_nowait;
+	read_unlock(&kvm->mem_fault_nowait_lock);
+	return ret;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 55155e262646e..064fbfed97f01 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -505,6 +506,12 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -1175,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
 #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
+#define KVM_CAP_MEM_FAULT_NOWAIT 226
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1658,7 +1666,7 @@ struct kvm_enc_region {
 /* Available with KVM_CAP_ARM_SVE */
 #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
 
-/* Available with  KVM_CAP_S390_VCPU_RESETS */
+/* Available with KVM_CAP_S390_VCPU_RESETS */
 #define KVM_S390_NORMAL_RESET	_IO(KVMIO,   0xc3)
 #define KVM_S390_CLEAR_RESET	_IO(KVMIO,   0xc4)
 
@@ -2228,4 +2236,7 @@ struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+/* Available with KVM_CAP_MEM_FAULT_NOWAIT */
+#define KVM_SET_MEM_FAULT_NOWAIT _IOWR(KVMIO, 0xd2, bool)
+
 #endif /* __LINUX_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 20522d4ba1e0d..5d9e3f48a9634 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -505,6 +506,12 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dae5f48151032..8e5bfc00d1181 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1149,6 +1149,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	INIT_LIST_HEAD(&kvm->devices);
 	kvm->max_vcpus = KVM_MAX_VCPUS;
 
+	rwlock_init(&kvm->mem_fault_nowait_lock);
+	kvm->mem_fault_nowait = false;
+
 	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
 
 	/*
@@ -2313,6 +2316,16 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
+static int kvm_vm_ioctl_set_mem_fault_nowait(struct kvm *kvm, bool state)
+{
+	if (!kvm_vm_ioctl_check_extension(kvm, KVM_CAP_MEM_FAULT_NOWAIT))
+		return -1;
+	write_lock(&kvm->mem_fault_nowait_lock);
+	kvm->mem_fault_nowait = state;
+	write_unlock(&kvm->mem_fault_nowait_lock);
+	return 0;
+}
+
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 {
 	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
@@ -4675,6 +4688,10 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 
 		return r;
 	}
+	case KVM_CAP_MEM_FAULT_NOWAIT:
+		if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap))
+			return -EINVAL;
+		return 0;
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}
@@ -4892,6 +4909,15 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+	case KVM_SET_MEM_FAULT_NOWAIT: {
+		bool state;
+
+		r = -EFAULT;
+		if (copy_from_user(&state, argp, sizeof(state)))
+			goto out;
+		r = kvm_vm_ioctl_set_mem_fault_nowait(kvm, state);
+		break;
+	}
 	case KVM_CHECK_EXTENSION:
 		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
 		break;
-- 
2.39.1.581.gbfd45094c4-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 6/8] kvm/x86: Add mem fault exit on EPT violations
  2023-02-15  1:16 [PATCH 0/8] Add memory fault exits to avoid slow GUP Anish Moorthy
                   ` (4 preceding siblings ...)
  2023-02-15  1:16 ` [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits Anish Moorthy
@ 2023-02-15  1:16 ` Anish Moorthy
  2023-02-15 17:23   ` Sean Christopherson
  2023-02-15  1:16 ` [PATCH 7/8] kvm/arm64: Implement KVM_CAP_MEM_FAULT_NOWAIT for arm64 Anish Moorthy
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 35+ messages in thread
From: Anish Moorthy @ 2023-02-15  1:16 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier
  Cc: Oliver Upton, Sean Christopherson, James Houghton, Anish Moorthy,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

With the relevant kvm cap enabled, EPT violations will exit to userspace
w/ reason KVM_EXIT_MEMORY_FAULT instead of resolving the fault via slow
get_user_pages.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Suggested-by: James Houghton <jthoughton@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 23 ++++++++++++++++++++---
 arch/x86/kvm/x86.c     |  1 +
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index aeb240b339f54..28af8d60adee6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4201,6 +4201,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	bool async;
+	bool mem_fault_nowait;
 
 	/*
 	 * Retry the page fault if the gfn hit a memslot that is being deleted
@@ -4230,9 +4231,25 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	}
 
 	async = false;
-	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
-					  fault->write, &fault->map_writable,
-					  &fault->hva);
+	mem_fault_nowait = memory_faults_enabled(vcpu->kvm);
+
+	fault->pfn = __gfn_to_pfn_memslot(
+		slot, fault->gfn,
+		mem_fault_nowait,
+		false,
+		mem_fault_nowait ? NULL : &async,
+		fault->write, &fault->map_writable,
+		&fault->hva);
+
+	if (mem_fault_nowait) {
+		if (fault->pfn == KVM_PFN_ERR_FAULT) {
+			vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+			vcpu->run->memory_fault.gpa = fault->gfn << PAGE_SHIFT;
+			vcpu->run->memory_fault.size = PAGE_SIZE;
+		}
+		return RET_PF_CONTINUE;
+	}
+
 	if (!async)
 		return RET_PF_CONTINUE; /* *pfn has correct page already */
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 508074e47bc0e..fe39ab2af5db4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4427,6 +4427,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VAPIC:
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
+	case KVM_CAP_MEM_FAULT_NOWAIT:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
-- 
2.39.1.581.gbfd45094c4-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 7/8] kvm/arm64: Implement KVM_CAP_MEM_FAULT_NOWAIT for arm64
  2023-02-15  1:16 [PATCH 0/8] Add memory fault exits to avoid slow GUP Anish Moorthy
                   ` (5 preceding siblings ...)
  2023-02-15  1:16 ` [PATCH 6/8] kvm/x86: Add mem fault exit on EPT violations Anish Moorthy
@ 2023-02-15  1:16 ` Anish Moorthy
  2023-02-15 18:24   ` Oliver Upton
  2023-02-15  1:16 ` [PATCH 8/8] selftests/kvm: Handle mem fault exits in demand paging test Anish Moorthy
  2023-02-15  1:46 ` [PATCH 0/8] Add memory fault exits to avoid slow GUP James Houghton
  8 siblings, 1 reply; 35+ messages in thread
From: Anish Moorthy @ 2023-02-15  1:16 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier
  Cc: Oliver Upton, Sean Christopherson, James Houghton, Anish Moorthy,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

Just do atomic gfn_to_pfn_memslot when the cap is enabled. Since we
don't have to deal with async page faults, the implementation is even
simpler than on x86

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 arch/arm64/kvm/arm.c |  1 +
 arch/arm64/kvm/mmu.c | 14 ++++++++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 698787ed87e92..31bec7866c346 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -220,6 +220,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VCPU_ATTRIBUTES:
 	case KVM_CAP_PTP_KVM:
 	case KVM_CAP_ARM_SYSTEM_SUSPEND:
+	case KVM_CAP_MEM_FAULT_NOWAIT:
 		r = 1;
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 01352f5838a00..964af7cd5f1c8 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1206,6 +1206,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	unsigned long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
+	bool mem_fault_nowait;
 
 	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
 	write_fault = kvm_is_write_fault(vcpu);
@@ -1301,8 +1302,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 */
 	smp_rmb();
 
-	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-				   write_fault, &writable, NULL);
+	mem_fault_nowait = memory_faults_enabled(vcpu->kvm);
+	pfn = __gfn_to_pfn_memslot(
+		memslot, gfn, mem_fault_nowait, false, NULL,
+		write_fault, &writable, NULL);
+
+	if (mem_fault_nowait && pfn == KVM_PFN_ERR_FAULT) {
+		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+		vcpu->run->memory_fault.gpa = gfn << PAGE_SHIFT;
+		vcpu->run->memory_fault.size = vma_pagesize;
+		return -EFAULT;
+	}
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
-- 
2.39.1.581.gbfd45094c4-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 8/8] selftests/kvm: Handle mem fault exits in demand paging test
  2023-02-15  1:16 [PATCH 0/8] Add memory fault exits to avoid slow GUP Anish Moorthy
                   ` (6 preceding siblings ...)
  2023-02-15  1:16 ` [PATCH 7/8] kvm/arm64: Implement KVM_CAP_MEM_FAULT_NOWAIT for arm64 Anish Moorthy
@ 2023-02-15  1:16 ` Anish Moorthy
  2023-02-15  1:46 ` [PATCH 0/8] Add memory fault exits to avoid slow GUP James Houghton
  8 siblings, 0 replies; 35+ messages in thread
From: Anish Moorthy @ 2023-02-15  1:16 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier
  Cc: Oliver Upton, Sean Christopherson, James Houghton, Anish Moorthy,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

When a memory fault exit is received for a GFN which has not yet been
UFFDIO_COPY/CONTINUEd, said call will be issued (and any EEXISTs will be
ignored). Otherwise, the pages will be faulted in via
MADV_POPULATE_WRITE.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        | 206 +++++++++++++-----
 1 file changed, 151 insertions(+), 55 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 34d5ba2044a2c..ff731aef52fd0 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -15,6 +15,7 @@
 #include <time.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <sys/mman.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -31,6 +32,60 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
 static size_t demand_paging_size;
 static char *guest_data_prototype;
 
+static int num_uffds;
+static size_t uffd_region_size;
+static struct uffd_desc **uffd_descs;
+/*
+ * Delay when demand paging is performed through userfaultfd or directly by
+ * vcpu_worker in the case of a KVM_EXIT_MEMORY_FAULT.
+ */
+static useconds_t uffd_delay;
+static int uffd_mode;
+
+
+static int handle_uffd_page_request(
+	int uffd_mode, int uffd, uint64_t hva, bool is_vcpu
+);
+
+static void madv_write_or_err(uint64_t gpa)
+{
+	int r;
+	void *hva = addr_gpa2hva(memstress_args.vm, gpa);
+
+	r = madvise(hva, demand_paging_size, MADV_POPULATE_WRITE);
+	TEST_ASSERT(
+		r == 0,
+		"MADV_POPULATE_WRITE on hva 0x%lx (gpa 0x%lx) failed with errno %i\n",
+		(uintptr_t) hva, gpa, errno);
+}
+
+static void ready_page(uint64_t gpa)
+{
+	int r, uffd;
+
+	/*
+	 * This test only registers memslot 1 w/ userfaultfd. Any accesses outside
+	 * the registered ranges should fault in the physical pages through
+	 * MADV_POPULATE_WRITE.
+	 */
+	if ((gpa < memstress_args.gpa)
+		|| (gpa >= memstress_args.gpa + memstress_args.size)) {
+		madv_write_or_err(gpa);
+	} else {
+		if (uffd_delay)
+			usleep(uffd_delay);
+
+		uffd = uffd_descs[(gpa - memstress_args.gpa) / uffd_region_size]->uffd;
+
+		r = handle_uffd_page_request(
+			uffd_mode, uffd,
+			(uint64_t) addr_gpa2hva(memstress_args.vm, gpa), true);
+
+		if (r == EEXIST)
+			madv_write_or_err(gpa);
+	}
+}
+
 static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 {
 	struct kvm_vcpu *vcpu = vcpu_args->vcpu;
@@ -42,25 +97,34 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
 
-	/* Let the guest access its memory */
-	ret = _vcpu_run(vcpu);
-	TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-	if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
-		TEST_ASSERT(false,
-			    "Invalid guest sync status: exit_reason=%s\n",
-			    exit_reason_str(run->exit_reason));
-	}
+	while (true) {
+		/* Let the guest access its memory */
+		ret = _vcpu_run(vcpu);
+		TEST_ASSERT(ret == 0 || (run->exit_reason == KVM_EXIT_MEMORY_FAULT),
+					"vcpu_run failed: %d\n", ret);
+		if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
+
+			if (run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
+				ready_page(run->memory_fault.gpa);
+				continue;
+			}
+
+			TEST_ASSERT(false,
+					"Invalid guest sync status: exit_reason=%s\n",
+					exit_reason_str(run->exit_reason));
+		}
 
-	ts_diff = timespec_elapsed(start);
-	PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
-		       ts_diff.tv_sec, ts_diff.tv_nsec);
+		ts_diff = timespec_elapsed(start);
+		PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
+				ts_diff.tv_sec, ts_diff.tv_nsec);
+		break;
+	}
 }
 
-static int handle_uffd_page_request(int uffd_mode, int uffd,
-									struct uffd_msg *msg)
+static int handle_uffd_page_request(
+	int uffd_mode, int uffd, uint64_t hva, bool is_vcpu)
 {
 	pid_t tid = syscall(__NR_gettid);
-	uint64_t addr = msg->arg.pagefault.address;
 	struct timespec start;
 	struct timespec ts_diff;
 	int r;
@@ -71,58 +135,81 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		struct uffdio_copy copy;
 
 		copy.src = (uint64_t)guest_data_prototype;
-		copy.dst = addr;
+		copy.dst = hva;
 		copy.len = demand_paging_size;
-		copy.mode = 0;
+		copy.mode = UFFDIO_COPY_MODE_DONTWAKE;
 
-		r = ioctl(uffd, UFFDIO_COPY, &copy);
 		/*
-		 * With multiple vCPU threads fault on a single page and there are
-		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
-		 * will fail with EEXIST: handle that case without signaling an
-		 * error.
+		 * With multiple vCPU threads and at least one of multiple reader threads
+		 * or vCPU memory faults, multiple vCPUs accessing an absent page will
+		 * almost certainly cause some thread doing the UFFDIO_COPY here to get
+		 * EEXIST: make sure to allow that case.
 		 */
-		if (r == -1 && errno != EEXIST) {
-			pr_info(
-				"Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
-				addr, tid, errno);
-			return r;
-		}
+		r = ioctl(uffd, UFFDIO_COPY, &copy);
+		TEST_ASSERT(
+			r == 0 || errno == EEXIST,
+			"Thread 0x%x failed UFFDIO_COPY on hva 0x%lx, errno = %d",
+			gettid(), hva, errno);
 	} else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
+		/* The comments in the UFFDIO_COPY branch also apply here. */
 		struct uffdio_continue cont = {0};
 
-		cont.range.start = addr;
+		cont.range.start = hva;
 		cont.range.len = demand_paging_size;
+		cont.mode = UFFDIO_CONTINUE_MODE_DONTWAKE;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
-		/* See the note about EEXISTs in the UFFDIO_COPY branch. */
-		if (r == -1 && errno != EEXIST) {
-			pr_info(
-				"Failed UFFDIO_CONTINUE in 0x%lx from thread %d, errno = %d\n",
-				addr, tid, errno);
-			return r;
-		}
+		TEST_ASSERT(
+			r == 0 || errno == EEXIST,
+			"Thread 0x%x failed UFFDIO_CONTINUE on hva 0x%lx, errno = %d",
+			gettid(), hva, errno);
 	} else {
 		TEST_FAIL("Invalid uffd mode %d", uffd_mode);
 	}
 
+	/*
+	 * If the above UFFDIO_COPY/CONTINUE fails with EEXIST, it will do so without
+	 * waking threads waiting on the UFFD: make sure that happens here.
+	 */
+	if (!is_vcpu) {
+		struct uffdio_range range = {
+			.start = hva,
+			.len = demand_paging_size
+		};
+		r = ioctl(uffd, UFFDIO_WAKE, &range);
+		TEST_ASSERT(
+			r == 0,
+			"Thread 0x%x failed UFFDIO_WAKE on hva 0x%lx, errno = %d",
+			gettid(), hva, errno);
+	}
+
 	ts_diff = timespec_elapsed(start);
 
 	PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
 		       timespec_to_ns(ts_diff));
 	PER_PAGE_DEBUG("Paged in %ld bytes at 0x%lx from thread %d\n",
-		       demand_paging_size, addr, tid);
+		       demand_paging_size, hva, tid);
 
 	return 0;
 }
 
+static int handle_uffd_page_request_from_uffd(
+	int uffd_mode, int uffd, struct uffd_msg *msg)
+{
+	TEST_ASSERT(
+		msg->event == UFFD_EVENT_PAGEFAULT,
+		"Received uffd message with event %d != UFFD_EVENT_PAGEFAULT",
+		msg->event);
+	return handle_uffd_page_request(
+		uffd_mode, uffd, msg->arg.pagefault.address, false);
+}
+
 struct test_params {
-	int uffd_mode;
 	bool single_uffd;
-	useconds_t uffd_delay;
 	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
+	bool memfault_exits;
 };
 
 static void prefault_mem(void *alias, uint64_t len)
@@ -139,12 +226,10 @@ static void prefault_mem(void *alias, uint64_t len)
 static void run_test(enum vm_guest_mode mode, void *arg)
 {
 	struct test_params *p = arg;
-	struct uffd_desc **uffd_descs = NULL;
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i, num_uffds = 0;
-	uint64_t uffd_region_size;
+	int i;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
@@ -156,12 +241,18 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		    "Failed to allocate buffer for guest data pattern");
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		num_uffds = p->single_uffd ? 1 : nr_vcpus;
 		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
 
 		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
-		TEST_ASSERT(uffd_descs, "Memory allocation failed");
+		TEST_ASSERT(uffd_descs, "Failed to allocate memory of uffd descriptors");
+
+		if (p->memfault_exits) {
+			TEST_ASSERT(vm_check_cap(vm, KVM_CAP_MEM_FAULT_NOWAIT) > 0,
+						"Vm does not have KVM_CAP_SET_MEM_FAULT_NOWAIT");
+			vm_ioctl(vm, KVM_SET_MEM_FAULT_NOWAIT, &p->memfault_exits);
+		}
 
 		for (i = 0; i < num_uffds; i++) {
 			struct memstress_vcpu_args *vcpu_args;
@@ -181,10 +272,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 * requests.
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
-				p->uffd_mode, p->uffd_delay, vcpu_hva,
+				uffd_mode, uffd_delay, vcpu_hva,
 				uffd_region_size,
 				p->readers_per_uffd,
-				&handle_uffd_page_request);
+				&handle_uffd_page_request_from_uffd);
 		}
 	}
 
@@ -198,7 +289,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	ts_diff = timespec_elapsed(start);
 	pr_info("All vCPU threads joined\n");
 
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
 		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
@@ -213,7 +304,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memstress_destroy_vm(vm);
 
 	free(guest_data_prototype);
-	if (p->uffd_mode)
+	if (uffd_mode)
 		free(uffd_descs);
 }
 
@@ -222,7 +313,7 @@ static void help(char *name)
 	puts("");
 	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
 		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
-		   "          [-s type] [-v vcpus] [-o]\n", name);
+		   "          [-w] [-s type] [-v vcpus] [-o]\n", name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
@@ -233,6 +324,7 @@ static void help(char *name)
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
 	printf(" -r: Set the number of reader threads per uffd.\n");
+	printf(" -w: Enable kvm cap for memory fault exits.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -252,29 +344,30 @@ int main(int argc, char *argv[])
 		.partition_vcpu_memory_access = true,
 		.readers_per_uffd = 1,
 		.single_uffd = false,
+		.memfault_exits = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
+	while ((opt = getopt(argc, argv, "ahowm:u:d:b:s:v:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
 			break;
 		case 'u':
 			if (!strcmp("MISSING", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
+				uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
 			else if (!strcmp("MINOR", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
-			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
+				uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
+			TEST_ASSERT(uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
 		case 'a':
 			p.single_uffd = true;
 			break;
 		case 'd':
-			p.uffd_delay = strtoul(optarg, NULL, 0);
-			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
+			uffd_delay = strtoul(optarg, NULL, 0);
+			TEST_ASSERT(uffd_delay >= 0, "A negative UFFD delay is not supported.");
 			break;
 		case 'b':
 			guest_percpu_mem_size = parse_size(optarg);
@@ -297,6 +390,9 @@ int main(int argc, char *argv[])
 						"Invalid number of readers per uffd %d: must be >=1",
 						p.readers_per_uffd);
 			break;
+		case 'w':
+			p.memfault_exits = true;
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
@@ -304,7 +400,7 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	if (p.uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
+	if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
 	    !backing_src_is_shared(p.src_type)) {
 		TEST_FAIL("userfaultfd MINOR mode requires shared memory; pick a different -s");
 	}
-- 
2.39.1.581.gbfd45094c4-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/8] Add memory fault exits to avoid slow GUP
  2023-02-15  1:16 [PATCH 0/8] Add memory fault exits to avoid slow GUP Anish Moorthy
                   ` (7 preceding siblings ...)
  2023-02-15  1:16 ` [PATCH 8/8] selftests/kvm: Handle mem fault exits in demand paging test Anish Moorthy
@ 2023-02-15  1:46 ` James Houghton
  8 siblings, 0 replies; 35+ messages in thread
From: James Houghton @ 2023-02-15  1:46 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Sean Christopherson,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

On Tue, Feb 14, 2023 at 5:16 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> This series improves scalabiity with userfaultfd-based postcopy live
> migration. It implements the no-slow-gup approach which James Houghton
> described in his earlier RFC ([1]). The new cap
> KVM_CAP_MEM_FAULT_NOWAIT, is introduced, which causes KVM to exit to
> userspace if fast get_user_pages (GUP) fails while resolving a page
> fault. The motivation is to allow (most) EPT violations to be resolved
> without going through userfaultfd, which involves serializing faults on
> internal locks: see [1] for more details.

To clarify a little bit here:

One big question: Why do we need a new KVM CAP? Couldn't we just use
UFFD_FEATURE_SIGBUS?

The original RFC thread[1] addresses this question, but to reiterate
here: the difference comes down to non-vCPU guest memory accesses,
like if KVM needs to read memory to emulate an instruction. If we use
UFFD_FEATURE_SIGBUS, KVM's copy_{to,from}_user will just fail, and the
VM will probably just die (depending on what exactly KVM was trying to
do). In these cases, we want KVM to sleep in handle_userfault(). Given
that we couldn't just use UFFD_FEATURE_SIGBUS, a new KVM CAP seemed to
be the most natural solution.

> After receiving the new exit, userspace can check if it has previously
> UFFDIO_COPY/CONTINUEd the faulting address- if not, then it knows that
> fast GUP could not possibly have succeeded, and so the fault has to be
> resolved via UFFDIO_COPY/CONTINUE. In these cases a UFFDIO_WAKE is
> unnecessary, as the vCPU thread hasn't been put to sleep waiting on the
> uffd.
>
> If userspace *has* already COPY/CONTINUEd the address, then it must take
> some other action to make fast GUP succeed: such as swapping in the
> page (for instance, via MADV_POPULATE_WRITE for writable mappings).
>
> This feature should only be enabled during userfaultfd postcopy, as it
> prevents the generation of async page faults.
>
> The actual kernel changes to implement the change on arm64/x86 are
> small: most of this series is actually just adding support for the new
> feature in the demand paging self test. Performance samples (rates
> reported in thousands of pages/s, average of five runs each) generated
> using [2] on an x86 machine with 256 cores, are shown below.
>
> vCPUs, Paging Rate (w/o new cap), Paging Rate (w/ new cap)
> 1       150     340
> 2       191     477
> 4       210     809
> 8       155     1239
> 16      130     1595
> 32      108     2299
> 64      86      3482
> 128     62      4134
> 256     36      4012

Thank you, Anish! :)

> [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/8] selftests/kvm: Fix bug in how demand_paging_test calculates paging rate
  2023-02-15  1:16 ` [PATCH 1/8] selftests/kvm: Fix bug in how demand_paging_test calculates paging rate Anish Moorthy
@ 2023-02-15  7:27   ` Oliver Upton
  2023-02-15 16:44     ` Sean Christopherson
  0 siblings, 1 reply; 35+ messages in thread
From: Oliver Upton @ 2023-02-15  7:27 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Paolo Bonzini, Marc Zyngier, Sean Christopherson, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

Hi Anish,

On Wed, Feb 15, 2023 at 01:16:07AM +0000, Anish Moorthy wrote:
> Currently we're dividing tv_nsec by 1E8, not 1E9.
> 
> Reported-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  tools/testing/selftests/kvm/demand_paging_test.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> index b0e1fc4de9e29..6809184ce2390 100644
> --- a/tools/testing/selftests/kvm/demand_paging_test.c
> +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> @@ -194,7 +194,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>  		ts_diff.tv_sec, ts_diff.tv_nsec);
>  	pr_info("Overall demand paging rate: %f pgs/sec\n",
>  		memstress_args.vcpu_args[0].pages * nr_vcpus /
> -		((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / 100000000.0));
> +		((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / 1E9));

Use NSEC_PER_SEC instead so the conversion taking place is immediately
obvious.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits
  2023-02-15  1:16 ` [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits Anish Moorthy
@ 2023-02-15  8:41   ` Marc Zyngier
  2023-02-15 17:07     ` Sean Christopherson
  2023-02-16 18:53     ` Anish Moorthy
  2023-02-15  8:59   ` Oliver Upton
  1 sibling, 2 replies; 35+ messages in thread
From: Marc Zyngier @ 2023-02-15  8:41 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Paolo Bonzini, Oliver Upton, Sean Christopherson, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

On Wed, 15 Feb 2023 01:16:11 +0000,
Anish Moorthy <amoorthy@google.com> wrote:
> 
> This new KVM exit allows userspace to handle missing memory. It
> indicates that the pages in the range [gpa, gpa + size) must be mapped.
> 
> The "flags" field actually goes unused in this series: it's included for
> forward compatibility with [1], should this series happen to go in
> first.
> 
> [1] https://lore.kernel.org/all/CA+EHjTyzZ2n8kQxH_Qx72aRq1k+dETJXTsoOM3tggPZAZkYbCA@mail.gmail.com/
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
>  Documentation/virt/kvm/api.rst | 42 ++++++++++++++++++++++++++++++++++
>  include/linux/kvm_host.h       | 13 +++++++++++
>  include/uapi/linux/kvm.h       | 13 ++++++++++-
>  tools/include/uapi/linux/kvm.h |  7 ++++++
>  virt/kvm/kvm_main.c            | 26 +++++++++++++++++++++
>  5 files changed, 100 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9807b05a1b571..4b06e60668686 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5937,6 +5937,18 @@ delivery must be provided via the "reg_aen" struct.
>  The "pad" and "reserved" fields may be used for future extensions and should be
>  set to 0s by userspace.
>  
> +4.137 KVM_SET_MEM_FAULT_NOWAIT
> +------------------------------
> +
> +:Capability: KVM_CAP_MEM_FAULT_NOWAIT
> +:Architectures: x86, arm64
> +:Type: vm ioctl
> +:Parameters: bool state (in)

Huh. See towards the end of this patch why I think this is a pretty
bad idea.

> +:Returns: 0 on success, or -1 if KVM_CAP_MEM_FAULT_NOWAIT is not present.

Please pick a meaningful error code instead of -1. And if you intended
this as -EPERM, please explain the rationale (-EINVAL would seem more
appropriate).

> +
> +Enables (state=true) or disables (state=false) waitless memory faults. For more
> +information, see the documentation of KVM_CAP_MEM_FAULT_NOWAIT.
> +
>  5. The kvm_run structure
>  ========================
>  
> @@ -6544,6 +6556,21 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 gpa;
> +			__u64 size;
> +		} memory_fault;
> +
> +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> +encountered a memory error which is not handled by KVM kernel module and
> +which userspace may choose to handle.

No, the vcpu hasn't "encountered a memory error". The hypervisor has
taken a page fault, which is very different. And it isn't that KVM
couldn't handle it (or we wouldn't have a hypervisor at all). From
what I understand of this series (possibly very little), userspace has
to *ask* for these, and they are delivered in specific circumstances.
Which are?

> +
> +'gpa' and 'size' indicate the memory range the error occurs at. Userspace
> +may handle the error and return to KVM to retry the previous memory access.

What are these *exactly*? In what unit? What guarantees that the
process eventually converges? How is userspace supposed to handle the
fault (which is *not* an error)? Don't you need to communicate other
information, such as the type of fault (read, write, permission or
translation fault...)?

> +
>  ::
>  
>      /* KVM_EXIT_NOTIFY */
> @@ -7577,6 +7604,21 @@ This capability is aimed to mitigate the threat that malicious VMs can
>  cause CPU stuck (due to event windows don't open up) and make the CPU
>  unavailable to host or other VMs.
>  
> +7.34 KVM_CAP_MEM_FAULT_NOWAIT
> +-----------------------------
> +
> +:Architectures: x86, arm64
> +:Target: VM
> +:Parameters: None
> +:Returns: 0 on success, or -EINVAL if capability is not supported.
> +
> +The presence of this capability indicates that userspace can enable/disable
> +waitless memory faults through the KVM_SET_MEM_FAULT_NOWAIT ioctl.
> +
> +When waitless memory faults are enabled, fast get_user_pages failures when
> +handling EPT/Shadow Page Table violations will cause a vCPU exit
> +(KVM_EXIT_MEMORY_FAULT) instead of a fallback to slow get_user_pages.

Do you really expect a random VMM hacker to understand what GUP is?
Also, the x86 verbiage makes zero sense to me. Please write this in a
way that is useful for userspace people, because they are the ones
consuming this documentation. I really can't imagine anyone being able
to write something useful based on this.

> +
>  8. Other capabilities.
>  ======================
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 109b18e2789c4..9352e7f8480fb 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -801,6 +801,9 @@ struct kvm {
>  	bool vm_bugged;
>  	bool vm_dead;
>  
> +	rwlock_t mem_fault_nowait_lock;
> +	bool mem_fault_nowait;

A full-fat rwlock to protect a single bool? What benefits do you
expect from a rwlock? Why is it preferable to an atomic access, or a
simple bitop?

> +
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  	struct notifier_block pm_notifier;
>  #endif
> @@ -2278,4 +2281,14 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +static inline bool memory_faults_enabled(struct kvm *kvm)
> +{
> +	bool ret;
> +
> +	read_lock(&kvm->mem_fault_nowait_lock);
> +	ret = kvm->mem_fault_nowait;
> +	read_unlock(&kvm->mem_fault_nowait_lock);
> +	return ret;
> +}
> +
>  #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 55155e262646e..064fbfed97f01 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_RISCV_SBI        35
>  #define KVM_EXIT_RISCV_CSR        36
>  #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -505,6 +506,12 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>  			__u32 flags;
>  		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory_fault;

Sigh... This doesn't even match the documentation.

>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};
> @@ -1175,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
>  #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
>  #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
> +#define KVM_CAP_MEM_FAULT_NOWAIT 226
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1658,7 +1666,7 @@ struct kvm_enc_region {
>  /* Available with KVM_CAP_ARM_SVE */
>  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
>  
> -/* Available with  KVM_CAP_S390_VCPU_RESETS */
> +/* Available with KVM_CAP_S390_VCPU_RESETS */

Unrelated change?

>  #define KVM_S390_NORMAL_RESET	_IO(KVMIO,   0xc3)
>  #define KVM_S390_CLEAR_RESET	_IO(KVMIO,   0xc4)
>  
> @@ -2228,4 +2236,7 @@ struct kvm_s390_zpci_op {
>  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
>  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
>  
> +/* Available with KVM_CAP_MEM_FAULT_NOWAIT */
> +#define KVM_SET_MEM_FAULT_NOWAIT _IOWR(KVMIO, 0xd2, bool)
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index 20522d4ba1e0d..5d9e3f48a9634 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_RISCV_SBI        35
>  #define KVM_EXIT_RISCV_CSR        36
>  #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -505,6 +506,12 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>  			__u32 flags;
>  		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory_fault;
>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dae5f48151032..8e5bfc00d1181 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1149,6 +1149,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  	INIT_LIST_HEAD(&kvm->devices);
>  	kvm->max_vcpus = KVM_MAX_VCPUS;
>  
> +	rwlock_init(&kvm->mem_fault_nowait_lock);
> +	kvm->mem_fault_nowait = false;
> +
>  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>  
>  	/*
> @@ -2313,6 +2316,16 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>  
> +static int kvm_vm_ioctl_set_mem_fault_nowait(struct kvm *kvm, bool state)
> +{
> +	if (!kvm_vm_ioctl_check_extension(kvm, KVM_CAP_MEM_FAULT_NOWAIT))
> +		return -1;
> +	write_lock(&kvm->mem_fault_nowait_lock);
> +	kvm->mem_fault_nowait = state;
> +	write_unlock(&kvm->mem_fault_nowait_lock);
> +	return 0;
> +}
> +
>  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>  {
>  	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> @@ -4675,6 +4688,10 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  
>  		return r;
>  	}
> +	case KVM_CAP_MEM_FAULT_NOWAIT:
> +		if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap))
> +			return -EINVAL;
> +		return 0;
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -4892,6 +4909,15 @@ static long kvm_vm_ioctl(struct file *filp,
>  		r = 0;
>  		break;
>  	}
> +	case KVM_SET_MEM_FAULT_NOWAIT: {
> +		bool state;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&state, argp, sizeof(state)))

A copy_from_user for... a bool? Why don't you directly treat argp as
the boolean itself? Of even better, make it a more interesting
quantity so that the API can evolve in the future. As it stands, it is
not extensible, which means it is already dead, if Linux APIs are
anything to go by.

On a related subject: is there a set of patches for any of the main
VMMs making any use of this?

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits
  2023-02-15  1:16 ` [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits Anish Moorthy
  2023-02-15  8:41   ` Marc Zyngier
@ 2023-02-15  8:59   ` Oliver Upton
  1 sibling, 0 replies; 35+ messages in thread
From: Oliver Upton @ 2023-02-15  8:59 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Paolo Bonzini, Marc Zyngier, Sean Christopherson, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

On Wed, Feb 15, 2023 at 01:16:11AM +0000, Anish Moorthy wrote:
> This new KVM exit allows userspace to handle missing memory. It
> indicates that the pages in the range [gpa, gpa + size) must be mapped.
> 
> The "flags" field actually goes unused in this series: it's included for
> forward compatibility with [1], should this series happen to go in
> first.
> 
> [1] https://lore.kernel.org/all/CA+EHjTyzZ2n8kQxH_Qx72aRq1k+dETJXTsoOM3tggPZAZkYbCA@mail.gmail.com/
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
>  Documentation/virt/kvm/api.rst | 42 ++++++++++++++++++++++++++++++++++
>  include/linux/kvm_host.h       | 13 +++++++++++
>  include/uapi/linux/kvm.h       | 13 ++++++++++-
>  tools/include/uapi/linux/kvm.h |  7 ++++++
>  virt/kvm/kvm_main.c            | 26 +++++++++++++++++++++
>  5 files changed, 100 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9807b05a1b571..4b06e60668686 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5937,6 +5937,18 @@ delivery must be provided via the "reg_aen" struct.
>  The "pad" and "reserved" fields may be used for future extensions and should be
>  set to 0s by userspace.
>  
> +4.137 KVM_SET_MEM_FAULT_NOWAIT
> +------------------------------
> +
> +:Capability: KVM_CAP_MEM_FAULT_NOWAIT
> +:Architectures: x86, arm64
> +:Type: vm ioctl
> +:Parameters: bool state (in)
> +:Returns: 0 on success, or -1 if KVM_CAP_MEM_FAULT_NOWAIT is not present.
> +
> +Enables (state=true) or disables (state=false) waitless memory faults. For more
> +information, see the documentation of KVM_CAP_MEM_FAULT_NOWAIT.

Why not use KVM_ENABLE_CAP instead for this? Its a preexisting UAPI for
toggling KVM behaviors.

>  5. The kvm_run structure
>  ========================
>  
> @@ -6544,6 +6556,21 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 gpa;
> +			__u64 size;
> +		} memory_fault;
> +
> +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> +encountered a memory error which is not handled by KVM kernel module and
> +which userspace may choose to handle.
> +
> +'gpa' and 'size' indicate the memory range the error occurs at. Userspace
> +may handle the error and return to KVM to retry the previous memory access.

How is userspace expected to differentiate the gup_fast() failed exit
from the guest-private memory exit? I don't think flags are a good idea
for this, as it comes with the illusion that both events can happen on a
single exit. In reality, these are mutually exclusive.

A fault type/code would be better here, with the option to add flags at
a later date that could be used to further describe the exit (if
needed).

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/8] kvm: Allow hva_pfn_fast to resolve read-only faults.
  2023-02-15  1:16 ` [PATCH 4/8] kvm: Allow hva_pfn_fast to resolve read-only faults Anish Moorthy
@ 2023-02-15  9:01   ` Oliver Upton
  2023-02-15 17:03     ` Sean Christopherson
  0 siblings, 1 reply; 35+ messages in thread
From: Oliver Upton @ 2023-02-15  9:01 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Paolo Bonzini, Marc Zyngier, Sean Christopherson, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

On Wed, Feb 15, 2023 at 01:16:10AM +0000, Anish Moorthy wrote:
> The upcoming mem_fault_nowait commits will make it so that, when the
> relevant cap is enabled, hva_to_pfn will return after calling
> hva_to_pfn_fast without ever attempting to pin memory via
> hva_to_pfn_slow.
> 
> hva_to_pfn_fast currently just fails for read-only faults. However,
> there doesn't seem to be a reason that we can't just try pinning the
> page without FOLL_WRITE instead of immediately falling back to slow-GUP.
> This commit implements that behavior.
> 
> Suggested-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  virt/kvm/kvm_main.c | 22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d255964ec331e..dae5f48151032 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2479,7 +2479,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>  }
>  
>  /*
> - * The fast path to get the writable pfn which will be stored in @pfn,
> + * The fast path to get the pfn which will be stored in @pfn,
>   * true indicates success, otherwise false is returned.  It's also the
>   * only part that runs if we can in atomic context.
>   */
> @@ -2487,16 +2487,18 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>  			    bool *writable, kvm_pfn_t *pfn)
>  {
>  	struct page *page[1];
> +	bool found_by_fast_gup =
> +		get_user_page_fast_only(
> +			addr,
> +			/*
> +			 * Fast pin a writable pfn only if it is a write fault request
> +			 * or the caller allows to map a writable pfn for a read fault
> +			 * request.
> +			 */
> +			(write_fault || writable) ? FOLL_WRITE : 0,
> +			page);
>  
> -	/*
> -	 * Fast pin a writable pfn only if it is a write fault request
> -	 * or the caller allows to map a writable pfn for a read fault
> -	 * request.
> -	 */
> -	if (!(write_fault || writable))
> -		return false;
> -
> -	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> +	if (found_by_fast_gup) {

You could have a smaller diff (and arrive at something more readable)
using a local for the gup flags:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9c60384b5ae0..57f92ff3728a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2494,6 +2494,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
 static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 			    bool *writable, kvm_pfn_t *pfn)
 {
+	unsigned int gup_flags;
 	struct page *page[1];
 
 	/*
@@ -2501,10 +2502,9 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 	 * or the caller allows to map a writable pfn for a read fault
 	 * request.
 	 */
-	if (!(write_fault || writable))
-		return false;
+	gup_flags = (write_fault || writable) ? FOLL_WRITE : 0;
 
-	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+	if (get_user_page_fast_only(addr, gup_flags, page)) {
 		*pfn = page_to_pfn(page[0]);
 
 		if (writable)

-- 
Thanks,
Oliver

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/8] selftests/kvm: Fix bug in how demand_paging_test calculates paging rate
  2023-02-15  7:27   ` Oliver Upton
@ 2023-02-15 16:44     ` Sean Christopherson
  2023-02-15 18:05       ` Anish Moorthy
  0 siblings, 1 reply; 35+ messages in thread
From: Sean Christopherson @ 2023-02-15 16:44 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Anish Moorthy, Paolo Bonzini, Marc Zyngier, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

On Wed, Feb 15, 2023, Oliver Upton wrote:
> Hi Anish,
> 
> On Wed, Feb 15, 2023 at 01:16:07AM +0000, Anish Moorthy wrote:
> > Currently we're dividing tv_nsec by 1E8, not 1E9.
> > 
> > Reported-by: James Houghton <jthoughton@google.com>
> > Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > ---
> >  tools/testing/selftests/kvm/demand_paging_test.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> > index b0e1fc4de9e29..6809184ce2390 100644
> > --- a/tools/testing/selftests/kvm/demand_paging_test.c
> > +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> > @@ -194,7 +194,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >  		ts_diff.tv_sec, ts_diff.tv_nsec);
> >  	pr_info("Overall demand paging rate: %f pgs/sec\n",
> >  		memstress_args.vcpu_args[0].pages * nr_vcpus /
> > -		((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / 100000000.0));
> > +		((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / 1E9));
> 
> Use NSEC_PER_SEC instead so the conversion taking place is immediately
> obvious.

And please post this separately, it's a fix that's independent of the rest of the
series and can/should be applied sooner than later.

Thanks!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/8] kvm: Allow hva_pfn_fast to resolve read-only faults.
  2023-02-15  9:01   ` Oliver Upton
@ 2023-02-15 17:03     ` Sean Christopherson
  2023-02-15 18:19       ` Anish Moorthy
  0 siblings, 1 reply; 35+ messages in thread
From: Sean Christopherson @ 2023-02-15 17:03 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Anish Moorthy, Paolo Bonzini, Marc Zyngier, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

On Wed, Feb 15, 2023, Oliver Upton wrote:
> On Wed, Feb 15, 2023 at 01:16:10AM +0000, Anish Moorthy wrote:
> > The upcoming mem_fault_nowait commits will make it so that, when the
> > relevant cap is enabled, hva_to_pfn will return after calling
> > hva_to_pfn_fast without ever attempting to pin memory via
> > hva_to_pfn_slow.
> > 
> > hva_to_pfn_fast currently just fails for read-only faults. However,
> > there doesn't seem to be a reason that we can't just try pinning the
> > page without FOLL_WRITE instead of immediately falling back to slow-GUP.
> > This commit implements that behavior.

State what the patch does, avoid pronouns, and especially don't have "This commmit"
or "This patch" anywhere.  From Documentation/process/submitting-patches.rst:

  Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
  instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
  to do frotz", as if you are giving orders to the codebase to change
  its behaviour.

> > Suggested-by: James Houghton <jthoughton@google.com>
> > Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > ---
> >  virt/kvm/kvm_main.c | 22 ++++++++++++----------
> >  1 file changed, 12 insertions(+), 10 deletions(-)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index d255964ec331e..dae5f48151032 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2479,7 +2479,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> >  }
> >  
> >  /*
> > - * The fast path to get the writable pfn which will be stored in @pfn,
> > + * The fast path to get the pfn which will be stored in @pfn,
> >   * true indicates success, otherwise false is returned.  It's also the
> >   * only part that runs if we can in atomic context.
> >   */
> > @@ -2487,16 +2487,18 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> >  			    bool *writable, kvm_pfn_t *pfn)
> >  {
> >  	struct page *page[1];
> > +	bool found_by_fast_gup =
> > +		get_user_page_fast_only(
> > +			addr,
> > +			/*
> > +			 * Fast pin a writable pfn only if it is a write fault request
> > +			 * or the caller allows to map a writable pfn for a read fault
> > +			 * request.
> > +			 */
> > +			(write_fault || writable) ? FOLL_WRITE : 0,
> > +			page);
> >  
> > -	/*
> > -	 * Fast pin a writable pfn only if it is a write fault request
> > -	 * or the caller allows to map a writable pfn for a read fault
> > -	 * request.
> > -	 */
> > -	if (!(write_fault || writable))
> > -		return false;
> > -
> > -	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> > +	if (found_by_fast_gup) {
> 
> You could have a smaller diff (and arrive at something more readable)

Heh, this whole series just screams "google3". :-)

Anish, please read through 

  Documentation/process/coding-style.rst

and 

  Documentation/process/submitting-patches.rst

particularaly the "Describe your changes" and "Style-check your changes" your
changes sections.  Bonus points if you work through the mostly redundant process/
documentation, e.g. these have supplementary info.

  Documentation/process/4.Coding.rst
  Documentation/process/5.Posting.rst

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits
  2023-02-15  8:41   ` Marc Zyngier
@ 2023-02-15 17:07     ` Sean Christopherson
  2023-02-16 18:53     ` Anish Moorthy
  1 sibling, 0 replies; 35+ messages in thread
From: Sean Christopherson @ 2023-02-15 17:07 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Anish Moorthy, Paolo Bonzini, Oliver Upton, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

On Wed, Feb 15, 2023, Marc Zyngier wrote:
> On Wed, 15 Feb 2023 01:16:11 +0000, Anish Moorthy <amoorthy@google.com> wrote:
> >  8. Other capabilities.
> >  ======================
> >  
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 109b18e2789c4..9352e7f8480fb 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -801,6 +801,9 @@ struct kvm {
> >  	bool vm_bugged;
> >  	bool vm_dead;
> >  
> > +	rwlock_t mem_fault_nowait_lock;
> > +	bool mem_fault_nowait;
> 
> A full-fat rwlock to protect a single bool? What benefits do you
> expect from a rwlock? Why is it preferable to an atomic access, or a
> simple bitop?

There's no need to have any kind off dedicated atomicity.  The only readers are
in vCPU context, just disallow KVM_CAP_MEM_FAULT_NOWAIT after vCPUs are created.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/8] kvm/x86: Add mem fault exit on EPT violations
  2023-02-15  1:16 ` [PATCH 6/8] kvm/x86: Add mem fault exit on EPT violations Anish Moorthy
@ 2023-02-15 17:23   ` Sean Christopherson
  2023-02-16 22:55     ` Peter Xu
  2023-02-23  0:35     ` Anish Moorthy
  0 siblings, 2 replies; 35+ messages in thread
From: Sean Christopherson @ 2023-02-15 17:23 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

On Wed, Feb 15, 2023, Anish Moorthy wrote:
> With the relevant kvm cap enabled, EPT violations will exit to userspace
> w/ reason KVM_EXIT_MEMORY_FAULT instead of resolving the fault via slow
> get_user_pages.

Similar to the other patch's feedback, please rewrite the changelog to phrase the
changes as commands, not as descriptions of the resulting behavior.

> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Suggested-by: James Houghton <jthoughton@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 23 ++++++++++++++++++++---
>  arch/x86/kvm/x86.c     |  1 +
>  2 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index aeb240b339f54..28af8d60adee6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4201,6 +4201,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
>  	bool async;
> +	bool mem_fault_nowait;
>  
>  	/*
>  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> @@ -4230,9 +4231,25 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	}
>  
>  	async = false;
> -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> -					  fault->write, &fault->map_writable,
> -					  &fault->hva);
> +	mem_fault_nowait = memory_faults_enabled(vcpu->kvm);
> +
> +	fault->pfn = __gfn_to_pfn_memslot(
> +		slot, fault->gfn,
> +		mem_fault_nowait,

Hrm, as prep work for this series, I think we should first clean up the pile o'
bools.  This came up before when the "interruptible" flag was added[*].  We punted
then, but I think it's time to bite the bullet, especially since "nowait" and
"async" need to be mutually exclusive.

[*] https://lore.kernel.org/all/YrR9i3yHzh5ftOxB@google.com

> +		false,
> +		mem_fault_nowait ? NULL : &async,
> +		fault->write, &fault->map_writable,
> +		&fault->hva);

Same comments as the other patch: follow kernel style, not google3's madness :-)

> +	if (mem_fault_nowait) {
> +		if (fault->pfn == KVM_PFN_ERR_FAULT) {
> +			vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +			vcpu->run->memory_fault.gpa = fault->gfn << PAGE_SHIFT;
> +			vcpu->run->memory_fault.size = PAGE_SIZE;

This belongs in a separate patch, and the exit stuff should be filled in by
kvm_handle_error_pfn().  Then this if-statement goes away entirely because the
"if (!async)" will always evaluate true in the nowait case.

> +		}
> +		return RET_PF_CONTINUE;
> +	}
> +
>  	if (!async)
>  		return RET_PF_CONTINUE; /* *pfn has correct page already */
>  

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/8] selftests/kvm: Fix bug in how demand_paging_test calculates paging rate
  2023-02-15 16:44     ` Sean Christopherson
@ 2023-02-15 18:05       ` Anish Moorthy
  0 siblings, 0 replies; 35+ messages in thread
From: Anish Moorthy @ 2023-02-15 18:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Oliver Upton, Paolo Bonzini, Marc Zyngier, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

On Wed, Feb 15, 2023 at 8:44 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Feb 15, 2023, Oliver Upton wrote:
> > Hi Anish,
> >
> > On Wed, Feb 15, 2023 at 01:16:07AM +0000, Anish Moorthy wrote:
> > > Currently we're dividing tv_nsec by 1E8, not 1E9.
> > >
> > > Reported-by: James Houghton <jthoughton@google.com>
> > > Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > > ---
> > >  tools/testing/selftests/kvm/demand_paging_test.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> > > index b0e1fc4de9e29..6809184ce2390 100644
> > > --- a/tools/testing/selftests/kvm/demand_paging_test.c
> > > +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> > > @@ -194,7 +194,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> > >             ts_diff.tv_sec, ts_diff.tv_nsec);
> > >     pr_info("Overall demand paging rate: %f pgs/sec\n",
> > >             memstress_args.vcpu_args[0].pages * nr_vcpus /
> > > -           ((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / 100000000.0));
> > > +           ((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / 1E9));
> >
> > Use NSEC_PER_SEC instead so the conversion taking place is immediately
> > obvious.
>
> And please post this separately, it's a fix that's independent of the rest of the
> series and can/should be applied sooner than later.

Will do, and thanks for the pointer to NSEC_PER_SEC Oliver.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/8] kvm: Allow hva_pfn_fast to resolve read-only faults.
  2023-02-15 17:03     ` Sean Christopherson
@ 2023-02-15 18:19       ` Anish Moorthy
  0 siblings, 0 replies; 35+ messages in thread
From: Anish Moorthy @ 2023-02-15 18:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Oliver Upton, Paolo Bonzini, Marc Zyngier, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

On Wed, Feb 15, 2023 at 1:02 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> You could have a smaller diff (and arrive at something more readable)
> using a local for the gup flags:
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9c60384b5ae0..57f92ff3728a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2494,6 +2494,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
>  static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>                             bool *writable, kvm_pfn_t *pfn)
>  {
> +       unsigned int gup_flags;
>         struct page *page[1];
>
>         /*
> @@ -2501,10 +2502,9 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
>          * or the caller allows to map a writable pfn for a read fault
>          * request.
>          */
> -       if (!(write_fault || writable))
> -               return false;
> +       gup_flags = (write_fault || writable) ? FOLL_WRITE : 0;
>
> -       if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> +       if (get_user_page_fast_only(addr, gup_flags, page)) {
>                 *pfn = page_to_pfn(page[0]);
>
>                 if (writable)

Good idea, will do.

On Wed, Feb 15, 2023 at 9:03 AM Sean Christopherson <seanjc@google.com> wrote:
>
> State what the patch does, avoid pronouns, and especially don't have "This commmit"
> or "This patch" anywhere.  From Documentation/process/submitting-patches.rst:
>
>   Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
>   instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
>   to do frotz", as if you are giving orders to the codebase to change
>   its behaviour.
>
> Heh, this whole series just screams "google3". :-)
>
> Anish, please read through
>
>   Documentation/process/coding-style.rst
>
> and
>
>   Documentation/process/submitting-patches.rst
>
> particularaly the "Describe your changes" and "Style-check your changes" your
> changes sections.  Bonus points if you work through the mostly redundant process/
> documentation, e.g. these have supplementary info.
>
>   Documentation/process/4.Coding.rst
>   Documentation/process/5.Posting.rst

Thanks for the resources Sean- I'll go through the series and rework the commit
messages/documentation appropriately.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] kvm/arm64: Implement KVM_CAP_MEM_FAULT_NOWAIT for arm64
  2023-02-15  1:16 ` [PATCH 7/8] kvm/arm64: Implement KVM_CAP_MEM_FAULT_NOWAIT for arm64 Anish Moorthy
@ 2023-02-15 18:24   ` Oliver Upton
  2023-02-15 23:28     ` Anish Moorthy
  0 siblings, 1 reply; 35+ messages in thread
From: Oliver Upton @ 2023-02-15 18:24 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Paolo Bonzini, Marc Zyngier, Sean Christopherson, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

On Wed, Feb 15, 2023 at 01:16:13AM +0000, Anish Moorthy wrote:
> Just do atomic gfn_to_pfn_memslot when the cap is enabled. Since we
> don't have to deal with async page faults, the implementation is even
> simpler than on x86

All of Sean's suggestions about writing a change description apply here
too.

> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
>  arch/arm64/kvm/arm.c |  1 +
>  arch/arm64/kvm/mmu.c | 14 ++++++++++++--
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 698787ed87e92..31bec7866c346 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -220,6 +220,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_VCPU_ATTRIBUTES:
>  	case KVM_CAP_PTP_KVM:
>  	case KVM_CAP_ARM_SYSTEM_SUSPEND:
> +	case KVM_CAP_MEM_FAULT_NOWAIT:
>  		r = 1;
>  		break;
>  	case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 01352f5838a00..964af7cd5f1c8 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1206,6 +1206,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	unsigned long vma_pagesize, fault_granule;
>  	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>  	struct kvm_pgtable *pgt;
> +	bool mem_fault_nowait;
>  
>  	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
>  	write_fault = kvm_is_write_fault(vcpu);
> @@ -1301,8 +1302,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	 */
>  	smp_rmb();
>  
> -	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
> -				   write_fault, &writable, NULL);
> +	mem_fault_nowait = memory_faults_enabled(vcpu->kvm);
> +	pfn = __gfn_to_pfn_memslot(
> +		memslot, gfn, mem_fault_nowait, false, NULL,
> +		write_fault, &writable, NULL);
> +
> +	if (mem_fault_nowait && pfn == KVM_PFN_ERR_FAULT) {
> +		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +		vcpu->run->memory_fault.gpa = gfn << PAGE_SHIFT;
> +		vcpu->run->memory_fault.size = vma_pagesize;
> +		return -EFAULT;

We really don't want to get out to userspace with EFAULT. Instead, we
should get out to userspace with 0 as the return code to indicate a
'normal' / expected exit.

That will require a bit of redefinition on user_mem_abort()'s return
values:

 - < 0, return to userspace with an error
 - 0, return to userspace for a 'normal' exit
 - 1, resume the guest

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] kvm/arm64: Implement KVM_CAP_MEM_FAULT_NOWAIT for arm64
  2023-02-15 18:24   ` Oliver Upton
@ 2023-02-15 23:28     ` Anish Moorthy
  2023-02-15 23:37       ` Oliver Upton
  0 siblings, 1 reply; 35+ messages in thread
From: Anish Moorthy @ 2023-02-15 23:28 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Paolo Bonzini, Marc Zyngier, Sean Christopherson, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

On Wed, Feb 15, 2023 at 10:24 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> All of Sean's suggestions about writing a change description apply here
> too.

Ack

> > +     if (mem_fault_nowait && pfn == KVM_PFN_ERR_FAULT) {
> > +             vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > +             vcpu->run->memory_fault.gpa = gfn << PAGE_SHIFT;
> > +             vcpu->run->memory_fault.size = vma_pagesize;
> > +             return -EFAULT;
>
> We really don't want to get out to userspace with EFAULT. Instead, we
> should get out to userspace with 0 as the return code to indicate a
> 'normal' / expected exit.
>
> That will require a bit of redefinition on user_mem_abort()'s return
> values:
>
>  - < 0, return to userspace with an error
>  - 0, return to userspace for a 'normal' exit
>  - 1, resume the guest

Ok, easy enough: do you want that patch sent separately or as part
of the next version of this series?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] kvm/arm64: Implement KVM_CAP_MEM_FAULT_NOWAIT for arm64
  2023-02-15 23:28     ` Anish Moorthy
@ 2023-02-15 23:37       ` Oliver Upton
  0 siblings, 0 replies; 35+ messages in thread
From: Oliver Upton @ 2023-02-15 23:37 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Paolo Bonzini, Marc Zyngier, Sean Christopherson, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm

On Wed, Feb 15, 2023 at 03:28:31PM -0800, Anish Moorthy wrote:
> On Wed, Feb 15, 2023 at 10:24 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > All of Sean's suggestions about writing a change description apply here
> > too.
> 
> Ack
> 
> > > +     if (mem_fault_nowait && pfn == KVM_PFN_ERR_FAULT) {
> > > +             vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > > +             vcpu->run->memory_fault.gpa = gfn << PAGE_SHIFT;
> > > +             vcpu->run->memory_fault.size = vma_pagesize;
> > > +             return -EFAULT;
> >
> > We really don't want to get out to userspace with EFAULT. Instead, we
> > should get out to userspace with 0 as the return code to indicate a
> > 'normal' / expected exit.
> >
> > That will require a bit of redefinition on user_mem_abort()'s return
> > values:
> >
> >  - < 0, return to userspace with an error
> >  - 0, return to userspace for a 'normal' exit
> >  - 1, resume the guest
> 
> Ok, easy enough: do you want that patch sent separately or as part
> of the next version of this series?

Roll it into the next spin of the series. Splitting off patches (as asked
in patch 1) is only useful if there's a bugfix or some other reason for
inclusion ahead of the entire series.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits
  2023-02-15  8:41   ` Marc Zyngier
  2023-02-15 17:07     ` Sean Christopherson
@ 2023-02-16 18:53     ` Anish Moorthy
  2023-02-16 21:38       ` Sean Christopherson
  1 sibling, 1 reply; 35+ messages in thread
From: Anish Moorthy @ 2023-02-16 18:53 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Paolo Bonzini, Oliver Upton, Sean Christopherson, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm, peterx

On Wed, Feb 15, 2023 at 12:59 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> Why not use KVM_ENABLE_CAP instead for this? Its a preexisting UAPI for
> toggling KVM behaviors.

Oh I wrote it this way because, even with the "args" field in "struct
kvm_enable_cap" to express the enable/disable intent, it felt weird to allow
disabling the feature through KVM_ENABLE_CAP. But it seems like that's the
convention, so I'll make the change.

> >  5. The kvm_run structure
> >  ========================
> >
> > @@ -6544,6 +6556,21 @@ array field represents return values. The userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> > +
> > +             /* KVM_EXIT_MEMORY_FAULT */
> > +             struct {
> > +                     __u64 gpa;
> > +                     __u64 size;
> > +             } memory_fault;
> > +
>
> How is userspace expected to differentiate the gup_fast() failed exit
> from the guest-private memory exit? I don't think flags are a good idea
> for this, as it comes with the illusion that both events can happen on a
> single exit. In reality, these are mutually exclusive.
>
> A fault type/code would be better here, with the option to add flags at
> a later date that could be used to further describe the exit (if
> needed).

Agreed. Something like this, then?

+    struct {
+        __u32 fault_code;
+        __u64 reserved;
+        __u64 gpa;
+        __u64 size;
+    } memory_fault;

The "reserved" field is meant to be the placeholder for a future "flags" field.
Let me know if there's a better/more conventional way to achieve this.

On Wed, Feb 15, 2023 at 9:07 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Feb 15, 2023, Marc Zyngier wrote:
> > On Wed, 15 Feb 2023 01:16:11 +0000, Anish Moorthy <amoorthy@google.com> wrote:
> > >  8. Other capabilities.
> > >  ======================
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 109b18e2789c4..9352e7f8480fb 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -801,6 +801,9 @@ struct kvm {
> > >     bool vm_bugged;
> > >     bool vm_dead;
> > >
> > > +   rwlock_t mem_fault_nowait_lock;
> > > +   bool mem_fault_nowait;
> >
> > A full-fat rwlock to protect a single bool? What benefits do you
> > expect from a rwlock? Why is it preferable to an atomic access, or a
> > simple bitop?
>
> There's no need to have any kind off dedicated atomicity.  The only readers are
> in vCPU context, just disallow KVM_CAP_MEM_FAULT_NOWAIT after vCPUs are created.

I think we do need atomicity here. When KVM_CAP_MEM_FAULT_NOWAIT is enabled
async page faults are essentially disabled: so userspace will likely want to
disable the cap at some point (such as the end of live migration post-copy).
Since we want to support this without having to pause vCPUs, there's an
atomicity requirement.

Marc, I used an rwlock simply because it seemed like the easiest correct thing
to do. I had a hunch that I'd be asked to change this to an atomic, so I can go
ahead and do that barring any other suggestions.

On Wed, Feb 15, 2023 at 12:41 AM Marc Zyngier <maz@kernel.org> wrote:
>
> > +:Returns: 0 on success, or -1 if KVM_CAP_MEM_FAULT_NOWAIT is not present.
>
> Please pick a meaningful error code instead of -1. And if you intended
> this as -EPERM, please explain the rationale (-EINVAL would seem more
> appropriate).

As I mention earlier, I'll be switching to toggling the capability via
KVM_ENABLE_CAP so this KVM_SET_MEM_FAULT_NOWAIT ioctl is going to go away. I
will make sure to set an appropriate error code if the user passes nonsensical
"args" to KVM_ENABLE_CAP though, whatever that ends up meaning.

> > +
> > +Enables (state=true) or disables (state=false) waitless memory faults. For more
> > +information, see the documentation of KVM_CAP_MEM_FAULT_NOWAIT.
> > +
> >  5. The kvm_run structure
> >  ========================
> >
> > @@ -6544,6 +6556,21 @@ array field represents return values. The userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >
> > +::
> > +
> > +             /* KVM_EXIT_MEMORY_FAULT */
> > +             struct {
> > +                     __u64 gpa;
> > +                     __u64 size;
> > +             } memory_fault;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +which userspace may choose to handle.
>
> No, the vcpu hasn't "encountered a memory error". The hypervisor has
> taken a page fault, which is very different. And it isn't that KVM
> couldn't handle it (or we wouldn't have a hypervisor at all). From
> what I understand of this series (possibly very little), userspace has
> to *ask* for these, and they are delivered in specific circumstances.
> Which are?

Thanks for the correction: "encountered a memory error" was incorrect phrasing.
What is your opinion on the following, hopefully more accurate, documentation?

"An exit reason of KVM_EXIT_MEMORY_FAULT indicates that the vCPU encountered a
memory fault for which the userspace page tables did not contain a present
mapping. This exit is only generated when KVM_CAP_MEM_FAULT_NOWAIT is enabled:
otherwise KVM will attempt to resolve the fault without exiting to userspace,
returning -1 with errno=EFAULT when it cannot."

It might also help to note that the vCPUs exit in exactly the same cases where
userspace would get a page fault when trying to access the memory through the
VMA/mapping provided to the memslot. Please let me know if you feel that this
information should be included in the documentation, or if you see anything else
I've gotten wrong.

As for handling these faults: userspace should consider its own knowledge of the
guest and determine how best to resolve each fault. For instance if userspace
hasn't UFFDIO_COPY/CONTINUEd a faulting page yet, then it must do so to
establish the mapping. If it already did, then the next step would be to fault
in the page via MADV_POPULATE_READ|WRITE.

A naive userspace could just always MADV_POPULATE_READ|WRITE in response to
these faults: that would basically yield KVM's behavior *without* the capability
enabled, just with extra steps, worse performance, and no chance for async page
faults.

A final note: this is just how the fault applies in the context of
userfaultfd-based post-copy live migration. Other uses might come up in the
future, in which case userspace would want to take some other sort of action.

> > +'gpa' and 'size' indicate the memory range the error occurs at. Userspace
> > +may handle the error and return to KVM to retry the previous memory access.
>
> What are these *exactly*? In what unit?

Sorry, I'll add some descriptive comments here. "gpa" is the guest physical
address of the faulting page (rounded down to be aligned with "size"), and
"size" is just the size in bytes of the faulting page.

> ... What guarantees that the
> process eventually converges? How is userspace supposed to handle the
> fault (which is *not* an error)? Don't you need to communicate other
> information, such as the type of fault (read, write, permission or
> translation fault...)?

Ok, two parts to the response here.

1. As Oliver touches on earlier, we'll probably want to use this same field for
   different classes of memory fault in the future (such as the ones which Chao
   is introducing in [1]): so it does make sense to add "code" and "flags"
   fields which can be used to communicate more information to the user (and
   which can just be set to MEM_FAULT_NOWAIT/0 in this series).

2. As to why a code/flags of MEM_FAULT_NOWAIT/0 should be enough information to
   convey to the user here: the assumption behind this series is that
   MADV_POPULATE_READ|WRITE will always be enough to ensure that the vCPU can
   run again. This goes back to the earlier claim about vCPUs exiting in the
   exact same cases as when userspace would get a page fault trying to access
   the same memory. MADV_POPULATE_READ|WRITE is intended to resolve exactly
   these cases, so userspace shouldn't need anything more.

Again, a VMM enabling the new exit to make post-copy faster must determine what
action to take depending on whether it has UFFDIO_COPY|CONTINUEd the faulting
page (at least for now, perhaps there will be more uses in the future): we can
keep discussing if this seems fragile. Just keeping things limited to
userfaultfd, having KVM communicate to userspace which action is required could
get difficult. For UFFDIO_CONTINUE we'd have to check if the VMA had
VM_UFFD_MINOR, and for UFFDIO_COPY we'd have to at least check the page cache.

> >
> > +7.34 KVM_CAP_MEM_FAULT_NOWAIT
> > +-----------------------------
> > +
> > +:Architectures: x86, arm64
> > +:Target: VM
> > +:Parameters: None
> > +:Returns: 0 on success, or -EINVAL if capability is not supported.
> > +
> > +The presence of this capability indicates that userspace can enable/disable
> > +waitless memory faults through the KVM_SET_MEM_FAULT_NOWAIT ioctl.
> > +
> > +When waitless memory faults are enabled, fast get_user_pages failures when
> > +handling EPT/Shadow Page Table violations will cause a vCPU exit
> > +(KVM_EXIT_MEMORY_FAULT) instead of a fallback to slow get_user_pages.
>
> Do you really expect a random VMM hacker to understand what GUP is?
> Also, the x86 verbiage makes zero sense to me. Please write this in a
> way that is useful for userspace people, because they are the ones
> consuming this documentation. I really can't imagine anyone being able
> to write something useful based on this.

My bad: obviously I shouldn't be exposing the kernel's implementation details
here. As for what the documentation should actually say, I'm thinking I can
adapt the revised description of the KVM exit from earlier ("An exit reason of
KVM_EXIT_MEMORY_FAULT indicates..."). Again, please let me know if you see
anything which should be changed there.

> > +             /* KVM_EXIT_MEMORY_FAULT */
> > +             struct {
> > +                     __u64 flags;
> > +                     __u64 gpa;
> > +                     __u64 size;
> > +             } memory_fault;
>
> Sigh... This doesn't even match the documentation.

Ack! Ack.

> > -/* Available with  KVM_CAP_S390_VCPU_RESETS */
> > +/* Available with KVM_CAP_S390_VCPU_RESETS */
>
> Unrelated change?

Yes, I'll drop it from the next revision.

> > +             r = -EFAULT;
> > +             if (copy_from_user(&state, argp, sizeof(state)))
>
> A copy_from_user for... a bool? Why don't you directly treat argp as
> the boolean itself? Of even better, make it a more interesting
> quantity so that the API can evolve in the future. As it stands, it is
> not extensible, which means it is already dead, if Linux APIs are
> anything to go by.

I'll keep that in mind for the future, but this is now moot since I'll be
removing KVM_SET_MEM_FAULT_NOWAIT.

> On a related subject: is there a set of patches for any of the main
> VMMs making any use of this?

No. For what it's worth, the demand_paging_selftest demonstrates the basics of
what userspace can do with the new exit. Would it be helpful to see how QEMU
could use it as well?

[1] https://lore.kernel.org/all/CA+EHjTyzZ2n8kQxH_Qx72aRq1k+dETJXTsoOM3tggPZAZkYbCA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits
  2023-02-16 18:53     ` Anish Moorthy
@ 2023-02-16 21:38       ` Sean Christopherson
  2023-02-17 19:14         ` Anish Moorthy
  0 siblings, 1 reply; 35+ messages in thread
From: Sean Christopherson @ 2023-02-16 21:38 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Marc Zyngier, Paolo Bonzini, Oliver Upton, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm, peterx

On Thu, Feb 16, 2023, Anish Moorthy wrote:
> On Wed, Feb 15, 2023 at 12:59 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index 109b18e2789c4..9352e7f8480fb 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -801,6 +801,9 @@ struct kvm {
> > > >     bool vm_bugged;
> > > >     bool vm_dead;
> > > >
> > > > +   rwlock_t mem_fault_nowait_lock;
> > > > +   bool mem_fault_nowait;
> > >
> > > A full-fat rwlock to protect a single bool? What benefits do you
> > > expect from a rwlock? Why is it preferable to an atomic access, or a
> > > simple bitop?
> >
> > There's no need to have any kind off dedicated atomicity.  The only readers are
> > in vCPU context, just disallow KVM_CAP_MEM_FAULT_NOWAIT after vCPUs are created.
> 
> I think we do need atomicity here.

Atomicity, yes.  Mutually exclusivity, no.  AFAICT, nothing will break if userspace
has multiple in-flight calls to toggled the flag.  And if we do want to guarantee
there's only one writer, then kvm->lock or kvm->slots_lock will suffice.

> When KVM_CAP_MEM_FAULT_NOWAIT is enabled async page faults are essentially
> disabled: so userspace will likely want to disable the cap at some point
> (such as the end of live migration post-copy).

Ah, this is a dynamic thing and not a set-and-forget thing.

> Since we want to support this without having to pause vCPUs, there's an
> atomicity requirement.

Ensuring that vCPUs "see" the new value and not corrupting memory are two very
different things.  Making the flag an atomic, wrapping with a rwlock, etc... do
nothing to ensure vCPUs observe the new value.  And for non-crazy usage of bools,
they're not even necessary to avoid memory corruption, e.g. the result of concurrent
writes to a bool is non-deterministic, but so is the order of two tasks contending
for a lock, so it's a moot point.

I think what you really want to achieve is that vCPUs observe the NOWAIT flag
before KVM returns to userspace.  There are a variety of ways to make that happen,
but since this all about accessing guest memory, the simplest is likely to
"protect" the flag with kvm->srcu, i.e. require SRCU be held by readers and then
do a synchronize_srcu() to ensure all vCPUs have picked up the new value.

Speaking of SRCU (which protect memslots), why not make this a memslot flag?  If
the goal is to be able to turn the behavior on/off dynamically, wouldn't it be
beneficial to turn off the NOWAIT behavior when a memslot is fully transfered?

A memslot flag would likely be simpler to implement as it would piggyback all of
the existing infrastructure to handle memslot updates.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/8] kvm/x86: Add mem fault exit on EPT violations
  2023-02-15 17:23   ` Sean Christopherson
@ 2023-02-16 22:55     ` Peter Xu
  2023-02-23  0:35     ` Anish Moorthy
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Xu @ 2023-02-16 22:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Anish Moorthy, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	James Houghton, Ben Gardon, David Matlack, Ricardo Koller,
	Chao Peng, Axel Rasmussen, kvm, kvmarm

Hi, Sean,

On Wed, Feb 15, 2023 at 09:23:55AM -0800, Sean Christopherson wrote:
> > @@ -4230,9 +4231,25 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  	}
> >  
> >  	async = false;
> > -	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > -					  fault->write, &fault->map_writable,
> > -					  &fault->hva);
> > +	mem_fault_nowait = memory_faults_enabled(vcpu->kvm);
> > +
> > +	fault->pfn = __gfn_to_pfn_memslot(
> > +		slot, fault->gfn,
> > +		mem_fault_nowait,
> 
> Hrm, as prep work for this series, I think we should first clean up the pile o'
> bools.  This came up before when the "interruptible" flag was added[*].  We punted
> then, but I think it's time to bite the bullet, especially since "nowait" and
> "async" need to be mutually exclusive.
> 
> [*] https://lore.kernel.org/all/YrR9i3yHzh5ftOxB@google.com

IIUC at that time we didn't reach a consensus on how to rework it, and the
suggestion was using a bool and separate the cleanup (while the flag
cleanup got NACKed..), which makes sense to me.

Personally I like your other patch a lot:

https://lore.kernel.org/all/YrTNGVpT8Cw2yrnr@google.com/

The question is, after we merge the bools into a flag, do we still need a
struct* anyway?  If we can reach a consensus, I'll be happy to continue
working on the cleanup (by picking up your patch first).  Or if Anish would
like to take it over in any form that'll be also perfect.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits
  2023-02-16 21:38       ` Sean Christopherson
@ 2023-02-17 19:14         ` Anish Moorthy
  2023-02-17 20:33           ` Sean Christopherson
  2023-02-17 20:47           ` Sean Christopherson
  0 siblings, 2 replies; 35+ messages in thread
From: Anish Moorthy @ 2023-02-17 19:14 UTC (permalink / raw)
  To: Sean Christopherson, Oliver Upton
  Cc: Marc Zyngier, Paolo Bonzini, James Houghton, Ben Gardon,
	David Matlack, Ricardo Koller, Chao Peng, Axel Rasmussen, kvm,
	kvmarm, peterx

On Thu, Feb 16, 2023 at 10:53 AM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Wed, Feb 15, 2023 at 12:59 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > How is userspace expected to differentiate the gup_fast() failed exit
> > from the guest-private memory exit? I don't think flags are a good idea
> > for this, as it comes with the illusion that both events can happen on a
> > single exit. In reality, these are mutually exclusive.
> >
> > A fault type/code would be better here, with the option to add flags at
> > a later date that could be used to further describe the exit (if
> > needed).
>
> Agreed. Something like this, then?
>
> +    struct {
> +        __u32 fault_code;
> +        __u64 reserved;
> +        __u64 gpa;
> +        __u64 size;
> +    } memory_fault;
>
> The "reserved" field is meant to be the placeholder for a future "flags" field.
> Let me know if there's a better/more conventional way to achieve this.

On Thu, Feb 16, 2023 at 10:53 AM Anish Moorthy <amoorthy@google.com> wrote:
>
> 1. As Oliver touches on earlier, we'll probably want to use this same field for
>    different classes of memory fault in the future (such as the ones which Chao
>    is introducing in [1]): so it does make sense to add "code" and "flags"
>    fields which can be used to communicate more information to the user (and
>    which can just be set to MEM_FAULT_NOWAIT/0 in this series).

Let me walk back my responses here: I took a closer look at Chao's series, and
it doesn't seem that I should be trying to share KVM_EXIT_MEMORY_FAULT with it
in the first place. As far as I can understand (not that much, to be clear :)
we're signaling unrelated things, so it makes more sense to use different exits
(ie, rename mine to KVM_EXIT_MEMORY_FAULT_NOWAIT). That would prevent any
potential confusion about mutual exclusivity.

That also removes the need for a "fault_code" field in "memory_fault", which I
could rename to something more general like "guest_memory_range". As for the
"reserved" field, we could keep it around if we care about reusing
"guest_memory_range" between exits, or discard it if we don't.

On Thu, Feb 16, 2023 at 1:38 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Feb 16, 2023, Anish Moorthy wrote:
> > On Wed, Feb 15, 2023 at 12:59 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > index 109b18e2789c4..9352e7f8480fb 100644
> > > > > --- a/include/linux/kvm_host.h
> > > > > +++ b/include/linux/kvm_host.h
> > > > > @@ -801,6 +801,9 @@ struct kvm {
> > > > >     bool vm_bugged;
> > > > >     bool vm_dead;
> > > > >
> > > > > +   rwlock_t mem_fault_nowait_lock;
> > > > > +   bool mem_fault_nowait;
> > > >
> > > > A full-fat rwlock to protect a single bool? What benefits do you
> > > > expect from a rwlock? Why is it preferable to an atomic access, or a
> > > > simple bitop?
> > >
> > > There's no need to have any kind off dedicated atomicity.  The only readers are
> > > in vCPU context, just disallow KVM_CAP_MEM_FAULT_NOWAIT after vCPUs are created.
> >
> > I think we do need atomicity here.
>
> Atomicity, yes.  Mutually exclusivity, no.  AFAICT, nothing will break if userspace
> has multiple in-flight calls to toggled the flag.  And if we do want to guarantee
> there's only one writer, then kvm->lock or kvm->slots_lock will suffice.
>
> > Since we want to support this without having to pause vCPUs, there's an
> > atomicity requirement.
>
> Ensuring that vCPUs "see" the new value and not corrupting memory are two very
> different things.  Making the flag an atomic, wrapping with a rwlock, etc... do
> nothing to ensure vCPUs observe the new value.  And for non-crazy usage of bools,
> they're not even necessary to avoid memory corruption...

Oh, that's news to me- I've learned to treat any unprotected concurrent accesses
to memory as undefined behavior: guess there's always more to learn.

> I think what you really want to achieve is that vCPUs observe the NOWAIT flag
> before KVM returns to userspace.  There are a variety of ways to make that happen,
> but since this all about accessing guest memory, the simplest is likely to
> "protect" the flag with kvm->srcu, i.e. require SRCU be held by readers and then
> do a synchronize_srcu() to ensure all vCPUs have picked up the new value.
>
> Speaking of SRCU (which protect memslots), why not make this a memslot flag?  If
> the goal is to be able to turn the behavior on/off dynamically, wouldn't it be
> beneficial to turn off the NOWAIT behavior when a memslot is fully transfered?

Thanks for the suggestions, I never considered making this a memslot flag
actually. so I'll go look into that.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits
  2023-02-17 19:14         ` Anish Moorthy
@ 2023-02-17 20:33           ` Sean Christopherson
  2023-02-23  1:16             ` Anish Moorthy
  2023-02-17 20:47           ` Sean Christopherson
  1 sibling, 1 reply; 35+ messages in thread
From: Sean Christopherson @ 2023-02-17 20:33 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Oliver Upton, Marc Zyngier, Paolo Bonzini, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm, peterx

On Fri, Feb 17, 2023, Anish Moorthy wrote:
> On Thu, Feb 16, 2023 at 10:53 AM Anish Moorthy <amoorthy@google.com> wrote:
> >
> > On Wed, Feb 15, 2023 at 12:59 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> > >
> > > How is userspace expected to differentiate the gup_fast() failed exit
> > > from the guest-private memory exit?

Sorry, missed this comment first time around.  I don't see any reason why userspace
needs to uniquely identify the gup_fast() case.  Similar to page faults from
hardware, KVM should describe the access in sufficient detail so that userspace
can resolve the issue, whatever that may be, but KVM should make any assumptions
about _why_ the failure occurred.  

gup_fast() failing in itself isn't interesting.  The _intended_ behavior is that
KVM will exit if and only if the guest accesses a valid page that hasn't yet been
transfered from the source, but the _actual_ behavior is that KVM will exit if
the page isn't faulted in for _any_ reason.  Even tagging the access NOWAIT is
speculative to some degree, as that suggests the access may have succeeded if
KVM "waited", which may or may not be true.

E.g. see the virtiofs trunctionation use case that planted the original seed for
this approach: https://lore.kernel.org/kvm/20201005161620.GC11938@linux.intel.com.

> > > I don't think flags are a good idea for this, as it comes with the
> > > illusion that both events can happen on a single exit. In reality, these
> > > are mutually exclusive.

They aren't mutually exclusive.  Obviously KVM will only detect one other the
other, but it's entirely possible that a guest could be accessing the "wrong"
flavor of a page _and_ for that flavor to not be faulted in.  A smart userspace
should see that (a) it needs to change the memory attributes and (b) that it
needs to demand the to-be-installed page from the source.

> > > A fault type/code would be better here, with the option to add flags at
> > > a later date that could be used to further describe the exit (if
> > > needed).
> >
> > Agreed.

Not agreed :-)

> > +    struct {
> > +        __u32 fault_code;
> > +        __u64 reserved;
> > +        __u64 gpa;
> > +        __u64 size;
> > +    } memory_fault;
> >
> > The "reserved" field is meant to be the placeholder for a future "flags" field.
> > Let me know if there's a better/more conventional way to achieve this.
> 
> On Thu, Feb 16, 2023 at 10:53 AM Anish Moorthy <amoorthy@google.com> wrote:
> >
> > 1. As Oliver touches on earlier, we'll probably want to use this same field for
> >    different classes of memory fault in the future (such as the ones which Chao
> >    is introducing in [1]): so it does make sense to add "code" and "flags"
> >    fields which can be used to communicate more information to the user (and
> >    which can just be set to MEM_FAULT_NOWAIT/0 in this series).
> 
> Let me walk back my responses here: I took a closer look at Chao's series, and
> it doesn't seem that I should be trying to share KVM_EXIT_MEMORY_FAULT with it
> in the first place. As far as I can understand (not that much, to be clear :)
> we're signaling unrelated things, so it makes more sense to use different exits
> (ie, rename mine to KVM_EXIT_MEMORY_FAULT_NOWAIT). That would prevent any
> potential confusion about mutual exclusivity.

Hard "no" on a separate exit reason unless someone comes up with a very compelling
argument.

Chao's UPM series is not yet merged, i.e. is not set in stone.  If the proposed
functionality in Chao's series is lacking and/or will conflict with this UFFD,
then we can and should address those issues _before_ it gets merged.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits
  2023-02-17 19:14         ` Anish Moorthy
  2023-02-17 20:33           ` Sean Christopherson
@ 2023-02-17 20:47           ` Sean Christopherson
  1 sibling, 0 replies; 35+ messages in thread
From: Sean Christopherson @ 2023-02-17 20:47 UTC (permalink / raw)
  To: Anish Moorthy
  Cc: Oliver Upton, Marc Zyngier, Paolo Bonzini, James Houghton,
	Ben Gardon, David Matlack, Ricardo Koller, Chao Peng,
	Axel Rasmussen, kvm, kvmarm, peterx

On Fri, Feb 17, 2023, Anish Moorthy wrote:
> On Thu, Feb 16, 2023 at 1:38 PM Sean Christopherson <seanjc@google.com> wrote:
> > Ensuring that vCPUs "see" the new value and not corrupting memory are two very
> > different things.  Making the flag an atomic, wrapping with a rwlock, etc... do
> > nothing to ensure vCPUs observe the new value.  And for non-crazy usage of bools,
> > they're not even necessary to avoid memory corruption...
> 
> Oh, that's news to me- I've learned to treat any unprotected concurrent accesses
> to memory as undefined behavior: guess there's always more to learn.

To expand a bit, undefined and non-deterministic are two different things.  In
this case, the behavior is not deterministic, but it _is_ defined.  Readers will
either see %true or %false, but they will not see %beagle. 

And more importantly _KVM_ doesn't care whether or not the behavior is deterministic,
that's userspace's responsibility.  E.g. if KVM were endangered by the flag changing
then some form of locking absolutely would be needed, but that's not the case here.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/8] kvm/x86: Add mem fault exit on EPT violations
  2023-02-15 17:23   ` Sean Christopherson
  2023-02-16 22:55     ` Peter Xu
@ 2023-02-23  0:35     ` Anish Moorthy
  2023-02-23 20:11       ` Sean Christopherson
  1 sibling, 1 reply; 35+ messages in thread
From: Anish Moorthy @ 2023-02-23  0:35 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm

On Wed, Feb 15, 2023 at 9:24 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Feb 15, 2023, Anish Moorthy wrote:
> > +     if (mem_fault_nowait) {
> > +             if (fault->pfn == KVM_PFN_ERR_FAULT) {
> > +                     vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > +                     vcpu->run->memory_fault.gpa = fault->gfn << PAGE_SHIFT;
> > +                     vcpu->run->memory_fault.size = PAGE_SIZE;
>
> This belongs in a separate patch, and the exit stuff should be filled in by
> kvm_handle_error_pfn().  Then this if-statement goes away entirely because the
> "if (!async)" will always evaluate true in the nowait case.

Hi Sean, what exactly did you want "in a separate patch"? Unless you
mean separating
the changes to arch/x86/kvm/mmu/mmu.c and the one-liner to enable the cap in
arch/x86/kvm/x86.c, I don't really understand how this patch could
logically be split.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits
  2023-02-17 20:33           ` Sean Christopherson
@ 2023-02-23  1:16             ` Anish Moorthy
  2023-02-23 20:55               ` Sean Christopherson
  0 siblings, 1 reply; 35+ messages in thread
From: Anish Moorthy @ 2023-02-23  1:16 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Oliver Upton, Chao Peng, kvm

On Fri, Feb 17, 2023 at 12:33 PM Sean Christopherson <seanjc@google.com> wrote
> > > > I don't think flags are a good idea for this, as it comes with the
> > > > illusion that both events can happen on a single exit. In reality, these
> > > > are mutually exclusive.
>
> They aren't mutually exclusive.  Obviously KVM will only detect one other the
> other, but it's entirely possible that a guest could be accessing the "wrong"
> flavor of a page _and_ for that flavor to not be faulted in.  A smart userspace
> should see that (a) it needs to change the memory attributes and (b) that it
> needs to demand the to-be-installed page from the source.
>
> > > > A fault type/code would be better here, with the option to add flags at
> > > > a later date that could be used to further describe the exit (if
> > > > needed).
> > >
> > > Agreed.
>
> Not agreed :-)
> ...
> Hard "no" on a separate exit reason unless someone comes up with a very compelling
> argument.
>
> Chao's UPM series is not yet merged, i.e. is not set in stone.  If the proposed
> functionality in Chao's series is lacking and/or will conflict with this UFFD,
> then we can and should address those issues _before_ it gets merged.

Ok so I have a v2 of the series basically ready to go, but I realized
that I should
probably have brought up my modified API here to make sure it was
sane: so, I'll do
that first

In v2, I've
(1)  renamed the kvm cap from KVM_CAP_MEM_FAULT_NOWAIT to
KVM_CAP_MEMORY_FAULT_EXIT due to Sean's earlier comment

> gup_fast() failing in itself isn't interesting.  The _intended_ behavior is that
> KVM will exit if and only if the guest accesses a valid page that hasn't yet been
> transfered from the source, but the _actual_ behavior is that KVM will exit if
> the page isn't faulted in for _any_ reason.  Even tagging the access NOWAIT is
> speculative to some degree, as that suggests the access may have succeeded if
> KVM "waited", which may or may not be true.

(2) kept the definition of kvm_run.memory_fault as
struct {
    __u64 flags;
    __u64 gpa;
    __ u64 len;
} memory_fault;
which, apart from the name of the "len" field, is exactly what Chao
has in their series.
flags remains a bitfield describing the reason for the memory fault:
in the two places
this series generates this fault, it sets a bit in flags. Userspace is
meant to check whether
a memory_fault was generated due to KVM_CAP_MEMORY_FAULT_EXIT using the
KVM_MEMORY_FAULT_EXIT_REASON_ABSENT mask.

(3) switched over to a memslot flag: KVM_CAP_MEMORY_FAULT_EXIT simply
indicates whether this flag is supported. As such, trying to enable
the new capability
directly generates an EINVAL, which is meant to make the incorrect usage clear.

Hopefully this all appears sane?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/8] kvm/x86: Add mem fault exit on EPT violations
  2023-02-23  0:35     ` Anish Moorthy
@ 2023-02-23 20:11       ` Sean Christopherson
  0 siblings, 0 replies; 35+ messages in thread
From: Sean Christopherson @ 2023-02-23 20:11 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: kvm

On Wed, Feb 22, 2023, Anish Moorthy wrote:
> On Wed, Feb 15, 2023 at 9:24 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Feb 15, 2023, Anish Moorthy wrote:
> > > +     if (mem_fault_nowait) {
> > > +             if (fault->pfn == KVM_PFN_ERR_FAULT) {
> > > +                     vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > > +                     vcpu->run->memory_fault.gpa = fault->gfn << PAGE_SHIFT;
> > > +                     vcpu->run->memory_fault.size = PAGE_SIZE;
> >
> > This belongs in a separate patch, and the exit stuff should be filled in by
> > kvm_handle_error_pfn().  Then this if-statement goes away entirely because the
> > "if (!async)" will always evaluate true in the nowait case.
> 
> Hi Sean, what exactly did you want "in a separate patch"?

Separate "exit if fast gup() fails", a.k.a. nowait, from "exit with KVM_EXIT_MEMORY_FAULT
instead of -EFAULT on errors".  I.e. don't tie KVM_EXIT_MEMORY_FAULT to the fast
gup() capability (or memslot flag?).  That way filing vcpu->run->memory_fault
lands in a separate, largely generic patch, and this patch only introduces the
fast gup() logic.

That probably means adding yet another capability, but capabilities are cheap.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits
  2023-02-23  1:16             ` Anish Moorthy
@ 2023-02-23 20:55               ` Sean Christopherson
  2023-02-23 23:03                 ` Anish Moorthy
  0 siblings, 1 reply; 35+ messages in thread
From: Sean Christopherson @ 2023-02-23 20:55 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Oliver Upton, Chao Peng, kvm

On Wed, Feb 22, 2023, Anish Moorthy wrote:
> On Fri, Feb 17, 2023 at 12:33 PM Sean Christopherson <seanjc@google.com> wrote
> > > > > I don't think flags are a good idea for this, as it comes with the
> > > > > illusion that both events can happen on a single exit. In reality, these
> > > > > are mutually exclusive.
> >
> > They aren't mutually exclusive.  Obviously KVM will only detect one other the
> > other, but it's entirely possible that a guest could be accessing the "wrong"
> > flavor of a page _and_ for that flavor to not be faulted in.  A smart userspace
> > should see that (a) it needs to change the memory attributes and (b) that it
> > needs to demand the to-be-installed page from the source.
> >
> > > > > A fault type/code would be better here, with the option to add flags at
> > > > > a later date that could be used to further describe the exit (if
> > > > > needed).
> > > >
> > > > Agreed.
> >
> > Not agreed :-)
> > ...
> > Hard "no" on a separate exit reason unless someone comes up with a very compelling
> > argument.
> >
> > Chao's UPM series is not yet merged, i.e. is not set in stone.  If the proposed
> > functionality in Chao's series is lacking and/or will conflict with this UFFD,
> > then we can and should address those issues _before_ it gets merged.
> 
> Ok so I have a v2 of the series basically ready to go, but I realized
> that I should
> probably have brought up my modified API here to make sure it was
> sane: so, I'll do
> that first
> 
> In v2, I've
> (1)  renamed the kvm cap from KVM_CAP_MEM_FAULT_NOWAIT to
> KVM_CAP_MEMORY_FAULT_EXIT due to Sean's earlier comment
> 
> > gup_fast() failing in itself isn't interesting.  The _intended_ behavior is that
> > KVM will exit if and only if the guest accesses a valid page that hasn't yet been
> > transfered from the source, but the _actual_ behavior is that KVM will exit if
> > the page isn't faulted in for _any_ reason.  Even tagging the access NOWAIT is
> > speculative to some degree, as that suggests the access may have succeeded if
> > KVM "waited", which may or may not be true.
> 
> (2) kept the definition of kvm_run.memory_fault as
> struct {
>     __u64 flags;
>     __u64 gpa;
>     __ u64 len;
> } memory_fault;
> which, apart from the name of the "len" field, is exactly what Chao
> has in their series.

Off-topic, please adjust whatever email client you're using to not wrap so
agressively and at seeming random times.

As written, this makes my eyes bleed, whereas formatting like so does not :-)

  Ok so I have a v2 of the series basically ready to go, but I realized that I
  should probably have brought up my modified API here to make sure it was sane:
  so, I'll do that first

  ...

  which, apart from the name of the "len" field, is exactly what Chao
  has in their series.

  Flags remains a bitfield describing the reason for the memory fault:
  in the two places this series generates this fault, it sets a bit in flags.
  Userspace is meant to check whether a memory_fault was generated due to
  KVM_CAP_MEMORY_FAULT_EXIT using the KVM_MEMORY_FAULT_EXIT_REASON_ABSENT mask.

> flags remains a bitfield describing the reason for the memory fault:
> in the two places
> this series generates this fault, it sets a bit in flags. Userspace is
> meant to check whether
> a memory_fault was generated due to KVM_CAP_MEMORY_FAULT_EXIT using the
> KVM_MEMORY_FAULT_EXIT_REASON_ABSENT mask.

Before sending a new version, let's bottom out on whether or not a
KVM_MEMORY_FAULT_EXIT_REASON_ABSENT flag is necessary.  I'm not dead set against
a flag, but as I called out earlier[*], it can have false positives.  I.e. userspace
needs to be able to suss out the real problem anyways.  And if userspace needs to
be that smart, what's the point of the flag?

[*] https://lore.kernel.org/all/Y+%2FkgMxQPOswAz%2F2@google.com

> 
> (3) switched over to a memslot flag: KVM_CAP_MEMORY_FAULT_EXIT simply
> indicates whether this flag is supported.

The new memslot flag should depend on KVM_CAP_MEMORY_FAULT_EXIT, but
KVM_CAP_MEMORY_FAULT_EXIT should be a standalone thing, i.e. should convert "all"
guest-memory -EFAULTS to KVM_CAP_MEMORY_FAULT_EXIT.  All in quotes because I would
likely be ok with a partial conversion for the initial implementation if there
are paths that would require an absurd amount of work to convert.

> As such, trying to enable the new capability directly generates an EINVAL,
> which is meant to make the incorrect usage clear.
> 
> Hopefully this all appears sane?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits
  2023-02-23 20:55               ` Sean Christopherson
@ 2023-02-23 23:03                 ` Anish Moorthy
  2023-02-24  0:01                   ` Sean Christopherson
  0 siblings, 1 reply; 35+ messages in thread
From: Anish Moorthy @ 2023-02-23 23:03 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Oliver Upton, Chao Peng, kvm, James Houghton

On Thu, Feb 23, 2023 at 12:55 PM Sean Christopherson <seanjc@google.com> wrote:
> Off-topic, please adjust whatever email client you're using to not wrap so
> agressively and at seeming random times.
> As written, this makes my eyes bleed, whereas formatting like so does not :-)

Oof, that *is* pretty bad. I think I have the cause figured out
though, so I'll be careful about that from now on :)

>   Ok so I have a v2 of the series basically ready to go, but I realized that I
>   should probably have brought up my modified API here to make sure it was sane:
>   so, I'll do that first
>
>   ...
>
>   which, apart from the name of the "len" field, is exactly what Chao
>   has in their series.
>
>   Flags remains a bitfield describing the reason for the memory fault:
>   in the two places this series generates this fault, it sets a bit in flags.
>   Userspace is meant to check whether a memory_fault was generated due to
>   KVM_CAP_MEMORY_FAULT_EXIT using the KVM_MEMORY_FAULT_EXIT_REASON_ABSENT mask.
>
> > flags remains a bitfield describing the reason for the memory fault:
> > in the two places
> > this series generates this fault, it sets a bit in flags. Userspace is
> > meant to check whether
> > a memory_fault was generated due to KVM_CAP_MEMORY_FAULT_EXIT using the
> > KVM_MEMORY_FAULT_EXIT_REASON_ABSENT mask.
>
> Before sending a new version, let's bottom out on whether or not a
> KVM_MEMORY_FAULT_EXIT_REASON_ABSENT flag is necessary.  I'm not dead set against
> a flag, but as I called out earlier[*], it can have false positives.  I.e. userspace
> needs to be able to suss out the real problem anyways.  And if userspace needs to
> be that smart, what's the point of the flag?
>
> [*] https://lore.kernel.org/all/Y+%2FkgMxQPOswAz%2F2@google.com

My understanding of your previous message was off: I didn't realize
that fast gup would also fail for present mappings where the
read/write permission was incorrect. So I'd agree that
KVM_MEMORY_FAULT_EXIT_REASON_ABSENT should be dropped: best not to
mislead with false positives.

> >
> > (3) switched over to a memslot flag: KVM_CAP_MEMORY_FAULT_EXIT simply
> > indicates whether this flag is supported.
>
> The new memslot flag should depend on KVM_CAP_MEMORY_FAULT_EXIT, but
> KVM_CAP_MEMORY_FAULT_EXIT should be a standalone thing, i.e. should convert "all"
> guest-memory -EFAULTS to KVM_CAP_MEMORY_FAULT_EXIT.  All in quotes because I would
> likely be ok with a partial conversion for the initial implementation if there
> are paths that would require an absurd amount of work to convert.

I'm actually not sure where all of the sources of guest-memory
-EFAULTs are or how I'd go about finding them. Is the standalone
behavior of KVM_CAP_MEMORY_FAULT_EXIT something you're suggesting
because that new name is too broad for what it does right now? If so
then I'd rather just rename it again: but if that functionality should
be included with this series, then I'll take a look at the required
work given a couple of pointers :)

I will say, a partial implementation seems worse than no
implementation: isn't there a risk that someone ends up depending on
the incomplete behavior?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits
  2023-02-23 23:03                 ` Anish Moorthy
@ 2023-02-24  0:01                   ` Sean Christopherson
  0 siblings, 0 replies; 35+ messages in thread
From: Sean Christopherson @ 2023-02-24  0:01 UTC (permalink / raw)
  To: Anish Moorthy; +Cc: Oliver Upton, Chao Peng, kvm, James Houghton

On Thu, Feb 23, 2023, Anish Moorthy wrote:
> On Thu, Feb 23, 2023 at 12:55 PM Sean Christopherson <seanjc@google.com> wrote:
> > > (3) switched over to a memslot flag: KVM_CAP_MEMORY_FAULT_EXIT simply
> > > indicates whether this flag is supported.
> >
> > The new memslot flag should depend on KVM_CAP_MEMORY_FAULT_EXIT, but
> > KVM_CAP_MEMORY_FAULT_EXIT should be a standalone thing, i.e. should convert "all"
> > guest-memory -EFAULTS to KVM_CAP_MEMORY_FAULT_EXIT.  All in quotes because I would
> > likely be ok with a partial conversion for the initial implementation if there
> > are paths that would require an absurd amount of work to convert.
> 
> I'm actually not sure where all of the sources of guest-memory -EFAULTs are
> or how I'd go about finding them.

Heh, if anyone can says they can list _all_ sources of KVM accesses to guest
memory of the top of their head, they're lying.

Finding the sources is a bit tedious, but shouldn't be too hard.  The scope is
is -EFAULT in the context of KVM_RUN that gets returned to userspace.  There are
150+ references to -EFAULT to wade through, but the vast majority are obviously
not in scope, e.g. are for uaccess failures in other ioctls().

> Is the standalone behavior of KVM_CAP_MEMORY_FAULT_EXIT something you're
> suggesting because that new name is too broad for what it does right now?

No, I want a standalone thing because I want to start driving toward KVM never
returning -EFAULT to host userspace for accesses to guest memory in the context
of ioctl(KVM_RUN).  E.g. I want to replace the "return -EFAULT" in
kvm_handle_error_pfn() with something like "return kvm_handle_memory_fault_exit(...)".

My hope/goal is that return useful information will allow userspace to do more
interesting things with guest memory without needing invasive, one-off changes
to KVM.  At the very least, it should get us to the point where a memory fault
from KVM_RUN exits to userspace with sane, helpful information, not a generic
"-EFAULT, have fun debugging!".

> If so then I'd rather just rename it again: but if that functionality should
> be included with this series, then I'll take a look at the required work
> given a couple of pointers :)
> 
> I will say, a partial implementation seems worse than no
> implementation: isn't there a risk that someone ends up depending on
> the incomplete behavior?

In this case, I don't think so.  We definitely would need to document that KVM
may still return -EFAULT in certain scenarios, but we have at least one known
use case (this one) where catching the common cases is sufficient.  And if/when
use cases come along that need 100% accuracy, we can hunt down and fix the KVM
remaining "bugs".

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-02-24  0:02 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-15  1:16 [PATCH 0/8] Add memory fault exits to avoid slow GUP Anish Moorthy
2023-02-15  1:16 ` [PATCH 1/8] selftests/kvm: Fix bug in how demand_paging_test calculates paging rate Anish Moorthy
2023-02-15  7:27   ` Oliver Upton
2023-02-15 16:44     ` Sean Christopherson
2023-02-15 18:05       ` Anish Moorthy
2023-02-15  1:16 ` [PATCH 2/8] selftests/kvm: Allow many vcpus per UFFD in demand paging test Anish Moorthy
2023-02-15  1:16 ` [PATCH 3/8] selftests/kvm: Switch demand paging uffd readers to epoll Anish Moorthy
2023-02-15  1:16 ` [PATCH 4/8] kvm: Allow hva_pfn_fast to resolve read-only faults Anish Moorthy
2023-02-15  9:01   ` Oliver Upton
2023-02-15 17:03     ` Sean Christopherson
2023-02-15 18:19       ` Anish Moorthy
2023-02-15  1:16 ` [PATCH 5/8] kvm: Add cap/kvm_run field for memory fault exits Anish Moorthy
2023-02-15  8:41   ` Marc Zyngier
2023-02-15 17:07     ` Sean Christopherson
2023-02-16 18:53     ` Anish Moorthy
2023-02-16 21:38       ` Sean Christopherson
2023-02-17 19:14         ` Anish Moorthy
2023-02-17 20:33           ` Sean Christopherson
2023-02-23  1:16             ` Anish Moorthy
2023-02-23 20:55               ` Sean Christopherson
2023-02-23 23:03                 ` Anish Moorthy
2023-02-24  0:01                   ` Sean Christopherson
2023-02-17 20:47           ` Sean Christopherson
2023-02-15  8:59   ` Oliver Upton
2023-02-15  1:16 ` [PATCH 6/8] kvm/x86: Add mem fault exit on EPT violations Anish Moorthy
2023-02-15 17:23   ` Sean Christopherson
2023-02-16 22:55     ` Peter Xu
2023-02-23  0:35     ` Anish Moorthy
2023-02-23 20:11       ` Sean Christopherson
2023-02-15  1:16 ` [PATCH 7/8] kvm/arm64: Implement KVM_CAP_MEM_FAULT_NOWAIT for arm64 Anish Moorthy
2023-02-15 18:24   ` Oliver Upton
2023-02-15 23:28     ` Anish Moorthy
2023-02-15 23:37       ` Oliver Upton
2023-02-15  1:16 ` [PATCH 8/8] selftests/kvm: Handle mem fault exits in demand paging test Anish Moorthy
2023-02-15  1:46 ` [PATCH 0/8] Add memory fault exits to avoid slow GUP James Houghton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.