[PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population
@ 2021-11-11  0:12 David Matlack
  2021-11-11  0:12 ` [PATCH 1/4] KVM: selftests: Start at iteration 0 instead of -1 David Matlack
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: David Matlack @ 2021-11-11  0:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Sean Christopherson, Ben Gardon, Andrew Jones, Jim Mattson,
	Yanan Wang, Peter Xu, Aaron Lewis, David Matlack

This series fixes a performance issue in the KVM selftests, specifically
those that use perf_test_util. These tests create vCPU threads which
immediately enter guest mode and start faulting in memory. Creating
vCPU threads while faulting in memory is a recipe for generating a lot
of contention on the mmap_sem, as thread creation requires acquiring the
mmap_sem in write mode.

This series fixes this issue by ensuring that all vCPUs threads are
created before entering guest mode. As part of fixing this issue I
consolidated the code to create and join vCPU threads across all users
of perf_test_util.

The last commit is an unrelated perf_test_util cleanup.

Note: This series applies on top of
https://lore.kernel.org/kvm/20211111000310.1435032-1-dmatlack@google.com/,
although the dependency on the series is just cosmetic.

David Matlack (4):
  KVM: selftests: Start at iteration 0 instead of -1
  KVM: selftests: Move vCPU thread creation and joining to common
    helpers
  KVM: selftests: Wait for all vCPU to be created before entering guest
    mode
  KVM: selftests: Use perf_test_destroy_vm in
    memslot_modification_stress_test

 .../selftests/kvm/access_tracking_perf_test.c | 46 +++---------
 .../selftests/kvm/demand_paging_test.c        | 25 +------
 .../selftests/kvm/dirty_log_perf_test.c       | 19 ++---
 .../selftests/kvm/include/perf_test_util.h    |  5 ++
 .../selftests/kvm/lib/perf_test_util.c        | 72 +++++++++++++++++++
 .../kvm/memslot_modification_stress_test.c    | 25 ++-----
 6 files changed, 96 insertions(+), 96 deletions(-)

-- 
2.34.0.rc1.387.gb447b232ab-goog

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] KVM: selftests: Start at iteration 0 instead of -1
  2021-11-11  0:12 [PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population David Matlack
@ 2021-11-11  0:12 ` David Matlack
  2021-11-11  0:12 ` [PATCH 2/4] KVM: selftests: Move vCPU thread creation and joining to common helpers David Matlack
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: David Matlack @ 2021-11-11  0:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Sean Christopherson, Ben Gardon, Andrew Jones, Jim Mattson,
	Yanan Wang, Peter Xu, Aaron Lewis, David Matlack

Start at iteration 0 instead of -1 to avoid having to initialize
vcpu_last_completed_iteration when setting up vCPU threads. This
simplifies the next commit where we move vCPU thread initialization
out to a common helper.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 tools/testing/selftests/kvm/access_tracking_perf_test.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 5364a2ed7c68..7f25a06e19c9 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -47,7 +47,7 @@
 #include "guest_modes.h"
 
 /* Global variable used to synchronize all of the vCPU threads. */
-static int iteration = -1;
+static int iteration;
 
 /* Defines what vCPU threads should do during a given iteration. */
 static enum {
@@ -220,7 +220,7 @@ static void *vcpu_thread_main(void *arg)
 	struct perf_test_vcpu_args *vcpu_args = arg;
 	struct kvm_vm *vm = perf_test_args.vm;
 	int vcpu_id = vcpu_args->vcpu_id;
-	int current_iteration = -1;
+	int current_iteration = 0;
 
 	while (spin_wait_for_next_iteration(&current_iteration)) {
 		switch (READ_ONCE(iteration_work)) {
@@ -303,11 +303,9 @@ static pthread_t *create_vcpu_threads(int vcpus)
 	vcpu_threads = malloc(vcpus * sizeof(vcpu_threads[0]));
 	TEST_ASSERT(vcpu_threads, "Failed to allocate vcpu_threads.");
 
-	for (i = 0; i < vcpus; i++) {
-		vcpu_last_completed_iteration[i] = iteration;
+	for (i = 0; i < vcpus; i++)
 		pthread_create(&vcpu_threads[i], NULL, vcpu_thread_main,
 			       &perf_test_args.vcpu_args[i]);
-	}
 
 	return vcpu_threads;
 }
-- 
2.34.0.rc1.387.gb447b232ab-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] KVM: selftests: Move vCPU thread creation and joining to common helpers
  2021-11-11  0:12 [PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population David Matlack
  2021-11-11  0:12 ` [PATCH 1/4] KVM: selftests: Start at iteration 0 instead of -1 David Matlack
@ 2021-11-11  0:12 ` David Matlack
  2021-11-11 18:08   ` Ben Gardon
  2021-11-11  0:12 ` [PATCH 3/4] KVM: selftests: Wait for all vCPU to be created before entering guest mode David Matlack
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: David Matlack @ 2021-11-11  0:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Sean Christopherson, Ben Gardon, Andrew Jones, Jim Mattson,
	Yanan Wang, Peter Xu, Aaron Lewis, David Matlack

Move vCPU thread creation and joining to common helper functions. This
is in preparation for the next commit which ensures that all vCPU
threads are fully created before entering guest mode on any one
vCPU.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../selftests/kvm/access_tracking_perf_test.c | 40 +++-------------
 .../selftests/kvm/demand_paging_test.c        | 25 ++--------
 .../selftests/kvm/dirty_log_perf_test.c       | 19 ++------
 .../selftests/kvm/include/perf_test_util.h    |  5 ++
 .../selftests/kvm/lib/perf_test_util.c        | 46 +++++++++++++++++++
 .../kvm/memslot_modification_stress_test.c    | 22 ++-------
 6 files changed, 67 insertions(+), 90 deletions(-)

diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 7f25a06e19c9..d8909032317a 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -215,9 +215,8 @@ static bool spin_wait_for_next_iteration(int *current_iteration)
 	return true;
 }
 
-static void *vcpu_thread_main(void *arg)
+static void vcpu_thread_main(struct perf_test_vcpu_args *vcpu_args)
 {
-	struct perf_test_vcpu_args *vcpu_args = arg;
 	struct kvm_vm *vm = perf_test_args.vm;
 	int vcpu_id = vcpu_args->vcpu_id;
 	int current_iteration = 0;
@@ -235,8 +234,6 @@ static void *vcpu_thread_main(void *arg)
 
 		vcpu_last_completed_iteration[vcpu_id] = current_iteration;
 	}
-
-	return NULL;
 }
 
 static void spin_wait_for_vcpu(int vcpu_id, int target_iteration)
@@ -295,43 +292,16 @@ static void mark_memory_idle(struct kvm_vm *vm, int vcpus)
 	run_iteration(vm, vcpus, "Mark memory idle");
 }
 
-static pthread_t *create_vcpu_threads(int vcpus)
-{
-	pthread_t *vcpu_threads;
-	int i;
-
-	vcpu_threads = malloc(vcpus * sizeof(vcpu_threads[0]));
-	TEST_ASSERT(vcpu_threads, "Failed to allocate vcpu_threads.");
-
-	for (i = 0; i < vcpus; i++)
-		pthread_create(&vcpu_threads[i], NULL, vcpu_thread_main,
-			       &perf_test_args.vcpu_args[i]);
-
-	return vcpu_threads;
-}
-
-static void terminate_vcpu_threads(pthread_t *vcpu_threads, int vcpus)
-{
-	int i;
-
-	/* Set done to signal the vCPU threads to exit */
-	done = true;
-
-	for (i = 0; i < vcpus; i++)
-		pthread_join(vcpu_threads[i], NULL);
-}
-
 static void run_test(enum vm_guest_mode mode, void *arg)
 {
 	struct test_params *params = arg;
 	struct kvm_vm *vm;
-	pthread_t *vcpu_threads;
 	int vcpus = params->vcpus;
 
 	vm = perf_test_create_vm(mode, vcpus, params->vcpu_memory_bytes, 1,
 				 params->backing_src, !overlap_memory_access);
 
-	vcpu_threads = create_vcpu_threads(vcpus);
+	perf_test_start_vcpu_threads(vcpus, vcpu_thread_main);
 
 	pr_info("\n");
 	access_memory(vm, vcpus, ACCESS_WRITE, "Populating memory");
@@ -346,8 +316,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	mark_memory_idle(vm, vcpus);
 	access_memory(vm, vcpus, ACCESS_READ, "Reading from idle memory");
 
-	terminate_vcpu_threads(vcpu_threads, vcpus);
-	free(vcpu_threads);
+	/* Set done to signal the vCPU threads to exit */
+	done = true;
+
+	perf_test_join_vcpu_threads(vcpus);
 	perf_test_destroy_vm(vm);
 }
 
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 26f8fd8a57ec..6a719d065599 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -42,10 +42,9 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
 static size_t demand_paging_size;
 static char *guest_data_prototype;
 
-static void *vcpu_worker(void *data)
+static void vcpu_worker(struct perf_test_vcpu_args *vcpu_args)
 {
 	int ret;
-	struct perf_test_vcpu_args *vcpu_args = (struct perf_test_vcpu_args *)data;
 	int vcpu_id = vcpu_args->vcpu_id;
 	struct kvm_vm *vm = perf_test_args.vm;
 	struct kvm_run *run;
@@ -68,8 +67,6 @@ static void *vcpu_worker(void *data)
 	ts_diff = timespec_elapsed(start);
 	PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_id,
 		       ts_diff.tv_sec, ts_diff.tv_nsec);
-
-	return NULL;
 }
 
 static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t addr)
@@ -282,7 +279,6 @@ struct test_params {
 static void run_test(enum vm_guest_mode mode, void *arg)
 {
 	struct test_params *p = arg;
-	pthread_t *vcpu_threads;
 	pthread_t *uffd_handler_threads = NULL;
 	struct uffd_handler_args *uffd_args = NULL;
 	struct timespec start;
@@ -302,9 +298,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		    "Failed to allocate buffer for guest data pattern");
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
-	vcpu_threads = malloc(nr_vcpus * sizeof(*vcpu_threads));
-	TEST_ASSERT(vcpu_threads, "Memory allocation failed");
-
 	if (p->uffd_mode) {
 		uffd_handler_threads =
 			malloc(nr_vcpus * sizeof(*uffd_handler_threads));
@@ -346,22 +339,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	pr_info("Finished creating vCPUs and starting uffd threads\n");
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
-
-	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
-		pthread_create(&vcpu_threads[vcpu_id], NULL, vcpu_worker,
-			       &perf_test_args.vcpu_args[vcpu_id]);
-	}
-
+	perf_test_start_vcpu_threads(nr_vcpus, vcpu_worker);
 	pr_info("Started all vCPUs\n");
 
-	/* Wait for the vcpu threads to quit */
-	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
-		pthread_join(vcpu_threads[vcpu_id], NULL);
-		PER_VCPU_DEBUG("Joined thread for vCPU %d\n", vcpu_id);
-	}
-
+	perf_test_join_vcpu_threads(nr_vcpus);
 	ts_diff = timespec_elapsed(start);
-
 	pr_info("All vCPU threads joined\n");
 
 	if (p->uffd_mode) {
@@ -385,7 +367,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	perf_test_destroy_vm(vm);
 
 	free(guest_data_prototype);
-	free(vcpu_threads);
 	if (p->uffd_mode) {
 		free(uffd_handler_threads);
 		free(uffd_args);
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 583b4d95aa98..1954b964d1cf 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -31,7 +31,7 @@ static bool host_quit;
 static int iteration;
 static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
 
-static void *vcpu_worker(void *data)
+static void vcpu_worker(struct perf_test_vcpu_args *vcpu_args)
 {
 	int ret;
 	struct kvm_vm *vm = perf_test_args.vm;
@@ -41,7 +41,6 @@ static void *vcpu_worker(void *data)
 	struct timespec ts_diff;
 	struct timespec total = (struct timespec){0};
 	struct timespec avg;
-	struct perf_test_vcpu_args *vcpu_args = (struct perf_test_vcpu_args *)data;
 	int vcpu_id = vcpu_args->vcpu_id;
 
 	run = vcpu_state(vm, vcpu_id);
@@ -83,8 +82,6 @@ static void *vcpu_worker(void *data)
 	pr_debug("\nvCPU %d dirtied 0x%lx pages over %d iterations in %ld.%.9lds. (Avg %ld.%.9lds/iteration)\n",
 		vcpu_id, pages_count, vcpu_last_completed_iteration[vcpu_id],
 		total.tv_sec, total.tv_nsec, avg.tv_sec, avg.tv_nsec);
-
-	return NULL;
 }
 
 struct test_params {
@@ -170,7 +167,6 @@ static void free_bitmaps(unsigned long *bitmaps[], int slots)
 static void run_test(enum vm_guest_mode mode, void *arg)
 {
 	struct test_params *p = arg;
-	pthread_t *vcpu_threads;
 	struct kvm_vm *vm;
 	unsigned long **bitmaps;
 	uint64_t guest_num_pages;
@@ -204,20 +200,15 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		vm_enable_cap(vm, &cap);
 	}
 
-	vcpu_threads = malloc(nr_vcpus * sizeof(*vcpu_threads));
-	TEST_ASSERT(vcpu_threads, "Memory allocation failed");
-
 	/* Start the iterations */
 	iteration = 0;
 	host_quit = false;
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
-	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
+	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
 		vcpu_last_completed_iteration[vcpu_id] = -1;
 
-		pthread_create(&vcpu_threads[vcpu_id], NULL, vcpu_worker,
-			       &perf_test_args.vcpu_args[vcpu_id]);
-	}
+	perf_test_start_vcpu_threads(nr_vcpus, vcpu_worker);
 
 	/* Allow the vCPUs to populate memory */
 	pr_debug("Starting iteration %d - Populating\n", iteration);
@@ -286,8 +277,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
 	/* Tell the vcpu thread to quit */
 	host_quit = true;
-	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
-		pthread_join(vcpu_threads[vcpu_id], NULL);
+	perf_test_join_vcpu_threads(nr_vcpus);
 
 	avg = timespec_div(get_dirty_log_total, p->iterations);
 	pr_info("Get dirty log over %lu iterations took %ld.%.9lds. (Avg %ld.%.9lds/iteration)\n",
@@ -302,7 +292,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	}
 
 	free_bitmaps(bitmaps, p->slots);
-	free(vcpu_threads);
 	perf_test_destroy_vm(vm);
 }
 
diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h
index 74e3622b3a6e..a86f953d8d36 100644
--- a/tools/testing/selftests/kvm/include/perf_test_util.h
+++ b/tools/testing/selftests/kvm/include/perf_test_util.h
@@ -8,6 +8,8 @@
 #ifndef SELFTEST_KVM_PERF_TEST_UTIL_H
 #define SELFTEST_KVM_PERF_TEST_UTIL_H
 
+#include <pthread.h>
+
 #include "kvm_util.h"
 
 /* Default guest test virtual memory offset */
@@ -45,4 +47,7 @@ void perf_test_destroy_vm(struct kvm_vm *vm);
 
 void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract);
 
+void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *));
+void perf_test_join_vcpu_threads(int vcpus);
+
 #endif /* SELFTEST_KVM_PERF_TEST_UTIL_H */
diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
index 77f9eb5667c9..d646477ed16a 100644
--- a/tools/testing/selftests/kvm/lib/perf_test_util.c
+++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
@@ -16,6 +16,20 @@ struct perf_test_args perf_test_args;
  */
 static uint64_t guest_test_virt_mem = DEFAULT_GUEST_TEST_MEM;
 
+struct vcpu_thread {
+	/* The id of the vCPU. */
+	int vcpu_id;
+
+	/* The pthread backing the vCPU. */
+	pthread_t thread;
+};
+
+/* The vCPU threads involved in this test. */
+static struct vcpu_thread vcpu_threads[KVM_MAX_VCPUS];
+
+/* The function run by each vCPU thread, as provided by the test. */
+static void (*vcpu_thread_fn)(struct perf_test_vcpu_args *);
+
 /*
  * Continuously write to the first 8 bytes of each page in the
  * specified region.
@@ -177,3 +191,35 @@ void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract)
 	perf_test_args.wr_fract = wr_fract;
 	sync_global_to_guest(vm, perf_test_args);
 }
+
+static void *vcpu_thread_main(void *data)
+{
+	struct vcpu_thread *vcpu = data;
+
+	vcpu_thread_fn(&perf_test_args.vcpu_args[vcpu->vcpu_id]);
+
+	return NULL;
+}
+
+void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *))
+{
+	int vcpu_id;
+
+	vcpu_thread_fn = vcpu_fn;
+
+	for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++) {
+		struct vcpu_thread *vcpu = &vcpu_threads[vcpu_id];
+
+		vcpu->vcpu_id = vcpu_id;
+
+		pthread_create(&vcpu->thread, NULL, vcpu_thread_main, vcpu);
+	}
+}
+
+void perf_test_join_vcpu_threads(int vcpus)
+{
+	int vcpu_id;
+
+	for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++)
+		pthread_join(vcpu_threads[vcpu_id].thread, NULL);
+}
diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
index df431d0da1ee..5bd0b076f57f 100644
--- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
+++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
@@ -36,11 +36,9 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
 
 static bool run_vcpus = true;
 
-static void *vcpu_worker(void *data)
+static void vcpu_worker(struct perf_test_vcpu_args *vcpu_args)
 {
 	int ret;
-	struct perf_test_vcpu_args *vcpu_args =
-		(struct perf_test_vcpu_args *)data;
 	int vcpu_id = vcpu_args->vcpu_id;
 	struct kvm_vm *vm = perf_test_args.vm;
 	struct kvm_run *run;
@@ -59,8 +57,6 @@ static void *vcpu_worker(void *data)
 			    "Invalid guest sync status: exit_reason=%s\n",
 			    exit_reason_str(run->exit_reason));
 	}
-
-	return NULL;
 }
 
 struct memslot_antagonist_args {
@@ -100,22 +96,15 @@ struct test_params {
 static void run_test(enum vm_guest_mode mode, void *arg)
 {
 	struct test_params *p = arg;
-	pthread_t *vcpu_threads;
 	struct kvm_vm *vm;
-	int vcpu_id;
 
 	vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 VM_MEM_SRC_ANONYMOUS,
 				 p->partition_vcpu_memory_access);
 
-	vcpu_threads = malloc(nr_vcpus * sizeof(*vcpu_threads));
-	TEST_ASSERT(vcpu_threads, "Memory allocation failed");
-
 	pr_info("Finished creating vCPUs\n");
 
-	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
-		pthread_create(&vcpu_threads[vcpu_id], NULL, vcpu_worker,
-			       &perf_test_args.vcpu_args[vcpu_id]);
+	perf_test_start_vcpu_threads(nr_vcpus, vcpu_worker);
 
 	pr_info("Started all vCPUs\n");
 
@@ -124,16 +113,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
 	run_vcpus = false;
 
-	/* Wait for the vcpu threads to quit */
-	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
-		pthread_join(vcpu_threads[vcpu_id], NULL);
-
+	perf_test_join_vcpu_threads(nr_vcpus);
 	pr_info("All vCPU threads joined\n");
 
 	ucall_uninit(vm);
 	kvm_vm_free(vm);
-
-	free(vcpu_threads);
 }
 
 static void help(char *name)
-- 
2.34.0.rc1.387.gb447b232ab-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] KVM: selftests: Wait for all vCPU to be created before entering guest mode
  2021-11-11  0:12 [PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population David Matlack
  2021-11-11  0:12 ` [PATCH 1/4] KVM: selftests: Start at iteration 0 instead of -1 David Matlack
  2021-11-11  0:12 ` [PATCH 2/4] KVM: selftests: Move vCPU thread creation and joining to common helpers David Matlack
@ 2021-11-11  0:12 ` David Matlack
  2021-11-11 18:17   ` Ben Gardon
  2021-11-11  0:12 ` [PATCH 4/4] KVM: selftests: Use perf_test_destroy_vm in memslot_modification_stress_test David Matlack
  2021-11-16 11:13 ` [PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population Paolo Bonzini
  4 siblings, 1 reply; 14+ messages in thread
From: David Matlack @ 2021-11-11  0:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Sean Christopherson, Ben Gardon, Andrew Jones, Jim Mattson,
	Yanan Wang, Peter Xu, Aaron Lewis, David Matlack

Thread creation requires taking the mmap_sem in write mode, which causes
vCPU threads running in guest mode to block while they are populating
memory. Fix this by waiting for all vCPU threads to be created and start
running before entering guest mode on any one vCPU thread.

This substantially improves the "Populate memory time" when using 1GiB
pages since it allows all vCPUs to zero pages in parallel rather than
blocking because a writer is waiting (which is waiting for another vCPU
that is busy zeroing a 1GiB page).

Before:

  $ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb
  ...
  Populate memory time: 52.811184013s

After:

  $ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb
  ...
  Populate memory time: 10.204573342s

Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../selftests/kvm/lib/perf_test_util.c        | 26 +++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
index d646477ed16a..722df3a28791 100644
--- a/tools/testing/selftests/kvm/lib/perf_test_util.c
+++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
@@ -22,6 +22,9 @@ struct vcpu_thread {
 
 	/* The pthread backing the vCPU. */
 	pthread_t thread;
+
+	/* Set to true once the vCPU thread is up and running. */
+	bool running;
 };
 
 /* The vCPU threads involved in this test. */
@@ -30,6 +33,9 @@ static struct vcpu_thread vcpu_threads[KVM_MAX_VCPUS];
 /* The function run by each vCPU thread, as provided by the test. */
 static void (*vcpu_thread_fn)(struct perf_test_vcpu_args *);
 
+/* Set to true once all vCPU threads are up and running. */
+static bool all_vcpu_threads_running;
+
 /*
  * Continuously write to the first 8 bytes of each page in the
  * specified region.
@@ -196,6 +202,17 @@ static void *vcpu_thread_main(void *data)
 {
 	struct vcpu_thread *vcpu = data;
 
+	WRITE_ONCE(vcpu->running, true);
+
+	/*
+	 * Wait for all vCPU threads to be up and running before calling the test-
+	 * provided vCPU thread function. This prevents thread creation (which
+	 * requires taking the mmap_sem in write mode) from interfering with the
+	 * guest faulting in its memory.
+	 */
+	while (!READ_ONCE(all_vcpu_threads_running))
+		;
+
 	vcpu_thread_fn(&perf_test_args.vcpu_args[vcpu->vcpu_id]);
 
 	return NULL;
@@ -206,14 +223,23 @@ void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vc
 	int vcpu_id;
 
 	vcpu_thread_fn = vcpu_fn;
+	WRITE_ONCE(all_vcpu_threads_running, false);
 
 	for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++) {
 		struct vcpu_thread *vcpu = &vcpu_threads[vcpu_id];
 
 		vcpu->vcpu_id = vcpu_id;
+		WRITE_ONCE(vcpu->running, false);
 
 		pthread_create(&vcpu->thread, NULL, vcpu_thread_main, vcpu);
 	}
+
+	for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++) {
+		while (!READ_ONCE(vcpu_threads[vcpu_id].running))
+			;
+	}
+
+	WRITE_ONCE(all_vcpu_threads_running, true);
 }
 
 void perf_test_join_vcpu_threads(int vcpus)
-- 
2.34.0.rc1.387.gb447b232ab-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] KVM: selftests: Use perf_test_destroy_vm in memslot_modification_stress_test
  2021-11-11  0:12 [PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population David Matlack
                   ` (2 preceding siblings ...)
  2021-11-11  0:12 ` [PATCH 3/4] KVM: selftests: Wait for all vCPU to be created before entering guest mode David Matlack
@ 2021-11-11  0:12 ` David Matlack
  2021-11-11 18:18   ` Ben Gardon
  2021-11-16 11:13 ` [PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population Paolo Bonzini
  4 siblings, 1 reply; 14+ messages in thread
From: David Matlack @ 2021-11-11  0:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Sean Christopherson, Ben Gardon, Andrew Jones, Jim Mattson,
	Yanan Wang, Peter Xu, Aaron Lewis, David Matlack

Change memslot_modification_stress_test to use perf_test_destroy_vm
instead of manually calling ucall_uninit and kvm_vm_free.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 tools/testing/selftests/kvm/memslot_modification_stress_test.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
index 5bd0b076f57f..1410d0a9141a 100644
--- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
+++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
@@ -116,8 +116,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	perf_test_join_vcpu_threads(nr_vcpus);
 	pr_info("All vCPU threads joined\n");
 
-	ucall_uninit(vm);
-	kvm_vm_free(vm);
+	perf_test_destroy_vm(vm);
 }
 
 static void help(char *name)
-- 
2.34.0.rc1.387.gb447b232ab-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] KVM: selftests: Move vCPU thread creation and joining to common helpers
  2021-11-11  0:12 ` [PATCH 2/4] KVM: selftests: Move vCPU thread creation and joining to common helpers David Matlack
@ 2021-11-11 18:08   ` Ben Gardon
  2021-11-12  0:47     ` David Matlack
  0 siblings, 1 reply; 14+ messages in thread
From: Ben Gardon @ 2021-11-11 18:08 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Andrew Jones,
	Jim Mattson, Yanan Wang, Peter Xu, Aaron Lewis

On Wed, Nov 10, 2021 at 4:13 PM David Matlack <dmatlack@google.com> wrote:
>
> Move vCPU thread creation and joining to common helper functions. This
> is in preparation for the next commit which ensures that all vCPU
> threads are fully created before entering guest mode on any one
> vCPU.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>

Besides one comment below, this is an awesome cleanup!

> ---
>  .../selftests/kvm/access_tracking_perf_test.c | 40 +++-------------
>  .../selftests/kvm/demand_paging_test.c        | 25 ++--------
>  .../selftests/kvm/dirty_log_perf_test.c       | 19 ++------
>  .../selftests/kvm/include/perf_test_util.h    |  5 ++
>  .../selftests/kvm/lib/perf_test_util.c        | 46 +++++++++++++++++++
>  .../kvm/memslot_modification_stress_test.c    | 22 ++-------
>  6 files changed, 67 insertions(+), 90 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> index 7f25a06e19c9..d8909032317a 100644
> --- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
> +++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> @@ -215,9 +215,8 @@ static bool spin_wait_for_next_iteration(int *current_iteration)
>         return true;
>  }
>
> -static void *vcpu_thread_main(void *arg)
> +static void vcpu_thread_main(struct perf_test_vcpu_args *vcpu_args)
>  {
> -       struct perf_test_vcpu_args *vcpu_args = arg;
>         struct kvm_vm *vm = perf_test_args.vm;
>         int vcpu_id = vcpu_args->vcpu_id;
>         int current_iteration = 0;
> @@ -235,8 +234,6 @@ static void *vcpu_thread_main(void *arg)
>
>                 vcpu_last_completed_iteration[vcpu_id] = current_iteration;
>         }
> -
> -       return NULL;
>  }
>
>  static void spin_wait_for_vcpu(int vcpu_id, int target_iteration)
> @@ -295,43 +292,16 @@ static void mark_memory_idle(struct kvm_vm *vm, int vcpus)
>         run_iteration(vm, vcpus, "Mark memory idle");
>  }
>
> -static pthread_t *create_vcpu_threads(int vcpus)
> -{
> -       pthread_t *vcpu_threads;
> -       int i;
> -
> -       vcpu_threads = malloc(vcpus * sizeof(vcpu_threads[0]));
> -       TEST_ASSERT(vcpu_threads, "Failed to allocate vcpu_threads.");
> -
> -       for (i = 0; i < vcpus; i++)
> -               pthread_create(&vcpu_threads[i], NULL, vcpu_thread_main,
> -                              &perf_test_args.vcpu_args[i]);
> -
> -       return vcpu_threads;
> -}
> -
> -static void terminate_vcpu_threads(pthread_t *vcpu_threads, int vcpus)
> -{
> -       int i;
> -
> -       /* Set done to signal the vCPU threads to exit */
> -       done = true;
> -
> -       for (i = 0; i < vcpus; i++)
> -               pthread_join(vcpu_threads[i], NULL);
> -}
> -
>  static void run_test(enum vm_guest_mode mode, void *arg)
>  {
>         struct test_params *params = arg;
>         struct kvm_vm *vm;
> -       pthread_t *vcpu_threads;
>         int vcpus = params->vcpus;
>
>         vm = perf_test_create_vm(mode, vcpus, params->vcpu_memory_bytes, 1,
>                                  params->backing_src, !overlap_memory_access);
>
> -       vcpu_threads = create_vcpu_threads(vcpus);
> +       perf_test_start_vcpu_threads(vcpus, vcpu_thread_main);
>
>         pr_info("\n");
>         access_memory(vm, vcpus, ACCESS_WRITE, "Populating memory");
> @@ -346,8 +316,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>         mark_memory_idle(vm, vcpus);
>         access_memory(vm, vcpus, ACCESS_READ, "Reading from idle memory");
>
> -       terminate_vcpu_threads(vcpu_threads, vcpus);
> -       free(vcpu_threads);
> +       /* Set done to signal the vCPU threads to exit */
> +       done = true;
> +
> +       perf_test_join_vcpu_threads(vcpus);
>         perf_test_destroy_vm(vm);
>  }
>
> diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> index 26f8fd8a57ec..6a719d065599 100644
> --- a/tools/testing/selftests/kvm/demand_paging_test.c
> +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> @@ -42,10 +42,9 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
>  static size_t demand_paging_size;
>  static char *guest_data_prototype;
>
> -static void *vcpu_worker(void *data)
> +static void vcpu_worker(struct perf_test_vcpu_args *vcpu_args)
>  {
>         int ret;
> -       struct perf_test_vcpu_args *vcpu_args = (struct perf_test_vcpu_args *)data;
>         int vcpu_id = vcpu_args->vcpu_id;
>         struct kvm_vm *vm = perf_test_args.vm;
>         struct kvm_run *run;
> @@ -68,8 +67,6 @@ static void *vcpu_worker(void *data)
>         ts_diff = timespec_elapsed(start);
>         PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_id,
>                        ts_diff.tv_sec, ts_diff.tv_nsec);
> -
> -       return NULL;
>  }
>
>  static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t addr)
> @@ -282,7 +279,6 @@ struct test_params {
>  static void run_test(enum vm_guest_mode mode, void *arg)
>  {
>         struct test_params *p = arg;
> -       pthread_t *vcpu_threads;
>         pthread_t *uffd_handler_threads = NULL;
>         struct uffd_handler_args *uffd_args = NULL;
>         struct timespec start;
> @@ -302,9 +298,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>                     "Failed to allocate buffer for guest data pattern");
>         memset(guest_data_prototype, 0xAB, demand_paging_size);
>
> -       vcpu_threads = malloc(nr_vcpus * sizeof(*vcpu_threads));
> -       TEST_ASSERT(vcpu_threads, "Memory allocation failed");
> -
>         if (p->uffd_mode) {
>                 uffd_handler_threads =
>                         malloc(nr_vcpus * sizeof(*uffd_handler_threads));
> @@ -346,22 +339,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>         pr_info("Finished creating vCPUs and starting uffd threads\n");
>
>         clock_gettime(CLOCK_MONOTONIC, &start);
> -
> -       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
> -               pthread_create(&vcpu_threads[vcpu_id], NULL, vcpu_worker,
> -                              &perf_test_args.vcpu_args[vcpu_id]);
> -       }
> -
> +       perf_test_start_vcpu_threads(nr_vcpus, vcpu_worker);
>         pr_info("Started all vCPUs\n");
>
> -       /* Wait for the vcpu threads to quit */
> -       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
> -               pthread_join(vcpu_threads[vcpu_id], NULL);
> -               PER_VCPU_DEBUG("Joined thread for vCPU %d\n", vcpu_id);
> -       }
> -
> +       perf_test_join_vcpu_threads(nr_vcpus);
>         ts_diff = timespec_elapsed(start);
> -
>         pr_info("All vCPU threads joined\n");
>
>         if (p->uffd_mode) {
> @@ -385,7 +367,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>         perf_test_destroy_vm(vm);
>
>         free(guest_data_prototype);
> -       free(vcpu_threads);
>         if (p->uffd_mode) {
>                 free(uffd_handler_threads);
>                 free(uffd_args);
> diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> index 583b4d95aa98..1954b964d1cf 100644
> --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> @@ -31,7 +31,7 @@ static bool host_quit;
>  static int iteration;
>  static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
>
> -static void *vcpu_worker(void *data)
> +static void vcpu_worker(struct perf_test_vcpu_args *vcpu_args)
>  {
>         int ret;
>         struct kvm_vm *vm = perf_test_args.vm;
> @@ -41,7 +41,6 @@ static void *vcpu_worker(void *data)
>         struct timespec ts_diff;
>         struct timespec total = (struct timespec){0};
>         struct timespec avg;
> -       struct perf_test_vcpu_args *vcpu_args = (struct perf_test_vcpu_args *)data;
>         int vcpu_id = vcpu_args->vcpu_id;
>
>         run = vcpu_state(vm, vcpu_id);
> @@ -83,8 +82,6 @@ static void *vcpu_worker(void *data)
>         pr_debug("\nvCPU %d dirtied 0x%lx pages over %d iterations in %ld.%.9lds. (Avg %ld.%.9lds/iteration)\n",
>                 vcpu_id, pages_count, vcpu_last_completed_iteration[vcpu_id],
>                 total.tv_sec, total.tv_nsec, avg.tv_sec, avg.tv_nsec);
> -
> -       return NULL;
>  }
>
>  struct test_params {
> @@ -170,7 +167,6 @@ static void free_bitmaps(unsigned long *bitmaps[], int slots)
>  static void run_test(enum vm_guest_mode mode, void *arg)
>  {
>         struct test_params *p = arg;
> -       pthread_t *vcpu_threads;
>         struct kvm_vm *vm;
>         unsigned long **bitmaps;
>         uint64_t guest_num_pages;
> @@ -204,20 +200,15 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>                 vm_enable_cap(vm, &cap);
>         }
>
> -       vcpu_threads = malloc(nr_vcpus * sizeof(*vcpu_threads));
> -       TEST_ASSERT(vcpu_threads, "Memory allocation failed");
> -
>         /* Start the iterations */
>         iteration = 0;
>         host_quit = false;
>
>         clock_gettime(CLOCK_MONOTONIC, &start);
> -       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
> +       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
>                 vcpu_last_completed_iteration[vcpu_id] = -1;

Did you miss this in the previous commit or mean to leave it here?

>
> -               pthread_create(&vcpu_threads[vcpu_id], NULL, vcpu_worker,
> -                              &perf_test_args.vcpu_args[vcpu_id]);
> -       }
> +       perf_test_start_vcpu_threads(nr_vcpus, vcpu_worker);
>
>         /* Allow the vCPUs to populate memory */
>         pr_debug("Starting iteration %d - Populating\n", iteration);
> @@ -286,8 +277,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>
>         /* Tell the vcpu thread to quit */
>         host_quit = true;
> -       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
> -               pthread_join(vcpu_threads[vcpu_id], NULL);
> +       perf_test_join_vcpu_threads(nr_vcpus);
>
>         avg = timespec_div(get_dirty_log_total, p->iterations);
>         pr_info("Get dirty log over %lu iterations took %ld.%.9lds. (Avg %ld.%.9lds/iteration)\n",
> @@ -302,7 +292,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>         }
>
>         free_bitmaps(bitmaps, p->slots);
> -       free(vcpu_threads);
>         perf_test_destroy_vm(vm);
>  }
>
> diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h
> index 74e3622b3a6e..a86f953d8d36 100644
> --- a/tools/testing/selftests/kvm/include/perf_test_util.h
> +++ b/tools/testing/selftests/kvm/include/perf_test_util.h
> @@ -8,6 +8,8 @@
>  #ifndef SELFTEST_KVM_PERF_TEST_UTIL_H
>  #define SELFTEST_KVM_PERF_TEST_UTIL_H
>
> +#include <pthread.h>
> +
>  #include "kvm_util.h"
>
>  /* Default guest test virtual memory offset */
> @@ -45,4 +47,7 @@ void perf_test_destroy_vm(struct kvm_vm *vm);
>
>  void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract);
>
> +void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *));
> +void perf_test_join_vcpu_threads(int vcpus);
> +
>  #endif /* SELFTEST_KVM_PERF_TEST_UTIL_H */
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index 77f9eb5667c9..d646477ed16a 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -16,6 +16,20 @@ struct perf_test_args perf_test_args;
>   */
>  static uint64_t guest_test_virt_mem = DEFAULT_GUEST_TEST_MEM;
>
> +struct vcpu_thread {
> +       /* The id of the vCPU. */
> +       int vcpu_id;
> +
> +       /* The pthread backing the vCPU. */
> +       pthread_t thread;
> +};
> +
> +/* The vCPU threads involved in this test. */
> +static struct vcpu_thread vcpu_threads[KVM_MAX_VCPUS];
> +
> +/* The function run by each vCPU thread, as provided by the test. */
> +static void (*vcpu_thread_fn)(struct perf_test_vcpu_args *);
> +
>  /*
>   * Continuously write to the first 8 bytes of each page in the
>   * specified region.
> @@ -177,3 +191,35 @@ void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract)
>         perf_test_args.wr_fract = wr_fract;
>         sync_global_to_guest(vm, perf_test_args);
>  }
> +
> +static void *vcpu_thread_main(void *data)
> +{
> +       struct vcpu_thread *vcpu = data;
> +
> +       vcpu_thread_fn(&perf_test_args.vcpu_args[vcpu->vcpu_id]);
> +
> +       return NULL;
> +}
> +
> +void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *))
> +{
> +       int vcpu_id;
> +
> +       vcpu_thread_fn = vcpu_fn;
> +
> +       for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++) {
> +               struct vcpu_thread *vcpu = &vcpu_threads[vcpu_id];
> +
> +               vcpu->vcpu_id = vcpu_id;
> +
> +               pthread_create(&vcpu->thread, NULL, vcpu_thread_main, vcpu);
> +       }
> +}
> +
> +void perf_test_join_vcpu_threads(int vcpus)
> +{
> +       int vcpu_id;
> +
> +       for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++)
> +               pthread_join(vcpu_threads[vcpu_id].thread, NULL);
> +}
> diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
> index df431d0da1ee..5bd0b076f57f 100644
> --- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
> +++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
> @@ -36,11 +36,9 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
>
>  static bool run_vcpus = true;
>
> -static void *vcpu_worker(void *data)
> +static void vcpu_worker(struct perf_test_vcpu_args *vcpu_args)
>  {
>         int ret;
> -       struct perf_test_vcpu_args *vcpu_args =
> -               (struct perf_test_vcpu_args *)data;
>         int vcpu_id = vcpu_args->vcpu_id;
>         struct kvm_vm *vm = perf_test_args.vm;
>         struct kvm_run *run;
> @@ -59,8 +57,6 @@ static void *vcpu_worker(void *data)
>                             "Invalid guest sync status: exit_reason=%s\n",
>                             exit_reason_str(run->exit_reason));
>         }
> -
> -       return NULL;
>  }
>
>  struct memslot_antagonist_args {
> @@ -100,22 +96,15 @@ struct test_params {
>  static void run_test(enum vm_guest_mode mode, void *arg)
>  {
>         struct test_params *p = arg;
> -       pthread_t *vcpu_threads;
>         struct kvm_vm *vm;
> -       int vcpu_id;
>
>         vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
>                                  VM_MEM_SRC_ANONYMOUS,
>                                  p->partition_vcpu_memory_access);
>
> -       vcpu_threads = malloc(nr_vcpus * sizeof(*vcpu_threads));
> -       TEST_ASSERT(vcpu_threads, "Memory allocation failed");
> -
>         pr_info("Finished creating vCPUs\n");
>
> -       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
> -               pthread_create(&vcpu_threads[vcpu_id], NULL, vcpu_worker,
> -                              &perf_test_args.vcpu_args[vcpu_id]);
> +       perf_test_start_vcpu_threads(nr_vcpus, vcpu_worker);
>
>         pr_info("Started all vCPUs\n");
>
> @@ -124,16 +113,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>
>         run_vcpus = false;
>
> -       /* Wait for the vcpu threads to quit */
> -       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
> -               pthread_join(vcpu_threads[vcpu_id], NULL);
> -
> +       perf_test_join_vcpu_threads(nr_vcpus);
>         pr_info("All vCPU threads joined\n");
>
>         ucall_uninit(vm);
>         kvm_vm_free(vm);
> -
> -       free(vcpu_threads);
>  }
>
>  static void help(char *name)
> --
> 2.34.0.rc1.387.gb447b232ab-goog
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] KVM: selftests: Wait for all vCPU to be created before entering guest mode
  2021-11-11  0:12 ` [PATCH 3/4] KVM: selftests: Wait for all vCPU to be created before entering guest mode David Matlack
@ 2021-11-11 18:17   ` Ben Gardon
  2021-11-12  0:51     ` David Matlack
  0 siblings, 1 reply; 14+ messages in thread
From: Ben Gardon @ 2021-11-11 18:17 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Andrew Jones,
	Jim Mattson, Yanan Wang, Peter Xu, Aaron Lewis

On Wed, Nov 10, 2021 at 4:13 PM David Matlack <dmatlack@google.com> wrote:
>
> Thread creation requires taking the mmap_sem in write mode, which causes
> vCPU threads running in guest mode to block while they are populating
> memory. Fix this by waiting for all vCPU threads to be created and start
> running before entering guest mode on any one vCPU thread.
>
> This substantially improves the "Populate memory time" when using 1GiB
> pages since it allows all vCPUs to zero pages in parallel rather than
> blocking because a writer is waiting (which is waiting for another vCPU
> that is busy zeroing a 1GiB page).
>
> Before:
>
>   $ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb
>   ...
>   Populate memory time: 52.811184013s
>
> After:
>
>   $ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb
>   ...
>   Populate memory time: 10.204573342s
>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  .../selftests/kvm/lib/perf_test_util.c        | 26 +++++++++++++++++++
>  1 file changed, 26 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index d646477ed16a..722df3a28791 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -22,6 +22,9 @@ struct vcpu_thread {
>
>         /* The pthread backing the vCPU. */
>         pthread_t thread;
> +
> +       /* Set to true once the vCPU thread is up and running. */
> +       bool running;
>  };
>
>  /* The vCPU threads involved in this test. */
> @@ -30,6 +33,9 @@ static struct vcpu_thread vcpu_threads[KVM_MAX_VCPUS];
>  /* The function run by each vCPU thread, as provided by the test. */
>  static void (*vcpu_thread_fn)(struct perf_test_vcpu_args *);
>
> +/* Set to true once all vCPU threads are up and running. */
> +static bool all_vcpu_threads_running;
> +
>  /*
>   * Continuously write to the first 8 bytes of each page in the
>   * specified region.
> @@ -196,6 +202,17 @@ static void *vcpu_thread_main(void *data)
>  {
>         struct vcpu_thread *vcpu = data;
>
> +       WRITE_ONCE(vcpu->running, true);
> +
> +       /*
> +        * Wait for all vCPU threads to be up and running before calling the test-
> +        * provided vCPU thread function. This prevents thread creation (which
> +        * requires taking the mmap_sem in write mode) from interfering with the
> +        * guest faulting in its memory.
> +        */
> +       while (!READ_ONCE(all_vcpu_threads_running))
> +               ;
> +

I can never remember the rules on this so I could be wrong, but you
may want a cpu_relax() in that loop to prevent it from being optimized
out. Maybe the READ_ONCE is sufficient though.

>         vcpu_thread_fn(&perf_test_args.vcpu_args[vcpu->vcpu_id]);
>
>         return NULL;
> @@ -206,14 +223,23 @@ void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vc
>         int vcpu_id;
>
>         vcpu_thread_fn = vcpu_fn;
> +       WRITE_ONCE(all_vcpu_threads_running, false);
>
>         for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++) {
>                 struct vcpu_thread *vcpu = &vcpu_threads[vcpu_id];
>
>                 vcpu->vcpu_id = vcpu_id;
> +               WRITE_ONCE(vcpu->running, false);

Do these need to be WRITE_ONCE? I don't think WRITE_ONCE provides any
extra memory ordering guarantees and I don't know why the compiler
would optimize these out. If they do need to be WRITE_ONCE, they
probably merit comments.

>
>                 pthread_create(&vcpu->thread, NULL, vcpu_thread_main, vcpu);
>         }
> +
> +       for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++) {
> +               while (!READ_ONCE(vcpu_threads[vcpu_id].running))
> +                       ;
> +       }
> +
> +       WRITE_ONCE(all_vcpu_threads_running, true);
>  }
>
>  void perf_test_join_vcpu_threads(int vcpus)
> --
> 2.34.0.rc1.387.gb447b232ab-goog
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] KVM: selftests: Use perf_test_destroy_vm in memslot_modification_stress_test
  2021-11-11  0:12 ` [PATCH 4/4] KVM: selftests: Use perf_test_destroy_vm in memslot_modification_stress_test David Matlack
@ 2021-11-11 18:18   ` Ben Gardon
  0 siblings, 0 replies; 14+ messages in thread
From: Ben Gardon @ 2021-11-11 18:18 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Andrew Jones,
	Jim Mattson, Yanan Wang, Peter Xu, Aaron Lewis

On Wed, Nov 10, 2021 at 4:13 PM David Matlack <dmatlack@google.com> wrote:
>
> Change memslot_modification_stress_test to use perf_test_destroy_vm
> instead of manually calling ucall_uninit and kvm_vm_free.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>

> ---
>  tools/testing/selftests/kvm/memslot_modification_stress_test.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
> index 5bd0b076f57f..1410d0a9141a 100644
> --- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
> +++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
> @@ -116,8 +116,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>         perf_test_join_vcpu_threads(nr_vcpus);
>         pr_info("All vCPU threads joined\n");
>
> -       ucall_uninit(vm);
> -       kvm_vm_free(vm);
> +       perf_test_destroy_vm(vm);
>  }
>
>  static void help(char *name)
> --
> 2.34.0.rc1.387.gb447b232ab-goog
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] KVM: selftests: Move vCPU thread creation and joining to common helpers
  2021-11-11 18:08   ` Ben Gardon
@ 2021-11-12  0:47     ` David Matlack
  0 siblings, 0 replies; 14+ messages in thread
From: David Matlack @ 2021-11-12  0:47 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Andrew Jones,
	Jim Mattson, Yanan Wang, Peter Xu, Aaron Lewis

On Thu, Nov 11, 2021 at 10:08 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Nov 10, 2021 at 4:13 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Move vCPU thread creation and joining to common helper functions. This
> > is in preparation for the next commit which ensures that all vCPU
> > threads are fully created before entering guest mode on any one
> > vCPU.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Reviewed-by: Ben Gardon <bgardon@google.com>
>
> Besides one comment below, this is an awesome cleanup!
>
> > ---
> >  .../selftests/kvm/access_tracking_perf_test.c | 40 +++-------------
> >  .../selftests/kvm/demand_paging_test.c        | 25 ++--------
> >  .../selftests/kvm/dirty_log_perf_test.c       | 19 ++------
> >  .../selftests/kvm/include/perf_test_util.h    |  5 ++
> >  .../selftests/kvm/lib/perf_test_util.c        | 46 +++++++++++++++++++
> >  .../kvm/memslot_modification_stress_test.c    | 22 ++-------
> >  6 files changed, 67 insertions(+), 90 deletions(-)
> >
> > diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> > index 7f25a06e19c9..d8909032317a 100644
> > --- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
> > +++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> > @@ -215,9 +215,8 @@ static bool spin_wait_for_next_iteration(int *current_iteration)
> >         return true;
> >  }
> >
> > -static void *vcpu_thread_main(void *arg)
> > +static void vcpu_thread_main(struct perf_test_vcpu_args *vcpu_args)
> >  {
> > -       struct perf_test_vcpu_args *vcpu_args = arg;
> >         struct kvm_vm *vm = perf_test_args.vm;
> >         int vcpu_id = vcpu_args->vcpu_id;
> >         int current_iteration = 0;
> > @@ -235,8 +234,6 @@ static void *vcpu_thread_main(void *arg)
> >
> >                 vcpu_last_completed_iteration[vcpu_id] = current_iteration;
> >         }
> > -
> > -       return NULL;
> >  }
> >
> >  static void spin_wait_for_vcpu(int vcpu_id, int target_iteration)
> > @@ -295,43 +292,16 @@ static void mark_memory_idle(struct kvm_vm *vm, int vcpus)
> >         run_iteration(vm, vcpus, "Mark memory idle");
> >  }
> >
> > -static pthread_t *create_vcpu_threads(int vcpus)
> > -{
> > -       pthread_t *vcpu_threads;
> > -       int i;
> > -
> > -       vcpu_threads = malloc(vcpus * sizeof(vcpu_threads[0]));
> > -       TEST_ASSERT(vcpu_threads, "Failed to allocate vcpu_threads.");
> > -
> > -       for (i = 0; i < vcpus; i++)
> > -               pthread_create(&vcpu_threads[i], NULL, vcpu_thread_main,
> > -                              &perf_test_args.vcpu_args[i]);
> > -
> > -       return vcpu_threads;
> > -}
> > -
> > -static void terminate_vcpu_threads(pthread_t *vcpu_threads, int vcpus)
> > -{
> > -       int i;
> > -
> > -       /* Set done to signal the vCPU threads to exit */
> > -       done = true;
> > -
> > -       for (i = 0; i < vcpus; i++)
> > -               pthread_join(vcpu_threads[i], NULL);
> > -}
> > -
> >  static void run_test(enum vm_guest_mode mode, void *arg)
> >  {
> >         struct test_params *params = arg;
> >         struct kvm_vm *vm;
> > -       pthread_t *vcpu_threads;
> >         int vcpus = params->vcpus;
> >
> >         vm = perf_test_create_vm(mode, vcpus, params->vcpu_memory_bytes, 1,
> >                                  params->backing_src, !overlap_memory_access);
> >
> > -       vcpu_threads = create_vcpu_threads(vcpus);
> > +       perf_test_start_vcpu_threads(vcpus, vcpu_thread_main);
> >
> >         pr_info("\n");
> >         access_memory(vm, vcpus, ACCESS_WRITE, "Populating memory");
> > @@ -346,8 +316,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >         mark_memory_idle(vm, vcpus);
> >         access_memory(vm, vcpus, ACCESS_READ, "Reading from idle memory");
> >
> > -       terminate_vcpu_threads(vcpu_threads, vcpus);
> > -       free(vcpu_threads);
> > +       /* Set done to signal the vCPU threads to exit */
> > +       done = true;
> > +
> > +       perf_test_join_vcpu_threads(vcpus);
> >         perf_test_destroy_vm(vm);
> >  }
> >
> > diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> > index 26f8fd8a57ec..6a719d065599 100644
> > --- a/tools/testing/selftests/kvm/demand_paging_test.c
> > +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> > @@ -42,10 +42,9 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
> >  static size_t demand_paging_size;
> >  static char *guest_data_prototype;
> >
> > -static void *vcpu_worker(void *data)
> > +static void vcpu_worker(struct perf_test_vcpu_args *vcpu_args)
> >  {
> >         int ret;
> > -       struct perf_test_vcpu_args *vcpu_args = (struct perf_test_vcpu_args *)data;
> >         int vcpu_id = vcpu_args->vcpu_id;
> >         struct kvm_vm *vm = perf_test_args.vm;
> >         struct kvm_run *run;
> > @@ -68,8 +67,6 @@ static void *vcpu_worker(void *data)
> >         ts_diff = timespec_elapsed(start);
> >         PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_id,
> >                        ts_diff.tv_sec, ts_diff.tv_nsec);
> > -
> > -       return NULL;
> >  }
> >
> >  static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t addr)
> > @@ -282,7 +279,6 @@ struct test_params {
> >  static void run_test(enum vm_guest_mode mode, void *arg)
> >  {
> >         struct test_params *p = arg;
> > -       pthread_t *vcpu_threads;
> >         pthread_t *uffd_handler_threads = NULL;
> >         struct uffd_handler_args *uffd_args = NULL;
> >         struct timespec start;
> > @@ -302,9 +298,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >                     "Failed to allocate buffer for guest data pattern");
> >         memset(guest_data_prototype, 0xAB, demand_paging_size);
> >
> > -       vcpu_threads = malloc(nr_vcpus * sizeof(*vcpu_threads));
> > -       TEST_ASSERT(vcpu_threads, "Memory allocation failed");
> > -
> >         if (p->uffd_mode) {
> >                 uffd_handler_threads =
> >                         malloc(nr_vcpus * sizeof(*uffd_handler_threads));
> > @@ -346,22 +339,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >         pr_info("Finished creating vCPUs and starting uffd threads\n");
> >
> >         clock_gettime(CLOCK_MONOTONIC, &start);
> > -
> > -       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
> > -               pthread_create(&vcpu_threads[vcpu_id], NULL, vcpu_worker,
> > -                              &perf_test_args.vcpu_args[vcpu_id]);
> > -       }
> > -
> > +       perf_test_start_vcpu_threads(nr_vcpus, vcpu_worker);
> >         pr_info("Started all vCPUs\n");
> >
> > -       /* Wait for the vcpu threads to quit */
> > -       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
> > -               pthread_join(vcpu_threads[vcpu_id], NULL);
> > -               PER_VCPU_DEBUG("Joined thread for vCPU %d\n", vcpu_id);
> > -       }
> > -
> > +       perf_test_join_vcpu_threads(nr_vcpus);
> >         ts_diff = timespec_elapsed(start);
> > -
> >         pr_info("All vCPU threads joined\n");
> >
> >         if (p->uffd_mode) {
> > @@ -385,7 +367,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >         perf_test_destroy_vm(vm);
> >
> >         free(guest_data_prototype);
> > -       free(vcpu_threads);
> >         if (p->uffd_mode) {
> >                 free(uffd_handler_threads);
> >                 free(uffd_args);
> > diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > index 583b4d95aa98..1954b964d1cf 100644
> > --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > @@ -31,7 +31,7 @@ static bool host_quit;
> >  static int iteration;
> >  static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
> >
> > -static void *vcpu_worker(void *data)
> > +static void vcpu_worker(struct perf_test_vcpu_args *vcpu_args)
> >  {
> >         int ret;
> >         struct kvm_vm *vm = perf_test_args.vm;
> > @@ -41,7 +41,6 @@ static void *vcpu_worker(void *data)
> >         struct timespec ts_diff;
> >         struct timespec total = (struct timespec){0};
> >         struct timespec avg;
> > -       struct perf_test_vcpu_args *vcpu_args = (struct perf_test_vcpu_args *)data;
> >         int vcpu_id = vcpu_args->vcpu_id;
> >
> >         run = vcpu_state(vm, vcpu_id);
> > @@ -83,8 +82,6 @@ static void *vcpu_worker(void *data)
> >         pr_debug("\nvCPU %d dirtied 0x%lx pages over %d iterations in %ld.%.9lds. (Avg %ld.%.9lds/iteration)\n",
> >                 vcpu_id, pages_count, vcpu_last_completed_iteration[vcpu_id],
> >                 total.tv_sec, total.tv_nsec, avg.tv_sec, avg.tv_nsec);
> > -
> > -       return NULL;
> >  }
> >
> >  struct test_params {
> > @@ -170,7 +167,6 @@ static void free_bitmaps(unsigned long *bitmaps[], int slots)
> >  static void run_test(enum vm_guest_mode mode, void *arg)
> >  {
> >         struct test_params *p = arg;
> > -       pthread_t *vcpu_threads;
> >         struct kvm_vm *vm;
> >         unsigned long **bitmaps;
> >         uint64_t guest_num_pages;
> > @@ -204,20 +200,15 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >                 vm_enable_cap(vm, &cap);
> >         }
> >
> > -       vcpu_threads = malloc(nr_vcpus * sizeof(*vcpu_threads));
> > -       TEST_ASSERT(vcpu_threads, "Memory allocation failed");
> > -
> >         /* Start the iterations */
> >         iteration = 0;
> >         host_quit = false;
> >
> >         clock_gettime(CLOCK_MONOTONIC, &start);
> > -       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
> > +       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
> >                 vcpu_last_completed_iteration[vcpu_id] = -1;
>
> Did you miss this in the previous commit or mean to leave it here?

Ah I missed cleaning this one up. I'll get rid of this in v2. Thanks!

>
> >
> > -               pthread_create(&vcpu_threads[vcpu_id], NULL, vcpu_worker,
> > -                              &perf_test_args.vcpu_args[vcpu_id]);
> > -       }
> > +       perf_test_start_vcpu_threads(nr_vcpus, vcpu_worker);
> >
> >         /* Allow the vCPUs to populate memory */
> >         pr_debug("Starting iteration %d - Populating\n", iteration);
> > @@ -286,8 +277,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >
> >         /* Tell the vcpu thread to quit */
> >         host_quit = true;
> > -       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
> > -               pthread_join(vcpu_threads[vcpu_id], NULL);
> > +       perf_test_join_vcpu_threads(nr_vcpus);
> >
> >         avg = timespec_div(get_dirty_log_total, p->iterations);
> >         pr_info("Get dirty log over %lu iterations took %ld.%.9lds. (Avg %ld.%.9lds/iteration)\n",
> > @@ -302,7 +292,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >         }
> >
> >         free_bitmaps(bitmaps, p->slots);
> > -       free(vcpu_threads);
> >         perf_test_destroy_vm(vm);
> >  }
> >
> > diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h
> > index 74e3622b3a6e..a86f953d8d36 100644
> > --- a/tools/testing/selftests/kvm/include/perf_test_util.h
> > +++ b/tools/testing/selftests/kvm/include/perf_test_util.h
> > @@ -8,6 +8,8 @@
> >  #ifndef SELFTEST_KVM_PERF_TEST_UTIL_H
> >  #define SELFTEST_KVM_PERF_TEST_UTIL_H
> >
> > +#include <pthread.h>
> > +
> >  #include "kvm_util.h"
> >
> >  /* Default guest test virtual memory offset */
> > @@ -45,4 +47,7 @@ void perf_test_destroy_vm(struct kvm_vm *vm);
> >
> >  void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract);
> >
> > +void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *));
> > +void perf_test_join_vcpu_threads(int vcpus);
> > +
> >  #endif /* SELFTEST_KVM_PERF_TEST_UTIL_H */
> > diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
> > index 77f9eb5667c9..d646477ed16a 100644
> > --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> > +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> > @@ -16,6 +16,20 @@ struct perf_test_args perf_test_args;
> >   */
> >  static uint64_t guest_test_virt_mem = DEFAULT_GUEST_TEST_MEM;
> >
> > +struct vcpu_thread {
> > +       /* The id of the vCPU. */
> > +       int vcpu_id;
> > +
> > +       /* The pthread backing the vCPU. */
> > +       pthread_t thread;
> > +};
> > +
> > +/* The vCPU threads involved in this test. */
> > +static struct vcpu_thread vcpu_threads[KVM_MAX_VCPUS];
> > +
> > +/* The function run by each vCPU thread, as provided by the test. */
> > +static void (*vcpu_thread_fn)(struct perf_test_vcpu_args *);
> > +
> >  /*
> >   * Continuously write to the first 8 bytes of each page in the
> >   * specified region.
> > @@ -177,3 +191,35 @@ void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract)
> >         perf_test_args.wr_fract = wr_fract;
> >         sync_global_to_guest(vm, perf_test_args);
> >  }
> > +
> > +static void *vcpu_thread_main(void *data)
> > +{
> > +       struct vcpu_thread *vcpu = data;
> > +
> > +       vcpu_thread_fn(&perf_test_args.vcpu_args[vcpu->vcpu_id]);
> > +
> > +       return NULL;
> > +}
> > +
> > +void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *))
> > +{
> > +       int vcpu_id;
> > +
> > +       vcpu_thread_fn = vcpu_fn;
> > +
> > +       for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++) {
> > +               struct vcpu_thread *vcpu = &vcpu_threads[vcpu_id];
> > +
> > +               vcpu->vcpu_id = vcpu_id;
> > +
> > +               pthread_create(&vcpu->thread, NULL, vcpu_thread_main, vcpu);
> > +       }
> > +}
> > +
> > +void perf_test_join_vcpu_threads(int vcpus)
> > +{
> > +       int vcpu_id;
> > +
> > +       for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++)
> > +               pthread_join(vcpu_threads[vcpu_id].thread, NULL);
> > +}
> > diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
> > index df431d0da1ee..5bd0b076f57f 100644
> > --- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
> > +++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
> > @@ -36,11 +36,9 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
> >
> >  static bool run_vcpus = true;
> >
> > -static void *vcpu_worker(void *data)
> > +static void vcpu_worker(struct perf_test_vcpu_args *vcpu_args)
> >  {
> >         int ret;
> > -       struct perf_test_vcpu_args *vcpu_args =
> > -               (struct perf_test_vcpu_args *)data;
> >         int vcpu_id = vcpu_args->vcpu_id;
> >         struct kvm_vm *vm = perf_test_args.vm;
> >         struct kvm_run *run;
> > @@ -59,8 +57,6 @@ static void *vcpu_worker(void *data)
> >                             "Invalid guest sync status: exit_reason=%s\n",
> >                             exit_reason_str(run->exit_reason));
> >         }
> > -
> > -       return NULL;
> >  }
> >
> >  struct memslot_antagonist_args {
> > @@ -100,22 +96,15 @@ struct test_params {
> >  static void run_test(enum vm_guest_mode mode, void *arg)
> >  {
> >         struct test_params *p = arg;
> > -       pthread_t *vcpu_threads;
> >         struct kvm_vm *vm;
> > -       int vcpu_id;
> >
> >         vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
> >                                  VM_MEM_SRC_ANONYMOUS,
> >                                  p->partition_vcpu_memory_access);
> >
> > -       vcpu_threads = malloc(nr_vcpus * sizeof(*vcpu_threads));
> > -       TEST_ASSERT(vcpu_threads, "Memory allocation failed");
> > -
> >         pr_info("Finished creating vCPUs\n");
> >
> > -       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
> > -               pthread_create(&vcpu_threads[vcpu_id], NULL, vcpu_worker,
> > -                              &perf_test_args.vcpu_args[vcpu_id]);
> > +       perf_test_start_vcpu_threads(nr_vcpus, vcpu_worker);
> >
> >         pr_info("Started all vCPUs\n");
> >
> > @@ -124,16 +113,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >
> >         run_vcpus = false;
> >
> > -       /* Wait for the vcpu threads to quit */
> > -       for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
> > -               pthread_join(vcpu_threads[vcpu_id], NULL);
> > -
> > +       perf_test_join_vcpu_threads(nr_vcpus);
> >         pr_info("All vCPU threads joined\n");
> >
> >         ucall_uninit(vm);
> >         kvm_vm_free(vm);
> > -
> > -       free(vcpu_threads);
> >  }
> >
> >  static void help(char *name)
> > --
> > 2.34.0.rc1.387.gb447b232ab-goog
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] KVM: selftests: Wait for all vCPU to be created before entering guest mode
  2021-11-11 18:17   ` Ben Gardon
@ 2021-11-12  0:51     ` David Matlack
  2021-11-12  1:46       ` David Matlack
  0 siblings, 1 reply; 14+ messages in thread
From: David Matlack @ 2021-11-12  0:51 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Andrew Jones,
	Jim Mattson, Yanan Wang, Peter Xu, Aaron Lewis

On Thu, Nov 11, 2021 at 10:17 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Nov 10, 2021 at 4:13 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Thread creation requires taking the mmap_sem in write mode, which causes
> > vCPU threads running in guest mode to block while they are populating
> > memory. Fix this by waiting for all vCPU threads to be created and start
> > running before entering guest mode on any one vCPU thread.
> >
> > This substantially improves the "Populate memory time" when using 1GiB
> > pages since it allows all vCPUs to zero pages in parallel rather than
> > blocking because a writer is waiting (which is waiting for another vCPU
> > that is busy zeroing a 1GiB page).
> >
> > Before:
> >
> >   $ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb
> >   ...
> >   Populate memory time: 52.811184013s
> >
> > After:
> >
> >   $ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb
> >   ...
> >   Populate memory time: 10.204573342s
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  .../selftests/kvm/lib/perf_test_util.c        | 26 +++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> >
> > diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
> > index d646477ed16a..722df3a28791 100644
> > --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> > +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> > @@ -22,6 +22,9 @@ struct vcpu_thread {
> >
> >         /* The pthread backing the vCPU. */
> >         pthread_t thread;
> > +
> > +       /* Set to true once the vCPU thread is up and running. */
> > +       bool running;
> >  };
> >
> >  /* The vCPU threads involved in this test. */
> > @@ -30,6 +33,9 @@ static struct vcpu_thread vcpu_threads[KVM_MAX_VCPUS];
> >  /* The function run by each vCPU thread, as provided by the test. */
> >  static void (*vcpu_thread_fn)(struct perf_test_vcpu_args *);
> >
> > +/* Set to true once all vCPU threads are up and running. */
> > +static bool all_vcpu_threads_running;
> > +
> >  /*
> >   * Continuously write to the first 8 bytes of each page in the
> >   * specified region.
> > @@ -196,6 +202,17 @@ static void *vcpu_thread_main(void *data)
> >  {
> >         struct vcpu_thread *vcpu = data;
> >
> > +       WRITE_ONCE(vcpu->running, true);
> > +
> > +       /*
> > +        * Wait for all vCPU threads to be up and running before calling the test-
> > +        * provided vCPU thread function. This prevents thread creation (which
> > +        * requires taking the mmap_sem in write mode) from interfering with the
> > +        * guest faulting in its memory.
> > +        */
> > +       while (!READ_ONCE(all_vcpu_threads_running))
> > +               ;
> > +
>
> I can never remember the rules on this so I could be wrong, but you
> may want a cpu_relax() in that loop to prevent it from being optimized
> out. Maybe the READ_ONCE is sufficient though.

READ_ONCE is sufficient to prevent the loop from being optimized out
but cpu_relax() is nice to have to play nice with our hyperthread
buddy.

On that note there are a lot of spin waits in the KVM selftests and
none of the ones I've seen use cpu_relax().

I'll take a look at adding cpu_relax() throughout the selftests in v2.

>
> >         vcpu_thread_fn(&perf_test_args.vcpu_args[vcpu->vcpu_id]);
> >
> >         return NULL;
> > @@ -206,14 +223,23 @@ void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vc
> >         int vcpu_id;
> >
> >         vcpu_thread_fn = vcpu_fn;
> > +       WRITE_ONCE(all_vcpu_threads_running, false);
> >
> >         for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++) {
> >                 struct vcpu_thread *vcpu = &vcpu_threads[vcpu_id];
> >
> >                 vcpu->vcpu_id = vcpu_id;
> > +               WRITE_ONCE(vcpu->running, false);
>
> Do these need to be WRITE_ONCE? I don't think WRITE_ONCE provides any
> extra memory ordering guarantees and I don't know why the compiler
> would optimize these out. If they do need to be WRITE_ONCE, they
> probably merit comments.

To be completely honest I'm not sure. I included WRITE_ONCE out of
caution to ensure the compiler does not reorder the writes with
respect to the READ_ONCE. I'll need to do a bit more research to
confirm if it's really necessary.

>
> >
> >                 pthread_create(&vcpu->thread, NULL, vcpu_thread_main, vcpu);
> >         }
> > +
> > +       for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++) {
> > +               while (!READ_ONCE(vcpu_threads[vcpu_id].running))
> > +                       ;
> > +       }
> > +
> > +       WRITE_ONCE(all_vcpu_threads_running, true);
> >  }
> >
> >  void perf_test_join_vcpu_threads(int vcpus)
> > --
> > 2.34.0.rc1.387.gb447b232ab-goog
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] KVM: selftests: Wait for all vCPU to be created before entering guest mode
  2021-11-12  0:51     ` David Matlack
@ 2021-11-12  1:46       ` David Matlack
  2021-11-12 16:32         ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: David Matlack @ 2021-11-12  1:46 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Andrew Jones,
	Jim Mattson, Yanan Wang, Peter Xu, Aaron Lewis

On Thu, Nov 11, 2021 at 4:51 PM David Matlack <dmatlack@google.com> wrote:
>
> On Thu, Nov 11, 2021 at 10:17 AM Ben Gardon <bgardon@google.com> wrote:
> >
> > On Wed, Nov 10, 2021 at 4:13 PM David Matlack <dmatlack@google.com> wrote:
> > >
> > > Thread creation requires taking the mmap_sem in write mode, which causes
> > > vCPU threads running in guest mode to block while they are populating
> > > memory. Fix this by waiting for all vCPU threads to be created and start
> > > running before entering guest mode on any one vCPU thread.
> > >
> > > This substantially improves the "Populate memory time" when using 1GiB
> > > pages since it allows all vCPUs to zero pages in parallel rather than
> > > blocking because a writer is waiting (which is waiting for another vCPU
> > > that is busy zeroing a 1GiB page).
> > >
> > > Before:
> > >
> > >   $ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb
> > >   ...
> > >   Populate memory time: 52.811184013s
> > >
> > > After:
> > >
> > >   $ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb
> > >   ...
> > >   Populate memory time: 10.204573342s
> > >
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >  .../selftests/kvm/lib/perf_test_util.c        | 26 +++++++++++++++++++
> > >  1 file changed, 26 insertions(+)
> > >
> > > diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
> > > index d646477ed16a..722df3a28791 100644
> > > --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> > > +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> > > @@ -22,6 +22,9 @@ struct vcpu_thread {
> > >
> > >         /* The pthread backing the vCPU. */
> > >         pthread_t thread;
> > > +
> > > +       /* Set to true once the vCPU thread is up and running. */
> > > +       bool running;
> > >  };
> > >
> > >  /* The vCPU threads involved in this test. */
> > > @@ -30,6 +33,9 @@ static struct vcpu_thread vcpu_threads[KVM_MAX_VCPUS];
> > >  /* The function run by each vCPU thread, as provided by the test. */
> > >  static void (*vcpu_thread_fn)(struct perf_test_vcpu_args *);
> > >
> > > +/* Set to true once all vCPU threads are up and running. */
> > > +static bool all_vcpu_threads_running;
> > > +
> > >  /*
> > >   * Continuously write to the first 8 bytes of each page in the
> > >   * specified region.
> > > @@ -196,6 +202,17 @@ static void *vcpu_thread_main(void *data)
> > >  {
> > >         struct vcpu_thread *vcpu = data;
> > >
> > > +       WRITE_ONCE(vcpu->running, true);
> > > +
> > > +       /*
> > > +        * Wait for all vCPU threads to be up and running before calling the test-
> > > +        * provided vCPU thread function. This prevents thread creation (which
> > > +        * requires taking the mmap_sem in write mode) from interfering with the
> > > +        * guest faulting in its memory.
> > > +        */
> > > +       while (!READ_ONCE(all_vcpu_threads_running))
> > > +               ;
> > > +
> >
> > I can never remember the rules on this so I could be wrong, but you
> > may want a cpu_relax() in that loop to prevent it from being optimized
> > out. Maybe the READ_ONCE is sufficient though.
>
> READ_ONCE is sufficient to prevent the loop from being optimized out
> but cpu_relax() is nice to have to play nice with our hyperthread
> buddy.
>
> On that note there are a lot of spin waits in the KVM selftests and
> none of the ones I've seen use cpu_relax().
>
> I'll take a look at adding cpu_relax() throughout the selftests in v2.
>
> >
> > >         vcpu_thread_fn(&perf_test_args.vcpu_args[vcpu->vcpu_id]);
> > >
> > >         return NULL;
> > > @@ -206,14 +223,23 @@ void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vc
> > >         int vcpu_id;
> > >
> > >         vcpu_thread_fn = vcpu_fn;
> > > +       WRITE_ONCE(all_vcpu_threads_running, false);
> > >
> > >         for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++) {
> > >                 struct vcpu_thread *vcpu = &vcpu_threads[vcpu_id];
> > >
> > >                 vcpu->vcpu_id = vcpu_id;
> > > +               WRITE_ONCE(vcpu->running, false);
> >
> > Do these need to be WRITE_ONCE? I don't think WRITE_ONCE provides any
> > extra memory ordering guarantees and I don't know why the compiler
> > would optimize these out. If they do need to be WRITE_ONCE, they
> > probably merit comments.
>
> To be completely honest I'm not sure. I included WRITE_ONCE out of
> caution to ensure the compiler does not reorder the writes with
> respect to the READ_ONCE. I'll need to do a bit more research to
> confirm if it's really necessary.

FWIW removing WRITE_ONCE and bumping the optimization level up to O3
did not cause any problems. But this is no proof of course.

This quote from memory-barries.txt makes me think it'd be prudent to
keep the WRITE_ONCE:

     You should assume that the compiler can move READ_ONCE() and
     WRITE_ONCE() past code not containing READ_ONCE(), WRITE_ONCE(),
     barrier(), or similar primitives.

So, for example, the compiler could potentially re-order READ_ONCE
loop below after the write to all_vcpu_threads_running if we did not
include WRITE_ONCE?

>
> >
> > >
> > >                 pthread_create(&vcpu->thread, NULL, vcpu_thread_main, vcpu);
> > >         }
> > > +
> > > +       for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++) {
> > > +               while (!READ_ONCE(vcpu_threads[vcpu_id].running))
> > > +                       ;
> > > +       }
> > > +
> > > +       WRITE_ONCE(all_vcpu_threads_running, true);
> > >  }
> > >
> > >  void perf_test_join_vcpu_threads(int vcpus)
> > > --
> > > 2.34.0.rc1.387.gb447b232ab-goog
> > >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] KVM: selftests: Wait for all vCPU to be created before entering guest mode
  2021-11-12  1:46       ` David Matlack
@ 2021-11-12 16:32         ` Sean Christopherson
  0 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2021-11-12 16:32 UTC (permalink / raw)
  To: David Matlack
  Cc: Ben Gardon, Paolo Bonzini, kvm, Andrew Jones, Jim Mattson,
	Yanan Wang, Peter Xu, Aaron Lewis

On Thu, Nov 11, 2021, David Matlack wrote:
> On Thu, Nov 11, 2021 at 4:51 PM David Matlack <dmatlack@google.com> wrote:
> > > > @@ -196,6 +202,17 @@ static void *vcpu_thread_main(void *data)
> > > >  {
> > > >         struct vcpu_thread *vcpu = data;
> > > >
> > > > +       WRITE_ONCE(vcpu->running, true);
> > > > +
> > > > +       /*
> > > > +        * Wait for all vCPU threads to be up and running before calling the test-
> > > > +        * provided vCPU thread function. This prevents thread creation (which
> > > > +        * requires taking the mmap_sem in write mode) from interfering with the
> > > > +        * guest faulting in its memory.
> > > > +        */
> > > > +       while (!READ_ONCE(all_vcpu_threads_running))

This is way more convoluted than it needs to be:

	atomic_inc(&perf_test_args.nr_running_vcpus);

	while (atomic_read(&perf_test_args.nr_running_vcpus) < perf_test_args.nr_vcpus)
		cpu_relax();

diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h
index df9f1a3a3ffb..ce9039ec8c18 100644
--- a/tools/testing/selftests/kvm/include/perf_test_util.h
+++ b/tools/testing/selftests/kvm/include/perf_test_util.h
@@ -30,6 +30,8 @@ struct perf_test_args {
        uint64_t host_page_size;
        uint64_t guest_page_size;
        int wr_fract;
+       int nr_vcpus;
+       atomic_t nr_running_vcpus;
 
        struct perf_test_vcpu_args vcpu_args[KVM_MAX_VCPUS];
 };


Alternatively it could be a countdown to zero so that only the atomic_t is needed.


> > > I can never remember the rules on this so I could be wrong, but you
> > > may want a cpu_relax() in that loop to prevent it from being optimized
> > > out. Maybe the READ_ONCE is sufficient though.
> >
> > READ_ONCE is sufficient to prevent the loop from being optimized out
> > but cpu_relax() is nice to have to play nice with our hyperthread
> > buddy.
> >
> > On that note there are a lot of spin waits in the KVM selftests and
> > none of the ones I've seen use cpu_relax().
> >
> > I'll take a look at adding cpu_relax() throughout the selftests in v2.
> >
> > >
> > > >         vcpu_thread_fn(&perf_test_args.vcpu_args[vcpu->vcpu_id]);
> > > >
> > > >         return NULL;
> > > > @@ -206,14 +223,23 @@ void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vc
> > > >         int vcpu_id;
> > > >
> > > >         vcpu_thread_fn = vcpu_fn;
> > > > +       WRITE_ONCE(all_vcpu_threads_running, false);
> > > >
> > > >         for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++) {
> > > >                 struct vcpu_thread *vcpu = &vcpu_threads[vcpu_id];
> > > >
> > > >                 vcpu->vcpu_id = vcpu_id;
> > > > +               WRITE_ONCE(vcpu->running, false);
> > >
> > > Do these need to be WRITE_ONCE? I don't think WRITE_ONCE provides any
> > > extra memory ordering guarantees and I don't know why the compiler
> > > would optimize these out. If they do need to be WRITE_ONCE, they
> > > probably merit comments.
> >
> > To be completely honest I'm not sure. I included WRITE_ONCE out of
> > caution to ensure the compiler does not reorder the writes with
> > respect to the READ_ONCE. I'll need to do a bit more research to
> > confirm if it's really necessary.
> 
> FWIW removing WRITE_ONCE and bumping the optimization level up to O3
> did not cause any problems. But this is no proof of course.
> 
> This quote from memory-barries.txt makes me think it'd be prudent to
> keep the WRITE_ONCE:
> 
>      You should assume that the compiler can move READ_ONCE() and
>      WRITE_ONCE() past code not containing READ_ONCE(), WRITE_ONCE(),
>      barrier(), or similar primitives.
> 
> So, for example, the compiler could potentially re-order READ_ONCE
> loop below after the write to all_vcpu_threads_running if we did not
> include WRITE_ONCE?

Practically speaking, no.  pthread_create() undoubtedly has barriers galore, so
unless @vcpus==0, all_vcpu_threads_running won't get re-ordered.

As noted above, READ_ONCE/WRITE_ONCE do provide _some_ guarantees about memory
ordering with respect to the _compiler_.  Notably, the compiler is allowed to
reorder non-ONCE loads/stores around {READ,WRITE}_ONCE.  And emphasis on "compiler".
On x86, that's effectively the same as memory ordering in hardware, because ignoring
WC memory, x86 is strongy ordered.  But arm64 and others are weakly ordered, in
which case {READ,WRITE}_ONCE do not provide any guarantees about how loads/stores
will complete when run on the CPU.

This is a bad example because there's not really a race to be had, e.g. aside from
pthread_create(), there's also the fact that all_vcpu_threads_running=false is a
likely nop since it's zero initialized.  

Here's a contrived example:

static void *vcpu_thread_main(void *data)
{
	while (!READ_ONCE(test_stage))
		cpu_relax();

	READ_ONCE(test_fn)();
}


int main(...)
{
	for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++)
		pthread_create(&vcpu->thread, NULL, vcpu_thread_main, vcpu);

	test_fn = do_work;

	WRITE_ONCE(test_stage, 1);
}

On any architecture, the thread could observe a NULL test_fn because it's allowed
to reorder the store to test_fn around the store to test_stage.  Making it

	WRITE_ONCE(test_fn, do_work);
	WRITE_ONCE(test_stage, 1);

would solve the problem on x86 as it would effectively force the compiler to emit:

	test_stage = 0
	barrier() // because of pthread_create()
	test_fn    = do_work
	test_stage = 1

but on arm64 and others, hardware can complete the second store to test_stage
_before_ the store to test_fn, and so the threads could observe a NULL test_fn
even though the compiler was forced to generate a specific order of operations.

To ensure something like the above works on weakly-ordered architctures, the code
would should instead be something like:

static void *vcpu_thread_main(void *data)
{
        while (!READ_ONCE(test_stage))
                cpu_relax();

	smp_rmb();
        test_fn();
}

int main(...)
{
        for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++)
                pthread_create(&vcpu->thread, NULL, vcpu_thread_main, vcpu);

        test_fn = do_work;
	smp_wmb();
        test_stage = 1;
}

Final note, the READ_ONCE() in the while loop is still necessary because without
that the compiler would be allowed to assume that test_stage can't be changed
in that loop, e.g. on x86 could generate the following hang the task:

	mov [test_stage], %rax
1:	test %rax, %rax
	jnz 2f
	pause
	jmp 1b
2:

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population
  2021-11-11  0:12 [PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population David Matlack
                   ` (3 preceding siblings ...)
  2021-11-11  0:12 ` [PATCH 4/4] KVM: selftests: Use perf_test_destroy_vm in memslot_modification_stress_test David Matlack
@ 2021-11-16 11:13 ` Paolo Bonzini
  2021-11-17  0:21   ` David Matlack
  4 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2021-11-16 11:13 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Sean Christopherson, Ben Gardon, Andrew Jones, Jim Mattson,
	Yanan Wang, Peter Xu, Aaron Lewis

On 11/11/21 01:12, David Matlack wrote:
> This series fixes a performance issue in the KVM selftests, specifically
> those that use perf_test_util. These tests create vCPU threads which
> immediately enter guest mode and start faulting in memory. Creating
> vCPU threads while faulting in memory is a recipe for generating a lot
> of contention on the mmap_sem, as thread creation requires acquiring the
> mmap_sem in write mode.
> 
> This series fixes this issue by ensuring that all vCPUs threads are
> created before entering guest mode. As part of fixing this issue I
> consolidated the code to create and join vCPU threads across all users
> of perf_test_util.
> 
> The last commit is an unrelated perf_test_util cleanup.
> 
> Note: This series applies on top of
> https://lore.kernel.org/kvm/20211111000310.1435032-1-dmatlack@google.com/,
> although the dependency on the series is just cosmetic.
> 
> David Matlack (4):
>    KVM: selftests: Start at iteration 0 instead of -1
>    KVM: selftests: Move vCPU thread creation and joining to common
>      helpers
>    KVM: selftests: Wait for all vCPU to be created before entering guest
>      mode
>    KVM: selftests: Use perf_test_destroy_vm in
>      memslot_modification_stress_test
> 
>   .../selftests/kvm/access_tracking_perf_test.c | 46 +++---------
>   .../selftests/kvm/demand_paging_test.c        | 25 +------
>   .../selftests/kvm/dirty_log_perf_test.c       | 19 ++---
>   .../selftests/kvm/include/perf_test_util.h    |  5 ++
>   .../selftests/kvm/lib/perf_test_util.c        | 72 +++++++++++++++++++
>   .../kvm/memslot_modification_stress_test.c    | 25 ++-----
>   6 files changed, 96 insertions(+), 96 deletions(-)
> 

Queued, thanks.

Paolo


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population
  2021-11-16 11:13 ` [PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population Paolo Bonzini
@ 2021-11-17  0:21   ` David Matlack
  0 siblings, 0 replies; 14+ messages in thread
From: David Matlack @ 2021-11-17  0:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Sean Christopherson, Ben Gardon, Andrew Jones, Jim Mattson,
	Yanan Wang, Peter Xu, Aaron Lewis

On Tue, Nov 16, 2021 at 3:13 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Queued, thanks.

I was planning to send a v2 to resolve the feedback on [PATCH 3/4]
from Sean and Ben. How should I proceed? The code should be functional
as is so I can send a follow-on or you can make the changes Sean
suggested (it's pretty straight forward).

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-11-17  0:21 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-11  0:12 [PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population David Matlack
2021-11-11  0:12 ` [PATCH 1/4] KVM: selftests: Start at iteration 0 instead of -1 David Matlack
2021-11-11  0:12 ` [PATCH 2/4] KVM: selftests: Move vCPU thread creation and joining to common helpers David Matlack
2021-11-11 18:08   ` Ben Gardon
2021-11-12  0:47     ` David Matlack
2021-11-11  0:12 ` [PATCH 3/4] KVM: selftests: Wait for all vCPU to be created before entering guest mode David Matlack
2021-11-11 18:17   ` Ben Gardon
2021-11-12  0:51     ` David Matlack
2021-11-12  1:46       ` David Matlack
2021-11-12 16:32         ` Sean Christopherson
2021-11-11  0:12 ` [PATCH 4/4] KVM: selftests: Use perf_test_destroy_vm in memslot_modification_stress_test David Matlack
2021-11-11 18:18   ` Ben Gardon
2021-11-16 11:13 ` [PATCH 0/4] KVM: selftests: Avoid mmap_sem contention during memory population Paolo Bonzini
2021-11-17  0:21   ` David Matlack

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.