kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Calculate memory access latency stats
@ 2023-03-16 22:27 Colton Lewis
  2023-03-16 22:27 ` [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter Colton Lewis
  2023-03-16 22:27 ` [PATCH v2 2/2] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis
  0 siblings, 2 replies; 17+ messages in thread
From: Colton Lewis @ 2023-03-16 22:27 UTC (permalink / raw)
  To: Paolo Bonzini, Shuah Khan, Colton Lewis, Sean Christopherson,
	David Matlack, Vipin Sharma, Andrew Jones, Marc Zyngier,
	Ben Gardon, Ricardo Koller, Oliver Upton
  Cc: kvm

Sample the latency of memory accesses in dirty_log_perf_test and
report summary stats to give a picture of the latency
distribution. Specifically, focus on the right tail with the 50th,
90th, and 99th percentile reported in ns.

v2:
* rebase to v6.3-rc2
* move counter measurement to common code helpers
* fix page allocation math for only the exact number needed
* add command line argument to control number of samples

v1: https://lore.kernel.org/kvm/20221115173258.2530923-1-coltonlewis@google.com/

Colton Lewis (2):
  KVM: selftests: Provide generic way to read system counter
  KVM: selftests: Print summary stats of memory latency distribution

 .../selftests/kvm/access_tracking_perf_test.c |  3 +-
 .../selftests/kvm/demand_paging_test.c        |  2 +-
 .../selftests/kvm/dirty_log_perf_test.c       | 10 ++-
 .../testing/selftests/kvm/include/kvm_util.h  | 15 ++++
 .../testing/selftests/kvm/include/memstress.h | 10 ++-
 tools/testing/selftests/kvm/lib/kvm_util.c    | 30 ++++++++
 tools/testing/selftests/kvm/lib/memstress.c   | 68 ++++++++++++++++---
 .../kvm/memslot_modification_stress_test.c    |  2 +-
 .../kvm/system_counter_offset_test.c          | 10 +--
 9 files changed, 128 insertions(+), 22 deletions(-)

--
2.40.0.rc1.284.g88254d51c5-goog

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter
  2023-03-16 22:27 [PATCH v2 0/2] Calculate memory access latency stats Colton Lewis
@ 2023-03-16 22:27 ` Colton Lewis
  2023-03-17  9:26   ` Andrew Jones
                     ` (2 more replies)
  2023-03-16 22:27 ` [PATCH v2 2/2] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis
  1 sibling, 3 replies; 17+ messages in thread
From: Colton Lewis @ 2023-03-16 22:27 UTC (permalink / raw)
  To: Paolo Bonzini, Shuah Khan, Colton Lewis, Sean Christopherson,
	David Matlack, Vipin Sharma, Andrew Jones, Marc Zyngier,
	Ben Gardon, Ricardo Koller, Oliver Upton
  Cc: kvm

Provide a generic function to read the system counter from the guest
for timing purposes. A common and important way to measure guest
performance is to measure the amount of time different actions take in
the guest. Provide also a mathematical conversion from cycles to
nanoseconds and a macro for timing individual statements.

Substitute the previous custom implementation of a similar function in
system_counter_offset_test with this new implementation.

Signed-off-by: Colton Lewis <coltonlewis@google.com>
---
 .../testing/selftests/kvm/include/kvm_util.h  | 15 ++++++++++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 30 +++++++++++++++++++
 .../kvm/system_counter_offset_test.c          | 10 ++-----
 3 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index c9286811a4cb..8b478eabee4c 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -10,4 +10,19 @@
 #include "kvm_util_base.h"
 #include "ucall_common.h"

+#if defined(__aarch64__) || defined(__x86_64__)
+
+uint64_t cycles_read(void);
+double cycles_to_ns(struct kvm_vcpu *vcpu, double cycles);
+
+#define MEASURE_CYCLES(x)			\
+	({					\
+		uint64_t start;			\
+		start = cycles_read();		\
+		x;				\
+		cycles_read() - start;		\
+	})
+
+#endif
+
 #endif /* SELFTEST_KVM_UTIL_H */
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 3ea24a5f4c43..780481a92efe 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -2135,3 +2135,34 @@ void __attribute((constructor)) kvm_selftest_init(void)

 	kvm_selftest_arch_init();
 }
+
+#if defined(__aarch64__)
+
+#include "arch_timer.h"
+
+uint64_t cycles_read(void)
+{
+	return timer_get_cntct(VIRTUAL);
+}
+
+double cycles_to_ns(struct kvm_vcpu *vcpu, double cycles)
+{
+	return cycles * (1e9 / timer_get_cntfrq());
+}
+
+#elif defined(__x86_64__)
+
+#include "processor.h"
+
+uint64_t cycles_read(void)
+{
+	return rdtsc();
+}
+
+double cycles_to_ns(struct kvm_vcpu *vcpu, double cycles)
+{
+	uint64_t tsc_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
+
+	return cycles * (1e9 / (tsc_khz * 1000));
+}
+#endif
diff --git a/tools/testing/selftests/kvm/system_counter_offset_test.c b/tools/testing/selftests/kvm/system_counter_offset_test.c
index 7f5b330b6a1b..44101d0fcb48 100644
--- a/tools/testing/selftests/kvm/system_counter_offset_test.c
+++ b/tools/testing/selftests/kvm/system_counter_offset_test.c
@@ -39,14 +39,9 @@ static void setup_system_counter(struct kvm_vcpu *vcpu, struct test_case *test)
 			     &test->tsc_offset);
 }

-static uint64_t guest_read_system_counter(struct test_case *test)
-{
-	return rdtsc();
-}
-
 static uint64_t host_read_guest_system_counter(struct test_case *test)
 {
-	return rdtsc() + test->tsc_offset;
+	return cycles_read() + test->tsc_offset;
 }

 #else /* __x86_64__ */
@@ -63,9 +58,8 @@ static void guest_main(void)
 	int i;

 	for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
-		struct test_case *test = &test_cases[i];

-		GUEST_SYNC_CLOCK(i, guest_read_system_counter(test));
+		GUEST_SYNC_CLOCK(i, cycles_read());
 	}
 }

--
2.40.0.rc1.284.g88254d51c5-goog

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/2] KVM: selftests: Print summary stats of memory latency distribution
  2023-03-16 22:27 [PATCH v2 0/2] Calculate memory access latency stats Colton Lewis
  2023-03-16 22:27 ` [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter Colton Lewis
@ 2023-03-16 22:27 ` Colton Lewis
  2023-03-17 18:16   ` Vipin Sharma
  1 sibling, 1 reply; 17+ messages in thread
From: Colton Lewis @ 2023-03-16 22:27 UTC (permalink / raw)
  To: Paolo Bonzini, Shuah Khan, Colton Lewis, Sean Christopherson,
	David Matlack, Vipin Sharma, Andrew Jones, Marc Zyngier,
	Ben Gardon, Ricardo Koller, Oliver Upton
  Cc: kvm

Print summary stats of the memory latency distribution in nanoseconds
for dirty_log_perf_test. For every iteration, this prints the minimum,
the maximum, and the 50th, 90th, and 99th percentiles.

Stats are calculated by sorting the samples taken from all vcpus and
picking from the index corresponding with each percentile.

Because keeping all samples is impractical due to the space required
in some cases (pooled memory w/ 64 vcpus would be 64 GB/vcpu * 64
vcpus * 250,000 samples/GB * 8 bytes/sample ~ 8 GB extra memory just
for samples), reservoir sampling [1] is used to only keep a small
number of samples per vcpu (settable via command argument). There is a
tradeoff between stat accuracy and memory usage. More samples means
more accuracy but also more memory use.

Because other selftests use memstress.c, those tests were hardcoded to
take 0 samples at this time.

[1] https://en.wikipedia.org/wiki/Reservoir_sampling#Simple:_Algorithm_R

Signed-off-by: Colton Lewis <coltonlewis@google.com>
---
 .../selftests/kvm/access_tracking_perf_test.c |  3 +-
 .../selftests/kvm/demand_paging_test.c        |  2 +-
 .../selftests/kvm/dirty_log_perf_test.c       | 10 ++-
 .../testing/selftests/kvm/include/memstress.h | 10 ++-
 tools/testing/selftests/kvm/lib/memstress.c   | 68 ++++++++++++++++---
 .../kvm/memslot_modification_stress_test.c    |  2 +-
 6 files changed, 81 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..89947aa4ff3a 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -306,7 +306,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct kvm_vm *vm;
 	int nr_vcpus = params->nr_vcpus;

-	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes,
+				 0, 1,
 				 params->backing_src, !overlap_memory_access);

 	memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index b0e1fc4de9e2..bc5c171eca9e 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -135,7 +135,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct kvm_vm *vm;
 	int i;

-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 0, 1,
 				 p->src_type, p->partition_vcpu_memory_access);

 	demand_paging_size = get_backing_src_pagesz(p->src_type);
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index e9d6d1aecf89..40058fe2b58f 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -133,6 +133,7 @@ struct test_params {
 	int slots;
 	uint32_t write_percent;
 	uint32_t random_seed;
+	uint64_t samples_per_vcpu;
 	bool random_access;
 };

@@ -224,6 +225,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	int i;

 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
+				 p->samples_per_vcpu,
 				 p->slots, p->backing_src,
 				 p->partition_vcpu_memory_access);

@@ -274,6 +276,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	ts_diff = timespec_elapsed(start);
 	pr_info("Populate memory time: %ld.%.9lds\n",
 		ts_diff.tv_sec, ts_diff.tv_nsec);
+	memstress_print_percentiles(vm, nr_vcpus);

 	/* Enable dirty logging */
 	clock_gettime(CLOCK_MONOTONIC, &start);
@@ -304,6 +307,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		vcpu_dirty_total = timespec_add(vcpu_dirty_total, ts_diff);
 		pr_info("Iteration %d dirty memory time: %ld.%.9lds\n",
 			iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
+		memstress_print_percentiles(vm, nr_vcpus);

 		clock_gettime(CLOCK_MONOTONIC, &start);
 		get_dirty_log(vm, bitmaps, p->slots);
@@ -428,6 +432,7 @@ int main(int argc, char *argv[])
 		.slots = 1,
 		.random_seed = 1,
 		.write_percent = 100,
+		.samples_per_vcpu = 1000,
 	};
 	int opt;

@@ -438,7 +443,7 @@ int main(int argc, char *argv[])

 	guest_modes_append_default();

-	while ((opt = getopt(argc, argv, "ab:c:eghi:m:nop:r:s:v:x:w:")) != -1) {
+	while ((opt = getopt(argc, argv, "ab:c:eghi:m:nop:r:s:S:v:x:w:")) != -1) {
 		switch (opt) {
 		case 'a':
 			p.random_access = true;
@@ -480,6 +485,9 @@ int main(int argc, char *argv[])
 		case 's':
 			p.backing_src = parse_backing_src_type(optarg);
 			break;
+		case 'S':
+			p.samples_per_vcpu = atoi_positive("Number of samples/vcpu", optarg);
+			break;
 		case 'v':
 			nr_vcpus = atoi_positive("Number of vCPUs", optarg);
 			TEST_ASSERT(nr_vcpus <= max_vcpus,
diff --git a/tools/testing/selftests/kvm/include/memstress.h b/tools/testing/selftests/kvm/include/memstress.h
index 72e3e358ef7b..25a3b5e308e9 100644
--- a/tools/testing/selftests/kvm/include/memstress.h
+++ b/tools/testing/selftests/kvm/include/memstress.h
@@ -37,6 +37,12 @@ struct memstress_args {
 	uint64_t guest_page_size;
 	uint32_t random_seed;
 	uint32_t write_percent;
+	uint64_t samples_per_vcpu;
+	/* Store all samples in a flat array so they can be easily
+	 * sorted later.
+	 */
+	vm_vaddr_t latency_samples;
+

 	/* Run vCPUs in L2 instead of L1, if the architecture supports it. */
 	bool nested;
@@ -56,7 +62,8 @@ struct memstress_args {
 extern struct memstress_args memstress_args;

 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes,
+				   uint64_t samples_per_vcpu, int slots,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access);
 void memstress_destroy_vm(struct kvm_vm *vm);
@@ -72,4 +79,5 @@ void memstress_guest_code(uint32_t vcpu_id);
 uint64_t memstress_nested_pages(int nr_vcpus);
 void memstress_setup_nested(struct kvm_vm *vm, int nr_vcpus, struct kvm_vcpu *vcpus[]);

+void memstress_print_percentiles(struct kvm_vm *vm, int nr_vcpus);
 #endif /* SELFTEST_KVM_MEMSTRESS_H */
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index 5f1d3173c238..c275471ad6d8 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -48,14 +48,17 @@ void memstress_guest_code(uint32_t vcpu_idx)
 {
 	struct memstress_args *args = &memstress_args;
 	struct memstress_vcpu_args *vcpu_args = &args->vcpu_args[vcpu_idx];
-	struct guest_random_state rand_state;
 	uint64_t gva;
 	uint64_t pages;
 	uint64_t addr;
 	uint64_t page;
 	int i;
-
-	rand_state = new_guest_random_state(args->random_seed + vcpu_idx);
+	struct guest_random_state rand_state =
+		new_guest_random_state(args->random_seed + vcpu_idx);
+	uint64_t *vcpu_samples = (uint64_t *)args->latency_samples
+		+ args->samples_per_vcpu * vcpu_idx;
+	uint64_t cycles;
+	uint32_t maybe_sample;

 	gva = vcpu_args->gva;
 	pages = vcpu_args->pages;
@@ -73,15 +76,46 @@ void memstress_guest_code(uint32_t vcpu_idx)
 			addr = gva + (page * args->guest_page_size);

 			if (guest_random_u32(&rand_state) % 100 < args->write_percent)
-				*(uint64_t *)addr = 0x0123456789ABCDEF;
+				cycles = MEASURE_CYCLES(*(uint64_t *)addr = 0x0123456789ABCDEF);
 			else
-				READ_ONCE(*(uint64_t *)addr);
+				cycles = MEASURE_CYCLES(READ_ONCE(*(uint64_t *)addr));
+
+			if (i < args->samples_per_vcpu) {
+				vcpu_samples[i] = cycles;
+				continue;
+			}
+
+			maybe_sample = guest_random_u32(&rand_state) % (i + 1);
+
+			if (maybe_sample < args->samples_per_vcpu)
+				vcpu_samples[maybe_sample] = cycles;
 		}

 		GUEST_SYNC(1);
 	}
 }

+/* compare function for qsort */
+static int memstress_qcmp(const void *a, const void *b)
+{
+	return *(int *)a - *(int *)b;
+}
+
+void memstress_print_percentiles(struct kvm_vm *vm, int nr_vcpus)
+{
+	uint64_t n_samples = nr_vcpus * memstress_args.samples_per_vcpu;
+	uint64_t *host_latency_samples = addr_gva2hva(vm, memstress_args.latency_samples);
+
+	qsort(host_latency_samples, n_samples, sizeof(uint64_t), &memstress_qcmp);
+
+	pr_info("Latency distribution (ns) = min:%6.0lf, 50th:%6.0lf, 90th:%6.0lf, 99th:%6.0lf, max:%6.0lf\n",
+		cycles_to_ns(vcpus[0], (double)host_latency_samples[0]),
+		cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples / 2]),
+		cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples * 9 / 10]),
+		cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples * 99 / 100]),
+		cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples - 1]));
+}
+
 void memstress_setup_vcpus(struct kvm_vm *vm, int nr_vcpus,
 			   struct kvm_vcpu *vcpus[],
 			   uint64_t vcpu_memory_bytes,
@@ -119,21 +153,26 @@ void memstress_setup_vcpus(struct kvm_vm *vm, int nr_vcpus,
 }

 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes,
+				   uint64_t samples_per_vcpu, int slots,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access)
 {
 	struct memstress_args *args = &memstress_args;
 	struct kvm_vm *vm;
-	uint64_t guest_num_pages, slot0_pages = 0;
+	uint64_t guest_num_pages, sample_pages, slot0_pages = 0;
 	uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src);
 	uint64_t region_end_gfn;
+	uint64_t sample_size;
+	uint64_t sample_slot;
+	uint64_t sample_start;
 	int i;

 	pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode));

 	/* By default vCPUs will write to memory. */
 	args->write_percent = 100;
+	args->samples_per_vcpu = samples_per_vcpu;

 	/*
 	 * Snapshot the non-huge page size.  This is used by the guest code to
@@ -159,12 +198,17 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 	if (args->nested)
 		slot0_pages += memstress_nested_pages(nr_vcpus);

+	sample_size = samples_per_vcpu * nr_vcpus * sizeof(uint64_t);
+	sample_pages = vm_adjust_num_guest_pages(
+		mode, 1 + sample_size / args->guest_page_size);
+
 	/*
 	 * Pass guest_num_pages to populate the page tables for test memory.
 	 * The memory is also added to memslot 0, but that's a benign side
 	 * effect as KVM allows aliasing HVAs in meslots.
 	 */
-	vm = __vm_create_with_vcpus(mode, nr_vcpus, slot0_pages + guest_num_pages,
+	vm = __vm_create_with_vcpus(mode, nr_vcpus,
+				    slot0_pages + guest_num_pages + sample_pages,
 				    memstress_guest_code, vcpus);

 	args->vm = vm;
@@ -190,7 +234,7 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 		    " nr_vcpus: %d wss: %" PRIx64 "]\n",
 		    guest_num_pages, region_end_gfn - 1, nr_vcpus, vcpu_memory_bytes);

-	args->gpa = (region_end_gfn - guest_num_pages - 1) * args->guest_page_size;
+	args->gpa = (region_end_gfn - guest_num_pages - sample_pages - 1) * args->guest_page_size;
 	args->gpa = align_down(args->gpa, backing_src_pagesz);
 #ifdef __s390x__
 	/* Align to 1M (segment size) */
@@ -213,6 +257,12 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 	/* Do mapping for the demand paging memory slot */
 	virt_map(vm, guest_test_virt_mem, args->gpa, guest_num_pages);

+	memstress_args.latency_samples = guest_test_virt_mem + args->size;
+	sample_start = args->gpa + args->size;
+	sample_slot = MEMSTRESS_MEM_SLOT_INDEX + slots;
+	vm_userspace_mem_region_add(vm, backing_src, sample_start, sample_slot, sample_pages, 0);
+	virt_map(vm, memstress_args.latency_samples, sample_start, sample_pages);
+
 	memstress_setup_vcpus(vm, nr_vcpus, vcpus, vcpu_memory_bytes,
 			      partition_vcpu_memory_access);

diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
index 9855c41ca811..0a14f7d16e0c 100644
--- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
+++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
@@ -95,7 +95,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct test_params *p = arg;
 	struct kvm_vm *vm;

-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 0, 1,
 				 VM_MEM_SRC_ANONYMOUS,
 				 p->partition_vcpu_memory_access);

--
2.40.0.rc1.284.g88254d51c5-goog

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter
  2023-03-16 22:27 ` [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter Colton Lewis
@ 2023-03-17  9:26   ` Andrew Jones
  2023-03-21 19:08     ` Colton Lewis
  2023-03-17 17:04   ` Vipin Sharma
  2023-03-17 17:09   ` Marc Zyngier
  2 siblings, 1 reply; 17+ messages in thread
From: Andrew Jones @ 2023-03-17  9:26 UTC (permalink / raw)
  To: Colton Lewis
  Cc: Paolo Bonzini, Shuah Khan, Sean Christopherson, David Matlack,
	Vipin Sharma, Marc Zyngier, Ben Gardon, Ricardo Koller,
	Oliver Upton, kvm

On Thu, Mar 16, 2023 at 10:27:51PM +0000, Colton Lewis wrote:
> Provide a generic function to read the system counter from the guest
> for timing purposes. A common and important way to measure guest
> performance is to measure the amount of time different actions take in
> the guest. Provide also a mathematical conversion from cycles to
> nanoseconds and a macro for timing individual statements.
> 
> Substitute the previous custom implementation of a similar function in
> system_counter_offset_test with this new implementation.
> 
> Signed-off-by: Colton Lewis <coltonlewis@google.com>
> ---
>  .../testing/selftests/kvm/include/kvm_util.h  | 15 ++++++++++
>  tools/testing/selftests/kvm/lib/kvm_util.c    | 30 +++++++++++++++++++
>  .../kvm/system_counter_offset_test.c          | 10 ++-----
>  3 files changed, 47 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index c9286811a4cb..8b478eabee4c 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -10,4 +10,19 @@
>  #include "kvm_util_base.h"
>  #include "ucall_common.h"
> 
> +#if defined(__aarch64__) || defined(__x86_64__)
> +
> +uint64_t cycles_read(void);
> +double cycles_to_ns(struct kvm_vcpu *vcpu, double cycles);
> +
> +#define MEASURE_CYCLES(x)			\
> +	({					\
> +		uint64_t start;			\
> +		start = cycles_read();		\
> +		x;				\
> +		cycles_read() - start;		\
> +	})
> +
> +#endif
> +
>  #endif /* SELFTEST_KVM_UTIL_H */
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 3ea24a5f4c43..780481a92efe 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -2135,3 +2135,34 @@ void __attribute((constructor)) kvm_selftest_init(void)
> 
>  	kvm_selftest_arch_init();
>  }
> +
> +#if defined(__aarch64__)
> +
> +#include "arch_timer.h"
> +
> +uint64_t cycles_read(void)
> +{
> +	return timer_get_cntct(VIRTUAL);
> +}
> +
> +double cycles_to_ns(struct kvm_vcpu *vcpu, double cycles)
> +{
> +	return cycles * (1e9 / timer_get_cntfrq());
> +}
> +
> +#elif defined(__x86_64__)
> +
> +#include "processor.h"
> +
> +uint64_t cycles_read(void)
> +{
> +	return rdtsc();
> +}
> +
> +double cycles_to_ns(struct kvm_vcpu *vcpu, double cycles)
> +{
> +	uint64_t tsc_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
> +
> +	return cycles * (1e9 / (tsc_khz * 1000));
> +}
> +#endif

Instead of the #ifdef's why not must put these functions in
lib/ARCH/processor.c?

Thanks,
drew

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter
  2023-03-16 22:27 ` [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter Colton Lewis
  2023-03-17  9:26   ` Andrew Jones
@ 2023-03-17 17:04   ` Vipin Sharma
  2023-03-21 19:09     ` Colton Lewis
  2023-03-17 17:09   ` Marc Zyngier
  2 siblings, 1 reply; 17+ messages in thread
From: Vipin Sharma @ 2023-03-17 17:04 UTC (permalink / raw)
  To: Colton Lewis
  Cc: Paolo Bonzini, Shuah Khan, Sean Christopherson, David Matlack,
	Andrew Jones, Marc Zyngier, Ben Gardon, Ricardo Koller,
	Oliver Upton, kvm

On Thu, Mar 16, 2023 at 3:29 PM Colton Lewis <coltonlewis@google.com> wrote:
>
> Provide a generic function to read the system counter from the guest
> for timing purposes. A common and important way to measure guest
> performance is to measure the amount of time different actions take in
> the guest. Provide also a mathematical conversion from cycles to
> nanoseconds and a macro for timing individual statements.
>
> Substitute the previous custom implementation of a similar function in

May be specify specific name:  guest_read_system_counter()

> system_counter_offset_test with this new implementation.
>
> Signed-off-by: Colton Lewis <coltonlewis@google.com>
> ---
>  .../testing/selftests/kvm/include/kvm_util.h  | 15 ++++++++++
>  tools/testing/selftests/kvm/lib/kvm_util.c    | 30 +++++++++++++++++++
>  .../kvm/system_counter_offset_test.c          | 10 ++-----
>  3 files changed, 47 insertions(+), 8 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index c9286811a4cb..8b478eabee4c 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -10,4 +10,19 @@
>  #include "kvm_util_base.h"
>  #include "ucall_common.h"
>
> +#if defined(__aarch64__) || defined(__x86_64__)
> +
> +uint64_t cycles_read(void);
> +double cycles_to_ns(struct kvm_vcpu *vcpu, double cycles);
> +
> +#define MEASURE_CYCLES(x)                      \
> +       ({                                      \
> +               uint64_t start;                 \
> +               start = cycles_read();          \
> +               x;                              \
> +               cycles_read() - start;          \
> +       })
> +

MEASURE_CYCLES should be moved to the next patch where it is getting
used. Does it have to be macro or can it be replaced with a function?

> +#endif
> +
>  #endif /* SELFTEST_KVM_UTIL_H */
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 3ea24a5f4c43..780481a92efe 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -2135,3 +2135,34 @@ void __attribute((constructor)) kvm_selftest_init(void)
>
>         kvm_selftest_arch_init();
>  }
> +
> +#if defined(__aarch64__)
> +
> +#include "arch_timer.h"
> +
> +uint64_t cycles_read(void)
> +{
> +       return timer_get_cntct(VIRTUAL);
> +}
> +
> +double cycles_to_ns(struct kvm_vcpu *vcpu, double cycles)
> +{
> +       return cycles * (1e9 / timer_get_cntfrq());
> +}
> +
> +#elif defined(__x86_64__)
> +
> +#include "processor.h"
> +
> +uint64_t cycles_read(void)
> +{
> +       return rdtsc();
> +}
> +
> +double cycles_to_ns(struct kvm_vcpu *vcpu, double cycles)
> +{
> +       uint64_t tsc_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
> +
> +       return cycles * (1e9 / (tsc_khz * 1000));
> +}
> +#endif

As Andrew noted,  these should be in the respective processor files.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter
  2023-03-16 22:27 ` [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter Colton Lewis
  2023-03-17  9:26   ` Andrew Jones
  2023-03-17 17:04   ` Vipin Sharma
@ 2023-03-17 17:09   ` Marc Zyngier
  2023-03-21 19:10     ` Colton Lewis
  2 siblings, 1 reply; 17+ messages in thread
From: Marc Zyngier @ 2023-03-17 17:09 UTC (permalink / raw)
  To: Colton Lewis
  Cc: Paolo Bonzini, Shuah Khan, Sean Christopherson, David Matlack,
	Vipin Sharma, Andrew Jones, Ben Gardon, Ricardo Koller,
	Oliver Upton, kvm

On Thu, 16 Mar 2023 22:27:51 +0000,
Colton Lewis <coltonlewis@google.com> wrote:
> 
> Provide a generic function to read the system counter from the guest
> for timing purposes. A common and important way to measure guest
> performance is to measure the amount of time different actions take in
> the guest. Provide also a mathematical conversion from cycles to
> nanoseconds and a macro for timing individual statements.
> 
> Substitute the previous custom implementation of a similar function in
> system_counter_offset_test with this new implementation.
> 
> Signed-off-by: Colton Lewis <coltonlewis@google.com>
> ---
>  .../testing/selftests/kvm/include/kvm_util.h  | 15 ++++++++++
>  tools/testing/selftests/kvm/lib/kvm_util.c    | 30 +++++++++++++++++++
>  .../kvm/system_counter_offset_test.c          | 10 ++-----
>  3 files changed, 47 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index c9286811a4cb..8b478eabee4c 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -10,4 +10,19 @@
>  #include "kvm_util_base.h"
>  #include "ucall_common.h"
> 
> +#if defined(__aarch64__) || defined(__x86_64__)
> +
> +uint64_t cycles_read(void);
> +double cycles_to_ns(struct kvm_vcpu *vcpu, double cycles);
> +
> +#define MEASURE_CYCLES(x)			\
> +	({					\
> +		uint64_t start;			\
> +		start = cycles_read();		\
> +		x;				\

You insert memory accesses inside a sequence that has no dependency
with it. On a weakly ordered memory system, there is absolutely no
reason why the memory access shouldn't be moved around. What do you
exactly measure in that case?

> +		cycles_read() - start;		\

I also question the usefulness of this exercise. You're comparing the
time it takes for a multi-GHz system to put a write in a store buffer
(assuming it didn't miss in the TLBs) vs a counter that gets updated
at a frequency of a few tens of MHz.

My guts feeling is that this results in a big fat zero most of the
time, but I'm happy to be explained otherwise.

> +	})
> +
> +#endif
> +
>  #endif /* SELFTEST_KVM_UTIL_H */
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 3ea24a5f4c43..780481a92efe 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -2135,3 +2135,34 @@ void __attribute((constructor)) kvm_selftest_init(void)
> 
>  	kvm_selftest_arch_init();
>  }
> +
> +#if defined(__aarch64__)
> +
> +#include "arch_timer.h"
> +
> +uint64_t cycles_read(void)
> +{
> +	return timer_get_cntct(VIRTUAL);
> +}
> +
> +double cycles_to_ns(struct kvm_vcpu *vcpu, double cycles)
> +{
> +	return cycles * (1e9 / timer_get_cntfrq());

We already have all the required code to deal with ns conversions
using a multiplier and a shift, avoiding floating point like the
plague it is. Please reuse the kernel code for this, as you're quite
likely to only measure the time it takes for KVM to trap the FP
registers and perform a FP/SIMD switch...

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] KVM: selftests: Print summary stats of memory latency distribution
  2023-03-16 22:27 ` [PATCH v2 2/2] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis
@ 2023-03-17 18:16   ` Vipin Sharma
  2023-03-21 19:10     ` Colton Lewis
  0 siblings, 1 reply; 17+ messages in thread
From: Vipin Sharma @ 2023-03-17 18:16 UTC (permalink / raw)
  To: Colton Lewis
  Cc: Paolo Bonzini, Shuah Khan, Sean Christopherson, David Matlack,
	Andrew Jones, Marc Zyngier, Ben Gardon, Ricardo Koller,
	Oliver Upton, kvm

On Thu, Mar 16, 2023 at 3:29 PM Colton Lewis <coltonlewis@google.com> wrote:
>
> Print summary stats of the memory latency distribution in nanoseconds
> for dirty_log_perf_test. For every iteration, this prints the minimum,
> the maximum, and the 50th, 90th, and 99th percentiles.
>

Can you also write how this is useful and why these 5 specific
percentiles are valuable for testers? Generally, 5 number summaries
are 0, 25, 50, 75, 100 %ile.
It might also be too much data to display with each iteration.

Nit: minimum, 50th, 90th, 99th and maximum since this is the order you
are printing.


> @@ -428,6 +432,7 @@ int main(int argc, char *argv[])
>                 .slots = 1,
>                 .random_seed = 1,
>                 .write_percent = 100,
> +               .samples_per_vcpu = 1000,

Why is 1000 the right default choice? Maybe the default should be 0
and if anyone wants to use it then they can use the command line
option to set it?

> @@ -438,7 +443,7 @@ int main(int argc, char *argv[])
>
>         guest_modes_append_default();
>
> -       while ((opt = getopt(argc, argv, "ab:c:eghi:m:nop:r:s:v:x:w:")) != -1) {
> +       while ((opt = getopt(argc, argv, "ab:c:eghi:m:nop:r:s:S:v:x:w:")) != -1) {

1. Please add the help section to tell about the new command line option.
2. Instead of having s and S, it may be better to use a different
lower case letter, like "l" for latency. Giving this option will print
memory latency and users need to provide the number of samples they
prefer per vCPU.

> @@ -480,6 +485,9 @@ int main(int argc, char *argv[])
>                 case 's':
>                         p.backing_src = parse_backing_src_type(optarg);
>                         break;
> +               case 'S':
> +                       p.samples_per_vcpu = atoi_positive("Number of samples/vcpu", optarg);
> +                       break;

This will force users to always see latency stat when they do not want
to see it. I think this patch should be modified in a way to easily
disable this feature.
I might be wrong here and it is actually a useful metric to see with
each run. If this is true then maybe the commit should mention why it
is good for each run.

> +void memstress_print_percentiles(struct kvm_vm *vm, int nr_vcpus)
> +{
> +       uint64_t n_samples = nr_vcpus * memstress_args.samples_per_vcpu;
> +       uint64_t *host_latency_samples = addr_gva2hva(vm, memstress_args.latency_samples);
> +
> +       qsort(host_latency_samples, n_samples, sizeof(uint64_t), &memstress_qcmp);
> +
> +       pr_info("Latency distribution (ns) = min:%6.0lf, 50th:%6.0lf, 90th:%6.0lf, 99th:%6.0lf, max:%6.0lf\n",
> +               cycles_to_ns(vcpus[0], (double)host_latency_samples[0]),

I am not much aware of how tsc is set up and used. Will all vCPUs have
the same tsc value? Can this change if vCPU gets scheduled to
different pCPU on the host?

> +               cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples / 2]),
> +               cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples * 9 / 10]),
> +               cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples * 99 / 100]),
> +               cycles_to_ns(vcpus[0], (double)host_latency_samples[n_samples - 1]));
> +}

All of the dirty_log_perf_tests print data in seconds followed by
nanoseconds. I will recommend keeping it the same.
Also, 9 digits for nanoseconds instead of 6 as in other formats.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter
  2023-03-17  9:26   ` Andrew Jones
@ 2023-03-21 19:08     ` Colton Lewis
  0 siblings, 0 replies; 17+ messages in thread
From: Colton Lewis @ 2023-03-21 19:08 UTC (permalink / raw)
  To: Andrew Jones
  Cc: pbonzini, shuah, seanjc, dmatlack, vipinsh, maz, bgardon,
	ricarkol, oliver.upton, kvm

Andrew Jones <andrew.jones@linux.dev> writes:

>> +#if defined(__aarch64__)
>> +
>> +#include "arch_timer.h"
>> +
>> +uint64_t cycles_read(void)
>> +{
>> +	return timer_get_cntct(VIRTUAL);
>> +}
>> +
>> +double cycles_to_ns(struct kvm_vcpu *vcpu, double cycles)
>> +{
>> +	return cycles * (1e9 / timer_get_cntfrq());
>> +}
>> +
>> +#elif defined(__x86_64__)
>> +
>> +#include "processor.h"
>> +
>> +uint64_t cycles_read(void)
>> +{
>> +	return rdtsc();
>> +}
>> +
>> +double cycles_to_ns(struct kvm_vcpu *vcpu, double cycles)
>> +{
>> +	uint64_t tsc_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
>> +
>> +	return cycles * (1e9 / (tsc_khz * 1000));
>> +}
>> +#endif

> Instead of the #ifdef's why not must put these functions in
> lib/ARCH/processor.c?

Seems like a good suggestion.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter
  2023-03-17 17:04   ` Vipin Sharma
@ 2023-03-21 19:09     ` Colton Lewis
  0 siblings, 0 replies; 17+ messages in thread
From: Colton Lewis @ 2023-03-21 19:09 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: pbonzini, shuah, seanjc, dmatlack, andrew.jones, maz, bgardon,
	ricarkol, oliver.upton, kvm

Vipin Sharma <vipinsh@google.com> writes:

> On Thu, Mar 16, 2023 at 3:29 PM Colton Lewis <coltonlewis@google.com>  
> wrote:

>> Provide a generic function to read the system counter from the guest
>> for timing purposes. A common and important way to measure guest
>> performance is to measure the amount of time different actions take in
>> the guest. Provide also a mathematical conversion from cycles to
>> nanoseconds and a macro for timing individual statements.

>> Substitute the previous custom implementation of a similar function in

> May be specify specific name:  guest_read_system_counter()

Will do.

>> +#define MEASURE_CYCLES(x)                      \
>> +       ({                                      \
>> +               uint64_t start;                 \
>> +               start = cycles_read();          \
>> +               x;                              \
>> +               cycles_read() - start;          \
>> +       })
>> +

> MEASURE_CYCLES should be moved to the next patch where it is getting
> used. Does it have to be macro or can it be replaced with a function?

Will move to the next patch. It can't be replaced by a function because
then 'x' would be executed before the function begins and there would be
nothing left to measure.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter
  2023-03-17 17:09   ` Marc Zyngier
@ 2023-03-21 19:10     ` Colton Lewis
  2023-03-28 10:09       ` Marc Zyngier
  0 siblings, 1 reply; 17+ messages in thread
From: Colton Lewis @ 2023-03-21 19:10 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: pbonzini, shuah, seanjc, dmatlack, vipinsh, andrew.jones,
	bgardon, ricarkol, oliver.upton, kvm

Marc Zyngier <maz@kernel.org> writes:

>> +#define MEASURE_CYCLES(x)			\
>> +	({					\
>> +		uint64_t start;			\
>> +		start = cycles_read();		\
>> +		x;				\

> You insert memory accesses inside a sequence that has no dependency
> with it. On a weakly ordered memory system, there is absolutely no
> reason why the memory access shouldn't be moved around. What do you
> exactly measure in that case?

cycles_read is built on another function timer_get_cntct which includes
its own barriers. Stripped of some abstraction, the sequence is:

timer_get_cntct (isb+read timer)
whatever is being measured
timer_get_cntct

I hadn't looked at it too closely before but on review of the manual
I think you are correct. Borrowing from example D7-2 in the manual, it
should be:

timer_get_cntct
isb
whatever is being measured
dsb
timer_get_cntct

>> +		cycles_read() - start;		\

> I also question the usefulness of this exercise. You're comparing the
> time it takes for a multi-GHz system to put a write in a store buffer
> (assuming it didn't miss in the TLBs) vs a counter that gets updated
> at a frequency of a few tens of MHz.

> My guts feeling is that this results in a big fat zero most of the
> time, but I'm happy to be explained otherwise.


In context, I'm trying to measure the time it takes to write to a buffer
*with dirty memory logging enabled*. What do you mean by zero? I can
confirm from running this code I am not measuring zero time.


> We already have all the required code to deal with ns conversions
> using a multiplier and a shift, avoiding floating point like the
> plague it is. Please reuse the kernel code for this, as you're quite
> likely to only measure the time it takes for KVM to trap the FP
> registers and perform a FP/SIMD switch...

Will do.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] KVM: selftests: Print summary stats of memory latency distribution
  2023-03-17 18:16   ` Vipin Sharma
@ 2023-03-21 19:10     ` Colton Lewis
  2023-03-21 21:21       ` Sean Christopherson
  0 siblings, 1 reply; 17+ messages in thread
From: Colton Lewis @ 2023-03-21 19:10 UTC (permalink / raw)
  To: Vipin Sharma
  Cc: pbonzini, shuah, seanjc, dmatlack, andrew.jones, maz, bgardon,
	ricarkol, oliver.upton, kvm

Vipin Sharma <vipinsh@google.com> writes:

> On Thu, Mar 16, 2023 at 3:29 PM Colton Lewis <coltonlewis@google.com>  
> wrote:

>> Print summary stats of the memory latency distribution in nanoseconds
>> for dirty_log_perf_test. For every iteration, this prints the minimum,
>> the maximum, and the 50th, 90th, and 99th percentiles.


> Can you also write how this is useful and why these 5 specific
> percentiles are valuable for testers? Generally, 5 number summaries
> are 0, 25, 50, 75, 100 %ile.
> It might also be too much data to display with each iteration.

Generally, when measuring latency you care more about the right half of
the distribution. The left half of the distribution is assumed to be
bunched close together at some low value, and I do see in my
measurements there is not much difference between the minimum and the
50th percentile. The right half of the distribution is where things
might be going wrong. 50th, 90th, or 99th are common measures to take in
other contexts, specifically networking.

One example:
https://cloud.google.com/spanner/docs/latency-metrics#overview

> Nit: minimum, 50th, 90th, 99th and maximum since this is the order you
> are printing.

Ok.

>> @@ -428,6 +432,7 @@ int main(int argc, char *argv[])
>>                  .slots = 1,
>>                  .random_seed = 1,
>>                  .write_percent = 100,
>> +               .samples_per_vcpu = 1000,

> Why is 1000 the right default choice? Maybe the default should be 0
> and if anyone wants to use it then they can use the command line
> option to set it?

Good point.

>> @@ -438,7 +443,7 @@ int main(int argc, char *argv[])

>>          guest_modes_append_default();

>> -       while ((opt = getopt(argc,  
>> argv, "ab:c:eghi:m:nop:r:s:v:x:w:")) != -1) {
>> +       while ((opt = getopt(argc,  
>> argv, "ab:c:eghi:m:nop:r:s:S:v:x:w:")) != -1) {

> 1. Please add the help section to tell about the new command line option.

Thanks for the reminder.

> 2. Instead of having s and S, it may be better to use a different
> lower case letter, like "l" for latency. Giving this option will print
> memory latency and users need to provide the number of samples they
> prefer per vCPU.

Probably a good idea.

>> @@ -480,6 +485,9 @@ int main(int argc, char *argv[])
>>                  case 's':
>>                          p.backing_src = parse_backing_src_type(optarg);
>>                          break;
>> +               case 'S':
>> +                       p.samples_per_vcpu = atoi_positive("Number of  
>> samples/vcpu", optarg);
>> +                       break;

> This will force users to always see latency stat when they do not want
> to see it. I think this patch should be modified in a way to easily
> disable this feature.
> I might be wrong here and it is actually a useful metric to see with
> each run. If this is true then maybe the commit should mention why it
> is good for each run.

It gives an extra line of information per run. This would help people
distinguish between cases where the average is high because most
accesses are high or because a a small percentage of accesses are
*really* high.

But in general I agree it should be silent unless requested.

>> +void memstress_print_percentiles(struct kvm_vm *vm, int nr_vcpus)
>> +{
>> +       uint64_t n_samples = nr_vcpus * memstress_args.samples_per_vcpu;
>> +       uint64_t *host_latency_samples = addr_gva2hva(vm,  
>> memstress_args.latency_samples);
>> +
>> +       qsort(host_latency_samples, n_samples, sizeof(uint64_t),  
>> &memstress_qcmp);
>> +
>> +       pr_info("Latency distribution (ns) = min:%6.0lf, 50th:%6.0lf,  
>> 90th:%6.0lf, 99th:%6.0lf, max:%6.0lf\n",
>> +               cycles_to_ns(vcpus[0], (double)host_latency_samples[0]),

> I am not much aware of how tsc is set up and used. Will all vCPUs have
> the same tsc value? Can this change if vCPU gets scheduled to
> different pCPU on the host?

All vCPUs *in one VM* should have the same frequency. The alternative is
probably possible but so weird I can't imagine a reason for doing it.

>> +               cycles_to_ns(vcpus[0],  
>> (double)host_latency_samples[n_samples / 2]),
>> +               cycles_to_ns(vcpus[0],  
>> (double)host_latency_samples[n_samples * 9 / 10]),
>> +               cycles_to_ns(vcpus[0],  
>> (double)host_latency_samples[n_samples * 99 / 100]),
>> +               cycles_to_ns(vcpus[0],  
>> (double)host_latency_samples[n_samples - 1]));
>> +}

> All of the dirty_log_perf_tests print data in seconds followed by
> nanoseconds. I will recommend keeping it the same.
> Also, 9 digits for nanoseconds instead of 6 as in other formats.

Consistency here is a good thing. Thanks for pointing out.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] KVM: selftests: Print summary stats of memory latency distribution
  2023-03-21 19:10     ` Colton Lewis
@ 2023-03-21 21:21       ` Sean Christopherson
  2023-03-24 16:26         ` Colton Lewis
  0 siblings, 1 reply; 17+ messages in thread
From: Sean Christopherson @ 2023-03-21 21:21 UTC (permalink / raw)
  To: Colton Lewis
  Cc: Vipin Sharma, pbonzini, shuah, dmatlack, andrew.jones, maz,
	bgardon, ricarkol, oliver.upton, kvm

On Tue, Mar 21, 2023, Colton Lewis wrote:
> Vipin Sharma <vipinsh@google.com> writes:
> 
> > On Thu, Mar 16, 2023 at 3:29 PM Colton Lewis <coltonlewis@google.com>
> > wrote:
> > > +       pr_info("Latency distribution (ns) = min:%6.0lf,
> > > 50th:%6.0lf, 90th:%6.0lf, 99th:%6.0lf, max:%6.0lf\n",
> > > +               cycles_to_ns(vcpus[0], (double)host_latency_samples[0]),
> 
> > I am not much aware of how tsc is set up and used. Will all vCPUs have
> > the same tsc value? Can this change if vCPU gets scheduled to
> > different pCPU on the host?

FWIW, if this test were run on older CPUs, there would be potential divergence
across pCPUs that would then bleed into vCPUs to some extent.  Older CPUs tied
the TSC frequency to the core frequency, e.g. would change frequency depending
on the power/turbo state, and the TSC would even stop counting altogether at
certain C-states.  KVM does its best to adjust the guest's perception of the TSC,
but it can't be hidden completely.

But for what this test is trying to do, IMO there's zero reason to worry about
that.
 
> All vCPUs *in one VM* should have the same frequency. The alternative is
> probably possible but so weird I can't imagine a reason for doing it.

Somewhat related to Vipin's question, "host_latency_samples" is a confusing name.
It's easy to miss that "host_latency_samples" are actually samples collected in
the guest, and thus to think that this code will depend on which pCPU it runs on.

I don't see any reason for such a verbose name, e.g. can't it just be "samples"?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] KVM: selftests: Print summary stats of memory latency distribution
  2023-03-21 21:21       ` Sean Christopherson
@ 2023-03-24 16:26         ` Colton Lewis
  0 siblings, 0 replies; 17+ messages in thread
From: Colton Lewis @ 2023-03-24 16:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: vipinsh, pbonzini, shuah, dmatlack, andrew.jones, maz, bgardon,
	ricarkol, oliver.upton, kvm

Sean Christopherson <seanjc@google.com> writes:

> On Tue, Mar 21, 2023, Colton Lewis wrote:
>> Vipin Sharma <vipinsh@google.com> writes:

>> > On Thu, Mar 16, 2023 at 3:29 PM Colton Lewis <coltonlewis@google.com>
>> > wrote:
>> > > +       pr_info("Latency distribution (ns) = min:%6.0lf,
>> > > 50th:%6.0lf, 90th:%6.0lf, 99th:%6.0lf, max:%6.0lf\n",
>> > > +               cycles_to_ns(vcpus[0],  
>> (double)host_latency_samples[0]),

>> > I am not much aware of how tsc is set up and used. Will all vCPUs have
>> > the same tsc value? Can this change if vCPU gets scheduled to
>> > different pCPU on the host?

> FWIW, if this test were run on older CPUs, there would be potential  
> divergence
> across pCPUs that would then bleed into vCPUs to some extent.  Older CPUs  
> tied
> the TSC frequency to the core frequency, e.g. would change frequency  
> depending
> on the power/turbo state, and the TSC would even stop counting altogether  
> at
> certain C-states.  KVM does its best to adjust the guest's perception of  
> the TSC,
> but it can't be hidden completely.

> But for what this test is trying to do, IMO there's zero reason to worry  
> about
> that.

Thanks for the confirmation.

>> All vCPUs *in one VM* should have the same frequency. The alternative is
>> probably possible but so weird I can't imagine a reason for doing it.

> Somewhat related to Vipin's question, "host_latency_samples" is a  
> confusing name.
> It's easy to miss that "host_latency_samples" are actually samples  
> collected in
> the guest, and thus to think that this code will depend on which pCPU it  
> runs on.

> I don't see any reason for such a verbose name, e.g. can't it just  
> be "samples"?

Yes it can.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter
  2023-03-21 19:10     ` Colton Lewis
@ 2023-03-28 10:09       ` Marc Zyngier
  2023-03-28 15:38         ` Sean Christopherson
  2023-03-28 21:51         ` Colton Lewis
  0 siblings, 2 replies; 17+ messages in thread
From: Marc Zyngier @ 2023-03-28 10:09 UTC (permalink / raw)
  To: Colton Lewis
  Cc: pbonzini, shuah, seanjc, dmatlack, vipinsh, andrew.jones,
	bgardon, ricarkol, oliver.upton, kvm

On Tue, 21 Mar 2023 19:10:04 +0000,
Colton Lewis <coltonlewis@google.com> wrote:
> 
> Marc Zyngier <maz@kernel.org> writes:
> 
> >> +#define MEASURE_CYCLES(x)			\
> >> +	({					\
> >> +		uint64_t start;			\
> >> +		start = cycles_read();		\
> >> +		x;				\
> 
> > You insert memory accesses inside a sequence that has no dependency
> > with it. On a weakly ordered memory system, there is absolutely no
> > reason why the memory access shouldn't be moved around. What do you
> > exactly measure in that case?
> 
> cycles_read is built on another function timer_get_cntct which includes
> its own barriers. Stripped of some abstraction, the sequence is:
> 
> timer_get_cntct (isb+read timer)
> whatever is being measured
> timer_get_cntct
> 
> I hadn't looked at it too closely before but on review of the manual
> I think you are correct. Borrowing from example D7-2 in the manual, it
> should be:
> 
> timer_get_cntct
> isb
> whatever is being measured
> dsb
> timer_get_cntct

That's better, but also very heavy handed. You'd be better off
constructing an address dependency from the timer value, and feed that
into a load-acquire/store-release pair wrapping your payload.

> 
> >> +		cycles_read() - start;		\
> 
> > I also question the usefulness of this exercise. You're comparing the
> > time it takes for a multi-GHz system to put a write in a store buffer
> > (assuming it didn't miss in the TLBs) vs a counter that gets updated
> > at a frequency of a few tens of MHz.
> 
> > My guts feeling is that this results in a big fat zero most of the
> > time, but I'm happy to be explained otherwise.
> 
> 
> In context, I'm trying to measure the time it takes to write to a buffer
> *with dirty memory logging enabled*. What do you mean by zero? I can
> confirm from running this code I am not measuring zero time.

See my earlier point: the counter tick is a few MHz, and the CPU
multiple GHz. So unless "whatever" is something that takes a
significant time (several thousands of CPU cycles), you'll measure
nothing using the counter. Page faults will probably show, but not a
normal access.

The right tool for this job is to use PMU events, as they count at the
CPU frequency.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter
  2023-03-28 10:09       ` Marc Zyngier
@ 2023-03-28 15:38         ` Sean Christopherson
  2023-03-28 15:50           ` Marc Zyngier
  2023-03-28 21:51         ` Colton Lewis
  1 sibling, 1 reply; 17+ messages in thread
From: Sean Christopherson @ 2023-03-28 15:38 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Colton Lewis, pbonzini, shuah, dmatlack, vipinsh, andrew.jones,
	bgardon, ricarkol, oliver.upton, kvm

On Tue, Mar 28, 2023, Marc Zyngier wrote:
> On Tue, 21 Mar 2023 19:10:04 +0000,
> Colton Lewis <coltonlewis@google.com> wrote:
> > In context, I'm trying to measure the time it takes to write to a buffer
> > *with dirty memory logging enabled*. What do you mean by zero? I can
> > confirm from running this code I am not measuring zero time.
> 
> See my earlier point: the counter tick is a few MHz, and the CPU
> multiple GHz.

On x86, the system counter (TSC) counts at multiple GHz, so we should be able to
continue with that approach for x86.

> So unless "whatever" is something that takes a significant time (several
> thousands of CPU cycles), you'll measure nothing using the counter. Page
> faults will probably show, but not a normal access.
> 
> The right tool for this job is to use PMU events, as they count at the CPU
> frequency.

Out of curiosity, what does the kernel end up using for things like ndelay()?  I
tried to follow the breadcrumbs for ARM and got as far as arm_arch_timer.c, but
after that I'm more than a bit lost.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter
  2023-03-28 15:38         ` Sean Christopherson
@ 2023-03-28 15:50           ` Marc Zyngier
  0 siblings, 0 replies; 17+ messages in thread
From: Marc Zyngier @ 2023-03-28 15:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Colton Lewis, pbonzini, shuah, dmatlack, vipinsh, andrew.jones,
	bgardon, ricarkol, oliver.upton, kvm

On Tue, 28 Mar 2023 16:38:37 +0100,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Tue, Mar 28, 2023, Marc Zyngier wrote:
> > On Tue, 21 Mar 2023 19:10:04 +0000,
> > Colton Lewis <coltonlewis@google.com> wrote:
> > > In context, I'm trying to measure the time it takes to write to a buffer
> > > *with dirty memory logging enabled*. What do you mean by zero? I can
> > > confirm from running this code I am not measuring zero time.
> > 
> > See my earlier point: the counter tick is a few MHz, and the CPU
> > multiple GHz.
> 
> On x86, the system counter (TSC) counts at multiple GHz, so we
> should be able to continue with that approach for x86.
> 
> > So unless "whatever" is something that takes a significant time (several
> > thousands of CPU cycles), you'll measure nothing using the counter. Page
> > faults will probably show, but not a normal access.
> > 
> > The right tool for this job is to use PMU events, as they count at the CPU
> > frequency.
> 
> Out of curiosity, what does the kernel end up using for things like ndelay()?  I
> tried to follow the breadcrumbs for ARM and got as far as arm_arch_timer.c, but
> after that I'm more than a bit lost.

That's where it ends. We use the counter for everything. Even on
ARMv8.6+ HW that is supposed to give you a 1GHz counter,
implementations are allowed to perform "lumpy" increments (50MHz
counter with increments of 20 per tick, for example).

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter
  2023-03-28 10:09       ` Marc Zyngier
  2023-03-28 15:38         ` Sean Christopherson
@ 2023-03-28 21:51         ` Colton Lewis
  1 sibling, 0 replies; 17+ messages in thread
From: Colton Lewis @ 2023-03-28 21:51 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: pbonzini, shuah, seanjc, dmatlack, vipinsh, andrew.jones,
	bgardon, ricarkol, oliver.upton, kvm

Marc Zyngier <maz@kernel.org> writes:

> On Tue, 21 Mar 2023 19:10:04 +0000,
> Colton Lewis <coltonlewis@google.com> wrote:

>> Marc Zyngier <maz@kernel.org> writes:

>> >> +#define MEASURE_CYCLES(x)			\
>> >> +	({					\
>> >> +		uint64_t start;			\
>> >> +		start = cycles_read();		\
>> >> +		x;				\

>> > You insert memory accesses inside a sequence that has no dependency
>> > with it. On a weakly ordered memory system, there is absolutely no
>> > reason why the memory access shouldn't be moved around. What do you
>> > exactly measure in that case?

>> cycles_read is built on another function timer_get_cntct which includes
>> its own barriers. Stripped of some abstraction, the sequence is:

>> timer_get_cntct (isb+read timer)
>> whatever is being measured
>> timer_get_cntct

>> I hadn't looked at it too closely before but on review of the manual
>> I think you are correct. Borrowing from example D7-2 in the manual, it
>> should be:

>> timer_get_cntct
>> isb
>> whatever is being measured
>> dsb
>> timer_get_cntct

> That's better, but also very heavy handed. You'd be better off
> constructing an address dependency from the timer value, and feed that
> into a load-acquire/store-release pair wrapping your payload.

I can do something like that.


>> >> +		cycles_read() - start;		\

>> > I also question the usefulness of this exercise. You're comparing the
>> > time it takes for a multi-GHz system to put a write in a store buffer
>> > (assuming it didn't miss in the TLBs) vs a counter that gets updated
>> > at a frequency of a few tens of MHz.

>> > My guts feeling is that this results in a big fat zero most of the
>> > time, but I'm happy to be explained otherwise.


>> In context, I'm trying to measure the time it takes to write to a buffer
>> *with dirty memory logging enabled*. What do you mean by zero? I can
>> confirm from running this code I am not measuring zero time.

> See my earlier point: the counter tick is a few MHz, and the CPU
> multiple GHz. So unless "whatever" is something that takes a
> significant time (several thousands of CPU cycles), you'll measure
> nothing using the counter. Page faults will probably show, but not a
> normal access.

> The right tool for this job is to use PMU events, as they count at the
> CPU frequency.

Thanks. I understand you clearly now. I think it works out to tens of
cpu cycles, not thousands to observe a timer tick in the usual case
(2 GHz / 25 MHz = 80, of course slower timers exist), but I agree with
you a more precise tool is called for.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-03-28 21:52 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-16 22:27 [PATCH v2 0/2] Calculate memory access latency stats Colton Lewis
2023-03-16 22:27 ` [PATCH v2 1/2] KVM: selftests: Provide generic way to read system counter Colton Lewis
2023-03-17  9:26   ` Andrew Jones
2023-03-21 19:08     ` Colton Lewis
2023-03-17 17:04   ` Vipin Sharma
2023-03-21 19:09     ` Colton Lewis
2023-03-17 17:09   ` Marc Zyngier
2023-03-21 19:10     ` Colton Lewis
2023-03-28 10:09       ` Marc Zyngier
2023-03-28 15:38         ` Sean Christopherson
2023-03-28 15:50           ` Marc Zyngier
2023-03-28 21:51         ` Colton Lewis
2023-03-16 22:27 ` [PATCH v2 2/2] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis
2023-03-17 18:16   ` Vipin Sharma
2023-03-21 19:10     ` Colton Lewis
2023-03-21 21:21       ` Sean Christopherson
2023-03-24 16:26         ` Colton Lewis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).