All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Calculate memory access latency stats
@ 2022-11-15 17:32 Colton Lewis
  2022-11-15 17:32 ` [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples Colton Lewis
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Colton Lewis @ 2022-11-15 17:32 UTC (permalink / raw)
  To: kvm
  Cc: pbonzini, maz, dmatlack, seanjc, bgardon, oupton, ricarkol, Colton Lewis

Sample the latency of memory accesses in dirty_log_perf_test and
report summary stats to give a picture of the latency
distribution. Specifically, focus on the right tail with the 50th,
90th, and 99th percentile reported in ns.

This patch depends on my previous dirty_log_perf_test patch adding the
ability to randomize memory accesses. It needs the pRNG to do random
sampling.

https://lore.kernel.org/kvm/20221107182208.479157-1-coltonlewis@google.com/

Colton Lewis (3):
  KVM: selftests: Allocate additional space for latency samples
  KVM: selftests: Collect memory access latency samples
  KVM: selftests: Print summary stats of memory latency distribution

 .../selftests/kvm/dirty_log_perf_test.c       |   2 +
 .../selftests/kvm/include/perf_test_util.h    |   2 +
 .../selftests/kvm/lib/perf_test_util.c        | 103 +++++++++++++++++-
 3 files changed, 104 insertions(+), 3 deletions(-)

--
2.38.1.431.g37b22c650d-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples
  2022-11-15 17:32 [PATCH 0/3] Calculate memory access latency stats Colton Lewis
@ 2022-11-15 17:32 ` Colton Lewis
  2023-01-17 20:32   ` Ricardo Koller
  2023-01-18 16:49   ` Sean Christopherson
  2022-11-15 17:32 ` [PATCH 2/3] KVM: selftests: Collect memory access " Colton Lewis
  2022-11-15 17:32 ` [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis
  2 siblings, 2 replies; 19+ messages in thread
From: Colton Lewis @ 2022-11-15 17:32 UTC (permalink / raw)
  To: kvm
  Cc: pbonzini, maz, dmatlack, seanjc, bgardon, oupton, ricarkol, Colton Lewis

Allocate additional space for latency samples. This has been separated
out to call attention to the additional VM memory allocation. The test
runs out of physical pages without the additional allocation. The 100
multiple for pages was determined by trial and error. A more
well-reasoned calculation would be preferable.

Signed-off-by: Colton Lewis <coltonlewis@google.com>
---
 tools/testing/selftests/kvm/lib/perf_test_util.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
index 137be359b09e..a48904b64e19 100644
--- a/tools/testing/selftests/kvm/lib/perf_test_util.c
+++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
@@ -38,6 +38,12 @@ static bool all_vcpu_threads_running;
 
 static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 
+#define SAMPLES_PER_VCPU 1000
+#define SAMPLE_CAPACITY (SAMPLES_PER_VCPU * KVM_MAX_VCPUS)
+
+/* Store all samples in a flat array so they can be easily sorted later. */
+uint64_t latency_samples[SAMPLE_CAPACITY];
+
 /*
  * Continuously write to the first 8 bytes of each page in the
  * specified region.
@@ -122,7 +128,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 {
 	struct perf_test_args *pta = &perf_test_args;
 	struct kvm_vm *vm;
-	uint64_t guest_num_pages, slot0_pages = 0;
+	uint64_t guest_num_pages, sample_pages, slot0_pages = 0;
 	uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src);
 	uint64_t region_end_gfn;
 	int i;
@@ -161,7 +167,9 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 	 * The memory is also added to memslot 0, but that's a benign side
 	 * effect as KVM allows aliasing HVAs in meslots.
 	 */
-	vm = __vm_create_with_vcpus(mode, nr_vcpus, slot0_pages + guest_num_pages,
+	sample_pages = 100 * sizeof(latency_samples) / pta->guest_page_size;
+	vm = __vm_create_with_vcpus(mode, nr_vcpus,
+				    slot0_pages + guest_num_pages + sample_pages,
 				    perf_test_guest_code, vcpus);
 
 	pta->vm = vm;
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] KVM: selftests: Collect memory access latency samples
  2022-11-15 17:32 [PATCH 0/3] Calculate memory access latency stats Colton Lewis
  2022-11-15 17:32 ` [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples Colton Lewis
@ 2022-11-15 17:32 ` Colton Lewis
  2023-01-17 20:43   ` Ricardo Koller
  2023-01-17 20:48   ` Ricardo Koller
  2022-11-15 17:32 ` [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis
  2 siblings, 2 replies; 19+ messages in thread
From: Colton Lewis @ 2022-11-15 17:32 UTC (permalink / raw)
  To: kvm
  Cc: pbonzini, maz, dmatlack, seanjc, bgardon, oupton, ricarkol, Colton Lewis

Collect memory access latency measured in clock cycles.

This introduces a dependency on the timers for ARM and x86. No other
architectures are implemented and their samples will all be 0.

Because keeping all samples is impractical due to the space required
in some cases (pooled memory w/ 64 vcpus would be 64 GB/vcpu * 64
vcpus * 250,000 samples/GB * 8 bytes/sample ~ 8 Gb extra memory just
for samples), resevior sampling is used to only keep a small number of
samples per vcpu (1000 samples in this patch).

Resevoir sampling means despite keeping only a small number of
samples, each sample has an equal chance of making it to the
resevoir. Simple proofs of this can be found online. This makes the
resevoir a good representation of the distribution of samples and
enables calculation of reasonably accurate percentiles.

All samples are stored in a statically allocated flat array for ease
of combining them later. Samples are stored at an offset in this array
calculated by the vcpu index (so vcpu 5 sample 10 would be stored at
address sample_times + 5 * vcpu_idx + 10).

Signed-off-by: Colton Lewis <coltonlewis@google.com>
---
 .../selftests/kvm/lib/perf_test_util.c        | 34 +++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
index a48904b64e19..0311da76bae0 100644
--- a/tools/testing/selftests/kvm/lib/perf_test_util.c
+++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
@@ -4,6 +4,9 @@
  */
 #include <inttypes.h>
 
+#if defined(__aarch64__)
+#include "aarch64/arch_timer.h"
+#endif
 #include "kvm_util.h"
 #include "perf_test_util.h"
 #include "processor.h"
@@ -44,6 +47,18 @@ static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 /* Store all samples in a flat array so they can be easily sorted later. */
 uint64_t latency_samples[SAMPLE_CAPACITY];
 
+static uint64_t perf_test_timer_read(void)
+{
+#if defined(__aarch64__)
+	return timer_get_cntct(VIRTUAL);
+#elif defined(__x86_64__)
+	return rdtsc();
+#else
+#warn __func__ " is not implemented for this architecture, will return 0"
+	return 0;
+#endif
+}
+
 /*
  * Continuously write to the first 8 bytes of each page in the
  * specified region.
@@ -59,6 +74,10 @@ void perf_test_guest_code(uint32_t vcpu_idx)
 	int i;
 	struct guest_random_state rand_state =
 		new_guest_random_state(pta->random_seed + vcpu_idx);
+	uint64_t *latency_samples_offset = latency_samples + SAMPLES_PER_VCPU * vcpu_idx;
+	uint64_t count_before;
+	uint64_t count_after;
+	uint32_t maybe_sample;
 
 	gva = vcpu_args->gva;
 	pages = vcpu_args->pages;
@@ -75,10 +94,21 @@ void perf_test_guest_code(uint32_t vcpu_idx)
 
 			addr = gva + (page * pta->guest_page_size);
 
-			if (guest_random_u32(&rand_state) % 100 < pta->write_percent)
+			if (guest_random_u32(&rand_state) % 100 < pta->write_percent) {
+				count_before = perf_test_timer_read();
 				*(uint64_t *)addr = 0x0123456789ABCDEF;
-			else
+				count_after = perf_test_timer_read();
+			} else {
+				count_before = perf_test_timer_read();
 				READ_ONCE(*(uint64_t *)addr);
+				count_after = perf_test_timer_read();
+			}
+
+			maybe_sample = guest_random_u32(&rand_state) % (i + 1);
+			if (i < SAMPLES_PER_VCPU)
+				latency_samples_offset[i] = count_after - count_before;
+			else if (maybe_sample < SAMPLES_PER_VCPU)
+				latency_samples_offset[maybe_sample] = count_after - count_before;
 		}
 
 		GUEST_SYNC(1);
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution
  2022-11-15 17:32 [PATCH 0/3] Calculate memory access latency stats Colton Lewis
  2022-11-15 17:32 ` [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples Colton Lewis
  2022-11-15 17:32 ` [PATCH 2/3] KVM: selftests: Collect memory access " Colton Lewis
@ 2022-11-15 17:32 ` Colton Lewis
  2023-01-17 20:45   ` Ricardo Koller
  2023-01-18 16:43   ` Sean Christopherson
  2 siblings, 2 replies; 19+ messages in thread
From: Colton Lewis @ 2022-11-15 17:32 UTC (permalink / raw)
  To: kvm
  Cc: pbonzini, maz, dmatlack, seanjc, bgardon, oupton, ricarkol, Colton Lewis

Print summary stats of the memory latency distribution in
nanoseconds. For every iteration, this prints the minimum, the
maximum, and the 50th, 90th, and 99th percentiles.

Stats are calculated by sorting the samples taken from all vcpus and
picking from the index corresponding with each percentile.

The conversion to nanoseconds needs the frequency of the Intel
timestamp counter, which is estimated by reading the counter before
and after sleeping for 1 second. This is not a pretty trick, but it
also exists in vmx_nested_tsc_scaling_test.c

Signed-off-by: Colton Lewis <coltonlewis@google.com>
---
 .../selftests/kvm/dirty_log_perf_test.c       |  2 +
 .../selftests/kvm/include/perf_test_util.h    |  2 +
 .../selftests/kvm/lib/perf_test_util.c        | 62 +++++++++++++++++++
 3 files changed, 66 insertions(+)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 202f38a72851..2bc066bba460 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -274,6 +274,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	ts_diff = timespec_elapsed(start);
 	pr_info("Populate memory time: %ld.%.9lds\n",
 		ts_diff.tv_sec, ts_diff.tv_nsec);
+	perf_test_print_percentiles(vm, nr_vcpus);
 
 	/* Enable dirty logging */
 	clock_gettime(CLOCK_MONOTONIC, &start);
@@ -304,6 +305,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		vcpu_dirty_total = timespec_add(vcpu_dirty_total, ts_diff);
 		pr_info("Iteration %d dirty memory time: %ld.%.9lds\n",
 			iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
+		perf_test_print_percentiles(vm, nr_vcpus);
 
 		clock_gettime(CLOCK_MONOTONIC, &start);
 		get_dirty_log(vm, bitmaps, p->slots);
diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h
index 3d0b75ea866a..ca378c262f12 100644
--- a/tools/testing/selftests/kvm/include/perf_test_util.h
+++ b/tools/testing/selftests/kvm/include/perf_test_util.h
@@ -47,6 +47,8 @@ struct perf_test_args {
 
 extern struct perf_test_args perf_test_args;
 
+void perf_test_print_percentiles(struct kvm_vm *vm, int nr_vcpus);
+
 struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 				   uint64_t vcpu_memory_bytes, int slots,
 				   enum vm_mem_backing_src_type backing_src,
diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
index 0311da76bae0..927d22421f7c 100644
--- a/tools/testing/selftests/kvm/lib/perf_test_util.c
+++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
@@ -115,6 +115,68 @@ void perf_test_guest_code(uint32_t vcpu_idx)
 	}
 }
 
+#if defined(__x86_64__)
+/* This could be determined with the right sequence of cpuid
+ * instructions, but that's oddly complicated.
+ */
+static uint64_t perf_test_intel_timer_frequency(void)
+{
+	uint64_t count_before;
+	uint64_t count_after;
+	uint64_t measured_freq;
+	uint64_t adjusted_freq;
+
+	count_before = perf_test_timer_read();
+	sleep(1);
+	count_after = perf_test_timer_read();
+
+	/* Using 1 second implies our units are in Hz already. */
+	measured_freq = count_after - count_before;
+	/* Truncate to the nearest MHz. Clock frequencies are round numbers. */
+	adjusted_freq = measured_freq / 1000000 * 1000000;
+
+	return adjusted_freq;
+}
+#endif
+
+static double perf_test_cycles_to_ns(double cycles)
+{
+#if defined(__aarch64__)
+	return cycles * (1e9 / timer_get_cntfrq());
+#elif defined(__x86_64__)
+	static uint64_t timer_frequency;
+
+	if (timer_frequency == 0)
+		timer_frequency = perf_test_intel_timer_frequency();
+
+	return cycles * (1e9 / timer_frequency);
+#else
+#warn __func__ " is not implemented for this architecture, will return 0"
+	return 0.0;
+#endif
+}
+
+/* compare function for qsort */
+static int perf_test_qcmp(const void *a, const void *b)
+{
+	return *(int *)a - *(int *)b;
+}
+
+void perf_test_print_percentiles(struct kvm_vm *vm, int nr_vcpus)
+{
+	uint64_t n_samples = nr_vcpus * SAMPLES_PER_VCPU;
+
+	sync_global_from_guest(vm, latency_samples);
+	qsort(latency_samples, n_samples, sizeof(uint64_t), &perf_test_qcmp);
+
+	pr_info("Latency distribution (ns) = min:%6.0lf, 50th:%6.0lf, 90th:%6.0lf, 99th:%6.0lf, max:%6.0lf\n",
+		perf_test_cycles_to_ns((double)latency_samples[0]),
+		perf_test_cycles_to_ns((double)latency_samples[n_samples / 2]),
+		perf_test_cycles_to_ns((double)latency_samples[n_samples * 9 / 10]),
+		perf_test_cycles_to_ns((double)latency_samples[n_samples * 99 / 100]),
+		perf_test_cycles_to_ns((double)latency_samples[n_samples - 1]));
+}
+
 void perf_test_setup_vcpus(struct kvm_vm *vm, int nr_vcpus,
 			   struct kvm_vcpu *vcpus[],
 			   uint64_t vcpu_memory_bytes,
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples
  2022-11-15 17:32 ` [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples Colton Lewis
@ 2023-01-17 20:32   ` Ricardo Koller
  2023-01-18 16:49   ` Sean Christopherson
  1 sibling, 0 replies; 19+ messages in thread
From: Ricardo Koller @ 2023-01-17 20:32 UTC (permalink / raw)
  To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton

Hi Colton,

On Tue, Nov 15, 2022 at 05:32:56PM +0000, Colton Lewis wrote:
> Allocate additional space for latency samples. This has been separated
> out to call attention to the additional VM memory allocation. The test
> runs out of physical pages without the additional allocation. The 100
> multiple for pages was determined by trial and error. A more
> well-reasoned calculation would be preferable.
> 
> Signed-off-by: Colton Lewis <coltonlewis@google.com>
> ---
>  tools/testing/selftests/kvm/lib/perf_test_util.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index 137be359b09e..a48904b64e19 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -38,6 +38,12 @@ static bool all_vcpu_threads_running;
>  
>  static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>  
> +#define SAMPLES_PER_VCPU 1000
> +#define SAMPLE_CAPACITY (SAMPLES_PER_VCPU * KVM_MAX_VCPUS)
> +
> +/* Store all samples in a flat array so they can be easily sorted later. */
> +uint64_t latency_samples[SAMPLE_CAPACITY];
> +
>  /*
>   * Continuously write to the first 8 bytes of each page in the
>   * specified region.
> @@ -122,7 +128,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int nr_vcpus,
>  {
>  	struct perf_test_args *pta = &perf_test_args;
>  	struct kvm_vm *vm;
> -	uint64_t guest_num_pages, slot0_pages = 0;
> +	uint64_t guest_num_pages, sample_pages, slot0_pages = 0;
>  	uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src);
>  	uint64_t region_end_gfn;
>  	int i;
> @@ -161,7 +167,9 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int nr_vcpus,
>  	 * The memory is also added to memslot 0, but that's a benign side
>  	 * effect as KVM allows aliasing HVAs in meslots.
>  	 */
> -	vm = __vm_create_with_vcpus(mode, nr_vcpus, slot0_pages + guest_num_pages,
> +	sample_pages = 100 * sizeof(latency_samples) / pta->guest_page_size;

I don't think there's any need to guess. The number of accesses is
vcpu_args->pages (one access per guest page). So all memory could be
allocated dynamically to hold "vcpu_args->pages * sample_sz".

> +	vm = __vm_create_with_vcpus(mode, nr_vcpus,
> +				    slot0_pages + guest_num_pages + sample_pages,
>  				    perf_test_guest_code, vcpus);
>  
>  	pta->vm = vm;
> -- 
> 2.38.1.431.g37b22c650d-goog
> 

Thanks,
Ricardo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples
  2022-11-15 17:32 ` [PATCH 2/3] KVM: selftests: Collect memory access " Colton Lewis
@ 2023-01-17 20:43   ` Ricardo Koller
  2023-01-18 16:32     ` Sean Christopherson
  2023-01-26 17:58     ` Colton Lewis
  2023-01-17 20:48   ` Ricardo Koller
  1 sibling, 2 replies; 19+ messages in thread
From: Ricardo Koller @ 2023-01-17 20:43 UTC (permalink / raw)
  To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton

On Tue, Nov 15, 2022 at 05:32:57PM +0000, Colton Lewis wrote:
> Collect memory access latency measured in clock cycles.
> 
> This introduces a dependency on the timers for ARM and x86. No other
> architectures are implemented and their samples will all be 0.
> 
> Because keeping all samples is impractical due to the space required
> in some cases (pooled memory w/ 64 vcpus would be 64 GB/vcpu * 64
> vcpus * 250,000 samples/GB * 8 bytes/sample ~ 8 Gb extra memory just
> for samples), resevior sampling is used to only keep a small number of

nit: reservoir

> samples per vcpu (1000 samples in this patch).

Didn't see this before my previous comment. But, I guess it still
applies: isn't it possible to know the number of events to store?  to
avoid the "100" obtained via trial and error.

> 
> Resevoir sampling means despite keeping only a small number of
> samples, each sample has an equal chance of making it to the
> resevoir. Simple proofs of this can be found online. This makes the
> resevoir a good representation of the distribution of samples and

reservoir

> enables calculation of reasonably accurate percentiles.
> 
> All samples are stored in a statically allocated flat array for ease
> of combining them later. Samples are stored at an offset in this array
> calculated by the vcpu index (so vcpu 5 sample 10 would be stored at
> address sample_times + 5 * vcpu_idx + 10).
> 
> Signed-off-by: Colton Lewis <coltonlewis@google.com>
> ---
>  .../selftests/kvm/lib/perf_test_util.c        | 34 +++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index a48904b64e19..0311da76bae0 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -4,6 +4,9 @@
>   */
>  #include <inttypes.h>
>  
> +#if defined(__aarch64__)
> +#include "aarch64/arch_timer.h"
> +#endif
>  #include "kvm_util.h"
>  #include "perf_test_util.h"
>  #include "processor.h"
> @@ -44,6 +47,18 @@ static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>  /* Store all samples in a flat array so they can be easily sorted later. */
>  uint64_t latency_samples[SAMPLE_CAPACITY];
>  
> +static uint64_t perf_test_timer_read(void)
> +{
> +#if defined(__aarch64__)
> +	return timer_get_cntct(VIRTUAL);
> +#elif defined(__x86_64__)
> +	return rdtsc();
> +#else
> +#warn __func__ " is not implemented for this architecture, will return 0"
> +	return 0;
> +#endif
> +}
> +
>  /*
>   * Continuously write to the first 8 bytes of each page in the
>   * specified region.
> @@ -59,6 +74,10 @@ void perf_test_guest_code(uint32_t vcpu_idx)
>  	int i;
>  	struct guest_random_state rand_state =
>  		new_guest_random_state(pta->random_seed + vcpu_idx);
> +	uint64_t *latency_samples_offset = latency_samples + SAMPLES_PER_VCPU * vcpu_idx;
> +	uint64_t count_before;
> +	uint64_t count_after;
> +	uint32_t maybe_sample;
>  
>  	gva = vcpu_args->gva;
>  	pages = vcpu_args->pages;
> @@ -75,10 +94,21 @@ void perf_test_guest_code(uint32_t vcpu_idx)
>  
>  			addr = gva + (page * pta->guest_page_size);
>  
> -			if (guest_random_u32(&rand_state) % 100 < pta->write_percent)
> +			if (guest_random_u32(&rand_state) % 100 < pta->write_percent) {
> +				count_before = perf_test_timer_read();
>  				*(uint64_t *)addr = 0x0123456789ABCDEF;
> -			else
> +				count_after = perf_test_timer_read();
> +			} else {
> +				count_before = perf_test_timer_read();
>  				READ_ONCE(*(uint64_t *)addr);
> +				count_after = perf_test_timer_read();

"count_before ... ACCESS count_after" could be moved to some macro,
e.g.,:
	t = MEASURE(READ_ONCE(*(uint64_t *)addr));

> +			}
> +
> +			maybe_sample = guest_random_u32(&rand_state) % (i + 1);
> +			if (i < SAMPLES_PER_VCPU)
> +				latency_samples_offset[i] = count_after - count_before;
> +			else if (maybe_sample < SAMPLES_PER_VCPU)
> +				latency_samples_offset[maybe_sample] = count_after - count_before;

I would prefer these reservoir sampling details to be in a helper, 
e.g.,:
	reservoir_sample_record(t, i);

>  		}
>  
>  		GUEST_SYNC(1);
> -- 
> 2.38.1.431.g37b22c650d-goog
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution
  2022-11-15 17:32 ` [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis
@ 2023-01-17 20:45   ` Ricardo Koller
  2023-01-26 17:58     ` Colton Lewis
  2023-01-18 16:43   ` Sean Christopherson
  1 sibling, 1 reply; 19+ messages in thread
From: Ricardo Koller @ 2023-01-17 20:45 UTC (permalink / raw)
  To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton

On Tue, Nov 15, 2022 at 05:32:58PM +0000, Colton Lewis wrote:
> Print summary stats of the memory latency distribution in
> nanoseconds. For every iteration, this prints the minimum, the
> maximum, and the 50th, 90th, and 99th percentiles.
> 
> Stats are calculated by sorting the samples taken from all vcpus and
> picking from the index corresponding with each percentile.
> 
> The conversion to nanoseconds needs the frequency of the Intel
> timestamp counter, which is estimated by reading the counter before
> and after sleeping for 1 second. This is not a pretty trick, but it
> also exists in vmx_nested_tsc_scaling_test.c
> 
> Signed-off-by: Colton Lewis <coltonlewis@google.com>
> ---
>  .../selftests/kvm/dirty_log_perf_test.c       |  2 +
>  .../selftests/kvm/include/perf_test_util.h    |  2 +
>  .../selftests/kvm/lib/perf_test_util.c        | 62 +++++++++++++++++++
>  3 files changed, 66 insertions(+)
> 
> diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> index 202f38a72851..2bc066bba460 100644
> --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> @@ -274,6 +274,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>  	ts_diff = timespec_elapsed(start);
>  	pr_info("Populate memory time: %ld.%.9lds\n",
>  		ts_diff.tv_sec, ts_diff.tv_nsec);
> +	perf_test_print_percentiles(vm, nr_vcpus);
>  
>  	/* Enable dirty logging */
>  	clock_gettime(CLOCK_MONOTONIC, &start);
> @@ -304,6 +305,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>  		vcpu_dirty_total = timespec_add(vcpu_dirty_total, ts_diff);
>  		pr_info("Iteration %d dirty memory time: %ld.%.9lds\n",
>  			iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
> +		perf_test_print_percentiles(vm, nr_vcpus);
>  
>  		clock_gettime(CLOCK_MONOTONIC, &start);
>  		get_dirty_log(vm, bitmaps, p->slots);
> diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h
> index 3d0b75ea866a..ca378c262f12 100644
> --- a/tools/testing/selftests/kvm/include/perf_test_util.h
> +++ b/tools/testing/selftests/kvm/include/perf_test_util.h
> @@ -47,6 +47,8 @@ struct perf_test_args {
>  
>  extern struct perf_test_args perf_test_args;
>  
> +void perf_test_print_percentiles(struct kvm_vm *vm, int nr_vcpus);
> +
>  struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int nr_vcpus,
>  				   uint64_t vcpu_memory_bytes, int slots,
>  				   enum vm_mem_backing_src_type backing_src,
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index 0311da76bae0..927d22421f7c 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -115,6 +115,68 @@ void perf_test_guest_code(uint32_t vcpu_idx)
>  	}
>  }
>  
> +#if defined(__x86_64__)
> +/* This could be determined with the right sequence of cpuid
> + * instructions, but that's oddly complicated.
> + */
> +static uint64_t perf_test_intel_timer_frequency(void)
> +{
> +	uint64_t count_before;
> +	uint64_t count_after;
> +	uint64_t measured_freq;
> +	uint64_t adjusted_freq;
> +
> +	count_before = perf_test_timer_read();
> +	sleep(1);
> +	count_after = perf_test_timer_read();
> +
> +	/* Using 1 second implies our units are in Hz already. */
> +	measured_freq = count_after - count_before;
> +	/* Truncate to the nearest MHz. Clock frequencies are round numbers. */
> +	adjusted_freq = measured_freq / 1000000 * 1000000;
> +
> +	return adjusted_freq;
> +}
> +#endif
> +
> +static double perf_test_cycles_to_ns(double cycles)
> +{
> +#if defined(__aarch64__)
> +	return cycles * (1e9 / timer_get_cntfrq());
> +#elif defined(__x86_64__)
> +	static uint64_t timer_frequency;
> +
> +	if (timer_frequency == 0)
> +		timer_frequency = perf_test_intel_timer_frequency();
> +
> +	return cycles * (1e9 / timer_frequency);
> +#else
> +#warn __func__ " is not implemented for this architecture, will return 0"
> +	return 0.0;
> +#endif
> +}
> +
> +/* compare function for qsort */
> +static int perf_test_qcmp(const void *a, const void *b)
> +{
> +	return *(int *)a - *(int *)b;
> +}
> +
> +void perf_test_print_percentiles(struct kvm_vm *vm, int nr_vcpus)
> +{
> +	uint64_t n_samples = nr_vcpus * SAMPLES_PER_VCPU;
> +
> +	sync_global_from_guest(vm, latency_samples);
> +	qsort(latency_samples, n_samples, sizeof(uint64_t), &perf_test_qcmp);
> +
> +	pr_info("Latency distribution (ns) = min:%6.0lf, 50th:%6.0lf, 90th:%6.0lf, 99th:%6.0lf, max:%6.0lf\n",
> +		perf_test_cycles_to_ns((double)latency_samples[0]),
> +		perf_test_cycles_to_ns((double)latency_samples[n_samples / 2]),
> +		perf_test_cycles_to_ns((double)latency_samples[n_samples * 9 / 10]),
> +		perf_test_cycles_to_ns((double)latency_samples[n_samples * 99 / 100]),
> +		perf_test_cycles_to_ns((double)latency_samples[n_samples - 1]));
> +}

Latency distribution (ns) = min:   732, 50th:   792, 90th:   901, 99th:
                                ^^^
nit: would prefer to avoid the spaces 

> +
>  void perf_test_setup_vcpus(struct kvm_vm *vm, int nr_vcpus,
>  			   struct kvm_vcpu *vcpus[],
>  			   uint64_t vcpu_memory_bytes,
> -- 
> 2.38.1.431.g37b22c650d-goog
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples
  2022-11-15 17:32 ` [PATCH 2/3] KVM: selftests: Collect memory access " Colton Lewis
  2023-01-17 20:43   ` Ricardo Koller
@ 2023-01-17 20:48   ` Ricardo Koller
  2023-01-26 17:59     ` Colton Lewis
  1 sibling, 1 reply; 19+ messages in thread
From: Ricardo Koller @ 2023-01-17 20:48 UTC (permalink / raw)
  To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton

On Tue, Nov 15, 2022 at 05:32:57PM +0000, Colton Lewis wrote:
> Collect memory access latency measured in clock cycles.
> 
> This introduces a dependency on the timers for ARM and x86. No other
> architectures are implemented and their samples will all be 0.
> 
> Because keeping all samples is impractical due to the space required
> in some cases (pooled memory w/ 64 vcpus would be 64 GB/vcpu * 64
> vcpus * 250,000 samples/GB * 8 bytes/sample ~ 8 Gb extra memory just
> for samples), resevior sampling is used to only keep a small number of
> samples per vcpu (1000 samples in this patch).
> 
> Resevoir sampling means despite keeping only a small number of

Could you add a reference? I'm trying to understand the algorithm, but
I'm not even sure what's the version implemented ("Optimal: Algorithm
L"?). 

> samples, each sample has an equal chance of making it to the
> resevoir. Simple proofs of this can be found online. This makes the
> resevoir a good representation of the distribution of samples and
> enables calculation of reasonably accurate percentiles.
> 
> All samples are stored in a statically allocated flat array for ease
> of combining them later. Samples are stored at an offset in this array
> calculated by the vcpu index (so vcpu 5 sample 10 would be stored at
> address sample_times + 5 * vcpu_idx + 10).
> 
> Signed-off-by: Colton Lewis <coltonlewis@google.com>
> ---
>  .../selftests/kvm/lib/perf_test_util.c        | 34 +++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index a48904b64e19..0311da76bae0 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -4,6 +4,9 @@
>   */
>  #include <inttypes.h>
>  
> +#if defined(__aarch64__)
> +#include "aarch64/arch_timer.h"
> +#endif
>  #include "kvm_util.h"
>  #include "perf_test_util.h"
>  #include "processor.h"
> @@ -44,6 +47,18 @@ static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>  /* Store all samples in a flat array so they can be easily sorted later. */
>  uint64_t latency_samples[SAMPLE_CAPACITY];
>  
> +static uint64_t perf_test_timer_read(void)
> +{
> +#if defined(__aarch64__)
> +	return timer_get_cntct(VIRTUAL);
> +#elif defined(__x86_64__)
> +	return rdtsc();
> +#else
> +#warn __func__ " is not implemented for this architecture, will return 0"
> +	return 0;
> +#endif
> +}
> +
>  /*
>   * Continuously write to the first 8 bytes of each page in the
>   * specified region.
> @@ -59,6 +74,10 @@ void perf_test_guest_code(uint32_t vcpu_idx)
>  	int i;
>  	struct guest_random_state rand_state =
>  		new_guest_random_state(pta->random_seed + vcpu_idx);
> +	uint64_t *latency_samples_offset = latency_samples + SAMPLES_PER_VCPU * vcpu_idx;
> +	uint64_t count_before;
> +	uint64_t count_after;
> +	uint32_t maybe_sample;
>  
>  	gva = vcpu_args->gva;
>  	pages = vcpu_args->pages;
> @@ -75,10 +94,21 @@ void perf_test_guest_code(uint32_t vcpu_idx)
>  
>  			addr = gva + (page * pta->guest_page_size);
>  
> -			if (guest_random_u32(&rand_state) % 100 < pta->write_percent)
> +			if (guest_random_u32(&rand_state) % 100 < pta->write_percent) {
> +				count_before = perf_test_timer_read();
>  				*(uint64_t *)addr = 0x0123456789ABCDEF;
> -			else
> +				count_after = perf_test_timer_read();
> +			} else {
> +				count_before = perf_test_timer_read();
>  				READ_ONCE(*(uint64_t *)addr);
> +				count_after = perf_test_timer_read();
> +			}
> +
> +			maybe_sample = guest_random_u32(&rand_state) % (i + 1);
> +			if (i < SAMPLES_PER_VCPU)
> +				latency_samples_offset[i] = count_after - count_before;
> +			else if (maybe_sample < SAMPLES_PER_VCPU)
> +				latency_samples_offset[maybe_sample] = count_after - count_before;
>  		}
>  
>  		GUEST_SYNC(1);
> -- 
> 2.38.1.431.g37b22c650d-goog
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples
  2023-01-17 20:43   ` Ricardo Koller
@ 2023-01-18 16:32     ` Sean Christopherson
  2023-01-26 18:00       ` Colton Lewis
  2023-01-26 17:58     ` Colton Lewis
  1 sibling, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2023-01-18 16:32 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: Colton Lewis, kvm, pbonzini, maz, dmatlack, bgardon, oupton

On Tue, Jan 17, 2023, Ricardo Koller wrote:
> On Tue, Nov 15, 2022 at 05:32:57PM +0000, Colton Lewis wrote:
> > @@ -44,6 +47,18 @@ static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
> >  /* Store all samples in a flat array so they can be easily sorted later. */
> >  uint64_t latency_samples[SAMPLE_CAPACITY];
> >  
> > +static uint64_t perf_test_timer_read(void)
> > +{
> > +#if defined(__aarch64__)
> > +	return timer_get_cntct(VIRTUAL);
> > +#elif defined(__x86_64__)
> > +	return rdtsc();
> > +#else
> > +#warn __func__ " is not implemented for this architecture, will return 0"
> > +	return 0;
> > +#endif
> > +}

I would prefer to put the guest-side timer helpers into common code, e.g. as
guest_read_system_counter(), replacing system_counter_offset_test.c's one-off
version.

> >  /*
> >   * Continuously write to the first 8 bytes of each page in the
> >   * specified region.
> > @@ -59,6 +74,10 @@ void perf_test_guest_code(uint32_t vcpu_idx)
> >  	int i;
> >  	struct guest_random_state rand_state =
> >  		new_guest_random_state(pta->random_seed + vcpu_idx);
> > +	uint64_t *latency_samples_offset = latency_samples + SAMPLES_PER_VCPU * vcpu_idx;

"offset" is confusing because the system counter (TSC in x86) has an offset for
the guest-perceived value.  Maybe just "latencies"?

> > +	uint64_t count_before;
> > +	uint64_t count_after;

Maybe s/count/time?  Yeah, it's technically wrong to call it "time", but "count"
is too generic.

> > +	uint32_t maybe_sample;
> >  
> >  	gva = vcpu_args->gva;
> >  	pages = vcpu_args->pages;
> > @@ -75,10 +94,21 @@ void perf_test_guest_code(uint32_t vcpu_idx)
> >  
> >  			addr = gva + (page * pta->guest_page_size);
> >  
> > -			if (guest_random_u32(&rand_state) % 100 < pta->write_percent)
> > +			if (guest_random_u32(&rand_state) % 100 < pta->write_percent) {
> > +				count_before = perf_test_timer_read();
> >  				*(uint64_t *)addr = 0x0123456789ABCDEF;
> > -			else
> > +				count_after = perf_test_timer_read();
> > +			} else {
> > +				count_before = perf_test_timer_read();
> >  				READ_ONCE(*(uint64_t *)addr);
> > +				count_after = perf_test_timer_read();
> 
> "count_before ... ACCESS count_after" could be moved to some macro,
> e.g.,:
> 	t = MEASURE(READ_ONCE(*(uint64_t *)addr));

Even better, capture the read vs. write in a local variable to self-document the
use of the RNG, then the motivation for reading the system counter inside the
if/else-statements goes away.  That way we don't need to come up with a name
that documents what MEASURE() measures.

			write = guest_random_u32(&rand_state) % 100 < args->write_percent;

			time_before = guest_system_counter_read();
			if (write)
				*(uint64_t *)addr = 0x0123456789ABCDEF;
			else
				READ_ONCE(*(uint64_t *)addr);
			time_after = guest_system_counter_read();

> > +			}
> > +
> > +			maybe_sample = guest_random_u32(&rand_state) % (i + 1);

No need to generate a random number for iterations that always sample.  And I
think it will make the code easier to follow if there is a single write to the
array.  The derivation of the index is what's interesting and different, we should
use code to highlight that.

			/*
			 * Always sample early iterations to ensure at least the
			 * number of requested samples is collected.  Once the
			 * array has been filled, <here is a comment from Colton
			 * briefly explaining the math>.
			 * 
			if (i < SAMPLES_PER_VCPU)
				idx = i;
			else
				idx = guest_random_u32(&rand_state) % (i + 1);

			if (idx < SAMPLES_PER_VCPU)
				latencies[idx] = time_after - time_before;

> > +			if (i < SAMPLES_PER_VCPU)

Would it make sense to let the user configure the number of samples?  Seems easy
enough and would let the user ultimately decide how much memory to burn on samples.

> > +				latency_samples_offset[i] = count_after - count_before;
> > +			else if (maybe_sample < SAMPLES_PER_VCPU)
> > +				latency_samples_offset[maybe_sample] = count_after - count_before;
> 
> I would prefer these reservoir sampling details to be in a helper, 
> e.g.,:
> 	reservoir_sample_record(t, i);

Heh, I vote to open code the behavior.  I dislike fancy names that hide relatively
simple logic.  IMO, readers won't care how the modulo math provides even distribution,
just that it does and that the early iterations always samples to ensure ever bucket
is filled.

In this case, I can pretty much guarantee that I'd end up spending more time
digging into what "reservoir" means than I would to understand the basic flow.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution
  2022-11-15 17:32 ` [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis
  2023-01-17 20:45   ` Ricardo Koller
@ 2023-01-18 16:43   ` Sean Christopherson
  2023-01-26 17:59     ` Colton Lewis
  1 sibling, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2023-01-18 16:43 UTC (permalink / raw)
  To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, bgardon, oupton, ricarkol

On Tue, Nov 15, 2022, Colton Lewis wrote:
> Print summary stats of the memory latency distribution in
> nanoseconds. For every iteration, this prints the minimum, the
> maximum, and the 50th, 90th, and 99th percentiles.
> 
> Stats are calculated by sorting the samples taken from all vcpus and
> picking from the index corresponding with each percentile.
> 
> The conversion to nanoseconds needs the frequency of the Intel
> timestamp counter, which is estimated by reading the counter before
> and after sleeping for 1 second. This is not a pretty trick, but it
> also exists in vmx_nested_tsc_scaling_test.c

This test shouldn't need to guesstimate the frequency, just use a VM-scoped
KVM_GET_TSC_KHZ, which will provide KVM's default TSC frequency, i.e. the host
frequency.  For hardware with a constant TSC, which is everything modern, that
will be as accurate as we can get.  For hardware without a constant TSC, well, buy
new hardware :-)

vmx_nested_tsc_scaling_test.c does the weird sleep() behavior because it's trying
to validate that the guest TSC counts at the correct rate, i.e. it is validating
KVM_GET_TSC_KHZ to some extent, and so obviously doesn't fully trust its result.
   
For tests that just want to measure time, there's no reason not to trust KVM.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples
  2022-11-15 17:32 ` [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples Colton Lewis
  2023-01-17 20:32   ` Ricardo Koller
@ 2023-01-18 16:49   ` Sean Christopherson
  2023-01-26 18:00     ` Colton Lewis
  1 sibling, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2023-01-18 16:49 UTC (permalink / raw)
  To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, bgardon, oupton, ricarkol

On Tue, Nov 15, 2022, Colton Lewis wrote:
> Allocate additional space for latency samples. This has been separated
> out to call attention to the additional VM memory allocation.

A blurb in the changelog is sufficient, no need to split allocation and use into two
patches.  I would actually collapse all three into one.  The changes aren't so big
that errors will be difficult to bisect, and without the final printing, the other
changes are useless for all intents and purposes, i.e. if for some reason we want
to revert the sampling, it will be all or nothing.

I do think it makes sense to separate the system counter stuff to a separate
patch (and land it in generic code), e.g. add helpers to read the system counter
from the guest and convert the result to nanoseconds in a separate patch.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution
  2023-01-17 20:45   ` Ricardo Koller
@ 2023-01-26 17:58     ` Colton Lewis
  0 siblings, 0 replies; 19+ messages in thread
From: Colton Lewis @ 2023-01-26 17:58 UTC (permalink / raw)
  To: Ricardo Koller; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton

Ricardo Koller <ricarkol@google.com> writes:

> Latency distribution (ns) = min:   732, 50th:   792, 90th:   901, 99th:
>                                  ^^^
> nit: would prefer to avoid the spaces

The spaces are there so each number has a fixed width and makes the
lines easy to visually compare because the numbers are aligned to the
same columns. No way to avoid some extra spaces and preserve this
property. What if a number is a different length than expected?  I chose
a uniform width I was confident would always fit a measurement of the
max.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples
  2023-01-17 20:43   ` Ricardo Koller
  2023-01-18 16:32     ` Sean Christopherson
@ 2023-01-26 17:58     ` Colton Lewis
  2023-01-26 18:30       ` Ricardo Koller
  1 sibling, 1 reply; 19+ messages in thread
From: Colton Lewis @ 2023-01-26 17:58 UTC (permalink / raw)
  To: Ricardo Koller; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton

Ricardo Koller <ricarkol@google.com> writes:


> nit: reservoir


Will fix


> Didn't see this before my previous comment. But, I guess it still
> applies: isn't it possible to know the number of events to store?  to
> avoid the "100" obtained via trial and error.


That's what I thought I was calculating with sample_pages =
sizeof(latency_samples) / pta->guest_page_size. Both values are in bytes
so that should give the number of guest pages I need to allocate to hold
latency_samples. The 100 is a fudge factor added to the calculation
after encountering crashes. Any idea why the math above is incorrect?


> reservoir


Will fix.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples
  2023-01-17 20:48   ` Ricardo Koller
@ 2023-01-26 17:59     ` Colton Lewis
  0 siblings, 0 replies; 19+ messages in thread
From: Colton Lewis @ 2023-01-26 17:59 UTC (permalink / raw)
  To: Ricardo Koller; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton

Ricardo Koller <ricarkol@google.com> writes:


>> Resevoir sampling means despite keeping only a small number of

> Could you add a reference? I'm trying to understand the algorithm, but
> I'm not even sure what's the version implemented ("Optimal: Algorithm
> L"?).


I can add an external link to Wikipedia. That was the source I used. The
version implemented here is the simple version that Wikipedia names
"Algorithm R".

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution
  2023-01-18 16:43   ` Sean Christopherson
@ 2023-01-26 17:59     ` Colton Lewis
  0 siblings, 0 replies; 19+ messages in thread
From: Colton Lewis @ 2023-01-26 17:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, pbonzini, maz, dmatlack, bgardon, oupton, ricarkol

Sean Christopherson <seanjc@google.com> writes:

> This test shouldn't need to guesstimate the frequency, just use a  
> VM-scoped
> KVM_GET_TSC_KHZ, which will provide KVM's default TSC frequency, i.e. the  
> host
> frequency.  For hardware with a constant TSC, which is everything modern,  
> that
> will be as accurate as we can get.  For hardware without a constant TSC,  
> well, buy
> new hardware :-)


Thanks. That simplifies things. I encountered that number before but was
unsure if I could trust it for reasons I no longer remember.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples
  2023-01-18 16:49   ` Sean Christopherson
@ 2023-01-26 18:00     ` Colton Lewis
  0 siblings, 0 replies; 19+ messages in thread
From: Colton Lewis @ 2023-01-26 18:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, pbonzini, maz, dmatlack, bgardon, oupton, ricarkol

Sean Christopherson <seanjc@google.com> writes:

> On Tue, Nov 15, 2022, Colton Lewis wrote:
>> Allocate additional space for latency samples. This has been separated
>> out to call attention to the additional VM memory allocation.

> A blurb in the changelog is sufficient, no need to split allocation and  
> use into two
> patches.  I would actually collapse all three into one.  The changes  
> aren't so big
> that errors will be difficult to bisect, and without the final printing,  
> the other
> changes are useless for all intents and purposes, i.e. if for some reason  
> we want
> to revert the sampling, it will be all or nothing.

Will do. It's easier to merge commits than split them up.


> I do think it makes sense to separate the system counter stuff to a  
> separate
> patch (and land it in generic code), e.g. add helpers to read the system  
> counter
> from the guest and convert the result to nanoseconds in a separate
> patch.

Will do.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples
  2023-01-18 16:32     ` Sean Christopherson
@ 2023-01-26 18:00       ` Colton Lewis
  2023-01-26 19:07         ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Colton Lewis @ 2023-01-26 18:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: ricarkol, kvm, pbonzini, maz, dmatlack, bgardon, oupton

Sean Christopherson <seanjc@google.com> writes:
> On Tue, Jan 17, 2023, Ricardo Koller wrote:
>> On Tue, Nov 15, 2022 at 05:32:57PM +0000, Colton Lewis wrote:
>> > @@ -44,6 +47,18 @@ static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>> >  /* Store all samples in a flat array so they can be easily sorted  
>> later. */
>> >  uint64_t latency_samples[SAMPLE_CAPACITY];
>> >
>> > +static uint64_t perf_test_timer_read(void)
>> > +{
>> > +#if defined(__aarch64__)
>> > +	return timer_get_cntct(VIRTUAL);
>> > +#elif defined(__x86_64__)
>> > +	return rdtsc();
>> > +#else
>> > +#warn __func__ " is not implemented for this architecture, will  
>> return 0"
>> > +	return 0;
>> > +#endif
>> > +}

> I would prefer to put the guest-side timer helpers into common code, e.g.  
> as
> guest_read_system_counter(), replacing system_counter_offset_test.c's  
> one-off
> version.

Will do.

>> >  /*
>> >   * Continuously write to the first 8 bytes of each page in the
>> >   * specified region.
>> > @@ -59,6 +74,10 @@ void perf_test_guest_code(uint32_t vcpu_idx)
>> >  	int i;
>> >  	struct guest_random_state rand_state =
>> >  		new_guest_random_state(pta->random_seed + vcpu_idx);
>> > +	uint64_t *latency_samples_offset = latency_samples +  
>> SAMPLES_PER_VCPU * vcpu_idx;

> "offset" is confusing because the system counter (TSC in x86) has an  
> offset for
> the guest-perceived value.  Maybe just "latencies"?

Will do.

>> > +	uint64_t count_before;
>> > +	uint64_t count_after;

> Maybe s/count/time?  Yeah, it's technically wrong to call it "time",  
> but "count"
> is too generic.

I could say "cycles".

>> > +	uint32_t maybe_sample;
>> >
>> >  	gva = vcpu_args->gva;
>> >  	pages = vcpu_args->pages;
>> > @@ -75,10 +94,21 @@ void perf_test_guest_code(uint32_t vcpu_idx)
>> >
>> >  			addr = gva + (page * pta->guest_page_size);
>> >
>> > -			if (guest_random_u32(&rand_state) % 100 < pta->write_percent)
>> > +			if (guest_random_u32(&rand_state) % 100 < pta->write_percent) {
>> > +				count_before = perf_test_timer_read();
>> >  				*(uint64_t *)addr = 0x0123456789ABCDEF;
>> > -			else
>> > +				count_after = perf_test_timer_read();
>> > +			} else {
>> > +				count_before = perf_test_timer_read();
>> >  				READ_ONCE(*(uint64_t *)addr);
>> > +				count_after = perf_test_timer_read();

>> "count_before ... ACCESS count_after" could be moved to some macro,
>> e.g.,:
>> 	t = MEASURE(READ_ONCE(*(uint64_t *)addr));

> Even better, capture the read vs. write in a local variable to  
> self-document the
> use of the RNG, then the motivation for reading the system counter inside  
> the
> if/else-statements goes away.  That way we don't need to come up with a  
> name
> that documents what MEASURE() measures.

> 			write = guest_random_u32(&rand_state) % 100 < args->write_percent;

> 			time_before = guest_system_counter_read();
> 			if (write)
> 				*(uint64_t *)addr = 0x0123456789ABCDEF;
> 			else
> 				READ_ONCE(*(uint64_t *)addr);
> 			time_after = guest_system_counter_read();

Couldn't timing before and after the if statement produce bad
measurements? We might be including a branch mispredict in our memory
access latency and this could happen a lot because it's random so no way
for the CPU to predict.

>> > +			}
>> > +
>> > +			maybe_sample = guest_random_u32(&rand_state) % (i + 1);

> No need to generate a random number for iterations that always sample.   
> And I
> think it will make the code easier to follow if there is a single write  
> to the
> array.  The derivation of the index is what's interesting and different,  
> we should
> use code to highlight that.

> 			/*
> 			 * Always sample early iterations to ensure at least the
> 			 * number of requested samples is collected.  Once the
> 			 * array has been filled, <here is a comment from Colton
> 			 * briefly explaining the math>.
> 			 *
> 			if (i < SAMPLES_PER_VCPU)
> 				idx = i;
> 			else
> 				idx = guest_random_u32(&rand_state) % (i + 1);

> 			if (idx < SAMPLES_PER_VCPU)
> 				latencies[idx] = time_after - time_before;

Will do.

>> > +			if (i < SAMPLES_PER_VCPU)

> Would it make sense to let the user configure the number of samples?   
> Seems easy
> enough and would let the user ultimately decide how much memory to burn  
> on samples.

Theoretically users may wish to tweak the accuracy vs memory use
tradeoff. Seemed like a shakey value proposition to me because of
diminishing returns to increased accuracy, but I will include an option
if you insist.

>> > +				latency_samples_offset[i] = count_after - count_before;
>> > +			else if (maybe_sample < SAMPLES_PER_VCPU)
>> > +				latency_samples_offset[maybe_sample] = count_after - count_before;

>> I would prefer these reservoir sampling details to be in a helper,
>> e.g.,:
>> 	reservoir_sample_record(t, i);

> Heh, I vote to open code the behavior.  I dislike fancy names that hide  
> relatively
> simple logic.  IMO, readers won't care how the modulo math provides even  
> distribution,
> just that it does and that the early iterations always samples to ensure  
> ever bucket
> is filled.

> In this case, I can pretty much guarantee that I'd end up spending more  
> time
> digging into what "reservoir" means than I would to understand the basic  
> flow.
I agree. Logic is simple enough more names will confuse as to what's
really happening.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples
  2023-01-26 17:58     ` Colton Lewis
@ 2023-01-26 18:30       ` Ricardo Koller
  0 siblings, 0 replies; 19+ messages in thread
From: Ricardo Koller @ 2023-01-26 18:30 UTC (permalink / raw)
  To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton

On Thu, Jan 26, 2023 at 9:58 AM Colton Lewis <coltonlewis@google.com> wrote:
>
> Ricardo Koller <ricarkol@google.com> writes:
>
>
> > nit: reservoir
>
>
> Will fix
>
>
> > Didn't see this before my previous comment. But, I guess it still
> > applies: isn't it possible to know the number of events to store?  to
> > avoid the "100" obtained via trial and error.
>
>
> That's what I thought I was calculating with sample_pages =
> sizeof(latency_samples) / pta->guest_page_size. Both values are in bytes
> so that should give the number of guest pages I need to allocate to hold
> latency_samples. The 100 is a fudge factor added to the calculation
> after encountering crashes. Any idea why the math above is incorrect?
>

Mm, I don't know to be honest. But I do think it's required to calculate
the exact number of bytes needed. Otherwise, 100 might not be enough for
a bigger VM with more samples.

>
> > reservoir
>
>
> Will fix.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples
  2023-01-26 18:00       ` Colton Lewis
@ 2023-01-26 19:07         ` Sean Christopherson
  0 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2023-01-26 19:07 UTC (permalink / raw)
  To: Colton Lewis; +Cc: ricarkol, kvm, pbonzini, maz, dmatlack, bgardon, oupton

On Thu, Jan 26, 2023, Colton Lewis wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > Maybe s/count/time?  Yeah, it's technically wrong to call it "time", but
> > "count" is too generic.
> 
> I could say "cycles".

Works for me.

> > > > +	uint32_t maybe_sample;
> > > >
> > > >  	gva = vcpu_args->gva;
> > > >  	pages = vcpu_args->pages;
> > > > @@ -75,10 +94,21 @@ void perf_test_guest_code(uint32_t vcpu_idx)
> > > >
> > > >  			addr = gva + (page * pta->guest_page_size);
> > > >
> > > > -			if (guest_random_u32(&rand_state) % 100 < pta->write_percent)
> > > > +			if (guest_random_u32(&rand_state) % 100 < pta->write_percent) {
> > > > +				count_before = perf_test_timer_read();
> > > >  				*(uint64_t *)addr = 0x0123456789ABCDEF;
> > > > -			else
> > > > +				count_after = perf_test_timer_read();
> > > > +			} else {
> > > > +				count_before = perf_test_timer_read();
> > > >  				READ_ONCE(*(uint64_t *)addr);
> > > > +				count_after = perf_test_timer_read();
> 
> > > "count_before ... ACCESS count_after" could be moved to some macro,
> > > e.g.,:
> > > 	t = MEASURE(READ_ONCE(*(uint64_t *)addr));
> 
> > Even better, capture the read vs. write in a local variable to
> > self-document the
> > use of the RNG, then the motivation for reading the system counter
> > inside the
> > if/else-statements goes away.  That way we don't need to come up with a
> > name
> > that documents what MEASURE() measures.
> 
> > 			write = guest_random_u32(&rand_state) % 100 < args->write_percent;
> 
> > 			time_before = guest_system_counter_read();
> > 			if (write)
> > 				*(uint64_t *)addr = 0x0123456789ABCDEF;
> > 			else
> > 				READ_ONCE(*(uint64_t *)addr);
> > 			time_after = guest_system_counter_read();
> 
> Couldn't timing before and after the if statement produce bad
> measurements? We might be including a branch mispredict in our memory
> access latency and this could happen a lot because it's random so no way
> for the CPU to predict.

Hmm, I was assuming the latency due to a (mispredicted) branch would be in the
noise compared to the latency of a VM-Exit needed to handle the fault.

On the other hand, adding a common macro would be trivial, it's only the naming
that's hard.  What if we keep with the "cycles" theme and do as Ricardo suggested?
E.g. minus backslashes, this doesn't look awful.

#define MEASURE_CYCLES(x)
({
	uint64_t start;

	start = guest_system_counter_read();
	x;
	guest_system_counter_read() - start;
})



> > > > +			if (i < SAMPLES_PER_VCPU)
> 
> > Would it make sense to let the user configure the number of samples?
> > Seems easy enough and would let the user ultimately decide how much memory
> > to burn on samples.
> 
> Theoretically users may wish to tweak the accuracy vs memory use
> tradeoff. Seemed like a shakey value proposition to me because of
> diminishing returns to increased accuracy, but I will include an option
> if you insist.

It's not so much that I expect users to want better accuracy, it's that I dislike
I arbitrary magic numbers.  E.g. why 1000 and not 500 or 2000?  Especially since
the number of accesses _is_ controllable by the user.  Hmm, the other thing to
consider is that, as proposed, the vast majority of the capacity will go unused,
e.g. default is 4, max is 512 (and that should really be tied to KVM's max of 4096).

What if we invert the calculation and define the capacity in bytes, and derive
the number of samples based on capacity / nr_vcpus?  And then push the capacity
as high as possible, e.g. I assume there's a threshold where the test will fail
because of selftests poor page table management.  That way the sampling is as
efficient as possible, and the arbitrary capacity comes from limitations within
the framework, as opposed to a human defined magic number.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-01-26 19:08 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-15 17:32 [PATCH 0/3] Calculate memory access latency stats Colton Lewis
2022-11-15 17:32 ` [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples Colton Lewis
2023-01-17 20:32   ` Ricardo Koller
2023-01-18 16:49   ` Sean Christopherson
2023-01-26 18:00     ` Colton Lewis
2022-11-15 17:32 ` [PATCH 2/3] KVM: selftests: Collect memory access " Colton Lewis
2023-01-17 20:43   ` Ricardo Koller
2023-01-18 16:32     ` Sean Christopherson
2023-01-26 18:00       ` Colton Lewis
2023-01-26 19:07         ` Sean Christopherson
2023-01-26 17:58     ` Colton Lewis
2023-01-26 18:30       ` Ricardo Koller
2023-01-17 20:48   ` Ricardo Koller
2023-01-26 17:59     ` Colton Lewis
2022-11-15 17:32 ` [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis
2023-01-17 20:45   ` Ricardo Koller
2023-01-26 17:58     ` Colton Lewis
2023-01-18 16:43   ` Sean Christopherson
2023-01-26 17:59     ` Colton Lewis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.