[kvm-unit-tests PATCH] Support micro operation measurement on arm64

All of lore.kernel.org
 help / color / mirror / Atom feed

* [kvm-unit-tests PATCH] Support micro operation measurement on arm64
@ 2017-12-15 21:15 Shih-Wei Li
  2017-12-15 21:15 ` [kvm-unit-tests PATCH] arm64: add micro test Shih-Wei Li
  2017-12-18 18:14 ` [kvm-unit-tests PATCH] Support micro operation measurement on arm64 Andrew Jones
  0 siblings, 2 replies; 27+ messages in thread
From: Shih-Wei Li @ 2017-12-15 21:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: christoffer.dall, marc.zyngier, drjones, pbonzini, Shih-Wei Li

The patch provides support for quantifying the cost of micro level
operations on arm64 hardware. The supported operations include hypercall,
mmio accesses, EOI virtual interrupt, and IPI send. Measurements are
currently obtained using timer counters. Further modifications in KVM
will be required to support timestamping using cycle counters, as KVM
now disables accesses to the PMU counters from the VM.

We iterate each of the tests for millions of times and output their
average, minimum and maximum cost in timer counts. Instruction barriers
were used before and after taking timestamps to avoid out-of-order
execution or pipelining from skewing our measurements.

To improve precision in the measurements, one should consider pinning
each VCPU to a specific physical CPU (PCPU) and ensure no other task
could run on that PCPU to skew the results. This can be achieved by
enabling QMP server in the QEMU command in unittest.cfg for micro test,
allowing a client program to get the thread_id for each VCPU thread
from the QMP server. Based on the information, the client program can
then pin the corresponding VCPUs to dedicated PCPUs and isolate
interrupts and tasks from those PCPUs.

The patch has been tested on arm64 hardware including AMD Seattle and
ThunderX2, which has GICv2 and GICv3 respectively.

Shih-Wei Li (1):
  arm64: add micro test

 arm/Makefile.common |   1 +
 arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 arm/unittests.cfg   |   6 ++
 3 files changed, 296 insertions(+)
 create mode 100644 arm/micro-test.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-15 21:15 [kvm-unit-tests PATCH] Support micro operation measurement on arm64 Shih-Wei Li
@ 2017-12-15 21:15 ` Shih-Wei Li
  2017-12-18 17:31   ` Yury Norov
                     ` (4 more replies)
  2017-12-18 18:14 ` [kvm-unit-tests PATCH] Support micro operation measurement on arm64 Andrew Jones
  1 sibling, 5 replies; 27+ messages in thread
From: Shih-Wei Li @ 2017-12-15 21:15 UTC (permalink / raw)
  To: kvm, kvmarm; +Cc: marc.zyngier, pbonzini, Shih-Wei Li

Here we provide the support for measuring various micro level
operations on arm64. We iterate each of the tests for millions of
times and output their average, minimum and maximum cost in timer
counts. Instruction barriers are used before and after taking
timestamps to avoid out-of-order execution or pipelining from
skewing our measurements.

The tests we currently support and measure are mostly
straightforward by the function names and the respective comments.
For IPI test, we measure the cost of sending IPI from a source
VCPU to a target VCPU, until the target VCPU receives the IPI.

Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
---
 arm/Makefile.common |   1 +
 arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 arm/unittests.cfg   |   6 ++
 3 files changed, 296 insertions(+)
 create mode 100644 arm/micro-test.c

diff --git a/arm/Makefile.common b/arm/Makefile.common
index 0a039cf..c7d5c27 100644
--- a/arm/Makefile.common
+++ b/arm/Makefile.common
@@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
 tests-common += $(TEST_DIR)/gic.flat
 tests-common += $(TEST_DIR)/psci.flat
 tests-common += $(TEST_DIR)/sieve.flat
+tests-common += $(TEST_DIR)/micro-test.flat
 
 tests-all = $(tests-common) $(tests)
 all: directories $(tests-all)
diff --git a/arm/micro-test.c b/arm/micro-test.c
new file mode 100644
index 0000000..7df2272
--- /dev/null
+++ b/arm/micro-test.c
@@ -0,0 +1,289 @@
+#include <util.h>
+#include <asm/gic.h>
+
+static volatile bool second_cpu_up;
+static volatile bool first_cpu_ack;
+static volatile bool ipi_acked;
+static volatile bool ipi_received;
+static volatile bool ipi_ready;
+#define IPI_IRQ		1
+
+#define TIMEOUT (1U << 28)
+
+#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
+#define for_each_test(_iter, _tests, _tmp) \
+	for (_tmp = 0, _iter = _tests; \
+			_tmp < ARR_SIZE(_tests); \
+			_tmp++, _iter++)
+
+#define CYCLE_COUNT(c1, c2) \
+	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
+
+#define IPI_DEBUG 0
+
+#if IPI_DEBUG == 1
+#define debug(fmt, ...) \
+	printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
+#else
+#define debug(fmt, ...) {}
+#endif
+
+static uint64_t read_cc(void)
+{
+	uint64_t cc;
+	asm volatile(
+		"isb\n"
+		"mrs %0, CNTPCT_EL0\n"
+		"isb\n"
+		: [reg] "=r" (cc)
+		::
+	);
+	return cc;
+}
+
+static void ipi_irq_handler(struct pt_regs *regs __unused)
+{
+	u32 ack;
+	ipi_ready = false;
+	ipi_received = true;
+	ack = gic_read_iar();
+	ipi_acked = true;
+	gic_write_eoir(ack);
+	ipi_ready = true;
+}
+
+static void ipi_test_secondary_entry(void)
+{
+	unsigned int timeout = TIMEOUT;
+
+	debug("secondary core up\n");
+
+	enum vector v = EL1H_IRQ;
+	install_irq_handler(v, ipi_irq_handler);
+
+	gic_enable_defaults();
+
+	second_cpu_up = true;
+
+	debug("secondary initialized vgic\n");
+
+	while (!first_cpu_ack && timeout--);
+	if (!first_cpu_ack) {
+		debug("ipi_test: First CPU did not ack wake-up\n");
+		exit(1);
+	}
+
+	debug("detected first cpu ack\n");
+
+	local_irq_enable(); /* Enter small wait-loop */
+	ipi_ready = true;
+	while (true);
+}
+
+static int test_init(void)
+{
+	int ret;
+	unsigned int timeout = TIMEOUT;
+
+	ret = gic_init();
+	if (!ret) {
+		debug("No supported gic present, skipping tests...\n");
+		goto out;
+	}
+
+	ipi_ready = false;
+
+	gic_enable_defaults();
+
+	debug("starting second CPU\n");
+	smp_boot_secondary(1, ipi_test_secondary_entry);
+
+	while (!second_cpu_up && timeout--); /* Wait for second CPU! */
+
+	if (!second_cpu_up) {
+		debug("ipi_test: timeout waiting for secondary CPU\n");
+		ret = 0;
+		goto out;
+	}
+
+	debug("detected secondary core up\n");
+
+	first_cpu_ack = true;
+
+	printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
+
+out:
+	return ret;
+}
+
+static unsigned long ipi_test(void)
+{
+	unsigned int timeout = TIMEOUT;
+	unsigned long c1, c2;
+
+	while (!ipi_ready && timeout--);
+	if (!ipi_ready) {
+		debug("ipi_test: second core not ready for IPIs\n");
+		exit(1);
+	}
+
+	ipi_received = false;
+
+	c1 = read_cc();
+
+	gic_ipi_send_single(IPI_IRQ, 1);
+
+	timeout = TIMEOUT;
+	while (!ipi_received && timeout--);
+	if (!ipi_received) {
+		debug("ipi_test: secondary core never received ipi\n");
+		exit(1);
+	}
+
+	c2 = read_cc();
+	return CYCLE_COUNT(c1, c2);
+}
+
+
+static unsigned long hvc_test(void)
+{
+	unsigned long c1, c2;
+
+	c1 = read_cc();
+	asm volatile("mov w0, #0x4b000000; hvc #0");
+	c2 = read_cc();
+	return CYCLE_COUNT(c1, c2);
+}
+
+static void __noop(void)
+{
+}
+
+static unsigned long noop_guest(void)
+{
+	unsigned long c1, c2;
+
+	c1 = read_cc();
+	__noop();
+	c2 = read_cc();
+	return CYCLE_COUNT(c1, c2);
+}
+
+static unsigned long mmio_read_user(void)
+{
+	unsigned long c1, c2;
+	void *mmio_read_user_addr = (void*) 0x0a000008;
+
+	/* Measure MMIO exit to QEMU in userspace */
+	c1 = read_cc();
+	readl(mmio_read_user_addr);
+	c2 = read_cc();
+	return CYCLE_COUNT(c1, c2);
+}
+
+static unsigned long mmio_read_vgic(void)
+{
+	unsigned long c1, c2;
+	int v = gic_version();
+	void *vgic_dist_addr = NULL;
+
+	if (v == 2)
+		vgic_dist_addr = gicv2_dist_base();
+	else if (v == 3)
+		vgic_dist_addr = gicv3_dist_base();
+
+	/* Measure MMIO exit to host kernel */
+	c1 = read_cc();
+	readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */
+	c2 = read_cc();
+	return CYCLE_COUNT(c1, c2);
+}
+
+static unsigned long eoi_test(void)
+{
+	unsigned long c1, c2;
+	int v = gic_version();
+	void (*write_eoir)(u32 irqstat) = NULL;
+
+	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
+
+	if (v == 2)
+		write_eoir = gicv2_write_eoir;
+	else if (v == 3)
+		write_eoir = gicv3_write_eoir;
+
+	c1 = read_cc();
+	write_eoir(val);
+	c2 = read_cc();
+
+	return CYCLE_COUNT(c1, c2);
+}
+
+struct exit_test {
+	const char *name;
+	unsigned long (*test_fn)(void);
+	bool run;
+};
+
+static struct exit_test available_tests[] = {
+	{"hvc",                hvc_test,           true},
+	{"noop_guest",         noop_guest,         true},
+	{"mmio_read_user",     mmio_read_user,     true},
+	{"mmio_read_vgic",     mmio_read_vgic,     true},
+	{"eoi",                eoi_test,           true},
+	{"ipi",                ipi_test,           true},
+};
+
+static void loop_test(struct exit_test *test)
+{
+	unsigned long i, iterations = 32;
+	unsigned long sample, cycles;
+	unsigned long long min = 0, max = 0;
+	const unsigned long long goal = (1ULL << 29);
+
+	do {
+		iterations *= 2;
+		cycles = 0;
+		for (i = 0; i < iterations; i++) {
+			sample = test->test_fn();
+			if (sample == 0) {
+				/*
+				 * If something went wrong or we had an
+				 * overflow, don't count that sample.
+				 */
+				iterations--;
+				i--;
+				debug("cycle count overflow: %lu\n", sample);
+				continue;
+			}
+			cycles += sample;
+			if (min == 0 || min > sample)
+				min = sample;
+			if (max < sample)
+				max = sample;
+		}
+	} while (cycles < goal);
+	printf("%s:\t avg %lu\t min %llu\t max %llu\n",
+		test->name, cycles / (iterations), min, max);
+}
+
+void kvm_unit_test(void)
+{
+	int i=0;
+	struct exit_test *test;
+	for_each_test(test, available_tests, i) {
+		if (!test->run)
+			continue;
+		loop_test(test);
+	}
+
+	return;
+}
+
+int main(int argc, char **argv)
+{
+	if (!test_init())
+		exit(1);
+	kvm_unit_test();
+	return 0;
+}
diff --git a/arm/unittests.cfg b/arm/unittests.cfg
index 44b98cf..1d0c4ca 100644
--- a/arm/unittests.cfg
+++ b/arm/unittests.cfg
@@ -116,3 +116,9 @@ file = timer.flat
 groups = timer
 timeout = 2s
 arch = arm64
+
+# Exit tests
+[micro-test]
+file = micro-test.flat
+smp = 2
+groups = micro-test
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-15 21:15 ` [kvm-unit-tests PATCH] arm64: add micro test Shih-Wei Li
@ 2017-12-18 17:31   ` Yury Norov
  2017-12-18 21:32     ` Shih-Wei Li
  2017-12-19  9:12     ` Christoffer Dall
  2017-12-18 19:10   ` Andrew Jones
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 27+ messages in thread
From: Yury Norov @ 2017-12-18 17:31 UTC (permalink / raw)
  To: Shih-Wei Li
  Cc: Goutham, Sunil, kvm, marc.zyngier, Cherian, Linu, pbonzini, kvmarm

Hi Shih-Wei,

(CC Cavium folks)

Thank you for the test! I was going to write something like
this, and instead found your patch submitted - good luck to me. 

Soon I'll test my hardware with it and share you results.
Comments inline.

Yury

On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
> Here we provide the support for measuring various micro level
> operations on arm64. We iterate each of the tests for millions of
> times and output their average, minimum and maximum cost in timer
> counts. Instruction barriers are used before and after taking
> timestamps to avoid out-of-order execution or pipelining from
> skewing our measurements.
> 
> The tests we currently support and measure are mostly
> straightforward by the function names and the respective comments.
> For IPI test, we measure the cost of sending IPI from a source
> VCPU to a target VCPU, until the target VCPU receives the IPI.
> 
> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> ---
>  arm/Makefile.common |   1 +
>  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  arm/unittests.cfg   |   6 ++
>  3 files changed, 296 insertions(+)
>  create mode 100644 arm/micro-test.c
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index 0a039cf..c7d5c27 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>  tests-common += $(TEST_DIR)/gic.flat
>  tests-common += $(TEST_DIR)/psci.flat
>  tests-common += $(TEST_DIR)/sieve.flat
> +tests-common += $(TEST_DIR)/micro-test.flat
>  
>  tests-all = $(tests-common) $(tests)
>  all: directories $(tests-all)
> diff --git a/arm/micro-test.c b/arm/micro-test.c
> new file mode 100644
> index 0000000..7df2272
> --- /dev/null
> +++ b/arm/micro-test.c
> @@ -0,0 +1,289 @@
> +#include <util.h>
> +#include <asm/gic.h>
> +
> +static volatile bool second_cpu_up;
> +static volatile bool first_cpu_ack;
> +static volatile bool ipi_acked;
> +static volatile bool ipi_received;
> +static volatile bool ipi_ready;
> +#define IPI_IRQ		1
> +
> +#define TIMEOUT (1U << 28)
> +
> +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
> +#define for_each_test(_iter, _tests, _tmp) \
> +	for (_tmp = 0, _iter = _tests; \
> +			_tmp < ARR_SIZE(_tests); \
> +			_tmp++, _iter++)
> +
> +#define CYCLE_COUNT(c1, c2) \
> +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))

Is my understanding correct that this is overflow protection?
c1 and c2 are 64-bit values. To overflow them you need 58 years
at 1G CPU freq.

And anyway,
#define CYCLE_COUNT(c1, c2)  ((c1) >= (c2) ? 0 : (c2) - (c1))

> +
> +#define IPI_DEBUG 0
> +

If you assume to pass IPI_DEBUG as GCC parameter, like -DIPI_DEBUG=1,
it should be 

#ifndef IPI_DEBUG
#define IPI_DEBUG 0
#endif

But I wonder if there's already existing switch for debug info?

> +#if IPI_DEBUG == 1
> +#define debug(fmt, ...) \
> +	printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
> +#else
> +#define debug(fmt, ...) {}
> +#endif

There are some fancy printing functions in lib/report.c. I didn't
look deeply into thet, but at first glance you can use it here.

> +
> +static uint64_t read_cc(void)
> +{
> +	uint64_t cc;
> +	asm volatile(
> +		"isb\n"
> +		"mrs %0, CNTPCT_EL0\n"
> +		"isb\n"
> +		: [reg] "=r" (cc)
> +		::
> +	);
> +	return cc;
> +}

Flushing pipeline before mrs is enough. Also, you can use read_sysreg()
instead of inline assembly. Refer arch_counter_get_cntpct() in kernel:

151 static inline u64 arch_counter_get_cntpct(void)
152 {
153         isb();   
154         return arch_timer_reg_read_stable(cntpct_el0);
155 }

> +
> +static void ipi_irq_handler(struct pt_regs *regs __unused)
> +{
> +	u32 ack;
> +	ipi_ready = false;
> +	ipi_received = true;
> +	ack = gic_read_iar();
> +	ipi_acked = true;
> +	gic_write_eoir(ack);
> +	ipi_ready = true;
> +}
> +
> +static void ipi_test_secondary_entry(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +
> +	debug("secondary core up\n");
> +
> +	enum vector v = EL1H_IRQ;
> +	install_irq_handler(v, ipi_irq_handler);
> +
> +	gic_enable_defaults();
> +
> +	second_cpu_up = true;
> +
> +	debug("secondary initialized vgic\n");
> +
> +	while (!first_cpu_ack && timeout--);
> +	if (!first_cpu_ack) {
> +		debug("ipi_test: First CPU did not ack wake-up\n");
> +		exit(1);
> +	}

Nit: here and later, timeout is not actually timeout, but something like
counter of attempts. Maybe it worth to use read_cc() here if you need
time intervals? 

> +
> +	debug("detected first cpu ack\n");
> +
> +	local_irq_enable(); /* Enter small wait-loop */
> +	ipi_ready = true;
> +	while (true);
> +}
> +
> +static int test_init(void)
> +{
> +	int ret;
> +	unsigned int timeout = TIMEOUT;
> +
> +	ret = gic_init();
> +	if (!ret) {
> +		debug("No supported gic present, skipping tests...\n");
> +		goto out;
> +	}
> +
> +	ipi_ready = false;
> +
> +	gic_enable_defaults();
> +
> +	debug("starting second CPU\n");
> +	smp_boot_secondary(1, ipi_test_secondary_entry);
> +
> +	while (!second_cpu_up && timeout--); /* Wait for second CPU! */
> +
> +	if (!second_cpu_up) {
> +		debug("ipi_test: timeout waiting for secondary CPU\n");
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	debug("detected secondary core up\n");
> +
> +	first_cpu_ack = true;
> +
> +	printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
> +
> +out:

Nit: it's simpler to return error at place and drop the 'out' label,
like this:
if (!gic_init()) {
        debug(...);
        return 0;
}

> +	return ret;
> +}
> +
> +static unsigned long ipi_test(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +	unsigned long c1, c2;

read_cc() returns uint64_t, so c1 and c2 should also be fixed-size
variables. Maybe one day we'll port test to arm32 where unsigned long is
32-bit...

> +
> +	while (!ipi_ready && timeout--);
> +	if (!ipi_ready) {
> +		debug("ipi_test: second core not ready for IPIs\n");
> +		exit(1);
> +	}
> +
> +	ipi_received = false;
> +
> +	c1 = read_cc();
> +
> +	gic_ipi_send_single(IPI_IRQ, 1);
> +
> +	timeout = TIMEOUT;
> +	while (!ipi_received && timeout--);
> +	if (!ipi_received) {
> +		debug("ipi_test: secondary core never received ipi\n");
> +		exit(1);
> +	}
> +
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +

Nit: odd line.

> +static unsigned long hvc_test(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	asm volatile("mov w0, #0x4b000000; hvc #0");
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}

This will be broken if compiler decide to assign r0 for variable c1,
or I miss something?

> +
> +static void __noop(void)
> +{
> +}

This would be completely optimized out. Is it for demonstrative
purpose?

> +
> +static unsigned long noop_guest(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	__noop();
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_user(void)
> +{
> +	unsigned long c1, c2;
> +	void *mmio_read_user_addr = (void*) 0x0a000008;
> +
> +	/* Measure MMIO exit to QEMU in userspace */
> +	c1 = read_cc();
> +	readl(mmio_read_user_addr);
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_vgic(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void *vgic_dist_addr = NULL;
> +
> +	if (v == 2)
> +		vgic_dist_addr = gicv2_dist_base();
> +	else if (v == 3)
> +		vgic_dist_addr = gicv3_dist_base();
> +
> +	/* Measure MMIO exit to host kernel */
> +	c1 = read_cc();
> +	readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}

So if gic version is not 2 or 3, it will readl() at broken
address. I think it's better to return error and print relevant
message to system log.

> +
> +static unsigned long eoi_test(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void (*write_eoir)(u32 irqstat) = NULL;
> +
> +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> +
> +	if (v == 2)
> +		write_eoir = gicv2_write_eoir;
> +	else if (v == 3)
> +		write_eoir = gicv3_write_eoir;
> +
> +	c1 = read_cc();
> +	write_eoir(val);
> +	c2 = read_cc();
> +
> +	return CYCLE_COUNT(c1, c2);
> +}

Similar here.

> +
> +struct exit_test {
> +	const char *name;
> +	unsigned long (*test_fn)(void);
> +	bool run;
> +};
> +
> +static struct exit_test available_tests[] = {
> +	{"hvc",                hvc_test,           true},
> +	{"noop_guest",         noop_guest,         true},
> +	{"mmio_read_user",     mmio_read_user,     true},
> +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> +	{"eoi",                eoi_test,           true},
> +	{"ipi",                ipi_test,           true},
> +};
> +
> +static void loop_test(struct exit_test *test)
> +{
> +	unsigned long i, iterations = 32;
> +	unsigned long sample, cycles;
> +	unsigned long long min = 0, max = 0;
> +	const unsigned long long goal = (1ULL << 29);
> +
> +	do {
> +		iterations *= 2;
> +		cycles = 0;
> +		for (i = 0; i < iterations; i++) {
> +			sample = test->test_fn();
> +			if (sample == 0) {
> +				/*
> +				 * If something went wrong or we had an
> +				 * overflow, don't count that sample.
> +				 */
> +				iterations--;
> +				i--;
> +				debug("cycle count overflow: %lu\n", sample);
> +				continue;
> +			}
> +			cycles += sample;
> +			if (min == 0 || min > sample)
> +				min = sample;
> +			if (max < sample)
> +				max = sample;
> +		}
> +	} while (cycles < goal);
> +	printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> +		test->name, cycles / (iterations), min, max);
> +}
> +
> +void kvm_unit_test(void)
> +{
> +	int i=0;
> +	struct exit_test *test;
> +	for_each_test(test, available_tests, i) {
> +		if (!test->run)
> +			continue;
> +		loop_test(test);
> +	}
> +
> +	return;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	if (!test_init())
> +		exit(1);
> +	kvm_unit_test();
> +	return 0;
> +}
> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> index 44b98cf..1d0c4ca 100644
> --- a/arm/unittests.cfg
> +++ b/arm/unittests.cfg
> @@ -116,3 +116,9 @@ file = timer.flat
>  groups = timer
>  timeout = 2s
>  arch = arm64
> +
> +# Exit tests
> +[micro-test]
> +file = micro-test.flat
> +smp = 2
> +groups = micro-test
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] Support micro operation measurement on arm64
  2017-12-15 21:15 [kvm-unit-tests PATCH] Support micro operation measurement on arm64 Shih-Wei Li
  2017-12-15 21:15 ` [kvm-unit-tests PATCH] arm64: add micro test Shih-Wei Li
@ 2017-12-18 18:14 ` Andrew Jones
  2017-12-18 20:58   ` Shih-Wei Li
  1 sibling, 1 reply; 27+ messages in thread
From: Andrew Jones @ 2017-12-18 18:14 UTC (permalink / raw)
  To: Shih-Wei Li; +Cc: marc.zyngier, kvmarm, kvm, pbonzini

Hi Shih-Wei,

Thanks for doing this! Porting Christoffer's selftests to kvm-unit-tests
has been on the kvm-unit-tests' TODO list since it was first introduced.

On Fri, Dec 15, 2017 at 04:15:38PM -0500, Shih-Wei Li wrote:
> The patch provides support for quantifying the cost of micro level
> operations on arm64 hardware. The supported operations include hypercall,
> mmio accesses, EOI virtual interrupt, and IPI send. Measurements are
> currently obtained using timer counters. Further modifications in KVM
> will be required to support timestamping using cycle counters, as KVM
> now disables accesses to the PMU counters from the VM.

KVM only disables access when userspace tells it to, which it doesn't
do by default. Is there something else missing keeping the PMU counters
from being used?

> 
> We iterate each of the tests for millions of times and output their
> average, minimum and maximum cost in timer counts. Instruction barriers

Can we reduce the number of iterations and still get valid results? The
test takes so long that of all the platforms I tested it on timed out
before it completed, except seattle. The default timeout for kvm-unit-
tests is 90 seconds. I'd rather a unit test execute in much shorter time
than that too, in order to keep people encouraged to run them frequently.
If these tests must run a long time, then I think we should add them to
the nodefault group.

> were used before and after taking timestamps to avoid out-of-order
> execution or pipelining from skewing our measurements.
> 
> To improve precision in the measurements, one should consider pinning
> each VCPU to a specific physical CPU (PCPU) and ensure no other task
> could run on that PCPU to skew the results. This can be achieved by
> enabling QMP server in the QEMU command in unittest.cfg for micro test,
> allowing a client program to get the thread_id for each VCPU thread
> from the QMP server. Based on the information, the client program can
> then pin the corresponding VCPUs to dedicated PCPUs and isolate
> interrupts and tasks from those PCPUs.

To isolate the CPUs one would need to boot the host with the isolcpus
kernel command line option. Pinning the VCPUs is pretty easy though,
so we could provide a script that does that in kvm-unit-tests and then
always use it for this test. The script could also warn if we're
pinning to CPUs that haven't been isolated.

> 
> The patch has been tested on arm64 hardware including AMD Seattle and
> ThunderX2, which has GICv2 and GICv3 respectively.

I tried thunderx2, amberwing, mustang, and seattle. Only seattle
completed, the rest timed out.

Thanks,
drew

> 
> Shih-Wei Li (1):
>   arm64: add micro test
> 
>  arm/Makefile.common |   1 +
>  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  arm/unittests.cfg   |   6 ++
>  3 files changed, 296 insertions(+)
>  create mode 100644 arm/micro-test.c
> 
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-15 21:15 ` [kvm-unit-tests PATCH] arm64: add micro test Shih-Wei Li
  2017-12-18 17:31   ` Yury Norov
@ 2017-12-18 19:10   ` Andrew Jones
  2017-12-18 21:58     ` Shih-Wei Li
  2017-12-20 17:00     ` Andrew Jones
  2017-12-18 22:17   ` Kalra, Ashish
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 27+ messages in thread
From: Andrew Jones @ 2017-12-18 19:10 UTC (permalink / raw)
  To: Shih-Wei Li; +Cc: kvm, kvmarm, christoffer.dall, marc.zyngier, pbonzini

On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
> Here we provide the support for measuring various micro level
> operations on arm64. We iterate each of the tests for millions of
> times and output their average, minimum and maximum cost in timer
> counts. Instruction barriers are used before and after taking
> timestamps to avoid out-of-order execution or pipelining from
> skewing our measurements.
> 
> The tests we currently support and measure are mostly
> straightforward by the function names and the respective comments.
> For IPI test, we measure the cost of sending IPI from a source
> VCPU to a target VCPU, until the target VCPU receives the IPI.
> 
> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> ---
>  arm/Makefile.common |   1 +
>  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  arm/unittests.cfg   |   6 ++
>  3 files changed, 296 insertions(+)
>  create mode 100644 arm/micro-test.c
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index 0a039cf..c7d5c27 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>  tests-common += $(TEST_DIR)/gic.flat
>  tests-common += $(TEST_DIR)/psci.flat
>  tests-common += $(TEST_DIR)/sieve.flat
> +tests-common += $(TEST_DIR)/micro-test.flat
>  
>  tests-all = $(tests-common) $(tests)
>  all: directories $(tests-all)
> diff --git a/arm/micro-test.c b/arm/micro-test.c
> new file mode 100644
> index 0000000..7df2272
> --- /dev/null
> +++ b/arm/micro-test.c
> @@ -0,0 +1,289 @@

Please add the copyright and license header like we have in other files.

> +#include <util.h>
> +#include <asm/gic.h>
> +
> +static volatile bool second_cpu_up;
> +static volatile bool first_cpu_ack;
> +static volatile bool ipi_acked;
> +static volatile bool ipi_received;
> +static volatile bool ipi_ready;
> +#define IPI_IRQ		1
> +
> +#define TIMEOUT (1U << 28)
> +
> +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))

We have this in libcflat already, ARRAY_SIZE()

> +#define for_each_test(_iter, _tests, _tmp) \
> +	for (_tmp = 0, _iter = _tests; \
> +			_tmp < ARR_SIZE(_tests); \
> +			_tmp++, _iter++)

One time used macro is probably not necessary and the tmp
var wouldn't be necessary if you make the last entry of
available_tests a null entry.

> +
> +#define CYCLE_COUNT(c1, c2) \
> +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> +
> +#define IPI_DEBUG 0
> +
> +#if IPI_DEBUG == 1
> +#define debug(fmt, ...) \
> +	printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
> +#else
> +#define debug(fmt, ...) {}
> +#endif

We don't generally turn off print statements in kvm-unit-tests.
If the statement is really a temporary debug statement, then
maybe just remove it before posting the patch. If it's even just a
little helpful, then just leave it in and always print it. To promote
messages to a higher level of importance than printf use report_*
functions.

> +
> +static uint64_t read_cc(void)
> +{
> +	uint64_t cc;
> +	asm volatile(
> +		"isb\n"
> +		"mrs %0, CNTPCT_EL0\n"
> +		"isb\n"
> +		: [reg] "=r" (cc)
> +		::
> +	);
> +	return cc;
> +}
> +
> +static void ipi_irq_handler(struct pt_regs *regs __unused)
> +{
> +	u32 ack;
> +	ipi_ready = false;
> +	ipi_received = true;
> +	ack = gic_read_iar();
> +	ipi_acked = true;
> +	gic_write_eoir(ack);
> +	ipi_ready = true;
> +}
> +
> +static void ipi_test_secondary_entry(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +
> +	debug("secondary core up\n");
> +
> +	enum vector v = EL1H_IRQ;
> +	install_irq_handler(v, ipi_irq_handler);
> +
> +	gic_enable_defaults();
> +
> +	second_cpu_up = true;
> +
> +	debug("secondary initialized vgic\n");
> +
> +	while (!first_cpu_ack && timeout--);

cpu_relax() here will help get that first-cpu ack on TCG faster. That
said, this test doesn't make sense on TCG anyway, other than debugging
it. So you could just add 'accel = kvm' to it's unittests.cfg parameter
list.

I just tried on TCG now. It doesn't run. It gets

Timer Frequency 62500000 Hz (Output in timer count)
Unhandled exception ec=0 (UNKNOWN)
Vector: 4 (el1h_sync)
ESR_EL1:         02000000, ec=0 (UNKNOWN)
FAR_EL1: 0000000000000000 (not valid)
Exception frame registers:
pc : [<0000000040080088>] lr : [<00000000400803e8>] pstate: 800003c5
sp : 00000000400aff90
x29: 0000000000000000 x28: 0000000000000000 
x27: 0000000040090000 x26: 0000000040090c60 
x25: 0000000040090000 x24: 000000001fffffff 
x23: 0000000000000000 x22: 0000000000000000 
x21: 0000000000000040 x20: 0000000000000000 
x19: 0000000000000000 x18: 00000000400b0000 
x17: 0000000000000000 x16: 0000000000000000 
x15: 00000000400afe8c x14: 00000000400b0000 
x13: 00000000400afecc x12: 0000000000001680 
x11: 0000000000000000 x10: 6666666666666667 
x9 : 0000000000000030 x8 : 0000000000000030 
x7 : 00000000400af670 x6 : 00000000400af673 
x5 : 00000000400af678 x4 : 00000000000007b7 
x3 : 00000000400af6ec x2 : 0000000040090000 
x1 : 000000000015909e x0 : 000000004b000000

You don't need to get it to work on TCG if you add the 'accel = kvm'
parameter, but it's sometimes indicative of a bug in the unit test
when it doesn't run, so you might want to take a look.

Also need to do s/timeout/tries/ or something, as it's not time related.

> +	if (!first_cpu_ack) {
> +		debug("ipi_test: First CPU did not ack wake-up\n");

I think you should drop the debug() macro completely, but in here
it should certainly be changed. Erroring out shouldn't be silent,
as it would be when out debug turned on.

> +		exit(1);
> +	}
> +
> +	debug("detected first cpu ack\n");
> +
> +	local_irq_enable(); /* Enter small wait-loop */
> +	ipi_ready = true;
> +	while (true);
> +}
> +
> +static int test_init(void)
> +{
> +	int ret;
> +	unsigned int timeout = TIMEOUT;
> +
> +	ret = gic_init();
> +	if (!ret) {
> +		debug("No supported gic present, skipping tests...\n");
> +		goto out;
> +	}
> +
> +	ipi_ready = false;
> +
> +	gic_enable_defaults();
> +
> +	debug("starting second CPU\n");
> +	smp_boot_secondary(1, ipi_test_secondary_entry);
> +
> +	while (!second_cpu_up && timeout--); /* Wait for second CPU! */
> +
> +	if (!second_cpu_up) {
> +		debug("ipi_test: timeout waiting for secondary CPU\n");
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	debug("detected secondary core up\n");
> +
> +	first_cpu_ack = true;

I think you should be able to avoid most of this manual cpu
synchronization by using the on_cpu() call, which is the
recommended way to start secondaries anyway.

> +
> +	printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
> +
> +out:
> +	return ret;
> +}
> +
> +static unsigned long ipi_test(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +	unsigned long c1, c2;
> +
> +	while (!ipi_ready && timeout--);
> +	if (!ipi_ready) {
> +		debug("ipi_test: second core not ready for IPIs\n");
> +		exit(1);
> +	}
> +
> +	ipi_received = false;
> +
> +	c1 = read_cc();
> +
> +	gic_ipi_send_single(IPI_IRQ, 1);
> +
> +	timeout = TIMEOUT;
> +	while (!ipi_received && timeout--);
> +	if (!ipi_received) {
> +		debug("ipi_test: secondary core never received ipi\n");
> +		exit(1);
> +	}
> +
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +
> +static unsigned long hvc_test(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	asm volatile("mov w0, #0x4b000000; hvc #0");
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static void __noop(void)
> +{
> +}
> +
> +static unsigned long noop_guest(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	__noop();
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_user(void)
> +{
> +	unsigned long c1, c2;
> +	void *mmio_read_user_addr = (void*) 0x0a000008;

This is a virtio-mmio transport device-id address. It's harmless to
read it, as long as we don't change it. I've had plans to provide
mmio addresses with different read, write permissions and sizes through
extending a test-dev for quite a while. I should do that. For this patch
I guess this is fine, but please comment it with a big FIXME or TODO to
make sure we change it when we can someday.

> +
> +	/* Measure MMIO exit to QEMU in userspace */
> +	c1 = read_cc();
> +	readl(mmio_read_user_addr);
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_vgic(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void *vgic_dist_addr = NULL;
> +
> +	if (v == 2)
> +		vgic_dist_addr = gicv2_dist_base();
> +	else if (v == 3)
> +		vgic_dist_addr = gicv3_dist_base();
> +
> +	/* Measure MMIO exit to host kernel */
> +	c1 = read_cc();
> +	readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */

You can add GICD_IIDR to lib/arm/asm/gic.h

> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long eoi_test(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void (*write_eoir)(u32 irqstat) = NULL;
> +
> +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> +
> +	if (v == 2)
> +		write_eoir = gicv2_write_eoir;
> +	else if (v == 3)
> +		write_eoir = gicv3_write_eoir;

You don't need this, we have gic_write_eoir(), which you used above.

> +
> +	c1 = read_cc();
> +	write_eoir(val);
> +	c2 = read_cc();
> +
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +struct exit_test {
> +	const char *name;
> +	unsigned long (*test_fn)(void);
> +	bool run;
> +};
> +
> +static struct exit_test available_tests[] = {
> +	{"hvc",                hvc_test,           true},
> +	{"noop_guest",         noop_guest,         true},
> +	{"mmio_read_user",     mmio_read_user,     true},
> +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> +	{"eoi",                eoi_test,           true},
> +	{"ipi",                ipi_test,           true},
> +};
> +
> +static void loop_test(struct exit_test *test)
> +{
> +	unsigned long i, iterations = 32;
> +	unsigned long sample, cycles;
> +	unsigned long long min = 0, max = 0;
> +	const unsigned long long goal = (1ULL << 29);
> +
> +	do {
> +		iterations *= 2;
> +		cycles = 0;
> +		for (i = 0; i < iterations; i++) {
> +			sample = test->test_fn();
> +			if (sample == 0) {
> +				/*
> +				 * If something went wrong or we had an
> +				 * overflow, don't count that sample.
> +				 */
> +				iterations--;
> +				i--;

I'd prefer incrementing a different counter after the sample==0 test /
continue, than to decrement both i and iterations.

> +				debug("cycle count overflow: %lu\n", sample);
> +				continue;
> +			}
> +			cycles += sample;
> +			if (min == 0 || min > sample)
> +				min = sample;
> +			if (max < sample)
> +				max = sample;
> +		}
> +	} while (cycles < goal);
> +	printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> +		test->name, cycles / (iterations), min, max);

No need for () around iterations.

> +}
> +
> +void kvm_unit_test(void)
> +{
> +	int i=0;
> +	struct exit_test *test;
> +	for_each_test(test, available_tests, i) {
> +		if (!test->run)
> +			continue;
> +		loop_test(test);
> +	}
> +
> +	return;

pointless return

> +}

nit: no need for the above function, just inline it in main().

> +
> +int main(int argc, char **argv)
> +{
> +	if (!test_init())
> +		exit(1);

I think this and all exit(1)'s in this unit test could/should be
replaced with asserts.

> +	kvm_unit_test();
> +	return 0;
> +}
> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> index 44b98cf..1d0c4ca 100644
> --- a/arm/unittests.cfg
> +++ b/arm/unittests.cfg
> @@ -116,3 +116,9 @@ file = timer.flat
>  groups = timer
>  timeout = 2s
>  arch = arm64
> +
> +# Exit tests
> +[micro-test]
> +file = micro-test.flat
> +smp = 2
> +groups = micro-test
> -- 
> 1.9.1
> 
>

Thanks again for contributing to kvm-unit-tests. I look forward to v2.

drew

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] Support micro operation measurement on arm64
  2017-12-18 18:14 ` [kvm-unit-tests PATCH] Support micro operation measurement on arm64 Andrew Jones
@ 2017-12-18 20:58   ` Shih-Wei Li
  2017-12-19  9:06     ` Christoffer Dall
  0 siblings, 1 reply; 27+ messages in thread
From: Shih-Wei Li @ 2017-12-18 20:58 UTC (permalink / raw)
  To: Andrew Jones; +Cc: Marc Zyngier, kvmarm, kvm, Paolo Bonzini

On Mon, Dec 18, 2017 at 1:14 PM, Andrew Jones <drjones@redhat.com> wrote:
> Hi Shih-Wei,
>
> Thanks for doing this! Porting Christoffer's selftests to kvm-unit-tests
> has been on the kvm-unit-tests' TODO list since it was first introduced.
>
> On Fri, Dec 15, 2017 at 04:15:38PM -0500, Shih-Wei Li wrote:
>> The patch provides support for quantifying the cost of micro level
>> operations on arm64 hardware. The supported operations include hypercall,
>> mmio accesses, EOI virtual interrupt, and IPI send. Measurements are
>> currently obtained using timer counters. Further modifications in KVM
>> will be required to support timestamping using cycle counters, as KVM
>> now disables accesses to the PMU counters from the VM.
>
> KVM only disables access when userspace tells it to, which it doesn't
> do by default. Is there something else missing keeping the PMU counters
> from being used?

Thanks for the feedback! What I meant by PMU counters here was for
"CPU cycle counter" specifically. I'm not aware of a way to enable the
PMU cycle counter from QEMU, did I miss something here?

>
>>
>> We iterate each of the tests for millions of times and output their
>> average, minimum and maximum cost in timer counts. Instruction barriers
>
> Can we reduce the number of iterations and still get valid results? The
> test takes so long that of all the platforms I tested it on timed out
> before it completed, except seattle. The default timeout for kvm-unit-
> tests is 90 seconds. I'd rather a unit test execute in much shorter time
> than that too, in order to keep people encouraged to run them frequently.
> If these tests must run a long time, then I think we should add them to
> the nodefault group.

I think it's possible to reduce the timeout without losing accuracy. I
can look into this further.

>
>> were used before and after taking timestamps to avoid out-of-order
>> execution or pipelining from skewing our measurements.
>>
>> To improve precision in the measurements, one should consider pinning
>> each VCPU to a specific physical CPU (PCPU) and ensure no other task
>> could run on that PCPU to skew the results. This can be achieved by
>> enabling QMP server in the QEMU command in unittest.cfg for micro test,
>> allowing a client program to get the thread_id for each VCPU thread
>> from the QMP server. Based on the information, the client program can
>> then pin the corresponding VCPUs to dedicated PCPUs and isolate
>> interrupts and tasks from those PCPUs.
>
> To isolate the CPUs one would need to boot the host with the isolcpus
> kernel command line option. Pinning the VCPUs is pretty easy though,
> so we could provide a script that does that in kvm-unit-tests and then
> always use it for this test. The script could also warn if we're
> pinning to CPUs that haven't been isolated.
>

My intention was to support VCPU pinning as an optional feature,
so the users that care about extra precision can add qmp option in
QEMU config and run the script to pin VCPUs. Otherwise, the test can
be conducted in a fashion similar to what's done in vmexit on x86.

>>
>> The patch has been tested on arm64 hardware including AMD Seattle and
>> ThunderX2, which has GICv2 and GICv3 respectively.
>
> I tried thunderx2, amberwing, mustang, and seattle. Only seattle
> completed, the rest timed out.

I have only tested the code by invoking test directly using make
standalone like the following. I did notice that it took ~90 seconds
to finish the test itself.
./"tests/micro-cost"

Thanks,
Shih-Wei

>
> Thanks,
> drew
>
>>
>> Shih-Wei Li (1):
>>   arm64: add micro test
>>
>>  arm/Makefile.common |   1 +
>>  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  arm/unittests.cfg   |   6 ++
>>  3 files changed, 296 insertions(+)
>>  create mode 100644 arm/micro-test.c
>>
>> --
>> 1.9.1
>>
>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-18 17:31   ` Yury Norov
@ 2017-12-18 21:32     ` Shih-Wei Li
  2017-12-19  9:12     ` Christoffer Dall
  1 sibling, 0 replies; 27+ messages in thread
From: Shih-Wei Li @ 2017-12-18 21:32 UTC (permalink / raw)
  To: Yury Norov
  Cc: Goutham, Sunil, kvm, Marc Zyngier, Cherian, Linu, Paolo Bonzini, kvmarm

On Mon, Dec 18, 2017 at 12:31 PM, Yury Norov <ynorov@caviumnetworks.com> wrote:
> Hi Shih-Wei,
>
> (CC Cavium folks)
>
> Thank you for the test! I was going to write something like
> this, and instead found your patch submitted - good luck to me.
>
> Soon I'll test my hardware with it and share you results.
> Comments inline.
>
> Yury
>
> On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
>> Here we provide the support for measuring various micro level
>> operations on arm64. We iterate each of the tests for millions of
>> times and output their average, minimum and maximum cost in timer
>> counts. Instruction barriers are used before and after taking
>> timestamps to avoid out-of-order execution or pipelining from
>> skewing our measurements.
>>
>> The tests we currently support and measure are mostly
>> straightforward by the function names and the respective comments.
>> For IPI test, we measure the cost of sending IPI from a source
>> VCPU to a target VCPU, until the target VCPU receives the IPI.
>>
>> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
>> ---
>>  arm/Makefile.common |   1 +
>>  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  arm/unittests.cfg   |   6 ++
>>  3 files changed, 296 insertions(+)
>>  create mode 100644 arm/micro-test.c
>>
>> diff --git a/arm/Makefile.common b/arm/Makefile.common
>> index 0a039cf..c7d5c27 100644
>> --- a/arm/Makefile.common
>> +++ b/arm/Makefile.common
>> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>>  tests-common += $(TEST_DIR)/gic.flat
>>  tests-common += $(TEST_DIR)/psci.flat
>>  tests-common += $(TEST_DIR)/sieve.flat
>> +tests-common += $(TEST_DIR)/micro-test.flat
>>
>>  tests-all = $(tests-common) $(tests)
>>  all: directories $(tests-all)
>> diff --git a/arm/micro-test.c b/arm/micro-test.c
>> new file mode 100644
>> index 0000000..7df2272
>> --- /dev/null
>> +++ b/arm/micro-test.c
>> @@ -0,0 +1,289 @@
>> +#include <util.h>
>> +#include <asm/gic.h>
>> +
>> +static volatile bool second_cpu_up;
>> +static volatile bool first_cpu_ack;
>> +static volatile bool ipi_acked;
>> +static volatile bool ipi_received;
>> +static volatile bool ipi_ready;
>> +#define IPI_IRQ              1
>> +
>> +#define TIMEOUT (1U << 28)
>> +
>> +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
>> +#define for_each_test(_iter, _tests, _tmp) \
>> +     for (_tmp = 0, _iter = _tests; \
>> +                     _tmp < ARR_SIZE(_tests); \
>> +                     _tmp++, _iter++)
>> +
>> +#define CYCLE_COUNT(c1, c2) \
>> +     (((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
>
> Is my understanding correct that this is overflow protection?
> c1 and c2 are 64-bit values. To overflow them you need 58 years
> at 1G CPU freq.
>
> And anyway,
> #define CYCLE_COUNT(c1, c2)  ((c1) >= (c2) ? 0 : (c2) - (c1))

Yes, it's for overflow protection.

>
>> +
>> +#define IPI_DEBUG 0
>> +
>
> If you assume to pass IPI_DEBUG as GCC parameter, like -DIPI_DEBUG=1,
> it should be
>
> #ifndef IPI_DEBUG
> #define IPI_DEBUG 0
> #endif
>
> But I wonder if there's already existing switch for debug info?
>
>> +#if IPI_DEBUG == 1
>> +#define debug(fmt, ...) \
>> +     printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
>> +#else
>> +#define debug(fmt, ...) {}
>> +#endif
>
> There are some fancy printing functions in lib/report.c. I didn't
> look deeply into thet, but at first glance you can use it here.
>
>> +
>> +static uint64_t read_cc(void)
>> +{
>> +     uint64_t cc;
>> +     asm volatile(
>> +             "isb\n"
>> +             "mrs %0, CNTPCT_EL0\n"
>> +             "isb\n"
>> +             : [reg] "=r" (cc)
>> +             ::
>> +     );
>> +     return cc;
>> +}
>
> Flushing pipeline before mrs is enough. Also, you can use read_sysreg()
> instead of inline assembly. Refer arch_counter_get_cntpct() in kernel:
>
> 151 static inline u64 arch_counter_get_cntpct(void)
> 152 {
> 153         isb();
> 154         return arch_timer_reg_read_stable(cntpct_el0);
> 155 }
>
>> +
>> +static void ipi_irq_handler(struct pt_regs *regs __unused)
>> +{
>> +     u32 ack;
>> +     ipi_ready = false;
>> +     ipi_received = true;
>> +     ack = gic_read_iar();
>> +     ipi_acked = true;
>> +     gic_write_eoir(ack);
>> +     ipi_ready = true;
>> +}
>> +
>> +static void ipi_test_secondary_entry(void)
>> +{
>> +     unsigned int timeout = TIMEOUT;
>> +
>> +     debug("secondary core up\n");
>> +
>> +     enum vector v = EL1H_IRQ;
>> +     install_irq_handler(v, ipi_irq_handler);
>> +
>> +     gic_enable_defaults();
>> +
>> +     second_cpu_up = true;
>> +
>> +     debug("secondary initialized vgic\n");
>> +
>> +     while (!first_cpu_ack && timeout--);
>> +     if (!first_cpu_ack) {
>> +             debug("ipi_test: First CPU did not ack wake-up\n");
>> +             exit(1);
>> +     }
>
> Nit: here and later, timeout is not actually timeout, but something like
> counter of attempts. Maybe it worth to use read_cc() here if you need
> time intervals?
>
>> +
>> +     debug("detected first cpu ack\n");
>> +
>> +     local_irq_enable(); /* Enter small wait-loop */
>> +     ipi_ready = true;
>> +     while (true);
>> +}
>> +
>> +static int test_init(void)
>> +{
>> +     int ret;
>> +     unsigned int timeout = TIMEOUT;
>> +
>> +     ret = gic_init();
>> +     if (!ret) {
>> +             debug("No supported gic present, skipping tests...\n");
>> +             goto out;
>> +     }
>> +
>> +     ipi_ready = false;
>> +
>> +     gic_enable_defaults();
>> +
>> +     debug("starting second CPU\n");
>> +     smp_boot_secondary(1, ipi_test_secondary_entry);
>> +
>> +     while (!second_cpu_up && timeout--); /* Wait for second CPU! */
>> +
>> +     if (!second_cpu_up) {
>> +             debug("ipi_test: timeout waiting for secondary CPU\n");
>> +             ret = 0;
>> +             goto out;
>> +     }
>> +
>> +     debug("detected secondary core up\n");
>> +
>> +     first_cpu_ack = true;
>> +
>> +     printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
>> +
>> +out:
>
> Nit: it's simpler to return error at place and drop the 'out' label,
> like this:
> if (!gic_init()) {
>         debug(...);
>         return 0;
> }

Yes.

>
>> +     return ret;
>> +}
>> +
>> +static unsigned long ipi_test(void)
>> +{
>> +     unsigned int timeout = TIMEOUT;
>> +     unsigned long c1, c2;
>
> read_cc() returns uint64_t, so c1 and c2 should also be fixed-size
> variables. Maybe one day we'll port test to arm32 where unsigned long is
> 32-bit...

Right, I'll look at it.

>
>> +
>> +     while (!ipi_ready && timeout--);
>> +     if (!ipi_ready) {
>> +             debug("ipi_test: second core not ready for IPIs\n");
>> +             exit(1);
>> +     }
>> +
>> +     ipi_received = false;
>> +
>> +     c1 = read_cc();
>> +
>> +     gic_ipi_send_single(IPI_IRQ, 1);
>> +
>> +     timeout = TIMEOUT;
>> +     while (!ipi_received && timeout--);
>> +     if (!ipi_received) {
>> +             debug("ipi_test: secondary core never received ipi\n");
>> +             exit(1);
>> +     }
>> +
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +
>
> Nit: odd line.
>
>> +static unsigned long hvc_test(void)
>> +{
>> +     unsigned long c1, c2;
>> +
>> +     c1 = read_cc();
>> +     asm volatile("mov w0, #0x4b000000; hvc #0");
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>
> This will be broken if compiler decide to assign r0 for variable c1,
> or I miss something?
>
>> +
>> +static void __noop(void)
>> +{
>> +}
>
> This would be completely optimized out. Is it for demonstrative
> purpose?

Yes, this is to show the cost using isb().

>
>> +
>> +static unsigned long noop_guest(void)
>> +{
>> +     unsigned long c1, c2;
>> +
>> +     c1 = read_cc();
>> +     __noop();
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long mmio_read_user(void)
>> +{
>> +     unsigned long c1, c2;
>> +     void *mmio_read_user_addr = (void*) 0x0a000008;
>> +
>> +     /* Measure MMIO exit to QEMU in userspace */
>> +     c1 = read_cc();
>> +     readl(mmio_read_user_addr);
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long mmio_read_vgic(void)
>> +{
>> +     unsigned long c1, c2;
>> +     int v = gic_version();
>> +     void *vgic_dist_addr = NULL;
>> +
>> +     if (v == 2)
>> +             vgic_dist_addr = gicv2_dist_base();
>> +     else if (v == 3)
>> +             vgic_dist_addr = gicv3_dist_base();
>> +
>> +     /* Measure MMIO exit to host kernel */
>> +     c1 = read_cc();
>> +     readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>
> So if gic version is not 2 or 3, it will readl() at broken
> address. I think it's better to return error and print relevant
> message to system log.
>
>> +
>> +static unsigned long eoi_test(void)
>> +{
>> +     unsigned long c1, c2;
>> +     int v = gic_version();
>> +     void (*write_eoir)(u32 irqstat) = NULL;
>> +
>> +     u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
>> +
>> +     if (v == 2)
>> +             write_eoir = gicv2_write_eoir;
>> +     else if (v == 3)
>> +             write_eoir = gicv3_write_eoir;
>> +
>> +     c1 = read_cc();
>> +     write_eoir(val);
>> +     c2 = read_cc();
>> +
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>
> Similar here.
>
>> +
>> +struct exit_test {
>> +     const char *name;
>> +     unsigned long (*test_fn)(void);
>> +     bool run;
>> +};
>> +
>> +static struct exit_test available_tests[] = {
>> +     {"hvc",                hvc_test,           true},
>> +     {"noop_guest",         noop_guest,         true},
>> +     {"mmio_read_user",     mmio_read_user,     true},
>> +     {"mmio_read_vgic",     mmio_read_vgic,     true},
>> +     {"eoi",                eoi_test,           true},
>> +     {"ipi",                ipi_test,           true},
>> +};
>> +
>> +static void loop_test(struct exit_test *test)
>> +{
>> +     unsigned long i, iterations = 32;
>> +     unsigned long sample, cycles;
>> +     unsigned long long min = 0, max = 0;
>> +     const unsigned long long goal = (1ULL << 29);
>> +
>> +     do {
>> +             iterations *= 2;
>> +             cycles = 0;
>> +             for (i = 0; i < iterations; i++) {
>> +                     sample = test->test_fn();
>> +                     if (sample == 0) {
>> +                             /*
>> +                              * If something went wrong or we had an
>> +                              * overflow, don't count that sample.
>> +                              */
>> +                             iterations--;
>> +                             i--;
>> +                             debug("cycle count overflow: %lu\n", sample);
>> +                             continue;
>> +                     }
>> +                     cycles += sample;
>> +                     if (min == 0 || min > sample)
>> +                             min = sample;
>> +                     if (max < sample)
>> +                             max = sample;
>> +             }
>> +     } while (cycles < goal);
>> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
>> +             test->name, cycles / (iterations), min, max);
>> +}
>> +
>> +void kvm_unit_test(void)
>> +{
>> +     int i=0;
>> +     struct exit_test *test;
>> +     for_each_test(test, available_tests, i) {
>> +             if (!test->run)
>> +                     continue;
>> +             loop_test(test);
>> +     }
>> +
>> +     return;
>> +}
>> +
>> +int main(int argc, char **argv)
>> +{
>> +     if (!test_init())
>> +             exit(1);
>> +     kvm_unit_test();
>> +     return 0;
>> +}
>> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
>> index 44b98cf..1d0c4ca 100644
>> --- a/arm/unittests.cfg
>> +++ b/arm/unittests.cfg
>> @@ -116,3 +116,9 @@ file = timer.flat
>>  groups = timer
>>  timeout = 2s
>>  arch = arm64
>> +
>> +# Exit tests
>> +[micro-test]
>> +file = micro-test.flat
>> +smp = 2
>> +groups = micro-test
>> --
>> 1.9.1
>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-18 19:10   ` Andrew Jones
@ 2017-12-18 21:58     ` Shih-Wei Li
  2017-12-19 12:00       ` Andrew Jones
  2017-12-20 17:00     ` Andrew Jones
  1 sibling, 1 reply; 27+ messages in thread
From: Shih-Wei Li @ 2017-12-18 21:58 UTC (permalink / raw)
  To: Andrew Jones; +Cc: kvm, kvmarm, Christoffer Dall, Marc Zyngier, Paolo Bonzini

On Mon, Dec 18, 2017 at 2:10 PM, Andrew Jones <drjones@redhat.com> wrote:
> On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
>> Here we provide the support for measuring various micro level
>> operations on arm64. We iterate each of the tests for millions of
>> times and output their average, minimum and maximum cost in timer
>> counts. Instruction barriers are used before and after taking
>> timestamps to avoid out-of-order execution or pipelining from
>> skewing our measurements.
>>
>> The tests we currently support and measure are mostly
>> straightforward by the function names and the respective comments.
>> For IPI test, we measure the cost of sending IPI from a source
>> VCPU to a target VCPU, until the target VCPU receives the IPI.
>>
>> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
>> ---
>>  arm/Makefile.common |   1 +
>>  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  arm/unittests.cfg   |   6 ++
>>  3 files changed, 296 insertions(+)
>>  create mode 100644 arm/micro-test.c
>>
>> diff --git a/arm/Makefile.common b/arm/Makefile.common
>> index 0a039cf..c7d5c27 100644
>> --- a/arm/Makefile.common
>> +++ b/arm/Makefile.common
>> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>>  tests-common += $(TEST_DIR)/gic.flat
>>  tests-common += $(TEST_DIR)/psci.flat
>>  tests-common += $(TEST_DIR)/sieve.flat
>> +tests-common += $(TEST_DIR)/micro-test.flat
>>
>>  tests-all = $(tests-common) $(tests)
>>  all: directories $(tests-all)
>> diff --git a/arm/micro-test.c b/arm/micro-test.c
>> new file mode 100644
>> index 0000000..7df2272
>> --- /dev/null
>> +++ b/arm/micro-test.c
>> @@ -0,0 +1,289 @@
>
> Please add the copyright and license header like we have in other files.
>
>> +#include <util.h>
>> +#include <asm/gic.h>
>> +
>> +static volatile bool second_cpu_up;
>> +static volatile bool first_cpu_ack;
>> +static volatile bool ipi_acked;
>> +static volatile bool ipi_received;
>> +static volatile bool ipi_ready;
>> +#define IPI_IRQ              1
>> +
>> +#define TIMEOUT (1U << 28)
>> +
>> +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
>
> We have this in libcflat already, ARRAY_SIZE()
>
>> +#define for_each_test(_iter, _tests, _tmp) \
>> +     for (_tmp = 0, _iter = _tests; \
>> +                     _tmp < ARR_SIZE(_tests); \
>> +                     _tmp++, _iter++)
>
> One time used macro is probably not necessary and the tmp
> var wouldn't be necessary if you make the last entry of
> available_tests a null entry.

You're right. I'll look into this.

>
>> +
>> +#define CYCLE_COUNT(c1, c2) \
>> +     (((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
>> +
>> +#define IPI_DEBUG 0
>> +
>> +#if IPI_DEBUG == 1
>> +#define debug(fmt, ...) \
>> +     printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
>> +#else
>> +#define debug(fmt, ...) {}
>> +#endif
>
> We don't generally turn off print statements in kvm-unit-tests.
> If the statement is really a temporary debug statement, then
> maybe just remove it before posting the patch. If it's even just a
> little helpful, then just leave it in and always print it. To promote
> messages to a higher level of importance than printf use report_*
> functions.
>

Thanks. I'll look at it.

>> +
>> +static uint64_t read_cc(void)
>> +{
>> +     uint64_t cc;
>> +     asm volatile(
>> +             "isb\n"
>> +             "mrs %0, CNTPCT_EL0\n"
>> +             "isb\n"
>> +             : [reg] "=r" (cc)
>> +             ::
>> +     );
>> +     return cc;
>> +}
>> +
>> +static void ipi_irq_handler(struct pt_regs *regs __unused)
>> +{
>> +     u32 ack;
>> +     ipi_ready = false;
>> +     ipi_received = true;
>> +     ack = gic_read_iar();
>> +     ipi_acked = true;
>> +     gic_write_eoir(ack);
>> +     ipi_ready = true;
>> +}
>> +
>> +static void ipi_test_secondary_entry(void)
>> +{
>> +     unsigned int timeout = TIMEOUT;
>> +
>> +     debug("secondary core up\n");
>> +
>> +     enum vector v = EL1H_IRQ;
>> +     install_irq_handler(v, ipi_irq_handler);
>> +
>> +     gic_enable_defaults();
>> +
>> +     second_cpu_up = true;
>> +
>> +     debug("secondary initialized vgic\n");
>> +
>> +     while (!first_cpu_ack && timeout--);
>
> cpu_relax() here will help get that first-cpu ack on TCG faster. That
> said, this test doesn't make sense on TCG anyway, other than debugging
> it. So you could just add 'accel = kvm' to it's unittests.cfg parameter
> list.

ok, thanks.

>
> I just tried on TCG now. It doesn't run. It gets
>
> Timer Frequency 62500000 Hz (Output in timer count)
> Unhandled exception ec=0 (UNKNOWN)
> Vector: 4 (el1h_sync)
> ESR_EL1:         02000000, ec=0 (UNKNOWN)
> FAR_EL1: 0000000000000000 (not valid)
> Exception frame registers:
> pc : [<0000000040080088>] lr : [<00000000400803e8>] pstate: 800003c5
> sp : 00000000400aff90
> x29: 0000000000000000 x28: 0000000000000000
> x27: 0000000040090000 x26: 0000000040090c60
> x25: 0000000040090000 x24: 000000001fffffff
> x23: 0000000000000000 x22: 0000000000000000
> x21: 0000000000000040 x20: 0000000000000000
> x19: 0000000000000000 x18: 00000000400b0000
> x17: 0000000000000000 x16: 0000000000000000
> x15: 00000000400afe8c x14: 00000000400b0000
> x13: 00000000400afecc x12: 0000000000001680
> x11: 0000000000000000 x10: 6666666666666667
> x9 : 0000000000000030 x8 : 0000000000000030
> x7 : 00000000400af670 x6 : 00000000400af673
> x5 : 00000000400af678 x4 : 00000000000007b7
> x3 : 00000000400af6ec x2 : 0000000040090000
> x1 : 000000000015909e x0 : 000000004b000000
>
> You don't need to get it to work on TCG if you add the 'accel = kvm'
> parameter, but it's sometimes indicative of a bug in the unit test
> when it doesn't run, so you might want to take a look.
>
> Also need to do s/timeout/tries/ or something, as it's not time related.
>
>> +     if (!first_cpu_ack) {
>> +             debug("ipi_test: First CPU did not ack wake-up\n");
>
> I think you should drop the debug() macro completely, but in here
> it should certainly be changed. Erroring out shouldn't be silent,
> as it would be when out debug turned on.
>
>> +             exit(1);
>> +     }
>> +
>> +     debug("detected first cpu ack\n");
>> +
>> +     local_irq_enable(); /* Enter small wait-loop */
>> +     ipi_ready = true;
>> +     while (true);
>> +}
>> +
>> +static int test_init(void)
>> +{
>> +     int ret;
>> +     unsigned int timeout = TIMEOUT;
>> +
>> +     ret = gic_init();
>> +     if (!ret) {
>> +             debug("No supported gic present, skipping tests...\n");
>> +             goto out;
>> +     }
>> +
>> +     ipi_ready = false;
>> +
>> +     gic_enable_defaults();
>> +
>> +     debug("starting second CPU\n");
>> +     smp_boot_secondary(1, ipi_test_secondary_entry);
>> +
>> +     while (!second_cpu_up && timeout--); /* Wait for second CPU! */
>> +
>> +     if (!second_cpu_up) {
>> +             debug("ipi_test: timeout waiting for secondary CPU\n");
>> +             ret = 0;
>> +             goto out;
>> +     }
>> +
>> +     debug("detected secondary core up\n");
>> +
>> +     first_cpu_ack = true;
>
> I think you should be able to avoid most of this manual cpu
> synchronization by using the on_cpu() call, which is the
> recommended way to start secondaries anyway.

I'll take a look.

>
>> +
>> +     printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
>> +
>> +out:
>> +     return ret;
>> +}
>> +
>> +static unsigned long ipi_test(void)
>> +{
>> +     unsigned int timeout = TIMEOUT;
>> +     unsigned long c1, c2;
>> +
>> +     while (!ipi_ready && timeout--);
>> +     if (!ipi_ready) {
>> +             debug("ipi_test: second core not ready for IPIs\n");
>> +             exit(1);
>> +     }
>> +
>> +     ipi_received = false;
>> +
>> +     c1 = read_cc();
>> +
>> +     gic_ipi_send_single(IPI_IRQ, 1);
>> +
>> +     timeout = TIMEOUT;
>> +     while (!ipi_received && timeout--);
>> +     if (!ipi_received) {
>> +             debug("ipi_test: secondary core never received ipi\n");
>> +             exit(1);
>> +     }
>> +
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +
>> +static unsigned long hvc_test(void)
>> +{
>> +     unsigned long c1, c2;
>> +
>> +     c1 = read_cc();
>> +     asm volatile("mov w0, #0x4b000000; hvc #0");
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +static void __noop(void)
>> +{
>> +}
>> +
>> +static unsigned long noop_guest(void)
>> +{
>> +     unsigned long c1, c2;
>> +
>> +     c1 = read_cc();
>> +     __noop();
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long mmio_read_user(void)
>> +{
>> +     unsigned long c1, c2;
>> +     void *mmio_read_user_addr = (void*) 0x0a000008;
>
> This is a virtio-mmio transport device-id address. It's harmless to
> read it, as long as we don't change it. I've had plans to provide
> mmio addresses with different read, write permissions and sizes through
> extending a test-dev for quite a while. I should do that. For this patch
> I guess this is fine, but please comment it with a big FIXME or TODO to
> make sure we change it when we can someday.

Thanks for pointing out. I'll take a look.

>
>> +
>> +     /* Measure MMIO exit to QEMU in userspace */
>> +     c1 = read_cc();
>> +     readl(mmio_read_user_addr);
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long mmio_read_vgic(void)
>> +{
>> +     unsigned long c1, c2;
>> +     int v = gic_version();
>> +     void *vgic_dist_addr = NULL;
>> +
>> +     if (v == 2)
>> +             vgic_dist_addr = gicv2_dist_base();
>> +     else if (v == 3)
>> +             vgic_dist_addr = gicv3_dist_base();
>> +
>> +     /* Measure MMIO exit to host kernel */
>> +     c1 = read_cc();
>> +     readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */
>
> You can add GICD_IIDR to lib/arm/asm/gic.h
>
>> +     c2 = read_cc();
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long eoi_test(void)
>> +{
>> +     unsigned long c1, c2;
>> +     int v = gic_version();
>> +     void (*write_eoir)(u32 irqstat) = NULL;
>> +
>> +     u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
>> +
>> +     if (v == 2)
>> +             write_eoir = gicv2_write_eoir;
>> +     else if (v == 3)
>> +             write_eoir = gicv3_write_eoir;
>
> You don't need this, we have gic_write_eoir(), which you used above.

Yes. I was thinking of calling EOI function directly to avoid the assert.

>
>> +
>> +     c1 = read_cc();
>> +     write_eoir(val);
>> +     c2 = read_cc();
>> +
>> +     return CYCLE_COUNT(c1, c2);
>> +}
>> +
>> +struct exit_test {
>> +     const char *name;
>> +     unsigned long (*test_fn)(void);
>> +     bool run;
>> +};
>> +
>> +static struct exit_test available_tests[] = {
>> +     {"hvc",                hvc_test,           true},
>> +     {"noop_guest",         noop_guest,         true},
>> +     {"mmio_read_user",     mmio_read_user,     true},
>> +     {"mmio_read_vgic",     mmio_read_vgic,     true},
>> +     {"eoi",                eoi_test,           true},
>> +     {"ipi",                ipi_test,           true},
>> +};
>> +
>> +static void loop_test(struct exit_test *test)
>> +{
>> +     unsigned long i, iterations = 32;
>> +     unsigned long sample, cycles;
>> +     unsigned long long min = 0, max = 0;
>> +     const unsigned long long goal = (1ULL << 29);
>> +
>> +     do {
>> +             iterations *= 2;
>> +             cycles = 0;
>> +             for (i = 0; i < iterations; i++) {
>> +                     sample = test->test_fn();
>> +                     if (sample == 0) {
>> +                             /*
>> +                              * If something went wrong or we had an
>> +                              * overflow, don't count that sample.
>> +                              */
>> +                             iterations--;
>> +                             i--;
>
> I'd prefer incrementing a different counter after the sample==0 test /
> continue, than to decrement both i and iterations.

I'm not sure about the comment above. Could you please substantiate
how a new counter should work here?

>
>> +                             debug("cycle count overflow: %lu\n", sample);
>> +                             continue;
>> +                     }
>> +                     cycles += sample;
>> +                     if (min == 0 || min > sample)
>> +                             min = sample;
>> +                     if (max < sample)
>> +                             max = sample;
>> +             }
>> +     } while (cycles < goal);
>> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
>> +             test->name, cycles / (iterations), min, max);
>
> No need for () around iterations.
>
>> +}
>> +
>> +void kvm_unit_test(void)
>> +{
>> +     int i=0;
>> +     struct exit_test *test;
>> +     for_each_test(test, available_tests, i) {
>> +             if (!test->run)
>> +                     continue;
>> +             loop_test(test);
>> +     }
>> +
>> +     return;
>
> pointless return
>
>> +}
>
> nit: no need for the above function, just inline it in main().

Yes.

>
>> +
>> +int main(int argc, char **argv)
>> +{
>> +     if (!test_init())
>> +             exit(1);
>
> I think this and all exit(1)'s in this unit test could/should be
> replaced with asserts.
>
>> +     kvm_unit_test();
>> +     return 0;
>> +}
>> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
>> index 44b98cf..1d0c4ca 100644
>> --- a/arm/unittests.cfg
>> +++ b/arm/unittests.cfg
>> @@ -116,3 +116,9 @@ file = timer.flat
>>  groups = timer
>>  timeout = 2s
>>  arch = arm64
>> +
>> +# Exit tests
>> +[micro-test]
>> +file = micro-test.flat
>> +smp = 2
>> +groups = micro-test
>> --
>> 1.9.1
>>
>>
>
> Thanks again for contributing to kvm-unit-tests. I look forward to v2.

Sure. Thank you for the feedback as well! I'll send out the patch asap.

Shih-Wei

>
> drew

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-15 21:15 ` [kvm-unit-tests PATCH] arm64: add micro test Shih-Wei Li
  2017-12-18 17:31   ` Yury Norov
  2017-12-18 19:10   ` Andrew Jones
@ 2017-12-18 22:17   ` Kalra, Ashish
  2017-12-18 22:18   ` Kalra, Ashish
  2017-12-18 22:31   ` Kalra, Ashish
  4 siblings, 0 replies; 27+ messages in thread
From: Kalra, Ashish @ 2017-12-18 22:17 UTC (permalink / raw)
  To: Shih-Wei Li, kvm, kvmarm; +Cc: marc.zyngier, pbonzini

Hello Shih-Wei,

This "micro" test is really useful for micro-architectural performance 
measurement of KVM/ARM.

But, to get slightly more accurate cycle counts, it will be preferable 
to compute counter read (and isb) overhead and deduct it from the final 
cycle count, something like this :

+static unsigned long counter_read_overhead;

+static void compute_counter_read_overhead(void)
+{
+	unsigned long c1, c2;
+
+	c1 = read_cc();
+	c2 = read_cc();
+	counter_read_overhead = c2 - c1;
+}

+#define CYCLE_COUNT(c1, c2) \
+	(((c1) > (c2) || ((c1) == (c2))) ? 0 : ((c2) - (c1)) - 
counter_read_overhead )
+

static int test_init(void)
{
	int ret;
	unsigned int timeout = TIMEOUT;

+	compute_counter_read_overhead();
..
..

Thanks,
Ashish

On 12/16/2017 2:45 AM, Shih-Wei Li wrote:
> Here we provide the support for measuring various micro level
> operations on arm64. We iterate each of the tests for millions of
> times and output their average, minimum and maximum cost in timer
> counts. Instruction barriers are used before and after taking
> timestamps to avoid out-of-order execution or pipelining from
> skewing our measurements.
> 
> The tests we currently support and measure are mostly
> straightforward by the function names and the respective comments.
> For IPI test, we measure the cost of sending IPI from a source
> VCPU to a target VCPU, until the target VCPU receives the IPI.
> 
> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> ---
>   arm/Makefile.common |   1 +
>   arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>   arm/unittests.cfg   |   6 ++
>   3 files changed, 296 insertions(+)
>   create mode 100644 arm/micro-test.c
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index 0a039cf..c7d5c27 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>   tests-common += $(TEST_DIR)/gic.flat
>   tests-common += $(TEST_DIR)/psci.flat
>   tests-common += $(TEST_DIR)/sieve.flat
> +tests-common += $(TEST_DIR)/micro-test.flat
>   
>   tests-all = $(tests-common) $(tests)
>   all: directories $(tests-all)
> diff --git a/arm/micro-test.c b/arm/micro-test.c
> new file mode 100644
> index 0000000..7df2272
> --- /dev/null
> +++ b/arm/micro-test.c
> @@ -0,0 +1,289 @@
> +#include <util.h>
> +#include <asm/gic.h>
> +
> +static volatile bool second_cpu_up;
> +static volatile bool first_cpu_ack;
> +static volatile bool ipi_acked;
> +static volatile bool ipi_received;
> +static volatile bool ipi_ready;
> +#define IPI_IRQ		1
> +
> +#define TIMEOUT (1U << 28)
> +
> +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
> +#define for_each_test(_iter, _tests, _tmp) \
> +	for (_tmp = 0, _iter = _tests; \
> +			_tmp < ARR_SIZE(_tests); \
> +			_tmp++, _iter++)
> +
> +#define CYCLE_COUNT(c1, c2) \
> +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> +
> +#define IPI_DEBUG 0
> +
> +#if IPI_DEBUG == 1
> +#define debug(fmt, ...) \
> +	printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
> +#else
> +#define debug(fmt, ...) {}
> +#endif
> +
> +static uint64_t read_cc(void)
> +{
> +	uint64_t cc;
> +	asm volatile(
> +		"isb\n"
> +		"mrs %0, CNTPCT_EL0\n"
> +		"isb\n"
> +		: [reg] "=r" (cc)
> +		::
> +	);
> +	return cc;
> +}
> +
> +static void ipi_irq_handler(struct pt_regs *regs __unused)
> +{
> +	u32 ack;
> +	ipi_ready = false;
> +	ipi_received = true;
> +	ack = gic_read_iar();
> +	ipi_acked = true;
> +	gic_write_eoir(ack);
> +	ipi_ready = true;
> +}
> +
> +static void ipi_test_secondary_entry(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +
> +	debug("secondary core up\n");
> +
> +	enum vector v = EL1H_IRQ;
> +	install_irq_handler(v, ipi_irq_handler);
> +
> +	gic_enable_defaults();
> +
> +	second_cpu_up = true;
> +
> +	debug("secondary initialized vgic\n");
> +
> +	while (!first_cpu_ack && timeout--);
> +	if (!first_cpu_ack) {
> +		debug("ipi_test: First CPU did not ack wake-up\n");
> +		exit(1);
> +	}
> +
> +	debug("detected first cpu ack\n");
> +
> +	local_irq_enable(); /* Enter small wait-loop */
> +	ipi_ready = true;
> +	while (true);
> +}
> +
> +static int test_init(void)
> +{
> +	int ret;
> +	unsigned int timeout = TIMEOUT;
> +
> +	ret = gic_init();
> +	if (!ret) {
> +		debug("No supported gic present, skipping tests...\n");
> +		goto out;
> +	}
> +
> +	ipi_ready = false;
> +
> +	gic_enable_defaults();
> +
> +	debug("starting second CPU\n");
> +	smp_boot_secondary(1, ipi_test_secondary_entry);
> +
> +	while (!second_cpu_up && timeout--); /* Wait for second CPU! */
> +
> +	if (!second_cpu_up) {
> +		debug("ipi_test: timeout waiting for secondary CPU\n");
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	debug("detected secondary core up\n");
> +
> +	first_cpu_ack = true;
> +
> +	printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
> +
> +out:
> +	return ret;
> +}
> +
> +static unsigned long ipi_test(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +	unsigned long c1, c2;
> +
> +	while (!ipi_ready && timeout--);
> +	if (!ipi_ready) {
> +		debug("ipi_test: second core not ready for IPIs\n");
> +		exit(1);
> +	}
> +
> +	ipi_received = false;
> +
> +	c1 = read_cc();
> +
> +	gic_ipi_send_single(IPI_IRQ, 1);
> +
> +	timeout = TIMEOUT;
> +	while (!ipi_received && timeout--);
> +	if (!ipi_received) {
> +		debug("ipi_test: secondary core never received ipi\n");
> +		exit(1);
> +	}
> +
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +
> +static unsigned long hvc_test(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	asm volatile("mov w0, #0x4b000000; hvc #0");
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static void __noop(void)
> +{
> +}
> +
> +static unsigned long noop_guest(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	__noop();
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_user(void)
> +{
> +	unsigned long c1, c2;
> +	void *mmio_read_user_addr = (void*) 0x0a000008;
> +
> +	/* Measure MMIO exit to QEMU in userspace */
> +	c1 = read_cc();
> +	readl(mmio_read_user_addr);
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_vgic(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void *vgic_dist_addr = NULL;
> +
> +	if (v == 2)
> +		vgic_dist_addr = gicv2_dist_base();
> +	else if (v == 3)
> +		vgic_dist_addr = gicv3_dist_base();
> +
> +	/* Measure MMIO exit to host kernel */
> +	c1 = read_cc();
> +	readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long eoi_test(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void (*write_eoir)(u32 irqstat) = NULL;
> +
> +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> +
> +	if (v == 2)
> +		write_eoir = gicv2_write_eoir;
> +	else if (v == 3)
> +		write_eoir = gicv3_write_eoir;
> +
> +	c1 = read_cc();
> +	write_eoir(val);
> +	c2 = read_cc();
> +
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +struct exit_test {
> +	const char *name;
> +	unsigned long (*test_fn)(void);
> +	bool run;
> +};
> +
> +static struct exit_test available_tests[] = {
> +	{"hvc",                hvc_test,           true},
> +	{"noop_guest",         noop_guest,         true},
> +	{"mmio_read_user",     mmio_read_user,     true},
> +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> +	{"eoi",                eoi_test,           true},
> +	{"ipi",                ipi_test,           true},
> +};
> +
> +static void loop_test(struct exit_test *test)
> +{
> +	unsigned long i, iterations = 32;
> +	unsigned long sample, cycles;
> +	unsigned long long min = 0, max = 0;
> +	const unsigned long long goal = (1ULL << 29);
> +
> +	do {
> +		iterations *= 2;
> +		cycles = 0;
> +		for (i = 0; i < iterations; i++) {
> +			sample = test->test_fn();
> +			if (sample == 0) {
> +				/*
> +				 * If something went wrong or we had an
> +				 * overflow, don't count that sample.
> +				 */
> +				iterations--;
> +				i--;
> +				debug("cycle count overflow: %lu\n", sample);
> +				continue;
> +			}
> +			cycles += sample;
> +			if (min == 0 || min > sample)
> +				min = sample;
> +			if (max < sample)
> +				max = sample;
> +		}
> +	} while (cycles < goal);
> +	printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> +		test->name, cycles / (iterations), min, max);
> +}
> +
> +void kvm_unit_test(void)
> +{
> +	int i=0;
> +	struct exit_test *test;
> +	for_each_test(test, available_tests, i) {
> +		if (!test->run)
> +			continue;
> +		loop_test(test);
> +	}
> +
> +	return;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	if (!test_init())
> +		exit(1);
> +	kvm_unit_test();
> +	return 0;
> +}
> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> index 44b98cf..1d0c4ca 100644
> --- a/arm/unittests.cfg
> +++ b/arm/unittests.cfg
> @@ -116,3 +116,9 @@ file = timer.flat
>   groups = timer
>   timeout = 2s
>   arch = arm64
> +
> +# Exit tests
> +[micro-test]
> +file = micro-test.flat
> +smp = 2
> +groups = micro-test
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-15 21:15 ` [kvm-unit-tests PATCH] arm64: add micro test Shih-Wei Li
                     ` (2 preceding siblings ...)
  2017-12-18 22:17   ` Kalra, Ashish
@ 2017-12-18 22:18   ` Kalra, Ashish
  2017-12-18 22:31   ` Kalra, Ashish
  4 siblings, 0 replies; 27+ messages in thread
From: Kalra, Ashish @ 2017-12-18 22:18 UTC (permalink / raw)
  To: Shih-Wei Li, kvm, kvmarm; +Cc: marc.zyngier, pbonzini


[-- Attachment #1.1: Type: text/plain, Size: 10638 bytes --]

Hello Shih-Wei,

This "micro" test is really useful for micro-architectural performance
measurement of KVM/ARM.

But, to get more accurate cycle counts, it will be preferable to compute
counter read (and isb) overhead and deduct it from the final cycle
count, something like this :

+static unsigned long counter_read_overhead;

+static void compute_counter_read_overhead(void)
+{
+       unsigned long c1, c2;
+
+       c1 = read_cc();
+       c2 = read_cc();
+       counter_read_overhead = c2 - c1;
+}

+#define CYCLE_COUNT(c1, c2) \
+       (((c1) > (c2) || ((c1) == (c2))) ? 0 : ((c2) - (c1)) -
counter_read_overhead )
+

static int test_init(void)
{
         int ret;
         unsigned int timeout = TIMEOUT;

+       compute_counter_read_overhead();
..
..

Thanks,
Ashish

On 12/16/2017 2:45 AM, Shih-Wei Li wrote:
 > Here we provide the support for measuring various micro level
 > operations on arm64. We iterate each of the tests for millions of
 > times and output their average, minimum and maximum cost in timer
 > counts. Instruction barriers are used before and after taking
 > timestamps to avoid out-of-order execution or pipelining from
 > skewing our measurements.
 >
 > The tests we currently support and measure are mostly
 > straightforward by the function names and the respective comments.
 > For IPI test, we measure the cost of sending IPI from a source
 > VCPU to a target VCPU, until the target VCPU receives the IPI.
 >
 > Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
 > ---
 >   arm/Makefile.common |   1 +
 >   arm/micro-test.c    | 289 
++++++++++++++++++++++++++++++++++++++++++++++++++++
 >   arm/unittests.cfg   |   6 ++
 >   3 files changed, 296 insertions(+)
 >   create mode 100644 arm/micro-test.c
 >
 > diff --git a/arm/Makefile.common b/arm/Makefile.common
 > index 0a039cf..c7d5c27 100644
 > --- a/arm/Makefile.common
 > +++ b/arm/Makefile.common
 > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
 >   tests-common += $(TEST_DIR)/gic.flat
 >   tests-common += $(TEST_DIR)/psci.flat
 >   tests-common += $(TEST_DIR)/sieve.flat
 > +tests-common += $(TEST_DIR)/micro-test.flat
 >
 >   tests-all = $(tests-common) $(tests)
 >   all: directories $(tests-all)
 > diff --git a/arm/micro-test.c b/arm/micro-test.c
 > new file mode 100644
 > index 0000000..7df2272
 > --- /dev/null
 > +++ b/arm/micro-test.c
 > @@ -0,0 +1,289 @@
 > +#include <util.h>
 > +#include <asm/gic.h>
 > +
 > +static volatile bool second_cpu_up;
 > +static volatile bool first_cpu_ack;
 > +static volatile bool ipi_acked;
 > +static volatile bool ipi_received;
 > +static volatile bool ipi_ready;
 > +#define IPI_IRQ              1
 > +
 > +#define TIMEOUT (1U << 28)
 > +
 > +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
 > +#define for_each_test(_iter, _tests, _tmp) \
 > +     for (_tmp = 0, _iter = _tests; \
 > +                     _tmp < ARR_SIZE(_tests); \
 > +                     _tmp++, _iter++)
 > +
 > +#define CYCLE_COUNT(c1, c2) \
 > +     (((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
 > +
 > +#define IPI_DEBUG 0
 > +
 > +#if IPI_DEBUG == 1
 > +#define debug(fmt, ...) \
 > +     printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
 > +#else
 > +#define debug(fmt, ...) {}
 > +#endif
 > +
 > +static uint64_t read_cc(void)
 > +{
 > +     uint64_t cc;
 > +     asm volatile(
 > +             "isb\n"
 > +             "mrs %0, CNTPCT_EL0\n"
 > +             "isb\n"
 > +             : [reg] "=r" (cc)
 > +             ::
 > +     );
 > +     return cc;
 > +}
 > +
 > +static void ipi_irq_handler(struct pt_regs *regs __unused)
 > +{
 > +     u32 ack;
 > +     ipi_ready = false;
 > +     ipi_received = true;
 > +     ack = gic_read_iar();
 > +     ipi_acked = true;
 > +     gic_write_eoir(ack);
 > +     ipi_ready = true;
 > +}
 > +
 > +static void ipi_test_secondary_entry(void)
 > +{
 > +     unsigned int timeout = TIMEOUT;
 > +
 > +     debug("secondary core up\n");
 > +
 > +     enum vector v = EL1H_IRQ;
 > +     install_irq_handler(v, ipi_irq_handler);
 > +
 > +     gic_enable_defaults();
 > +
 > +     second_cpu_up = true;
 > +
 > +     debug("secondary initialized vgic\n");
 > +
 > +     while (!first_cpu_ack && timeout--);
 > +     if (!first_cpu_ack) {
 > +             debug("ipi_test: First CPU did not ack wake-up\n");
 > +             exit(1);
 > +     }
 > +
 > +     debug("detected first cpu ack\n");
 > +
 > +     local_irq_enable(); /* Enter small wait-loop */
 > +     ipi_ready = true;
 > +     while (true);
 > +}
 > +
 > +static int test_init(void)
 > +{
 > +     int ret;
 > +     unsigned int timeout = TIMEOUT;
 > +
 > +     ret = gic_init();
 > +     if (!ret) {
 > +             debug("No supported gic present, skipping tests...\n");
 > +             goto out;
 > +     }
 > +
 > +     ipi_ready = false;
 > +
 > +     gic_enable_defaults();
 > +
 > +     debug("starting second CPU\n");
 > +     smp_boot_secondary(1, ipi_test_secondary_entry);
 > +
 > +     while (!second_cpu_up && timeout--); /* Wait for second CPU! */
 > +
 > +     if (!second_cpu_up) {
 > +             debug("ipi_test: timeout waiting for secondary CPU\n");
 > +             ret = 0;
 > +             goto out;
 > +     }
 > +
 > +     debug("detected secondary core up\n");
 > +
 > +     first_cpu_ack = true;
 > +
 > +     printf("Timer Frequency %d Hz (Output in timer count)\n", 
get_cntfrq());
 > +
 > +out:
 > +     return ret;
 > +}
 > +
 > +static unsigned long ipi_test(void)
 > +{
 > +     unsigned int timeout = TIMEOUT;
 > +     unsigned long c1, c2;
 > +
 > +     while (!ipi_ready && timeout--);
 > +     if (!ipi_ready) {
 > +             debug("ipi_test: second core not ready for IPIs\n");
 > +             exit(1);
 > +     }
 > +
 > +     ipi_received = false;
 > +
 > +     c1 = read_cc();
 > +
 > +     gic_ipi_send_single(IPI_IRQ, 1);
 > +
 > +     timeout = TIMEOUT;
 > +     while (!ipi_received && timeout--);
 > +     if (!ipi_received) {
 > +             debug("ipi_test: secondary core never received ipi\n");
 > +             exit(1);
 > +     }
 > +
 > +     c2 = read_cc();
 > +     return CYCLE_COUNT(c1, c2);
 > +}
 > +
 > +
 > +static unsigned long hvc_test(void)
 > +{
 > +     unsigned long c1, c2;
 > +
 > +     c1 = read_cc();
 > +     asm volatile("mov w0, #0x4b000000; hvc #0");
 > +     c2 = read_cc();
 > +     return CYCLE_COUNT(c1, c2);
 > +}
 > +
 > +static void __noop(void)
 > +{
 > +}
 > +
 > +static unsigned long noop_guest(void)
 > +{
 > +     unsigned long c1, c2;
 > +
 > +     c1 = read_cc();
 > +     __noop();
 > +     c2 = read_cc();
 > +     return CYCLE_COUNT(c1, c2);
 > +}
 > +
 > +static unsigned long mmio_read_user(void)
 > +{
 > +     unsigned long c1, c2;
 > +     void *mmio_read_user_addr = (void*) 0x0a000008;
 > +
 > +     /* Measure MMIO exit to QEMU in userspace */
 > +     c1 = read_cc();
 > +     readl(mmio_read_user_addr);
 > +     c2 = read_cc();
 > +     return CYCLE_COUNT(c1, c2);
 > +}
 > +
 > +static unsigned long mmio_read_vgic(void)
 > +{
 > +     unsigned long c1, c2;
 > +     int v = gic_version();
 > +     void *vgic_dist_addr = NULL;
 > +
 > +     if (v == 2)
 > +             vgic_dist_addr = gicv2_dist_base();
 > +     else if (v == 3)
 > +             vgic_dist_addr = gicv3_dist_base();
 > +
 > +     /* Measure MMIO exit to host kernel */
 > +     c1 = read_cc();
 > +     readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */
 > +     c2 = read_cc();
 > +     return CYCLE_COUNT(c1, c2);
 > +}
 > +
 > +static unsigned long eoi_test(void)
 > +{
 > +     unsigned long c1, c2;
 > +     int v = gic_version();
 > +     void (*write_eoir)(u32 irqstat) = NULL;
 > +
 > +     u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
 > +
 > +     if (v == 2)
 > +             write_eoir = gicv2_write_eoir;
 > +     else if (v == 3)
 > +             write_eoir = gicv3_write_eoir;
 > +
 > +     c1 = read_cc();
 > +     write_eoir(val);
 > +     c2 = read_cc();
 > +
 > +     return CYCLE_COUNT(c1, c2);
 > +}
 > +
 > +struct exit_test {
 > +     const char *name;
 > +     unsigned long (*test_fn)(void);
 > +     bool run;
 > +};
 > +
 > +static struct exit_test available_tests[] = {
 > +     {"hvc",                hvc_test,           true},
 > +     {"noop_guest",         noop_guest,         true},
 > +     {"mmio_read_user",     mmio_read_user,     true},
 > +     {"mmio_read_vgic",     mmio_read_vgic,     true},
 > +     {"eoi",                eoi_test,           true},
 > +     {"ipi",                ipi_test,           true},
 > +};
 > +
 > +static void loop_test(struct exit_test *test)
 > +{
 > +     unsigned long i, iterations = 32;
 > +     unsigned long sample, cycles;
 > +     unsigned long long min = 0, max = 0;
 > +     const unsigned long long goal = (1ULL << 29);
 > +
 > +     do {
 > +             iterations *= 2;
 > +             cycles = 0;
 > +             for (i = 0; i < iterations; i++) {
 > +                     sample = test->test_fn();
 > +                     if (sample == 0) {
 > +                             /*
 > +                              * If something went wrong or we had an
 > +                              * overflow, don't count that sample.
 > +                              */
 > +                             iterations--;
 > +                             i--;
 > +                             debug("cycle count overflow: %lu\n", 
sample);
 > +                             continue;
 > +                     }
 > +                     cycles += sample;
 > +                     if (min == 0 || min > sample)
 > +                             min = sample;
 > +                     if (max < sample)
 > +                             max = sample;
 > +             }
 > +     } while (cycles < goal);
 > +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
 > +             test->name, cycles / (iterations), min, max);
 > +}
 > +
 > +void kvm_unit_test(void)
 > +{
 > +     int i=0;
 > +     struct exit_test *test;
 > +     for_each_test(test, available_tests, i) {
 > +             if (!test->run)
 > +                     continue;
 > +             loop_test(test);
 > +     }
 > +
 > +     return;
 > +}
 > +
 > +int main(int argc, char **argv)
 > +{
 > +     if (!test_init())
 > +             exit(1);
 > +     kvm_unit_test();
 > +     return 0;
 > +}
 > diff --git a/arm/unittests.cfg b/arm/unittests.cfg
 > index 44b98cf..1d0c4ca 100644
 > --- a/arm/unittests.cfg
 > +++ b/arm/unittests.cfg
 > @@ -116,3 +116,9 @@ file = timer.flat
 >   groups = timer
 >   timeout = 2s
 >   arch = arm64
 > +
 > +# Exit tests
 > +[micro-test]
 > +file = micro-test.flat
 > +smp = 2
 > +groups = micro-test
 >

[-- Attachment #1.2: Type: text/html, Size: 17592 bytes --]

[-- Attachment #2: Type: text/plain, Size: 151 bytes --]

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-15 21:15 ` [kvm-unit-tests PATCH] arm64: add micro test Shih-Wei Li
                     ` (3 preceding siblings ...)
  2017-12-18 22:18   ` Kalra, Ashish
@ 2017-12-18 22:31   ` Kalra, Ashish
  4 siblings, 0 replies; 27+ messages in thread
From: Kalra, Ashish @ 2017-12-18 22:31 UTC (permalink / raw)
  To: Shih-Wei Li, kvm, kvmarm; +Cc: marc.zyngier, pbonzini

Hello Shih-Wei,

This "micro" test is really useful for micro-architectural performance
measurement of KVM/ARM.

But, to get more accurate cycle counts, it will be preferable to compute
counter read (and isb) overhead and deduct it from the final cycle
count, something like this :

+static unsigned long counter_read_overhead;

+static void compute_counter_read_overhead(void)
+{
+       unsigned long c1, c2;
+
+       c1 = read_cc();
+       c2 = read_cc();
+       counter_read_overhead = c2 - c1;
+}

+#define CYCLE_COUNT(c1, c2) \
+       (((c1) > (c2) || ((c1) == (c2))) ? 0 : ((c2) - (c1)) -
counter_read_overhead )
+

static int test_init(void)
{
         int ret;
         unsigned int timeout = TIMEOUT;

+       compute_counter_read_overhead();
..
..

Thanks,
Ashish

On 12/16/2017 2:45 AM, Shih-Wei Li wrote:
> Here we provide the support for measuring various micro level
> operations on arm64. We iterate each of the tests for millions of
> times and output their average, minimum and maximum cost in timer
> counts. Instruction barriers are used before and after taking
> timestamps to avoid out-of-order execution or pipelining from
> skewing our measurements.
> 
> The tests we currently support and measure are mostly
> straightforward by the function names and the respective comments.
> For IPI test, we measure the cost of sending IPI from a source
> VCPU to a target VCPU, until the target VCPU receives the IPI.
> 
> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> ---
>   arm/Makefile.common |   1 +
>   arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>   arm/unittests.cfg   |   6 ++
>   3 files changed, 296 insertions(+)
>   create mode 100644 arm/micro-test.c
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index 0a039cf..c7d5c27 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>   tests-common += $(TEST_DIR)/gic.flat
>   tests-common += $(TEST_DIR)/psci.flat
>   tests-common += $(TEST_DIR)/sieve.flat
> +tests-common += $(TEST_DIR)/micro-test.flat
>   
>   tests-all = $(tests-common) $(tests)
>   all: directories $(tests-all)
> diff --git a/arm/micro-test.c b/arm/micro-test.c
> new file mode 100644
> index 0000000..7df2272
> --- /dev/null
> +++ b/arm/micro-test.c
> @@ -0,0 +1,289 @@
> +#include <util.h>
> +#include <asm/gic.h>
> +
> +static volatile bool second_cpu_up;
> +static volatile bool first_cpu_ack;
> +static volatile bool ipi_acked;
> +static volatile bool ipi_received;
> +static volatile bool ipi_ready;
> +#define IPI_IRQ		1
> +
> +#define TIMEOUT (1U << 28)
> +
> +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
> +#define for_each_test(_iter, _tests, _tmp) \
> +	for (_tmp = 0, _iter = _tests; \
> +			_tmp < ARR_SIZE(_tests); \
> +			_tmp++, _iter++)
> +
> +#define CYCLE_COUNT(c1, c2) \
> +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> +
> +#define IPI_DEBUG 0
> +
> +#if IPI_DEBUG == 1
> +#define debug(fmt, ...) \
> +	printf("[cpu %d]: " fmt, smp_processor_id(),  ## __VA_ARGS__)
> +#else
> +#define debug(fmt, ...) {}
> +#endif
> +
> +static uint64_t read_cc(void)
> +{
> +	uint64_t cc;
> +	asm volatile(
> +		"isb\n"
> +		"mrs %0, CNTPCT_EL0\n"
> +		"isb\n"
> +		: [reg] "=r" (cc)
> +		::
> +	);
> +	return cc;
> +}
> +
> +static void ipi_irq_handler(struct pt_regs *regs __unused)
> +{
> +	u32 ack;
> +	ipi_ready = false;
> +	ipi_received = true;
> +	ack = gic_read_iar();
> +	ipi_acked = true;
> +	gic_write_eoir(ack);
> +	ipi_ready = true;
> +}
> +
> +static void ipi_test_secondary_entry(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +
> +	debug("secondary core up\n");
> +
> +	enum vector v = EL1H_IRQ;
> +	install_irq_handler(v, ipi_irq_handler);
> +
> +	gic_enable_defaults();
> +
> +	second_cpu_up = true;
> +
> +	debug("secondary initialized vgic\n");
> +
> +	while (!first_cpu_ack && timeout--);
> +	if (!first_cpu_ack) {
> +		debug("ipi_test: First CPU did not ack wake-up\n");
> +		exit(1);
> +	}
> +
> +	debug("detected first cpu ack\n");
> +
> +	local_irq_enable(); /* Enter small wait-loop */
> +	ipi_ready = true;
> +	while (true);
> +}
> +
> +static int test_init(void)
> +{
> +	int ret;
> +	unsigned int timeout = TIMEOUT;
> +
> +	ret = gic_init();
> +	if (!ret) {
> +		debug("No supported gic present, skipping tests...\n");
> +		goto out;
> +	}
> +
> +	ipi_ready = false;
> +
> +	gic_enable_defaults();
> +
> +	debug("starting second CPU\n");
> +	smp_boot_secondary(1, ipi_test_secondary_entry);
> +
> +	while (!second_cpu_up && timeout--); /* Wait for second CPU! */
> +
> +	if (!second_cpu_up) {
> +		debug("ipi_test: timeout waiting for secondary CPU\n");
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	debug("detected secondary core up\n");
> +
> +	first_cpu_ack = true;
> +
> +	printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
> +
> +out:
> +	return ret;
> +}
> +
> +static unsigned long ipi_test(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +	unsigned long c1, c2;
> +
> +	while (!ipi_ready && timeout--);
> +	if (!ipi_ready) {
> +		debug("ipi_test: second core not ready for IPIs\n");
> +		exit(1);
> +	}
> +
> +	ipi_received = false;
> +
> +	c1 = read_cc();
> +
> +	gic_ipi_send_single(IPI_IRQ, 1);
> +
> +	timeout = TIMEOUT;
> +	while (!ipi_received && timeout--);
> +	if (!ipi_received) {
> +		debug("ipi_test: secondary core never received ipi\n");
> +		exit(1);
> +	}
> +
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +
> +static unsigned long hvc_test(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	asm volatile("mov w0, #0x4b000000; hvc #0");
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static void __noop(void)
> +{
> +}
> +
> +static unsigned long noop_guest(void)
> +{
> +	unsigned long c1, c2;
> +
> +	c1 = read_cc();
> +	__noop();
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_user(void)
> +{
> +	unsigned long c1, c2;
> +	void *mmio_read_user_addr = (void*) 0x0a000008;
> +
> +	/* Measure MMIO exit to QEMU in userspace */
> +	c1 = read_cc();
> +	readl(mmio_read_user_addr);
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_vgic(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void *vgic_dist_addr = NULL;
> +
> +	if (v == 2)
> +		vgic_dist_addr = gicv2_dist_base();
> +	else if (v == 3)
> +		vgic_dist_addr = gicv3_dist_base();
> +
> +	/* Measure MMIO exit to host kernel */
> +	c1 = read_cc();
> +	readl(vgic_dist_addr + 0x8); /* Read GICD_IIDR */
> +	c2 = read_cc();
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +static unsigned long eoi_test(void)
> +{
> +	unsigned long c1, c2;
> +	int v = gic_version();
> +	void (*write_eoir)(u32 irqstat) = NULL;
> +
> +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> +
> +	if (v == 2)
> +		write_eoir = gicv2_write_eoir;
> +	else if (v == 3)
> +		write_eoir = gicv3_write_eoir;
> +
> +	c1 = read_cc();
> +	write_eoir(val);
> +	c2 = read_cc();
> +
> +	return CYCLE_COUNT(c1, c2);
> +}
> +
> +struct exit_test {
> +	const char *name;
> +	unsigned long (*test_fn)(void);
> +	bool run;
> +};
> +
> +static struct exit_test available_tests[] = {
> +	{"hvc",                hvc_test,           true},
> +	{"noop_guest",         noop_guest,         true},
> +	{"mmio_read_user",     mmio_read_user,     true},
> +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> +	{"eoi",                eoi_test,           true},
> +	{"ipi",                ipi_test,           true},
> +};
> +
> +static void loop_test(struct exit_test *test)
> +{
> +	unsigned long i, iterations = 32;
> +	unsigned long sample, cycles;
> +	unsigned long long min = 0, max = 0;
> +	const unsigned long long goal = (1ULL << 29);
> +
> +	do {
> +		iterations *= 2;
> +		cycles = 0;
> +		for (i = 0; i < iterations; i++) {
> +			sample = test->test_fn();
> +			if (sample == 0) {
> +				/*
> +				 * If something went wrong or we had an
> +				 * overflow, don't count that sample.
> +				 */
> +				iterations--;
> +				i--;
> +				debug("cycle count overflow: %lu\n", sample);
> +				continue;
> +			}
> +			cycles += sample;
> +			if (min == 0 || min > sample)
> +				min = sample;
> +			if (max < sample)
> +				max = sample;
> +		}
> +	} while (cycles < goal);
> +	printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> +		test->name, cycles / (iterations), min, max);
> +}
> +
> +void kvm_unit_test(void)
> +{
> +	int i=0;
> +	struct exit_test *test;
> +	for_each_test(test, available_tests, i) {
> +		if (!test->run)
> +			continue;
> +		loop_test(test);
> +	}
> +
> +	return;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	if (!test_init())
> +		exit(1);
> +	kvm_unit_test();
> +	return 0;
> +}
> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> index 44b98cf..1d0c4ca 100644
> --- a/arm/unittests.cfg
> +++ b/arm/unittests.cfg
> @@ -116,3 +116,9 @@ file = timer.flat
>   groups = timer
>   timeout = 2s
>   arch = arm64
> +
> +# Exit tests
> +[micro-test]
> +file = micro-test.flat
> +smp = 2
> +groups = micro-test
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] Support micro operation measurement on arm64
  2017-12-18 20:58   ` Shih-Wei Li
@ 2017-12-19  9:06     ` Christoffer Dall
  2017-12-19 12:11       ` Andrew Jones
  0 siblings, 1 reply; 27+ messages in thread
From: Christoffer Dall @ 2017-12-19  9:06 UTC (permalink / raw)
  To: Shih-Wei Li; +Cc: Marc Zyngier, Paolo Bonzini, kvmarm, kvm

On Mon, Dec 18, 2017 at 03:58:49PM -0500, Shih-Wei Li wrote:
> On Mon, Dec 18, 2017 at 1:14 PM, Andrew Jones <drjones@redhat.com> wrote:
> > Hi Shih-Wei,
> >
> > Thanks for doing this! Porting Christoffer's selftests to kvm-unit-tests
> > has been on the kvm-unit-tests' TODO list since it was first introduced.
> >
> > On Fri, Dec 15, 2017 at 04:15:38PM -0500, Shih-Wei Li wrote:
> >> The patch provides support for quantifying the cost of micro level
> >> operations on arm64 hardware. The supported operations include hypercall,
> >> mmio accesses, EOI virtual interrupt, and IPI send. Measurements are
> >> currently obtained using timer counters. Further modifications in KVM
> >> will be required to support timestamping using cycle counters, as KVM
> >> now disables accesses to the PMU counters from the VM.
> >
> > KVM only disables access when userspace tells it to, which it doesn't
> > do by default. Is there something else missing keeping the PMU counters
> > from being used?
> 
> Thanks for the feedback! What I meant by PMU counters here was for
> "CPU cycle counter" specifically. I'm not aware of a way to enable the
> PMU cycle counter from QEMU, did I miss something here?
> 

We always set MDSCR_EL2.TPM, meaning that you cannot reliably read a
cycle counter in the guest.

If userspace tells KVM to emulate a PMU, you will get an emulated result
when reading the cycle counter from a guest, instead of an undefined
exception, but you will never access the cycle counter directly.

Here we want to measure round-trip time from the VM through the
hypervisor, and we don't currently count cycles in EL2 with the PMU
emulation, and even if we did, we'd be counting additional round-trip
times, so if the goal is to get more precision than the arch counters,
this won't help you.

What we did for the papers was to hack KVM to not set the TPM bit and
jut read the cycle counter directly, but this isn't safe, as the guest
then gets full access to the PMU and can mess with the host.

If it's crucial to measure individual operations on a cycle-accurate
level, then our options are pretty much to either patch KVM when doing
so, or introduce a scary command line parameter, but I'm not thrilled
by the idea.

> >
> >>
> >> We iterate each of the tests for millions of times and output their
> >> average, minimum and maximum cost in timer counts. Instruction barriers
> >
> > Can we reduce the number of iterations and still get valid results? The
> > test takes so long that of all the platforms I tested it on timed out
> > before it completed, except seattle. The default timeout for kvm-unit-
> > tests is 90 seconds. I'd rather a unit test execute in much shorter time
> > than that too, in order to keep people encouraged to run them frequently.
> > If these tests must run a long time, then I think we should add them to
> > the nodefault group.
> 
> I think it's possible to reduce the timeout without losing accuracy. I
> can look into this further.
> 

I think just running them for 100,000 or maximum 1,000,000 times should
be sufficient.  Alternatively an option to run it for a long time could
be provided?

> >
> >> were used before and after taking timestamps to avoid out-of-order
> >> execution or pipelining from skewing our measurements.
> >>
> >> To improve precision in the measurements, one should consider pinning
> >> each VCPU to a specific physical CPU (PCPU) and ensure no other task
> >> could run on that PCPU to skew the results. This can be achieved by
> >> enabling QMP server in the QEMU command in unittest.cfg for micro test,
> >> allowing a client program to get the thread_id for each VCPU thread
> >> from the QMP server. Based on the information, the client program can
> >> then pin the corresponding VCPUs to dedicated PCPUs and isolate
> >> interrupts and tasks from those PCPUs.
> >
> > To isolate the CPUs one would need to boot the host with the isolcpus
> > kernel command line option. Pinning the VCPUs is pretty easy though,
> > so we could provide a script that does that in kvm-unit-tests and then
> > always use it for this test. The script could also warn if we're
> > pinning to CPUs that haven't been isolated.
> >
> 
> My intention was to support VCPU pinning as an optional feature,
> so the users that care about extra precision can add qmp option in
> QEMU config and run the script to pin VCPUs. Otherwise, the test can
> be conducted in a fashion similar to what's done in vmexit on x86.
> 

If we can script VCPU pinning, I think that's preferred.  In our
experiments we never actually saw measurable differences between
isolcpus and simple vcpu pinning when using a high enough number of
iterations, except when looking at things like jitter, which we don't do
for these tests.

That notwithstanding, I think it's an optional feature that can be added
later.

> >>
> >> The patch has been tested on arm64 hardware including AMD Seattle and
> >> ThunderX2, which has GICv2 and GICv3 respectively.
> >
> > I tried thunderx2, amberwing, mustang, and seattle. Only seattle
> > completed, the rest timed out.
> 
> I have only tested the code by invoking test directly using make
> standalone like the following. I did notice that it took ~90 seconds
> to finish the test itself.
> ./"tests/micro-cost"
> 

Let's try to bring this down for the next iteration.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-18 17:31   ` Yury Norov
  2017-12-18 21:32     ` Shih-Wei Li
@ 2017-12-19  9:12     ` Christoffer Dall
  2017-12-19 10:05       ` Yury Norov
  1 sibling, 1 reply; 27+ messages in thread
From: Christoffer Dall @ 2017-12-19  9:12 UTC (permalink / raw)
  To: Yury Norov
  Cc: Goutham, Sunil, kvm, marc.zyngier, Cherian, Linu, pbonzini,
	kvmarm, Shih-Wei Li

On Mon, Dec 18, 2017 at 08:31:21PM +0300, Yury Norov wrote:
> On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
> > Here we provide the support for measuring various micro level
> > operations on arm64. We iterate each of the tests for millions of
> > times and output their average, minimum and maximum cost in timer
> > counts. Instruction barriers are used before and after taking
> > timestamps to avoid out-of-order execution or pipelining from
> > skewing our measurements.
> > 
> > The tests we currently support and measure are mostly
> > straightforward by the function names and the respective comments.
> > For IPI test, we measure the cost of sending IPI from a source
> > VCPU to a target VCPU, until the target VCPU receives the IPI.
> > 
> > Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> > ---
> >  arm/Makefile.common |   1 +
> >  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  arm/unittests.cfg   |   6 ++
> >  3 files changed, 296 insertions(+)
> >  create mode 100644 arm/micro-test.c
> > 
> > diff --git a/arm/Makefile.common b/arm/Makefile.common
> > index 0a039cf..c7d5c27 100644
> > --- a/arm/Makefile.common
> > +++ b/arm/Makefile.common
> > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
> >  tests-common += $(TEST_DIR)/gic.flat
> >  tests-common += $(TEST_DIR)/psci.flat
> >  tests-common += $(TEST_DIR)/sieve.flat
> > +tests-common += $(TEST_DIR)/micro-test.flat
> >  
> >  tests-all = $(tests-common) $(tests)
> >  all: directories $(tests-all)
> > diff --git a/arm/micro-test.c b/arm/micro-test.c
> > new file mode 100644
> > index 0000000..7df2272
> > --- /dev/null
> > +++ b/arm/micro-test.c
> > @@ -0,0 +1,289 @@
> > +#include <util.h>
> > +#include <asm/gic.h>
> > +
> > +static volatile bool second_cpu_up;
> > +static volatile bool first_cpu_ack;
> > +static volatile bool ipi_acked;
> > +static volatile bool ipi_received;
> > +static volatile bool ipi_ready;
> > +#define IPI_IRQ		1
> > +
> > +#define TIMEOUT (1U << 28)
> > +
> > +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
> > +#define for_each_test(_iter, _tests, _tmp) \
> > +	for (_tmp = 0, _iter = _tests; \
> > +			_tmp < ARR_SIZE(_tests); \
> > +			_tmp++, _iter++)
> > +
> > +#define CYCLE_COUNT(c1, c2) \
> > +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> 
> Is my understanding correct that this is overflow protection?
> c1 and c2 are 64-bit values. To overflow them you need 58 years
> at 1G CPU freq.
> 

That's assuming your cycle counter starts at 0, and that nobody
programmed it near the overflow value to get an overflow interrupt.

So if you get rid of this, you have to make sure the host never plays
with the cycle counter behind your back, and that you've initialized it
to zero.

FWIW, when we first wrote some version of this code 5+ years ago in a
different world, we actually saw overflow as a result of the way we ran
this code.  That may have changed, but it's not as clear cut.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-19  9:12     ` Christoffer Dall
@ 2017-12-19 10:05       ` Yury Norov
  2017-12-19 13:04         ` Christoffer Dall
  0 siblings, 1 reply; 27+ messages in thread
From: Yury Norov @ 2017-12-19 10:05 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Shih-Wei Li, kvm, kvmarm, marc.zyngier, drjones, pbonzini, Kalra,
	Ashish, Goutham, Sunil, Cherian, Linu

On Tue, Dec 19, 2017 at 10:12:00AM +0100, Christoffer Dall wrote:
> On Mon, Dec 18, 2017 at 08:31:21PM +0300, Yury Norov wrote:
> > On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
> > > Here we provide the support for measuring various micro level
> > > operations on arm64. We iterate each of the tests for millions of
> > > times and output their average, minimum and maximum cost in timer
> > > counts. Instruction barriers are used before and after taking
> > > timestamps to avoid out-of-order execution or pipelining from
> > > skewing our measurements.
> > > 
> > > The tests we currently support and measure are mostly
> > > straightforward by the function names and the respective comments.
> > > For IPI test, we measure the cost of sending IPI from a source
> > > VCPU to a target VCPU, until the target VCPU receives the IPI.
> > > 
> > > Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> > > ---
> > >  arm/Makefile.common |   1 +
> > >  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  arm/unittests.cfg   |   6 ++
> > >  3 files changed, 296 insertions(+)
> > >  create mode 100644 arm/micro-test.c
> > > 
> > > diff --git a/arm/Makefile.common b/arm/Makefile.common
> > > index 0a039cf..c7d5c27 100644
> > > --- a/arm/Makefile.common
> > > +++ b/arm/Makefile.common
> > > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
> > >  tests-common += $(TEST_DIR)/gic.flat
> > >  tests-common += $(TEST_DIR)/psci.flat
> > >  tests-common += $(TEST_DIR)/sieve.flat
> > > +tests-common += $(TEST_DIR)/micro-test.flat
> > >  
> > >  tests-all = $(tests-common) $(tests)
> > >  all: directories $(tests-all)
> > > diff --git a/arm/micro-test.c b/arm/micro-test.c
> > > new file mode 100644
> > > index 0000000..7df2272
> > > --- /dev/null
> > > +++ b/arm/micro-test.c
> > > @@ -0,0 +1,289 @@
> > > +#include <util.h>
> > > +#include <asm/gic.h>
> > > +
> > > +static volatile bool second_cpu_up;
> > > +static volatile bool first_cpu_ack;
> > > +static volatile bool ipi_acked;
> > > +static volatile bool ipi_received;
> > > +static volatile bool ipi_ready;
> > > +#define IPI_IRQ		1
> > > +
> > > +#define TIMEOUT (1U << 28)
> > > +
> > > +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
> > > +#define for_each_test(_iter, _tests, _tmp) \
> > > +	for (_tmp = 0, _iter = _tests; \
> > > +			_tmp < ARR_SIZE(_tests); \
> > > +			_tmp++, _iter++)
> > > +
> > > +#define CYCLE_COUNT(c1, c2) \
> > > +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> > 
> > Is my understanding correct that this is overflow protection?
> > c1 and c2 are 64-bit values. To overflow them you need 58 years
> > at 1G CPU freq.
> > 
> 
> That's assuming your cycle counter starts at 0, and that nobody
> programmed it near the overflow value to get an overflow interrupt.
> 
> So if you get rid of this, you have to make sure the host never plays
> with the cycle counter behind your back, and that you've initialized it
> to zero.

Ah, now I understand it. But if there's no evil intention to break the
test, there's single chance to overflow in 584 years. And if it happened,
isn't it simpler to run test again? 

If host shifts cycle counter during the test, the whole test becomes useless,
I think.

> FWIW, when we first wrote some version of this code 5+ years ago in a
> different world, we actually saw overflow as a result of the way we ran
> this code.  That may have changed, but it's not as clear cut.

Maybe 5+ years ago it was ARM32, and overflow was the result of
erroneous cast to unsigned long? (I guess it because the cast exists
in this patch.)

Anyway, this is minor. I'll happily take the test with or without this
hunk. But if you / Shih-Wei decide to save it, the comment would be
appreciated.

Yury

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-18 21:58     ` Shih-Wei Li
@ 2017-12-19 12:00       ` Andrew Jones
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Jones @ 2017-12-19 12:00 UTC (permalink / raw)
  To: Shih-Wei Li; +Cc: kvm, kvmarm, Christoffer Dall, Marc Zyngier, Paolo Bonzini

On Mon, Dec 18, 2017 at 04:58:53PM -0500, Shih-Wei Li wrote:
> On Mon, Dec 18, 2017 at 2:10 PM, Andrew Jones <drjones@redhat.com> wrote:
> >> +static void loop_test(struct exit_test *test)
> >> +{
> >> +     unsigned long i, iterations = 32;
> >> +     unsigned long sample, cycles;
> >> +     unsigned long long min = 0, max = 0;
> >> +     const unsigned long long goal = (1ULL << 29);
> >> +
> >> +     do {
> >> +             iterations *= 2;
> >> +             cycles = 0;
> >> +             for (i = 0; i < iterations; i++) {
> >> +                     sample = test->test_fn();
> >> +                     if (sample == 0) {
> >> +                             /*
> >> +                              * If something went wrong or we had an
> >> +                              * overflow, don't count that sample.
> >> +                              */
> >> +                             iterations--;
> >> +                             i--;
> >
> > I'd prefer incrementing a different counter after the sample==0 test /
> > continue, than to decrement both i and iterations.
> 
> I'm not sure about the comment above. Could you please substantiate
> how a new counter should work here?

It doesn't really matter, but how about

 do {
     iterations *= 2;
     cycles = 0;
     i = 0;
     while (i < iterations) {
         sample = test->test_fn();
         if (sample == 0)
             continue;
         cycles += sample;
         ...
         ++i;
     }
 } while (cycles < goal);


> 
> >
> >> +                             debug("cycle count overflow: %lu\n", sample);
> >> +                             continue;
> >> +                     }
> >> +                     cycles += sample;
> >> +                     if (min == 0 || min > sample)
> >> +                             min = sample;
> >> +                     if (max < sample)
> >> +                             max = sample;
> >> +             }
> >> +     } while (cycles < goal);
> >> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> >> +             test->name, cycles / (iterations), min, max);
> >

Thanks,
drew

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] Support micro operation measurement on arm64
  2017-12-19  9:06     ` Christoffer Dall
@ 2017-12-19 12:11       ` Andrew Jones
  2017-12-19 13:07         ` Christoffer Dall
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Jones @ 2017-12-19 12:11 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Shih-Wei Li, kvm, kvmarm, Marc Zyngier, Paolo Bonzini

On Tue, Dec 19, 2017 at 10:06:20AM +0100, Christoffer Dall wrote:
> On Mon, Dec 18, 2017 at 03:58:49PM -0500, Shih-Wei Li wrote:
> > On Mon, Dec 18, 2017 at 1:14 PM, Andrew Jones <drjones@redhat.com> wrote:
> > > Hi Shih-Wei,
> > >
> > > Thanks for doing this! Porting Christoffer's selftests to kvm-unit-tests
> > > has been on the kvm-unit-tests' TODO list since it was first introduced.
> > >
> > > On Fri, Dec 15, 2017 at 04:15:38PM -0500, Shih-Wei Li wrote:
> > >> The patch provides support for quantifying the cost of micro level
> > >> operations on arm64 hardware. The supported operations include hypercall,
> > >> mmio accesses, EOI virtual interrupt, and IPI send. Measurements are
> > >> currently obtained using timer counters. Further modifications in KVM
> > >> will be required to support timestamping using cycle counters, as KVM
> > >> now disables accesses to the PMU counters from the VM.
> > >
> > > KVM only disables access when userspace tells it to, which it doesn't
> > > do by default. Is there something else missing keeping the PMU counters
> > > from being used?
> > 
> > Thanks for the feedback! What I meant by PMU counters here was for
> > "CPU cycle counter" specifically. I'm not aware of a way to enable the
> > PMU cycle counter from QEMU, did I miss something here?
> > 
> 
> We always set MDSCR_EL2.TPM, meaning that you cannot reliably read a
> cycle counter in the guest.
> 
> If userspace tells KVM to emulate a PMU, you will get an emulated result
> when reading the cycle counter from a guest, instead of an undefined
> exception, but you will never access the cycle counter directly.

Ah, of course. Real vs. emulated access makes a big difference here.

> 
> Here we want to measure round-trip time from the VM through the
> hypervisor, and we don't currently count cycles in EL2 with the PMU
> emulation, and even if we did, we'd be counting additional round-trip
> times, so if the goal is to get more precision than the arch counters,
> this won't help you.
> 
> What we did for the papers was to hack KVM to not set the TPM bit and
> jut read the cycle counter directly, but this isn't safe, as the guest
> then gets full access to the PMU and can mess with the host.
> 
> If it's crucial to measure individual operations on a cycle-accurate
> level, then our options are pretty much to either patch KVM when doing
> so, or introduce a scary command line parameter, but I'm not thrilled
> by the idea.
> 
> > >
> > >>
> > >> We iterate each of the tests for millions of times and output their
> > >> average, minimum and maximum cost in timer counts. Instruction barriers
> > >
> > > Can we reduce the number of iterations and still get valid results? The
> > > test takes so long that of all the platforms I tested it on timed out
> > > before it completed, except seattle. The default timeout for kvm-unit-
> > > tests is 90 seconds. I'd rather a unit test execute in much shorter time
> > > than that too, in order to keep people encouraged to run them frequently.
> > > If these tests must run a long time, then I think we should add them to
> > > the nodefault group.
> > 
> > I think it's possible to reduce the timeout without losing accuracy. I
> > can look into this further.
> > 
> 
> I think just running them for 100,000 or maximum 1,000,000 times should
> be sufficient.  Alternatively an option to run it for a long time could
> be provided?

Providing a number of iterations option or something, that has a
reasonable default, sounds good to me.

> 
> > >
> > >> were used before and after taking timestamps to avoid out-of-order
> > >> execution or pipelining from skewing our measurements.
> > >>
> > >> To improve precision in the measurements, one should consider pinning
> > >> each VCPU to a specific physical CPU (PCPU) and ensure no other task
> > >> could run on that PCPU to skew the results. This can be achieved by
> > >> enabling QMP server in the QEMU command in unittest.cfg for micro test,
> > >> allowing a client program to get the thread_id for each VCPU thread
> > >> from the QMP server. Based on the information, the client program can
> > >> then pin the corresponding VCPUs to dedicated PCPUs and isolate
> > >> interrupts and tasks from those PCPUs.
> > >
> > > To isolate the CPUs one would need to boot the host with the isolcpus
> > > kernel command line option. Pinning the VCPUs is pretty easy though,
> > > so we could provide a script that does that in kvm-unit-tests and then
> > > always use it for this test. The script could also warn if we're
> > > pinning to CPUs that haven't been isolated.
> > >
> > 
> > My intention was to support VCPU pinning as an optional feature,
> > so the users that care about extra precision can add qmp option in
> > QEMU config and run the script to pin VCPUs. Otherwise, the test can
> > be conducted in a fashion similar to what's done in vmexit on x86.
> > 
> 
> If we can script VCPU pinning, I think that's preferred.  In our
> experiments we never actually saw measurable differences between
> isolcpus and simple vcpu pinning when using a high enough number of
> iterations, except when looking at things like jitter, which we don't do
> for these tests.
> 
> That notwithstanding, I think it's an optional feature that can be added
> later.

Yeah, let's do it later, but I think doing it makes enough sense that
it's worth writing more bash.

> 
> > >>
> > >> The patch has been tested on arm64 hardware including AMD Seattle and
> > >> ThunderX2, which has GICv2 and GICv3 respectively.
> > >
> > > I tried thunderx2, amberwing, mustang, and seattle. Only seattle
> > > completed, the rest timed out.
> > 
> > I have only tested the code by invoking test directly using make
> > standalone like the following. I did notice that it took ~90 seconds
> > to finish the test itself.
> > ./"tests/micro-cost"

standalone still uses timeout with 90 seconds. So your hardware was just
faster than mine, I guess :-)

> > 
> 
> Let's try to bring this down for the next iteration.
> 
> Thanks,
> -Christoffer

Thanks,
drew

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-19 10:05       ` Yury Norov
@ 2017-12-19 13:04         ` Christoffer Dall
  0 siblings, 0 replies; 27+ messages in thread
From: Christoffer Dall @ 2017-12-19 13:04 UTC (permalink / raw)
  To: Yury Norov
  Cc: Goutham, Sunil, kvm, marc.zyngier, Cherian, Linu, pbonzini,
	kvmarm, Shih-Wei Li

On Tue, Dec 19, 2017 at 01:05:21PM +0300, Yury Norov wrote:
> On Tue, Dec 19, 2017 at 10:12:00AM +0100, Christoffer Dall wrote:
> > On Mon, Dec 18, 2017 at 08:31:21PM +0300, Yury Norov wrote:
> > > On Fri, Dec 15, 2017 at 04:15:39PM -0500, Shih-Wei Li wrote:
> > > > Here we provide the support for measuring various micro level
> > > > operations on arm64. We iterate each of the tests for millions of
> > > > times and output their average, minimum and maximum cost in timer
> > > > counts. Instruction barriers are used before and after taking
> > > > timestamps to avoid out-of-order execution or pipelining from
> > > > skewing our measurements.
> > > > 
> > > > The tests we currently support and measure are mostly
> > > > straightforward by the function names and the respective comments.
> > > > For IPI test, we measure the cost of sending IPI from a source
> > > > VCPU to a target VCPU, until the target VCPU receives the IPI.
> > > > 
> > > > Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> > > > ---
> > > >  arm/Makefile.common |   1 +
> > > >  arm/micro-test.c    | 289 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  arm/unittests.cfg   |   6 ++
> > > >  3 files changed, 296 insertions(+)
> > > >  create mode 100644 arm/micro-test.c
> > > > 
> > > > diff --git a/arm/Makefile.common b/arm/Makefile.common
> > > > index 0a039cf..c7d5c27 100644
> > > > --- a/arm/Makefile.common
> > > > +++ b/arm/Makefile.common
> > > > @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
> > > >  tests-common += $(TEST_DIR)/gic.flat
> > > >  tests-common += $(TEST_DIR)/psci.flat
> > > >  tests-common += $(TEST_DIR)/sieve.flat
> > > > +tests-common += $(TEST_DIR)/micro-test.flat
> > > >  
> > > >  tests-all = $(tests-common) $(tests)
> > > >  all: directories $(tests-all)
> > > > diff --git a/arm/micro-test.c b/arm/micro-test.c
> > > > new file mode 100644
> > > > index 0000000..7df2272
> > > > --- /dev/null
> > > > +++ b/arm/micro-test.c
> > > > @@ -0,0 +1,289 @@
> > > > +#include <util.h>
> > > > +#include <asm/gic.h>
> > > > +
> > > > +static volatile bool second_cpu_up;
> > > > +static volatile bool first_cpu_ack;
> > > > +static volatile bool ipi_acked;
> > > > +static volatile bool ipi_received;
> > > > +static volatile bool ipi_ready;
> > > > +#define IPI_IRQ		1
> > > > +
> > > > +#define TIMEOUT (1U << 28)
> > > > +
> > > > +#define ARR_SIZE(_x) ((int)(sizeof(_x) / sizeof(_x[0])))
> > > > +#define for_each_test(_iter, _tests, _tmp) \
> > > > +	for (_tmp = 0, _iter = _tests; \
> > > > +			_tmp < ARR_SIZE(_tests); \
> > > > +			_tmp++, _iter++)
> > > > +
> > > > +#define CYCLE_COUNT(c1, c2) \
> > > > +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> > > 
> > > Is my understanding correct that this is overflow protection?
> > > c1 and c2 are 64-bit values. To overflow them you need 58 years
> > > at 1G CPU freq.
> > > 
> > 
> > That's assuming your cycle counter starts at 0, and that nobody
> > programmed it near the overflow value to get an overflow interrupt.
> > 
> > So if you get rid of this, you have to make sure the host never plays
> > with the cycle counter behind your back, and that you've initialized it
> > to zero.
> 
> Ah, now I understand it. But if there's no evil intention to break the
> test, there's single chance to overflow in 584 years. And if it happened,
> isn't it simpler to run test again? 

I believe we used this to throw away results where there was an
overflow, which is nicer to the user than leaving them guessing 'this is
a very high number, I better run the test again'.

-Christoffer

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] Support micro operation measurement on arm64
  2017-12-19 12:11       ` Andrew Jones
@ 2017-12-19 13:07         ` Christoffer Dall
  2017-12-20 16:22           ` Andrew Jones
  0 siblings, 1 reply; 27+ messages in thread
From: Christoffer Dall @ 2017-12-19 13:07 UTC (permalink / raw)
  To: Andrew Jones; +Cc: Shih-Wei Li, kvm, kvmarm, Marc Zyngier, Paolo Bonzini

On Tue, Dec 19, 2017 at 01:11:09PM +0100, Andrew Jones wrote:
> On Tue, Dec 19, 2017 at 10:06:20AM +0100, Christoffer Dall wrote:
> > On Mon, Dec 18, 2017 at 03:58:49PM -0500, Shih-Wei Li wrote:
> > > On Mon, Dec 18, 2017 at 1:14 PM, Andrew Jones <drjones@redhat.com> wrote:
> > > > Hi Shih-Wei,
> > > >
> > > > Thanks for doing this! Porting Christoffer's selftests to kvm-unit-tests
> > > > has been on the kvm-unit-tests' TODO list since it was first introduced.
> > > >
> > > > On Fri, Dec 15, 2017 at 04:15:38PM -0500, Shih-Wei Li wrote:
> > > >> The patch provides support for quantifying the cost of micro level
> > > >> operations on arm64 hardware. The supported operations include hypercall,
> > > >> mmio accesses, EOI virtual interrupt, and IPI send. Measurements are
> > > >> currently obtained using timer counters. Further modifications in KVM
> > > >> will be required to support timestamping using cycle counters, as KVM
> > > >> now disables accesses to the PMU counters from the VM.
> > > >
> > > > KVM only disables access when userspace tells it to, which it doesn't
> > > > do by default. Is there something else missing keeping the PMU counters
> > > > from being used?
> > > 
> > > Thanks for the feedback! What I meant by PMU counters here was for
> > > "CPU cycle counter" specifically. I'm not aware of a way to enable the
> > > PMU cycle counter from QEMU, did I miss something here?
> > > 
> > 
> > We always set MDSCR_EL2.TPM, meaning that you cannot reliably read a
> > cycle counter in the guest.
> > 
> > If userspace tells KVM to emulate a PMU, you will get an emulated result
> > when reading the cycle counter from a guest, instead of an undefined
> > exception, but you will never access the cycle counter directly.
> 
> Ah, of course. Real vs. emulated access makes a big difference here.
> 
> > 
> > Here we want to measure round-trip time from the VM through the
> > hypervisor, and we don't currently count cycles in EL2 with the PMU
> > emulation, and even if we did, we'd be counting additional round-trip
> > times, so if the goal is to get more precision than the arch counters,
> > this won't help you.
> > 
> > What we did for the papers was to hack KVM to not set the TPM bit and
> > jut read the cycle counter directly, but this isn't safe, as the guest
> > then gets full access to the PMU and can mess with the host.
> > 
> > If it's crucial to measure individual operations on a cycle-accurate
> > level, then our options are pretty much to either patch KVM when doing
> > so, or introduce a scary command line parameter, but I'm not thrilled
> > by the idea.
> > 
> > > >
> > > >>
> > > >> We iterate each of the tests for millions of times and output their
> > > >> average, minimum and maximum cost in timer counts. Instruction barriers
> > > >
> > > > Can we reduce the number of iterations and still get valid results? The
> > > > test takes so long that of all the platforms I tested it on timed out
> > > > before it completed, except seattle. The default timeout for kvm-unit-
> > > > tests is 90 seconds. I'd rather a unit test execute in much shorter time
> > > > than that too, in order to keep people encouraged to run them frequently.
> > > > If these tests must run a long time, then I think we should add them to
> > > > the nodefault group.
> > > 
> > > I think it's possible to reduce the timeout without losing accuracy. I
> > > can look into this further.
> > > 
> > 
> > I think just running them for 100,000 or maximum 1,000,000 times should
> > be sufficient.  Alternatively an option to run it for a long time could
> > be provided?
> 
> Providing a number of iterations option or something, that has a
> reasonable default, sounds good to me.
> 
> > 
> > > >
> > > >> were used before and after taking timestamps to avoid out-of-order
> > > >> execution or pipelining from skewing our measurements.
> > > >>
> > > >> To improve precision in the measurements, one should consider pinning
> > > >> each VCPU to a specific physical CPU (PCPU) and ensure no other task
> > > >> could run on that PCPU to skew the results. This can be achieved by
> > > >> enabling QMP server in the QEMU command in unittest.cfg for micro test,
> > > >> allowing a client program to get the thread_id for each VCPU thread
> > > >> from the QMP server. Based on the information, the client program can
> > > >> then pin the corresponding VCPUs to dedicated PCPUs and isolate
> > > >> interrupts and tasks from those PCPUs.
> > > >
> > > > To isolate the CPUs one would need to boot the host with the isolcpus
> > > > kernel command line option. Pinning the VCPUs is pretty easy though,
> > > > so we could provide a script that does that in kvm-unit-tests and then
> > > > always use it for this test. The script could also warn if we're
> > > > pinning to CPUs that haven't been isolated.
> > > >
> > > 
> > > My intention was to support VCPU pinning as an optional feature,
> > > so the users that care about extra precision can add qmp option in
> > > QEMU config and run the script to pin VCPUs. Otherwise, the test can
> > > be conducted in a fashion similar to what's done in vmexit on x86.
> > > 
> > 
> > If we can script VCPU pinning, I think that's preferred.  In our
> > experiments we never actually saw measurable differences between
> > isolcpus and simple vcpu pinning when using a high enough number of
> > iterations, except when looking at things like jitter, which we don't do
> > for these tests.
> > 
> > That notwithstanding, I think it's an optional feature that can be added
> > later.
> 
> Yeah, let's do it later, but I think doing it makes enough sense that
> it's worth writing more bash.
> 

Agreed.

> > 
> > > >>
> > > >> The patch has been tested on arm64 hardware including AMD Seattle and
> > > >> ThunderX2, which has GICv2 and GICv3 respectively.
> > > >
> > > > I tried thunderx2, amberwing, mustang, and seattle. Only seattle
> > > > completed, the rest timed out.
> > > 
> > > I have only tested the code by invoking test directly using make
> > > standalone like the following. I did notice that it took ~90 seconds
> > > to finish the test itself.
> > > ./"tests/micro-cost"
> 
> standalone still uses timeout with 90 seconds. So your hardware was just
> faster than mine, I guess :-)
> 

[indentation confusion?]

You're responding to something Shih-Wei wrote here, but I didn't
understand Shih-Wei's answer, and I think he ran the work on Seattle, so
not sure what the difference was.

Anyway, let's have it execute more fast by default as you suggest.

-Christoffer

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] Support micro operation measurement on arm64
  2017-12-19 13:07         ` Christoffer Dall
@ 2017-12-20 16:22           ` Andrew Jones
  2017-12-21 11:31             ` Christoffer Dall
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Jones @ 2017-12-20 16:22 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Shih-Wei Li, kvm, kvmarm, Marc Zyngier, Paolo Bonzini

On Tue, Dec 19, 2017 at 02:07:06PM +0100, Christoffer Dall wrote:
> On Tue, Dec 19, 2017 at 01:11:09PM +0100, Andrew Jones wrote:
> > On Tue, Dec 19, 2017 at 10:06:20AM +0100, Christoffer Dall wrote:
> > > On Mon, Dec 18, 2017 at 03:58:49PM -0500, Shih-Wei Li wrote:
> > > > 
> > > > I have only tested the code by invoking test directly using make
> > > > standalone like the following. I did notice that it took ~90 seconds
> > > > to finish the test itself.
> > > > ./"tests/micro-cost"
> > 
> > standalone still uses timeout with 90 seconds. So your hardware was just
> > faster than mine, I guess :-)
> > 
> 
> [indentation confusion?]

Actually I was just lazily replying to both you and Shih-Wei in the same
mail.

> 
> You're responding to something Shih-Wei wrote here, but I didn't
> understand Shih-Wei's answer, and I think he ran the work on Seattle, so
> not sure what the difference was.

It turns out the problem was just the opposite of what I say above. My
hardware must be faster. While testing v2 of the patch I still got
timeouts, so I decided to actually try to figure out what's going on.
The eoi_test() test is so fast that c1 always equals c2, so we always
assume an overflow occurred and keep trying to get a better sample.

I guess we need two changes to the patch. We need to handle c1 == c2
differently, probably just assume the measurement is smaller than the
timer frequency and report 1 tick instead of zero. We should also
have a way out of loop_test() when the test is constantly returning
zero.

Thanks,
drew

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-18 19:10   ` Andrew Jones
  2017-12-18 21:58     ` Shih-Wei Li
@ 2017-12-20 17:00     ` Andrew Jones
  2018-05-01 14:57       ` Kalra, Ashish
  1 sibling, 1 reply; 27+ messages in thread
From: Andrew Jones @ 2017-12-20 17:00 UTC (permalink / raw)
  To: Shih-Wei Li; +Cc: kvm, kvmarm, christoffer.dall, marc.zyngier, pbonzini

On Mon, Dec 18, 2017 at 08:10:37PM +0100, Andrew Jones wrote:
> I just tried on TCG now. It doesn't run. It gets
> 
> Timer Frequency 62500000 Hz (Output in timer count)
> Unhandled exception ec=0 (UNKNOWN)
> Vector: 4 (el1h_sync)
> ESR_EL1:         02000000, ec=0 (UNKNOWN)
> FAR_EL1: 0000000000000000 (not valid)
> Exception frame registers:
> pc : [<0000000040080088>] lr : [<00000000400803e8>] pstate: 800003c5
> sp : 00000000400aff90
> x29: 0000000000000000 x28: 0000000000000000 
> x27: 0000000040090000 x26: 0000000040090c60 
> x25: 0000000040090000 x24: 000000001fffffff 
> x23: 0000000000000000 x22: 0000000000000000 
> x21: 0000000000000040 x20: 0000000000000000 
> x19: 0000000000000000 x18: 00000000400b0000 
> x17: 0000000000000000 x16: 0000000000000000 
> x15: 00000000400afe8c x14: 00000000400b0000 
> x13: 00000000400afecc x12: 0000000000001680 
> x11: 0000000000000000 x10: 6666666666666667 
> x9 : 0000000000000030 x8 : 0000000000000030 
> x7 : 00000000400af670 x6 : 00000000400af673 
> x5 : 00000000400af678 x4 : 00000000000007b7 
> x3 : 00000000400af6ec x2 : 0000000040090000 
> x1 : 000000000015909e x0 : 000000004b000000
>

I looked into this. It's due to the hvc call. The exception goes away if
you add '-machine virtualization=yes' to the qemu command line. But as
there's no point in running it, and it's way too slow, then the
'accel = kvm' added in v2 of the patch is the right thing to do.

Thanks,
drew 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] Support micro operation measurement on arm64
  2017-12-20 16:22           ` Andrew Jones
@ 2017-12-21 11:31             ` Christoffer Dall
  2017-12-21 14:32               ` Andrew Jones
  0 siblings, 1 reply; 27+ messages in thread
From: Christoffer Dall @ 2017-12-21 11:31 UTC (permalink / raw)
  To: Andrew Jones; +Cc: Shih-Wei Li, kvm, kvmarm, Marc Zyngier, Paolo Bonzini

On Wed, Dec 20, 2017 at 05:22:54PM +0100, Andrew Jones wrote:
> On Tue, Dec 19, 2017 at 02:07:06PM +0100, Christoffer Dall wrote:
> > On Tue, Dec 19, 2017 at 01:11:09PM +0100, Andrew Jones wrote:
> > > On Tue, Dec 19, 2017 at 10:06:20AM +0100, Christoffer Dall wrote:
> > > > On Mon, Dec 18, 2017 at 03:58:49PM -0500, Shih-Wei Li wrote:
> > > > > 
> > > > > I have only tested the code by invoking test directly using make
> > > > > standalone like the following. I did notice that it took ~90 seconds
> > > > > to finish the test itself.
> > > > > ./"tests/micro-cost"
> > > 
> > > standalone still uses timeout with 90 seconds. So your hardware was just
> > > faster than mine, I guess :-)
> > > 
> > 
> > [indentation confusion?]
> 
> Actually I was just lazily replying to both you and Shih-Wei in the same
> mail.
> 

Ah, well the result had a clash with my OCD.

> > 
> > You're responding to something Shih-Wei wrote here, but I didn't
> > understand Shih-Wei's answer, and I think he ran the work on Seattle, so
> > not sure what the difference was.
> 
> It turns out the problem was just the opposite of what I say above. My
> hardware must be faster. While testing v2 of the patch I still got
> timeouts, so I decided to actually try to figure out what's going on.
> The eoi_test() test is so fast that c1 always equals c2, so we always
> assume an overflow occurred and keep trying to get a better sample.

Ah, I should have read this before responding to the other thread...

> 
> I guess we need two changes to the patch. We need to handle c1 == c2
> differently, probably just assume the measurement is smaller than the
> timer frequency and report 1 tick instead of zero. 

Yeah, things are different when using the generic timer counter instead
of the cycle counter in that sense.

Perhaps we should just have an initial assert of "is this thing on"
which does something that should register a moving count on even a slow
clocked arch counter, and then only check for overflow in the general
case?

> We should also
> have a way out of loop_test() when the test is constantly returning
> zero.
> 

Indeed, seems reasonable.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] Support micro operation measurement on arm64
  2017-12-21 11:31             ` Christoffer Dall
@ 2017-12-21 14:32               ` Andrew Jones
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Jones @ 2017-12-21 14:32 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Shih-Wei Li, kvm, kvmarm, Marc Zyngier, Paolo Bonzini

On Thu, Dec 21, 2017 at 12:31:50PM +0100, Christoffer Dall wrote:
> Perhaps we should just have an initial assert of "is this thing on"
> which does something that should register a moving count on even a slow
> clocked arch counter, and then only check for overflow in the general
> case?

Yeah, that sounds good to me.

Thanks,
drew

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2017-12-20 17:00     ` Andrew Jones
@ 2018-05-01 14:57       ` Kalra, Ashish
  2018-05-02  9:23         ` Marc Zyngier
  0 siblings, 1 reply; 27+ messages in thread
From: Kalra, Ashish @ 2018-05-01 14:57 UTC (permalink / raw)
  To: Andrew Jones, Shih-Wei Li
  Cc: christoffer.dall, marc.zyngier, kvmarm, kvm, pbonzini

Hello Shih-Wei,

This exception is probably due to a bug in the hvc_test (hvc call) code,
and there is a generic bug in the hvc test code.
In the function hvc_test(), local variables c1, c2 get mapped to 
registers (typically x1 & x2) and those are "clobbered" across the hvc 
call, hence, c1 gets clobbered and we get a large cycle count for 
hvc_test (c2 - 0), which messes-up the hvc test measurements.

As per SMC64/HVC64 ABI definitions, x0-x18 can be clobbered across
HVC/SMC calls.

Currently, i am using your micro-benchmark code with the following
hack applied ...

+register unsigned long c1_hvc asm ("x19"); 
 

static unsigned long hvc_test(void) 
 

{
+     unsigned long c2;
-     unsigned long c1, c2;
 
 
 
 
 

+     c1_hvc = read_cc();
-     c1 = read_cc();
      asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0"); 
 

      c2 = read_cc();
+     return CYCLE_COUNT(c1_hvc, c2);
-     return CYCLE_COUNT(c1, c2); 
 

}

Thanks,
Ashish

On 12/20/2017 10:30 PM, Andrew Jones wrote:
> On Mon, Dec 18, 2017 at 08:10:37PM +0100, Andrew Jones wrote:
>> I just tried on TCG now. It doesn't run. It gets
>>
>> Timer Frequency 62500000 Hz (Output in timer count)
>> Unhandled exception ec=0 (UNKNOWN)
>> Vector: 4 (el1h_sync)
>> ESR_EL1:         02000000, ec=0 (UNKNOWN)
>> FAR_EL1: 0000000000000000 (not valid)
>> Exception frame registers:
>> pc : [<0000000040080088>] lr : [<00000000400803e8>] pstate: 800003c5
>> sp : 00000000400aff90
>> x29: 0000000000000000 x28: 0000000000000000
>> x27: 0000000040090000 x26: 0000000040090c60
>> x25: 0000000040090000 x24: 000000001fffffff
>> x23: 0000000000000000 x22: 0000000000000000
>> x21: 0000000000000040 x20: 0000000000000000
>> x19: 0000000000000000 x18: 00000000400b0000
>> x17: 0000000000000000 x16: 0000000000000000
>> x15: 00000000400afe8c x14: 00000000400b0000
>> x13: 00000000400afecc x12: 0000000000001680
>> x11: 0000000000000000 x10: 6666666666666667
>> x9 : 0000000000000030 x8 : 0000000000000030
>> x7 : 00000000400af670 x6 : 00000000400af673
>> x5 : 00000000400af678 x4 : 00000000000007b7
>> x3 : 00000000400af6ec x2 : 0000000040090000
>> x1 : 000000000015909e x0 : 000000004b000000
>>
> 
> I looked into this. It's due to the hvc call. The exception goes away if
> you add '-machine virtualization=yes' to the qemu command line. But as
> there's no point in running it, and it's way too slow, then the
> 'accel = kvm' added in v2 of the patch is the right thing to do.
> 
> Thanks,
> drew
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2018-05-01 14:57       ` Kalra, Ashish
@ 2018-05-02  9:23         ` Marc Zyngier
  2018-05-03 11:12           ` Kalra, Ashish
  0 siblings, 1 reply; 27+ messages in thread
From: Marc Zyngier @ 2018-05-02  9:23 UTC (permalink / raw)
  To: Kalra, Ashish, Andrew Jones, Shih-Wei Li
  Cc: christoffer.dall, pbonzini, kvmarm, kvm

On 01/05/18 15:57, Kalra, Ashish wrote:
> Hello Shih-Wei,
> 
> This exception is probably due to a bug in the hvc_test (hvc call) code,
> and there is a generic bug in the hvc test code.
> In the function hvc_test(), local variables c1, c2 get mapped to 
> registers (typically x1 & x2) and those are "clobbered" across the hvc 
> call, hence, c1 gets clobbered and we get a large cycle count for 
> hvc_test (c2 - 0), which messes-up the hvc test measurements.
> 
> As per SMC64/HVC64 ABI definitions, x0-x18 can be clobbered across
> HVC/SMC calls.

I have two issues with this statement:

- Version 1.1 of the SMCCC actually says that only x0-x3 will be
potentially clobbered

- KVM makes a point in not clobbering any register, except for those
functions implemented by SMCCC 1.1.

That being said...

> 
> Currently, i am using your micro-benchmark code with the following
> hack applied ...
> 
> +register unsigned long c1_hvc asm ("x19"); 
>  
> 
> static unsigned long hvc_test(void) 
>  
> 
> {
> +     unsigned long c2;
> -     unsigned long c1, c2;
>  
>  
>  
>  
>  
> 
> +     c1_hvc = read_cc();
> -     c1 = read_cc();
>       asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0"); 
>  
> 
>       c2 = read_cc();

The reason for the problem you're seeing is probably that the
constraints are not quite right. Here, x0 is not simply clobbered, but
is actively written to in the middle of the sequence (it is at least an
early clobber). It is also, I assume, a result from the hypercall, so it
cannot simply be discarded.

It would help to get a disassembly of the function, but I'd tend to
rewrite the code as such:

extern int bar(void);

int foo(void)
{
	register int w0 asm("w0");
	int a, b;

	a = bar();
	w0 = 0x4b000000;
	asm volatile("hvc #0" : "+r" (w0) :: );
	b = bar();

	return a - b;
}

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2018-05-02  9:23         ` Marc Zyngier
@ 2018-05-03 11:12           ` Kalra, Ashish
  2018-05-03 16:24             ` Marc Zyngier
  0 siblings, 1 reply; 27+ messages in thread
From: Kalra, Ashish @ 2018-05-03 11:12 UTC (permalink / raw)
  To: Marc Zyngier, Andrew Jones, Shih-Wei Li
  Cc: pbonzini, kvmarm, kvm, Christoffer Dall


>> 
>> The reason for the problem you're seeing is probably that the
>> constraints are not quite right. Here, x0 is not simply clobbered, but
>> is actively written to in the middle of the sequence (it is at least an
>> early clobber). It is also, I assume, a result from the hypercall, so it
>> cannot simply be discarded.
>> 
>> It would help to get a disassembly of the function, but I'd tend to
>> rewrite the code as such:
>> 
>> extern int bar(void);
>> 
>> int foo(void)
>> {
>> 	register int w0 asm("w0");
>> 	int a, b;
>> 
>> 	a = bar();
>> 	w0 = 0x4b000000;
>> 	asm volatile("hvc #0" : "+r" (w0) :: );
>> 	b = bar();
>> 
>> 	return a - b;
>> }

There still is the issue of x1-x3 registers being used for local 
variables and clobbering of the same across the "hvc" call.

Comparatively, the fix below works more reliably :

asm volatile("hvc #0" : "+r" (w0) :: "x1","x2","x3");

Thanks,
Ashish

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2018-05-03 11:12           ` Kalra, Ashish
@ 2018-05-03 16:24             ` Marc Zyngier
  2018-05-03 18:08               ` Kalra, Ashish
  0 siblings, 1 reply; 27+ messages in thread
From: Marc Zyngier @ 2018-05-03 16:24 UTC (permalink / raw)
  To: Kalra, Ashish, Andrew Jones, Shih-Wei Li
  Cc: pbonzini, kvmarm, kvm, Christoffer Dall

On 03/05/18 12:12, Kalra, Ashish wrote:
> 
>>>
>>> The reason for the problem you're seeing is probably that the
>>> constraints are not quite right. Here, x0 is not simply clobbered, but
>>> is actively written to in the middle of the sequence (it is at least an
>>> early clobber). It is also, I assume, a result from the hypercall, so it
>>> cannot simply be discarded.
>>>
>>> It would help to get a disassembly of the function, but I'd tend to
>>> rewrite the code as such:
>>>
>>> extern int bar(void);
>>>
>>> int foo(void)
>>> {
>>> 	register int w0 asm("w0");
>>> 	int a, b;
>>>
>>> 	a = bar();
>>> 	w0 = 0x4b000000;
>>> 	asm volatile("hvc #0" : "+r" (w0) :: );
>>> 	b = bar();
>>>
>>> 	return a - b;
>>> }
> 
> There still is the issue of x1-x3 registers being used for local 
> variables and clobbering of the same across the "hvc" call.
> 
> Comparatively, the fix below works more reliably :
> 
> asm volatile("hvc #0" : "+r" (w0) :: "x1","x2","x3");

Is that because this hypercall is explicitly clobbering these registers?
or because KVM is corrupting them? If the former, fine. If the latter,
that'd be a KVM bug.

So which one is it?

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [kvm-unit-tests PATCH] arm64: add micro test
  2018-05-03 16:24             ` Marc Zyngier
@ 2018-05-03 18:08               ` Kalra, Ashish
  0 siblings, 0 replies; 27+ messages in thread
From: Kalra, Ashish @ 2018-05-03 18:08 UTC (permalink / raw)
  To: Marc Zyngier, Andrew Jones, Shih-Wei Li
  Cc: pbonzini, kvmarm, kvm, Christoffer Dall


 >>On 5/3/2018 9:54 PM, Marc Zyngier wrote:
>> On 03/05/18 12:12, Kalra, Ashish wrote:
>>>
>>>>>
>>>> The reason for the problem you're seeing is probably that the
>>>>> constraints are not quite right. Here, x0 is not simply clobbered, but
>>>>> is actively written to in the middle of the sequence (it is at least an
>>>>> early clobber). It is also, I assume, a result from the hypercall, so it
>>>>> cannot simply be discarded.
>>>>>
>>>>> It would help to get a disassembly of the function, but I'd tend to
>>>>> rewrite the code as such:
>>>>>
>>>>> extern int bar(void);
>>>>>
>>>>> int foo(void)
>>>>> {
>>>>> 	register int w0 asm("w0");
>>>>> 	int a, b;
>>>>>
>>>>> 	a = bar();
>>>>> 	w0 = 0x4b000000;
>>>>> 	asm volatile("hvc #0" : "+r" (w0) :: );
>>>>> 	b = bar();
>>>>>
>>>>> 	return a - b;
>>>>> }
>>>
>>> There still is the issue of x1-x3 registers being used for local
>>> variables and clobbering of the same across the "hvc" call.
>>>
>>> Comparatively, the fix below works more reliably :
>>>
>>>> asm volatile("hvc #0" : "+r" (w0) :: "x1","x2","x3");
>> 
>> Is that because this hypercall is explicitly clobbering these registers?
>> or because KVM is corrupting them? If the former, fine. If the latter,
>> that'd be a KVM bug.
>> 
>> So which one is it?
>> 

It is being explicitly done by the hypercall handler:

int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
{
	u32 func_id = smccc_get_function(vcpu);
         u32 val = PSCI_RET_NOT_SUPPORTED;
         ..
         ..
         smccc_set_retval(vcpu, val, 0, 0, 0);
}

Thanks,
Ashish

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2018-05-03 18:08 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-15 21:15 [kvm-unit-tests PATCH] Support micro operation measurement on arm64 Shih-Wei Li
2017-12-15 21:15 ` [kvm-unit-tests PATCH] arm64: add micro test Shih-Wei Li
2017-12-18 17:31   ` Yury Norov
2017-12-18 21:32     ` Shih-Wei Li
2017-12-19  9:12     ` Christoffer Dall
2017-12-19 10:05       ` Yury Norov
2017-12-19 13:04         ` Christoffer Dall
2017-12-18 19:10   ` Andrew Jones
2017-12-18 21:58     ` Shih-Wei Li
2017-12-19 12:00       ` Andrew Jones
2017-12-20 17:00     ` Andrew Jones
2018-05-01 14:57       ` Kalra, Ashish
2018-05-02  9:23         ` Marc Zyngier
2018-05-03 11:12           ` Kalra, Ashish
2018-05-03 16:24             ` Marc Zyngier
2018-05-03 18:08               ` Kalra, Ashish
2017-12-18 22:17   ` Kalra, Ashish
2017-12-18 22:18   ` Kalra, Ashish
2017-12-18 22:31   ` Kalra, Ashish
2017-12-18 18:14 ` [kvm-unit-tests PATCH] Support micro operation measurement on arm64 Andrew Jones
2017-12-18 20:58   ` Shih-Wei Li
2017-12-19  9:06     ` Christoffer Dall
2017-12-19 12:11       ` Andrew Jones
2017-12-19 13:07         ` Christoffer Dall
2017-12-20 16:22           ` Andrew Jones
2017-12-21 11:31             ` Christoffer Dall
2017-12-21 14:32               ` Andrew Jones

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.