[kvm-unit-tests PATCH V2 0/2] Support micro operation measurement on arm64

All of lore.kernel.org
 help / color / mirror / Atom feed

* [kvm-unit-tests PATCH V2 0/2] Support micro operation measurement on arm64
@ 2017-12-20  1:34 Shih-Wei Li
  2017-12-20  1:34 ` [kvm-unit-tests PATCH V2 1/2] arm/arm64: add GICD_IIDR definition Shih-Wei Li
  2017-12-20  1:34 ` [kvm-unit-tests PATCH V2 2/2] arm64: add micro test Shih-Wei Li
  0 siblings, 2 replies; 14+ messages in thread
From: Shih-Wei Li @ 2017-12-20  1:34 UTC (permalink / raw)
  To: kvm, kvmarm; +Cc: marc.zyngier, pbonzini, Shih-Wei Li

The patch provides support for quantifying the cost of micro level
operations on arm64 hardware. The supported operations include hypercall,
mmio accesses, EOI virtual interrupt, and IPI send. Measurements are
currently obtained using timer counters. Further modifications in KVM
will be required to support timestamping using cycle counters, as KVM
now disables accesses to the PMU counters from the VM.

We iterate each of the tests for millions of times and output their
average, minimum and maximum cost in timer counts. Instruction barriers
were used before and after taking timestamps to avoid out-of-order
execution or pipelining from skewing our measurements.

To improve precision in the measurements, one should consider pinning
each VCPU to a specific physical CPU (PCPU) and ensure no other task
could run on that PCPU to skew the results. This can be achieved by
enabling QMP server in the QEMU command in unittest.cfg for micro test,
allowing a client program to get the thread_id for each VCPU thread
from the QMP server. Based on the information, the client program can
then pin the corresponding VCPUs to dedicated PCPUs and isolate
interrupts and tasks from those PCPUs.

The patch has been tested on arm64 hardware including AMD Seattle and
ThunderX2, which has GICv2 and GICv3 respectively.

Shih-Wei Li (2):
  arm/arm64: add GICD_IIDR definition
  arm64: add micro test

 arm/Makefile.common |   1 +
 arm/micro-test.c    | 219 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 arm/unittests.cfg   |   7 ++
 lib/arm/asm/gic.h   |   1 +
 4 files changed, 228 insertions(+)
 create mode 100644 arm/micro-test.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [kvm-unit-tests PATCH V2 1/2] arm/arm64: add GICD_IIDR definition
  2017-12-20  1:34 [kvm-unit-tests PATCH V2 0/2] Support micro operation measurement on arm64 Shih-Wei Li
@ 2017-12-20  1:34 ` Shih-Wei Li
  2017-12-20  1:34 ` [kvm-unit-tests PATCH V2 2/2] arm64: add micro test Shih-Wei Li
  1 sibling, 0 replies; 14+ messages in thread
From: Shih-Wei Li @ 2017-12-20  1:34 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: christoffer.dall, marc.zyngier, drjones, pbonzini, Shih-Wei Li

Added the definition of GICD_IIDR for later use in micro test.

Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
---
 lib/arm/asm/gic.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/arm/asm/gic.h b/lib/arm/asm/gic.h
index 2eb4af8..3b9982b 100644
--- a/lib/arm/asm/gic.h
+++ b/lib/arm/asm/gic.h
@@ -10,6 +10,7 @@
 /* Distributor registers */
 #define GICD_CTLR			0x0000
 #define GICD_TYPER			0x0004
+#define GICD_IIDR			0x0008
 #define GICD_IGROUPR			0x0080
 #define GICD_ISENABLER			0x0100
 #define GICD_ISPENDR			0x0200
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [kvm-unit-tests PATCH V2 2/2] arm64: add micro test
  2017-12-20  1:34 [kvm-unit-tests PATCH V2 0/2] Support micro operation measurement on arm64 Shih-Wei Li
  2017-12-20  1:34 ` [kvm-unit-tests PATCH V2 1/2] arm/arm64: add GICD_IIDR definition Shih-Wei Li
@ 2017-12-20  1:34 ` Shih-Wei Li
  2017-12-20 16:41   ` Andrew Jones
  2017-12-22  6:07   ` Yury Norov
  1 sibling, 2 replies; 14+ messages in thread
From: Shih-Wei Li @ 2017-12-20  1:34 UTC (permalink / raw)
  To: kvm, kvmarm; +Cc: marc.zyngier, pbonzini, Shih-Wei Li

Here we provide the support for measuring various micro level
operations on arm64. We iterate each of the tests for millions of
times and output their average, minimum and maximum cost in timer
counts. Instruction barriers were used before and after taking
timestamps to avoid out-of-order execution or pipelining from
skewing our measurements.

The operations we currently supported and measured are mostly
self-explanatory by their function name. For IPI, we measured the
cost of sending IPI from a source VCPU to a target VCPU, until the
target VCPU receives the IPI.

Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
---
 arm/Makefile.common |   1 +
 arm/micro-test.c    | 219 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 arm/unittests.cfg   |   7 ++
 3 files changed, 227 insertions(+)
 create mode 100644 arm/micro-test.c

diff --git a/arm/Makefile.common b/arm/Makefile.common
index 0a039cf..c7d5c27 100644
--- a/arm/Makefile.common
+++ b/arm/Makefile.common
@@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
 tests-common += $(TEST_DIR)/gic.flat
 tests-common += $(TEST_DIR)/psci.flat
 tests-common += $(TEST_DIR)/sieve.flat
+tests-common += $(TEST_DIR)/micro-test.flat
 
 tests-all = $(tests-common) $(tests)
 all: directories $(tests-all)
diff --git a/arm/micro-test.c b/arm/micro-test.c
new file mode 100644
index 0000000..c31c9ac
--- /dev/null
+++ b/arm/micro-test.c
@@ -0,0 +1,219 @@
+/*
+ * Measure the cost of micro level operations.
+ *
+ * Copyright Columbia University
+ * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
+ * Author: Christoffer Dall <cdall@cs.columbia.edu>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.
+ */
+#include "libcflat.h"
+#include <util.h>
+#include <asm/gic.h>
+
+static volatile bool ipi_received;
+static volatile bool ipi_ready;
+#define IPI_IRQ		1
+
+#define TIMEOUT (1U << 28)
+
+/*
+ * The counter may not always start with zero, which means it could
+ * overflow after some time.
+ */
+#define COUNT(c1, c2) \
+	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
+
+static uint64_t read_cc(void)
+{
+	isb();
+	return read_sysreg(cntpct_el0);
+}
+
+static void ipi_irq_handler(struct pt_regs *regs __unused)
+{
+	u32 ack;
+	ipi_ready = false;
+	ipi_received = true;
+	ack = gic_read_iar();
+	gic_write_eoir(ack);
+	ipi_ready = true;
+}
+
+static void ipi_test_secondary_entry(void *data __unused)
+{
+	enum vector v = EL1H_IRQ;
+	install_irq_handler(v, ipi_irq_handler);
+
+	gic_enable_defaults();
+
+	local_irq_enable(); /* Enter small wait-loop */
+	ipi_ready = true;
+	while (true);
+}
+
+static int test_init(void)
+{
+	if (!gic_init()) {
+		printf("No supported gic present, skipping tests...\n");
+		return 0;
+	}
+
+	ipi_ready = false;
+
+	gic_enable_defaults();
+	on_cpu_async(1, ipi_test_secondary_entry, 0);
+
+	printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
+
+	return 1;
+}
+
+static unsigned long ipi_test(void)
+{
+	unsigned int timeout = TIMEOUT;
+	uint64_t c1, c2;
+
+	while (!ipi_ready && timeout--);
+	assert(ipi_ready);
+
+	ipi_received = false;
+
+	c1 = read_cc();
+
+	gic_ipi_send_single(IPI_IRQ, 1);
+
+	timeout = TIMEOUT;
+	while (!ipi_received && timeout--);
+	assert(ipi_received);
+
+	c2 = read_cc();
+	return COUNT(c1, c2);
+}
+
+
+static unsigned long hvc_test(void)
+{
+	uint64_t c1, c2;
+
+	c1 = read_cc();
+	asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
+	c2 = read_cc();
+	return COUNT(c1, c2);
+}
+
+static unsigned long mmio_read_user(void)
+{
+	uint64_t c1, c2;
+	/*
+	 * FIXME: Read device-id in virtio mmio here. This address
+	 * needs to be updated in the future if any relavent
+	 * changes in QEMU test-dev are made.
+	 */
+	void *mmio_read_user_addr = (void*) 0x0a000008;
+
+	c1 = read_cc();
+	readl(mmio_read_user_addr);
+	c2 = read_cc();
+	return COUNT(c1, c2);
+}
+
+static unsigned long mmio_read_vgic(void)
+{
+	uint64_t c1, c2;
+	int v = gic_version();
+	void *vgic_dist_addr = NULL;
+
+	if (v == 2)
+		vgic_dist_addr = gicv2_dist_base();
+	else if (v == 3)
+		vgic_dist_addr = gicv3_dist_base();
+
+	c1 = read_cc();
+	readl(vgic_dist_addr + GICD_IIDR);
+	c2 = read_cc();
+	return COUNT(c1, c2);
+}
+
+static unsigned long eoi_test(void)
+{
+	uint64_t c1, c2;
+	int v = gic_version();
+	void (*write_eoir)(u32 irqstat) = NULL;
+
+	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
+
+	/* To avoid measuring assert(..) in gic_write_eoir */
+	if (v == 2)
+		write_eoir = gicv2_write_eoir;
+	else if (v == 3)
+		write_eoir = gicv3_write_eoir;
+
+	c1 = read_cc();
+	write_eoir(val);
+	c2 = read_cc();
+
+	return COUNT(c1, c2);
+}
+
+struct exit_test {
+	const char *name;
+	unsigned long (*test_fn)(void);
+	bool run;
+};
+
+static struct exit_test tests[] = {
+	{"hvc",                hvc_test,           true},
+	{"mmio_read_user",     mmio_read_user,     true},
+	{"mmio_read_vgic",     mmio_read_vgic,     true},
+	{"eoi",                eoi_test,           true},
+	{"ipi",                ipi_test,           true},
+};
+
+static void loop_test(struct exit_test *test)
+{
+	unsigned long i, iterations = 32;
+	unsigned long sample, cycles;
+	unsigned long long min = 0, max = 0;
+	const unsigned long long goal = (1ULL << 25);
+
+	do {
+		iterations *= 2;
+		cycles = 0;
+		i = 0;
+		while (i < iterations) {
+			sample = test->test_fn();
+			if (sample == 0) {
+				/*
+				 * If something went wrong or we had an
+				 * overflow, don't count that sample.
+				 */
+				printf("cycle count overflow: %lu\n", sample);
+				continue;
+			}
+			cycles += sample;
+			if (min == 0 || min > sample)
+				min = sample;
+			if (max < sample)
+				max = sample;
+			++i;
+		}
+	} while (cycles < goal);
+	printf("%s:\t avg %lu\t min %llu\t max %llu\n",
+		test->name, cycles / iterations, min, max);
+}
+
+int main(int argc, char **argv)
+{
+	int i;
+
+	assert(test_init());
+
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		if (!tests[i].run)
+			continue;
+		loop_test(&tests[i]);
+	}
+
+	return 0;
+}
diff --git a/arm/unittests.cfg b/arm/unittests.cfg
index 44b98cf..a8fb0b3 100644
--- a/arm/unittests.cfg
+++ b/arm/unittests.cfg
@@ -116,3 +116,10 @@ file = timer.flat
 groups = timer
 timeout = 2s
 arch = arm64
+
+# Exit tests
+[micro-test]
+file = micro-test.flat
+smp = 2
+groups = micro-test
+accel = kvm
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [kvm-unit-tests PATCH V2 2/2] arm64: add micro test
  2017-12-20  1:34 ` [kvm-unit-tests PATCH V2 2/2] arm64: add micro test Shih-Wei Li
@ 2017-12-20 16:41   ` Andrew Jones
  2017-12-20 21:41     ` Shih-Wei Li
  2017-12-22  6:07   ` Yury Norov
  1 sibling, 1 reply; 14+ messages in thread
From: Andrew Jones @ 2017-12-20 16:41 UTC (permalink / raw)
  To: Shih-Wei Li; +Cc: marc.zyngier, kvmarm, kvm, pbonzini

On Tue, Dec 19, 2017 at 08:34:45PM -0500, Shih-Wei Li wrote:
> Here we provide the support for measuring various micro level
> operations on arm64. We iterate each of the tests for millions of
> times and output their average, minimum and maximum cost in timer
> counts. Instruction barriers were used before and after taking
> timestamps to avoid out-of-order execution or pipelining from
> skewing our measurements.
> 
> The operations we currently supported and measured are mostly
> self-explanatory by their function name. For IPI, we measured the
> cost of sending IPI from a source VCPU to a target VCPU, until the
> target VCPU receives the IPI.
> 
> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> ---
>  arm/Makefile.common |   1 +
>  arm/micro-test.c    | 219 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  arm/unittests.cfg   |   7 ++
>  3 files changed, 227 insertions(+)
>  create mode 100644 arm/micro-test.c
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index 0a039cf..c7d5c27 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>  tests-common += $(TEST_DIR)/gic.flat
>  tests-common += $(TEST_DIR)/psci.flat
>  tests-common += $(TEST_DIR)/sieve.flat
> +tests-common += $(TEST_DIR)/micro-test.flat
>  
>  tests-all = $(tests-common) $(tests)
>  all: directories $(tests-all)
> diff --git a/arm/micro-test.c b/arm/micro-test.c
> new file mode 100644
> index 0000000..c31c9ac
> --- /dev/null
> +++ b/arm/micro-test.c
> @@ -0,0 +1,219 @@
> +/*
> + * Measure the cost of micro level operations.
> + *
> + * Copyright Columbia University
> + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
> + * Author: Christoffer Dall <cdall@cs.columbia.edu>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.
> + */
> +#include "libcflat.h"
> +#include <util.h>
> +#include <asm/gic.h>
> +
> +static volatile bool ipi_received;
> +static volatile bool ipi_ready;
> +#define IPI_IRQ		1
> +
> +#define TIMEOUT (1U << 28)
> +
> +/*
> + * The counter may not always start with zero, which means it could
> + * overflow after some time.
> + */
> +#define COUNT(c1, c2) \
> +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))

I just wrote about this in another mail, but here I think we should
handle c1 == c2 by returning 1 instead of zero.

> +
> +static uint64_t read_cc(void)
> +{
> +	isb();
> +	return read_sysreg(cntpct_el0);
> +}
> +
> +static void ipi_irq_handler(struct pt_regs *regs __unused)
> +{
> +	u32 ack;
> +	ipi_ready = false;
> +	ipi_received = true;
> +	ack = gic_read_iar();
> +	gic_write_eoir(ack);
> +	ipi_ready = true;
> +}
> +
> +static void ipi_test_secondary_entry(void *data __unused)
> +{
> +	enum vector v = EL1H_IRQ;
> +	install_irq_handler(v, ipi_irq_handler);
> +
> +	gic_enable_defaults();
> +
> +	local_irq_enable(); /* Enter small wait-loop */
> +	ipi_ready = true;
> +	while (true);
> +}
> +
> +static int test_init(void)
> +{
> +	if (!gic_init()) {
> +		printf("No supported gic present, skipping tests...\n");
> +		return 0;
> +	}
> +
> +	ipi_ready = false;
> +
> +	gic_enable_defaults();
> +	on_cpu_async(1, ipi_test_secondary_entry, 0);
> +
> +	printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
> +
> +	return 1;
> +}
> +
> +static unsigned long ipi_test(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +	uint64_t c1, c2;
> +
> +	while (!ipi_ready && timeout--);
> +	assert(ipi_ready);
> +
> +	ipi_received = false;
> +
> +	c1 = read_cc();
> +
> +	gic_ipi_send_single(IPI_IRQ, 1);
> +
> +	timeout = TIMEOUT;
> +	while (!ipi_received && timeout--);
> +	assert(ipi_received);
> +
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +
> +static unsigned long hvc_test(void)
> +{
> +	uint64_t c1, c2;
> +
> +	c1 = read_cc();
> +	asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_user(void)
> +{
> +	uint64_t c1, c2;
> +	/*
> +	 * FIXME: Read device-id in virtio mmio here. This address
> +	 * needs to be updated in the future if any relavent
> +	 * changes in QEMU test-dev are made.
> +	 */
> +	void *mmio_read_user_addr = (void*) 0x0a000008;
> +
> +	c1 = read_cc();
> +	readl(mmio_read_user_addr);
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_vgic(void)
> +{
> +	uint64_t c1, c2;
> +	int v = gic_version();
> +	void *vgic_dist_addr = NULL;
> +
> +	if (v == 2)
> +		vgic_dist_addr = gicv2_dist_base();
> +	else if (v == 3)
> +		vgic_dist_addr = gicv3_dist_base();
> +
> +	c1 = read_cc();
> +	readl(vgic_dist_addr + GICD_IIDR);
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long eoi_test(void)
> +{
> +	uint64_t c1, c2;
> +	int v = gic_version();
> +	void (*write_eoir)(u32 irqstat) = NULL;
> +
> +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> +
> +	/* To avoid measuring assert(..) in gic_write_eoir */
> +	if (v == 2)
> +		write_eoir = gicv2_write_eoir;
> +	else if (v == 3)
> +		write_eoir = gicv3_write_eoir;
> +
> +	c1 = read_cc();
> +	write_eoir(val);
> +	c2 = read_cc();
> +
> +	return COUNT(c1, c2);
> +}
> +
> +struct exit_test {
> +	const char *name;
> +	unsigned long (*test_fn)(void);
> +	bool run;
> +};
> +
> +static struct exit_test tests[] = {
> +	{"hvc",                hvc_test,           true},
> +	{"mmio_read_user",     mmio_read_user,     true},
> +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> +	{"eoi",                eoi_test,           true},
> +	{"ipi",                ipi_test,           true},
> +};
> +
> +static void loop_test(struct exit_test *test)
> +{
> +	unsigned long i, iterations = 32;
> +	unsigned long sample, cycles;
> +	unsigned long long min = 0, max = 0;
> +	const unsigned long long goal = (1ULL << 25);
> +
> +	do {
> +		iterations *= 2;
> +		cycles = 0;
> +		i = 0;
> +		while (i < iterations) {
> +			sample = test->test_fn();
> +			if (sample == 0) {
> +				/*
> +				 * If something went wrong or we had an
> +				 * overflow, don't count that sample.
> +				 */
> +				printf("cycle count overflow: %lu\n", sample);
> +				continue;

Also mentioned in the other mail, we should change this to something like

 if (sample == 0) {
     if (failures++ > MAX_FAILURES) {
         printf("%s: Too many cycle count overflows\n", test->name);
         return;
     }
     continue;
 }

Notice the dropping of the sample parameter (we know it's zero) and
the addition of the test name.

> +			}
> +			cycles += sample;
> +			if (min == 0 || min > sample)
> +				min = sample;
> +			if (max < sample)
> +				max = sample;
> +			++i;
> +		}
> +	} while (cycles < goal);
> +	printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> +		test->name, cycles / iterations, min, max);

Please use %<num><type> format parameters in order to put this
output in columns. Also, I think we should convert the cycle
counts to time (us) using the timer frequency.

> +}
> +
> +int main(int argc, char **argv)
> +{
> +	int i;
> +
> +	assert(test_init());

Even though we'll probably never turn assert() into a no-op,
we try to use it in a way where that's an option, i.e. no
functional code used as the parameter. So please write this
instead as

 ret = test_init();
 assert(ret);

> +
> +	for (i = 0; i < ARRAY_SIZE(tests); i++) {
> +		if (!tests[i].run)
> +			continue;
> +		loop_test(&tests[i]);
> +	}
> +
> +	return 0;
> +}

We may still want to modify this test to take command
line parameters for the number of iterations and to select
which tests to run.

> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> index 44b98cf..a8fb0b3 100644
> --- a/arm/unittests.cfg
> +++ b/arm/unittests.cfg
> @@ -116,3 +116,10 @@ file = timer.flat
>  groups = timer
>  timeout = 2s
>  arch = arm64
> +
> +# Exit tests
> +[micro-test]
> +file = micro-test.flat
> +smp = 2
> +groups = micro-test
> +accel = kvm

Unless we come up with some expected values and thresholds for
these execution times in order to turn the results into PASS/FAIL
tests, then I'm still thinking we should probably add it to the
nodefault group in order to save time running the regression tests.

Thanks,
drew

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [kvm-unit-tests PATCH V2 2/2] arm64: add micro test
  2017-12-20 16:41   ` Andrew Jones
@ 2017-12-20 21:41     ` Shih-Wei Li
  2017-12-21  9:21       ` Andrew Jones
  0 siblings, 1 reply; 14+ messages in thread
From: Shih-Wei Li @ 2017-12-20 21:41 UTC (permalink / raw)
  To: Andrew Jones; +Cc: Marc Zyngier, kvmarm, kvm, Paolo Bonzini

On Wed, Dec 20, 2017 at 11:41 AM, Andrew Jones <drjones@redhat.com> wrote:
> On Tue, Dec 19, 2017 at 08:34:45PM -0500, Shih-Wei Li wrote:
>> Here we provide the support for measuring various micro level
>> operations on arm64. We iterate each of the tests for millions of
>> times and output their average, minimum and maximum cost in timer
>> counts. Instruction barriers were used before and after taking
>> timestamps to avoid out-of-order execution or pipelining from
>> skewing our measurements.
>>
>> The operations we currently supported and measured are mostly
>> self-explanatory by their function name. For IPI, we measured the
>> cost of sending IPI from a source VCPU to a target VCPU, until the
>> target VCPU receives the IPI.
>>
>> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
>> ---
>>  arm/Makefile.common |   1 +
>>  arm/micro-test.c    | 219 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  arm/unittests.cfg   |   7 ++
>>  3 files changed, 227 insertions(+)
>>  create mode 100644 arm/micro-test.c
>>
>> diff --git a/arm/Makefile.common b/arm/Makefile.common
>> index 0a039cf..c7d5c27 100644
>> --- a/arm/Makefile.common
>> +++ b/arm/Makefile.common
>> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>>  tests-common += $(TEST_DIR)/gic.flat
>>  tests-common += $(TEST_DIR)/psci.flat
>>  tests-common += $(TEST_DIR)/sieve.flat
>> +tests-common += $(TEST_DIR)/micro-test.flat
>>
>>  tests-all = $(tests-common) $(tests)
>>  all: directories $(tests-all)
>> diff --git a/arm/micro-test.c b/arm/micro-test.c
>> new file mode 100644
>> index 0000000..c31c9ac
>> --- /dev/null
>> +++ b/arm/micro-test.c
>> @@ -0,0 +1,219 @@
>> +/*
>> + * Measure the cost of micro level operations.
>> + *
>> + * Copyright Columbia University
>> + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
>> + * Author: Christoffer Dall <cdall@cs.columbia.edu>
>> + *
>> + * This work is licensed under the terms of the GNU LGPL, version 2.
>> + */
>> +#include "libcflat.h"
>> +#include <util.h>
>> +#include <asm/gic.h>
>> +
>> +static volatile bool ipi_received;
>> +static volatile bool ipi_ready;
>> +#define IPI_IRQ              1
>> +
>> +#define TIMEOUT (1U << 28)
>> +
>> +/*
>> + * The counter may not always start with zero, which means it could
>> + * overflow after some time.
>> + */
>> +#define COUNT(c1, c2) \
>> +     (((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
>
> I just wrote about this in another mail, but here I think we should
> handle c1 == c2 by returning 1 instead of zero.
>
>> +
>> +static uint64_t read_cc(void)
>> +{
>> +     isb();
>> +     return read_sysreg(cntpct_el0);
>> +}
>> +
>> +static void ipi_irq_handler(struct pt_regs *regs __unused)
>> +{
>> +     u32 ack;
>> +     ipi_ready = false;
>> +     ipi_received = true;
>> +     ack = gic_read_iar();
>> +     gic_write_eoir(ack);
>> +     ipi_ready = true;
>> +}
>> +
>> +static void ipi_test_secondary_entry(void *data __unused)
>> +{
>> +     enum vector v = EL1H_IRQ;
>> +     install_irq_handler(v, ipi_irq_handler);
>> +
>> +     gic_enable_defaults();
>> +
>> +     local_irq_enable(); /* Enter small wait-loop */
>> +     ipi_ready = true;
>> +     while (true);
>> +}
>> +
>> +static int test_init(void)
>> +{
>> +     if (!gic_init()) {
>> +             printf("No supported gic present, skipping tests...\n");
>> +             return 0;
>> +     }
>> +
>> +     ipi_ready = false;
>> +
>> +     gic_enable_defaults();
>> +     on_cpu_async(1, ipi_test_secondary_entry, 0);
>> +
>> +     printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
>> +
>> +     return 1;
>> +}
>> +
>> +static unsigned long ipi_test(void)
>> +{
>> +     unsigned int timeout = TIMEOUT;
>> +     uint64_t c1, c2;
>> +
>> +     while (!ipi_ready && timeout--);
>> +     assert(ipi_ready);
>> +
>> +     ipi_received = false;
>> +
>> +     c1 = read_cc();
>> +
>> +     gic_ipi_send_single(IPI_IRQ, 1);
>> +
>> +     timeout = TIMEOUT;
>> +     while (!ipi_received && timeout--);
>> +     assert(ipi_received);
>> +
>> +     c2 = read_cc();
>> +     return COUNT(c1, c2);
>> +}
>> +
>> +
>> +static unsigned long hvc_test(void)
>> +{
>> +     uint64_t c1, c2;
>> +
>> +     c1 = read_cc();
>> +     asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
>> +     c2 = read_cc();
>> +     return COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long mmio_read_user(void)
>> +{
>> +     uint64_t c1, c2;
>> +     /*
>> +      * FIXME: Read device-id in virtio mmio here. This address
>> +      * needs to be updated in the future if any relavent
>> +      * changes in QEMU test-dev are made.
>> +      */
>> +     void *mmio_read_user_addr = (void*) 0x0a000008;
>> +
>> +     c1 = read_cc();
>> +     readl(mmio_read_user_addr);
>> +     c2 = read_cc();
>> +     return COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long mmio_read_vgic(void)
>> +{
>> +     uint64_t c1, c2;
>> +     int v = gic_version();
>> +     void *vgic_dist_addr = NULL;
>> +
>> +     if (v == 2)
>> +             vgic_dist_addr = gicv2_dist_base();
>> +     else if (v == 3)
>> +             vgic_dist_addr = gicv3_dist_base();
>> +
>> +     c1 = read_cc();
>> +     readl(vgic_dist_addr + GICD_IIDR);
>> +     c2 = read_cc();
>> +     return COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long eoi_test(void)
>> +{
>> +     uint64_t c1, c2;
>> +     int v = gic_version();
>> +     void (*write_eoir)(u32 irqstat) = NULL;
>> +
>> +     u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
>> +
>> +     /* To avoid measuring assert(..) in gic_write_eoir */
>> +     if (v == 2)
>> +             write_eoir = gicv2_write_eoir;
>> +     else if (v == 3)
>> +             write_eoir = gicv3_write_eoir;
>> +
>> +     c1 = read_cc();
>> +     write_eoir(val);
>> +     c2 = read_cc();
>> +
>> +     return COUNT(c1, c2);
>> +}
>> +
>> +struct exit_test {
>> +     const char *name;
>> +     unsigned long (*test_fn)(void);
>> +     bool run;
>> +};
>> +
>> +static struct exit_test tests[] = {
>> +     {"hvc",                hvc_test,           true},
>> +     {"mmio_read_user",     mmio_read_user,     true},
>> +     {"mmio_read_vgic",     mmio_read_vgic,     true},
>> +     {"eoi",                eoi_test,           true},
>> +     {"ipi",                ipi_test,           true},
>> +};
>> +
>> +static void loop_test(struct exit_test *test)
>> +{
>> +     unsigned long i, iterations = 32;
>> +     unsigned long sample, cycles;
>> +     unsigned long long min = 0, max = 0;
>> +     const unsigned long long goal = (1ULL << 25);
>> +
>> +     do {
>> +             iterations *= 2;
>> +             cycles = 0;
>> +             i = 0;
>> +             while (i < iterations) {
>> +                     sample = test->test_fn();
>> +                     if (sample == 0) {
>> +                             /*
>> +                              * If something went wrong or we had an
>> +                              * overflow, don't count that sample.
>> +                              */
>> +                             printf("cycle count overflow: %lu\n", sample);
>> +                             continue;
>
> Also mentioned in the other mail, we should change this to something like
>
>  if (sample == 0) {
>      if (failures++ > MAX_FAILURES) {
>          printf("%s: Too many cycle count overflows\n", test->name);
>          return;
>      }
>      continue;
>  }
>
> Notice the dropping of the sample parameter (we know it's zero) and
> the addition of the test name.
>

I think the chance which the timer counter actually overflows ((c1) >
(c2)) should be really really rare (mostly due to that (c1) == (c2)),
and we can just ignore the case if it happens. So I wonder if we can
leave the COUNT macro as the way it is, and change the print info in
the code like the following?

if (sample == 0) {
   if (failures++ > MAX_FAILURES) {
       printf("%s: Too many overflows, the cost is likely smaller than
a cycle count\n", test->name);
       return;
   }
   continue;
}

Also, what might be a good value for MAX_FAILURES based on your
experience? The problem in EOI never happened on the machines I used.

>> +                     }
>> +                     cycles += sample;
>> +                     if (min == 0 || min > sample)
>> +                             min = sample;
>> +                     if (max < sample)
>> +                             max = sample;
>> +                     ++i;
>> +             }
>> +     } while (cycles < goal);
>> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
>> +             test->name, cycles / iterations, min, max);
>
> Please use %<num><type> format parameters in order to put this
> output in columns. Also, I think we should convert the cycle
> counts to time (us) using the timer frequency.
>

One thing I noticed here is, since in kvm-unit-tests, printf does not
currently support floating points, we may lose precision in the output
for tests that have smaller cost. For example, in EOI, I saw the min
output on seattle became zero since its cost is smaller than 1us.
Outputting in nanoseconds might be an option but this then requires
type casting floats to unsigned (the timer frequency is around
hundreds of MHz in the hardware I tested) just to print the value.

I wonder if it's ok to output the timer frequency/counts as we used to
have so the users can do the math themselves?

>> +}
>> +
>> +int main(int argc, char **argv)
>> +{
>> +     int i;
>> +
>> +     assert(test_init());
>
> Even though we'll probably never turn assert() into a no-op,
> we try to use it in a way where that's an option, i.e. no
> functional code used as the parameter. So please write this
> instead as
>
>  ret = test_init();
>  assert(ret);
>
>> +
>> +     for (i = 0; i < ARRAY_SIZE(tests); i++) {
>> +             if (!tests[i].run)
>> +                     continue;
>> +             loop_test(&tests[i]);
>> +     }
>> +
>> +     return 0;
>> +}
>
> We may still want to modify this test to take command
> line parameters for the number of iterations and to select
> which tests to run.
>
>> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
>> index 44b98cf..a8fb0b3 100644
>> --- a/arm/unittests.cfg
>> +++ b/arm/unittests.cfg
>> @@ -116,3 +116,10 @@ file = timer.flat
>>  groups = timer
>>  timeout = 2s
>>  arch = arm64
>> +
>> +# Exit tests
>> +[micro-test]
>> +file = micro-test.flat
>> +smp = 2
>> +groups = micro-test
>> +accel = kvm
>
> Unless we come up with some expected values and thresholds for
> these execution times in order to turn the results into PASS/FAIL
> tests, then I'm still thinking we should probably add it to the
> nodefault group in order to save time running the regression tests.

I'm ok with making the micro test as a nodefault if we cannot find a
generic timeout value that works across the hardware.

Thanks,
Shih-Wei

>
> Thanks,
> drew

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [kvm-unit-tests PATCH V2 2/2] arm64: add micro test
  2017-12-20 21:41     ` Shih-Wei Li
@ 2017-12-21  9:21       ` Andrew Jones
  2017-12-21 11:28         ` Christoffer Dall
  2017-12-21 16:12         ` Shih-Wei Li
  0 siblings, 2 replies; 14+ messages in thread
From: Andrew Jones @ 2017-12-21  9:21 UTC (permalink / raw)
  To: Shih-Wei Li; +Cc: Marc Zyngier, kvmarm, kvm, Paolo Bonzini

On Wed, Dec 20, 2017 at 04:41:24PM -0500, Shih-Wei Li wrote:
> On Wed, Dec 20, 2017 at 11:41 AM, Andrew Jones <drjones@redhat.com> wrote:
> > On Tue, Dec 19, 2017 at 08:34:45PM -0500, Shih-Wei Li wrote:
> >> +             while (i < iterations) {
> >> +                     sample = test->test_fn();
> >> +                     if (sample == 0) {
> >> +                             /*
> >> +                              * If something went wrong or we had an
> >> +                              * overflow, don't count that sample.
> >> +                              */
> >> +                             printf("cycle count overflow: %lu\n", sample);
> >> +                             continue;
> >
> > Also mentioned in the other mail, we should change this to something like
> >
> >  if (sample == 0) {
> >      if (failures++ > MAX_FAILURES) {
> >          printf("%s: Too many cycle count overflows\n", test->name);
> >          return;
> >      }
> >      continue;
> >  }
> >
> > Notice the dropping of the sample parameter (we know it's zero) and
> > the addition of the test name.
> >
> 
> I think the chance which the timer counter actually overflows ((c1) >
> (c2)) should be really really rare (mostly due to that (c1) == (c2)),
> and we can just ignore the case if it happens. So I wonder if we can
> leave the COUNT macro as the way it is, and change the print info in
> the code like the following?
> 
> if (sample == 0) {
>    if (failures++ > MAX_FAILURES) {
>        printf("%s: Too many overflows, the cost is likely smaller than
> a cycle count\n", test->name);
>        return;
>    }
>    continue;
> }
> 
> Also, what might be a good value for MAX_FAILURES based on your
> experience? The problem in EOI never happened on the machines I used.

It's always zero on at least one of the platforms I use. So it won't
matter what MAX_FAILURES is set to, we'll always hit it. I think
the chance we overflow and get c1 == c2 is almost nil, so we might as
well report the sample as either zero or one in that case. But whatever
fixes the time outs is fine by me.

> 
> >> +                     }
> >> +                     cycles += sample;
> >> +                     if (min == 0 || min > sample)
> >> +                             min = sample;
> >> +                     if (max < sample)
> >> +                             max = sample;
> >> +                     ++i;
> >> +             }
> >> +     } while (cycles < goal);
> >> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> >> +             test->name, cycles / iterations, min, max);
> >
> > Please use %<num><type> format parameters in order to put this
> > output in columns. Also, I think we should convert the cycle
> > counts to time (us) using the timer frequency.
> >
> 
> One thing I noticed here is, since in kvm-unit-tests, printf does not
> currently support floating points, we may lose precision in the output
> for tests that have smaller cost. For example, in EOI, I saw the min
> output on seattle became zero since its cost is smaller than 1us.
> Outputting in nanoseconds might be an option but this then requires
> type casting floats to unsigned (the timer frequency is around
> hundreds of MHz in the hardware I tested) just to print the value.

We shouldn't need floats. Just use nanoseconds for calculations and
then convert to us.us-fraction manually.

 ns      = cycles * frq;
 us      = ns / 1000;
 us_frac = (ns % 1000) / 100; // one decimal place, use 10 for two
 printf("%d.d\n", us, us_frac);

> 
> I wonder if it's ok to output the timer frequency/counts as we used to
> have so the users can do the math themselves?

If you wish. I won't insist on the doing the conversion in the test,
but it seems easy and helpful enough to do.

> >> +# Exit tests
> >> +[micro-test]
> >> +file = micro-test.flat
> >> +smp = 2
> >> +groups = micro-test
> >> +accel = kvm
> >
> > Unless we come up with some expected values and thresholds for
> > these execution times in order to turn the results into PASS/FAIL
> > tests, then I'm still thinking we should probably add it to the
> > nodefault group in order to save time running the regression tests.
> 
> I'm ok with making the micro test as a nodefault if we cannot find a
> generic timeout value that works across the hardware.
>

OK, let's do that for now.

Thanks,
drew

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [kvm-unit-tests PATCH V2 2/2] arm64: add micro test
  2017-12-21  9:21       ` Andrew Jones
@ 2017-12-21 11:28         ` Christoffer Dall
  2017-12-21 16:26           ` Shih-Wei Li
  2017-12-21 16:12         ` Shih-Wei Li
  1 sibling, 1 reply; 14+ messages in thread
From: Christoffer Dall @ 2017-12-21 11:28 UTC (permalink / raw)
  To: Andrew Jones; +Cc: Shih-Wei Li, kvm, kvmarm, Marc Zyngier, Paolo Bonzini

On Thu, Dec 21, 2017 at 10:21:29AM +0100, Andrew Jones wrote:
> On Wed, Dec 20, 2017 at 04:41:24PM -0500, Shih-Wei Li wrote:
> > On Wed, Dec 20, 2017 at 11:41 AM, Andrew Jones <drjones@redhat.com> wrote:
> > > On Tue, Dec 19, 2017 at 08:34:45PM -0500, Shih-Wei Li wrote:
> > >> +             while (i < iterations) {
> > >> +                     sample = test->test_fn();
> > >> +                     if (sample == 0) {
> > >> +                             /*
> > >> +                              * If something went wrong or we had an
> > >> +                              * overflow, don't count that sample.
> > >> +                              */
> > >> +                             printf("cycle count overflow: %lu\n", sample);
> > >> +                             continue;
> > >
> > > Also mentioned in the other mail, we should change this to something like
> > >
> > >  if (sample == 0) {
> > >      if (failures++ > MAX_FAILURES) {
> > >          printf("%s: Too many cycle count overflows\n", test->name);
> > >          return;
> > >      }
> > >      continue;
> > >  }
> > >
> > > Notice the dropping of the sample parameter (we know it's zero) and
> > > the addition of the test name.
> > >
> > 
> > I think the chance which the timer counter actually overflows ((c1) >
> > (c2)) should be really really rare (mostly due to that (c1) == (c2)),
> > and we can just ignore the case if it happens. So I wonder if we can
> > leave the COUNT macro as the way it is, and change the print info in
> > the code like the following?
> > 
> > if (sample == 0) {
> >    if (failures++ > MAX_FAILURES) {
> >        printf("%s: Too many overflows, the cost is likely smaller than
> > a cycle count\n", test->name);
> >        return;
> >    }
> >    continue;
> > }
> > 
> > Also, what might be a good value for MAX_FAILURES based on your
> > experience? The problem in EOI never happened on the machines I used.
> 
> It's always zero on at least one of the platforms I use. So it won't
> matter what MAX_FAILURES is set to, we'll always hit it. I think
> the chance we overflow and get c1 == c2 is almost nil, so we might as
> well report the sample as either zero or one in that case. But whatever
> fixes the time outs is fine by me.
> 

I think the overflow thing should be kept as a sanity check; if we don't
see a reasonable cycle count (either overflow or c1 == c2), just cancel
the test and report to the user that this thing is unreliable and they
should fix their hypervisor.  Thoughts?

> > 
> > >> +                     }
> > >> +                     cycles += sample;
> > >> +                     if (min == 0 || min > sample)
> > >> +                             min = sample;
> > >> +                     if (max < sample)
> > >> +                             max = sample;
> > >> +                     ++i;
> > >> +             }
> > >> +     } while (cycles < goal);
> > >> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> > >> +             test->name, cycles / iterations, min, max);
> > >
> > > Please use %<num><type> format parameters in order to put this
> > > output in columns. Also, I think we should convert the cycle
> > > counts to time (us) using the timer frequency.
> > >
> > 
> > One thing I noticed here is, since in kvm-unit-tests, printf does not
> > currently support floating points, we may lose precision in the output
> > for tests that have smaller cost. For example, in EOI, I saw the min
> > output on seattle became zero since its cost is smaller than 1us.
> > Outputting in nanoseconds might be an option but this then requires
> > type casting floats to unsigned (the timer frequency is around
> > hundreds of MHz in the hardware I tested) just to print the value.
> 
> We shouldn't need floats. Just use nanoseconds for calculations and
> then convert to us.us-fraction manually.
> 
>  ns      = cycles * frq;
>  us      = ns / 1000;
>  us_frac = (ns % 1000) / 100; // one decimal place, use 10 for two
>  printf("%d.d\n", us, us_frac);
> 
> > 
> > I wonder if it's ok to output the timer frequency/counts as we used to
> > have so the users can do the math themselves?
> 
> If you wish. I won't insist on the doing the conversion in the test,
> but it seems easy and helpful enough to do.
> 

I think we're asking for trouble if we output the raw timer counts as
I'm not sure people are aware this can vary across platforms.

What is the argument against scaling the count to a reasonable measure
of time?

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [kvm-unit-tests PATCH V2 2/2] arm64: add micro test
  2017-12-21  9:21       ` Andrew Jones
  2017-12-21 11:28         ` Christoffer Dall
@ 2017-12-21 16:12         ` Shih-Wei Li
  2017-12-21 16:57           ` Andrew Jones
  1 sibling, 1 reply; 14+ messages in thread
From: Shih-Wei Li @ 2017-12-21 16:12 UTC (permalink / raw)
  To: Andrew Jones; +Cc: Marc Zyngier, kvmarm, kvm, Paolo Bonzini

On Thu, Dec 21, 2017 at 4:21 AM, Andrew Jones <drjones@redhat.com> wrote:
> On Wed, Dec 20, 2017 at 04:41:24PM -0500, Shih-Wei Li wrote:
>> On Wed, Dec 20, 2017 at 11:41 AM, Andrew Jones <drjones@redhat.com> wrote:
>> > On Tue, Dec 19, 2017 at 08:34:45PM -0500, Shih-Wei Li wrote:
>> >> +             while (i < iterations) {
>> >> +                     sample = test->test_fn();
>> >> +                     if (sample == 0) {
>> >> +                             /*
>> >> +                              * If something went wrong or we had an
>> >> +                              * overflow, don't count that sample.
>> >> +                              */
>> >> +                             printf("cycle count overflow: %lu\n", sample);
>> >> +                             continue;
>> >
>> > Also mentioned in the other mail, we should change this to something like
>> >
>> >  if (sample == 0) {
>> >      if (failures++ > MAX_FAILURES) {
>> >          printf("%s: Too many cycle count overflows\n", test->name);
>> >          return;
>> >      }
>> >      continue;
>> >  }
>> >
>> > Notice the dropping of the sample parameter (we know it's zero) and
>> > the addition of the test name.
>> >
>>
>> I think the chance which the timer counter actually overflows ((c1) >
>> (c2)) should be really really rare (mostly due to that (c1) == (c2)),
>> and we can just ignore the case if it happens. So I wonder if we can
>> leave the COUNT macro as the way it is, and change the print info in
>> the code like the following?
>>
>> if (sample == 0) {
>>    if (failures++ > MAX_FAILURES) {
>>        printf("%s: Too many overflows, the cost is likely smaller than
>> a cycle count\n", test->name);
>>        return;
>>    }
>>    continue;
>> }
>>
>> Also, what might be a good value for MAX_FAILURES based on your
>> experience? The problem in EOI never happened on the machines I used.
>
> It's always zero on at least one of the platforms I use. So it won't
> matter what MAX_FAILURES is set to, we'll always hit it. I think
> the chance we overflow and get c1 == c2 is almost nil, so we might as
> well report the sample as either zero or one in that case. But whatever
> fixes the time outs is fine by me.

Yes, sounds good to me.

>
>>
>> >> +                     }
>> >> +                     cycles += sample;
>> >> +                     if (min == 0 || min > sample)
>> >> +                             min = sample;
>> >> +                     if (max < sample)
>> >> +                             max = sample;
>> >> +                     ++i;
>> >> +             }
>> >> +     } while (cycles < goal);
>> >> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
>> >> +             test->name, cycles / iterations, min, max);
>> >
>> > Please use %<num><type> format parameters in order to put this
>> > output in columns. Also, I think we should convert the cycle
>> > counts to time (us) using the timer frequency.
>> >
>>
>> One thing I noticed here is, since in kvm-unit-tests, printf does not
>> currently support floating points, we may lose precision in the output
>> for tests that have smaller cost. For example, in EOI, I saw the min
>> output on seattle became zero since its cost is smaller than 1us.
>> Outputting in nanoseconds might be an option but this then requires
>> type casting floats to unsigned (the timer frequency is around
>> hundreds of MHz in the hardware I tested) just to print the value.
>
> We shouldn't need floats. Just use nanoseconds for calculations and
> then convert to us.us-fraction manually.
>
>  ns      = cycles * frq;
>  us      = ns / 1000;
>  us_frac = (ns % 1000) / 100; // one decimal place, use 10 for two
>  printf("%d.d\n", us, us_frac);
>

Just to make sure, the frq here is 1 / (get_cntfrq() / 10^9), right?
On my seattle, get_cntfrq() gives me 250000000 (2.5*10^8), which makes
the denominator less than 1, so I guess I'll need float to do the math
first, then cast the result to an unsigned number like the following.

float frq = 1 / (get_cntfrq() / 10^9);
ns      =  (unsigned) cycles * frq;
....

>>
>> I wonder if it's ok to output the timer frequency/counts as we used to
>> have so the users can do the math themselves?
>
> If you wish. I won't insist on the doing the conversion in the test,
> but it seems easy and helpful enough to do.
>
>> >> +# Exit tests
>> >> +[micro-test]
>> >> +file = micro-test.flat
>> >> +smp = 2
>> >> +groups = micro-test
>> >> +accel = kvm
>> >
>> > Unless we come up with some expected values and thresholds for
>> > these execution times in order to turn the results into PASS/FAIL
>> > tests, then I'm still thinking we should probably add it to the
>> > nodefault group in order to save time running the regression tests.
>>
>> I'm ok with making the micro test as a nodefault if we cannot find a
>> generic timeout value that works across the hardware.
>>
>
> OK, let's do that for now.

Yes. Should I just put a line like "groups = nodefault,micro-test" in
unittests.cfg?

Thanks,
Shih-Wei

>
> Thanks,
> drew

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [kvm-unit-tests PATCH V2 2/2] arm64: add micro test
  2017-12-21 11:28         ` Christoffer Dall
@ 2017-12-21 16:26           ` Shih-Wei Li
  0 siblings, 0 replies; 14+ messages in thread
From: Shih-Wei Li @ 2017-12-21 16:26 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, Paolo Bonzini, kvmarm, kvm

On Thu, Dec 21, 2017 at 6:28 AM, Christoffer Dall
<christoffer.dall@linaro.org> wrote:
> On Thu, Dec 21, 2017 at 10:21:29AM +0100, Andrew Jones wrote:
>> On Wed, Dec 20, 2017 at 04:41:24PM -0500, Shih-Wei Li wrote:
>> > On Wed, Dec 20, 2017 at 11:41 AM, Andrew Jones <drjones@redhat.com> wrote:
>> > > On Tue, Dec 19, 2017 at 08:34:45PM -0500, Shih-Wei Li wrote:
>> > >> +             while (i < iterations) {
>> > >> +                     sample = test->test_fn();
>> > >> +                     if (sample == 0) {
>> > >> +                             /*
>> > >> +                              * If something went wrong or we had an
>> > >> +                              * overflow, don't count that sample.
>> > >> +                              */
>> > >> +                             printf("cycle count overflow: %lu\n", sample);
>> > >> +                             continue;
>> > >
>> > > Also mentioned in the other mail, we should change this to something like
>> > >
>> > >  if (sample == 0) {
>> > >      if (failures++ > MAX_FAILURES) {
>> > >          printf("%s: Too many cycle count overflows\n", test->name);
>> > >          return;
>> > >      }
>> > >      continue;
>> > >  }
>> > >
>> > > Notice the dropping of the sample parameter (we know it's zero) and
>> > > the addition of the test name.
>> > >
>> >
>> > I think the chance which the timer counter actually overflows ((c1) >
>> > (c2)) should be really really rare (mostly due to that (c1) == (c2)),
>> > and we can just ignore the case if it happens. So I wonder if we can
>> > leave the COUNT macro as the way it is, and change the print info in
>> > the code like the following?
>> >
>> > if (sample == 0) {
>> >    if (failures++ > MAX_FAILURES) {
>> >        printf("%s: Too many overflows, the cost is likely smaller than
>> > a cycle count\n", test->name);
>> >        return;
>> >    }
>> >    continue;
>> > }
>> >
>> > Also, what might be a good value for MAX_FAILURES based on your
>> > experience? The problem in EOI never happened on the machines I used.
>>
>> It's always zero on at least one of the platforms I use. So it won't
>> matter what MAX_FAILURES is set to, we'll always hit it. I think
>> the chance we overflow and get c1 == c2 is almost nil, so we might as
>> well report the sample as either zero or one in that case. But whatever
>> fixes the time outs is fine by me.
>>
>
> I think the overflow thing should be kept as a sanity check; if we don't
> see a reasonable cycle count (either overflow or c1 == c2), just cancel
> the test and report to the user that this thing is unreliable and they
> should fix their hypervisor.  Thoughts?

I guess you meant that we should simply cancel the test whenever we
see an unreasonable cycle count, instead of canceling it until this
happens for MAX_FAILURES of times, yes?

>
>> >
>> > >> +                     }
>> > >> +                     cycles += sample;
>> > >> +                     if (min == 0 || min > sample)
>> > >> +                             min = sample;
>> > >> +                     if (max < sample)
>> > >> +                             max = sample;
>> > >> +                     ++i;
>> > >> +             }
>> > >> +     } while (cycles < goal);
>> > >> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
>> > >> +             test->name, cycles / iterations, min, max);
>> > >
>> > > Please use %<num><type> format parameters in order to put this
>> > > output in columns. Also, I think we should convert the cycle
>> > > counts to time (us) using the timer frequency.
>> > >
>> >
>> > One thing I noticed here is, since in kvm-unit-tests, printf does not
>> > currently support floating points, we may lose precision in the output
>> > for tests that have smaller cost. For example, in EOI, I saw the min
>> > output on seattle became zero since its cost is smaller than 1us.
>> > Outputting in nanoseconds might be an option but this then requires
>> > type casting floats to unsigned (the timer frequency is around
>> > hundreds of MHz in the hardware I tested) just to print the value.
>>
>> We shouldn't need floats. Just use nanoseconds for calculations and
>> then convert to us.us-fraction manually.
>>
>>  ns      = cycles * frq;
>>  us      = ns / 1000;
>>  us_frac = (ns % 1000) / 100; // one decimal place, use 10 for two
>>  printf("%d.d\n", us, us_frac);
>>
>> >
>> > I wonder if it's ok to output the timer frequency/counts as we used to
>> > have so the users can do the math themselves?
>>
>> If you wish. I won't insist on the doing the conversion in the test,
>> but it seems easy and helpful enough to do.
>>
>
> I think we're asking for trouble if we output the raw timer counts as
> I'm not sure people are aware this can vary across platforms.
>
> What is the argument against scaling the count to a reasonable measure
> of time?

Because we cannot output float from printf directly in kvm-unit-tests,
so I was just wondering whether we should do the extra conversions to
output the measure of time, or should we leave it for the user to
convert the results.

Thanks,
Shih-Wei

>
> Thanks,
> -Christoffer

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [kvm-unit-tests PATCH V2 2/2] arm64: add micro test
  2017-12-21 16:12         ` Shih-Wei Li
@ 2017-12-21 16:57           ` Andrew Jones
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Jones @ 2017-12-21 16:57 UTC (permalink / raw)
  To: Shih-Wei Li; +Cc: kvm, kvmarm, Christoffer Dall, Marc Zyngier, Paolo Bonzini

On Thu, Dec 21, 2017 at 11:12:51AM -0500, Shih-Wei Li wrote:
> On Thu, Dec 21, 2017 at 4:21 AM, Andrew Jones <drjones@redhat.com> wrote:
> > On Wed, Dec 20, 2017 at 04:41:24PM -0500, Shih-Wei Li wrote:
> >> On Wed, Dec 20, 2017 at 11:41 AM, Andrew Jones <drjones@redhat.com> wrote:
> >> > On Tue, Dec 19, 2017 at 08:34:45PM -0500, Shih-Wei Li wrote:
> >> >> +             while (i < iterations) {
> >> >> +                     sample = test->test_fn();
> >> >> +                     if (sample == 0) {
> >> >> +                             /*
> >> >> +                              * If something went wrong or we had an
> >> >> +                              * overflow, don't count that sample.
> >> >> +                              */
> >> >> +                             printf("cycle count overflow: %lu\n", sample);
> >> >> +                             continue;
> >> >
> >> > Also mentioned in the other mail, we should change this to something like
> >> >
> >> >  if (sample == 0) {
> >> >      if (failures++ > MAX_FAILURES) {
> >> >          printf("%s: Too many cycle count overflows\n", test->name);
> >> >          return;
> >> >      }
> >> >      continue;
> >> >  }
> >> >
> >> > Notice the dropping of the sample parameter (we know it's zero) and
> >> > the addition of the test name.
> >> >
> >>
> >> I think the chance which the timer counter actually overflows ((c1) >
> >> (c2)) should be really really rare (mostly due to that (c1) == (c2)),
> >> and we can just ignore the case if it happens. So I wonder if we can
> >> leave the COUNT macro as the way it is, and change the print info in
> >> the code like the following?
> >>
> >> if (sample == 0) {
> >>    if (failures++ > MAX_FAILURES) {
> >>        printf("%s: Too many overflows, the cost is likely smaller than
> >> a cycle count\n", test->name);
> >>        return;
> >>    }
> >>    continue;
> >> }
> >>
> >> Also, what might be a good value for MAX_FAILURES based on your
> >> experience? The problem in EOI never happened on the machines I used.
> >
> > It's always zero on at least one of the platforms I use. So it won't
> > matter what MAX_FAILURES is set to, we'll always hit it. I think
> > the chance we overflow and get c1 == c2 is almost nil, so we might as
> > well report the sample as either zero or one in that case. But whatever
> > fixes the time outs is fine by me.
> 
> Yes, sounds good to me.
> 
> >
> >>
> >> >> +                     }
> >> >> +                     cycles += sample;
> >> >> +                     if (min == 0 || min > sample)
> >> >> +                             min = sample;
> >> >> +                     if (max < sample)
> >> >> +                             max = sample;
> >> >> +                     ++i;
> >> >> +             }
> >> >> +     } while (cycles < goal);
> >> >> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> >> >> +             test->name, cycles / iterations, min, max);
> >> >
> >> > Please use %<num><type> format parameters in order to put this
> >> > output in columns. Also, I think we should convert the cycle
> >> > counts to time (us) using the timer frequency.
> >> >
> >>
> >> One thing I noticed here is, since in kvm-unit-tests, printf does not
> >> currently support floating points, we may lose precision in the output
> >> for tests that have smaller cost. For example, in EOI, I saw the min
> >> output on seattle became zero since its cost is smaller than 1us.
> >> Outputting in nanoseconds might be an option but this then requires
> >> type casting floats to unsigned (the timer frequency is around
> >> hundreds of MHz in the hardware I tested) just to print the value.
> >
> > We shouldn't need floats. Just use nanoseconds for calculations and
> > then convert to us.us-fraction manually.
> >
> >  ns      = cycles * frq;
> >  us      = ns / 1000;
> >  us_frac = (ns % 1000) / 100; // one decimal place, use 10 for two
> >  printf("%d.d\n", us, us_frac);
> >
> 
> Just to make sure, the frq here is 1 / (get_cntfrq() / 10^9), right?
> On my seattle, get_cntfrq() gives me 250000000 (2.5*10^8), which makes
> the denominator less than 1, so I guess I'll need float to do the math
> first, then cast the result to an unsigned number like the following.
> 
> float frq = 1 / (get_cntfrq() / 10^9);
> ns      =  (unsigned) cycles * frq;
> ....

Yes, I skipped a step in my example above and named frq wrong, it should
be ns_per_cycle. Your calculation is correct, but you still don't need
float because we don't expect get_cntfrq() > ns_per_sec, so we can just
use integer division

 ns_per_cycle = ns_per_sec / cycles_per_sec = 10^9 / get_cntfrq();
 ns = cycles * ns_per_cycle;
 ...

> 
> >>
> >> I wonder if it's ok to output the timer frequency/counts as we used to
> >> have so the users can do the math themselves?
> >
> > If you wish. I won't insist on the doing the conversion in the test,
> > but it seems easy and helpful enough to do.
> >
> >> >> +# Exit tests
> >> >> +[micro-test]
> >> >> +file = micro-test.flat
> >> >> +smp = 2
> >> >> +groups = micro-test
> >> >> +accel = kvm
> >> >
> >> > Unless we come up with some expected values and thresholds for
> >> > these execution times in order to turn the results into PASS/FAIL
> >> > tests, then I'm still thinking we should probably add it to the
> >> > nodefault group in order to save time running the regression tests.
> >>
> >> I'm ok with making the micro test as a nodefault if we cannot find a
> >> generic timeout value that works across the hardware.
> >>
> >
> > OK, let's do that for now.
> 
> Yes. Should I just put a line like "groups = nodefault,micro-test" in
> unittests.cfg?

Yup

Thanks,
drew

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [kvm-unit-tests PATCH V2 2/2] arm64: add micro test
  2017-12-20  1:34 ` [kvm-unit-tests PATCH V2 2/2] arm64: add micro test Shih-Wei Li
  2017-12-20 16:41   ` Andrew Jones
@ 2017-12-22  6:07   ` Yury Norov
  2017-12-22  7:26     ` Shih-Wei Li
  1 sibling, 1 reply; 14+ messages in thread
From: Yury Norov @ 2017-12-22  6:07 UTC (permalink / raw)
  To: Shih-Wei Li
  Cc: kvm, kvmarm, christoffer.dall, marc.zyngier, drjones, pbonzini

On Tue, Dec 19, 2017 at 08:34:45PM -0500, Shih-Wei Li wrote:
> Here we provide the support for measuring various micro level
> operations on arm64. We iterate each of the tests for millions of
> times and output their average, minimum and maximum cost in timer
> counts. Instruction barriers were used before and after taking
> timestamps to avoid out-of-order execution or pipelining from
> skewing our measurements.

Instruction barriers are used only before, right? Also, I think it
should be the comment to function rather than part of patch
description.

> The operations we currently supported and measured are mostly
> self-explanatory by their function name. For IPI, we measured the
> cost of sending IPI from a source VCPU to a target VCPU, until the
> target VCPU receives the IPI.
 
Please describe changes since v1.

> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> ---
>  arm/Makefile.common |   1 +
>  arm/micro-test.c    | 219 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  arm/unittests.cfg   |   7 ++
>  3 files changed, 227 insertions(+)
>  create mode 100644 arm/micro-test.c
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index 0a039cf..c7d5c27 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>  tests-common += $(TEST_DIR)/gic.flat
>  tests-common += $(TEST_DIR)/psci.flat
>  tests-common += $(TEST_DIR)/sieve.flat
> +tests-common += $(TEST_DIR)/micro-test.flat
>  
>  tests-all = $(tests-common) $(tests)
>  all: directories $(tests-all)
> diff --git a/arm/micro-test.c b/arm/micro-test.c
> new file mode 100644
> index 0000000..c31c9ac
> --- /dev/null
> +++ b/arm/micro-test.c
> @@ -0,0 +1,219 @@
> +/*
> + * Measure the cost of micro level operations.
> + *
> + * Copyright Columbia University
> + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
> + * Author: Christoffer Dall <cdall@cs.columbia.edu>

Nit. If Christoffer is author, he should also sing-off the patch.

> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.
> + */
> +#include "libcflat.h"
> +#include <util.h>
> +#include <asm/gic.h>
> +
> +static volatile bool ipi_received;
> +static volatile bool ipi_ready;
> +#define IPI_IRQ		1
> +
> +#define TIMEOUT (1U << 28)
> +
> +/*
> + * The counter may not always start with zero, which means it could
> + * overflow after some time.
> + */

Is this comment relevant? In discussion people say that they have always
c1 == c2 because timer cannot tick even once, not because it ticks
2^64 times. And this is what really occurs. Also, if ci == c2, 
(c2 - c1) is already zero, and you don't need do something special to 
handle it.

> +#define COUNT(c1, c2) \
> +	(((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> +
> +static uint64_t read_cc(void)
> +{
> +	isb();
> +	return read_sysreg(cntpct_el0);
> +}
> +
> +static void ipi_irq_handler(struct pt_regs *regs __unused)
> +{
> +	u32 ack;
> +	ipi_ready = false;
> +	ipi_received = true;
> +	ack = gic_read_iar();
> +	gic_write_eoir(ack);
> +	ipi_ready = true;
> +}
> +
> +static void ipi_test_secondary_entry(void *data __unused)
> +{
> +	enum vector v = EL1H_IRQ;
> +	install_irq_handler(v, ipi_irq_handler);
> +
> +	gic_enable_defaults();
> +
> +	local_irq_enable(); /* Enter small wait-loop */
> +	ipi_ready = true;
> +	while (true);

I remember, it was asked in v1, but anyway, is it intentional here and
in other similar busy-loops not to use wfe/wfi or cpu_relax()? 

Andrew Jones asked you:

>> cpu_relax() here will help get that first-cpu ack on TCG faster. That
>> said, this test doesn't make sense on TCG anyway, other than debugging
>> it. So you could just add 'accel = kvm' to it's unittests.cfg parameter
>> list.
>
>ok, thanks.

Or I misunderstood something?

> +}
> +
> +static int test_init(void)
> +{
> +	if (!gic_init()) {
> +		printf("No supported gic present, skipping tests...\n");
> +		return 0;
> +	}
> +
> +	ipi_ready = false;
> +
> +	gic_enable_defaults();
> +	on_cpu_async(1, ipi_test_secondary_entry, 0);
> +
> +	printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
> +
> +	return 1;
> +}
> +
> +static unsigned long ipi_test(void)
> +{
> +	unsigned int timeout = TIMEOUT;
> +	uint64_t c1, c2;
> +
> +	while (!ipi_ready && timeout--);

Again, this is not timeout. It's nothing about time.
If you need limit the loop by time, it would probably look
like this:

	while (!ipi_ready && read_cc() - start_cc < timeout)
                cpu_relax();

This snippet is repeated too much times in the test to make it a
function, like wait(addr, value).

> +	assert(ipi_ready);
> +
> +	ipi_received = false;
> +
> +	c1 = read_cc();
> +
> +	gic_ipi_send_single(IPI_IRQ, 1);
> +
> +	timeout = TIMEOUT;
> +	while (!ipi_received && timeout--);
> +	assert(ipi_received);
> +
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +

Odd line is still here.

> +static unsigned long hvc_test(void)
> +{
> +	uint64_t c1, c2;
> +
> +	c1 = read_cc();
> +	asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_user(void)
> +{
> +	uint64_t c1, c2;
> +	/*
> +	 * FIXME: Read device-id in virtio mmio here. This address
> +	 * needs to be updated in the future if any relavent

Typo: relevant

> +	 * changes in QEMU test-dev are made.
> +	 */
> +	void *mmio_read_user_addr = (void*) 0x0a000008;

It sounds like function name. Or you mean mmio read-only address?
If no, I'd rename it to mmio_user_addr

> +
> +	c1 = read_cc();
> +	readl(mmio_read_user_addr);
> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long mmio_read_vgic(void)
> +{
> +	uint64_t c1, c2;
> +	int v = gic_version();

I think gic version cannot be changed while test is running,
so you can check it once on init and store at global variable.

> +	void *vgic_dist_addr = NULL;

Useless initialization

> +
> +	if (v == 2)
> +		vgic_dist_addr = gicv2_dist_base();
> +	else if (v == 3)
> +		vgic_dist_addr = gicv3_dist_base();

Again, if there's real chance that gic version will appear not 2 or 3,
you'd take care of it. If no, make else-branch unconditional, like say 
in ipi_clear_active_handler().

> +	c1 = read_cc();
> +	readl(vgic_dist_addr + GICD_IIDR);

Normally here should be smp_rmb(), but it seems there's no writers at
this address. But if you test mmio performance, you'd take into account
the time for synchronization. Is it correct?

> +	c2 = read_cc();
> +	return COUNT(c1, c2);
> +}
> +
> +static unsigned long eoi_test(void)
> +{
> +	uint64_t c1, c2;
> +	int v = gic_version();
> +	void (*write_eoir)(u32 irqstat) = NULL;

Again, useless initialization. If you make else-branch below
unconditional, you wouldn't need to shut the compiler up.

> +
> +	u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> +
> +	/* To avoid measuring assert(..) in gic_write_eoir */

Sorry, I don't understand what this comment means.
To avoid measuring assert(..) in gic_write_eoir, let's segfault here 
by executing at NULL address - like this? 

> +	if (v == 2)
> +		write_eoir = gicv2_write_eoir;
> +	else if (v == 3)
> +		write_eoir = gicv3_write_eoir;

IIUC, you can do it once at init, right?

> +
> +	c1 = read_cc();
> +	write_eoir(val);
> +	c2 = read_cc();
> +
> +	return COUNT(c1, c2);
> +}
> +
> +struct exit_test {
> +	const char *name;
> +	unsigned long (*test_fn)(void);
> +	bool run;
> +};
> +
> +static struct exit_test tests[] = {
> +	{"hvc",                hvc_test,           true},
> +	{"mmio_read_user",     mmio_read_user,     true},
> +	{"mmio_read_vgic",     mmio_read_vgic,     true},
> +	{"eoi",                eoi_test,           true},
> +	{"ipi",                ipi_test,           true},
> +};
> +
> +static void loop_test(struct exit_test *test)
> +{
> +	unsigned long i, iterations = 32;
> +	unsigned long sample, cycles;
> +	unsigned long long min = 0, max = 0;
> +	const unsigned long long goal = (1ULL << 25);
> +
> +	do {
> +		iterations *= 2;
> +		cycles = 0;
> +		i = 0;
> +		while (i < iterations) {
> +			sample = test->test_fn();
> +			if (sample == 0) {
> +				/*
> +				 * If something went wrong or we had an
> +				 * overflow, don't count that sample.
> +				 */
> +				printf("cycle count overflow: %lu\n", sample);

Text doesn't match the comment.

Sample doesn't contain error code and is always zero. what's the point
to print it?

> +				continue;
> +			}
> +			cycles += sample;
> +			if (min == 0 || min > sample)

Hint: initialize min as ULONGLONG_MAX, and drop this (min == 0) check.

> +				min = sample;
> +			if (max < sample)
> +				max = sample;
> +			++i;
> +		}
> +	} while (cycles < goal);
> +	printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> +		test->name, cycles / iterations, min, max);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	int i;
> +
> +	assert(test_init());
> +
> +	for (i = 0; i < ARRAY_SIZE(tests); i++) {
> +		if (!tests[i].run)
> +			continue;
> +		loop_test(&tests[i]);
> +	}
> +
> +	return 0;
> +}
> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> index 44b98cf..a8fb0b3 100644
> --- a/arm/unittests.cfg
> +++ b/arm/unittests.cfg
> @@ -116,3 +116,10 @@ file = timer.flat
>  groups = timer
>  timeout = 2s
>  arch = arm64
> +
> +# Exit tests
> +[micro-test]
> +file = micro-test.flat
> +smp = 2
> +groups = micro-test
> +accel = kvm
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [kvm-unit-tests PATCH V2 2/2] arm64: add micro test
  2017-12-22  6:07   ` Yury Norov
@ 2017-12-22  7:26     ` Shih-Wei Li
  2017-12-22 12:44       ` Yury Norov
  0 siblings, 1 reply; 14+ messages in thread
From: Shih-Wei Li @ 2017-12-22  7:26 UTC (permalink / raw)
  To: Yury Norov
  Cc: kvm, kvmarm, Christoffer Dall, Marc Zyngier, Andrew Jones, Paolo Bonzini

On Fri, Dec 22, 2017 at 1:07 AM, Yury Norov <ynorov@caviumnetworks.com> wrote:
> On Tue, Dec 19, 2017 at 08:34:45PM -0500, Shih-Wei Li wrote:
>> Here we provide the support for measuring various micro level
>> operations on arm64. We iterate each of the tests for millions of
>> times and output their average, minimum and maximum cost in timer
>> counts. Instruction barriers were used before and after taking
>> timestamps to avoid out-of-order execution or pipelining from
>> skewing our measurements.
>
> Instruction barriers are used only before, right? Also, I think it
> should be the comment to function rather than part of patch
> description.

Yes. Sorry about the confusion in the texts here.

>
>> The operations we currently supported and measured are mostly
>> self-explanatory by their function name. For IPI, we measured the
>> cost of sending IPI from a source VCPU to a target VCPU, until the
>> target VCPU receives the IPI.
>
> Please describe changes since v1.

changes here include simplifying initialization, reducing the
iterations in loop_test, removing noop_test, fixing coding styles
issues and other nits.

>
>> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
>> ---
>>  arm/Makefile.common |   1 +
>>  arm/micro-test.c    | 219 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  arm/unittests.cfg   |   7 ++
>>  3 files changed, 227 insertions(+)
>>  create mode 100644 arm/micro-test.c
>>
>> diff --git a/arm/Makefile.common b/arm/Makefile.common
>> index 0a039cf..c7d5c27 100644
>> --- a/arm/Makefile.common
>> +++ b/arm/Makefile.common
>> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>>  tests-common += $(TEST_DIR)/gic.flat
>>  tests-common += $(TEST_DIR)/psci.flat
>>  tests-common += $(TEST_DIR)/sieve.flat
>> +tests-common += $(TEST_DIR)/micro-test.flat
>>
>>  tests-all = $(tests-common) $(tests)
>>  all: directories $(tests-all)
>> diff --git a/arm/micro-test.c b/arm/micro-test.c
>> new file mode 100644
>> index 0000000..c31c9ac
>> --- /dev/null
>> +++ b/arm/micro-test.c
>> @@ -0,0 +1,219 @@
>> +/*
>> + * Measure the cost of micro level operations.
>> + *
>> + * Copyright Columbia University
>> + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
>> + * Author: Christoffer Dall <cdall@cs.columbia.edu>
>
> Nit. If Christoffer is author, he should also sing-off the patch.
>
>> + *
>> + * This work is licensed under the terms of the GNU LGPL, version 2.
>> + */
>> +#include "libcflat.h"
>> +#include <util.h>
>> +#include <asm/gic.h>
>> +
>> +static volatile bool ipi_received;
>> +static volatile bool ipi_ready;
>> +#define IPI_IRQ              1
>> +
>> +#define TIMEOUT (1U << 28)
>> +
>> +/*
>> + * The counter may not always start with zero, which means it could
>> + * overflow after some time.
>> + */
>
> Is this comment relevant? In discussion people say that they have always
> c1 == c2 because timer cannot tick even once, not because it ticks
> 2^64 times. And this is what really occurs. Also, if ci == c2,
> (c2 - c1) is already zero, and you don't need do something special to
> handle it.

The comment here is to address your question before whether this is
for overflow protection. And yes, we don't need the "((c1) == (c2))"
condition here, thanks for poiting out!

>
>> +#define COUNT(c1, c2) \
>> +     (((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
>> +
>> +static uint64_t read_cc(void)
>> +{
>> +     isb();
>> +     return read_sysreg(cntpct_el0);
>> +}
>> +
>> +static void ipi_irq_handler(struct pt_regs *regs __unused)
>> +{
>> +     u32 ack;
>> +     ipi_ready = false;
>> +     ipi_received = true;
>> +     ack = gic_read_iar();
>> +     gic_write_eoir(ack);
>> +     ipi_ready = true;
>> +}
>> +
>> +static void ipi_test_secondary_entry(void *data __unused)
>> +{
>> +     enum vector v = EL1H_IRQ;
>> +     install_irq_handler(v, ipi_irq_handler);
>> +
>> +     gic_enable_defaults();
>> +
>> +     local_irq_enable(); /* Enter small wait-loop */
>> +     ipi_ready = true;
>> +     while (true);
>
> I remember, it was asked in v1, but anyway, is it intentional here and
> in other similar busy-loops not to use wfe/wfi or cpu_relax()?
>
> Andrew Jones asked you:
>
>>> cpu_relax() here will help get that first-cpu ack on TCG faster. That
>>> said, this test doesn't make sense on TCG anyway, other than debugging
>>> it. So you could just add 'accel = kvm' to it's unittests.cfg parameter
>>> list.
>>
>>ok, thanks.
>
> Or I misunderstood something?

I added accel = kvm in unittest.cfg so the test can never run on TCG.

>
>> +}
>> +
>> +static int test_init(void)
>> +{
>> +     if (!gic_init()) {
>> +             printf("No supported gic present, skipping tests...\n");
>> +             return 0;
>> +     }
>> +
>> +     ipi_ready = false;
>> +
>> +     gic_enable_defaults();
>> +     on_cpu_async(1, ipi_test_secondary_entry, 0);
>> +
>> +     printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
>> +
>> +     return 1;
>> +}
>> +
>> +static unsigned long ipi_test(void)
>> +{
>> +     unsigned int timeout = TIMEOUT;
>> +     uint64_t c1, c2;
>> +
>> +     while (!ipi_ready && timeout--);
>
> Again, this is not timeout. It's nothing about time.
> If you need limit the loop by time, it would probably look
> like this:
>
>         while (!ipi_ready && read_cc() - start_cc < timeout)
>                 cpu_relax();
>
> This snippet is repeated too much times in the test to make it a
> function, like wait(addr, value).

Sorry for the confusion due to bad naming here, Alex Jones mentioned
it's probably better to rename timeout here as tries (because there
are no timing-related checks here). Will fix it in v3.

>
>> +     assert(ipi_ready);
>> +
>> +     ipi_received = false;
>> +
>> +     c1 = read_cc();
>> +
>> +     gic_ipi_send_single(IPI_IRQ, 1);
>> +
>> +     timeout = TIMEOUT;
>> +     while (!ipi_received && timeout--);
>> +     assert(ipi_received);
>> +
>> +     c2 = read_cc();
>> +     return COUNT(c1, c2);
>> +}
>> +
>> +
>
> Odd line is still here.
>

Yes, will fix it.

>> +static unsigned long hvc_test(void)
>> +{
>> +     uint64_t c1, c2;
>> +
>> +     c1 = read_cc();
>> +     asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
>> +     c2 = read_cc();
>> +     return COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long mmio_read_user(void)
>> +{
>> +     uint64_t c1, c2;
>> +     /*
>> +      * FIXME: Read device-id in virtio mmio here. This address
>> +      * needs to be updated in the future if any relavent
>
> Typo: relevant
>
>> +      * changes in QEMU test-dev are made.
>> +      */
>> +     void *mmio_read_user_addr = (void*) 0x0a000008;
>
> It sounds like function name. Or you mean mmio read-only address?
> If no, I'd rename it to mmio_user_addr

I named it as it is because this address is used for the following
mmio-read operation below. I think the address is also read-only
though.

>
>> +
>> +     c1 = read_cc();
>> +     readl(mmio_read_user_addr);
>> +     c2 = read_cc();
>> +     return COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long mmio_read_vgic(void)
>> +{
>> +     uint64_t c1, c2;
>> +     int v = gic_version();
>
> I think gic version cannot be changed while test is running,
> so you can check it once on init and store at global variable.
>
>> +     void *vgic_dist_addr = NULL;
>
> Useless initialization
>
>> +
>> +     if (v == 2)
>> +             vgic_dist_addr = gicv2_dist_base();
>> +     else if (v == 3)
>> +             vgic_dist_addr = gicv3_dist_base();
>
> Again, if there's real chance that gic version will appear not 2 or 3,
> you'd take care of it. If no, make else-branch unconditional, like say
> in ipi_clear_active_handler().

The test will not run if the gic version is not 2 or 3 as it'll
trigger the assert in main().

>
>> +     c1 = read_cc();
>> +     readl(vgic_dist_addr + GICD_IIDR);
>
> Normally here should be smp_rmb(), but it seems there's no writers at
> this address. But if you test mmio performance, you'd take into account
> the time for synchronization. Is it correct?

I think we had an isb in read_cc before getting the counter so to make
sure we read the counter after the MMIO read. Does it address your
concern?

>
>> +     c2 = read_cc();
>> +     return COUNT(c1, c2);
>> +}
>> +
>> +static unsigned long eoi_test(void)
>> +{
>> +     uint64_t c1, c2;
>> +     int v = gic_version();
>> +     void (*write_eoir)(u32 irqstat) = NULL;
>
> Again, useless initialization. If you make else-branch below
> unconditional, you wouldn't need to shut the compiler up.
>
>> +
>> +     u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
>> +
>> +     /* To avoid measuring assert(..) in gic_write_eoir */
>
> Sorry, I don't understand what this comment means.
> To avoid measuring assert(..) in gic_write_eoir, let's segfault here
> by executing at NULL address - like this?

The exiting "gic_write_eoir" function includes an assert(..), which
results to some bias in the cost in EOI, which it's pure cost is
already rather small. So the code here calls to the gicvx related
function directly to avoid the assert().

>
>> +     if (v == 2)
>> +             write_eoir = gicv2_write_eoir;
>> +     else if (v == 3)
>> +             write_eoir = gicv3_write_eoir;
>
> IIUC, you can do it once at init, right?

Yes, I can. I guess either way should work.

>
>> +
>> +     c1 = read_cc();
>> +     write_eoir(val);
>> +     c2 = read_cc();
>> +
>> +     return COUNT(c1, c2);
>> +}
>> +
>> +struct exit_test {
>> +     const char *name;
>> +     unsigned long (*test_fn)(void);
>> +     bool run;
>> +};
>> +
>> +static struct exit_test tests[] = {
>> +     {"hvc",                hvc_test,           true},
>> +     {"mmio_read_user",     mmio_read_user,     true},
>> +     {"mmio_read_vgic",     mmio_read_vgic,     true},
>> +     {"eoi",                eoi_test,           true},
>> +     {"ipi",                ipi_test,           true},
>> +};
>> +
>> +static void loop_test(struct exit_test *test)
>> +{
>> +     unsigned long i, iterations = 32;
>> +     unsigned long sample, cycles;
>> +     unsigned long long min = 0, max = 0;
>> +     const unsigned long long goal = (1ULL << 25);
>> +
>> +     do {
>> +             iterations *= 2;
>> +             cycles = 0;
>> +             i = 0;
>> +             while (i < iterations) {
>> +                     sample = test->test_fn();
>> +                     if (sample == 0) {
>> +                             /*
>> +                              * If something went wrong or we had an
>> +                              * overflow, don't count that sample.
>> +                              */
>> +                             printf("cycle count overflow: %lu\n", sample);
>
> Text doesn't match the comment.

I guess you meant that we should say something like the cost is
smaller than a timer count?

>
> Sample doesn't contain error code and is always zero. what's the point
> to print it?
>
>> +                             continue;
>> +                     }
>> +                     cycles += sample;
>> +                     if (min == 0 || min > sample)
>
> Hint: initialize min as ULONGLONG_MAX, and drop this (min == 0) check.
>
>> +                             min = sample;
>> +                     if (max < sample)
>> +                             max = sample;
>> +                     ++i;
>> +             }
>> +     } while (cycles < goal);
>> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
>> +             test->name, cycles / iterations, min, max);
>> +}
>> +
>> +int main(int argc, char **argv)
>> +{
>> +     int i;
>> +
>> +     assert(test_init());
>> +
>> +     for (i = 0; i < ARRAY_SIZE(tests); i++) {
>> +             if (!tests[i].run)
>> +                     continue;
>> +             loop_test(&tests[i]);
>> +     }
>> +
>> +     return 0;
>> +}
>> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
>> index 44b98cf..a8fb0b3 100644
>> --- a/arm/unittests.cfg
>> +++ b/arm/unittests.cfg
>> @@ -116,3 +116,10 @@ file = timer.flat
>>  groups = timer
>>  timeout = 2s
>>  arch = arm64
>> +
>> +# Exit tests
>> +[micro-test]
>> +file = micro-test.flat
>> +smp = 2
>> +groups = micro-test
>> +accel = kvm
>> --
>> 1.9.1
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [kvm-unit-tests PATCH V2 2/2] arm64: add micro test
  2017-12-22  7:26     ` Shih-Wei Li
@ 2017-12-22 12:44       ` Yury Norov
  2018-01-17 21:49         ` Shih-Wei Li
  0 siblings, 1 reply; 14+ messages in thread
From: Yury Norov @ 2017-12-22 12:44 UTC (permalink / raw)
  To: Shih-Wei Li; +Cc: kvm, Marc Zyngier, Paolo Bonzini, kvmarm

On Fri, Dec 22, 2017 at 02:26:32AM -0500, Shih-Wei Li wrote:
> On Fri, Dec 22, 2017 at 1:07 AM, Yury Norov <ynorov@caviumnetworks.com> wrote:
> > On Tue, Dec 19, 2017 at 08:34:45PM -0500, Shih-Wei Li wrote:
> >> Here we provide the support for measuring various micro level
> >> operations on arm64. We iterate each of the tests for millions of
> >> times and output their average, minimum and maximum cost in timer
> >> counts. Instruction barriers were used before and after taking
> >> timestamps to avoid out-of-order execution or pipelining from
> >> skewing our measurements.
> >
> > Instruction barriers are used only before, right? Also, I think it
> > should be the comment to function rather than part of patch
> > description.
> 
> Yes. Sorry about the confusion in the texts here.
> 
> >
> >> The operations we currently supported and measured are mostly
> >> self-explanatory by their function name. For IPI, we measured the
> >> cost of sending IPI from a source VCPU to a target VCPU, until the
> >> target VCPU receives the IPI.
> >
> > Please describe changes since v1.
> 
> changes here include simplifying initialization, reducing the
> iterations in loop_test, removing noop_test, fixing coding styles
> issues and other nits.
> 
> >
> >> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
> >> ---
> >>  arm/Makefile.common |   1 +
> >>  arm/micro-test.c    | 219 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  arm/unittests.cfg   |   7 ++
> >>  3 files changed, 227 insertions(+)
> >>  create mode 100644 arm/micro-test.c
> >>
> >> diff --git a/arm/Makefile.common b/arm/Makefile.common
> >> index 0a039cf..c7d5c27 100644
> >> --- a/arm/Makefile.common
> >> +++ b/arm/Makefile.common
> >> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
> >>  tests-common += $(TEST_DIR)/gic.flat
> >>  tests-common += $(TEST_DIR)/psci.flat
> >>  tests-common += $(TEST_DIR)/sieve.flat
> >> +tests-common += $(TEST_DIR)/micro-test.flat
> >>
> >>  tests-all = $(tests-common) $(tests)
> >>  all: directories $(tests-all)
> >> diff --git a/arm/micro-test.c b/arm/micro-test.c
> >> new file mode 100644
> >> index 0000000..c31c9ac
> >> --- /dev/null
> >> +++ b/arm/micro-test.c
> >> @@ -0,0 +1,219 @@
> >> +/*
> >> + * Measure the cost of micro level operations.
> >> + *
> >> + * Copyright Columbia University
> >> + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
> >> + * Author: Christoffer Dall <cdall@cs.columbia.edu>
> >
> > Nit. If Christoffer is author, he should also sing-off the patch.
> >
> >> + *
> >> + * This work is licensed under the terms of the GNU LGPL, version 2.
> >> + */
> >> +#include "libcflat.h"
> >> +#include <util.h>
> >> +#include <asm/gic.h>
> >> +
> >> +static volatile bool ipi_received;
> >> +static volatile bool ipi_ready;
> >> +#define IPI_IRQ              1
> >> +
> >> +#define TIMEOUT (1U << 28)
> >> +
> >> +/*
> >> + * The counter may not always start with zero, which means it could
> >> + * overflow after some time.
> >> + */
> >
> > Is this comment relevant? In discussion people say that they have always
> > c1 == c2 because timer cannot tick even once, not because it ticks
> > 2^64 times. And this is what really occurs. Also, if ci == c2,
> > (c2 - c1) is already zero, and you don't need do something special to
> > handle it.
> 
> The comment here is to address your question before whether this is
> for overflow protection. And yes, we don't need the "((c1) == (c2))"
> condition here, thanks for poiting out!

And another thing that comes in mind. If individual time measurements 
will be mostly 0, 1 and 2, it is is very suspicious.

What I mean is that tick of system clock is the fastest process on the
chip, and delivering IPI is complicated procedure, especially in
virtualized system. In my IPI test for Linux the typical time to
deliver IPI is 0.3-0.4 us, which gives almmost 1K ticks at 2+ GHZ.
Did you check your prescalers and so on?

> >> +#define COUNT(c1, c2) \
> >> +     (((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
> >> +
> >> +static uint64_t read_cc(void)
> >> +{
> >> +     isb();
> >> +     return read_sysreg(cntpct_el0);
> >> +}
> >> +
> >> +static void ipi_irq_handler(struct pt_regs *regs __unused)
> >> +{
> >> +     u32 ack;
> >> +     ipi_ready = false;
> >> +     ipi_received = true;
> >> +     ack = gic_read_iar();
> >> +     gic_write_eoir(ack);
> >> +     ipi_ready = true;
> >> +}
> >> +
> >> +static void ipi_test_secondary_entry(void *data __unused)
> >> +{
> >> +     enum vector v = EL1H_IRQ;
> >> +     install_irq_handler(v, ipi_irq_handler);
> >> +
> >> +     gic_enable_defaults();
> >> +
> >> +     local_irq_enable(); /* Enter small wait-loop */
> >> +     ipi_ready = true;
> >> +     while (true);
> >
> > I remember, it was asked in v1, but anyway, is it intentional here and
> > in other similar busy-loops not to use wfe/wfi or cpu_relax()?
> >
> > Andrew Jones asked you:
> >
> >>> cpu_relax() here will help get that first-cpu ack on TCG faster. That
> >>> said, this test doesn't make sense on TCG anyway, other than debugging
> >>> it. So you could just add 'accel = kvm' to it's unittests.cfg parameter
> >>> list.
> >>
> >>ok, thanks.
> >
> > Or I misunderstood something?
> 
> I added accel = kvm in unittest.cfg so the test can never run on TCG.

Ah, OK.

> >> +}
> >> +
> >> +static int test_init(void)
> >> +{
> >> +     if (!gic_init()) {
> >> +             printf("No supported gic present, skipping tests...\n");
> >> +             return 0;
> >> +     }
> >> +
> >> +     ipi_ready = false;
> >> +
> >> +     gic_enable_defaults();
> >> +     on_cpu_async(1, ipi_test_secondary_entry, 0);
> >> +
> >> +     printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
> >> +
> >> +     return 1;
> >> +}
> >> +
> >> +static unsigned long ipi_test(void)
> >> +{
> >> +     unsigned int timeout = TIMEOUT;
> >> +     uint64_t c1, c2;
> >> +
> >> +     while (!ipi_ready && timeout--);
> >
> > Again, this is not timeout. It's nothing about time.
> > If you need limit the loop by time, it would probably look
> > like this:
> >
> >         while (!ipi_ready && read_cc() - start_cc < timeout)
> >                 cpu_relax();
> >
> > This snippet is repeated too much times in the test to make it a
> > function, like wait(addr, value).
> 
> Sorry for the confusion due to bad naming here, Alex Jones mentioned
> it's probably better to rename timeout here as tries (because there
> are no timing-related checks here). Will fix it in v3.

I think this is not naming issue. It's OK if you try say turn on
peripheral device for 10 times, and then report something like
device not acceptable. Here you make 2^28 attempts and name it
timeout. What I think is that if you need timeouts, the most
straightforward option for you is to use timeouts.

> >> +     assert(ipi_ready);
> >> +
> >> +     ipi_received = false;
> >> +
> >> +     c1 = read_cc();
> >> +
> >> +     gic_ipi_send_single(IPI_IRQ, 1);
> >> +
> >> +     timeout = TIMEOUT;
> >> +     while (!ipi_received && timeout--);
> >> +     assert(ipi_received);
> >> +
> >> +     c2 = read_cc();
> >> +     return COUNT(c1, c2);
> >> +}
> >> +
> >> +
> >
> > Odd line is still here.
> >
> 
> Yes, will fix it.
> 
> >> +static unsigned long hvc_test(void)
> >> +{
> >> +     uint64_t c1, c2;
> >> +
> >> +     c1 = read_cc();
> >> +     asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
> >> +     c2 = read_cc();
> >> +     return COUNT(c1, c2);
> >> +}
> >> +
> >> +static unsigned long mmio_read_user(void)
> >> +{
> >> +     uint64_t c1, c2;
> >> +     /*
> >> +      * FIXME: Read device-id in virtio mmio here. This address
> >> +      * needs to be updated in the future if any relavent
> >
> > Typo: relevant
> >
> >> +      * changes in QEMU test-dev are made.
> >> +      */
> >> +     void *mmio_read_user_addr = (void*) 0x0a000008;
> >
> > It sounds like function name. Or you mean mmio read-only address?
> > If no, I'd rename it to mmio_user_addr
> 
> I named it as it is because this address is used for the following
> mmio-read operation below. I think the address is also read-only
> though.

For me the most suitable name would be mmio_device_id.

> >> +
> >> +     c1 = read_cc();
> >> +     readl(mmio_read_user_addr);
> >> +     c2 = read_cc();
> >> +     return COUNT(c1, c2);
> >> +}
> >> +
> >> +static unsigned long mmio_read_vgic(void)
> >> +{
> >> +     uint64_t c1, c2;
> >> +     int v = gic_version();
> >
> > I think gic version cannot be changed while test is running,
> > so you can check it once on init and store at global variable.
> >
> >> +     void *vgic_dist_addr = NULL;
> >
> > Useless initialization
> >
> >> +
> >> +     if (v == 2)
> >> +             vgic_dist_addr = gicv2_dist_base();
> >> +     else if (v == 3)
> >> +             vgic_dist_addr = gicv3_dist_base();
> >
> > Again, if there's real chance that gic version will appear not 2 or 3,
> > you'd take care of it. If no, make else-branch unconditional, like say
> > in ipi_clear_active_handler().
> 
> The test will not run if the gic version is not 2 or 3 as it'll
> trigger the assert in main().
> 
> >
> >> +     c1 = read_cc();
> >> +     readl(vgic_dist_addr + GICD_IIDR);
> >
> > Normally here should be smp_rmb(), but it seems there's no writers at
> > this address. But if you test mmio performance, you'd take into account
> > the time for synchronization. Is it correct?
> 
> I think we had an isb in read_cc before getting the counter so to make
> sure we read the counter after the MMIO read. Does it address your
> concern?

Flushing pipeline and I-cache which is isb, and synchronizing D-caches
are different stories.

> >> +     c2 = read_cc();
> >> +     return COUNT(c1, c2);
> >> +}
> >> +
> >> +static unsigned long eoi_test(void)
> >> +{
> >> +     uint64_t c1, c2;
> >> +     int v = gic_version();
> >> +     void (*write_eoir)(u32 irqstat) = NULL;
> >
> > Again, useless initialization. If you make else-branch below
> > unconditional, you wouldn't need to shut the compiler up.
> >
> >> +
> >> +     u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
> >> +
> >> +     /* To avoid measuring assert(..) in gic_write_eoir */
> >
> > Sorry, I don't understand what this comment means.
> > To avoid measuring assert(..) in gic_write_eoir, let's segfault here
> > by executing at NULL address - like this?
> 
> The exiting "gic_write_eoir" function includes an assert(..), which
> results to some bias in the cost in EOI, which it's pure cost is
> already rather small. So the code here calls to the gicvx related
> function directly to avoid the assert().
> 
> >
> >> +     if (v == 2)
> >> +             write_eoir = gicv2_write_eoir;
> >> +     else if (v == 3)
> >> +             write_eoir = gicv3_write_eoir;
> >
> > IIUC, you can do it once at init, right?
> 
> Yes, I can. I guess either way should work.

But in current version you do it billion times in hot path which
probably affects final results.

> >> +
> >> +     c1 = read_cc();
> >> +     write_eoir(val);
> >> +     c2 = read_cc();
> >> +
> >> +     return COUNT(c1, c2);
> >> +}
> >> +
> >> +struct exit_test {
> >> +     const char *name;
> >> +     unsigned long (*test_fn)(void);
> >> +     bool run;
> >> +};
> >> +
> >> +static struct exit_test tests[] = {
> >> +     {"hvc",                hvc_test,           true},
> >> +     {"mmio_read_user",     mmio_read_user,     true},
> >> +     {"mmio_read_vgic",     mmio_read_vgic,     true},
> >> +     {"eoi",                eoi_test,           true},
> >> +     {"ipi",                ipi_test,           true},
> >> +};
> >> +
> >> +static void loop_test(struct exit_test *test)
> >> +{
> >> +     unsigned long i, iterations = 32;
> >> +     unsigned long sample, cycles;
> >> +     unsigned long long min = 0, max = 0;
> >> +     const unsigned long long goal = (1ULL << 25);
> >> +
> >> +     do {
> >> +             iterations *= 2;
> >> +             cycles = 0;
> >> +             i = 0;
> >> +             while (i < iterations) {
> >> +                     sample = test->test_fn();
> >> +                     if (sample == 0) {
> >> +                             /*
> >> +                              * If something went wrong or we had an
> >> +                              * overflow, don't count that sample.
> >> +                              */
> >> +                             printf("cycle count overflow: %lu\n", sample);
> >
> > Text doesn't match the comment.
> 
> I guess you meant that we should say something like the cost is
> smaller than a timer count?

Yep

> > Sample doesn't contain error code and is always zero. what's the point
> > to print it?
> >
> >> +                             continue;
> >> +                     }
> >> +                     cycles += sample;
> >> +                     if (min == 0 || min > sample)
> >
> > Hint: initialize min as ULONGLONG_MAX, and drop this (min == 0) check.
> >
> >> +                             min = sample;
> >> +                     if (max < sample)
> >> +                             max = sample;
> >> +                     ++i;
> >> +             }
> >> +     } while (cycles < goal);
> >> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
> >> +             test->name, cycles / iterations, min, max);
> >> +}
> >> +
> >> +int main(int argc, char **argv)
> >> +{
> >> +     int i;
> >> +
> >> +     assert(test_init());
> >> +
> >> +     for (i = 0; i < ARRAY_SIZE(tests); i++) {
> >> +             if (!tests[i].run)
> >> +                     continue;
> >> +             loop_test(&tests[i]);
> >> +     }
> >> +
> >> +     return 0;
> >> +}
> >> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> >> index 44b98cf..a8fb0b3 100644
> >> --- a/arm/unittests.cfg
> >> +++ b/arm/unittests.cfg
> >> @@ -116,3 +116,10 @@ file = timer.flat
> >>  groups = timer
> >>  timeout = 2s
> >>  arch = arm64
> >> +
> >> +# Exit tests
> >> +[micro-test]
> >> +file = micro-test.flat
> >> +smp = 2

Sidenote. It would be also interesting to benchmark broadcast IPIs with
smp >> 2. 

> >> +groups = micro-test
> >> +accel = kvm
> >> --
> >> 1.9.1
> >>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [kvm-unit-tests PATCH V2 2/2] arm64: add micro test
  2017-12-22 12:44       ` Yury Norov
@ 2018-01-17 21:49         ` Shih-Wei Li
  0 siblings, 0 replies; 14+ messages in thread
From: Shih-Wei Li @ 2018-01-17 21:49 UTC (permalink / raw)
  To: Yury Norov
  Cc: kvm, kvmarm, Christoffer Dall, Marc Zyngier, Andrew Jones, Paolo Bonzini

On Fri, Dec 22, 2017 at 7:44 AM, Yury Norov <ynorov@caviumnetworks.com> wrote:
> On Fri, Dec 22, 2017 at 02:26:32AM -0500, Shih-Wei Li wrote:
>> On Fri, Dec 22, 2017 at 1:07 AM, Yury Norov <ynorov@caviumnetworks.com> wrote:
>> > On Tue, Dec 19, 2017 at 08:34:45PM -0500, Shih-Wei Li wrote:
>> >> Here we provide the support for measuring various micro level
>> >> operations on arm64. We iterate each of the tests for millions of
>> >> times and output their average, minimum and maximum cost in timer
>> >> counts. Instruction barriers were used before and after taking
>> >> timestamps to avoid out-of-order execution or pipelining from
>> >> skewing our measurements.
>> >
>> > Instruction barriers are used only before, right? Also, I think it
>> > should be the comment to function rather than part of patch
>> > description.
>>
>> Yes. Sorry about the confusion in the texts here.
>>
>> >
>> >> The operations we currently supported and measured are mostly
>> >> self-explanatory by their function name. For IPI, we measured the
>> >> cost of sending IPI from a source VCPU to a target VCPU, until the
>> >> target VCPU receives the IPI.
>> >
>> > Please describe changes since v1.
>>
>> changes here include simplifying initialization, reducing the
>> iterations in loop_test, removing noop_test, fixing coding styles
>> issues and other nits.
>>
>> >
>> >> Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
>> >> ---
>> >>  arm/Makefile.common |   1 +
>> >>  arm/micro-test.c    | 219 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>  arm/unittests.cfg   |   7 ++
>> >>  3 files changed, 227 insertions(+)
>> >>  create mode 100644 arm/micro-test.c
>> >>
>> >> diff --git a/arm/Makefile.common b/arm/Makefile.common
>> >> index 0a039cf..c7d5c27 100644
>> >> --- a/arm/Makefile.common
>> >> +++ b/arm/Makefile.common
>> >> @@ -16,6 +16,7 @@ tests-common += $(TEST_DIR)/pmu.flat
>> >>  tests-common += $(TEST_DIR)/gic.flat
>> >>  tests-common += $(TEST_DIR)/psci.flat
>> >>  tests-common += $(TEST_DIR)/sieve.flat
>> >> +tests-common += $(TEST_DIR)/micro-test.flat
>> >>
>> >>  tests-all = $(tests-common) $(tests)
>> >>  all: directories $(tests-all)
>> >> diff --git a/arm/micro-test.c b/arm/micro-test.c
>> >> new file mode 100644
>> >> index 0000000..c31c9ac
>> >> --- /dev/null
>> >> +++ b/arm/micro-test.c
>> >> @@ -0,0 +1,219 @@
>> >> +/*
>> >> + * Measure the cost of micro level operations.
>> >> + *
>> >> + * Copyright Columbia University
>> >> + * Author: Shih-Wei Li <shihwei@cs.columbia.edu>
>> >> + * Author: Christoffer Dall <cdall@cs.columbia.edu>
>> >
>> > Nit. If Christoffer is author, he should also sing-off the patch.
>> >
>> >> + *
>> >> + * This work is licensed under the terms of the GNU LGPL, version 2.
>> >> + */
>> >> +#include "libcflat.h"
>> >> +#include <util.h>
>> >> +#include <asm/gic.h>
>> >> +
>> >> +static volatile bool ipi_received;
>> >> +static volatile bool ipi_ready;
>> >> +#define IPI_IRQ              1
>> >> +
>> >> +#define TIMEOUT (1U << 28)
>> >> +
>> >> +/*
>> >> + * The counter may not always start with zero, which means it could
>> >> + * overflow after some time.
>> >> + */
>> >
>> > Is this comment relevant? In discussion people say that they have always
>> > c1 == c2 because timer cannot tick even once, not because it ticks
>> > 2^64 times. And this is what really occurs. Also, if ci == c2,
>> > (c2 - c1) is already zero, and you don't need do something special to
>> > handle it.
>>
>> The comment here is to address your question before whether this is
>> for overflow protection. And yes, we don't need the "((c1) == (c2))"
>> condition here, thanks for poiting out!
>
> And another thing that comes in mind. If individual time measurements
> will be mostly 0, 1 and 2, it is is very suspicious.
>
> What I mean is that tick of system clock is the fastest process on the
> chip, and delivering IPI is complicated procedure, especially in
> virtualized system. In my IPI test for Linux the typical time to
> deliver IPI is 0.3-0.4 us, which gives almmost 1K ticks at 2+ GHZ.
> Did you check your prescalers and so on?
>
>> >> +#define COUNT(c1, c2) \
>> >> +     (((c1) > (c2) || ((c1) == (c2))) ? 0 : (c2) - (c1))
>> >> +
>> >> +static uint64_t read_cc(void)
>> >> +{
>> >> +     isb();
>> >> +     return read_sysreg(cntpct_el0);
>> >> +}
>> >> +
>> >> +static void ipi_irq_handler(struct pt_regs *regs __unused)
>> >> +{
>> >> +     u32 ack;
>> >> +     ipi_ready = false;
>> >> +     ipi_received = true;
>> >> +     ack = gic_read_iar();
>> >> +     gic_write_eoir(ack);
>> >> +     ipi_ready = true;
>> >> +}
>> >> +
>> >> +static void ipi_test_secondary_entry(void *data __unused)
>> >> +{
>> >> +     enum vector v = EL1H_IRQ;
>> >> +     install_irq_handler(v, ipi_irq_handler);
>> >> +
>> >> +     gic_enable_defaults();
>> >> +
>> >> +     local_irq_enable(); /* Enter small wait-loop */
>> >> +     ipi_ready = true;
>> >> +     while (true);
>> >
>> > I remember, it was asked in v1, but anyway, is it intentional here and
>> > in other similar busy-loops not to use wfe/wfi or cpu_relax()?
>> >
>> > Andrew Jones asked you:
>> >
>> >>> cpu_relax() here will help get that first-cpu ack on TCG faster. That
>> >>> said, this test doesn't make sense on TCG anyway, other than debugging
>> >>> it. So you could just add 'accel = kvm' to it's unittests.cfg parameter
>> >>> list.
>> >>
>> >>ok, thanks.
>> >
>> > Or I misunderstood something?
>>
>> I added accel = kvm in unittest.cfg so the test can never run on TCG.
>
> Ah, OK.
>
>> >> +}
>> >> +
>> >> +static int test_init(void)
>> >> +{
>> >> +     if (!gic_init()) {
>> >> +             printf("No supported gic present, skipping tests...\n");
>> >> +             return 0;
>> >> +     }
>> >> +
>> >> +     ipi_ready = false;
>> >> +
>> >> +     gic_enable_defaults();
>> >> +     on_cpu_async(1, ipi_test_secondary_entry, 0);
>> >> +
>> >> +     printf("Timer Frequency %d Hz (Output in timer count)\n", get_cntfrq());
>> >> +
>> >> +     return 1;
>> >> +}
>> >> +
>> >> +static unsigned long ipi_test(void)
>> >> +{
>> >> +     unsigned int timeout = TIMEOUT;
>> >> +     uint64_t c1, c2;
>> >> +
>> >> +     while (!ipi_ready && timeout--);
>> >
>> > Again, this is not timeout. It's nothing about time.
>> > If you need limit the loop by time, it would probably look
>> > like this:
>> >
>> >         while (!ipi_ready && read_cc() - start_cc < timeout)
>> >                 cpu_relax();
>> >
>> > This snippet is repeated too much times in the test to make it a
>> > function, like wait(addr, value).
>>
>> Sorry for the confusion due to bad naming here, Alex Jones mentioned
>> it's probably better to rename timeout here as tries (because there
>> are no timing-related checks here). Will fix it in v3.
>
> I think this is not naming issue. It's OK if you try say turn on
> peripheral device for 10 times, and then report something like
> device not acceptable. Here you make 2^28 attempts and name it
> timeout. What I think is that if you need timeouts, the most
> straightforward option for you is to use timeouts.

So the problem with using read_cc is, we then will measure the time
included from isb() in read_cc(). I think it's ok to use just a simple
counter here because the purpose of the counter is to exit gracefully
instead of spinning in the loop, once the test fails to complete
within some period of time.

>
>> >> +     assert(ipi_ready);
>> >> +
>> >> +     ipi_received = false;
>> >> +
>> >> +     c1 = read_cc();
>> >> +
>> >> +     gic_ipi_send_single(IPI_IRQ, 1);
>> >> +
>> >> +     timeout = TIMEOUT;
>> >> +     while (!ipi_received && timeout--);
>> >> +     assert(ipi_received);
>> >> +
>> >> +     c2 = read_cc();
>> >> +     return COUNT(c1, c2);
>> >> +}
>> >> +
>> >> +
>> >
>> > Odd line is still here.
>> >
>>
>> Yes, will fix it.
>>
>> >> +static unsigned long hvc_test(void)
>> >> +{
>> >> +     uint64_t c1, c2;
>> >> +
>> >> +     c1 = read_cc();
>> >> +     asm volatile("mov w0, #0x4b000000; hvc #0" ::: "w0");
>> >> +     c2 = read_cc();
>> >> +     return COUNT(c1, c2);
>> >> +}
>> >> +
>> >> +static unsigned long mmio_read_user(void)
>> >> +{
>> >> +     uint64_t c1, c2;
>> >> +     /*
>> >> +      * FIXME: Read device-id in virtio mmio here. This address
>> >> +      * needs to be updated in the future if any relavent
>> >
>> > Typo: relevant
>> >
>> >> +      * changes in QEMU test-dev are made.
>> >> +      */
>> >> +     void *mmio_read_user_addr = (void*) 0x0a000008;
>> >
>> > It sounds like function name. Or you mean mmio read-only address?
>> > If no, I'd rename it to mmio_user_addr
>>
>> I named it as it is because this address is used for the following
>> mmio-read operation below. I think the address is also read-only
>> though.
>
> For me the most suitable name would be mmio_device_id.
>
>> >> +
>> >> +     c1 = read_cc();
>> >> +     readl(mmio_read_user_addr);
>> >> +     c2 = read_cc();
>> >> +     return COUNT(c1, c2);
>> >> +}
>> >> +
>> >> +static unsigned long mmio_read_vgic(void)
>> >> +{
>> >> +     uint64_t c1, c2;
>> >> +     int v = gic_version();
>> >
>> > I think gic version cannot be changed while test is running,
>> > so you can check it once on init and store at global variable.
>> >
>> >> +     void *vgic_dist_addr = NULL;
>> >
>> > Useless initialization
>> >
>> >> +
>> >> +     if (v == 2)
>> >> +             vgic_dist_addr = gicv2_dist_base();
>> >> +     else if (v == 3)
>> >> +             vgic_dist_addr = gicv3_dist_base();
>> >
>> > Again, if there's real chance that gic version will appear not 2 or 3,
>> > you'd take care of it. If no, make else-branch unconditional, like say
>> > in ipi_clear_active_handler().
>>
>> The test will not run if the gic version is not 2 or 3 as it'll
>> trigger the assert in main().
>>
>> >
>> >> +     c1 = read_cc();
>> >> +     readl(vgic_dist_addr + GICD_IIDR);
>> >
>> > Normally here should be smp_rmb(), but it seems there's no writers at
>> > this address. But if you test mmio performance, you'd take into account
>> > the time for synchronization. Is it correct?
>>
>> I think we had an isb in read_cc before getting the counter so to make
>> sure we read the counter after the MMIO read. Does it address your
>> concern?
>
> Flushing pipeline and I-cache which is isb, and synchronizing D-caches
> are different stories.
>
>> >> +     c2 = read_cc();
>> >> +     return COUNT(c1, c2);
>> >> +}
>> >> +
>> >> +static unsigned long eoi_test(void)
>> >> +{
>> >> +     uint64_t c1, c2;
>> >> +     int v = gic_version();
>> >> +     void (*write_eoir)(u32 irqstat) = NULL;
>> >
>> > Again, useless initialization. If you make else-branch below
>> > unconditional, you wouldn't need to shut the compiler up.
>> >
>> >> +
>> >> +     u32 val = 1023; /* spurious IDs, writes to EOI are ignored */
>> >> +
>> >> +     /* To avoid measuring assert(..) in gic_write_eoir */
>> >
>> > Sorry, I don't understand what this comment means.
>> > To avoid measuring assert(..) in gic_write_eoir, let's segfault here
>> > by executing at NULL address - like this?
>>
>> The exiting "gic_write_eoir" function includes an assert(..), which
>> results to some bias in the cost in EOI, which it's pure cost is
>> already rather small. So the code here calls to the gicvx related
>> function directly to avoid the assert().
>>
>> >
>> >> +     if (v == 2)
>> >> +             write_eoir = gicv2_write_eoir;
>> >> +     else if (v == 3)
>> >> +             write_eoir = gicv3_write_eoir;
>> >
>> > IIUC, you can do it once at init, right?
>>
>> Yes, I can. I guess either way should work.
>
> But in current version you do it billion times in hot path which
> probably affects final results.

thanks, I have moved the initialization piece earlier in v3.

>
>> >> +
>> >> +     c1 = read_cc();
>> >> +     write_eoir(val);
>> >> +     c2 = read_cc();
>> >> +
>> >> +     return COUNT(c1, c2);
>> >> +}
>> >> +
>> >> +struct exit_test {
>> >> +     const char *name;
>> >> +     unsigned long (*test_fn)(void);
>> >> +     bool run;
>> >> +};
>> >> +
>> >> +static struct exit_test tests[] = {
>> >> +     {"hvc",                hvc_test,           true},
>> >> +     {"mmio_read_user",     mmio_read_user,     true},
>> >> +     {"mmio_read_vgic",     mmio_read_vgic,     true},
>> >> +     {"eoi",                eoi_test,           true},
>> >> +     {"ipi",                ipi_test,           true},
>> >> +};
>> >> +
>> >> +static void loop_test(struct exit_test *test)
>> >> +{
>> >> +     unsigned long i, iterations = 32;
>> >> +     unsigned long sample, cycles;
>> >> +     unsigned long long min = 0, max = 0;
>> >> +     const unsigned long long goal = (1ULL << 25);
>> >> +
>> >> +     do {
>> >> +             iterations *= 2;
>> >> +             cycles = 0;
>> >> +             i = 0;
>> >> +             while (i < iterations) {
>> >> +                     sample = test->test_fn();
>> >> +                     if (sample == 0) {
>> >> +                             /*
>> >> +                              * If something went wrong or we had an
>> >> +                              * overflow, don't count that sample.
>> >> +                              */
>> >> +                             printf("cycle count overflow: %lu\n", sample);
>> >
>> > Text doesn't match the comment.
>>
>> I guess you meant that we should say something like the cost is
>> smaller than a timer count?
>
> Yep
>
>> > Sample doesn't contain error code and is always zero. what's the point
>> > to print it?
>> >
>> >> +                             continue;
>> >> +                     }
>> >> +                     cycles += sample;
>> >> +                     if (min == 0 || min > sample)
>> >
>> > Hint: initialize min as ULONGLONG_MAX, and drop this (min == 0) check.
>> >
>> >> +                             min = sample;
>> >> +                     if (max < sample)
>> >> +                             max = sample;
>> >> +                     ++i;
>> >> +             }
>> >> +     } while (cycles < goal);
>> >> +     printf("%s:\t avg %lu\t min %llu\t max %llu\n",
>> >> +             test->name, cycles / iterations, min, max);
>> >> +}
>> >> +
>> >> +int main(int argc, char **argv)
>> >> +{
>> >> +     int i;
>> >> +
>> >> +     assert(test_init());
>> >> +
>> >> +     for (i = 0; i < ARRAY_SIZE(tests); i++) {
>> >> +             if (!tests[i].run)
>> >> +                     continue;
>> >> +             loop_test(&tests[i]);
>> >> +     }
>> >> +
>> >> +     return 0;
>> >> +}
>> >> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
>> >> index 44b98cf..a8fb0b3 100644
>> >> --- a/arm/unittests.cfg
>> >> +++ b/arm/unittests.cfg
>> >> @@ -116,3 +116,10 @@ file = timer.flat
>> >>  groups = timer
>> >>  timeout = 2s
>> >>  arch = arm64
>> >> +
>> >> +# Exit tests
>> >> +[micro-test]
>> >> +file = micro-test.flat
>> >> +smp = 2
>
> Sidenote. It would be also interesting to benchmark broadcast IPIs with
> smp >> 2.
>
>> >> +groups = micro-test
>> >> +accel = kvm
>> >> --
>> >> 1.9.1
>> >>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-01-17 21:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-20  1:34 [kvm-unit-tests PATCH V2 0/2] Support micro operation measurement on arm64 Shih-Wei Li
2017-12-20  1:34 ` [kvm-unit-tests PATCH V2 1/2] arm/arm64: add GICD_IIDR definition Shih-Wei Li
2017-12-20  1:34 ` [kvm-unit-tests PATCH V2 2/2] arm64: add micro test Shih-Wei Li
2017-12-20 16:41   ` Andrew Jones
2017-12-20 21:41     ` Shih-Wei Li
2017-12-21  9:21       ` Andrew Jones
2017-12-21 11:28         ` Christoffer Dall
2017-12-21 16:26           ` Shih-Wei Li
2017-12-21 16:12         ` Shih-Wei Li
2017-12-21 16:57           ` Andrew Jones
2017-12-22  6:07   ` Yury Norov
2017-12-22  7:26     ` Shih-Wei Li
2017-12-22 12:44       ` Yury Norov
2018-01-17 21:49         ` Shih-Wei Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.