linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] arm64: Enable access to pmu registers by user-space
@ 2019-06-11 12:53 Raphael Gault
  2019-06-11 12:53 ` [PATCH 1/7] perf: arm64: Compile tests unconditionally Raphael Gault
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Raphael Gault @ 2019-06-11 12:53 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel
  Cc: mingo, peterz, catalin.marinas, will.deacon, acme, mark.rutland,
	Raphael Gault

The perf user-space tool relies on the PMU to monitor events. It offers an
abstraction layer over the hardware counters since the underlying
implementation is cpu-dependent. We want to allow userspace tools to have
access to the registers storing the hardware counters' values directly.
This targets specifically self-monitoring tasks in order to reduce the
overhead by directly accessing the registers without having to go
through the kernel.
In order to do this we need to setup the pmu so that it exposes its registers
to userspace access.

The first patch enables the tests for arm64 architecture in the perf
tool to be compiled systematically.

The second patch add a test to the perf tool so that we can test that the
access to the registers works correctly from userspace.

The third patch adds another test similar to the first one but this time
using rseq as mechanism to make sure of the data correctness.

The fourth patch focuses on the armv8 pmuv3 PMU support and makes sure that
the access to the pmu registers is enable and that the userspace have
access to the relevent information in order to use them.

The fifth patch adds a hook to handle faulting access to the pmu
registers. This is necessary in order to have a coherent behaviour
on big.LITTLE environment.

The sixth patch put in place callbacks to enable access to the hardware
counters from userspace when a compatible event is opened using the perf
API.

Raphael Gault (7):
  perf: arm64: Compile tests unconditionally
  perf: arm64: Add test to check userspace access to hardware counters.
  perf: arm64: Use rseq to test userspace access to pmu counters
  arm64: pmu: Add function implementation to update event index in
    userpage.
  arm64: pmu: Add hook to handle pmu-related undefined instructions
  arm64: perf: Enable pmu counter direct access for perf event on armv8
  Documentation: arm64: Document PMU counters access from userspace

 .../arm64/pmu_counter_user_access.txt         |  42 +++
 arch/arm64/include/asm/mmu.h                  |   6 +
 arch/arm64/include/asm/mmu_context.h          |   2 +
 arch/arm64/include/asm/perf_event.h           |  14 +
 arch/arm64/kernel/cpufeature.c                |   4 +-
 arch/arm64/kernel/perf_event.c                |  76 ++++++
 drivers/perf/arm_pmu.c                        |  38 +++
 include/linux/perf/arm_pmu.h                  |   2 +
 tools/perf/arch/arm64/Build                   |   2 +-
 tools/perf/arch/arm64/include/arch-tests.h    |   9 +
 tools/perf/arch/arm64/include/rseq-arm64.h    | 220 +++++++++++++++
 tools/perf/arch/arm64/tests/Build             |   4 +-
 tools/perf/arch/arm64/tests/arch-tests.c      |  10 +
 tools/perf/arch/arm64/tests/rseq-pmu-events.c | 219 +++++++++++++++
 tools/perf/arch/arm64/tests/user-events.c     | 255 ++++++++++++++++++
 15 files changed, 899 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/arm64/pmu_counter_user_access.txt
 create mode 100644 tools/perf/arch/arm64/include/rseq-arm64.h
 create mode 100644 tools/perf/arch/arm64/tests/rseq-pmu-events.c
 create mode 100644 tools/perf/arch/arm64/tests/user-events.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/7] perf: arm64: Compile tests unconditionally
  2019-06-11 12:53 [PATCH 0/7] arm64: Enable access to pmu registers by user-space Raphael Gault
@ 2019-06-11 12:53 ` Raphael Gault
  2019-06-11 14:09   ` Mark Rutland
  2019-06-22  6:34   ` [tip:perf/core] perf tests " tip-bot for Raphael Gault
  2019-06-11 12:53 ` [PATCH 2/7] perf: arm64: Add test to check userspace access to hardware counters Raphael Gault
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 21+ messages in thread
From: Raphael Gault @ 2019-06-11 12:53 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel
  Cc: mingo, peterz, catalin.marinas, will.deacon, acme, mark.rutland,
	Raphael Gault

In order to subsequently add more tests for the arm64 architecture
we compile the tests target for arm64 systematically.

Signed-off-by: Raphael Gault <raphael.gault@arm.com>
---
 tools/perf/arch/arm64/Build       | 2 +-
 tools/perf/arch/arm64/tests/Build | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/arch/arm64/Build b/tools/perf/arch/arm64/Build
index 36222e64bbf7..a7dd46a5b678 100644
--- a/tools/perf/arch/arm64/Build
+++ b/tools/perf/arch/arm64/Build
@@ -1,2 +1,2 @@
 perf-y += util/
-perf-$(CONFIG_DWARF_UNWIND) += tests/
+perf-y += tests/
diff --git a/tools/perf/arch/arm64/tests/Build b/tools/perf/arch/arm64/tests/Build
index 41707fea74b3..a61c06bdb757 100644
--- a/tools/perf/arch/arm64/tests/Build
+++ b/tools/perf/arch/arm64/tests/Build
@@ -1,4 +1,4 @@
 perf-y += regs_load.o
-perf-y += dwarf-unwind.o
+perf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o
 
 perf-y += arch-tests.o
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/7] perf: arm64: Add test to check userspace access to hardware counters.
  2019-06-11 12:53 [PATCH 0/7] arm64: Enable access to pmu registers by user-space Raphael Gault
  2019-06-11 12:53 ` [PATCH 1/7] perf: arm64: Compile tests unconditionally Raphael Gault
@ 2019-06-11 12:53 ` Raphael Gault
  2019-06-11 12:53 ` [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters Raphael Gault
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Raphael Gault @ 2019-06-11 12:53 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel
  Cc: mingo, peterz, catalin.marinas, will.deacon, acme, mark.rutland,
	Raphael Gault

This test relies on the fact that the PMU registers are accessible
from userspace. It then uses the perf_event_mmap_page to retrieve
the counter index and access the underlying register.

This test uses sched_setaffinity(2) in order to run on all CPU and thus
check the behaviour of the PMU of all cpus in a big.LITTLE environment.

Signed-off-by: Raphael Gault <raphael.gault@arm.com>
---
 tools/perf/arch/arm64/include/arch-tests.h |   6 +
 tools/perf/arch/arm64/tests/Build          |   1 +
 tools/perf/arch/arm64/tests/arch-tests.c   |   4 +
 tools/perf/arch/arm64/tests/user-events.c  | 255 +++++++++++++++++++++
 4 files changed, 266 insertions(+)
 create mode 100644 tools/perf/arch/arm64/tests/user-events.c

diff --git a/tools/perf/arch/arm64/include/arch-tests.h b/tools/perf/arch/arm64/include/arch-tests.h
index 90ec4c8cb880..a9b17ae0560b 100644
--- a/tools/perf/arch/arm64/include/arch-tests.h
+++ b/tools/perf/arch/arm64/include/arch-tests.h
@@ -2,11 +2,17 @@
 #ifndef ARCH_TESTS_H
 #define ARCH_TESTS_H
 
+#define __maybe_unused	__attribute__((unused))
 #ifdef HAVE_DWARF_UNWIND_SUPPORT
 struct thread;
 struct perf_sample;
+int test__arch_unwind_sample(struct perf_sample *sample,
+			     struct thread *thread);
 #endif
 
 extern struct test arch_tests[];
+int test__rd_pmevcntr(struct test *test __maybe_unused,
+		      int subtest __maybe_unused);
+
 
 #endif
diff --git a/tools/perf/arch/arm64/tests/Build b/tools/perf/arch/arm64/tests/Build
index a61c06bdb757..3f9a20c17fc6 100644
--- a/tools/perf/arch/arm64/tests/Build
+++ b/tools/perf/arch/arm64/tests/Build
@@ -1,4 +1,5 @@
 perf-y += regs_load.o
 perf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o
 
+perf-y += user-events.o
 perf-y += arch-tests.o
diff --git a/tools/perf/arch/arm64/tests/arch-tests.c b/tools/perf/arch/arm64/tests/arch-tests.c
index 5b1543c98022..57df9b89dede 100644
--- a/tools/perf/arch/arm64/tests/arch-tests.c
+++ b/tools/perf/arch/arm64/tests/arch-tests.c
@@ -10,6 +10,10 @@ struct test arch_tests[] = {
 		.func = test__dwarf_unwind,
 	},
 #endif
+	{
+		.desc = "User counter access",
+		.func = test__rd_pmevcntr,
+	},
 	{
 		.func = NULL,
 	},
diff --git a/tools/perf/arch/arm64/tests/user-events.c b/tools/perf/arch/arm64/tests/user-events.c
new file mode 100644
index 000000000000..958e4cd000c1
--- /dev/null
+++ b/tools/perf/arch/arm64/tests/user-events.c
@@ -0,0 +1,255 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <asm/bug.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sched.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <sys/mman.h>
+#include <sys/sysinfo.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <linux/types.h>
+#include "perf.h"
+#include "debug.h"
+#include "tests/tests.h"
+#include "cloexec.h"
+#include "util.h"
+#include "arch-tests.h"
+
+/*
+ * ARMv8 ARM reserves the following encoding for system registers:
+ * (Ref: ARMv8 ARM, Section: "System instruction class encoding overview",
+ *  C5.2, version:ARM DDI 0487A.f)
+ *      [20-19] : Op0
+ *      [18-16] : Op1
+ *      [15-12] : CRn
+ *      [11-8]  : CRm
+ *      [7-5]   : Op2
+ */
+#define Op0_shift       19
+#define Op0_mask        0x3
+#define Op1_shift       16
+#define Op1_mask        0x7
+#define CRn_shift       12
+#define CRn_mask        0xf
+#define CRm_shift       8
+#define CRm_mask        0xf
+#define Op2_shift       5
+#define Op2_mask        0x7
+
+#define __stringify(x)	#x
+
+#define read_sysreg(r) ({						\
+	u64 __val;							\
+	asm volatile("mrs %0, " __stringify(r) : "=r" (__val));		\
+	__val;								\
+})
+
+#define PMEVCNTR_READ_CASE(idx)					\
+	case idx:						\
+		return read_sysreg(pmevcntr##idx##_el0)
+
+#define PMEVCNTR_CASES(readwrite)		\
+	PMEVCNTR_READ_CASE(0);			\
+	PMEVCNTR_READ_CASE(1);			\
+	PMEVCNTR_READ_CASE(2);			\
+	PMEVCNTR_READ_CASE(3);			\
+	PMEVCNTR_READ_CASE(4);			\
+	PMEVCNTR_READ_CASE(5);			\
+	PMEVCNTR_READ_CASE(6);			\
+	PMEVCNTR_READ_CASE(7);			\
+	PMEVCNTR_READ_CASE(8);			\
+	PMEVCNTR_READ_CASE(9);			\
+	PMEVCNTR_READ_CASE(10);			\
+	PMEVCNTR_READ_CASE(11);			\
+	PMEVCNTR_READ_CASE(12);			\
+	PMEVCNTR_READ_CASE(13);			\
+	PMEVCNTR_READ_CASE(14);			\
+	PMEVCNTR_READ_CASE(15);			\
+	PMEVCNTR_READ_CASE(16);			\
+	PMEVCNTR_READ_CASE(17);			\
+	PMEVCNTR_READ_CASE(18);			\
+	PMEVCNTR_READ_CASE(19);			\
+	PMEVCNTR_READ_CASE(20);			\
+	PMEVCNTR_READ_CASE(21);			\
+	PMEVCNTR_READ_CASE(22);			\
+	PMEVCNTR_READ_CASE(23);			\
+	PMEVCNTR_READ_CASE(24);			\
+	PMEVCNTR_READ_CASE(25);			\
+	PMEVCNTR_READ_CASE(26);			\
+	PMEVCNTR_READ_CASE(27);			\
+	PMEVCNTR_READ_CASE(28);			\
+	PMEVCNTR_READ_CASE(29);			\
+	PMEVCNTR_READ_CASE(30)
+
+/*
+ * Read a value direct from PMEVCNTR<idx>
+ */
+static u64 read_evcnt_direct(int idx)
+{
+	switch (idx) {
+	PMEVCNTR_CASES(READ);
+	default:
+		WARN_ON(1);
+	}
+
+	return 0;
+}
+
+static u64 mmap_read_self(void *addr)
+{
+	struct perf_event_mmap_page *pc = addr;
+	u32 seq, idx, time_mult = 0, time_shift = 0;
+	u64 count, cyc = 0, time_offset = 0, enabled, running, delta;
+
+	do {
+		seq = READ_ONCE(pc->lock);
+		barrier();
+
+		enabled = READ_ONCE(pc->time_enabled);
+		running = READ_ONCE(pc->time_running);
+
+		if (enabled != running) {
+			cyc = read_sysreg(cntvct_el0);
+			time_mult = READ_ONCE(pc->time_mult);
+			time_shift = READ_ONCE(pc->time_shift);
+			time_offset = READ_ONCE(pc->time_offset);
+		}
+
+		idx = READ_ONCE(pc->index);
+		count = READ_ONCE(pc->offset);
+		if (idx)
+			count += read_evcnt_direct(idx - 1);
+
+		barrier();
+	} while (READ_ONCE(pc->lock) != seq);
+
+	if (enabled != running) {
+		u64 quot, rem;
+
+		quot = (cyc >> time_shift);
+		rem = cyc & (((u64)1 << time_shift) - 1);
+		delta = time_offset + quot * time_mult +
+			((rem * time_mult) >> time_shift);
+
+		enabled += delta;
+		if (idx)
+			running += delta;
+
+		quot = count / running;
+		rem = count % running;
+		count = quot * enabled + (rem * enabled) / running;
+	}
+
+	return count;
+}
+
+static int __test__rd_pmevcntr(void)
+{
+	volatile int tmp = 0;
+	u64 i, loops = 1000;
+	int n;
+	int fd;
+	void *addr;
+	struct perf_event_attr attr = {
+		.type = PERF_TYPE_HARDWARE,
+		.config = PERF_COUNT_HW_INSTRUCTIONS,
+		.exclude_kernel = 1,
+	};
+	u64 delta_sum = 0;
+	char sbuf[STRERR_BUFSIZE];
+
+	fd = sys_perf_event_open(&attr, 0, -1, -1,
+				 perf_event_open_cloexec_flag());
+	if (fd < 0) {
+		pr_err("Error: sys_perf_event_open() syscall returned "
+		       "with %d (%s)\n", fd,
+		       str_error_r(errno, sbuf, sizeof(sbuf)));
+		return -1;
+	}
+
+	addr = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+	if (addr == (void *)(-1)) {
+		pr_err("Error: mmap() syscall returned with (%s)\n",
+		       str_error_r(errno, sbuf, sizeof(sbuf)));
+		goto out_close;
+	}
+
+	for (n = 0; n < 6; n++) {
+		u64 stamp, now, delta;
+
+		stamp = mmap_read_self(addr);
+
+		for (i = 0; i < loops; i++)
+			tmp++;
+
+		now = mmap_read_self(addr);
+		loops *= 10;
+
+		delta = now - stamp;
+		pr_debug("%14d: %14llu\n", n, (long long)delta);
+
+		delta_sum += delta;
+	}
+
+	munmap(addr, page_size);
+	pr_debug("   ");
+
+out_close:
+	close(fd);
+
+	if (!delta_sum)
+		return -1;
+
+	return 0;
+}
+
+int test__rd_pmevcntr(struct test __maybe_unused *test,
+		      int __maybe_unused subtest)
+{
+	int status = 0;
+	int wret = 0;
+	int ret = 0;
+	int pid;
+	int cpu;
+	cpu_set_t cpu_set;
+
+	pid = fork();
+	if (pid < 0)
+		return -1;
+
+	if (!pid) {
+		for (cpu = 0; cpu < get_nprocs(); cpu++) {
+			pr_info("setting affinity to cpu: %d\n", cpu);
+			CPU_ZERO(&cpu_set);
+			CPU_SET(cpu, &cpu_set);
+			if (sched_setaffinity(getpid(),
+					      sizeof(cpu_set),
+					      &cpu_set) == -1) {
+				pr_err("Error: impossible to set cpu (%d) affinity\n",
+				       cpu);
+				continue;
+			}
+			ret = __test__rd_pmevcntr();
+		}
+		exit(ret);
+	}
+
+	wret = waitpid(pid, &status, 0);
+	if (wret < 0)
+		return -1;
+
+	if (WIFSIGNALED(status)) {
+		pr_err("Error: the child process was interrupted by a signal\n");
+		return -1;
+	}
+
+	if (WIFEXITED(status) && WEXITSTATUS(status)) {
+		pr_err("Error: the child process exited with: %d\n",
+		       WEXITSTATUS(status));
+		return -1;
+	}
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters
  2019-06-11 12:53 [PATCH 0/7] arm64: Enable access to pmu registers by user-space Raphael Gault
  2019-06-11 12:53 ` [PATCH 1/7] perf: arm64: Compile tests unconditionally Raphael Gault
  2019-06-11 12:53 ` [PATCH 2/7] perf: arm64: Add test to check userspace access to hardware counters Raphael Gault
@ 2019-06-11 12:53 ` Raphael Gault
  2019-06-11 14:33   ` Arnaldo Carvalho de Melo
  2019-06-11 12:53 ` [PATCH 4/7] arm64: pmu: Add function implementation to update event index in userpage Raphael Gault
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Raphael Gault @ 2019-06-11 12:53 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel
  Cc: mingo, peterz, catalin.marinas, will.deacon, acme, mark.rutland,
	Raphael Gault

Add an extra test to check userspace access to pmu hardware counters.
This test doesn't rely on the seqlock as a synchronisation mechanism but
instead uses the restartable sequences to make sure that the thread is
not interrupted when reading the index of the counter and the associated
pmu register.

In addition to reading the pmu counters, this test is run several time
in order to measure the ratio of failures:
I ran this test on the Juno development platform, which is big.LITTLE
with 4 Cortex A53 and 2 Cortex A57. The results vary quite a lot
(running it with 100 tests is not so long and I did it several times).
I ran it once with 10000 iterations:
`runs: 10000, abort: 62.53%, zero: 34.93%, success: 2.54%`

Signed-off-by: Raphael Gault <raphael.gault@arm.com>
---
 tools/perf/arch/arm64/include/arch-tests.h    |   5 +-
 tools/perf/arch/arm64/include/rseq-arm64.h    | 220 ++++++++++++++++++
 tools/perf/arch/arm64/tests/Build             |   1 +
 tools/perf/arch/arm64/tests/arch-tests.c      |   6 +
 tools/perf/arch/arm64/tests/rseq-pmu-events.c | 219 +++++++++++++++++
 5 files changed, 450 insertions(+), 1 deletion(-)
 create mode 100644 tools/perf/arch/arm64/include/rseq-arm64.h
 create mode 100644 tools/perf/arch/arm64/tests/rseq-pmu-events.c

diff --git a/tools/perf/arch/arm64/include/arch-tests.h b/tools/perf/arch/arm64/include/arch-tests.h
index a9b17ae0560b..4164762b43c6 100644
--- a/tools/perf/arch/arm64/include/arch-tests.h
+++ b/tools/perf/arch/arm64/include/arch-tests.h
@@ -13,6 +13,9 @@ int test__arch_unwind_sample(struct perf_sample *sample,
 extern struct test arch_tests[];
 int test__rd_pmevcntr(struct test *test __maybe_unused,
 		      int subtest __maybe_unused);
-
+#ifdef CONFIG_RSEQ
+int rseq__rd_pmevcntr(struct test *test __maybe_unused,
+		      int subtest __maybe_unused);
+#endif
 
 #endif
diff --git a/tools/perf/arch/arm64/include/rseq-arm64.h b/tools/perf/arch/arm64/include/rseq-arm64.h
new file mode 100644
index 000000000000..00d6960915a9
--- /dev/null
+++ b/tools/perf/arch/arm64/include/rseq-arm64.h
@@ -0,0 +1,220 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * rseq-arm64.h
+ *
+ * This file is mostly a copy from
+ * tools/testing/selftests/rseq/rseq-arm64.h
+ */
+
+/*
+ * aarch64 -mbig-endian generates mixed endianness code vs data:
+ * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
+ * matches code endianness.
+ */
+#define __rseq_str_1(x)  #x
+#define __rseq_str(x)            __rseq_str_1(x)
+
+#define RSEQ_ACCESS_ONCE(x)	(*(__volatile__  __typeof__(x) *)&(x))
+#define RSEQ_SIG_CODE	0xd428bc00	/* BRK #0x45E0.  */
+
+#ifdef __AARCH64EB__
+#define RSEQ_SIG_DATA	0x00bc28d4	/* BRK #0x45E0.  */
+#else
+#define RSEQ_SIG_DATA	RSEQ_SIG_CODE
+#endif
+
+#define RSEQ_SIG	RSEQ_SIG_DATA
+
+#define rseq_smp_mb()	__asm__ __volatile__ ("dmb ish" ::: "memory")
+#define rseq_smp_rmb()	__asm__ __volatile__ ("dmb ishld" ::: "memory")
+#define rseq_smp_wmb()	__asm__ __volatile__ ("dmb ishst" ::: "memory")
+
+#define rseq_smp_load_acquire(p)						\
+__extension__ ({								\
+	__typeof(*p) ____p1;							\
+	switch (sizeof(*p)) {							\
+	case 1:									\
+		asm volatile ("ldarb %w0, %1"					\
+			: "=r" (*(__u8 *)p)					\
+			: "Q" (*p) : "memory");					\
+		break;								\
+	case 2:									\
+		asm volatile ("ldarh %w0, %1"					\
+			: "=r" (*(__u16 *)p)					\
+			: "Q" (*p) : "memory");					\
+		break;								\
+	case 4:									\
+		asm volatile ("ldar %w0, %1"					\
+			: "=r" (*(__u32 *)p)					\
+			: "Q" (*p) : "memory");					\
+		break;								\
+	case 8:									\
+		asm volatile ("ldar %0, %1"					\
+			: "=r" (*(__u64 *)p)					\
+			: "Q" (*p) : "memory");					\
+		break;								\
+	}									\
+	____p1;									\
+})
+
+#define rseq_smp_acquire__after_ctrl_dep()	rseq_smp_rmb()
+
+#define rseq_smp_store_release(p, v)						\
+do {										\
+	switch (sizeof(*p)) {							\
+	case 1:									\
+		asm volatile ("stlrb %w1, %0"					\
+				: "=Q" (*p)					\
+				: "r" ((__u8)v)					\
+				: "memory");					\
+		break;								\
+	case 2:									\
+		asm volatile ("stlrh %w1, %0"					\
+				: "=Q" (*p)					\
+				: "r" ((__u16)v)				\
+				: "memory");					\
+		break;								\
+	case 4:									\
+		asm volatile ("stlr %w1, %0"					\
+				: "=Q" (*p)					\
+				: "r" ((__u32)v)				\
+				: "memory");					\
+		break;								\
+	case 8:									\
+		asm volatile ("stlr %1, %0"					\
+				: "=Q" (*p)					\
+				: "r" ((__u64)v)				\
+				: "memory");					\
+		break;								\
+	}									\
+} while (0)
+
+#ifdef RSEQ_SKIP_FASTPATH
+#include "rseq-skip.h"
+#else /* !RSEQ_SKIP_FASTPATH */
+
+#define RSEQ_ASM_TMP_REG32	"w15"
+#define RSEQ_ASM_TMP_REG	"x15"
+#define RSEQ_ASM_TMP_REG_2	"x14"
+
+#define INJECT_ASM_REG	RSEQ_ASM_TMP_REG32
+
+#define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, start_ip,		\
+				post_commit_offset, abort_ip)			\
+	"	.pushsection	__rseq_cs, \"aw\"\n"				\
+	"	.balign	32\n"							\
+	__rseq_str(label) ":\n"							\
+	"	.long	" __rseq_str(version) ", " __rseq_str(flags) "\n"	\
+	"	.quad	" __rseq_str(start_ip) ", "				\
+			  __rseq_str(post_commit_offset) ", "			\
+			  __rseq_str(abort_ip) "\n"				\
+	"	.popsection\n\t"						\
+	"	.pushsection __rseq_cs_ptr_array, \"aw\"\n"				\
+	"	.quad " __rseq_str(label) "b\n"					\
+	"	.popsection\n"
+
+#define RSEQ_ASM_DEFINE_TABLE(label, start_ip, post_commit_ip, abort_ip)	\
+	__RSEQ_ASM_DEFINE_TABLE(label, 0x0, 0x0, start_ip,			\
+				(post_commit_ip - start_ip), abort_ip)
+
+/*
+ * Exit points of a rseq critical section consist of all instructions outside
+ * of the critical section where a critical section can either branch to or
+ * reach through the normal course of its execution. The abort IP and the
+ * post-commit IP are already part of the __rseq_cs section and should not be
+ * explicitly defined as additional exit points. Knowing all exit points is
+ * useful to assist debuggers stepping over the critical section.
+ */
+#define RSEQ_ASM_DEFINE_EXIT_POINT(start_ip, exit_ip)				\
+	"	.pushsection __rseq_exit_point_array, \"aw\"\n"			\
+	"	.quad " __rseq_str(start_ip) ", " __rseq_str(exit_ip) "\n"	\
+	"	.popsection\n"
+
+#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs)			\
+	"	adrp	" RSEQ_ASM_TMP_REG ", " __rseq_str(cs_label) "\n"	\
+	"	add	" RSEQ_ASM_TMP_REG ", " RSEQ_ASM_TMP_REG		\
+			", :lo12:" __rseq_str(cs_label) "\n"			\
+	"	str	" RSEQ_ASM_TMP_REG ", %[" __rseq_str(rseq_cs) "]\n"	\
+	__rseq_str(label) ":\n"
+
+#define RSEQ_ASM_DEFINE_ABORT(label, abort_label)				\
+	"	b	222f\n"							\
+	"	.inst 	"	__rseq_str(RSEQ_SIG_CODE) "\n"			\
+	__rseq_str(label) ":\n"							\
+	"	b	%l[" __rseq_str(abort_label) "]\n"			\
+	"222:\n"
+
+#define RSEQ_ASM_OP_STORE(value, var)						\
+	"	str	%[" __rseq_str(value) "], %[" __rseq_str(var) "]\n"
+
+#define RSEQ_ASM_OP_STORE_RELEASE(value, var)					\
+	"	stlr	%[" __rseq_str(value) "], %[" __rseq_str(var) "]\n"
+
+#define RSEQ_ASM_OP_FINAL_STORE(value, var, post_commit_label)			\
+	RSEQ_ASM_OP_STORE(value, var)						\
+	__rseq_str(post_commit_label) ":\n"
+
+#define RSEQ_ASM_OP_FINAL_STORE_RELEASE(value, var, post_commit_label)		\
+	RSEQ_ASM_OP_STORE_RELEASE(value, var)					\
+	__rseq_str(post_commit_label) ":\n"
+
+#define RSEQ_ASM_OP_CMPEQ(var, expect, label)					\
+	"	ldr	" RSEQ_ASM_TMP_REG ", %[" __rseq_str(var) "]\n"		\
+	"	sub	" RSEQ_ASM_TMP_REG ", " RSEQ_ASM_TMP_REG		\
+			", %[" __rseq_str(expect) "]\n"				\
+	"	cbnz	" RSEQ_ASM_TMP_REG ", " __rseq_str(label) "\n"
+
+#define RSEQ_ASM_OP_CMPEQ32(var, expect, label)					\
+	"	ldr	" RSEQ_ASM_TMP_REG32 ", %[" __rseq_str(var) "]\n"	\
+	"	sub	" RSEQ_ASM_TMP_REG32 ", " RSEQ_ASM_TMP_REG32		\
+			", %w[" __rseq_str(expect) "]\n"			\
+	"	cbnz	" RSEQ_ASM_TMP_REG32 ", " __rseq_str(label) "\n"
+
+#define RSEQ_ASM_OP_CMPNE(var, expect, label)					\
+	"	ldr	" RSEQ_ASM_TMP_REG ", %[" __rseq_str(var) "]\n"		\
+	"	sub	" RSEQ_ASM_TMP_REG ", " RSEQ_ASM_TMP_REG		\
+			", %[" __rseq_str(expect) "]\n"				\
+	"	cbz	" RSEQ_ASM_TMP_REG ", " __rseq_str(label) "\n"
+
+#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label)			\
+	RSEQ_ASM_OP_CMPEQ32(current_cpu_id, cpu_id, label)
+
+#define RSEQ_ASM_OP_R_LOAD(var)							\
+	"	ldr	" RSEQ_ASM_TMP_REG ", [%[" __rseq_str(var) "]], #0\n"
+
+#define RSEQ_ASM_OP_R_STORE(var)						\
+	"	str	" RSEQ_ASM_TMP_REG ", %[" __rseq_str(var) "]\n"
+
+#define RSEQ_ASM_OP_R_LOAD_OFF(offset)						\
+	"	ldr	" RSEQ_ASM_TMP_REG ", [" RSEQ_ASM_TMP_REG		\
+			", %[" __rseq_str(offset) "]]\n"
+
+#define RSEQ_ASM_OP_R_ADD(count)						\
+	"	add	" RSEQ_ASM_TMP_REG ", " RSEQ_ASM_TMP_REG		\
+			", %[" __rseq_str(count) "]\n"
+
+#define RSEQ_ASM_OP_R_SUB(imm)						\
+	"	sub	" RSEQ_ASM_TMP_REG ", " RSEQ_ASM_TMP_REG		\
+			", #" __rseq_str(imm) "\n"
+
+#define RSEQ_ASM_OP_R_AND(mask)							\
+	"	and	" RSEQ_ASM_TMP_REG ", " RSEQ_ASM_TMP_REG		\
+			", #" __rseq_str(mask) "\n"
+
+#define RSEQ_ASM_OP_R_FINAL_STORE(var, post_commit_label)			\
+	"	str	" RSEQ_ASM_TMP_REG ", %[" __rseq_str(var) "]\n"		\
+	__rseq_str(post_commit_label) ":\n"
+
+#define RSEQ_ASM_OP_R_BAD_MEMCPY(dst, src, len)					\
+	"	cbz	%[" __rseq_str(len) "], 333f\n"				\
+	"	mov	" RSEQ_ASM_TMP_REG_2 ", %[" __rseq_str(len) "]\n"	\
+	"222:	sub	" RSEQ_ASM_TMP_REG_2 ", " RSEQ_ASM_TMP_REG_2 ", #1\n"	\
+	"	ldrb	" RSEQ_ASM_TMP_REG32 ", [%[" __rseq_str(src) "]"	\
+			", " RSEQ_ASM_TMP_REG_2 "]\n"				\
+	"	strb	" RSEQ_ASM_TMP_REG32 ", [%[" __rseq_str(dst) "]"	\
+			", " RSEQ_ASM_TMP_REG_2 "]\n"				\
+	"	cbnz	" RSEQ_ASM_TMP_REG_2 ", 222b\n"				\
+	"333:\n"
+
+
+#endif /* !RSEQ_SKIP_FASTPATH */
diff --git a/tools/perf/arch/arm64/tests/Build b/tools/perf/arch/arm64/tests/Build
index 3f9a20c17fc6..b53823603066 100644
--- a/tools/perf/arch/arm64/tests/Build
+++ b/tools/perf/arch/arm64/tests/Build
@@ -1,5 +1,6 @@
 perf-y += regs_load.o
 perf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o
+perf-$(CONFIG_RSEQ) += rseq-pmu-events.o
 
 perf-y += user-events.o
 perf-y += arch-tests.o
diff --git a/tools/perf/arch/arm64/tests/arch-tests.c b/tools/perf/arch/arm64/tests/arch-tests.c
index 57df9b89dede..5f08bdc41d89 100644
--- a/tools/perf/arch/arm64/tests/arch-tests.c
+++ b/tools/perf/arch/arm64/tests/arch-tests.c
@@ -14,6 +14,12 @@ struct test arch_tests[] = {
 		.desc = "User counter access",
 		.func = test__rd_pmevcntr,
 	},
+#ifdef CONFIG_RSEQ
+	{
+		.desc = "User counter access with rseq",
+		.func = rseq__rd_pmevcntr,
+	},
+#endif
 	{
 		.func = NULL,
 	},
diff --git a/tools/perf/arch/arm64/tests/rseq-pmu-events.c b/tools/perf/arch/arm64/tests/rseq-pmu-events.c
new file mode 100644
index 000000000000..4d9578d7e056
--- /dev/null
+++ b/tools/perf/arch/arm64/tests/rseq-pmu-events.c
@@ -0,0 +1,219 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <asm/bug.h>
+#include <linux/rseq.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sched.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <sys/mman.h>
+#include <sys/sysinfo.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <linux/types.h>
+#include "perf.h"
+#include "debug.h"
+#include "tests/tests.h"
+#include "cloexec.h"
+#include "util.h"
+#include "arch-tests.h"
+#include "rseq-arm64.h"
+
+static __thread volatile struct rseq __rseq_abi;
+
+static int sys_rseq(volatile struct rseq *rseq_abi, uint32_t rseq_len,
+		    int flags, uint32_t sig)
+{
+	return syscall(__NR_rseq, rseq_abi, rseq_len, flags, sig);
+}
+
+static int rseq_register_current_thread(void)
+{
+	return sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);
+}
+
+static int rseq_unregister_current_thread(void)
+{
+	return sys_rseq(&__rseq_abi, sizeof(struct rseq),
+			RSEQ_FLAG_UNREGISTER, RSEQ_SIG);
+}
+
+static u64 noinline mmap_read_self(void *addr, int cpu)
+{
+	struct perf_event_mmap_page *pc = addr;
+	u32 idx = 0;
+	u64 count = 0;
+
+	asm volatile goto(
+                     RSEQ_ASM_DEFINE_TABLE(0, 1f, 2f, 3f)
+		     "nop\n"
+                     RSEQ_ASM_STORE_RSEQ_CS(1, 0b, rseq_cs)
+		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
+                     RSEQ_ASM_OP_R_LOAD(pc_idx)
+                     RSEQ_ASM_OP_R_AND(0xFF)
+		     RSEQ_ASM_OP_R_STORE(idx)
+                     RSEQ_ASM_OP_R_SUB(0x1)
+		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
+                     "msr pmselr_el0, " RSEQ_ASM_TMP_REG "\n"
+                     "isb\n"
+		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
+                     "mrs " RSEQ_ASM_TMP_REG ", pmxevcntr_el0\n"
+                     RSEQ_ASM_OP_R_FINAL_STORE(cnt, 2)
+		     "nop\n"
+                     RSEQ_ASM_DEFINE_ABORT(3, abort)
+                     :/* No output operands */
+		     :  [cpu_id] "r" (cpu),
+			[current_cpu_id] "Qo" (__rseq_abi.cpu_id),
+			[rseq_cs] "m" (__rseq_abi.rseq_cs),
+			[cnt] "m" (count),
+			[pc_idx] "r" (&pc->index),
+			[idx] "m" (idx)
+                     :"memory"
+                     :abort
+                    );
+
+	if (idx)
+		count += READ_ONCE(pc->offset);
+
+	return count;
+abort:
+        pr_debug("Abort handler\n");
+        exit(-2);
+}
+
+static int __test__rd_pmevcntr(void)
+{
+	volatile int tmp = 0;
+	u64 i, loops = 1000;
+	int n;
+	int fd;
+	int cpu;
+	void *addr;
+	struct perf_event_attr attr = {
+		.type = PERF_TYPE_HARDWARE,
+		.config = PERF_COUNT_HW_INSTRUCTIONS,
+		.exclude_kernel = 1,
+	};
+	u64 delta_sum = 0;
+	char sbuf[STRERR_BUFSIZE];
+
+	fd = rseq_register_current_thread();
+	if (fd) {
+		pr_err("Error: unable to register current thread "
+		       "return value: %d (%s)\n", fd,
+		       str_error_r(errno, sbuf, sizeof(sbuf)));
+		return -1;
+	}
+	cpu = RSEQ_ACCESS_ONCE(__rseq_abi.cpu_id_start);
+
+	pr_debug("cpu: %d\n", cpu);
+	fd = sys_perf_event_open(&attr, 0, -1, -1,
+				 perf_event_open_cloexec_flag());
+	if (fd < 0) {
+		pr_err("Error: sys_perf_event_open() syscall returned "
+		       "with %d (%s)\n", fd,
+		       str_error_r(errno, sbuf, sizeof(sbuf)));
+		return -1;
+	}
+
+	addr = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+	if (addr == (void *)(-1)) {
+		pr_err("Error: mmap() syscall returned with (%s)\n",
+		       str_error_r(errno, sbuf, sizeof(sbuf)));
+		goto out_close;
+	}
+
+	for (n = 0; n < 6; n++) {
+		u64 stamp, now, delta;
+
+		stamp = mmap_read_self(addr, cpu);
+
+		for (i = 0; i < loops; i++)
+			tmp++;
+
+		now = mmap_read_self(addr, cpu);
+		loops *= 10;
+
+		delta = now - stamp;
+		pr_debug("%14d: %14llu\n", n, (long long)delta);
+
+		delta_sum += delta;
+	}
+
+	munmap(addr, page_size);
+
+	rseq_unregister_current_thread();
+	pr_debug("   ");
+
+out_close:
+	close(fd);
+
+	if (!delta_sum)
+		return -1;
+
+	return 0;
+}
+
+int rseq__rd_pmevcntr(struct test __maybe_unused *test,
+		      int __maybe_unused subtest)
+{
+	int status = 0;
+	int wret = 0;
+	int ret = 0;
+	int i = 0;
+	int pid;
+	int nb_run = 100;
+	unsigned int run = 0;
+	unsigned int aborted = 0, zero = 0, success = 0;
+
+	for (i = 0; i < nb_run; i++) {
+		pid = fork();
+		if (pid < 0)
+			return -1;
+
+		if (!pid) {
+			ret = __test__rd_pmevcntr();
+			exit(ret);
+		}
+
+		wret = waitpid(pid, &status, 0);
+		if (wret < 0)
+			return -1;
+
+		if (WIFSIGNALED(status)) {
+			pr_err("Error: the child process was interrupted by a signal: %d\n", WTERMSIG(status));
+			return -1;
+		}
+
+		if (WIFEXITED(status) && WEXITSTATUS(status)) {
+			pr_err("Error: the child process exited with: %d\n",
+			       WEXITSTATUS(status));
+			/*
+			return -1;
+			*/
+			switch (WEXITSTATUS(status)) {
+			case 0:
+				success++;
+				break;
+			case 255:
+				zero++;
+				break;
+			case 254:
+				aborted++;
+				break;
+			default:
+				break;
+			}
+		}
+		else
+			success++;
+	}
+
+	run = aborted + zero + success;
+
+	pr_info("runs: %u, abort: %.2f%%, zero: %.2f%%, success: %.2f%%\n",
+		run, (aborted / (float)run) * 100,
+		(zero / (float)run) * 100,
+		(success / (float)run) * 100);
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/7] arm64: pmu: Add function implementation to update event index in userpage.
  2019-06-11 12:53 [PATCH 0/7] arm64: Enable access to pmu registers by user-space Raphael Gault
                   ` (2 preceding siblings ...)
  2019-06-11 12:53 ` [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters Raphael Gault
@ 2019-06-11 12:53 ` Raphael Gault
  2019-06-11 12:53 ` [PATCH 5/7] arm64: pmu: Add hook to handle pmu-related undefined instructions Raphael Gault
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Raphael Gault @ 2019-06-11 12:53 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel
  Cc: mingo, peterz, catalin.marinas, will.deacon, acme, mark.rutland,
	Raphael Gault

In order to be able to access the counter directly for userspace,
we need to provide the index of the counter using the userpage.
We thus need to override the event_idx function to retrieve and
convert the perf_event index to armv8 hardware index.

Since the arm_pmu driver can be used by any implementation, even
if not armv8, two components play a role into making sure the
behaviour is correct and consistent with the PMU capabilities:

* the ARMPMU_EL0_RD_CNTR flag which denotes the capability to access
counter from userspace.
* the event_idx call back, which is implemented and initialized by
the PMU implementation: if no callback is provided, the default
behaviour applies, returning 0 as index value.

Signed-off-by: Raphael Gault <raphael.gault@arm.com>
---
 arch/arm64/kernel/perf_event.c | 21 +++++++++++++++++++++
 include/linux/perf/arm_pmu.h   |  2 ++
 2 files changed, 23 insertions(+)

diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index 348d12eec566..293e4c365a53 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -829,6 +829,22 @@ static void armv8pmu_clear_event_idx(struct pmu_hw_events *cpuc,
 		clear_bit(idx - 1, cpuc->used_mask);
 }
 
+static int armv8pmu_access_event_idx(struct perf_event *event)
+{
+	if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
+		return 0;
+
+	/*
+	 * We remap the cycle counter index to 32 to
+	 * match the offset applied to the rest of
+	 * the counter indeces.
+	 */
+	if (event->hw.idx == ARMV8_IDX_CYCLE_COUNTER)
+		return 32;
+
+	return event->hw.idx;
+}
+
 /*
  * Add an event filter to a given event.
  */
@@ -922,6 +938,8 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
 	if (armv8pmu_event_is_64bit(event))
 		event->hw.flags |= ARMPMU_EVT_64BIT;
 
+	event->hw.flags |= ARMPMU_EL0_RD_CNTR;
+
 	/* Only expose micro/arch events supported by this PMU */
 	if ((hw_event_id > 0) && (hw_event_id < ARMV8_PMUV3_MAX_COMMON_EVENTS)
 	    && test_bit(hw_event_id, armpmu->pmceid_bitmap)) {
@@ -1042,6 +1060,8 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu)
 	cpu_pmu->set_event_filter	= armv8pmu_set_event_filter;
 	cpu_pmu->filter_match		= armv8pmu_filter_match;
 
+	cpu_pmu->pmu.event_idx		= armv8pmu_access_event_idx;
+
 	return 0;
 }
 
@@ -1220,6 +1240,7 @@ void arch_perf_update_userpage(struct perf_event *event,
 	 */
 	freq = arch_timer_get_rate();
 	userpg->cap_user_time = 1;
+	userpg->cap_user_rdpmc = !!(event->hw.flags & ARMPMU_EL0_RD_CNTR);
 
 	clocks_calc_mult_shift(&userpg->time_mult, &shift, freq,
 			NSEC_PER_SEC, 0);
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index 4641e850b204..3bef390c1069 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -30,6 +30,8 @@
  */
 /* Event uses a 64bit counter */
 #define ARMPMU_EVT_64BIT		1
+/* Allow access to hardware counter from userspace */
+#define ARMPMU_EL0_RD_CNTR		2
 
 #define HW_OP_UNSUPPORTED		0xFFFF
 #define C(_x)				PERF_COUNT_HW_CACHE_##_x
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/7] arm64: pmu: Add hook to handle pmu-related undefined instructions
  2019-06-11 12:53 [PATCH 0/7] arm64: Enable access to pmu registers by user-space Raphael Gault
                   ` (3 preceding siblings ...)
  2019-06-11 12:53 ` [PATCH 4/7] arm64: pmu: Add function implementation to update event index in userpage Raphael Gault
@ 2019-06-11 12:53 ` Raphael Gault
  2019-06-11 12:53 ` [PATCH 6/7] arm64: perf: Enable pmu counter direct access for perf event on armv8 Raphael Gault
  2019-06-11 12:53 ` [PATCH 7/7] Documentation: arm64: Document PMU counters access from userspace Raphael Gault
  6 siblings, 0 replies; 21+ messages in thread
From: Raphael Gault @ 2019-06-11 12:53 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel
  Cc: mingo, peterz, catalin.marinas, will.deacon, acme, mark.rutland,
	Raphael Gault

In order to prevent the userspace processes which are trying to access
the registers from the pmu registers on a big.LITTLE environment we
introduce a hook to handle undefined instructions.

The goal here is to prevent the process to be interrupted by a signal
when the error is caused by the task being scheduled while accessing
a counter, causing the counter access to be invalid. As we are not able
to know efficiently the number of counters available physically on both
pmu in that context we consider that any faulting access to a counter
which is architecturally correct should not cause a SIGILL signal if
the permissions are set accordingly.

This commit also modifies the mask of the mrs_hook declared in
arch/arm64/kernel/cpufeatures.c which emulates only feature register
access. This is necessary because this hook's mask was too large and
thus masking any mrs instruction, even if not related to the emulated
registers which made the pmu emulation inefficient.

Signed-off-by: Raphael Gault <raphael.gault@arm.com>
---
 arch/arm64/kernel/cpufeature.c |  4 +--
 arch/arm64/kernel/perf_event.c | 55 ++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 80babf451519..d9b2be97cc06 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2167,8 +2167,8 @@ static int emulate_mrs(struct pt_regs *regs, u32 insn)
 }
 
 static struct undef_hook mrs_hook = {
-	.instr_mask = 0xfff00000,
-	.instr_val  = 0xd5300000,
+	.instr_mask = 0xffff0000,
+	.instr_val  = 0xd5380000,
 	.pstate_mask = PSR_AA32_MODE_MASK,
 	.pstate_val = PSR_MODE_EL0t,
 	.fn = emulate_mrs,
diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index 293e4c365a53..93ac24b51d5f 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -19,9 +19,11 @@
  * along with this program.  If not, see <http://www.gnu.org/licenses/>.
  */
 
+#include <asm/cpu.h>
 #include <asm/irq_regs.h>
 #include <asm/perf_event.h>
 #include <asm/sysreg.h>
+#include <asm/traps.h>
 #include <asm/virt.h>
 
 #include <linux/acpi.h>
@@ -1041,6 +1043,59 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
 	return probe.present ? 0 : -ENODEV;
 }
 
+static int emulate_pmu(struct pt_regs *regs, u32 insn)
+{
+	u32 sys_reg, rt;
+	u32 pmuserenr;
+
+	sys_reg = (u32)aarch64_insn_decode_immediate(AARCH64_INSN_IMM_16, insn) << 5;
+	rt = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT, insn);
+	pmuserenr = read_sysreg(pmuserenr_el0);
+
+	if ((pmuserenr & (ARMV8_PMU_USERENR_ER|ARMV8_PMU_USERENR_CR)) !=
+	    (ARMV8_PMU_USERENR_ER|ARMV8_PMU_USERENR_CR))
+		return -EINVAL;
+
+
+	/*
+	 * Userspace is expected to only use this in the context of the scheme
+	 * described in the struct perf_event_mmap_page comments.
+	 *
+	 * Given that context, we can only get here if we got migrated between
+	 * getting the register index and doing the MSR read.  This in turn
+	 * implies we'll fail the sequence and retry, so any value returned is
+	 * 'good', all we need is to be non-fatal.
+	 *
+	 * The choice of the value 0 is comming from the fact that when
+	 * accessing a register which is not counting events but is accessible,
+	 * we get 0.
+	 */
+	pt_regs_write_reg(regs, rt, 0);
+
+	arm64_skip_faulting_instruction(regs, 4);
+	return 0;
+}
+
+/*
+ * This hook will only be triggered by mrs
+ * instructions on PMU registers. This is mandatory
+ * in order to have a consistent behaviour even on
+ * big.LITTLE systems.
+ */
+static struct undef_hook pmu_hook = {
+	.instr_mask = 0xffff8800,
+	.instr_val  = 0xd53b8800,
+	.fn = emulate_pmu,
+};
+
+static int __init enable_pmu_emulation(void)
+{
+	register_undef_hook(&pmu_hook);
+	return 0;
+}
+
+core_initcall(enable_pmu_emulation);
+
 static int armv8_pmu_init(struct arm_pmu *cpu_pmu)
 {
 	int ret = armv8pmu_probe_pmu(cpu_pmu);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/7] arm64: perf: Enable pmu counter direct access for perf event on armv8
  2019-06-11 12:53 [PATCH 0/7] arm64: Enable access to pmu registers by user-space Raphael Gault
                   ` (4 preceding siblings ...)
  2019-06-11 12:53 ` [PATCH 5/7] arm64: pmu: Add hook to handle pmu-related undefined instructions Raphael Gault
@ 2019-06-11 12:53 ` Raphael Gault
  2019-06-11 12:53 ` [PATCH 7/7] Documentation: arm64: Document PMU counters access from userspace Raphael Gault
  6 siblings, 0 replies; 21+ messages in thread
From: Raphael Gault @ 2019-06-11 12:53 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel
  Cc: mingo, peterz, catalin.marinas, will.deacon, acme, mark.rutland,
	Raphael Gault

Keep track of event opened with direct access to the hardware counters
and modify permissions while they are open.

The strategy used here is the same which x86 uses: everytime an event
is mapped, the permissions are set if required. The atomic field added
in the mm_context helps keep track of the different event opened and
de-activate the permissions when all are unmapped.
We also need to update the permissions in the context switch code so
that tasks keep the right permissions.

Signed-off-by: Raphael Gault <raphael.gault@arm.com>
---
 arch/arm64/include/asm/mmu.h         |  6 +++++
 arch/arm64/include/asm/mmu_context.h |  2 ++
 arch/arm64/include/asm/perf_event.h  | 14 ++++++++++
 drivers/perf/arm_pmu.c               | 38 ++++++++++++++++++++++++++++
 4 files changed, 60 insertions(+)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 67ef25d037ea..9de4cf0b17c7 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -29,6 +29,12 @@
 
 typedef struct {
 	atomic64_t	id;
+
+	/*
+	 * non-zero if userspace have access to hardware
+	 * counters directly.
+	 */
+	atomic_t	pmu_direct_access;
 	void		*vdso;
 	unsigned long	flags;
 } mm_context_t;
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 2da3e478fd8f..33494af613d8 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -32,6 +32,7 @@
 #include <asm-generic/mm_hooks.h>
 #include <asm/cputype.h>
 #include <asm/pgtable.h>
+#include <asm/perf_event.h>
 #include <asm/sysreg.h>
 #include <asm/tlbflush.h>
 
@@ -235,6 +236,7 @@ static inline void __switch_mm(struct mm_struct *next)
 	}
 
 	check_and_switch_context(next, cpu);
+	perf_switch_user_access(next);
 }
 
 static inline void
diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
index c593761ba61c..32a6d604bb3b 100644
--- a/arch/arm64/include/asm/perf_event.h
+++ b/arch/arm64/include/asm/perf_event.h
@@ -19,6 +19,7 @@
 
 #include <asm/stack_pointer.h>
 #include <asm/ptrace.h>
+#include <linux/mm_types.h>
 
 #define	ARMV8_PMU_MAX_COUNTERS	32
 #define	ARMV8_PMU_COUNTER_MASK	(ARMV8_PMU_MAX_COUNTERS - 1)
@@ -234,4 +235,17 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
 	(regs)->pstate = PSR_MODE_EL1h;	\
 }
 
+static inline void perf_switch_user_access(struct mm_struct *mm)
+{
+	if (!IS_ENABLED(CONFIG_PERF_EVENTS))
+		return;
+
+	if (atomic_read(&mm->context.pmu_direct_access)) {
+		write_sysreg(ARMV8_PMU_USERENR_ER|ARMV8_PMU_USERENR_CR,
+			     pmuserenr_el0);
+	} else {
+		write_sysreg(0, pmuserenr_el0);
+	}
+}
+
 #endif
diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index 2d06b8095a19..6ae85fcbf297 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -25,6 +25,7 @@
 #include <linux/irqdesc.h>
 
 #include <asm/irq_regs.h>
+#include <asm/mmu_context.h>
 
 static DEFINE_PER_CPU(struct arm_pmu *, cpu_armpmu);
 static DEFINE_PER_CPU(int, cpu_irq);
@@ -778,6 +779,41 @@ static void cpu_pmu_destroy(struct arm_pmu *cpu_pmu)
 					    &cpu_pmu->node);
 }
 
+static void refresh_pmuserenr(void *mm)
+{
+	perf_switch_user_access(mm);
+}
+
+static void armpmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
+{
+	if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
+		return;
+
+	/*
+	 * This function relies on not being called concurrently in two
+	 * tasks in the same mm.  Otherwise one task could observe
+	 * pmu_direct_access > 1 and return all the way back to
+	 * userspace with user access disabled while another task is still
+	 * doing on_each_cpu_mask() to enable user access.
+	 *
+	 * For now, this can't happen because all callers hold mmap_sem
+	 * for write.  If this changes, we'll need a different solution.
+	 */
+	lockdep_assert_held_exclusive(&mm->mmap_sem);
+
+	if (atomic_inc_return(&mm->context.pmu_direct_access) == 1)
+		on_each_cpu(refresh_pmuserenr, mm, 1);
+}
+
+static void armpmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
+{
+	if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
+		return;
+
+	if (atomic_dec_and_test(&mm->context.pmu_direct_access))
+		on_each_cpu_mask(mm_cpumask(mm), refresh_pmuserenr, NULL, 1);
+}
+
 static struct arm_pmu *__armpmu_alloc(gfp_t flags)
 {
 	struct arm_pmu *pmu;
@@ -799,6 +835,8 @@ static struct arm_pmu *__armpmu_alloc(gfp_t flags)
 		.pmu_enable	= armpmu_enable,
 		.pmu_disable	= armpmu_disable,
 		.event_init	= armpmu_event_init,
+		.event_mapped	= armpmu_event_mapped,
+		.event_unmapped	= armpmu_event_unmapped,
 		.add		= armpmu_add,
 		.del		= armpmu_del,
 		.start		= armpmu_start,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/7] Documentation: arm64: Document PMU counters access from userspace
  2019-06-11 12:53 [PATCH 0/7] arm64: Enable access to pmu registers by user-space Raphael Gault
                   ` (5 preceding siblings ...)
  2019-06-11 12:53 ` [PATCH 6/7] arm64: perf: Enable pmu counter direct access for perf event on armv8 Raphael Gault
@ 2019-06-11 12:53 ` Raphael Gault
  6 siblings, 0 replies; 21+ messages in thread
From: Raphael Gault @ 2019-06-11 12:53 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel
  Cc: mingo, peterz, catalin.marinas, will.deacon, acme, mark.rutland,
	Raphael Gault

Add a documentation file to describe the access to the pmu hardware
counters from userspace

Signed-off-by: Raphael Gault <raphael.gault@arm.com>
---
 .../arm64/pmu_counter_user_access.txt         | 42 +++++++++++++++++++
 1 file changed, 42 insertions(+)
 create mode 100644 Documentation/arm64/pmu_counter_user_access.txt

diff --git a/Documentation/arm64/pmu_counter_user_access.txt b/Documentation/arm64/pmu_counter_user_access.txt
new file mode 100644
index 000000000000..6788b1107381
--- /dev/null
+++ b/Documentation/arm64/pmu_counter_user_access.txt
@@ -0,0 +1,42 @@
+Access to PMU hardware counter from userspace
+=============================================
+
+Overview
+--------
+The perf user-space tool relies on the PMU to monitor events. It offers an
+abstraction layer over the hardware counters since the underlying
+implementation is cpu-dependent.
+Arm64 allows userspace tools to have access to the registers storing the
+hardware counters' values directly.
+
+This targets specifically self-monitoring tasks in order to reduce the overhead
+by directly accessing the registers without having to go through the kernel.
+
+How-to
+------
+The focus is set on the armv8 pmuv3 which makes sure that the access to the pmu
+registers is enable and that the userspace have access to the relevent
+information in order to use them.
+
+In order to have access to the hardware counter it is necessary to open the event
+using the perf tool interface: the sys_perf_event_open syscall returns a fd which
+can subsequently be used with the mmap syscall in order to retrieve a page of memory
+containing information about the event.
+The PMU driver uses this page to expose to the user the hardware counter's
+index. Using this index enables the user to access the PMU registers using the
+`mrs` instruction.
+
+Have a look `at tools/perf/arch/arm64/tests/user-events.c` for an example. It can be
+run using the perf tool to check that the access to the registers works
+correctly from userspace:
+
+./perf test -v
+
+About chained events
+--------------------
+When the user requests for an event to be counted on 64 bits, two hardware
+counters are used and need to be combined to retrieve the correct value:
+
+val = read_counter(idx);
+if ((event.attr.config1 & 0x1))
+	val = (val << 32) | read_counter(idx - 1);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/7] perf: arm64: Compile tests unconditionally
  2019-06-11 12:53 ` [PATCH 1/7] perf: arm64: Compile tests unconditionally Raphael Gault
@ 2019-06-11 14:09   ` Mark Rutland
  2019-06-11 14:23     ` Arnaldo Carvalho de Melo
  2019-06-22  6:34   ` [tip:perf/core] perf tests " tip-bot for Raphael Gault
  1 sibling, 1 reply; 21+ messages in thread
From: Mark Rutland @ 2019-06-11 14:09 UTC (permalink / raw)
  To: Raphael Gault
  Cc: linux-arm-kernel, linux-kernel, mingo, peterz, catalin.marinas,
	will.deacon, acme

On Tue, Jun 11, 2019 at 01:53:09PM +0100, Raphael Gault wrote:
> In order to subsequently add more tests for the arm64 architecture
> we compile the tests target for arm64 systematically.

Given prior questions regarding this commit, it's probably worth
spelling things out more explicitly, e.g.

  Currently we only build the arm64/tests directory if
  CONFIG_DWARF_UNWIND is selected, which is fine as the only test we
  have is arm64/tests/dwarf-unwind.o.

  So that we can add more tests to the test directory, let's
  unconditionally build the directory, but conditionally build
  dwarf-unwind.o depending on CONFIG_DWARF_UNWIND.

  There should be no functional change as a result of this patch.

> 
> Signed-off-by: Raphael Gault <raphael.gault@arm.com>

Either way, the patch looks good to me:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  tools/perf/arch/arm64/Build       | 2 +-
>  tools/perf/arch/arm64/tests/Build | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/perf/arch/arm64/Build b/tools/perf/arch/arm64/Build
> index 36222e64bbf7..a7dd46a5b678 100644
> --- a/tools/perf/arch/arm64/Build
> +++ b/tools/perf/arch/arm64/Build
> @@ -1,2 +1,2 @@
>  perf-y += util/
> -perf-$(CONFIG_DWARF_UNWIND) += tests/
> +perf-y += tests/
> diff --git a/tools/perf/arch/arm64/tests/Build b/tools/perf/arch/arm64/tests/Build
> index 41707fea74b3..a61c06bdb757 100644
> --- a/tools/perf/arch/arm64/tests/Build
> +++ b/tools/perf/arch/arm64/tests/Build
> @@ -1,4 +1,4 @@
>  perf-y += regs_load.o
> -perf-y += dwarf-unwind.o
> +perf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o
>  
>  perf-y += arch-tests.o
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/7] perf: arm64: Compile tests unconditionally
  2019-06-11 14:09   ` Mark Rutland
@ 2019-06-11 14:23     ` Arnaldo Carvalho de Melo
  2019-06-11 17:01       ` Mark Rutland
  0 siblings, 1 reply; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2019-06-11 14:23 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Raphael Gault, linux-arm-kernel, linux-kernel, mingo, peterz,
	catalin.marinas, will.deacon

Em Tue, Jun 11, 2019 at 03:09:07PM +0100, Mark Rutland escreveu:
> On Tue, Jun 11, 2019 at 01:53:09PM +0100, Raphael Gault wrote:
> > In order to subsequently add more tests for the arm64 architecture
> > we compile the tests target for arm64 systematically.
> 
> Given prior questions regarding this commit, it's probably worth
> spelling things out more explicitly, e.g.
> 
>   Currently we only build the arm64/tests directory if
>   CONFIG_DWARF_UNWIND is selected, which is fine as the only test we
>   have is arm64/tests/dwarf-unwind.o.
> 
>   So that we can add more tests to the test directory, let's
>   unconditionally build the directory, but conditionally build
>   dwarf-unwind.o depending on CONFIG_DWARF_UNWIND.
> 
>   There should be no functional change as a result of this patch.
> 
> > 
> > Signed-off-by: Raphael Gault <raphael.gault@arm.com>
> 
> Either way, the patch looks good to me:
> 
> Acked-by: Mark Rutland <mark.rutland@arm.com>

I'll update the comment, collect your Acked-by and apply the patch.

- Arnaldo
 
> Mark.
> 
> > ---
> >  tools/perf/arch/arm64/Build       | 2 +-
> >  tools/perf/arch/arm64/tests/Build | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/tools/perf/arch/arm64/Build b/tools/perf/arch/arm64/Build
> > index 36222e64bbf7..a7dd46a5b678 100644
> > --- a/tools/perf/arch/arm64/Build
> > +++ b/tools/perf/arch/arm64/Build
> > @@ -1,2 +1,2 @@
> >  perf-y += util/
> > -perf-$(CONFIG_DWARF_UNWIND) += tests/
> > +perf-y += tests/
> > diff --git a/tools/perf/arch/arm64/tests/Build b/tools/perf/arch/arm64/tests/Build
> > index 41707fea74b3..a61c06bdb757 100644
> > --- a/tools/perf/arch/arm64/tests/Build
> > +++ b/tools/perf/arch/arm64/tests/Build
> > @@ -1,4 +1,4 @@
> >  perf-y += regs_load.o
> > -perf-y += dwarf-unwind.o
> > +perf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o
> >  
> >  perf-y += arch-tests.o
> > -- 
> > 2.17.1
> > 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters
  2019-06-11 12:53 ` [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters Raphael Gault
@ 2019-06-11 14:33   ` Arnaldo Carvalho de Melo
  2019-06-11 16:57     ` Mark Rutland
  0 siblings, 1 reply; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2019-06-11 14:33 UTC (permalink / raw)
  To: Raphael Gault
  Cc: linux-arm-kernel, linux-kernel, mingo, peterz, catalin.marinas,
	will.deacon, mark.rutland

Em Tue, Jun 11, 2019 at 01:53:11PM +0100, Raphael Gault escreveu:
> Add an extra test to check userspace access to pmu hardware counters.
> This test doesn't rely on the seqlock as a synchronisation mechanism but
> instead uses the restartable sequences to make sure that the thread is
> not interrupted when reading the index of the counter and the associated
> pmu register.
> 
> In addition to reading the pmu counters, this test is run several time
> in order to measure the ratio of failures:
> I ran this test on the Juno development platform, which is big.LITTLE
> with 4 Cortex A53 and 2 Cortex A57. The results vary quite a lot
> (running it with 100 tests is not so long and I did it several times).
> I ran it once with 10000 iterations:
> `runs: 10000, abort: 62.53%, zero: 34.93%, success: 2.54%`
> 
> Signed-off-by: Raphael Gault <raphael.gault@arm.com>
> ---
>  tools/perf/arch/arm64/include/arch-tests.h    |   5 +-
>  tools/perf/arch/arm64/include/rseq-arm64.h    | 220 ++++++++++++++++++

So, I applied the first patch in this series, but could you please break
this patch into at least two, one introducing the facility
(include/rseq*) and the second adding the test?

We try to enforce this kind of granularity as down the line we may want
to revert one part while the other already has other uses and thus
wouldn't allow a straight revert.

Also, can this go to tools/arch/ instead? Is this really perf specific?
Isn't there any arch/arm64/include files for the kernel that we could
mirror and have it checked for drift in tools/perf/check-headers.sh?

- Arnaldo

>  tools/perf/arch/arm64/tests/Build             |   1 +
>  tools/perf/arch/arm64/tests/arch-tests.c      |   6 +
>  tools/perf/arch/arm64/tests/rseq-pmu-events.c | 219 +++++++++++++++++
>  5 files changed, 450 insertions(+), 1 deletion(-)
>  create mode 100644 tools/perf/arch/arm64/include/rseq-arm64.h
>  create mode 100644 tools/perf/arch/arm64/tests/rseq-pmu-events.c
> 
> diff --git a/tools/perf/arch/arm64/include/arch-tests.h b/tools/perf/arch/arm64/include/arch-tests.h
> index a9b17ae0560b..4164762b43c6 100644
> --- a/tools/perf/arch/arm64/include/arch-tests.h
> +++ b/tools/perf/arch/arm64/include/arch-tests.h
> @@ -13,6 +13,9 @@ int test__arch_unwind_sample(struct perf_sample *sample,
>  extern struct test arch_tests[];
>  int test__rd_pmevcntr(struct test *test __maybe_unused,
>  		      int subtest __maybe_unused);
> -
> +#ifdef CONFIG_RSEQ
> +int rseq__rd_pmevcntr(struct test *test __maybe_unused,
> +		      int subtest __maybe_unused);
> +#endif
>  
>  #endif
> diff --git a/tools/perf/arch/arm64/include/rseq-arm64.h b/tools/perf/arch/arm64/include/rseq-arm64.h
> new file mode 100644
> index 000000000000..00d6960915a9
> --- /dev/null
> +++ b/tools/perf/arch/arm64/include/rseq-arm64.h
> @@ -0,0 +1,220 @@
> +/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
> +/*
> + * rseq-arm64.h
> + *
> + * This file is mostly a copy from
> + * tools/testing/selftests/rseq/rseq-arm64.h
> + */
> +
> +/*
> + * aarch64 -mbig-endian generates mixed endianness code vs data:
> + * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
> + * matches code endianness.
> + */
> +#define __rseq_str_1(x)  #x
> +#define __rseq_str(x)            __rseq_str_1(x)
> +
> +#define RSEQ_ACCESS_ONCE(x)	(*(__volatile__  __typeof__(x) *)&(x))
> +#define RSEQ_SIG_CODE	0xd428bc00	/* BRK #0x45E0.  */
> +
> +#ifdef __AARCH64EB__
> +#define RSEQ_SIG_DATA	0x00bc28d4	/* BRK #0x45E0.  */
> +#else
> +#define RSEQ_SIG_DATA	RSEQ_SIG_CODE
> +#endif
> +
> +#define RSEQ_SIG	RSEQ_SIG_DATA
> +
> +#define rseq_smp_mb()	__asm__ __volatile__ ("dmb ish" ::: "memory")
> +#define rseq_smp_rmb()	__asm__ __volatile__ ("dmb ishld" ::: "memory")
> +#define rseq_smp_wmb()	__asm__ __volatile__ ("dmb ishst" ::: "memory")
> +
> +#define rseq_smp_load_acquire(p)						\
> +__extension__ ({								\
> +	__typeof(*p) ____p1;							\
> +	switch (sizeof(*p)) {							\
> +	case 1:									\
> +		asm volatile ("ldarb %w0, %1"					\
> +			: "=r" (*(__u8 *)p)					\
> +			: "Q" (*p) : "memory");					\
> +		break;								\
> +	case 2:									\
> +		asm volatile ("ldarh %w0, %1"					\
> +			: "=r" (*(__u16 *)p)					\
> +			: "Q" (*p) : "memory");					\
> +		break;								\
> +	case 4:									\
> +		asm volatile ("ldar %w0, %1"					\
> +			: "=r" (*(__u32 *)p)					\
> +			: "Q" (*p) : "memory");					\
> +		break;								\
> +	case 8:									\
> +		asm volatile ("ldar %0, %1"					\
> +			: "=r" (*(__u64 *)p)					\
> +			: "Q" (*p) : "memory");					\
> +		break;								\
> +	}									\
> +	____p1;									\
> +})
> +
> +#define rseq_smp_acquire__after_ctrl_dep()	rseq_smp_rmb()
> +
> +#define rseq_smp_store_release(p, v)						\
> +do {										\
> +	switch (sizeof(*p)) {							\
> +	case 1:									\
> +		asm volatile ("stlrb %w1, %0"					\
> +				: "=Q" (*p)					\
> +				: "r" ((__u8)v)					\
> +				: "memory");					\
> +		break;								\
> +	case 2:									\
> +		asm volatile ("stlrh %w1, %0"					\
> +				: "=Q" (*p)					\
> +				: "r" ((__u16)v)				\
> +				: "memory");					\
> +		break;								\
> +	case 4:									\
> +		asm volatile ("stlr %w1, %0"					\
> +				: "=Q" (*p)					\
> +				: "r" ((__u32)v)				\
> +				: "memory");					\
> +		break;								\
> +	case 8:									\
> +		asm volatile ("stlr %1, %0"					\
> +				: "=Q" (*p)					\
> +				: "r" ((__u64)v)				\
> +				: "memory");					\
> +		break;								\
> +	}									\
> +} while (0)
> +
> +#ifdef RSEQ_SKIP_FASTPATH
> +#include "rseq-skip.h"
> +#else /* !RSEQ_SKIP_FASTPATH */
> +
> +#define RSEQ_ASM_TMP_REG32	"w15"
> +#define RSEQ_ASM_TMP_REG	"x15"
> +#define RSEQ_ASM_TMP_REG_2	"x14"
> +
> +#define INJECT_ASM_REG	RSEQ_ASM_TMP_REG32
> +
> +#define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, start_ip,		\
> +				post_commit_offset, abort_ip)			\
> +	"	.pushsection	__rseq_cs, \"aw\"\n"				\
> +	"	.balign	32\n"							\
> +	__rseq_str(label) ":\n"							\
> +	"	.long	" __rseq_str(version) ", " __rseq_str(flags) "\n"	\
> +	"	.quad	" __rseq_str(start_ip) ", "				\
> +			  __rseq_str(post_commit_offset) ", "			\
> +			  __rseq_str(abort_ip) "\n"				\
> +	"	.popsection\n\t"						\
> +	"	.pushsection __rseq_cs_ptr_array, \"aw\"\n"				\
> +	"	.quad " __rseq_str(label) "b\n"					\
> +	"	.popsection\n"
> +
> +#define RSEQ_ASM_DEFINE_TABLE(label, start_ip, post_commit_ip, abort_ip)	\
> +	__RSEQ_ASM_DEFINE_TABLE(label, 0x0, 0x0, start_ip,			\
> +				(post_commit_ip - start_ip), abort_ip)
> +
> +/*
> + * Exit points of a rseq critical section consist of all instructions outside
> + * of the critical section where a critical section can either branch to or
> + * reach through the normal course of its execution. The abort IP and the
> + * post-commit IP are already part of the __rseq_cs section and should not be
> + * explicitly defined as additional exit points. Knowing all exit points is
> + * useful to assist debuggers stepping over the critical section.
> + */
> +#define RSEQ_ASM_DEFINE_EXIT_POINT(start_ip, exit_ip)				\
> +	"	.pushsection __rseq_exit_point_array, \"aw\"\n"			\
> +	"	.quad " __rseq_str(start_ip) ", " __rseq_str(exit_ip) "\n"	\
> +	"	.popsection\n"
> +
> +#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs)			\
> +	"	adrp	" RSEQ_ASM_TMP_REG ", " __rseq_str(cs_label) "\n"	\
> +	"	add	" RSEQ_ASM_TMP_REG ", " RSEQ_ASM_TMP_REG		\
> +			", :lo12:" __rseq_str(cs_label) "\n"			\
> +	"	str	" RSEQ_ASM_TMP_REG ", %[" __rseq_str(rseq_cs) "]\n"	\
> +	__rseq_str(label) ":\n"
> +
> +#define RSEQ_ASM_DEFINE_ABORT(label, abort_label)				\
> +	"	b	222f\n"							\
> +	"	.inst 	"	__rseq_str(RSEQ_SIG_CODE) "\n"			\
> +	__rseq_str(label) ":\n"							\
> +	"	b	%l[" __rseq_str(abort_label) "]\n"			\
> +	"222:\n"
> +
> +#define RSEQ_ASM_OP_STORE(value, var)						\
> +	"	str	%[" __rseq_str(value) "], %[" __rseq_str(var) "]\n"
> +
> +#define RSEQ_ASM_OP_STORE_RELEASE(value, var)					\
> +	"	stlr	%[" __rseq_str(value) "], %[" __rseq_str(var) "]\n"
> +
> +#define RSEQ_ASM_OP_FINAL_STORE(value, var, post_commit_label)			\
> +	RSEQ_ASM_OP_STORE(value, var)						\
> +	__rseq_str(post_commit_label) ":\n"
> +
> +#define RSEQ_ASM_OP_FINAL_STORE_RELEASE(value, var, post_commit_label)		\
> +	RSEQ_ASM_OP_STORE_RELEASE(value, var)					\
> +	__rseq_str(post_commit_label) ":\n"
> +
> +#define RSEQ_ASM_OP_CMPEQ(var, expect, label)					\
> +	"	ldr	" RSEQ_ASM_TMP_REG ", %[" __rseq_str(var) "]\n"		\
> +	"	sub	" RSEQ_ASM_TMP_REG ", " RSEQ_ASM_TMP_REG		\
> +			", %[" __rseq_str(expect) "]\n"				\
> +	"	cbnz	" RSEQ_ASM_TMP_REG ", " __rseq_str(label) "\n"
> +
> +#define RSEQ_ASM_OP_CMPEQ32(var, expect, label)					\
> +	"	ldr	" RSEQ_ASM_TMP_REG32 ", %[" __rseq_str(var) "]\n"	\
> +	"	sub	" RSEQ_ASM_TMP_REG32 ", " RSEQ_ASM_TMP_REG32		\
> +			", %w[" __rseq_str(expect) "]\n"			\
> +	"	cbnz	" RSEQ_ASM_TMP_REG32 ", " __rseq_str(label) "\n"
> +
> +#define RSEQ_ASM_OP_CMPNE(var, expect, label)					\
> +	"	ldr	" RSEQ_ASM_TMP_REG ", %[" __rseq_str(var) "]\n"		\
> +	"	sub	" RSEQ_ASM_TMP_REG ", " RSEQ_ASM_TMP_REG		\
> +			", %[" __rseq_str(expect) "]\n"				\
> +	"	cbz	" RSEQ_ASM_TMP_REG ", " __rseq_str(label) "\n"
> +
> +#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label)			\
> +	RSEQ_ASM_OP_CMPEQ32(current_cpu_id, cpu_id, label)
> +
> +#define RSEQ_ASM_OP_R_LOAD(var)							\
> +	"	ldr	" RSEQ_ASM_TMP_REG ", [%[" __rseq_str(var) "]], #0\n"
> +
> +#define RSEQ_ASM_OP_R_STORE(var)						\
> +	"	str	" RSEQ_ASM_TMP_REG ", %[" __rseq_str(var) "]\n"
> +
> +#define RSEQ_ASM_OP_R_LOAD_OFF(offset)						\
> +	"	ldr	" RSEQ_ASM_TMP_REG ", [" RSEQ_ASM_TMP_REG		\
> +			", %[" __rseq_str(offset) "]]\n"
> +
> +#define RSEQ_ASM_OP_R_ADD(count)						\
> +	"	add	" RSEQ_ASM_TMP_REG ", " RSEQ_ASM_TMP_REG		\
> +			", %[" __rseq_str(count) "]\n"
> +
> +#define RSEQ_ASM_OP_R_SUB(imm)						\
> +	"	sub	" RSEQ_ASM_TMP_REG ", " RSEQ_ASM_TMP_REG		\
> +			", #" __rseq_str(imm) "\n"
> +
> +#define RSEQ_ASM_OP_R_AND(mask)							\
> +	"	and	" RSEQ_ASM_TMP_REG ", " RSEQ_ASM_TMP_REG		\
> +			", #" __rseq_str(mask) "\n"
> +
> +#define RSEQ_ASM_OP_R_FINAL_STORE(var, post_commit_label)			\
> +	"	str	" RSEQ_ASM_TMP_REG ", %[" __rseq_str(var) "]\n"		\
> +	__rseq_str(post_commit_label) ":\n"
> +
> +#define RSEQ_ASM_OP_R_BAD_MEMCPY(dst, src, len)					\
> +	"	cbz	%[" __rseq_str(len) "], 333f\n"				\
> +	"	mov	" RSEQ_ASM_TMP_REG_2 ", %[" __rseq_str(len) "]\n"	\
> +	"222:	sub	" RSEQ_ASM_TMP_REG_2 ", " RSEQ_ASM_TMP_REG_2 ", #1\n"	\
> +	"	ldrb	" RSEQ_ASM_TMP_REG32 ", [%[" __rseq_str(src) "]"	\
> +			", " RSEQ_ASM_TMP_REG_2 "]\n"				\
> +	"	strb	" RSEQ_ASM_TMP_REG32 ", [%[" __rseq_str(dst) "]"	\
> +			", " RSEQ_ASM_TMP_REG_2 "]\n"				\
> +	"	cbnz	" RSEQ_ASM_TMP_REG_2 ", 222b\n"				\
> +	"333:\n"
> +
> +
> +#endif /* !RSEQ_SKIP_FASTPATH */
> diff --git a/tools/perf/arch/arm64/tests/Build b/tools/perf/arch/arm64/tests/Build
> index 3f9a20c17fc6..b53823603066 100644
> --- a/tools/perf/arch/arm64/tests/Build
> +++ b/tools/perf/arch/arm64/tests/Build
> @@ -1,5 +1,6 @@
>  perf-y += regs_load.o
>  perf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o
> +perf-$(CONFIG_RSEQ) += rseq-pmu-events.o
>  
>  perf-y += user-events.o
>  perf-y += arch-tests.o
> diff --git a/tools/perf/arch/arm64/tests/arch-tests.c b/tools/perf/arch/arm64/tests/arch-tests.c
> index 57df9b89dede..5f08bdc41d89 100644
> --- a/tools/perf/arch/arm64/tests/arch-tests.c
> +++ b/tools/perf/arch/arm64/tests/arch-tests.c
> @@ -14,6 +14,12 @@ struct test arch_tests[] = {
>  		.desc = "User counter access",
>  		.func = test__rd_pmevcntr,
>  	},
> +#ifdef CONFIG_RSEQ
> +	{
> +		.desc = "User counter access with rseq",
> +		.func = rseq__rd_pmevcntr,
> +	},
> +#endif
>  	{
>  		.func = NULL,
>  	},
> diff --git a/tools/perf/arch/arm64/tests/rseq-pmu-events.c b/tools/perf/arch/arm64/tests/rseq-pmu-events.c
> new file mode 100644
> index 000000000000..4d9578d7e056
> --- /dev/null
> +++ b/tools/perf/arch/arm64/tests/rseq-pmu-events.c
> @@ -0,0 +1,219 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <asm/bug.h>
> +#include <linux/rseq.h>
> +#include <errno.h>
> +#include <unistd.h>
> +#include <sched.h>
> +#include <stdlib.h>
> +#include <signal.h>
> +#include <sys/mman.h>
> +#include <sys/sysinfo.h>
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +#include <linux/types.h>
> +#include "perf.h"
> +#include "debug.h"
> +#include "tests/tests.h"
> +#include "cloexec.h"
> +#include "util.h"
> +#include "arch-tests.h"
> +#include "rseq-arm64.h"
> +
> +static __thread volatile struct rseq __rseq_abi;
> +
> +static int sys_rseq(volatile struct rseq *rseq_abi, uint32_t rseq_len,
> +		    int flags, uint32_t sig)
> +{
> +	return syscall(__NR_rseq, rseq_abi, rseq_len, flags, sig);
> +}
> +
> +static int rseq_register_current_thread(void)
> +{
> +	return sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);
> +}
> +
> +static int rseq_unregister_current_thread(void)
> +{
> +	return sys_rseq(&__rseq_abi, sizeof(struct rseq),
> +			RSEQ_FLAG_UNREGISTER, RSEQ_SIG);
> +}
> +
> +static u64 noinline mmap_read_self(void *addr, int cpu)
> +{
> +	struct perf_event_mmap_page *pc = addr;
> +	u32 idx = 0;
> +	u64 count = 0;
> +
> +	asm volatile goto(
> +                     RSEQ_ASM_DEFINE_TABLE(0, 1f, 2f, 3f)
> +		     "nop\n"
> +                     RSEQ_ASM_STORE_RSEQ_CS(1, 0b, rseq_cs)
> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
> +                     RSEQ_ASM_OP_R_LOAD(pc_idx)
> +                     RSEQ_ASM_OP_R_AND(0xFF)
> +		     RSEQ_ASM_OP_R_STORE(idx)
> +                     RSEQ_ASM_OP_R_SUB(0x1)
> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
> +                     "msr pmselr_el0, " RSEQ_ASM_TMP_REG "\n"
> +                     "isb\n"
> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
> +                     "mrs " RSEQ_ASM_TMP_REG ", pmxevcntr_el0\n"
> +                     RSEQ_ASM_OP_R_FINAL_STORE(cnt, 2)
> +		     "nop\n"
> +                     RSEQ_ASM_DEFINE_ABORT(3, abort)
> +                     :/* No output operands */
> +		     :  [cpu_id] "r" (cpu),
> +			[current_cpu_id] "Qo" (__rseq_abi.cpu_id),
> +			[rseq_cs] "m" (__rseq_abi.rseq_cs),
> +			[cnt] "m" (count),
> +			[pc_idx] "r" (&pc->index),
> +			[idx] "m" (idx)
> +                     :"memory"
> +                     :abort
> +                    );
> +
> +	if (idx)
> +		count += READ_ONCE(pc->offset);
> +
> +	return count;
> +abort:
> +        pr_debug("Abort handler\n");
> +        exit(-2);
> +}
> +
> +static int __test__rd_pmevcntr(void)
> +{
> +	volatile int tmp = 0;
> +	u64 i, loops = 1000;
> +	int n;
> +	int fd;
> +	int cpu;
> +	void *addr;
> +	struct perf_event_attr attr = {
> +		.type = PERF_TYPE_HARDWARE,
> +		.config = PERF_COUNT_HW_INSTRUCTIONS,
> +		.exclude_kernel = 1,
> +	};
> +	u64 delta_sum = 0;
> +	char sbuf[STRERR_BUFSIZE];
> +
> +	fd = rseq_register_current_thread();
> +	if (fd) {
> +		pr_err("Error: unable to register current thread "
> +		       "return value: %d (%s)\n", fd,
> +		       str_error_r(errno, sbuf, sizeof(sbuf)));
> +		return -1;
> +	}
> +	cpu = RSEQ_ACCESS_ONCE(__rseq_abi.cpu_id_start);
> +
> +	pr_debug("cpu: %d\n", cpu);
> +	fd = sys_perf_event_open(&attr, 0, -1, -1,
> +				 perf_event_open_cloexec_flag());
> +	if (fd < 0) {
> +		pr_err("Error: sys_perf_event_open() syscall returned "
> +		       "with %d (%s)\n", fd,
> +		       str_error_r(errno, sbuf, sizeof(sbuf)));
> +		return -1;
> +	}
> +
> +	addr = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
> +	if (addr == (void *)(-1)) {
> +		pr_err("Error: mmap() syscall returned with (%s)\n",
> +		       str_error_r(errno, sbuf, sizeof(sbuf)));
> +		goto out_close;
> +	}
> +
> +	for (n = 0; n < 6; n++) {
> +		u64 stamp, now, delta;
> +
> +		stamp = mmap_read_self(addr, cpu);
> +
> +		for (i = 0; i < loops; i++)
> +			tmp++;
> +
> +		now = mmap_read_self(addr, cpu);
> +		loops *= 10;
> +
> +		delta = now - stamp;
> +		pr_debug("%14d: %14llu\n", n, (long long)delta);
> +
> +		delta_sum += delta;
> +	}
> +
> +	munmap(addr, page_size);
> +
> +	rseq_unregister_current_thread();
> +	pr_debug("   ");
> +
> +out_close:
> +	close(fd);
> +
> +	if (!delta_sum)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +int rseq__rd_pmevcntr(struct test __maybe_unused *test,
> +		      int __maybe_unused subtest)
> +{
> +	int status = 0;
> +	int wret = 0;
> +	int ret = 0;
> +	int i = 0;
> +	int pid;
> +	int nb_run = 100;
> +	unsigned int run = 0;
> +	unsigned int aborted = 0, zero = 0, success = 0;
> +
> +	for (i = 0; i < nb_run; i++) {
> +		pid = fork();
> +		if (pid < 0)
> +			return -1;
> +
> +		if (!pid) {
> +			ret = __test__rd_pmevcntr();
> +			exit(ret);
> +		}
> +
> +		wret = waitpid(pid, &status, 0);
> +		if (wret < 0)
> +			return -1;
> +
> +		if (WIFSIGNALED(status)) {
> +			pr_err("Error: the child process was interrupted by a signal: %d\n", WTERMSIG(status));
> +			return -1;
> +		}
> +
> +		if (WIFEXITED(status) && WEXITSTATUS(status)) {
> +			pr_err("Error: the child process exited with: %d\n",
> +			       WEXITSTATUS(status));
> +			/*
> +			return -1;
> +			*/
> +			switch (WEXITSTATUS(status)) {
> +			case 0:
> +				success++;
> +				break;
> +			case 255:
> +				zero++;
> +				break;
> +			case 254:
> +				aborted++;
> +				break;
> +			default:
> +				break;
> +			}
> +		}
> +		else
> +			success++;
> +	}
> +
> +	run = aborted + zero + success;
> +
> +	pr_info("runs: %u, abort: %.2f%%, zero: %.2f%%, success: %.2f%%\n",
> +		run, (aborted / (float)run) * 100,
> +		(zero / (float)run) * 100,
> +		(success / (float)run) * 100);
> +	return 0;
> +}
> -- 
> 2.17.1

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters
  2019-06-11 14:33   ` Arnaldo Carvalho de Melo
@ 2019-06-11 16:57     ` Mark Rutland
  2019-06-11 19:33       ` Mathieu Desnoyers
  2019-06-25 13:29       ` Raphael Gault
  0 siblings, 2 replies; 21+ messages in thread
From: Mark Rutland @ 2019-06-11 16:57 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Raphael Gault, linux-arm-kernel, linux-kernel, mingo, peterz,
	catalin.marinas, will.deacon, mathieu.desnoyers

Hi Arnaldo,

On Tue, Jun 11, 2019 at 11:33:46AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Jun 11, 2019 at 01:53:11PM +0100, Raphael Gault escreveu:
> > Add an extra test to check userspace access to pmu hardware counters.
> > This test doesn't rely on the seqlock as a synchronisation mechanism but
> > instead uses the restartable sequences to make sure that the thread is
> > not interrupted when reading the index of the counter and the associated
> > pmu register.
> > 
> > In addition to reading the pmu counters, this test is run several time
> > in order to measure the ratio of failures:
> > I ran this test on the Juno development platform, which is big.LITTLE
> > with 4 Cortex A53 and 2 Cortex A57. The results vary quite a lot
> > (running it with 100 tests is not so long and I did it several times).
> > I ran it once with 10000 iterations:
> > `runs: 10000, abort: 62.53%, zero: 34.93%, success: 2.54%`
> > 
> > Signed-off-by: Raphael Gault <raphael.gault@arm.com>
> > ---
> >  tools/perf/arch/arm64/include/arch-tests.h    |   5 +-
> >  tools/perf/arch/arm64/include/rseq-arm64.h    | 220 ++++++++++++++++++
> 
> So, I applied the first patch in this series, but could you please break
> this patch into at least two, one introducing the facility
> (include/rseq*) and the second adding the test?
> 
> We try to enforce this kind of granularity as down the line we may want
> to revert one part while the other already has other uses and thus
> wouldn't allow a straight revert.
> 
> Also, can this go to tools/arch/ instead? Is this really perf specific?
> Isn't there any arch/arm64/include files for the kernel that we could
> mirror and have it checked for drift in tools/perf/check-headers.sh?

The rseq bits aren't strictly perf specific, and I think the existing
bits under tools/testing/selftests/rseq/ could be factored out to common
locations under tools/include/ and tools/arch/*/include/.

From a scan, those already duplicate barriers and other helpers which
already have definitions under tools/, which seems unfortunate. :/

Comments below are for Raphael and Matthieu.

[...]

> > +static u64 noinline mmap_read_self(void *addr, int cpu)
> > +{
> > +	struct perf_event_mmap_page *pc = addr;
> > +	u32 idx = 0;
> > +	u64 count = 0;
> > +
> > +	asm volatile goto(
> > +                     RSEQ_ASM_DEFINE_TABLE(0, 1f, 2f, 3f)
> > +		     "nop\n"
> > +                     RSEQ_ASM_STORE_RSEQ_CS(1, 0b, rseq_cs)
> > +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
> > +                     RSEQ_ASM_OP_R_LOAD(pc_idx)
> > +                     RSEQ_ASM_OP_R_AND(0xFF)
> > +		     RSEQ_ASM_OP_R_STORE(idx)
> > +                     RSEQ_ASM_OP_R_SUB(0x1)
> > +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
> > +                     "msr pmselr_el0, " RSEQ_ASM_TMP_REG "\n"
> > +                     "isb\n"
> > +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
> > +                     "mrs " RSEQ_ASM_TMP_REG ", pmxevcntr_el0\n"
> > +                     RSEQ_ASM_OP_R_FINAL_STORE(cnt, 2)
> > +		     "nop\n"
> > +                     RSEQ_ASM_DEFINE_ABORT(3, abort)
> > +                     :/* No output operands */
> > +		     :  [cpu_id] "r" (cpu),
> > +			[current_cpu_id] "Qo" (__rseq_abi.cpu_id),
> > +			[rseq_cs] "m" (__rseq_abi.rseq_cs),
> > +			[cnt] "m" (count),
> > +			[pc_idx] "r" (&pc->index),
> > +			[idx] "m" (idx)
> > +                     :"memory"
> > +                     :abort
> > +                    );

While baroque, this doesn't look as scary as I thought it would!

However, I'm very scared that this is modifying input operands without
clobbering them. IIUC this is beacause we're trying to use asm goto,
which doesn't permit output operands.

I'm very dubious to abusing asm goto in this way. Can we instead use a
regular asm volatile block, and place the abort handler _within_ the
asm? If performance is a concern, we can use .pushsection and
.popsection to move that far away...

> > +
> > +	if (idx)
> > +		count += READ_ONCE(pc->offset);

I'm rather scared that from GCC's PoV, idx was initialized to zero, and
not modified above (per the asm constraints). I realise that we've used
an "m" constraint and clobbered memory, but I could well imagine that
GCC can interpret that as needing to place a read-only copy in memory,
but still being permitted to use the original value in a register. That
would permit the above to be optimized away, since GCC knows no
registers were clobbered, and thus idx must still be zero.

> > +
> > +	return count;

... and for similar reasons, always return zero here.

> > +abort:
> > +        pr_debug("Abort handler\n");
> > +        exit(-2);
> > +}

Given the necessary complexity evident above, I'm also fearful that the
sort of folk that want userspace counter access aren't going to bother
with the above.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/7] perf: arm64: Compile tests unconditionally
  2019-06-11 14:23     ` Arnaldo Carvalho de Melo
@ 2019-06-11 17:01       ` Mark Rutland
  0 siblings, 0 replies; 21+ messages in thread
From: Mark Rutland @ 2019-06-11 17:01 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Raphael Gault, linux-arm-kernel, linux-kernel, mingo, peterz,
	catalin.marinas, will.deacon

On Tue, Jun 11, 2019 at 11:23:56AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Jun 11, 2019 at 03:09:07PM +0100, Mark Rutland escreveu:
> > On Tue, Jun 11, 2019 at 01:53:09PM +0100, Raphael Gault wrote:
> > > In order to subsequently add more tests for the arm64 architecture
> > > we compile the tests target for arm64 systematically.
> > 
> > Given prior questions regarding this commit, it's probably worth
> > spelling things out more explicitly, e.g.
> > 
> >   Currently we only build the arm64/tests directory if
> >   CONFIG_DWARF_UNWIND is selected, which is fine as the only test we
> >   have is arm64/tests/dwarf-unwind.o.
> > 
> >   So that we can add more tests to the test directory, let's
> >   unconditionally build the directory, but conditionally build
> >   dwarf-unwind.o depending on CONFIG_DWARF_UNWIND.
> > 
> >   There should be no functional change as a result of this patch.
> > 
> > > 
> > > Signed-off-by: Raphael Gault <raphael.gault@arm.com>
> > 
> > Either way, the patch looks good to me:
> > 
> > Acked-by: Mark Rutland <mark.rutland@arm.com>
> 
> I'll update the comment, collect your Acked-by and apply the patch.

That's great, thanks!

As a heads-up, there are still open ABI discussions to be had on the
rest of the series, so while review would be appreciated, it would be
best to hold off applying the remaining userspace bits for now.

Mark.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters
  2019-06-11 16:57     ` Mark Rutland
@ 2019-06-11 19:33       ` Mathieu Desnoyers
  2019-06-13  7:10         ` Raphael Gault
                           ` (2 more replies)
  2019-06-25 13:29       ` Raphael Gault
  1 sibling, 3 replies; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-06-11 19:33 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Arnaldo Carvalho de Melo, Raphael Gault, linux-arm-kernel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Catalin Marinas,
	Will Deacon

----- On Jun 11, 2019, at 6:57 PM, Mark Rutland mark.rutland@arm.com wrote:

> Hi Arnaldo,
> 
> On Tue, Jun 11, 2019 at 11:33:46AM -0300, Arnaldo Carvalho de Melo wrote:
>> Em Tue, Jun 11, 2019 at 01:53:11PM +0100, Raphael Gault escreveu:
>> > Add an extra test to check userspace access to pmu hardware counters.
>> > This test doesn't rely on the seqlock as a synchronisation mechanism but
>> > instead uses the restartable sequences to make sure that the thread is
>> > not interrupted when reading the index of the counter and the associated
>> > pmu register.
>> > 
>> > In addition to reading the pmu counters, this test is run several time
>> > in order to measure the ratio of failures:
>> > I ran this test on the Juno development platform, which is big.LITTLE
>> > with 4 Cortex A53 and 2 Cortex A57. The results vary quite a lot
>> > (running it with 100 tests is not so long and I did it several times).
>> > I ran it once with 10000 iterations:
>> > `runs: 10000, abort: 62.53%, zero: 34.93%, success: 2.54%`
>> > 
>> > Signed-off-by: Raphael Gault <raphael.gault@arm.com>
>> > ---
>> >  tools/perf/arch/arm64/include/arch-tests.h    |   5 +-
>> >  tools/perf/arch/arm64/include/rseq-arm64.h    | 220 ++++++++++++++++++
>> 
>> So, I applied the first patch in this series, but could you please break
>> this patch into at least two, one introducing the facility
>> (include/rseq*) and the second adding the test?
>> 
>> We try to enforce this kind of granularity as down the line we may want
>> to revert one part while the other already has other uses and thus
>> wouldn't allow a straight revert.
>> 
>> Also, can this go to tools/arch/ instead? Is this really perf specific?
>> Isn't there any arch/arm64/include files for the kernel that we could
>> mirror and have it checked for drift in tools/perf/check-headers.sh?
> 
> The rseq bits aren't strictly perf specific, and I think the existing
> bits under tools/testing/selftests/rseq/ could be factored out to common
> locations under tools/include/ and tools/arch/*/include/.

Hi Mark,

Thanks for CCing me!

Or into a stand-alone librseq project:

https://github.com/compudj/librseq (currently a development branch in
my own github)

I don't see why this user-space code should sit in the kernel tree.
It is not tooling-specific.

> 
> From a scan, those already duplicate barriers and other helpers which
> already have definitions under tools/, which seems unfortunate. :/
> 
> Comments below are for Raphael and Matthieu.
> 
> [...]
> 
>> > +static u64 noinline mmap_read_self(void *addr, int cpu)
>> > +{
>> > +	struct perf_event_mmap_page *pc = addr;
>> > +	u32 idx = 0;
>> > +	u64 count = 0;
>> > +
>> > +	asm volatile goto(
>> > +                     RSEQ_ASM_DEFINE_TABLE(0, 1f, 2f, 3f)
>> > +		     "nop\n"
>> > +                     RSEQ_ASM_STORE_RSEQ_CS(1, 0b, rseq_cs)
>> > +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
>> > +                     RSEQ_ASM_OP_R_LOAD(pc_idx)
>> > +                     RSEQ_ASM_OP_R_AND(0xFF)
>> > +		     RSEQ_ASM_OP_R_STORE(idx)
>> > +                     RSEQ_ASM_OP_R_SUB(0x1)
>> > +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
>> > +                     "msr pmselr_el0, " RSEQ_ASM_TMP_REG "\n"
>> > +                     "isb\n"
>> > +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)

I really don't understand why the cpu_id needs to be compared 3 times
here (?!?)

Explicit comparison of the cpu_id within the rseq critical section
should be done _once_.

If the kernel happens to preempt and migrate the thread while in the
critical section, it's the kernel's job to move user-space execution
to the abort handler.

>> > +                     "mrs " RSEQ_ASM_TMP_REG ", pmxevcntr_el0\n"
>> > +                     RSEQ_ASM_OP_R_FINAL_STORE(cnt, 2)
>> > +		     "nop\n"
>> > +                     RSEQ_ASM_DEFINE_ABORT(3, abort)
>> > +                     :/* No output operands */
>> > +		     :  [cpu_id] "r" (cpu),
>> > +			[current_cpu_id] "Qo" (__rseq_abi.cpu_id),
>> > +			[rseq_cs] "m" (__rseq_abi.rseq_cs),
>> > +			[cnt] "m" (count),
>> > +			[pc_idx] "r" (&pc->index),
>> > +			[idx] "m" (idx)
>> > +                     :"memory"
>> > +                     :abort
>> > +                    );
> 
> While baroque, this doesn't look as scary as I thought it would!

That's good to hear :)

> 
> However, I'm very scared that this is modifying input operands without
> clobbering them. IIUC this is beacause we're trying to use asm goto,
> which doesn't permit output operands.

This is correct. What is wrong with modifying the target of "m" input
operands in an inline asm that has a "memory" clobber ?

gcc documentation at https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
states:

"An asm goto statement cannot have outputs. This is due to an internal
restriction of the compiler: control transfer instructions cannot have
outputs. If the assembler code does modify anything, use the "memory"
clobber to force the optimizers to flush all register values to memory
and reload them if necessary after the asm statement."

If there is a problem with this approach, an alternative would be to
pass &__rseq_abi.rseq.cs as a "r" input operand, explicitly dereference
it in the assembly, and use the "memory" clobber to ensure the compiler
knows that there are read/write references to memory.

> I'm very dubious to abusing asm goto in this way. Can we instead use a
> regular asm volatile block, and place the abort handler _within_ the
> asm? If performance is a concern, we can use .pushsection and
> .popsection to move that far away...

Let's dig into what would be needed in order to move the abort into the
asm block.

One approach would be to make that asm block return a nonzero value in
an output register, and put zero in that register in the non-abort case,
and then have a conditional check in C on that register to check
whether it needs to branch to the abort. This adds overhead we want
to avoid.

Another alternative would be to perform the entire abort handler in
the same assembly block as the rseq critical section. However, this
prevents us from going back to C to handle the abort, which is unwanted.
For instance, in the use-case of perf counters on aarch64, a good
fallback on abort would be to call the perf system call to read the
value of the performance counter. However, requiring that the abort be
implemented within the rseq assembly block would require that we
re-implement system call invocation in user-space for this, which
is rather annoying.

> 
>> > +
>> > +	if (idx)
>> > +		count += READ_ONCE(pc->offset);
> 
> I'm rather scared that from GCC's PoV, idx was initialized to zero, and
> not modified above (per the asm constraints). I realise that we've used
> an "m" constraint and clobbered memory, but I could well imagine that
> GCC can interpret that as needing to place a read-only copy in memory,
> but still being permitted to use the original value in a register. That
> would permit the above to be optimized away, since GCC knows no
> registers were clobbered, and thus idx must still be zero.

I suspect this is based on a rather conservative interpretation of the
following statement from https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html:

"The "memory" clobber tells the compiler that the assembly code performs memory
reads or writes to items other than those listed in the input and output operands"

Based on the previous sentence, it's tempting to conclude that the "m" input
operands content is not clobbered by the "memory" clobber.

however, it is followed by this:

"Further, the compiler does not assume that any values read from memory before an
asm remain unchanged after that asm; it reloads them as needed. Using the "memory"
clobber effectively forms a read/write memory barrier for the compiler."

Based on this last sentence, my understanding is that a "memory" clobber would
also clobber the content of any "m" operand.

If use of "m" (var) input-operand-as-output + "memory" clobber ends up being an
issue, we can always fall-back to "r" (&var) input operand + "memory" clobber,
which seems less ambiguous from a documentation standpoint.

I'd really like to have an authoritative answer from gcc folks before we start
changing this in all rseq asm for all architectures.

> 
>> > +
>> > +	return count;
> 
> ... and for similar reasons, always return zero here.
> 
>> > +abort:
>> > +        pr_debug("Abort handler\n");
>> > +        exit(-2);
>> > +}
> 
> Given the necessary complexity evident above, I'm also fearful that the
> sort of folk that want userspace counter access aren't going to bother
> with the above.

The abort handler should be implemented in C, simply invoking the perf
system call which lets the kernel perform the perf counter read.

Thanks,

Mathieu


> 
> Thanks,
> Mark.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters
  2019-06-11 19:33       ` Mathieu Desnoyers
@ 2019-06-13  7:10         ` Raphael Gault
  2019-06-13  8:27           ` Mathieu Desnoyers
  2019-06-13 11:02         ` Raphael Gault
  2019-06-25 13:20         ` Raphael Gault
  2 siblings, 1 reply; 21+ messages in thread
From: Raphael Gault @ 2019-06-13  7:10 UTC (permalink / raw)
  To: Mathieu Desnoyers, Mark Rutland
  Cc: Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Catalin Marinas, Will Deacon

Hi Mathieu,

On 6/11/19 8:33 PM, Mathieu Desnoyers wrote:
> ----- On Jun 11, 2019, at 6:57 PM, Mark Rutland mark.rutland@arm.com wrote:
> 
>> Hi Arnaldo,
>>
>> On Tue, Jun 11, 2019 at 11:33:46AM -0300, Arnaldo Carvalho de Melo wrote:
>>> Em Tue, Jun 11, 2019 at 01:53:11PM +0100, Raphael Gault escreveu:
>>>> Add an extra test to check userspace access to pmu hardware counters.
>>>> This test doesn't rely on the seqlock as a synchronisation mechanism but
>>>> instead uses the restartable sequences to make sure that the thread is
>>>> not interrupted when reading the index of the counter and the associated
>>>> pmu register.
>>>>
>>>> In addition to reading the pmu counters, this test is run several time
>>>> in order to measure the ratio of failures:
>>>> I ran this test on the Juno development platform, which is big.LITTLE
>>>> with 4 Cortex A53 and 2 Cortex A57. The results vary quite a lot
>>>> (running it with 100 tests is not so long and I did it several times).
>>>> I ran it once with 10000 iterations:
>>>> `runs: 10000, abort: 62.53%, zero: 34.93%, success: 2.54%`
>>>>
>>>> Signed-off-by: Raphael Gault <raphael.gault@arm.com>
>>>> ---
>>>>   tools/perf/arch/arm64/include/arch-tests.h    |   5 +-
>>>>   tools/perf/arch/arm64/include/rseq-arm64.h    | 220 ++++++++++++++++++
>>>
>>> So, I applied the first patch in this series, but could you please break
>>> this patch into at least two, one introducing the facility
>>> (include/rseq*) and the second adding the test?
>>>
>>> We try to enforce this kind of granularity as down the line we may want
>>> to revert one part while the other already has other uses and thus
>>> wouldn't allow a straight revert.
>>>
>>> Also, can this go to tools/arch/ instead? Is this really perf specific?
>>> Isn't there any arch/arm64/include files for the kernel that we could
>>> mirror and have it checked for drift in tools/perf/check-headers.sh?
>>
>> The rseq bits aren't strictly perf specific, and I think the existing
>> bits under tools/testing/selftests/rseq/ could be factored out to common
>> locations under tools/include/ and tools/arch/*/include/.
> 
> Hi Mark,
> 
> Thanks for CCing me!
> 
> Or into a stand-alone librseq project:
> 
> https://github.com/compudj/librseq (currently a development branch in
> my own github)
> 
> I don't see why this user-space code should sit in the kernel tree.
> It is not tooling-specific.
> 
>>
>>  From a scan, those already duplicate barriers and other helpers which
>> already have definitions under tools/, which seems unfortunate. :/
>>
>> Comments below are for Raphael and Matthieu.
>>
>> [...]
>>
>>>> +static u64 noinline mmap_read_self(void *addr, int cpu)
>>>> +{
>>>> +	struct perf_event_mmap_page *pc = addr;
>>>> +	u32 idx = 0;
>>>> +	u64 count = 0;
>>>> +
>>>> +	asm volatile goto(
>>>> +                     RSEQ_ASM_DEFINE_TABLE(0, 1f, 2f, 3f)
>>>> +		     "nop\n"
>>>> +                     RSEQ_ASM_STORE_RSEQ_CS(1, 0b, rseq_cs)
>>>> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
>>>> +                     RSEQ_ASM_OP_R_LOAD(pc_idx)
>>>> +                     RSEQ_ASM_OP_R_AND(0xFF)
>>>> +		     RSEQ_ASM_OP_R_STORE(idx)
>>>> +                     RSEQ_ASM_OP_R_SUB(0x1)
>>>> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
>>>> +                     "msr pmselr_el0, " RSEQ_ASM_TMP_REG "\n"
>>>> +                     "isb\n"
>>>> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
> 
> I really don't understand why the cpu_id needs to be compared 3 times
> here (?!?)
> 
> Explicit comparison of the cpu_id within the rseq critical section
> should be done _once_.
> 

I understand and that's what I thought as well but I got confused with a 
comment in (src)/include/uapi/linux/rseq.h which states:
 > This CPU number value should always be compared
 > against the value of the cpu_id field before performing a rseq
 > commit or returning a value read from a data structure indexed
 > using the cpu_id_start value.

I'll remove the unnecessary testing.


> If the kernel happens to preempt and migrate the thread while in the
> critical section, it's the kernel's job to move user-space execution
> to the abort handler.
> 
[...]

Thanks,

-- 
Raphael Gault

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters
  2019-06-13  7:10         ` Raphael Gault
@ 2019-06-13  8:27           ` Mathieu Desnoyers
  0 siblings, 0 replies; 21+ messages in thread
From: Mathieu Desnoyers @ 2019-06-13  8:27 UTC (permalink / raw)
  To: Raphael Gault
  Cc: Mark Rutland, Arnaldo Carvalho de Melo, linux-arm-kernel,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Catalin Marinas,
	Will Deacon

----- On Jun 13, 2019, at 9:10 AM, Raphael Gault raphael.gault@arm.com wrote:

> Hi Mathieu,
> 
> On 6/11/19 8:33 PM, Mathieu Desnoyers wrote:
>> ----- On Jun 11, 2019, at 6:57 PM, Mark Rutland mark.rutland@arm.com wrote:
>> 
>>> Hi Arnaldo,
>>>
>>> On Tue, Jun 11, 2019 at 11:33:46AM -0300, Arnaldo Carvalho de Melo wrote:
>>>> Em Tue, Jun 11, 2019 at 01:53:11PM +0100, Raphael Gault escreveu:
>>>>> Add an extra test to check userspace access to pmu hardware counters.
>>>>> This test doesn't rely on the seqlock as a synchronisation mechanism but
>>>>> instead uses the restartable sequences to make sure that the thread is
>>>>> not interrupted when reading the index of the counter and the associated
>>>>> pmu register.
>>>>>
>>>>> In addition to reading the pmu counters, this test is run several time
>>>>> in order to measure the ratio of failures:
>>>>> I ran this test on the Juno development platform, which is big.LITTLE
>>>>> with 4 Cortex A53 and 2 Cortex A57. The results vary quite a lot
>>>>> (running it with 100 tests is not so long and I did it several times).
>>>>> I ran it once with 10000 iterations:
>>>>> `runs: 10000, abort: 62.53%, zero: 34.93%, success: 2.54%`
>>>>>
>>>>> Signed-off-by: Raphael Gault <raphael.gault@arm.com>
>>>>> ---
>>>>>   tools/perf/arch/arm64/include/arch-tests.h    |   5 +-
>>>>>   tools/perf/arch/arm64/include/rseq-arm64.h    | 220 ++++++++++++++++++
>>>>
>>>> So, I applied the first patch in this series, but could you please break
>>>> this patch into at least two, one introducing the facility
>>>> (include/rseq*) and the second adding the test?
>>>>
>>>> We try to enforce this kind of granularity as down the line we may want
>>>> to revert one part while the other already has other uses and thus
>>>> wouldn't allow a straight revert.
>>>>
>>>> Also, can this go to tools/arch/ instead? Is this really perf specific?
>>>> Isn't there any arch/arm64/include files for the kernel that we could
>>>> mirror and have it checked for drift in tools/perf/check-headers.sh?
>>>
>>> The rseq bits aren't strictly perf specific, and I think the existing
>>> bits under tools/testing/selftests/rseq/ could be factored out to common
>>> locations under tools/include/ and tools/arch/*/include/.
>> 
>> Hi Mark,
>> 
>> Thanks for CCing me!
>> 
>> Or into a stand-alone librseq project:
>> 
>> https://github.com/compudj/librseq (currently a development branch in
>> my own github)
>> 
>> I don't see why this user-space code should sit in the kernel tree.
>> It is not tooling-specific.
>> 
>>>
>>>  From a scan, those already duplicate barriers and other helpers which
>>> already have definitions under tools/, which seems unfortunate. :/
>>>
>>> Comments below are for Raphael and Matthieu.
>>>
>>> [...]
>>>
>>>>> +static u64 noinline mmap_read_self(void *addr, int cpu)
>>>>> +{
>>>>> +	struct perf_event_mmap_page *pc = addr;
>>>>> +	u32 idx = 0;
>>>>> +	u64 count = 0;
>>>>> +
>>>>> +	asm volatile goto(
>>>>> +                     RSEQ_ASM_DEFINE_TABLE(0, 1f, 2f, 3f)
>>>>> +		     "nop\n"
>>>>> +                     RSEQ_ASM_STORE_RSEQ_CS(1, 0b, rseq_cs)
>>>>> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
>>>>> +                     RSEQ_ASM_OP_R_LOAD(pc_idx)
>>>>> +                     RSEQ_ASM_OP_R_AND(0xFF)
>>>>> +		     RSEQ_ASM_OP_R_STORE(idx)
>>>>> +                     RSEQ_ASM_OP_R_SUB(0x1)
>>>>> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
>>>>> +                     "msr pmselr_el0, " RSEQ_ASM_TMP_REG "\n"
>>>>> +                     "isb\n"
>>>>> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
>> 
>> I really don't understand why the cpu_id needs to be compared 3 times
>> here (?!?)
>> 
>> Explicit comparison of the cpu_id within the rseq critical section
>> should be done _once_.
>> 
> 
> I understand and that's what I thought as well but I got confused with a
> comment in (src)/include/uapi/linux/rseq.h which states:
> > This CPU number value should always be compared
> > against the value of the cpu_id field before performing a rseq
> > commit or returning a value read from a data structure indexed
> > using the cpu_id_start value.
> 
> I'll remove the unnecessary testing.

It needs to be compared at least once, yes. But once is enough.

Thanks,

Mathieu

> 
> 
>> If the kernel happens to preempt and migrate the thread while in the
>> critical section, it's the kernel's job to move user-space execution
>> to the abort handler.
>> 
> [...]
> 
> Thanks,
> 
> --
> Raphael Gault

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters
  2019-06-11 19:33       ` Mathieu Desnoyers
  2019-06-13  7:10         ` Raphael Gault
@ 2019-06-13 11:02         ` Raphael Gault
  2019-06-25 13:20         ` Raphael Gault
  2 siblings, 0 replies; 21+ messages in thread
From: Raphael Gault @ 2019-06-13 11:02 UTC (permalink / raw)
  To: Mathieu Desnoyers, Mark Rutland
  Cc: Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Catalin Marinas, Will Deacon

Hi Mathieu, Mark,

On 6/11/19 8:33 PM, Mathieu Desnoyers wrote:
> ----- On Jun 11, 2019, at 6:57 PM, Mark Rutland mark.rutland@arm.com wrote:
> 
>> Hi Arnaldo,
>>
>> On Tue, Jun 11, 2019 at 11:33:46AM -0300, Arnaldo Carvalho de Melo wrote:
>>> Em Tue, Jun 11, 2019 at 01:53:11PM +0100, Raphael Gault escreveu:
>>>> Add an extra test to check userspace access to pmu hardware counters.
>>>> This test doesn't rely on the seqlock as a synchronisation mechanism but
>>>> instead uses the restartable sequences to make sure that the thread is
>>>> not interrupted when reading the index of the counter and the associated
>>>> pmu register.
>>>>
>>>> In addition to reading the pmu counters, this test is run several time
>>>> in order to measure the ratio of failures:
>>>> I ran this test on the Juno development platform, which is big.LITTLE
>>>> with 4 Cortex A53 and 2 Cortex A57. The results vary quite a lot
>>>> (running it with 100 tests is not so long and I did it several times).
>>>> I ran it once with 10000 iterations:
>>>> `runs: 10000, abort: 62.53%, zero: 34.93%, success: 2.54%`
>>>>
>>>> Signed-off-by: Raphael Gault <raphael.gault@arm.com>
>>>> ---
>>>>   tools/perf/arch/arm64/include/arch-tests.h    |   5 +-
>>>>   tools/perf/arch/arm64/include/rseq-arm64.h    | 220 ++++++++++++++++++
>>>
>>> So, I applied the first patch in this series, but could you please break
>>> this patch into at least two, one introducing the facility
>>> (include/rseq*) and the second adding the test?
>>>
>>> We try to enforce this kind of granularity as down the line we may want
>>> to revert one part while the other already has other uses and thus
>>> wouldn't allow a straight revert.
>>>
>>> Also, can this go to tools/arch/ instead? Is this really perf specific?
>>> Isn't there any arch/arm64/include files for the kernel that we could
>>> mirror and have it checked for drift in tools/perf/check-headers.sh?
>>
>> The rseq bits aren't strictly perf specific, and I think the existing
>> bits under tools/testing/selftests/rseq/ could be factored out to common
>> locations under tools/include/ and tools/arch/*/include/.
> 
> Hi Mark,
> 
> Thanks for CCing me!
> 
> Or into a stand-alone librseq project:
> 
> https://github.com/compudj/librseq (currently a development branch in
> my own github)
> 
> I don't see why this user-space code should sit in the kernel tree.
> It is not tooling-specific.
> 

I understand your point but I have to admit that I don't really see how 
to make it work together with the test which require those definitions.

>>
>>  From a scan, those already duplicate barriers and other helpers which
>> already have definitions under tools/, which seems unfortunate. :/
>>

Also I realize that there is a duplicate with definitions introduced in 
the selftests but I kind of simplified the macros I'm using to get rid 
of what wasn't useful to me at the moment. (mainly the loop labels and 
parameter injections in the asm statement)
I understand what both Mark and Arnaldo are saying about moving it out 
of perf so that it is not duplicated but my question is whether it is a 
good thing to do as is since it is not exactly the same content as 
what's in the selftests.

I hope you can understand my concerns and I'd like to hear your opinions 
on that matter.

Thanks,

-- 
Raphael Gault

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [tip:perf/core] perf tests arm64: Compile tests unconditionally
  2019-06-11 12:53 ` [PATCH 1/7] perf: arm64: Compile tests unconditionally Raphael Gault
  2019-06-11 14:09   ` Mark Rutland
@ 2019-06-22  6:34   ` tip-bot for Raphael Gault
  1 sibling, 0 replies; 21+ messages in thread
From: tip-bot for Raphael Gault @ 2019-06-22  6:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mark.rutland, raphael.gault, hpa, acme, tglx, will.deacon,
	linux-kernel, catalin.marinas, mingo, peterz

Commit-ID:  010e3e8fc12b1c13ce19821a11d8930226ebb4b6
Gitweb:     https://git.kernel.org/tip/010e3e8fc12b1c13ce19821a11d8930226ebb4b6
Author:     Raphael Gault <raphael.gault@arm.com>
AuthorDate: Tue, 11 Jun 2019 13:53:09 +0100
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 17 Jun 2019 15:57:16 -0300

perf tests arm64: Compile tests unconditionally

In order to subsequently add more tests for the arm64 architecture we
compile the tests target for arm64 systematically.

Further explanation provided by Mark Rutland:

Given prior questions regarding this commit, it's probably worth
spelling things out more explicitly, e.g.

  Currently we only build the arm64/tests directory if
  CONFIG_DWARF_UNWIND is selected, which is fine as the only test we
  have is arm64/tests/dwarf-unwind.o.

  So that we can add more tests to the test directory, let's
  unconditionally build the directory, but conditionally build
  dwarf-unwind.o depending on CONFIG_DWARF_UNWIND.

  There should be no functional change as a result of this patch.

Signed-off-by: Raphael Gault <raphael.gault@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/20190611125315.18736-2-raphael.gault@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/arch/arm64/Build       | 2 +-
 tools/perf/arch/arm64/tests/Build | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/arch/arm64/Build b/tools/perf/arch/arm64/Build
index 36222e64bbf7..a7dd46a5b678 100644
--- a/tools/perf/arch/arm64/Build
+++ b/tools/perf/arch/arm64/Build
@@ -1,2 +1,2 @@
 perf-y += util/
-perf-$(CONFIG_DWARF_UNWIND) += tests/
+perf-y += tests/
diff --git a/tools/perf/arch/arm64/tests/Build b/tools/perf/arch/arm64/tests/Build
index 41707fea74b3..a61c06bdb757 100644
--- a/tools/perf/arch/arm64/tests/Build
+++ b/tools/perf/arch/arm64/tests/Build
@@ -1,4 +1,4 @@
 perf-y += regs_load.o
-perf-y += dwarf-unwind.o
+perf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o
 
 perf-y += arch-tests.o

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters
  2019-06-11 19:33       ` Mathieu Desnoyers
  2019-06-13  7:10         ` Raphael Gault
  2019-06-13 11:02         ` Raphael Gault
@ 2019-06-25 13:20         ` Raphael Gault
  2 siblings, 0 replies; 21+ messages in thread
From: Raphael Gault @ 2019-06-25 13:20 UTC (permalink / raw)
  To: Mathieu Desnoyers, Mark Rutland
  Cc: Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Catalin Marinas, Will Deacon,
	szabolcs.nagy

Hi Mathieu, Hi Szabolcs,

On 6/11/19 8:33 PM, Mathieu Desnoyers wrote:
> ----- On Jun 11, 2019, at 6:57 PM, Mark Rutland mark.rutland@arm.com wrote:
> 
>> Hi Arnaldo,
>>
>> On Tue, Jun 11, 2019 at 11:33:46AM -0300, Arnaldo Carvalho de Melo wrote:
>>> Em Tue, Jun 11, 2019 at 01:53:11PM +0100, Raphael Gault escreveu:
>>>> Add an extra test to check userspace access to pmu hardware counters.
>>>> This test doesn't rely on the seqlock as a synchronisation mechanism but
>>>> instead uses the restartable sequences to make sure that the thread is
>>>> not interrupted when reading the index of the counter and the associated
>>>> pmu register.
>>>>
>>>> In addition to reading the pmu counters, this test is run several time
>>>> in order to measure the ratio of failures:
>>>> I ran this test on the Juno development platform, which is big.LITTLE
>>>> with 4 Cortex A53 and 2 Cortex A57. The results vary quite a lot
>>>> (running it with 100 tests is not so long and I did it several times).
>>>> I ran it once with 10000 iterations:
>>>> `runs: 10000, abort: 62.53%, zero: 34.93%, success: 2.54%`
>>>>
>>>> Signed-off-by: Raphael Gault <raphael.gault@arm.com>
>>>> ---
>>>>   tools/perf/arch/arm64/include/arch-tests.h    |   5 +-
>>>>   tools/perf/arch/arm64/include/rseq-arm64.h    | 220 ++++++++++++++++++
>>>
>>> So, I applied the first patch in this series, but could you please break
>>> this patch into at least two, one introducing the facility
>>> (include/rseq*) and the second adding the test?
>>>
>>> We try to enforce this kind of granularity as down the line we may want
>>> to revert one part while the other already has other uses and thus
>>> wouldn't allow a straight revert.
>>>
>>> Also, can this go to tools/arch/ instead? Is this really perf specific?
>>> Isn't there any arch/arm64/include files for the kernel that we could
>>> mirror and have it checked for drift in tools/perf/check-headers.sh?
>>
>> The rseq bits aren't strictly perf specific, and I think the existing
>> bits under tools/testing/selftests/rseq/ could be factored out to common
>> locations under tools/include/ and tools/arch/*/include/.
> 
> Hi Mark,
> 
> Thanks for CCing me!
> 
> Or into a stand-alone librseq project:
> 
> https://github.com/compudj/librseq (currently a development branch in
> my own github)
> 
> I don't see why this user-space code should sit in the kernel tree.
> It is not tooling-specific.
> 
>>
>>  From a scan, those already duplicate barriers and other helpers which
>> already have definitions under tools/, which seems unfortunate. :/
>>
>> Comments below are for Raphael and Matthieu.
>>
>> [...]
>>
>>>> +static u64 noinline mmap_read_self(void *addr, int cpu)
>>>> +{
>>>> +	struct perf_event_mmap_page *pc = addr;
>>>> +	u32 idx = 0;
>>>> +	u64 count = 0;
>>>> +
>>>> +	asm volatile goto(
>>>> +                     RSEQ_ASM_DEFINE_TABLE(0, 1f, 2f, 3f)
>>>> +		     "nop\n"
>>>> +                     RSEQ_ASM_STORE_RSEQ_CS(1, 0b, rseq_cs)
>>>> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
>>>> +                     RSEQ_ASM_OP_R_LOAD(pc_idx)
>>>> +                     RSEQ_ASM_OP_R_AND(0xFF)
>>>> +		     RSEQ_ASM_OP_R_STORE(idx)
>>>> +                     RSEQ_ASM_OP_R_SUB(0x1)
>>>> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
>>>> +                     "msr pmselr_el0, " RSEQ_ASM_TMP_REG "\n"
>>>> +                     "isb\n"
>>>> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
> 
> I really don't understand why the cpu_id needs to be compared 3 times
> here (?!?)
> 
> Explicit comparison of the cpu_id within the rseq critical section
> should be done _once_.
> 
> If the kernel happens to preempt and migrate the thread while in the
> critical section, it's the kernel's job to move user-space execution
> to the abort handler.
> 
>>>> +                     "mrs " RSEQ_ASM_TMP_REG ", pmxevcntr_el0\n"
>>>> +                     RSEQ_ASM_OP_R_FINAL_STORE(cnt, 2)
>>>> +		     "nop\n"
>>>> +                     RSEQ_ASM_DEFINE_ABORT(3, abort)
>>>> +                     :/* No output operands */
>>>> +		     :  [cpu_id] "r" (cpu),
>>>> +			[current_cpu_id] "Qo" (__rseq_abi.cpu_id),
>>>> +			[rseq_cs] "m" (__rseq_abi.rseq_cs),
>>>> +			[cnt] "m" (count),
>>>> +			[pc_idx] "r" (&pc->index),
>>>> +			[idx] "m" (idx)
>>>> +                     :"memory"
>>>> +                     :abort
>>>> +                    );
>>
>> While baroque, this doesn't look as scary as I thought it would!
> 
> That's good to hear :)
> 
>>
>> However, I'm very scared that this is modifying input operands without
>> clobbering them. IIUC this is beacause we're trying to use asm goto,
>> which doesn't permit output operands.
> 
> This is correct. What is wrong with modifying the target of "m" input
> operands in an inline asm that has a "memory" clobber ?
> 
> gcc documentation at https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
> states:
> 
> "An asm goto statement cannot have outputs. This is due to an internal
> restriction of the compiler: control transfer instructions cannot have
> outputs. If the assembler code does modify anything, use the "memory"
> clobber to force the optimizers to flush all register values to memory
> and reload them if necessary after the asm statement."
> 
> If there is a problem with this approach, an alternative would be to
> pass &__rseq_abi.rseq.cs as a "r" input operand, explicitly dereference
> it in the assembly, and use the "memory" clobber to ensure the compiler
> knows that there are read/write references to memory.
> 
>> I'm very dubious to abusing asm goto in this way. Can we instead use a
>> regular asm volatile block, and place the abort handler _within_ the
>> asm? If performance is a concern, we can use .pushsection and
>> .popsection to move that far away...
> 
> Let's dig into what would be needed in order to move the abort into the
> asm block.
> 
> One approach would be to make that asm block return a nonzero value in
> an output register, and put zero in that register in the non-abort case,
> and then have a conditional check in C on that register to check
> whether it needs to branch to the abort. This adds overhead we want
> to avoid.
> 
> Another alternative would be to perform the entire abort handler in
> the same assembly block as the rseq critical section. However, this
> prevents us from going back to C to handle the abort, which is unwanted.
> For instance, in the use-case of perf counters on aarch64, a good
> fallback on abort would be to call the perf system call to read the
> value of the performance counter. However, requiring that the abort be
> implemented within the rseq assembly block would require that we
> re-implement system call invocation in user-space for this, which
> is rather annoying.
> 
>>
>>>> +
>>>> +	if (idx)
>>>> +		count += READ_ONCE(pc->offset);
>>
>> I'm rather scared that from GCC's PoV, idx was initialized to zero, and
>> not modified above (per the asm constraints). I realise that we've used
>> an "m" constraint and clobbered memory, but I could well imagine that
>> GCC can interpret that as needing to place a read-only copy in memory,
>> but still being permitted to use the original value in a register. That
>> would permit the above to be optimized away, since GCC knows no
>> registers were clobbered, and thus idx must still be zero.
> 
> I suspect this is based on a rather conservative interpretation of the
> following statement from https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html:
> 
> "The "memory" clobber tells the compiler that the assembly code performs memory
> reads or writes to items other than those listed in the input and output operands"
> 
> Based on the previous sentence, it's tempting to conclude that the "m" input
> operands content is not clobbered by the "memory" clobber.
> 
> however, it is followed by this:
> 
> "Further, the compiler does not assume that any values read from memory before an
> asm remain unchanged after that asm; it reloads them as needed. Using the "memory"
> clobber effectively forms a read/write memory barrier for the compiler."
> 
> Based on this last sentence, my understanding is that a "memory" clobber would
> also clobber the content of any "m" operand.
> 
> If use of "m" (var) input-operand-as-output + "memory" clobber ends up being an
> issue, we can always fall-back to "r" (&var) input operand + "memory" clobber,
> which seems less ambiguous from a documentation standpoint.
> 
> I'd really like to have an authoritative answer from gcc folks before we start
> changing this in all rseq asm for all architectures.
> 

Hi Szabolcs, we would really appreciate to see what your opinion is on 
this matter.

>>
>>>> +
>>>> +	return count;
>>
>> ... and for similar reasons, always return zero here.
>>
>>>> +abort:
>>>> +        pr_debug("Abort handler\n");
>>>> +        exit(-2);
>>>> +}
>>
>> Given the necessary complexity evident above, I'm also fearful that the
>> sort of folk that want userspace counter access aren't going to bother
>> with the above.
> 
> The abort handler should be implemented in C, simply invoking the perf
> system call which lets the kernel perform the perf counter read.


Thanks,

-- 
Raphael Gault

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters
  2019-06-11 16:57     ` Mark Rutland
  2019-06-11 19:33       ` Mathieu Desnoyers
@ 2019-06-25 13:29       ` Raphael Gault
  2019-06-25 13:41         ` Will Deacon
  1 sibling, 1 reply; 21+ messages in thread
From: Raphael Gault @ 2019-06-25 13:29 UTC (permalink / raw)
  To: Mark Rutland, will.deacon
  Cc: Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel, mingo,
	peterz, catalin.marinas, mathieu.desnoyers

Hi Mark, Hi Will,

Now that we have a better idea what enabling this feature for 
heterogeneous systems would look like (both with or without using 
rseqs), it might be worth discussing if this is in fact a desirable 
thing in term of performance-complexity trade off.

Indeed, while not as scary as first thought, maybe the rseq method would 
still dissuade users from using this feature. It is also worth noting 
that if we only enable this feature on homogeneous system, the `mrs` 
hook/emulation would not be necessary.
If because of the complexity of the setup we need to consider whether we 
want to upstream this and have to maintain it afterward.

Thanks,

-- 
Raphael Gault

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters
  2019-06-25 13:29       ` Raphael Gault
@ 2019-06-25 13:41         ` Will Deacon
  0 siblings, 0 replies; 21+ messages in thread
From: Will Deacon @ 2019-06-25 13:41 UTC (permalink / raw)
  To: Raphael Gault
  Cc: Mark Rutland, will.deacon, peterz, catalin.marinas,
	Arnaldo Carvalho de Melo, linux-kernel, mingo, mathieu.desnoyers,
	linux-arm-kernel

Hi Raphael,

On Tue, Jun 25, 2019 at 02:29:56PM +0100, Raphael Gault wrote:
> Now that we have a better idea what enabling this feature for heterogeneous
> systems would look like (both with or without using rseqs), it might be
> worth discussing if this is in fact a desirable thing in term of
> performance-complexity trade off.

Agreed; it's one of those things that I think is *definitely* worth
prototyping, but it's not obviously something we should commit to at this
stage.

> Indeed, while not as scary as first thought, maybe the rseq method would
> still dissuade users from using this feature. It is also worth noting that
> if we only enable this feature on homogeneous system, the `mrs`
> hook/emulation would not be necessary.
> If because of the complexity of the setup we need to consider whether we
> want to upstream this and have to maintain it afterward.

Given that we don't currently support userspace access to the counters at
all, I think upstreaming support only for homogeneous systems makes sense
initially as long as (a) we fail gracefully on heterogeneous systems and (b)
we don't close the door to using the rseq-based mechanism in future if we
choose to (i.e. it would be nice if this was an extension to the ABI rather
than a break). That also gives us a chance to assess the wider adoption of
rseq before throwing our weight behind it.

Sound reasonable?

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-06-25 13:41 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-11 12:53 [PATCH 0/7] arm64: Enable access to pmu registers by user-space Raphael Gault
2019-06-11 12:53 ` [PATCH 1/7] perf: arm64: Compile tests unconditionally Raphael Gault
2019-06-11 14:09   ` Mark Rutland
2019-06-11 14:23     ` Arnaldo Carvalho de Melo
2019-06-11 17:01       ` Mark Rutland
2019-06-22  6:34   ` [tip:perf/core] perf tests " tip-bot for Raphael Gault
2019-06-11 12:53 ` [PATCH 2/7] perf: arm64: Add test to check userspace access to hardware counters Raphael Gault
2019-06-11 12:53 ` [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters Raphael Gault
2019-06-11 14:33   ` Arnaldo Carvalho de Melo
2019-06-11 16:57     ` Mark Rutland
2019-06-11 19:33       ` Mathieu Desnoyers
2019-06-13  7:10         ` Raphael Gault
2019-06-13  8:27           ` Mathieu Desnoyers
2019-06-13 11:02         ` Raphael Gault
2019-06-25 13:20         ` Raphael Gault
2019-06-25 13:29       ` Raphael Gault
2019-06-25 13:41         ` Will Deacon
2019-06-11 12:53 ` [PATCH 4/7] arm64: pmu: Add function implementation to update event index in userpage Raphael Gault
2019-06-11 12:53 ` [PATCH 5/7] arm64: pmu: Add hook to handle pmu-related undefined instructions Raphael Gault
2019-06-11 12:53 ` [PATCH 6/7] arm64: perf: Enable pmu counter direct access for perf event on armv8 Raphael Gault
2019-06-11 12:53 ` [PATCH 7/7] Documentation: arm64: Document PMU counters access from userspace Raphael Gault

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).