linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/9] libperf and arm64 userspace counter access support
@ 2020-10-01 14:01 Rob Herring
  2020-10-01 14:01 ` [PATCH v4 1/9] arm64: pmu: Add function implementation to update event index in userpage Rob Herring
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Rob Herring @ 2020-10-01 14:01 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: linux-arm-kernel, linux-kernel, Alexander Shishkin, Namhyung Kim,
	Raphael Gault, Mark Rutland, Jonathan Cameron, Ian Rogers,
	honnappa.nagarahalli, Itaru Kitayama

This is resurrecting Raphael's series[1] to enable userspace counter
access on arm64. My previous versions are here[2][3][4]. A git branch is
here[5].

Changes in v4:
 - Dropped 'arm64: pmu: Add hook to handle pmu-related undefined instructions'.
   The onus is on userspace to pin itself to a homogeneous subset of CPUs
   and avoid any aborts on heterogeneous systems, so the hook is not needed.
 - Make perf_evsel__mmap() take pages rather than bytes for size
 - Fix building arm64 heterogeneous test.

Changes in v3:
 - Dropped removing x86 rdpmc test until libperf tests can run via 'perf test'
 - Added verbose prints for tests
 - Split adding perf_evsel__mmap() to separate patch


The following changes to the arm64 support have been made compared to
Raphael's last version:

The major change is support for heterogeneous systems with some
restrictions. Specifically, userspace must pin itself to like CPUs, open
a specific PMU by type, and use h/w specific events. The tests have been
reworked to demonstrate this.

Chained events are not supported. The problem with supporting chained
events was there's no way to distinguish between a chained event and a
native 64-bit counter. We could add some flag, but do self monitoring
processes really need that? Native 64-bit counters are supported if the
PMU h/w has support. As there's already an explicit ABI to request 64-bit
counters, userspace can request 64-bit counters and if user
access is not enabled, then it must retry with 32-bit counters.

Prior versions broke the build on arm32 (surprisingly never caught by
0-day). As a result, event_mapped and event_unmapped implementations have
been moved into the arm64 code.

There was a bug in that pmc_width was not set in the user page. The tests
now check for this.

The documentation has been converted to rST. I've added sections on
chained events and heterogeneous.

The tests have been expanded to test the cycle counter access.

Rob

[1] https://lore.kernel.org/linux-arm-kernel/20190822144220.27860-1-raphael.gault@arm.com/
[2] https://lore.kernel.org/linux-arm-kernel/20200707205333.624938-1-robh@kernel.org/
[3] https://lore.kernel.org/linux-arm-kernel/20200828205614.3391252-1-robh@kernel.org/
[4] https://lore.kernel.org/linux-arm-kernel/20200911215118.2887710-1-robh@kernel.org/
[5] git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git user-perf-event-v4

Raphael Gault (3):
  arm64: pmu: Add function implementation to update event index in
    userpage
  arm64: perf: Enable pmu counter direct access for perf event on armv8
  Documentation: arm64: Document PMU counters access from userspace

Rob Herring (6):
  tools/include: Add an initial math64.h
  libperf: Add libperf_evsel__mmap()
  libperf: tests: Add support for verbose printing
  libperf: Add support for user space counter access
  libperf: Add arm64 support to perf_mmap__read_self()
  perf: arm64: Add test for userspace counter access on heterogeneous
    systems

 Documentation/arm64/index.rst                 |   1 +
 .../arm64/perf_counter_user_access.rst        |  56 ++++++
 arch/arm64/include/asm/mmu.h                  |   5 +
 arch/arm64/include/asm/mmu_context.h          |   2 +
 arch/arm64/include/asm/perf_event.h           |  14 ++
 arch/arm64/kernel/perf_event.c                |  62 ++++++
 include/linux/perf/arm_pmu.h                  |   2 +
 tools/include/linux/math64.h                  |  75 +++++++
 tools/lib/perf/Documentation/libperf.txt      |   1 +
 tools/lib/perf/evsel.c                        |  34 ++++
 tools/lib/perf/include/internal/evsel.h       |   2 +
 tools/lib/perf/include/internal/mmap.h        |   3 +
 tools/lib/perf/include/internal/tests.h       |  32 +++
 tools/lib/perf/include/perf/evsel.h           |   2 +
 tools/lib/perf/libperf.map                    |   1 +
 tools/lib/perf/mmap.c                         | 188 ++++++++++++++++++
 tools/lib/perf/tests/Makefile                 |   4 +-
 tools/lib/perf/tests/test-evsel.c             |  63 ++++++
 tools/perf/arch/arm64/include/arch-tests.h    |   7 +
 tools/perf/arch/arm64/tests/Build             |   1 +
 tools/perf/arch/arm64/tests/arch-tests.c      |   4 +
 tools/perf/arch/arm64/tests/user-events.c     | 170 ++++++++++++++++
 22 files changed, 728 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/arm64/perf_counter_user_access.rst
 create mode 100644 tools/include/linux/math64.h
 create mode 100644 tools/perf/arch/arm64/tests/user-events.c

--
2.25.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 1/9] arm64: pmu: Add function implementation to update event index in userpage
  2020-10-01 14:01 [PATCH v4 0/9] libperf and arm64 userspace counter access support Rob Herring
@ 2020-10-01 14:01 ` Rob Herring
  2020-10-01 14:01 ` [PATCH v4 2/9] arm64: perf: Enable pmu counter direct access for perf event on armv8 Rob Herring
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Rob Herring @ 2020-10-01 14:01 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: linux-arm-kernel, linux-kernel, Alexander Shishkin, Namhyung Kim,
	Raphael Gault, Mark Rutland, Jonathan Cameron, Ian Rogers,
	honnappa.nagarahalli, Itaru Kitayama

From: Raphael Gault <raphael.gault@arm.com>

In order to be able to access the counter directly for userspace,
we need to provide the index of the counter using the userpage.
We thus need to override the event_idx function to retrieve and
convert the perf_event index to armv8 hardware index.

Since the arm_pmu driver can be used by any implementation, even
if not armv8, two components play a role into making sure the
behaviour is correct and consistent with the PMU capabilities:

* the ARMPMU_EL0_RD_CNTR flag which denotes the capability to access
counter from userspace.
* the event_idx call back, which is implemented and initialized by
the PMU implementation: if no callback is provided, the default
behaviour applies, returning 0 as index value.

Signed-off-by: Raphael Gault <raphael.gault@arm.com>
Signed-off-by: Rob Herring <robh@kernel.org>
---
 arch/arm64/kernel/perf_event.c | 21 +++++++++++++++++++++
 include/linux/perf/arm_pmu.h   |  2 ++
 2 files changed, 23 insertions(+)

diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index 462f9a9cc44b..e14f360a7883 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -818,6 +818,22 @@ static void armv8pmu_clear_event_idx(struct pmu_hw_events *cpuc,
 		clear_bit(idx - 1, cpuc->used_mask);
 }
 
+static int armv8pmu_access_event_idx(struct perf_event *event)
+{
+	if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
+		return 0;
+
+	/*
+	 * We remap the cycle counter index to 32 to
+	 * match the offset applied to the rest of
+	 * the counter indices.
+	 */
+	if (event->hw.idx == ARMV8_IDX_CYCLE_COUNTER)
+		return 32;
+
+	return event->hw.idx;
+}
+
 /*
  * Add an event filter to a given event.
  */
@@ -914,6 +930,9 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
 	if (armv8pmu_event_is_64bit(event))
 		event->hw.flags |= ARMPMU_EVT_64BIT;
 
+	if (!armv8pmu_event_is_chained(event))
+		event->hw.flags |= ARMPMU_EL0_RD_CNTR;
+
 	/* Only expose micro/arch events supported by this PMU */
 	if ((hw_event_id > 0) && (hw_event_id < ARMV8_PMUV3_MAX_COMMON_EVENTS)
 	    && test_bit(hw_event_id, armpmu->pmceid_bitmap)) {
@@ -1038,6 +1057,8 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
 	cpu_pmu->set_event_filter	= armv8pmu_set_event_filter;
 	cpu_pmu->filter_match		= armv8pmu_filter_match;
 
+	cpu_pmu->pmu.event_idx		= armv8pmu_access_event_idx;
+
 	cpu_pmu->name			= name;
 	cpu_pmu->map_event		= map_event;
 	cpu_pmu->attr_groups[ARMPMU_ATTR_GROUP_EVENTS] = events ?
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index 5b616dde9a4c..74fbbbd29dc7 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -26,6 +26,8 @@
  */
 /* Event uses a 64bit counter */
 #define ARMPMU_EVT_64BIT		1
+/* Allow access to hardware counter from userspace */
+#define ARMPMU_EL0_RD_CNTR		2
 
 #define HW_OP_UNSUPPORTED		0xFFFF
 #define C(_x)				PERF_COUNT_HW_CACHE_##_x
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 2/9] arm64: perf: Enable pmu counter direct access for perf event on armv8
  2020-10-01 14:01 [PATCH v4 0/9] libperf and arm64 userspace counter access support Rob Herring
  2020-10-01 14:01 ` [PATCH v4 1/9] arm64: pmu: Add function implementation to update event index in userpage Rob Herring
@ 2020-10-01 14:01 ` Rob Herring
  2020-11-13 18:06   ` Mark Rutland
  2020-10-01 14:01 ` [PATCH v4 3/9] tools/include: Add an initial math64.h Rob Herring
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Rob Herring @ 2020-10-01 14:01 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: linux-arm-kernel, linux-kernel, Alexander Shishkin, Namhyung Kim,
	Raphael Gault, Mark Rutland, Jonathan Cameron, Ian Rogers,
	honnappa.nagarahalli, Itaru Kitayama

From: Raphael Gault <raphael.gault@arm.com>

Keep track of event opened with direct access to the hardware counters
and modify permissions while they are open.

The strategy used here is the same which x86 uses: everytime an event
is mapped, the permissions are set if required. The atomic field added
in the mm_context helps keep track of the different event opened and
de-activate the permissions when all are unmapped.
We also need to update the permissions in the context switch code so
that tasks keep the right permissions.

Signed-off-by: Raphael Gault <raphael.gault@arm.com>
Signed-off-by: Rob Herring <robh@kernel.org>
---
v2:
 - Move mapped/unmapped into arm64 code. Fixes arm32.
 - Rebase on cap_user_time_short changes

Changes from Raphael's v4:
  - Drop homogeneous check
  - Disable access for chained counters
  - Set pmc_width in user page
---
 arch/arm64/include/asm/mmu.h         |  5 ++++
 arch/arm64/include/asm/mmu_context.h |  2 ++
 arch/arm64/include/asm/perf_event.h  | 14 ++++++++++
 arch/arm64/kernel/perf_event.c       | 41 ++++++++++++++++++++++++++++
 4 files changed, 62 insertions(+)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index a7a5ecaa2e83..52cfdb676f06 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -19,6 +19,11 @@
 
 typedef struct {
 	atomic64_t	id;
+	/*
+	 * non-zero if userspace have access to hardware
+	 * counters directly.
+	 */
+	atomic_t	pmu_direct_access;
 #ifdef CONFIG_COMPAT
 	void		*sigpage;
 #endif
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index f2d7537d6f83..d24589ecb07a 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -21,6 +21,7 @@
 #include <asm/proc-fns.h>
 #include <asm-generic/mm_hooks.h>
 #include <asm/cputype.h>
+#include <asm/perf_event.h>
 #include <asm/sysreg.h>
 #include <asm/tlbflush.h>
 
@@ -224,6 +225,7 @@ static inline void __switch_mm(struct mm_struct *next)
 	}
 
 	check_and_switch_context(next);
+	perf_switch_user_access(next);
 }
 
 static inline void
diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
index 2c2d7dbe8a02..a025d9595d51 100644
--- a/arch/arm64/include/asm/perf_event.h
+++ b/arch/arm64/include/asm/perf_event.h
@@ -8,6 +8,7 @@
 
 #include <asm/stack_pointer.h>
 #include <asm/ptrace.h>
+#include <linux/mm_types.h>
 
 #define	ARMV8_PMU_MAX_COUNTERS	32
 #define	ARMV8_PMU_COUNTER_MASK	(ARMV8_PMU_MAX_COUNTERS - 1)
@@ -251,4 +252,17 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
 	(regs)->pstate = PSR_MODE_EL1h;	\
 }
 
+static inline void perf_switch_user_access(struct mm_struct *mm)
+{
+	if (!IS_ENABLED(CONFIG_PERF_EVENTS))
+		return;
+
+	if (atomic_read(&mm->context.pmu_direct_access)) {
+		write_sysreg(ARMV8_PMU_USERENR_ER|ARMV8_PMU_USERENR_CR,
+			     pmuserenr_el0);
+	} else {
+		write_sysreg(0, pmuserenr_el0);
+	}
+}
+
 #endif
diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index e14f360a7883..8344dc030823 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -834,6 +834,41 @@ static int armv8pmu_access_event_idx(struct perf_event *event)
 	return event->hw.idx;
 }
 
+static void refresh_pmuserenr(void *mm)
+{
+	perf_switch_user_access(mm);
+}
+
+static void armv8pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
+{
+	if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
+		return;
+
+	/*
+	 * This function relies on not being called concurrently in two
+	 * tasks in the same mm.  Otherwise one task could observe
+	 * pmu_direct_access > 1 and return all the way back to
+	 * userspace with user access disabled while another task is still
+	 * doing on_each_cpu_mask() to enable user access.
+	 *
+	 * For now, this can't happen because all callers hold mmap_lock
+	 * for write.  If this changes, we'll need a different solution.
+	 */
+	lockdep_assert_held_write(&mm->mmap_lock);
+
+	if (atomic_inc_return(&mm->context.pmu_direct_access) == 1)
+		on_each_cpu(refresh_pmuserenr, mm, 1);
+}
+
+static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
+{
+	if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
+		return;
+
+	if (atomic_dec_and_test(&mm->context.pmu_direct_access))
+		on_each_cpu_mask(mm_cpumask(mm), refresh_pmuserenr, NULL, 1);
+}
+
 /*
  * Add an event filter to a given event.
  */
@@ -1058,6 +1093,8 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
 	cpu_pmu->filter_match		= armv8pmu_filter_match;
 
 	cpu_pmu->pmu.event_idx		= armv8pmu_access_event_idx;
+	cpu_pmu->pmu.event_mapped	= armv8pmu_event_mapped;
+	cpu_pmu->pmu.event_unmapped	= armv8pmu_event_unmapped;
 
 	cpu_pmu->name			= name;
 	cpu_pmu->map_event		= map_event;
@@ -1218,6 +1255,10 @@ void arch_perf_update_userpage(struct perf_event *event,
 	userpg->cap_user_time = 0;
 	userpg->cap_user_time_zero = 0;
 	userpg->cap_user_time_short = 0;
+	userpg->cap_user_rdpmc = !!(event->hw.flags & ARMPMU_EL0_RD_CNTR);
+
+	if (userpg->cap_user_rdpmc)
+		userpg->pmc_width = armv8pmu_event_is_64bit(event) ? 64 : 32;
 
 	do {
 		rd = sched_clock_read_begin(&seq);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 3/9] tools/include: Add an initial math64.h
  2020-10-01 14:01 [PATCH v4 0/9] libperf and arm64 userspace counter access support Rob Herring
  2020-10-01 14:01 ` [PATCH v4 1/9] arm64: pmu: Add function implementation to update event index in userpage Rob Herring
  2020-10-01 14:01 ` [PATCH v4 2/9] arm64: perf: Enable pmu counter direct access for perf event on armv8 Rob Herring
@ 2020-10-01 14:01 ` Rob Herring
  2020-10-01 14:01 ` [PATCH v4 4/9] libperf: Add libperf_evsel__mmap() Rob Herring
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Rob Herring @ 2020-10-01 14:01 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: linux-arm-kernel, linux-kernel, Alexander Shishkin, Namhyung Kim,
	Raphael Gault, Mark Rutland, Jonathan Cameron, Ian Rogers,
	honnappa.nagarahalli, Itaru Kitayama

Add an initial math64.h similar to linux/math64.h with functions
mul_u64_u64_div64() and mul_u64_u32_shr(). This isn't a direct copy of
include/linux/math64.h as that doesn't define mul_u64_u64_div64().

Implementation was written by Peter Zilkstra based on linux/math64.h
and div64.h[1]. The original implementation was not optimal on arm64 as
__int128 division is not optimal with a call out to __udivti3, so I
dropped the __int128 variant of mul_u64_u64_div64().

[1] https://lore.kernel.org/lkml/20200322101848.GF2452@worktop.programming.kicks-ass.net/

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rob Herring <robh@kernel.org>
---
 tools/include/linux/math64.h | 75 ++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)
 create mode 100644 tools/include/linux/math64.h

diff --git a/tools/include/linux/math64.h b/tools/include/linux/math64.h
new file mode 100644
index 000000000000..4ad45d5943dc
--- /dev/null
+++ b/tools/include/linux/math64.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MATH64_H
+#define _LINUX_MATH64_H
+
+#include <linux/types.h>
+
+#ifdef __x86_64__
+static inline u64 mul_u64_u64_div64(u64 a, u64 b, u64 c)
+{
+	u64 q;
+
+	asm ("mulq %2; divq %3" : "=a" (q)
+				: "a" (a), "rm" (b), "rm" (c)
+				: "rdx");
+
+	return q;
+}
+#define mul_u64_u64_div64 mul_u64_u64_div64
+#endif
+
+#ifdef __SIZEOF_INT128__
+static inline u64 mul_u64_u32_shr(u64 a, u32 b, unsigned int shift)
+{
+	return (u64)(((unsigned __int128)a * b) >> shift);
+}
+
+#else
+
+#ifdef __i386__
+static inline u64 mul_u32_u32(u32 a, u32 b)
+{
+	u32 high, low;
+
+	asm ("mull %[b]" : "=a" (low), "=d" (high)
+			 : [a] "a" (a), [b] "rm" (b) );
+
+	return low | ((u64)high) << 32;
+}
+#else
+static inline u64 mul_u32_u32(u32 a, u32 b)
+{
+	return (u64)a * b;
+}
+#endif
+
+static inline u64 mul_u64_u32_shr(u64 a, u32 b, unsigned int shift)
+{
+	u32 ah, al;
+	u64 ret;
+
+	al = a;
+	ah = a >> 32;
+
+	ret = mul_u32_u32(al, b) >> shift;
+	if (ah)
+		ret += mul_u32_u32(ah, b) << (32 - shift);
+
+	return ret;
+}
+
+#endif	/* __SIZEOF_INT128__ */
+
+#ifndef mul_u64_u64_div64
+static inline u64 mul_u64_u64_div64(u64 a, u64 b, u64 c)
+{
+	u64 quot, rem;
+
+	quot = a / c;
+	rem = a % c;
+
+	return quot * b + (rem * b) / c;
+}
+#endif
+
+#endif /* _LINUX_MATH64_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
  2020-10-01 14:01 [PATCH v4 0/9] libperf and arm64 userspace counter access support Rob Herring
                   ` (2 preceding siblings ...)
  2020-10-01 14:01 ` [PATCH v4 3/9] tools/include: Add an initial math64.h Rob Herring
@ 2020-10-01 14:01 ` Rob Herring
  2020-10-14 11:05   ` Jiri Olsa
  2020-10-01 14:01 ` [PATCH v4 5/9] libperf: tests: Add support for verbose printing Rob Herring
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Rob Herring @ 2020-10-01 14:01 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: linux-arm-kernel, linux-kernel, Alexander Shishkin, Namhyung Kim,
	Raphael Gault, Mark Rutland, Jonathan Cameron, Ian Rogers,
	honnappa.nagarahalli, Itaru Kitayama

In order to support usersapce access, an event must be mmapped. While
there's already mmap support for evlist, the usecase is a bit different
than the self monitoring with userspace access. So let's add a new
perf_evsel__mmap() function to mmap an evsel. This allows implementing
userspace access as a fastpath for perf_evsel__read().

The mmapped address is returned by perf_evsel__mmap() primarily for
users/tests to check if userspace access is enabled.

Signed-off-by: Rob Herring <robh@kernel.org>
---
v4:
 - Change perf_evsel__mmap size to pages instead of bytes
v3:
 - New patch split out from user access patch

mmap fix
---
 tools/lib/perf/Documentation/libperf.txt |  1 +
 tools/lib/perf/evsel.c                   | 31 ++++++++++++++++++++++++
 tools/lib/perf/include/internal/evsel.h  |  2 ++
 tools/lib/perf/include/perf/evsel.h      |  2 ++
 tools/lib/perf/libperf.map               |  1 +
 5 files changed, 37 insertions(+)

diff --git a/tools/lib/perf/Documentation/libperf.txt b/tools/lib/perf/Documentation/libperf.txt
index 0c74c30ed23a..d2a541e7a7ec 100644
--- a/tools/lib/perf/Documentation/libperf.txt
+++ b/tools/lib/perf/Documentation/libperf.txt
@@ -136,6 +136,7 @@ SYNOPSIS
                        struct perf_thread_map *threads);
   void perf_evsel__close(struct perf_evsel *evsel);
   void perf_evsel__close_cpu(struct perf_evsel *evsel, int cpu);
+  void *perf_evsel__mmap(struct perf_evsel *evsel, int pages);
   int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
                        struct perf_counts_values *count);
   int perf_evsel__enable(struct perf_evsel *evsel);
diff --git a/tools/lib/perf/evsel.c b/tools/lib/perf/evsel.c
index 4dc06289f4c7..42eaf3c18981 100644
--- a/tools/lib/perf/evsel.c
+++ b/tools/lib/perf/evsel.c
@@ -11,10 +11,12 @@
 #include <stdlib.h>
 #include <internal/xyarray.h>
 #include <internal/cpumap.h>
+#include <internal/mmap.h>
 #include <internal/threadmap.h>
 #include <internal/lib.h>
 #include <linux/string.h>
 #include <sys/ioctl.h>
+#include <sys/mman.h>
 
 void perf_evsel__init(struct perf_evsel *evsel, struct perf_event_attr *attr)
 {
@@ -156,6 +158,35 @@ void perf_evsel__close_cpu(struct perf_evsel *evsel, int cpu)
 	perf_evsel__close_fd_cpu(evsel, cpu);
 }
 
+void *perf_evsel__mmap(struct perf_evsel *evsel, int pages)
+{
+	int ret;
+	struct perf_mmap *map;
+	struct perf_mmap_param mp = {
+		.prot = PROT_READ | PROT_WRITE,
+	};
+
+	if (FD(evsel, 0, 0) < 0)
+		return NULL;
+
+	mp.mask = (pages * page_size) - 1;
+
+	map = zalloc(sizeof(*map));
+	if (!map)
+		return NULL;
+
+	perf_mmap__init(map, NULL, false, NULL);
+
+	ret = perf_mmap__mmap(map, &mp, FD(evsel, 0, 0), 0);
+	if (ret) {
+		free(map);
+		return NULL;
+	}
+
+	evsel->mmap = map;
+	return map->base;
+}
+
 int perf_evsel__read_size(struct perf_evsel *evsel)
 {
 	u64 read_format = evsel->attr.read_format;
diff --git a/tools/lib/perf/include/internal/evsel.h b/tools/lib/perf/include/internal/evsel.h
index 1ffd083b235e..a7985dbb68ff 100644
--- a/tools/lib/perf/include/internal/evsel.h
+++ b/tools/lib/perf/include/internal/evsel.h
@@ -9,6 +9,7 @@
 
 struct perf_cpu_map;
 struct perf_thread_map;
+struct perf_mmap;
 struct xyarray;
 
 /*
@@ -40,6 +41,7 @@ struct perf_evsel {
 	struct perf_cpu_map	*cpus;
 	struct perf_cpu_map	*own_cpus;
 	struct perf_thread_map	*threads;
+	struct perf_mmap	*mmap;
 	struct xyarray		*fd;
 	struct xyarray		*sample_id;
 	u64			*id;
diff --git a/tools/lib/perf/include/perf/evsel.h b/tools/lib/perf/include/perf/evsel.h
index c82ec39a4ad0..1b1534439334 100644
--- a/tools/lib/perf/include/perf/evsel.h
+++ b/tools/lib/perf/include/perf/evsel.h
@@ -3,6 +3,7 @@
 #define __LIBPERF_EVSEL_H
 
 #include <stdint.h>
+#include <stddef.h>
 #include <perf/core.h>
 
 struct perf_evsel;
@@ -27,6 +28,7 @@ LIBPERF_API int perf_evsel__open(struct perf_evsel *evsel, struct perf_cpu_map *
 				 struct perf_thread_map *threads);
 LIBPERF_API void perf_evsel__close(struct perf_evsel *evsel);
 LIBPERF_API void perf_evsel__close_cpu(struct perf_evsel *evsel, int cpu);
+LIBPERF_API void *perf_evsel__mmap(struct perf_evsel *evsel, int pages);
 LIBPERF_API int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
 				 struct perf_counts_values *count);
 LIBPERF_API int perf_evsel__enable(struct perf_evsel *evsel);
diff --git a/tools/lib/perf/libperf.map b/tools/lib/perf/libperf.map
index 7be1af8a546c..733a0647be8b 100644
--- a/tools/lib/perf/libperf.map
+++ b/tools/lib/perf/libperf.map
@@ -23,6 +23,7 @@ LIBPERF_0.0.1 {
 		perf_evsel__disable;
 		perf_evsel__open;
 		perf_evsel__close;
+		perf_evsel__mmap;
 		perf_evsel__read;
 		perf_evsel__cpus;
 		perf_evsel__threads;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 5/9] libperf: tests: Add support for verbose printing
  2020-10-01 14:01 [PATCH v4 0/9] libperf and arm64 userspace counter access support Rob Herring
                   ` (3 preceding siblings ...)
  2020-10-01 14:01 ` [PATCH v4 4/9] libperf: Add libperf_evsel__mmap() Rob Herring
@ 2020-10-01 14:01 ` Rob Herring
  2020-10-01 14:01 ` [PATCH v4 6/9] libperf: Add support for user space counter access Rob Herring
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Rob Herring @ 2020-10-01 14:01 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: linux-arm-kernel, linux-kernel, Alexander Shishkin, Namhyung Kim,
	Raphael Gault, Mark Rutland, Jonathan Cameron, Ian Rogers,
	honnappa.nagarahalli, Itaru Kitayama

Add __T_VERBOSE() so tests can add verbose output. The verbose output is
enabled with the '-v' command line option.

Signed-off-by: Rob Herring <robh@kernel.org>
---
v3:
 - New patch
---
 tools/lib/perf/include/internal/tests.h | 32 +++++++++++++++++++++++++
 tools/lib/perf/tests/Makefile           |  4 +++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/tools/lib/perf/include/internal/tests.h b/tools/lib/perf/include/internal/tests.h
index 2093e8868a67..27b6e64299e2 100644
--- a/tools/lib/perf/include/internal/tests.h
+++ b/tools/lib/perf/include/internal/tests.h
@@ -3,11 +3,32 @@
 #define __LIBPERF_INTERNAL_TESTS_H
 
 #include <stdio.h>
+#include <unistd.h>
 
 int tests_failed;
+int tests_verbose;
+
+static inline int get_verbose(char **argv, int argc)
+{
+	char c;
+	int verbose = 0;
+
+	while ((c = getopt(argc, argv, "v")) != -1) {
+		switch (c)
+		{
+		case 'v':
+			verbose = 1;
+			break;
+		default:
+			break;
+		}
+	}
+	return verbose;
+}
 
 #define __T_START					\
 do {							\
+	tests_verbose = get_verbose(argv, argc);	\
 	fprintf(stdout, "- running %s...", __FILE__);	\
 	fflush(NULL);					\
 	tests_failed = 0;				\
@@ -30,4 +51,15 @@ do {
 	}                                                                        \
 } while (0)
 
+#define __T_VERBOSE(...)						\
+do {									\
+	if (tests_verbose) {						\
+		if (tests_verbose == 1) {				\
+			fputc('\n', stderr);				\
+			tests_verbose++;				\
+		}							\
+		fprintf(stderr, ##__VA_ARGS__);				\
+	}								\
+} while (0)
+
 #endif /* __LIBPERF_INTERNAL_TESTS_H */
diff --git a/tools/lib/perf/tests/Makefile b/tools/lib/perf/tests/Makefile
index 96841775feaf..9438b385d489 100644
--- a/tools/lib/perf/tests/Makefile
+++ b/tools/lib/perf/tests/Makefile
@@ -5,6 +5,8 @@ TESTS = test-cpumap test-threadmap test-evlist test-evsel
 TESTS_SO := $(addsuffix -so,$(TESTS))
 TESTS_A  := $(addsuffix -a,$(TESTS))
 
+TEST_ARGS := $(if $(V),-v)
+
 # Set compile option CFLAGS
 ifdef EXTRA_CFLAGS
   CFLAGS := $(EXTRA_CFLAGS)
@@ -30,7 +32,7 @@ run:
 	@echo "running static:"
 	@for i in $(TESTS_A); do ./$$i; done
 	@echo "running dynamic:"
-	@for i in $(TESTS_SO); do LD_LIBRARY_PATH=../ ./$$i; done
+	@for i in $(TESTS_SO); do LD_LIBRARY_PATH=../ ./$$i $(TEST_ARGS); done
 
 clean:
 	$(call QUIET_CLEAN, tests)$(RM) $(TESTS_A) $(TESTS_SO)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 6/9] libperf: Add support for user space counter access
  2020-10-01 14:01 [PATCH v4 0/9] libperf and arm64 userspace counter access support Rob Herring
                   ` (4 preceding siblings ...)
  2020-10-01 14:01 ` [PATCH v4 5/9] libperf: tests: Add support for verbose printing Rob Herring
@ 2020-10-01 14:01 ` Rob Herring
  2020-10-01 14:01 ` [PATCH v4 7/9] libperf: Add arm64 support to perf_mmap__read_self() Rob Herring
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Rob Herring @ 2020-10-01 14:01 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: linux-arm-kernel, linux-kernel, Alexander Shishkin, Namhyung Kim,
	Raphael Gault, Mark Rutland, Jonathan Cameron, Ian Rogers,
	honnappa.nagarahalli, Itaru Kitayama

x86 and arm64 can both support direct access of event counters in
userspace. The access sequence is less than trivial and currently exists
in perf test code (tools/perf/arch/x86/tests/rdpmc.c) with copies in
projects such as PAPI and libpfm4.

In order to support usersapce access, an event must be mmapped first
with perf_evsel__mmap(). Then subsequent calls to perf_evsel__read()
will use the fast path (assuming the arch supports it).

Signed-off-by: Rob Herring <robh@kernel.org>
---
v4:
 - Update perf_evsel__mmap size to pages
v3:
 - Split out perf_evsel__mmap() to separate patch
---
 tools/lib/perf/evsel.c                 |  3 +
 tools/lib/perf/include/internal/mmap.h |  3 +
 tools/lib/perf/mmap.c                  | 90 ++++++++++++++++++++++++++
 tools/lib/perf/tests/test-evsel.c      | 63 ++++++++++++++++++
 4 files changed, 159 insertions(+)

diff --git a/tools/lib/perf/evsel.c b/tools/lib/perf/evsel.c
index 42eaf3c18981..e1f18d65f604 100644
--- a/tools/lib/perf/evsel.c
+++ b/tools/lib/perf/evsel.c
@@ -222,6 +222,9 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
 	if (FD(evsel, cpu, thread) < 0)
 		return -EINVAL;
 
+	if (evsel->mmap && !perf_mmap__read_self(evsel->mmap, count))
+		return 0;
+
 	if (readn(FD(evsel, cpu, thread), count->values, size) <= 0)
 		return -errno;
 
diff --git a/tools/lib/perf/include/internal/mmap.h b/tools/lib/perf/include/internal/mmap.h
index be7556e0a2b2..5e3422f40ed5 100644
--- a/tools/lib/perf/include/internal/mmap.h
+++ b/tools/lib/perf/include/internal/mmap.h
@@ -11,6 +11,7 @@
 #define PERF_SAMPLE_MAX_SIZE (1 << 16)
 
 struct perf_mmap;
+struct perf_counts_values;
 
 typedef void (*libperf_unmap_cb_t)(struct perf_mmap *map);
 
@@ -52,4 +53,6 @@ void perf_mmap__put(struct perf_mmap *map);
 
 u64 perf_mmap__read_head(struct perf_mmap *map);
 
+int perf_mmap__read_self(struct perf_mmap *map, struct perf_counts_values *count);
+
 #endif /* __LIBPERF_INTERNAL_MMAP_H */
diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
index 79d5ed6c38cc..cb07969cfdbf 100644
--- a/tools/lib/perf/mmap.c
+++ b/tools/lib/perf/mmap.c
@@ -8,9 +8,11 @@
 #include <linux/perf_event.h>
 #include <perf/mmap.h>
 #include <perf/event.h>
+#include <perf/evsel.h>
 #include <internal/mmap.h>
 #include <internal/lib.h>
 #include <linux/kernel.h>
+#include <linux/math64.h>
 #include "internal.h"
 
 void perf_mmap__init(struct perf_mmap *map, struct perf_mmap *prev,
@@ -273,3 +275,91 @@ union perf_event *perf_mmap__read_event(struct perf_mmap *map)
 
 	return event;
 }
+
+#if defined(__i386__) || defined(__x86_64__)
+static u64 read_perf_counter(unsigned int counter)
+{
+	unsigned int low, high;
+
+	asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));
+
+	return low | ((u64)high) << 32;
+}
+
+static u64 read_timestamp(void)
+{
+	unsigned int low, high;
+
+	asm volatile("rdtsc" : "=a" (low), "=d" (high));
+
+	return low | ((u64)high) << 32;
+}
+#else
+static u64 read_perf_counter(unsigned int counter) { return 0; }
+static u64 read_timestamp(void) { return 0; }
+#endif
+
+int perf_mmap__read_self(struct perf_mmap *map, struct perf_counts_values *count)
+{
+	struct perf_event_mmap_page *pc = map->base;
+	u32 seq, idx, time_mult = 0, time_shift = 0;
+	u64 cnt, cyc = 0, time_offset = 0, time_cycles = 0, time_mask = ~0ULL;
+
+	BUG_ON(!pc);
+
+	if (!pc->cap_user_rdpmc)
+		return -1;
+
+	do {
+		seq = READ_ONCE(pc->lock);
+		barrier();
+
+		count->ena = READ_ONCE(pc->time_enabled);
+		count->run = READ_ONCE(pc->time_running);
+
+		if (pc->cap_user_time && count->ena != count->run) {
+			cyc = read_timestamp();
+			time_mult = READ_ONCE(pc->time_mult);
+			time_shift = READ_ONCE(pc->time_shift);
+			time_offset = READ_ONCE(pc->time_offset);
+
+			if (pc->cap_user_time_short) {
+				time_cycles = READ_ONCE(pc->time_cycles);
+				time_mask = READ_ONCE(pc->time_mask);
+			}
+		}
+
+		idx = READ_ONCE(pc->index);
+		cnt = READ_ONCE(pc->offset);
+		if (pc->cap_user_rdpmc && idx) {
+			u64 evcnt = read_perf_counter(idx - 1);
+			u16 width = READ_ONCE(pc->pmc_width);
+
+			evcnt <<= 64 - width;
+			evcnt >>= 64 - width;
+			cnt += evcnt;
+		} else
+			return -1;
+
+		barrier();
+	} while (READ_ONCE(pc->lock) != seq);
+
+	if (count->ena != count->run) {
+		u64 delta;
+
+		/* Adjust for cap_usr_time_short, a nop if not */
+		cyc = time_cycles + ((cyc - time_cycles) & time_mask);
+
+		delta = time_offset + mul_u64_u32_shr(cyc, time_mult, time_shift);
+
+		count->ena += delta;
+		if (idx)
+			count->run += delta;
+
+		cnt = mul_u64_u64_div64(cnt, count->ena, count->run);
+	}
+
+	count->val = cnt;
+
+	return 0;
+}
diff --git a/tools/lib/perf/tests/test-evsel.c b/tools/lib/perf/tests/test-evsel.c
index 135722ac965b..eeca8203d73d 100644
--- a/tools/lib/perf/tests/test-evsel.c
+++ b/tools/lib/perf/tests/test-evsel.c
@@ -120,6 +120,67 @@ static int test_stat_thread_enable(void)
 	return 0;
 }
 
+static int test_stat_user_read(int event)
+{
+	struct perf_counts_values counts = { .val = 0 };
+	struct perf_thread_map *threads;
+	struct perf_evsel *evsel;
+	struct perf_event_mmap_page *pc;
+	struct perf_event_attr attr = {
+		.type	= PERF_TYPE_HARDWARE,
+		.config	= event,
+	};
+	int err, i;
+
+	threads = perf_thread_map__new_dummy();
+	__T("failed to create threads", threads);
+
+	perf_thread_map__set_pid(threads, 0, 0);
+
+	evsel = perf_evsel__new(&attr);
+	__T("failed to create evsel", evsel);
+
+	err = perf_evsel__open(evsel, NULL, threads);
+	__T("failed to open evsel", err == 0);
+
+	pc = perf_evsel__mmap(evsel, 0);
+	__T("failed to mmap evsel", pc);
+
+#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
+	__T("userspace counter access not supported", pc->cap_user_rdpmc);
+	__T("userspace counter access not enabled", pc->index);
+	__T("userspace counter width not set", pc->pmc_width >= 32);
+#endif
+
+	perf_evsel__read(evsel, 0, 0, &counts);
+	__T("failed to read value for evsel", counts.val != 0);
+
+	for (i = 0; i < 5; i++) {
+		volatile int count = 0x10000 << i;
+		__u64 start, end, last = 0;
+
+		__T_VERBOSE("\tloop = %u, ", count);
+
+		perf_evsel__read(evsel, 0, 0, &counts);
+		start = counts.val;
+
+		while (count--) ;
+
+		perf_evsel__read(evsel, 0, 0, &counts);
+		end = counts.val;
+
+		__T("invalid counter data", (end - start) > last);
+		last = end - start;
+		__T_VERBOSE("count = %llu\n", end - start);
+	}
+
+	perf_evsel__close(evsel);
+	perf_evsel__delete(evsel);
+
+	perf_thread_map__put(threads);
+	return 0;
+}
+
 int main(int argc, char **argv)
 {
 	__T_START;
@@ -129,6 +190,8 @@ int main(int argc, char **argv)
 	test_stat_cpu();
 	test_stat_thread();
 	test_stat_thread_enable();
+	test_stat_user_read(PERF_COUNT_HW_INSTRUCTIONS);
+	test_stat_user_read(PERF_COUNT_HW_CPU_CYCLES);
 
 	__T_END;
 	return 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 7/9] libperf: Add arm64 support to perf_mmap__read_self()
  2020-10-01 14:01 [PATCH v4 0/9] libperf and arm64 userspace counter access support Rob Herring
                   ` (5 preceding siblings ...)
  2020-10-01 14:01 ` [PATCH v4 6/9] libperf: Add support for user space counter access Rob Herring
@ 2020-10-01 14:01 ` Rob Herring
  2020-10-01 14:01 ` [PATCH v4 8/9] perf: arm64: Add test for userspace counter access on heterogeneous systems Rob Herring
  2020-10-01 14:01 ` [PATCH v4 9/9] Documentation: arm64: Document PMU counters access from userspace Rob Herring
  8 siblings, 0 replies; 29+ messages in thread
From: Rob Herring @ 2020-10-01 14:01 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: linux-arm-kernel, linux-kernel, Alexander Shishkin, Namhyung Kim,
	Raphael Gault, Mark Rutland, Jonathan Cameron, Ian Rogers,
	honnappa.nagarahalli, Itaru Kitayama

Add the arm64 variants for read_perf_counter() and read_timestamp().
Unfortunately the counter number is encoded into the instruction, so the
code is a bit verbose to enumerate all possible counters.

Signed-off-by: Rob Herring <robh@kernel.org>
---
 tools/lib/perf/mmap.c | 98 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 98 insertions(+)

diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
index cb07969cfdbf..6cf939aae6b6 100644
--- a/tools/lib/perf/mmap.c
+++ b/tools/lib/perf/mmap.c
@@ -13,6 +13,7 @@
 #include <internal/lib.h>
 #include <linux/kernel.h>
 #include <linux/math64.h>
+#include <linux/stringify.h>
 #include "internal.h"
 
 void perf_mmap__init(struct perf_mmap *map, struct perf_mmap *prev,
@@ -294,6 +295,103 @@ static u64 read_timestamp(void)
 
 	return low | ((u64)high) << 32;
 }
+#elif defined(__aarch64__)
+#define read_sysreg(r) ({						\
+	u64 __val;							\
+	asm volatile("mrs %0, " __stringify(r) : "=r" (__val));		\
+	__val;								\
+})
+
+static u64 read_pmccntr(void)
+{
+	return read_sysreg(pmccntr_el0);
+}
+
+#define PMEVCNTR_READ(idx)					\
+	static u64 read_pmevcntr_##idx(void) {			\
+		return read_sysreg(pmevcntr##idx##_el0);	\
+	}
+
+PMEVCNTR_READ(0);
+PMEVCNTR_READ(1);
+PMEVCNTR_READ(2);
+PMEVCNTR_READ(3);
+PMEVCNTR_READ(4);
+PMEVCNTR_READ(5);
+PMEVCNTR_READ(6);
+PMEVCNTR_READ(7);
+PMEVCNTR_READ(8);
+PMEVCNTR_READ(9);
+PMEVCNTR_READ(10);
+PMEVCNTR_READ(11);
+PMEVCNTR_READ(12);
+PMEVCNTR_READ(13);
+PMEVCNTR_READ(14);
+PMEVCNTR_READ(15);
+PMEVCNTR_READ(16);
+PMEVCNTR_READ(17);
+PMEVCNTR_READ(18);
+PMEVCNTR_READ(19);
+PMEVCNTR_READ(20);
+PMEVCNTR_READ(21);
+PMEVCNTR_READ(22);
+PMEVCNTR_READ(23);
+PMEVCNTR_READ(24);
+PMEVCNTR_READ(25);
+PMEVCNTR_READ(26);
+PMEVCNTR_READ(27);
+PMEVCNTR_READ(28);
+PMEVCNTR_READ(29);
+PMEVCNTR_READ(30);
+
+/*
+ * Read a value direct from PMEVCNTR<idx>
+ */
+static u64 read_perf_counter(unsigned int counter)
+{
+	static u64 (* const read_f[])(void) = {
+		read_pmevcntr_0,
+		read_pmevcntr_1,
+		read_pmevcntr_2,
+		read_pmevcntr_3,
+		read_pmevcntr_4,
+		read_pmevcntr_5,
+		read_pmevcntr_6,
+		read_pmevcntr_7,
+		read_pmevcntr_8,
+		read_pmevcntr_9,
+		read_pmevcntr_10,
+		read_pmevcntr_11,
+		read_pmevcntr_13,
+		read_pmevcntr_12,
+		read_pmevcntr_14,
+		read_pmevcntr_15,
+		read_pmevcntr_16,
+		read_pmevcntr_17,
+		read_pmevcntr_18,
+		read_pmevcntr_19,
+		read_pmevcntr_20,
+		read_pmevcntr_21,
+		read_pmevcntr_22,
+		read_pmevcntr_23,
+		read_pmevcntr_24,
+		read_pmevcntr_25,
+		read_pmevcntr_26,
+		read_pmevcntr_27,
+		read_pmevcntr_28,
+		read_pmevcntr_29,
+		read_pmevcntr_30,
+		read_pmccntr
+	};
+
+	if (counter < ARRAY_SIZE(read_f))
+		return (read_f[counter])();
+
+	return 0;
+}
+
+static u64 read_timestamp(void) { return read_sysreg(cntvct_el0); }
+
 #else
 static u64 read_perf_counter(unsigned int counter) { return 0; }
 static u64 read_timestamp(void) { return 0; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 8/9] perf: arm64: Add test for userspace counter access on heterogeneous systems
  2020-10-01 14:01 [PATCH v4 0/9] libperf and arm64 userspace counter access support Rob Herring
                   ` (6 preceding siblings ...)
  2020-10-01 14:01 ` [PATCH v4 7/9] libperf: Add arm64 support to perf_mmap__read_self() Rob Herring
@ 2020-10-01 14:01 ` Rob Herring
  2020-10-01 14:01 ` [PATCH v4 9/9] Documentation: arm64: Document PMU counters access from userspace Rob Herring
  8 siblings, 0 replies; 29+ messages in thread
From: Rob Herring @ 2020-10-01 14:01 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: linux-arm-kernel, linux-kernel, Alexander Shishkin, Namhyung Kim,
	Raphael Gault, Mark Rutland, Jonathan Cameron, Ian Rogers,
	honnappa.nagarahalli, Itaru Kitayama

Userspace counter access only works on heterogeneous systems with some
restrictions. The userspace process must be pinned to a homogeneous
subset of CPUs and must open the corresponding PMU for those CPUs. This
commit adds a test implementing these requirements.

Signed-off-by: Rob Herring <robh@kernel.org>
---
v4:
- Update perf_evsel__mmap params
v2:
- Drop all but heterogeneous test as others covered by libperf tests
- Rework to use libperf
---
 tools/perf/arch/arm64/include/arch-tests.h |   7 +
 tools/perf/arch/arm64/tests/Build          |   1 +
 tools/perf/arch/arm64/tests/arch-tests.c   |   4 +
 tools/perf/arch/arm64/tests/user-events.c  | 170 +++++++++++++++++++++
 4 files changed, 182 insertions(+)
 create mode 100644 tools/perf/arch/arm64/tests/user-events.c

diff --git a/tools/perf/arch/arm64/include/arch-tests.h b/tools/perf/arch/arm64/include/arch-tests.h
index 90ec4c8cb880..380ad34a3f09 100644
--- a/tools/perf/arch/arm64/include/arch-tests.h
+++ b/tools/perf/arch/arm64/include/arch-tests.h
@@ -2,11 +2,18 @@
 #ifndef ARCH_TESTS_H
 #define ARCH_TESTS_H
 
+#include <linux/compiler.h>
+
 #ifdef HAVE_DWARF_UNWIND_SUPPORT
 struct thread;
 struct perf_sample;
+int test__arch_unwind_sample(struct perf_sample *sample,
+			     struct thread *thread);
 #endif
 
 extern struct test arch_tests[];
+int test__rd_pinned(struct test __maybe_unused *test,
+		       int __maybe_unused subtest);
+
 
 #endif
diff --git a/tools/perf/arch/arm64/tests/Build b/tools/perf/arch/arm64/tests/Build
index a61c06bdb757..3f9a20c17fc6 100644
--- a/tools/perf/arch/arm64/tests/Build
+++ b/tools/perf/arch/arm64/tests/Build
@@ -1,4 +1,5 @@
 perf-y += regs_load.o
 perf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o
 
+perf-y += user-events.o
 perf-y += arch-tests.o
diff --git a/tools/perf/arch/arm64/tests/arch-tests.c b/tools/perf/arch/arm64/tests/arch-tests.c
index 5b1543c98022..80ce7bd3c16d 100644
--- a/tools/perf/arch/arm64/tests/arch-tests.c
+++ b/tools/perf/arch/arm64/tests/arch-tests.c
@@ -10,6 +10,10 @@ struct test arch_tests[] = {
 		.func = test__dwarf_unwind,
 	},
 #endif
+	{
+		.desc = "Pinned CPU user counter access",
+		.func = test__rd_pinned,
+	},
 	{
 		.func = NULL,
 	},
diff --git a/tools/perf/arch/arm64/tests/user-events.c b/tools/perf/arch/arm64/tests/user-events.c
new file mode 100644
index 000000000000..46a6b05fe3fd
--- /dev/null
+++ b/tools/perf/arch/arm64/tests/user-events.c
@@ -0,0 +1,170 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <unistd.h>
+#include <sched.h>
+#include <cpumap.h>
+
+#include <perf/core.h>
+#include <perf/threadmap.h>
+#include <perf/evsel.h>
+
+#include "pmu.h"
+#include "debug.h"
+#include "tests/tests.h"
+#include "arch-tests.h"
+
+static int run_test(struct perf_evsel *evsel)
+{
+	int n;
+	volatile int tmp = 0;
+	u64 delta, i, loops = 1000;
+	struct perf_counts_values counts = { .val = 0 };
+
+	for (n = 0; n < 6; n++) {
+		u64 stamp, now;
+
+		perf_evsel__read(evsel, 0, 0, &counts);
+		stamp = counts.val;
+
+		for (i = 0; i < loops; i++)
+			tmp++;
+
+		perf_evsel__read(evsel, 0, 0, &counts);
+		now = counts.val;
+		loops *= 10;
+
+		delta = now - stamp;
+		pr_debug("%14d: %14llu\n", n, (long long)delta);
+
+		if (!delta)
+			break;
+	}
+	return delta ? 0 : -1;
+}
+
+static struct perf_pmu *pmu_for_cpu(int cpu)
+{
+	int acpu, idx;
+	struct perf_pmu *pmu = NULL;
+
+	while ((pmu = perf_pmu__scan(pmu)) != NULL) {
+		if (pmu->is_uncore)
+			continue;
+		perf_cpu_map__for_each_cpu(acpu, idx, pmu->cpus)
+			if (acpu == cpu)
+				return pmu;
+	}
+	return NULL;
+}
+
+static bool pmu_is_homogeneous(void)
+{
+	int core_cnt = 0;
+	struct perf_pmu *pmu = NULL;
+
+	while ((pmu = perf_pmu__scan(pmu)) != NULL) {
+		if (!pmu->is_uncore && !perf_cpu_map__empty(pmu->cpus))
+			core_cnt++;
+	}
+	return core_cnt == 1;
+}
+
+static int libperf_print(enum libperf_print_level level,
+			 const char *fmt, va_list ap)
+{
+	(void)level;
+	return vfprintf(stderr, fmt, ap);
+}
+
+static struct perf_evsel *perf_init(struct perf_event_attr *attr)
+{
+	int err;
+	struct perf_thread_map *threads;
+	struct perf_evsel *evsel;
+
+	libperf_init(libperf_print);
+
+	threads = perf_thread_map__new_dummy();
+	if (!threads) {
+		pr_err("failed to create threads\n");
+		return NULL;
+	}
+
+	perf_thread_map__set_pid(threads, 0, 0);
+
+	evsel = perf_evsel__new(attr);
+	if (!evsel) {
+		pr_err("failed to create evsel\n");
+		goto out_thread;
+	}
+
+	err = perf_evsel__open(evsel, NULL, threads);
+	if (err) {
+		pr_err("failed to open evsel\n");
+		goto out_open;
+	}
+
+	if (!perf_evsel__mmap(evsel, 0)) {
+		pr_err("failed to mmap evsel\n");
+		goto out_mmap;
+	}
+
+	return evsel;
+
+out_mmap:
+	perf_evsel__close(evsel);
+out_open:
+	perf_evsel__delete(evsel);
+out_thread:
+	perf_thread_map__put(threads);
+	return NULL;
+}
+
+int test__rd_pinned(struct test __maybe_unused *test,
+		    int __maybe_unused subtest)
+{
+	int cpu, cputmp, ret = -1;
+	struct perf_evsel *evsel;
+	struct perf_event_attr attr = {
+		.config = 0x8, /* Instruction count */
+		.config1 = 0, /* 32-bit counter */
+		.exclude_kernel = 1,
+	};
+	cpu_set_t cpu_set;
+	struct perf_pmu *pmu;
+
+	if (pmu_is_homogeneous())
+		return TEST_SKIP;
+
+	cpu = sched_getcpu();
+	pmu = pmu_for_cpu(cpu);
+	if (!pmu)
+		return -1;
+	attr.type = pmu->type;
+
+	CPU_ZERO(&cpu_set);
+	perf_cpu_map__for_each_cpu(cpu, cputmp, pmu->cpus)
+		CPU_SET(cpu, &cpu_set);
+	if (sched_setaffinity(0, sizeof(cpu_set), &cpu_set) < 0)
+		pr_err("Could not set affinity\n");
+
+	evsel = perf_init(&attr);
+	if (!evsel)
+		return -1;
+
+	perf_cpu_map__for_each_cpu(cpu, cputmp, pmu->cpus) {
+		CPU_ZERO(&cpu_set);
+		CPU_SET(cpu, &cpu_set);
+		if (sched_setaffinity(0, sizeof(cpu_set), &cpu_set) < 0)
+			pr_err("Could not set affinity\n");
+
+		pr_debug("Running on CPU %d\n", cpu);
+
+		ret = run_test(evsel);
+		if (ret)
+			break;
+	}
+
+	perf_evsel__close(evsel);
+	perf_evsel__delete(evsel);
+	return ret;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 9/9] Documentation: arm64: Document PMU counters access from userspace
  2020-10-01 14:01 [PATCH v4 0/9] libperf and arm64 userspace counter access support Rob Herring
                   ` (7 preceding siblings ...)
  2020-10-01 14:01 ` [PATCH v4 8/9] perf: arm64: Add test for userspace counter access on heterogeneous systems Rob Herring
@ 2020-10-01 14:01 ` Rob Herring
  8 siblings, 0 replies; 29+ messages in thread
From: Rob Herring @ 2020-10-01 14:01 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: linux-arm-kernel, linux-kernel, Alexander Shishkin, Namhyung Kim,
	Raphael Gault, Mark Rutland, Jonathan Cameron, Ian Rogers,
	honnappa.nagarahalli, Itaru Kitayama

From: Raphael Gault <raphael.gault@arm.com>

Add a documentation file to describe the access to the pmu hardware
counters from userspace

Signed-off-by: Raphael Gault <raphael.gault@arm.com>
Signed-off-by: Rob Herring <robh@kernel.org>
---
v2:
  - Update links to test examples

Changes from Raphael's v4:
  - Convert to rSt
  - Update chained event status
  - Add section for heterogeneous systems
---
 Documentation/arm64/index.rst                 |  1 +
 .../arm64/perf_counter_user_access.rst        | 56 +++++++++++++++++++
 2 files changed, 57 insertions(+)
 create mode 100644 Documentation/arm64/perf_counter_user_access.rst

diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst
index d9665d83c53a..c712a08e7627 100644
--- a/Documentation/arm64/index.rst
+++ b/Documentation/arm64/index.rst
@@ -15,6 +15,7 @@ ARM64 Architecture
     legacy_instructions
     memory
     perf
+    perf_counter_user_access
     pointer-authentication
     silicon-errata
     sve
diff --git a/Documentation/arm64/perf_counter_user_access.rst b/Documentation/arm64/perf_counter_user_access.rst
new file mode 100644
index 000000000000..e49e141f10cc
--- /dev/null
+++ b/Documentation/arm64/perf_counter_user_access.rst
@@ -0,0 +1,56 @@
+=============================================
+Access to PMU hardware counter from userspace
+=============================================
+
+Overview
+--------
+The perf userspace tool relies on the PMU to monitor events. It offers an
+abstraction layer over the hardware counters since the underlying
+implementation is cpu-dependent.
+Arm64 allows userspace tools to have access to the registers storing the
+hardware counters' values directly.
+
+This targets specifically self-monitoring tasks in order to reduce the overhead
+by directly accessing the registers without having to go through the kernel.
+
+How-to
+------
+The focus is set on the armv8 pmuv3 which makes sure that the access to the pmu
+registers is enabled and that the userspace has access to the relevant
+information in order to use them.
+
+In order to have access to the hardware counter it is necessary to open the event
+using the perf tool interface: the sys_perf_event_open syscall returns a fd which
+can subsequently be used with the mmap syscall in order to retrieve a page of
+memory containing information about the event.
+The PMU driver uses this page to expose to the user the hardware counter's
+index and other necessary data. Using this index enables the user to access the
+PMU registers using the `mrs` instruction.
+
+The userspace access is supported in libperf using the perf_evsel__mmap()
+and perf_evsel__read() functions. See `tools/lib/perf/tests/test-evsel.c`_ for
+an example.
+
+About heterogeneous systems
+---------------------------
+On heterogeneous systems such as big.LITTLE, userspace PMU counter access can
+only be enabled when the tasks are pinned to a homogeneous subset of cores and
+the corresponding PMU instance is opened by specifying the 'type' attribute.
+The use of generic event types is not supported in this case.
+
+Have a look at `tools/perf/arch/arm64/tests/user-events.c`_ for an example. It
+can be run using the perf tool to check that the access to the registers works
+correctly from userspace:
+
+.. code-block:: sh
+
+  perf test -v user
+
+About chained events
+--------------------
+Chained events are not supported in userspace. If a 64-bit counter is requested,
+userspace access will only be enabled if the underlying counter is 64-bit.
+
+.. Links
+.. _tools/perf/arch/arm64/tests/user-events.c:
+   https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/perf/arch/arm64/tests/user-events.c
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
  2020-10-01 14:01 ` [PATCH v4 4/9] libperf: Add libperf_evsel__mmap() Rob Herring
@ 2020-10-14 11:05   ` Jiri Olsa
  2020-10-16 21:39     ` Rob Herring
  0 siblings, 1 reply; 29+ messages in thread
From: Jiri Olsa @ 2020-10-14 11:05 UTC (permalink / raw)
  To: Rob Herring
  Cc: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Alexander Shishkin, Namhyung Kim, Raphael Gault, Mark Rutland,
	Jonathan Cameron, Ian Rogers, honnappa.nagarahalli,
	Itaru Kitayama

On Thu, Oct 01, 2020 at 09:01:11AM -0500, Rob Herring wrote:

SNIP

>  
> +void *perf_evsel__mmap(struct perf_evsel *evsel, int pages)
> +{
> +	int ret;
> +	struct perf_mmap *map;
> +	struct perf_mmap_param mp = {
> +		.prot = PROT_READ | PROT_WRITE,
> +	};
> +
> +	if (FD(evsel, 0, 0) < 0)
> +		return NULL;
> +
> +	mp.mask = (pages * page_size) - 1;
> +
> +	map = zalloc(sizeof(*map));
> +	if (!map)
> +		return NULL;
> +
> +	perf_mmap__init(map, NULL, false, NULL);
> +
> +	ret = perf_mmap__mmap(map, &mp, FD(evsel, 0, 0), 0);

hum, so you map event for FD(0,0) but later in perf_evsel__read
you allow to read any cpu/thread combination ending up reading
data from FD(0,0) map:

	int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
			     struct perf_counts_values *count)
	{
		size_t size = perf_evsel__read_size(evsel);

		memset(count, 0, sizeof(*count));

		if (FD(evsel, cpu, thread) < 0)
			return -EINVAL;

		if (evsel->mmap && !perf_mmap__read_self(evsel->mmap, count))
			return 0;


I think we should either check cpu == 0, thread == 0, or make it
general and store perf_evsel::mmap in xyarray as we do for fds

thanks,
jirka


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
  2020-10-14 11:05   ` Jiri Olsa
@ 2020-10-16 21:39     ` Rob Herring
  2020-10-19 20:15       ` Jiri Olsa
  0 siblings, 1 reply; 29+ messages in thread
From: Rob Herring @ 2020-10-16 21:39 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Alexander Shishkin, Namhyung Kim, Raphael Gault, Mark Rutland,
	Jonathan Cameron, Ian Rogers, Honnappa Nagarahalli,
	Itaru Kitayama

On Wed, Oct 14, 2020 at 6:05 AM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Thu, Oct 01, 2020 at 09:01:11AM -0500, Rob Herring wrote:
>
> SNIP
>
> >
> > +void *perf_evsel__mmap(struct perf_evsel *evsel, int pages)
> > +{
> > +     int ret;
> > +     struct perf_mmap *map;
> > +     struct perf_mmap_param mp = {
> > +             .prot = PROT_READ | PROT_WRITE,
> > +     };
> > +
> > +     if (FD(evsel, 0, 0) < 0)
> > +             return NULL;
> > +
> > +     mp.mask = (pages * page_size) - 1;
> > +
> > +     map = zalloc(sizeof(*map));
> > +     if (!map)
> > +             return NULL;
> > +
> > +     perf_mmap__init(map, NULL, false, NULL);
> > +
> > +     ret = perf_mmap__mmap(map, &mp, FD(evsel, 0, 0), 0);
>
> hum, so you map event for FD(0,0) but later in perf_evsel__read
> you allow to read any cpu/thread combination ending up reading
> data from FD(0,0) map:
>
>         int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
>                              struct perf_counts_values *count)
>         {
>                 size_t size = perf_evsel__read_size(evsel);
>
>                 memset(count, 0, sizeof(*count));
>
>                 if (FD(evsel, cpu, thread) < 0)
>                         return -EINVAL;
>
>                 if (evsel->mmap && !perf_mmap__read_self(evsel->mmap, count))
>                         return 0;
>
>
> I think we should either check cpu == 0, thread == 0, or make it
> general and store perf_evsel::mmap in xyarray as we do for fds

The mmapped read will actually fail and then we fallback here. My main
concern though is adding more overhead on a feature that's meant to be
low overhead (granted, it's not much). Maybe we could add checks on
the mmap that we've opened the event with pid == 0 and cpu == -1 (so
only 1 FD)?

Rob

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
  2020-10-16 21:39     ` Rob Herring
@ 2020-10-19 20:15       ` Jiri Olsa
  2020-10-20 14:38         ` Rob Herring
  0 siblings, 1 reply; 29+ messages in thread
From: Jiri Olsa @ 2020-10-19 20:15 UTC (permalink / raw)
  To: Rob Herring
  Cc: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Alexander Shishkin, Namhyung Kim, Raphael Gault, Mark Rutland,
	Jonathan Cameron, Ian Rogers, Honnappa Nagarahalli,
	Itaru Kitayama

On Fri, Oct 16, 2020 at 04:39:15PM -0500, Rob Herring wrote:
> On Wed, Oct 14, 2020 at 6:05 AM Jiri Olsa <jolsa@redhat.com> wrote:
> >
> > On Thu, Oct 01, 2020 at 09:01:11AM -0500, Rob Herring wrote:
> >
> > SNIP
> >
> > >
> > > +void *perf_evsel__mmap(struct perf_evsel *evsel, int pages)
> > > +{
> > > +     int ret;
> > > +     struct perf_mmap *map;
> > > +     struct perf_mmap_param mp = {
> > > +             .prot = PROT_READ | PROT_WRITE,
> > > +     };
> > > +
> > > +     if (FD(evsel, 0, 0) < 0)
> > > +             return NULL;
> > > +
> > > +     mp.mask = (pages * page_size) - 1;
> > > +
> > > +     map = zalloc(sizeof(*map));
> > > +     if (!map)
> > > +             return NULL;
> > > +
> > > +     perf_mmap__init(map, NULL, false, NULL);
> > > +
> > > +     ret = perf_mmap__mmap(map, &mp, FD(evsel, 0, 0), 0);
> >
> > hum, so you map event for FD(0,0) but later in perf_evsel__read
> > you allow to read any cpu/thread combination ending up reading
> > data from FD(0,0) map:
> >
> >         int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
> >                              struct perf_counts_values *count)
> >         {
> >                 size_t size = perf_evsel__read_size(evsel);
> >
> >                 memset(count, 0, sizeof(*count));
> >
> >                 if (FD(evsel, cpu, thread) < 0)
> >                         return -EINVAL;
> >
> >                 if (evsel->mmap && !perf_mmap__read_self(evsel->mmap, count))
> >                         return 0;
> >
> >
> > I think we should either check cpu == 0, thread == 0, or make it
> > general and store perf_evsel::mmap in xyarray as we do for fds
> 
> The mmapped read will actually fail and then we fallback here. My main
> concern though is adding more overhead on a feature that's meant to be
> low overhead (granted, it's not much). Maybe we could add checks on
> the mmap that we've opened the event with pid == 0 and cpu == -1 (so
> only 1 FD)?

but then you limit this just for single fd.. having mmap as xyarray
would not be that bad and perf_evsel__mmap will call perf_mmap__mmap
for each defined cpu/thread .. so it depends on user how fast this
will be - how many maps needs to be created/mmaped

jirka


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
  2020-10-19 20:15       ` Jiri Olsa
@ 2020-10-20 14:38         ` Rob Herring
  2020-10-20 15:35           ` Jiri Olsa
  0 siblings, 1 reply; 29+ messages in thread
From: Rob Herring @ 2020-10-20 14:38 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Alexander Shishkin, Namhyung Kim, Raphael Gault, Mark Rutland,
	Jonathan Cameron, Ian Rogers, Honnappa Nagarahalli,
	Itaru Kitayama

On Mon, Oct 19, 2020 at 3:15 PM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Fri, Oct 16, 2020 at 04:39:15PM -0500, Rob Herring wrote:
> > On Wed, Oct 14, 2020 at 6:05 AM Jiri Olsa <jolsa@redhat.com> wrote:
> > >
> > > On Thu, Oct 01, 2020 at 09:01:11AM -0500, Rob Herring wrote:
> > >
> > > SNIP
> > >
> > > >
> > > > +void *perf_evsel__mmap(struct perf_evsel *evsel, int pages)
> > > > +{
> > > > +     int ret;
> > > > +     struct perf_mmap *map;
> > > > +     struct perf_mmap_param mp = {
> > > > +             .prot = PROT_READ | PROT_WRITE,
> > > > +     };
> > > > +
> > > > +     if (FD(evsel, 0, 0) < 0)
> > > > +             return NULL;
> > > > +
> > > > +     mp.mask = (pages * page_size) - 1;
> > > > +
> > > > +     map = zalloc(sizeof(*map));
> > > > +     if (!map)
> > > > +             return NULL;
> > > > +
> > > > +     perf_mmap__init(map, NULL, false, NULL);
> > > > +
> > > > +     ret = perf_mmap__mmap(map, &mp, FD(evsel, 0, 0), 0);
> > >
> > > hum, so you map event for FD(0,0) but later in perf_evsel__read
> > > you allow to read any cpu/thread combination ending up reading
> > > data from FD(0,0) map:
> > >
> > >         int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
> > >                              struct perf_counts_values *count)
> > >         {
> > >                 size_t size = perf_evsel__read_size(evsel);
> > >
> > >                 memset(count, 0, sizeof(*count));
> > >
> > >                 if (FD(evsel, cpu, thread) < 0)
> > >                         return -EINVAL;
> > >
> > >                 if (evsel->mmap && !perf_mmap__read_self(evsel->mmap, count))
> > >                         return 0;
> > >
> > >
> > > I think we should either check cpu == 0, thread == 0, or make it
> > > general and store perf_evsel::mmap in xyarray as we do for fds
> >
> > The mmapped read will actually fail and then we fallback here. My main
> > concern though is adding more overhead on a feature that's meant to be
> > low overhead (granted, it's not much). Maybe we could add checks on
> > the mmap that we've opened the event with pid == 0 and cpu == -1 (so
> > only 1 FD)?
>
> but then you limit this just for single fd.. having mmap as xyarray
> would not be that bad and perf_evsel__mmap will call perf_mmap__mmap
> for each defined cpu/thread .. so it depends on user how fast this
> will be - how many maps needs to be created/mmaped

Given userspace access fails for anything other than the calling
thread and all cpus, how would more than 1 mmap be useful here?

If we did want multiple mmaps, wouldn't we just use the evlist API in
that case? It already does all that.

Rob

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
  2020-10-20 14:38         ` Rob Herring
@ 2020-10-20 15:35           ` Jiri Olsa
  2020-10-20 17:11             ` Rob Herring
  0 siblings, 1 reply; 29+ messages in thread
From: Jiri Olsa @ 2020-10-20 15:35 UTC (permalink / raw)
  To: Rob Herring
  Cc: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Alexander Shishkin, Namhyung Kim, Raphael Gault, Mark Rutland,
	Jonathan Cameron, Ian Rogers, Honnappa Nagarahalli,
	Itaru Kitayama

On Tue, Oct 20, 2020 at 09:38:13AM -0500, Rob Herring wrote:
> On Mon, Oct 19, 2020 at 3:15 PM Jiri Olsa <jolsa@redhat.com> wrote:
> >
> > On Fri, Oct 16, 2020 at 04:39:15PM -0500, Rob Herring wrote:
> > > On Wed, Oct 14, 2020 at 6:05 AM Jiri Olsa <jolsa@redhat.com> wrote:
> > > >
> > > > On Thu, Oct 01, 2020 at 09:01:11AM -0500, Rob Herring wrote:
> > > >
> > > > SNIP
> > > >
> > > > >
> > > > > +void *perf_evsel__mmap(struct perf_evsel *evsel, int pages)
> > > > > +{
> > > > > +     int ret;
> > > > > +     struct perf_mmap *map;
> > > > > +     struct perf_mmap_param mp = {
> > > > > +             .prot = PROT_READ | PROT_WRITE,
> > > > > +     };
> > > > > +
> > > > > +     if (FD(evsel, 0, 0) < 0)
> > > > > +             return NULL;
> > > > > +
> > > > > +     mp.mask = (pages * page_size) - 1;
> > > > > +
> > > > > +     map = zalloc(sizeof(*map));
> > > > > +     if (!map)
> > > > > +             return NULL;
> > > > > +
> > > > > +     perf_mmap__init(map, NULL, false, NULL);
> > > > > +
> > > > > +     ret = perf_mmap__mmap(map, &mp, FD(evsel, 0, 0), 0);
> > > >
> > > > hum, so you map event for FD(0,0) but later in perf_evsel__read
> > > > you allow to read any cpu/thread combination ending up reading
> > > > data from FD(0,0) map:
> > > >
> > > >         int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
> > > >                              struct perf_counts_values *count)
> > > >         {
> > > >                 size_t size = perf_evsel__read_size(evsel);
> > > >
> > > >                 memset(count, 0, sizeof(*count));
> > > >
> > > >                 if (FD(evsel, cpu, thread) < 0)
> > > >                         return -EINVAL;
> > > >
> > > >                 if (evsel->mmap && !perf_mmap__read_self(evsel->mmap, count))
> > > >                         return 0;
> > > >
> > > >
> > > > I think we should either check cpu == 0, thread == 0, or make it
> > > > general and store perf_evsel::mmap in xyarray as we do for fds
> > >
> > > The mmapped read will actually fail and then we fallback here. My main
> > > concern though is adding more overhead on a feature that's meant to be
> > > low overhead (granted, it's not much). Maybe we could add checks on
> > > the mmap that we've opened the event with pid == 0 and cpu == -1 (so
> > > only 1 FD)?
> >
> > but then you limit this just for single fd.. having mmap as xyarray
> > would not be that bad and perf_evsel__mmap will call perf_mmap__mmap
> > for each defined cpu/thread .. so it depends on user how fast this
> > will be - how many maps needs to be created/mmaped
> 
> Given userspace access fails for anything other than the calling
> thread and all cpus, how would more than 1 mmap be useful here?

I'm not sure what you mean by fail in here.. you need mmap for each
event fd you want to read from

in the example below we read stats from all cpus via perf_evsel__read,
if we insert this call after perf_evsel__open:

  perf_evsel__mmap(cpus, NULL);

that maps page for each event, then perf_evsel__read
could go through the fast code, no?

> 
> If we did want multiple mmaps, wouldn't we just use the evlist API in
> that case? It already does all that.

we could, but I see this as a separate perf_evsel interface, and if
we allow to have perf_evsel__mmap I think it should follow the
cpus/threads it's open for

jirka


---
static int test_stat_cpu(void)
{
        struct perf_cpu_map *cpus;
        struct perf_evsel *evsel;
        struct perf_event_attr attr = {
                .type   = PERF_TYPE_SOFTWARE,
                .config = PERF_COUNT_SW_CPU_CLOCK,
        };
        int err, cpu, tmp;

        cpus = perf_cpu_map__new(NULL);
        __T("failed to create cpus", cpus);

        evsel = perf_evsel__new(&attr);
        __T("failed to create evsel", evsel);

        err = perf_evsel__open(evsel, cpus, NULL);
        __T("failed to open evsel", err == 0);

        perf_cpu_map__for_each_cpu(cpu, tmp, cpus) {
                struct perf_counts_values counts = { .val = 0 };

                perf_evsel__read(evsel, cpu, 0, &counts);
                __T("failed to read value for evsel", counts.val != 0);
        }

        perf_evsel__close(evsel);
        perf_evsel__delete(evsel);

        perf_cpu_map__put(cpus);
        return 0;
}


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
  2020-10-20 15:35           ` Jiri Olsa
@ 2020-10-20 17:11             ` Rob Herring
  2020-10-21 11:24               ` Jiri Olsa
  0 siblings, 1 reply; 29+ messages in thread
From: Rob Herring @ 2020-10-20 17:11 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Alexander Shishkin, Namhyung Kim, Raphael Gault, Mark Rutland,
	Jonathan Cameron, Ian Rogers, Honnappa Nagarahalli,
	Itaru Kitayama

On Tue, Oct 20, 2020 at 10:35 AM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Tue, Oct 20, 2020 at 09:38:13AM -0500, Rob Herring wrote:
> > On Mon, Oct 19, 2020 at 3:15 PM Jiri Olsa <jolsa@redhat.com> wrote:
> > >
> > > On Fri, Oct 16, 2020 at 04:39:15PM -0500, Rob Herring wrote:
> > > > On Wed, Oct 14, 2020 at 6:05 AM Jiri Olsa <jolsa@redhat.com> wrote:
> > > > >
> > > > > On Thu, Oct 01, 2020 at 09:01:11AM -0500, Rob Herring wrote:
> > > > >
> > > > > SNIP
> > > > >
> > > > > >
> > > > > > +void *perf_evsel__mmap(struct perf_evsel *evsel, int pages)
> > > > > > +{
> > > > > > +     int ret;
> > > > > > +     struct perf_mmap *map;
> > > > > > +     struct perf_mmap_param mp = {
> > > > > > +             .prot = PROT_READ | PROT_WRITE,
> > > > > > +     };
> > > > > > +
> > > > > > +     if (FD(evsel, 0, 0) < 0)
> > > > > > +             return NULL;
> > > > > > +
> > > > > > +     mp.mask = (pages * page_size) - 1;
> > > > > > +
> > > > > > +     map = zalloc(sizeof(*map));
> > > > > > +     if (!map)
> > > > > > +             return NULL;
> > > > > > +
> > > > > > +     perf_mmap__init(map, NULL, false, NULL);
> > > > > > +
> > > > > > +     ret = perf_mmap__mmap(map, &mp, FD(evsel, 0, 0), 0);
> > > > >
> > > > > hum, so you map event for FD(0,0) but later in perf_evsel__read
> > > > > you allow to read any cpu/thread combination ending up reading
> > > > > data from FD(0,0) map:
> > > > >
> > > > >         int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
> > > > >                              struct perf_counts_values *count)
> > > > >         {
> > > > >                 size_t size = perf_evsel__read_size(evsel);
> > > > >
> > > > >                 memset(count, 0, sizeof(*count));
> > > > >
> > > > >                 if (FD(evsel, cpu, thread) < 0)
> > > > >                         return -EINVAL;
> > > > >
> > > > >                 if (evsel->mmap && !perf_mmap__read_self(evsel->mmap, count))
> > > > >                         return 0;
> > > > >
> > > > >
> > > > > I think we should either check cpu == 0, thread == 0, or make it
> > > > > general and store perf_evsel::mmap in xyarray as we do for fds
> > > >
> > > > The mmapped read will actually fail and then we fallback here. My main
> > > > concern though is adding more overhead on a feature that's meant to be
> > > > low overhead (granted, it's not much). Maybe we could add checks on
> > > > the mmap that we've opened the event with pid == 0 and cpu == -1 (so
> > > > only 1 FD)?
> > >
> > > but then you limit this just for single fd.. having mmap as xyarray
> > > would not be that bad and perf_evsel__mmap will call perf_mmap__mmap
> > > for each defined cpu/thread .. so it depends on user how fast this
> > > will be - how many maps needs to be created/mmaped
> >
> > Given userspace access fails for anything other than the calling
> > thread and all cpus, how would more than 1 mmap be useful here?
>
> I'm not sure what you mean by fail in here.. you need mmap for each
> event fd you want to read from

Yes, but that's one mmap per event (evsel) which is different than per
cpu/thread.

> in the example below we read stats from all cpus via perf_evsel__read,
> if we insert this call after perf_evsel__open:
>
>   perf_evsel__mmap(cpus, NULL);
>
> that maps page for each event, then perf_evsel__read
> could go through the fast code, no?

No, because we're not self-monitoring (pid == 0 and cpu == -1). With
the following change:

diff --git a/tools/lib/perf/tests/test-evsel.c
b/tools/lib/perf/tests/test-evsel.c
index eeca8203d73d..1fca9c121f7c 100644
--- a/tools/lib/perf/tests/test-evsel.c
+++ b/tools/lib/perf/tests/test-evsel.c
@@ -17,6 +17,7 @@ static int test_stat_cpu(void)
 {
        struct perf_cpu_map *cpus;
        struct perf_evsel *evsel;
+       struct perf_event_mmap_page *pc;
        struct perf_event_attr attr = {
                .type   = PERF_TYPE_SOFTWARE,
                .config = PERF_COUNT_SW_CPU_CLOCK,
@@ -32,6 +33,15 @@ static int test_stat_cpu(void)
        err = perf_evsel__open(evsel, cpus, NULL);
        __T("failed to open evsel", err == 0);

+       pc = perf_evsel__mmap(evsel, 0);
+       __T("failed to mmap evsel", pc);
+
+#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
+       __T("userspace counter access not supported", pc->cap_user_rdpmc);
+       __T("userspace counter access not enabled", pc->index);
+       __T("userspace counter width not set", pc->pmc_width >= 32);
+#endif
+
        perf_cpu_map__for_each_cpu(cpu, tmp, cpus) {
                struct perf_counts_values counts = { .val = 0 };

I get:

- running test-evsel.c...FAILED test-evsel.c:40 userspace counter
access not supported

If I set it to pid==0, userspace counter access is also disabled.

Maybe there is some use for mmap beyond fast path read for
self-monitoring or what evlist mmap does, but I don't know what that
would be.

Note that we could get rid of the mmap API and just do the mmap behind
the scenes whenever we get the magic setup that works. The main
downside with that is you can't check if the fast path is enabled or
not (though we could have a perf_evsel__is_fast_read(evsel, cpu,
thread) instead).

Rob

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
  2020-10-20 17:11             ` Rob Herring
@ 2020-10-21 11:24               ` Jiri Olsa
  2020-11-05 16:19                 ` Rob Herring
  0 siblings, 1 reply; 29+ messages in thread
From: Jiri Olsa @ 2020-10-21 11:24 UTC (permalink / raw)
  To: Rob Herring
  Cc: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Alexander Shishkin, Namhyung Kim, Raphael Gault, Mark Rutland,
	Jonathan Cameron, Ian Rogers, Honnappa Nagarahalli,
	Itaru Kitayama

On Tue, Oct 20, 2020 at 12:11:47PM -0500, Rob Herring wrote:

SNIP

> > > > >
> > > > > The mmapped read will actually fail and then we fallback here. My main
> > > > > concern though is adding more overhead on a feature that's meant to be
> > > > > low overhead (granted, it's not much). Maybe we could add checks on
> > > > > the mmap that we've opened the event with pid == 0 and cpu == -1 (so
> > > > > only 1 FD)?
> > > >
> > > > but then you limit this just for single fd.. having mmap as xyarray
> > > > would not be that bad and perf_evsel__mmap will call perf_mmap__mmap
> > > > for each defined cpu/thread .. so it depends on user how fast this
> > > > will be - how many maps needs to be created/mmaped
> > >
> > > Given userspace access fails for anything other than the calling
> > > thread and all cpus, how would more than 1 mmap be useful here?
> >
> > I'm not sure what you mean by fail in here.. you need mmap for each
> > event fd you want to read from
> 
> Yes, but that's one mmap per event (evsel) which is different than per
> cpu/thread.

right, and you need mmap per fd IIUC

> 
> > in the example below we read stats from all cpus via perf_evsel__read,
> > if we insert this call after perf_evsel__open:
> >
> >   perf_evsel__mmap(cpus, NULL);
> >
> > that maps page for each event, then perf_evsel__read
> > could go through the fast code, no?
> 
> No, because we're not self-monitoring (pid == 0 and cpu == -1). With
> the following change:
> 
> diff --git a/tools/lib/perf/tests/test-evsel.c
> b/tools/lib/perf/tests/test-evsel.c
> index eeca8203d73d..1fca9c121f7c 100644
> --- a/tools/lib/perf/tests/test-evsel.c
> +++ b/tools/lib/perf/tests/test-evsel.c
> @@ -17,6 +17,7 @@ static int test_stat_cpu(void)
>  {
>         struct perf_cpu_map *cpus;
>         struct perf_evsel *evsel;
> +       struct perf_event_mmap_page *pc;
>         struct perf_event_attr attr = {
>                 .type   = PERF_TYPE_SOFTWARE,
>                 .config = PERF_COUNT_SW_CPU_CLOCK,
> @@ -32,6 +33,15 @@ static int test_stat_cpu(void)
>         err = perf_evsel__open(evsel, cpus, NULL);
>         __T("failed to open evsel", err == 0);
> 
> +       pc = perf_evsel__mmap(evsel, 0);
> +       __T("failed to mmap evsel", pc);
> +
> +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> +       __T("userspace counter access not supported", pc->cap_user_rdpmc);
> +       __T("userspace counter access not enabled", pc->index);
> +       __T("userspace counter width not set", pc->pmc_width >= 32);
> +#endif

I'll need to check, I'm surprised this would depend on the way
you open the event

jirka

> +
>         perf_cpu_map__for_each_cpu(cpu, tmp, cpus) {
>                 struct perf_counts_values counts = { .val = 0 };
> 
> I get:
> 
> - running test-evsel.c...FAILED test-evsel.c:40 userspace counter
> access not supported
> 
> If I set it to pid==0, userspace counter access is also disabled.
> 
> Maybe there is some use for mmap beyond fast path read for
> self-monitoring or what evlist mmap does, but I don't know what that
> would be.
> 
> Note that we could get rid of the mmap API and just do the mmap behind
> the scenes whenever we get the magic setup that works. The main
> downside with that is you can't check if the fast path is enabled or
> not (though we could have a perf_evsel__is_fast_read(evsel, cpu,
> thread) instead).
> 
> Rob
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
  2020-10-21 11:24               ` Jiri Olsa
@ 2020-11-05 16:19                 ` Rob Herring
  2020-11-05 22:41                   ` Jiri Olsa
  0 siblings, 1 reply; 29+ messages in thread
From: Rob Herring @ 2020-11-05 16:19 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Alexander Shishkin, Namhyung Kim, Raphael Gault, Mark Rutland,
	Jonathan Cameron, Ian Rogers, Honnappa Nagarahalli,
	Itaru Kitayama

On Wed, Oct 21, 2020 at 6:24 AM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Tue, Oct 20, 2020 at 12:11:47PM -0500, Rob Herring wrote:
>
> SNIP
>
> > > > > >
> > > > > > The mmapped read will actually fail and then we fallback here. My main
> > > > > > concern though is adding more overhead on a feature that's meant to be
> > > > > > low overhead (granted, it's not much). Maybe we could add checks on
> > > > > > the mmap that we've opened the event with pid == 0 and cpu == -1 (so
> > > > > > only 1 FD)?
> > > > >
> > > > > but then you limit this just for single fd.. having mmap as xyarray
> > > > > would not be that bad and perf_evsel__mmap will call perf_mmap__mmap
> > > > > for each defined cpu/thread .. so it depends on user how fast this
> > > > > will be - how many maps needs to be created/mmaped
> > > >
> > > > Given userspace access fails for anything other than the calling
> > > > thread and all cpus, how would more than 1 mmap be useful here?
> > >
> > > I'm not sure what you mean by fail in here.. you need mmap for each
> > > event fd you want to read from
> >
> > Yes, but that's one mmap per event (evsel) which is different than per
> > cpu/thread.
>
> right, and you need mmap per fd IIUC
>
> >
> > > in the example below we read stats from all cpus via perf_evsel__read,
> > > if we insert this call after perf_evsel__open:
> > >
> > >   perf_evsel__mmap(cpus, NULL);
> > >
> > > that maps page for each event, then perf_evsel__read
> > > could go through the fast code, no?
> >
> > No, because we're not self-monitoring (pid == 0 and cpu == -1). With
> > the following change:
> >
> > diff --git a/tools/lib/perf/tests/test-evsel.c
> > b/tools/lib/perf/tests/test-evsel.c
> > index eeca8203d73d..1fca9c121f7c 100644
> > --- a/tools/lib/perf/tests/test-evsel.c
> > +++ b/tools/lib/perf/tests/test-evsel.c
> > @@ -17,6 +17,7 @@ static int test_stat_cpu(void)
> >  {
> >         struct perf_cpu_map *cpus;
> >         struct perf_evsel *evsel;
> > +       struct perf_event_mmap_page *pc;
> >         struct perf_event_attr attr = {
> >                 .type   = PERF_TYPE_SOFTWARE,
> >                 .config = PERF_COUNT_SW_CPU_CLOCK,
> > @@ -32,6 +33,15 @@ static int test_stat_cpu(void)
> >         err = perf_evsel__open(evsel, cpus, NULL);
> >         __T("failed to open evsel", err == 0);
> >
> > +       pc = perf_evsel__mmap(evsel, 0);
> > +       __T("failed to mmap evsel", pc);
> > +
> > +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > +       __T("userspace counter access not supported", pc->cap_user_rdpmc);
> > +       __T("userspace counter access not enabled", pc->index);
> > +       __T("userspace counter width not set", pc->pmc_width >= 32);
> > +#endif
>
> I'll need to check, I'm surprised this would depend on the way
> you open the event

Any more thoughts on this?

Rob

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
  2020-11-05 16:19                 ` Rob Herring
@ 2020-11-05 22:41                   ` Jiri Olsa
  2020-11-06 21:56                     ` Rob Herring
  0 siblings, 1 reply; 29+ messages in thread
From: Jiri Olsa @ 2020-11-05 22:41 UTC (permalink / raw)
  To: Rob Herring
  Cc: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Alexander Shishkin, Namhyung Kim, Raphael Gault, Mark Rutland,
	Jonathan Cameron, Ian Rogers, Honnappa Nagarahalli,
	Itaru Kitayama

On Thu, Nov 05, 2020 at 10:19:24AM -0600, Rob Herring wrote:

SNIP

> > > >
> > > > that maps page for each event, then perf_evsel__read
> > > > could go through the fast code, no?
> > >
> > > No, because we're not self-monitoring (pid == 0 and cpu == -1). With
> > > the following change:
> > >
> > > diff --git a/tools/lib/perf/tests/test-evsel.c
> > > b/tools/lib/perf/tests/test-evsel.c
> > > index eeca8203d73d..1fca9c121f7c 100644
> > > --- a/tools/lib/perf/tests/test-evsel.c
> > > +++ b/tools/lib/perf/tests/test-evsel.c
> > > @@ -17,6 +17,7 @@ static int test_stat_cpu(void)
> > >  {
> > >         struct perf_cpu_map *cpus;
> > >         struct perf_evsel *evsel;
> > > +       struct perf_event_mmap_page *pc;
> > >         struct perf_event_attr attr = {
> > >                 .type   = PERF_TYPE_SOFTWARE,
> > >                 .config = PERF_COUNT_SW_CPU_CLOCK,
> > > @@ -32,6 +33,15 @@ static int test_stat_cpu(void)
> > >         err = perf_evsel__open(evsel, cpus, NULL);
> > >         __T("failed to open evsel", err == 0);
> > >
> > > +       pc = perf_evsel__mmap(evsel, 0);
> > > +       __T("failed to mmap evsel", pc);
> > > +
> > > +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > > +       __T("userspace counter access not supported", pc->cap_user_rdpmc);
> > > +       __T("userspace counter access not enabled", pc->index);
> > > +       __T("userspace counter width not set", pc->pmc_width >= 32);
> > > +#endif
> >
> > I'll need to check, I'm surprised this would depend on the way
> > you open the event
> 
> Any more thoughts on this?

sry I got stuck with other stuff.. I tried your change
and pc->cap_user_rdpmc is 0 because the test creates
software event, which does not support that

when I change that to:

	.type   = PERF_TYPE_HARDWARE,
	.config = PERF_COUNT_HW_CPU_CYCLES,

I don't see any of those warning you added

jirka


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
  2020-11-05 22:41                   ` Jiri Olsa
@ 2020-11-06 21:56                     ` Rob Herring
  2020-11-11 12:00                       ` Jiri Olsa
  0 siblings, 1 reply; 29+ messages in thread
From: Rob Herring @ 2020-11-06 21:56 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Alexander Shishkin, Namhyung Kim, Raphael Gault, Mark Rutland,
	Jonathan Cameron, Ian Rogers, Honnappa Nagarahalli,
	Itaru Kitayama

On Thu, Nov 5, 2020 at 4:41 PM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Thu, Nov 05, 2020 at 10:19:24AM -0600, Rob Herring wrote:
>
> SNIP
>
> > > > >
> > > > > that maps page for each event, then perf_evsel__read
> > > > > could go through the fast code, no?
> > > >
> > > > No, because we're not self-monitoring (pid == 0 and cpu == -1). With
> > > > the following change:
> > > >
> > > > diff --git a/tools/lib/perf/tests/test-evsel.c
> > > > b/tools/lib/perf/tests/test-evsel.c
> > > > index eeca8203d73d..1fca9c121f7c 100644
> > > > --- a/tools/lib/perf/tests/test-evsel.c
> > > > +++ b/tools/lib/perf/tests/test-evsel.c
> > > > @@ -17,6 +17,7 @@ static int test_stat_cpu(void)
> > > >  {
> > > >         struct perf_cpu_map *cpus;
> > > >         struct perf_evsel *evsel;
> > > > +       struct perf_event_mmap_page *pc;
> > > >         struct perf_event_attr attr = {
> > > >                 .type   = PERF_TYPE_SOFTWARE,
> > > >                 .config = PERF_COUNT_SW_CPU_CLOCK,
> > > > @@ -32,6 +33,15 @@ static int test_stat_cpu(void)
> > > >         err = perf_evsel__open(evsel, cpus, NULL);
> > > >         __T("failed to open evsel", err == 0);
> > > >
> > > > +       pc = perf_evsel__mmap(evsel, 0);
> > > > +       __T("failed to mmap evsel", pc);
> > > > +
> > > > +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > > > +       __T("userspace counter access not supported", pc->cap_user_rdpmc);
> > > > +       __T("userspace counter access not enabled", pc->index);
> > > > +       __T("userspace counter width not set", pc->pmc_width >= 32);
> > > > +#endif
> > >
> > > I'll need to check, I'm surprised this would depend on the way
> > > you open the event
> >
> > Any more thoughts on this?
>
> sry I got stuck with other stuff.. I tried your change
> and pc->cap_user_rdpmc is 0 because the test creates
> software event, which does not support that

Sigh, yes, of course.

> when I change that to:
>
>         .type   = PERF_TYPE_HARDWARE,
>         .config = PERF_COUNT_HW_CPU_CYCLES,
>
> I don't see any of those warning you added

So I've now implemented the per fd mmap. It seems to run and get some
data, but for the above case the counts don't look right.

cpu0: count = 0x10883, ena = 0xbf42, run = 0xbf42
cpu1: count = 0x1bc65, ena = 0xa278, run = 0xa278
cpu2: count = 0x1fab2, ena = 0x91ea, run = 0x91ea
cpu3: count = 0x23d61, ena = 0x81ac, run = 0x81ac
cpu4: count = 0x2936a, ena = 0x7149, run = 0x7149
cpu5: count = 0x2cd4e, ena = 0x634f, run = 0x634f
cpu6: count = 0x3139f, ena = 0x53e7, run = 0x53e7
cpu7: count = 0x35350, ena = 0x4690, run = 0x4690

For comparison, this is what I get using the slow path read():
cpu0: count = 0x1c40, ena = 0x188b5, run = 0x188b5
cpu1: count = 0x18e0, ena = 0x1b8f4, run = 0x1b8f4
cpu2: count = 0x745e, ena = 0x1ab9e, run = 0x1ab9e
cpu3: count = 0x2416, ena = 0x1a280, run = 0x1a280
cpu4: count = 0x19c7, ena = 0x19b00, run = 0x19b00
cpu5: count = 0x1737, ena = 0x19262, run = 0x19262
cpu6: count = 0x11d0e, ena = 0x18944, run = 0x18944
cpu7: count = 0x20dbe, ena = 0x181f4, run = 0x181f4

The difference is we get a sequentially increasing count rather than 1
random CPU (the one running the test) with a much higher count. That
seems to me we're just reading the count for the calling process, not
each CPU.

For this to work correctly, cap_user_rdpmc would have to be set only
for the CPU's mmap that matches the calling process's CPU. I'm not
sure whether that can be done. Even if it can, is it really worth
doing so? You're accelerating reading an event on 1 out of N CPUs. And
what do we do on every kernel up til now this won't work on? Another
cap bit?

Rob

P.S. I did find one bug with all this. The shifts by pmc_width in the
read seq need to be a signed count. This test happens to have raw
counter values starting at 2^47.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
  2020-11-06 21:56                     ` Rob Herring
@ 2020-11-11 12:00                       ` Jiri Olsa
  2020-11-11 14:50                         ` Rob Herring
  0 siblings, 1 reply; 29+ messages in thread
From: Jiri Olsa @ 2020-11-11 12:00 UTC (permalink / raw)
  To: Rob Herring
  Cc: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Alexander Shishkin, Namhyung Kim, Raphael Gault, Mark Rutland,
	Jonathan Cameron, Ian Rogers, Honnappa Nagarahalli,
	Itaru Kitayama

On Fri, Nov 06, 2020 at 03:56:11PM -0600, Rob Herring wrote:
> On Thu, Nov 5, 2020 at 4:41 PM Jiri Olsa <jolsa@redhat.com> wrote:
> >
> > On Thu, Nov 05, 2020 at 10:19:24AM -0600, Rob Herring wrote:
> >
> > SNIP
> >
> > > > > >
> > > > > > that maps page for each event, then perf_evsel__read
> > > > > > could go through the fast code, no?
> > > > >
> > > > > No, because we're not self-monitoring (pid == 0 and cpu == -1). With
> > > > > the following change:
> > > > >
> > > > > diff --git a/tools/lib/perf/tests/test-evsel.c
> > > > > b/tools/lib/perf/tests/test-evsel.c
> > > > > index eeca8203d73d..1fca9c121f7c 100644
> > > > > --- a/tools/lib/perf/tests/test-evsel.c
> > > > > +++ b/tools/lib/perf/tests/test-evsel.c
> > > > > @@ -17,6 +17,7 @@ static int test_stat_cpu(void)
> > > > >  {
> > > > >         struct perf_cpu_map *cpus;
> > > > >         struct perf_evsel *evsel;
> > > > > +       struct perf_event_mmap_page *pc;
> > > > >         struct perf_event_attr attr = {
> > > > >                 .type   = PERF_TYPE_SOFTWARE,
> > > > >                 .config = PERF_COUNT_SW_CPU_CLOCK,
> > > > > @@ -32,6 +33,15 @@ static int test_stat_cpu(void)
> > > > >         err = perf_evsel__open(evsel, cpus, NULL);
> > > > >         __T("failed to open evsel", err == 0);
> > > > >
> > > > > +       pc = perf_evsel__mmap(evsel, 0);
> > > > > +       __T("failed to mmap evsel", pc);
> > > > > +
> > > > > +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > > > > +       __T("userspace counter access not supported", pc->cap_user_rdpmc);
> > > > > +       __T("userspace counter access not enabled", pc->index);
> > > > > +       __T("userspace counter width not set", pc->pmc_width >= 32);
> > > > > +#endif
> > > >
> > > > I'll need to check, I'm surprised this would depend on the way
> > > > you open the event
> > >
> > > Any more thoughts on this?
> >
> > sry I got stuck with other stuff.. I tried your change
> > and pc->cap_user_rdpmc is 0 because the test creates
> > software event, which does not support that
> 
> Sigh, yes, of course.
> 
> > when I change that to:
> >
> >         .type   = PERF_TYPE_HARDWARE,
> >         .config = PERF_COUNT_HW_CPU_CYCLES,
> >
> > I don't see any of those warning you added
> 
> So I've now implemented the per fd mmap. It seems to run and get some
> data, but for the above case the counts don't look right.
> 
> cpu0: count = 0x10883, ena = 0xbf42, run = 0xbf42
> cpu1: count = 0x1bc65, ena = 0xa278, run = 0xa278
> cpu2: count = 0x1fab2, ena = 0x91ea, run = 0x91ea
> cpu3: count = 0x23d61, ena = 0x81ac, run = 0x81ac
> cpu4: count = 0x2936a, ena = 0x7149, run = 0x7149
> cpu5: count = 0x2cd4e, ena = 0x634f, run = 0x634f
> cpu6: count = 0x3139f, ena = 0x53e7, run = 0x53e7
> cpu7: count = 0x35350, ena = 0x4690, run = 0x4690
> 
> For comparison, this is what I get using the slow path read():
> cpu0: count = 0x1c40, ena = 0x188b5, run = 0x188b5
> cpu1: count = 0x18e0, ena = 0x1b8f4, run = 0x1b8f4
> cpu2: count = 0x745e, ena = 0x1ab9e, run = 0x1ab9e
> cpu3: count = 0x2416, ena = 0x1a280, run = 0x1a280
> cpu4: count = 0x19c7, ena = 0x19b00, run = 0x19b00
> cpu5: count = 0x1737, ena = 0x19262, run = 0x19262
> cpu6: count = 0x11d0e, ena = 0x18944, run = 0x18944
> cpu7: count = 0x20dbe, ena = 0x181f4, run = 0x181f4

hum, could you please send/push changes with that test?
I can try it and check

jirka

> 
> The difference is we get a sequentially increasing count rather than 1
> random CPU (the one running the test) with a much higher count. That
> seems to me we're just reading the count for the calling process, not
> each CPU.
> 
> For this to work correctly, cap_user_rdpmc would have to be set only
> for the CPU's mmap that matches the calling process's CPU. I'm not
> sure whether that can be done. Even if it can, is it really worth
> doing so? You're accelerating reading an event on 1 out of N CPUs. And
> what do we do on every kernel up til now this won't work on? Another
> cap bit?
> 
> Rob
> 
> P.S. I did find one bug with all this. The shifts by pmc_width in the
> read seq need to be a signed count. This test happens to have raw
> counter values starting at 2^47.
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
  2020-11-11 12:00                       ` Jiri Olsa
@ 2020-11-11 14:50                         ` Rob Herring
  0 siblings, 0 replies; 29+ messages in thread
From: Rob Herring @ 2020-11-11 14:50 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-arm-kernel, linux-kernel,
	Alexander Shishkin, Namhyung Kim, Raphael Gault, Mark Rutland,
	Jonathan Cameron, Ian Rogers, Honnappa Nagarahalli,
	Itaru Kitayama

On Wed, Nov 11, 2020 at 6:01 AM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Fri, Nov 06, 2020 at 03:56:11PM -0600, Rob Herring wrote:
> > On Thu, Nov 5, 2020 at 4:41 PM Jiri Olsa <jolsa@redhat.com> wrote:
> > >
> > > On Thu, Nov 05, 2020 at 10:19:24AM -0600, Rob Herring wrote:
> > >
> > > SNIP
> > >
> > > > > > >
> > > > > > > that maps page for each event, then perf_evsel__read
> > > > > > > could go through the fast code, no?
> > > > > >
> > > > > > No, because we're not self-monitoring (pid == 0 and cpu == -1). With
> > > > > > the following change:
> > > > > >
> > > > > > diff --git a/tools/lib/perf/tests/test-evsel.c
> > > > > > b/tools/lib/perf/tests/test-evsel.c
> > > > > > index eeca8203d73d..1fca9c121f7c 100644
> > > > > > --- a/tools/lib/perf/tests/test-evsel.c
> > > > > > +++ b/tools/lib/perf/tests/test-evsel.c
> > > > > > @@ -17,6 +17,7 @@ static int test_stat_cpu(void)
> > > > > >  {
> > > > > >         struct perf_cpu_map *cpus;
> > > > > >         struct perf_evsel *evsel;
> > > > > > +       struct perf_event_mmap_page *pc;
> > > > > >         struct perf_event_attr attr = {
> > > > > >                 .type   = PERF_TYPE_SOFTWARE,
> > > > > >                 .config = PERF_COUNT_SW_CPU_CLOCK,
> > > > > > @@ -32,6 +33,15 @@ static int test_stat_cpu(void)
> > > > > >         err = perf_evsel__open(evsel, cpus, NULL);
> > > > > >         __T("failed to open evsel", err == 0);
> > > > > >
> > > > > > +       pc = perf_evsel__mmap(evsel, 0);
> > > > > > +       __T("failed to mmap evsel", pc);
> > > > > > +
> > > > > > +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > > > > > +       __T("userspace counter access not supported", pc->cap_user_rdpmc);
> > > > > > +       __T("userspace counter access not enabled", pc->index);
> > > > > > +       __T("userspace counter width not set", pc->pmc_width >= 32);
> > > > > > +#endif
> > > > >
> > > > > I'll need to check, I'm surprised this would depend on the way
> > > > > you open the event
> > > >
> > > > Any more thoughts on this?
> > >
> > > sry I got stuck with other stuff.. I tried your change
> > > and pc->cap_user_rdpmc is 0 because the test creates
> > > software event, which does not support that
> >
> > Sigh, yes, of course.
> >
> > > when I change that to:
> > >
> > >         .type   = PERF_TYPE_HARDWARE,
> > >         .config = PERF_COUNT_HW_CPU_CYCLES,
> > >
> > > I don't see any of those warning you added
> >
> > So I've now implemented the per fd mmap. It seems to run and get some
> > data, but for the above case the counts don't look right.
> >
> > cpu0: count = 0x10883, ena = 0xbf42, run = 0xbf42
> > cpu1: count = 0x1bc65, ena = 0xa278, run = 0xa278
> > cpu2: count = 0x1fab2, ena = 0x91ea, run = 0x91ea
> > cpu3: count = 0x23d61, ena = 0x81ac, run = 0x81ac
> > cpu4: count = 0x2936a, ena = 0x7149, run = 0x7149
> > cpu5: count = 0x2cd4e, ena = 0x634f, run = 0x634f
> > cpu6: count = 0x3139f, ena = 0x53e7, run = 0x53e7
> > cpu7: count = 0x35350, ena = 0x4690, run = 0x4690
> >
> > For comparison, this is what I get using the slow path read():
> > cpu0: count = 0x1c40, ena = 0x188b5, run = 0x188b5
> > cpu1: count = 0x18e0, ena = 0x1b8f4, run = 0x1b8f4
> > cpu2: count = 0x745e, ena = 0x1ab9e, run = 0x1ab9e
> > cpu3: count = 0x2416, ena = 0x1a280, run = 0x1a280
> > cpu4: count = 0x19c7, ena = 0x19b00, run = 0x19b00
> > cpu5: count = 0x1737, ena = 0x19262, run = 0x19262
> > cpu6: count = 0x11d0e, ena = 0x18944, run = 0x18944
> > cpu7: count = 0x20dbe, ena = 0x181f4, run = 0x181f4
>
> hum, could you please send/push changes with that test?
> I can try it and check

Here you go:

git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git user-perf-event-v5

Just comment out the mmap parts to get original behavior.

Rob

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/9] arm64: perf: Enable pmu counter direct access for perf event on armv8
  2020-10-01 14:01 ` [PATCH v4 2/9] arm64: perf: Enable pmu counter direct access for perf event on armv8 Rob Herring
@ 2020-11-13 18:06   ` Mark Rutland
  2020-11-19 18:35     ` Rob Herring
  2020-11-19 19:15     ` Will Deacon
  0 siblings, 2 replies; 29+ messages in thread
From: Mark Rutland @ 2020-11-13 18:06 UTC (permalink / raw)
  To: Rob Herring, Will Deacon
  Cc: Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-arm-kernel,
	linux-kernel, Alexander Shishkin, Namhyung Kim, Raphael Gault,
	Jonathan Cameron, Ian Rogers, honnappa.nagarahalli,
	Itaru Kitayama

Hi Rob,

Thanks for this, and sorry for the long delay since this was last
reviewed. Overall this is looking pretty good, but I have a couple of
remaining concerns.

Will, I have a query for you below.

On Thu, Oct 01, 2020 at 09:01:09AM -0500, Rob Herring wrote:
> From: Raphael Gault <raphael.gault@arm.com>
> 
> Keep track of event opened with direct access to the hardware counters
> and modify permissions while they are open.
> 
> The strategy used here is the same which x86 uses: everytime an event
> is mapped, the permissions are set if required. The atomic field added
> in the mm_context helps keep track of the different event opened and
> de-activate the permissions when all are unmapped.
> We also need to update the permissions in the context switch code so
> that tasks keep the right permissions.
> 
> Signed-off-by: Raphael Gault <raphael.gault@arm.com>
> Signed-off-by: Rob Herring <robh@kernel.org>
> ---
> v2:
>  - Move mapped/unmapped into arm64 code. Fixes arm32.
>  - Rebase on cap_user_time_short changes
> 
> Changes from Raphael's v4:
>   - Drop homogeneous check
>   - Disable access for chained counters
>   - Set pmc_width in user page
> ---
>  arch/arm64/include/asm/mmu.h         |  5 ++++
>  arch/arm64/include/asm/mmu_context.h |  2 ++
>  arch/arm64/include/asm/perf_event.h  | 14 ++++++++++
>  arch/arm64/kernel/perf_event.c       | 41 ++++++++++++++++++++++++++++
>  4 files changed, 62 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index a7a5ecaa2e83..52cfdb676f06 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -19,6 +19,11 @@
>  
>  typedef struct {
>  	atomic64_t	id;
> +	/*
> +	 * non-zero if userspace have access to hardware
> +	 * counters directly.
> +	 */
> +	atomic_t	pmu_direct_access;
>  #ifdef CONFIG_COMPAT
>  	void		*sigpage;
>  #endif
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index f2d7537d6f83..d24589ecb07a 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -21,6 +21,7 @@
>  #include <asm/proc-fns.h>
>  #include <asm-generic/mm_hooks.h>
>  #include <asm/cputype.h>
> +#include <asm/perf_event.h>
>  #include <asm/sysreg.h>
>  #include <asm/tlbflush.h>
>  
> @@ -224,6 +225,7 @@ static inline void __switch_mm(struct mm_struct *next)
>  	}
>  
>  	check_and_switch_context(next);
> +	perf_switch_user_access(next);
>  }
>  
>  static inline void
> diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
> index 2c2d7dbe8a02..a025d9595d51 100644
> --- a/arch/arm64/include/asm/perf_event.h
> +++ b/arch/arm64/include/asm/perf_event.h
> @@ -8,6 +8,7 @@
>  
>  #include <asm/stack_pointer.h>
>  #include <asm/ptrace.h>
> +#include <linux/mm_types.h>
>  
>  #define	ARMV8_PMU_MAX_COUNTERS	32
>  #define	ARMV8_PMU_COUNTER_MASK	(ARMV8_PMU_MAX_COUNTERS - 1)
> @@ -251,4 +252,17 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
>  	(regs)->pstate = PSR_MODE_EL1h;	\
>  }
>  
> +static inline void perf_switch_user_access(struct mm_struct *mm)
> +{
> +	if (!IS_ENABLED(CONFIG_PERF_EVENTS))
> +		return;
> +
> +	if (atomic_read(&mm->context.pmu_direct_access)) {
> +		write_sysreg(ARMV8_PMU_USERENR_ER|ARMV8_PMU_USERENR_CR,
> +			     pmuserenr_el0);
> +	} else {
> +		write_sysreg(0, pmuserenr_el0);
> +	}
> +}

PMUSERENR.ER gives RW access to PMSELR_EL0. While we no longer use
PMSELR_EL0 in the kernel, we can preempt and migrate userspace tasks
between homogeneous CPUs, and presumably need to context-switch this
with the task (like we do for TPIDR_EL0 and friends), or clear the
register on context-switch to prevent it becoming an unintended covert
channel.

These bits also enable AArch32 access. Is there any way an AArch32 task
can enable this? If so we should probably block that given we do not
support this interface on 32-bit arm.

> +
>  #endif
> diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
> index e14f360a7883..8344dc030823 100644
> --- a/arch/arm64/kernel/perf_event.c
> +++ b/arch/arm64/kernel/perf_event.c
> @@ -834,6 +834,41 @@ static int armv8pmu_access_event_idx(struct perf_event *event)
>  	return event->hw.idx;
>  }
>  
> +static void refresh_pmuserenr(void *mm)
> +{
> +	perf_switch_user_access(mm);
> +}
> +
> +static void armv8pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
> +{
> +	if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> +		return;
> +
> +	/*
> +	 * This function relies on not being called concurrently in two
> +	 * tasks in the same mm.  Otherwise one task could observe
> +	 * pmu_direct_access > 1 and return all the way back to
> +	 * userspace with user access disabled while another task is still
> +	 * doing on_each_cpu_mask() to enable user access.
> +	 *
> +	 * For now, this can't happen because all callers hold mmap_lock
> +	 * for write.  If this changes, we'll need a different solution.
> +	 */
> +	lockdep_assert_held_write(&mm->mmap_lock);
> +
> +	if (atomic_inc_return(&mm->context.pmu_direct_access) == 1)
> +		on_each_cpu(refresh_pmuserenr, mm, 1);
> +}
> +
> +static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> +{
> +	if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> +		return;
> +
> +	if (atomic_dec_and_test(&mm->context.pmu_direct_access))
> +		on_each_cpu_mask(mm_cpumask(mm), refresh_pmuserenr, NULL, 1);
> +}

I didn't think we kept our mm_cpumask() up-to-date in all cases on
arm64, so I'm not sure we can use it like this.

Will, can you confirm either way?

Thanks,
Mark.

> +
>  /*
>   * Add an event filter to a given event.
>   */
> @@ -1058,6 +1093,8 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
>  	cpu_pmu->filter_match		= armv8pmu_filter_match;
>  
>  	cpu_pmu->pmu.event_idx		= armv8pmu_access_event_idx;
> +	cpu_pmu->pmu.event_mapped	= armv8pmu_event_mapped;
> +	cpu_pmu->pmu.event_unmapped	= armv8pmu_event_unmapped;
>  
>  	cpu_pmu->name			= name;
>  	cpu_pmu->map_event		= map_event;
> @@ -1218,6 +1255,10 @@ void arch_perf_update_userpage(struct perf_event *event,
>  	userpg->cap_user_time = 0;
>  	userpg->cap_user_time_zero = 0;
>  	userpg->cap_user_time_short = 0;
> +	userpg->cap_user_rdpmc = !!(event->hw.flags & ARMPMU_EL0_RD_CNTR);
> +
> +	if (userpg->cap_user_rdpmc)
> +		userpg->pmc_width = armv8pmu_event_is_64bit(event) ? 64 : 32;
>  
>  	do {
>  		rd = sched_clock_read_begin(&seq);
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/9] arm64: perf: Enable pmu counter direct access for perf event on armv8
  2020-11-13 18:06   ` Mark Rutland
@ 2020-11-19 18:35     ` Rob Herring
  2020-11-19 19:15     ` Will Deacon
  1 sibling, 0 replies; 29+ messages in thread
From: Rob Herring @ 2020-11-19 18:35 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Will Deacon, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-arm-kernel,
	linux-kernel, Alexander Shishkin, Namhyung Kim, Raphael Gault,
	Jonathan Cameron, Ian Rogers, Honnappa Nagarahalli,
	Itaru Kitayama

On Fri, Nov 13, 2020 at 12:06 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> Hi Rob,
>
> Thanks for this, and sorry for the long delay since this was last
> reviewed. Overall this is looking pretty good, but I have a couple of
> remaining concerns.
>
> Will, I have a query for you below.
>
> On Thu, Oct 01, 2020 at 09:01:09AM -0500, Rob Herring wrote:
> > From: Raphael Gault <raphael.gault@arm.com>
> >
> > Keep track of event opened with direct access to the hardware counters
> > and modify permissions while they are open.
> >
> > The strategy used here is the same which x86 uses: everytime an event
> > is mapped, the permissions are set if required. The atomic field added
> > in the mm_context helps keep track of the different event opened and
> > de-activate the permissions when all are unmapped.
> > We also need to update the permissions in the context switch code so
> > that tasks keep the right permissions.
> >
> > Signed-off-by: Raphael Gault <raphael.gault@arm.com>
> > Signed-off-by: Rob Herring <robh@kernel.org>
> > ---
> > v2:
> >  - Move mapped/unmapped into arm64 code. Fixes arm32.
> >  - Rebase on cap_user_time_short changes
> >
> > Changes from Raphael's v4:
> >   - Drop homogeneous check
> >   - Disable access for chained counters
> >   - Set pmc_width in user page
> > ---
> >  arch/arm64/include/asm/mmu.h         |  5 ++++
> >  arch/arm64/include/asm/mmu_context.h |  2 ++
> >  arch/arm64/include/asm/perf_event.h  | 14 ++++++++++
> >  arch/arm64/kernel/perf_event.c       | 41 ++++++++++++++++++++++++++++
> >  4 files changed, 62 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> > index a7a5ecaa2e83..52cfdb676f06 100644
> > --- a/arch/arm64/include/asm/mmu.h
> > +++ b/arch/arm64/include/asm/mmu.h
> > @@ -19,6 +19,11 @@
> >
> >  typedef struct {
> >       atomic64_t      id;
> > +     /*
> > +      * non-zero if userspace have access to hardware
> > +      * counters directly.
> > +      */
> > +     atomic_t        pmu_direct_access;
> >  #ifdef CONFIG_COMPAT
> >       void            *sigpage;
> >  #endif
> > diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> > index f2d7537d6f83..d24589ecb07a 100644
> > --- a/arch/arm64/include/asm/mmu_context.h
> > +++ b/arch/arm64/include/asm/mmu_context.h
> > @@ -21,6 +21,7 @@
> >  #include <asm/proc-fns.h>
> >  #include <asm-generic/mm_hooks.h>
> >  #include <asm/cputype.h>
> > +#include <asm/perf_event.h>
> >  #include <asm/sysreg.h>
> >  #include <asm/tlbflush.h>
> >
> > @@ -224,6 +225,7 @@ static inline void __switch_mm(struct mm_struct *next)
> >       }
> >
> >       check_and_switch_context(next);
> > +     perf_switch_user_access(next);
> >  }
> >
> >  static inline void
> > diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
> > index 2c2d7dbe8a02..a025d9595d51 100644
> > --- a/arch/arm64/include/asm/perf_event.h
> > +++ b/arch/arm64/include/asm/perf_event.h
> > @@ -8,6 +8,7 @@
> >
> >  #include <asm/stack_pointer.h>
> >  #include <asm/ptrace.h>
> > +#include <linux/mm_types.h>
> >
> >  #define      ARMV8_PMU_MAX_COUNTERS  32
> >  #define      ARMV8_PMU_COUNTER_MASK  (ARMV8_PMU_MAX_COUNTERS - 1)
> > @@ -251,4 +252,17 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
> >       (regs)->pstate = PSR_MODE_EL1h; \
> >  }
> >
> > +static inline void perf_switch_user_access(struct mm_struct *mm)
> > +{
> > +     if (!IS_ENABLED(CONFIG_PERF_EVENTS))
> > +             return;
> > +
> > +     if (atomic_read(&mm->context.pmu_direct_access)) {
> > +             write_sysreg(ARMV8_PMU_USERENR_ER|ARMV8_PMU_USERENR_CR,
> > +                          pmuserenr_el0);
> > +     } else {
> > +             write_sysreg(0, pmuserenr_el0);
> > +     }
> > +}
>
> PMUSERENR.ER gives RW access to PMSELR_EL0. While we no longer use
> PMSELR_EL0 in the kernel, we can preempt and migrate userspace tasks
> between homogeneous CPUs, and presumably need to context-switch this
> with the task (like we do for TPIDR_EL0 and friends), or clear the
> register on context-switch to prevent it becoming an unintended covert
> channel.

Humm, now that I've read up on PMSELR_EL0 I'm now wondering if I
should be using PMSELR_EL0 in libperf. If you look at patch 7, the
counter read is pretty ugly because there's 32 possible mrs
instructions. If PMSELR_EL0 is used, we can have a single read path.
It's a msr and mrs vs. a function ptr load, branch/ret, and mrs. I'd
guess there's no guarantees on system reg access times, but I'd guess
typically the former is more optimal? It certainly simplifies the code
which I'd rather have given the limited users.

If I go that route and we don't context switch PMSELR_EL0, reads of
PMXEVCNTR_EL0 could be stale. But does that matter? No, because
reading PMEVCNTR<n>_EL0 can already be stale and the seq counter will
catch that.

> These bits also enable AArch32 access. Is there any way an AArch32 task
> can enable this? If so we should probably block that given we do not
> support this interface on 32-bit arm.

I'd assume this works for AArch32 given we don't do anything here to
prevent it. I suppose we could look at MMCF_AARCH32 flag in
mm_context_t? But is not implemented for arch/arm/ really a reason to
disable?

Rob

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/9] arm64: perf: Enable pmu counter direct access for perf event on armv8
  2020-11-13 18:06   ` Mark Rutland
  2020-11-19 18:35     ` Rob Herring
@ 2020-11-19 19:15     ` Will Deacon
  2020-11-20 20:03       ` Rob Herring
  1 sibling, 1 reply; 29+ messages in thread
From: Will Deacon @ 2020-11-19 19:15 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Rob Herring, Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-arm-kernel,
	linux-kernel, Alexander Shishkin, Namhyung Kim, Raphael Gault,
	Jonathan Cameron, Ian Rogers, honnappa.nagarahalli,
	Itaru Kitayama

On Fri, Nov 13, 2020 at 06:06:33PM +0000, Mark Rutland wrote:
> On Thu, Oct 01, 2020 at 09:01:09AM -0500, Rob Herring wrote:
> > +static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > +{
> > +	if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > +		return;
> > +
> > +	if (atomic_dec_and_test(&mm->context.pmu_direct_access))
> > +		on_each_cpu_mask(mm_cpumask(mm), refresh_pmuserenr, NULL, 1);
> > +}
> 
> I didn't think we kept our mm_cpumask() up-to-date in all cases on
> arm64, so I'm not sure we can use it like this.
> 
> Will, can you confirm either way?

We don't update mm_cpumask() as the cost of the atomic showed up in some
benchmarks I did years ago and we've never had any need for the thing anyway
because out TLB invalidation is one or all.

Will

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/9] arm64: perf: Enable pmu counter direct access for perf event on armv8
  2020-11-19 19:15     ` Will Deacon
@ 2020-11-20 20:03       ` Rob Herring
  2020-11-20 22:08         ` Rob Herring
  2020-12-02 14:57         ` Rob Herring
  0 siblings, 2 replies; 29+ messages in thread
From: Rob Herring @ 2020-11-20 20:03 UTC (permalink / raw)
  To: Will Deacon, Mark Rutland
  Cc: Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-arm-kernel,
	linux-kernel, Alexander Shishkin, Namhyung Kim, Raphael Gault,
	Jonathan Cameron, Ian Rogers, honnappa.nagarahalli,
	Itaru Kitayama

On Thu, Nov 19, 2020 at 07:15:15PM +0000, Will Deacon wrote:
> On Fri, Nov 13, 2020 at 06:06:33PM +0000, Mark Rutland wrote:
> > On Thu, Oct 01, 2020 at 09:01:09AM -0500, Rob Herring wrote:
> > > +static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > > +{
> > > +	if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > +		return;
> > > +
> > > +	if (atomic_dec_and_test(&mm->context.pmu_direct_access))
> > > +		on_each_cpu_mask(mm_cpumask(mm), refresh_pmuserenr, NULL, 1);
> > > +}
> > 
> > I didn't think we kept our mm_cpumask() up-to-date in all cases on
> > arm64, so I'm not sure we can use it like this.
> > 
> > Will, can you confirm either way?
> 
> We don't update mm_cpumask() as the cost of the atomic showed up in some
> benchmarks I did years ago and we've never had any need for the thing anyway
> because out TLB invalidation is one or all.

That's good because we're also passing NULL instead of mm which would 
crash. So it must be more than it's not up to date, but it's always 0. 
It looks like event_mapped on x86 uses mm_cpumask(mm) which I guess was 
dropped when copying this code as it didn't work... For reference, the 
x86 version of this originated in commit 7911d3f7af14a6.

I'm not clear on why we need to update pmuserenr_el0 here anyways. To 
get here userspace has to mmap the event and then unmmap it. If we did 
nothing, then counter accesses would not fault until the next context 
switch.

If you all have any ideas, I'm all ears. I'm not a scheduler nor perf 
hacker. ;)

Rob


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/9] arm64: perf: Enable pmu counter direct access for perf event on armv8
  2020-11-20 20:03       ` Rob Herring
@ 2020-11-20 22:08         ` Rob Herring
  2020-12-02 14:57         ` Rob Herring
  1 sibling, 0 replies; 29+ messages in thread
From: Rob Herring @ 2020-11-20 22:08 UTC (permalink / raw)
  To: Will Deacon, Mark Rutland
  Cc: Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-arm-kernel,
	linux-kernel, Alexander Shishkin, Namhyung Kim, Raphael Gault,
	Jonathan Cameron, Ian Rogers, honnappa.nagarahalli,
	Itaru Kitayama

On Fri, Nov 20, 2020 at 02:03:45PM -0600, Rob Herring wrote:
> On Thu, Nov 19, 2020 at 07:15:15PM +0000, Will Deacon wrote:
> > On Fri, Nov 13, 2020 at 06:06:33PM +0000, Mark Rutland wrote:
> > > On Thu, Oct 01, 2020 at 09:01:09AM -0500, Rob Herring wrote:
> > > > +static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > > > +{
> > > > +	if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > > +		return;
> > > > +
> > > > +	if (atomic_dec_and_test(&mm->context.pmu_direct_access))
> > > > +		on_each_cpu_mask(mm_cpumask(mm), refresh_pmuserenr, NULL, 1);
> > > > +}
> > > 
> > > I didn't think we kept our mm_cpumask() up-to-date in all cases on
> > > arm64, so I'm not sure we can use it like this.
> > > 
> > > Will, can you confirm either way?
> > 
> > We don't update mm_cpumask() as the cost of the atomic showed up in some
> > benchmarks I did years ago and we've never had any need for the thing anyway
> > because out TLB invalidation is one or all.
> 
> That's good because we're also passing NULL instead of mm which would 
> crash. So it must be more than it's not up to date, but it's always 0. 
> It looks like event_mapped on x86 uses mm_cpumask(mm) which I guess was 
> dropped when copying this code as it didn't work... For reference, the 
> x86 version of this originated in commit 7911d3f7af14a6.
> 
> I'm not clear on why we need to update pmuserenr_el0 here anyways. To 
> get here userspace has to mmap the event and then unmmap it. If we did 
> nothing, then counter accesses would not fault until the next context 
> switch.
> 
> If you all have any ideas, I'm all ears. I'm not a scheduler nor perf 
> hacker. ;)

Here's another issue that needs addressing:

https://lore.kernel.org/lkml/20200821195754.20159-3-kan.liang@linux.intel.com/

Rob

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/9] arm64: perf: Enable pmu counter direct access for perf event on armv8
  2020-11-20 20:03       ` Rob Herring
  2020-11-20 22:08         ` Rob Herring
@ 2020-12-02 14:57         ` Rob Herring
  2021-01-07  0:17           ` Rob Herring
  1 sibling, 1 reply; 29+ messages in thread
From: Rob Herring @ 2020-12-02 14:57 UTC (permalink / raw)
  To: Will Deacon, Mark Rutland
  Cc: Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-arm-kernel,
	linux-kernel, Alexander Shishkin, Namhyung Kim, Raphael Gault,
	Jonathan Cameron, Ian Rogers, honnappa.nagarahalli,
	Itaru Kitayama

On Fri, Nov 20, 2020 at 02:03:45PM -0600, Rob Herring wrote:
> On Thu, Nov 19, 2020 at 07:15:15PM +0000, Will Deacon wrote:
> > On Fri, Nov 13, 2020 at 06:06:33PM +0000, Mark Rutland wrote:
> > > On Thu, Oct 01, 2020 at 09:01:09AM -0500, Rob Herring wrote:
> > > > +static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > > > +{
> > > > +	if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > > +		return;
> > > > +
> > > > +	if (atomic_dec_and_test(&mm->context.pmu_direct_access))
> > > > +		on_each_cpu_mask(mm_cpumask(mm), refresh_pmuserenr, NULL, 1);
> > > > +}
> > > 
> > > I didn't think we kept our mm_cpumask() up-to-date in all cases on
> > > arm64, so I'm not sure we can use it like this.
> > > 
> > > Will, can you confirm either way?
> > 
> > We don't update mm_cpumask() as the cost of the atomic showed up in some
> > benchmarks I did years ago and we've never had any need for the thing anyway
> > because out TLB invalidation is one or all.
> 
> That's good because we're also passing NULL instead of mm which would 
> crash. So it must be more than it's not up to date, but it's always 0. 
> It looks like event_mapped on x86 uses mm_cpumask(mm) which I guess was 
> dropped when copying this code as it didn't work... For reference, the 
> x86 version of this originated in commit 7911d3f7af14a6.
> 
> I'm not clear on why we need to update pmuserenr_el0 here anyways. To 
> get here userspace has to mmap the event and then unmmap it. If we did 
> nothing, then counter accesses would not fault until the next context 
> switch.
> 
> If you all have any ideas, I'm all ears. I'm not a scheduler nor perf 
> hacker. ;)

Mark, Will, any thoughts on this?

Rob

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/9] arm64: perf: Enable pmu counter direct access for perf event on armv8
  2020-12-02 14:57         ` Rob Herring
@ 2021-01-07  0:17           ` Rob Herring
  0 siblings, 0 replies; 29+ messages in thread
From: Rob Herring @ 2021-01-07  0:17 UTC (permalink / raw)
  To: Will Deacon, Mark Rutland
  Cc: Catalin Marinas, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, linux-arm-kernel,
	linux-kernel, Alexander Shishkin, Namhyung Kim, Raphael Gault,
	Jonathan Cameron, Ian Rogers, honnappa.nagarahalli,
	Itaru Kitayama

On Wed, Dec 02, 2020 at 07:57:47AM -0700, Rob Herring wrote:
> On Fri, Nov 20, 2020 at 02:03:45PM -0600, Rob Herring wrote:
> > On Thu, Nov 19, 2020 at 07:15:15PM +0000, Will Deacon wrote:
> > > On Fri, Nov 13, 2020 at 06:06:33PM +0000, Mark Rutland wrote:
> > > > On Thu, Oct 01, 2020 at 09:01:09AM -0500, Rob Herring wrote:
> > > > > +static void armv8pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > > > > +{
> > > > > +	if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > > > +		return;
> > > > > +
> > > > > +	if (atomic_dec_and_test(&mm->context.pmu_direct_access))
> > > > > +		on_each_cpu_mask(mm_cpumask(mm), refresh_pmuserenr, NULL, 1);
> > > > > +}

Bump on this again... :)

> > > > 
> > > > I didn't think we kept our mm_cpumask() up-to-date in all cases on
> > > > arm64, so I'm not sure we can use it like this.
> > > > 
> > > > Will, can you confirm either way?
> > > 
> > > We don't update mm_cpumask() as the cost of the atomic showed up in some
> > > benchmarks I did years ago and we've never had any need for the thing anyway
> > > because out TLB invalidation is one or all.
> > 
> > That's good because we're also passing NULL instead of mm which would 
> > crash. So it must be more than it's not up to date, but it's always 0. 
> > It looks like event_mapped on x86 uses mm_cpumask(mm) which I guess was 
> > dropped when copying this code as it didn't work... For reference, the 
> > x86 version of this originated in commit 7911d3f7af14a6.
> > 
> > I'm not clear on why we need to update pmuserenr_el0 here anyways. To 
> > get here userspace has to mmap the event and then unmmap it. If we did 
> > nothing, then counter accesses would not fault until the next context 
> > switch.

Okay, I've come up with a test case where I can trigger this. It's a bit 
convoluted IMO where the thread doing the mmap is different thread from 
reading the counter. Seems like it would be better if we just disabled 
user access if we're not doing calling thread monitoring. We could 
always loosen that restriction later. x86 OTOH was wide open with access 
globally enabled and this hunk of code was part of restricting it some.

> > 
> > If you all have any ideas, I'm all ears. I'm not a scheduler nor perf 
> > hacker. ;)
> 
> Mark, Will, any thoughts on this?

Any reason, this would not work:

static void refresh_pmuserenr(void *mm)
{
	if (mm == current->active_mm)
		perf_switch_user_access(mm);
}

The downside is we'd be doing an IPI on *all* cores for a PMU, not just 
the ones in mm_cpumask() (if that was accurate). But this isn't a fast 
path.

Rob

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2021-01-07  0:18 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-01 14:01 [PATCH v4 0/9] libperf and arm64 userspace counter access support Rob Herring
2020-10-01 14:01 ` [PATCH v4 1/9] arm64: pmu: Add function implementation to update event index in userpage Rob Herring
2020-10-01 14:01 ` [PATCH v4 2/9] arm64: perf: Enable pmu counter direct access for perf event on armv8 Rob Herring
2020-11-13 18:06   ` Mark Rutland
2020-11-19 18:35     ` Rob Herring
2020-11-19 19:15     ` Will Deacon
2020-11-20 20:03       ` Rob Herring
2020-11-20 22:08         ` Rob Herring
2020-12-02 14:57         ` Rob Herring
2021-01-07  0:17           ` Rob Herring
2020-10-01 14:01 ` [PATCH v4 3/9] tools/include: Add an initial math64.h Rob Herring
2020-10-01 14:01 ` [PATCH v4 4/9] libperf: Add libperf_evsel__mmap() Rob Herring
2020-10-14 11:05   ` Jiri Olsa
2020-10-16 21:39     ` Rob Herring
2020-10-19 20:15       ` Jiri Olsa
2020-10-20 14:38         ` Rob Herring
2020-10-20 15:35           ` Jiri Olsa
2020-10-20 17:11             ` Rob Herring
2020-10-21 11:24               ` Jiri Olsa
2020-11-05 16:19                 ` Rob Herring
2020-11-05 22:41                   ` Jiri Olsa
2020-11-06 21:56                     ` Rob Herring
2020-11-11 12:00                       ` Jiri Olsa
2020-11-11 14:50                         ` Rob Herring
2020-10-01 14:01 ` [PATCH v4 5/9] libperf: tests: Add support for verbose printing Rob Herring
2020-10-01 14:01 ` [PATCH v4 6/9] libperf: Add support for user space counter access Rob Herring
2020-10-01 14:01 ` [PATCH v4 7/9] libperf: Add arm64 support to perf_mmap__read_self() Rob Herring
2020-10-01 14:01 ` [PATCH v4 8/9] perf: arm64: Add test for userspace counter access on heterogeneous systems Rob Herring
2020-10-01 14:01 ` [PATCH v4 9/9] Documentation: arm64: Document PMU counters access from userspace Rob Herring

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).