linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
@ 2017-06-05 15:22 Will Deacon
  2017-06-05 15:22 ` [PATCH v4 1/5] genirq: export irq_get_percpu_devid_partition to modules Will Deacon
                   ` (5 more replies)
  0 siblings, 6 replies; 33+ messages in thread
From: Will Deacon @ 2017-06-05 15:22 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: marc.zyngier, mark.rutland, kim.phillips, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel, Will Deacon

Hi all,

This is the sixth posting of the patches previously posted here:

  rfcv1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/476450.html
  rfcv2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/479387.html
     v1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/483684.html
     v2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-April/499938.html
     v3: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-May/507132.html

The main change since v3 is that I have reworked and fixed the CPU hotplug
and notifier bits, in light of review comments from tglx.

The architecture documentation is available here:

  https://developer.arm.com/products/architecture/a-profile/docs/ddi0586/latest/arm-architecture-reference-manual-supplement-statistical-profiling-extension-for-armv8-a

and there's a high-level overview on this official ARM blog:

  https://community.arm.com/processors/b/blog/posts/statistical-profiling-extension-for-armv8-a

All comments welcome,

Will

Will Deacon (5):
  genirq: export irq_get_percpu_devid_partition to modules
  perf/core: Export AUX buffer helpers to modules
  perf/core: Add PERF_AUX_FLAG_COLLISION to report colliding samples
  drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension
  dt-bindings: Document devicetree binding for ARM SPE

 Documentation/devicetree/bindings/arm/spe-pmu.txt |   20 +
 drivers/perf/Kconfig                              |    8 +
 drivers/perf/Makefile                             |    1 +
 drivers/perf/arm_spe_pmu.c                        | 1243 +++++++++++++++++++++
 include/uapi/linux/perf_event.h                   |    1 +
 kernel/events/ring_buffer.c                       |    4 +
 kernel/irq/irqdesc.c                              |    1 +
 7 files changed, 1278 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/spe-pmu.txt
 create mode 100644 drivers/perf/arm_spe_pmu.c

-- 
2.1.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v4 1/5] genirq: export irq_get_percpu_devid_partition to modules
  2017-06-05 15:22 [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Will Deacon
@ 2017-06-05 15:22 ` Will Deacon
  2017-06-05 15:22 ` [PATCH v4 2/5] perf/core: Export AUX buffer helpers " Will Deacon
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2017-06-05 15:22 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: marc.zyngier, mark.rutland, kim.phillips, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel, Will Deacon

Any modular driver using cluster-affine PPIs needs to be able to call
irq_get_percpu_devid_partition so that it can enable the IRQ on the
correct subset of CPUs.

This patch exports the symbol so that it can be called from within a
module.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/irq/irqdesc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 00bb0aeea1d0..1e6ae73eae59 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -856,6 +856,7 @@ int irq_get_percpu_devid_partition(unsigned int irq, struct cpumask *affinity)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(irq_get_percpu_devid_partition);
 
 void kstat_incr_irq_this_cpu(unsigned int irq)
 {
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 2/5] perf/core: Export AUX buffer helpers to modules
  2017-06-05 15:22 [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Will Deacon
  2017-06-05 15:22 ` [PATCH v4 1/5] genirq: export irq_get_percpu_devid_partition to modules Will Deacon
@ 2017-06-05 15:22 ` Will Deacon
  2017-06-05 15:22 ` [PATCH v4 3/5] perf/core: Add PERF_AUX_FLAG_COLLISION to report colliding samples Will Deacon
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2017-06-05 15:22 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: marc.zyngier, mark.rutland, kim.phillips, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel, Will Deacon

Perf PMU drivers using AUX buffers cannot be built as modules unless
the AUX helpers are exported.

This patch exports perf_aux_output_{begin,end,skip} and perf_get_aux to
modules.

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/events/ring_buffer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 2831480c63a2..cd5e902a27ac 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -411,6 +411,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
 
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(perf_aux_output_begin);
 
 /*
  * Commit the data written by hardware into the ring buffer by adjusting
@@ -470,6 +471,7 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size)
 	rb_free_aux(rb);
 	ring_buffer_put(rb);
 }
+EXPORT_SYMBOL_GPL(perf_aux_output_end);
 
 /*
  * Skip over a given number of bytes in the AUX buffer, due to, for example,
@@ -498,6 +500,7 @@ int perf_aux_output_skip(struct perf_output_handle *handle, unsigned long size)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(perf_aux_output_skip);
 
 void *perf_get_aux(struct perf_output_handle *handle)
 {
@@ -507,6 +510,7 @@ void *perf_get_aux(struct perf_output_handle *handle)
 
 	return handle->rb->aux_priv;
 }
+EXPORT_SYMBOL_GPL(perf_get_aux);
 
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 3/5] perf/core: Add PERF_AUX_FLAG_COLLISION to report colliding samples
  2017-06-05 15:22 [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Will Deacon
  2017-06-05 15:22 ` [PATCH v4 1/5] genirq: export irq_get_percpu_devid_partition to modules Will Deacon
  2017-06-05 15:22 ` [PATCH v4 2/5] perf/core: Export AUX buffer helpers " Will Deacon
@ 2017-06-05 15:22 ` Will Deacon
  2017-06-05 15:22 ` [PATCH v4 4/5] drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension Will Deacon
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2017-06-05 15:22 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: marc.zyngier, mark.rutland, kim.phillips, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel, Will Deacon

The ARM SPE architecture permits an implementation to ignore a sample
if the sample is due to be taken whilst another sample is already being
produced. In this case, it is desirable to report the collision to
userspace, as they may want to lower the sample period.

This patch adds a PERF_AUX_FLAG_COLLISION flag, so that such events can
be relayed to userspace.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/uapi/linux/perf_event.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index b1c0b187acfe..157034597d21 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -916,6 +916,7 @@ enum perf_callchain_context {
 #define PERF_AUX_FLAG_TRUNCATED		0x01	/* record was truncated to fit */
 #define PERF_AUX_FLAG_OVERWRITE		0x02	/* snapshot from overwrite mode */
 #define PERF_AUX_FLAG_PARTIAL		0x04	/* record contains gaps */
+#define PERF_AUX_FLAG_COLLISION		0x08	/* sample collided with another */
 
 #define PERF_FLAG_FD_NO_GROUP		(1UL << 0)
 #define PERF_FLAG_FD_OUTPUT		(1UL << 1)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 4/5] drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension
  2017-06-05 15:22 [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Will Deacon
                   ` (2 preceding siblings ...)
  2017-06-05 15:22 ` [PATCH v4 3/5] perf/core: Add PERF_AUX_FLAG_COLLISION to report colliding samples Will Deacon
@ 2017-06-05 15:22 ` Will Deacon
  2017-06-05 15:55   ` Kim Phillips
                     ` (2 more replies)
  2017-06-05 15:22 ` [PATCH v4 5/5] dt-bindings: Document devicetree binding for ARM SPE Will Deacon
  2017-06-12 11:08 ` [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Mark Rutland
  5 siblings, 3 replies; 33+ messages in thread
From: Will Deacon @ 2017-06-05 15:22 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: marc.zyngier, mark.rutland, kim.phillips, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel, Will Deacon

The ARMv8.2 architecture introduces the optional Statistical Profiling
Extension (SPE).

SPE can be used to profile a population of operations in the CPU pipeline
after instruction decode. These are either architected instructions (i.e.
a dynamic instruction trace) or CPU-specific uops and the choice is fixed
statically in the hardware and advertised to userspace via caps/. Sampling
is controlled using a sampling interval, similar to a regular PMU counter,
but also with an optional random perturbation to avoid falling into patterns
where you continuously profile the same instruction in a hot loop.

After each operation is decoded, the interval counter is decremented. When
it hits zero, an operation is chosen for profiling and tracked within the
pipeline until it retires. Along the way, information such as TLB lookups,
cache misses, time spent to issue etc is captured in the form of a sample.
The sample is then filtered according to certain criteria (e.g. load
latency) that can be specified in the event config (described under
format/) and, if the sample satisfies the filter, it is written out to
memory as a record, otherwise it is discarded. Only one operation can
be sampled at a time.

The in-memory buffer is linear and virtually addressed, raising an
interrupt when it fills up. The PMU driver handles these interrupts to
give the appearance of a ring buffer, as expected by the AUX code.

The in-memory trace-like format is self-describing (though not parseable
in reverse) and written as a series of records, with each record
corresponding to a sample and consisting of a sequence of packets. These
packets are defined by the architecture, although some have CPU-specific
fields for recording information specific to the microarchitecture.

As a simple example, a record generated for a branch instruction may
consist of the following packets:

  0 (Address) : Virtual PC of the branch instruction
  1 (Type)    : Conditional direct branch
  2 (Counter) : Number of cycles taken from Dispatch to Issue
  3 (Address) : Virtual branch target + condition flags
  4 (Counter) : Number of cycles taken from Dispatch to Complete
  5 (Events)  : Mispredicted as not-taken
  6 (END)     : End of record

It is also possible to toggle properties such as timestamp packets in
each record.

This patch adds support for SPE in the form of a new perf driver.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/perf/Kconfig       |    8 +
 drivers/perf/Makefile      |    1 +
 drivers/perf/arm_spe_pmu.c | 1243 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1252 insertions(+)
 create mode 100644 drivers/perf/arm_spe_pmu.c

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index aa587edaf9ea..2e24b9c5744e 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -42,4 +42,12 @@ config XGENE_PMU
         help
           Say y if you want to use APM X-Gene SoC performance monitors.
 
+config ARM_SPE_PMU
+	tristate "Enable support for the ARMv8.2 Statistical Profiling Extension"
+	depends on PERF_EVENTS && ARM64
+	help
+	  Enable perf support for the ARMv8.2 Statistical Profiling
+	  Extension, which provides periodic sampling of operations in
+	  the CPU pipeline and reports this via the perf AUX interface.
+
 endmenu
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 6420bd4394d5..eaee60cf4b1b 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o
 obj-$(CONFIG_QCOM_L2_PMU)	+= qcom_l2_pmu.o
 obj-$(CONFIG_QCOM_L3_PMU) += qcom_l3_pmu.o
 obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
+obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
new file mode 100644
index 000000000000..a271300ad27d
--- /dev/null
+++ b/drivers/perf/arm_spe_pmu.c
@@ -0,0 +1,1243 @@
+/*
+ * Perf support for the Statistical Profiling Extension, introduced as
+ * part of ARMv8.2.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * Copyright (C) 2016 ARM Limited
+ *
+ * Author: Will Deacon <will.deacon@arm.com>
+ */
+
+#define PMUNAME				"arm_spe"
+#define DRVNAME				PMUNAME "_pmu"
+#define pr_fmt(fmt)			DRVNAME ": " fmt
+
+#include <linux/cpuhotplug.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/module.h>
+#include <linux/of_address.h>
+#include <linux/of_device.h>
+#include <linux/perf_event.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+
+#include <asm/sysreg.h>
+
+/* ID registers */
+#define PMSIDR_EL1			sys_reg(3, 0, 9, 9, 7)
+#define PMSIDR_EL1_FE_SHIFT		0
+#define PMSIDR_EL1_FT_SHIFT		1
+#define PMSIDR_EL1_FL_SHIFT		2
+#define PMSIDR_EL1_ARCHINST_SHIFT	3
+#define PMSIDR_EL1_LDS_SHIFT		4
+#define PMSIDR_EL1_ERND_SHIFT		5
+#define PMSIDR_EL1_INTERVAL_SHIFT	8
+#define PMSIDR_EL1_INTERVAL_MASK	0xfUL
+#define PMSIDR_EL1_MAXSIZE_SHIFT	12
+#define PMSIDR_EL1_MAXSIZE_MASK		0xfUL
+#define PMSIDR_EL1_COUNTSIZE_SHIFT	16
+#define PMSIDR_EL1_COUNTSIZE_MASK	0xfUL
+
+#define PMBIDR_EL1			sys_reg(3, 0, 9, 10, 7)
+#define PMBIDR_EL1_ALIGN_SHIFT		0
+#define PMBIDR_EL1_ALIGN_MASK		0xfU
+#define PMBIDR_EL1_P_SHIFT		4
+#define PMBIDR_EL1_F_SHIFT		5
+
+/* Sampling controls */
+#define PMSCR_EL1			sys_reg(3, 0, 9, 9, 0)
+#define PMSCR_EL1_E0SPE_SHIFT		0
+#define PMSCR_EL1_E1SPE_SHIFT		1
+#define PMSCR_EL1_CX_SHIFT		3
+#define PMSCR_EL1_PA_SHIFT		4
+#define PMSCR_EL1_TS_SHIFT		5
+#define PMSCR_EL1_PCT_SHIFT		6
+
+#define PMSICR_EL1			sys_reg(3, 0, 9, 9, 2)
+
+#define PMSIRR_EL1			sys_reg(3, 0, 9, 9, 3)
+#define PMSIRR_EL1_RND_SHIFT		0
+#define PMSIRR_EL1_IVAL_MASK		0xffUL
+
+/* Filtering controls */
+#define PMSFCR_EL1			sys_reg(3, 0, 9, 9, 4)
+#define PMSFCR_EL1_FE_SHIFT		0
+#define PMSFCR_EL1_FT_SHIFT		1
+#define PMSFCR_EL1_FL_SHIFT		2
+#define PMSFCR_EL1_B_SHIFT		16
+#define PMSFCR_EL1_LD_SHIFT		17
+#define PMSFCR_EL1_ST_SHIFT		18
+
+#define PMSEVFR_EL1			sys_reg(3, 0, 9, 9, 5)
+#define PMSEVFR_EL1_RES0		0x0000ffff00ff0f55UL
+
+#define PMSLATFR_EL1			sys_reg(3, 0, 9, 9, 6)
+#define PMSLATFR_EL1_MINLAT_SHIFT	0
+
+/* Buffer controls */
+#define PMBLIMITR_EL1			sys_reg(3, 0, 9, 10, 0)
+#define PMBLIMITR_EL1_E_SHIFT		0
+#define PMBLIMITR_EL1_FM_SHIFT		1
+#define PMBLIMITR_EL1_FM_MASK		0x3UL
+#define PMBLIMITR_EL1_FM_STOP_IRQ	(0 << PMBLIMITR_EL1_FM_SHIFT)
+
+#define PMBPTR_EL1			sys_reg(3, 0, 9, 10, 1)
+
+/* Buffer error reporting */
+#define PMBSR_EL1			sys_reg(3, 0, 9, 10, 3)
+#define PMBSR_EL1_COLL_SHIFT		16
+#define PMBSR_EL1_S_SHIFT		17
+#define PMBSR_EL1_EA_SHIFT		18
+#define PMBSR_EL1_DL_SHIFT		19
+#define PMBSR_EL1_EC_SHIFT		26
+#define PMBSR_EL1_EC_MASK		0x3fUL
+
+#define PMBSR_EL1_EC_BUF		(0x0UL << PMBSR_EL1_EC_SHIFT)
+#define PMBSR_EL1_EC_FAULT_S1		(0x24UL << PMBSR_EL1_EC_SHIFT)
+#define PMBSR_EL1_EC_FAULT_S2		(0x25UL << PMBSR_EL1_EC_SHIFT)
+
+#define PMBSR_EL1_FAULT_FSC_SHIFT	0
+#define PMBSR_EL1_FAULT_FSC_MASK	0x3fUL
+
+#define PMBSR_EL1_BUF_BSC_SHIFT		0
+#define PMBSR_EL1_BUF_BSC_MASK		0x3fUL
+
+#define PMBSR_EL1_BUF_BSC_FULL		(0x1UL << PMBSR_EL1_BUF_BSC_SHIFT)
+
+#define psb_csync()			asm volatile("hint #17")
+
+struct arm_spe_pmu_buf {
+	int					nr_pages;
+	bool					snapshot;
+	void					*base;
+};
+
+struct arm_spe_pmu {
+	struct pmu				pmu;
+	struct platform_device			*pdev;
+	cpumask_t				supported_cpus;
+	struct hlist_node			hotplug_node;
+
+	int					irq; /* PPI */
+
+	u16					min_period;
+	u16					cnt_width;
+
+#define SPE_PMU_FEAT_FILT_EVT			(1UL << 0)
+#define SPE_PMU_FEAT_FILT_TYP			(1UL << 1)
+#define SPE_PMU_FEAT_FILT_LAT			(1UL << 2)
+#define SPE_PMU_FEAT_ARCH_INST			(1UL << 3)
+#define SPE_PMU_FEAT_LDS			(1UL << 4)
+#define SPE_PMU_FEAT_ERND			(1UL << 5)
+#define SPE_PMU_FEAT_DEV_PROBED			(1UL << 63)
+	u64					features;
+
+	u16					max_record_sz;
+	u16					align;
+	struct perf_output_handle __percpu	*handle;
+};
+
+#define to_spe_pmu(p) (container_of(p, struct arm_spe_pmu, pmu))
+
+/* Convert a free-running index from perf into an SPE buffer offset */
+#define PERF_IDX2OFF(idx, buf)	((idx) & (((buf)->nr_pages << PAGE_SHIFT) - 1))
+
+/* Keep track of our dynamic hotplug state */
+static enum cpuhp_state arm_spe_pmu_online;
+
+/* This sysfs gunk was really good fun to write. */
+enum arm_spe_pmu_capabilities {
+	SPE_PMU_CAP_ARCH_INST = 0,
+	SPE_PMU_CAP_ERND,
+	SPE_PMU_CAP_FEAT_MAX,
+	SPE_PMU_CAP_CNT_SZ = SPE_PMU_CAP_FEAT_MAX,
+	SPE_PMU_CAP_MIN_IVAL,
+};
+
+static int arm_spe_pmu_feat_caps[SPE_PMU_CAP_FEAT_MAX] = {
+	[SPE_PMU_CAP_ARCH_INST]	= SPE_PMU_FEAT_ARCH_INST,
+	[SPE_PMU_CAP_ERND]	= SPE_PMU_FEAT_ERND,
+};
+
+static u32 arm_spe_pmu_cap_get(struct arm_spe_pmu *spe_pmu, int cap)
+{
+	if (cap < SPE_PMU_CAP_FEAT_MAX)
+		return !!(spe_pmu->features & arm_spe_pmu_feat_caps[cap]);
+
+	switch (cap) {
+	case SPE_PMU_CAP_CNT_SZ:
+		return spe_pmu->cnt_width;
+	case SPE_PMU_CAP_MIN_IVAL:
+		return spe_pmu->min_period;
+	default:
+		WARN(1, "unknown cap %d\n", cap);
+	}
+
+	return 0;
+}
+
+static ssize_t arm_spe_pmu_cap_show(struct device *dev,
+				    struct device_attribute *attr,
+				    char *buf)
+{
+	struct platform_device *pdev = to_platform_device(dev);
+	struct arm_spe_pmu *spe_pmu = platform_get_drvdata(pdev);
+	struct dev_ext_attribute *ea =
+		container_of(attr, struct dev_ext_attribute, attr);
+	int cap = (long)ea->var;
+
+	return snprintf(buf, PAGE_SIZE, "%u\n",
+		arm_spe_pmu_cap_get(spe_pmu, cap));
+}
+
+#define SPE_EXT_ATTR_ENTRY(_name, _func, _var)				\
+	&((struct dev_ext_attribute[]) {				\
+		{ __ATTR(_name, S_IRUGO, _func, NULL), (void *)_var }	\
+	})[0].attr.attr
+
+#define SPE_CAP_EXT_ATTR_ENTRY(_name, _var)				\
+	SPE_EXT_ATTR_ENTRY(_name, arm_spe_pmu_cap_show, _var)
+
+static struct attribute *arm_spe_pmu_cap_attr[] = {
+	SPE_CAP_EXT_ATTR_ENTRY(arch_inst, SPE_PMU_CAP_ARCH_INST),
+	SPE_CAP_EXT_ATTR_ENTRY(ernd, SPE_PMU_CAP_ERND),
+	SPE_CAP_EXT_ATTR_ENTRY(count_size, SPE_PMU_CAP_CNT_SZ),
+	SPE_CAP_EXT_ATTR_ENTRY(min_interval, SPE_PMU_CAP_MIN_IVAL),
+	NULL,
+};
+
+static struct attribute_group arm_spe_pmu_cap_group = {
+	.name	= "caps",
+	.attrs	= arm_spe_pmu_cap_attr,
+};
+
+/* User ABI */
+#define ATTR_CFG_FLD_ts_enable_CFG		config	/* PMSCR_EL1.TS */
+#define ATTR_CFG_FLD_ts_enable_LO		0
+#define ATTR_CFG_FLD_ts_enable_HI		0
+#define ATTR_CFG_FLD_pa_enable_CFG		config	/* PMSCR_EL1.PA */
+#define ATTR_CFG_FLD_pa_enable_LO		1
+#define ATTR_CFG_FLD_pa_enable_HI		1
+#define ATTR_CFG_FLD_jitter_CFG			config	/* PMSIRR_EL1.RND */
+#define ATTR_CFG_FLD_jitter_LO			16
+#define ATTR_CFG_FLD_jitter_HI			16
+#define ATTR_CFG_FLD_branch_filter_CFG		config	/* PMSFCR_EL1.B */
+#define ATTR_CFG_FLD_branch_filter_LO		32
+#define ATTR_CFG_FLD_branch_filter_HI		32
+#define ATTR_CFG_FLD_load_filter_CFG		config	/* PMSFCR_EL1.LD */
+#define ATTR_CFG_FLD_load_filter_LO		33
+#define ATTR_CFG_FLD_load_filter_HI		33
+#define ATTR_CFG_FLD_store_filter_CFG		config	/* PMSFCR_EL1.ST */
+#define ATTR_CFG_FLD_store_filter_LO		34
+#define ATTR_CFG_FLD_store_filter_HI		34
+
+#define ATTR_CFG_FLD_event_filter_CFG		config1	/* PMSEVFR_EL1 */
+#define ATTR_CFG_FLD_event_filter_LO		0
+#define ATTR_CFG_FLD_event_filter_HI		63
+
+#define ATTR_CFG_FLD_min_latency_CFG		config2	/* PMSLATFR_EL1.MINLAT */
+#define ATTR_CFG_FLD_min_latency_LO		0
+#define ATTR_CFG_FLD_min_latency_HI		11
+
+/* Why does everything I do descend into this? */
+#define __GEN_PMU_FORMAT_ATTR(cfg, lo, hi)				\
+	(lo) == (hi) ? #cfg ":" #lo "\n" : #cfg ":" #lo "-" #hi
+
+#define _GEN_PMU_FORMAT_ATTR(cfg, lo, hi)				\
+	__GEN_PMU_FORMAT_ATTR(cfg, lo, hi)
+
+#define GEN_PMU_FORMAT_ATTR(name)					\
+	PMU_FORMAT_ATTR(name,						\
+	_GEN_PMU_FORMAT_ATTR(ATTR_CFG_FLD_##name##_CFG,			\
+			     ATTR_CFG_FLD_##name##_LO,			\
+			     ATTR_CFG_FLD_##name##_HI))
+
+#define _ATTR_CFG_GET_FLD(attr, cfg, lo, hi)				\
+	((((attr)->cfg) >> lo) & GENMASK(hi - lo, 0))
+
+#define ATTR_CFG_GET_FLD(attr, name)					\
+	_ATTR_CFG_GET_FLD(attr,						\
+			  ATTR_CFG_FLD_##name##_CFG,			\
+			  ATTR_CFG_FLD_##name##_LO,			\
+			  ATTR_CFG_FLD_##name##_HI)
+
+GEN_PMU_FORMAT_ATTR(ts_enable);
+GEN_PMU_FORMAT_ATTR(pa_enable);
+GEN_PMU_FORMAT_ATTR(jitter);
+GEN_PMU_FORMAT_ATTR(load_filter);
+GEN_PMU_FORMAT_ATTR(store_filter);
+GEN_PMU_FORMAT_ATTR(branch_filter);
+GEN_PMU_FORMAT_ATTR(event_filter);
+GEN_PMU_FORMAT_ATTR(min_latency);
+
+static struct attribute *arm_spe_pmu_formats_attr[] = {
+	&format_attr_ts_enable.attr,
+	&format_attr_pa_enable.attr,
+	&format_attr_jitter.attr,
+	&format_attr_load_filter.attr,
+	&format_attr_store_filter.attr,
+	&format_attr_branch_filter.attr,
+	&format_attr_event_filter.attr,
+	&format_attr_min_latency.attr,
+	NULL,
+};
+
+static struct attribute_group arm_spe_pmu_format_group = {
+	.name	= "format",
+	.attrs	= arm_spe_pmu_formats_attr,
+};
+
+static ssize_t arm_spe_pmu_get_attr_cpumask(struct device *dev,
+					    struct device_attribute *attr,
+					    char *buf)
+{
+	struct platform_device *pdev = to_platform_device(dev);
+	struct arm_spe_pmu *spe_pmu = platform_get_drvdata(pdev);
+
+	return cpumap_print_to_pagebuf(true, buf, &spe_pmu->supported_cpus);
+}
+static DEVICE_ATTR(cpumask, S_IRUGO, arm_spe_pmu_get_attr_cpumask, NULL);
+
+static struct attribute *arm_spe_pmu_attrs[] = {
+	&dev_attr_cpumask.attr,
+	NULL,
+};
+
+static struct attribute_group arm_spe_pmu_group = {
+	.attrs	= arm_spe_pmu_attrs,
+};
+
+static const struct attribute_group *arm_spe_pmu_attr_groups[] = {
+	&arm_spe_pmu_group,
+	&arm_spe_pmu_cap_group,
+	&arm_spe_pmu_format_group,
+	NULL,
+};
+
+/* Convert between user ABI and register values */
+static u64 arm_spe_event_to_pmscr(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+	u64 reg = 0;
+
+	reg |= ATTR_CFG_GET_FLD(attr, ts_enable) << PMSCR_EL1_TS_SHIFT;
+	reg |= ATTR_CFG_GET_FLD(attr, pa_enable) << PMSCR_EL1_PA_SHIFT;
+
+	if (!attr->exclude_user)
+		reg |= BIT(PMSCR_EL1_E0SPE_SHIFT);
+
+	if (!attr->exclude_kernel)
+		reg |= BIT(PMSCR_EL1_E1SPE_SHIFT);
+
+	if (IS_ENABLED(CONFIG_PID_IN_CONTEXTIDR))
+		reg |= BIT(PMSCR_EL1_CX_SHIFT);
+
+	return reg;
+}
+
+static void arm_spe_event_sanitise_period(struct perf_event *event)
+{
+	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
+	u64 period = event->hw.sample_period & ~PMSIRR_EL1_IVAL_MASK;
+
+	if (period < spe_pmu->min_period)
+		period = spe_pmu->min_period;
+
+	event->hw.sample_period = period;
+}
+
+static u64 arm_spe_event_to_pmsirr(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+	u64 reg = 0;
+
+	arm_spe_event_sanitise_period(event);
+
+	reg |= ATTR_CFG_GET_FLD(attr, jitter) << PMSIRR_EL1_RND_SHIFT;
+	reg |= event->hw.sample_period;
+
+	return reg;
+}
+
+static u64 arm_spe_event_to_pmsfcr(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+	u64 reg = 0;
+
+	reg |= ATTR_CFG_GET_FLD(attr, load_filter) << PMSFCR_EL1_LD_SHIFT;
+	reg |= ATTR_CFG_GET_FLD(attr, store_filter) << PMSFCR_EL1_ST_SHIFT;
+	reg |= ATTR_CFG_GET_FLD(attr, branch_filter) << PMSFCR_EL1_B_SHIFT;
+
+	if (reg)
+		reg |= BIT(PMSFCR_EL1_FT_SHIFT);
+
+	if (ATTR_CFG_GET_FLD(attr, event_filter))
+		reg |= BIT(PMSFCR_EL1_FE_SHIFT);
+
+	if (ATTR_CFG_GET_FLD(attr, min_latency))
+		reg |= BIT(PMSFCR_EL1_FL_SHIFT);
+
+	return reg;
+}
+
+static u64 arm_spe_event_to_pmsevfr(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+	return ATTR_CFG_GET_FLD(attr, event_filter);
+}
+
+static u64 arm_spe_event_to_pmslatfr(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+	return ATTR_CFG_GET_FLD(attr, min_latency) << PMSLATFR_EL1_MINLAT_SHIFT;
+}
+
+static bool arm_spe_pmu_buffer_mgmt_pending(u64 pmbsr)
+{
+	const char *err_str;
+
+	/* Service required? */
+	if (!(pmbsr & BIT(PMBSR_EL1_S_SHIFT)))
+		return false;
+
+	/* We only expect buffer management events */
+	switch (pmbsr & (PMBSR_EL1_EC_MASK << PMBSR_EL1_EC_SHIFT)) {
+	case PMBSR_EL1_EC_BUF:
+		/* Handled below */
+		break;
+	case PMBSR_EL1_EC_FAULT_S1:
+	case PMBSR_EL1_EC_FAULT_S2:
+		err_str = "Unexpected buffer fault";
+		goto out_err;
+	default:
+		err_str = "Unknown error code";
+		goto out_err;
+	}
+
+	/* Buffer management event */
+	switch (pmbsr & (PMBSR_EL1_BUF_BSC_MASK << PMBSR_EL1_BUF_BSC_SHIFT)) {
+	case PMBSR_EL1_BUF_BSC_FULL:
+		/* Ensure new profiling data is visible to the CPU */
+		psb_csync();
+		dsb(nsh);
+		return true;
+	default:
+		err_str = "Unknown buffer status code";
+	}
+
+out_err:
+	pr_err_ratelimited("%s on CPU %d [PMBSR=0x%08llx]\n", err_str,
+			   smp_processor_id(), pmbsr);
+	return false;
+}
+
+static u64 arm_spe_pmu_next_snapshot_off(struct perf_output_handle *handle)
+{
+	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
+	struct arm_spe_pmu *spe_pmu = to_spe_pmu(handle->event->pmu);
+	u64 head = PERF_IDX2OFF(handle->head, buf);
+	u64 limit = buf->nr_pages * PAGE_SIZE;
+
+	/*
+	 * The trace format isn't parseable in reverse, so clamp
+	 * the limit to half of the buffer size in snapshot mode
+	 * so that the worst case is half a buffer of records, as
+	 * opposed to a single record.
+	 */
+	if (head < limit >> 1)
+		limit >>= 1;
+
+	/*
+	 * If we're within max_record_sz of the limit, we must
+	 * pad, move the head index and recompute the limit.
+	 */
+	if (limit - head < spe_pmu->max_record_sz) {
+		memset(buf->base + head, 0, limit - head);
+		handle->head = PERF_IDX2OFF(limit, buf);
+		limit = ((buf->nr_pages * PAGE_SIZE) >> 1) + handle->head;
+	}
+
+	return limit;
+}
+
+static u64 __arm_spe_pmu_next_off(struct perf_output_handle *handle)
+{
+	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
+	u64 head = PERF_IDX2OFF(handle->head, buf);
+	u64 tail = PERF_IDX2OFF(handle->head + handle->size, buf);
+	u64 wakeup = PERF_IDX2OFF(handle->wakeup, buf);
+	u64 limit = buf->nr_pages * PAGE_SIZE;
+
+	/*
+	 * Set the limit pointer to either the watermark or the
+	 * current tail pointer; whichever comes first.
+	 */
+	if (handle->head + handle->size <= handle->wakeup) {
+		/* The tail is next, so check for wrapping */
+		if (tail >= head) {
+			/*
+			 * No wrapping, but need to align downwards to
+			 * avoid corrupting unconsumed data.
+			 */
+			limit = round_down(tail, PAGE_SIZE);
+
+		}
+	} else if (wakeup >= head) {
+		/*
+		 * The wakeup is next and doesn't wrap. Align upwards to
+		 * ensure that we do indeed reach the watermark.
+		 */
+		limit = round_up(wakeup, PAGE_SIZE);
+
+		/*
+		 * If rounding up crosses the tail, then we have to
+		 * round down to avoid corrupting unconsumed data.
+		 * Hopefully the tail will have moved by the time we
+		 * hit the new limit.
+		 */
+		if (wakeup < tail && limit > tail)
+			limit = round_down(wakeup, PAGE_SIZE);
+	}
+
+	/*
+	 * If rounding down crosses the head, then the buffer is full,
+	 * so pad to tail and end the session.
+	 */
+	if (limit <= head) {
+		memset(buf->base + head, 0, handle->size);
+		perf_aux_output_skip(handle, handle->size);
+		perf_aux_output_flag(handle, PERF_AUX_FLAG_TRUNCATED);
+		perf_aux_output_end(handle, 0);
+		limit = 0;
+	}
+
+	return limit;
+}
+
+static u64 arm_spe_pmu_next_off(struct perf_output_handle *handle)
+{
+	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
+	struct arm_spe_pmu *spe_pmu = to_spe_pmu(handle->event->pmu);
+	u64 limit = __arm_spe_pmu_next_off(handle);
+	u64 head = PERF_IDX2OFF(handle->head, buf);
+
+	/*
+	 * If the head has come too close to the end of the buffer,
+	 * then pad to the end and recompute the limit.
+	 */
+	if (limit && (limit - head < spe_pmu->max_record_sz)) {
+		memset(buf->base + head, 0, limit - head);
+		perf_aux_output_skip(handle, limit - head);
+		limit = __arm_spe_pmu_next_off(handle);
+	}
+
+	return limit;
+}
+
+static void arm_spe_perf_aux_output_begin(struct perf_output_handle *handle,
+					  struct perf_event *event)
+{
+	u64 base, limit;
+	struct arm_spe_pmu_buf *buf;
+
+	/* Start a new aux session */
+	buf = perf_aux_output_begin(handle, event);
+	if (!buf) {
+		event->hw.state |= PERF_HES_STOPPED;
+		/*
+		 * We still need to clear the limit pointer, since the
+		 * profiler might only be disabled by virtue of a fault.
+		 */
+		limit = 0;
+		goto out_write_limit;
+	}
+
+	limit = buf->snapshot ? arm_spe_pmu_next_snapshot_off(handle)
+			      : arm_spe_pmu_next_off(handle);
+	if (limit)
+		limit |= BIT(PMBLIMITR_EL1_E_SHIFT);
+
+	base = (u64)buf->base + PERF_IDX2OFF(handle->head, buf);
+	write_sysreg_s(base, PMBPTR_EL1);
+	limit += (u64)buf->base;
+
+out_write_limit:
+	write_sysreg_s(limit, PMBLIMITR_EL1);
+}
+
+static bool arm_spe_perf_aux_output_end(struct perf_output_handle *handle,
+					struct perf_event *event,
+					bool resume)
+{
+	u64 pmbptr, pmbsr, offset, size;
+	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
+	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
+	bool truncated;
+
+	/*
+	 * We can be called via IRQ work trying to disable the PMU after
+	 * a buffer full event. In this case, the aux session has already
+	 * been stopped, so there's nothing to do here.
+	 */
+	if (!buf)
+		return false;
+
+	/*
+	 * If there isn't a pending management event and we're not stopping
+	 * the current session, then just leave everything alone.
+	 */
+	pmbsr = read_sysreg_s(PMBSR_EL1);
+	if (!arm_spe_pmu_buffer_mgmt_pending(pmbsr) && resume)
+		return false; /* Spurious IRQ */
+
+	/* Ensure hardware updates to PMBPTR_EL1 are visible */
+	isb();
+
+	/*
+	 * Work out how much data has been written since the last update
+	 * to the head index.
+	 */
+	pmbptr = round_down(read_sysreg_s(PMBPTR_EL1), spe_pmu->align);
+	offset = pmbptr - (u64)buf->base;
+	size = offset - PERF_IDX2OFF(handle->head, buf);
+
+	if (buf->snapshot)
+		handle->head = offset;
+
+	/*
+	 * Either the buffer is full or we're stopping the session. Check
+	 * that we didn't write a partial record, since this can result
+	 * in unparseable trace and we must disable the event.
+	 */
+	if (pmbsr & BIT(PMBSR_EL1_COLL_SHIFT))
+		perf_aux_output_flag(handle, PERF_AUX_FLAG_COLLISION);
+
+	truncated = pmbsr & BIT(PMBSR_EL1_DL_SHIFT);
+	if (truncated)
+		perf_aux_output_flag(handle, PERF_AUX_FLAG_TRUNCATED);
+
+	perf_aux_output_end(handle, size);
+
+	/*
+	 * If we're not resuming the session, then we can clear the fault
+	 * and we're done, otherwise we need to start a new session.
+	 */
+	if (!resume)
+		write_sysreg_s(0, PMBSR_EL1);
+	else if (!truncated)
+		arm_spe_perf_aux_output_begin(handle, event);
+
+	return true;
+}
+
+/* IRQ handling */
+static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
+{
+	struct perf_output_handle *handle = dev;
+
+	if (!perf_get_aux(handle))
+		return IRQ_NONE;
+
+	if (!arm_spe_perf_aux_output_end(handle, handle->event, true))
+		return IRQ_NONE;
+
+	irq_work_run();
+	isb(); /* Ensure the buffer is disabled if data loss has occurred */
+	write_sysreg_s(0, PMBSR_EL1);
+	return IRQ_HANDLED;
+}
+
+/* Perf callbacks */
+static int arm_spe_pmu_event_init(struct perf_event *event)
+{
+	u64 reg;
+	struct perf_event_attr *attr = &event->attr;
+	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
+
+	/* This is, of course, deeply driver-specific */
+	if (attr->type != event->pmu->type)
+		return -ENOENT;
+
+	if (event->cpu >= 0 &&
+	    !cpumask_test_cpu(event->cpu, &spe_pmu->supported_cpus))
+		return -ENOENT;
+
+	if (arm_spe_event_to_pmsevfr(event) & PMSEVFR_EL1_RES0)
+		return -EOPNOTSUPP;
+
+	if (event->hw.sample_period < spe_pmu->min_period ||
+	    event->hw.sample_period & PMSIRR_EL1_IVAL_MASK)
+		return -EOPNOTSUPP;
+
+	if (attr->exclude_idle)
+		return -EOPNOTSUPP;
+
+	/*
+	 * Feedback-directed frequency throttling doesn't work when we
+	 * have a buffer of samples. We'd need to manually count the
+	 * samples in the buffer when it fills up and adjust the event
+	 * count to reflect that. Instead, force the user to specify a
+	 * sample period instead.
+	 */
+	if (attr->freq)
+		return -EINVAL;
+
+	reg = arm_spe_event_to_pmsfcr(event);
+	if ((reg & BIT(PMSFCR_EL1_FE_SHIFT)) &&
+	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_EVT))
+		return -EOPNOTSUPP;
+
+	if ((reg & BIT(PMSFCR_EL1_FT_SHIFT)) &&
+	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_TYP))
+		return -EOPNOTSUPP;
+
+	if ((reg & BIT(PMSFCR_EL1_FL_SHIFT)) &&
+	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_LAT))
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
+static void arm_spe_pmu_start(struct perf_event *event, int flags)
+{
+	u64 reg;
+	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	struct perf_output_handle *handle = this_cpu_ptr(spe_pmu->handle);
+
+	hwc->state = 0;
+	arm_spe_perf_aux_output_begin(handle, event);
+	if (hwc->state)
+		return;
+
+	reg = arm_spe_event_to_pmsfcr(event);
+	write_sysreg_s(reg, PMSFCR_EL1);
+
+	reg = arm_spe_event_to_pmsevfr(event);
+	write_sysreg_s(reg, PMSEVFR_EL1);
+
+	reg = arm_spe_event_to_pmslatfr(event);
+	write_sysreg_s(reg, PMSLATFR_EL1);
+
+	if (flags & PERF_EF_RELOAD) {
+		reg = arm_spe_event_to_pmsirr(event);
+		write_sysreg_s(reg, PMSIRR_EL1);
+		isb();
+		reg = local64_read(&hwc->period_left);
+		write_sysreg_s(reg, PMSICR_EL1);
+	}
+
+	reg = arm_spe_event_to_pmscr(event);
+	isb();
+	write_sysreg_s(reg, PMSCR_EL1);
+}
+
+static void arm_spe_pmu_disable_and_drain_local(void)
+{
+	/* Disable profiling at EL0 and EL1 */
+	write_sysreg_s(0, PMSCR_EL1);
+	isb();
+
+	/* Drain any buffered data */
+	psb_csync();
+	dsb(nsh);
+
+	/* Disable the profiling buffer */
+	write_sysreg_s(0, PMBLIMITR_EL1);
+}
+
+static void arm_spe_pmu_stop(struct perf_event *event, int flags)
+{
+	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	struct perf_output_handle *handle = this_cpu_ptr(spe_pmu->handle);
+
+	/* If we're already stopped, then nothing to do */
+	if (hwc->state & PERF_HES_STOPPED)
+		return;
+
+	/* Stop all trace generation */
+	arm_spe_pmu_disable_and_drain_local();
+
+	if (flags & PERF_EF_UPDATE) {
+		arm_spe_perf_aux_output_end(handle, event, false);
+		/*
+		 * This may also contain ECOUNT, but nobody else should
+		 * be looking at period_left, since we forbid frequency
+		 * based sampling.
+		 */
+		local64_set(&hwc->period_left, read_sysreg_s(PMSICR_EL1));
+		hwc->state |= PERF_HES_UPTODATE;
+	}
+
+	hwc->state |= PERF_HES_STOPPED;
+}
+
+static int arm_spe_pmu_add(struct perf_event *event, int flags)
+{
+	int ret = 0;
+	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	int cpu = event->cpu == -1 ? smp_processor_id() : event->cpu;
+
+	if (!cpumask_test_cpu(cpu, &spe_pmu->supported_cpus))
+		return -ENOENT;
+
+	hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+
+	if (flags & PERF_EF_START) {
+		arm_spe_pmu_start(event, PERF_EF_RELOAD);
+		if (hwc->state & PERF_HES_STOPPED)
+			ret = -EINVAL;
+	}
+
+	return ret;
+}
+
+static void arm_spe_pmu_del(struct perf_event *event, int flags)
+{
+	arm_spe_pmu_stop(event, PERF_EF_UPDATE);
+}
+
+static void arm_spe_pmu_read(struct perf_event *event)
+{
+}
+
+static void *arm_spe_pmu_setup_aux(int cpu, void **pages, int nr_pages,
+				   bool snapshot)
+{
+	int i;
+	struct page **pglist;
+	struct arm_spe_pmu_buf *buf;
+
+	/*
+	 * We require an even number of pages for snapshot mode, so that
+	 * we can effectively treat the buffer as consisting of two equal
+	 * parts and give userspace a fighting chance of getting some
+	 * useful data out of it.
+	 */
+	if (!nr_pages || (snapshot && (nr_pages & 1)))
+		return NULL;
+
+	buf = kzalloc_node(sizeof(*buf), GFP_KERNEL, cpu_to_node(cpu));
+	if (!buf)
+		return NULL;
+
+	pglist = kcalloc(nr_pages, sizeof(*pglist), GFP_KERNEL);
+	if (!pglist)
+		goto out_free_buf;
+
+	for (i = 0; i < nr_pages; ++i) {
+		struct page *page = virt_to_page(pages[i]);
+
+		if (PagePrivate(page)) {
+			pr_warn("unexpected high-order page for auxbuf!");
+			goto out_free_pglist;
+		}
+
+		pglist[i] = virt_to_page(pages[i]);
+	}
+
+	buf->base = vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
+	if (!buf->base)
+		goto out_free_pglist;
+
+	buf->nr_pages	= nr_pages;
+	buf->snapshot	= snapshot;
+
+	kfree(pglist);
+	return buf;
+
+out_free_pglist:
+	kfree(pglist);
+out_free_buf:
+	kfree(buf);
+	return NULL;
+}
+
+static void arm_spe_pmu_free_aux(void *aux)
+{
+	struct arm_spe_pmu_buf *buf = aux;
+
+	vunmap(buf->base);
+	kfree(buf);
+}
+
+/* Initialisation and teardown functions */
+static int arm_spe_pmu_perf_init(struct arm_spe_pmu *spe_pmu)
+{
+	static atomic_t pmu_idx = ATOMIC_INIT(-1);
+
+	int idx;
+	char *name;
+	struct device *dev = &spe_pmu->pdev->dev;
+
+	spe_pmu->pmu = (struct pmu) {
+		.capabilities	= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE,
+		.attr_groups	= arm_spe_pmu_attr_groups,
+		/*
+		 * We hitch a ride on the software context here, so that
+		 * we can support per-task profiling (which is not possible
+		 * with the invalid context as it doesn't get sched callbacks).
+		 * This requires that userspace either uses a dummy event for
+		 * perf_event_open, since the aux buffer is not setup until
+		 * a subsequent mmap, or creates the profiling event in a
+		 * disabled state and explicitly PERF_EVENT_IOC_ENABLEs it
+		 * once the buffer has been created.
+		 */
+		.task_ctx_nr	= perf_sw_context,
+		.event_init	= arm_spe_pmu_event_init,
+		.add		= arm_spe_pmu_add,
+		.del		= arm_spe_pmu_del,
+		.start		= arm_spe_pmu_start,
+		.stop		= arm_spe_pmu_stop,
+		.read		= arm_spe_pmu_read,
+		.setup_aux	= arm_spe_pmu_setup_aux,
+		.free_aux	= arm_spe_pmu_free_aux,
+	};
+
+	idx = atomic_inc_return(&pmu_idx);
+	name = devm_kasprintf(dev, GFP_KERNEL, "%s_%d", PMUNAME, idx);
+	return perf_pmu_register(&spe_pmu->pmu, name, -1);
+}
+
+static void arm_spe_pmu_perf_destroy(struct arm_spe_pmu *spe_pmu)
+{
+	perf_pmu_unregister(&spe_pmu->pmu);
+}
+
+static void __arm_spe_pmu_dev_probe(void *info)
+{
+	int fld;
+	u64 reg;
+	struct arm_spe_pmu *spe_pmu = info;
+	struct device *dev = &spe_pmu->pdev->dev;
+
+	fld = cpuid_feature_extract_unsigned_field(read_cpuid(ID_AA64DFR0_EL1),
+						   ID_AA64DFR0_PMSVER_SHIFT);
+	if (!fld) {
+		dev_err(dev,
+			"unsupported ID_AA64DFR0_EL1.PMSVer [%d] on CPU %d\n",
+			fld, smp_processor_id());
+		return;
+	}
+
+	/* Read PMBIDR first to determine whether or not we have access */
+	reg = read_sysreg_s(PMBIDR_EL1);
+	if (reg & BIT(PMBIDR_EL1_P_SHIFT)) {
+		dev_err(dev,
+			"profiling buffer owned by higher exception level\n");
+		return;
+	}
+
+	/* Minimum alignment. If it's out-of-range, then fail the probe */
+	fld = reg >> PMBIDR_EL1_ALIGN_SHIFT & PMBIDR_EL1_ALIGN_MASK;
+	spe_pmu->align = 1 << fld;
+	if (spe_pmu->align > SZ_2K) {
+		dev_err(dev, "unsupported PMBIDR.Align [%d] on CPU %d\n",
+			fld, smp_processor_id());
+		return;
+	}
+
+	/* It's now safe to read PMSIDR and figure out what we've got */
+	reg = read_sysreg_s(PMSIDR_EL1);
+	if (reg & BIT(PMSIDR_EL1_FE_SHIFT))
+		spe_pmu->features |= SPE_PMU_FEAT_FILT_EVT;
+
+	if (reg & BIT(PMSIDR_EL1_FT_SHIFT))
+		spe_pmu->features |= SPE_PMU_FEAT_FILT_TYP;
+
+	if (reg & BIT(PMSIDR_EL1_FL_SHIFT))
+		spe_pmu->features |= SPE_PMU_FEAT_FILT_LAT;
+
+	if (reg & BIT(PMSIDR_EL1_ARCHINST_SHIFT))
+		spe_pmu->features |= SPE_PMU_FEAT_ARCH_INST;
+
+	if (reg & BIT(PMSIDR_EL1_LDS_SHIFT))
+		spe_pmu->features |= SPE_PMU_FEAT_LDS;
+
+	if (reg & BIT(PMSIDR_EL1_ERND_SHIFT))
+		spe_pmu->features |= SPE_PMU_FEAT_ERND;
+
+	/* This field has a spaced out encoding, so just use a look-up */
+	fld = reg >> PMSIDR_EL1_INTERVAL_SHIFT & PMSIDR_EL1_INTERVAL_MASK;
+	switch (fld) {
+	case 0:
+		spe_pmu->min_period = 256;
+		break;
+	case 2:
+		spe_pmu->min_period = 512;
+		break;
+	case 3:
+		spe_pmu->min_period = 768;
+		break;
+	case 4:
+		spe_pmu->min_period = 1024;
+		break;
+	case 5:
+		spe_pmu->min_period = 1536;
+		break;
+	case 6:
+		spe_pmu->min_period = 2048;
+		break;
+	case 7:
+		spe_pmu->min_period = 3072;
+		break;
+	default:
+		dev_warn(dev, "unknown PMSIDR_EL1.Interval [%d]; assuming 8\n",
+			 fld);
+		/* Fallthrough */
+	case 8:
+		spe_pmu->min_period = 4096;
+	}
+
+	/* Maximum record size. If it's out-of-range, then fail the probe */
+	fld = reg >> PMSIDR_EL1_MAXSIZE_SHIFT & PMSIDR_EL1_MAXSIZE_MASK;
+	spe_pmu->max_record_sz = 1 << fld;
+	if (spe_pmu->max_record_sz > SZ_2K || spe_pmu->max_record_sz < 16) {
+		dev_err(dev, "unsupported PMSIDR_EL1.MaxSize [%d] on CPU %d\n",
+			fld, smp_processor_id());
+		return;
+	}
+
+	fld = reg >> PMSIDR_EL1_COUNTSIZE_SHIFT & PMSIDR_EL1_COUNTSIZE_MASK;
+	switch (fld) {
+	default:
+		dev_warn(dev, "unknown PMSIDR_EL1.CountSize [%d]; assuming 2\n",
+			 fld);
+		/* Fallthrough */
+	case 2:
+		spe_pmu->cnt_width = 12;
+	}
+
+	dev_info(dev,
+		 "probed for CPUs %*pbl [max_record_sz %u, align %u, features 0x%llx]\n",
+		 cpumask_pr_args(&spe_pmu->supported_cpus),
+		 spe_pmu->max_record_sz, spe_pmu->align, spe_pmu->features);
+
+	spe_pmu->features |= SPE_PMU_FEAT_DEV_PROBED;
+	return;
+}
+
+static void __arm_spe_pmu_reset_local(void)
+{
+	/*
+	 * This is probably overkill, as we have no idea where we're
+	 * draining any buffered data to...
+	 */
+	arm_spe_pmu_disable_and_drain_local();
+
+	/* Reset the buffer base pointer */
+	write_sysreg_s(0, PMBPTR_EL1);
+	isb();
+
+	/* Clear any pending management interrupts */
+	write_sysreg_s(0, PMBSR_EL1);
+	isb();
+}
+
+static void __arm_spe_pmu_setup_one(void *info)
+{
+	struct arm_spe_pmu *spe_pmu = info;
+
+	__arm_spe_pmu_reset_local();
+	enable_percpu_irq(spe_pmu->irq, IRQ_TYPE_NONE);
+}
+
+static void __arm_spe_pmu_stop_one(void *info)
+{
+	struct arm_spe_pmu *spe_pmu = info;
+
+	disable_percpu_irq(spe_pmu->irq);
+	__arm_spe_pmu_reset_local();
+}
+
+static int arm_spe_pmu_cpu_startup(unsigned int cpu, struct hlist_node *node)
+{
+	struct arm_spe_pmu *spe_pmu;
+
+	spe_pmu = hlist_entry_safe(node, struct arm_spe_pmu, hotplug_node);
+	if (!cpumask_test_cpu(cpu, &spe_pmu->supported_cpus))
+		return 0;
+
+	__arm_spe_pmu_setup_one(spe_pmu);
+	return 0;
+}
+
+static int arm_spe_pmu_cpu_teardown(unsigned int cpu, struct hlist_node *node)
+{
+	struct arm_spe_pmu *spe_pmu;
+
+	spe_pmu = hlist_entry_safe(node, struct arm_spe_pmu, hotplug_node);
+	if (!cpumask_test_cpu(cpu, &spe_pmu->supported_cpus))
+		return 0;
+
+	__arm_spe_pmu_stop_one(spe_pmu);
+	return 0;
+}
+
+static int arm_spe_pmu_dev_init(struct arm_spe_pmu *spe_pmu)
+{
+	int ret;
+	cpumask_t *mask = &spe_pmu->supported_cpus;
+
+	/* Make sure we probe the hardware on a relevant CPU */
+	ret = smp_call_function_any(mask,  __arm_spe_pmu_dev_probe, spe_pmu, 1);
+	if (ret || !(spe_pmu->features & SPE_PMU_FEAT_DEV_PROBED))
+		return -ENXIO;
+
+	/* Request our PPIs (note that the IRQ is still disabled) */
+	ret = request_percpu_irq(spe_pmu->irq, arm_spe_pmu_irq_handler, DRVNAME,
+				 spe_pmu->handle);
+	if (ret)
+		return ret;
+
+	/*
+	 * Register our hotplug notifier now so we don't miss any events.
+	 * This will enable the IRQ for any supported CPUs that are already
+	 * up.
+	 */
+	ret = cpuhp_state_add_instance(arm_spe_pmu_online,
+				       &spe_pmu->hotplug_node);
+	if (ret)
+		free_percpu_irq(spe_pmu->irq, spe_pmu->handle);
+
+	return ret;
+}
+
+static void arm_spe_pmu_dev_teardown(struct arm_spe_pmu *spe_pmu)
+{
+	cpuhp_state_remove_instance(arm_spe_pmu_online, &spe_pmu->hotplug_node);
+	free_percpu_irq(spe_pmu->irq, spe_pmu->handle);
+}
+
+/* Driver and device probing */
+static int arm_spe_pmu_irq_probe(struct arm_spe_pmu *spe_pmu)
+{
+	struct platform_device *pdev = spe_pmu->pdev;
+	int irq = platform_get_irq(pdev, 0);
+
+	if (irq < 0) {
+		dev_err(&pdev->dev, "failed to get IRQ (%d)\n", irq);
+		return -ENXIO;
+	}
+
+	if (!irq_is_percpu(irq)) {
+		dev_err(&pdev->dev, "expected PPI but got SPI (%d)\n", irq);
+		return -EINVAL;
+	}
+
+	if (irq_get_percpu_devid_partition(irq, &spe_pmu->supported_cpus)) {
+		dev_err(&pdev->dev, "failed to get PPI partition (%d)\n", irq);
+		return -EINVAL;
+	}
+
+	spe_pmu->irq = irq;
+	return 0;
+}
+
+static const struct of_device_id arm_spe_pmu_of_match[] = {
+	{ .compatible = "arm,statistical-profiling-extension-v1", .data = (void *)1 },
+};
+
+static int arm_spe_pmu_device_dt_probe(struct platform_device *pdev)
+{
+	int ret;
+	struct arm_spe_pmu *spe_pmu;
+	struct device *dev = &pdev->dev;
+
+	spe_pmu = devm_kzalloc(dev, sizeof(*spe_pmu), GFP_KERNEL);
+	if (!spe_pmu) {
+		dev_err(dev, "failed to allocate spe_pmu\n");
+		return -ENOMEM;
+	}
+
+	spe_pmu->handle = alloc_percpu(typeof(*spe_pmu->handle));
+	if (!spe_pmu->handle)
+		return -ENOMEM;
+
+	spe_pmu->pdev = pdev;
+	platform_set_drvdata(pdev, spe_pmu);
+
+	ret = arm_spe_pmu_irq_probe(spe_pmu);
+	if (ret)
+		goto out_free_handle;
+
+	ret = arm_spe_pmu_dev_init(spe_pmu);
+	if (ret)
+		goto out_free_handle;
+
+	ret = arm_spe_pmu_perf_init(spe_pmu);
+	if (ret)
+		goto out_teardown_dev;
+
+	return 0;
+
+out_teardown_dev:
+	arm_spe_pmu_dev_teardown(spe_pmu);
+out_free_handle:
+	free_percpu(spe_pmu->handle);
+	return ret;
+}
+
+static int arm_spe_pmu_device_remove(struct platform_device *pdev)
+{
+	struct arm_spe_pmu *spe_pmu = platform_get_drvdata(pdev);
+
+	arm_spe_pmu_perf_destroy(spe_pmu);
+	arm_spe_pmu_dev_teardown(spe_pmu);
+	free_percpu(spe_pmu->handle);
+	return 0;
+}
+
+static struct platform_driver arm_spe_pmu_driver = {
+	.driver	= {
+		.name		= DRVNAME,
+		.of_match_table	= of_match_ptr(arm_spe_pmu_of_match),
+	},
+	.probe	= arm_spe_pmu_device_dt_probe,
+	.remove	= arm_spe_pmu_device_remove,
+};
+
+static int __init arm_spe_pmu_init(void)
+{
+	int ret;
+
+	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, DRVNAME,
+				      arm_spe_pmu_cpu_startup,
+				      arm_spe_pmu_cpu_teardown);
+	if (ret < 0)
+		return ret;
+	arm_spe_pmu_online = ret;
+
+	ret = platform_driver_register(&arm_spe_pmu_driver);
+	if (ret)
+		cpuhp_remove_multi_state(arm_spe_pmu_online);
+
+	return ret;
+}
+
+static void __exit arm_spe_pmu_exit(void)
+{
+	platform_driver_unregister(&arm_spe_pmu_driver);
+	cpuhp_remove_multi_state(arm_spe_pmu_online);
+}
+
+module_init(arm_spe_pmu_init);
+module_exit(arm_spe_pmu_exit);
+
+MODULE_DESCRIPTION("Perf driver for the ARMv8.2 Statistical Profiling Extension");
+MODULE_AUTHOR("Will Deacon <will.deacon@arm.com>");
+MODULE_LICENSE("GPL v2");
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v4 5/5] dt-bindings: Document devicetree binding for ARM SPE
  2017-06-05 15:22 [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Will Deacon
                   ` (3 preceding siblings ...)
  2017-06-05 15:22 ` [PATCH v4 4/5] drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension Will Deacon
@ 2017-06-05 15:22 ` Will Deacon
  2017-06-12 11:08 ` [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Mark Rutland
  5 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2017-06-05 15:22 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: marc.zyngier, mark.rutland, kim.phillips, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel, Will Deacon

This patch documents the devicetree binding in use for ARM SPE.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Rob Herring <robh@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 Documentation/devicetree/bindings/arm/spe-pmu.txt | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/spe-pmu.txt

diff --git a/Documentation/devicetree/bindings/arm/spe-pmu.txt b/Documentation/devicetree/bindings/arm/spe-pmu.txt
new file mode 100644
index 000000000000..93372f2a7df9
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/spe-pmu.txt
@@ -0,0 +1,20 @@
+* ARMv8.2 Statistical Profiling Extension (SPE) Performance Monitor Units (PMU)
+
+ARMv8.2 introduces the optional Statistical Profiling Extension for collecting
+performance sample data using an in-memory trace buffer.
+
+** SPE Required properties:
+
+- compatible : should be one of:
+	       "arm,statistical-profiling-extension-v1"
+
+- interrupts : Exactly 1 PPI must be listed. For heterogeneous systems where
+               SPE is only supported on a subset of the CPUs, please consult
+	       the arm,gic-v3 binding for details on describing a PPI partition.
+
+** Example:
+
+spe-pmu {
+        compatible = "arm,statistical-profiling-extension-v1";
+        interrupts = <GIC_PPI 05 IRQ_TYPE_LEVEL_HIGH &part1>;
+};
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 4/5] drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension
  2017-06-05 15:22 ` [PATCH v4 4/5] drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension Will Deacon
@ 2017-06-05 15:55   ` Kim Phillips
  2017-06-05 16:11     ` Will Deacon
  2017-06-15 14:57   ` Mark Rutland
  2017-07-03 17:23   ` Mark Rutland
  2 siblings, 1 reply; 33+ messages in thread
From: Kim Phillips @ 2017-06-05 15:55 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, marc.zyngier, mark.rutland, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

On Mon, 5 Jun 2017 16:22:56 +0100
Will Deacon <will.deacon@arm.com> wrote:

> +/* Perf callbacks */
> +static int arm_spe_pmu_event_init(struct perf_event *event)
> +{
> +	u64 reg;
> +	struct perf_event_attr *attr = &event->attr;
> +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
> +
> +	/* This is, of course, deeply driver-specific */
> +	if (attr->type != event->pmu->type)
> +		return -ENOENT;
> +
> +	if (event->cpu >= 0 &&
> +	    !cpumask_test_cpu(event->cpu, &spe_pmu->supported_cpus))
> +		return -ENOENT;
> +
> +	if (arm_spe_event_to_pmsevfr(event) & PMSEVFR_EL1_RES0)
> +		return -EOPNOTSUPP;
> +
> +	if (event->hw.sample_period < spe_pmu->min_period ||
> +	    event->hw.sample_period & PMSIRR_EL1_IVAL_MASK)
> +		return -EOPNOTSUPP;
> +
> +	if (attr->exclude_idle)
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * Feedback-directed frequency throttling doesn't work when we
> +	 * have a buffer of samples. We'd need to manually count the
> +	 * samples in the buffer when it fills up and adjust the event
> +	 * count to reflect that. Instead, force the user to specify a
> +	 * sample period instead.
> +	 */
> +	if (attr->freq)
> +		return -EINVAL;
> +
> +	reg = arm_spe_event_to_pmsfcr(event);
> +	if ((reg & BIT(PMSFCR_EL1_FE_SHIFT)) &&
> +	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_EVT))
> +		return -EOPNOTSUPP;
> +
> +	if ((reg & BIT(PMSFCR_EL1_FT_SHIFT)) &&
> +	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_TYP))
> +		return -EOPNOTSUPP;
> +
> +	if ((reg & BIT(PMSFCR_EL1_FL_SHIFT)) &&
> +	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_LAT))
> +		return -EOPNOTSUPP;
> +
> +	return 0;
> +}

AFAICT, my comments from the last submission have still not been fully
addressed:

http://lists.infradead.org/pipermail/linux-arm-kernel/2017-May/508027.html

Thanks,

Kim

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 4/5] drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension
  2017-06-05 15:55   ` Kim Phillips
@ 2017-06-05 16:11     ` Will Deacon
  0 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2017-06-05 16:11 UTC (permalink / raw)
  To: Kim Phillips
  Cc: linux-arm-kernel, marc.zyngier, mark.rutland, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

On Mon, Jun 05, 2017 at 10:55:16AM -0500, Kim Phillips wrote:
> On Mon, 5 Jun 2017 16:22:56 +0100
> Will Deacon <will.deacon@arm.com> wrote:
> 
> > +/* Perf callbacks */
> > +static int arm_spe_pmu_event_init(struct perf_event *event)
> > +{
> > +	u64 reg;
> > +	struct perf_event_attr *attr = &event->attr;
> > +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
> > +
> > +	/* This is, of course, deeply driver-specific */
> > +	if (attr->type != event->pmu->type)
> > +		return -ENOENT;
> > +
> > +	if (event->cpu >= 0 &&
> > +	    !cpumask_test_cpu(event->cpu, &spe_pmu->supported_cpus))
> > +		return -ENOENT;
> > +
> > +	if (arm_spe_event_to_pmsevfr(event) & PMSEVFR_EL1_RES0)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (event->hw.sample_period < spe_pmu->min_period ||
> > +	    event->hw.sample_period & PMSIRR_EL1_IVAL_MASK)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (attr->exclude_idle)
> > +		return -EOPNOTSUPP;
> > +
> > +	/*
> > +	 * Feedback-directed frequency throttling doesn't work when we
> > +	 * have a buffer of samples. We'd need to manually count the
> > +	 * samples in the buffer when it fills up and adjust the event
> > +	 * count to reflect that. Instead, force the user to specify a
> > +	 * sample period instead.
> > +	 */
> > +	if (attr->freq)
> > +		return -EINVAL;
> > +
> > +	reg = arm_spe_event_to_pmsfcr(event);
> > +	if ((reg & BIT(PMSFCR_EL1_FE_SHIFT)) &&
> > +	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_EVT))
> > +		return -EOPNOTSUPP;
> > +
> > +	if ((reg & BIT(PMSFCR_EL1_FT_SHIFT)) &&
> > +	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_TYP))
> > +		return -EOPNOTSUPP;
> > +
> > +	if ((reg & BIT(PMSFCR_EL1_FL_SHIFT)) &&
> > +	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_LAT))
> > +		return -EOPNOTSUPP;
> > +
> > +	return 0;
> > +}
> 
> AFAICT, my comments from the last submission have still not been fully
> addressed:
> 
> http://lists.infradead.org/pipermail/linux-arm-kernel/2017-May/508027.html

To be frank, I really don't plan to address them and, even if I did, I would
trust Mark to NAK the change. If you're desperate for pr_debug, I'll add it
to keep you happy, but anything more than that needs to come in the form of
a separate patch submission addressing the wider problem of error reporting
from PMU drivers back to userspace. Patches welcome, but I suspect you're
still busy working on the tools code.

Do you have any constructive comments on the patch?

Will

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
  2017-06-05 15:22 [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Will Deacon
                   ` (4 preceding siblings ...)
  2017-06-05 15:22 ` [PATCH v4 5/5] dt-bindings: Document devicetree binding for ARM SPE Will Deacon
@ 2017-06-12 11:08 ` Mark Rutland
  2017-06-12 16:20   ` Kim Phillips
  5 siblings, 1 reply; 33+ messages in thread
From: Mark Rutland @ 2017-06-12 11:08 UTC (permalink / raw)
  To: Will Deacon, kim.phillips
  Cc: linux-arm-kernel, marc.zyngier, tglx, peterz, alexander.shishkin,
	robh, suzuki.poulose, pawel.moll, mathieu.poirier, mingo,
	linux-kernel

On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:
> Hi all,
>
> This is the sixth posting of the patches previously posted here:
>
>   rfcv1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/476450.html
>   rfcv2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/479387.html
>      v1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/483684.html
>      v2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-April/499938.html
>      v3: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-May/507132.html
>
> The main change since v3 is that I have reworked and fixed the CPU hotplug
> and notifier bits, in light of review comments from tglx.
>
> The architecture documentation is available here:
>
>   https://developer.arm.com/products/architecture/a-profile/docs/ddi0586/latest/arm-architecture-reference-manual-supplement-statistical-profiling-extension-for-armv8-a

Kim, do you have any version of the userspace side that we could look
at?

For review, it would be really helpful to have something that can poke
the PMU, even if it's incomplete or lacking polish.

Thanks,
Mark.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
  2017-06-12 11:08 ` [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Mark Rutland
@ 2017-06-12 16:20   ` Kim Phillips
  2017-06-15 15:57     ` Kim Phillips
  0 siblings, 1 reply; 33+ messages in thread
From: Kim Phillips @ 2017-06-12 16:20 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Will Deacon, linux-arm-kernel, marc.zyngier, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

On Mon, 12 Jun 2017 12:08:23 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:
> > This is the sixth posting of the patches previously posted here:
> > 
> >   rfcv1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/476450.html
> >   rfcv2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/479387.html
> >      v1: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-January/483684.html
> >      v2: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-April/499938.html
> >      v3: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-May/507132.html
> > 
> > The main change since v3 is that I have reworked and fixed the CPU hotplug
> > and notifier bits, in light of review comments from tglx.
> > 
> > The architecture documentation is available here:
> > 
> >   https://developer.arm.com/products/architecture/a-profile/docs/ddi0586/latest/arm-architecture-reference-manual-supplement-statistical-profiling-extension-for-armv8-a
> 
> Kim, do you have any version of the userspace side that we could look
> at?
> 
> For review, it would be really helpful to have something that can poke
> the PMU, even if it's incomplete or lacking polish.

Here's the latest push, based on a a couple of prior versions of this
driver:

http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1

I don't seem to be able to get any SPE data output after rebasing on
this version of the driver.  Still don't know why at the moment...

Kim

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 4/5] drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension
  2017-06-05 15:22 ` [PATCH v4 4/5] drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension Will Deacon
  2017-06-05 15:55   ` Kim Phillips
@ 2017-06-15 14:57   ` Mark Rutland
  2017-06-21 15:39     ` Will Deacon
  2017-07-03 17:23   ` Mark Rutland
  2 siblings, 1 reply; 33+ messages in thread
From: Mark Rutland @ 2017-06-15 14:57 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, marc.zyngier, kim.phillips, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

Hi Will,

Sorry for the delay on this review; it's taken me a while to ingest DDI
0586A and get a feel for how profiling PMUs work.

I have a number of comments below.

On Mon, Jun 05, 2017 at 04:22:56PM +0100, Will Deacon wrote:
> +/* ID registers */
> +#define PMSIDR_EL1			sys_reg(3, 0, 9, 9, 7)

Nit: could we please give the sysreg definitions a SYS_ prefix, for
consistency with other (architected) sysregs?

Ideally, as these are architected they'd live in <asm/sysreg.h>, but I'm
happy to factor that out later.

> +#define PMSIDR_EL1_FE_SHIFT		0
> +#define PMSIDR_EL1_FT_SHIFT		1
> +#define PMSIDR_EL1_FL_SHIFT		2
> +#define PMSIDR_EL1_ARCHINST_SHIFT	3
> +#define PMSIDR_EL1_LDS_SHIFT		4
> +#define PMSIDR_EL1_ERND_SHIFT		5
> +#define PMSIDR_EL1_INTERVAL_SHIFT	8
> +#define PMSIDR_EL1_INTERVAL_MASK	0xfUL
> +#define PMSIDR_EL1_MAXSIZE_SHIFT	12
> +#define PMSIDR_EL1_MAXSIZE_MASK		0xfUL
> +#define PMSIDR_EL1_COUNTSIZE_SHIFT	16
> +#define PMSIDR_EL1_COUNTSIZE_MASK	0xfUL
> +
> +#define PMBIDR_EL1			sys_reg(3, 0, 9, 10, 7)
> +#define PMBIDR_EL1_ALIGN_SHIFT		0
> +#define PMBIDR_EL1_ALIGN_MASK		0xfU
> +#define PMBIDR_EL1_P_SHIFT		4
> +#define PMBIDR_EL1_F_SHIFT		5
> +
> +/* Sampling controls */
> +#define PMSCR_EL1			sys_reg(3, 0, 9, 9, 0)
> +#define PMSCR_EL1_E0SPE_SHIFT		0
> +#define PMSCR_EL1_E1SPE_SHIFT		1
> +#define PMSCR_EL1_CX_SHIFT		3
> +#define PMSCR_EL1_PA_SHIFT		4
> +#define PMSCR_EL1_TS_SHIFT		5
> +#define PMSCR_EL1_PCT_SHIFT		6
> +
> +#define PMSICR_EL1			sys_reg(3, 0, 9, 9, 2)
> +
> +#define PMSIRR_EL1			sys_reg(3, 0, 9, 9, 3)
> +#define PMSIRR_EL1_RND_SHIFT		0
> +#define PMSIRR_EL1_IVAL_MASK		0xffUL

This is a little odd. This covers RND and (some) RES0 bits, not
INTERVAL, so it's the opposite polarity to the rest of the masks, and
incomplete (missing the upper 32 RES0 bits).

Could we please have this in the same polarity as the other masks,
precisely covering the INTERVAL field? i.e.

#define PMSIRR_EL1_INTERVAL_SHIFT	8
#define PMSIRR_EL1_INTERVAL_MASK	0xffffffUL

... I've proposed corresponding updates to the usages below.

> +
> +/* Filtering controls */
> +#define PMSFCR_EL1			sys_reg(3, 0, 9, 9, 4)
> +#define PMSFCR_EL1_FE_SHIFT		0
> +#define PMSFCR_EL1_FT_SHIFT		1
> +#define PMSFCR_EL1_FL_SHIFT		2
> +#define PMSFCR_EL1_B_SHIFT		16
> +#define PMSFCR_EL1_LD_SHIFT		17
> +#define PMSFCR_EL1_ST_SHIFT		18
> +
> +#define PMSEVFR_EL1			sys_reg(3, 0, 9, 9, 5)
> +#define PMSEVFR_EL1_RES0		0x0000ffff00ff0f55UL
> +
> +#define PMSLATFR_EL1			sys_reg(3, 0, 9, 9, 6)
> +#define PMSLATFR_EL1_MINLAT_SHIFT	0

I was going to suggest that we need a PMSLATFR_EL1_MINLAT_MASK, but I
see this is handled implicitly by ATTR_CFG_FLD_min_latency_{LO,HI} and
ATTR_CFG_GET_FLD().

> +
> +/* Buffer controls */
> +#define PMBLIMITR_EL1			sys_reg(3, 0, 9, 10, 0)
> +#define PMBLIMITR_EL1_E_SHIFT		0
> +#define PMBLIMITR_EL1_FM_SHIFT		1
> +#define PMBLIMITR_EL1_FM_MASK		0x3UL
> +#define PMBLIMITR_EL1_FM_STOP_IRQ	(0 << PMBLIMITR_EL1_FM_SHIFT)
> +
> +#define PMBPTR_EL1			sys_reg(3, 0, 9, 10, 1)
> +
> +/* Buffer error reporting */
> +#define PMBSR_EL1			sys_reg(3, 0, 9, 10, 3)
> +#define PMBSR_EL1_COLL_SHIFT		16
> +#define PMBSR_EL1_S_SHIFT		17
> +#define PMBSR_EL1_EA_SHIFT		18
> +#define PMBSR_EL1_DL_SHIFT		19
> +#define PMBSR_EL1_EC_SHIFT		26
> +#define PMBSR_EL1_EC_MASK		0x3fUL
> +
> +#define PMBSR_EL1_EC_BUF		(0x0UL << PMBSR_EL1_EC_SHIFT)
> +#define PMBSR_EL1_EC_FAULT_S1		(0x24UL << PMBSR_EL1_EC_SHIFT)
> +#define PMBSR_EL1_EC_FAULT_S2		(0x25UL << PMBSR_EL1_EC_SHIFT)
> +
> +#define PMBSR_EL1_FAULT_FSC_SHIFT	0
> +#define PMBSR_EL1_FAULT_FSC_MASK	0x3fUL

Nit: it might be worth having MSS in the name to show the field
hierarchy, e.g. PMBSR_EL1_MSS_FAULT_FSC_MASK

> +
> +#define PMBSR_EL1_BUF_BSC_SHIFT		0
> +#define PMBSR_EL1_BUF_BSC_MASK		0x3fUL
> +
> +#define PMBSR_EL1_BUF_BSC_FULL		(0x1UL << PMBSR_EL1_BUF_BSC_SHIFT)

Likewise here.

> +
> +#define psb_csync()			asm volatile("hint #17")

Other than my comments above, all the register/field/insn definitions
appear to be correct per DDI 0586A.

> +
> +struct arm_spe_pmu_buf {
> +	int					nr_pages;
> +	bool					snapshot;
> +	void					*base;
> +};
> +
> +struct arm_spe_pmu {
> +	struct pmu				pmu;
> +	struct platform_device			*pdev;
> +	cpumask_t				supported_cpus;
> +	struct hlist_node			hotplug_node;
> +
> +	int					irq; /* PPI */
> +
> +	u16					min_period;
> +	u16					cnt_width;

Elsewhere, we refer to this as the counter size (e.g.
SPE_PMU_CAP_CNT_SZ, and count_size in the sysfs interface).

Could we please pick one of "width" and "size" and use it consistently?

I guess "size" is preferable, given the architectural name is
"CountSize".

> +
> +#define SPE_PMU_FEAT_FILT_EVT			(1UL << 0)
> +#define SPE_PMU_FEAT_FILT_TYP			(1UL << 1)
> +#define SPE_PMU_FEAT_FILT_LAT			(1UL << 2)
> +#define SPE_PMU_FEAT_ARCH_INST			(1UL << 3)
> +#define SPE_PMU_FEAT_LDS			(1UL << 4)
> +#define SPE_PMU_FEAT_ERND			(1UL << 5)
> +#define SPE_PMU_FEAT_DEV_PROBED			(1UL << 63)
> +	u64					features;
> +
> +	u16					max_record_sz;
> +	u16					align;
> +	struct perf_output_handle __percpu	*handle;
> +};
> +
> +#define to_spe_pmu(p) (container_of(p, struct arm_spe_pmu, pmu))
> +
> +/* Convert a free-running index from perf into an SPE buffer offset */
> +#define PERF_IDX2OFF(idx, buf)	((idx) & (((buf)->nr_pages << PAGE_SHIFT) - 1))

The masking logic here assumes nr_pages is a power of two.

That's not always true, as arm_spe_pmu_setup_aux() only ensures that
nr_pages is a multiple of 2 * PAGE_SIZE, and only when using snapshot
mode.

For example, with ten 4K pages:

nr_pages:			0b1010
(nr_pages << PAGE_SHIFT):	0b1010000000000000
(nr_pages << PAGE_SHIFT) - 1:	0b1001111111111111

> +
> +/* Keep track of our dynamic hotplug state */
> +static enum cpuhp_state arm_spe_pmu_online;
> +
> +/* This sysfs gunk was really good fun to write. */
> +enum arm_spe_pmu_capabilities {
> +	SPE_PMU_CAP_ARCH_INST = 0,

No need for the initializer; enums start at zero.

> +	SPE_PMU_CAP_ERND,
> +	SPE_PMU_CAP_FEAT_MAX,
> +	SPE_PMU_CAP_CNT_SZ = SPE_PMU_CAP_FEAT_MAX,
> +	SPE_PMU_CAP_MIN_IVAL,
> +};

IMO, it would be worth having s/CAP/CAP_FEAT/ for the HW features.

We could get rid of the confusing SPE_PMU_CAP_FEAT_MAX definition here,
if we were to:

> +
> +static int arm_spe_pmu_feat_caps[SPE_PMU_CAP_FEAT_MAX] = {
> +	[SPE_PMU_CAP_ARCH_INST]	= SPE_PMU_FEAT_ARCH_INST,
> +	[SPE_PMU_CAP_ERND]	= SPE_PMU_FEAT_ERND,
> +};

.. change this to:

static int arm_spe_pmu_feat_caps[] = {
	...
};

... and:

> +
> +static u32 arm_spe_pmu_cap_get(struct arm_spe_pmu *spe_pmu, int cap)
> +{
> +	if (cap < SPE_PMU_CAP_FEAT_MAX)

... change this to:

	if (cap < ARRAY_SIZE(arm_spe_pmu_feat_caps))

> +		return !!(spe_pmu->features & arm_spe_pmu_feat_caps[cap]);
> +
> +	switch (cap) {
> +	case SPE_PMU_CAP_CNT_SZ:
> +		return spe_pmu->cnt_width;
> +	case SPE_PMU_CAP_MIN_IVAL:
> +		return spe_pmu->min_period;
> +	default:
> +		WARN(1, "unknown cap %d\n", cap);
> +	}
> +
> +	return 0;
> +}
> +
> +static ssize_t arm_spe_pmu_cap_show(struct device *dev,
> +				    struct device_attribute *attr,
> +				    char *buf)
> +{
> +	struct platform_device *pdev = to_platform_device(dev);
> +	struct arm_spe_pmu *spe_pmu = platform_get_drvdata(pdev);
> +	struct dev_ext_attribute *ea =
> +		container_of(attr, struct dev_ext_attribute, attr);
> +	int cap = (long)ea->var;
> +
> +	return snprintf(buf, PAGE_SIZE, "%u\n",
> +		arm_spe_pmu_cap_get(spe_pmu, cap));
> +}
> +
> +#define SPE_EXT_ATTR_ENTRY(_name, _func, _var)				\
> +	&((struct dev_ext_attribute[]) {				\
> +		{ __ATTR(_name, S_IRUGO, _func, NULL), (void *)_var }	\
> +	})[0].attr.attr
> +
> +#define SPE_CAP_EXT_ATTR_ENTRY(_name, _var)				\
> +	SPE_EXT_ATTR_ENTRY(_name, arm_spe_pmu_cap_show, _var)
> +
> +static struct attribute *arm_spe_pmu_cap_attr[] = {
> +	SPE_CAP_EXT_ATTR_ENTRY(arch_inst, SPE_PMU_CAP_ARCH_INST),
> +	SPE_CAP_EXT_ATTR_ENTRY(ernd, SPE_PMU_CAP_ERND),
> +	SPE_CAP_EXT_ATTR_ENTRY(count_size, SPE_PMU_CAP_CNT_SZ),
> +	SPE_CAP_EXT_ATTR_ENTRY(min_interval, SPE_PMU_CAP_MIN_IVAL),
> +	NULL,
> +};

I'd have expected GCC to warn about (integer/enum) _var values being
cast straight to void *, given the size mismatch.

Is that not the case, or do we need an unsigned long cast in
SPE_CAP_EXT_ATTR_ENTRY()?

Maybe GCC only complains the other way around.

[...]

> +/* Convert between user ABI and register values */
> +static u64 arm_spe_event_to_pmscr(struct perf_event *event)
> +{
> +	struct perf_event_attr *attr = &event->attr;
> +	u64 reg = 0;
> +
> +	reg |= ATTR_CFG_GET_FLD(attr, ts_enable) << PMSCR_EL1_TS_SHIFT;
> +	reg |= ATTR_CFG_GET_FLD(attr, pa_enable) << PMSCR_EL1_PA_SHIFT;

We should limit PA access to privileged users.

> +
> +	if (!attr->exclude_user)
> +		reg |= BIT(PMSCR_EL1_E0SPE_SHIFT);
> +
> +	if (!attr->exclude_kernel)
> +		reg |= BIT(PMSCR_EL1_E1SPE_SHIFT);
> +
> +	if (IS_ENABLED(CONFIG_PID_IN_CONTEXTIDR))
> +		reg |= BIT(PMSCR_EL1_CX_SHIFT);

... maybe likewise for CONTEXTIDR, too.

> +
> +	return reg;
> +}
> +
> +static void arm_spe_event_sanitise_period(struct perf_event *event)
> +{
> +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
> +	u64 period = event->hw.sample_period & ~PMSIRR_EL1_IVAL_MASK;

... as noted above, the upper 32 bits are RES0 in addition to the low 8
bits, so we need to explicitly check bits 31:8, e.g.

	u64 period = event->hw.sample_period;
	period &= (PMSIRR_EL1_INTERVAL_MASK << PMSIRR_EL1_INTERVAL_SHIFT);

> +
> +	if (period < spe_pmu->min_period)
> +		period = spe_pmu->min_period;

We already verify this in arm_spe_pmu_event_init(), so we don't need to
check this here.

We can drop arm_spe_event_sanitise_period() entirely. Given we validate
the period at event_init() time, there's no need to sanitize the value.

> +
> +	event->hw.sample_period = period;
> +}
> +
> +static u64 arm_spe_event_to_pmsirr(struct perf_event *event)
> +{
> +	struct perf_event_attr *attr = &event->attr;
> +	u64 reg = 0;
> +
> +	arm_spe_event_sanitise_period(event);
> +
> +	reg |= ATTR_CFG_GET_FLD(attr, jitter) << PMSIRR_EL1_RND_SHIFT;
> +	reg |= event->hw.sample_period;
> +
> +	return reg;
> +}

Given the above:

	u64 reg = event->hw.sample_period;
	reg |= ATTR_CFG_GET_FLD(attr, jitter) << PMSIRR_EL1_RND_SHIFT;

	return reg;

[...]

> +static bool arm_spe_pmu_buffer_mgmt_pending(u64 pmbsr)
> +{
> +	const char *err_str;
> +
> +	/* Service required? */
> +	if (!(pmbsr & BIT(PMBSR_EL1_S_SHIFT)))
> +		return false;
> +
> +	/* We only expect buffer management events */
> +	switch (pmbsr & (PMBSR_EL1_EC_MASK << PMBSR_EL1_EC_SHIFT)) {
> +	case PMBSR_EL1_EC_BUF:
> +		/* Handled below */
> +		break;
> +	case PMBSR_EL1_EC_FAULT_S1:
> +	case PMBSR_EL1_EC_FAULT_S2:
> +		err_str = "Unexpected buffer fault";
> +		goto out_err;
> +	default:
> +		err_str = "Unknown error code";
> +		goto out_err;
> +	}

For the error cases, I take it the assumption is that we leave
PMBSR_EL1.S set, so that the HW doesn't start again?

> +
> +	/* Buffer management event */
> +	switch (pmbsr & (PMBSR_EL1_BUF_BSC_MASK << PMBSR_EL1_BUF_BSC_SHIFT)) {
> +	case PMBSR_EL1_BUF_BSC_FULL:
> +		/* Ensure new profiling data is visible to the CPU */
> +		psb_csync();
> +		dsb(nsh);

I think that NSH might not be sufficient, given how this function is
used by callers below. I'll comment specifically in those call-sites.

> +		return true;
> +	default:
> +		err_str = "Unknown buffer status code";
> +	}
> +
> +out_err:
> +	pr_err_ratelimited("%s on CPU %d [PMBSR=0x%08llx]\n", err_str,
> +			   smp_processor_id(), pmbsr);

It might be worth dumping pmbsr with %016lx. The upper 64 bits are
currently RES0, but they do exist.

> +	return false;
> +}
> +

Could we have a comment block here to describe (roughly) what 
we're trying to do for the snapshot case?

> +static u64 arm_spe_pmu_next_snapshot_off(struct perf_output_handle *handle)
> +{
> +	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
> +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(handle->event->pmu);
> +	u64 head = PERF_IDX2OFF(handle->head, buf);
> +	u64 limit = buf->nr_pages * PAGE_SIZE;
> +
> +	/*
> +	 * The trace format isn't parseable in reverse, so clamp
> +	 * the limit to half of the buffer size in snapshot mode
> +	 * so that the worst case is half a buffer of records, as
> +	 * opposed to a single record.
> +	 */
> +	if (head < limit >> 1)
> +		limit >>= 1;

I was going to ask how we ensured nr_pages was 2 * SZ_4K * k, but I see
that arm_spe_pmu_setup_aux() ensures that when using snapshot mode.

> +
> +	/*
> +	 * If we're within max_record_sz of the limit, we must
> +	 * pad, move the head index and recompute the limit.
> +	 */
> +	if (limit - head < spe_pmu->max_record_sz) {
> +		memset(buf->base + head, 0, limit - head);

Could we have a mnemonic for the padding byte, and/or a helper that
wraps memset? e.g.

static void pad_buffer(void *start, u64 size)
{
	/* The padding packet is a single zero byte */
	memset(start, 0, size);
}

> +		handle->head = PERF_IDX2OFF(limit, buf);
> +		limit = ((buf->nr_pages * PAGE_SIZE) >> 1) + handle->head;
> +	}
> +
> +	return limit;
> +}
> +
> +static u64 __arm_spe_pmu_next_off(struct perf_output_handle *handle)
> +{
> +	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
> +	u64 head = PERF_IDX2OFF(handle->head, buf);
> +	u64 tail = PERF_IDX2OFF(handle->head + handle->size, buf);
> +	u64 wakeup = PERF_IDX2OFF(handle->wakeup, buf);
> +	u64 limit = buf->nr_pages * PAGE_SIZE;
> +
> +	/*
> +	 * Set the limit pointer to either the watermark or the
> +	 * current tail pointer; whichever comes first.
> +	 */
> +	if (handle->head + handle->size <= handle->wakeup) {
> +		/* The tail is next, so check for wrapping */
> +		if (tail >= head) {
> +			/*
> +			 * No wrapping, but need to align downwards to
> +			 * avoid corrupting unconsumed data.
> +			 */
> +			limit = round_down(tail, PAGE_SIZE);
> +
> +		}
> +	} else if (wakeup >= head) {

When wakeup == head, do we need to signal a wakeup event somehow?
Currently we'll pad the buffer, signal truncation, and end output, which
seems a little odd, but maybe that's what perf expects.

> +		/*
> +		 * The wakeup is next and doesn't wrap. Align upwards to
> +		 * ensure that we do indeed reach the watermark.
> +		 */
> +		limit = round_up(wakeup, PAGE_SIZE);
> +
> +		/*
> +		 * If rounding up crosses the tail, then we have to
> +		 * round down to avoid corrupting unconsumed data.
> +		 * Hopefully the tail will have moved by the time we
> +		 * hit the new limit.
> +		 */
> +		if (wakeup < tail && limit > tail)
> +			limit = round_down(wakeup, PAGE_SIZE);
> +	}

It took me a while to grok that we must consider the wakeup in
free-running counter space to avoid early wakeups, while we must
consider the tail in ring-buffer offset space to avoid clobbering data.

With that understanding, I think we have an issue here. If wakeup is
more than buffer size in the future, and the buffer is empty, I think we
set the limit too low.

In that case, we'd evaluate:

	handle->head + handle->size <= handle->wakeup

... as true, since size is at most buffer size. Thus we'd go into the
first if block. There we'd evaluate:

	tail >= head

... as true, since when the buffer is empty, head == tail. Thus, we'd
set the limit to:

	round_down(tail, PAGE_SIZE)

... which'll leave us with limit <= head, since head == tail. Thus,
we'll hit the case below:

> +
> +	/*
> +	 * If rounding down crosses the head, then the buffer is full,
> +	 * so pad to tail and end the session.
> +	 */
> +	if (limit <= head) {
> +		memset(buf->base + head, 0, handle->size);
> +		perf_aux_output_skip(handle, handle->size);
> +		perf_aux_output_flag(handle, PERF_AUX_FLAG_TRUNCATED);
> +		perf_aux_output_end(handle, 0);
> +		limit = 0;
> +	}
> +
> +	return limit;
> +}

... and end all output, even though the entire buffer was empty, and we
could have returned the end of the buffer as the limit.

It might be that something prevents wakeup from being that far in the
future, but in previous discussions we'd assumed that it could be any
arbitrary value.

I believe we can solve that, and simplify the logic as below. I've left
the wakeup < head and wakeup == head cases as above, ignored and
terminating respectively.

static u64 __arm_spe_pmu_next_off(struct perf_output_handle *handle)
{
	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
	const u64 bufsize = buf->nr_pages * PAGE_SIZE;
	u64 limit = bufsize;
	u64 head = PERF_IDX2OFF(handle->head, buf);
	u64 tail = PERF_IDX2OFF(handle->head + handle->size, buf);
	u64 wakeup = PERF_IDX2OFF(handle->wakeup, buf);

	if (!handle->size)
		goto no_space;
	
	/*
	 * Avoid clobbering unconsumed data. We know we have space, so
	 * if we see head == tail we know that the buffer is empty. If
	 * head > tail, then there's nothing to clobber prior to
	 * wrapping.
	 */
	if (head < tail)
		limit = round_down(tail, PAGE_SIZE);
	
	/*
	 * Wakeup may be arbitrarily far into future. If it's not in the
	 * current generation, either we'll wrap before hitting it, or
	 * it's in the past and has been handled already.
	 *
	 * If there's a wakeup before we wrap, arrange to be woken up by
	 * the page boundary following it. Keep the tail boundary if
	 * that's lower.
	 */
	if ((handle->wakeup / bufsize) == (handle->head / bufsize)) &&
	    head <= wakeup)
		limit = min(limit, round_up(wakeup, PAGE_SIZE));

	if (limit <= head)
		goto no_space;
	
	return limit;

no_space:
	memset(buf->base + head, 0, handle->size);
	perf_aux_output_skip(handle, handle->size);
	perf_aux_output_flag(handle, PERF_AUX_FLAG_TRUNCATED);
	perf_aux_output_end(handle, 0);

	return 0;
}

> +
> +static u64 arm_spe_pmu_next_off(struct perf_output_handle *handle)
> +{
> +	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
> +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(handle->event->pmu);
> +	u64 limit = __arm_spe_pmu_next_off(handle);
> +	u64 head = PERF_IDX2OFF(handle->head, buf);
> +
> +	/*
> +	 * If the head has come too close to the end of the buffer,
> +	 * then pad to the end and recompute the limit.
> +	 */
> +	if (limit && (limit - head < spe_pmu->max_record_sz)) {
> +		memset(buf->base + head, 0, limit - head);
> +		perf_aux_output_skip(handle, limit - head);
> +		limit = __arm_spe_pmu_next_off(handle);
> +	}
> +
> +	return limit;
> +}
> +
> +static void arm_spe_perf_aux_output_begin(struct perf_output_handle *handle,
> +					  struct perf_event *event)
> +{
> +	u64 base, limit;
> +	struct arm_spe_pmu_buf *buf;
> +
> +	/* Start a new aux session */
> +	buf = perf_aux_output_begin(handle, event);
> +	if (!buf) {
> +		event->hw.state |= PERF_HES_STOPPED;
> +		/*
> +		 * We still need to clear the limit pointer, since the
> +		 * profiler might only be disabled by virtue of a fault.
> +		 */
> +		limit = 0;
> +		goto out_write_limit;
> +	}
> +
> +	limit = buf->snapshot ? arm_spe_pmu_next_snapshot_off(handle)
> +			      : arm_spe_pmu_next_off(handle);
> +	if (limit)
> +		limit |= BIT(PMBLIMITR_EL1_E_SHIFT);
> +
> +	base = (u64)buf->base + PERF_IDX2OFF(handle->head, buf);
> +	write_sysreg_s(base, PMBPTR_EL1);
> +	limit += (u64)buf->base;
> +

I believe an isb() is necessary here to ensure the write to PMBPTR_EL1
occurs before the write to PMBLIMITR_EL1 enables the PMU. Otherwise, the
CPU could execute those out-of-order.

It's not clear to me whether that is sufficient for the PMU to observe
the new PMBPTR_EL1 before the new PMBLIMITR_EL1 value, though I assume
it must be.

> +out_write_limit:
> +	write_sysreg_s(limit, PMBLIMITR_EL1);
> +}
> +
> +static bool arm_spe_perf_aux_output_end(struct perf_output_handle *handle,
> +					struct perf_event *event,
> +					bool resume)
> +{
> +	u64 pmbptr, pmbsr, offset, size;
> +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
> +	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
> +	bool truncated;
> +
> +	/*
> +	 * We can be called via IRQ work trying to disable the PMU after
> +	 * a buffer full event. In this case, the aux session has already
> +	 * been stopped, so there's nothing to do here.
> +	 */
> +	if (!buf)
> +		return false;
> +
> +	/*
> +	 * If there isn't a pending management event and we're not stopping
> +	 * the current session, then just leave everything alone.
> +	 */
> +	pmbsr = read_sysreg_s(PMBSR_EL1);

When we call from arm_spe_pmu_irq_handler(), I think we need
synchronisation before reading PMBSR_EL1.

AFAICT from the spec, a context synchronisation event doesn't ensure
that the PMU's indirect write to PMBSR_EL1 is visible to the PE's direct
read above. I beleive we need a PSB CSYNC (and subsequent ISB) to ensure
that.

The only other caller is from arm_spe_pmu_stop(), which first calls
arm_spe_pmu_disable_and_drain_local(), so I guess the new barriers
should live in arm_spe_pmu_irq_handler(). I'll comment there.

> +	if (!arm_spe_pmu_buffer_mgmt_pending(pmbsr) && resume)
> +		return false; /* Spurious IRQ */
> +
> +	/* Ensure hardware updates to PMBPTR_EL1 are visible */
> +	isb();

Can we please move this into arm_spe_pmu_buffer_mgmt_pending(), after
the associated PSB CSYNC?

Then we can say that arm_spe_pmu_buffer_mgmt_pending() ensures all HW
updates have been synchronised (and made visible) if it returns true,
and it's easier to see that the synchronisation is correct.

> +
> +	/*
> +	 * Work out how much data has been written since the last update
> +	 * to the head index.
> +	 */
> +	pmbptr = round_down(read_sysreg_s(PMBPTR_EL1), spe_pmu->align);

I don't believe we need to align this.

Per the spec, PMBPTR_EL1[M:0] are RES0 in HW unless sync external abort
reporting is present, in which case they're valid. We write these bits
as zero, unless we have a bug elsewhere.

... so either the bits are zero, and we're fine, or an external abourt
has been hit. In the external abort case, we have no idea how far we
need to reverse the base pointer anyhow.

> +	offset = pmbptr - (u64)buf->base;
> +	size = offset - PERF_IDX2OFF(handle->head, buf);
> +
> +	if (buf->snapshot)
> +		handle->head = offset;

It's be worth a /* see arm_spe_pmu_next_snapshot_off() */ comment
or similar to explain what we're going for the snapshot case here.

> +
> +	/*
> +	 * Either the buffer is full or we're stopping the session. Check
> +	 * that we didn't write a partial record, since this can result
> +	 * in unparseable trace and we must disable the event.
> +	 */
> +	if (pmbsr & BIT(PMBSR_EL1_COLL_SHIFT))
> +		perf_aux_output_flag(handle, PERF_AUX_FLAG_COLLISION);
> +
> +	truncated = pmbsr & BIT(PMBSR_EL1_DL_SHIFT);
> +	if (truncated)
> +		perf_aux_output_flag(handle, PERF_AUX_FLAG_TRUNCATED);
> +
> +	perf_aux_output_end(handle, size);

The comment block above perf_aux_output_end() says:

  It is the pmu driver's responsibility to observe ordering rules of the
  hardware, so that all the data is externally visible before this is
  called.

... but in arm_spe_pmu_buffer_mgmt_pending() we only ensured that the
data was visible in the current NSH domain (i.e. only to this CPU).

I followed the callchain for updating head:

perf_aux_output_end()
-> perf_event_aux_event()
-> perf_output_end()
-> perf_output_put_handle()

... I see that there's an smp_wmb() (i.e. a DMB ISHST) on that path, but
it's not clear to me if that's sufficient to ensure that the PMU's
writes are made visible to other CPUs.

Given the comment, I'd feel happier if we had something here or in
arm_spe_pmu_buffer_mgmt_pending() to ensure that the PMU's prior writes
are visible to other CPUs.

> +
> +	/*
> +	 * If we're not resuming the session, then we can clear the fault
> +	 * and we're done, otherwise we need to start a new session.
> +	 */
> +	if (!resume)
> +		write_sysreg_s(0, PMBSR_EL1);
> +	else if (!truncated)
> +		arm_spe_perf_aux_output_begin(handle, event);
> +
> +	return true;
> +}
> +
> +/* IRQ handling */
> +static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
> +{
> +	struct perf_output_handle *handle = dev;
> +
> +	if (!perf_get_aux(handle))
> +		return IRQ_NONE;
> +
> +	if (!arm_spe_perf_aux_output_end(handle, handle->event, true))
> +		return IRQ_NONE;

As commented in arm_spe_perf_aux_output_end(), I think we need a
psb_csync(); isb() sequence prior to the read of PMBSR_EL1 in
arm_spe_perf_aux_output_end() to ensure that it is up-to-date w.r.t. the
interrupt.

> +
> +	irq_work_run();
> +	isb(); /* Ensure the buffer is disabled if data loss has occurred */

What exactly are we synchronising here?

AFAICT, when truncation occurs we don't clear PMBLIMITR_EL1.E, so the
buffer is only implicitly disabled by the PMU's indirect write
PMBSR_EL1.S, which we must have already synchronised prior to reading
PMBSR_EL1.

... so I can't see why this is necessary.

> +	write_sysreg_s(0, PMBSR_EL1);

... and regardless, we clear PMBSR_EL1.S here, which'll start the PMU
again, even if truncation occured, which I don't think we want.

Can we have arm_spe_perf_aux_output_end() clear PMBLIMITR_EL1.E when
truncation occurs?

> +	return IRQ_HANDLED;
> +}
> +
> +/* Perf callbacks */
> +static int arm_spe_pmu_event_init(struct perf_event *event)
> +{
> +	u64 reg;
> +	struct perf_event_attr *attr = &event->attr;
> +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
> +
> +	/* This is, of course, deeply driver-specific */
> +	if (attr->type != event->pmu->type)
> +		return -ENOENT;
> +
> +	if (event->cpu >= 0 &&
> +	    !cpumask_test_cpu(event->cpu, &spe_pmu->supported_cpus))
> +		return -ENOENT;

We're not rejecting cpu < 0, so I take it we're trying to handle
per-task events?

As I've mentioned before, that case worries me. One thing I've just
realised we need to figure out is what happens if attr.inherit is set.
The core doesn't reject that, and I suspect we may need to here.

> +
> +	if (arm_spe_event_to_pmsevfr(event) & PMSEVFR_EL1_RES0)
> +		return -EOPNOTSUPP;
> +
> +	if (event->hw.sample_period < spe_pmu->min_period ||
> +	    event->hw.sample_period & PMSIRR_EL1_IVAL_MASK)
> +		return -EOPNOTSUPP;

As mentioned in the sysreg comments, we need to check the upper 32 bits
of the PMSIRR value are zero, so we'll need something like:

	if (event->hw.sample_period < spe_pmu->min_period)
		return -EOPNOTSUPP;
	
	if (event->hw.sample_period &
	    ~(PMSIRR_EL1_INTERVAL_MASK << PMSIRR_EL1_INTERVAL_SHIFT))
		return -EOPNOTSUPP;

I think there's a slight miswording in the spec. Page 56 of the spec
(DDI 0586A) says of the PMSIRR_EL1.INTERVAL field:

    Software should set this to a value greater than the minimum
    indicated by PMSIDR_EL1.Interval.

... whereas here we're checking it's *at least* the minimum interval,
which I think is the intent of the spec. That's probably something we
should have clarified spec-side, with s/greater than/at least/.

> +
> +	if (attr->exclude_idle)
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * Feedback-directed frequency throttling doesn't work when we
> +	 * have a buffer of samples. We'd need to manually count the
> +	 * samples in the buffer when it fills up and adjust the event
> +	 * count to reflect that. Instead, force the user to specify a
> +	 * sample period instead.
> +	 */
> +	if (attr->freq)
> +		return -EINVAL;
> +
> +	reg = arm_spe_event_to_pmsfcr(event);
> +	if ((reg & BIT(PMSFCR_EL1_FE_SHIFT)) &&
> +	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_EVT))
> +		return -EOPNOTSUPP;
> +
> +	if ((reg & BIT(PMSFCR_EL1_FT_SHIFT)) &&
> +	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_TYP))
> +		return -EOPNOTSUPP;
> +
> +	if ((reg & BIT(PMSFCR_EL1_FL_SHIFT)) &&
> +	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_LAT))
> +		return -EOPNOTSUPP;

Does anything prevent this event from being added to a group?

Surely we should check that here?

> +
> +	return 0;
> +}
> +
> +static void arm_spe_pmu_start(struct perf_event *event, int flags)
> +{
> +	u64 reg;
> +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
> +	struct hw_perf_event *hwc = &event->hw;
> +	struct perf_output_handle *handle = this_cpu_ptr(spe_pmu->handle);
> +
> +	hwc->state = 0;
> +	arm_spe_perf_aux_output_begin(handle, event);
> +	if (hwc->state)
> +		return;

I was expecting we'd do this last, since PMBLIIMITR.E enables profiling.

I understand that we're relying on the PMSCR_EL1 filtering value to
prevent anything being written to tbe buffer until we've vonfigured the
options, but I'd feel a lot happier if we consistently relied upon
PMBLIMITR.E for that.

> +
> +	reg = arm_spe_event_to_pmsfcr(event);
> +	write_sysreg_s(reg, PMSFCR_EL1);
> +
> +	reg = arm_spe_event_to_pmsevfr(event);
> +	write_sysreg_s(reg, PMSEVFR_EL1);
> +
> +	reg = arm_spe_event_to_pmslatfr(event);
> +	write_sysreg_s(reg, PMSLATFR_EL1);
> +
> +	if (flags & PERF_EF_RELOAD) {
> +		reg = arm_spe_event_to_pmsirr(event);
> +		write_sysreg_s(reg, PMSIRR_EL1);
> +		isb();
> +		reg = local64_read(&hwc->period_left);
> +		write_sysreg_s(reg, PMSICR_EL1);
> +	}
> +
> +	reg = arm_spe_event_to_pmscr(event);
> +	isb();
> +	write_sysreg_s(reg, PMSCR_EL1);
> +}
> +
> +static void arm_spe_pmu_disable_and_drain_local(void)
> +{
> +	/* Disable profiling at EL0 and EL1 */
> +	write_sysreg_s(0, PMSCR_EL1);
> +	isb();
> +
> +	/* Drain any buffered data */
> +	psb_csync();
> +	dsb(nsh);
> +
> +	/* Disable the profiling buffer */
> +	write_sysreg_s(0, PMBLIMITR_EL1);

Can't this be done when we clear PMSCR_EL1? Surely buffered data would
be written out regardless?

> +}

[...]

> +static void *arm_spe_pmu_setup_aux(int cpu, void **pages, int nr_pages,
> +				   bool snapshot)
> +{
> +	int i;
> +	struct page **pglist;
> +	struct arm_spe_pmu_buf *buf;
> +
> +	/*
> +	 * We require an even number of pages for snapshot mode, so that
> +	 * we can effectively treat the buffer as consisting of two equal
> +	 * parts and give userspace a fighting chance of getting some
> +	 * useful data out of it.
> +	 */
> +	if (!nr_pages || (snapshot && (nr_pages & 1)))
> +		return NULL;

As noted above, we may need to ensure that this is a pwoer of two.

> +
> +	buf = kzalloc_node(sizeof(*buf), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!buf)
> +		return NULL;
> +
> +	pglist = kcalloc(nr_pages, sizeof(*pglist), GFP_KERNEL);
> +	if (!pglist)
> +		goto out_free_buf;
> +
> +	for (i = 0; i < nr_pages; ++i) {
> +		struct page *page = virt_to_page(pages[i]);
> +
> +		if (PagePrivate(page)) {
> +			pr_warn("unexpected high-order page for auxbuf!");

It looks like the intel-pt driver expects high-order pages.

What prevents us from seeing those?

Why can't we handle those?

How are these pages pinned? Does the core ensure that?

> +			goto out_free_pglist;
> +		}
> +
> +		pglist[i] = virt_to_page(pages[i]);
> +	}
> +
> +	buf->base = vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
> +	if (!buf->base)
> +		goto out_free_pglist;
> +
> +	buf->nr_pages	= nr_pages;
> +	buf->snapshot	= snapshot;
> +
> +	kfree(pglist);
> +	return buf;
> +
> +out_free_pglist:
> +	kfree(pglist);
> +out_free_buf:
> +	kfree(buf);
> +	return NULL;
> +}
> +
> +static void arm_spe_pmu_free_aux(void *aux)
> +{
> +	struct arm_spe_pmu_buf *buf = aux;
> +
> +	vunmap(buf->base);
> +	kfree(buf);
> +}
> +
> +/* Initialisation and teardown functions */
> +static int arm_spe_pmu_perf_init(struct arm_spe_pmu *spe_pmu)
> +{
> +	static atomic_t pmu_idx = ATOMIC_INIT(-1);
> +
> +	int idx;
> +	char *name;
> +	struct device *dev = &spe_pmu->pdev->dev;
> +
> +	spe_pmu->pmu = (struct pmu) {
> +		.capabilities	= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE,
> +		.attr_groups	= arm_spe_pmu_attr_groups,
> +		/*
> +		 * We hitch a ride on the software context here, so that
> +		 * we can support per-task profiling (which is not possible
> +		 * with the invalid context as it doesn't get sched callbacks).
> +		 * This requires that userspace either uses a dummy event for
> +		 * perf_event_open, since the aux buffer is not setup until
> +		 * a subsequent mmap, or creates the profiling event in a
> +		 * disabled state and explicitly PERF_EVENT_IOC_ENABLEs it
> +		 * once the buffer has been created.
> +		 */
> +		.task_ctx_nr	= perf_sw_context,

While other tracing PMUs do this, I think this is a horrible bodge, and
a bad idea, given it violates assumptions made in the core code.

For example, unlike true SW events, add() and start() can fail, so a
tracing event can unexpectedly stop SW events from being scheduled.

AFAICT, we could also try to move a tracing event into a later-created
HW PMU group, which is very worrying.

I really think we should have a separate tracing context for this class
of PMU, or we make it so that the invalid context can receive sched
callbacks.

> +		.event_init	= arm_spe_pmu_event_init,
> +		.add		= arm_spe_pmu_add,
> +		.del		= arm_spe_pmu_del,
> +		.start		= arm_spe_pmu_start,
> +		.stop		= arm_spe_pmu_stop,
> +		.read		= arm_spe_pmu_read,
> +		.setup_aux	= arm_spe_pmu_setup_aux,
> +		.free_aux	= arm_spe_pmu_free_aux,
> +	};
> +
> +	idx = atomic_inc_return(&pmu_idx);
> +	name = devm_kasprintf(dev, GFP_KERNEL, "%s_%d", PMUNAME, idx);
> +	return perf_pmu_register(&spe_pmu->pmu, name, -1);
> +}
> +
> +static void arm_spe_pmu_perf_destroy(struct arm_spe_pmu *spe_pmu)
> +{
> +	perf_pmu_unregister(&spe_pmu->pmu);
> +}
> +
> +static void __arm_spe_pmu_dev_probe(void *info)
> +{
> +	int fld;
> +	u64 reg;
> +	struct arm_spe_pmu *spe_pmu = info;
> +	struct device *dev = &spe_pmu->pdev->dev;
> +
> +	fld = cpuid_feature_extract_unsigned_field(read_cpuid(ID_AA64DFR0_EL1),
> +						   ID_AA64DFR0_PMSVER_SHIFT);
> +	if (!fld) {
> +		dev_err(dev,
> +			"unsupported ID_AA64DFR0_EL1.PMSVer [%d] on CPU %d\n",
> +			fld, smp_processor_id());
> +		return;
> +	}

Given we only bail out when PMSver is zero, surely we can just say:

	dev_err(dev, "SPE not supported on cpu %d", smp_processor_id())

Otherwise, the rest of the probing logic and boilerplate code looked
fine to me.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
  2017-06-12 16:20   ` Kim Phillips
@ 2017-06-15 15:57     ` Kim Phillips
  2017-06-21 15:31       ` Will Deacon
  0 siblings, 1 reply; 33+ messages in thread
From: Kim Phillips @ 2017-06-15 15:57 UTC (permalink / raw)
  To: Mark Rutland, Will Deacon
  Cc: linux-arm-kernel, marc.zyngier, tglx, peterz, alexander.shishkin,
	robh, suzuki.poulose, pawel.moll, mathieu.poirier, mingo,
	linux-kernel

On Mon, 12 Jun 2017 11:20:48 -0500
Kim Phillips <kim.phillips@arm.com> wrote:

> On Mon, 12 Jun 2017 12:08:23 +0100
> Mark Rutland <mark.rutland@arm.com> wrote:
> 
> > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:
> > > This is the sixth posting of the patches previously posted here:
...
> > Kim, do you have any version of the userspace side that we could look
> > at?
> > 
> > For review, it would be really helpful to have something that can poke
> > the PMU, even if it's incomplete or lacking polish.
> 
> Here's the latest push, based on a a couple of prior versions of this
> driver:
> 
> http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1
> 
> I don't seem to be able to get any SPE data output after rebasing on
> this version of the driver.  Still don't know why at the moment...

Bisected to commit e38ba76deef "perf tools: force uncore events to
system wide monitoring".  So, using record with specifying a -C
<cpu> explicitly now produces SPE data, but only a couple of valid
records at the beginning of each buffer; the rest is filled with
PADding (0's).

I see Mark's latest comments have found a possible issue in the perf
aux buffer handling code in the driver, and that the driver does some
memset of padding (0's) itself; could that be responsible for the above
behaviour?

Thanks,

Kim

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
  2017-06-15 15:57     ` Kim Phillips
@ 2017-06-21 15:31       ` Will Deacon
  2017-06-22 15:56         ` Kim Phillips
  0 siblings, 1 reply; 33+ messages in thread
From: Will Deacon @ 2017-06-21 15:31 UTC (permalink / raw)
  To: Kim Phillips
  Cc: Mark Rutland, linux-arm-kernel, marc.zyngier, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

On Thu, Jun 15, 2017 at 10:57:35AM -0500, Kim Phillips wrote:
> On Mon, 12 Jun 2017 11:20:48 -0500
> Kim Phillips <kim.phillips@arm.com> wrote:
> 
> > On Mon, 12 Jun 2017 12:08:23 +0100
> > Mark Rutland <mark.rutland@arm.com> wrote:
> > 
> > > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:
> > > > This is the sixth posting of the patches previously posted here:
> ...
> > > Kim, do you have any version of the userspace side that we could look
> > > at?
> > > 
> > > For review, it would be really helpful to have something that can poke
> > > the PMU, even if it's incomplete or lacking polish.
> > 
> > Here's the latest push, based on a a couple of prior versions of this
> > driver:
> > 
> > http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1
> > 
> > I don't seem to be able to get any SPE data output after rebasing on
> > this version of the driver.  Still don't know why at the moment...
> 
> Bisected to commit e38ba76deef "perf tools: force uncore events to
> system wide monitoring".  So, using record with specifying a -C
> <cpu> explicitly now produces SPE data, but only a couple of valid
> records at the beginning of each buffer; the rest is filled with
> PADding (0's).
> 
> I see Mark's latest comments have found a possible issue in the perf
> aux buffer handling code in the driver, and that the driver does some
> memset of padding (0's) itself; could that be responsible for the above
> behaviour?

Possibly. Do you know how big you're mapping the aux buffer and what (if
any) value you're passing as aux_watermark?

Will

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 4/5] drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension
  2017-06-15 14:57   ` Mark Rutland
@ 2017-06-21 15:39     ` Will Deacon
  2017-06-27 17:12       ` Mark Rutland
  0 siblings, 1 reply; 33+ messages in thread
From: Will Deacon @ 2017-06-21 15:39 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, marc.zyngier, kim.phillips, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

Hi Mark,

Thanks for the extensive review. Replies inline.

On Thu, Jun 15, 2017 at 03:57:29PM +0100, Mark Rutland wrote:
> On Mon, Jun 05, 2017 at 04:22:56PM +0100, Will Deacon wrote:
> > +/* ID registers */
> > +#define PMSIDR_EL1			sys_reg(3, 0, 9, 9, 7)
> 
> Nit: could we please give the sysreg definitions a SYS_ prefix, for
> consistency with other (architected) sysregs?
> 
> Ideally, as these are architected they'd live in <asm/sysreg.h>, but I'm
> happy to factor that out later.

I don't see the point in adding the prefixes without moving the #defines
into sysreg.h, so I'll look at just doing it in one go.

> > +#define PMSIRR_EL1			sys_reg(3, 0, 9, 9, 3)
> > +#define PMSIRR_EL1_RND_SHIFT		0
> > +#define PMSIRR_EL1_IVAL_MASK		0xffUL
> 
> This is a little odd. This covers RND and (some) RES0 bits, not
> INTERVAL, so it's the opposite polarity to the rest of the masks, and
> incomplete (missing the upper 32 RES0 bits).
> 
> Could we please have this in the same polarity as the other masks,
> precisely covering the INTERVAL field? i.e.
> 
> #define PMSIRR_EL1_INTERVAL_SHIFT	8
> #define PMSIRR_EL1_INTERVAL_MASK	0xffffffUL

Yup, I'll fix that.

> > +#define PMBSR_EL1_EC_BUF		(0x0UL << PMBSR_EL1_EC_SHIFT)
> > +#define PMBSR_EL1_EC_FAULT_S1		(0x24UL << PMBSR_EL1_EC_SHIFT)
> > +#define PMBSR_EL1_EC_FAULT_S2		(0x25UL << PMBSR_EL1_EC_SHIFT)
> > +
> > +#define PMBSR_EL1_FAULT_FSC_SHIFT	0
> > +#define PMBSR_EL1_FAULT_FSC_MASK	0x3fUL
> 
> Nit: it might be worth having MSS in the name to show the field
> hierarchy, e.g. PMBSR_EL1_MSS_FAULT_FSC_MASK

I'd rather avoid growing the name, and I don't find this confusing as-is.

> > +struct arm_spe_pmu {
> > +	struct pmu				pmu;
> > +	struct platform_device			*pdev;
> > +	cpumask_t				supported_cpus;
> > +	struct hlist_node			hotplug_node;
> > +
> > +	int					irq; /* PPI */
> > +
> > +	u16					min_period;
> > +	u16					cnt_width;
> 
> Elsewhere, we refer to this as the counter size (e.g.
> SPE_PMU_CAP_CNT_SZ, and count_size in the sysfs interface).
> 
> Could we please pick one of "width" and "size" and use it consistently?
> 
> I guess "size" is preferable, given the architectural name is
> "CountSize".

Will fix.

> > +/* Convert a free-running index from perf into an SPE buffer offset */
> > +#define PERF_IDX2OFF(idx, buf)	((idx) & (((buf)->nr_pages << PAGE_SHIFT) - 1))
> 
> The masking logic here assumes nr_pages is a power of two.
> 
> That's not always true, as arm_spe_pmu_setup_aux() only ensures that
> nr_pages is a multiple of 2 * PAGE_SIZE, and only when using snapshot
> mode.
> 
> For example, with ten 4K pages:
> 
> nr_pages:			0b1010
> (nr_pages << PAGE_SHIFT):	0b1010000000000000
> (nr_pages << PAGE_SHIFT) - 1:	0b1001111111111111

Nice. Fixed (using a bloody mod operator).

> > +
> > +/* Keep track of our dynamic hotplug state */
> > +static enum cpuhp_state arm_spe_pmu_online;
> > +
> > +/* This sysfs gunk was really good fun to write. */
> > +enum arm_spe_pmu_capabilities {
> > +	SPE_PMU_CAP_ARCH_INST = 0,
> 
> No need for the initializer; enums start at zero.

I know, but I do like to make it explicit that I'm relying on that here.
This is done all over the place in the kernel.

> 
> > +	SPE_PMU_CAP_ERND,
> > +	SPE_PMU_CAP_FEAT_MAX,
> > +	SPE_PMU_CAP_CNT_SZ = SPE_PMU_CAP_FEAT_MAX,
> > +	SPE_PMU_CAP_MIN_IVAL,
> > +};
> 
> IMO, it would be worth having s/CAP/CAP_FEAT/ for the HW features.
> 
> We could get rid of the confusing SPE_PMU_CAP_FEAT_MAX definition here,
> if we were to:
> 
> > +
> > +static int arm_spe_pmu_feat_caps[SPE_PMU_CAP_FEAT_MAX] = {
> > +	[SPE_PMU_CAP_ARCH_INST]	= SPE_PMU_FEAT_ARCH_INST,
> > +	[SPE_PMU_CAP_ERND]	= SPE_PMU_FEAT_ERND,
> > +};
> 
> .. change this to:
> 
> static int arm_spe_pmu_feat_caps[] = {
> 	...
> };
> 
> ... and:
> 
> > +
> > +static u32 arm_spe_pmu_cap_get(struct arm_spe_pmu *spe_pmu, int cap)
> > +{
> > +	if (cap < SPE_PMU_CAP_FEAT_MAX)
> 
> ... change this to:
> 
> 	if (cap < ARRAY_SIZE(arm_spe_pmu_feat_caps))

I quite liked this suggestion at first and I even implemented it, but the
result was IMO less maintainable than the above. There's ABI involved here,
so I want to make it as difficult as possible to break the ABI when adding a
new hardware capability to the driver. The current code does a good job of
that:

  - If you add a boolean feature in the wrong place in arm_spe_capabilities,
    then you'll get a WARN

  - If you add a non-boolean feature in the wrong place then it will be
    reported as non-present, unless you add an entry in
    arm_spe_pmu_feat_caps (at which point you'd realise the mistake)

  - If you only update arm_spe_pmu_feat_caps, then you'll get a build error.

With your change, it's a lot easier to break things subtly, so I'd rather
keep this as-is unless you have non-cosmetic reasons to change it.

> > +#define SPE_EXT_ATTR_ENTRY(_name, _func, _var)				\
> > +	&((struct dev_ext_attribute[]) {				\
> > +		{ __ATTR(_name, S_IRUGO, _func, NULL), (void *)_var }	\
> > +	})[0].attr.attr
> > +
> > +#define SPE_CAP_EXT_ATTR_ENTRY(_name, _var)				\
> > +	SPE_EXT_ATTR_ENTRY(_name, arm_spe_pmu_cap_show, _var)
> > +
> > +static struct attribute *arm_spe_pmu_cap_attr[] = {
> > +	SPE_CAP_EXT_ATTR_ENTRY(arch_inst, SPE_PMU_CAP_ARCH_INST),
> > +	SPE_CAP_EXT_ATTR_ENTRY(ernd, SPE_PMU_CAP_ERND),
> > +	SPE_CAP_EXT_ATTR_ENTRY(count_size, SPE_PMU_CAP_CNT_SZ),
> > +	SPE_CAP_EXT_ATTR_ENTRY(min_interval, SPE_PMU_CAP_MIN_IVAL),
> > +	NULL,
> > +};
> 
> I'd have expected GCC to warn about (integer/enum) _var values being
> cast straight to void *, given the size mismatch.
> 
> Is that not the case, or do we need an unsigned long cast in
> SPE_CAP_EXT_ATTR_ENTRY()?
> 
> Maybe GCC only complains the other way around.

GCC 7 seems to be perfectly happy with this code.

> 
> [...]
> 
> > +/* Convert between user ABI and register values */
> > +static u64 arm_spe_event_to_pmscr(struct perf_event *event)
> > +{
> > +	struct perf_event_attr *attr = &event->attr;
> > +	u64 reg = 0;
> > +
> > +	reg |= ATTR_CFG_GET_FLD(attr, ts_enable) << PMSCR_EL1_TS_SHIFT;
> > +	reg |= ATTR_CFG_GET_FLD(attr, pa_enable) << PMSCR_EL1_PA_SHIFT;
> 
> We should limit PA access to privileged users.
> 
> > +
> > +	if (!attr->exclude_user)
> > +		reg |= BIT(PMSCR_EL1_E0SPE_SHIFT);
> > +
> > +	if (!attr->exclude_kernel)
> > +		reg |= BIT(PMSCR_EL1_E1SPE_SHIFT);
> > +
> > +	if (IS_ENABLED(CONFIG_PID_IN_CONTEXTIDR))
> > +		reg |= BIT(PMSCR_EL1_CX_SHIFT);
> 
> ... maybe likewise for CONTEXTIDR, too.

Yes, agreed. Fixed.

> > +static void arm_spe_event_sanitise_period(struct perf_event *event)
> > +{
> > +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
> > +	u64 period = event->hw.sample_period & ~PMSIRR_EL1_IVAL_MASK;
> 
> ... as noted above, the upper 32 bits are RES0 in addition to the low 8
> bits, so we need to explicitly check bits 31:8, e.g.
> 
> 	u64 period = event->hw.sample_period;
> 	period &= (PMSIRR_EL1_INTERVAL_MASK << PMSIRR_EL1_INTERVAL_SHIFT);

Fixed.

> > +
> > +	if (period < spe_pmu->min_period)
> > +		period = spe_pmu->min_period;
> 
> We already verify this in arm_spe_pmu_event_init(), so we don't need to
> check this here.
> 
> We can drop arm_spe_event_sanitise_period() entirely. Given we validate
> the period at event_init() time, there's no need to sanitize the value.

What about PERF_IOC_PERIOD? I don't think that re-inits the event.

> > +static bool arm_spe_pmu_buffer_mgmt_pending(u64 pmbsr)
> > +{
> > +	const char *err_str;
> > +
> > +	/* Service required? */
> > +	if (!(pmbsr & BIT(PMBSR_EL1_S_SHIFT)))
> > +		return false;
> > +
> > +	/* We only expect buffer management events */
> > +	switch (pmbsr & (PMBSR_EL1_EC_MASK << PMBSR_EL1_EC_SHIFT)) {
> > +	case PMBSR_EL1_EC_BUF:
> > +		/* Handled below */
> > +		break;
> > +	case PMBSR_EL1_EC_FAULT_S1:
> > +	case PMBSR_EL1_EC_FAULT_S2:
> > +		err_str = "Unexpected buffer fault";
> > +		goto out_err;
> > +	default:
> > +		err_str = "Unknown error code";
> > +		goto out_err;
> > +	}
> 
> For the error cases, I take it the assumption is that we leave
> PMBSR_EL1.S set, so that the HW doesn't start again?

No, I don't think I actually handle these cases at all. Whilst they're
probably catastrophic (vmapped mappings are faulting!), I should at least
try to park the profiler. Will fix for the next version.

> > +out_err:
> > +	pr_err_ratelimited("%s on CPU %d [PMBSR=0x%08llx]\n", err_str,
> > +			   smp_processor_id(), pmbsr);
> 
> It might be worth dumping pmbsr with %016lx. The upper 64 bits are
> currently RES0, but they do exist.

Ok.

> > +	return false;
> > +}
> > +
> 
> Could we have a comment block here to describe (roughly) what 
> we're trying to do for the snapshot case?

Ok, I'll try to think of something to say.

> > +
> > +	/*
> > +	 * If we're within max_record_sz of the limit, we must
> > +	 * pad, move the head index and recompute the limit.
> > +	 */
> > +	if (limit - head < spe_pmu->max_record_sz) {
> > +		memset(buf->base + head, 0, limit - head);
> 
> Could we have a mnemonic for the padding byte, and/or a helper that
> wraps memset? e.g.
> 
> static void pad_buffer(void *start, u64 size)
> {
> 	/* The padding packet is a single zero byte */
> 	memset(start, 0, size);
> }

Sure.

> > +		handle->head = PERF_IDX2OFF(limit, buf);
> > +		limit = ((buf->nr_pages * PAGE_SIZE) >> 1) + handle->head;
> > +	}
> > +
> > +	return limit;
> > +}
> > +
> > +static u64 __arm_spe_pmu_next_off(struct perf_output_handle *handle)
> > +{
> > +	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
> > +	u64 head = PERF_IDX2OFF(handle->head, buf);
> > +	u64 tail = PERF_IDX2OFF(handle->head + handle->size, buf);
> > +	u64 wakeup = PERF_IDX2OFF(handle->wakeup, buf);
> > +	u64 limit = buf->nr_pages * PAGE_SIZE;
> > +
> > +	/*
> > +	 * Set the limit pointer to either the watermark or the
> > +	 * current tail pointer; whichever comes first.
> > +	 */
> > +	if (handle->head + handle->size <= handle->wakeup) {
> > +		/* The tail is next, so check for wrapping */
> > +		if (tail >= head) {
> > +			/*
> > +			 * No wrapping, but need to align downwards to
> > +			 * avoid corrupting unconsumed data.
> > +			 */
> > +			limit = round_down(tail, PAGE_SIZE);
> > +
> > +		}
> > +	} else if (wakeup >= head) {
> 
> When wakeup == head, do we need to signal a wakeup event somehow?
> Currently we'll pad the buffer, signal truncation, and end output, which
> seems a little odd, but maybe that's what perf expects.

Wakeup can never be equal to head here. We know that the wakeup is next (by
the first if above) and therefore it is within handle->size of head. Since
we're starting a session, then we either have:

  1. Wakeup calculated by perf_aux_output_{skip,end}, or
  2. Wakeup calculated by rb_alloc_aux (initial mmap)

In case (1), the wakeup will have been signalled when it was calculated to
be equal to head, and then the wakeup will have been moved to the next
watermark point. In case (2), the only way it can be equal to head is if
the watermark was 0, but an initial watermark is converted to half the
buffer size by the core.

Admittedly, I'd not realised this at the time (hence the >= check), and it
looks like we're going to rewrite this anyway :)

> > +		/*
> > +		 * The wakeup is next and doesn't wrap. Align upwards to
> > +		 * ensure that we do indeed reach the watermark.
> > +		 */
> > +		limit = round_up(wakeup, PAGE_SIZE);
> > +
> > +		/*
> > +		 * If rounding up crosses the tail, then we have to
> > +		 * round down to avoid corrupting unconsumed data.
> > +		 * Hopefully the tail will have moved by the time we
> > +		 * hit the new limit.
> > +		 */
> > +		if (wakeup < tail && limit > tail)
> > +			limit = round_down(wakeup, PAGE_SIZE);
> > +	}
> 
> It took me a while to grok that we must consider the wakeup in
> free-running counter space to avoid early wakeups, while we must
> consider the tail in ring-buffer offset space to avoid clobbering data.
> 
> With that understanding, I think we have an issue here. If wakeup is
> more than buffer size in the future, and the buffer is empty, I think we
> set the limit too low.
> 
> In that case, we'd evaluate:
> 
> 	handle->head + handle->size <= handle->wakeup
> 
> ... as true, since size is at most buffer size. Thus we'd go into the
> first if block. There we'd evaluate:

If the buffer is empty, then size is exactly buffer size - 1, but I take
your point.

> 
> 	tail >= head
> 
> ... as true, since when the buffer is empty, head == tail. Thus, we'd
> set the limit to:
> 
> 	round_down(tail, PAGE_SIZE)
> 
> ... which'll leave us with limit <= head, since head == tail. Thus,
> we'll hit the case below:
> 
> > +
> > +	/*
> > +	 * If rounding down crosses the head, then the buffer is full,
> > +	 * so pad to tail and end the session.
> > +	 */
> > +	if (limit <= head) {
> > +		memset(buf->base + head, 0, handle->size);
> > +		perf_aux_output_skip(handle, handle->size);
> > +		perf_aux_output_flag(handle, PERF_AUX_FLAG_TRUNCATED);
> > +		perf_aux_output_end(handle, 0);
> > +		limit = 0;
> > +	}
> > +
> > +	return limit;
> > +}
> 
> ... and end all output, even though the entire buffer was empty, and we
> could have returned the end of the buffer as the limit.
> 
> It might be that something prevents wakeup from being that far in the
> future, but in previous discussions we'd assumed that it could be any
> arbitrary value.

Yes, I think this case does indeed go wrong. Well spotted!

> I believe we can solve that, and simplify the logic as below. I've left
> the wakeup < head and wakeup == head cases as above, ignored and
> terminating respectively.

I think this mostly works, some suggestions/questions below.

> static u64 __arm_spe_pmu_next_off(struct perf_output_handle *handle)
> {
> 	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
> 	const u64 bufsize = buf->nr_pages * PAGE_SIZE;
> 	u64 limit = bufsize;
> 	u64 head = PERF_IDX2OFF(handle->head, buf);
> 	u64 tail = PERF_IDX2OFF(handle->head + handle->size, buf);
> 	u64 wakeup = PERF_IDX2OFF(handle->wakeup, buf);
> 
> 	if (!handle->size)
> 		goto no_space;

We can avoid the memset/output_skip in this case.

> 	/*
> 	 * Avoid clobbering unconsumed data. We know we have space, so
> 	 * if we see head == tail we know that the buffer is empty. If
> 	 * head > tail, then there's nothing to clobber prior to
> 	 * wrapping.
> 	 */
> 	if (head < tail)
> 		limit = round_down(tail, PAGE_SIZE);
> 	
> 	/*
> 	 * Wakeup may be arbitrarily far into future. If it's not in the
> 	 * current generation, either we'll wrap before hitting it, or
> 	 * it's in the past and has been handled already.
> 	 *
> 	 * If there's a wakeup before we wrap, arrange to be woken up by
> 	 * the page boundary following it. Keep the tail boundary if
> 	 * that's lower.
> 	 */
> 	if ((handle->wakeup / bufsize) == (handle->head / bufsize)) &&

I'd really like to get rid of these divisions, since we're not working with
nice powers of 2 here. Can't you just do:

  handle->wakeup < (handle->head + handle->size)

to establish that they're in the same "generation"?

> 	    head <= wakeup)
> 		limit = min(limit, round_up(wakeup, PAGE_SIZE));
> 
> 	if (limit <= head)
> 		goto no_space;

Does this correctly handle the case where the buffer is full and head ==
tail, but limit == bufsize? AFAICT, we can return a limit of bufsize and
corrupt the whole buffer.

> > +static void arm_spe_perf_aux_output_begin(struct perf_output_handle *handle,
> > +					  struct perf_event *event)
> > +{
> > +	u64 base, limit;
> > +	struct arm_spe_pmu_buf *buf;
> > +
> > +	/* Start a new aux session */
> > +	buf = perf_aux_output_begin(handle, event);
> > +	if (!buf) {
> > +		event->hw.state |= PERF_HES_STOPPED;
> > +		/*
> > +		 * We still need to clear the limit pointer, since the
> > +		 * profiler might only be disabled by virtue of a fault.
> > +		 */
> > +		limit = 0;
> > +		goto out_write_limit;
> > +	}
> > +
> > +	limit = buf->snapshot ? arm_spe_pmu_next_snapshot_off(handle)
> > +			      : arm_spe_pmu_next_off(handle);
> > +	if (limit)
> > +		limit |= BIT(PMBLIMITR_EL1_E_SHIFT);
> > +
> > +	base = (u64)buf->base + PERF_IDX2OFF(handle->head, buf);
> > +	write_sysreg_s(base, PMBPTR_EL1);
> > +	limit += (u64)buf->base;
> > +
> 
> I believe an isb() is necessary here to ensure the write to PMBPTR_EL1
> occurs before the write to PMBLIMITR_EL1 enables the PMU. Otherwise, the
> CPU could execute those out-of-order.

This function is always called in a context where the profiler is disabled
due to some other control (e.g. in PMSCR or because we're in fault context)
so the isb isn't necessary.

> > +out_write_limit:
> > +	write_sysreg_s(limit, PMBLIMITR_EL1);
> > +}
> > +
> > +static bool arm_spe_perf_aux_output_end(struct perf_output_handle *handle,
> > +					struct perf_event *event,
> > +					bool resume)
> > +{
> > +	u64 pmbptr, pmbsr, offset, size;
> > +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
> > +	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
> > +	bool truncated;
> > +
> > +	/*
> > +	 * We can be called via IRQ work trying to disable the PMU after
> > +	 * a buffer full event. In this case, the aux session has already
> > +	 * been stopped, so there's nothing to do here.
> > +	 */
> > +	if (!buf)
> > +		return false;
> > +
> > +	/*
> > +	 * If there isn't a pending management event and we're not stopping
> > +	 * the current session, then just leave everything alone.
> > +	 */
> > +	pmbsr = read_sysreg_s(PMBSR_EL1);
> 
> When we call from arm_spe_pmu_irq_handler(), I think we need
> synchronisation before reading PMBSR_EL1.
> 
> AFAICT from the spec, a context synchronisation event doesn't ensure
> that the PMU's indirect write to PMBSR_EL1 is visible to the PE's direct
> read above. I beleive we need a PSB CSYNC (and subsequent ISB) to ensure
> that.

I don't think that's right, but the spec isn't completely clear. PSB CSYNC
is about the profiling data itself, but in this case we've taken an IRQ
already so the PMBSR will be up-to-date. I'll seek clarification anyway.

> The only other caller is from arm_spe_pmu_stop(), which first calls
> arm_spe_pmu_disable_and_drain_local(), so I guess the new barriers
> should live in arm_spe_pmu_irq_handler(). I'll comment there.
> 
> > +	if (!arm_spe_pmu_buffer_mgmt_pending(pmbsr) && resume)
> > +		return false; /* Spurious IRQ */
> > +
> > +	/* Ensure hardware updates to PMBPTR_EL1 are visible */
> > +	isb();
> 
> Can we please move this into arm_spe_pmu_buffer_mgmt_pending(), after
> the associated PSB CSYNC?

Hmmm, I deliberately *didn't* do that because I wanted
arm_spe_pmu_buffer_mgmt_pending to ensure the buffer writes are visible, and
then the caller can decide if it cares about indirect SPE register writes
being visible. In reality, I ended up with a single caller, but let's see
how it looks when I rework it to deal with fatal aborts.

> > +	/*
> > +	 * Work out how much data has been written since the last update
> > +	 * to the head index.
> > +	 */
> > +	pmbptr = round_down(read_sysreg_s(PMBPTR_EL1), spe_pmu->align);
> 
> I don't believe we need to align this.
> 
> Per the spec, PMBPTR_EL1[M:0] are RES0 in HW unless sync external abort
> reporting is present, in which case they're valid. We write these bits
> as zero, unless we have a bug elsewhere.
> 
> ... so either the bits are zero, and we're fine, or an external abourt
> has been hit. In the external abort case, we have no idea how far we
> need to reverse the base pointer anyhow.

As part of the fatal abort handling, I should probably round down to
max_record_sz (keyed off DL==1, which I don't think can ever happen
at the moment).

> > +	offset = pmbptr - (u64)buf->base;
> > +	size = offset - PERF_IDX2OFF(handle->head, buf);
> > +
> > +	if (buf->snapshot)
> > +		handle->head = offset;
> 
> It's be worth a /* see arm_spe_pmu_next_snapshot_off() */ comment
> or similar to explain what we're going for the snapshot case here.

Ok.

> > +
> > +	/*
> > +	 * Either the buffer is full or we're stopping the session. Check
> > +	 * that we didn't write a partial record, since this can result
> > +	 * in unparseable trace and we must disable the event.
> > +	 */
> > +	if (pmbsr & BIT(PMBSR_EL1_COLL_SHIFT))
> > +		perf_aux_output_flag(handle, PERF_AUX_FLAG_COLLISION);
> > +
> > +	truncated = pmbsr & BIT(PMBSR_EL1_DL_SHIFT);
> > +	if (truncated)
> > +		perf_aux_output_flag(handle, PERF_AUX_FLAG_TRUNCATED);
> > +
> > +	perf_aux_output_end(handle, size);
> 
> The comment block above perf_aux_output_end() says:
> 
>   It is the pmu driver's responsibility to observe ordering rules of the
>   hardware, so that all the data is externally visible before this is
>   called.
> 
> ... but in arm_spe_pmu_buffer_mgmt_pending() we only ensured that the
> data was visible in the current NSH domain (i.e. only to this CPU).
> 
> I followed the callchain for updating head:
> 
> perf_aux_output_end()
> -> perf_event_aux_event()
> -> perf_output_end()
> -> perf_output_put_handle()
> 
> ... I see that there's an smp_wmb() (i.e. a DMB ISHST) on that path, but
> it's not clear to me if that's sufficient to ensure that the PMU's
> writes are made visible to other CPUs.

With the new memory model, it should be sufficient; the DSB NSH ensures
data is visible to us locally, and then we order that before the update
of the ring buffer.

> Given the comment, I'd feel happier if we had something here or in
> arm_spe_pmu_buffer_mgmt_pending() to ensure that the PMU's prior writes
> are visible to other CPUs.

I could add a comment?

> > +	/*
> > +	 * If we're not resuming the session, then we can clear the fault
> > +	 * and we're done, otherwise we need to start a new session.
> > +	 */
> > +	if (!resume)
> > +		write_sysreg_s(0, PMBSR_EL1);
> > +	else if (!truncated)
> > +		arm_spe_perf_aux_output_begin(handle, event);
> > +
> > +	return true;
> > +}
> > +
> > +/* IRQ handling */
> > +static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
> > +{
> > +	struct perf_output_handle *handle = dev;
> > +
> > +	if (!perf_get_aux(handle))
> > +		return IRQ_NONE;
> > +
> > +	if (!arm_spe_perf_aux_output_end(handle, handle->event, true))
> > +		return IRQ_NONE;
> 
> As commented in arm_spe_perf_aux_output_end(), I think we need a
> psb_csync(); isb() sequence prior to the read of PMBSR_EL1 in
> arm_spe_perf_aux_output_end() to ensure that it is up-to-date w.r.t. the
> interrupt.

As above, I don't agree but am checking this with the architects.

> > +	irq_work_run();
> > +	isb(); /* Ensure the buffer is disabled if data loss has occurred */
> 
> What exactly are we synchronising here?
> 
> AFAICT, when truncation occurs we don't clear PMBLIMITR_EL1.E, so the
> buffer is only implicitly disabled by the PMU's indirect write
> PMBSR_EL1.S, which we must have already synchronised prior to reading
> PMBSR_EL1.

When you report truncation to perf_aux_output_end, it eventually (the
irq_work_run() above) calls back into arm_spe_pmu_stop, and I want to make
sure we've nobbled the limit pointer.

It would probably be clearer just to add an ISB to the end of
arm_spe_pmu_disable_and_drain_local, which goes back to your previous comments
about the mgmt_pending code.

> ... so I can't see why this is necessary.
> 
> > +	write_sysreg_s(0, PMBSR_EL1);
> 
> ... and regardless, we clear PMBSR_EL1.S here, which'll start the PMU
> again, even if truncation occured, which I don't think we want.
> 
> Can we have arm_spe_perf_aux_output_end() clear PMBLIMITR_EL1.E when
> truncation occurs?

Right, that's exactly what happens. It's just highly convoluted.

> 
> > +	return IRQ_HANDLED;
> > +}
> > +
> > +/* Perf callbacks */
> > +static int arm_spe_pmu_event_init(struct perf_event *event)
> > +{
> > +	u64 reg;
> > +	struct perf_event_attr *attr = &event->attr;
> > +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
> > +
> > +	/* This is, of course, deeply driver-specific */
> > +	if (attr->type != event->pmu->type)
> > +		return -ENOENT;
> > +
> > +	if (event->cpu >= 0 &&
> > +	    !cpumask_test_cpu(event->cpu, &spe_pmu->supported_cpus))
> > +		return -ENOENT;
> 
> We're not rejecting cpu < 0, so I take it we're trying to handle
> per-task events?

Yes, like intel-pt does.

> As I've mentioned before, that case worries me. One thing I've just
> realised we need to figure out is what happens if attr.inherit is set.
> The core doesn't reject that, and I suspect we may need to here.

That's already rejected by perf_mmap.

> > +
> > +	if (arm_spe_event_to_pmsevfr(event) & PMSEVFR_EL1_RES0)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (event->hw.sample_period < spe_pmu->min_period ||
> > +	    event->hw.sample_period & PMSIRR_EL1_IVAL_MASK)
> > +		return -EOPNOTSUPP;
> 
> As mentioned in the sysreg comments, we need to check the upper 32 bits
> of the PMSIRR value are zero, so we'll need something like:
> 
> 	if (event->hw.sample_period < spe_pmu->min_period)
> 		return -EOPNOTSUPP;
> 	
> 	if (event->hw.sample_period &
> 	    ~(PMSIRR_EL1_INTERVAL_MASK << PMSIRR_EL1_INTERVAL_SHIFT))
> 		return -EOPNOTSUPP;

I wonder if we're actually better off just truncating the interval. That
way, if the interval is extended in the future, then new software won't
get an error on older cores. It feels a bit weird putting the max interval
in sysfs.

> > +	if (attr->exclude_idle)
> > +		return -EOPNOTSUPP;
> > +
> > +	/*
> > +	 * Feedback-directed frequency throttling doesn't work when we
> > +	 * have a buffer of samples. We'd need to manually count the
> > +	 * samples in the buffer when it fills up and adjust the event
> > +	 * count to reflect that. Instead, force the user to specify a
> > +	 * sample period instead.
> > +	 */
> > +	if (attr->freq)
> > +		return -EINVAL;
> > +
> > +	reg = arm_spe_event_to_pmsfcr(event);
> > +	if ((reg & BIT(PMSFCR_EL1_FE_SHIFT)) &&
> > +	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_EVT))
> > +		return -EOPNOTSUPP;
> > +
> > +	if ((reg & BIT(PMSFCR_EL1_FT_SHIFT)) &&
> > +	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_TYP))
> > +		return -EOPNOTSUPP;
> > +
> > +	if ((reg & BIT(PMSFCR_EL1_FL_SHIFT)) &&
> > +	    !(spe_pmu->features & SPE_PMU_FEAT_FILT_LAT))
> > +		return -EOPNOTSUPP;
> 
> Does anything prevent this event from being added to a group?

PERF_PMU_CAP_EXCLUSIVE should take care of that in the core.

> > +static void arm_spe_pmu_start(struct perf_event *event, int flags)
> > +{
> > +	u64 reg;
> > +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
> > +	struct hw_perf_event *hwc = &event->hw;
> > +	struct perf_output_handle *handle = this_cpu_ptr(spe_pmu->handle);
> > +
> > +	hwc->state = 0;
> > +	arm_spe_perf_aux_output_begin(handle, event);
> > +	if (hwc->state)
> > +		return;
> 
> I was expecting we'd do this last, since PMBLIIMITR.E enables profiling.
> 
> I understand that we're relying on the PMSCR_EL1 filtering value to
> prevent anything being written to tbe buffer until we've vonfigured the
> options, but I'd feel a lot happier if we consistently relied upon
> PMBLIMITR.E for that.

I actually prefer failing fast if we can, so I'd rather keep this as-is
given that it works and your objections are down to personal taste.

> > +
> > +	reg = arm_spe_event_to_pmsfcr(event);
> > +	write_sysreg_s(reg, PMSFCR_EL1);
> > +
> > +	reg = arm_spe_event_to_pmsevfr(event);
> > +	write_sysreg_s(reg, PMSEVFR_EL1);
> > +
> > +	reg = arm_spe_event_to_pmslatfr(event);
> > +	write_sysreg_s(reg, PMSLATFR_EL1);
> > +
> > +	if (flags & PERF_EF_RELOAD) {
> > +		reg = arm_spe_event_to_pmsirr(event);
> > +		write_sysreg_s(reg, PMSIRR_EL1);
> > +		isb();
> > +		reg = local64_read(&hwc->period_left);
> > +		write_sysreg_s(reg, PMSICR_EL1);
> > +	}
> > +
> > +	reg = arm_spe_event_to_pmscr(event);
> > +	isb();
> > +	write_sysreg_s(reg, PMSCR_EL1);
> > +}
> > +
> > +static void arm_spe_pmu_disable_and_drain_local(void)
> > +{
> > +	/* Disable profiling at EL0 and EL1 */
> > +	write_sysreg_s(0, PMSCR_EL1);
> > +	isb();
> > +
> > +	/* Drain any buffered data */
> > +	psb_csync();
> > +	dsb(nsh);
> > +
> > +	/* Disable the profiling buffer */
> > +	write_sysreg_s(0, PMBLIMITR_EL1);
> 
> Can't this be done when we clear PMSCR_EL1? Surely buffered data would
> be written out regardless?

I think the buffered data could be silently dropped if you did that.

> > +static void *arm_spe_pmu_setup_aux(int cpu, void **pages, int nr_pages,
> > +				   bool snapshot)
> > +{
> > +	int i;
> > +	struct page **pglist;
> > +	struct arm_spe_pmu_buf *buf;
> > +
> > +	/*
> > +	 * We require an even number of pages for snapshot mode, so that
> > +	 * we can effectively treat the buffer as consisting of two equal
> > +	 * parts and give userspace a fighting chance of getting some
> > +	 * useful data out of it.
> > +	 */
> > +	if (!nr_pages || (snapshot && (nr_pages & 1)))
> > +		return NULL;
> 
> As noted above, we may need to ensure that this is a pwoer of two.

Sorry, I don't follow. The power-of-two bug was in my IDX2OFF macro, which
I've fixed. For snapshot mode, we just need an even number of pages.

> > +	buf = kzalloc_node(sizeof(*buf), GFP_KERNEL, cpu_to_node(cpu));
> > +	if (!buf)
> > +		return NULL;
> > +
> > +	pglist = kcalloc(nr_pages, sizeof(*pglist), GFP_KERNEL);
> > +	if (!pglist)
> > +		goto out_free_buf;
> > +
> > +	for (i = 0; i < nr_pages; ++i) {
> > +		struct page *page = virt_to_page(pages[i]);
> > +
> > +		if (PagePrivate(page)) {
> > +			pr_warn("unexpected high-order page for auxbuf!");
> 
> It looks like the intel-pt driver expects high-order pages.
> 
> What prevents us from seeing those?

The fact that we don't set the PERF_PMU_CAP_AUX_NO_SG capability. The intel
driver uses physical address and scatter/gather lists, whereas ours just
takes virtual addresses for the buffer.

> Why can't we handle those?

We never get them, so we don't need to.

> How are these pages pinned? Does the core ensure that?

Yes, they're GFP_KERNEL pages underneath.

> > +	spe_pmu->pmu = (struct pmu) {
> > +		.capabilities	= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE,
> > +		.attr_groups	= arm_spe_pmu_attr_groups,
> > +		/*
> > +		 * We hitch a ride on the software context here, so that
> > +		 * we can support per-task profiling (which is not possible
> > +		 * with the invalid context as it doesn't get sched callbacks).
> > +		 * This requires that userspace either uses a dummy event for
> > +		 * perf_event_open, since the aux buffer is not setup until
> > +		 * a subsequent mmap, or creates the profiling event in a
> > +		 * disabled state and explicitly PERF_EVENT_IOC_ENABLEs it
> > +		 * once the buffer has been created.
> > +		 */
> > +		.task_ctx_nr	= perf_sw_context,
> 
> While other tracing PMUs do this, I think this is a horrible bodge, and
> a bad idea, given it violates assumptions made in the core code.
> 
> For example, unlike true SW events, add() and start() can fail, so a
> tracing event can unexpectedly stop SW events from being scheduled.
> 
> AFAICT, we could also try to move a tracing event into a later-created
> HW PMU group, which is very worrying.
> 
> I really think we should have a separate tracing context for this class
> of PMU, or we make it so that the invalid context can receive sched
> callbacks.

Fair point, and I already have a comment to call this out. Whilst I'm not
against seeing this fixed, I think it should be a separate patch series
given that this is a common idiom amongst system/uncore PMU drivers.

> > +static void __arm_spe_pmu_dev_probe(void *info)
> > +{
> > +	int fld;
> > +	u64 reg;
> > +	struct arm_spe_pmu *spe_pmu = info;
> > +	struct device *dev = &spe_pmu->pdev->dev;
> > +
> > +	fld = cpuid_feature_extract_unsigned_field(read_cpuid(ID_AA64DFR0_EL1),
> > +						   ID_AA64DFR0_PMSVER_SHIFT);
> > +	if (!fld) {
> > +		dev_err(dev,
> > +			"unsupported ID_AA64DFR0_EL1.PMSVer [%d] on CPU %d\n",
> > +			fld, smp_processor_id());
> > +		return;
> > +	}
> 
> Given we only bail out when PMSver is zero, surely we can just say:
> 
> 	dev_err(dev, "SPE not supported on cpu %d", smp_processor_id())

I'd rather leave it like this, so we have the information when we start
supporting additional values of fld.

Will

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
  2017-06-21 15:31       ` Will Deacon
@ 2017-06-22 15:56         ` Kim Phillips
  2017-06-22 18:36           ` Will Deacon
  0 siblings, 1 reply; 33+ messages in thread
From: Kim Phillips @ 2017-06-22 15:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, linux-arm-kernel, marc.zyngier, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

On Wed, 21 Jun 2017 16:31:09 +0100
Will Deacon <will.deacon@arm.com> wrote:

> On Thu, Jun 15, 2017 at 10:57:35AM -0500, Kim Phillips wrote:
> > On Mon, 12 Jun 2017 11:20:48 -0500
> > Kim Phillips <kim.phillips@arm.com> wrote:
> > 
> > > On Mon, 12 Jun 2017 12:08:23 +0100
> > > Mark Rutland <mark.rutland@arm.com> wrote:
> > > 
> > > > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:
> > > > > This is the sixth posting of the patches previously posted here:
> > ...
> > > > Kim, do you have any version of the userspace side that we could look
> > > > at?
> > > > 
> > > > For review, it would be really helpful to have something that can poke
> > > > the PMU, even if it's incomplete or lacking polish.
> > > 
> > > Here's the latest push, based on a a couple of prior versions of this
> > > driver:
> > > 
> > > http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1
> > > 
> > > I don't seem to be able to get any SPE data output after rebasing on
> > > this version of the driver.  Still don't know why at the moment...
> > 
> > Bisected to commit e38ba76deef "perf tools: force uncore events to
> > system wide monitoring".  So, using record with specifying a -C
> > <cpu> explicitly now produces SPE data, but only a couple of valid
> > records at the beginning of each buffer; the rest is filled with
> > PADding (0's).
> > 
> > I see Mark's latest comments have found a possible issue in the perf
> > aux buffer handling code in the driver, and that the driver does some
> > memset of padding (0's) itself; could that be responsible for the above
> > behaviour?
> 
> Possibly. Do you know how big you're mapping the aux buffer

4MiB.

> and what (if any) value you're passing as aux_watermark?

None passed, but it looks like 4KiB was used since the AUXTRACE size
was 4MiB - 4KiB.

I'm not seeing the issue with a simple bts-based version I'm
working on...yet.  We can revisit if I'm able to reproduce again; the
problem could have been on the userspace side.

Meanwhile, when using fvp-base.dtb, my model setup stops booting the
kernel after "smp: Bringing up secondary CPUs ...".  If I however take
the second SPE node from fvp-base.dts and add it to my working device
tree, I get this during the driver probe:

[    1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
[    1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
[    1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)
[    1.043784] arm_spe_pmu: probe of spe-pmu@1 failed with error -16

spe-pmu@0 is useable, but doubt spe-pmu@1 is.  btw, that 16 is EBUSY
"Device or resource busy".

Kim

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
  2017-06-22 15:56         ` Kim Phillips
@ 2017-06-22 18:36           ` Will Deacon
  2017-06-27 21:07             ` Kim Phillips
  0 siblings, 1 reply; 33+ messages in thread
From: Will Deacon @ 2017-06-22 18:36 UTC (permalink / raw)
  To: Kim Phillips
  Cc: Mark Rutland, linux-arm-kernel, marc.zyngier, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

On Thu, Jun 22, 2017 at 10:56:40AM -0500, Kim Phillips wrote:
> On Wed, 21 Jun 2017 16:31:09 +0100
> Will Deacon <will.deacon@arm.com> wrote:
> 
> > On Thu, Jun 15, 2017 at 10:57:35AM -0500, Kim Phillips wrote:
> > > On Mon, 12 Jun 2017 11:20:48 -0500
> > > Kim Phillips <kim.phillips@arm.com> wrote:
> > > 
> > > > On Mon, 12 Jun 2017 12:08:23 +0100
> > > > Mark Rutland <mark.rutland@arm.com> wrote:
> > > > 
> > > > > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:
> > > > > > This is the sixth posting of the patches previously posted here:
> > > ...
> > > > > Kim, do you have any version of the userspace side that we could look
> > > > > at?
> > > > > 
> > > > > For review, it would be really helpful to have something that can poke
> > > > > the PMU, even if it's incomplete or lacking polish.
> > > > 
> > > > Here's the latest push, based on a a couple of prior versions of this
> > > > driver:
> > > > 
> > > > http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1
> > > > 
> > > > I don't seem to be able to get any SPE data output after rebasing on
> > > > this version of the driver.  Still don't know why at the moment...
> > > 
> > > Bisected to commit e38ba76deef "perf tools: force uncore events to
> > > system wide monitoring".  So, using record with specifying a -C
> > > <cpu> explicitly now produces SPE data, but only a couple of valid
> > > records at the beginning of each buffer; the rest is filled with
> > > PADding (0's).
> > > 
> > > I see Mark's latest comments have found a possible issue in the perf
> > > aux buffer handling code in the driver, and that the driver does some
> > > memset of padding (0's) itself; could that be responsible for the above
> > > behaviour?
> > 
> > Possibly. Do you know how big you're mapping the aux buffer
> 
> 4MiB.
> 
> > and what (if any) value you're passing as aux_watermark?
> 
> None passed, but it looks like 4KiB was used since the AUXTRACE size
> was 4MiB - 4KiB.
> 
> I'm not seeing the issue with a simple bts-based version I'm
> working on...yet.  We can revisit if I'm able to reproduce again; the
> problem could have been on the userspace side.
> 
> Meanwhile, when using fvp-base.dtb, my model setup stops booting the
> kernel after "smp: Bringing up secondary CPUs ...".  If I however take
> the second SPE node from fvp-base.dts and add it to my working device
> tree, I get this during the driver probe:
> 
> [    1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
> [    1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
> [    1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)

Looks like you've screwed up your IRQ partitions, so you are effectively
registering the same device twice, which then blows up due to lack of shared
irqs.

Either remove one of the devices, or use IRQ partitions to restrict them
to unique sets of CPUs.

Will

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 4/5] drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension
  2017-06-21 15:39     ` Will Deacon
@ 2017-06-27 17:12       ` Mark Rutland
  0 siblings, 0 replies; 33+ messages in thread
From: Mark Rutland @ 2017-06-27 17:12 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, marc.zyngier, kim.phillips, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

Hi,

On Wed, Jun 21, 2017 at 04:39:33PM +0100, Will Deacon wrote:
> On Thu, Jun 15, 2017 at 03:57:29PM +0100, Mark Rutland wrote:
> > On Mon, Jun 05, 2017 at 04:22:56PM +0100, Will Deacon wrote:

> > > +	SPE_PMU_CAP_ERND,
> > > +	SPE_PMU_CAP_FEAT_MAX,
> > > +	SPE_PMU_CAP_CNT_SZ = SPE_PMU_CAP_FEAT_MAX,
> > > +	SPE_PMU_CAP_MIN_IVAL,
> > > +};

> > We could get rid of the confusing SPE_PMU_CAP_FEAT_MAX definition here,
> > if we were to:
> > 
> > > +static int arm_spe_pmu_feat_caps[SPE_PMU_CAP_FEAT_MAX] = {
> > > +	[SPE_PMU_CAP_ARCH_INST]	= SPE_PMU_FEAT_ARCH_INST,
> > > +	[SPE_PMU_CAP_ERND]	= SPE_PMU_FEAT_ERND,
> > > +};
> > 
> > .. change this to:
> > 
> > static int arm_spe_pmu_feat_caps[] = {
> > 	...
> > };
> > 
> > ... and:
> > 
> > > +static u32 arm_spe_pmu_cap_get(struct arm_spe_pmu *spe_pmu, int cap)
> > > +{
> > > +	if (cap < SPE_PMU_CAP_FEAT_MAX)
> > 
> > ... change this to:
> > 
> > 	if (cap < ARRAY_SIZE(arm_spe_pmu_feat_caps))
> 
> I quite liked this suggestion at first and I even implemented it, but the
> result was IMO less maintainable than the above. There's ABI involved here,
> so I want to make it as difficult as possible to break the ABI when adding a
> new hardware capability to the driver. The current code does a good job of
> that:
> 
>   - If you add a boolean feature in the wrong place in arm_spe_capabilities,
>     then you'll get a WARN
> 
>   - If you add a non-boolean feature in the wrong place then it will be
>     reported as non-present, unless you add an entry in
>     arm_spe_pmu_feat_caps (at which point you'd realise the mistake)
> 
>   - If you only update arm_spe_pmu_feat_caps, then you'll get a build error.
> 
> With your change, it's a lot easier to break things subtly, so I'd rather
> keep this as-is unless you have non-cosmetic reasons to change it.

No objections here; that's a pretty good rationale for keeping this
as-is.

[...]

> > > +static void arm_spe_event_sanitise_period(struct perf_event *event)
> > > +{
> > > +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
> > > +	u64 period = event->hw.sample_period & ~PMSIRR_EL1_IVAL_MASK;

> > > +	if (period < spe_pmu->min_period)
> > > +		period = spe_pmu->min_period;
> > 
> > We already verify this in arm_spe_pmu_event_init(), so we don't need to
> > check this here.
> > 
> > We can drop arm_spe_event_sanitise_period() entirely. Given we validate
> > the period at event_init() time, there's no need to sanitize the value.
> 
> What about PERF_IOC_PERIOD? I don't think that re-inits the event.

Ugh; good point.

Given that, it's arguably worth not validating the period at
event_init() time, to have consistent behaviour across opening an event
with a given period, and fiddling with tha via PERF_IOC_PERIOD.

I also think that rather than masking the period with
PMSIRR_EL1_IVAL_MASK we should have something like:

	if (period > SPE_PMU_MAX_PERIOD)
		period = SPE_PMU_MAX_PERIOD;

... as that avoids the user asking for max_period + 1, and getting the
minimum period, and other such weirdness.

[...]

> > > +static bool arm_spe_pmu_buffer_mgmt_pending(u64 pmbsr)
> > > +{
> > > +	const char *err_str;
> > > +
> > > +	/* Service required? */
> > > +	if (!(pmbsr & BIT(PMBSR_EL1_S_SHIFT)))
> > > +		return false;
> > > +
> > > +	/* We only expect buffer management events */
> > > +	switch (pmbsr & (PMBSR_EL1_EC_MASK << PMBSR_EL1_EC_SHIFT)) {
> > > +	case PMBSR_EL1_EC_BUF:
> > > +		/* Handled below */
> > > +		break;
> > > +	case PMBSR_EL1_EC_FAULT_S1:
> > > +	case PMBSR_EL1_EC_FAULT_S2:
> > > +		err_str = "Unexpected buffer fault";
> > > +		goto out_err;
> > > +	default:
> > > +		err_str = "Unknown error code";
> > > +		goto out_err;
> > > +	}
> > 
> > For the error cases, I take it the assumption is that we leave
> > PMBSR_EL1.S set, so that the HW doesn't start again?
> 
> No, I don't think I actually handle these cases at all. Whilst they're
> probably catastrophic (vmapped mappings are faulting!), I should at least
> try to park the profiler. Will fix for the next version.

Great; thanks.

[...]

> > > +		handle->head = PERF_IDX2OFF(limit, buf);
> > > +		limit = ((buf->nr_pages * PAGE_SIZE) >> 1) + handle->head;
> > > +	}
> > > +
> > > +	return limit;
> > > +}
> > > +
> > > +static u64 __arm_spe_pmu_next_off(struct perf_output_handle *handle)
> > > +{
> > > +	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
> > > +	u64 head = PERF_IDX2OFF(handle->head, buf);
> > > +	u64 tail = PERF_IDX2OFF(handle->head + handle->size, buf);
> > > +	u64 wakeup = PERF_IDX2OFF(handle->wakeup, buf);
> > > +	u64 limit = buf->nr_pages * PAGE_SIZE;
> > > +
> > > +	/*
> > > +	 * Set the limit pointer to either the watermark or the
> > > +	 * current tail pointer; whichever comes first.
> > > +	 */
> > > +	if (handle->head + handle->size <= handle->wakeup) {
> > > +		/* The tail is next, so check for wrapping */
> > > +		if (tail >= head) {
> > > +			/*
> > > +			 * No wrapping, but need to align downwards to
> > > +			 * avoid corrupting unconsumed data.
> > > +			 */
> > > +			limit = round_down(tail, PAGE_SIZE);
> > > +
> > > +		}
> > > +	} else if (wakeup >= head) {
> > 
> > When wakeup == head, do we need to signal a wakeup event somehow?
> > Currently we'll pad the buffer, signal truncation, and end output, which
> > seems a little odd, but maybe that's what perf expects.
> 
> Wakeup can never be equal to head here. We know that the wakeup is next (by
> the first if above) and therefore it is within handle->size of head. Since
> we're starting a session, then we either have:
> 
>   1. Wakeup calculated by perf_aux_output_{skip,end}, or
>   2. Wakeup calculated by rb_alloc_aux (initial mmap)
> 
> In case (1), the wakeup will have been signalled when it was calculated to
> be equal to head, and then the wakeup will have been moved to the next
> watermark point. In case (2), the only way it can be equal to head is if
> the watermark was 0, but an initial watermark is converted to half the
> buffer size by the core.
> 
> Admittedly, I'd not realised this at the time (hence the >= check), and it
> looks like we're going to rewrite this anyway :)

Thanks for the explanation; the above is really helpful.

For (2) I agree that after the mmap(), handle->wakeup > handle->head.

For (1), I think that we can hit a case where perf_aux_output_end() will
signal a wakeup, but even after rb->aux_wakeup is moved along, we still
have rb->aux_wakeup <= rb->aux_head.

Rationale for the below; enjoy.

Consider a non-snapshot case where we set up the mmap with aux_watermark
being PAGE_SIZE / 2, and the buffer size being 2 * PAGE_SIZE. Our
initial state of things is:

	rb->aux_head = 0;
	rb->aux_wakeup = 0;
	rb->aux_watermark = PAGE_SIZE / 2;

The first time we start, perf_aux_output_begin() gives us:

	handle->head = 0;
	handle->size = 2 * PAGE_SIZE;
	handle->wakeup = PAGE_SIZE / 2;

Given that, __arm_spe_pmu_next_off() gives us a limit of PAGE_SIZE (both
in your version and mine).

We set SPE off, and it fills PAGE_SIZE worth of buffer, then asserts its
maintenance IRQ. We thus call perf_aux_output_end(handle, PAGE_SIZE),
which adjusts the aux_head, detects we passed wakeup, adjusts the
aux_wakeup, and signals the wakeup:

	local_add(size, &rb->aux_head);
	local_add(rb->aux_watermark, &rb->aux_wakeup);
	...
	perf_output_wakeup(handle);

... leaving us with:

	rb->aux_head = PAGE_SIZE
	rb->aux_wakeup = PAGE_SIZE / 2;
	rb->aux_watermark = PAGE_SIZE / 2;

... so next time we call arm_spe_perf_aux_output_begin() we see:

	handle->head = PAGE_SIZE;
	handle->wakeup = PAGE_SIZE;

So AFAICT, we can see wakeup == head here, though it's arguably the core
code that's at fault. Hopefully I've missed something.

Assuming that above is correct,in either of our __arm_spe_pmu_next_off()
implementations that results in signalling truncation, and calling
perf_aux_output_end(), which will detect another wakeup, and move wakeup
past head. However, as we (spuriously) signalled truncation we end up
stopping.

I think that it would make sense for the perf core to advance the wakeup
beyond head in perf_aux_output_end(), even if this means outputting a
number of wakeup events.

> > > +		/*
> > > +		 * The wakeup is next and doesn't wrap. Align upwards to
> > > +		 * ensure that we do indeed reach the watermark.
> > > +		 */
> > > +		limit = round_up(wakeup, PAGE_SIZE);
> > > +
> > > +		/*
> > > +		 * If rounding up crosses the tail, then we have to
> > > +		 * round down to avoid corrupting unconsumed data.
> > > +		 * Hopefully the tail will have moved by the time we
> > > +		 * hit the new limit.
> > > +		 */
> > > +		if (wakeup < tail && limit > tail)
> > > +			limit = round_down(wakeup, PAGE_SIZE);
> > > +	}
> > 
> > It took me a while to grok that we must consider the wakeup in
> > free-running counter space to avoid early wakeups, while we must
> > consider the tail in ring-buffer offset space to avoid clobbering data.
> > 
> > With that understanding, I think we have an issue here. If wakeup is
> > more than buffer size in the future, and the buffer is empty, I think we
> > set the limit too low.
> > 
> > In that case, we'd evaluate:
> > 
> > 	handle->head + handle->size <= handle->wakeup
> > 
> > ... as true, since size is at most buffer size. Thus we'd go into the
> > first if block. There we'd evaluate:
> 
> If the buffer is empty, then size is exactly buffer size - 1, but I take
> your point.

Heh, I missed that the CIRC_*() helpers  enforced that at least a byte
was always free.

> > 
> > 	tail >= head
> > 
> > ... as true, since when the buffer is empty, head == tail. Thus, we'd
> > set the limit to:
> > 
> > 	round_down(tail, PAGE_SIZE)
> > 
> > ... which'll leave us with limit <= head, since head == tail. Thus,
> > we'll hit the case below:
> > 
> > > +
> > > +	/*
> > > +	 * If rounding down crosses the head, then the buffer is full,
> > > +	 * so pad to tail and end the session.
> > > +	 */
> > > +	if (limit <= head) {
> > > +		memset(buf->base + head, 0, handle->size);
> > > +		perf_aux_output_skip(handle, handle->size);
> > > +		perf_aux_output_flag(handle, PERF_AUX_FLAG_TRUNCATED);
> > > +		perf_aux_output_end(handle, 0);
> > > +		limit = 0;
> > > +	}
> > > +
> > > +	return limit;
> > > +}
> > 
> > ... and end all output, even though the entire buffer was empty, and we
> > could have returned the end of the buffer as the limit.
> > 
> > It might be that something prevents wakeup from being that far in the
> > future, but in previous discussions we'd assumed that it could be any
> > arbitrary value.
> 
> Yes, I think this case does indeed go wrong. Well spotted!
> 
> > I believe we can solve that, and simplify the logic as below. I've left
> > the wakeup < head and wakeup == head cases as above, ignored and
> > terminating respectively.
> 
> I think this mostly works, some suggestions/questions below.
> 
> > static u64 __arm_spe_pmu_next_off(struct perf_output_handle *handle)
> > {
> > 	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
> > 	const u64 bufsize = buf->nr_pages * PAGE_SIZE;
> > 	u64 limit = bufsize;
> > 	u64 head = PERF_IDX2OFF(handle->head, buf);
> > 	u64 tail = PERF_IDX2OFF(handle->head + handle->size, buf);
> > 	u64 wakeup = PERF_IDX2OFF(handle->wakeup, buf);
> > 
> > 	if (!handle->size)
> > 		goto no_space;
> 
> We can avoid the memset/output_skip in this case.

True.

We can move the no_space label if we remove the other goto (by inverting
the condition paired with it and returning limit there instead). i.e.

	if (!handle->size)
		goto no_space;

	< rounding down limit, etc >

	if (limit > head)
		return limit;
	
	/* 
	 * The buffer wasn't completely full, but we can't use the
	 * remaining space. Fill the unusable remainder with padding
	 */
no_space:
	perf_aux_output_flag(handle, PERF_AUX_FLAG_TRUNCATED);
	perf_aux_output_end(handle, 0);
	return 0;

> > 	/*
> > 	 * Avoid clobbering unconsumed data. We know we have space, so
> > 	 * if we see head == tail we know that the buffer is empty. If
> > 	 * head > tail, then there's nothing to clobber prior to
> > 	 * wrapping.
> > 	 */
> > 	if (head < tail)
> > 		limit = round_down(tail, PAGE_SIZE);
> > 	
> > 	/*
> > 	 * Wakeup may be arbitrarily far into future. If it's not in the
> > 	 * current generation, either we'll wrap before hitting it, or
> > 	 * it's in the past and has been handled already.
> > 	 *
> > 	 * If there's a wakeup before we wrap, arrange to be woken up by
> > 	 * the page boundary following it. Keep the tail boundary if
> > 	 * that's lower.
> > 	 */
> > 	if ((handle->wakeup / bufsize) == (handle->head / bufsize)) &&
> 
> I'd really like to get rid of these divisions, since we're not working with
> nice powers of 2 here. Can't you just do:
> 
>   handle->wakeup < (handle->head + handle->size)
> 
> to establish that they're in the same "generation"?

I initially thought that, but then realised that would also be true if
wakeup was a "generation" behind. When I wrote this, I wasn't sure if
that could happen, and/or what we should do in that case -- as noted
above I'd retained the fact we ignored wakeup in that case.

Extending my argument from earlier, I think that *can* happen for
certain values of aux_watermark, but we have other problems resulting
from that, and we should try to rule that out in the core code.

> > 	    head <= wakeup)
> > 		limit = min(limit, round_up(wakeup, PAGE_SIZE));
> > 
> > 	if (limit <= head)
> > 		goto no_space;
> 
> Does this correctly handle the case where the buffer is full and head
> == tail, but limit == bufsize? AFAICT, we can return a limit of
> bufsize and corrupt the whole buffer.

When head == tail, handle->size == 0, so at the start of the function
we'd have gone straight to no_space.

So AFAICT, we're fine on that front.

[...]

> > > +static void arm_spe_perf_aux_output_begin(struct perf_output_handle *handle,
> > > +					  struct perf_event *event)
> > > +{

> > > +	write_sysreg_s(base, PMBPTR_EL1);
> > > +	limit += (u64)buf->base;
> > > +
> > 
> > I believe an isb() is necessary here to ensure the write to PMBPTR_EL1
> > occurs before the write to PMBLIMITR_EL1 enables the PMU. Otherwise, the
> > CPU could execute those out-of-order.
> 
> This function is always called in a context where the profiler is disabled
> due to some other control (e.g. in PMSCR or because we're in fault context)
> so the isb isn't necessary.
> 
> > > +out_write_limit:
> > > +	write_sysreg_s(limit, PMBLIMITR_EL1);
> > > +}

Ah, I see. Sorry for the noise.

[...]

> > > +static bool arm_spe_perf_aux_output_end(struct perf_output_handle *handle,
> > > +					struct perf_event *event,
> > > +					bool resume)
> > > +{
> > > +	u64 pmbptr, pmbsr, offset, size;
> > > +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
> > > +	struct arm_spe_pmu_buf *buf = perf_get_aux(handle);
> > > +	bool truncated;
> > > +
> > > +	/*
> > > +	 * We can be called via IRQ work trying to disable the PMU after
> > > +	 * a buffer full event. In this case, the aux session has already
> > > +	 * been stopped, so there's nothing to do here.
> > > +	 */
> > > +	if (!buf)
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * If there isn't a pending management event and we're not stopping
> > > +	 * the current session, then just leave everything alone.
> > > +	 */
> > > +	pmbsr = read_sysreg_s(PMBSR_EL1);
> > 
> > When we call from arm_spe_pmu_irq_handler(), I think we need
> > synchronisation before reading PMBSR_EL1.
> > 
> > AFAICT from the spec, a context synchronisation event doesn't ensure
> > that the PMU's indirect write to PMBSR_EL1 is visible to the PE's direct
> > read above. I beleive we need a PSB CSYNC (and subsequent ISB) to ensure
> > that.
> 
> I don't think that's right, but the spec isn't completely clear. PSB CSYNC
> is about the profiling data itself, but in this case we've taken an IRQ
> already so the PMBSR will be up-to-date. I'll seek clarification anyway.

Thanks; I'll await this.

> > The only other caller is from arm_spe_pmu_stop(), which first calls
> > arm_spe_pmu_disable_and_drain_local(), so I guess the new barriers
> > should live in arm_spe_pmu_irq_handler(). I'll comment there.
> > 
> > > +	if (!arm_spe_pmu_buffer_mgmt_pending(pmbsr) && resume)
> > > +		return false; /* Spurious IRQ */
> > > +
> > > +	/* Ensure hardware updates to PMBPTR_EL1 are visible */
> > > +	isb();
> > 
> > Can we please move this into arm_spe_pmu_buffer_mgmt_pending(), after
> > the associated PSB CSYNC?
> 
> Hmmm, I deliberately *didn't* do that because I wanted
> arm_spe_pmu_buffer_mgmt_pending to ensure the buffer writes are visible, and
> then the caller can decide if it cares about indirect SPE register writes
> being visible. In reality, I ended up with a single caller, but let's see
> how it looks when I rework it to deal with fatal aborts.

Ok.

> > > +	/*
> > > +	 * Work out how much data has been written since the last update
> > > +	 * to the head index.
> > > +	 */
> > > +	pmbptr = round_down(read_sysreg_s(PMBPTR_EL1), spe_pmu->align);
> > 
> > I don't believe we need to align this.
> > 
> > Per the spec, PMBPTR_EL1[M:0] are RES0 in HW unless sync external abort
> > reporting is present, in which case they're valid. We write these bits
> > as zero, unless we have a bug elsewhere.
> > 
> > ... so either the bits are zero, and we're fine, or an external abourt
> > has been hit. In the external abort case, we have no idea how far we
> > need to reverse the base pointer anyhow.
> 
> As part of the fatal abort handling, I should probably round down to
> max_record_sz (keyed off DL==1, which I don't think can ever happen
> at the moment).

I don't think that's sufficient.

For an async external abort the spec says we cannot assume any valid
data has been written to the buffer when DL==1, so we must throw away
the whole run.

For other cases, it's not clear to me whether PMBPTR_EL1 is guaranteed
to be within max_record_sz of the end of the last complete record. The
spec says that PMBPTR_EL1 might not point at the byte after the last
complete record, and it doesn't say how far beyond this is may be.

Given that the async case is called out specifically, I think the intent
is that it is bounded somehow, but this could do with clarification.

[...]

> > > +	/*
> > > +	 * Either the buffer is full or we're stopping the session. Check
> > > +	 * that we didn't write a partial record, since this can result
> > > +	 * in unparseable trace and we must disable the event.
> > > +	 */
> > > +	if (pmbsr & BIT(PMBSR_EL1_COLL_SHIFT))
> > > +		perf_aux_output_flag(handle, PERF_AUX_FLAG_COLLISION);
> > > +
> > > +	truncated = pmbsr & BIT(PMBSR_EL1_DL_SHIFT);
> > > +	if (truncated)
> > > +		perf_aux_output_flag(handle, PERF_AUX_FLAG_TRUNCATED);
> > > +
> > > +	perf_aux_output_end(handle, size);
> > 
> > The comment block above perf_aux_output_end() says:
> > 
> >   It is the pmu driver's responsibility to observe ordering rules of the
> >   hardware, so that all the data is externally visible before this is
> >   called.
> > 
> > ... but in arm_spe_pmu_buffer_mgmt_pending() we only ensured that the
> > data was visible in the current NSH domain (i.e. only to this CPU).
> > 
> > I followed the callchain for updating head:
> > 
> > perf_aux_output_end()
> > -> perf_event_aux_event()
> > -> perf_output_end()
> > -> perf_output_put_handle()
> > 
> > ... I see that there's an smp_wmb() (i.e. a DMB ISHST) on that path, but
> > it's not clear to me if that's sufficient to ensure that the PMU's
> > writes are made visible to other CPUs.
> 
> With the new memory model, it should be sufficient; the DSB NSH ensures
> data is visible to us locally, and then we order that before the update
> of the ring buffer.
> 
> > Given the comment, I'd feel happier if we had something here or in
> > arm_spe_pmu_buffer_mgmt_pending() to ensure that the PMU's prior writes
> > are visible to other CPUs.
> 
> I could add a comment?

A comment would be great.

> > > +	/*
> > > +	 * If we're not resuming the session, then we can clear the fault
> > > +	 * and we're done, otherwise we need to start a new session.
> > > +	 */
> > > +	if (!resume)
> > > +		write_sysreg_s(0, PMBSR_EL1);
> > > +	else if (!truncated)
> > > +		arm_spe_perf_aux_output_begin(handle, event);
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +/* IRQ handling */
> > > +static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
> > > +{
> > > +	struct perf_output_handle *handle = dev;
> > > +
> > > +	if (!perf_get_aux(handle))
> > > +		return IRQ_NONE;
> > > +
> > > +	if (!arm_spe_perf_aux_output_end(handle, handle->event, true))
> > > +		return IRQ_NONE;
> > 
> > As commented in arm_spe_perf_aux_output_end(), I think we need a
> > psb_csync(); isb() sequence prior to the read of PMBSR_EL1 in
> > arm_spe_perf_aux_output_end() to ensure that it is up-to-date w.r.t. the
> > interrupt.
> 
> As above, I don't agree but am checking this with the architects.

Sure. I'm also hoping that you're correct here!

> > > +	irq_work_run();
> > > +	isb(); /* Ensure the buffer is disabled if data loss has occurred */
> > 
> > What exactly are we synchronising here?
> > 
> > AFAICT, when truncation occurs we don't clear PMBLIMITR_EL1.E, so the
> > buffer is only implicitly disabled by the PMU's indirect write
> > PMBSR_EL1.S, which we must have already synchronised prior to reading
> > PMBSR_EL1.
> 
> When you report truncation to perf_aux_output_end, it eventually (the
> irq_work_run() above) calls back into arm_spe_pmu_stop, and I want to make
> sure we've nobbled the limit pointer.
> 
> It would probably be clearer just to add an ISB to the end of
> arm_spe_pmu_disable_and_drain_local, which goes back to your previous comments
> about the mgmt_pending code.

That would work for me.

> > ... so I can't see why this is necessary.
> > 
> > > +	write_sysreg_s(0, PMBSR_EL1);
> > 
> > ... and regardless, we clear PMBSR_EL1.S here, which'll start the PMU
> > again, even if truncation occured, which I don't think we want.
> > 
> > Can we have arm_spe_perf_aux_output_end() clear PMBLIMITR_EL1.E when
> > truncation occurs?
> 
> Right, that's exactly what happens. It's just highly convoluted.

Thanks; I see what's going on now. What a joy.

> > > +	return IRQ_HANDLED;
> > > +}
> > > +
> > > +/* Perf callbacks */
> > > +static int arm_spe_pmu_event_init(struct perf_event *event)
> > > +{
> > > +	u64 reg;
> > > +	struct perf_event_attr *attr = &event->attr;
> > > +	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
> > > +
> > > +	/* This is, of course, deeply driver-specific */
> > > +	if (attr->type != event->pmu->type)
> > > +		return -ENOENT;
> > > +
> > > +	if (event->cpu >= 0 &&
> > > +	    !cpumask_test_cpu(event->cpu, &spe_pmu->supported_cpus))
> > > +		return -ENOENT;
> > 
> > We're not rejecting cpu < 0, so I take it we're trying to handle
> > per-task events?
> 
> Yes, like intel-pt does.
> 
> > As I've mentioned before, that case worries me. One thing I've just
> > realised we need to figure out is what happens if attr.inherit is set.
> > The core doesn't reject that, and I suspect we may need to here.
> 
> That's already rejected by perf_mmap.

Ok. Odd that we even allow those events to be created, though.

> > > +	if (arm_spe_event_to_pmsevfr(event) & PMSEVFR_EL1_RES0)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (event->hw.sample_period < spe_pmu->min_period ||
> > > +	    event->hw.sample_period & PMSIRR_EL1_IVAL_MASK)
> > > +		return -EOPNOTSUPP;
> > 
> > As mentioned in the sysreg comments, we need to check the upper 32 bits
> > of the PMSIRR value are zero, so we'll need something like:
> > 
> > 	if (event->hw.sample_period < spe_pmu->min_period)
> > 		return -EOPNOTSUPP;
> > 	
> > 	if (event->hw.sample_period &
> > 	    ~(PMSIRR_EL1_INTERVAL_MASK << PMSIRR_EL1_INTERVAL_SHIFT))
> > 		return -EOPNOTSUPP;
> 
> I wonder if we're actually better off just truncating the interval. That
> way, if the interval is extended in the future, then new software won't
> get an error on older cores.

As above, given the PERF_EVENT_IOC_INTERVAL case, I agree that it's
better to not validate the interval here, and to truncate it as
necessary when installing the event.

> It feels a bit weird putting the max interval in sysfs.

I have no strong feelings either way. Maybe it's useful for the user to
choose a sensible / optimum userspace buffer size?

[...]

> > Does anything prevent this event from being added to a group?
> 
> PERF_PMU_CAP_EXCLUSIVE should take care of that in the core.

So it does. Phew.

[...]

> > > +static void arm_spe_pmu_disable_and_drain_local(void)
> > > +{
> > > +	/* Disable profiling at EL0 and EL1 */
> > > +	write_sysreg_s(0, PMSCR_EL1);
> > > +	isb();
> > > +
> > > +	/* Drain any buffered data */
> > > +	psb_csync();
> > > +	dsb(nsh);
> > > +
> > > +	/* Disable the profiling buffer */
> > > +	write_sysreg_s(0, PMBLIMITR_EL1);
> > 
> > Can't this be done when we clear PMSCR_EL1? Surely buffered data would
> > be written out regardless?
> 
> I think the buffered data could be silently dropped if you did that.

Having looked over the spec again, I think you're correct. Sorry for the
noise.

I'd missed the distinction between disabling sampling and disabling
the buffer into which samples are fed.

> > > +static void *arm_spe_pmu_setup_aux(int cpu, void **pages, int nr_pages,
> > > +				   bool snapshot)
> > > +{
> > > +	int i;
> > > +	struct page **pglist;
> > > +	struct arm_spe_pmu_buf *buf;
> > > +
> > > +	/*
> > > +	 * We require an even number of pages for snapshot mode, so that
> > > +	 * we can effectively treat the buffer as consisting of two equal
> > > +	 * parts and give userspace a fighting chance of getting some
> > > +	 * useful data out of it.
> > > +	 */
> > > +	if (!nr_pages || (snapshot && (nr_pages & 1)))
> > > +		return NULL;
> > 
> > As noted above, we may need to ensure that this is a pwoer of two.
> 
> Sorry, I don't follow. The power-of-two bug was in my IDX2OFF macro, which
> I've fixed. For snapshot mode, we just need an even number of pages.

Sure; with IDX2OFF() fixed the above is sufficient.

> > > +	buf = kzalloc_node(sizeof(*buf), GFP_KERNEL, cpu_to_node(cpu));
> > > +	if (!buf)
> > > +		return NULL;
> > > +
> > > +	pglist = kcalloc(nr_pages, sizeof(*pglist), GFP_KERNEL);
> > > +	if (!pglist)
> > > +		goto out_free_buf;
> > > +
> > > +	for (i = 0; i < nr_pages; ++i) {
> > > +		struct page *page = virt_to_page(pages[i]);
> > > +
> > > +		if (PagePrivate(page)) {
> > > +			pr_warn("unexpected high-order page for auxbuf!");
> > 
> > It looks like the intel-pt driver expects high-order pages.
> > 
> > What prevents us from seeing those?
> 
> The fact that we don't set the PERF_PMU_CAP_AUX_NO_SG capability. The intel
> driver uses physical address and scatter/gather lists, whereas ours just
> takes virtual addresses for the buffer.
> 
> > Why can't we handle those?
> 
> We never get them, so we don't need to.

Ok.

> > How are these pages pinned? Does the core ensure that?
> 
> Yes, they're GFP_KERNEL pages underneath.

Ok.

> > > +	spe_pmu->pmu = (struct pmu) {
> > > +		.capabilities	= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE,
> > > +		.attr_groups	= arm_spe_pmu_attr_groups,
> > > +		/*
> > > +		 * We hitch a ride on the software context here, so that
> > > +		 * we can support per-task profiling (which is not possible
> > > +		 * with the invalid context as it doesn't get sched callbacks).
> > > +		 * This requires that userspace either uses a dummy event for
> > > +		 * perf_event_open, since the aux buffer is not setup until
> > > +		 * a subsequent mmap, or creates the profiling event in a
> > > +		 * disabled state and explicitly PERF_EVENT_IOC_ENABLEs it
> > > +		 * once the buffer has been created.
> > > +		 */
> > > +		.task_ctx_nr	= perf_sw_context,
> > 
> > While other tracing PMUs do this, I think this is a horrible bodge, and
> > a bad idea, given it violates assumptions made in the core code.
> > 
> > For example, unlike true SW events, add() and start() can fail, so a
> > tracing event can unexpectedly stop SW events from being scheduled.
> > 
> > AFAICT, we could also try to move a tracing event into a later-created
> > HW PMU group, which is very worrying.
> > 
> > I really think we should have a separate tracing context for this class
> > of PMU, or we make it so that the invalid context can receive sched
> > callbacks.
> 
> Fair point, and I already have a comment to call this out. Whilst I'm not
> against seeing this fixed, I think it should be a separate patch series
> given that this is a common idiom amongst system/uncore PMU drivers.

I'll add this to my list of things to look at.

I would like to give this a poke in at least the group attach scenario I
mention above.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
  2017-06-22 18:36           ` Will Deacon
@ 2017-06-27 21:07             ` Kim Phillips
  2017-06-28 11:26               ` Mark Rutland
  0 siblings, 1 reply; 33+ messages in thread
From: Kim Phillips @ 2017-06-27 21:07 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, linux-arm-kernel, marc.zyngier, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

On Thu, 22 Jun 2017 19:36:20 +0100
Will Deacon <will.deacon@arm.com> wrote:

> On Thu, Jun 22, 2017 at 10:56:40AM -0500, Kim Phillips wrote:
> > On Wed, 21 Jun 2017 16:31:09 +0100
> > Will Deacon <will.deacon@arm.com> wrote:
> > 
> > > On Thu, Jun 15, 2017 at 10:57:35AM -0500, Kim Phillips wrote:
> > > > On Mon, 12 Jun 2017 11:20:48 -0500
> > > > Kim Phillips <kim.phillips@arm.com> wrote:
> > > > 
> > > > > On Mon, 12 Jun 2017 12:08:23 +0100
> > > > > Mark Rutland <mark.rutland@arm.com> wrote:
> > > > > 
> > > > > > On Mon, Jun 05, 2017 at 04:22:52PM +0100, Will Deacon wrote:
> > > > > > > This is the sixth posting of the patches previously posted here:
> > > > ...
> > > > > > Kim, do you have any version of the userspace side that we could look
> > > > > > at?
> > > > > > 
> > > > > > For review, it would be really helpful to have something that can poke
> > > > > > the PMU, even if it's incomplete or lacking polish.
> > > > > 
> > > > > Here's the latest push, based on a a couple of prior versions of this
> > > > > driver:
> > > > > 
> > > > > http://linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/armspev0.1
> > > > > 
> > > > > I don't seem to be able to get any SPE data output after rebasing on
> > > > > this version of the driver.  Still don't know why at the moment...
> > > > 
> > > > Bisected to commit e38ba76deef "perf tools: force uncore events to
> > > > system wide monitoring".  So, using record with specifying a -C
> > > > <cpu> explicitly now produces SPE data, but only a couple of valid
> > > > records at the beginning of each buffer; the rest is filled with
> > > > PADding (0's).
> > > > 
> > > > I see Mark's latest comments have found a possible issue in the perf
> > > > aux buffer handling code in the driver, and that the driver does some
> > > > memset of padding (0's) itself; could that be responsible for the above
> > > > behaviour?
> > > 
> > > Possibly. Do you know how big you're mapping the aux buffer
> > 
> > 4MiB.
> > 
> > > and what (if any) value you're passing as aux_watermark?
> > 
> > None passed, but it looks like 4KiB was used since the AUXTRACE size
> > was 4MiB - 4KiB.
> > 
> > I'm not seeing the issue with a simple bts-based version I'm
> > working on...yet.  We can revisit if I'm able to reproduce again; the
> > problem could have been on the userspace side.

I'm close to finishing the bts version of userspace, and have been
testing a bit more thoroughly, so now I consistently see the excessive
PADding when recording a CPU that's idle. I.e., when I taskset the perf
record to the same CPU I specify to record's -C (taskset -c n perf
record -C n), I get max. twenty-odd number of PAD bytes at the end of
the AUX buffers in the perf.data file.  If, OTOH, I taskset -c n perf
record -C m, where m != n, I get a couple of valid event records in the
buffer, and the rest of the buffer is filled with PADding.

It wouldn't be a problem except that it's wastes too much space
sometimes.  Here is a good output buffer sample from a --mmap-pages=,12
run, with only 4 PADs tacked onto the end:

0xd190 [0x30]: PERF_RECORD_AUXTRACE size: 0x48  offset: 0  ref: 0xe914f7e3ce  idx: 0  tid: -1  cpu: 2
.
. ... ARM SPE data: size 72 bytes
.  00000000:  4a 01                                           B COND
.  00000002:  b1 00 00 00 00 00 00 00 c0                      TGT 0 el2 ns=1
.  0000000b:  42 42                                           RETIRED NOT-TAKEN 
.  0000000d:  b0 f4 4e 10 08 00 00 ff ff                      PC ff000008104ef4 el3 ns=1
.  00000016:  98 00 00                                        LAT 0 TOT
.  00000019:  71 a5 39 e1 14 e9 00 00 00                      TS 1001077684645
.  00000022:  4a 02                                           B IND
.  00000024:  b1 54 51 11 08 00 00 ff ff                      TGT ff000008115154 el3 ns=1
.  0000002d:  42 02                                           RETIRED 
.  0000002f:  b0 68 36 11 08 00 00 ff ff                      PC ff000008113668 el3 ns=1
.  00000038:  98 00 00                                        LAT 0 TOT
.  0000003b:  71 a5 39 e1 14 e9 00 00 00                      TS 1001077684645
.  00000044:  00                                              PAD
.  00000045:  00                                              PAD
.  00000046:  00                                              PAD
.  00000047:  00                                              PAD

whereas this one - from later on in the same run - is over 99% PADs: 

0xd250 [0x30]: PERF_RECORD_AUXTRACE size: 0x5fc0  offset: 0xfffff4ae0044  ref: 0xe91cead1dd  idx: 0  tid: -1  cpu: 2
.
. ... ARM SPE data: size 24512 bytes
.  00000000:  4a 00                                           B
.  00000002:  b1 cc 4e 10 08 00 00 ff ff                      TGT ff000008104ecc el3 ns=1
.  0000000b:  42 02                                           RETIRED 
.  0000000d:  b0 90 4e 10 08 00 00 ff ff                      PC ff000008104e90 el3 ns=1
.  00000016:  98 00 00                                        LAT 0 TOT
.  00000019:  71 a5 39 e1 14 e9 00 00 00                      TS 1001077684645
.  00000022:  49 01                                           ST
.  00000024:  b2 e0 2e f5 7d 00 80 ff ff                      VA ffff80007df52ee0
.  0000002d:  b3 e0 2e f5 fd 00 00 00 80                      PA fdf52ee0 ns=1
.  00000036:  9a 00 00                                        LAT 0 XLAT
.  00000039:  42 16                                           RETIRED L1D-ACCESS TLB-ACCESS 
.  0000003b:  b0 e8 41 39 08 00 00 ff ff                      PC ff0000083941e8 el3 ns=1
.  00000044:  98 00 00                                        LAT 0 TOT
.  00000047:  71 a5 39 e1 14 e9 00 00 00                      TS 1001077684645
.  00000050:  4a 00                                           B
.  00000052:  b1 58 f2 0f 08 00 00 ff ff                      TGT ff0000080ff258 el3 ns=1
.  0000005b:  42 02                                           RETIRED 
.  0000005d:  b0 90 de 0d 08 00 00 ff ff                      PC ff0000080dde90 el3 ns=1
.  00000066:  98 00 00                                        LAT 0 TOT
.  00000069:  71 8f 4e e1 14 e9 00 00 00                      TS 1001077689999
.  00000072:  48 00                                           INSN-OTHER
.  00000074:  42 02                                           RETIRED 
.  00000076:  b0 f8 16 61 08 00 00 ff ff                      PC ff0000086116f8 el3 ns=1
.  0000007f:  98 00 00                                        LAT 0 TOT
.  00000082:  71 8f 4e e1 14 e9 00 00 00                      TS 1001077689999
.  0000008b:  49 00                                           LD
.  0000008d:  b2 10 34 ba 7b 00 80 ff ff                      VA ffff80007bba3410
.  00000096:  b3 10 34 ba fb 00 00 00 80                      PA fbba3410 ns=1
.  0000009f:  9a 00 00                                        LAT 0 XLAT
.  000000a2:  42 16                                           RETIRED L1D-ACCESS TLB-ACCESS 
.  000000a4:  b0 8c be 7a 08 00 00 ff ff                      PC ff0000087abe8c el3 ns=1
.  000000ad:  98 00 00                                        LAT 0 TOT
.  000000b0:  71 8f 4e e1 14 e9 00 00 00                      TS 1001077689999
.  000000b9:  00                                              PAD
...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...
.  00005fbf:  00                                              PAD

So maybe there's an offset counter that isn't being reset properly;
hopefully the parallel discussion with Mark will help find the problem.

FWIW, there is also this one I saw with mmap-pages set to 5
(pages), which gets rounded up to 8 pages: the driver started
memsetting places it shouldn't?:

$ sudo ./perf record -c 512 -C 0 -e arm_spe/branch_filter=0,ts_enable=1,pa_enable=1,event_filter=0,load_filter=0,jitter=1,store_filter=0,min_latency=0/ --mmap-pages=,5 dd if=/dev/urandom of=/dev/null count=10000
rounding mmap pages size to 32K (8 pages)
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB) copied, 1.3391 s, 3.8 MB/s
[ 1885.042803] Unable to handle kernel paging request at virtual address ffff00000adac000
[ 1885.042873] pgd = ffff00000ad48000
[ 1885.042899] [ffff00000adac000] *pgd=00000000fdffe003, *pud=00000000fdffd003, *pmd=00000000fdff8003, *pte=0000000000000000
[ 1885.043083] Internal error: Oops: 96000047 [#1] PREEMPT SMP
[ 1885.043131] Modules linked in:
[ 1885.043200] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.12.0-rc5-00039-gf1d4a187881e #34
[ 1885.043299] Hardware name: FVP_Base_AEMv8A-AEMv8A (DT)
[ 1885.043364] task: ffff000008c21a80 task.stack: ffff000008c10000
[ 1885.043436] PC is at __memset+0x1ac/0x1d0
[ 1885.043499] LR is at __arm_spe_pmu_next_off+0x6c/0xd8
[ 1885.043600] pc : [<ffff00000837dbac>] lr : [<ffff0000086771e4>] pstate: 204001c9
[ 1885.043600] sp : ffff80007df22d10
[ 1885.043733] x29: ffff80007df22d10 x28: ffff000008c21a80 
[ 1885.043819] x27: ffff000008c37768 x26: ffff80007df30280 
[ 1885.043910] x25: ffff80007a109800 x24: 0000001d507d1906 
[ 1885.044012] x23: ffff80007a601018 x22: ffff80007a3ebb00 
[ 1885.044102] x21: ffff80007df36ab0 x20: ffff80007a3ebb00 
[ 1885.044196] x19: ffff80007df36ab0 x18: 0000000000000000 
[ 1885.044253] x17: 0000000000000000 x16: 0000000000000000 
[ 1885.044339] x15: 0000000000000000 x14: ffff000008c21a80 
[ 1885.044456] x13: 000080007532d000 x12: 000000003445d91d 
[ 1885.044557] x11: 0000000000000000 x10: 0000000000001000 
[ 1885.044600] x9 : 0000000000000000 x8 : ffff00000adac000 
[ 1885.044729] x7 : 0000000000000000 x6 : 00000000000003ff 
[ 1885.044800] x5 : 0000000000000400 x4 : 0000000000000000 
[ 1885.044911] x3 : 0000000000000008 x2 : 0000000000003bff 
[ 1885.045000] x1 : 0000000000000000 x0 : ffff00000ada8000 
[ 1885.045100] Process swapper/0 (pid: 0, stack limit = 0xffff000008c10000)
[ 1885.045179] Stack: (0xffff80007df22d10 to 0xffff000008c14000)
[ 1885.045239] Call trace:
[ 1885.045300] Exception stack(0xffff80007df22b40 to 0xffff80007df22c70)
[ 1885.045400] 2b40: ffff80007df36ab0 0001000000000000 ffff80007df22d10 ffff00000837dbac
[ 1885.045505] 2b60: 0000000000000000 0000000000000000 ffff80007bb4b520 ffff00000837eac0
[ 1885.045605] 2b80: ffff80007df22d10 ffff0000080d6b58 0000000100060b21 ffff80007bb4afa8
[ 1885.045712] 2ba0: ffff80007bb4af20 ffff80007bb4af00 0000000000000000 ffff000008c19f04
[ 1885.045815] 2bc0: ffff000008bff000 ffff000008c17000 ffff80007bb53f00 000000000002fe89
[ 1885.045916] 2be0: ffff00000ada8000 0000000000000000 0000000000003bff 0000000000000008
[ 1885.046013] 2c00: 0000000000000000 0000000000000400 00000000000003ff 0000000000000000
[ 1885.046126] 2c20: ffff00000adac000 0000000000000000 0000000000001000 0000000000000000
[ 1885.046224] 2c40: 000000003445d91d 000080007532d000 ffff000008c21a80 0000000000000000
[ 1885.046326] 2c60: 0000000000000000 0000000000000000
[ 1885.046408] [<ffff00000837dbac>] __memset+0x1ac/0x1d0
[ 1885.046499] [<ffff00000867729c>] arm_spe_perf_aux_output_begin+0x4c/0x1b8
[ 1885.046599] [<ffff000008678424>] arm_spe_pmu_start+0x34/0xf0
[ 1885.046695] [<ffff000008678548>] arm_spe_pmu_add+0x68/0x98
[ 1885.046733] [<ffff00000814da54>] event_sched_in.isra.61+0xcc/0x218
[ 1885.046879] [<ffff00000814dc08>] group_sched_in+0x68/0x1a0
[ 1885.046981] [<ffff00000814dfd0>] ctx_sched_in+0x290/0x468
[ 1885.047080] [<ffff00000814e23c>] perf_event_sched_in+0x94/0xa8
[ 1885.047179] [<ffff00000814e2b4>] ctx_resched+0x64/0xb0
[ 1885.047268] [<ffff00000814e500>] __perf_event_enable+0x200/0x238
[ 1885.047366] [<ffff000008147118>] event_function+0x90/0xf0
[ 1885.047452] [<ffff0000081499e8>] remote_function+0x60/0x70
[ 1885.047514] [<ffff0000081194fc>] flush_smp_call_function_queue+0x9c/0x168
[ 1885.047637] [<ffff00000811a2a0>] generic_smp_call_function_single_interrupt+0x10/0x18
[ 1885.047733] [<ffff00000808e928>] handle_IPI+0xc0/0x1d8
[ 1885.047799] [<ffff000008081700>] gic_handle_irq+0x80/0x178
[ 1885.047799] Exception stack(0xffff000008c13d80 to 0xffff000008c13eb0)
[ 1885.047984] 3d80: 0000000000000000 ffff000008c21a80 00000000000003e8 ffff000008651430
[ 1885.048087] 3da0: 000000001999999a 0000000000000020 0000002bedec501b 0000000000000000
[ 1885.048190] 3dc0: 000001b2b5103510 ffff000008081800 0000000000001000 0000000000000000
[ 1885.048300] 3de0: 000000003445d91d 000080007532d000 ffff000008c21a80 0000000000000000
[ 1885.048400] 3e00: 0000000000000000 0000000000000000 0000000000000000 000001b6e54dc796
[ 1885.048505] 3e20: 0000000000000002 ffff80007a983c00 0000000000000002 ffff000008cdc130
[ 1885.048600] 3e40: 000001b6e5424132 ffff000008c21a80 0000000000000000 00000000fef7c684
[ 1885.048716] 3e60: 0000000080b10018 ffff000008c13eb0 ffff00000861014c ffff000008c13eb0
[ 1885.048817] 3e80: ffff00000861018c 0000000040c00149 00000000fef7c684 0000000000000002
[ 1885.048900] 3ea0: ffffffffffffffff ffff00000861014c
[ 1885.048900] [<ffff0000080827f4>] el1_irq+0xb4/0x128
[ 1885.049009] [<ffff00000861018c>] cpuidle_enter_state+0x154/0x200
[ 1885.049126] [<ffff000008610270>] cpuidle_enter+0x18/0x20
[ 1885.049207] [<ffff0000080ddd08>] call_cpuidle+0x18/0x30
[ 1885.049332] [<ffff0000080ddf44>] do_idle+0x19c/0x1d8
[ 1885.049400] [<ffff0000080de114>] cpu_startup_entry+0x24/0x28
[ 1885.049453] [<ffff0000087a6b30>] rest_init+0x80/0x90
[ 1885.049500] [<ffff000008b10b3c>] start_kernel+0x374/0x388
[ 1885.049617] [<ffff000008b101e0>] __primary_switched+0x64/0x6c
[ 1885.049699] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428) 
[ 1885.049800] ---[ end trace 31b9a9f27da95f58 ]---
[ 1885.049900] Kernel panic - not syncing: Fatal exception in interrupt
[ 1885.050000] SMP: stopping secondary CPUs
[ 1885.050204] Kernel Offset: disabled
[ 1885.050240] Memory Limit: none
[ 1885.050327] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

It's not easily reproduced.

> > Meanwhile, when using fvp-base.dtb, my model setup stops booting the
> > kernel after "smp: Bringing up secondary CPUs ...".  If I however take
> > the second SPE node from fvp-base.dts and add it to my working device
> > tree, I get this during the driver probe:
> > 
> > [    1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
> > [    1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
> > [    1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)
> 
> Looks like you've screwed up your IRQ partitions, so you are effectively
> registering the same device twice, which then blows up due to lack of shared
> irqs.
> 
> Either remove one of the devices, or use IRQ partitions to restrict them
> to unique sets of CPUs.

Right, but since I want to get parity with what you're running -
fvp_base.dtb - I tried to debug the hang after "smp: Bringing up
secondary CPUs ..." problem, and could only debug it to the PSCI driver
hitting one of these cases:

case PSCI_RET_INVALID_PARAMS:
case PSCI_RET_INVALID_ADDRESS:

Note: it's yet another place I have to manually instrument the error
path in a kernel driver in lieu of it being more naturally verbose by
itself; I *implore* you to reconsider adding proper user messaging to
arm_spe_pmu_event_init().

I can't tell which part of the fvp-base device tree is not liked by the
firmware; I tried different combinations of the PSCI node, different CPU
enumerations (cpu@100 vs cpu@1), removing idle-states properties...any
hints appreciated.

Thanks,

Kim

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
  2017-06-27 21:07             ` Kim Phillips
@ 2017-06-28 11:26               ` Mark Rutland
  2017-06-28 11:32                 ` Mark Rutland
  2017-06-29  0:59                 ` [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Kim Phillips
  0 siblings, 2 replies; 33+ messages in thread
From: Mark Rutland @ 2017-06-28 11:26 UTC (permalink / raw)
  To: Kim Phillips
  Cc: Will Deacon, linux-arm-kernel, marc.zyngier, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote:
> I'm close to finishing the bts version of userspace, and have been
> testing a bit more thoroughly, so now I consistently see the excessive
> PADding when recording a CPU that's idle. I.e., when I taskset the perf
> record to the same CPU I specify to record's -C (taskset -c n perf
> record -C n), I get max. twenty-odd number of PAD bytes at the end of
> the AUX buffers in the perf.data file.  If, OTOH, I taskset -c n perf
> record -C m, where m != n, I get a couple of valid event records in the
> buffer, and the rest of the buffer is filled with PADding.
> 
> It wouldn't be a problem except that it's wastes too much space
> sometimes.  Here is a good output buffer sample from a --mmap-pages=,12
> run, with only 4 PADs tacked onto the end:
> 
> 0xd190 [0x30]: PERF_RECORD_AUXTRACE size: 0x48  offset: 0  ref: 0xe914f7e3ce  idx: 0  tid: -1  cpu: 2
> .
> . ... ARM SPE data: size 72 bytes
> .  00000000:  4a 01                                           B COND

[...]

> .  0000003b:  71 a5 39 e1 14 e9 00 00 00                      TS 1001077684645
> .  00000044:  00                                              PAD
> .  00000045:  00                                              PAD
> .  00000046:  00                                              PAD
> .  00000047:  00                                              PAD
> 
> whereas this one - from later on in the same run - is over 99% PADs: 
> 
> 0xd250 [0x30]: PERF_RECORD_AUXTRACE size: 0x5fc0  offset: 0xfffff4ae0044  ref: 0xe91cead1dd  idx: 0  tid: -1  cpu: 2
> .
> . ... ARM SPE data: size 24512 bytes
> .  00000000:  4a 00                                           B

[...]

> .  000000b0:  71 8f 4e e1 14 e9 00 00 00                      TS 1001077689999
> .  000000b9:  00                                              PAD
> ...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...
> .  00005fbf:  00                                              PAD

Interesting.

If you cat /proc/interrupts, do you see many more SPE interrupts on CPU
n than on m?

Otherwise, I wonder if this is some odd interaction with idle. Can you
try to forcefully load that other CPU?

e.g. run something like:

	taskset -c <n> sh -c 'while true; do done'

... in parallel with the tracer.

For reference, what was your event sample period (i.e. the value of
perf_event_attr::sample_period)?

Did you modify that at all with PERF_EVENT_IOC_PERIOD?

> So maybe there's an offset counter that isn't being reset properly;
> hopefully the parallel discussion with Mark will help find the problem.
> 
> FWIW, there is also this one I saw with mmap-pages set to 5
> (pages), which gets rounded up to 8 pages:

Sorry, *what* does the rounding upwards? Userspace, perf core, or the
driver? Where?

> the driver started memsetting places it shouldn't?:
>
> $ sudo ./perf record -c 512 -C 0 -e arm_spe/branch_filter=0,ts_enable=1,pa_enable=1,event_filter=0,load_filter=0,jitter=1,store_filter=0,min_latency=0/ --mmap-pages=,5 dd if=/dev/urandom of=/dev/null count=10000
> rounding mmap pages size to 32K (8 pages)
> 10000+0 records in
> 10000+0 records out
> 5120000 bytes (5.1 MB) copied, 1.3391 s, 3.8 MB/s
> [ 1885.042803] Unable to handle kernel paging request at virtual address ffff00000adac000
> [ 1885.042873] pgd = ffff00000ad48000
> [ 1885.042899] [ffff00000adac000] *pgd=00000000fdffe003, *pud=00000000fdffd003, *pmd=00000000fdff8003, *pte=0000000000000000
> [ 1885.043083] Internal error: Oops: 96000047 [#1] PREEMPT SMP
> [ 1885.043131] Modules linked in:
> [ 1885.043200] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.12.0-rc5-00039-gf1d4a187881e #34
> [ 1885.043299] Hardware name: FVP_Base_AEMv8A-AEMv8A (DT)
> [ 1885.043364] task: ffff000008c21a80 task.stack: ffff000008c10000
> [ 1885.043436] PC is at __memset+0x1ac/0x1d0
> [ 1885.043499] LR is at __arm_spe_pmu_next_off+0x6c/0xd8
> [ 1885.043600] pc : [<ffff00000837dbac>] lr : [<ffff0000086771e4>] pstate: 204001c9
> [ 1885.043600] sp : ffff80007df22d10
> [ 1885.043733] x29: ffff80007df22d10 x28: ffff000008c21a80 
> [ 1885.043819] x27: ffff000008c37768 x26: ffff80007df30280 
> [ 1885.043910] x25: ffff80007a109800 x24: 0000001d507d1906 
> [ 1885.044012] x23: ffff80007a601018 x22: ffff80007a3ebb00 
> [ 1885.044102] x21: ffff80007df36ab0 x20: ffff80007a3ebb00 
> [ 1885.044196] x19: ffff80007df36ab0 x18: 0000000000000000 
> [ 1885.044253] x17: 0000000000000000 x16: 0000000000000000 
> [ 1885.044339] x15: 0000000000000000 x14: ffff000008c21a80 
> [ 1885.044456] x13: 000080007532d000 x12: 000000003445d91d 
> [ 1885.044557] x11: 0000000000000000 x10: 0000000000001000 
> [ 1885.044600] x9 : 0000000000000000 x8 : ffff00000adac000 
> [ 1885.044729] x7 : 0000000000000000 x6 : 00000000000003ff 
> [ 1885.044800] x5 : 0000000000000400 x4 : 0000000000000000 
> [ 1885.044911] x3 : 0000000000000008 x2 : 0000000000003bff 
> [ 1885.045000] x1 : 0000000000000000 x0 : ffff00000ada8000 
> [ 1885.045100] Process swapper/0 (pid: 0, stack limit = 0xffff000008c10000)
> [ 1885.045179] Stack: (0xffff80007df22d10 to 0xffff000008c14000)
> [ 1885.045239] Call trace:

> [ 1885.046408] [<ffff00000837dbac>] __memset+0x1ac/0x1d0
> [ 1885.046499] [<ffff00000867729c>] arm_spe_perf_aux_output_begin+0x4c/0x1b8
> [ 1885.046599] [<ffff000008678424>] arm_spe_pmu_start+0x34/0xf0
> [ 1885.046695] [<ffff000008678548>] arm_spe_pmu_add+0x68/0x98
> [ 1885.046733] [<ffff00000814da54>] event_sched_in.isra.61+0xcc/0x218
> [ 1885.046879] [<ffff00000814dc08>] group_sched_in+0x68/0x1a0
> [ 1885.046981] [<ffff00000814dfd0>] ctx_sched_in+0x290/0x468
> [ 1885.047080] [<ffff00000814e23c>] perf_event_sched_in+0x94/0xa8
> [ 1885.047179] [<ffff00000814e2b4>] ctx_resched+0x64/0xb0
> [ 1885.047268] [<ffff00000814e500>] __perf_event_enable+0x200/0x238
> [ 1885.047366] [<ffff000008147118>] event_function+0x90/0xf0
> [ 1885.047452] [<ffff0000081499e8>] remote_function+0x60/0x70
> [ 1885.047514] [<ffff0000081194fc>] flush_smp_call_function_queue+0x9c/0x168
> [ 1885.047637] [<ffff00000811a2a0>] generic_smp_call_function_single_interrupt+0x10/0x18
> [ 1885.047733] [<ffff00000808e928>] handle_IPI+0xc0/0x1d8
> [ 1885.047799] [<ffff000008081700>] gic_handle_irq+0x80/0x178

> [ 1885.048900] [<ffff0000080827f4>] el1_irq+0xb4/0x128
> [ 1885.049009] [<ffff00000861018c>] cpuidle_enter_state+0x154/0x200
> [ 1885.049126] [<ffff000008610270>] cpuidle_enter+0x18/0x20
> [ 1885.049207] [<ffff0000080ddd08>] call_cpuidle+0x18/0x30
> [ 1885.049332] [<ffff0000080ddf44>] do_idle+0x19c/0x1d8
> [ 1885.049400] [<ffff0000080de114>] cpu_startup_entry+0x24/0x28
> [ 1885.049453] [<ffff0000087a6b30>] rest_init+0x80/0x90
> [ 1885.049500] [<ffff000008b10b3c>] start_kernel+0x374/0x388
> [ 1885.049617] [<ffff000008b101e0>] __primary_switched+0x64/0x6c
> [ 1885.049699] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428) 
> [ 1885.049800] ---[ end trace 31b9a9f27da95f58 ]---
> [ 1885.049900] Kernel panic - not syncing: Fatal exception in interrupt
> [ 1885.050000] SMP: stopping secondary CPUs
> [ 1885.050204] Kernel Offset: disabled
> [ 1885.050240] Memory Limit: none
> [ 1885.050327] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

That's worrying. I'll see if I can reproduce this.

> It's not easily reproduced.
> 
> > > Meanwhile, when using fvp-base.dtb, my model setup stops booting the
> > > kernel after "smp: Bringing up secondary CPUs ...".  If I however take
> > > the second SPE node from fvp-base.dts and add it to my working device
> > > tree, I get this during the driver probe:
> > > 
> > > [    1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
> > > [    1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
> > > [    1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)
> > 
> > Looks like you've screwed up your IRQ partitions, so you are effectively
> > registering the same device twice, which then blows up due to lack of shared
> > irqs.
> > 
> > Either remove one of the devices, or use IRQ partitions to restrict them
> > to unique sets of CPUs.
> 
> Right, but since I want to get parity with what you're running -
> fvp_base.dtb - I tried to debug the hang after "smp: Bringing up
> secondary CPUs ..." problem, and could only debug it to the PSCI driver
> hitting one of these cases:
> 
> case PSCI_RET_INVALID_PARAMS:
> case PSCI_RET_INVALID_ADDRESS:

Sounds like your DT is describing CPUs that don't exist (or perhaps the
same CPU several times). Certainly, PSCI and the kernel disagree on
which CPUS exist.

What exact DT are you using?

Are you using the bootwrapper, or ATF? I'm guessing you're using the
bootwrapper.

Which version of the bootwrapepr are you using? If it doesn't have
commit:

  ccdc936924b3682d ("Dynamically determine the set of CPUs")

... have you configured it appropriately with --with-cpu-ids?

How is your model configured? Which CPU IDs does it think exist?

> Note: it's yet another place I have to manually instrument the error
> path in a kernel driver in lieu of it being more naturally verbose by
> itself; I *implore* you to reconsider adding proper user messaging to
> arm_spe_pmu_event_init().

Given this is a FW configuration issue (i.e. a system-level error), I'm
more than happy to make the PSCI driver messages more helpful where
possible.

That's completely orthogonal to the SPE debug messages for requests made
by the user.

> I can't tell which part of the fvp-base device tree is not liked by the
> firmware; I tried different combinations of the PSCI node, different CPU
> enumerations (cpu@100 vs cpu@1), removing idle-states properties...any
> hints appreciated.

The bootwrapper doesn't support idle. So no idle-states should be in the
DT.

If you can share your DT, bootwrapper configuration, and model
configuration, I can try to debug this with you.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
  2017-06-28 11:26               ` Mark Rutland
@ 2017-06-28 11:32                 ` Mark Rutland
  2017-06-29  1:16                   ` Kim Phillips
  2017-06-29  0:59                 ` [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Kim Phillips
  1 sibling, 1 reply; 33+ messages in thread
From: Mark Rutland @ 2017-06-28 11:32 UTC (permalink / raw)
  To: Kim Phillips
  Cc: robh, mathieu.poirier, pawel.moll, suzuki.poulose, marc.zyngier,
	Will Deacon, linux-kernel, alexander.shishkin, peterz, mingo,
	tglx, linux-arm-kernel

On Wed, Jun 28, 2017 at 12:26:02PM +0100, Mark Rutland wrote:
> On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote:
> > FWIW, there is also this one I saw with mmap-pages set to 5
> > (pages), which gets rounded up to 8 pages:
> 
> Sorry, *what* does the rounding upwards? Userspace, perf core, or the
> driver? Where?
> 
> > the driver started memsetting places it shouldn't?:
> >
> > $ sudo ./perf record -c 512 -C 0 -e arm_spe/branch_filter=0,ts_enable=1,pa_enable=1,event_filter=0,load_filter=0,jitter=1,store_filter=0,min_latency=0/ --mmap-pages=,5 dd if=/dev/urandom of=/dev/null count=10000
> > rounding mmap pages size to 32K (8 pages)
> > 10000+0 records in
> > 10000+0 records out
> > 5120000 bytes (5.1 MB) copied, 1.3391 s, 3.8 MB/s
> > [ 1885.042803] Unable to handle kernel paging request at virtual address ffff00000adac000
> > [ 1885.042873] pgd = ffff00000ad48000
> > [ 1885.042899] [ffff00000adac000] *pgd=00000000fdffe003, *pud=00000000fdffd003, *pmd=00000000fdff8003, *pte=0000000000000000
> > [ 1885.043083] Internal error: Oops: 96000047 [#1] PREEMPT SMP
> > [ 1885.043131] Modules linked in:
> > [ 1885.043200] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.12.0-rc5-00039-gf1d4a187881e #34
> > [ 1885.043299] Hardware name: FVP_Base_AEMv8A-AEMv8A (DT)
> > [ 1885.043364] task: ffff000008c21a80 task.stack: ffff000008c10000
> > [ 1885.043436] PC is at __memset+0x1ac/0x1d0
> > [ 1885.043499] LR is at __arm_spe_pmu_next_off+0x6c/0xd8
> > [ 1885.043600] pc : [<ffff00000837dbac>] lr : [<ffff0000086771e4>] pstate: 204001c9
> > [ 1885.043600] sp : ffff80007df22d10
> > [ 1885.043733] x29: ffff80007df22d10 x28: ffff000008c21a80 
> > [ 1885.043819] x27: ffff000008c37768 x26: ffff80007df30280 
> > [ 1885.043910] x25: ffff80007a109800 x24: 0000001d507d1906 
> > [ 1885.044012] x23: ffff80007a601018 x22: ffff80007a3ebb00 
> > [ 1885.044102] x21: ffff80007df36ab0 x20: ffff80007a3ebb00 
> > [ 1885.044196] x19: ffff80007df36ab0 x18: 0000000000000000 
> > [ 1885.044253] x17: 0000000000000000 x16: 0000000000000000 
> > [ 1885.044339] x15: 0000000000000000 x14: ffff000008c21a80 
> > [ 1885.044456] x13: 000080007532d000 x12: 000000003445d91d 
> > [ 1885.044557] x11: 0000000000000000 x10: 0000000000001000 
> > [ 1885.044600] x9 : 0000000000000000 x8 : ffff00000adac000 
> > [ 1885.044729] x7 : 0000000000000000 x6 : 00000000000003ff 
> > [ 1885.044800] x5 : 0000000000000400 x4 : 0000000000000000 
> > [ 1885.044911] x3 : 0000000000000008 x2 : 0000000000003bff 
> > [ 1885.045000] x1 : 0000000000000000 x0 : ffff00000ada8000 
> > [ 1885.045100] Process swapper/0 (pid: 0, stack limit = 0xffff000008c10000)
> > [ 1885.045179] Stack: (0xffff80007df22d10 to 0xffff000008c14000)
> > [ 1885.045239] Call trace:
> 
> > [ 1885.046408] [<ffff00000837dbac>] __memset+0x1ac/0x1d0
> > [ 1885.046499] [<ffff00000867729c>] arm_spe_perf_aux_output_begin+0x4c/0x1b8
> > [ 1885.046599] [<ffff000008678424>] arm_spe_pmu_start+0x34/0xf0
> > [ 1885.046695] [<ffff000008678548>] arm_spe_pmu_add+0x68/0x98
> > [ 1885.046733] [<ffff00000814da54>] event_sched_in.isra.61+0xcc/0x218
> > [ 1885.046879] [<ffff00000814dc08>] group_sched_in+0x68/0x1a0
> > [ 1885.046981] [<ffff00000814dfd0>] ctx_sched_in+0x290/0x468
> > [ 1885.047080] [<ffff00000814e23c>] perf_event_sched_in+0x94/0xa8
> > [ 1885.047179] [<ffff00000814e2b4>] ctx_resched+0x64/0xb0
> > [ 1885.047268] [<ffff00000814e500>] __perf_event_enable+0x200/0x238
> > [ 1885.047366] [<ffff000008147118>] event_function+0x90/0xf0
> > [ 1885.047452] [<ffff0000081499e8>] remote_function+0x60/0x70
> > [ 1885.047514] [<ffff0000081194fc>] flush_smp_call_function_queue+0x9c/0x168
> > [ 1885.047637] [<ffff00000811a2a0>] generic_smp_call_function_single_interrupt+0x10/0x18
> > [ 1885.047733] [<ffff00000808e928>] handle_IPI+0xc0/0x1d8
> > [ 1885.047799] [<ffff000008081700>] gic_handle_irq+0x80/0x178
> 
> > [ 1885.048900] [<ffff0000080827f4>] el1_irq+0xb4/0x128
> > [ 1885.049009] [<ffff00000861018c>] cpuidle_enter_state+0x154/0x200
> > [ 1885.049126] [<ffff000008610270>] cpuidle_enter+0x18/0x20
> > [ 1885.049207] [<ffff0000080ddd08>] call_cpuidle+0x18/0x30
> > [ 1885.049332] [<ffff0000080ddf44>] do_idle+0x19c/0x1d8
> > [ 1885.049400] [<ffff0000080de114>] cpu_startup_entry+0x24/0x28
> > [ 1885.049453] [<ffff0000087a6b30>] rest_init+0x80/0x90
> > [ 1885.049500] [<ffff000008b10b3c>] start_kernel+0x374/0x388
> > [ 1885.049617] [<ffff000008b101e0>] __primary_switched+0x64/0x6c
> > [ 1885.049699] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428) 
> > [ 1885.049800] ---[ end trace 31b9a9f27da95f58 ]---
> > [ 1885.049900] Kernel panic - not syncing: Fatal exception in interrupt
> > [ 1885.050000] SMP: stopping secondary CPUs
> > [ 1885.050204] Kernel Offset: disabled
> > [ 1885.050240] Memory Limit: none
> > [ 1885.050327] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
> 
> That's worrying. I'll see if I can reproduce this.

Actually, this might be down to the IDX2OFF() macro being borked for non
power-of-two buffer sizes.

Do you have Will's latest fixes? In his tree there's a commit:

  4f331cd62531dce2 ("squash! drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension")

... which should fix the IDX2OFF() bug.

It's be good to reproduce the issue if we can, regardless.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
  2017-06-28 11:26               ` Mark Rutland
  2017-06-28 11:32                 ` Mark Rutland
@ 2017-06-29  0:59                 ` Kim Phillips
  2017-06-29 11:11                   ` Mark Rutland
  1 sibling, 1 reply; 33+ messages in thread
From: Kim Phillips @ 2017-06-29  0:59 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Will Deacon, linux-arm-kernel, marc.zyngier, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

On Wed, 28 Jun 2017 12:26:02 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote:
> > I'm close to finishing the bts version of userspace, and have been
> > testing a bit more thoroughly, so now I consistently see the excessive
> > PADding when recording a CPU that's idle. I.e., when I taskset the perf
> > record to the same CPU I specify to record's -C (taskset -c n perf
> > record -C n), I get max. twenty-odd number of PAD bytes at the end of
> > the AUX buffers in the perf.data file.  If, OTOH, I taskset -c n perf
> > record -C m, where m != n, I get a couple of valid event records in the
> > buffer, and the rest of the buffer is filled with PADding.
> > 
> > It wouldn't be a problem except that it's wastes too much space
> > sometimes.  Here is a good output buffer sample from a --mmap-pages=,12
> > run, with only 4 PADs tacked onto the end:
> > 
> > 0xd190 [0x30]: PERF_RECORD_AUXTRACE size: 0x48  offset: 0  ref: 0xe914f7e3ce  idx: 0  tid: -1  cpu: 2
> > .
> > . ... ARM SPE data: size 72 bytes
> > .  00000000:  4a 01                                           B COND
> 
> [...]
> 
> > .  0000003b:  71 a5 39 e1 14 e9 00 00 00                      TS 1001077684645
> > .  00000044:  00                                              PAD
> > .  00000045:  00                                              PAD
> > .  00000046:  00                                              PAD
> > .  00000047:  00                                              PAD
> > 
> > whereas this one - from later on in the same run - is over 99% PADs: 
> > 
> > 0xd250 [0x30]: PERF_RECORD_AUXTRACE size: 0x5fc0  offset: 0xfffff4ae0044  ref: 0xe91cead1dd  idx: 0  tid: -1  cpu: 2
> > .
> > . ... ARM SPE data: size 24512 bytes
> > .  00000000:  4a 00                                           B
> 
> [...]
> 
> > .  000000b0:  71 8f 4e e1 14 e9 00 00 00                      TS 1001077689999
> > .  000000b9:  00                                              PAD
> > ...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...
> > .  00005fbf:  00                                              PAD
> 
> Interesting.
> 
> If you cat /proc/interrupts, do you see many more SPE interrupts on CPU
> n than on m?

When n == m, I see approx. 1 IRQ per SPE buffer full.

When n != m, I see neither CPU n or m incur SPE interrupts; the
workload ran but didn't get recorded, or, rather, 'idleness' got
recorded instead.

> Otherwise, I wonder if this is some odd interaction with idle. Can you
> try to forcefully load that other CPU?
> 
> e.g. run something like:
> 
> 	taskset -c <n> sh -c 'while true; do done'
> 
> ... in parallel with the tracer.

If I do a:

taskset -c 1 sh -c 'while true; do echo blah > /dev/null' & 
taskset -c 0 perf record -C 1 ...

then non-idleness and non-PADdingness get recorded.

> For reference, what was your event sample period (i.e. the value of
> perf_event_attr::sample_period)?
> 
> Did you modify that at all with PERF_EVENT_IOC_PERIOD?

If that's the same as 'perf record -c <period>', then, yes, I set
the period to values such as 512, 1024.

> > > > Meanwhile, when using fvp-base.dtb, my model setup stops booting the
> > > > kernel after "smp: Bringing up secondary CPUs ...".  If I however take
> > > > the second SPE node from fvp-base.dts and add it to my working device
> > > > tree, I get this during the driver probe:
> > > > 
> > > > [    1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
> > > > [    1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
> > > > [    1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)
> > > 
> > > Looks like you've screwed up your IRQ partitions, so you are effectively
> > > registering the same device twice, which then blows up due to lack of shared
> > > irqs.
> > > 
> > > Either remove one of the devices, or use IRQ partitions to restrict them
> > > to unique sets of CPUs.
> > 
> > Right, but since I want to get parity with what you're running -
> > fvp_base.dtb - I tried to debug the hang after "smp: Bringing up
> > secondary CPUs ..." problem, and could only debug it to the PSCI driver
> > hitting one of these cases:
> > 
> > case PSCI_RET_INVALID_PARAMS:
> > case PSCI_RET_INVALID_ADDRESS:
> 
> Sounds like your DT is describing CPUs that don't exist (or perhaps the
> same CPU several times). Certainly, PSCI and the kernel disagree on
> which CPUS exist.
> 
> What exact DT are you using?

the one this commit to linux-will's perf/spe branch provides:

commit 2a73de57eaf61cdfd61be1e20a46e4a2c326775f
Author: Marc Zyngier <marc.zyngier@arm.com>
Date:   Tue Mar 11 18:14:45 2014 +0000

    arm64: dts: add model device-tree
    
    Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
    Signed-off-by: Will Deacon <will.deacon@arm.com>

> Are you using the bootwrapper, or ATF? I'm guessing you're using the
> bootwrapper.

I'm using the wrapper to wrap arm-trusted-firmware (ATF?) objects, so,
both?  I noticed the wrapper I was using was pretty old, so I updated
it.

arm-trusted-firmware, btw, has just been updated to enable SPE at lower
ELs, so I don't have to use a hacked-up version anymore.

I also updated my BL33 to the latest upstream u-boot
vexpress_aemv8a_dram_defconfig, and at least now the kernel continues
to boot, even though it can't bring up 6 of the 7 secondary CPUs.

> Which version of the bootwrapepr are you using? If it doesn't have
> commit:
> 
>   ccdc936924b3682d ("Dynamically determine the set of CPUs")
> 
> ... have you configured it appropriately with --with-cpu-ids?
> 
> How is your model configured?

CLUSTER0_NUM_CORES=4
CLUSTER1_NUM_CORES=4

> Which CPU IDs does it think exist?

1,2,3,4,0x100,0x101,0x102,0x103

...which are different from the above device tree!:

0,0x100,0x200,0x300,0x10000,0x10100,0x10200,0x10300

So I imagine that's the problem, thanks!

I don't see how to tell the model to put the CPUs at different
addresses, only a lot of GICv3 redistributor switches?  btw, where can
I get updates to the run-model.sh scripts?  Answer off-list?

> > Note: it's yet another place I have to manually instrument the error
> > path in a kernel driver in lieu of it being more naturally verbose by
> > itself; I *implore* you to reconsider adding proper user messaging to
> > arm_spe_pmu_event_init().
> 
> Given this is a FW configuration issue (i.e. a system-level error), I'm
> more than happy to make the PSCI driver messages more helpful where
> possible.
> 
> That's completely orthogonal to the SPE debug messages for requests made
> by the user.

I respectfully disagree, given the current state of the interfaces
involved.

> > I can't tell which part of the fvp-base device tree is not liked by the
> > firmware; I tried different combinations of the PSCI node, different CPU
> > enumerations (cpu@100 vs cpu@1), removing idle-states properties...any
> > hints appreciated.
> 
> The bootwrapper doesn't support idle. So no idle-states should be in the
> DT.
> 
> If you can share your DT, bootwrapper configuration, and model
> configuration, I can try to debug this with you.

I reverted the wrapper's ccdc936924b3682d ("Dynamically determine the
set of CPUs") commit you mentioned above, and specified the cpu-ids
manually, and am now running with 8 CPUs, although linux enumerates
them as 0,1,8,9,10,11,12,13?

Thanks for your continued support,

Kim

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
  2017-06-28 11:32                 ` Mark Rutland
@ 2017-06-29  1:16                   ` Kim Phillips
  2017-06-29  1:43                     ` [PATCH] perf tools: Add ARM Statistical Profiling Extensions (SPE) support Kim Phillips
  0 siblings, 1 reply; 33+ messages in thread
From: Kim Phillips @ 2017-06-29  1:16 UTC (permalink / raw)
  To: Mark Rutland
  Cc: robh, mathieu.poirier, pawel.moll, suzuki.poulose, marc.zyngier,
	Will Deacon, linux-kernel, alexander.shishkin, peterz, mingo,
	tglx, linux-arm-kernel

On Wed, 28 Jun 2017 12:32:50 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> On Wed, Jun 28, 2017 at 12:26:02PM +0100, Mark Rutland wrote:
> > On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote:
> > > FWIW, there is also this one I saw with mmap-pages set to 5
> > > (pages), which gets rounded up to 8 pages:
> > 
> > Sorry, *what* does the rounding upwards? Userspace, perf core, or the
> > driver? Where?

SPE implementations may vary from the minimum buffer alignment of the
smallest available page size, so I left the bts userspace tool code's
upwards-rounding code intact for now.  

I'll take this opportunity to submit the SPE perf tool patch in the
form of a reply to this email:  Look for the rounding code in
tools/perf/arch/arm64/util/arm-spe.c:arm_spe_recording_options().

> > That's worrying. I'll see if I can reproduce this.
> 
> Actually, this might be down to the IDX2OFF() macro being borked for non
> power-of-two buffer sizes.
> 
> Do you have Will's latest fixes? In his tree there's a commit:
> 
>   4f331cd62531dce2 ("squash! drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension")
> 
> ... which should fix the IDX2OFF() bug.

yes, I've been running with that squash! commit since a couple of days
after I noticed it over a week ago.

> It's be good to reproduce the issue if we can, regardless.

FWIW, I couldn't the little I tried today.

Kim

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] perf tools: Add ARM Statistical Profiling Extensions (SPE) support
  2017-06-29  1:16                   ` Kim Phillips
@ 2017-06-29  1:43                     ` Kim Phillips
  2017-06-30 14:02                       ` Mark Rutland
  0 siblings, 1 reply; 33+ messages in thread
From: Kim Phillips @ 2017-06-29  1:43 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Mark Rutland, robh, mathieu.poirier,
	pawel.moll, suzuki.poulose, marc.zyngier, Will Deacon,
	linux-kernel, alexander.shishkin, peterz, mingo, tglx,
	linux-arm-kernel
  Cc: Adrian Hunter, Jiri Olsa, Andi Kleen, Wang Nan

'perf record' and 'perf report --dump-raw-trace' supported in this
release.

Example usage:

taskset -c 2 ./perf record -C 2 -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=1/ \
		dd if=/dev/zero of=/dev/null count=10000

perf report --dump-raw-trace

Note that the perf.data file is portable, so the report can be run on another
architecture host if necessary.

Output will contain raw SPE data and its textual representation, such as:

0xc7d0 [0x30]: PERF_RECORD_AUXTRACE size: 0x82f70  offset: 0  ref: 0x1e947e88189  idx: 0  tid: -1  cpu: 2
.
. ... ARM SPE data: size 536432 bytes
.  00000000:  4a 01                                           B COND
.  00000002:  b1 00 00 00 00 00 00 00 80                      TGT 0 el0 ns=1
.  0000000b:  42 42                                           RETIRED NOT-TAKEN
.  0000000d:  b0 20 41 c0 ad ff ff 00 80                      PC ffffadc04120 el0 ns=1
.  00000016:  98 00 00                                        LAT 0 TOT
.  00000019:  71 80 3e f7 46 e9 01 00 00                      TS 2101429616256
.  00000022:  49 01                                           ST
.  00000024:  b2 50 bd ba 73 00 80 ff ff                      VA ffff800073babd50
.  0000002d:  b3 50 bd ba f3 00 00 00 80                      PA f3babd50 ns=1
.  00000036:  9a 00 00                                        LAT 0 XLAT
.  00000039:  42 16                                           RETIRED L1D-ACCESS TLB-ACCESS
.  0000003b:  b0 8c b4 1e 08 00 00 ff ff                      PC ff0000081eb48c el3 ns=1
.  00000044:  98 00 00                                        LAT 0 TOT
.  00000047:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868
.  00000050:  48 00                                           INSN-OTHER
.  00000052:  42 02                                           RETIRED
.  00000054:  b0 58 54 1f 08 00 00 ff ff                      PC ff0000081f5458 el3 ns=1
.  0000005d:  98 00 00                                        LAT 0 TOT
.  00000060:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868
...

Other release notes:

- applies to acme's perf/{core,urgent} branches, likely elsewhere

- Record requires Will's SPE driver, currently undergoing upstream review

- the intel-bts implementation was used as a starting point; its
  min/default/max buffer sizes and power of 2 pages granularity need to be
  revisited for ARM SPE

- not been tested on platforms with multiple SPE clusters/domains

- snapshot support (record -S), and conversion to native perf events
  (e.g., via 'perf inject --itrace'), are still in development

- technically both cs-etm and spe can be used simultaneously, however
  disabled for simplicity in this release

Signed-off-by: Kim Phillips <kim.phillips@arm.com>
---
 tools/perf/arch/arm/util/auxtrace.c   |  20 +-
 tools/perf/arch/arm/util/pmu.c        |   3 +
 tools/perf/arch/arm64/util/Build      |   3 +-
 tools/perf/arch/arm64/util/arm-spe.c  | 210 ++++++++++++++++
 tools/perf/util/Build                 |   2 +
 tools/perf/util/arm-spe-pkt-decoder.c | 448 ++++++++++++++++++++++++++++++++++
 tools/perf/util/arm-spe-pkt-decoder.h |  52 ++++
 tools/perf/util/arm-spe.c             | 318 ++++++++++++++++++++++++
 tools/perf/util/arm-spe.h             |  39 +++
 tools/perf/util/auxtrace.c            |   3 +
 tools/perf/util/auxtrace.h            |   1 +
 11 files changed, 1095 insertions(+), 4 deletions(-)
 create mode 100644 tools/perf/arch/arm64/util/arm-spe.c
 create mode 100644 tools/perf/util/arm-spe-pkt-decoder.c
 create mode 100644 tools/perf/util/arm-spe-pkt-decoder.h
 create mode 100644 tools/perf/util/arm-spe.c
 create mode 100644 tools/perf/util/arm-spe.h

diff --git a/tools/perf/arch/arm/util/auxtrace.c b/tools/perf/arch/arm/util/auxtrace.c
index 8edf2cb71564..ec071609e8ac 100644
--- a/tools/perf/arch/arm/util/auxtrace.c
+++ b/tools/perf/arch/arm/util/auxtrace.c
@@ -22,29 +22,43 @@
 #include "../../util/evlist.h"
 #include "../../util/pmu.h"
 #include "cs-etm.h"
+#include "arm-spe.h"
 
 struct auxtrace_record
 *auxtrace_record__init(struct perf_evlist *evlist, int *err)
 {
-	struct perf_pmu	*cs_etm_pmu;
+	struct perf_pmu	*cs_etm_pmu, *arm_spe_pmu;
 	struct perf_evsel *evsel;
-	bool found_etm = false;
+	bool found_etm = false, found_spe = false;
 
 	cs_etm_pmu = perf_pmu__find(CORESIGHT_ETM_PMU_NAME);
+	arm_spe_pmu = perf_pmu__find(ARM_SPE_PMU_NAME);
 
 	if (evlist) {
 		evlist__for_each_entry(evlist, evsel) {
 			if (cs_etm_pmu &&
 			    evsel->attr.type == cs_etm_pmu->type)
 				found_etm = true;
+			if (arm_spe_pmu &&
+			    evsel->attr.type == arm_spe_pmu->type)
+				found_spe = true;
 		}
 	}
 
+	if (found_etm && found_spe) {
+		pr_err("Concurrent ARM Coresight ETM and SPE operation not currently supported\n");
+		*err = -EOPNOTSUPP;
+		return NULL;
+	}
+
 	if (found_etm)
 		return cs_etm_record_init(err);
 
+	if (found_spe)
+		return arm_spe_recording_init(err);
+
 	/*
-	 * Clear 'err' even if we haven't found a cs_etm event - that way perf
+	 * Clear 'err' even if we haven't found an event - that way perf
 	 * record can still be used even if tracers aren't present.  The NULL
 	 * return value will take care of telling the infrastructure HW tracing
 	 * isn't available.
diff --git a/tools/perf/arch/arm/util/pmu.c b/tools/perf/arch/arm/util/pmu.c
index 98d67399a0d6..71fb8f13b40a 100644
--- a/tools/perf/arch/arm/util/pmu.c
+++ b/tools/perf/arch/arm/util/pmu.c
@@ -20,6 +20,7 @@
 #include <linux/perf_event.h>
 
 #include "cs-etm.h"
+#include "arm-spe.h"
 #include "../../util/pmu.h"
 
 struct perf_event_attr
@@ -31,6 +32,8 @@ struct perf_event_attr
 		pmu->selectable = true;
 		pmu->set_drv_config = cs_etm_set_drv_config;
 	}
+	if (!strcmp(pmu->name, ARM_SPE_PMU_NAME))
+		pmu->selectable = true;
 #endif
 	return NULL;
 }
diff --git a/tools/perf/arch/arm64/util/Build b/tools/perf/arch/arm64/util/Build
index cef6fb38d17e..f9969bb88ccb 100644
--- a/tools/perf/arch/arm64/util/Build
+++ b/tools/perf/arch/arm64/util/Build
@@ -3,4 +3,5 @@ libperf-$(CONFIG_LOCAL_LIBUNWIND) += unwind-libunwind.o
 
 libperf-$(CONFIG_AUXTRACE) += ../../arm/util/pmu.o \
 			      ../../arm/util/auxtrace.o \
-			      ../../arm/util/cs-etm.o
+			      ../../arm/util/cs-etm.o \
+			      arm-spe.o
diff --git a/tools/perf/arch/arm64/util/arm-spe.c b/tools/perf/arch/arm64/util/arm-spe.c
new file mode 100644
index 000000000000..07172764881c
--- /dev/null
+++ b/tools/perf/arch/arm64/util/arm-spe.c
@@ -0,0 +1,210 @@
+/*
+ * ARM Statistical Profiling Extensions (SPE) support
+ * Copyright (c) 2017, ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/bitops.h>
+#include <linux/log2.h>
+
+#include "../../util/cpumap.h"
+#include "../../util/evsel.h"
+#include "../../util/evlist.h"
+#include "../../util/session.h"
+#include "../../util/util.h"
+#include "../../util/pmu.h"
+#include "../../util/debug.h"
+#include "../../util/tsc.h"
+#include "../../util/auxtrace.h"
+#include "../../util/arm-spe.h"
+
+#define KiB(x) ((x) * 1024)
+#define MiB(x) ((x) * 1024 * 1024)
+#define KiB_MASK(x) (KiB(x) - 1)
+#define MiB_MASK(x) (MiB(x) - 1)
+
+struct arm_spe_recording {
+	struct auxtrace_record		itr;
+	struct perf_pmu			*arm_spe_pmu;
+	struct perf_evlist		*evlist;
+};
+
+static size_t
+arm_spe_info_priv_size(struct auxtrace_record *itr __maybe_unused,
+			 struct perf_evlist *evlist __maybe_unused)
+{
+	return ARM_SPE_AUXTRACE_PRIV_SIZE;
+}
+
+static int arm_spe_info_fill(struct auxtrace_record *itr,
+			       struct perf_session *session,
+			       struct auxtrace_info_event *auxtrace_info,
+			       size_t priv_size)
+{
+	struct arm_spe_recording *sper =
+			container_of(itr, struct arm_spe_recording, itr);
+	struct perf_pmu *arm_spe_pmu = sper->arm_spe_pmu;
+
+	if (priv_size != ARM_SPE_AUXTRACE_PRIV_SIZE)
+		return -EINVAL;
+
+	if (!session->evlist->nr_mmaps)
+		return -EINVAL;
+
+	auxtrace_info->type = PERF_AUXTRACE_ARM_SPE;
+	auxtrace_info->priv[ARM_SPE_PMU_TYPE] = arm_spe_pmu->type;
+
+	return 0;
+}
+
+static int arm_spe_recording_options(struct auxtrace_record *itr,
+				       struct perf_evlist *evlist,
+				       struct record_opts *opts)
+{
+	struct arm_spe_recording *sper =
+			container_of(itr, struct arm_spe_recording, itr);
+	struct perf_pmu *arm_spe_pmu = sper->arm_spe_pmu;
+	struct perf_evsel *evsel, *arm_spe_evsel = NULL;
+	const struct cpu_map *cpus = evlist->cpus;
+	bool privileged = geteuid() == 0 || perf_event_paranoid() < 0;
+	struct perf_evsel *tracking_evsel;
+	int err;
+
+	sper->evlist = evlist;
+
+	evlist__for_each_entry(evlist, evsel) {
+		if (evsel->attr.type == arm_spe_pmu->type) {
+			if (arm_spe_evsel) {
+				pr_err("There may be only one " ARM_SPE_PMU_NAME " event\n");
+				return -EINVAL;
+			}
+			evsel->attr.freq = 0;
+			evsel->attr.sample_period = 1;
+			arm_spe_evsel = evsel;
+			opts->full_auxtrace = true;
+		}
+	}
+
+	if (!opts->full_auxtrace)
+		return 0;
+
+	/* We are in full trace mode but '-m,xyz' wasn't specified */
+	if (opts->full_auxtrace && !opts->auxtrace_mmap_pages) {
+		if (privileged) {
+			opts->auxtrace_mmap_pages = MiB(4) / page_size;
+		} else {
+			opts->auxtrace_mmap_pages = KiB(128) / page_size;
+			if (opts->mmap_pages == UINT_MAX)
+				opts->mmap_pages = KiB(256) / page_size;
+		}
+	}
+
+	/* Validate auxtrace_mmap_pages */
+	if (opts->auxtrace_mmap_pages) {
+		size_t sz = opts->auxtrace_mmap_pages * (size_t)page_size;
+		size_t min_sz = KiB(8);
+
+		if (sz < min_sz || !is_power_of_2(sz)) {
+			pr_err("Invalid mmap size for ARM SPE: must be at least %zuKiB and a power of 2\n",
+			       min_sz / 1024);
+			return -EINVAL;
+		}
+	}
+
+	/*
+	 * To obtain the auxtrace buffer file descriptor, the auxtrace event
+	 * must come first.
+	 */
+	perf_evlist__to_front(evlist, arm_spe_evsel);
+
+	/*
+	 * In the case of per-cpu mmaps, we need the CPU on the
+	 * AUX event.
+	 */
+	if (!cpu_map__empty(cpus))
+		perf_evsel__set_sample_bit(arm_spe_evsel, CPU);
+
+	/* Add dummy event to keep tracking */
+	err = parse_events(evlist, "dummy:u", NULL);
+	if (err)
+		return err;
+
+	tracking_evsel = perf_evlist__last(evlist);
+	perf_evlist__set_tracking_event(evlist, tracking_evsel);
+
+	tracking_evsel->attr.freq = 0;
+	tracking_evsel->attr.sample_period = 1;
+
+	/* In per-cpu case, always need the time of mmap events etc */
+	if (!cpu_map__empty(cpus))
+		perf_evsel__set_sample_bit(tracking_evsel, TIME);
+
+	return 0;
+}
+
+static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)
+{
+	u64 ts;
+
+	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));
+
+	return ts;
+}
+
+static void arm_spe_recording_free(struct auxtrace_record *itr)
+{
+	struct arm_spe_recording *sper =
+			container_of(itr, struct arm_spe_recording, itr);
+
+       free(sper);
+}
+
+static int arm_spe_read_finish(struct auxtrace_record *itr, int idx)
+{
+	struct arm_spe_recording *sper =
+			container_of(itr, struct arm_spe_recording, itr);
+	struct perf_evsel *evsel;
+
+	evlist__for_each_entry(sper->evlist, evsel) {
+		if (evsel->attr.type == sper->arm_spe_pmu->type)
+			return perf_evlist__enable_event_idx(sper->evlist,
+							     evsel, idx);
+	}
+	return -EINVAL;
+}
+
+struct auxtrace_record *arm_spe_recording_init(int *err)
+{
+	struct perf_pmu *arm_spe_pmu = perf_pmu__find(ARM_SPE_PMU_NAME);
+	struct arm_spe_recording *sper;
+
+	if (!arm_spe_pmu)
+		return NULL;
+
+	sper = zalloc(sizeof(struct arm_spe_recording));
+	if (!sper) {
+		*err = -ENOMEM;
+		return NULL;
+	}
+
+	sper->arm_spe_pmu = arm_spe_pmu;
+	sper->itr.recording_options = arm_spe_recording_options;
+	sper->itr.info_priv_size = arm_spe_info_priv_size;
+	sper->itr.info_fill = arm_spe_info_fill;
+	sper->itr.free = arm_spe_recording_free;
+	sper->itr.reference = arm_spe_reference;
+	sper->itr.read_finish = arm_spe_read_finish;
+	sper->itr.alignment = 0;
+	return &sper->itr;
+}
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 79dea95a7f68..4ed31e88b8ee 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -82,6 +82,8 @@ libperf-$(CONFIG_AUXTRACE) += auxtrace.o
 libperf-$(CONFIG_AUXTRACE) += intel-pt-decoder/
 libperf-$(CONFIG_AUXTRACE) += intel-pt.o
 libperf-$(CONFIG_AUXTRACE) += intel-bts.o
+libperf-$(CONFIG_AUXTRACE) += arm-spe.o
+libperf-$(CONFIG_AUXTRACE) += arm-spe-pkt-decoder.o
 libperf-y += parse-branch-options.o
 libperf-y += dump-insn.o
 libperf-y += parse-regs-options.o
diff --git a/tools/perf/util/arm-spe-pkt-decoder.c b/tools/perf/util/arm-spe-pkt-decoder.c
new file mode 100644
index 000000000000..ca3813d5b91a
--- /dev/null
+++ b/tools/perf/util/arm-spe-pkt-decoder.c
@@ -0,0 +1,448 @@
+/*
+ * ARM Statistical Profiling Extensions (SPE) support
+ * Copyright (c) 2017, ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include <stdio.h>
+#include <string.h>
+#include <endian.h>
+#include <byteswap.h>
+
+#include "arm-spe-pkt-decoder.h"
+
+#define BIT(n)		(1 << (n))
+
+#define BIT61		((uint64_t)1 << 61)
+#define BIT62		((uint64_t)1 << 62)
+#define BIT63		((uint64_t)1 << 63)
+
+#define NS_FLAG		BIT63
+#define EL_FLAG		(BIT62 | BIT61)
+
+#if __BYTE_ORDER == __BIG_ENDIAN
+#define le16_to_cpu bswap_16
+#define le32_to_cpu bswap_32
+#define le64_to_cpu bswap_64
+#define memcpy_le64(d, s, n) do { \
+	memcpy((d), (s), (n));    \
+	*(d) = le64_to_cpu(*(d)); \
+} while (0)
+#else
+#define le16_to_cpu
+#define le32_to_cpu
+#define le64_to_cpu
+#define memcpy_le64 memcpy
+#endif
+
+static const char * const arm_spe_packet_name[] = {
+	[ARM_SPE_PAD]		= "PAD",
+	[ARM_SPE_END]		= "END",
+	[ARM_SPE_TIMESTAMP]	= "TS",
+	[ARM_SPE_ADDRESS]	= "ADDR",
+	[ARM_SPE_COUNTER]	= "LAT",
+	[ARM_SPE_CONTEXT]	= "CONTEXT",
+	[ARM_SPE_INSN_TYPE]	= "INSN-TYPE",
+	[ARM_SPE_EVENTS]	= "EVENTS",
+	[ARM_SPE_DATA_SOURCE]	= "DATA-SOURCE",
+};
+
+const char *arm_spe_pkt_name(enum arm_spe_pkt_type type)
+{
+	return arm_spe_packet_name[type];
+}
+
+/* return ARM SPE payload size from its encoding:
+ * 00 : byte
+ * 01 : halfword (2)
+ * 10 : word (4)
+ * 11 : doubleword (8)
+ */
+static int payloadlen(unsigned char byte)
+{
+	return 1 << ((byte & 0x30) >> 4);
+}
+
+static int arm_spe_get_pad(struct arm_spe_pkt *packet)
+{
+	packet->type = ARM_SPE_PAD;
+	return 1;
+}
+
+static int arm_spe_get_alignment(const unsigned char *buf, size_t len,
+				 struct arm_spe_pkt *packet)
+{
+	unsigned int alignment = 1 << ((buf[0] & 0xf) + 1);
+
+	if (len < alignment)
+		return ARM_SPE_NEED_MORE_BYTES;
+
+	packet->type = ARM_SPE_PAD;
+	return alignment - (((uint64_t)buf) & (alignment - 1));
+}
+
+static int arm_spe_get_end(struct arm_spe_pkt *packet)
+{
+	packet->type = ARM_SPE_END;
+	return 1;
+}
+
+static int arm_spe_get_timestamp(const unsigned char *buf, size_t len,
+				 struct arm_spe_pkt *packet)
+{
+	if (len < 8)
+		return ARM_SPE_NEED_MORE_BYTES;
+
+	packet->type = ARM_SPE_TIMESTAMP;
+	memcpy_le64(&packet->payload, buf + 1, 8);
+
+	return 1 + 8;
+}
+
+static int arm_spe_get_events(const unsigned char *buf, size_t len,
+			      struct arm_spe_pkt *packet)
+{
+	unsigned int events_len = payloadlen(buf[0]);
+
+	if (len < events_len)
+		return ARM_SPE_NEED_MORE_BYTES;
+
+	packet->type = ARM_SPE_EVENTS;
+	packet->index = events_len;
+	switch (events_len) {
+	case 1: packet->payload = *(uint8_t *)(buf + 1); break;
+	case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break;
+	case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break;
+	case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break;
+	default: return ARM_SPE_BAD_PACKET;
+	}
+
+	return 1 + events_len;
+}
+
+static int arm_spe_get_data_source(const unsigned char *buf,
+				   struct arm_spe_pkt *packet)
+{
+	int len = payloadlen(buf[0]);
+
+	packet->type = ARM_SPE_DATA_SOURCE;
+	if (len == 1)
+		packet->payload = buf[1];
+	else if (len == 2)
+		packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1));
+
+	return 1 + len;
+}
+
+static int arm_spe_get_context(const unsigned char *buf, size_t len,
+			       struct arm_spe_pkt *packet)
+{
+	if (len < 4)
+		return ARM_SPE_NEED_MORE_BYTES;
+
+	packet->type = ARM_SPE_CONTEXT;
+	packet->index = buf[0] & 0x3;
+	packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1));
+
+	return 1 + 4;
+}
+
+static int arm_spe_get_insn_type(const unsigned char *buf,
+				 struct arm_spe_pkt *packet)
+{
+	packet->type = ARM_SPE_INSN_TYPE;
+	packet->index = buf[0] & 0x3;
+	packet->payload = buf[1];
+
+	return 1 + 1;
+}
+
+static int arm_spe_get_counter(const unsigned char *buf, size_t len,
+			       const unsigned char ext_hdr, struct arm_spe_pkt *packet)
+{
+	if (len < 2)
+		return ARM_SPE_NEED_MORE_BYTES;
+
+	packet->type = ARM_SPE_COUNTER;
+	if (ext_hdr)
+		packet->index = ((buf[0] & 0x3) << 3) | (buf[1] & 0x7);
+	else
+		packet->index = buf[0] & 0x7;
+
+	packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1));
+
+	return 1 + ext_hdr + 2;
+}
+
+static int arm_spe_get_addr(const unsigned char *buf, size_t len,
+			    const unsigned char ext_hdr, struct arm_spe_pkt *packet)
+{
+	if (len < 8)
+		return ARM_SPE_NEED_MORE_BYTES;
+
+	packet->type = ARM_SPE_ADDRESS;
+	if (ext_hdr)
+		packet->index = ((buf[0] & 0x3) << 3) | (buf[1] & 0x7);
+	else
+		packet->index = buf[0] & 0x7;
+
+	memcpy_le64(&packet->payload, buf + 1, 8);
+
+	return 1 + ext_hdr + 8;
+}
+
+static int arm_spe_do_get_packet(const unsigned char *buf, size_t len,
+				 struct arm_spe_pkt *packet)
+{
+	unsigned int byte;
+
+	memset(packet, 0, sizeof(struct arm_spe_pkt));
+
+	if (!len)
+		return ARM_SPE_NEED_MORE_BYTES;
+
+	byte = buf[0];
+	if (byte == 0)
+		return arm_spe_get_pad(packet);
+	else if (byte == 1) /* no timestamp at end of record */
+		return arm_spe_get_end(packet);
+	else if (byte & 0xc0 /* 0y11000000 */) {
+		if (byte & 0x80) {
+			/* 0x38 is 0y00111000 */
+			if ((byte & 0x38) == 0x30) /* address packet (short) */
+				return arm_spe_get_addr(buf, len, 0, packet);
+			if ((byte & 0x38) == 0x18) /* counter packet (short) */
+				return arm_spe_get_counter(buf, len, 0, packet);
+		} else
+			if (byte == 0x71)
+				return arm_spe_get_timestamp(buf, len, packet);
+			else if ((byte & 0xf) == 0x2)
+				return arm_spe_get_events(buf, len, packet);
+			else if ((byte & 0xf) == 0x3)
+				return arm_spe_get_data_source(buf, packet);
+			else if ((byte & 0x3c) == 0x24)
+				return arm_spe_get_context(buf, len, packet);
+			else if ((byte & 0x3c) == 0x8)
+				return arm_spe_get_insn_type(buf, packet);
+	} else if ((byte & 0xe0) == 0x20 /* 0y00100000 */) {
+		/* 16-bit header */
+		byte = buf[1];
+		if (byte == 0)
+			return arm_spe_get_alignment(buf, len, packet);
+		else if ((byte & 0xf8) == 0xb0)
+			return arm_spe_get_addr(buf, len, 1, packet);
+		else if ((byte & 0xf8) == 0x98)
+			return arm_spe_get_counter(buf, len, 1, packet);
+	}
+
+	return ARM_SPE_BAD_PACKET;
+}
+
+int arm_spe_get_packet(const unsigned char *buf, size_t len,
+		       struct arm_spe_pkt *packet)
+{
+	int ret;
+
+	ret = arm_spe_do_get_packet(buf, len, packet);
+	if (ret > 0) {
+		while (ret < 1 && len > (size_t)ret && !buf[ret])
+			ret += 1;
+	}
+	return ret;
+}
+
+int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
+		     size_t buf_len)
+{
+	int ret, ns, el, index = packet->index;
+	unsigned long long payload = packet->payload;
+	const char *name = arm_spe_pkt_name(packet->type);
+
+	switch (packet->type) {
+	case ARM_SPE_BAD:
+	case ARM_SPE_PAD:
+	case ARM_SPE_END:
+		return snprintf(buf, buf_len, "%s", name);
+	case ARM_SPE_EVENTS: {
+		size_t blen = buf_len;
+
+		ret = 0;
+		if (payload & 0x1) {
+			ret = snprintf(buf, buf_len, "EXCEPTION-GEN ");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x2) {
+			ret = snprintf(buf, buf_len, "RETIRED ");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x4) {
+			ret = snprintf(buf, buf_len, "L1D-ACCESS ");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x8) {
+			ret = snprintf(buf, buf_len, "L1D-REFILL ");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x10) {
+			ret = snprintf(buf, buf_len, "TLB-ACCESS ");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x20) {
+			ret = snprintf(buf, buf_len, "TLB-REFILL ");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x40) {
+			ret = snprintf(buf, buf_len, "NOT-TAKEN ");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x80) {
+			ret = snprintf(buf, buf_len, "MISPRED ");
+			buf += ret;
+			blen -= ret;
+		}
+		if (index > 1) {
+			if (payload & 0x100) {
+				ret = snprintf(buf, buf_len, "LLC-ACCESS ");
+				buf += ret;
+				blen -= ret;
+			}
+			if (payload & 0x200) {
+				ret = snprintf(buf, buf_len, "LLC-REFILL ");
+				buf += ret;
+				blen -= ret;
+			}
+			if (payload & 0x400) {
+				ret = snprintf(buf, buf_len, "REMOTE-ACCESS ");
+				buf += ret;
+				blen -= ret;
+			}
+		}
+		if (ret < 0)
+			return ret;
+		blen -= ret;
+		return buf_len - blen;
+	}
+	case ARM_SPE_INSN_TYPE:
+		switch (index) {
+		case 0:	return snprintf(buf, buf_len, "%s", payload & 0x1 ?
+					"COND-SELECT" : "INSN-OTHER");
+		case 1:	{
+			size_t blen = buf_len;
+
+			if (payload & 0x1)
+				ret = snprintf(buf, buf_len, "ST");
+			else
+				ret = snprintf(buf, buf_len, "LD");
+			buf += ret;
+			blen -= ret;
+			if (payload & 0x2) {
+				if (payload & 0x4) {
+					ret = snprintf(buf, buf_len, " AT");
+					buf += ret;
+					blen -= ret;
+				}
+				if (payload & 0x8) {
+					ret = snprintf(buf, buf_len, " EXCL");
+					buf += ret;
+					blen -= ret;
+				}
+				if (payload & 0x10) {
+					ret = snprintf(buf, buf_len, " AR");
+					buf += ret;
+					blen -= ret;
+				}
+			} else if (payload & 0x4) {
+				ret = snprintf(buf, buf_len, " SIMD-FP");
+				buf += ret;
+				blen -= ret;
+			}
+			if (ret < 0)
+				return ret;
+			blen -= ret;
+			return buf_len - blen;
+		}
+		case 2:	{
+			size_t blen = buf_len;
+
+			ret = snprintf(buf, buf_len, "B");
+			buf += ret;
+			blen -= ret;
+			if (payload & 0x1) {
+				ret = snprintf(buf, buf_len, " COND");
+				buf += ret;
+				blen -= ret;
+			}
+			if (payload & 0x2) {
+				ret = snprintf(buf, buf_len, " IND");
+				buf += ret;
+				blen -= ret;
+			}
+			if (ret < 0)
+				return ret;
+			blen -= ret;
+			return buf_len - blen;
+			}
+		default: return 0;
+		}
+	case ARM_SPE_DATA_SOURCE:
+	case ARM_SPE_TIMESTAMP:
+		return snprintf(buf, buf_len, "%s %lld", name, payload);
+	case ARM_SPE_ADDRESS:
+		switch (index) {
+		case 0:
+		case 1: ns = !!(packet->payload & NS_FLAG);
+			el = (packet->payload & EL_FLAG) >> 61;
+			payload &= ~(0xffULL << 56);
+			return snprintf(buf, buf_len, "%s %llx el%d ns=%d",
+				        (index == 1) ? "TGT" : "PC", payload, el, ns);
+		case 2:	return snprintf(buf, buf_len, "VA %llx", payload);
+		case 3:	ns = !!(packet->payload & NS_FLAG);
+			payload &= ~(0xffULL << 56);
+			return snprintf(buf, buf_len, "PA %llx ns=%d",
+					payload, ns);
+		default: return 0;
+		}
+	case ARM_SPE_CONTEXT:
+		return snprintf(buf, buf_len, "%s %lx el%d", name,
+				(unsigned long)payload, index + 1);
+	case ARM_SPE_COUNTER: {
+		size_t blen = buf_len;
+
+		ret = snprintf(buf, buf_len, "%s %d ", name,
+			       (unsigned short)payload);
+		buf += ret;
+		blen -= ret;
+		switch (index) {
+		case 0:	ret = snprintf(buf, buf_len, "TOT"); break;
+		case 1:	ret = snprintf(buf, buf_len, "ISSUE"); break;
+		case 2:	ret = snprintf(buf, buf_len, "XLAT"); break;
+		default: ret = 0;
+		}
+		if (ret < 0)
+			return ret;
+		blen -= ret;
+		return buf_len - blen;
+	}
+	default:
+		break;
+	}
+
+	return snprintf(buf, buf_len, "%s 0x%llx (%d)",
+			name, payload, packet->index);
+}
diff --git a/tools/perf/util/arm-spe-pkt-decoder.h b/tools/perf/util/arm-spe-pkt-decoder.h
new file mode 100644
index 000000000000..793552d8696e
--- /dev/null
+++ b/tools/perf/util/arm-spe-pkt-decoder.h
@@ -0,0 +1,52 @@
+/*
+ * ARM Statistical Profiling Extensions (SPE) support
+ * Copyright (c) 2017, ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#ifndef INCLUDE__ARM_SPE_PKT_DECODER_H__
+#define INCLUDE__ARM_SPE_PKT_DECODER_H__
+
+#include <stddef.h>
+#include <stdint.h>
+
+#define ARM_SPE_PKT_DESC_MAX		256
+
+#define ARM_SPE_NEED_MORE_BYTES		-1
+#define ARM_SPE_BAD_PACKET		-2
+
+enum arm_spe_pkt_type {
+	ARM_SPE_BAD,
+	ARM_SPE_PAD,
+	ARM_SPE_END,
+	ARM_SPE_TIMESTAMP,
+	ARM_SPE_ADDRESS,
+	ARM_SPE_COUNTER,
+	ARM_SPE_CONTEXT,
+	ARM_SPE_INSN_TYPE,
+	ARM_SPE_EVENTS,
+	ARM_SPE_DATA_SOURCE,
+};
+
+struct arm_spe_pkt {
+	enum arm_spe_pkt_type	type;
+	unsigned char		index;
+	uint64_t		payload;
+};
+
+const char *arm_spe_pkt_name(enum arm_spe_pkt_type);
+
+int arm_spe_get_packet(const unsigned char *buf, size_t len,
+			struct arm_spe_pkt *packet);
+
+int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf, size_t len);
+#endif
diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
new file mode 100644
index 000000000000..f3eccd73b54a
--- /dev/null
+++ b/tools/perf/util/arm-spe.c
@@ -0,0 +1,318 @@
+/*
+ * ARM Statistical Profiling Extensions (SPE) support
+ * Copyright (c) 2017, ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include <endian.h>
+#include <errno.h>
+#include <byteswap.h>
+#include <inttypes.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/bitops.h>
+#include <linux/log2.h>
+
+#include "cpumap.h"
+#include "color.h"
+#include "evsel.h"
+#include "evlist.h"
+#include "machine.h"
+#include "session.h"
+#include "util.h"
+#include "thread.h"
+#include "debug.h"
+#include "auxtrace.h"
+#include "arm-spe.h"
+#include "arm-spe-pkt-decoder.h"
+
+struct arm_spe {
+	struct auxtrace			auxtrace;
+	struct auxtrace_queues		queues;
+	struct auxtrace_heap		heap;
+	u32				auxtrace_type;
+	struct perf_session		*session;
+	struct machine			*machine;
+	u32				pmu_type;
+};
+
+struct arm_spe_queue {
+	struct arm_spe		*spe;
+	unsigned int		queue_nr;
+	struct auxtrace_buffer	*buffer;
+	bool			on_heap;
+	bool			done;
+	pid_t			pid;
+	pid_t			tid;
+	int			cpu;
+};
+
+static void arm_spe_dump(struct arm_spe *spe __maybe_unused,
+			   unsigned char *buf, size_t len)
+{
+	struct arm_spe_pkt packet;
+	size_t pos = 0;
+	int ret, pkt_len, i;
+	char desc[ARM_SPE_PKT_DESC_MAX];
+	const char *color = PERF_COLOR_BLUE;
+
+	color_fprintf(stdout, color,
+		      ". ... ARM SPE data: size %zu bytes\n",
+		      len);
+
+	while (len) {
+		ret = arm_spe_get_packet(buf, len, &packet);
+		if (ret > 0)
+			pkt_len = ret;
+		else
+			pkt_len = 1;
+		printf(".");
+		color_fprintf(stdout, color, "  %08x: ", pos);
+		for (i = 0; i < pkt_len; i++)
+			color_fprintf(stdout, color, " %02x", buf[i]);
+		for (; i < 16; i++)
+			color_fprintf(stdout, color, "   ");
+		if (ret > 0) {
+			ret = arm_spe_pkt_desc(&packet, desc,
+					       ARM_SPE_PKT_DESC_MAX);
+			if (ret > 0)
+				color_fprintf(stdout, color, " %s\n", desc);
+		} else {
+			color_fprintf(stdout, color, " Bad packet!\n");
+		}
+		pos += pkt_len;
+		buf += pkt_len;
+		len -= pkt_len;
+	}
+}
+
+static void arm_spe_dump_event(struct arm_spe *spe, unsigned char *buf,
+				 size_t len)
+{
+	printf(".\n");
+	arm_spe_dump(spe, buf, len);
+}
+
+static struct arm_spe_queue *arm_spe_alloc_queue(struct arm_spe *spe,
+						     unsigned int queue_nr)
+{
+	struct arm_spe_queue *speq;
+
+	speq = zalloc(sizeof(struct arm_spe_queue));
+	if (!speq)
+		return NULL;
+
+	speq->spe = spe;
+	speq->queue_nr = queue_nr;
+	speq->pid = -1;
+	speq->tid = -1;
+	speq->cpu = -1;
+
+	return speq;
+}
+
+static int arm_spe_setup_queue(struct arm_spe *spe,
+				 struct auxtrace_queue *queue,
+				 unsigned int queue_nr)
+{
+	struct arm_spe_queue *speq = queue->priv;
+
+	if (list_empty(&queue->head))
+		return 0;
+
+	if (!speq) {
+		speq = arm_spe_alloc_queue(spe, queue_nr);
+		if (!speq)
+			return -ENOMEM;
+		queue->priv = speq;
+
+		if (queue->cpu != -1)
+			speq->cpu = queue->cpu;
+		speq->tid = queue->tid;
+	}
+
+	if (!speq->on_heap && !speq->buffer) {
+		int ret;
+
+		speq->buffer = auxtrace_buffer__next(queue, NULL);
+		if (!speq->buffer)
+			return 0;
+
+		ret = auxtrace_heap__add(&spe->heap, queue_nr,
+					 speq->buffer->reference);
+		if (ret)
+			return ret;
+		speq->on_heap = true;
+	}
+
+	return 0;
+}
+
+static int arm_spe_setup_queues(struct arm_spe *spe)
+{
+	unsigned int i;
+	int ret;
+
+	for (i = 0; i < spe->queues.nr_queues; i++) {
+		ret = arm_spe_setup_queue(spe, &spe->queues.queue_array[i],
+					    i);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static inline int arm_spe_update_queues(struct arm_spe *spe)
+{
+	if (spe->queues.new_data) {
+		spe->queues.new_data = false;
+		return arm_spe_setup_queues(spe);
+	}
+	return 0;
+}
+
+static int arm_spe_process_event(struct perf_session *session __maybe_unused,
+				   union perf_event *event __maybe_unused,
+				   struct perf_sample *sample __maybe_unused,
+				   struct perf_tool *tool __maybe_unused)
+{
+	return 0;
+}
+
+static int arm_spe_process_auxtrace_event(struct perf_session *session,
+					    union perf_event *event,
+					    struct perf_tool *tool __maybe_unused)
+{
+	struct arm_spe *spe = container_of(session->auxtrace, struct arm_spe,
+					     auxtrace);
+	struct auxtrace_buffer *buffer;
+	off_t data_offset;
+	int fd = perf_data_file__fd(session->file);
+	int err;
+
+	if (perf_data_file__is_pipe(session->file)) {
+		data_offset = 0;
+	} else {
+		data_offset = lseek(fd, 0, SEEK_CUR);
+		if (data_offset == -1)
+			return -errno;
+	}
+
+	err = auxtrace_queues__add_event(&spe->queues, session, event,
+					 data_offset, &buffer);
+	if (err)
+		return err;
+
+	/* Dump here now we have copied a piped trace out of the pipe */
+	if (dump_trace) {
+		if (auxtrace_buffer__get_data(buffer, fd)) {
+			arm_spe_dump_event(spe, buffer->data,
+					     buffer->size);
+			auxtrace_buffer__put_data(buffer);
+		}
+	}
+
+	return 0;
+}
+
+static int arm_spe_flush(struct perf_session *session __maybe_unused,
+			   struct perf_tool *tool __maybe_unused)
+{
+	return 0;
+}
+
+static void arm_spe_free_queue(void *priv)
+{
+	struct arm_spe_queue *speq = priv;
+
+	if (!speq)
+		return;
+	free(speq);
+}
+
+static void arm_spe_free_events(struct perf_session *session)
+{
+	struct arm_spe *spe = container_of(session->auxtrace, struct arm_spe,
+					     auxtrace);
+	struct auxtrace_queues *queues = &spe->queues;
+	unsigned int i;
+
+	for (i = 0; i < queues->nr_queues; i++) {
+		arm_spe_free_queue(queues->queue_array[i].priv);
+		queues->queue_array[i].priv = NULL;
+	}
+	auxtrace_queues__free(queues);
+}
+
+static void arm_spe_free(struct perf_session *session)
+{
+	struct arm_spe *spe = container_of(session->auxtrace, struct arm_spe,
+					     auxtrace);
+
+	auxtrace_heap__free(&spe->heap);
+	arm_spe_free_events(session);
+	session->auxtrace = NULL;
+	free(spe);
+}
+
+static const char * const arm_spe_info_fmts[] = {
+	[ARM_SPE_PMU_TYPE]		= "  PMU Type           %"PRId64"\n",
+};
+
+static void arm_spe_print_info(u64 *arr)
+{
+	if (!dump_trace)
+		return;
+
+	fprintf(stdout, arm_spe_info_fmts[ARM_SPE_PMU_TYPE], arr[ARM_SPE_PMU_TYPE]);
+}
+
+int arm_spe_process_auxtrace_info(union perf_event *event,
+				    struct perf_session *session)
+{
+	struct auxtrace_info_event *auxtrace_info = &event->auxtrace_info;
+	size_t min_sz = sizeof(u64) * ARM_SPE_PMU_TYPE;
+	struct arm_spe *spe;
+	int err;
+
+	if (auxtrace_info->header.size < sizeof(struct auxtrace_info_event) +
+					min_sz)
+		return -EINVAL;
+
+	spe = zalloc(sizeof(struct arm_spe));
+	if (!spe)
+		return -ENOMEM;
+
+	err = auxtrace_queues__init(&spe->queues);
+	if (err)
+		goto err_free;
+
+	spe->session = session;
+	spe->machine = &session->machines.host; /* No kvm support */
+	spe->auxtrace_type = auxtrace_info->type;
+	spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
+
+	spe->auxtrace.process_event = arm_spe_process_event;
+	spe->auxtrace.process_auxtrace_event = arm_spe_process_auxtrace_event;
+	spe->auxtrace.flush_events = arm_spe_flush;
+	spe->auxtrace.free_events = arm_spe_free_events;
+	spe->auxtrace.free = arm_spe_free;
+	session->auxtrace = &spe->auxtrace;
+
+	arm_spe_print_info(&auxtrace_info->priv[0]);
+
+	return 0;
+
+err_free:
+	free(spe);
+	return err;
+}
diff --git a/tools/perf/util/arm-spe.h b/tools/perf/util/arm-spe.h
new file mode 100644
index 000000000000..afa4704c5e5e
--- /dev/null
+++ b/tools/perf/util/arm-spe.h
@@ -0,0 +1,39 @@
+/*
+ * ARM Statistical Profiling Extensions (SPE) support
+ * Copyright (c) 2017, ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#ifndef INCLUDE__PERF_ARM_SPE_H__
+#define INCLUDE__PERF_ARM_SPE_H__
+
+#define ARM_SPE_PMU_NAME "arm_spe_0"
+
+enum {
+	ARM_SPE_PMU_TYPE,
+	ARM_SPE_PER_CPU_MMAPS,
+	ARM_SPE_AUXTRACE_PRIV_MAX,
+};
+
+#define ARM_SPE_AUXTRACE_PRIV_SIZE (ARM_SPE_AUXTRACE_PRIV_MAX * sizeof(u64))
+
+struct auxtrace_record;
+struct perf_tool;
+union perf_event;
+struct perf_session;
+
+struct auxtrace_record *arm_spe_recording_init(int *err);
+
+int arm_spe_process_auxtrace_info(union perf_event *event,
+				  struct perf_session *session);
+
+#endif
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index 0daf63b9ee3e..f9ccc52a6c8f 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -57,6 +57,7 @@
 
 #include "intel-pt.h"
 #include "intel-bts.h"
+#include "arm-spe.h"
 
 #include "sane_ctype.h"
 #include "symbol/kallsyms.h"
@@ -903,6 +904,8 @@ int perf_event__process_auxtrace_info(struct perf_tool *tool __maybe_unused,
 		return intel_pt_process_auxtrace_info(event, session);
 	case PERF_AUXTRACE_INTEL_BTS:
 		return intel_bts_process_auxtrace_info(event, session);
+	case PERF_AUXTRACE_ARM_SPE:
+		return arm_spe_process_auxtrace_info(event, session);
 	case PERF_AUXTRACE_CS_ETM:
 	case PERF_AUXTRACE_UNKNOWN:
 	default:
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index 9f0de72d58e2..db1479b2a428 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -43,6 +43,7 @@ enum auxtrace_type {
 	PERF_AUXTRACE_INTEL_PT,
 	PERF_AUXTRACE_INTEL_BTS,
 	PERF_AUXTRACE_CS_ETM,
+	PERF_AUXTRACE_ARM_SPE,
 };
 
 enum itrace_period_type {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
  2017-06-29  0:59                 ` [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Kim Phillips
@ 2017-06-29 11:11                   ` Mark Rutland
  2017-07-06 17:08                     ` Kim Phillips
  0 siblings, 1 reply; 33+ messages in thread
From: Mark Rutland @ 2017-06-29 11:11 UTC (permalink / raw)
  To: Kim Phillips
  Cc: Will Deacon, linux-arm-kernel, marc.zyngier, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

On Wed, Jun 28, 2017 at 07:59:53PM -0500, Kim Phillips wrote:
> On Wed, 28 Jun 2017 12:26:02 +0100
> Mark Rutland <mark.rutland@arm.com> wrote:
> > On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote:

> > > > > Meanwhile, when using fvp-base.dtb, my model setup stops booting the
> > > > > kernel after "smp: Bringing up secondary CPUs ...".  If I however take
> > > > > the second SPE node from fvp-base.dts and add it to my working device
> > > > > tree, I get this during the driver probe:
> > > > > 
> > > > > [    1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
> > > > > [    1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
> > > > > [    1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)
> > > > 
> > > > Looks like you've screwed up your IRQ partitions, so you are effectively
> > > > registering the same device twice, which then blows up due to lack of shared
> > > > irqs.
> > > > 
> > > > Either remove one of the devices, or use IRQ partitions to restrict them
> > > > to unique sets of CPUs.
> > > 
> > > Right, but since I want to get parity with what you're running -
> > > fvp_base.dtb - I tried to debug the hang after "smp: Bringing up
> > > secondary CPUs ..." problem, and could only debug it to the PSCI driver
> > > hitting one of these cases:
> > > 
> > > case PSCI_RET_INVALID_PARAMS:
> > > case PSCI_RET_INVALID_ADDRESS:
> > 
> > Sounds like your DT is describing CPUs that don't exist (or perhaps the
> > same CPU several times). Certainly, PSCI and the kernel disagree on
> > which CPUS exist.
> > 
> > What exact DT are you using?
> 
> the one this commit to linux-will's perf/spe branch provides:
> 
> commit 2a73de57eaf61cdfd61be1e20a46e4a2c326775f
> Author: Marc Zyngier <marc.zyngier@arm.com>
> Date:   Tue Mar 11 18:14:45 2014 +0000
> 
>     arm64: dts: add model device-tree
>     
>     Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
>     Signed-off-by: Will Deacon <will.deacon@arm.com>
> 
> > Are you using the bootwrapper, or ATF? I'm guessing you're using the
> > bootwrapper.
> 
> I'm using the wrapper to wrap arm-trusted-firmware (ATF?) objects, so,
> both?  I noticed the wrapper I was using was pretty old, so I updated
> it.

Ok. So what's likely happening is that ATF and the bootwrapper's FDT
disagree as to the set of CPUs. You're using ATF's PSCI implementation,
and not the boot-wrapper's.

I don't know how ATF enumerates CPUs on the model, so I can't offer much
guidance here other than fixing your DT to match whatever ATF believes
exists.

> arm-trusted-firmware, btw, has just been updated to enable SPE at lower
> ELs, so I don't have to use a hacked-up version anymore.
> 
> I also updated my BL33 to the latest upstream u-boot
> vexpress_aemv8a_dram_defconfig, and at least now the kernel continues
> to boot, even though it can't bring up 6 of the 7 secondary CPUs.

Do you mean that you replaced the bootwrapper with u-boot?

I'm a little confused here.

Regardless, it sounds like whatever DT is passed to the kernel still
doesn't match the real model configuration.

> > Which version of the bootwrapepr are you using? If it doesn't have
> > commit:
> > 
> >   ccdc936924b3682d ("Dynamically determine the set of CPUs")
> > 
> > ... have you configured it appropriately with --with-cpu-ids?
> > 
> > How is your model configured?
> 
> CLUSTER0_NUM_CORES=4
> CLUSTER1_NUM_CORES=4
> 
> > Which CPU IDs does it think exist?
> 
> 1,2,3,4,0x100,0x101,0x102,0x103
> 
> ...which are different from the above device tree!:
> 
> 0,0x100,0x200,0x300,0x10000,0x10100,0x10200,0x10300
> 
> So I imagine that's the problem, thanks!

Sounds like it, yes.

> I don't see how to tell the model to put the CPUs at different
> addresses, only a lot of GICv3 redistributor switches? 

I don't know how to do this, sorry.

> btw, where can I get updates to the run-model.sh scripts?  Answer
> off-list?

I don't know which script you're referring to. Contact whoever you got
it from initially?

[...]

> > > I can't tell which part of the fvp-base device tree is not liked by the
> > > firmware; I tried different combinations of the PSCI node, different CPU
> > > enumerations (cpu@100 vs cpu@1), removing idle-states properties...any
> > > hints appreciated.
> > 
> > The bootwrapper doesn't support idle. So no idle-states should be in the
> > DT.
> > 
> > If you can share your DT, bootwrapper configuration, and model
> > configuration, I can try to debug this with you.
> 
> I reverted the wrapper's ccdc936924b3682d ("Dynamically determine the
> set of CPUs") commit you mentioned above, and specified the cpu-ids
> manually, and am now running with 8 CPUs, although linux enumerates
> them as 0,1,8,9,10,11,12,13?

The --with-cpu-ids option *adds* CPU nodes, but leaves the broken ones,
and your CPU phandles (and PPI partitions for the SPE node(s)) will all
be wrong. Linux is still seeing those erroneous CPU nodes (presumably
taking Linux CPU ids 2-7).

Generally, --with-cpu-ids doesn't work as you'd expect, which is why it
got removed in favour of assuming an initally correct DT.

Please fix the DT instead. With a fixed DT, and commit ccdc936924b3682d,
the bootwrapper won't further mangle your DT.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] perf tools: Add ARM Statistical Profiling Extensions (SPE) support
  2017-06-29  1:43                     ` [PATCH] perf tools: Add ARM Statistical Profiling Extensions (SPE) support Kim Phillips
@ 2017-06-30 14:02                       ` Mark Rutland
  2017-07-18  0:48                         ` Kim Phillips
  0 siblings, 1 reply; 33+ messages in thread
From: Mark Rutland @ 2017-06-30 14:02 UTC (permalink / raw)
  To: Kim Phillips
  Cc: Arnaldo Carvalho de Melo, robh, mathieu.poirier, pawel.moll,
	suzuki.poulose, marc.zyngier, Will Deacon, linux-kernel,
	alexander.shishkin, peterz, mingo, tglx, linux-arm-kernel,
	Adrian Hunter, Jiri Olsa, Andi Kleen, Wang Nan

Hi Kim,

On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote:
> 'perf record' and 'perf report --dump-raw-trace' supported in this
> release.
> 
> Example usage:
> 
> taskset -c 2 ./perf record -C 2 -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=1/ \
> 		dd if=/dev/zero of=/dev/null count=10000
> 
> perf report --dump-raw-trace
> 
> Note that the perf.data file is portable, so the report can be run on another
> architecture host if necessary.
> 
> Output will contain raw SPE data and its textual representation, such as:
> 
> 0xc7d0 [0x30]: PERF_RECORD_AUXTRACE size: 0x82f70  offset: 0  ref: 0x1e947e88189  idx: 0  tid: -1  cpu: 2
> .
> . ... ARM SPE data: size 536432 bytes
> .  00000000:  4a 01                                           B COND
> .  00000002:  b1 00 00 00 00 00 00 00 80                      TGT 0 el0 ns=1
> .  0000000b:  42 42                                           RETIRED NOT-TAKEN
> .  0000000d:  b0 20 41 c0 ad ff ff 00 80                      PC ffffadc04120 el0 ns=1
> .  00000016:  98 00 00                                        LAT 0 TOT
> .  00000019:  71 80 3e f7 46 e9 01 00 00                      TS 2101429616256
> .  00000022:  49 01                                           ST
> .  00000024:  b2 50 bd ba 73 00 80 ff ff                      VA ffff800073babd50
> .  0000002d:  b3 50 bd ba f3 00 00 00 80                      PA f3babd50 ns=1
> .  00000036:  9a 00 00                                        LAT 0 XLAT
> .  00000039:  42 16                                           RETIRED L1D-ACCESS TLB-ACCESS
> .  0000003b:  b0 8c b4 1e 08 00 00 ff ff                      PC ff0000081eb48c el3 ns=1
> .  00000044:  98 00 00                                        LAT 0 TOT
> .  00000047:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868
> .  00000050:  48 00                                           INSN-OTHER
> .  00000052:  42 02                                           RETIRED
> .  00000054:  b0 58 54 1f 08 00 00 ff ff                      PC ff0000081f5458 el3 ns=1
> .  0000005d:  98 00 00                                        LAT 0 TOT
> .  00000060:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868
> ...
> 
> Other release notes:
> 
> - applies to acme's perf/{core,urgent} branches, likely elsewhere
> 
> - Record requires Will's SPE driver, currently undergoing upstream review
> 
> - the intel-bts implementation was used as a starting point; its
>   min/default/max buffer sizes and power of 2 pages granularity need to be
>   revisited for ARM SPE
> 
> - not been tested on platforms with multiple SPE clusters/domains
> 
> - snapshot support (record -S), and conversion to native perf events
>   (e.g., via 'perf inject --itrace'), are still in development
> 
> - technically both cs-etm and spe can be used simultaneously, however
>   disabled for simplicity in this release
> 
> Signed-off-by: Kim Phillips <kim.phillips@arm.com>
> ---
>  tools/perf/arch/arm/util/auxtrace.c   |  20 +-
>  tools/perf/arch/arm/util/pmu.c        |   3 +
>  tools/perf/arch/arm64/util/Build      |   3 +-
>  tools/perf/arch/arm64/util/arm-spe.c  | 210 ++++++++++++++++
>  tools/perf/util/Build                 |   2 +
>  tools/perf/util/arm-spe-pkt-decoder.c | 448 ++++++++++++++++++++++++++++++++++
>  tools/perf/util/arm-spe-pkt-decoder.h |  52 ++++
>  tools/perf/util/arm-spe.c             | 318 ++++++++++++++++++++++++
>  tools/perf/util/arm-spe.h             |  39 +++
>  tools/perf/util/auxtrace.c            |   3 +
>  tools/perf/util/auxtrace.h            |   1 +
>  11 files changed, 1095 insertions(+), 4 deletions(-)
>  create mode 100644 tools/perf/arch/arm64/util/arm-spe.c
>  create mode 100644 tools/perf/util/arm-spe-pkt-decoder.c
>  create mode 100644 tools/perf/util/arm-spe-pkt-decoder.h
>  create mode 100644 tools/perf/util/arm-spe.c
>  create mode 100644 tools/perf/util/arm-spe.h
> 
> diff --git a/tools/perf/arch/arm/util/auxtrace.c b/tools/perf/arch/arm/util/auxtrace.c
> index 8edf2cb71564..ec071609e8ac 100644
> --- a/tools/perf/arch/arm/util/auxtrace.c
> +++ b/tools/perf/arch/arm/util/auxtrace.c
> @@ -22,29 +22,43 @@
>  #include "../../util/evlist.h"
>  #include "../../util/pmu.h"
>  #include "cs-etm.h"
> +#include "arm-spe.h"
>  
>  struct auxtrace_record
>  *auxtrace_record__init(struct perf_evlist *evlist, int *err)
>  {
> -	struct perf_pmu	*cs_etm_pmu;
> +	struct perf_pmu	*cs_etm_pmu, *arm_spe_pmu;
>  	struct perf_evsel *evsel;
> -	bool found_etm = false;
> +	bool found_etm = false, found_spe = false;
>  
>  	cs_etm_pmu = perf_pmu__find(CORESIGHT_ETM_PMU_NAME);
> +	arm_spe_pmu = perf_pmu__find(ARM_SPE_PMU_NAME);
>  
>  	if (evlist) {
>  		evlist__for_each_entry(evlist, evsel) {
>  			if (cs_etm_pmu &&
>  			    evsel->attr.type == cs_etm_pmu->type)
>  				found_etm = true;
> +			if (arm_spe_pmu &&
> +			    evsel->attr.type == arm_spe_pmu->type)
> +				found_spe = true;

Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all
SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1"
and so on).

Can we not find all PMUs with a "arm_spe_" prefix?

... or, taking a step back, do we need some sysfs "class" attribute to
identify multi-instance PMUs?

>  		}
>  	}
>  
> +	if (found_etm && found_spe) {
> +		pr_err("Concurrent ARM Coresight ETM and SPE operation not currently supported\n");
> +		*err = -EOPNOTSUPP;
> +		return NULL;
> +	}
> +
>  	if (found_etm)
>  		return cs_etm_record_init(err);
>  
> +	if (found_spe)
> +		return arm_spe_recording_init(err);

... so given the above, this will fail.

AFAICT, this means that perf record opens the event, but doesn't touch
the aux buffer at all.

> +
>  	/*
> -	 * Clear 'err' even if we haven't found a cs_etm event - that way perf
> +	 * Clear 'err' even if we haven't found an event - that way perf
>  	 * record can still be used even if tracers aren't present.  The NULL
>  	 * return value will take care of telling the infrastructure HW tracing
>  	 * isn't available.
> diff --git a/tools/perf/arch/arm/util/pmu.c b/tools/perf/arch/arm/util/pmu.c
> index 98d67399a0d6..71fb8f13b40a 100644
> --- a/tools/perf/arch/arm/util/pmu.c
> +++ b/tools/perf/arch/arm/util/pmu.c
> @@ -20,6 +20,7 @@
>  #include <linux/perf_event.h>
>  
>  #include "cs-etm.h"
> +#include "arm-spe.h"
>  #include "../../util/pmu.h"
>  
>  struct perf_event_attr
> @@ -31,6 +32,8 @@ struct perf_event_attr
>  		pmu->selectable = true;
>  		pmu->set_drv_config = cs_etm_set_drv_config;
>  	}
> +	if (!strcmp(pmu->name, ARM_SPE_PMU_NAME))
> +		pmu->selectable = true;

... likewise I here.

I guess we need an is_arm_spe_pmu() helper for both cases, iterating over
all PMUs.

>  #endif
>  	return NULL;
>  }
> diff --git a/tools/perf/arch/arm64/util/Build b/tools/perf/arch/arm64/util/Build
> index cef6fb38d17e..f9969bb88ccb 100644
> --- a/tools/perf/arch/arm64/util/Build
> +++ b/tools/perf/arch/arm64/util/Build
> @@ -3,4 +3,5 @@ libperf-$(CONFIG_LOCAL_LIBUNWIND) += unwind-libunwind.o
>  
>  libperf-$(CONFIG_AUXTRACE) += ../../arm/util/pmu.o \
>  			      ../../arm/util/auxtrace.o \
> -			      ../../arm/util/cs-etm.o
> +			      ../../arm/util/cs-etm.o \
> +			      arm-spe.o
> diff --git a/tools/perf/arch/arm64/util/arm-spe.c b/tools/perf/arch/arm64/util/arm-spe.c
> new file mode 100644
> index 000000000000..07172764881c
> --- /dev/null
> +++ b/tools/perf/arch/arm64/util/arm-spe.c
> @@ -0,0 +1,210 @@
> +/*
> + * ARM Statistical Profiling Extensions (SPE) support
> + * Copyright (c) 2017, ARM Ltd.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/bitops.h>
> +#include <linux/log2.h>
> +
> +#include "../../util/cpumap.h"
> +#include "../../util/evsel.h"
> +#include "../../util/evlist.h"
> +#include "../../util/session.h"
> +#include "../../util/util.h"
> +#include "../../util/pmu.h"
> +#include "../../util/debug.h"
> +#include "../../util/tsc.h"
> +#include "../../util/auxtrace.h"
> +#include "../../util/arm-spe.h"
> +
> +#define KiB(x) ((x) * 1024)
> +#define MiB(x) ((x) * 1024 * 1024)
> +#define KiB_MASK(x) (KiB(x) - 1)
> +#define MiB_MASK(x) (MiB(x) - 1)

It's a shame we don't have a UAPI version of <linux/sizes.h>, though I
guess it's good to do the same thing as the x86 code.

I was a little worried that the *_MASK() helpers might be used with a
non power-of-two x, but it seems they're unused even for x86. Can we
please delete them?

> +struct arm_spe_recording {
> +	struct auxtrace_record		itr;
> +	struct perf_pmu			*arm_spe_pmu;
> +	struct perf_evlist		*evlist;
> +};

A user may wich to record trace on separate uarches simultaneously, so
having a singleton arm_spe_pmu for the entire evlist doesn't seem right.

e.g. I don't see why we should allow a user to do:

./perf record -c 1024 \
	-e arm_spe_0/ts_enable=1,pa_enable=0/ \
	-e arm_spe_1/ts_enable=1,pa_enable=0/ \
	${WORKLOAD}

... which perf-record seems to accept today, but I don't seem to get any
aux trace, regardless of whether I taskset the entire thing to any
particular CPU.

It also seems that the events aren't task bound, judging by -vv output:

------------------------------------------------------------
perf_event_attr:
  type                             7
  size                             112
  config                           0x1
  { sample_period, sample_freq }   1024
  sample_type                      IP|TID|TIME|CPU|IDENTIFIER
  read_format                      ID
  disabled                         1
  inherit                          1
  enable_on_exec                   1
  sample_id_all                    1
  exclude_guest                    1
------------------------------------------------------------
sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 4
sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 5
sys_perf_event_open: pid -1  cpu 2  group_fd -1  flags 0x8 = 6
sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8 = 8
------------------------------------------------------------
perf_event_attr:
  type                             8
  size                             112
  config                           0x1
  { sample_period, sample_freq }   1024
  sample_type                      IP|TID|TIME|IDENTIFIER
  read_format                      ID
  disabled                         1
  inherit                          1
  enable_on_exec                   1
  sample_id_all                    1
  exclude_guest                    1
------------------------------------------------------------
sys_perf_event_open: pid -1  cpu 4  group_fd -1  flags 0x8 = 9
sys_perf_event_open: pid -1  cpu 5  group_fd -1  flags 0x8 = 10
sys_perf_event_open: pid -1  cpu 6  group_fd -1  flags 0x8 = 11
sys_perf_event_open: pid -1  cpu 7  group_fd -1  flags 0x8 = 12
------------------------------------------------------------
perf_event_attr:
  type                             1
  size                             112
  config                           0x9
  { sample_period, sample_freq }   1024
  sample_type                      IP|TID|TIME|IDENTIFIER
  read_format                      ID
  disabled                         1
  inherit                          1
  exclude_kernel                   1
  exclude_hv                       1
  mmap                             1
  comm                             1
  enable_on_exec                   1
  task                             1
  sample_id_all                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
sys_perf_event_open: pid 2181  cpu 0  group_fd -1  flags 0x8 = 13
sys_perf_event_open: pid 2181  cpu 1  group_fd -1  flags 0x8 = 14
sys_perf_event_open: pid 2181  cpu 2  group_fd -1  flags 0x8 = 15
sys_perf_event_open: pid 2181  cpu 3  group_fd -1  flags 0x8 = 16
sys_perf_event_open: pid 2181  cpu 4  group_fd -1  flags 0x8 = 17
sys_perf_event_open: pid 2181  cpu 5  group_fd -1  flags 0x8 = 18
sys_perf_event_open: pid 2181  cpu 6  group_fd -1  flags 0x8 = 19
sys_perf_event_open: pid 2181  cpu 7  group_fd -1  flags 0x8 = 20



I see something similar (i.e. perf doesn't try to bind the events to the
workload PID) when I try to record with only a single PMU. In that case,
perf-record blows up because it can't handle events on a subset of CPUs
(though it should be able to):

nanook@torsk:~$ ./perf record -vv -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=0/ true
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Problems creating module maps, continuing anyway...
------------------------------------------------------------
perf_event_attr:
  type                             7
  size                             112
  config                           0x1
  { sample_period, sample_freq }   1024
  sample_type                      IP|TID|TIME|CPU|IDENTIFIER
  read_format                      ID
  disabled                         1
  inherit                          1
  enable_on_exec                   1
  sample_id_all                    1
  exclude_guest                    1
------------------------------------------------------------
sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 4
sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 5
sys_perf_event_open: pid -1  cpu 2  group_fd -1  flags 0x8 = 6
sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8 = 8
------------------------------------------------------------
perf_event_attr:
  type                             1
  size                             112
  config                           0x9
  { sample_period, sample_freq }   1024
  sample_type                      IP|TID|TIME|IDENTIFIER
  read_format                      ID
  disabled                         1
  inherit                          1
  exclude_kernel                   1
  exclude_hv                       1
  mmap                             1
  comm                             1
  enable_on_exec                   1
  task                             1
  sample_id_all                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
sys_perf_event_open: pid 2185  cpu 0  group_fd -1  flags 0x8 = 9
sys_perf_event_open: pid 2185  cpu 1  group_fd -1  flags 0x8 = 10
sys_perf_event_open: pid 2185  cpu 2  group_fd -1  flags 0x8 = 11
sys_perf_event_open: pid 2185  cpu 3  group_fd -1  flags 0x8 = 12
sys_perf_event_open: pid 2185  cpu 4  group_fd -1  flags 0x8 = 13
sys_perf_event_open: pid 2185  cpu 5  group_fd -1  flags 0x8 = 14
sys_perf_event_open: pid 2185  cpu 6  group_fd -1  flags 0x8 = 15
sys_perf_event_open: pid 2185  cpu 7  group_fd -1  flags 0x8 = 16
mmap size 266240B
AUX area mmap length 131072
perf event ring buffer mmapped per cpu
failed to mmap AUX area
failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0xffffc8596038, 512)=22)



... with a SW event, this works as expected, being bound to the workload PID:

nanook@torsk:~$ ./perf record -vvv -e context-switches true
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Problems creating module maps, continuing anyway...
------------------------------------------------------------
perf_event_attr:
  type                             1
  size                             112
  config                           0x3
  { sample_period, sample_freq }   4000
  sample_type                      IP|TID|TIME|PERIOD
  disabled                         1
  inherit                          1
  mmap                             1
  comm                             1
  freq                             1
  enable_on_exec                   1
  task                             1
  sample_id_all                    1
  exclude_guest                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
sys_perf_event_open: pid 2220  cpu 0  group_fd -1  flags 0x8 = 4
sys_perf_event_open: pid 2220  cpu 1  group_fd -1  flags 0x8 = 5
sys_perf_event_open: pid 2220  cpu 2  group_fd -1  flags 0x8 = 6
sys_perf_event_open: pid 2220  cpu 3  group_fd -1  flags 0x8 = 8
sys_perf_event_open: pid 2220  cpu 4  group_fd -1  flags 0x8 = 9
sys_perf_event_open: pid 2220  cpu 5  group_fd -1  flags 0x8 = 10
sys_perf_event_open: pid 2220  cpu 6  group_fd -1  flags 0x8 = 11
sys_perf_event_open: pid 2220  cpu 7  group_fd -1  flags 0x8 = 12
mmap size 528384B
perf event ring buffer mmapped per cpu
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
[ perf record: Woken up 1 times to write data ]
overlapping maps in /lib/aarch64-linux-gnu/ld-2.19.so (disable tui for more info)
overlapping maps in [vdso] (disable tui for more info)
overlapping maps in /tmp/perf-2220.map (disable tui for more info)
Looking at the vmlinux_path (8 entries long)
No kallsyms or vmlinux with build-id cc083c873190ff1254624d3137142c6841c118c3 was found
[kernel.kallsyms] with build id cc083c873190ff1254624d3137142c6841c118c3 not found, continuing without symbols
overlapping maps in /etc/ld.so.cache (disable tui for more info)
overlapping maps in /lib/aarch64-linux-gnu/libc-2.19.so (disable tui for more info)
overlapping maps in /tmp/perf-2220.map (disable tui for more info)
overlapping maps in /tmp/perf-2220.map (disable tui for more info)
overlapping maps in /tmp/perf-2220.map (disable tui for more info)
overlapping maps in /tmp/perf-2220.map (disable tui for more info)
overlapping maps in /lib/aarch64-linux-gnu/libc-2.19.so (disable tui for more info)
overlapping maps in /bin/true (disable tui for more info)
overlapping maps in /lib/aarch64-linux-gnu/ld-2.19.so (disable tui for more info)
failed to write feature HEADER_CPUDESC
failed to write feature HEADER_CPUID
[ perf record: Captured and wrote 0.002 MB perf.data (4 samples) ]



... so I guess this has something to do with the way the tool tries to
use the cpumask, maknig the wrong assumption that this implies
system-wide collection is mandatory / expected.

> +
> +static size_t
> +arm_spe_info_priv_size(struct auxtrace_record *itr __maybe_unused,
> +			 struct perf_evlist *evlist __maybe_unused)
> +{
> +	return ARM_SPE_AUXTRACE_PRIV_SIZE;
> +}
> +
> +static int arm_spe_info_fill(struct auxtrace_record *itr,
> +			       struct perf_session *session,
> +			       struct auxtrace_info_event *auxtrace_info,
> +			       size_t priv_size)
> +{
> +	struct arm_spe_recording *sper =
> +			container_of(itr, struct arm_spe_recording, itr);
> +	struct perf_pmu *arm_spe_pmu = sper->arm_spe_pmu;
> +
> +	if (priv_size != ARM_SPE_AUXTRACE_PRIV_SIZE)
> +		return -EINVAL;
> +
> +	if (!session->evlist->nr_mmaps)
> +		return -EINVAL;
> +
> +	auxtrace_info->type = PERF_AUXTRACE_ARM_SPE;
> +	auxtrace_info->priv[ARM_SPE_PMU_TYPE] = arm_spe_pmu->type;
> +
> +	return 0;
> +}
> +
> +static int arm_spe_recording_options(struct auxtrace_record *itr,
> +				       struct perf_evlist *evlist,
> +				       struct record_opts *opts)
> +{
> +	struct arm_spe_recording *sper =
> +			container_of(itr, struct arm_spe_recording, itr);
> +	struct perf_pmu *arm_spe_pmu = sper->arm_spe_pmu;
> +	struct perf_evsel *evsel, *arm_spe_evsel = NULL;
> +	const struct cpu_map *cpus = evlist->cpus;
> +	bool privileged = geteuid() == 0 || perf_event_paranoid() < 0;
> +	struct perf_evsel *tracking_evsel;
> +	int err;
> +
> +	sper->evlist = evlist;
> +
> +	evlist__for_each_entry(evlist, evsel) {
> +		if (evsel->attr.type == arm_spe_pmu->type) {
> +			if (arm_spe_evsel) {
> +				pr_err("There may be only one " ARM_SPE_PMU_NAME " event\n");
> +				return -EINVAL;
> +			}
> +			evsel->attr.freq = 0;
> +			evsel->attr.sample_period = 1;
> +			arm_spe_evsel = evsel;
> +			opts->full_auxtrace = true;
> +		}
> +	}

Theoretically, we could ask for different events on different CPUs, but
otehrwise, this looks sane.

> +
> +	if (!opts->full_auxtrace)
> +		return 0;
> +
> +	/* We are in full trace mode but '-m,xyz' wasn't specified */
> +	if (opts->full_auxtrace && !opts->auxtrace_mmap_pages) {
> +		if (privileged) {
> +			opts->auxtrace_mmap_pages = MiB(4) / page_size;
> +		} else {
> +			opts->auxtrace_mmap_pages = KiB(128) / page_size;
> +			if (opts->mmap_pages == UINT_MAX)
> +				opts->mmap_pages = KiB(256) / page_size;
> +		}
> +	}
> +
> +	/* Validate auxtrace_mmap_pages */
> +	if (opts->auxtrace_mmap_pages) {
> +		size_t sz = opts->auxtrace_mmap_pages * (size_t)page_size;
> +		size_t min_sz = KiB(8);
> +
> +		if (sz < min_sz || !is_power_of_2(sz)) {
> +			pr_err("Invalid mmap size for ARM SPE: must be at least %zuKiB and a power of 2\n",
> +			       min_sz / 1024);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	/*
> +	 * To obtain the auxtrace buffer file descriptor, the auxtrace event
> +	 * must come first.
> +	 */
> +	perf_evlist__to_front(evlist, arm_spe_evsel);

Huh? *what* needs the auxtrace buffer fd?

This seems really fragile. Can't we store this elsewhere?

> +
> +	/*
> +	 * In the case of per-cpu mmaps, we need the CPU on the
> +	 * AUX event.
> +	 */
> +	if (!cpu_map__empty(cpus))
> +		perf_evsel__set_sample_bit(arm_spe_evsel, CPU);
> +
> +	/* Add dummy event to keep tracking */
> +	err = parse_events(evlist, "dummy:u", NULL);
> +	if (err)
> +		return err;
> +
> +	tracking_evsel = perf_evlist__last(evlist);
> +	perf_evlist__set_tracking_event(evlist, tracking_evsel);
> +
> +	tracking_evsel->attr.freq = 0;
> +	tracking_evsel->attr.sample_period = 1;
> +
> +	/* In per-cpu case, always need the time of mmap events etc */
> +	if (!cpu_map__empty(cpus))
> +		perf_evsel__set_sample_bit(tracking_evsel, TIME);
> +
> +	return 0;
> +}
> +
> +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)
> +{
> +	u64 ts;
> +
> +	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));
> +
> +	return ts;
> +}

I do not think it's a good idea to read the counter directly like this.

What is this "reference" intended to be meaningful relative to?

Why do we need to do this in userspace?

Can we not ask the kernel to output timestamps instead?

> +
> +static void arm_spe_recording_free(struct auxtrace_record *itr)
> +{
> +	struct arm_spe_recording *sper =
> +			container_of(itr, struct arm_spe_recording, itr);
> +
> +       free(sper);
> +}
> +
> +static int arm_spe_read_finish(struct auxtrace_record *itr, int idx)
> +{
> +	struct arm_spe_recording *sper =
> +			container_of(itr, struct arm_spe_recording, itr);
> +	struct perf_evsel *evsel;
> +
> +	evlist__for_each_entry(sper->evlist, evsel) {
> +		if (evsel->attr.type == sper->arm_spe_pmu->type)
> +			return perf_evlist__enable_event_idx(sper->evlist,
> +							     evsel, idx);
> +	}
> +	return -EINVAL;
> +}
> +
> +struct auxtrace_record *arm_spe_recording_init(int *err)
> +{
> +	struct perf_pmu *arm_spe_pmu = perf_pmu__find(ARM_SPE_PMU_NAME);
> +	struct arm_spe_recording *sper;
> +
> +	if (!arm_spe_pmu)
> +		return NULL;

No need to set *err here?

> +
> +	sper = zalloc(sizeof(struct arm_spe_recording));
> +	if (!sper) {
> +		*err = -ENOMEM;
> +		return NULL;
> +	}

... as we do here?

[...]

> +
> +	sper->arm_spe_pmu = arm_spe_pmu;
> +	sper->itr.recording_options = arm_spe_recording_options;
> +	sper->itr.info_priv_size = arm_spe_info_priv_size;
> +	sper->itr.info_fill = arm_spe_info_fill;
> +	sper->itr.free = arm_spe_recording_free;
> +	sper->itr.reference = arm_spe_reference;
> +	sper->itr.read_finish = arm_spe_read_finish;
> +	sper->itr.alignment = 0;
> +	return &sper->itr;
> +}
> diff --git a/tools/perf/util/Build b/tools/perf/util/Build
> index 79dea95a7f68..4ed31e88b8ee 100644
> --- a/tools/perf/util/Build
> +++ b/tools/perf/util/Build
> @@ -82,6 +82,8 @@ libperf-$(CONFIG_AUXTRACE) += auxtrace.o
>  libperf-$(CONFIG_AUXTRACE) += intel-pt-decoder/
>  libperf-$(CONFIG_AUXTRACE) += intel-pt.o
>  libperf-$(CONFIG_AUXTRACE) += intel-bts.o
> +libperf-$(CONFIG_AUXTRACE) += arm-spe.o
> +libperf-$(CONFIG_AUXTRACE) += arm-spe-pkt-decoder.o
>  libperf-y += parse-branch-options.o
>  libperf-y += dump-insn.o
>  libperf-y += parse-regs-options.o
> diff --git a/tools/perf/util/arm-spe-pkt-decoder.c b/tools/perf/util/arm-spe-pkt-decoder.c
> new file mode 100644
> index 000000000000..ca3813d5b91a
> --- /dev/null
> +++ b/tools/perf/util/arm-spe-pkt-decoder.c
> @@ -0,0 +1,448 @@
> +/*
> + * ARM Statistical Profiling Extensions (SPE) support
> + * Copyright (c) 2017, ARM Ltd.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + */
> +
> +#include <stdio.h>
> +#include <string.h>
> +#include <endian.h>
> +#include <byteswap.h>
> +
> +#include "arm-spe-pkt-decoder.h"
> +
> +#define BIT(n)		(1 << (n))
> +
> +#define BIT61		((uint64_t)1 << 61)
> +#define BIT62		((uint64_t)1 << 62)
> +#define BIT63		((uint64_t)1 << 63)
> +
> +#define NS_FLAG		BIT63
> +#define EL_FLAG		(BIT62 | BIT61)
> +
> +#if __BYTE_ORDER == __BIG_ENDIAN
> +#define le16_to_cpu bswap_16
> +#define le32_to_cpu bswap_32
> +#define le64_to_cpu bswap_64
> +#define memcpy_le64(d, s, n) do { \
> +	memcpy((d), (s), (n));    \
> +	*(d) = le64_to_cpu(*(d)); \
> +} while (0)
> +#else
> +#define le16_to_cpu
> +#define le32_to_cpu
> +#define le64_to_cpu
> +#define memcpy_le64 memcpy
> +#endif
> +
> +static const char * const arm_spe_packet_name[] = {
> +	[ARM_SPE_PAD]		= "PAD",
> +	[ARM_SPE_END]		= "END",
> +	[ARM_SPE_TIMESTAMP]	= "TS",
> +	[ARM_SPE_ADDRESS]	= "ADDR",
> +	[ARM_SPE_COUNTER]	= "LAT",
> +	[ARM_SPE_CONTEXT]	= "CONTEXT",
> +	[ARM_SPE_INSN_TYPE]	= "INSN-TYPE",
> +	[ARM_SPE_EVENTS]	= "EVENTS",
> +	[ARM_SPE_DATA_SOURCE]	= "DATA-SOURCE",
> +};
> +
> +const char *arm_spe_pkt_name(enum arm_spe_pkt_type type)
> +{
> +	return arm_spe_packet_name[type];
> +}
> +
> +/* return ARM SPE payload size from its encoding:
> + * 00 : byte
> + * 01 : halfword (2)
> + * 10 : word (4)
> + * 11 : doubleword (8)
> + */
> +static int payloadlen(unsigned char byte)
> +{
> +	return 1 << ((byte & 0x30) >> 4);
> +}
> +
> +static int arm_spe_get_pad(struct arm_spe_pkt *packet)
> +{
> +	packet->type = ARM_SPE_PAD;
> +	return 1;
> +}
> +
> +static int arm_spe_get_alignment(const unsigned char *buf, size_t len,
> +				 struct arm_spe_pkt *packet)
> +{
> +	unsigned int alignment = 1 << ((buf[0] & 0xf) + 1);
> +
> +	if (len < alignment)
> +		return ARM_SPE_NEED_MORE_BYTES;
> +
> +	packet->type = ARM_SPE_PAD;
> +	return alignment - (((uint64_t)buf) & (alignment - 1));
> +}
> +
> +static int arm_spe_get_end(struct arm_spe_pkt *packet)
> +{
> +	packet->type = ARM_SPE_END;
> +	return 1;
> +}
> +
> +static int arm_spe_get_timestamp(const unsigned char *buf, size_t len,
> +				 struct arm_spe_pkt *packet)
> +{
> +	if (len < 8)
> +		return ARM_SPE_NEED_MORE_BYTES;
> +
> +	packet->type = ARM_SPE_TIMESTAMP;
> +	memcpy_le64(&packet->payload, buf + 1, 8);
> +
> +	return 1 + 8;
> +}
> +
> +static int arm_spe_get_events(const unsigned char *buf, size_t len,
> +			      struct arm_spe_pkt *packet)
> +{
> +	unsigned int events_len = payloadlen(buf[0]);
> +
> +	if (len < events_len)
> +		return ARM_SPE_NEED_MORE_BYTES;

Isn't len the size of the whole buffer? So isn't this failing to account
for the header byte?

> +
> +	packet->type = ARM_SPE_EVENTS;
> +	packet->index = events_len;

Huh? The events packet has no "index" field, so why do we need this?

> +	switch (events_len) {
> +	case 1: packet->payload = *(uint8_t *)(buf + 1); break;
> +	case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break;
> +	case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break;
> +	case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break;
> +	default: return ARM_SPE_BAD_PACKET;
> +	}
> +
> +	return 1 + events_len;
> +}
> +
> +static int arm_spe_get_data_source(const unsigned char *buf,
> +				   struct arm_spe_pkt *packet)
> +{
> +	int len = payloadlen(buf[0]);
> +
> +	packet->type = ARM_SPE_DATA_SOURCE;
> +	if (len == 1)
> +		packet->payload = buf[1];
> +	else if (len == 2)
> +		packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1));
> +
> +	return 1 + len;
> +}

For those packets with a payload, the header has a uniform format
describing the payload size. Given that, can't we make the payload
extraction generic, regardless of the packet type?

e.g. something like:

static int arm_spe_get_payload(const unsigned char *buf, size_t len,
			       struct arm_spe_pkt *packet)
{
	<determine paylaod size>
	<length check>
	<switch>
	<return nr consumed bytes (inc header), or error>
}

static int arm_spe_get_events(const unsigned char *buf, size_t len,
			      struct arm_spe_pkt *packet)
{
	packet->type = ARM_SPE_EVENTS;
	return arm_spe_get_payload(buf, len, packet);
}

static int arm_spe_get_data_source(const unsigned char *buf,
				   struct arm_spe_pkt *packet)
{
	packet->type = ARM_SPE_DATA_SOURCE;
	return arm_spe_get_payload(buf, len, packet);
}

... and so on for the other packets with a payload.

> +static int arm_spe_do_get_packet(const unsigned char *buf, size_t len,
> +				 struct arm_spe_pkt *packet)
> +{
> +	unsigned int byte;
> +
> +	memset(packet, 0, sizeof(struct arm_spe_pkt));
> +
> +	if (!len)
> +		return ARM_SPE_NEED_MORE_BYTES;
> +
> +	byte = buf[0];
> +	if (byte == 0)
> +		return arm_spe_get_pad(packet);
> +	else if (byte == 1) /* no timestamp at end of record */
> +		return arm_spe_get_end(packet);
> +	else if (byte & 0xc0 /* 0y11000000 */) {
> +		if (byte & 0x80) {
> +			/* 0x38 is 0y00111000 */
> +			if ((byte & 0x38) == 0x30) /* address packet (short) */
> +				return arm_spe_get_addr(buf, len, 0, packet);
> +			if ((byte & 0x38) == 0x18) /* counter packet (short) */
> +				return arm_spe_get_counter(buf, len, 0, packet);
> +		} else
> +			if (byte == 0x71)
> +				return arm_spe_get_timestamp(buf, len, packet);
> +			else if ((byte & 0xf) == 0x2)
> +				return arm_spe_get_events(buf, len, packet);
> +			else if ((byte & 0xf) == 0x3)
> +				return arm_spe_get_data_source(buf, packet);
> +			else if ((byte & 0x3c) == 0x24)
> +				return arm_spe_get_context(buf, len, packet);
> +			else if ((byte & 0x3c) == 0x8)
> +				return arm_spe_get_insn_type(buf, packet);

Could we have some mnemonics for these?

e.g.

#define SPE_HEADER0_PAD		0x0
#define SPE_HEADER0_END		0x1

#define SPE_HEADER0_EVENTS	0x42
#define SPE_HEADER0_EVENTS_MASK	0xcf

if (byte == SPE_HEADER0_PAD) { 
	...
} else if (byte == SPE_HEADER0_END) {
	...
} else if ((byte & SPE_HEADER0_EVENTS_MASK) == SPE_HEADER0_EVENTS) {
	...
}

... which could even be turned into something table-driven.

> +	} else if ((byte & 0xe0) == 0x20 /* 0y00100000 */) {
> +		/* 16-bit header */
> +		byte = buf[1];
> +		if (byte == 0)
> +			return arm_spe_get_alignment(buf, len, packet);
> +		else if ((byte & 0xf8) == 0xb0)
> +			return arm_spe_get_addr(buf, len, 1, packet);
> +		else if ((byte & 0xf8) == 0x98)
> +			return arm_spe_get_counter(buf, len, 1, packet);
> +	}
> +
> +	return ARM_SPE_BAD_PACKET;
> +}
> +
> +int arm_spe_get_packet(const unsigned char *buf, size_t len,
> +		       struct arm_spe_pkt *packet)
> +{
> +	int ret;
> +
> +	ret = arm_spe_do_get_packet(buf, len, packet);
> +	if (ret > 0) {
> +		while (ret < 1 && len > (size_t)ret && !buf[ret])
> +			ret += 1;
> +	}

What is this trying to do?

> +	return ret;
> +}
> +
> +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
> +		     size_t buf_len)
> +{
> +	int ret, ns, el, index = packet->index;
> +	unsigned long long payload = packet->payload;
> +	const char *name = arm_spe_pkt_name(packet->type);
> +
> +	switch (packet->type) {
> +	case ARM_SPE_BAD:
> +	case ARM_SPE_PAD:
> +	case ARM_SPE_END:
> +		return snprintf(buf, buf_len, "%s", name);
> +	case ARM_SPE_EVENTS: {

[...]

> +	case ARM_SPE_DATA_SOURCE:
> +	case ARM_SPE_TIMESTAMP:
> +		return snprintf(buf, buf_len, "%s %lld", name, payload);
> +	case ARM_SPE_ADDRESS:
> +		switch (index) {
> +		case 0:
> +		case 1: ns = !!(packet->payload & NS_FLAG);
> +			el = (packet->payload & EL_FLAG) >> 61;
> +			payload &= ~(0xffULL << 56);
> +			return snprintf(buf, buf_len, "%s %llx el%d ns=%d",
> +				        (index == 1) ? "TGT" : "PC", payload, el, ns);

For TTBR1 addresses, this ends up losing the leading 0xff, giving us
invalid addresses, which look odd.

Can we please sign-extend bit 55 so that this gives us valid addresses
regardless of TTBR?

Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so
that things get padded consistently?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 4/5] drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension
  2017-06-05 15:22 ` [PATCH v4 4/5] drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension Will Deacon
  2017-06-05 15:55   ` Kim Phillips
  2017-06-15 14:57   ` Mark Rutland
@ 2017-07-03 17:23   ` Mark Rutland
  2 siblings, 0 replies; 33+ messages in thread
From: Mark Rutland @ 2017-07-03 17:23 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, marc.zyngier, kim.phillips, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

On Mon, Jun 05, 2017 at 04:22:56PM +0100, Will Deacon wrote:
> +static const struct of_device_id arm_spe_pmu_of_match[] = {
> +	{ .compatible = "arm,statistical-profiling-extension-v1", .data = (void *)1 },
> +};

I just noticed that we're missing a sentinel entry here. Please could
you append one for v5?

Somehow we've been getting lucky with subsequent memory being zero, but
KASAN screams when it sees us accessing said memory:

[    5.451190] ==================================================================
[    5.459186] BUG: KASAN: global-out-of-bounds in __of_match_node+0x140/0x158
[    5.466595] Read of size 1 at addr ffff20000a504788 by task swapper/0/1
[    5.473615] 
[    5.475388] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc4-00010-g35b026e #2
[    5.483547] Hardware name: ARM Juno development board (r1) (DT)
[    5.489856] Call trace:
[    5.492629] [<ffff200008094540>] dump_backtrace+0x0/0x560
[    5.498445] [<ffff200008094ac0>] show_stack+0x20/0x30
[    5.503910] [<ffff200008ba4280>] dump_stack+0x11c/0x184
[    5.509555] [<ffff200008546a18>] print_address_description+0x40/0x388
[    5.516446] [<ffff200008547080>] kasan_report+0x138/0x398
[    5.522266] [<ffff2000085472f8>] __asan_report_load1_noabort+0x18/0x20
[    5.529238] [<ffff200009a90f00>] __of_match_node+0x140/0x158
[    5.535312] [<ffff200009a90f54>] of_match_node+0x3c/0x60
[    5.541032] [<ffff200009a95fa4>] of_match_device+0x54/0x98
[    5.546939] [<ffff2000091b39ac>] platform_match+0xc4/0x2e8
[    5.552841] [<ffff2000091ada40>] __driver_attach+0x70/0x218
[    5.558831] [<ffff2000091a7c2c>] bus_for_each_dev+0x13c/0x1d0
[    5.564999] [<ffff2000091ac590>] driver_attach+0x48/0x78
[    5.570720] [<ffff2000091ab514>] bus_add_driver+0x26c/0x5e0
[    5.576712] [<ffff2000091af86c>] driver_register+0x16c/0x398
[    5.582802] [<ffff2000091b33f8>] __platform_driver_register+0xd8/0x128
[    5.589775] [<ffff20000a89a18c>] arm_spe_pmu_init+0x60/0x8c
[    5.595766] [<ffff200008084acc>] do_one_initcall+0xcc/0x370
[    5.601762] [<ffff20000a7e1d3c>] kernel_init_freeable+0x5f8/0x6c4
[    5.608294] [<ffff200009f8f7a8>] kernel_init+0x18/0x190
[    5.613924] [<ffff200008084710>] ret_from_fork+0x10/0x40
[    5.619604] 
[    5.621346] The buggy address belongs to the variable:
[    5.626893]  arm_spe_pmu_of_match+0xc8/0x4c0
[    5.631496] 
[    5.633235] Memory state around the buggy address:
[    5.638413]  ffff20000a504680: 00 01 fa fa fa fa fa fa 00 00 00 00 00 00 00 00
[    5.646235]  ffff20000a504700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[    5.654057] >ffff20000a504780: 00 fa fa fa fa fa fa fa 00 fa fa fa fa fa fa fa
[    5.661852]                       ^
[    5.665683]  ffff20000a504800: 07 fa fa fa fa fa fa fa 00 04 fa fa fa fa fa fa
[    5.673506]  ffff20000a504880: 00 05 fa fa fa fa fa fa 00 06 fa fa fa fa fa fa
[    5.681302] ==================================================================

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension
  2017-06-29 11:11                   ` Mark Rutland
@ 2017-07-06 17:08                     ` Kim Phillips
  0 siblings, 0 replies; 33+ messages in thread
From: Kim Phillips @ 2017-07-06 17:08 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Will Deacon, linux-arm-kernel, marc.zyngier, tglx, peterz,
	alexander.shishkin, robh, suzuki.poulose, pawel.moll,
	mathieu.poirier, mingo, linux-kernel

On Thu, 29 Jun 2017 12:11:02 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> On Wed, Jun 28, 2017 at 07:59:53PM -0500, Kim Phillips wrote:
> > arm-trusted-firmware, btw, has just been updated to enable SPE at lower
> > ELs, so I don't have to use a hacked-up version anymore.
> > 
> > I also updated my BL33 to the latest upstream u-boot
> > vexpress_aemv8a_dram_defconfig, and at least now the kernel continues
> > to boot, even though it can't bring up 6 of the 7 secondary CPUs.
> 
> Do you mean that you replaced the bootwrapper with u-boot?

no, sorry, arm-trusted-firmware wants a BL33 image, which u-boot
provides.

Sorry but I guess I'm not using the bootwrapper, and we are launching
the model in completely different manners.

The bootwrapper input is a kernel and a dtb, and it emits a dtb and a
linux-system.axf file, the latter of which I don't see how to launch
the model with:  The model script I'm using uses a kernel, dtb, and an
fip.bin and bl1.bin.

Can you share how you invoke the model, presumably with the .axf file?  

> The --with-cpu-ids option *adds* CPU nodes, but leaves the broken ones,
> and your CPU phandles (and PPI partitions for the SPE node(s)) will all
> be wrong. Linux is still seeing those erroneous CPU nodes (presumably
> taking Linux CPU ids 2-7).
> 
> Generally, --with-cpu-ids doesn't work as you'd expect, which is why it
> got removed in favour of assuming an initally correct DT.
> 
> Please fix the DT instead. With a fixed DT, and commit ccdc936924b3682d,
> the bootwrapper won't further mangle your DT.

OK, changing the CPU IDs alone didn't work (kernel didn't even say hi),
but taking what commit ccdc936924b3682d does to the cpu_on/off
properties makes it work for my arm-trusted-firmware (non-boot-wrapper)
invocation, so I have to use the wrapper if I change my DT CPUs for the
time being.

So I'm OK now for at least the two-partition, four CPUs each setup, but
for topologies as described in Marc/Will's fvp-base.dts commit, I don't
see how to run without knowing how to make the axf file work with the
model, i.e., solely with the boot-wrapper.

Thanks,

Kim

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] perf tools: Add ARM Statistical Profiling Extensions (SPE) support
  2017-06-30 14:02                       ` Mark Rutland
@ 2017-07-18  0:48                         ` Kim Phillips
  2017-08-18  3:11                           ` [PATCH v2] " Kim Phillips
  2017-08-18 16:59                           ` [PATCH] " Mark Rutland
  0 siblings, 2 replies; 33+ messages in thread
From: Kim Phillips @ 2017-07-18  0:48 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Arnaldo Carvalho de Melo, robh, mathieu.poirier, pawel.moll,
	suzuki.poulose, marc.zyngier, Will Deacon, linux-kernel,
	alexander.shishkin, peterz, mingo, tglx, linux-arm-kernel,
	Adrian Hunter, Jiri Olsa, Andi Kleen, Wang Nan

On Fri, 30 Jun 2017 15:02:41 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> Hi Kim,

Hi Mark,

> On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote:
<snip>
> >  	if (evlist) {
> >  		evlist__for_each_entry(evlist, evsel) {
> >  			if (cs_etm_pmu &&
> >  			    evsel->attr.type == cs_etm_pmu->type)
> >  				found_etm = true;
> > +			if (arm_spe_pmu &&
> > +			    evsel->attr.type == arm_spe_pmu->type)
> > +				found_spe = true;
> 
> Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all
> SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1"
> and so on).
> 
> Can we not find all PMUs with a "arm_spe_" prefix?
> 
> ... or, taking a step back, do we need some sysfs "class" attribute to
> identify multi-instance PMUs?

Since there is one SPE per core, and it looks like the buffer full
interrupt line is the only difference between the SPE device node
specification in the device tree, I guess I don't understand why the
driver doesn't accept a singular "arm_spe" from the tool, and manage
interrupt handling accordingly.  Also, if a set of CPUs are missing SPE
support, and the user doesn't explicitly define a CPU affinity to
outside that partition, then decline to run, stating why.

This would also obviously help a lot from an ease-of-use perspective.

<snip>

> > +struct arm_spe_recording {
> > +	struct auxtrace_record		itr;
> > +	struct perf_pmu			*arm_spe_pmu;
> > +	struct perf_evlist		*evlist;
> > +};
> 
> A user may wich to record trace on separate uarches simultaneously, so
> having a singleton arm_spe_pmu for the entire evlist doesn't seem right.
> 
> e.g. I don't see why we should allow a user to do:
> 
> ./perf record -c 1024 \
> 	-e arm_spe_0/ts_enable=1,pa_enable=0/ \
> 	-e arm_spe_1/ts_enable=1,pa_enable=0/ \
> 	${WORKLOAD}
> 
> ... which perf-record seems to accept today, but I don't seem to get any
> aux trace, regardless of whether I taskset the entire thing to any
> particular CPU.

The implementation-defined components of the SPE don't affect a
'record'/capture operation, so a single arm_spe should be fine with
separate uarch setups.

> It also seems that the events aren't task bound, judging by -vv output:

<snip>

> I see something similar (i.e. perf doesn't try to bind the events to the
> workload PID) when I try to record with only a single PMU. In that case,
> perf-record blows up because it can't handle events on a subset of CPUs
> (though it should be able to):
> 
> nanook@torsk:~$ ./perf record -vv -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=0/ true

<snip>

> mmap size 266240B
> AUX area mmap length 131072
> perf event ring buffer mmapped per cpu
> failed to mmap AUX area
> failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0xffffc8596038, 512)=22)

FWIW, that INTERNAL ERROR is fixed by this commit, btw:

commit 8a1898db51a3390241cd5fae267dc8aaa9db0f8b
Author: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Date:   Tue Jun 20 12:26:39 2017 +0200

    perf/aux: Correct return code of rb_alloc_aux() if !has_aux(ev)

So now it should return:

failed to mmap with 95 (Operation not supported)

> ... with a SW event, this works as expected, being bound to the workload PID:

<snip>

> ... so I guess this has something to do with the way the tool tries to
> use the cpumask, maknig the wrong assumption that this implies
> system-wide collection is mandatory / expected.

Right, I'll take a look at it.

> > +	if (!opts->full_auxtrace)
> > +		return 0;
> > +
> > +	/* We are in full trace mode but '-m,xyz' wasn't specified */
> > +	if (opts->full_auxtrace && !opts->auxtrace_mmap_pages) {
> > +		if (privileged) {
> > +			opts->auxtrace_mmap_pages = MiB(4) / page_size;
> > +		} else {
> > +			opts->auxtrace_mmap_pages = KiB(128) / page_size;
> > +			if (opts->mmap_pages == UINT_MAX)
> > +				opts->mmap_pages = KiB(256) / page_size;
> > +		}
> > +	}
> > +
> > +	/* Validate auxtrace_mmap_pages */
> > +	if (opts->auxtrace_mmap_pages) {
> > +		size_t sz = opts->auxtrace_mmap_pages * (size_t)page_size;
> > +		size_t min_sz = KiB(8);
> > +
> > +		if (sz < min_sz || !is_power_of_2(sz)) {
> > +			pr_err("Invalid mmap size for ARM SPE: must be at least %zuKiB and a power of 2\n",
> > +			       min_sz / 1024);
> > +			return -EINVAL;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * To obtain the auxtrace buffer file descriptor, the auxtrace event
> > +	 * must come first.
> > +	 */
> > +	perf_evlist__to_front(evlist, arm_spe_evsel);
> 
> Huh? *what* needs the auxtrace buffer fd?
> 
> This seems really fragile. Can't we store this elsewhere?

It's copied from the bts code, and the other auxtrace record users do
the same; it looks like auxtrace record has implicit dependencies on it?

> > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)
> > +{
> > +	u64 ts;
> > +
> > +	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));
> > +
> > +	return ts;
> > +}
> 
> I do not think it's a good idea to read the counter directly like this.
> 
> What is this "reference" intended to be meaningful relative to?

AFAICT, it's just a nonce the perf tool uses to track unique events,
and I thought this better than the ETM driver's heavier get-random
implementation.

> Why do we need to do this in userspace?
> 
> Can we not ask the kernel to output timestamps instead?

Why?  This gets the job done faster.

> > +static int arm_spe_get_events(const unsigned char *buf, size_t len,
> > +			      struct arm_spe_pkt *packet)
> > +{
> > +	unsigned int events_len = payloadlen(buf[0]);
> > +
> > +	if (len < events_len)
> > +		return ARM_SPE_NEED_MORE_BYTES;
> 
> Isn't len the size of the whole buffer? So isn't this failing to account
> for the header byte?

well spotted; I changed /events_len/1 + events_len/.

> > +	packet->type = ARM_SPE_EVENTS;
> > +	packet->index = events_len;
> 
> Huh? The events packet has no "index" field, so why do we need this?

To identify Events with a less number of comparisons in arm_spe_pkt_desc():
E.g., the LLC-ACCESS, LLC-REFILL, and REMOTE-ACCESS events are
identified iff index > 1.

> > +	switch (events_len) {
> > +	case 1: packet->payload = *(uint8_t *)(buf + 1); break;
> > +	case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break;
> > +	case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break;
> > +	case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break;
> > +	default: return ARM_SPE_BAD_PACKET;
> > +	}
> > +
> > +	return 1 + events_len;
> > +}
> > +
> > +static int arm_spe_get_data_source(const unsigned char *buf,
> > +				   struct arm_spe_pkt *packet)
> > +{
> > +	int len = payloadlen(buf[0]);
> > +
> > +	packet->type = ARM_SPE_DATA_SOURCE;
> > +	if (len == 1)
> > +		packet->payload = buf[1];
> > +	else if (len == 2)
> > +		packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1));
> > +
> > +	return 1 + len;
> > +}
> 
> For those packets with a payload, the header has a uniform format
> describing the payload size. Given that, can't we make the payload
> extraction generic, regardless of the packet type?
> 
> e.g. something like:
> 
> static int arm_spe_get_payload(const unsigned char *buf, size_t len,
> 			       struct arm_spe_pkt *packet)
> {
> 	<determine paylaod size>
> 	<length check>
> 	<switch>
> 	<return nr consumed bytes (inc header), or error>
> }
> 
> static int arm_spe_get_events(const unsigned char *buf, size_t len,
> 			      struct arm_spe_pkt *packet)
> {
> 	packet->type = ARM_SPE_EVENTS;
> 	return arm_spe_get_payload(buf, len, packet);
> }
> 
> static int arm_spe_get_data_source(const unsigned char *buf,
> 				   struct arm_spe_pkt *packet)
> {
> 	packet->type = ARM_SPE_DATA_SOURCE;
> 	return arm_spe_get_payload(buf, len, packet);
> }
> 
> ... and so on for the other packets with a payload.

done for TIMESTAMP, EVENTS, DATA_SOURCE, CONTEXT, INSN_TYPE.  It
wouldn't fit ADDR and COUNTER well since they can occur in an
extended-header, and their lengths are encoded differently, and are
fixed anyway.

> > +static int arm_spe_do_get_packet(const unsigned char *buf, size_t len,
> > +				 struct arm_spe_pkt *packet)
> > +{
> > +	unsigned int byte;
> > +
> > +	memset(packet, 0, sizeof(struct arm_spe_pkt));
> > +
> > +	if (!len)
> > +		return ARM_SPE_NEED_MORE_BYTES;
> > +
> > +	byte = buf[0];
> > +	if (byte == 0)
> > +		return arm_spe_get_pad(packet);
> > +	else if (byte == 1) /* no timestamp at end of record */
> > +		return arm_spe_get_end(packet);
> > +	else if (byte & 0xc0 /* 0y11000000 */) {
> > +		if (byte & 0x80) {
> > +			/* 0x38 is 0y00111000 */
> > +			if ((byte & 0x38) == 0x30) /* address packet (short) */
> > +				return arm_spe_get_addr(buf, len, 0, packet);
> > +			if ((byte & 0x38) == 0x18) /* counter packet (short) */
> > +				return arm_spe_get_counter(buf, len, 0, packet);
> > +		} else
> > +			if (byte == 0x71)
> > +				return arm_spe_get_timestamp(buf, len, packet);
> > +			else if ((byte & 0xf) == 0x2)
> > +				return arm_spe_get_events(buf, len, packet);
> > +			else if ((byte & 0xf) == 0x3)
> > +				return arm_spe_get_data_source(buf, packet);
> > +			else if ((byte & 0x3c) == 0x24)
> > +				return arm_spe_get_context(buf, len, packet);
> > +			else if ((byte & 0x3c) == 0x8)
> > +				return arm_spe_get_insn_type(buf, packet);
> 
> Could we have some mnemonics for these?
> 
> e.g.
> 
> #define SPE_HEADER0_PAD		0x0
> #define SPE_HEADER0_END		0x1
> 
> #define SPE_HEADER0_EVENTS	0x42
> #define SPE_HEADER0_EVENTS_MASK	0xcf
> 
> if (byte == SPE_HEADER0_PAD) { 
> 	...
> } else if (byte == SPE_HEADER0_END) {
> 	...
> } else if ((byte & SPE_HEADER0_EVENTS_MASK) == SPE_HEADER0_EVENTS) {
> 	...
> }
> 
> ... which could even be turned into something table-driven.

It'd be a pretty sparse table, so I doubt it'd be faster, but if it is,
I'd just as soon leave that type of space tradeoff decision to the
compiler, given its optimization directives.

I'll take a look at replacing the constants that have named equivalents
with their named versions, even though it was pretty clear already what
they denoted, given the name of the function each branch was calling,
and the comments.

> > +	} else if ((byte & 0xe0) == 0x20 /* 0y00100000 */) {
> > +		/* 16-bit header */
> > +		byte = buf[1];
> > +		if (byte == 0)
> > +			return arm_spe_get_alignment(buf, len, packet);
> > +		else if ((byte & 0xf8) == 0xb0)
> > +			return arm_spe_get_addr(buf, len, 1, packet);
> > +		else if ((byte & 0xf8) == 0x98)
> > +			return arm_spe_get_counter(buf, len, 1, packet);
> > +	}
> > +
> > +	return ARM_SPE_BAD_PACKET;
> > +}
> > +
> > +int arm_spe_get_packet(const unsigned char *buf, size_t len,
> > +		       struct arm_spe_pkt *packet)
> > +{
> > +	int ret;
> > +
> > +	ret = arm_spe_do_get_packet(buf, len, packet);
> > +	if (ret > 0) {
> > +		while (ret < 1 && len > (size_t)ret && !buf[ret])
> > +			ret += 1;
> > +	}
> 
> What is this trying to do?

Nothing!  I've since fixed it to prevent multiple contiguous
PADs from coming out on their own lines, and rather accumulate up to 16
(the width of the raw dump format) on one PAD-labeled line, like so:

.  00007ec9:  00 00 00 00 00 00 00 00 00 00                   PAD

instead of this:

.  00007ec9:  00                                              PAD
.  00007eca:  00                                              PAD
.  00007ecb:  00                                              PAD
.  00007ecc:  00                                              PAD
.  00007ecd:  00                                              PAD
.  00007ece:  00                                              PAD
.  00007ecf:  00                                              PAD
.  00007ed0:  00                                              PAD
.  00007ed1:  00                                              PAD
.  00007ed2:  00                                              PAD

thanks for pointing it out.

> > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
> > +		     size_t buf_len)
> > +{
> > +	int ret, ns, el, index = packet->index;
> > +	unsigned long long payload = packet->payload;
> > +	const char *name = arm_spe_pkt_name(packet->type);
> > +
> > +	switch (packet->type) {
> > +	case ARM_SPE_BAD:
> > +	case ARM_SPE_PAD:
> > +	case ARM_SPE_END:
> > +		return snprintf(buf, buf_len, "%s", name);
> > +	case ARM_SPE_EVENTS: {
> 
> [...]
> 
> > +	case ARM_SPE_DATA_SOURCE:
> > +	case ARM_SPE_TIMESTAMP:
> > +		return snprintf(buf, buf_len, "%s %lld", name, payload);
> > +	case ARM_SPE_ADDRESS:
> > +		switch (index) {
> > +		case 0:
> > +		case 1: ns = !!(packet->payload & NS_FLAG);
> > +			el = (packet->payload & EL_FLAG) >> 61;
> > +			payload &= ~(0xffULL << 56);
> > +			return snprintf(buf, buf_len, "%s %llx el%d ns=%d",
> > +				        (index == 1) ? "TGT" : "PC", payload, el, ns);
> 
> For TTBR1 addresses, this ends up losing the leading 0xff, giving us
> invalid addresses, which look odd.
> 
> Can we please sign-extend bit 55 so that this gives us valid addresses
> regardless of TTBR?

I'll take a look at doing this once I get consistent output from an
implementation.

> Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so
> that things get padded consistently?

I've added the 0x prefix, but prefer to not fix the length to 016: I
don't see any direct benefit, rather see benefits to having the length
vary, for output size control and less obvious reasons, e.g., sorting
address lines by their length to get a sense of address groups caught
during the run.  FWIW, Intel doesn't do the 016 either.

If I've omitted a response to the other comments, it's because they are
being addressed.

Thanks!

Kim

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2] perf tools: Add ARM Statistical Profiling Extensions (SPE) support
  2017-07-18  0:48                         ` Kim Phillips
@ 2017-08-18  3:11                           ` Kim Phillips
  2017-08-18 17:36                             ` Mark Rutland
  2017-08-18 16:59                           ` [PATCH] " Mark Rutland
  1 sibling, 1 reply; 33+ messages in thread
From: Kim Phillips @ 2017-08-18  3:11 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Arnaldo Carvalho de Melo, robh, mathieu.poirier, pawel.moll,
	suzuki.poulose, marc.zyngier, Will Deacon, linux-kernel,
	alexander.shishkin, peterz, mingo, tglx, linux-arm-kernel,
	Adrian Hunter, Jiri Olsa, Andi Kleen, Wang Nan

On Mon, 17 Jul 2017 19:48:22 -0500
Kim Phillips <kim.phillips@arm.com> wrote:

> On Fri, 30 Jun 2017 15:02:41 +0100
> Mark Rutland <mark.rutland@arm.com> wrote:
> 
> > Hi Kim,
> 
> Hi Mark,
> 
> > On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote:
> <snip>
> > >  	if (evlist) {
> > >  		evlist__for_each_entry(evlist, evsel) {
> > >  			if (cs_etm_pmu &&
> > >  			    evsel->attr.type == cs_etm_pmu->type)
> > >  				found_etm = true;
> > > +			if (arm_spe_pmu &&
> > > +			    evsel->attr.type == arm_spe_pmu->type)
> > > +				found_spe = true;
> > 
> > Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all
> > SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1"
> > and so on).
> > 
> > Can we not find all PMUs with a "arm_spe_" prefix?
> > 
> > ... or, taking a step back, do we need some sysfs "class" attribute to
> > identify multi-instance PMUs?
> 
> Since there is one SPE per core, and it looks like the buffer full
> interrupt line is the only difference between the SPE device node
> specification in the device tree, I guess I don't understand why the
> driver doesn't accept a singular "arm_spe" from the tool, and manage
> interrupt handling accordingly.  Also, if a set of CPUs are missing SPE
> support, and the user doesn't explicitly define a CPU affinity to
> outside that partition, then decline to run, stating why.
> 
> This would also obviously help a lot from an ease-of-use perspective.
> 
> <snip>
> 
> > > +struct arm_spe_recording {
> > > +	struct auxtrace_record		itr;
> > > +	struct perf_pmu			*arm_spe_pmu;
> > > +	struct perf_evlist		*evlist;
> > > +};
> > 
> > A user may wich to record trace on separate uarches simultaneously, so
> > having a singleton arm_spe_pmu for the entire evlist doesn't seem right.
> > 
> > e.g. I don't see why we should allow a user to do:
> > 
> > ./perf record -c 1024 \
> > 	-e arm_spe_0/ts_enable=1,pa_enable=0/ \
> > 	-e arm_spe_1/ts_enable=1,pa_enable=0/ \
> > 	${WORKLOAD}
> > 
> > ... which perf-record seems to accept today, but I don't seem to get any
> > aux trace, regardless of whether I taskset the entire thing to any
> > particular CPU.
> 
> The implementation-defined components of the SPE don't affect a
> 'record'/capture operation, so a single arm_spe should be fine with
> separate uarch setups.
> 
> > It also seems that the events aren't task bound, judging by -vv output:
> 
> <snip>
> 
> > I see something similar (i.e. perf doesn't try to bind the events to the
> > workload PID) when I try to record with only a single PMU. In that case,
> > perf-record blows up because it can't handle events on a subset of CPUs
> > (though it should be able to):
> > 
> > nanook@torsk:~$ ./perf record -vv -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=0/ true
> 
> <snip>
> 
> > mmap size 266240B
> > AUX area mmap length 131072
> > perf event ring buffer mmapped per cpu
> > failed to mmap AUX area
> > failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0xffffc8596038, 512)=22)
> 
> FWIW, that INTERNAL ERROR is fixed by this commit, btw:
> 
> commit 8a1898db51a3390241cd5fae267dc8aaa9db0f8b
> Author: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
> Date:   Tue Jun 20 12:26:39 2017 +0200
> 
>     perf/aux: Correct return code of rb_alloc_aux() if !has_aux(ev)
> 
> So now it should return:
> 
> failed to mmap with 95 (Operation not supported)
> 
> > ... with a SW event, this works as expected, being bound to the workload PID:
> 
> <snip>
> 
> > ... so I guess this has something to do with the way the tool tries to
> > use the cpumask, maknig the wrong assumption that this implies
> > system-wide collection is mandatory / expected.
> 
> Right, I'll take a look at it.
> 
> > > +	if (!opts->full_auxtrace)
> > > +		return 0;
> > > +
> > > +	/* We are in full trace mode but '-m,xyz' wasn't specified */
> > > +	if (opts->full_auxtrace && !opts->auxtrace_mmap_pages) {
> > > +		if (privileged) {
> > > +			opts->auxtrace_mmap_pages = MiB(4) / page_size;
> > > +		} else {
> > > +			opts->auxtrace_mmap_pages = KiB(128) / page_size;
> > > +			if (opts->mmap_pages == UINT_MAX)
> > > +				opts->mmap_pages = KiB(256) / page_size;
> > > +		}
> > > +	}
> > > +
> > > +	/* Validate auxtrace_mmap_pages */
> > > +	if (opts->auxtrace_mmap_pages) {
> > > +		size_t sz = opts->auxtrace_mmap_pages * (size_t)page_size;
> > > +		size_t min_sz = KiB(8);
> > > +
> > > +		if (sz < min_sz || !is_power_of_2(sz)) {
> > > +			pr_err("Invalid mmap size for ARM SPE: must be at least %zuKiB and a power of 2\n",
> > > +			       min_sz / 1024);
> > > +			return -EINVAL;
> > > +		}
> > > +	}
> > > +
> > > +	/*
> > > +	 * To obtain the auxtrace buffer file descriptor, the auxtrace event
> > > +	 * must come first.
> > > +	 */
> > > +	perf_evlist__to_front(evlist, arm_spe_evsel);
> > 
> > Huh? *what* needs the auxtrace buffer fd?
> > 
> > This seems really fragile. Can't we store this elsewhere?
> 
> It's copied from the bts code, and the other auxtrace record users do
> the same; it looks like auxtrace record has implicit dependencies on it?
> 
> > > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)
> > > +{
> > > +	u64 ts;
> > > +
> > > +	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));
> > > +
> > > +	return ts;
> > > +}
> > 
> > I do not think it's a good idea to read the counter directly like this.
> > 
> > What is this "reference" intended to be meaningful relative to?
> 
> AFAICT, it's just a nonce the perf tool uses to track unique events,
> and I thought this better than the ETM driver's heavier get-random
> implementation.
> 
> > Why do we need to do this in userspace?
> > 
> > Can we not ask the kernel to output timestamps instead?
> 
> Why?  This gets the job done faster.
> 
> > > +static int arm_spe_get_events(const unsigned char *buf, size_t len,
> > > +			      struct arm_spe_pkt *packet)
> > > +{
> > > +	unsigned int events_len = payloadlen(buf[0]);
> > > +
> > > +	if (len < events_len)
> > > +		return ARM_SPE_NEED_MORE_BYTES;
> > 
> > Isn't len the size of the whole buffer? So isn't this failing to account
> > for the header byte?
> 
> well spotted; I changed /events_len/1 + events_len/.
> 
> > > +	packet->type = ARM_SPE_EVENTS;
> > > +	packet->index = events_len;
> > 
> > Huh? The events packet has no "index" field, so why do we need this?
> 
> To identify Events with a less number of comparisons in arm_spe_pkt_desc():
> E.g., the LLC-ACCESS, LLC-REFILL, and REMOTE-ACCESS events are
> identified iff index > 1.
> 
> > > +	switch (events_len) {
> > > +	case 1: packet->payload = *(uint8_t *)(buf + 1); break;
> > > +	case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break;
> > > +	case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break;
> > > +	case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break;
> > > +	default: return ARM_SPE_BAD_PACKET;
> > > +	}
> > > +
> > > +	return 1 + events_len;
> > > +}
> > > +
> > > +static int arm_spe_get_data_source(const unsigned char *buf,
> > > +				   struct arm_spe_pkt *packet)
> > > +{
> > > +	int len = payloadlen(buf[0]);
> > > +
> > > +	packet->type = ARM_SPE_DATA_SOURCE;
> > > +	if (len == 1)
> > > +		packet->payload = buf[1];
> > > +	else if (len == 2)
> > > +		packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1));
> > > +
> > > +	return 1 + len;
> > > +}
> > 
> > For those packets with a payload, the header has a uniform format
> > describing the payload size. Given that, can't we make the payload
> > extraction generic, regardless of the packet type?
> > 
> > e.g. something like:
> > 
> > static int arm_spe_get_payload(const unsigned char *buf, size_t len,
> > 			       struct arm_spe_pkt *packet)
> > {
> > 	<determine paylaod size>
> > 	<length check>
> > 	<switch>
> > 	<return nr consumed bytes (inc header), or error>
> > }
> > 
> > static int arm_spe_get_events(const unsigned char *buf, size_t len,
> > 			      struct arm_spe_pkt *packet)
> > {
> > 	packet->type = ARM_SPE_EVENTS;
> > 	return arm_spe_get_payload(buf, len, packet);
> > }
> > 
> > static int arm_spe_get_data_source(const unsigned char *buf,
> > 				   struct arm_spe_pkt *packet)
> > {
> > 	packet->type = ARM_SPE_DATA_SOURCE;
> > 	return arm_spe_get_payload(buf, len, packet);
> > }
> > 
> > ... and so on for the other packets with a payload.
> 
> done for TIMESTAMP, EVENTS, DATA_SOURCE, CONTEXT, INSN_TYPE.  It
> wouldn't fit ADDR and COUNTER well since they can occur in an
> extended-header, and their lengths are encoded differently, and are
> fixed anyway.
> 
> > > +static int arm_spe_do_get_packet(const unsigned char *buf, size_t len,
> > > +				 struct arm_spe_pkt *packet)
> > > +{
> > > +	unsigned int byte;
> > > +
> > > +	memset(packet, 0, sizeof(struct arm_spe_pkt));
> > > +
> > > +	if (!len)
> > > +		return ARM_SPE_NEED_MORE_BYTES;
> > > +
> > > +	byte = buf[0];
> > > +	if (byte == 0)
> > > +		return arm_spe_get_pad(packet);
> > > +	else if (byte == 1) /* no timestamp at end of record */
> > > +		return arm_spe_get_end(packet);
> > > +	else if (byte & 0xc0 /* 0y11000000 */) {
> > > +		if (byte & 0x80) {
> > > +			/* 0x38 is 0y00111000 */
> > > +			if ((byte & 0x38) == 0x30) /* address packet (short) */
> > > +				return arm_spe_get_addr(buf, len, 0, packet);
> > > +			if ((byte & 0x38) == 0x18) /* counter packet (short) */
> > > +				return arm_spe_get_counter(buf, len, 0, packet);
> > > +		} else
> > > +			if (byte == 0x71)
> > > +				return arm_spe_get_timestamp(buf, len, packet);
> > > +			else if ((byte & 0xf) == 0x2)
> > > +				return arm_spe_get_events(buf, len, packet);
> > > +			else if ((byte & 0xf) == 0x3)
> > > +				return arm_spe_get_data_source(buf, packet);
> > > +			else if ((byte & 0x3c) == 0x24)
> > > +				return arm_spe_get_context(buf, len, packet);
> > > +			else if ((byte & 0x3c) == 0x8)
> > > +				return arm_spe_get_insn_type(buf, packet);
> > 
> > Could we have some mnemonics for these?
> > 
> > e.g.
> > 
> > #define SPE_HEADER0_PAD		0x0
> > #define SPE_HEADER0_END		0x1
> > 
> > #define SPE_HEADER0_EVENTS	0x42
> > #define SPE_HEADER0_EVENTS_MASK	0xcf
> > 
> > if (byte == SPE_HEADER0_PAD) { 
> > 	...
> > } else if (byte == SPE_HEADER0_END) {
> > 	...
> > } else if ((byte & SPE_HEADER0_EVENTS_MASK) == SPE_HEADER0_EVENTS) {
> > 	...
> > }
> > 
> > ... which could even be turned into something table-driven.
> 
> It'd be a pretty sparse table, so I doubt it'd be faster, but if it is,
> I'd just as soon leave that type of space tradeoff decision to the
> compiler, given its optimization directives.
> 
> I'll take a look at replacing the constants that have named equivalents
> with their named versions, even though it was pretty clear already what
> they denoted, given the name of the function each branch was calling,
> and the comments.
> 
> > > +	} else if ((byte & 0xe0) == 0x20 /* 0y00100000 */) {
> > > +		/* 16-bit header */
> > > +		byte = buf[1];
> > > +		if (byte == 0)
> > > +			return arm_spe_get_alignment(buf, len, packet);
> > > +		else if ((byte & 0xf8) == 0xb0)
> > > +			return arm_spe_get_addr(buf, len, 1, packet);
> > > +		else if ((byte & 0xf8) == 0x98)
> > > +			return arm_spe_get_counter(buf, len, 1, packet);
> > > +	}
> > > +
> > > +	return ARM_SPE_BAD_PACKET;
> > > +}
> > > +
> > > +int arm_spe_get_packet(const unsigned char *buf, size_t len,
> > > +		       struct arm_spe_pkt *packet)
> > > +{
> > > +	int ret;
> > > +
> > > +	ret = arm_spe_do_get_packet(buf, len, packet);
> > > +	if (ret > 0) {
> > > +		while (ret < 1 && len > (size_t)ret && !buf[ret])
> > > +			ret += 1;
> > > +	}
> > 
> > What is this trying to do?
> 
> Nothing!  I've since fixed it to prevent multiple contiguous
> PADs from coming out on their own lines, and rather accumulate up to 16
> (the width of the raw dump format) on one PAD-labeled line, like so:
> 
> .  00007ec9:  00 00 00 00 00 00 00 00 00 00                   PAD
> 
> instead of this:
> 
> .  00007ec9:  00                                              PAD
> .  00007eca:  00                                              PAD
> .  00007ecb:  00                                              PAD
> .  00007ecc:  00                                              PAD
> .  00007ecd:  00                                              PAD
> .  00007ece:  00                                              PAD
> .  00007ecf:  00                                              PAD
> .  00007ed0:  00                                              PAD
> .  00007ed1:  00                                              PAD
> .  00007ed2:  00                                              PAD
> 
> thanks for pointing it out.
> 
> > > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
> > > +		     size_t buf_len)
> > > +{
> > > +	int ret, ns, el, index = packet->index;
> > > +	unsigned long long payload = packet->payload;
> > > +	const char *name = arm_spe_pkt_name(packet->type);
> > > +
> > > +	switch (packet->type) {
> > > +	case ARM_SPE_BAD:
> > > +	case ARM_SPE_PAD:
> > > +	case ARM_SPE_END:
> > > +		return snprintf(buf, buf_len, "%s", name);
> > > +	case ARM_SPE_EVENTS: {
> > 
> > [...]
> > 
> > > +	case ARM_SPE_DATA_SOURCE:
> > > +	case ARM_SPE_TIMESTAMP:
> > > +		return snprintf(buf, buf_len, "%s %lld", name, payload);
> > > +	case ARM_SPE_ADDRESS:
> > > +		switch (index) {
> > > +		case 0:
> > > +		case 1: ns = !!(packet->payload & NS_FLAG);
> > > +			el = (packet->payload & EL_FLAG) >> 61;
> > > +			payload &= ~(0xffULL << 56);
> > > +			return snprintf(buf, buf_len, "%s %llx el%d ns=%d",
> > > +				        (index == 1) ? "TGT" : "PC", payload, el, ns);
> > 
> > For TTBR1 addresses, this ends up losing the leading 0xff, giving us
> > invalid addresses, which look odd.
> > 
> > Can we please sign-extend bit 55 so that this gives us valid addresses
> > regardless of TTBR?
> 
> I'll take a look at doing this once I get consistent output from an
> implementation.
> 
> > Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so
> > that things get padded consistently?
> 
> I've added the 0x prefix, but prefer to not fix the length to 016: I
> don't see any direct benefit, rather see benefits to having the length
> vary, for output size control and less obvious reasons, e.g., sorting
> address lines by their length to get a sense of address groups caught
> during the run.  FWIW, Intel doesn't do the 016 either.
> 
> If I've omitted a response to the other comments, it's because they are
> being addressed.

Hi Mark, I've tried to proceed as much as possible without your
response, so if you still have comments to my above comments, please
comment in-line above, otherwise review the v2 patch below?

Thanks,

Kim

>From 464d943dcac15d946863399001174e4dc4e00594 Mon Sep 17 00:00:00 2001
From: Kim Phillips <kim.phillips@arm.com>
Date: Wed, 8 Feb 2017 17:11:57 -0600
Subject: [PATCH v2] perf tools: Add ARM Statistical Profiling Extensions
 (SPE) support

'perf record' and 'perf report --dump-raw-trace' supported in this release

Example usage:

taskset -c 2 ./perf record -C 2 -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=1/ \
		dd if=/dev/zero of=/dev/null count=10000

perf report --dump-raw-trace

Note that the perf.data file is portable, so the report can be run on another
architecture host if necessary.

Output will contain raw SPE data and its textual representation, such as:

0xc7d0 [0x30]: PERF_RECORD_AUXTRACE size: 0x82f70  offset: 0  ref: 0x1e947e88189  idx: 0  tid: -1  cpu: 2
.
. ... ARM SPE data: size 536432 bytes
.  00000000:  4a 01                                           B COND
.  00000002:  b1 00 00 00 00 00 00 00 80                      TGT 0 el0 ns=1
.  0000000b:  42 42                                           RETIRED NOT-TAKEN
.  0000000d:  b0 20 41 c0 ad ff ff 00 80                      PC ffffadc04120 el0 ns=1
.  00000016:  98 00 00                                        LAT 0 TOT
.  00000019:  71 80 3e f7 46 e9 01 00 00                      TS 2101429616256
.  00000022:  49 01                                           ST
.  00000024:  b2 50 bd ba 73 00 80 ff ff                      VA ffff800073babd50
.  0000002d:  b3 50 bd ba f3 00 00 00 80                      PA f3babd50 ns=1
.  00000036:  9a 00 00                                        LAT 0 XLAT
.  00000039:  42 16                                           RETIRED L1D-ACCESS TLB-ACCESS
.  0000003b:  b0 8c b4 1e 08 00 00 ff ff                      PC ff0000081eb48c el3 ns=1
.  00000044:  98 00 00                                        LAT 0 TOT
.  00000047:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868
.  00000050:  48 00                                           INSN-OTHER
.  00000052:  42 02                                           RETIRED
.  00000054:  b0 58 54 1f 08 00 00 ff ff                      PC ff0000081f5458 el3 ns=1
.  0000005d:  98 00 00                                        LAT 0 TOT
.  00000060:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868
...

Other release notes:

- applies to acme's perf/{core,urgent} branches, likely elsewhere

- Record requires Will's SPE driver, currently undergoing upstream review

- the intel-bts implementation was used as a starting point; its
  min/default/max buffer sizes and power of 2 pages granularity need to be
  revisited for ARM SPE

- multiple SPE clusters/domains support pending potential driver changes?

- snapshot support (record -S), and conversion to native perf events
  (e.g., via 'perf inject --itrace'), are still in development

- technically both cs-etm and spe can be used simultaneously, however
  disabled for simplicity in this release

Signed-off-by: Kim Phillips <kim.phillips@arm.com>
---
v2: mostly addressing Mark Rutland's comments as much as possible without his
feedback to my feedback:

- decoder refactored with a get_payload, not extended to with-ext_len ones like
  get_addr,  named the constants

- 0x-ified %x output formats, but decided to not sign extend the addresses in
  the raw dump, rather do so if necessary in the synthesis stage:
  SPE implementations differ in this area, and raw dump should reflect that.

- CPU mask / new record behaviour bisected to commit e3ba76deef23064 "perf
  tools: Force uncore events to system wide monitoring".  Waiting to hear back
  on why driver can't do system wide monitoring, even across PPIs, by e.g.,
  sharing the SPE interrupts in one handler (SPE's don't differ in this record
  regard).

- addressed off-list comment from M. Williams:
  "Instruction Type" packet was renamed as "Operation Type".
   so in the spe packet decoder: INSN_TYPE -> OP_TYPE

- do_get_packet fixed to handle excessive, successive PADding from a new source
  of raw SPE data, so instead of:

	.  000011ae:  00                                              PAD
	.  000011af:  00                                              PAD
	.  000011b0:  00                                              PAD
	.  000011b1:  00                                              PAD
	.  000011b2:  00                                              PAD
	.  000011b3:  00                                              PAD
	.  000011b4:  00                                              PAD
	.  000011b5:  00                                              PAD
	.  000011b6:  00                                              PAD

  we now get:

	.  000011ae:  00 00 00 00 00 00 00 00 00                      PAD

- fixed 52 00 00 decoded with an empty events clause, adding 'EV' for all events
  clauses now.  parser writers can detect for empty event clauses by finding
  nothing after it.

- patch available and rebased on top of linux-will.git/perf/spe's
  latest, including an attempt to use David Howell's prctl work:

	https://patchwork.kernel.org/patch/9786501/

  to make the driver more communicative to users, here:

	http://www.linux-arm.org/git?p=linux-kp.git;a=shortlog;h=refs/heads/spe-prctl
  or 
	git://linux-arm.org/linux-kp.git  # spe-prctl branch

 tools/perf/arch/arm/util/auxtrace.c   |  20 +-
 tools/perf/arch/arm/util/pmu.c        |   3 +
 tools/perf/arch/arm64/util/Build      |   3 +-
 tools/perf/arch/arm64/util/arm-spe.c  | 210 +++++++++++++++
 tools/perf/util/Build                 |   2 +
 tools/perf/util/arm-spe-pkt-decoder.c | 464 ++++++++++++++++++++++++++++++++++
 tools/perf/util/arm-spe-pkt-decoder.h |  52 ++++
 tools/perf/util/arm-spe.c             | 318 +++++++++++++++++++++++
 tools/perf/util/arm-spe.h             |  39 +++
 tools/perf/util/auxtrace.c            |   3 +
 tools/perf/util/auxtrace.h            |   1 +
 11 files changed, 1111 insertions(+), 4 deletions(-)
 create mode 100644 tools/perf/arch/arm64/util/arm-spe.c
 create mode 100644 tools/perf/util/arm-spe-pkt-decoder.c
 create mode 100644 tools/perf/util/arm-spe-pkt-decoder.h
 create mode 100644 tools/perf/util/arm-spe.c
 create mode 100644 tools/perf/util/arm-spe.h

diff --git a/tools/perf/arch/arm/util/auxtrace.c b/tools/perf/arch/arm/util/auxtrace.c
index 8edf2cb71564..ec071609e8ac 100644
--- a/tools/perf/arch/arm/util/auxtrace.c
+++ b/tools/perf/arch/arm/util/auxtrace.c
@@ -22,29 +22,43 @@
 #include "../../util/evlist.h"
 #include "../../util/pmu.h"
 #include "cs-etm.h"
+#include "arm-spe.h"
 
 struct auxtrace_record
 *auxtrace_record__init(struct perf_evlist *evlist, int *err)
 {
-	struct perf_pmu	*cs_etm_pmu;
+	struct perf_pmu	*cs_etm_pmu, *arm_spe_pmu;
 	struct perf_evsel *evsel;
-	bool found_etm = false;
+	bool found_etm = false, found_spe = false;
 
 	cs_etm_pmu = perf_pmu__find(CORESIGHT_ETM_PMU_NAME);
+	arm_spe_pmu = perf_pmu__find(ARM_SPE_PMU_NAME);
 
 	if (evlist) {
 		evlist__for_each_entry(evlist, evsel) {
 			if (cs_etm_pmu &&
 			    evsel->attr.type == cs_etm_pmu->type)
 				found_etm = true;
+			if (arm_spe_pmu &&
+			    evsel->attr.type == arm_spe_pmu->type)
+				found_spe = true;
 		}
 	}
 
+	if (found_etm && found_spe) {
+		pr_err("Concurrent ARM Coresight ETM and SPE operation not currently supported\n");
+		*err = -EOPNOTSUPP;
+		return NULL;
+	}
+
 	if (found_etm)
 		return cs_etm_record_init(err);
 
+	if (found_spe)
+		return arm_spe_recording_init(err);
+
 	/*
-	 * Clear 'err' even if we haven't found a cs_etm event - that way perf
+	 * Clear 'err' even if we haven't found an event - that way perf
 	 * record can still be used even if tracers aren't present.  The NULL
 	 * return value will take care of telling the infrastructure HW tracing
 	 * isn't available.
diff --git a/tools/perf/arch/arm/util/pmu.c b/tools/perf/arch/arm/util/pmu.c
index 98d67399a0d6..71fb8f13b40a 100644
--- a/tools/perf/arch/arm/util/pmu.c
+++ b/tools/perf/arch/arm/util/pmu.c
@@ -20,6 +20,7 @@
 #include <linux/perf_event.h>
 
 #include "cs-etm.h"
+#include "arm-spe.h"
 #include "../../util/pmu.h"
 
 struct perf_event_attr
@@ -31,6 +32,8 @@ struct perf_event_attr
 		pmu->selectable = true;
 		pmu->set_drv_config = cs_etm_set_drv_config;
 	}
+	if (!strcmp(pmu->name, ARM_SPE_PMU_NAME))
+		pmu->selectable = true;
 #endif
 	return NULL;
 }
diff --git a/tools/perf/arch/arm64/util/Build b/tools/perf/arch/arm64/util/Build
index cef6fb38d17e..f9969bb88ccb 100644
--- a/tools/perf/arch/arm64/util/Build
+++ b/tools/perf/arch/arm64/util/Build
@@ -3,4 +3,5 @@ libperf-$(CONFIG_LOCAL_LIBUNWIND) += unwind-libunwind.o
 
 libperf-$(CONFIG_AUXTRACE) += ../../arm/util/pmu.o \
 			      ../../arm/util/auxtrace.o \
-			      ../../arm/util/cs-etm.o
+			      ../../arm/util/cs-etm.o \
+			      arm-spe.o
diff --git a/tools/perf/arch/arm64/util/arm-spe.c b/tools/perf/arch/arm64/util/arm-spe.c
new file mode 100644
index 000000000000..0b37f364bd62
--- /dev/null
+++ b/tools/perf/arch/arm64/util/arm-spe.c
@@ -0,0 +1,210 @@
+/*
+ * ARM Statistical Profiling Extensions (SPE) support
+ * Copyright (c) 2017, ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/bitops.h>
+#include <linux/log2.h>
+
+#include "../../util/cpumap.h"
+#include "../../util/evsel.h"
+#include "../../util/evlist.h"
+#include "../../util/session.h"
+#include "../../util/util.h"
+#include "../../util/pmu.h"
+#include "../../util/debug.h"
+#include "../../util/tsc.h"
+#include "../../util/auxtrace.h"
+#include "../../util/arm-spe.h"
+
+#define KiB(x) ((x) * 1024)
+#define MiB(x) ((x) * 1024 * 1024)
+
+struct arm_spe_recording {
+	struct auxtrace_record		itr;
+	struct perf_pmu			*arm_spe_pmu;
+	struct perf_evlist		*evlist;
+};
+
+static size_t
+arm_spe_info_priv_size(struct auxtrace_record *itr __maybe_unused,
+			 struct perf_evlist *evlist __maybe_unused)
+{
+	return ARM_SPE_AUXTRACE_PRIV_SIZE;
+}
+
+static int arm_spe_info_fill(struct auxtrace_record *itr,
+			       struct perf_session *session,
+			       struct auxtrace_info_event *auxtrace_info,
+			       size_t priv_size)
+{
+	struct arm_spe_recording *sper =
+			container_of(itr, struct arm_spe_recording, itr);
+	struct perf_pmu *arm_spe_pmu = sper->arm_spe_pmu;
+
+	if (priv_size != ARM_SPE_AUXTRACE_PRIV_SIZE)
+		return -EINVAL;
+
+	if (!session->evlist->nr_mmaps)
+		return -EINVAL;
+
+	auxtrace_info->type = PERF_AUXTRACE_ARM_SPE;
+	auxtrace_info->priv[ARM_SPE_PMU_TYPE] = arm_spe_pmu->type;
+
+	return 0;
+}
+
+static int arm_spe_recording_options(struct auxtrace_record *itr,
+				       struct perf_evlist *evlist,
+				       struct record_opts *opts)
+{
+	struct arm_spe_recording *sper =
+			container_of(itr, struct arm_spe_recording, itr);
+	struct perf_pmu *arm_spe_pmu = sper->arm_spe_pmu;
+	struct perf_evsel *evsel, *arm_spe_evsel = NULL;
+	const struct cpu_map *cpus = evlist->cpus;
+	bool privileged = geteuid() == 0 || perf_event_paranoid() < 0;
+	struct perf_evsel *tracking_evsel;
+	int err;
+
+	sper->evlist = evlist;
+
+	evlist__for_each_entry(evlist, evsel) {
+		if (evsel->attr.type == arm_spe_pmu->type) {
+			if (arm_spe_evsel) {
+				pr_err("There may be only one " ARM_SPE_PMU_NAME " event\n");
+				return -EINVAL;
+			}
+			evsel->attr.freq = 0;
+			evsel->attr.sample_period = 1;
+			arm_spe_evsel = evsel;
+			opts->full_auxtrace = true;
+		}
+	}
+
+	if (!opts->full_auxtrace)
+		return 0;
+
+	/* We are in full trace mode but '-m,xyz' wasn't specified */
+	if (opts->full_auxtrace && !opts->auxtrace_mmap_pages) {
+		if (privileged) {
+			opts->auxtrace_mmap_pages = MiB(4) / page_size;
+		} else {
+			opts->auxtrace_mmap_pages = KiB(128) / page_size;
+			if (opts->mmap_pages == UINT_MAX)
+				opts->mmap_pages = KiB(256) / page_size;
+		}
+	}
+
+	/* Validate auxtrace_mmap_pages */
+	if (opts->auxtrace_mmap_pages) {
+		size_t sz = opts->auxtrace_mmap_pages * (size_t)page_size;
+		size_t min_sz = KiB(8);
+
+		if (sz < min_sz || !is_power_of_2(sz)) {
+			pr_err("Invalid mmap size for ARM SPE: must be at least %zuKiB and a power of 2\n",
+			       min_sz / 1024);
+			return -EINVAL;
+		}
+	}
+
+	/*
+	 * To obtain the auxtrace buffer file descriptor, the auxtrace event
+	 * must come first.
+	 */
+	perf_evlist__to_front(evlist, arm_spe_evsel);
+
+	/*
+	 * In the case of per-cpu mmaps, we need the CPU on the
+	 * AUX event.
+	 */
+	if (!cpu_map__empty(cpus))
+		perf_evsel__set_sample_bit(arm_spe_evsel, CPU);
+
+	/* Add dummy event to keep tracking */
+	err = parse_events(evlist, "dummy:u", NULL);
+	if (err)
+		return err;
+
+	tracking_evsel = perf_evlist__last(evlist);
+	perf_evlist__set_tracking_event(evlist, tracking_evsel);
+
+	tracking_evsel->attr.freq = 0;
+	tracking_evsel->attr.sample_period = 1;
+
+	/* In per-cpu case, always need the time of mmap events etc */
+	if (!cpu_map__empty(cpus))
+		perf_evsel__set_sample_bit(tracking_evsel, TIME);
+
+	return 0;
+}
+
+static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)
+{
+	u64 ts;
+
+	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));
+
+	return ts;
+}
+
+static void arm_spe_recording_free(struct auxtrace_record *itr)
+{
+	struct arm_spe_recording *sper =
+			container_of(itr, struct arm_spe_recording, itr);
+
+       free(sper);
+}
+
+static int arm_spe_read_finish(struct auxtrace_record *itr, int idx)
+{
+	struct arm_spe_recording *sper =
+			container_of(itr, struct arm_spe_recording, itr);
+	struct perf_evsel *evsel;
+
+	evlist__for_each_entry(sper->evlist, evsel) {
+		if (evsel->attr.type == sper->arm_spe_pmu->type)
+			return perf_evlist__enable_event_idx(sper->evlist,
+							     evsel, idx);
+	}
+	return -EINVAL;
+}
+
+struct auxtrace_record *arm_spe_recording_init(int *err)
+{
+	struct perf_pmu *arm_spe_pmu = perf_pmu__find(ARM_SPE_PMU_NAME);
+	struct arm_spe_recording *sper;
+
+	if (!arm_spe_pmu) {
+		*err = -ENODEV;
+		return NULL;
+	}
+
+	sper = zalloc(sizeof(struct arm_spe_recording));
+	if (!sper) {
+		*err = -ENOMEM;
+		return NULL;
+	}
+
+	sper->arm_spe_pmu = arm_spe_pmu;
+	sper->itr.recording_options = arm_spe_recording_options;
+	sper->itr.info_priv_size = arm_spe_info_priv_size;
+	sper->itr.info_fill = arm_spe_info_fill;
+	sper->itr.free = arm_spe_recording_free;
+	sper->itr.reference = arm_spe_reference;
+	sper->itr.read_finish = arm_spe_read_finish;
+	sper->itr.alignment = 0;
+	return &sper->itr;
+}
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 79dea95a7f68..4ed31e88b8ee 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -82,6 +82,8 @@ libperf-$(CONFIG_AUXTRACE) += auxtrace.o
 libperf-$(CONFIG_AUXTRACE) += intel-pt-decoder/
 libperf-$(CONFIG_AUXTRACE) += intel-pt.o
 libperf-$(CONFIG_AUXTRACE) += intel-bts.o
+libperf-$(CONFIG_AUXTRACE) += arm-spe.o
+libperf-$(CONFIG_AUXTRACE) += arm-spe-pkt-decoder.o
 libperf-y += parse-branch-options.o
 libperf-y += dump-insn.o
 libperf-y += parse-regs-options.o
diff --git a/tools/perf/util/arm-spe-pkt-decoder.c b/tools/perf/util/arm-spe-pkt-decoder.c
new file mode 100644
index 000000000000..aeae921dd79d
--- /dev/null
+++ b/tools/perf/util/arm-spe-pkt-decoder.c
@@ -0,0 +1,464 @@
+/*
+ * ARM Statistical Profiling Extensions (SPE) support
+ * Copyright (c) 2017, ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include <stdio.h>
+#include <string.h>
+#include <endian.h>
+#include <byteswap.h>
+
+#include "arm-spe-pkt-decoder.h"
+
+#define BIT(n)		(1 << (n))
+
+#define BIT61		((uint64_t)1 << 61)
+#define BIT62		((uint64_t)1 << 62)
+#define BIT63		((uint64_t)1 << 63)
+
+#define NS_FLAG		BIT63
+#define EL_FLAG		(BIT62 | BIT61)
+
+#define SPE_HEADER0_PAD			0x0
+#define SPE_HEADER0_END			0x1
+#define SPE_HEADER0_ADDRESS		0x30 /* address packet (short) */
+#define SPE_HEADER0_ADDRESS_MASK	0x38
+#define SPE_HEADER0_COUNTER		0x18 /* counter packet (short) */
+#define SPE_HEADER0_COUNTER_MASK	0x38
+#define SPE_HEADER0_TIMESTAMP		0x71
+#define SPE_HEADER0_TIMESTAMP		0x71
+#define SPE_HEADER0_EVENTS		0x2
+#define SPE_HEADER0_EVENTS_MASK		0xf
+#define SPE_HEADER0_SOURCE		0x3
+#define SPE_HEADER0_SOURCE_MASK		0xf
+#define SPE_HEADER0_CONTEXT		0x24
+#define SPE_HEADER0_CONTEXT_MASK	0x3c
+#define SPE_HEADER0_OP_TYPE		0x8
+#define SPE_HEADER0_OP_TYPE_MASK	0x3c
+#define SPE_HEADER1_ALIGNMENT		0x0
+#define SPE_HEADER1_ADDRESS		0xb0 /* address packet (extended) */
+#define SPE_HEADER1_ADDRESS_MASK	0xf8
+#define SPE_HEADER1_COUNTER		0x98 /* counter packet (extended) */
+#define SPE_HEADER1_COUNTER_MASK	0xf8
+
+#if __BYTE_ORDER == __BIG_ENDIAN
+#define le16_to_cpu bswap_16
+#define le32_to_cpu bswap_32
+#define le64_to_cpu bswap_64
+#define memcpy_le64(d, s, n) do { \
+	memcpy((d), (s), (n));    \
+	*(d) = le64_to_cpu(*(d)); \
+} while (0)
+#else
+#define le16_to_cpu
+#define le32_to_cpu
+#define le64_to_cpu
+#define memcpy_le64 memcpy
+#endif
+
+static const char * const arm_spe_packet_name[] = {
+	[ARM_SPE_PAD]		= "PAD",
+	[ARM_SPE_END]		= "END",
+	[ARM_SPE_TIMESTAMP]	= "TS",
+	[ARM_SPE_ADDRESS]	= "ADDR",
+	[ARM_SPE_COUNTER]	= "LAT",
+	[ARM_SPE_CONTEXT]	= "CONTEXT",
+	[ARM_SPE_OP_TYPE]	= "OP-TYPE",
+	[ARM_SPE_EVENTS]	= "EVENTS",
+	[ARM_SPE_DATA_SOURCE]	= "DATA-SOURCE",
+};
+
+const char *arm_spe_pkt_name(enum arm_spe_pkt_type type)
+{
+	return arm_spe_packet_name[type];
+}
+
+/* return ARM SPE payload size from its encoding:
+ * 00 : byte
+ * 01 : halfword (2)
+ * 10 : word (4)
+ * 11 : doubleword (8)
+ */
+static int payloadlen(unsigned char byte)
+{
+	return 1 << ((byte & 0x30) >> 4);
+}
+
+static int arm_spe_get_payload(const unsigned char *buf, size_t len,
+			       struct arm_spe_pkt *packet)
+{
+	size_t payload_len = payloadlen(buf[0]);
+
+	if (len < 1 + payload_len)
+		return ARM_SPE_NEED_MORE_BYTES;
+
+	switch (payload_len) {
+	case 1: packet->payload = *(uint8_t *)(buf + 1); break;
+	case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break;
+	case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break;
+	case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break;
+	default: return ARM_SPE_BAD_PACKET;
+	}
+
+	return 1 + payload_len;
+}
+
+static int arm_spe_get_pad(struct arm_spe_pkt *packet)
+{
+	packet->type = ARM_SPE_PAD;
+	return 1;
+}
+
+static int arm_spe_get_alignment(const unsigned char *buf, size_t len,
+				 struct arm_spe_pkt *packet)
+{
+	unsigned int alignment = 1 << ((buf[0] & 0xf) + 1);
+
+	if (len < alignment)
+		return ARM_SPE_NEED_MORE_BYTES;
+
+	packet->type = ARM_SPE_PAD;
+	return alignment - (((uint64_t)buf) & (alignment - 1));
+}
+
+static int arm_spe_get_end(struct arm_spe_pkt *packet)
+{
+	packet->type = ARM_SPE_END;
+	return 1;
+}
+
+static int arm_spe_get_timestamp(const unsigned char *buf, size_t len,
+				 struct arm_spe_pkt *packet)
+{
+	packet->type = ARM_SPE_TIMESTAMP;
+	return arm_spe_get_payload(buf, len, packet);
+}
+
+static int arm_spe_get_events(const unsigned char *buf, size_t len,
+			      struct arm_spe_pkt *packet)
+{
+	int ret = arm_spe_get_payload(buf, len, packet);
+
+	packet->type = ARM_SPE_EVENTS;
+	packet->index = ret - 1;
+
+	return ret;
+}
+
+static int arm_spe_get_data_source(const unsigned char *buf, size_t len,
+				   struct arm_spe_pkt *packet)
+{
+	packet->type = ARM_SPE_DATA_SOURCE;
+	return arm_spe_get_payload(buf, len, packet);
+}
+
+static int arm_spe_get_context(const unsigned char *buf, size_t len,
+			       struct arm_spe_pkt *packet)
+{
+	packet->type = ARM_SPE_CONTEXT;
+	packet->index = buf[0] & 0x3;
+
+	return arm_spe_get_payload(buf, len, packet);
+}
+
+static int arm_spe_get_op_type(const unsigned char *buf, size_t len,
+			       struct arm_spe_pkt *packet)
+{
+	packet->type = ARM_SPE_OP_TYPE;
+	packet->index = buf[0] & 0x3;
+	return arm_spe_get_payload(buf, len, packet);
+}
+
+static int arm_spe_get_counter(const unsigned char *buf, size_t len,
+			       const unsigned char ext_hdr, struct arm_spe_pkt *packet)
+{
+	if (len < 2)
+		return ARM_SPE_NEED_MORE_BYTES;
+
+	packet->type = ARM_SPE_COUNTER;
+	if (ext_hdr)
+		packet->index = ((buf[0] & 0x3) << 3) | (buf[1] & 0x7);
+	else
+		packet->index = buf[0] & 0x7;
+
+	packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1));
+
+	return 1 + ext_hdr + 2;
+}
+
+static int arm_spe_get_addr(const unsigned char *buf, size_t len,
+			    const unsigned char ext_hdr, struct arm_spe_pkt *packet)
+{
+	if (len < 8)
+		return ARM_SPE_NEED_MORE_BYTES;
+
+	packet->type = ARM_SPE_ADDRESS;
+	if (ext_hdr)
+		packet->index = ((buf[0] & 0x3) << 3) | (buf[1] & 0x7);
+	else
+		packet->index = buf[0] & 0x7;
+
+	memcpy_le64(&packet->payload, buf + 1, 8);
+
+	return 1 + ext_hdr + 8;
+}
+
+static int arm_spe_do_get_packet(const unsigned char *buf, size_t len,
+				 struct arm_spe_pkt *packet)
+{
+	unsigned int byte;
+
+	memset(packet, 0, sizeof(struct arm_spe_pkt));
+
+	if (!len)
+		return ARM_SPE_NEED_MORE_BYTES;
+
+	byte = buf[0];
+	if (byte == SPE_HEADER0_PAD)
+		return arm_spe_get_pad(packet);
+	else if (byte == SPE_HEADER0_END) /* no timestamp at end of record */
+		return arm_spe_get_end(packet);
+	else if (byte & 0xc0 /* 0y11xxxxxx */) {
+		if (byte & 0x80) {
+			if ((byte & SPE_HEADER0_ADDRESS_MASK) == SPE_HEADER0_ADDRESS)
+				return arm_spe_get_addr(buf, len, 0, packet);
+			if ((byte & SPE_HEADER0_COUNTER_MASK) == SPE_HEADER0_COUNTER)
+				return arm_spe_get_counter(buf, len, 0, packet);
+		} else
+			if (byte == SPE_HEADER0_TIMESTAMP)
+				return arm_spe_get_timestamp(buf, len, packet);
+			else if ((byte & SPE_HEADER0_EVENTS_MASK) == SPE_HEADER0_EVENTS)
+				return arm_spe_get_events(buf, len, packet);
+			else if ((byte & SPE_HEADER0_SOURCE_MASK) == SPE_HEADER0_SOURCE)
+				return arm_spe_get_data_source(buf, len, packet);
+			else if ((byte & SPE_HEADER0_CONTEXT_MASK) == SPE_HEADER0_CONTEXT)
+				return arm_spe_get_context(buf, len, packet);
+			else if ((byte & SPE_HEADER0_OP_TYPE_MASK) == SPE_HEADER0_OP_TYPE)
+				return arm_spe_get_op_type(buf, len, packet);
+	} else if ((byte & 0xe0) == 0x20 /* 0y001xxxxx */) {
+		/* 16-bit header */ 
+		byte = buf[1];
+		if (byte == SPE_HEADER1_ALIGNMENT)
+			return arm_spe_get_alignment(buf, len, packet);
+		else if ((byte & SPE_HEADER1_ADDRESS_MASK) == SPE_HEADER1_ADDRESS)
+			return arm_spe_get_addr(buf, len, 1, packet);
+		else if ((byte & SPE_HEADER1_COUNTER_MASK) == SPE_HEADER1_COUNTER)
+			return arm_spe_get_counter(buf, len, 1, packet);
+	}
+
+	return ARM_SPE_BAD_PACKET;
+}
+
+int arm_spe_get_packet(const unsigned char *buf, size_t len,
+		       struct arm_spe_pkt *packet)
+{
+	int ret;
+
+	ret = arm_spe_do_get_packet(buf, len, packet);
+	if (ret > 0 && packet->type == ARM_SPE_PAD) {
+		while (ret < 16 && len > (size_t)ret && !buf[ret])
+			ret += 1;
+	}
+	return ret;
+}
+
+int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
+		     size_t buf_len)
+{
+	int ret, ns, el, index = packet->index;
+	unsigned long long payload = packet->payload;
+	const char *name = arm_spe_pkt_name(packet->type);
+
+	switch (packet->type) {
+	case ARM_SPE_BAD:
+	case ARM_SPE_PAD:
+	case ARM_SPE_END:
+		return snprintf(buf, buf_len, "%s", name);
+	case ARM_SPE_EVENTS: {
+		size_t blen = buf_len;
+
+		ret = 0;
+		ret = snprintf(buf, buf_len, "EV");
+		buf += ret;
+		blen -= ret;
+		if (payload & 0x1) {
+			ret = snprintf(buf, buf_len, " EXCEPTION-GEN");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x2) {
+			ret = snprintf(buf, buf_len, " RETIRED");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x4) {
+			ret = snprintf(buf, buf_len, " L1D-ACCESS");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x8) {
+			ret = snprintf(buf, buf_len, " L1D-REFILL");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x10) {
+			ret = snprintf(buf, buf_len, " TLB-ACCESS");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x20) {
+			ret = snprintf(buf, buf_len, " TLB-REFILL");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x40) {
+			ret = snprintf(buf, buf_len, " NOT-TAKEN");
+			buf += ret;
+			blen -= ret;
+		}
+		if (payload & 0x80) {
+			ret = snprintf(buf, buf_len, " MISPRED");
+			buf += ret;
+			blen -= ret;
+		}
+		if (index > 1) {
+			if (payload & 0x100) {
+				ret = snprintf(buf, buf_len, " LLC-ACCESS");
+				buf += ret;
+				blen -= ret;
+			}
+			if (payload & 0x200) {
+				ret = snprintf(buf, buf_len, " LLC-REFILL");
+				buf += ret;
+				blen -= ret;
+			}
+			if (payload & 0x400) {
+				ret = snprintf(buf, buf_len, " REMOTE-ACCESS");
+				buf += ret;
+				blen -= ret;
+			}
+		}
+		if (ret < 0)
+			return ret;
+		blen -= ret;
+		return buf_len - blen;
+	}
+	case ARM_SPE_OP_TYPE:
+		switch (index) {
+		case 0:	return snprintf(buf, buf_len, "%s", payload & 0x1 ?
+					"COND-SELECT" : "INSN-OTHER");
+		case 1:	{
+			size_t blen = buf_len;
+
+			if (payload & 0x1)
+				ret = snprintf(buf, buf_len, "ST");
+			else
+				ret = snprintf(buf, buf_len, "LD");
+			buf += ret;
+			blen -= ret;
+			if (payload & 0x2) {
+				if (payload & 0x4) {
+					ret = snprintf(buf, buf_len, " AT");
+					buf += ret;
+					blen -= ret;
+				}
+				if (payload & 0x8) {
+					ret = snprintf(buf, buf_len, " EXCL");
+					buf += ret;
+					blen -= ret;
+				}
+				if (payload & 0x10) {
+					ret = snprintf(buf, buf_len, " AR");
+					buf += ret;
+					blen -= ret;
+				}
+			} else if (payload & 0x4) {
+				ret = snprintf(buf, buf_len, " SIMD-FP");
+				buf += ret;
+				blen -= ret;
+			}
+			if (ret < 0)
+				return ret;
+			blen -= ret;
+			return buf_len - blen;
+		}
+		case 2:	{
+			size_t blen = buf_len;
+
+			ret = snprintf(buf, buf_len, "B");
+			buf += ret;
+			blen -= ret;
+			if (payload & 0x1) {
+				ret = snprintf(buf, buf_len, " COND");
+				buf += ret;
+				blen -= ret;
+			}
+			if (payload & 0x2) {
+				ret = snprintf(buf, buf_len, " IND");
+				buf += ret;
+				blen -= ret;
+			}
+			if (ret < 0)
+				return ret;
+			blen -= ret;
+			return buf_len - blen;
+			}
+		default: return 0;
+		}
+	case ARM_SPE_DATA_SOURCE:
+	case ARM_SPE_TIMESTAMP:
+		return snprintf(buf, buf_len, "%s %lld", name, payload);
+	case ARM_SPE_ADDRESS:
+		switch (index) {
+		case 0:
+		case 1: ns = !!(packet->payload & NS_FLAG);
+			el = (packet->payload & EL_FLAG) >> 61;
+			payload &= ~(0xffULL << 56);
+			return snprintf(buf, buf_len, "%s 0x%llx el%d ns=%d",
+				        (index == 1) ? "TGT" : "PC", payload, el, ns);
+		case 2:	return snprintf(buf, buf_len, "VA 0x%llx", payload);
+		case 3:	ns = !!(packet->payload & NS_FLAG);
+			payload &= ~(0xffULL << 56);
+			return snprintf(buf, buf_len, "PA 0x%llx ns=%d",
+					payload, ns);
+		default: return 0;
+		}
+	case ARM_SPE_CONTEXT:
+		return snprintf(buf, buf_len, "%s 0x%lx el%d", name,
+				(unsigned long)payload, index + 1);
+	case ARM_SPE_COUNTER: {
+		size_t blen = buf_len;
+
+		ret = snprintf(buf, buf_len, "%s %d ", name,
+			       (unsigned short)payload);
+		buf += ret;
+		blen -= ret;
+		switch (index) {
+		case 0:	ret = snprintf(buf, buf_len, "TOT"); break;
+		case 1:	ret = snprintf(buf, buf_len, "ISSUE"); break;
+		case 2:	ret = snprintf(buf, buf_len, "XLAT"); break;
+		default: ret = 0;
+		}
+		if (ret < 0)
+			return ret;
+		blen -= ret;
+		return buf_len - blen;
+	}
+	default:
+		break;
+	}
+
+	return snprintf(buf, buf_len, "%s 0x%llx (%d)",
+			name, payload, packet->index);
+}
+
diff --git a/tools/perf/util/arm-spe-pkt-decoder.h b/tools/perf/util/arm-spe-pkt-decoder.h
new file mode 100644
index 000000000000..f146f4143447
--- /dev/null
+++ b/tools/perf/util/arm-spe-pkt-decoder.h
@@ -0,0 +1,52 @@
+/*
+ * ARM Statistical Profiling Extensions (SPE) support
+ * Copyright (c) 2017, ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#ifndef INCLUDE__ARM_SPE_PKT_DECODER_H__
+#define INCLUDE__ARM_SPE_PKT_DECODER_H__
+
+#include <stddef.h>
+#include <stdint.h>
+
+#define ARM_SPE_PKT_DESC_MAX		256
+
+#define ARM_SPE_NEED_MORE_BYTES		-1
+#define ARM_SPE_BAD_PACKET		-2
+
+enum arm_spe_pkt_type {
+	ARM_SPE_BAD,
+	ARM_SPE_PAD,
+	ARM_SPE_END,
+	ARM_SPE_TIMESTAMP,
+	ARM_SPE_ADDRESS,
+	ARM_SPE_COUNTER,
+	ARM_SPE_CONTEXT,
+	ARM_SPE_OP_TYPE,
+	ARM_SPE_EVENTS,
+	ARM_SPE_DATA_SOURCE,
+};
+
+struct arm_spe_pkt {
+	enum arm_spe_pkt_type	type;
+	unsigned char		index;
+	uint64_t		payload;
+};
+
+const char *arm_spe_pkt_name(enum arm_spe_pkt_type);
+
+int arm_spe_get_packet(const unsigned char *buf, size_t len,
+		       struct arm_spe_pkt *packet);
+
+int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf, size_t len);
+#endif
diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
new file mode 100644
index 000000000000..f3eccd73b54a
--- /dev/null
+++ b/tools/perf/util/arm-spe.c
@@ -0,0 +1,318 @@
+/*
+ * ARM Statistical Profiling Extensions (SPE) support
+ * Copyright (c) 2017, ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include <endian.h>
+#include <errno.h>
+#include <byteswap.h>
+#include <inttypes.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/bitops.h>
+#include <linux/log2.h>
+
+#include "cpumap.h"
+#include "color.h"
+#include "evsel.h"
+#include "evlist.h"
+#include "machine.h"
+#include "session.h"
+#include "util.h"
+#include "thread.h"
+#include "debug.h"
+#include "auxtrace.h"
+#include "arm-spe.h"
+#include "arm-spe-pkt-decoder.h"
+
+struct arm_spe {
+	struct auxtrace			auxtrace;
+	struct auxtrace_queues		queues;
+	struct auxtrace_heap		heap;
+	u32				auxtrace_type;
+	struct perf_session		*session;
+	struct machine			*machine;
+	u32				pmu_type;
+};
+
+struct arm_spe_queue {
+	struct arm_spe		*spe;
+	unsigned int		queue_nr;
+	struct auxtrace_buffer	*buffer;
+	bool			on_heap;
+	bool			done;
+	pid_t			pid;
+	pid_t			tid;
+	int			cpu;
+};
+
+static void arm_spe_dump(struct arm_spe *spe __maybe_unused,
+			   unsigned char *buf, size_t len)
+{
+	struct arm_spe_pkt packet;
+	size_t pos = 0;
+	int ret, pkt_len, i;
+	char desc[ARM_SPE_PKT_DESC_MAX];
+	const char *color = PERF_COLOR_BLUE;
+
+	color_fprintf(stdout, color,
+		      ". ... ARM SPE data: size %zu bytes\n",
+		      len);
+
+	while (len) {
+		ret = arm_spe_get_packet(buf, len, &packet);
+		if (ret > 0)
+			pkt_len = ret;
+		else
+			pkt_len = 1;
+		printf(".");
+		color_fprintf(stdout, color, "  %08x: ", pos);
+		for (i = 0; i < pkt_len; i++)
+			color_fprintf(stdout, color, " %02x", buf[i]);
+		for (; i < 16; i++)
+			color_fprintf(stdout, color, "   ");
+		if (ret > 0) {
+			ret = arm_spe_pkt_desc(&packet, desc,
+					       ARM_SPE_PKT_DESC_MAX);
+			if (ret > 0)
+				color_fprintf(stdout, color, " %s\n", desc);
+		} else {
+			color_fprintf(stdout, color, " Bad packet!\n");
+		}
+		pos += pkt_len;
+		buf += pkt_len;
+		len -= pkt_len;
+	}
+}
+
+static void arm_spe_dump_event(struct arm_spe *spe, unsigned char *buf,
+				 size_t len)
+{
+	printf(".\n");
+	arm_spe_dump(spe, buf, len);
+}
+
+static struct arm_spe_queue *arm_spe_alloc_queue(struct arm_spe *spe,
+						     unsigned int queue_nr)
+{
+	struct arm_spe_queue *speq;
+
+	speq = zalloc(sizeof(struct arm_spe_queue));
+	if (!speq)
+		return NULL;
+
+	speq->spe = spe;
+	speq->queue_nr = queue_nr;
+	speq->pid = -1;
+	speq->tid = -1;
+	speq->cpu = -1;
+
+	return speq;
+}
+
+static int arm_spe_setup_queue(struct arm_spe *spe,
+				 struct auxtrace_queue *queue,
+				 unsigned int queue_nr)
+{
+	struct arm_spe_queue *speq = queue->priv;
+
+	if (list_empty(&queue->head))
+		return 0;
+
+	if (!speq) {
+		speq = arm_spe_alloc_queue(spe, queue_nr);
+		if (!speq)
+			return -ENOMEM;
+		queue->priv = speq;
+
+		if (queue->cpu != -1)
+			speq->cpu = queue->cpu;
+		speq->tid = queue->tid;
+	}
+
+	if (!speq->on_heap && !speq->buffer) {
+		int ret;
+
+		speq->buffer = auxtrace_buffer__next(queue, NULL);
+		if (!speq->buffer)
+			return 0;
+
+		ret = auxtrace_heap__add(&spe->heap, queue_nr,
+					 speq->buffer->reference);
+		if (ret)
+			return ret;
+		speq->on_heap = true;
+	}
+
+	return 0;
+}
+
+static int arm_spe_setup_queues(struct arm_spe *spe)
+{
+	unsigned int i;
+	int ret;
+
+	for (i = 0; i < spe->queues.nr_queues; i++) {
+		ret = arm_spe_setup_queue(spe, &spe->queues.queue_array[i],
+					    i);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static inline int arm_spe_update_queues(struct arm_spe *spe)
+{
+	if (spe->queues.new_data) {
+		spe->queues.new_data = false;
+		return arm_spe_setup_queues(spe);
+	}
+	return 0;
+}
+
+static int arm_spe_process_event(struct perf_session *session __maybe_unused,
+				   union perf_event *event __maybe_unused,
+				   struct perf_sample *sample __maybe_unused,
+				   struct perf_tool *tool __maybe_unused)
+{
+	return 0;
+}
+
+static int arm_spe_process_auxtrace_event(struct perf_session *session,
+					    union perf_event *event,
+					    struct perf_tool *tool __maybe_unused)
+{
+	struct arm_spe *spe = container_of(session->auxtrace, struct arm_spe,
+					     auxtrace);
+	struct auxtrace_buffer *buffer;
+	off_t data_offset;
+	int fd = perf_data_file__fd(session->file);
+	int err;
+
+	if (perf_data_file__is_pipe(session->file)) {
+		data_offset = 0;
+	} else {
+		data_offset = lseek(fd, 0, SEEK_CUR);
+		if (data_offset == -1)
+			return -errno;
+	}
+
+	err = auxtrace_queues__add_event(&spe->queues, session, event,
+					 data_offset, &buffer);
+	if (err)
+		return err;
+
+	/* Dump here now we have copied a piped trace out of the pipe */
+	if (dump_trace) {
+		if (auxtrace_buffer__get_data(buffer, fd)) {
+			arm_spe_dump_event(spe, buffer->data,
+					     buffer->size);
+			auxtrace_buffer__put_data(buffer);
+		}
+	}
+
+	return 0;
+}
+
+static int arm_spe_flush(struct perf_session *session __maybe_unused,
+			   struct perf_tool *tool __maybe_unused)
+{
+	return 0;
+}
+
+static void arm_spe_free_queue(void *priv)
+{
+	struct arm_spe_queue *speq = priv;
+
+	if (!speq)
+		return;
+	free(speq);
+}
+
+static void arm_spe_free_events(struct perf_session *session)
+{
+	struct arm_spe *spe = container_of(session->auxtrace, struct arm_spe,
+					     auxtrace);
+	struct auxtrace_queues *queues = &spe->queues;
+	unsigned int i;
+
+	for (i = 0; i < queues->nr_queues; i++) {
+		arm_spe_free_queue(queues->queue_array[i].priv);
+		queues->queue_array[i].priv = NULL;
+	}
+	auxtrace_queues__free(queues);
+}
+
+static void arm_spe_free(struct perf_session *session)
+{
+	struct arm_spe *spe = container_of(session->auxtrace, struct arm_spe,
+					     auxtrace);
+
+	auxtrace_heap__free(&spe->heap);
+	arm_spe_free_events(session);
+	session->auxtrace = NULL;
+	free(spe);
+}
+
+static const char * const arm_spe_info_fmts[] = {
+	[ARM_SPE_PMU_TYPE]		= "  PMU Type           %"PRId64"\n",
+};
+
+static void arm_spe_print_info(u64 *arr)
+{
+	if (!dump_trace)
+		return;
+
+	fprintf(stdout, arm_spe_info_fmts[ARM_SPE_PMU_TYPE], arr[ARM_SPE_PMU_TYPE]);
+}
+
+int arm_spe_process_auxtrace_info(union perf_event *event,
+				    struct perf_session *session)
+{
+	struct auxtrace_info_event *auxtrace_info = &event->auxtrace_info;
+	size_t min_sz = sizeof(u64) * ARM_SPE_PMU_TYPE;
+	struct arm_spe *spe;
+	int err;
+
+	if (auxtrace_info->header.size < sizeof(struct auxtrace_info_event) +
+					min_sz)
+		return -EINVAL;
+
+	spe = zalloc(sizeof(struct arm_spe));
+	if (!spe)
+		return -ENOMEM;
+
+	err = auxtrace_queues__init(&spe->queues);
+	if (err)
+		goto err_free;
+
+	spe->session = session;
+	spe->machine = &session->machines.host; /* No kvm support */
+	spe->auxtrace_type = auxtrace_info->type;
+	spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
+
+	spe->auxtrace.process_event = arm_spe_process_event;
+	spe->auxtrace.process_auxtrace_event = arm_spe_process_auxtrace_event;
+	spe->auxtrace.flush_events = arm_spe_flush;
+	spe->auxtrace.free_events = arm_spe_free_events;
+	spe->auxtrace.free = arm_spe_free;
+	session->auxtrace = &spe->auxtrace;
+
+	arm_spe_print_info(&auxtrace_info->priv[0]);
+
+	return 0;
+
+err_free:
+	free(spe);
+	return err;
+}
diff --git a/tools/perf/util/arm-spe.h b/tools/perf/util/arm-spe.h
new file mode 100644
index 000000000000..afa4704c5e5e
--- /dev/null
+++ b/tools/perf/util/arm-spe.h
@@ -0,0 +1,39 @@
+/*
+ * ARM Statistical Profiling Extensions (SPE) support
+ * Copyright (c) 2017, ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#ifndef INCLUDE__PERF_ARM_SPE_H__
+#define INCLUDE__PERF_ARM_SPE_H__
+
+#define ARM_SPE_PMU_NAME "arm_spe_0"
+
+enum {
+	ARM_SPE_PMU_TYPE,
+	ARM_SPE_PER_CPU_MMAPS,
+	ARM_SPE_AUXTRACE_PRIV_MAX,
+};
+
+#define ARM_SPE_AUXTRACE_PRIV_SIZE (ARM_SPE_AUXTRACE_PRIV_MAX * sizeof(u64))
+
+struct auxtrace_record;
+struct perf_tool;
+union perf_event;
+struct perf_session;
+
+struct auxtrace_record *arm_spe_recording_init(int *err);
+
+int arm_spe_process_auxtrace_info(union perf_event *event,
+				  struct perf_session *session);
+
+#endif
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index 5547457566a7..92da3981a761 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -57,6 +57,7 @@
 
 #include "intel-pt.h"
 #include "intel-bts.h"
+#include "arm-spe.h"
 
 #include "sane_ctype.h"
 #include "symbol/kallsyms.h"
@@ -913,6 +914,8 @@ int perf_event__process_auxtrace_info(struct perf_tool *tool __maybe_unused,
 		return intel_pt_process_auxtrace_info(event, session);
 	case PERF_AUXTRACE_INTEL_BTS:
 		return intel_bts_process_auxtrace_info(event, session);
+	case PERF_AUXTRACE_ARM_SPE:
+		return arm_spe_process_auxtrace_info(event, session);
 	case PERF_AUXTRACE_CS_ETM:
 	case PERF_AUXTRACE_UNKNOWN:
 	default:
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index 33b5e6cdf38c..fa1c8deffac4 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -43,6 +43,7 @@ enum auxtrace_type {
 	PERF_AUXTRACE_INTEL_PT,
 	PERF_AUXTRACE_INTEL_BTS,
 	PERF_AUXTRACE_CS_ETM,
+	PERF_AUXTRACE_ARM_SPE,
 };
 
 enum itrace_period_type {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] perf tools: Add ARM Statistical Profiling Extensions (SPE) support
  2017-07-18  0:48                         ` Kim Phillips
  2017-08-18  3:11                           ` [PATCH v2] " Kim Phillips
@ 2017-08-18 16:59                           ` Mark Rutland
  2017-08-18 22:22                             ` Kim Phillips
  1 sibling, 1 reply; 33+ messages in thread
From: Mark Rutland @ 2017-08-18 16:59 UTC (permalink / raw)
  To: Kim Phillips
  Cc: Arnaldo Carvalho de Melo, robh, mathieu.poirier, pawel.moll,
	suzuki.poulose, marc.zyngier, Will Deacon, linux-kernel,
	alexander.shishkin, peterz, mingo, tglx, linux-arm-kernel,
	Adrian Hunter, Jiri Olsa, Andi Kleen, Wang Nan

Hi Kim,

Sorry for the late reply. I see you've send an updated version. I'm
replying here first so as to answer your queries, and I intend to look
at the updated patch shortly.

On Mon, Jul 17, 2017 at 07:48:22PM -0500, Kim Phillips wrote:
> On Fri, 30 Jun 2017 15:02:41 +0100 Mark Rutland <mark.rutland@arm.com> wrote:
> > On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote:
> <snip>
> > >  	if (evlist) {
> > >  		evlist__for_each_entry(evlist, evsel) {
> > >  			if (cs_etm_pmu &&
> > >  			    evsel->attr.type == cs_etm_pmu->type)
> > >  				found_etm = true;
> > > +			if (arm_spe_pmu &&
> > > +			    evsel->attr.type == arm_spe_pmu->type)
> > > +				found_spe = true;
> > 
> > Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all
> > SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1"
> > and so on).
> > 
> > Can we not find all PMUs with a "arm_spe_" prefix?
> > 
> > ... or, taking a step back, do we need some sysfs "class" attribute to
> > identify multi-instance PMUs?
> 
> Since there is one SPE per core, and it looks like the buffer full
> interrupt line is the only difference between the SPE device node
> specification in the device tree, I guess I don't understand why the
> driver doesn't accept a singular "arm_spe" from the tool, and manage
> interrupt handling accordingly.

There are also differences which can be probed from the device, which
are not described in the DT (but are described in sysfs). Some of these
are exposed under sysfs.

There may be further differences in subsequent revisions of the
architecture, too. So the safest bet is to expose them separately, as we
do for other CPU-affine PMUs in heterogeneous systems.

> Also, if a set of CPUs are missing SPE support, and the user doesn't
> explicitly define a CPU affinity to outside that partition, then
> decline to run, stating why.

It's possible for userspace to do this regardless; look for the set of
SPE PMUs, and then look at their masks.

[...]

> > > +	/*
> > > +	 * To obtain the auxtrace buffer file descriptor, the auxtrace event
> > > +	 * must come first.
> > > +	 */
> > > +	perf_evlist__to_front(evlist, arm_spe_evsel);
> > 
> > Huh? *what* needs the auxtrace buffer fd?
> > 
> > This seems really fragile. Can't we store this elsewhere?
> 
> It's copied from the bts code, and the other auxtrace record users do
> the same; it looks like auxtrace record has implicit dependencies on it?

Is it definitely required? What happens if this isn't done?

> > > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)
> > > +{
> > > +	u64 ts;
> > > +
> > > +	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));
> > > +
> > > +	return ts;
> > > +}
> > 
> > I do not think it's a good idea to read the counter directly like this.
> > 
> > What is this "reference" intended to be meaningful relative to?
> 
> AFAICT, it's just a nonce the perf tool uses to track unique events,
> and I thought this better than the ETM driver's heavier get-random
> implementation.
> 
> > Why do we need to do this in userspace?
> > 
> > Can we not ask the kernel to output timestamps instead?
> 
> Why?  This gets the job done faster.

I had assumed that this needed to be correlated with the timestamps in
the event.

If this is a nonce, please don't read the counter directly in this way.
It may be trapped/emulated by a higher EL, making it very heavyweight.
The counter is only exposed so that the VDSO can use it, and that will
avoid using it in cases where it is unsafe.

[...]

> > > +static int arm_spe_get_events(const unsigned char *buf, size_t len,
> > > +			      struct arm_spe_pkt *packet)
> > > +{
> > > +	unsigned int events_len = payloadlen(buf[0]);
> > > +
> > > +	if (len < events_len)
> > > +		return ARM_SPE_NEED_MORE_BYTES;
> > 
> > Isn't len the size of the whole buffer? So isn't this failing to account
> > for the header byte?
> 
> well spotted; I changed /events_len/1 + events_len/.
> 
> > > +	packet->type = ARM_SPE_EVENTS;
> > > +	packet->index = events_len;
> > 
> > Huh? The events packet has no "index" field, so why do we need this?
> 
> To identify Events with a less number of comparisons in arm_spe_pkt_desc():
> E.g., the LLC-ACCESS, LLC-REFILL, and REMOTE-ACCESS events are
> identified iff index > 1.

It would be clearer to do the additional comparisons there.

Does this make a measureable difference in practice?

> > > +	switch (events_len) {
> > > +	case 1: packet->payload = *(uint8_t *)(buf + 1); break;
> > > +	case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break;
> > > +	case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break;
> > > +	case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break;
> > > +	default: return ARM_SPE_BAD_PACKET;
> > > +	}
> > > +
> > > +	return 1 + events_len;
> > > +}
> > > +
> > > +static int arm_spe_get_data_source(const unsigned char *buf,
> > > +				   struct arm_spe_pkt *packet)
> > > +{
> > > +	int len = payloadlen(buf[0]);
> > > +
> > > +	packet->type = ARM_SPE_DATA_SOURCE;
> > > +	if (len == 1)
> > > +		packet->payload = buf[1];
> > > +	else if (len == 2)
> > > +		packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1));
> > > +
> > > +	return 1 + len;
> > > +}
> > 
> > For those packets with a payload, the header has a uniform format
> > describing the payload size. Given that, can't we make the payload
> > extraction generic, regardless of the packet type?
> > 
> > e.g. something like:
> > 
> > static int arm_spe_get_payload(const unsigned char *buf, size_t len,
> > 			       struct arm_spe_pkt *packet)
> > {
> > 	<determine paylaod size>
> > 	<length check>
> > 	<switch>
> > 	<return nr consumed bytes (inc header), or error>
> > }
> > 
> > static int arm_spe_get_events(const unsigned char *buf, size_t len,
> > 			      struct arm_spe_pkt *packet)
> > {
> > 	packet->type = ARM_SPE_EVENTS;
> > 	return arm_spe_get_payload(buf, len, packet);
> > }
> > 
> > static int arm_spe_get_data_source(const unsigned char *buf,
> > 				   struct arm_spe_pkt *packet)
> > {
> > 	packet->type = ARM_SPE_DATA_SOURCE;
> > 	return arm_spe_get_payload(buf, len, packet);
> > }
> > 
> > ... and so on for the other packets with a payload.
> 
> done for TIMESTAMP, EVENTS, DATA_SOURCE, CONTEXT, INSN_TYPE.  It
> wouldn't fit ADDR and COUNTER well since they can occur in an
> extended-header, and their lengths are encoded differently, and are
> fixed anyway.

Ok. That sounds good to me.

[...]

> > > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
> > > +		     size_t buf_len)
> > > +{
> > > +	int ret, ns, el, index = packet->index;
> > > +	unsigned long long payload = packet->payload;
> > > +	const char *name = arm_spe_pkt_name(packet->type);
> > > +
> > > +	switch (packet->type) {

> > > +	case ARM_SPE_ADDRESS:
> > > +		switch (index) {
> > > +		case 0:
> > > +		case 1: ns = !!(packet->payload & NS_FLAG);
> > > +			el = (packet->payload & EL_FLAG) >> 61;
> > > +			payload &= ~(0xffULL << 56);
> > > +			return snprintf(buf, buf_len, "%s %llx el%d ns=%d",
> > > +				        (index == 1) ? "TGT" : "PC", payload, el, ns);

> > Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so
> > that things get padded consistently?
> 
> I've added the 0x prefix, but prefer to not fix the length to 016: I
> don't see any direct benefit, rather see benefits to having the length
> vary, for output size control and less obvious reasons, e.g., sorting
> address lines by their length to get a sense of address groups caught
> during the run.  FWIW, Intel doesn't do the 016 either.

With padding, sorting will also place address groups together, so I'm
not sure I follow.

Padding makes it *much* easier to scan over the output by eye, as
columns of event data will always share the same alignment.

> If I've omitted a response to the other comments, it's because they
> are being addressed.

Cool!

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf tools: Add ARM Statistical Profiling Extensions (SPE) support
  2017-08-18  3:11                           ` [PATCH v2] " Kim Phillips
@ 2017-08-18 17:36                             ` Mark Rutland
  2017-08-21 23:18                               ` Kim Phillips
  0 siblings, 1 reply; 33+ messages in thread
From: Mark Rutland @ 2017-08-18 17:36 UTC (permalink / raw)
  To: Kim Phillips
  Cc: Arnaldo Carvalho de Melo, robh, mathieu.poirier, pawel.moll,
	suzuki.poulose, marc.zyngier, Will Deacon, linux-kernel,
	alexander.shishkin, peterz, mingo, tglx, linux-arm-kernel,
	Adrian Hunter, Jiri Olsa, Andi Kleen, Wang Nan

Hi Kim,

On Thu, Aug 17, 2017 at 10:11:50PM -0500, Kim Phillips wrote:
> Hi Mark, I've tried to proceed as much as possible without your
> response, so if you still have comments to my above comments, please
> comment in-line above, otherwise review the v2 patch below?

Apologies again for the late response, and thanks for the updated patch!

[...]

> From 464d943dcac15d946863399001174e4dc4e00594 Mon Sep 17 00:00:00 2001
> From: Kim Phillips <kim.phillips@arm.com>
> Date: Wed, 8 Feb 2017 17:11:57 -0600
> Subject: [PATCH v2] perf tools: Add ARM Statistical Profiling Extensions
>  (SPE) support
> 
> 'perf record' and 'perf report --dump-raw-trace' supported in this release
> 
> Example usage:
> 
> taskset -c 2 ./perf record -C 2 -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=1/ \
> 		dd if=/dev/zero of=/dev/null count=10000
> 
> perf report --dump-raw-trace
> 
> Note that the perf.data file is portable, so the report can be run on another
> architecture host if necessary.
> 
> Output will contain raw SPE data and its textual representation, such as:
> 
> 0xc7d0 [0x30]: PERF_RECORD_AUXTRACE size: 0x82f70  offset: 0  ref: 0x1e947e88189  idx: 0  tid: -1  cpu: 2
> .
> . ... ARM SPE data: size 536432 bytes
> .  00000000:  4a 01                                           B COND
> .  00000002:  b1 00 00 00 00 00 00 00 80                      TGT 0 el0 ns=1
> .  0000000b:  42 42                                           RETIRED NOT-TAKEN
> .  0000000d:  b0 20 41 c0 ad ff ff 00 80                      PC ffffadc04120 el0 ns=1
> .  00000016:  98 00 00                                        LAT 0 TOT
> .  00000019:  71 80 3e f7 46 e9 01 00 00                      TS 2101429616256
> .  00000022:  49 01                                           ST
> .  00000024:  b2 50 bd ba 73 00 80 ff ff                      VA ffff800073babd50
> .  0000002d:  b3 50 bd ba f3 00 00 00 80                      PA f3babd50 ns=1
> .  00000036:  9a 00 00                                        LAT 0 XLAT
> .  00000039:  42 16                                           RETIRED L1D-ACCESS TLB-ACCESS
> .  0000003b:  b0 8c b4 1e 08 00 00 ff ff                      PC ff0000081eb48c el3 ns=1
> .  00000044:  98 00 00                                        LAT 0 TOT
> .  00000047:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868
> .  00000050:  48 00                                           INSN-OTHER
> .  00000052:  42 02                                           RETIRED
> .  00000054:  b0 58 54 1f 08 00 00 ff ff                      PC ff0000081f5458 el3 ns=1
> .  0000005d:  98 00 00                                        LAT 0 TOT
> .  00000060:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868

So FWIW, I think this is a good example of why that padding I requested
last time round matters.

For the first PC packet, I had to count the number of characters to see
that it was a TTBR0 address, which is made much clearer with leading
padding, as 0000ffffadc04120. With the addresses padded, the EL and NS
fields would also be aligned, making it *much* easier to scan by eye.

[...]

> - multiple SPE clusters/domains support pending potential driver changes?

As covered in my other reply, I don't believe that the driver is going
to change in this regard. Userspace will need to handle multiple SPE
instances.

I'll ignore that in the code below for now.

> - CPU mask / new record behaviour bisected to commit e3ba76deef23064 "perf
>   tools: Force uncore events to system wide monitoring".  Waiting to hear back
>   on why driver can't do system wide monitoring, even across PPIs, by e.g.,
>   sharing the SPE interrupts in one handler (SPE's don't differ in this record
>   regard).

Could you elaborate on this? I don't follow the interrupt handler
comments.

[...]

> +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)
> +{
> +	u64 ts;
> +
> +	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));
> +
> +	return ts;
> +}

As covered in my other reply, please don't use the counter for this.

It sounds like we need a simple/generic function to get a nonce, that
we could share with the ETM code.

[...]

> +#define BIT(n)		(1 << (n))
> +
> +#define BIT61		((uint64_t)1 << 61)
> +#define BIT62		((uint64_t)1 << 62)
> +#define BIT63		((uint64_t)1 << 63)
> +
> +#define NS_FLAG		BIT63
> +#define EL_FLAG		(BIT62 | BIT61)

This would be far simpler as:

#define	BIT(n)	(1UL << (n))

#define NS_FLAG		BIT(63)
#define EL_FLAG		(BIT(62) | BIT(61))

[...]

> +/* return ARM SPE payload size from its encoding:
> + * 00 : byte
> + * 01 : halfword (2)
> + * 10 : word (4)
> + * 11 : doubleword (8)
> + */
> +static int payloadlen(unsigned char byte)
> +{
> +	return 1 << ((byte & 0x30) >> 4);
> +}

It might be worth stating in the comment that this is encoded in bits
5:4 of the byte, since otherwise it looks odd.

> +
> +static int arm_spe_get_payload(const unsigned char *buf, size_t len,
> +			       struct arm_spe_pkt *packet)
> +{
> +	size_t payload_len = payloadlen(buf[0]);
> +
> +	if (len < 1 + payload_len)
> +		return ARM_SPE_NEED_MORE_BYTES;

If you did `buf++` here, you could avoid the `+ 1` in all the cases below.

> +
> +	switch (payload_len) {
> +	case 1: packet->payload = *(uint8_t *)(buf + 1); break;
> +	case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break;
> +	case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break;
> +	case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break;
> +	default: return ARM_SPE_BAD_PACKET;
> +	}
> +
> +	return 1 + payload_len;
> +}

[...]

> +int arm_spe_get_packet(const unsigned char *buf, size_t len,
> +		       struct arm_spe_pkt *packet)
> +{
> +	int ret;
> +
> +	ret = arm_spe_do_get_packet(buf, len, packet);
> +	if (ret > 0 && packet->type == ARM_SPE_PAD) {
> +		while (ret < 16 && len > (size_t)ret && !buf[ret])
> +			ret += 1;
> +	}
> +	return ret;
> +}

What's this doing? Skipping padding? What's the significance of 16?

> +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
> +		     size_t buf_len)
> +{
> +	int ret, ns, el, index = packet->index;
> +	unsigned long long payload = packet->payload;
> +	const char *name = arm_spe_pkt_name(packet->type);
> +
> +	switch (packet->type) {
> +	case ARM_SPE_BAD:
> +	case ARM_SPE_PAD:
> +	case ARM_SPE_END:
> +		return snprintf(buf, buf_len, "%s", name);
> +	case ARM_SPE_EVENTS: {
> +		size_t blen = buf_len;
> +
> +		ret = 0;
> +		ret = snprintf(buf, buf_len, "EV");
> +		buf += ret;
> +		blen -= ret;
> +		if (payload & 0x1) {
> +			ret = snprintf(buf, buf_len, " EXCEPTION-GEN");
> +			buf += ret;
> +			blen -= ret;
> +		}
> +		if (payload & 0x2) {
> +			ret = snprintf(buf, buf_len, " RETIRED");
> +			buf += ret;
> +			blen -= ret;
> +		}
> +		if (payload & 0x4) {
> +			ret = snprintf(buf, buf_len, " L1D-ACCESS");
> +			buf += ret;
> +			blen -= ret;
> +		}
> +		if (payload & 0x8) {
> +			ret = snprintf(buf, buf_len, " L1D-REFILL");
> +			buf += ret;
> +			blen -= ret;
> +		}
> +		if (payload & 0x10) {
> +			ret = snprintf(buf, buf_len, " TLB-ACCESS");
> +			buf += ret;
> +			blen -= ret;
> +		}
> +		if (payload & 0x20) {
> +			ret = snprintf(buf, buf_len, " TLB-REFILL");
> +			buf += ret;
> +			blen -= ret;
> +		}
> +		if (payload & 0x40) {
> +			ret = snprintf(buf, buf_len, " NOT-TAKEN");
> +			buf += ret;
> +			blen -= ret;
> +		}
> +		if (payload & 0x80) {
> +			ret = snprintf(buf, buf_len, " MISPRED");
> +			buf += ret;
> +			blen -= ret;
> +		}
> +		if (index > 1) {
> +			if (payload & 0x100) {
> +				ret = snprintf(buf, buf_len, " LLC-ACCESS");
> +				buf += ret;
> +				blen -= ret;
> +			}
> +			if (payload & 0x200) {
> +				ret = snprintf(buf, buf_len, " LLC-REFILL");
> +				buf += ret;
> +				blen -= ret;
> +			}
> +			if (payload & 0x400) {
> +				ret = snprintf(buf, buf_len, " REMOTE-ACCESS");
> +				buf += ret;
> +				blen -= ret;
> +			}
> +		}
> +		if (ret < 0)
> +			return ret;
> +		blen -= ret;
> +		return buf_len - blen;
> +	}

This looks like it could be turned into another switch, sharing the
repeated logic.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] perf tools: Add ARM Statistical Profiling Extensions (SPE) support
  2017-08-18 16:59                           ` [PATCH] " Mark Rutland
@ 2017-08-18 22:22                             ` Kim Phillips
  0 siblings, 0 replies; 33+ messages in thread
From: Kim Phillips @ 2017-08-18 22:22 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Arnaldo Carvalho de Melo, robh, mathieu.poirier, pawel.moll,
	suzuki.poulose, marc.zyngier, Will Deacon, linux-kernel,
	alexander.shishkin, peterz, mingo, tglx, linux-arm-kernel,
	Adrian Hunter, Jiri Olsa, Andi Kleen, Wang Nan

On Fri, 18 Aug 2017 17:59:25 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> On Mon, Jul 17, 2017 at 07:48:22PM -0500, Kim Phillips wrote:
> > On Fri, 30 Jun 2017 15:02:41 +0100 Mark Rutland <mark.rutland@arm.com> wrote:
> > > On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote:
> > <snip>
> > > >  	if (evlist) {
> > > >  		evlist__for_each_entry(evlist, evsel) {
> > > >  			if (cs_etm_pmu &&
> > > >  			    evsel->attr.type == cs_etm_pmu->type)
> > > >  				found_etm = true;
> > > > +			if (arm_spe_pmu &&
> > > > +			    evsel->attr.type == arm_spe_pmu->type)
> > > > +				found_spe = true;
> > > 
> > > Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all
> > > SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1"
> > > and so on).
> > > 
> > > Can we not find all PMUs with a "arm_spe_" prefix?
> > > 
> > > ... or, taking a step back, do we need some sysfs "class" attribute to
> > > identify multi-instance PMUs?
> > 
> > Since there is one SPE per core, and it looks like the buffer full
> > interrupt line is the only difference between the SPE device node
> > specification in the device tree, I guess I don't understand why the
> > driver doesn't accept a singular "arm_spe" from the tool, and manage
> > interrupt handling accordingly.
> 
> There are also differences which can be probed from the device, which

The only thing I see is PMSIDR fields describing things like minimum
recommended sampling interval.  So if CPU A's SPE has that as 256, and
CPU B's is 512, then deny the user asking for a 256 interval across the
two CPUs.  Or, better yet, issue a warning stating the driver has raised
the interval to the lowest common denominator of all CPU SPEs involved
(512 in the above case).

> are not described in the DT (but are described in sysfs). Some of these
> are exposed under sysfs.
> 
> There may be further differences in subsequent revisions of the
> architecture, too.

Future SPE lowest common denominator rules can be amended
according to the capabilities of the new system.

> So the safest bet is to expose them separately, as we
> do for other CPU-affine PMUs in heterogeneous systems.

Yes, perf is very hard to use on heterogeneous systems for this reason.
Cycles are cycles, it doesn't matter whether they're on an A53 or an
A72.

Meanwhile, this type of driver behaviour - and the fact that the drivers
are mute - hurts usability in heterogeneous environments, and can
easily be avoided.

> > Also, if a set of CPUs are missing SPE support, and the user doesn't
> > explicitly define a CPU affinity to outside that partition, then
> > decline to run, stating why.
> 
> It's possible for userspace to do this regardless; look for the set of
> SPE PMUs, and then look at their masks.

The driver still has to check if what the user is asking for, is
doable.  They also may not be using the perf tool.

> > > > +	/*
> > > > +	 * To obtain the auxtrace buffer file descriptor, the auxtrace event
> > > > +	 * must come first.
> > > > +	 */
> > > > +	perf_evlist__to_front(evlist, arm_spe_evsel);
> > > 
> > > Huh? *what* needs the auxtrace buffer fd?
> > > 
> > > This seems really fragile. Can't we store this elsewhere?
> > 
> > It's copied from the bts code, and the other auxtrace record users do
> > the same; it looks like auxtrace record has implicit dependencies on it?
> 
> Is it definitely required? What happens if this isn't done?

It says it wouldn't obtain the auxtrace buffer file descriptor.

> > > > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)
> > > > +{
> > > > +	u64 ts;
> > > > +
> > > > +	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));
> > > > +
> > > > +	return ts;
> > > > +}
> > > 
> > > I do not think it's a good idea to read the counter directly like this.
> > > 
> > > What is this "reference" intended to be meaningful relative to?
> > 
> > AFAICT, it's just a nonce the perf tool uses to track unique events,
> > and I thought this better than the ETM driver's heavier get-random
> > implementation.
> > 
> > > Why do we need to do this in userspace?
> > > 
> > > Can we not ask the kernel to output timestamps instead?
> > 
> > Why?  This gets the job done faster.
> 
> I had assumed that this needed to be correlated with the timestamps in
> the event.
> 
> If this is a nonce, please don't read the counter directly in this way.
> It may be trapped/emulated by a higher EL, making it very heavyweight.
> The counter is only exposed so that the VDSO can use it, and that will
> avoid using it in cases where it is unsafe.

Got it, thanks.

> > > > +	packet->type = ARM_SPE_EVENTS;
> > > > +	packet->index = events_len;
> > > 
> > > Huh? The events packet has no "index" field, so why do we need this?
> > 
> > To identify Events with a less number of comparisons in arm_spe_pkt_desc():
> > E.g., the LLC-ACCESS, LLC-REFILL, and REMOTE-ACCESS events are
> > identified iff index > 1.
> 
> It would be clearer to do the additional comparisons there.
> 
> Does this make a measureable difference in practice?

It should - I'll add a comment.

> > > > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
> > > > +		     size_t buf_len)
> > > > +{
> > > > +	int ret, ns, el, index = packet->index;
> > > > +	unsigned long long payload = packet->payload;
> > > > +	const char *name = arm_spe_pkt_name(packet->type);
> > > > +
> > > > +	switch (packet->type) {
> 
> > > > +	case ARM_SPE_ADDRESS:
> > > > +		switch (index) {
> > > > +		case 0:
> > > > +		case 1: ns = !!(packet->payload & NS_FLAG);
> > > > +			el = (packet->payload & EL_FLAG) >> 61;
> > > > +			payload &= ~(0xffULL << 56);
> > > > +			return snprintf(buf, buf_len, "%s %llx el%d ns=%d",
> > > > +				        (index == 1) ? "TGT" : "PC", payload, el, ns);
> 
> > > Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so
> > > that things get padded consistently?
> > 
> > I've added the 0x prefix, but prefer to not fix the length to 016: I
> > don't see any direct benefit, rather see benefits to having the length
> > vary, for output size control and less obvious reasons, e.g., sorting
> > address lines by their length to get a sense of address groups caught
> > during the run.  FWIW, Intel doesn't do the 016 either.
> 
> With padding, sorting will also place address groups together, so I'm
> not sure I follow.

sorting by *line length* can be done to easily assess the address
groups in a dump:

$ grep -w PC dump | awk '{ print length, $0 }' | sort -nu
77 .  00000080:  b0 00 00 00 00 00 00 00 a0                      PC 0x0 el1 ns=1
82 .  00000000:  b0 94 61 43 00 00 00 00 80                      PC 0x436194 el0 ns=1
88 .  00000000:  b0 50 20 ac a7 ff ff 00 80                      PC 0xffffa7ac2050 el0 ns=1
89 .  00000040:  b0 80 2d 08 08 00 00 01 a0                      PC 0x1000008082d80 el1 ns=1

> Padding makes it *much* easier to scan over the output by eye, as
> columns of event data will always share the same alignment.

Addresses are already technically misaligned by virtue of their being
prepended with "PC" (2 chars) vs. "TGT" (3 chars):

82 .  00000000:  b0 94 61 43 00 00 00 00 80                      PC 0x436194 el0 ns=1
83 .  0000001e:  b1 68 61 43 00 00 00 00 80                      TGT 0x436168 el0 ns=1

89 .  00000040:  b0 80 2d 08 08 00 00 01 a0                      PC 0x1000008082d80 el1 ns=1
91 .  000005de:  b1 ec 9a 92 08 00 00 ff a0                      TGT 0xff000008929aec el1 ns=1

If you're talking about the postpended "elX ns=Y", well, that less
significant given the variable length is more quickly detected by the
eye - giving the astute reader hints of which execution level the
address is in - and can be parsed using variable length delimeters.

OTOH, we can rename the tokens, e.g., 

current PC  -> {NS,SE}EL{0,1,2,3}PC 0xAAAA
current TGT -> {NS,SE}EL{0,1,2,3}BT 0xAAAA

Where "BT" -> "Branch Target", which admittedly is less obvious to the
uninitiated.

So the last sample above would be:

89 .  00000040:  b0 80 2d 08 08 00 00 01 a0                      NSEL1PC 0x1000008082d80
91 .  000005de:  b1 ec 9a 92 08 00 00 ff a0                      NSEL1BT 0xff000008929aec

Is that better though?

Are there others opinionated here?

I'll get to the v2 review comments later.

Thanks for your feedback!

Kim

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf tools: Add ARM Statistical Profiling Extensions (SPE) support
  2017-08-18 17:36                             ` Mark Rutland
@ 2017-08-21 23:18                               ` Kim Phillips
  0 siblings, 0 replies; 33+ messages in thread
From: Kim Phillips @ 2017-08-21 23:18 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Arnaldo Carvalho de Melo, robh, mathieu.poirier, pawel.moll,
	suzuki.poulose, marc.zyngier, Will Deacon, linux-kernel,
	alexander.shishkin, peterz, mingo, tglx, linux-arm-kernel,
	Adrian Hunter, Jiri Olsa, Andi Kleen, Wang Nan

On Fri, 18 Aug 2017 18:36:09 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> Hi Kim,

Hi Mark,

> On Thu, Aug 17, 2017 at 10:11:50PM -0500, Kim Phillips wrote:
> > Hi Mark, I've tried to proceed as much as possible without your
> > response, so if you still have comments to my above comments, please
> > comment in-line above, otherwise review the v2 patch below?
> 
> Apologies again for the late response, and thanks for the updated patch!

Thanks for your prompt response this time around.

> > .
> > . ... ARM SPE data: size 536432 bytes
> > .  00000000:  4a 01                                           B COND
> > .  00000002:  b1 00 00 00 00 00 00 00 80                      TGT 0 el0 ns=1
> > .  0000000b:  42 42                                           RETIRED NOT-TAKEN
> > .  0000000d:  b0 20 41 c0 ad ff ff 00 80                      PC ffffadc04120 el0 ns=1
> > .  00000016:  98 00 00                                        LAT 0 TOT
> > .  00000019:  71 80 3e f7 46 e9 01 00 00                      TS 2101429616256
> > .  00000022:  49 01                                           ST
> > .  00000024:  b2 50 bd ba 73 00 80 ff ff                      VA ffff800073babd50
> > .  0000002d:  b3 50 bd ba f3 00 00 00 80                      PA f3babd50 ns=1
> > .  00000036:  9a 00 00                                        LAT 0 XLAT
> > .  00000039:  42 16                                           RETIRED L1D-ACCESS TLB-ACCESS
> > .  0000003b:  b0 8c b4 1e 08 00 00 ff ff                      PC ff0000081eb48c el3 ns=1
> > .  00000044:  98 00 00                                        LAT 0 TOT
> > .  00000047:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868
> > .  00000050:  48 00                                           INSN-OTHER
> > .  00000052:  42 02                                           RETIRED
> > .  00000054:  b0 58 54 1f 08 00 00 ff ff                      PC ff0000081f5458 el3 ns=1
> > .  0000005d:  98 00 00                                        LAT 0 TOT
> > .  00000060:  71 cc 44 f7 46 e9 01 00 00                      TS 2101429617868
> 
> So FWIW, I think this is a good example of why that padding I requested
> last time round matters.
> 
> For the first PC packet, I had to count the number of characters to see
> that it was a TTBR0 address, which is made much clearer with leading
> padding, as 0000ffffadc04120. With the addresses padded, the EL and NS
> fields would also be aligned, making it *much* easier to scan by eye.

See my response in my prior email.

> > - multiple SPE clusters/domains support pending potential driver changes?
> 
> As covered in my other reply, I don't believe that the driver is going
> to change in this regard. Userspace will need to handle multiple SPE
> instances.
> 
> I'll ignore that in the code below for now.

Please let's continue the discussion in one place, and again in this
case, in the last email.

> > - CPU mask / new record behaviour bisected to commit e3ba76deef23064 "perf
> >   tools: Force uncore events to system wide monitoring".  Waiting to hear back
> >   on why driver can't do system wide monitoring, even across PPIs, by e.g.,
> >   sharing the SPE interrupts in one handler (SPE's don't differ in this record
> >   regard).
> 
> Could you elaborate on this? I don't follow the interrupt handler
> comments.

Would it be possible for the driver to request the IRQs with
IRQF_SHARED, in order to be able to operate across the multiple PPIs?

> > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)
> > +{
> > +	u64 ts;
> > +
> > +	asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));
> > +
> > +	return ts;
> > +}
> 
> As covered in my other reply, please don't use the counter for this.
> 
> It sounds like we need a simple/generic function to get a nonce, that
> we could share with the ETM code.

I've switched to using clock_gettime(CLOCK_MONOTONIC_RAW, ...).  The
ETM code uses two rand() calls, which, according to some minor
benchmarking on Juno, is almost twice as slow as clock_gettime. It's
three lines still, so I'll update the ETM code in-place independently
of this patch, and after the gettime implementation is reviewed.

> > +int arm_spe_get_packet(const unsigned char *buf, size_t len,
> > +		       struct arm_spe_pkt *packet)
> > +{
> > +	int ret;
> > +
> > +	ret = arm_spe_do_get_packet(buf, len, packet);
> > +	if (ret > 0 && packet->type == ARM_SPE_PAD) {
> > +		while (ret < 16 && len > (size_t)ret && !buf[ret])
> > +			ret += 1;
> > +	}
> > +	return ret;
> > +}
> 
> What's this doing? Skipping padding? What's the significance of 16?

I'll repeat the relevant part of the v2 changelog here:

- do_get_packet fixed to handle excessive, successive PADding from a new source
  of raw SPE data, so instead of:

	.  000011ae:  00                                              PAD
	.  000011af:  00                                              PAD
	.  000011b0:  00                                              PAD
	.  000011b1:  00                                              PAD
	.  000011b2:  00                                              PAD
	.  000011b3:  00                                              PAD
	.  000011b4:  00                                              PAD
	.  000011b5:  00                                              PAD
	.  000011b6:  00                                              PAD

  we now get:

	.  000011ae:  00 00 00 00 00 00 00 00 00                      PAD

...the 16 is the width of the dump format: max. 16 byte being displayed
per line: I'll add a comment as such.

> > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
> > +		     size_t buf_len)
> > +{
> > +	int ret, ns, el, index = packet->index;
> > +	unsigned long long payload = packet->payload;
> > +	const char *name = arm_spe_pkt_name(packet->type);
> > +
> > +	switch (packet->type) {
> > +	case ARM_SPE_BAD:
> > +	case ARM_SPE_PAD:
> > +	case ARM_SPE_END:
> > +		return snprintf(buf, buf_len, "%s", name);
> > +	case ARM_SPE_EVENTS: {
> > +		size_t blen = buf_len;
> > +
> > +		ret = 0;
> > +		ret = snprintf(buf, buf_len, "EV");
> > +		buf += ret;
> > +		blen -= ret;
> > +		if (payload & 0x1) {
> > +			ret = snprintf(buf, buf_len, " EXCEPTION-GEN");
> > +			buf += ret;
> > +			blen -= ret;
> > +		}
> > +		if (payload & 0x2) {
> > +			ret = snprintf(buf, buf_len, " RETIRED");
> > +			buf += ret;
> > +			blen -= ret;
> > +		}
> > +		if (payload & 0x4) {
> > +			ret = snprintf(buf, buf_len, " L1D-ACCESS");
> > +			buf += ret;
> > +			blen -= ret;
> > +		}
> > +		if (payload & 0x8) {
> > +			ret = snprintf(buf, buf_len, " L1D-REFILL");
> > +			buf += ret;
> > +			blen -= ret;
> > +		}
> > +		if (payload & 0x10) {
> > +			ret = snprintf(buf, buf_len, " TLB-ACCESS");
> > +			buf += ret;
> > +			blen -= ret;
> > +		}
> > +		if (payload & 0x20) {
> > +			ret = snprintf(buf, buf_len, " TLB-REFILL");
> > +			buf += ret;
> > +			blen -= ret;
> > +		}
> > +		if (payload & 0x40) {
> > +			ret = snprintf(buf, buf_len, " NOT-TAKEN");
> > +			buf += ret;
> > +			blen -= ret;
> > +		}
> > +		if (payload & 0x80) {
> > +			ret = snprintf(buf, buf_len, " MISPRED");
> > +			buf += ret;
> > +			blen -= ret;
> > +		}
> > +		if (index > 1) {
> > +			if (payload & 0x100) {
> > +				ret = snprintf(buf, buf_len, " LLC-ACCESS");
> > +				buf += ret;
> > +				blen -= ret;
> > +			}
> > +			if (payload & 0x200) {
> > +				ret = snprintf(buf, buf_len, " LLC-REFILL");
> > +				buf += ret;
> > +				blen -= ret;
> > +			}
> > +			if (payload & 0x400) {
> > +				ret = snprintf(buf, buf_len, " REMOTE-ACCESS");
> > +				buf += ret;
> > +				blen -= ret;
> > +			}
> > +		}
> > +		if (ret < 0)
> > +			return ret;
> > +		blen -= ret;
> > +		return buf_len - blen;
> > +	}
> 
> This looks like it could be turned into another switch, sharing the
> repeated logic.

How, if the payload may have multiple bits set?

I've addressed the rest of your comments and therefore trimmed them
out.  I can post a v3, but would rather shake out the pending issues
first, so please reply to my comments on this and Friday's email (Date:
Fri, 18 Aug 2017 17:22:48 -0500).

Thanks,

Kim

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2017-08-21 23:18 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-05 15:22 [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Will Deacon
2017-06-05 15:22 ` [PATCH v4 1/5] genirq: export irq_get_percpu_devid_partition to modules Will Deacon
2017-06-05 15:22 ` [PATCH v4 2/5] perf/core: Export AUX buffer helpers " Will Deacon
2017-06-05 15:22 ` [PATCH v4 3/5] perf/core: Add PERF_AUX_FLAG_COLLISION to report colliding samples Will Deacon
2017-06-05 15:22 ` [PATCH v4 4/5] drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension Will Deacon
2017-06-05 15:55   ` Kim Phillips
2017-06-05 16:11     ` Will Deacon
2017-06-15 14:57   ` Mark Rutland
2017-06-21 15:39     ` Will Deacon
2017-06-27 17:12       ` Mark Rutland
2017-07-03 17:23   ` Mark Rutland
2017-06-05 15:22 ` [PATCH v4 5/5] dt-bindings: Document devicetree binding for ARM SPE Will Deacon
2017-06-12 11:08 ` [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Mark Rutland
2017-06-12 16:20   ` Kim Phillips
2017-06-15 15:57     ` Kim Phillips
2017-06-21 15:31       ` Will Deacon
2017-06-22 15:56         ` Kim Phillips
2017-06-22 18:36           ` Will Deacon
2017-06-27 21:07             ` Kim Phillips
2017-06-28 11:26               ` Mark Rutland
2017-06-28 11:32                 ` Mark Rutland
2017-06-29  1:16                   ` Kim Phillips
2017-06-29  1:43                     ` [PATCH] perf tools: Add ARM Statistical Profiling Extensions (SPE) support Kim Phillips
2017-06-30 14:02                       ` Mark Rutland
2017-07-18  0:48                         ` Kim Phillips
2017-08-18  3:11                           ` [PATCH v2] " Kim Phillips
2017-08-18 17:36                             ` Mark Rutland
2017-08-21 23:18                               ` Kim Phillips
2017-08-18 16:59                           ` [PATCH] " Mark Rutland
2017-08-18 22:22                             ` Kim Phillips
2017-06-29  0:59                 ` [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension Kim Phillips
2017-06-29 11:11                   ` Mark Rutland
2017-07-06 17:08                     ` Kim Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).