linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] perf: Introduce address range filtering
@ 2016-04-27 15:44 Alexander Shishkin
  2016-04-27 15:44 ` [PATCH v2 1/7] perf: Move set_filter() from behind EVENT_TRACING Alexander Shishkin
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Alexander Shishkin @ 2016-04-27 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, x86, Borislav Petkov, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier,
	Alexander Shishkin

Hi Peter,

Here's my third round of address range filtering support for perf.

Changes since v1 [1]:

 * got rid of the race filter syncing cross-call if favor of a restart
 sequence and
 * moved all filter synchronization to pmu::start()
 * simplified some paths and structures as a result
 * added some more comments
 * dropped "file"/"kernel" specifier from filter format
 * added a pmu attribute for those pmus that support address filtering
 * moved all PT MSR bit definitions to pt.h.

Changes since v0 [2], notably

 * split the pmu callback in two as we discussed,
 * split the filter itself into 'core' and 'hw' parts,
 * made only parent events eligible for configuring filters (children
   can still have their 'hw' portions),
 * and basically re-wrote the whole thing around that.

Newer version of Intel PT supports address-based filtering, and this
patchset adds support for it to perf core and the PT pmu driver. It
works by configuring a number of address ranges in hardware and
telling it to use these ranges to filter its traces. Similar feature
also exists in ARM Coresight ETM/PTM and it is also taken into account
in this patchset.

Firstly, userspace configures filters via an ioctl(), filters are
formatted as an ascii string. Filters may refer to addresses in object
files for userspace code or kernel addresses. The latter might be
extended in the future to support kernel modules.

For userspace filters, we scan the task's vmas to see if any of them
match the defined filters (inode+offset) and if they do, calculate
memory offsets and program them into hardware. Note that since
different tasks will have different mappings for the same object
files, supporting cpu-wide events would require special tricks to
context-switch filters for userspace code.

Also, we monitor new mmap and exec events to update (or clear) filter
configuration.

[1] http://marc.info/?l=linux-kernel&m=146125264018801
[2] http://marc.info/?l=linux-kernel&m=144984116019143

Alexander Shishkin (7):
  perf: Move set_filter() from behind EVENT_TRACING
  perf/x86/intel/pt: Move MSR bit definitions to a private header
  perf/x86/intel/pt: IP filtering register/cpuid bits
  perf: Extend perf_event_aux_ctx() to optionally iterate through more
    events
  perf: Introduce address range filtering
  perf/x86/intel/pt: Add support for address range filtering in PT
  perf: Let userspace know if pmu supports address filters

 arch/x86/events/intel/pt.c       | 181 +++++++++-
 arch/x86/events/intel/pt.h       |  62 ++++
 arch/x86/include/asm/msr-index.h |  29 +-
 include/linux/perf_event.h       |  98 ++++++
 kernel/events/core.c             | 713 ++++++++++++++++++++++++++++++++++++---
 5 files changed, 1005 insertions(+), 78 deletions(-)

-- 
2.8.0.rc3

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/7] perf: Move set_filter() from behind EVENT_TRACING
  2016-04-27 15:44 [PATCH v2 0/7] perf: Introduce address range filtering Alexander Shishkin
@ 2016-04-27 15:44 ` Alexander Shishkin
  2016-05-05  9:45   ` [tip:perf/core] perf/core: Move set_filter() out of CONFIG_EVENT_TRACING tip-bot for Alexander Shishkin
  2016-04-27 15:44 ` [PATCH v2 2/7] perf/x86/intel/pt: Move MSR bit definitions to a private header Alexander Shishkin
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Alexander Shishkin @ 2016-04-27 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, x86, Borislav Petkov, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier,
	Alexander Shishkin

For instruction trace filtering, namely, for communicating filter
definitions from userspace, I'd like to re-use the SET_FILTER code
that the tracepoints are using currently.

To that end, this patch moves the relevant code from behind EVENT_TRACING
macro.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 45 ++++++++++++++++++++++-----------------------
 1 file changed, 22 insertions(+), 23 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index eabeb2aec0..aa3733fc76 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7244,24 +7244,6 @@ static inline void perf_tp_register(void)
 	perf_pmu_register(&perf_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
 }
 
-static int perf_event_set_filter(struct perf_event *event, void __user *arg)
-{
-	char *filter_str;
-	int ret;
-
-	if (event->attr.type != PERF_TYPE_TRACEPOINT)
-		return -EINVAL;
-
-	filter_str = strndup_user(arg, PAGE_SIZE);
-	if (IS_ERR(filter_str))
-		return PTR_ERR(filter_str);
-
-	ret = ftrace_profile_set_filter(event, event->attr.config, filter_str);
-
-	kfree(filter_str);
-	return ret;
-}
-
 static void perf_event_free_filter(struct perf_event *event)
 {
 	ftrace_profile_free_filter(event);
@@ -7316,11 +7298,6 @@ static inline void perf_tp_register(void)
 {
 }
 
-static int perf_event_set_filter(struct perf_event *event, void __user *arg)
-{
-	return -ENOENT;
-}
-
 static void perf_event_free_filter(struct perf_event *event)
 {
 }
@@ -7348,6 +7325,28 @@ void perf_bp_event(struct perf_event *bp, void *data)
 }
 #endif
 
+static int perf_event_set_filter(struct perf_event *event, void __user *arg)
+{
+	char *filter_str;
+	int ret = -EINVAL;
+
+	if (event->attr.type != PERF_TYPE_TRACEPOINT ||
+	    !IS_ENABLED(CONFIG_EVENT_TRACING))
+		return -EINVAL;
+
+	filter_str = strndup_user(arg, PAGE_SIZE);
+	if (IS_ERR(filter_str))
+		return PTR_ERR(filter_str);
+
+	if (IS_ENABLED(CONFIG_EVENT_TRACING) &&
+	    event->attr.type == PERF_TYPE_TRACEPOINT)
+		ret = ftrace_profile_set_filter(event, event->attr.config,
+						filter_str);
+
+	kfree(filter_str);
+	return ret;
+}
+
 /*
  * hrtimer based swevent callback
  */
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/7] perf/x86/intel/pt: Move MSR bit definitions to a private header
  2016-04-27 15:44 [PATCH v2 0/7] perf: Introduce address range filtering Alexander Shishkin
  2016-04-27 15:44 ` [PATCH v2 1/7] perf: Move set_filter() from behind EVENT_TRACING Alexander Shishkin
@ 2016-04-27 15:44 ` Alexander Shishkin
  2016-05-05  9:45   ` [tip:perf/core] perf/x86/intel/pt: Move PT specific " tip-bot for Alexander Shishkin
  2016-04-27 15:44 ` [PATCH v2 3/7] perf/x86/intel/pt: IP filtering register/cpuid bits Alexander Shishkin
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Alexander Shishkin @ 2016-04-27 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, x86, Borislav Petkov, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier,
	Alexander Shishkin

Nothing outside of the Intel PT driver should ever care about its MSR
bits, so there is no reason to keep them in msr-index.h. This patch
moves them to a pt-local header.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/events/intel/pt.h       | 24 ++++++++++++++++++++++++
 arch/x86/include/asm/msr-index.h | 20 --------------------
 2 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index 336878a5d2..a71046187e 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -20,6 +20,30 @@
 #define __INTEL_PT_H__
 
 /*
+ * PT MSR bit definitions
+ */
+#define RTIT_CTL_TRACEEN		BIT(0)
+#define RTIT_CTL_CYCLEACC		BIT(1)
+#define RTIT_CTL_OS			BIT(2)
+#define RTIT_CTL_USR			BIT(3)
+#define RTIT_CTL_CR3EN			BIT(7)
+#define RTIT_CTL_TOPA			BIT(8)
+#define RTIT_CTL_MTC_EN			BIT(9)
+#define RTIT_CTL_TSC_EN			BIT(10)
+#define RTIT_CTL_DISRETC		BIT(11)
+#define RTIT_CTL_BRANCH_EN		BIT(13)
+#define RTIT_CTL_MTC_RANGE_OFFSET	14
+#define RTIT_CTL_MTC_RANGE		(0x0full << RTIT_CTL_MTC_RANGE_OFFSET)
+#define RTIT_CTL_CYC_THRESH_OFFSET	19
+#define RTIT_CTL_CYC_THRESH		(0x0full << RTIT_CTL_CYC_THRESH_OFFSET)
+#define RTIT_CTL_PSB_FREQ_OFFSET	24
+#define RTIT_CTL_PSB_FREQ      		(0x0full << RTIT_CTL_PSB_FREQ_OFFSET)
+#define RTIT_STATUS_CONTEXTEN		BIT(1)
+#define RTIT_STATUS_TRIGGEREN		BIT(2)
+#define RTIT_STATUS_ERROR		BIT(4)
+#define RTIT_STATUS_STOPPED		BIT(5)
+
+/*
  * Single-entry ToPA: when this close to region boundary, switch
  * buffers to avoid losing data.
  */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 94555b4d85..7193577d8b 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -89,27 +89,7 @@
 #define MSR_PEBS_LD_LAT_THRESHOLD	0x000003f6
 
 #define MSR_IA32_RTIT_CTL		0x00000570
-#define RTIT_CTL_TRACEEN		BIT(0)
-#define RTIT_CTL_CYCLEACC		BIT(1)
-#define RTIT_CTL_OS			BIT(2)
-#define RTIT_CTL_USR			BIT(3)
-#define RTIT_CTL_CR3EN			BIT(7)
-#define RTIT_CTL_TOPA			BIT(8)
-#define RTIT_CTL_MTC_EN			BIT(9)
-#define RTIT_CTL_TSC_EN			BIT(10)
-#define RTIT_CTL_DISRETC		BIT(11)
-#define RTIT_CTL_BRANCH_EN		BIT(13)
-#define RTIT_CTL_MTC_RANGE_OFFSET	14
-#define RTIT_CTL_MTC_RANGE		(0x0full << RTIT_CTL_MTC_RANGE_OFFSET)
-#define RTIT_CTL_CYC_THRESH_OFFSET	19
-#define RTIT_CTL_CYC_THRESH		(0x0full << RTIT_CTL_CYC_THRESH_OFFSET)
-#define RTIT_CTL_PSB_FREQ_OFFSET	24
-#define RTIT_CTL_PSB_FREQ      		(0x0full << RTIT_CTL_PSB_FREQ_OFFSET)
 #define MSR_IA32_RTIT_STATUS		0x00000571
-#define RTIT_STATUS_CONTEXTEN		BIT(1)
-#define RTIT_STATUS_TRIGGEREN		BIT(2)
-#define RTIT_STATUS_ERROR		BIT(4)
-#define RTIT_STATUS_STOPPED		BIT(5)
 #define MSR_IA32_RTIT_CR3_MATCH		0x00000572
 #define MSR_IA32_RTIT_OUTPUT_BASE	0x00000560
 #define MSR_IA32_RTIT_OUTPUT_MASK	0x00000561
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/7] perf/x86/intel/pt: IP filtering register/cpuid bits
  2016-04-27 15:44 [PATCH v2 0/7] perf: Introduce address range filtering Alexander Shishkin
  2016-04-27 15:44 ` [PATCH v2 1/7] perf: Move set_filter() from behind EVENT_TRACING Alexander Shishkin
  2016-04-27 15:44 ` [PATCH v2 2/7] perf/x86/intel/pt: Move MSR bit definitions to a private header Alexander Shishkin
@ 2016-04-27 15:44 ` Alexander Shishkin
  2016-05-05  9:46   ` [tip:perf/core] perf/x86/intel/pt: Add IP filtering register/CPUID bits tip-bot for Alexander Shishkin
  2016-04-27 15:44 ` [PATCH v2 4/7] perf: Extend perf_event_aux_ctx() to optionally iterate through more events Alexander Shishkin
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Alexander Shishkin @ 2016-04-27 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, x86, Borislav Petkov, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier,
	Alexander Shishkin

New versions of Intel PT support address range-based filtering. These
are the registers, bit definitions and relevant CPUID bits.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/events/intel/pt.c       |  2 ++
 arch/x86/events/intel/pt.h       | 12 ++++++++++++
 arch/x86/include/asm/msr-index.h |  9 +++++++++
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 1aefd430e7..4eea19e4d9 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -67,11 +67,13 @@ static struct pt_cap_desc {
 	PT_CAP(max_subleaf,		0, CR_EAX, 0xffffffff),
 	PT_CAP(cr3_filtering,		0, CR_EBX, BIT(0)),
 	PT_CAP(psb_cyc,			0, CR_EBX, BIT(1)),
+	PT_CAP(ip_filtering,		0, CR_EBX, BIT(2)),
 	PT_CAP(mtc,			0, CR_EBX, BIT(3)),
 	PT_CAP(topa_output,		0, CR_ECX, BIT(0)),
 	PT_CAP(topa_multiple_entries,	0, CR_ECX, BIT(1)),
 	PT_CAP(single_range_output,	0, CR_ECX, BIT(2)),
 	PT_CAP(payloads_lip,		0, CR_ECX, BIT(31)),
+	PT_CAP(num_address_ranges,	1, CR_EAX, 0x3),
 	PT_CAP(mtc_periods,		1, CR_EAX, 0xffff0000),
 	PT_CAP(cycle_thresholds,	1, CR_EBX, 0xffff),
 	PT_CAP(psb_periods,		1, CR_EBX, 0xffff0000),
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index a71046187e..c5698eb6de 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -38,8 +38,18 @@
 #define RTIT_CTL_CYC_THRESH		(0x0full << RTIT_CTL_CYC_THRESH_OFFSET)
 #define RTIT_CTL_PSB_FREQ_OFFSET	24
 #define RTIT_CTL_PSB_FREQ      		(0x0full << RTIT_CTL_PSB_FREQ_OFFSET)
+#define RTIT_CTL_ADDR0_OFFSET		32
+#define RTIT_CTL_ADDR0      		(0x0full << RTIT_CTL_ADDR0_OFFSET)
+#define RTIT_CTL_ADDR1_OFFSET		36
+#define RTIT_CTL_ADDR1      		(0x0full << RTIT_CTL_ADDR1_OFFSET)
+#define RTIT_CTL_ADDR2_OFFSET		40
+#define RTIT_CTL_ADDR2      		(0x0full << RTIT_CTL_ADDR2_OFFSET)
+#define RTIT_CTL_ADDR3_OFFSET		44
+#define RTIT_CTL_ADDR3      		(0x0full << RTIT_CTL_ADDR3_OFFSET)
+#define RTIT_STATUS_FILTEREN		BIT(0)
 #define RTIT_STATUS_CONTEXTEN		BIT(1)
 #define RTIT_STATUS_TRIGGEREN		BIT(2)
+#define RTIT_STATUS_BUFFOVF		BIT(3)
 #define RTIT_STATUS_ERROR		BIT(4)
 #define RTIT_STATUS_STOPPED		BIT(5)
 
@@ -76,11 +86,13 @@ enum pt_capabilities {
 	PT_CAP_max_subleaf = 0,
 	PT_CAP_cr3_filtering,
 	PT_CAP_psb_cyc,
+	PT_CAP_ip_filtering,
 	PT_CAP_mtc,
 	PT_CAP_topa_output,
 	PT_CAP_topa_multiple_entries,
 	PT_CAP_single_range_output,
 	PT_CAP_payloads_lip,
+	PT_CAP_num_address_ranges,
 	PT_CAP_mtc_periods,
 	PT_CAP_cycle_thresholds,
 	PT_CAP_psb_periods,
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 7193577d8b..5a73a9c62c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -90,6 +90,15 @@
 
 #define MSR_IA32_RTIT_CTL		0x00000570
 #define MSR_IA32_RTIT_STATUS		0x00000571
+#define MSR_IA32_RTIT_STATUS		0x00000571
+#define MSR_IA32_RTIT_ADDR0_A		0x00000580
+#define MSR_IA32_RTIT_ADDR0_B		0x00000581
+#define MSR_IA32_RTIT_ADDR1_A		0x00000582
+#define MSR_IA32_RTIT_ADDR1_B		0x00000583
+#define MSR_IA32_RTIT_ADDR2_A		0x00000584
+#define MSR_IA32_RTIT_ADDR2_B		0x00000585
+#define MSR_IA32_RTIT_ADDR3_A		0x00000586
+#define MSR_IA32_RTIT_ADDR3_B		0x00000587
 #define MSR_IA32_RTIT_CR3_MATCH		0x00000572
 #define MSR_IA32_RTIT_OUTPUT_BASE	0x00000560
 #define MSR_IA32_RTIT_OUTPUT_MASK	0x00000561
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 4/7] perf: Extend perf_event_aux_ctx() to optionally iterate through more events
  2016-04-27 15:44 [PATCH v2 0/7] perf: Introduce address range filtering Alexander Shishkin
                   ` (2 preceding siblings ...)
  2016-04-27 15:44 ` [PATCH v2 3/7] perf/x86/intel/pt: IP filtering register/cpuid bits Alexander Shishkin
@ 2016-04-27 15:44 ` Alexander Shishkin
  2016-05-05  9:46   ` [tip:perf/core] perf/core: " tip-bot for Alexander Shishkin
  2016-04-27 15:44 ` [PATCH v2 5/7] perf: Introduce address range filtering Alexander Shishkin
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Alexander Shishkin @ 2016-04-27 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, x86, Borislav Petkov, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier,
	Alexander Shishkin

Trace filtering code needs an iterator that can go through all events in
a context, including inactive and filtered, to be able to update their
filters' address ranges based on mmap or exec events.

This patch changes perf_event_aux_ctx() to optionally do this.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index aa3733fc76..6d335f3878 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5790,15 +5790,18 @@ typedef void (perf_event_aux_output_cb)(struct perf_event *event, void *data);
 static void
 perf_event_aux_ctx(struct perf_event_context *ctx,
 		   perf_event_aux_output_cb output,
-		   void *data)
+		   void *data, bool all)
 {
 	struct perf_event *event;
 
 	list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
-		if (event->state < PERF_EVENT_STATE_INACTIVE)
-			continue;
-		if (!event_filter_match(event))
-			continue;
+		if (!all) {
+			if (event->state < PERF_EVENT_STATE_INACTIVE)
+				continue;
+			if (!event_filter_match(event))
+				continue;
+		}
+
 		output(event, data);
 	}
 }
@@ -5809,7 +5812,7 @@ perf_event_aux_task_ctx(perf_event_aux_output_cb output, void *data,
 {
 	rcu_read_lock();
 	preempt_disable();
-	perf_event_aux_ctx(task_ctx, output, data);
+	perf_event_aux_ctx(task_ctx, output, data, false);
 	preempt_enable();
 	rcu_read_unlock();
 }
@@ -5839,13 +5842,13 @@ perf_event_aux(perf_event_aux_output_cb output, void *data,
 		cpuctx = get_cpu_ptr(pmu->pmu_cpu_context);
 		if (cpuctx->unique_pmu != pmu)
 			goto next;
-		perf_event_aux_ctx(&cpuctx->ctx, output, data);
+		perf_event_aux_ctx(&cpuctx->ctx, output, data, false);
 		ctxn = pmu->task_ctx_nr;
 		if (ctxn < 0)
 			goto next;
 		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
 		if (ctx)
-			perf_event_aux_ctx(ctx, output, data);
+			perf_event_aux_ctx(ctx, output, data, false);
 next:
 		put_cpu_ptr(pmu->pmu_cpu_context);
 	}
@@ -5887,10 +5890,10 @@ static int __perf_pmu_output_stop(void *info)
 	};
 
 	rcu_read_lock();
-	perf_event_aux_ctx(&cpuctx->ctx, __perf_event_output_stop, &ro);
+	perf_event_aux_ctx(&cpuctx->ctx, __perf_event_output_stop, &ro, false);
 	if (cpuctx->task_ctx)
 		perf_event_aux_ctx(cpuctx->task_ctx, __perf_event_output_stop,
-				   &ro);
+				   &ro, false);
 	rcu_read_unlock();
 
 	return ro.err;
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 5/7] perf: Introduce address range filtering
  2016-04-27 15:44 [PATCH v2 0/7] perf: Introduce address range filtering Alexander Shishkin
                   ` (3 preceding siblings ...)
  2016-04-27 15:44 ` [PATCH v2 4/7] perf: Extend perf_event_aux_ctx() to optionally iterate through more events Alexander Shishkin
@ 2016-04-27 15:44 ` Alexander Shishkin
  2016-04-28 17:09   ` Peter Zijlstra
                     ` (3 more replies)
  2016-04-27 15:44 ` [PATCH v2 6/7] perf/x86/intel/pt: Add support for address range filtering in PT Alexander Shishkin
  2016-04-27 15:44 ` [PATCH v2 7/7] perf: Let userspace know if pmu supports address filters Alexander Shishkin
  6 siblings, 4 replies; 20+ messages in thread
From: Alexander Shishkin @ 2016-04-27 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, x86, Borislav Petkov, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier,
	Alexander Shishkin

Many instruction trace pmus out there support address range-based
filtering, which would, for example, generate trace data only for a
given range of instruction addresses, which is useful for tracing
individual functions, modules or libraries. Other pmus may also
utilize this functionality to allow filtering to or filtering out
code at certain address ranges.

This patch introduces the interface for userspace to specify these
filters and for the pmu drivers to apply these filters to hardware
configuration.

The user interface is an ascii string that is passed via an ioctl
and specifies (in the form of an ascii string) address ranges within
certain object files or within kernel. There is no special treatment
for kernel modules yet, but it might be a worthy pursuit.

The pmu driver interface basically add two extra callbacks to the
pmu driver structure, one of which validates the filter configuration
proposed by the user against what the hardware is actually capable of
doing and the other one translates hardware-independent filter
configuration into something that can be programmed into the
hardware.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h |  98 +++++++
 kernel/events/core.c       | 623 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 705 insertions(+), 16 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 85749ae8cb..32b2b4866a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -151,6 +151,15 @@ struct hw_perf_event {
 	 */
 	struct task_struct		*target;
 
+	/*
+	 * PMU would store hardware filter configuration
+	 * here.
+	 */
+	void				*addr_filters;
+
+	/* Last sync'ed generation of filters */
+	unsigned long			addr_filters_gen;
+
 /*
  * hw_perf_event::state flags; used to track the PERF_EF_* state.
  */
@@ -240,6 +249,9 @@ struct pmu {
 	int				task_ctx_nr;
 	int				hrtimer_interval_ms;
 
+	/* number of address filters this pmu can do */
+	unsigned int			nr_addr_filters;
+
 	/*
 	 * Fully disable/enable this PMU, can be used to protect from the PMI
 	 * as well as for lazy/batch writing of the MSRs.
@@ -393,12 +405,71 @@ struct pmu {
 	void (*free_aux)		(void *aux); /* optional */
 
 	/*
+	 * Validate address range filters: make sure hw supports the
+	 * requested configuration and number of filters; return 0 if the
+	 * supplied filters are valid, -errno otherwise.
+	 *
+	 * Runs in the context of the ioctl()ing process and is not serialized
+	 * with the rest of the pmu callbacks.
+	 */
+	int (*addr_filters_validate)	(struct list_head *filters);
+					/* optional */
+
+	/*
+	 * Synchronize address range filter configuration:
+	 * translate hw-agnostic filters into hardware configuration in
+	 * event::hw::addr_filters.
+	 *
+	 * Runs as a part of filter sync sequence that is done in ->start()
+	 * callback by calling perf_event_addr_filters_sync().
+	 *
+	 * May (and should) traverse event::addr_filters::list, for which its
+	 * caller provides necessary serialization.
+	 */
+	void (*addr_filters_sync)	(struct perf_event *event);
+					/* optional */
+
+	/*
 	 * Filter events for PMU-specific reasons.
 	 */
 	int (*filter_match)		(struct perf_event *event); /* optional */
 };
 
 /**
+ * struct perf_addr_filter - address range filter definition
+ * @entry:	event's filter list linkage
+ * @inode:	object file's inode for file-based filters
+ * @offset:	filter range offset
+ * @size:	filter range size
+ * @range:	1: range, 0: address
+ * @filter:	1: filter/start, 0: stop
+ *
+ * This is a hardware-agnostic filter configuration as specified by the user.
+ */
+struct perf_addr_filter {
+	struct list_head	entry;
+	struct inode		*inode;
+	unsigned long		offset;
+	unsigned long		size;
+	unsigned int		range	: 1,
+				filter	: 1;
+};
+
+/**
+ * struct perf_addr_filters_head - container for address range filters
+ * @list:	list of filters for this event
+ * @lock:	spinlock that serializes accesses to the @list and event's
+ *		(and its children's) filter generations.
+ *
+ * A child event will use parent's @list (and therefore @lock), so they are
+ * bundled together; see perf_event_addr_filters().
+ */
+struct perf_addr_filters_head {
+	struct list_head	list;
+	raw_spinlock_t		lock;
+};
+
+/**
  * enum perf_event_active_state - the states of a event
  */
 enum perf_event_active_state {
@@ -566,6 +637,12 @@ struct perf_event {
 
 	atomic_t			event_limit;
 
+	/* address range filters */
+	struct perf_addr_filters_head	addr_filters;
+	/* vma address array for file-based filders */
+	unsigned long			*addr_filters_offs;
+	unsigned long			addr_filters_gen;
+
 	void (*destroy)(struct perf_event *);
 	struct rcu_head			rcu_head;
 
@@ -679,6 +756,22 @@ struct perf_output_handle {
 	int				page;
 };
 
+void perf_event_addr_filters_sync(struct perf_event *event);
+
+/*
+ * An inherited event uses parent's filters
+ */
+static inline struct perf_addr_filters_head *
+perf_event_addr_filters(struct perf_event *event)
+{
+	struct perf_addr_filters_head *ifh = &event->addr_filters;
+
+	if (event->parent)
+		ifh = &event->parent->addr_filters;
+
+	return ifh;
+}
+
 #ifdef CONFIG_CGROUP_PERF
 
 /*
@@ -1066,6 +1159,11 @@ static inline bool is_write_backward(struct perf_event *event)
 	return !!event->attr.write_backward;
 }
 
+static inline bool has_addr_filter(struct perf_event *event)
+{
+	return event->pmu->nr_addr_filters;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern int perf_output_begin_forward(struct perf_output_handle *handle,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6d335f3878..606398b62a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -44,6 +44,8 @@
 #include <linux/compat.h>
 #include <linux/bpf.h>
 #include <linux/filter.h>
+#include <linux/namei.h>
+#include <linux/parser.h>
 
 #include "internal.h"
 
@@ -2364,11 +2366,17 @@ void perf_event_enable(struct perf_event *event)
 }
 EXPORT_SYMBOL_GPL(perf_event_enable);
 
+struct stop_event_data {
+	struct perf_event	*event;
+	unsigned int		restart;
+};
+
 static int __perf_event_stop(void *info)
 {
-	struct perf_event *event = info;
+	struct stop_event_data *sd = info;
+	struct perf_event *event = sd->event;
 
-	/* for AUX events, our job is done if the event is already inactive */
+	/* if it's already INACTIVE, do nothing */
 	if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
 		return 0;
 
@@ -2384,9 +2392,86 @@ static int __perf_event_stop(void *info)
 
 	event->pmu->stop(event, PERF_EF_UPDATE);
 
+	/*
+	 * May race with the actual stop (through perf_pmu_output_stop()),
+	 * but it is only used for events with AUX ring buffer, and such
+	 * events will refuse to restart because of rb::aux_mmap_count==0,
+	 * see comments in perf_aux_output_begin().
+	 *
+	 * Since this is happening on a event-local cpu, no trace is lost
+	 * while restarting.
+	 */
+	if (sd->restart)
+		event->pmu->start(event, PERF_EF_START);
+
 	return 0;
 }
 
+static int perf_event_restart(struct perf_event *event)
+{
+	struct stop_event_data sd = {
+		.event		= event,
+		.restart	= 1,
+	};
+	int ret = 0;
+
+	do {
+		if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
+			return 0;
+
+		/* matches smp_wmb() in event_sched_in() */
+		smp_rmb();
+
+		/*
+		 * We only want to restart ACTIVE events, so if the event goes
+		 * inactive here (event->oncpu==-1), there's nothing more to do;
+		 * fall through with ret==-ENXIO.
+		 */
+		ret = cpu_function_call(READ_ONCE(event->oncpu),
+					__perf_event_stop, &sd);
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
+/*
+ * In order to contain the amount of racy and tricky in the address filter
+ * configuration management, it is a two part process:
+ *
+ * (p1) when userspace mappings change as a result of (1) or (2) or (3) below,
+ *      we update the addresses of corresponding vmas in
+ *	event::addr_filters_offs array and bump the event::addr_filters_gen;
+ * (p2) when an event is scheduled in (pmu::add), it calls
+ *      perf_event_addr_filters_sync() which calls pmu::addr_filters_sync()
+ *      if the generation has changed since the previous call.
+ *
+ * If (p1) happens while the event is active, we restart it to force (p2).
+ *
+ * (1) perf_addr_filters_apply(): adjusting filters' offsets based on
+ *     pre-existing mappings, called once when new filters arrive via SET_FILTER
+ *     ioctl;
+ * (2) perf_addr_filters_adjust(): adjusting filters' offsets based on newly
+ *     registered mapping, called for every new mmap(), with mm::mmap_sem down
+ *     for reading;
+ * (3) perf_event_addr_filters_exec(): clearing filters' offsets in the process
+ *     of exec.
+ */
+void perf_event_addr_filters_sync(struct perf_event *event)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+
+	if (!has_addr_filter(event))
+		return;
+
+	raw_spin_lock(&ifh->lock);
+	if (event->addr_filters_gen != event->hw.addr_filters_gen) {
+		event->pmu->addr_filters_sync(event);
+		event->hw.addr_filters_gen = event->addr_filters_gen;
+	}
+	raw_spin_unlock(&ifh->lock);
+}
+EXPORT_SYMBOL_GPL(perf_event_addr_filters_sync);
+
 static int _perf_event_refresh(struct perf_event *event, int refresh)
 {
 	/*
@@ -3236,16 +3321,6 @@ out:
 		put_ctx(clone_ctx);
 }
 
-void perf_event_exec(void)
-{
-	int ctxn;
-
-	rcu_read_lock();
-	for_each_task_context_nr(ctxn)
-		perf_event_enable_on_exec(ctxn);
-	rcu_read_unlock();
-}
-
 struct perf_read_data {
 	struct perf_event *event;
 	bool group;
@@ -3757,6 +3832,9 @@ static bool exclusive_event_installable(struct perf_event *event,
 	return true;
 }
 
+static void perf_addr_filters_splice(struct perf_event *event,
+				       struct list_head *head);
+
 static void _free_event(struct perf_event *event)
 {
 	irq_work_sync(&event->pending);
@@ -3784,6 +3862,8 @@ static void _free_event(struct perf_event *event)
 	}
 
 	perf_event_free_bpf_prog(event);
+	perf_addr_filters_splice(event, NULL);
+	kfree(event->addr_filters_offs);
 
 	if (event->destroy)
 		event->destroy(event);
@@ -5855,6 +5935,57 @@ next:
 	rcu_read_unlock();
 }
 
+/*
+ * Clear all file-based filters at exec, they'll have to be
+ * re-instated when/if these objects are mmapped again.
+ */
+static void perf_event_addr_filters_exec(struct perf_event *event, void *data)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	struct perf_addr_filter *filter;
+	unsigned int restart = 0, count = 0;
+	unsigned long flags;
+
+	if (!has_addr_filter(event))
+		return;
+
+	raw_spin_lock_irqsave(&ifh->lock, flags);
+	list_for_each_entry(filter, &ifh->list, entry) {
+		if (filter->inode) {
+			event->addr_filters_offs[count] = 0;
+			restart++;
+		}
+
+		count++;
+	}
+
+	if (restart)
+		event->addr_filters_gen++;
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+	if (restart)
+		perf_event_restart(event);
+}
+
+void perf_event_exec(void)
+{
+	struct perf_event_context *ctx;
+	int ctxn;
+
+	rcu_read_lock();
+	for_each_task_context_nr(ctxn) {
+		ctx = current->perf_event_ctxp[ctxn];
+		if (!ctx)
+			continue;
+
+		perf_event_enable_on_exec(ctxn);
+
+		perf_event_aux_ctx(ctx, perf_event_addr_filters_exec, NULL,
+				   true);
+	}
+	rcu_read_unlock();
+}
+
 struct remote_output {
 	struct ring_buffer	*rb;
 	int			err;
@@ -5865,6 +5996,9 @@ static void __perf_event_output_stop(struct perf_event *event, void *data)
 	struct perf_event *parent = event->parent;
 	struct remote_output *ro = data;
 	struct ring_buffer *rb = ro->rb;
+	struct stop_event_data sd = {
+		.event	= event,
+	};
 
 	if (!has_aux(event))
 		return;
@@ -5877,7 +6011,7 @@ static void __perf_event_output_stop(struct perf_event *event, void *data)
 	 * ring-buffer, but it will be the child that's actually using it:
 	 */
 	if (rcu_dereference(parent->rb) == rb)
-		ro->err = __perf_event_stop(event);
+		ro->err = __perf_event_stop(&sd);
 }
 
 static int __perf_pmu_output_stop(void *info)
@@ -6338,6 +6472,87 @@ got_name:
 	kfree(buf);
 }
 
+/*
+ * Whether this @filter depends on a dynamic object which is not loaded
+ * yet or its load addresses are not known.
+ */
+static bool perf_addr_filter_needs_mmap(struct perf_addr_filter *filter)
+{
+	return filter->filter && filter->inode;
+}
+
+/*
+ * Check whether inode and address range match filter criteria.
+ */
+static bool perf_addr_filter_match(struct perf_addr_filter *filter,
+				     struct file *file, unsigned long offset,
+				     unsigned long size)
+{
+	if (filter->inode != file->f_inode)
+		return false;
+
+	if (filter->offset > offset + size)
+		return false;
+
+	if (filter->offset + filter->size < offset)
+		return false;
+
+	return true;
+}
+
+static void __perf_addr_filters_adjust(struct perf_event *event, void *data)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	struct vm_area_struct *vma = data;
+	unsigned long off = vma->vm_pgoff << PAGE_SHIFT, flags;
+	struct file *file = vma->vm_file;
+	struct perf_addr_filter *filter;
+	unsigned int restart = 0, count = 0;
+
+	if (!has_addr_filter(event))
+		return;
+
+	if (!file)
+		return;
+
+	raw_spin_lock_irqsave(&ifh->lock, flags);
+	list_for_each_entry(filter, &ifh->list, entry) {
+		if (perf_addr_filter_match(filter, file, off,
+					     vma->vm_end - vma->vm_start)) {
+			event->addr_filters_offs[count] = vma->vm_start;
+			restart++;
+		}
+
+		count++;
+	}
+
+	if (restart)
+		event->addr_filters_gen++;
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+	if (restart)
+		perf_event_restart(event);
+}
+
+/*
+ * Adjust all task's events' filters to the new vma
+ */
+static void perf_addr_filters_adjust(struct vm_area_struct *vma)
+{
+	struct perf_event_context *ctx;
+	int ctxn;
+
+	rcu_read_lock();
+	for_each_task_context_nr(ctxn) {
+		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
+		if (!ctx)
+			continue;
+
+		perf_event_aux_ctx(ctx, __perf_addr_filters_adjust, vma, true);
+	}
+	rcu_read_unlock();
+}
+
 void perf_event_mmap(struct vm_area_struct *vma)
 {
 	struct perf_mmap_event mmap_event;
@@ -6369,6 +6584,7 @@ void perf_event_mmap(struct vm_area_struct *vma)
 		/* .flags (attr_mmap2 only) */
 	};
 
+	perf_addr_filters_adjust(vma);
 	perf_event_mmap_event(&mmap_event);
 }
 
@@ -7328,13 +7544,370 @@ void perf_bp_event(struct perf_event *bp, void *data)
 }
 #endif
 
+/*
+ * Allocate a new address filter
+ */
+static struct perf_addr_filter *
+perf_addr_filter_new(struct perf_event *event, struct list_head *filters)
+{
+	int node = cpu_to_node(event->cpu == -1 ? 0 : event->cpu);
+	struct perf_addr_filter *filter;
+
+	filter = kzalloc_node(sizeof(*filter), GFP_KERNEL, node);
+	if (!filter)
+		return NULL;
+
+	INIT_LIST_HEAD(&filter->entry);
+	list_add_tail(&filter->entry, filters);
+
+	return filter;
+}
+
+static void free_filters_list(struct list_head *filters)
+{
+	struct perf_addr_filter *filter, *iter;
+
+	list_for_each_entry_safe(filter, iter, filters, entry) {
+		if (filter->inode)
+			iput(filter->inode);
+		list_del(&filter->entry);
+		kfree(filter);
+	}
+}
+
+/*
+ * Free existing address filters and optionally install new ones
+ */
+static void perf_addr_filters_splice(struct perf_event *event,
+				     struct list_head *head)
+{
+	unsigned long flags;
+	LIST_HEAD(list);
+
+	if (!has_addr_filter(event))
+		return;
+
+	/* don't bother with children, they don't have their own filters */
+	if (event->parent)
+		return;
+
+	raw_spin_lock_irqsave(&event->addr_filters.lock, flags);
+
+	list_splice_init(&event->addr_filters.list, &list);
+	if (head)
+		list_splice(head, &event->addr_filters.list);
+
+	raw_spin_unlock_irqrestore(&event->addr_filters.lock, flags);
+
+	free_filters_list(&list);
+}
+
+/*
+ * Scan through mm's vmas and see if one of them matches the
+ * @filter; if so, adjust filter's address range.
+ * Called with mm::mmap_sem down for reading.
+ */
+static unsigned long perf_addr_filter_apply(struct perf_addr_filter *filter,
+					    struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	for (vma = mm->mmap; vma->vm_next; vma = vma->vm_next) {
+		struct file *file = vma->vm_file;
+		unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
+		unsigned long vma_size = vma->vm_end - vma->vm_start;
+
+		if (!file)
+			continue;
+
+		if (!perf_addr_filter_match(filter, file, off, vma_size))
+			continue;
+
+		return vma->vm_start;
+	}
+
+	return 0;
+}
+
+/*
+ * Update event's address range filters based on the
+ * task's existing mappings, if any.
+ */
+static void perf_event_addr_filters_apply(struct perf_event *event)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	struct task_struct *task = READ_ONCE(event->ctx->task);
+	struct perf_addr_filter *filter;
+	struct mm_struct *mm = NULL;
+	unsigned int count = 0;
+	unsigned long flags;
+
+	/*
+	 * We may observe TASK_TOMBSTONE, which means that the event tear-down
+	 * will stop on the parent's child_mutex that our caller is also holding
+	 */
+	if (task == TASK_TOMBSTONE)
+		return;
+
+	mm = get_task_mm(event->ctx->task);
+	if (!mm)
+		goto restart;
+
+	down_read(&mm->mmap_sem);
+
+	raw_spin_lock_irqsave(&ifh->lock, flags);
+	list_for_each_entry(filter, &ifh->list, entry) {
+		event->addr_filters_offs[count] = 0;
+
+		if (perf_addr_filter_needs_mmap(filter))
+			event->addr_filters_offs[count] =
+				perf_addr_filter_apply(filter, mm);
+
+		count++;
+	}
+
+	event->addr_filters_gen++;
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+	up_read(&mm->mmap_sem);
+
+	mmput(mm);
+
+restart:
+	perf_event_restart(event);
+}
+
+/*
+ * Address range filtering: limiting the data to certain
+ * instruction address ranges. Filters are ioctl()ed to us from
+ * userspace as ascii strings.
+ *
+ * Filter string format:
+ *
+ * ACTION RANGE_SPEC
+ * where ACTION is one of the
+ *  * "filter": limit the trace to this region
+ *  * "start": start tracing from this address
+ *  * "stop": stop tracing at this address/region;
+ * RANGE_SPEC is
+ *  * for kernel addresses: <start address>[/<size>]
+ *  * for object files:     <start address>[/<size>]@</path/to/object/file>
+ *
+ * if <size> is not specified, the range is treated as a single address.
+ */
+enum {
+	IF_ACT_FILTER,
+	IF_ACT_START,
+	IF_ACT_STOP,
+	IF_SRC_FILE,
+	IF_SRC_KERNEL,
+	IF_SRC_FILEADDR,
+	IF_SRC_KERNELADDR,
+};
+
+enum {
+	IF_STATE_ACTION = 0,
+	IF_STATE_SOURCE,
+	IF_STATE_END,
+};
+
+static const match_table_t if_tokens = {
+	{ IF_ACT_FILTER,	"filter" },
+	{ IF_ACT_START,		"start" },
+	{ IF_ACT_STOP,		"stop" },
+	{ IF_SRC_FILE,		"%u/%u@%s" },
+	{ IF_SRC_KERNEL,	"%u/%u" },
+	{ IF_SRC_FILEADDR,	"%u@%s" },
+	{ IF_SRC_KERNELADDR,	"%u" },
+};
+
+/*
+ * Address filter string parser
+ */
+static int
+perf_event_parse_addr_filter(struct perf_event *event, char *fstr,
+			     struct list_head *filters)
+{
+	struct perf_addr_filter *filter = NULL;
+	char *start, *orig, *filename = NULL;
+	struct path path;
+	substring_t args[MAX_OPT_ARGS];
+	int state = IF_STATE_ACTION, token;
+	unsigned int kernel = 0;
+	int ret = -EINVAL;
+
+	orig = fstr = kstrdup(fstr, GFP_KERNEL);
+	if (!fstr)
+		return -ENOMEM;
+
+	while ((start = strsep(&fstr, " ,\n")) != NULL) {
+		ret = -EINVAL;
+
+		if (!*start)
+			continue;
+
+		/* filter definition begins */
+		if (state == IF_STATE_ACTION) {
+			filter = perf_addr_filter_new(event, filters);
+			if (!filter)
+				goto fail;
+		}
+
+		token = match_token(start, if_tokens, args);
+		switch (token) {
+		case IF_ACT_FILTER:
+		case IF_ACT_START:
+			filter->filter = 1;
+
+		case IF_ACT_STOP:
+			if (state != IF_STATE_ACTION)
+				goto fail;
+
+			state = IF_STATE_SOURCE;
+			break;
+
+		case IF_SRC_KERNELADDR:
+		case IF_SRC_KERNEL:
+			kernel = 1;
+
+		case IF_SRC_FILEADDR:
+		case IF_SRC_FILE:
+			if (state != IF_STATE_SOURCE)
+				goto fail;
+
+			if (token == IF_SRC_FILE || token == IF_SRC_KERNEL)
+				filter->range = 1;
+
+			*args[0].to = 0;
+			ret = kstrtoul(args[0].from, 0, &filter->offset);
+			if (ret)
+				goto fail;
+
+			if (filter->range) {
+				*args[1].to = 0;
+				ret = kstrtoul(args[1].from, 0, &filter->size);
+				if (ret)
+					goto fail;
+			}
+
+			if (token == IF_SRC_FILE) {
+				filename = match_strdup(&args[2]);
+				if (!filename) {
+					ret = -ENOMEM;
+					goto fail;
+				}
+			}
+
+			state = IF_STATE_END;
+			break;
+
+		default:
+			goto fail;
+		}
+
+		/*
+		 * Filter definition is fully parsed, validate and install it.
+		 * Make sure that it doesn't contradict itself or the event's
+		 * attribute.
+		 */
+		if (state == IF_STATE_END) {
+			if (kernel && event->attr.exclude_kernel)
+				goto fail;
+
+			if (!kernel) {
+				if (!filename)
+					goto fail;
+
+				/* look up the path and grab its inode */
+				ret = kern_path(filename, LOOKUP_FOLLOW, &path);
+				if (ret)
+					goto fail_free_name;
+
+				filter->inode = igrab(d_inode(path.dentry));
+				path_put(&path);
+				kfree(filename);
+				filename = NULL;
+
+				ret = -EINVAL;
+				if (!filter->inode ||
+				    !S_ISREG(filter->inode->i_mode))
+					/* free_filters_list() will iput() */
+					goto fail;
+			}
+
+			/* ready to consume more filters */
+			state = IF_STATE_ACTION;
+			filter = NULL;
+		}
+	}
+
+	if (state != IF_STATE_ACTION)
+		goto fail;
+
+	kfree(orig);
+
+	return 0;
+
+fail_free_name:
+	kfree(filename);
+fail:
+	free_filters_list(filters);
+	kfree(orig);
+
+	return ret;
+}
+
+static int
+perf_event_set_addr_filter(struct perf_event *event, char *filter_str)
+{
+	LIST_HEAD(filters);
+	int ret;
+
+	/*
+	 * Since this is called in perf_ioctl() path, we're already holding
+	 * ctx::mutex.
+	 */
+	lockdep_assert_held(&event->ctx->mutex);
+
+	if (WARN_ON_ONCE(event->parent))
+		return -EINVAL;
+
+	/*
+	 * For now, we only support filtering in per-task events; doing so
+	 * for cpu-wide events requires additional context switching trickery,
+	 * since same object code will be mapped at different virtual
+	 * addresses in different processes.
+	 */
+	if (!event->ctx->task)
+		return -EOPNOTSUPP;
+
+	ret = perf_event_parse_addr_filter(event, filter_str, &filters);
+	if (ret)
+		return ret;
+
+	ret = event->pmu->addr_filters_validate(&filters);
+	if (ret) {
+		free_filters_list(&filters);
+		return ret;
+	}
+
+	/* remove existing filters, if any */
+	perf_addr_filters_splice(event, &filters);
+
+	/* install new filters */
+	perf_event_for_each_child(event, perf_event_addr_filters_apply);
+
+	return ret;
+}
+
 static int perf_event_set_filter(struct perf_event *event, void __user *arg)
 {
 	char *filter_str;
 	int ret = -EINVAL;
 
-	if (event->attr.type != PERF_TYPE_TRACEPOINT ||
-	    !IS_ENABLED(CONFIG_EVENT_TRACING))
+	if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
+	    !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
+	    !has_addr_filter(event))
 		return -EINVAL;
 
 	filter_str = strndup_user(arg, PAGE_SIZE);
@@ -7345,6 +7918,8 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
 	    event->attr.type == PERF_TYPE_TRACEPOINT)
 		ret = ftrace_profile_set_filter(event, event->attr.config,
 						filter_str);
+	else if (has_addr_filter(event))
+		ret = perf_event_set_addr_filter(event, filter_str);
 
 	kfree(filter_str);
 	return ret;
@@ -8139,6 +8714,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	INIT_LIST_HEAD(&event->sibling_list);
 	INIT_LIST_HEAD(&event->rb_entry);
 	INIT_LIST_HEAD(&event->active_entry);
+	INIT_LIST_HEAD(&event->addr_filters.list);
 	INIT_HLIST_NODE(&event->hlist_entry);
 
 
@@ -8146,6 +8722,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	init_irq_work(&event->pending, perf_pending_event);
 
 	mutex_init(&event->mmap_mutex);
+	raw_spin_lock_init(&event->addr_filters.lock);
 
 	atomic_long_set(&event->refcount, 1);
 	event->cpu		= cpu;
@@ -8230,11 +8807,22 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (err)
 		goto err_pmu;
 
+	if (has_addr_filter(event)) {
+		event->addr_filters_offs = kcalloc(pmu->nr_addr_filters,
+						   sizeof(unsigned long),
+						   GFP_KERNEL);
+		if (!event->addr_filters_offs)
+			goto err_per_task;
+
+		/* force hw sync on the address filters */
+		event->addr_filters_gen = 1;
+	}
+
 	if (!event->parent) {
 		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
 			err = get_callchain_buffers();
 			if (err)
-				goto err_per_task;
+				goto err_addr_filters;
 		}
 	}
 
@@ -8243,6 +8831,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 	return event;
 
+err_addr_filters:
+	kfree(event->addr_filters_offs);
+
 err_per_task:
 	exclusive_event_destroy(event);
 
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 6/7] perf/x86/intel/pt: Add support for address range filtering in PT
  2016-04-27 15:44 [PATCH v2 0/7] perf: Introduce address range filtering Alexander Shishkin
                   ` (4 preceding siblings ...)
  2016-04-27 15:44 ` [PATCH v2 5/7] perf: Introduce address range filtering Alexander Shishkin
@ 2016-04-27 15:44 ` Alexander Shishkin
  2016-05-05  9:47   ` [tip:perf/core] " tip-bot for Alexander Shishkin
  2016-04-27 15:44 ` [PATCH v2 7/7] perf: Let userspace know if pmu supports address filters Alexander Shishkin
  6 siblings, 1 reply; 20+ messages in thread
From: Alexander Shishkin @ 2016-04-27 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, x86, Borislav Petkov, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier,
	Alexander Shishkin

Newer versions of Intel PT support address ranges, which can be used to
define IP address range-based filters or TraceSTOP regions. Number of
ranges in enumerated via cpuid.

This patch implements pmu callbacks and related low-level code to allow
filter validation, configuration and programming into the hardware.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/events/intel/pt.c | 179 ++++++++++++++++++++++++++++++++++++++++++---
 arch/x86/events/intel/pt.h |  26 +++++++
 2 files changed, 194 insertions(+), 11 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 4eea19e4d9..2228ed2fd6 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -253,6 +253,75 @@ static bool pt_event_valid(struct perf_event *event)
  * These all are cpu affine and operate on a local PT
  */
 
+/* Address ranges and their corresponding msr configuration registers */
+static const struct pt_address_range {
+	unsigned long	msr_a;
+	unsigned long	msr_b;
+	unsigned int	reg_off;
+} pt_address_ranges[] = {
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR0_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR0_B,
+		.reg_off = RTIT_CTL_ADDR0_OFFSET,
+	},
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR1_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR1_B,
+		.reg_off = RTIT_CTL_ADDR1_OFFSET,
+	},
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR2_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR2_B,
+		.reg_off = RTIT_CTL_ADDR2_OFFSET,
+	},
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR3_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR3_B,
+		.reg_off = RTIT_CTL_ADDR3_OFFSET,
+	}
+};
+
+static u64 pt_config_filters(struct perf_event *event)
+{
+	struct pt_filters *filters = event->hw.addr_filters;
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	unsigned int range = 0;
+	u64 rtit_ctl = 0;
+
+	if (!filters)
+		return 0;
+
+	perf_event_addr_filters_sync(event);
+
+	for (range = 0; range < filters->nr_filters; range++) {
+		struct pt_filter *filter = &filters->filter[range];
+
+		/*
+		 * Note, if the range has zero start/end addresses due
+		 * to its dynamic object not being loaded yet, we just
+		 * go ahead and program zeroed range, which will simply
+		 * produce no data. Note^2: if executable code at 0x0
+		 * is a concern, we can set up an "invalid" configuration
+		 * such as msr_b < msr_a.
+		 */
+
+		/* avoid redundant msr writes */
+		if (pt->filters.filter[range].msr_a != filter->msr_a) {
+			wrmsrl(pt_address_ranges[range].msr_a, filter->msr_a);
+			pt->filters.filter[range].msr_a = filter->msr_a;
+		}
+
+		if (pt->filters.filter[range].msr_b != filter->msr_b) {
+			wrmsrl(pt_address_ranges[range].msr_b, filter->msr_b);
+			pt->filters.filter[range].msr_b = filter->msr_b;
+		}
+
+		rtit_ctl |= filter->config << pt_address_ranges[range].reg_off;
+	}
+
+	return rtit_ctl;
+}
+
 static void pt_config(struct perf_event *event)
 {
 	u64 reg;
@@ -262,7 +331,8 @@ static void pt_config(struct perf_event *event)
 		wrmsrl(MSR_IA32_RTIT_STATUS, 0);
 	}
 
-	reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN | RTIT_CTL_TRACEEN;
+	reg = pt_config_filters(event);
+	reg |= RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN | RTIT_CTL_TRACEEN;
 
 	if (!event->attr.exclude_kernel)
 		reg |= RTIT_CTL_OS;
@@ -907,6 +977,82 @@ static void pt_buffer_free_aux(void *data)
 	kfree(buf);
 }
 
+static int pt_addr_filters_init(struct perf_event *event)
+{
+	struct pt_filters *filters;
+	int node = event->cpu == -1 ? -1 : cpu_to_node(event->cpu);
+
+	if (!pt_cap_get(PT_CAP_num_address_ranges))
+		return 0;
+
+	filters = kzalloc_node(sizeof(struct pt_filters), GFP_KERNEL, node);
+	if (!filters)
+		return -ENOMEM;
+
+	if (event->parent)
+		memcpy(filters, event->parent->hw.addr_filters,
+		       sizeof(*filters));
+
+	event->hw.addr_filters = filters;
+
+	return 0;
+}
+
+static void pt_addr_filters_fini(struct perf_event *event)
+{
+	kfree(event->hw.addr_filters);
+	event->hw.addr_filters = NULL;
+}
+
+static int pt_event_addr_filters_validate(struct list_head *filters)
+{
+	struct perf_addr_filter *filter;
+	int range = 0;
+
+	list_for_each_entry(filter, filters, entry) {
+		/* PT doesn't support single address triggers */
+		if (!filter->range)
+			return -EOPNOTSUPP;
+
+		if (!filter->inode && !kernel_ip(filter->offset))
+			return -EINVAL;
+
+		if (++range > pt_cap_get(PT_CAP_num_address_ranges))
+			return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static void pt_event_addr_filters_sync(struct perf_event *event)
+{
+	struct perf_addr_filters_head *head = perf_event_addr_filters(event);
+	unsigned long msr_a, msr_b, *offs = event->addr_filters_offs;
+	struct pt_filters *filters = event->hw.addr_filters;
+	struct perf_addr_filter *filter;
+	int range = 0;
+
+	if (!filters)
+		return;
+
+	list_for_each_entry(filter, &head->list, entry) {
+		if (filter->inode && !offs[range]) {
+			msr_a = msr_b = 0;
+		} else {
+			/* apply the offset */
+			msr_a = filter->offset + offs[range];
+			msr_b = filter->size + msr_a;
+		}
+
+		filters->filter[range].msr_a  = msr_a;
+		filters->filter[range].msr_b  = msr_b;
+		filters->filter[range].config = filter->filter ? 1 : 2;
+		range++;
+	}
+
+	filters->nr_filters = range;
+}
+
 /**
  * intel_pt_interrupt() - PT PMI handler
  */
@@ -1075,6 +1221,7 @@ static void pt_event_read(struct perf_event *event)
 
 static void pt_event_destroy(struct perf_event *event)
 {
+	pt_addr_filters_fini(event);
 	x86_del_exclusive(x86_lbr_exclusive_pt);
 }
 
@@ -1089,6 +1236,11 @@ static int pt_event_init(struct perf_event *event)
 	if (x86_add_exclusive(x86_lbr_exclusive_pt))
 		return -EBUSY;
 
+	if (pt_addr_filters_init(event)) {
+		x86_del_exclusive(x86_lbr_exclusive_pt);
+		return -ENOMEM;
+	}
+
 	event->destroy = pt_event_destroy;
 
 	return 0;
@@ -1142,16 +1294,21 @@ static __init int pt_init(void)
 			PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_AUX_SW_DOUBLEBUF;
 
 	pt_pmu.pmu.capabilities	|= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE;
-	pt_pmu.pmu.attr_groups	= pt_attr_groups;
-	pt_pmu.pmu.task_ctx_nr	= perf_sw_context;
-	pt_pmu.pmu.event_init	= pt_event_init;
-	pt_pmu.pmu.add		= pt_event_add;
-	pt_pmu.pmu.del		= pt_event_del;
-	pt_pmu.pmu.start	= pt_event_start;
-	pt_pmu.pmu.stop		= pt_event_stop;
-	pt_pmu.pmu.read		= pt_event_read;
-	pt_pmu.pmu.setup_aux	= pt_buffer_setup_aux;
-	pt_pmu.pmu.free_aux	= pt_buffer_free_aux;
+	pt_pmu.pmu.attr_groups		 = pt_attr_groups;
+	pt_pmu.pmu.task_ctx_nr		 = perf_sw_context;
+	pt_pmu.pmu.event_init		 = pt_event_init;
+	pt_pmu.pmu.add			 = pt_event_add;
+	pt_pmu.pmu.del			 = pt_event_del;
+	pt_pmu.pmu.start		 = pt_event_start;
+	pt_pmu.pmu.stop			 = pt_event_stop;
+	pt_pmu.pmu.read			 = pt_event_read;
+	pt_pmu.pmu.setup_aux		 = pt_buffer_setup_aux;
+	pt_pmu.pmu.free_aux		 = pt_buffer_free_aux;
+	pt_pmu.pmu.addr_filters_sync     = pt_event_addr_filters_sync;
+	pt_pmu.pmu.addr_filters_validate = pt_event_addr_filters_validate;
+	pt_pmu.pmu.nr_addr_filters       =
+		pt_cap_get(PT_CAP_num_address_ranges);
+
 	ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);
 
 	return ret;
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index c5698eb6de..670d450870 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -139,13 +139,39 @@ struct pt_buffer {
 	struct topa_entry	*topa_index[0];
 };
 
+#define PT_FILTERS_NUM	4
+
+/**
+ * struct pt_filter - IP range filter configuration
+ * @msr_a:	range start, goes to RTIT_ADDRn_A
+ * @msr_b:	range end, goes to RTIT_ADDRn_B
+ * @config:	4-bit field in RTIT_CTL
+ */
+struct pt_filter {
+	unsigned long	msr_a;
+	unsigned long	msr_b;
+	unsigned long	config;
+};
+
+/**
+ * struct pt_filters - IP range filtering context
+ * @filter:	filters defined for this context
+ * @nr_filters:	number of defined filters in the @filter array
+ */
+struct pt_filters {
+	struct pt_filter	filter[PT_FILTERS_NUM];
+	unsigned int		nr_filters;
+};
+
 /**
  * struct pt - per-cpu pt context
  * @handle:	perf output handle
+ * @filters:		last configured filters
  * @handle_nmi:	do handle PT PMI on this cpu, there's an active event
  */
 struct pt {
 	struct perf_output_handle handle;
+	struct pt_filters	filters;
 	int			handle_nmi;
 };
 
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 7/7] perf: Let userspace know if pmu supports address filters
  2016-04-27 15:44 [PATCH v2 0/7] perf: Introduce address range filtering Alexander Shishkin
                   ` (5 preceding siblings ...)
  2016-04-27 15:44 ` [PATCH v2 6/7] perf/x86/intel/pt: Add support for address range filtering in PT Alexander Shishkin
@ 2016-04-27 15:44 ` Alexander Shishkin
  2016-05-05  9:47   ` [tip:perf/core] perf/core: Let userspace know if the PMU " tip-bot for Alexander Shishkin
  6 siblings, 1 reply; 20+ messages in thread
From: Alexander Shishkin @ 2016-04-27 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, x86, Borislav Petkov, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier,
	Alexander Shishkin

Export an additional common attribute for PMUs that support address range
filtering to let the perf userspace identify such PMUs in a uniform way.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 606398b62a..6d07e294a7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8282,6 +8282,20 @@ static void free_pmu_context(struct pmu *pmu)
 out:
 	mutex_unlock(&pmus_lock);
 }
+
+/*
+ * Let userspace know that this PMU supports address range filtering
+ */
+static ssize_t nr_addr_filters_show(struct device *dev,
+				    struct device_attribute *attr,
+				    char *page)
+{
+	struct pmu *pmu = dev_get_drvdata(dev);
+
+	return snprintf(page, PAGE_SIZE - 1, "%d\n", pmu->nr_addr_filters);
+}
+DEVICE_ATTR_RO(nr_addr_filters);
+
 static struct idr pmu_idr;
 
 static ssize_t
@@ -8383,9 +8397,19 @@ static int pmu_dev_alloc(struct pmu *pmu)
 	if (ret)
 		goto free_dev;
 
+	/* for PMUs with address filters, throw in an extra attribute */
+	if (pmu->nr_addr_filters)
+		ret = device_create_file(pmu->dev, &dev_attr_nr_addr_filters);
+
+	if (ret)
+		goto del_dev;
+
 out:
 	return ret;
 
+del_dev:
+	device_del(pmu->dev);
+
 free_dev:
 	put_device(pmu->dev);
 	goto out;
@@ -8521,6 +8545,8 @@ void perf_pmu_unregister(struct pmu *pmu)
 	free_percpu(pmu->pmu_disable_count);
 	if (pmu->type >= PERF_TYPE_MAX)
 		idr_remove(&pmu_idr, pmu->type);
+	if (pmu->nr_addr_filters)
+		device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
 	device_del(pmu->dev);
 	put_device(pmu->dev);
 	free_pmu_context(pmu);
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] perf: Introduce address range filtering
  2016-04-27 15:44 ` [PATCH v2 5/7] perf: Introduce address range filtering Alexander Shishkin
@ 2016-04-28 17:09   ` Peter Zijlstra
  2016-04-29 18:12   ` Mathieu Poirier
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2016-04-28 17:09 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Thomas Gleixner, x86, Borislav Petkov, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier

On Wed, Apr 27, 2016 at 06:44:46PM +0300, Alexander Shishkin wrote:
> +/*
> + * In order to contain the amount of racy and tricky in the address filter
> + * configuration management, it is a two part process:
> + *
> + * (p1) when userspace mappings change as a result of (1) or (2) or (3) below,
> + *      we update the addresses of corresponding vmas in
> + *	event::addr_filters_offs array and bump the event::addr_filters_gen;
> + * (p2) when an event is scheduled in (pmu::add), it calls
> + *      perf_event_addr_filters_sync() which calls pmu::addr_filters_sync()
> + *      if the generation has changed since the previous call.
> + *
> + * If (p1) happens while the event is active, we restart it to force (p2).

Tiny nit; restart only does ::stop(),::start(), which doesn't go through
::add().

> + *
> + * (1) perf_addr_filters_apply(): adjusting filters' offsets based on
> + *     pre-existing mappings, called once when new filters arrive via SET_FILTER
> + *     ioctl;
> + * (2) perf_addr_filters_adjust(): adjusting filters' offsets based on newly
> + *     registered mapping, called for every new mmap(), with mm::mmap_sem down
> + *     for reading;
> + * (3) perf_event_addr_filters_exec(): clearing filters' offsets in the process
> + *     of exec.
> + */

> +static unsigned long perf_addr_filter_apply(struct perf_addr_filter *filter,
> +					    struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vma;
> +
> +	for (vma = mm->mmap; vma->vm_next; vma = vma->vm_next) {

Doesn't this miss the very last vma? That is, I was expecting:

	for (vma = mm->mmap; vma; vma = vma->vm_next) {

> +		struct file *file = vma->vm_file;
> +		unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
> +		unsigned long vma_size = vma->vm_end - vma->vm_start;
> +
> +		if (!file)
> +			continue;
> +
> +		if (!perf_addr_filter_match(filter, file, off, vma_size))
> +			continue;
> +
> +		return vma->vm_start;
> +	}
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] perf: Introduce address range filtering
  2016-04-27 15:44 ` [PATCH v2 5/7] perf: Introduce address range filtering Alexander Shishkin
  2016-04-28 17:09   ` Peter Zijlstra
@ 2016-04-29 18:12   ` Mathieu Poirier
  2016-04-30  4:59     ` Alexander Shishkin
  2016-05-03  8:56   ` Peter Zijlstra
  2016-05-05  9:46   ` [tip:perf/core] perf/core: " tip-bot for Alexander Shishkin
  3 siblings, 1 reply; 20+ messages in thread
From: Mathieu Poirier @ 2016-04-29 18:12 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Thomas Gleixner, x86, Borislav Petkov,
	Ingo Molnar, linux-kernel, vince, Stephane Eranian,
	Arnaldo Carvalho de Melo

On 27 April 2016 at 09:44, Alexander Shishkin
<alexander.shishkin@linux.intel.com> wrote:
> Many instruction trace pmus out there support address range-based
> filtering, which would, for example, generate trace data only for a
> given range of instruction addresses, which is useful for tracing
> individual functions, modules or libraries. Other pmus may also
> utilize this functionality to allow filtering to or filtering out
> code at certain address ranges.
>
> This patch introduces the interface for userspace to specify these
> filters and for the pmu drivers to apply these filters to hardware
> configuration.
>
> The user interface is an ascii string that is passed via an ioctl
> and specifies (in the form of an ascii string) address ranges within
> certain object files or within kernel. There is no special treatment
> for kernel modules yet, but it might be a worthy pursuit.
>
> The pmu driver interface basically add two extra callbacks to the
> pmu driver structure, one of which validates the filter configuration
> proposed by the user against what the hardware is actually capable of
> doing and the other one translates hardware-independent filter
> configuration into something that can be programmed into the
> hardware.
>
> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> ---
>  include/linux/perf_event.h |  98 +++++++
>  kernel/events/core.c       | 623 +++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 705 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 85749ae8cb..32b2b4866a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -151,6 +151,15 @@ struct hw_perf_event {
>          */
>         struct task_struct              *target;
>
> +       /*
> +        * PMU would store hardware filter configuration
> +        * here.
> +        */
> +       void                            *addr_filters;
> +
> +       /* Last sync'ed generation of filters */
> +       unsigned long                   addr_filters_gen;
> +
>  /*
>   * hw_perf_event::state flags; used to track the PERF_EF_* state.
>   */
> @@ -240,6 +249,9 @@ struct pmu {
>         int                             task_ctx_nr;
>         int                             hrtimer_interval_ms;
>
> +       /* number of address filters this pmu can do */
> +       unsigned int                    nr_addr_filters;
> +
>         /*
>          * Fully disable/enable this PMU, can be used to protect from the PMI
>          * as well as for lazy/batch writing of the MSRs.
> @@ -393,12 +405,71 @@ struct pmu {
>         void (*free_aux)                (void *aux); /* optional */
>
>         /*
> +        * Validate address range filters: make sure hw supports the
> +        * requested configuration and number of filters; return 0 if the
> +        * supplied filters are valid, -errno otherwise.
> +        *
> +        * Runs in the context of the ioctl()ing process and is not serialized
> +        * with the rest of the pmu callbacks.
> +        */
> +       int (*addr_filters_validate)    (struct list_head *filters);
> +                                       /* optional */
> +
> +       /*
> +        * Synchronize address range filter configuration:
> +        * translate hw-agnostic filters into hardware configuration in
> +        * event::hw::addr_filters.
> +        *
> +        * Runs as a part of filter sync sequence that is done in ->start()
> +        * callback by calling perf_event_addr_filters_sync().
> +        *
> +        * May (and should) traverse event::addr_filters::list, for which its
> +        * caller provides necessary serialization.
> +        */
> +       void (*addr_filters_sync)       (struct perf_event *event);
> +                                       /* optional */
> +
> +       /*
>          * Filter events for PMU-specific reasons.
>          */
>         int (*filter_match)             (struct perf_event *event); /* optional */
>  };
>
>  /**
> + * struct perf_addr_filter - address range filter definition
> + * @entry:     event's filter list linkage
> + * @inode:     object file's inode for file-based filters
> + * @offset:    filter range offset
> + * @size:      filter range size
> + * @range:     1: range, 0: address
> + * @filter:    1: filter/start, 0: stop
> + *
> + * This is a hardware-agnostic filter configuration as specified by the user.
> + */
> +struct perf_addr_filter {
> +       struct list_head        entry;
> +       struct inode            *inode;
> +       unsigned long           offset;
> +       unsigned long           size;
> +       unsigned int            range   : 1,
> +                               filter  : 1;
> +};
> +
> +/**
> + * struct perf_addr_filters_head - container for address range filters
> + * @list:      list of filters for this event
> + * @lock:      spinlock that serializes accesses to the @list and event's
> + *             (and its children's) filter generations.
> + *
> + * A child event will use parent's @list (and therefore @lock), so they are
> + * bundled together; see perf_event_addr_filters().
> + */
> +struct perf_addr_filters_head {
> +       struct list_head        list;
> +       raw_spinlock_t          lock;
> +};
> +
> +/**
>   * enum perf_event_active_state - the states of a event
>   */
>  enum perf_event_active_state {
> @@ -566,6 +637,12 @@ struct perf_event {
>
>         atomic_t                        event_limit;
>
> +       /* address range filters */
> +       struct perf_addr_filters_head   addr_filters;
> +       /* vma address array for file-based filders */
> +       unsigned long                   *addr_filters_offs;
> +       unsigned long                   addr_filters_gen;
> +
>         void (*destroy)(struct perf_event *);
>         struct rcu_head                 rcu_head;
>
> @@ -679,6 +756,22 @@ struct perf_output_handle {
>         int                             page;
>  };
>
> +void perf_event_addr_filters_sync(struct perf_event *event);
> +
> +/*
> + * An inherited event uses parent's filters
> + */
> +static inline struct perf_addr_filters_head *
> +perf_event_addr_filters(struct perf_event *event)
> +{
> +       struct perf_addr_filters_head *ifh = &event->addr_filters;
> +
> +       if (event->parent)
> +               ifh = &event->parent->addr_filters;
> +
> +       return ifh;
> +}
> +
>  #ifdef CONFIG_CGROUP_PERF
>
>  /*
> @@ -1066,6 +1159,11 @@ static inline bool is_write_backward(struct perf_event *event)
>         return !!event->attr.write_backward;
>  }
>
> +static inline bool has_addr_filter(struct perf_event *event)
> +{
> +       return event->pmu->nr_addr_filters;
> +}
> +
>  extern int perf_output_begin(struct perf_output_handle *handle,
>                              struct perf_event *event, unsigned int size);
>  extern int perf_output_begin_forward(struct perf_output_handle *handle,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 6d335f3878..606398b62a 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -44,6 +44,8 @@
>  #include <linux/compat.h>
>  #include <linux/bpf.h>
>  #include <linux/filter.h>
> +#include <linux/namei.h>
> +#include <linux/parser.h>
>
>  #include "internal.h"
>
> @@ -2364,11 +2366,17 @@ void perf_event_enable(struct perf_event *event)
>  }
>  EXPORT_SYMBOL_GPL(perf_event_enable);
>
> +struct stop_event_data {
> +       struct perf_event       *event;
> +       unsigned int            restart;
> +};
> +
>  static int __perf_event_stop(void *info)
>  {
> -       struct perf_event *event = info;
> +       struct stop_event_data *sd = info;
> +       struct perf_event *event = sd->event;
>
> -       /* for AUX events, our job is done if the event is already inactive */
> +       /* if it's already INACTIVE, do nothing */
>         if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
>                 return 0;
>
> @@ -2384,9 +2392,86 @@ static int __perf_event_stop(void *info)
>
>         event->pmu->stop(event, PERF_EF_UPDATE);
>
> +       /*
> +        * May race with the actual stop (through perf_pmu_output_stop()),
> +        * but it is only used for events with AUX ring buffer, and such
> +        * events will refuse to restart because of rb::aux_mmap_count==0,
> +        * see comments in perf_aux_output_begin().
> +        *
> +        * Since this is happening on a event-local cpu, no trace is lost
> +        * while restarting.
> +        */
> +       if (sd->restart)
> +               event->pmu->start(event, PERF_EF_START);
> +
>         return 0;
>  }
>
> +static int perf_event_restart(struct perf_event *event)
> +{
> +       struct stop_event_data sd = {
> +               .event          = event,
> +               .restart        = 1,
> +       };
> +       int ret = 0;
> +
> +       do {
> +               if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
> +                       return 0;
> +
> +               /* matches smp_wmb() in event_sched_in() */
> +               smp_rmb();
> +
> +               /*
> +                * We only want to restart ACTIVE events, so if the event goes
> +                * inactive here (event->oncpu==-1), there's nothing more to do;
> +                * fall through with ret==-ENXIO.
> +                */
> +               ret = cpu_function_call(READ_ONCE(event->oncpu),
> +                                       __perf_event_stop, &sd);
> +       } while (ret == -EAGAIN);
> +
> +       return ret;
> +}
> +
> +/*
> + * In order to contain the amount of racy and tricky in the address filter
> + * configuration management, it is a two part process:
> + *
> + * (p1) when userspace mappings change as a result of (1) or (2) or (3) below,
> + *      we update the addresses of corresponding vmas in
> + *     event::addr_filters_offs array and bump the event::addr_filters_gen;
> + * (p2) when an event is scheduled in (pmu::add), it calls
> + *      perf_event_addr_filters_sync() which calls pmu::addr_filters_sync()
> + *      if the generation has changed since the previous call.
> + *
> + * If (p1) happens while the event is active, we restart it to force (p2).
> + *
> + * (1) perf_addr_filters_apply(): adjusting filters' offsets based on
> + *     pre-existing mappings, called once when new filters arrive via SET_FILTER
> + *     ioctl;
> + * (2) perf_addr_filters_adjust(): adjusting filters' offsets based on newly
> + *     registered mapping, called for every new mmap(), with mm::mmap_sem down
> + *     for reading;
> + * (3) perf_event_addr_filters_exec(): clearing filters' offsets in the process
> + *     of exec.
> + */
> +void perf_event_addr_filters_sync(struct perf_event *event)
> +{
> +       struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
> +
> +       if (!has_addr_filter(event))
> +               return;
> +
> +       raw_spin_lock(&ifh->lock);
> +       if (event->addr_filters_gen != event->hw.addr_filters_gen) {
> +               event->pmu->addr_filters_sync(event);
> +               event->hw.addr_filters_gen = event->addr_filters_gen;
> +       }
> +       raw_spin_unlock(&ifh->lock);
> +}
> +EXPORT_SYMBOL_GPL(perf_event_addr_filters_sync);
> +
>  static int _perf_event_refresh(struct perf_event *event, int refresh)
>  {
>         /*
> @@ -3236,16 +3321,6 @@ out:
>                 put_ctx(clone_ctx);
>  }
>
> -void perf_event_exec(void)
> -{
> -       int ctxn;
> -
> -       rcu_read_lock();
> -       for_each_task_context_nr(ctxn)
> -               perf_event_enable_on_exec(ctxn);
> -       rcu_read_unlock();
> -}
> -
>  struct perf_read_data {
>         struct perf_event *event;
>         bool group;
> @@ -3757,6 +3832,9 @@ static bool exclusive_event_installable(struct perf_event *event,
>         return true;
>  }
>
> +static void perf_addr_filters_splice(struct perf_event *event,
> +                                      struct list_head *head);
> +
>  static void _free_event(struct perf_event *event)
>  {
>         irq_work_sync(&event->pending);
> @@ -3784,6 +3862,8 @@ static void _free_event(struct perf_event *event)
>         }
>
>         perf_event_free_bpf_prog(event);
> +       perf_addr_filters_splice(event, NULL);
> +       kfree(event->addr_filters_offs);
>
>         if (event->destroy)
>                 event->destroy(event);
> @@ -5855,6 +5935,57 @@ next:
>         rcu_read_unlock();
>  }
>
> +/*
> + * Clear all file-based filters at exec, they'll have to be
> + * re-instated when/if these objects are mmapped again.
> + */
> +static void perf_event_addr_filters_exec(struct perf_event *event, void *data)
> +{
> +       struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
> +       struct perf_addr_filter *filter;
> +       unsigned int restart = 0, count = 0;
> +       unsigned long flags;
> +
> +       if (!has_addr_filter(event))
> +               return;
> +
> +       raw_spin_lock_irqsave(&ifh->lock, flags);
> +       list_for_each_entry(filter, &ifh->list, entry) {
> +               if (filter->inode) {
> +                       event->addr_filters_offs[count] = 0;
> +                       restart++;
> +               }
> +
> +               count++;
> +       }
> +
> +       if (restart)
> +               event->addr_filters_gen++;
> +       raw_spin_unlock_irqrestore(&ifh->lock, flags);
> +
> +       if (restart)
> +               perf_event_restart(event);
> +}
> +
> +void perf_event_exec(void)
> +{
> +       struct perf_event_context *ctx;
> +       int ctxn;
> +
> +       rcu_read_lock();
> +       for_each_task_context_nr(ctxn) {
> +               ctx = current->perf_event_ctxp[ctxn];
> +               if (!ctx)
> +                       continue;
> +
> +               perf_event_enable_on_exec(ctxn);
> +
> +               perf_event_aux_ctx(ctx, perf_event_addr_filters_exec, NULL,
> +                                  true);
> +       }
> +       rcu_read_unlock();
> +}
> +
>  struct remote_output {
>         struct ring_buffer      *rb;
>         int                     err;
> @@ -5865,6 +5996,9 @@ static void __perf_event_output_stop(struct perf_event *event, void *data)
>         struct perf_event *parent = event->parent;
>         struct remote_output *ro = data;
>         struct ring_buffer *rb = ro->rb;
> +       struct stop_event_data sd = {
> +               .event  = event,
> +       };
>
>         if (!has_aux(event))
>                 return;
> @@ -5877,7 +6011,7 @@ static void __perf_event_output_stop(struct perf_event *event, void *data)
>          * ring-buffer, but it will be the child that's actually using it:
>          */
>         if (rcu_dereference(parent->rb) == rb)
> -               ro->err = __perf_event_stop(event);
> +               ro->err = __perf_event_stop(&sd);
>  }
>
>  static int __perf_pmu_output_stop(void *info)
> @@ -6338,6 +6472,87 @@ got_name:
>         kfree(buf);
>  }
>
> +/*
> + * Whether this @filter depends on a dynamic object which is not loaded
> + * yet or its load addresses are not known.
> + */
> +static bool perf_addr_filter_needs_mmap(struct perf_addr_filter *filter)
> +{
> +       return filter->filter && filter->inode;
> +}
> +
> +/*
> + * Check whether inode and address range match filter criteria.
> + */
> +static bool perf_addr_filter_match(struct perf_addr_filter *filter,
> +                                    struct file *file, unsigned long offset,
> +                                    unsigned long size)
> +{
> +       if (filter->inode != file->f_inode)
> +               return false;
> +
> +       if (filter->offset > offset + size)
> +               return false;
> +
> +       if (filter->offset + filter->size < offset)
> +               return false;
> +
> +       return true;
> +}
> +
> +static void __perf_addr_filters_adjust(struct perf_event *event, void *data)
> +{
> +       struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
> +       struct vm_area_struct *vma = data;
> +       unsigned long off = vma->vm_pgoff << PAGE_SHIFT, flags;
> +       struct file *file = vma->vm_file;
> +       struct perf_addr_filter *filter;
> +       unsigned int restart = 0, count = 0;
> +
> +       if (!has_addr_filter(event))
> +               return;
> +
> +       if (!file)
> +               return;
> +
> +       raw_spin_lock_irqsave(&ifh->lock, flags);
> +       list_for_each_entry(filter, &ifh->list, entry) {
> +               if (perf_addr_filter_match(filter, file, off,
> +                                            vma->vm_end - vma->vm_start)) {
> +                       event->addr_filters_offs[count] = vma->vm_start;
> +                       restart++;
> +               }
> +
> +               count++;
> +       }
> +
> +       if (restart)
> +               event->addr_filters_gen++;
> +       raw_spin_unlock_irqrestore(&ifh->lock, flags);
> +
> +       if (restart)
> +               perf_event_restart(event);
> +}
> +
> +/*
> + * Adjust all task's events' filters to the new vma
> + */
> +static void perf_addr_filters_adjust(struct vm_area_struct *vma)
> +{
> +       struct perf_event_context *ctx;
> +       int ctxn;
> +
> +       rcu_read_lock();
> +       for_each_task_context_nr(ctxn) {
> +               ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
> +               if (!ctx)
> +                       continue;
> +
> +               perf_event_aux_ctx(ctx, __perf_addr_filters_adjust, vma, true);
> +       }
> +       rcu_read_unlock();
> +}
> +
>  void perf_event_mmap(struct vm_area_struct *vma)
>  {
>         struct perf_mmap_event mmap_event;
> @@ -6369,6 +6584,7 @@ void perf_event_mmap(struct vm_area_struct *vma)
>                 /* .flags (attr_mmap2 only) */
>         };
>
> +       perf_addr_filters_adjust(vma);
>         perf_event_mmap_event(&mmap_event);
>  }
>
> @@ -7328,13 +7544,370 @@ void perf_bp_event(struct perf_event *bp, void *data)
>  }
>  #endif
>
> +/*
> + * Allocate a new address filter
> + */
> +static struct perf_addr_filter *
> +perf_addr_filter_new(struct perf_event *event, struct list_head *filters)
> +{
> +       int node = cpu_to_node(event->cpu == -1 ? 0 : event->cpu);
> +       struct perf_addr_filter *filter;
> +
> +       filter = kzalloc_node(sizeof(*filter), GFP_KERNEL, node);
> +       if (!filter)
> +               return NULL;
> +
> +       INIT_LIST_HEAD(&filter->entry);
> +       list_add_tail(&filter->entry, filters);
> +
> +       return filter;
> +}
> +
> +static void free_filters_list(struct list_head *filters)
> +{
> +       struct perf_addr_filter *filter, *iter;
> +
> +       list_for_each_entry_safe(filter, iter, filters, entry) {
> +               if (filter->inode)
> +                       iput(filter->inode);
> +               list_del(&filter->entry);
> +               kfree(filter);
> +       }
> +}
> +
> +/*
> + * Free existing address filters and optionally install new ones
> + */
> +static void perf_addr_filters_splice(struct perf_event *event,
> +                                    struct list_head *head)
> +{
> +       unsigned long flags;
> +       LIST_HEAD(list);
> +
> +       if (!has_addr_filter(event))
> +               return;
> +
> +       /* don't bother with children, they don't have their own filters */
> +       if (event->parent)
> +               return;
> +
> +       raw_spin_lock_irqsave(&event->addr_filters.lock, flags);
> +
> +       list_splice_init(&event->addr_filters.list, &list);
> +       if (head)
> +               list_splice(head, &event->addr_filters.list);
> +
> +       raw_spin_unlock_irqrestore(&event->addr_filters.lock, flags);
> +
> +       free_filters_list(&list);
> +}
> +
> +/*
> + * Scan through mm's vmas and see if one of them matches the
> + * @filter; if so, adjust filter's address range.
> + * Called with mm::mmap_sem down for reading.
> + */
> +static unsigned long perf_addr_filter_apply(struct perf_addr_filter *filter,
> +                                           struct mm_struct *mm)
> +{
> +       struct vm_area_struct *vma;
> +
> +       for (vma = mm->mmap; vma->vm_next; vma = vma->vm_next) {
> +               struct file *file = vma->vm_file;
> +               unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
> +               unsigned long vma_size = vma->vm_end - vma->vm_start;
> +
> +               if (!file)
> +                       continue;
> +
> +               if (!perf_addr_filter_match(filter, file, off, vma_size))
> +                       continue;
> +
> +               return vma->vm_start;
> +       }
> +
> +       return 0;
> +}
> +
> +/*
> + * Update event's address range filters based on the
> + * task's existing mappings, if any.
> + */
> +static void perf_event_addr_filters_apply(struct perf_event *event)
> +{
> +       struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
> +       struct task_struct *task = READ_ONCE(event->ctx->task);
> +       struct perf_addr_filter *filter;
> +       struct mm_struct *mm = NULL;
> +       unsigned int count = 0;
> +       unsigned long flags;
> +
> +       /*
> +        * We may observe TASK_TOMBSTONE, which means that the event tear-down
> +        * will stop on the parent's child_mutex that our caller is also holding
> +        */
> +       if (task == TASK_TOMBSTONE)
> +               return;
> +
> +       mm = get_task_mm(event->ctx->task);
> +       if (!mm)
> +               goto restart;
> +
> +       down_read(&mm->mmap_sem);
> +
> +       raw_spin_lock_irqsave(&ifh->lock, flags);
> +       list_for_each_entry(filter, &ifh->list, entry) {
> +               event->addr_filters_offs[count] = 0;
> +
> +               if (perf_addr_filter_needs_mmap(filter))
> +                       event->addr_filters_offs[count] =
> +                               perf_addr_filter_apply(filter, mm);
> +
> +               count++;
> +       }
> +
> +       event->addr_filters_gen++;
> +       raw_spin_unlock_irqrestore(&ifh->lock, flags);
> +
> +       up_read(&mm->mmap_sem);
> +
> +       mmput(mm);
> +
> +restart:
> +       perf_event_restart(event);
> +}
> +
> +/*
> + * Address range filtering: limiting the data to certain
> + * instruction address ranges. Filters are ioctl()ed to us from
> + * userspace as ascii strings.
> + *
> + * Filter string format:
> + *
> + * ACTION RANGE_SPEC
> + * where ACTION is one of the
> + *  * "filter": limit the trace to this region
> + *  * "start": start tracing from this address
> + *  * "stop": stop tracing at this address/region;
> + * RANGE_SPEC is
> + *  * for kernel addresses: <start address>[/<size>]
> + *  * for object files:     <start address>[/<size>]@</path/to/object/file>
> + *
> + * if <size> is not specified, the range is treated as a single address.
> + */
> +enum {
> +       IF_ACT_FILTER,
> +       IF_ACT_START,
> +       IF_ACT_STOP,
> +       IF_SRC_FILE,
> +       IF_SRC_KERNEL,
> +       IF_SRC_FILEADDR,
> +       IF_SRC_KERNELADDR,
> +};
> +
> +enum {
> +       IF_STATE_ACTION = 0,
> +       IF_STATE_SOURCE,
> +       IF_STATE_END,
> +};
> +
> +static const match_table_t if_tokens = {
> +       { IF_ACT_FILTER,        "filter" },
> +       { IF_ACT_START,         "start" },
> +       { IF_ACT_STOP,          "stop" },
> +       { IF_SRC_FILE,          "%u/%u@%s" },
> +       { IF_SRC_KERNEL,        "%u/%u" },
> +       { IF_SRC_FILEADDR,      "%u@%s" },
> +       { IF_SRC_KERNELADDR,    "%u" },
> +};
> +
> +/*
> + * Address filter string parser
> + */
> +static int
> +perf_event_parse_addr_filter(struct perf_event *event, char *fstr,
> +                            struct list_head *filters)
> +{
> +       struct perf_addr_filter *filter = NULL;
> +       char *start, *orig, *filename = NULL;
> +       struct path path;
> +       substring_t args[MAX_OPT_ARGS];
> +       int state = IF_STATE_ACTION, token;
> +       unsigned int kernel = 0;
> +       int ret = -EINVAL;
> +
> +       orig = fstr = kstrdup(fstr, GFP_KERNEL);
> +       if (!fstr)
> +               return -ENOMEM;
> +
> +       while ((start = strsep(&fstr, " ,\n")) != NULL) {
> +               ret = -EINVAL;
> +
> +               if (!*start)
> +                       continue;
> +
> +               /* filter definition begins */
> +               if (state == IF_STATE_ACTION) {
> +                       filter = perf_addr_filter_new(event, filters);
> +                       if (!filter)
> +                               goto fail;
> +               }
> +
> +               token = match_token(start, if_tokens, args);
> +               switch (token) {
> +               case IF_ACT_FILTER:
> +               case IF_ACT_START:
> +                       filter->filter = 1;
> +
> +               case IF_ACT_STOP:
> +                       if (state != IF_STATE_ACTION)
> +                               goto fail;
> +
> +                       state = IF_STATE_SOURCE;
> +                       break;
> +
> +               case IF_SRC_KERNELADDR:
> +               case IF_SRC_KERNEL:
> +                       kernel = 1;
> +
> +               case IF_SRC_FILEADDR:
> +               case IF_SRC_FILE:
> +                       if (state != IF_STATE_SOURCE)
> +                               goto fail;
> +
> +                       if (token == IF_SRC_FILE || token == IF_SRC_KERNEL)
> +                               filter->range = 1;
> +
> +                       *args[0].to = 0;
> +                       ret = kstrtoul(args[0].from, 0, &filter->offset);
> +                       if (ret)
> +                               goto fail;
> +
> +                       if (filter->range) {
> +                               *args[1].to = 0;
> +                               ret = kstrtoul(args[1].from, 0, &filter->size);
> +                               if (ret)
> +                                       goto fail;
> +                       }
> +
> +                       if (token == IF_SRC_FILE) {
> +                               filename = match_strdup(&args[2]);
> +                               if (!filename) {
> +                                       ret = -ENOMEM;
> +                                       goto fail;
> +                               }
> +                       }
> +
> +                       state = IF_STATE_END;
> +                       break;
> +
> +               default:
> +                       goto fail;
> +               }
> +
> +               /*
> +                * Filter definition is fully parsed, validate and install it.
> +                * Make sure that it doesn't contradict itself or the event's
> +                * attribute.
> +                */
> +               if (state == IF_STATE_END) {
> +                       if (kernel && event->attr.exclude_kernel)
> +                               goto fail;
> +
> +                       if (!kernel) {
> +                               if (!filename)
> +                                       goto fail;
> +
> +                               /* look up the path and grab its inode */
> +                               ret = kern_path(filename, LOOKUP_FOLLOW, &path);
> +                               if (ret)
> +                                       goto fail_free_name;
> +
> +                               filter->inode = igrab(d_inode(path.dentry));
> +                               path_put(&path);
> +                               kfree(filename);
> +                               filename = NULL;
> +
> +                               ret = -EINVAL;
> +                               if (!filter->inode ||
> +                                   !S_ISREG(filter->inode->i_mode))
> +                                       /* free_filters_list() will iput() */
> +                                       goto fail;
> +                       }
> +
> +                       /* ready to consume more filters */
> +                       state = IF_STATE_ACTION;
> +                       filter = NULL;
> +               }
> +       }
> +
> +       if (state != IF_STATE_ACTION)
> +               goto fail;
> +
> +       kfree(orig);
> +
> +       return 0;
> +
> +fail_free_name:
> +       kfree(filename);
> +fail:
> +       free_filters_list(filters);
> +       kfree(orig);
> +
> +       return ret;
> +}
> +
> +static int
> +perf_event_set_addr_filter(struct perf_event *event, char *filter_str)
> +{
> +       LIST_HEAD(filters);
> +       int ret;
> +
> +       /*
> +        * Since this is called in perf_ioctl() path, we're already holding
> +        * ctx::mutex.
> +        */
> +       lockdep_assert_held(&event->ctx->mutex);
> +
> +       if (WARN_ON_ONCE(event->parent))
> +               return -EINVAL;
> +
> +       /*
> +        * For now, we only support filtering in per-task events; doing so
> +        * for cpu-wide events requires additional context switching trickery,
> +        * since same object code will be mapped at different virtual
> +        * addresses in different processes.
> +        */
> +       if (!event->ctx->task)
> +               return -EOPNOTSUPP;
> +
> +       ret = perf_event_parse_addr_filter(event, filter_str, &filters);
> +       if (ret)
> +               return ret;
> +
> +       ret = event->pmu->addr_filters_validate(&filters);
> +       if (ret) {
> +               free_filters_list(&filters);
> +               return ret;
> +       }
> +
> +       /* remove existing filters, if any */
> +       perf_addr_filters_splice(event, &filters);
> +
> +       /* install new filters */
> +       perf_event_for_each_child(event, perf_event_addr_filters_apply);
> +
> +       return ret;
> +}
> +
>  static int perf_event_set_filter(struct perf_event *event, void __user *arg)
>  {
>         char *filter_str;
>         int ret = -EINVAL;
>
> -       if (event->attr.type != PERF_TYPE_TRACEPOINT ||
> -           !IS_ENABLED(CONFIG_EVENT_TRACING))
> +       if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
> +           !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
> +           !has_addr_filter(event))
>                 return -EINVAL;
>
>         filter_str = strndup_user(arg, PAGE_SIZE);
> @@ -7345,6 +7918,8 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
>             event->attr.type == PERF_TYPE_TRACEPOINT)
>                 ret = ftrace_profile_set_filter(event, event->attr.config,
>                                                 filter_str);
> +       else if (has_addr_filter(event))
> +               ret = perf_event_set_addr_filter(event, filter_str);
>
>         kfree(filter_str);
>         return ret;
> @@ -8139,6 +8714,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>         INIT_LIST_HEAD(&event->sibling_list);
>         INIT_LIST_HEAD(&event->rb_entry);
>         INIT_LIST_HEAD(&event->active_entry);
> +       INIT_LIST_HEAD(&event->addr_filters.list);
>         INIT_HLIST_NODE(&event->hlist_entry);
>
>
> @@ -8146,6 +8722,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>         init_irq_work(&event->pending, perf_pending_event);
>
>         mutex_init(&event->mmap_mutex);
> +       raw_spin_lock_init(&event->addr_filters.lock);
>
>         atomic_long_set(&event->refcount, 1);
>         event->cpu              = cpu;
> @@ -8230,11 +8807,22 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>         if (err)
>                 goto err_pmu;
>
> +       if (has_addr_filter(event)) {
> +               event->addr_filters_offs = kcalloc(pmu->nr_addr_filters,
> +                                                  sizeof(unsigned long),
> +                                                  GFP_KERNEL);
> +               if (!event->addr_filters_offs)
> +                       goto err_per_task;
> +
> +               /* force hw sync on the address filters */
> +               event->addr_filters_gen = 1;
> +       }
> +
>         if (!event->parent) {
>                 if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
>                         err = get_callchain_buffers();
>                         if (err)
> -                               goto err_per_task;
> +                               goto err_addr_filters;
>                 }
>         }
>
> @@ -8243,6 +8831,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>
>         return event;
>
> +err_addr_filters:
> +       kfree(event->addr_filters_offs);
> +
>  err_per_task:
>         exclusive_event_destroy(event);

I see two things in this work:

1) A framework to deal with filters described in user space.
2) An implementation for address filtering that will work for both
Intel and ARM.

This will work well for address filtering (for both PT and CS) but
what happens when we want to introduce new filters?  This is
inevitable and some filters will be architecture agnostic while others
architecture specific.

To me the above is well done and I can work with it, but the framework
to deal with general filters and the function of address filtering
need to be decoupled from the beginning.  80% of the above can be
reused with a simple name change (dropping "addr").  The rest, like
the parser and how to make the "if_tokens" expandable will be harder
deal with.

Thanks,
Mathieu

>
> --
> 2.8.0.rc3
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] perf: Introduce address range filtering
  2016-04-29 18:12   ` Mathieu Poirier
@ 2016-04-30  4:59     ` Alexander Shishkin
  2016-05-02 14:31       ` Mathieu Poirier
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Shishkin @ 2016-04-30  4:59 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: Peter Zijlstra, Thomas Gleixner, x86, Borislav Petkov,
	Ingo Molnar, linux-kernel, vince, Stephane Eranian,
	Arnaldo Carvalho de Melo

Mathieu Poirier <mathieu.poirier@linaro.org> writes:

> I see two things in this work:

[trimmed 700+ lines of context that had no purpose]

> 1) A framework to deal with filters described in user space.
> 2) An implementation for address filtering that will work for both
> Intel and ARM.
>
> This will work well for address filtering (for both PT and CS) but
> what happens when we want to introduce new filters?  This is
> inevitable and some filters will be architecture agnostic while others
> architecture specific.

Haven't we been through this [1] already?

[1] http://marc.info/?l=linux-kernel&m=145013911827358

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] perf: Introduce address range filtering
  2016-04-30  4:59     ` Alexander Shishkin
@ 2016-05-02 14:31       ` Mathieu Poirier
  0 siblings, 0 replies; 20+ messages in thread
From: Mathieu Poirier @ 2016-05-02 14:31 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Thomas Gleixner, x86, Borislav Petkov,
	Ingo Molnar, linux-kernel, vince, Stephane Eranian,
	Arnaldo Carvalho de Melo

On 29 April 2016 at 22:59, Alexander Shishkin
<alexander.shishkin@linux.intel.com> wrote:
> Mathieu Poirier <mathieu.poirier@linaro.org> writes:
>
>> I see two things in this work:
>
> [trimmed 700+ lines of context that had no purpose]
>
>> 1) A framework to deal with filters described in user space.
>> 2) An implementation for address filtering that will work for both
>> Intel and ARM.
>>
>> This will work well for address filtering (for both PT and CS) but
>> what happens when we want to introduce new filters?  This is
>> inevitable and some filters will be architecture agnostic while others
>> architecture specific.
>
> Haven't we been through this [1] already?
>
> [1] http://marc.info/?l=linux-kernel&m=145013911827358

Right, my comment was about improving the current solution and making
future changes less painful.  In my previous answer I should have made
it clearer that the current code will work for CoreSight and can go in
as is.  Now its up to Peter to decide what he wants to do.

With regards to how this solution fits in with instruction tracing for
other architectures, i.e ARM:

Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>

>
> Regards,
> --
> Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] perf: Introduce address range filtering
  2016-04-27 15:44 ` [PATCH v2 5/7] perf: Introduce address range filtering Alexander Shishkin
  2016-04-28 17:09   ` Peter Zijlstra
  2016-04-29 18:12   ` Mathieu Poirier
@ 2016-05-03  8:56   ` Peter Zijlstra
  2016-05-05  9:46   ` [tip:perf/core] perf/core: " tip-bot for Alexander Shishkin
  3 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2016-05-03  8:56 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Thomas Gleixner, x86, Borislav Petkov, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier,
	Michael Kerrisk (man-pages)

On Wed, Apr 27, 2016 at 06:44:46PM +0300, Alexander Shishkin wrote:
> Many instruction trace pmus out there support address range-based
> filtering, which would, for example, generate trace data only for a
> given range of instruction addresses, which is useful for tracing
> individual functions, modules or libraries. Other pmus may also
> utilize this functionality to allow filtering to or filtering out
> code at certain address ranges.
> 
> This patch introduces the interface for userspace to specify these
> filters and for the pmu drivers to apply these filters to hardware
> configuration.
> 
> The user interface is an ascii string that is passed via an ioctl
> and specifies (in the form of an ascii string) address ranges within
> certain object files or within kernel. There is no special treatment
> for kernel modules yet, but it might be a worthy pursuit.
> 
> The pmu driver interface basically add two extra callbacks to the
> pmu driver structure, one of which validates the filter configuration
> proposed by the user against what the hardware is actually capable of
> doing and the other one translates hardware-independent filter
> configuration into something that can be programmed into the
> hardware.

Alexander, could you please write a manpage patch for this new API?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [tip:perf/core] perf/core: Move set_filter() out of CONFIG_EVENT_TRACING
  2016-04-27 15:44 ` [PATCH v2 1/7] perf: Move set_filter() from behind EVENT_TRACING Alexander Shishkin
@ 2016-05-05  9:45   ` tip-bot for Alexander Shishkin
  0 siblings, 0 replies; 20+ messages in thread
From: tip-bot for Alexander Shishkin @ 2016-05-05  9:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: alexander.shishkin, vincent.weaver, hpa, mathieu.poirier, bp,
	acme, linux-kernel, mingo, acme, peterz, eranian, tglx, torvalds,
	jolsa

Commit-ID:  c796bbbe8dccd9c91ebbb99ffef33e0f73ced7bf
Gitweb:     http://git.kernel.org/tip/c796bbbe8dccd9c91ebbb99ffef33e0f73ced7bf
Author:     Alexander Shishkin <alexander.shishkin@linux.intel.com>
AuthorDate: Wed, 27 Apr 2016 18:44:42 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 May 2016 10:13:55 +0200

perf/core: Move set_filter() out of CONFIG_EVENT_TRACING

For instruction trace filtering, namely, for communicating filter
definitions from userspace, I'd like to re-use the SET_FILTER code
that the tracepoints are using currently.

To that end, move the relevant code out from behind the
CONFIG_EVENT_TRACING dependency.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1461771888-10409-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/core.c | 45 ++++++++++++++++++++++-----------------------
 1 file changed, 22 insertions(+), 23 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9de459a..13956ee 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7235,24 +7235,6 @@ static inline void perf_tp_register(void)
 	perf_pmu_register(&perf_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
 }
 
-static int perf_event_set_filter(struct perf_event *event, void __user *arg)
-{
-	char *filter_str;
-	int ret;
-
-	if (event->attr.type != PERF_TYPE_TRACEPOINT)
-		return -EINVAL;
-
-	filter_str = strndup_user(arg, PAGE_SIZE);
-	if (IS_ERR(filter_str))
-		return PTR_ERR(filter_str);
-
-	ret = ftrace_profile_set_filter(event, event->attr.config, filter_str);
-
-	kfree(filter_str);
-	return ret;
-}
-
 static void perf_event_free_filter(struct perf_event *event)
 {
 	ftrace_profile_free_filter(event);
@@ -7307,11 +7289,6 @@ static inline void perf_tp_register(void)
 {
 }
 
-static int perf_event_set_filter(struct perf_event *event, void __user *arg)
-{
-	return -ENOENT;
-}
-
 static void perf_event_free_filter(struct perf_event *event)
 {
 }
@@ -7339,6 +7316,28 @@ void perf_bp_event(struct perf_event *bp, void *data)
 }
 #endif
 
+static int perf_event_set_filter(struct perf_event *event, void __user *arg)
+{
+	char *filter_str;
+	int ret = -EINVAL;
+
+	if (event->attr.type != PERF_TYPE_TRACEPOINT ||
+	    !IS_ENABLED(CONFIG_EVENT_TRACING))
+		return -EINVAL;
+
+	filter_str = strndup_user(arg, PAGE_SIZE);
+	if (IS_ERR(filter_str))
+		return PTR_ERR(filter_str);
+
+	if (IS_ENABLED(CONFIG_EVENT_TRACING) &&
+	    event->attr.type == PERF_TYPE_TRACEPOINT)
+		ret = ftrace_profile_set_filter(event, event->attr.config,
+						filter_str);
+
+	kfree(filter_str);
+	return ret;
+}
+
 /*
  * hrtimer based swevent callback
  */

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [tip:perf/core] perf/x86/intel/pt: Move PT specific MSR bit definitions to a private header
  2016-04-27 15:44 ` [PATCH v2 2/7] perf/x86/intel/pt: Move MSR bit definitions to a private header Alexander Shishkin
@ 2016-05-05  9:45   ` tip-bot for Alexander Shishkin
  0 siblings, 0 replies; 20+ messages in thread
From: tip-bot for Alexander Shishkin @ 2016-05-05  9:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mathieu.poirier, bp, peterz, acme, torvalds, hpa,
	alexander.shishkin, vincent.weaver, acme, eranian, mingo,
	linux-kernel, jolsa, tglx

Commit-ID:  0dd28e2cdaff5319c86cc3ed11d1ca4cf1554046
Gitweb:     http://git.kernel.org/tip/0dd28e2cdaff5319c86cc3ed11d1ca4cf1554046
Author:     Alexander Shishkin <alexander.shishkin@linux.intel.com>
AuthorDate: Wed, 27 Apr 2016 18:44:43 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 May 2016 10:13:55 +0200

perf/x86/intel/pt: Move PT specific MSR bit definitions to a private header

Nothing outside of the Intel PT driver should ever care about its MSR
bits, so there is no reason to keep them in msr-index.h. This patch
moves them to a pt-local header.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1461771888-10409-3-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/events/intel/pt.h       | 24 ++++++++++++++++++++++++
 arch/x86/include/asm/msr-index.h | 20 --------------------
 2 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index 3abb5f5..81454fa 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -20,6 +20,30 @@
 #define __INTEL_PT_H__
 
 /*
+ * PT MSR bit definitions
+ */
+#define RTIT_CTL_TRACEEN		BIT(0)
+#define RTIT_CTL_CYCLEACC		BIT(1)
+#define RTIT_CTL_OS			BIT(2)
+#define RTIT_CTL_USR			BIT(3)
+#define RTIT_CTL_CR3EN			BIT(7)
+#define RTIT_CTL_TOPA			BIT(8)
+#define RTIT_CTL_MTC_EN			BIT(9)
+#define RTIT_CTL_TSC_EN			BIT(10)
+#define RTIT_CTL_DISRETC		BIT(11)
+#define RTIT_CTL_BRANCH_EN		BIT(13)
+#define RTIT_CTL_MTC_RANGE_OFFSET	14
+#define RTIT_CTL_MTC_RANGE		(0x0full << RTIT_CTL_MTC_RANGE_OFFSET)
+#define RTIT_CTL_CYC_THRESH_OFFSET	19
+#define RTIT_CTL_CYC_THRESH		(0x0full << RTIT_CTL_CYC_THRESH_OFFSET)
+#define RTIT_CTL_PSB_FREQ_OFFSET	24
+#define RTIT_CTL_PSB_FREQ      		(0x0full << RTIT_CTL_PSB_FREQ_OFFSET)
+#define RTIT_STATUS_CONTEXTEN		BIT(1)
+#define RTIT_STATUS_TRIGGEREN		BIT(2)
+#define RTIT_STATUS_ERROR		BIT(4)
+#define RTIT_STATUS_STOPPED		BIT(5)
+
+/*
  * Single-entry ToPA: when this close to region boundary, switch
  * buffers to avoid losing data.
  */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 94555b4..7193577 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -89,27 +89,7 @@
 #define MSR_PEBS_LD_LAT_THRESHOLD	0x000003f6
 
 #define MSR_IA32_RTIT_CTL		0x00000570
-#define RTIT_CTL_TRACEEN		BIT(0)
-#define RTIT_CTL_CYCLEACC		BIT(1)
-#define RTIT_CTL_OS			BIT(2)
-#define RTIT_CTL_USR			BIT(3)
-#define RTIT_CTL_CR3EN			BIT(7)
-#define RTIT_CTL_TOPA			BIT(8)
-#define RTIT_CTL_MTC_EN			BIT(9)
-#define RTIT_CTL_TSC_EN			BIT(10)
-#define RTIT_CTL_DISRETC		BIT(11)
-#define RTIT_CTL_BRANCH_EN		BIT(13)
-#define RTIT_CTL_MTC_RANGE_OFFSET	14
-#define RTIT_CTL_MTC_RANGE		(0x0full << RTIT_CTL_MTC_RANGE_OFFSET)
-#define RTIT_CTL_CYC_THRESH_OFFSET	19
-#define RTIT_CTL_CYC_THRESH		(0x0full << RTIT_CTL_CYC_THRESH_OFFSET)
-#define RTIT_CTL_PSB_FREQ_OFFSET	24
-#define RTIT_CTL_PSB_FREQ      		(0x0full << RTIT_CTL_PSB_FREQ_OFFSET)
 #define MSR_IA32_RTIT_STATUS		0x00000571
-#define RTIT_STATUS_CONTEXTEN		BIT(1)
-#define RTIT_STATUS_TRIGGEREN		BIT(2)
-#define RTIT_STATUS_ERROR		BIT(4)
-#define RTIT_STATUS_STOPPED		BIT(5)
 #define MSR_IA32_RTIT_CR3_MATCH		0x00000572
 #define MSR_IA32_RTIT_OUTPUT_BASE	0x00000560
 #define MSR_IA32_RTIT_OUTPUT_MASK	0x00000561

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [tip:perf/core] perf/x86/intel/pt: Add IP filtering register/CPUID bits
  2016-04-27 15:44 ` [PATCH v2 3/7] perf/x86/intel/pt: IP filtering register/cpuid bits Alexander Shishkin
@ 2016-05-05  9:46   ` tip-bot for Alexander Shishkin
  0 siblings, 0 replies; 20+ messages in thread
From: tip-bot for Alexander Shishkin @ 2016-05-05  9:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: vincent.weaver, alexander.shishkin, bp, torvalds, hpa,
	mathieu.poirier, mingo, eranian, linux-kernel, jolsa, acme, acme,
	peterz, tglx

Commit-ID:  f127fa098d76444c7a47b2f009356979492d77cd
Gitweb:     http://git.kernel.org/tip/f127fa098d76444c7a47b2f009356979492d77cd
Author:     Alexander Shishkin <alexander.shishkin@linux.intel.com>
AuthorDate: Wed, 27 Apr 2016 18:44:44 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 May 2016 10:13:56 +0200

perf/x86/intel/pt: Add IP filtering register/CPUID bits

New versions of Intel PT support address range-based filtering. Add
the new registers, bit definitions and relevant CPUID bits.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1461771888-10409-4-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/events/intel/pt.c       |  2 ++
 arch/x86/events/intel/pt.h       | 12 ++++++++++++
 arch/x86/include/asm/msr-index.h |  9 +++++++++
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 05ef87d..e5bfafe 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -67,11 +67,13 @@ static struct pt_cap_desc {
 	PT_CAP(max_subleaf,		0, CR_EAX, 0xffffffff),
 	PT_CAP(cr3_filtering,		0, CR_EBX, BIT(0)),
 	PT_CAP(psb_cyc,			0, CR_EBX, BIT(1)),
+	PT_CAP(ip_filtering,		0, CR_EBX, BIT(2)),
 	PT_CAP(mtc,			0, CR_EBX, BIT(3)),
 	PT_CAP(topa_output,		0, CR_ECX, BIT(0)),
 	PT_CAP(topa_multiple_entries,	0, CR_ECX, BIT(1)),
 	PT_CAP(single_range_output,	0, CR_ECX, BIT(2)),
 	PT_CAP(payloads_lip,		0, CR_ECX, BIT(31)),
+	PT_CAP(num_address_ranges,	1, CR_EAX, 0x3),
 	PT_CAP(mtc_periods,		1, CR_EAX, 0xffff0000),
 	PT_CAP(cycle_thresholds,	1, CR_EBX, 0xffff),
 	PT_CAP(psb_periods,		1, CR_EBX, 0xffff0000),
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index 81454fa..0ed9000 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -38,8 +38,18 @@
 #define RTIT_CTL_CYC_THRESH		(0x0full << RTIT_CTL_CYC_THRESH_OFFSET)
 #define RTIT_CTL_PSB_FREQ_OFFSET	24
 #define RTIT_CTL_PSB_FREQ      		(0x0full << RTIT_CTL_PSB_FREQ_OFFSET)
+#define RTIT_CTL_ADDR0_OFFSET		32
+#define RTIT_CTL_ADDR0      		(0x0full << RTIT_CTL_ADDR0_OFFSET)
+#define RTIT_CTL_ADDR1_OFFSET		36
+#define RTIT_CTL_ADDR1      		(0x0full << RTIT_CTL_ADDR1_OFFSET)
+#define RTIT_CTL_ADDR2_OFFSET		40
+#define RTIT_CTL_ADDR2      		(0x0full << RTIT_CTL_ADDR2_OFFSET)
+#define RTIT_CTL_ADDR3_OFFSET		44
+#define RTIT_CTL_ADDR3      		(0x0full << RTIT_CTL_ADDR3_OFFSET)
+#define RTIT_STATUS_FILTEREN		BIT(0)
 #define RTIT_STATUS_CONTEXTEN		BIT(1)
 #define RTIT_STATUS_TRIGGEREN		BIT(2)
+#define RTIT_STATUS_BUFFOVF		BIT(3)
 #define RTIT_STATUS_ERROR		BIT(4)
 #define RTIT_STATUS_STOPPED		BIT(5)
 
@@ -76,11 +86,13 @@ enum pt_capabilities {
 	PT_CAP_max_subleaf = 0,
 	PT_CAP_cr3_filtering,
 	PT_CAP_psb_cyc,
+	PT_CAP_ip_filtering,
 	PT_CAP_mtc,
 	PT_CAP_topa_output,
 	PT_CAP_topa_multiple_entries,
 	PT_CAP_single_range_output,
 	PT_CAP_payloads_lip,
+	PT_CAP_num_address_ranges,
 	PT_CAP_mtc_periods,
 	PT_CAP_cycle_thresholds,
 	PT_CAP_psb_periods,
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 7193577..5a73a9c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -90,6 +90,15 @@
 
 #define MSR_IA32_RTIT_CTL		0x00000570
 #define MSR_IA32_RTIT_STATUS		0x00000571
+#define MSR_IA32_RTIT_STATUS		0x00000571
+#define MSR_IA32_RTIT_ADDR0_A		0x00000580
+#define MSR_IA32_RTIT_ADDR0_B		0x00000581
+#define MSR_IA32_RTIT_ADDR1_A		0x00000582
+#define MSR_IA32_RTIT_ADDR1_B		0x00000583
+#define MSR_IA32_RTIT_ADDR2_A		0x00000584
+#define MSR_IA32_RTIT_ADDR2_B		0x00000585
+#define MSR_IA32_RTIT_ADDR3_A		0x00000586
+#define MSR_IA32_RTIT_ADDR3_B		0x00000587
 #define MSR_IA32_RTIT_CR3_MATCH		0x00000572
 #define MSR_IA32_RTIT_OUTPUT_BASE	0x00000560
 #define MSR_IA32_RTIT_OUTPUT_MASK	0x00000561

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [tip:perf/core] perf/core: Extend perf_event_aux_ctx() to optionally iterate through more events
  2016-04-27 15:44 ` [PATCH v2 4/7] perf: Extend perf_event_aux_ctx() to optionally iterate through more events Alexander Shishkin
@ 2016-05-05  9:46   ` tip-bot for Alexander Shishkin
  0 siblings, 0 replies; 20+ messages in thread
From: tip-bot for Alexander Shishkin @ 2016-05-05  9:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, acme, mathieu.poirier, tglx, mingo, peterz,
	alexander.shishkin, eranian, bp, acme, vincent.weaver, hpa,
	jolsa, torvalds

Commit-ID:  b73e4fefc18adbe4d12ffb746fb16306674c1ac6
Gitweb:     http://git.kernel.org/tip/b73e4fefc18adbe4d12ffb746fb16306674c1ac6
Author:     Alexander Shishkin <alexander.shishkin@linux.intel.com>
AuthorDate: Wed, 27 Apr 2016 18:44:45 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 May 2016 10:13:57 +0200

perf/core: Extend perf_event_aux_ctx() to optionally iterate through more events

Trace filtering code needs an iterator that can go through all events in
a context, including inactive and filtered, to be able to update their
filters' address ranges based on mmap or exec events.

This patch changes perf_event_aux_ctx() to optionally do this.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1461771888-10409-5-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/core.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 13956ee..2bb7c47 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5781,15 +5781,18 @@ typedef void (perf_event_aux_output_cb)(struct perf_event *event, void *data);
 static void
 perf_event_aux_ctx(struct perf_event_context *ctx,
 		   perf_event_aux_output_cb output,
-		   void *data)
+		   void *data, bool all)
 {
 	struct perf_event *event;
 
 	list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
-		if (event->state < PERF_EVENT_STATE_INACTIVE)
-			continue;
-		if (!event_filter_match(event))
-			continue;
+		if (!all) {
+			if (event->state < PERF_EVENT_STATE_INACTIVE)
+				continue;
+			if (!event_filter_match(event))
+				continue;
+		}
+
 		output(event, data);
 	}
 }
@@ -5800,7 +5803,7 @@ perf_event_aux_task_ctx(perf_event_aux_output_cb output, void *data,
 {
 	rcu_read_lock();
 	preempt_disable();
-	perf_event_aux_ctx(task_ctx, output, data);
+	perf_event_aux_ctx(task_ctx, output, data, false);
 	preempt_enable();
 	rcu_read_unlock();
 }
@@ -5830,13 +5833,13 @@ perf_event_aux(perf_event_aux_output_cb output, void *data,
 		cpuctx = get_cpu_ptr(pmu->pmu_cpu_context);
 		if (cpuctx->unique_pmu != pmu)
 			goto next;
-		perf_event_aux_ctx(&cpuctx->ctx, output, data);
+		perf_event_aux_ctx(&cpuctx->ctx, output, data, false);
 		ctxn = pmu->task_ctx_nr;
 		if (ctxn < 0)
 			goto next;
 		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
 		if (ctx)
-			perf_event_aux_ctx(ctx, output, data);
+			perf_event_aux_ctx(ctx, output, data, false);
 next:
 		put_cpu_ptr(pmu->pmu_cpu_context);
 	}
@@ -5878,10 +5881,10 @@ static int __perf_pmu_output_stop(void *info)
 	};
 
 	rcu_read_lock();
-	perf_event_aux_ctx(&cpuctx->ctx, __perf_event_output_stop, &ro);
+	perf_event_aux_ctx(&cpuctx->ctx, __perf_event_output_stop, &ro, false);
 	if (cpuctx->task_ctx)
 		perf_event_aux_ctx(cpuctx->task_ctx, __perf_event_output_stop,
-				   &ro);
+				   &ro, false);
 	rcu_read_unlock();
 
 	return ro.err;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [tip:perf/core] perf/core: Introduce address range filtering
  2016-04-27 15:44 ` [PATCH v2 5/7] perf: Introduce address range filtering Alexander Shishkin
                     ` (2 preceding siblings ...)
  2016-05-03  8:56   ` Peter Zijlstra
@ 2016-05-05  9:46   ` tip-bot for Alexander Shishkin
  3 siblings, 0 replies; 20+ messages in thread
From: tip-bot for Alexander Shishkin @ 2016-05-05  9:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mathieu.poirier, torvalds, eranian, acme, tglx, jolsa,
	alexander.shishkin, bp, vincent.weaver, acme, hpa, mingo, peterz,
	linux-kernel

Commit-ID:  375637bc524952f1122ea22caf5a8f1fecad8228
Gitweb:     http://git.kernel.org/tip/375637bc524952f1122ea22caf5a8f1fecad8228
Author:     Alexander Shishkin <alexander.shishkin@linux.intel.com>
AuthorDate: Wed, 27 Apr 2016 18:44:46 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 May 2016 10:13:57 +0200

perf/core: Introduce address range filtering

Many instruction tracing PMUs out there support address range-based
filtering, which would, for example, generate trace data only for a
given range of instruction addresses, which is useful for tracing
individual functions, modules or libraries. Other PMUs may also
utilize this functionality to allow filtering to or filtering out
code at certain address ranges.

This patch introduces the interface for userspace to specify these
filters and for the PMU drivers to apply these filters to hardware
configuration.

The user interface is an ASCII string that is passed via an ioctl()
and specifies (in the form of an ASCII string) address ranges within
certain object files or within kernel. There is no special treatment
for kernel modules yet, but it might be a worthy pursuit.

The PMU driver interface basically adds two extra callbacks to the
PMU driver structure, one of which validates the filter configuration
proposed by the user against what the hardware is actually capable of
doing and the other one translates hardware-independent filter
configuration into something that can be programmed into the
hardware.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1461771888-10409-6-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/perf_event.h |  98 +++++++
 kernel/events/core.c       | 623 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 705 insertions(+), 16 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a090700..c77e4a1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -151,6 +151,15 @@ struct hw_perf_event {
 	 */
 	struct task_struct		*target;
 
+	/*
+	 * PMU would store hardware filter configuration
+	 * here.
+	 */
+	void				*addr_filters;
+
+	/* Last sync'ed generation of filters */
+	unsigned long			addr_filters_gen;
+
 /*
  * hw_perf_event::state flags; used to track the PERF_EF_* state.
  */
@@ -240,6 +249,9 @@ struct pmu {
 	int				task_ctx_nr;
 	int				hrtimer_interval_ms;
 
+	/* number of address filters this PMU can do */
+	unsigned int			nr_addr_filters;
+
 	/*
 	 * Fully disable/enable this PMU, can be used to protect from the PMI
 	 * as well as for lazy/batch writing of the MSRs.
@@ -393,12 +405,71 @@ struct pmu {
 	void (*free_aux)		(void *aux); /* optional */
 
 	/*
+	 * Validate address range filters: make sure the HW supports the
+	 * requested configuration and number of filters; return 0 if the
+	 * supplied filters are valid, -errno otherwise.
+	 *
+	 * Runs in the context of the ioctl()ing process and is not serialized
+	 * with the rest of the PMU callbacks.
+	 */
+	int (*addr_filters_validate)	(struct list_head *filters);
+					/* optional */
+
+	/*
+	 * Synchronize address range filter configuration:
+	 * translate hw-agnostic filters into hardware configuration in
+	 * event::hw::addr_filters.
+	 *
+	 * Runs as a part of filter sync sequence that is done in ->start()
+	 * callback by calling perf_event_addr_filters_sync().
+	 *
+	 * May (and should) traverse event::addr_filters::list, for which its
+	 * caller provides necessary serialization.
+	 */
+	void (*addr_filters_sync)	(struct perf_event *event);
+					/* optional */
+
+	/*
 	 * Filter events for PMU-specific reasons.
 	 */
 	int (*filter_match)		(struct perf_event *event); /* optional */
 };
 
 /**
+ * struct perf_addr_filter - address range filter definition
+ * @entry:	event's filter list linkage
+ * @inode:	object file's inode for file-based filters
+ * @offset:	filter range offset
+ * @size:	filter range size
+ * @range:	1: range, 0: address
+ * @filter:	1: filter/start, 0: stop
+ *
+ * This is a hardware-agnostic filter configuration as specified by the user.
+ */
+struct perf_addr_filter {
+	struct list_head	entry;
+	struct inode		*inode;
+	unsigned long		offset;
+	unsigned long		size;
+	unsigned int		range	: 1,
+				filter	: 1;
+};
+
+/**
+ * struct perf_addr_filters_head - container for address range filters
+ * @list:	list of filters for this event
+ * @lock:	spinlock that serializes accesses to the @list and event's
+ *		(and its children's) filter generations.
+ *
+ * A child event will use parent's @list (and therefore @lock), so they are
+ * bundled together; see perf_event_addr_filters().
+ */
+struct perf_addr_filters_head {
+	struct list_head	list;
+	raw_spinlock_t		lock;
+};
+
+/**
  * enum perf_event_active_state - the states of a event
  */
 enum perf_event_active_state {
@@ -566,6 +637,12 @@ struct perf_event {
 
 	atomic_t			event_limit;
 
+	/* address range filters */
+	struct perf_addr_filters_head	addr_filters;
+	/* vma address array for file-based filders */
+	unsigned long			*addr_filters_offs;
+	unsigned long			addr_filters_gen;
+
 	void (*destroy)(struct perf_event *);
 	struct rcu_head			rcu_head;
 
@@ -1070,6 +1147,27 @@ static inline bool is_write_backward(struct perf_event *event)
 	return !!event->attr.write_backward;
 }
 
+static inline bool has_addr_filter(struct perf_event *event)
+{
+	return event->pmu->nr_addr_filters;
+}
+
+/*
+ * An inherited event uses parent's filters
+ */
+static inline struct perf_addr_filters_head *
+perf_event_addr_filters(struct perf_event *event)
+{
+	struct perf_addr_filters_head *ifh = &event->addr_filters;
+
+	if (event->parent)
+		ifh = &event->parent->addr_filters;
+
+	return ifh;
+}
+
+extern void perf_event_addr_filters_sync(struct perf_event *event);
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern int perf_output_begin_forward(struct perf_output_handle *handle,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2bb7c47..ffdc096 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -44,6 +44,8 @@
 #include <linux/compat.h>
 #include <linux/bpf.h>
 #include <linux/filter.h>
+#include <linux/namei.h>
+#include <linux/parser.h>
 
 #include "internal.h"
 
@@ -2365,11 +2367,17 @@ void perf_event_enable(struct perf_event *event)
 }
 EXPORT_SYMBOL_GPL(perf_event_enable);
 
+struct stop_event_data {
+	struct perf_event	*event;
+	unsigned int		restart;
+};
+
 static int __perf_event_stop(void *info)
 {
-	struct perf_event *event = info;
+	struct stop_event_data *sd = info;
+	struct perf_event *event = sd->event;
 
-	/* for AUX events, our job is done if the event is already inactive */
+	/* if it's already INACTIVE, do nothing */
 	if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
 		return 0;
 
@@ -2385,9 +2393,86 @@ static int __perf_event_stop(void *info)
 
 	event->pmu->stop(event, PERF_EF_UPDATE);
 
+	/*
+	 * May race with the actual stop (through perf_pmu_output_stop()),
+	 * but it is only used for events with AUX ring buffer, and such
+	 * events will refuse to restart because of rb::aux_mmap_count==0,
+	 * see comments in perf_aux_output_begin().
+	 *
+	 * Since this is happening on a event-local CPU, no trace is lost
+	 * while restarting.
+	 */
+	if (sd->restart)
+		event->pmu->start(event, PERF_EF_START);
+
 	return 0;
 }
 
+static int perf_event_restart(struct perf_event *event)
+{
+	struct stop_event_data sd = {
+		.event		= event,
+		.restart	= 1,
+	};
+	int ret = 0;
+
+	do {
+		if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
+			return 0;
+
+		/* matches smp_wmb() in event_sched_in() */
+		smp_rmb();
+
+		/*
+		 * We only want to restart ACTIVE events, so if the event goes
+		 * inactive here (event->oncpu==-1), there's nothing more to do;
+		 * fall through with ret==-ENXIO.
+		 */
+		ret = cpu_function_call(READ_ONCE(event->oncpu),
+					__perf_event_stop, &sd);
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
+/*
+ * In order to contain the amount of racy and tricky in the address filter
+ * configuration management, it is a two part process:
+ *
+ * (p1) when userspace mappings change as a result of (1) or (2) or (3) below,
+ *      we update the addresses of corresponding vmas in
+ *	event::addr_filters_offs array and bump the event::addr_filters_gen;
+ * (p2) when an event is scheduled in (pmu::add), it calls
+ *      perf_event_addr_filters_sync() which calls pmu::addr_filters_sync()
+ *      if the generation has changed since the previous call.
+ *
+ * If (p1) happens while the event is active, we restart it to force (p2).
+ *
+ * (1) perf_addr_filters_apply(): adjusting filters' offsets based on
+ *     pre-existing mappings, called once when new filters arrive via SET_FILTER
+ *     ioctl;
+ * (2) perf_addr_filters_adjust(): adjusting filters' offsets based on newly
+ *     registered mapping, called for every new mmap(), with mm::mmap_sem down
+ *     for reading;
+ * (3) perf_event_addr_filters_exec(): clearing filters' offsets in the process
+ *     of exec.
+ */
+void perf_event_addr_filters_sync(struct perf_event *event)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+
+	if (!has_addr_filter(event))
+		return;
+
+	raw_spin_lock(&ifh->lock);
+	if (event->addr_filters_gen != event->hw.addr_filters_gen) {
+		event->pmu->addr_filters_sync(event);
+		event->hw.addr_filters_gen = event->addr_filters_gen;
+	}
+	raw_spin_unlock(&ifh->lock);
+}
+EXPORT_SYMBOL_GPL(perf_event_addr_filters_sync);
+
 static int _perf_event_refresh(struct perf_event *event, int refresh)
 {
 	/*
@@ -3237,16 +3322,6 @@ out:
 		put_ctx(clone_ctx);
 }
 
-void perf_event_exec(void)
-{
-	int ctxn;
-
-	rcu_read_lock();
-	for_each_task_context_nr(ctxn)
-		perf_event_enable_on_exec(ctxn);
-	rcu_read_unlock();
-}
-
 struct perf_read_data {
 	struct perf_event *event;
 	bool group;
@@ -3748,6 +3823,9 @@ static bool exclusive_event_installable(struct perf_event *event,
 	return true;
 }
 
+static void perf_addr_filters_splice(struct perf_event *event,
+				       struct list_head *head);
+
 static void _free_event(struct perf_event *event)
 {
 	irq_work_sync(&event->pending);
@@ -3775,6 +3853,8 @@ static void _free_event(struct perf_event *event)
 	}
 
 	perf_event_free_bpf_prog(event);
+	perf_addr_filters_splice(event, NULL);
+	kfree(event->addr_filters_offs);
 
 	if (event->destroy)
 		event->destroy(event);
@@ -5846,6 +5926,57 @@ next:
 	rcu_read_unlock();
 }
 
+/*
+ * Clear all file-based filters at exec, they'll have to be
+ * re-instated when/if these objects are mmapped again.
+ */
+static void perf_event_addr_filters_exec(struct perf_event *event, void *data)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	struct perf_addr_filter *filter;
+	unsigned int restart = 0, count = 0;
+	unsigned long flags;
+
+	if (!has_addr_filter(event))
+		return;
+
+	raw_spin_lock_irqsave(&ifh->lock, flags);
+	list_for_each_entry(filter, &ifh->list, entry) {
+		if (filter->inode) {
+			event->addr_filters_offs[count] = 0;
+			restart++;
+		}
+
+		count++;
+	}
+
+	if (restart)
+		event->addr_filters_gen++;
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+	if (restart)
+		perf_event_restart(event);
+}
+
+void perf_event_exec(void)
+{
+	struct perf_event_context *ctx;
+	int ctxn;
+
+	rcu_read_lock();
+	for_each_task_context_nr(ctxn) {
+		ctx = current->perf_event_ctxp[ctxn];
+		if (!ctx)
+			continue;
+
+		perf_event_enable_on_exec(ctxn);
+
+		perf_event_aux_ctx(ctx, perf_event_addr_filters_exec, NULL,
+				   true);
+	}
+	rcu_read_unlock();
+}
+
 struct remote_output {
 	struct ring_buffer	*rb;
 	int			err;
@@ -5856,6 +5987,9 @@ static void __perf_event_output_stop(struct perf_event *event, void *data)
 	struct perf_event *parent = event->parent;
 	struct remote_output *ro = data;
 	struct ring_buffer *rb = ro->rb;
+	struct stop_event_data sd = {
+		.event	= event,
+	};
 
 	if (!has_aux(event))
 		return;
@@ -5868,7 +6002,7 @@ static void __perf_event_output_stop(struct perf_event *event, void *data)
 	 * ring-buffer, but it will be the child that's actually using it:
 	 */
 	if (rcu_dereference(parent->rb) == rb)
-		ro->err = __perf_event_stop(event);
+		ro->err = __perf_event_stop(&sd);
 }
 
 static int __perf_pmu_output_stop(void *info)
@@ -6329,6 +6463,87 @@ got_name:
 	kfree(buf);
 }
 
+/*
+ * Whether this @filter depends on a dynamic object which is not loaded
+ * yet or its load addresses are not known.
+ */
+static bool perf_addr_filter_needs_mmap(struct perf_addr_filter *filter)
+{
+	return filter->filter && filter->inode;
+}
+
+/*
+ * Check whether inode and address range match filter criteria.
+ */
+static bool perf_addr_filter_match(struct perf_addr_filter *filter,
+				     struct file *file, unsigned long offset,
+				     unsigned long size)
+{
+	if (filter->inode != file->f_inode)
+		return false;
+
+	if (filter->offset > offset + size)
+		return false;
+
+	if (filter->offset + filter->size < offset)
+		return false;
+
+	return true;
+}
+
+static void __perf_addr_filters_adjust(struct perf_event *event, void *data)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	struct vm_area_struct *vma = data;
+	unsigned long off = vma->vm_pgoff << PAGE_SHIFT, flags;
+	struct file *file = vma->vm_file;
+	struct perf_addr_filter *filter;
+	unsigned int restart = 0, count = 0;
+
+	if (!has_addr_filter(event))
+		return;
+
+	if (!file)
+		return;
+
+	raw_spin_lock_irqsave(&ifh->lock, flags);
+	list_for_each_entry(filter, &ifh->list, entry) {
+		if (perf_addr_filter_match(filter, file, off,
+					     vma->vm_end - vma->vm_start)) {
+			event->addr_filters_offs[count] = vma->vm_start;
+			restart++;
+		}
+
+		count++;
+	}
+
+	if (restart)
+		event->addr_filters_gen++;
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+	if (restart)
+		perf_event_restart(event);
+}
+
+/*
+ * Adjust all task's events' filters to the new vma
+ */
+static void perf_addr_filters_adjust(struct vm_area_struct *vma)
+{
+	struct perf_event_context *ctx;
+	int ctxn;
+
+	rcu_read_lock();
+	for_each_task_context_nr(ctxn) {
+		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
+		if (!ctx)
+			continue;
+
+		perf_event_aux_ctx(ctx, __perf_addr_filters_adjust, vma, true);
+	}
+	rcu_read_unlock();
+}
+
 void perf_event_mmap(struct vm_area_struct *vma)
 {
 	struct perf_mmap_event mmap_event;
@@ -6360,6 +6575,7 @@ void perf_event_mmap(struct vm_area_struct *vma)
 		/* .flags (attr_mmap2 only) */
 	};
 
+	perf_addr_filters_adjust(vma);
 	perf_event_mmap_event(&mmap_event);
 }
 
@@ -7319,13 +7535,370 @@ void perf_bp_event(struct perf_event *bp, void *data)
 }
 #endif
 
+/*
+ * Allocate a new address filter
+ */
+static struct perf_addr_filter *
+perf_addr_filter_new(struct perf_event *event, struct list_head *filters)
+{
+	int node = cpu_to_node(event->cpu == -1 ? 0 : event->cpu);
+	struct perf_addr_filter *filter;
+
+	filter = kzalloc_node(sizeof(*filter), GFP_KERNEL, node);
+	if (!filter)
+		return NULL;
+
+	INIT_LIST_HEAD(&filter->entry);
+	list_add_tail(&filter->entry, filters);
+
+	return filter;
+}
+
+static void free_filters_list(struct list_head *filters)
+{
+	struct perf_addr_filter *filter, *iter;
+
+	list_for_each_entry_safe(filter, iter, filters, entry) {
+		if (filter->inode)
+			iput(filter->inode);
+		list_del(&filter->entry);
+		kfree(filter);
+	}
+}
+
+/*
+ * Free existing address filters and optionally install new ones
+ */
+static void perf_addr_filters_splice(struct perf_event *event,
+				     struct list_head *head)
+{
+	unsigned long flags;
+	LIST_HEAD(list);
+
+	if (!has_addr_filter(event))
+		return;
+
+	/* don't bother with children, they don't have their own filters */
+	if (event->parent)
+		return;
+
+	raw_spin_lock_irqsave(&event->addr_filters.lock, flags);
+
+	list_splice_init(&event->addr_filters.list, &list);
+	if (head)
+		list_splice(head, &event->addr_filters.list);
+
+	raw_spin_unlock_irqrestore(&event->addr_filters.lock, flags);
+
+	free_filters_list(&list);
+}
+
+/*
+ * Scan through mm's vmas and see if one of them matches the
+ * @filter; if so, adjust filter's address range.
+ * Called with mm::mmap_sem down for reading.
+ */
+static unsigned long perf_addr_filter_apply(struct perf_addr_filter *filter,
+					    struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		struct file *file = vma->vm_file;
+		unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
+		unsigned long vma_size = vma->vm_end - vma->vm_start;
+
+		if (!file)
+			continue;
+
+		if (!perf_addr_filter_match(filter, file, off, vma_size))
+			continue;
+
+		return vma->vm_start;
+	}
+
+	return 0;
+}
+
+/*
+ * Update event's address range filters based on the
+ * task's existing mappings, if any.
+ */
+static void perf_event_addr_filters_apply(struct perf_event *event)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	struct task_struct *task = READ_ONCE(event->ctx->task);
+	struct perf_addr_filter *filter;
+	struct mm_struct *mm = NULL;
+	unsigned int count = 0;
+	unsigned long flags;
+
+	/*
+	 * We may observe TASK_TOMBSTONE, which means that the event tear-down
+	 * will stop on the parent's child_mutex that our caller is also holding
+	 */
+	if (task == TASK_TOMBSTONE)
+		return;
+
+	mm = get_task_mm(event->ctx->task);
+	if (!mm)
+		goto restart;
+
+	down_read(&mm->mmap_sem);
+
+	raw_spin_lock_irqsave(&ifh->lock, flags);
+	list_for_each_entry(filter, &ifh->list, entry) {
+		event->addr_filters_offs[count] = 0;
+
+		if (perf_addr_filter_needs_mmap(filter))
+			event->addr_filters_offs[count] =
+				perf_addr_filter_apply(filter, mm);
+
+		count++;
+	}
+
+	event->addr_filters_gen++;
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+	up_read(&mm->mmap_sem);
+
+	mmput(mm);
+
+restart:
+	perf_event_restart(event);
+}
+
+/*
+ * Address range filtering: limiting the data to certain
+ * instruction address ranges. Filters are ioctl()ed to us from
+ * userspace as ascii strings.
+ *
+ * Filter string format:
+ *
+ * ACTION RANGE_SPEC
+ * where ACTION is one of the
+ *  * "filter": limit the trace to this region
+ *  * "start": start tracing from this address
+ *  * "stop": stop tracing at this address/region;
+ * RANGE_SPEC is
+ *  * for kernel addresses: <start address>[/<size>]
+ *  * for object files:     <start address>[/<size>]@</path/to/object/file>
+ *
+ * if <size> is not specified, the range is treated as a single address.
+ */
+enum {
+	IF_ACT_FILTER,
+	IF_ACT_START,
+	IF_ACT_STOP,
+	IF_SRC_FILE,
+	IF_SRC_KERNEL,
+	IF_SRC_FILEADDR,
+	IF_SRC_KERNELADDR,
+};
+
+enum {
+	IF_STATE_ACTION = 0,
+	IF_STATE_SOURCE,
+	IF_STATE_END,
+};
+
+static const match_table_t if_tokens = {
+	{ IF_ACT_FILTER,	"filter" },
+	{ IF_ACT_START,		"start" },
+	{ IF_ACT_STOP,		"stop" },
+	{ IF_SRC_FILE,		"%u/%u@%s" },
+	{ IF_SRC_KERNEL,	"%u/%u" },
+	{ IF_SRC_FILEADDR,	"%u@%s" },
+	{ IF_SRC_KERNELADDR,	"%u" },
+};
+
+/*
+ * Address filter string parser
+ */
+static int
+perf_event_parse_addr_filter(struct perf_event *event, char *fstr,
+			     struct list_head *filters)
+{
+	struct perf_addr_filter *filter = NULL;
+	char *start, *orig, *filename = NULL;
+	struct path path;
+	substring_t args[MAX_OPT_ARGS];
+	int state = IF_STATE_ACTION, token;
+	unsigned int kernel = 0;
+	int ret = -EINVAL;
+
+	orig = fstr = kstrdup(fstr, GFP_KERNEL);
+	if (!fstr)
+		return -ENOMEM;
+
+	while ((start = strsep(&fstr, " ,\n")) != NULL) {
+		ret = -EINVAL;
+
+		if (!*start)
+			continue;
+
+		/* filter definition begins */
+		if (state == IF_STATE_ACTION) {
+			filter = perf_addr_filter_new(event, filters);
+			if (!filter)
+				goto fail;
+		}
+
+		token = match_token(start, if_tokens, args);
+		switch (token) {
+		case IF_ACT_FILTER:
+		case IF_ACT_START:
+			filter->filter = 1;
+
+		case IF_ACT_STOP:
+			if (state != IF_STATE_ACTION)
+				goto fail;
+
+			state = IF_STATE_SOURCE;
+			break;
+
+		case IF_SRC_KERNELADDR:
+		case IF_SRC_KERNEL:
+			kernel = 1;
+
+		case IF_SRC_FILEADDR:
+		case IF_SRC_FILE:
+			if (state != IF_STATE_SOURCE)
+				goto fail;
+
+			if (token == IF_SRC_FILE || token == IF_SRC_KERNEL)
+				filter->range = 1;
+
+			*args[0].to = 0;
+			ret = kstrtoul(args[0].from, 0, &filter->offset);
+			if (ret)
+				goto fail;
+
+			if (filter->range) {
+				*args[1].to = 0;
+				ret = kstrtoul(args[1].from, 0, &filter->size);
+				if (ret)
+					goto fail;
+			}
+
+			if (token == IF_SRC_FILE) {
+				filename = match_strdup(&args[2]);
+				if (!filename) {
+					ret = -ENOMEM;
+					goto fail;
+				}
+			}
+
+			state = IF_STATE_END;
+			break;
+
+		default:
+			goto fail;
+		}
+
+		/*
+		 * Filter definition is fully parsed, validate and install it.
+		 * Make sure that it doesn't contradict itself or the event's
+		 * attribute.
+		 */
+		if (state == IF_STATE_END) {
+			if (kernel && event->attr.exclude_kernel)
+				goto fail;
+
+			if (!kernel) {
+				if (!filename)
+					goto fail;
+
+				/* look up the path and grab its inode */
+				ret = kern_path(filename, LOOKUP_FOLLOW, &path);
+				if (ret)
+					goto fail_free_name;
+
+				filter->inode = igrab(d_inode(path.dentry));
+				path_put(&path);
+				kfree(filename);
+				filename = NULL;
+
+				ret = -EINVAL;
+				if (!filter->inode ||
+				    !S_ISREG(filter->inode->i_mode))
+					/* free_filters_list() will iput() */
+					goto fail;
+			}
+
+			/* ready to consume more filters */
+			state = IF_STATE_ACTION;
+			filter = NULL;
+		}
+	}
+
+	if (state != IF_STATE_ACTION)
+		goto fail;
+
+	kfree(orig);
+
+	return 0;
+
+fail_free_name:
+	kfree(filename);
+fail:
+	free_filters_list(filters);
+	kfree(orig);
+
+	return ret;
+}
+
+static int
+perf_event_set_addr_filter(struct perf_event *event, char *filter_str)
+{
+	LIST_HEAD(filters);
+	int ret;
+
+	/*
+	 * Since this is called in perf_ioctl() path, we're already holding
+	 * ctx::mutex.
+	 */
+	lockdep_assert_held(&event->ctx->mutex);
+
+	if (WARN_ON_ONCE(event->parent))
+		return -EINVAL;
+
+	/*
+	 * For now, we only support filtering in per-task events; doing so
+	 * for CPU-wide events requires additional context switching trickery,
+	 * since same object code will be mapped at different virtual
+	 * addresses in different processes.
+	 */
+	if (!event->ctx->task)
+		return -EOPNOTSUPP;
+
+	ret = perf_event_parse_addr_filter(event, filter_str, &filters);
+	if (ret)
+		return ret;
+
+	ret = event->pmu->addr_filters_validate(&filters);
+	if (ret) {
+		free_filters_list(&filters);
+		return ret;
+	}
+
+	/* remove existing filters, if any */
+	perf_addr_filters_splice(event, &filters);
+
+	/* install new filters */
+	perf_event_for_each_child(event, perf_event_addr_filters_apply);
+
+	return ret;
+}
+
 static int perf_event_set_filter(struct perf_event *event, void __user *arg)
 {
 	char *filter_str;
 	int ret = -EINVAL;
 
-	if (event->attr.type != PERF_TYPE_TRACEPOINT ||
-	    !IS_ENABLED(CONFIG_EVENT_TRACING))
+	if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
+	    !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
+	    !has_addr_filter(event))
 		return -EINVAL;
 
 	filter_str = strndup_user(arg, PAGE_SIZE);
@@ -7336,6 +7909,8 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
 	    event->attr.type == PERF_TYPE_TRACEPOINT)
 		ret = ftrace_profile_set_filter(event, event->attr.config,
 						filter_str);
+	else if (has_addr_filter(event))
+		ret = perf_event_set_addr_filter(event, filter_str);
 
 	kfree(filter_str);
 	return ret;
@@ -8130,6 +8705,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	INIT_LIST_HEAD(&event->sibling_list);
 	INIT_LIST_HEAD(&event->rb_entry);
 	INIT_LIST_HEAD(&event->active_entry);
+	INIT_LIST_HEAD(&event->addr_filters.list);
 	INIT_HLIST_NODE(&event->hlist_entry);
 
 
@@ -8137,6 +8713,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	init_irq_work(&event->pending, perf_pending_event);
 
 	mutex_init(&event->mmap_mutex);
+	raw_spin_lock_init(&event->addr_filters.lock);
 
 	atomic_long_set(&event->refcount, 1);
 	event->cpu		= cpu;
@@ -8221,11 +8798,22 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (err)
 		goto err_pmu;
 
+	if (has_addr_filter(event)) {
+		event->addr_filters_offs = kcalloc(pmu->nr_addr_filters,
+						   sizeof(unsigned long),
+						   GFP_KERNEL);
+		if (!event->addr_filters_offs)
+			goto err_per_task;
+
+		/* force hw sync on the address filters */
+		event->addr_filters_gen = 1;
+	}
+
 	if (!event->parent) {
 		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
 			err = get_callchain_buffers();
 			if (err)
-				goto err_per_task;
+				goto err_addr_filters;
 		}
 	}
 
@@ -8234,6 +8822,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 	return event;
 
+err_addr_filters:
+	kfree(event->addr_filters_offs);
+
 err_per_task:
 	exclusive_event_destroy(event);
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [tip:perf/core] perf/x86/intel/pt: Add support for address range filtering in PT
  2016-04-27 15:44 ` [PATCH v2 6/7] perf/x86/intel/pt: Add support for address range filtering in PT Alexander Shishkin
@ 2016-05-05  9:47   ` tip-bot for Alexander Shishkin
  0 siblings, 0 replies; 20+ messages in thread
From: tip-bot for Alexander Shishkin @ 2016-05-05  9:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, linux-kernel, torvalds, jolsa, acme, peterz, mingo,
	alexander.shishkin, mathieu.poirier, vincent.weaver, tglx, acme,
	eranian, bp

Commit-ID:  eadf48cab4b6b0ab8bcd53feb7d52a71e72debd0
Gitweb:     http://git.kernel.org/tip/eadf48cab4b6b0ab8bcd53feb7d52a71e72debd0
Author:     Alexander Shishkin <alexander.shishkin@linux.intel.com>
AuthorDate: Wed, 27 Apr 2016 18:44:47 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 May 2016 10:13:58 +0200

perf/x86/intel/pt: Add support for address range filtering in PT

Newer versions of Intel PT support address ranges, which can be used to
define IP address range-based filters or TraceSTOP regions. Number of
ranges in enumerated via cpuid.

This patch implements PMU callbacks and related low-level code to allow
filter validation, configuration and programming into the hardware.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1461771888-10409-7-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/events/intel/pt.c | 179 ++++++++++++++++++++++++++++++++++++++++++---
 arch/x86/events/intel/pt.h |  26 +++++++
 2 files changed, 194 insertions(+), 11 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index e5bfafe..2d1ce2c 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -265,6 +265,75 @@ static bool pt_event_valid(struct perf_event *event)
  * These all are cpu affine and operate on a local PT
  */
 
+/* Address ranges and their corresponding msr configuration registers */
+static const struct pt_address_range {
+	unsigned long	msr_a;
+	unsigned long	msr_b;
+	unsigned int	reg_off;
+} pt_address_ranges[] = {
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR0_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR0_B,
+		.reg_off = RTIT_CTL_ADDR0_OFFSET,
+	},
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR1_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR1_B,
+		.reg_off = RTIT_CTL_ADDR1_OFFSET,
+	},
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR2_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR2_B,
+		.reg_off = RTIT_CTL_ADDR2_OFFSET,
+	},
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR3_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR3_B,
+		.reg_off = RTIT_CTL_ADDR3_OFFSET,
+	}
+};
+
+static u64 pt_config_filters(struct perf_event *event)
+{
+	struct pt_filters *filters = event->hw.addr_filters;
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	unsigned int range = 0;
+	u64 rtit_ctl = 0;
+
+	if (!filters)
+		return 0;
+
+	perf_event_addr_filters_sync(event);
+
+	for (range = 0; range < filters->nr_filters; range++) {
+		struct pt_filter *filter = &filters->filter[range];
+
+		/*
+		 * Note, if the range has zero start/end addresses due
+		 * to its dynamic object not being loaded yet, we just
+		 * go ahead and program zeroed range, which will simply
+		 * produce no data. Note^2: if executable code at 0x0
+		 * is a concern, we can set up an "invalid" configuration
+		 * such as msr_b < msr_a.
+		 */
+
+		/* avoid redundant msr writes */
+		if (pt->filters.filter[range].msr_a != filter->msr_a) {
+			wrmsrl(pt_address_ranges[range].msr_a, filter->msr_a);
+			pt->filters.filter[range].msr_a = filter->msr_a;
+		}
+
+		if (pt->filters.filter[range].msr_b != filter->msr_b) {
+			wrmsrl(pt_address_ranges[range].msr_b, filter->msr_b);
+			pt->filters.filter[range].msr_b = filter->msr_b;
+		}
+
+		rtit_ctl |= filter->config << pt_address_ranges[range].reg_off;
+	}
+
+	return rtit_ctl;
+}
+
 static void pt_config(struct perf_event *event)
 {
 	u64 reg;
@@ -274,7 +343,8 @@ static void pt_config(struct perf_event *event)
 		wrmsrl(MSR_IA32_RTIT_STATUS, 0);
 	}
 
-	reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN | RTIT_CTL_TRACEEN;
+	reg = pt_config_filters(event);
+	reg |= RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN | RTIT_CTL_TRACEEN;
 
 	if (!event->attr.exclude_kernel)
 		reg |= RTIT_CTL_OS;
@@ -921,6 +991,82 @@ static void pt_buffer_free_aux(void *data)
 	kfree(buf);
 }
 
+static int pt_addr_filters_init(struct perf_event *event)
+{
+	struct pt_filters *filters;
+	int node = event->cpu == -1 ? -1 : cpu_to_node(event->cpu);
+
+	if (!pt_cap_get(PT_CAP_num_address_ranges))
+		return 0;
+
+	filters = kzalloc_node(sizeof(struct pt_filters), GFP_KERNEL, node);
+	if (!filters)
+		return -ENOMEM;
+
+	if (event->parent)
+		memcpy(filters, event->parent->hw.addr_filters,
+		       sizeof(*filters));
+
+	event->hw.addr_filters = filters;
+
+	return 0;
+}
+
+static void pt_addr_filters_fini(struct perf_event *event)
+{
+	kfree(event->hw.addr_filters);
+	event->hw.addr_filters = NULL;
+}
+
+static int pt_event_addr_filters_validate(struct list_head *filters)
+{
+	struct perf_addr_filter *filter;
+	int range = 0;
+
+	list_for_each_entry(filter, filters, entry) {
+		/* PT doesn't support single address triggers */
+		if (!filter->range)
+			return -EOPNOTSUPP;
+
+		if (!filter->inode && !kernel_ip(filter->offset))
+			return -EINVAL;
+
+		if (++range > pt_cap_get(PT_CAP_num_address_ranges))
+			return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static void pt_event_addr_filters_sync(struct perf_event *event)
+{
+	struct perf_addr_filters_head *head = perf_event_addr_filters(event);
+	unsigned long msr_a, msr_b, *offs = event->addr_filters_offs;
+	struct pt_filters *filters = event->hw.addr_filters;
+	struct perf_addr_filter *filter;
+	int range = 0;
+
+	if (!filters)
+		return;
+
+	list_for_each_entry(filter, &head->list, entry) {
+		if (filter->inode && !offs[range]) {
+			msr_a = msr_b = 0;
+		} else {
+			/* apply the offset */
+			msr_a = filter->offset + offs[range];
+			msr_b = filter->size + msr_a;
+		}
+
+		filters->filter[range].msr_a  = msr_a;
+		filters->filter[range].msr_b  = msr_b;
+		filters->filter[range].config = filter->filter ? 1 : 2;
+		range++;
+	}
+
+	filters->nr_filters = range;
+}
+
 /**
  * intel_pt_interrupt() - PT PMI handler
  */
@@ -1128,6 +1274,7 @@ static void pt_event_read(struct perf_event *event)
 
 static void pt_event_destroy(struct perf_event *event)
 {
+	pt_addr_filters_fini(event);
 	x86_del_exclusive(x86_lbr_exclusive_pt);
 }
 
@@ -1142,6 +1289,11 @@ static int pt_event_init(struct perf_event *event)
 	if (x86_add_exclusive(x86_lbr_exclusive_pt))
 		return -EBUSY;
 
+	if (pt_addr_filters_init(event)) {
+		x86_del_exclusive(x86_lbr_exclusive_pt);
+		return -ENOMEM;
+	}
+
 	event->destroy = pt_event_destroy;
 
 	return 0;
@@ -1195,16 +1347,21 @@ static __init int pt_init(void)
 			PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_AUX_SW_DOUBLEBUF;
 
 	pt_pmu.pmu.capabilities	|= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE;
-	pt_pmu.pmu.attr_groups	= pt_attr_groups;
-	pt_pmu.pmu.task_ctx_nr	= perf_sw_context;
-	pt_pmu.pmu.event_init	= pt_event_init;
-	pt_pmu.pmu.add		= pt_event_add;
-	pt_pmu.pmu.del		= pt_event_del;
-	pt_pmu.pmu.start	= pt_event_start;
-	pt_pmu.pmu.stop		= pt_event_stop;
-	pt_pmu.pmu.read		= pt_event_read;
-	pt_pmu.pmu.setup_aux	= pt_buffer_setup_aux;
-	pt_pmu.pmu.free_aux	= pt_buffer_free_aux;
+	pt_pmu.pmu.attr_groups		 = pt_attr_groups;
+	pt_pmu.pmu.task_ctx_nr		 = perf_sw_context;
+	pt_pmu.pmu.event_init		 = pt_event_init;
+	pt_pmu.pmu.add			 = pt_event_add;
+	pt_pmu.pmu.del			 = pt_event_del;
+	pt_pmu.pmu.start		 = pt_event_start;
+	pt_pmu.pmu.stop			 = pt_event_stop;
+	pt_pmu.pmu.read			 = pt_event_read;
+	pt_pmu.pmu.setup_aux		 = pt_buffer_setup_aux;
+	pt_pmu.pmu.free_aux		 = pt_buffer_free_aux;
+	pt_pmu.pmu.addr_filters_sync     = pt_event_addr_filters_sync;
+	pt_pmu.pmu.addr_filters_validate = pt_event_addr_filters_validate;
+	pt_pmu.pmu.nr_addr_filters       =
+		pt_cap_get(PT_CAP_num_address_ranges);
+
 	ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);
 
 	return ret;
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index 0ed9000..ca64599 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -140,14 +140,40 @@ struct pt_buffer {
 	struct topa_entry	*topa_index[0];
 };
 
+#define PT_FILTERS_NUM	4
+
+/**
+ * struct pt_filter - IP range filter configuration
+ * @msr_a:	range start, goes to RTIT_ADDRn_A
+ * @msr_b:	range end, goes to RTIT_ADDRn_B
+ * @config:	4-bit field in RTIT_CTL
+ */
+struct pt_filter {
+	unsigned long	msr_a;
+	unsigned long	msr_b;
+	unsigned long	config;
+};
+
+/**
+ * struct pt_filters - IP range filtering context
+ * @filter:	filters defined for this context
+ * @nr_filters:	number of defined filters in the @filter array
+ */
+struct pt_filters {
+	struct pt_filter	filter[PT_FILTERS_NUM];
+	unsigned int		nr_filters;
+};
+
 /**
  * struct pt - per-cpu pt context
  * @handle:	perf output handle
+ * @filters:		last configured filters
  * @handle_nmi:	do handle PT PMI on this cpu, there's an active event
  * @vmx_on:	1 if VMX is ON on this cpu
  */
 struct pt {
 	struct perf_output_handle handle;
+	struct pt_filters	filters;
 	int			handle_nmi;
 	int			vmx_on;
 };

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [tip:perf/core] perf/core: Let userspace know if the PMU supports address filters
  2016-04-27 15:44 ` [PATCH v2 7/7] perf: Let userspace know if pmu supports address filters Alexander Shishkin
@ 2016-05-05  9:47   ` tip-bot for Alexander Shishkin
  0 siblings, 0 replies; 20+ messages in thread
From: tip-bot for Alexander Shishkin @ 2016-05-05  9:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, eranian, vincent.weaver, tglx, bp, linux-kernel,
	mathieu.poirier, hpa, peterz, acme, jolsa, alexander.shishkin,
	mingo, acme

Commit-ID:  6e855cd4f4b5258016cf707f94f96bfa51c32f32
Gitweb:     http://git.kernel.org/tip/6e855cd4f4b5258016cf707f94f96bfa51c32f32
Author:     Alexander Shishkin <alexander.shishkin@linux.intel.com>
AuthorDate: Wed, 27 Apr 2016 18:44:48 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 May 2016 10:13:58 +0200

perf/core: Let userspace know if the PMU supports address filters

Export an additional common attribute for PMUs that support address range
filtering to let the perf userspace identify such PMUs in a uniform way.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1461771888-10409-8-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/core.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index ffdc096..63be654 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8273,6 +8273,20 @@ static void free_pmu_context(struct pmu *pmu)
 out:
 	mutex_unlock(&pmus_lock);
 }
+
+/*
+ * Let userspace know that this PMU supports address range filtering:
+ */
+static ssize_t nr_addr_filters_show(struct device *dev,
+				    struct device_attribute *attr,
+				    char *page)
+{
+	struct pmu *pmu = dev_get_drvdata(dev);
+
+	return snprintf(page, PAGE_SIZE - 1, "%d\n", pmu->nr_addr_filters);
+}
+DEVICE_ATTR_RO(nr_addr_filters);
+
 static struct idr pmu_idr;
 
 static ssize_t
@@ -8374,9 +8388,19 @@ static int pmu_dev_alloc(struct pmu *pmu)
 	if (ret)
 		goto free_dev;
 
+	/* For PMUs with address filters, throw in an extra attribute: */
+	if (pmu->nr_addr_filters)
+		ret = device_create_file(pmu->dev, &dev_attr_nr_addr_filters);
+
+	if (ret)
+		goto del_dev;
+
 out:
 	return ret;
 
+del_dev:
+	device_del(pmu->dev);
+
 free_dev:
 	put_device(pmu->dev);
 	goto out;
@@ -8512,6 +8536,8 @@ void perf_pmu_unregister(struct pmu *pmu)
 	free_percpu(pmu->pmu_disable_count);
 	if (pmu->type >= PERF_TYPE_MAX)
 		idr_remove(&pmu_idr, pmu->type);
+	if (pmu->nr_addr_filters)
+		device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
 	device_del(pmu->dev);
 	put_device(pmu->dev);
 	free_pmu_context(pmu);

^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-05-05  9:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-27 15:44 [PATCH v2 0/7] perf: Introduce address range filtering Alexander Shishkin
2016-04-27 15:44 ` [PATCH v2 1/7] perf: Move set_filter() from behind EVENT_TRACING Alexander Shishkin
2016-05-05  9:45   ` [tip:perf/core] perf/core: Move set_filter() out of CONFIG_EVENT_TRACING tip-bot for Alexander Shishkin
2016-04-27 15:44 ` [PATCH v2 2/7] perf/x86/intel/pt: Move MSR bit definitions to a private header Alexander Shishkin
2016-05-05  9:45   ` [tip:perf/core] perf/x86/intel/pt: Move PT specific " tip-bot for Alexander Shishkin
2016-04-27 15:44 ` [PATCH v2 3/7] perf/x86/intel/pt: IP filtering register/cpuid bits Alexander Shishkin
2016-05-05  9:46   ` [tip:perf/core] perf/x86/intel/pt: Add IP filtering register/CPUID bits tip-bot for Alexander Shishkin
2016-04-27 15:44 ` [PATCH v2 4/7] perf: Extend perf_event_aux_ctx() to optionally iterate through more events Alexander Shishkin
2016-05-05  9:46   ` [tip:perf/core] perf/core: " tip-bot for Alexander Shishkin
2016-04-27 15:44 ` [PATCH v2 5/7] perf: Introduce address range filtering Alexander Shishkin
2016-04-28 17:09   ` Peter Zijlstra
2016-04-29 18:12   ` Mathieu Poirier
2016-04-30  4:59     ` Alexander Shishkin
2016-05-02 14:31       ` Mathieu Poirier
2016-05-03  8:56   ` Peter Zijlstra
2016-05-05  9:46   ` [tip:perf/core] perf/core: " tip-bot for Alexander Shishkin
2016-04-27 15:44 ` [PATCH v2 6/7] perf/x86/intel/pt: Add support for address range filtering in PT Alexander Shishkin
2016-05-05  9:47   ` [tip:perf/core] " tip-bot for Alexander Shishkin
2016-04-27 15:44 ` [PATCH v2 7/7] perf: Let userspace know if pmu supports address filters Alexander Shishkin
2016-05-05  9:47   ` [tip:perf/core] perf/core: Let userspace know if the PMU " tip-bot for Alexander Shishkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).