All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/5] perf: Introduce address range filtering
@ 2016-04-21 15:16 Alexander Shishkin
  2016-04-21 15:16 ` [PATCH v1 1/5] perf: Move set_filter() from behind EVENT_TRACING Alexander Shishkin
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Alexander Shishkin @ 2016-04-21 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, Alexander Shishkin

Hi Peter,

This is the second round of my filtering patchset. I've changed quite
many things since the previous one [1], notably

 * split the pmu callback in two as we discussed,
 * split the filter itself into 'core' and 'hw' parts,
 * made only parent events eligible for configuring filters (children
   can still have their 'hw' portions),
 * and basically re-wrote the whole thing around that.

And also, dropped the awkward 'itrace' infix, which was just
confusing and rebranded it as "address range filtering", which is
actually more representative of what it is. I also figured that it's
usefulness may not be limited to instruction tracing (or even
hardware) pmus.

Newer version of Intel PT supports address-based filtering, and this
patchset adds support for it to perf core and the PT pmu driver. It
works by configuring a number of address ranges in hardware and
telling it to use these ranges to filter its traces. Similar feature
also exists in ARM Coresight ETM/PTM and it is also taken into account
in this patchset.

Firstly, userspace configures filters via an ioctl(), filters are
formatted as an ascii string. Filters may refer to addresses in object
files for userspace code or kernel addresses. The latter might be
extended in the future to support kernel modules.

For userspace filters, we scan the task's vmas to see if any of them
match the defined filters (inode+offset) and if they do, calculate
memory offsets and program them into hardware. Note that since
different tasks will have different mappings for the same object
files, supporting cpu-wide events would require special tricks to
context-switch filters for userspace code.

Also, we monitor new mmap and exec events to update (or clear) filter
configuration.

[1] http://marc.info/?l=linux-kernel&m=144984116019143

Alexander Shishkin (5):
  perf: Move set_filter() from behind EVENT_TRACING
  perf/x86/intel/pt: IP filtering register/cpuid bits
  perf: Extend perf_event_aux_ctx() to optionally iterate through more
    events
  perf: Introduce address range filtering
  perf/x86/intel/pt: Add support for address range filtering in PT

 arch/x86/events/intel/pt.c       | 185 ++++++++++-
 arch/x86/events/intel/pt.h       |  30 ++
 arch/x86/include/asm/msr-index.h |  18 ++
 include/linux/perf_event.h       |  95 ++++++
 kernel/events/core.c             | 684 ++++++++++++++++++++++++++++++++++++---
 5 files changed, 969 insertions(+), 43 deletions(-)

-- 
2.8.0.rc3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v1 1/5] perf: Move set_filter() from behind EVENT_TRACING
  2016-04-21 15:16 [PATCH v1 0/5] perf: Introduce address range filtering Alexander Shishkin
@ 2016-04-21 15:16 ` Alexander Shishkin
  2016-04-21 15:17 ` [PATCH v1 2/5] perf/x86/intel/pt: IP filtering register/cpuid bits Alexander Shishkin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 27+ messages in thread
From: Alexander Shishkin @ 2016-04-21 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, Alexander Shishkin

For instruction trace filtering, namely, for communicating filter
definitions from userspace, I'd like to re-use the SET_FILTER code
that the tracepoints are using currently.

To that end, this patch moves the relevant code from behind EVENT_TRACING
macro.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 45 ++++++++++++++++++++++-----------------------
 1 file changed, 22 insertions(+), 23 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index e66e190717..c73fd59811 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7273,24 +7273,6 @@ static inline void perf_tp_register(void)
 	perf_pmu_register(&perf_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
 }
 
-static int perf_event_set_filter(struct perf_event *event, void __user *arg)
-{
-	char *filter_str;
-	int ret;
-
-	if (event->attr.type != PERF_TYPE_TRACEPOINT)
-		return -EINVAL;
-
-	filter_str = strndup_user(arg, PAGE_SIZE);
-	if (IS_ERR(filter_str))
-		return PTR_ERR(filter_str);
-
-	ret = ftrace_profile_set_filter(event, event->attr.config, filter_str);
-
-	kfree(filter_str);
-	return ret;
-}
-
 static void perf_event_free_filter(struct perf_event *event)
 {
 	ftrace_profile_free_filter(event);
@@ -7345,11 +7327,6 @@ static inline void perf_tp_register(void)
 {
 }
 
-static int perf_event_set_filter(struct perf_event *event, void __user *arg)
-{
-	return -ENOENT;
-}
-
 static void perf_event_free_filter(struct perf_event *event)
 {
 }
@@ -7377,6 +7354,28 @@ void perf_bp_event(struct perf_event *bp, void *data)
 }
 #endif
 
+static int perf_event_set_filter(struct perf_event *event, void __user *arg)
+{
+	char *filter_str;
+	int ret = -EINVAL;
+
+	if (event->attr.type != PERF_TYPE_TRACEPOINT ||
+	    !IS_ENABLED(CONFIG_EVENT_TRACING))
+		return -EINVAL;
+
+	filter_str = strndup_user(arg, PAGE_SIZE);
+	if (IS_ERR(filter_str))
+		return PTR_ERR(filter_str);
+
+	if (IS_ENABLED(CONFIG_EVENT_TRACING) &&
+	    event->attr.type == PERF_TYPE_TRACEPOINT)
+		ret = ftrace_profile_set_filter(event, event->attr.config,
+						filter_str);
+
+	kfree(filter_str);
+	return ret;
+}
+
 /*
  * hrtimer based swevent callback
  */
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v1 2/5] perf/x86/intel/pt: IP filtering register/cpuid bits
  2016-04-21 15:16 [PATCH v1 0/5] perf: Introduce address range filtering Alexander Shishkin
  2016-04-21 15:16 ` [PATCH v1 1/5] perf: Move set_filter() from behind EVENT_TRACING Alexander Shishkin
@ 2016-04-21 15:17 ` Alexander Shishkin
  2016-04-21 17:48   ` Borislav Petkov
  2016-04-21 15:17 ` [PATCH v1 3/5] perf: Extend perf_event_aux_ctx() to optionally iterate through more events Alexander Shishkin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 27+ messages in thread
From: Alexander Shishkin @ 2016-04-21 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, Alexander Shishkin

New versions of Intel PT support address range-based filtering. These
are the registers, bit definitions and relevant CPUID bits.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/events/intel/pt.c       |  2 ++
 arch/x86/events/intel/pt.h       |  2 ++
 arch/x86/include/asm/msr-index.h | 18 ++++++++++++++++++
 3 files changed, 22 insertions(+)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 127f58c179..891447dd61 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -67,11 +67,13 @@ static struct pt_cap_desc {
 	PT_CAP(max_subleaf,		0, CR_EAX, 0xffffffff),
 	PT_CAP(cr3_filtering,		0, CR_EBX, BIT(0)),
 	PT_CAP(psb_cyc,			0, CR_EBX, BIT(1)),
+	PT_CAP(ip_filtering,		0, CR_EBX, BIT(2)),
 	PT_CAP(mtc,			0, CR_EBX, BIT(3)),
 	PT_CAP(topa_output,		0, CR_ECX, BIT(0)),
 	PT_CAP(topa_multiple_entries,	0, CR_ECX, BIT(1)),
 	PT_CAP(single_range_output,	0, CR_ECX, BIT(2)),
 	PT_CAP(payloads_lip,		0, CR_ECX, BIT(31)),
+	PT_CAP(num_address_ranges,	1, CR_EAX, 0x3),
 	PT_CAP(mtc_periods,		1, CR_EAX, 0xffff0000),
 	PT_CAP(cycle_thresholds,	1, CR_EBX, 0xffff),
 	PT_CAP(psb_periods,		1, CR_EBX, 0xffff0000),
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index 336878a5d2..6ce8cd20b9 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -52,11 +52,13 @@ enum pt_capabilities {
 	PT_CAP_max_subleaf = 0,
 	PT_CAP_cr3_filtering,
 	PT_CAP_psb_cyc,
+	PT_CAP_ip_filtering,
 	PT_CAP_mtc,
 	PT_CAP_topa_output,
 	PT_CAP_topa_multiple_entries,
 	PT_CAP_single_range_output,
 	PT_CAP_payloads_lip,
+	PT_CAP_num_address_ranges,
 	PT_CAP_mtc_periods,
 	PT_CAP_cycle_thresholds,
 	PT_CAP_psb_periods,
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index e0e2f7dfbd..964d7e17a6 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -105,11 +105,29 @@
 #define RTIT_CTL_CYC_THRESH		(0x0full << RTIT_CTL_CYC_THRESH_OFFSET)
 #define RTIT_CTL_PSB_FREQ_OFFSET	24
 #define RTIT_CTL_PSB_FREQ      		(0x0full << RTIT_CTL_PSB_FREQ_OFFSET)
+#define RTIT_CTL_ADDR0_OFFSET		32
+#define RTIT_CTL_ADDR0      		(0x0full << RTIT_CTL_ADDR0_OFFSET)
+#define RTIT_CTL_ADDR1_OFFSET		36
+#define RTIT_CTL_ADDR1      		(0x0full << RTIT_CTL_ADDR1_OFFSET)
+#define RTIT_CTL_ADDR2_OFFSET		40
+#define RTIT_CTL_ADDR2      		(0x0full << RTIT_CTL_ADDR2_OFFSET)
+#define RTIT_CTL_ADDR3_OFFSET		44
+#define RTIT_CTL_ADDR3      		(0x0full << RTIT_CTL_ADDR3_OFFSET)
 #define MSR_IA32_RTIT_STATUS		0x00000571
+#define RTIT_STATUS_FILTEREN		BIT(0)
 #define RTIT_STATUS_CONTEXTEN		BIT(1)
 #define RTIT_STATUS_TRIGGEREN		BIT(2)
+#define RTIT_STATUS_BUFFOVF		BIT(3)
 #define RTIT_STATUS_ERROR		BIT(4)
 #define RTIT_STATUS_STOPPED		BIT(5)
+#define MSR_IA32_RTIT_ADDR0_A		0x00000580
+#define MSR_IA32_RTIT_ADDR0_B		0x00000581
+#define MSR_IA32_RTIT_ADDR1_A		0x00000582
+#define MSR_IA32_RTIT_ADDR1_B		0x00000583
+#define MSR_IA32_RTIT_ADDR2_A		0x00000584
+#define MSR_IA32_RTIT_ADDR2_B		0x00000585
+#define MSR_IA32_RTIT_ADDR3_A		0x00000586
+#define MSR_IA32_RTIT_ADDR3_B		0x00000587
 #define MSR_IA32_RTIT_CR3_MATCH		0x00000572
 #define MSR_IA32_RTIT_OUTPUT_BASE	0x00000560
 #define MSR_IA32_RTIT_OUTPUT_MASK	0x00000561
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v1 3/5] perf: Extend perf_event_aux_ctx() to optionally iterate through more events
  2016-04-21 15:16 [PATCH v1 0/5] perf: Introduce address range filtering Alexander Shishkin
  2016-04-21 15:16 ` [PATCH v1 1/5] perf: Move set_filter() from behind EVENT_TRACING Alexander Shishkin
  2016-04-21 15:17 ` [PATCH v1 2/5] perf/x86/intel/pt: IP filtering register/cpuid bits Alexander Shishkin
@ 2016-04-21 15:17 ` Alexander Shishkin
  2016-04-21 15:17 ` [PATCH v1 4/5] perf: Introduce address range filtering Alexander Shishkin
  2016-04-21 15:17 ` [PATCH v1 5/5] perf/x86/intel/pt: Add support for address range filtering in PT Alexander Shishkin
  4 siblings, 0 replies; 27+ messages in thread
From: Alexander Shishkin @ 2016-04-21 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, Alexander Shishkin

Trace filtering code needs an iterator that can go through all events in
a context, including inactive and filtered, to be able to update their
filters' address ranges based on mmap or exec events.

This patch changes perf_event_aux_ctx() to optionally do this.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index c73fd59811..9260a8a073 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5812,15 +5812,18 @@ typedef void (perf_event_aux_output_cb)(struct perf_event *event, void *data);
 static void
 perf_event_aux_ctx(struct perf_event_context *ctx,
 		   perf_event_aux_output_cb output,
-		   void *data)
+		   void *data, bool all)
 {
 	struct perf_event *event;
 
 	list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
-		if (event->state < PERF_EVENT_STATE_INACTIVE)
-			continue;
-		if (!event_filter_match(event))
-			continue;
+		if (!all) {
+			if (event->state < PERF_EVENT_STATE_INACTIVE)
+				continue;
+			if (!event_filter_match(event))
+				continue;
+		}
+
 		output(event, data);
 	}
 }
@@ -5831,7 +5834,7 @@ perf_event_aux_task_ctx(perf_event_aux_output_cb output, void *data,
 {
 	rcu_read_lock();
 	preempt_disable();
-	perf_event_aux_ctx(task_ctx, output, data);
+	perf_event_aux_ctx(task_ctx, output, data, false);
 	preempt_enable();
 	rcu_read_unlock();
 }
@@ -5875,7 +5878,7 @@ perf_event_aux(perf_event_aux_output_cb output, void *data,
 	for_each_task_context_nr(ctxn) {
 		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
 		if (ctx)
-			perf_event_aux_ctx(ctx, output, data);
+			perf_event_aux_ctx(ctx, output, data, false);
 	}
 	preempt_enable();
 	rcu_read_unlock();
@@ -5916,10 +5919,10 @@ static int __perf_pmu_output_stop(void *info)
 	};
 
 	rcu_read_lock();
-	perf_event_aux_ctx(&cpuctx->ctx, __perf_event_output_stop, &ro);
+	perf_event_aux_ctx(&cpuctx->ctx, __perf_event_output_stop, &ro, false);
 	if (cpuctx->task_ctx)
 		perf_event_aux_ctx(cpuctx->task_ctx, __perf_event_output_stop,
-				   &ro);
+				   &ro, false);
 	rcu_read_unlock();
 
 	return ro.err;
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-21 15:16 [PATCH v1 0/5] perf: Introduce address range filtering Alexander Shishkin
                   ` (2 preceding siblings ...)
  2016-04-21 15:17 ` [PATCH v1 3/5] perf: Extend perf_event_aux_ctx() to optionally iterate through more events Alexander Shishkin
@ 2016-04-21 15:17 ` Alexander Shishkin
  2016-04-22  7:45   ` Peter Zijlstra
  2016-04-21 15:17 ` [PATCH v1 5/5] perf/x86/intel/pt: Add support for address range filtering in PT Alexander Shishkin
  4 siblings, 1 reply; 27+ messages in thread
From: Alexander Shishkin @ 2016-04-21 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, Alexander Shishkin

Many instruction trace pmus out there support address range-based
filtering, which would, for example, generate trace data only for a
given range of instruction addresses, which is useful for tracing
individual functions, modules or libraries. Other pmus may also
utilize this functionality to allow filtering to or filtering out
code at certain address ranges.

This patch introduces the interface for userspace to specify these
filters and for the pmu drivers to apply these filters to hardware
configuration.

The user interface is an ascii string that is passed via an ioctl
and specifies (in a way similar to uprobe) address ranges within
certain object files or within kernel. There is no special treatment
for kernel modules yet, but it might be a worthy pursuit.

The pmu driver interface basically adds an extra callback to the
pmu driver structure, which validates the filter configuration proposed
by the user against what the hardware is actually capable of doing
and translates it into something that pmu::start can program into
hardware.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h |  95 +++++++
 kernel/events/core.c       | 622 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 705 insertions(+), 12 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b717902c99..f37c56e3fd 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -151,6 +151,18 @@ struct hw_perf_event {
 	 */
 	struct task_struct		*target;
 
+	/*
+	 * PMU would store hardware filter configuration
+	 * here.
+	 */
+	void				*addr_filters;
+
+	/*
+	 * Array of vma offsets for file-based filters,
+	 * allocated by the pmu.
+	 */
+	unsigned long			*addr_filters_offs;
+
 /*
  * hw_perf_event::state flags; used to track the PERF_EF_* state.
  */
@@ -393,12 +405,72 @@ struct pmu {
 	void (*free_aux)		(void *aux); /* optional */
 
 	/*
+	 * Validate address range filters: make sure hw supports the
+	 * requested configuration and number of filters; return 0 if the
+	 * supplied filters are valid, -errno otherwise.
+	 */
+	int (*addr_filters_validate)	(struct list_head *filters);
+
+	/*
+	 * Configure address range filters:
+	 * translate hw-agnostic filter into hardware configuration in
+	 * event::hw::addr_filters.
+	 * @offs:	array of vma load addresses for file-based filters or
+	 *              NULL: clear previously known file-based filters;
+	 *		ignore for kernel-based filters;
+	 * @flags:	PERF_EF_RELOAD: restart the event, will be set if the
+	 *		event is ACTIVE.
+	 * The @offs array should be one element per filter on the list or
+	 * NULL. Special case @offs[x]==0 means "keep previously known
+	 * configuration for this filter". See __perf_addr_filters_adjust().
+	 */
+	void (*addr_filters_setup)	(struct perf_event *event,
+					 unsigned long *offs, int flags);
+					/* optional */
+
+	/*
 	 * Filter events for PMU-specific reasons.
 	 */
 	int (*filter_match)		(struct perf_event *event); /* optional */
 };
 
 /**
+ * struct perf_addr_filter - address range filter definition
+ * @entry:	event's filter list linkage
+ * @inode:	object file's inode for file-based filters
+ * @offset:	filter range offset
+ * @size:	filter range size
+ * @range:	1: range, 0: address
+ * @filter:	1: filter/start, 0: stop
+ * @kernel:	1: kernel, 0: file-based
+ *
+ * This is a hardware-agnostic filter configuration as specified by the user.
+ */
+struct perf_addr_filter {
+	struct list_head	entry;
+	struct inode		*inode;
+	unsigned long		offset;
+	unsigned long		size;
+	unsigned int		range	: 1,
+				filter	: 1,
+				kernel	: 1;
+};
+
+/**
+ * struct perf_addr_filters_head - container for address range filters
+ * @list:	list of filters for this event
+ * @lock:	spinlock that serializes accesses to the @list and event's
+ *		(and its children's) filter generations.
+ *
+ * A child event will use parent's @list (and therefore @lock), so they are
+ * bundled together; see perf_event_addr_filters().
+ */
+struct perf_addr_filters_head {
+	struct list_head	list;
+	raw_spinlock_t		lock;
+};
+
+/**
  * enum perf_event_active_state - the states of a event
  */
 enum perf_event_active_state {
@@ -571,6 +643,10 @@ struct perf_event {
 
 	atomic_t			event_limit;
 
+	/* address range filters */
+	struct perf_addr_filters_head	addr_filters;
+	unsigned long			addr_filters_gen;
+
 	void (*destroy)(struct perf_event *);
 	struct rcu_head			rcu_head;
 
@@ -685,6 +761,20 @@ struct perf_output_handle {
 	int				page;
 };
 
+/*
+ * An inherited event uses parent's filters
+ */
+static inline struct perf_addr_filters_head *
+perf_event_addr_filters(struct perf_event *event)
+{
+	struct perf_addr_filters_head *ifh = &event->addr_filters;
+
+	if (event->parent)
+		ifh = &event->parent->addr_filters;
+
+	return ifh;
+}
+
 #ifdef CONFIG_CGROUP_PERF
 
 /*
@@ -1063,6 +1153,11 @@ static inline bool is_write_backward(struct perf_event *event)
 	return !!event->attr.write_backward;
 }
 
+static inline bool has_addr_filter(struct perf_event *event)
+{
+	return event->pmu->addr_filters_setup;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern int perf_output_begin_forward(struct perf_output_handle *handle,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9260a8a073..495c801e9b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -44,6 +44,8 @@
 #include <linux/compat.h>
 #include <linux/bpf.h>
 #include <linux/filter.h>
+#include <linux/namei.h>
+#include <linux/parser.h>
 
 #include "internal.h"
 
@@ -2387,6 +2389,102 @@ static int __perf_event_stop(void *info)
 	return 0;
 }
 
+/*
+ * (Re-)programming address filter configuration into the hardware is a racy
+ * process: there is always a window between computing offset addresses and
+ * running the pmu callback on the target cpu.
+ *
+ * perf_event_addr_filters_setup() is called to update hw event configuration
+ * with the new mmap addresses. This is done in the following cases:
+ * (1) perf_addr_filters_apply(): adjusting filters' offsets based on
+ *     pre-existing mappings, called once when new filters arrive via SET_FILTER
+ *     ioctl;
+ * (2) perf_addr_filters_adjust(): adjusting filters' offsets based on newly
+ *     registered mapping, called for every new mmap(), with mm::mmap_sem down
+ *     for reading;
+ * (3) perf_itrace_exec(): clearing filters' offsets in the process of exec.
+ */
+struct addr_filter_setup_data {
+	struct perf_event	*event;
+	unsigned long		*offs;
+	unsigned long		gen;
+};
+
+static int __perf_event_addr_filters_setup(void *info)
+{
+	struct addr_filter_setup_data *id = info;
+	struct perf_event *event = id->event;
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	unsigned long flags;
+
+	if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
+		return -EAGAIN;
+
+	/* matches smp_wmb() in event_sched_in() */
+	smp_rmb();
+
+	/*
+	 * There is a window with interrupts enabled before we get here,
+	 * so we need to check again lest we try to stop another cpu's event.
+	 */
+	if (READ_ONCE(event->oncpu) != smp_processor_id())
+		return -EAGAIN;
+
+	raw_spin_lock_irqsave(&ifh->lock, flags);
+	/*
+	 * In case of generations' mismatch, we don't have to do anything for
+	 * this instance any, there will be another one with the *right* gen.
+	 * If called to clear filters, always let it through.
+	 */
+	if (id->gen == event->addr_filters_gen || !id->offs)
+		event->pmu->addr_filters_setup(event, id->offs, PERF_EF_RELOAD);
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+	return 0;
+}
+
+static int perf_event_addr_filters_setup(struct perf_event *event,
+					   unsigned long *offs,
+					   unsigned long gen)
+{
+	struct addr_filter_setup_data id = {
+		.event	= event,
+		.offs	= offs,
+		.gen	= gen,
+	};
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	unsigned long flags;
+	int ret = 0;
+
+	/*
+	 * We can't use event_function_call() here, because that would
+	 * require ctx::mutex, but one of our callers is called with
+	 * mm::mmap_sem down, which would cause an inversion, see bullet
+	 * (2) in put_event().
+	 */
+	do {
+		if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE) {
+			raw_spin_lock_irqsave(&ifh->lock, flags);
+			/* see __perf_event_addr_filters_setup */
+			if (gen == event->addr_filters_gen || !offs)
+				event->pmu->addr_filters_setup(event, offs, 0);
+			raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+			if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
+				break;
+			/* otherwise, fall through to the cross-call */
+		}
+
+		/* matches smp_wmb() in event_sched_in() */
+		smp_rmb();
+
+		ret = cpu_function_call(READ_ONCE(event->oncpu),
+					__perf_event_addr_filters_setup, &id);
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
 static int _perf_event_refresh(struct perf_event *event, int refresh)
 {
 	/*
@@ -3236,16 +3334,6 @@ out:
 		put_ctx(clone_ctx);
 }
 
-void perf_event_exec(void)
-{
-	int ctxn;
-
-	rcu_read_lock();
-	for_each_task_context_nr(ctxn)
-		perf_event_enable_on_exec(ctxn);
-	rcu_read_unlock();
-}
-
 struct perf_read_data {
 	struct perf_event *event;
 	bool group;
@@ -3779,6 +3867,9 @@ static bool exclusive_event_installable(struct perf_event *event,
 	return true;
 }
 
+static void perf_addr_filters_splice(struct perf_event *event,
+				       struct list_head *head);
+
 static void _free_event(struct perf_event *event)
 {
 	irq_work_sync(&event->pending);
@@ -3806,6 +3897,7 @@ static void _free_event(struct perf_event *event)
 	}
 
 	perf_event_free_bpf_prog(event);
+	perf_addr_filters_splice(event, NULL);
 
 	if (event->destroy)
 		event->destroy(event);
@@ -5884,6 +5976,52 @@ perf_event_aux(perf_event_aux_output_cb output, void *data,
 	rcu_read_unlock();
 }
 
+/*
+ * Clear all file-based filters at exec, they'll have to be
+ * re-instated when/if these objects are mmapped again.
+ */
+static void perf_itrace_exec(struct perf_event *event, void *data)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	struct perf_addr_filter *filter;
+	unsigned int restart = 0;
+	unsigned long flags;
+
+	if (!has_addr_filter(event))
+		return;
+
+	raw_spin_lock_irqsave(&ifh->lock, flags);
+	list_for_each_entry(filter, &ifh->list, entry) {
+		if (filter->kernel)
+			continue;
+
+		restart++;
+	}
+
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+	if (restart)
+		perf_event_addr_filters_setup(event, NULL, 0);
+}
+
+void perf_event_exec(void)
+{
+	struct perf_event_context *ctx;
+	int ctxn;
+
+	rcu_read_lock();
+	for_each_task_context_nr(ctxn) {
+		ctx = current->perf_event_ctxp[ctxn];
+		if (!ctx)
+			continue;
+
+		perf_event_enable_on_exec(ctxn);
+
+		perf_event_aux_ctx(ctx, perf_itrace_exec, NULL, true);
+	}
+	rcu_read_unlock();
+}
+
 struct remote_output {
 	struct ring_buffer	*rb;
 	int			err;
@@ -6367,6 +6505,99 @@ got_name:
 	kfree(buf);
 }
 
+/*
+ * Whether this @filter depends on a dynamic object which is not loaded
+ * yet or its load addresses are not known.
+ */
+static bool perf_addr_filter_needs_mmap(struct perf_addr_filter *filter)
+{
+	return filter->filter && !filter->kernel;
+}
+
+/*
+ * Check whether inode and address range match filter criteria.
+ */
+static bool perf_addr_filter_match(struct perf_addr_filter *filter,
+				     struct file *file, unsigned long offset,
+				     unsigned long size)
+{
+	if (filter->kernel)
+		return false;
+
+	if (filter->inode != file->f_inode)
+		return false;
+
+	if (filter->offset > offset + size)
+		return false;
+
+	if (filter->offset + filter->size < offset)
+		return false;
+
+	return true;
+}
+
+static void __perf_addr_filters_adjust(struct perf_event *event, void *data)
+{
+	struct vm_area_struct *vma = data;
+	unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
+	struct file *file = vma->vm_file;
+	struct perf_addr_filter *filter;
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	unsigned long *offs, flags, gen;
+	unsigned int restart = 0, count = 0;
+
+	if (!has_addr_filter(event))
+		return;
+
+	if (!file)
+		return;
+
+	offs = event->hw.addr_filters_offs;
+
+	raw_spin_lock_irqsave(&ifh->lock, flags);
+	list_for_each_entry(filter, &ifh->list, entry) {
+		/*
+		 * By default, keep this filter in the previously known state:
+		 * otherwise, for the filters that aren't impacted by this vma,
+		 * we'll have to re-scan the whole address space.
+		 */
+		offs[count] = 0;
+
+		if (perf_addr_filter_match(filter, file, off,
+					     vma->vm_end - vma->vm_start)) {
+			offs[count] = vma->vm_start;
+			restart++;
+		}
+
+		count++;
+	}
+
+	gen = ++event->addr_filters_gen;
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+	if (restart)
+		perf_event_addr_filters_setup(event, offs, gen);
+}
+
+/*
+ * Adjust all task's events' filters to the new vma
+ */
+static void perf_addr_filters_adjust(struct vm_area_struct *vma)
+{
+	struct perf_event_context *ctx;
+	int ctxn;
+
+	rcu_read_lock();
+	for_each_task_context_nr(ctxn) {
+		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
+		if (!ctx)
+			continue;
+
+		perf_event_aux_ctx(ctx, __perf_addr_filters_adjust, vma, true);
+	}
+	rcu_read_unlock();
+}
+
 void perf_event_mmap(struct vm_area_struct *vma)
 {
 	struct perf_mmap_event mmap_event;
@@ -6398,6 +6629,7 @@ void perf_event_mmap(struct vm_area_struct *vma)
 		/* .flags (attr_mmap2 only) */
 	};
 
+	perf_addr_filters_adjust(vma);
 	perf_event_mmap_event(&mmap_event);
 }
 
@@ -7357,13 +7589,375 @@ void perf_bp_event(struct perf_event *bp, void *data)
 }
 #endif
 
+/*
+ * Allocate a new address filter
+ */
+static struct perf_addr_filter *
+perf_addr_filter_new(struct perf_event *event, struct list_head *filters)
+{
+	int node = cpu_to_node(event->cpu == -1 ? 0 : event->cpu);
+	struct perf_addr_filter *filter;
+
+	filter = kzalloc_node(sizeof(*filter), GFP_KERNEL, node);
+	if (!filter)
+		return NULL;
+
+	INIT_LIST_HEAD(&filter->entry);
+	list_add_tail(&filter->entry, filters);
+
+	return filter;
+}
+
+static void free_filters_list(struct list_head *filters)
+{
+	struct perf_addr_filter *filter, *iter;
+
+	list_for_each_entry_safe(filter, iter, filters, entry) {
+		if (filter->inode)
+			iput(filter->inode);
+		list_del(&filter->entry);
+		kfree(filter);
+	}
+}
+
+/*
+ * Free existing address filters and optionally install new ones
+ */
+static void perf_addr_filters_splice(struct perf_event *event,
+				       struct list_head *head)
+{
+	unsigned long flags;
+	LIST_HEAD(list);
+
+	if (!has_addr_filter(event))
+		return;
+
+	/* don't bother with children, they don't have their own filters */
+	if (event->parent)
+		return;
+
+	raw_spin_lock_irqsave(&event->addr_filters.lock, flags);
+
+	list_splice_init(&event->addr_filters.list, &list);
+	if (head)
+		list_splice(head, &event->addr_filters.list);
+
+	raw_spin_unlock_irqrestore(&event->addr_filters.lock, flags);
+
+	free_filters_list(&list);
+}
+
+/*
+ * Scan through mm's vmas and see if one of them matches the
+ * @filter; if so, adjust filter's address range.
+ * Called with mm::mmap_sem down for reading.
+ */
+static unsigned long perf_addr_filter_apply(struct perf_addr_filter *filter,
+					      struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	for (vma = mm->mmap; vma->vm_next; vma = vma->vm_next) {
+		struct file *file = vma->vm_file;
+		unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
+		unsigned long vma_size = vma->vm_end - vma->vm_start;
+
+		if (!file)
+			continue;
+
+		if (!perf_addr_filter_match(filter, file, off,
+					      vma_size))
+			continue;
+
+		return vma->vm_start;
+	}
+
+	return 0;
+}
+
+/*
+ * Calculate event's address filters' ranges based on the
+ * task's existing mappings; if any of the existing mappings
+ * match the filters, update event's hw configuration and
+ * restart it if it's running.
+ */
+static void perf_event_addr_filters_apply(struct perf_event *event)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	struct perf_addr_filter *filter;
+	struct task_struct *task = READ_ONCE(event->ctx->task);
+	struct mm_struct *mm = NULL;
+	unsigned int restart = 0, count = 0;
+	unsigned long *offs, flags, gen;
+
+	offs = event->hw.addr_filters_offs;
+
+	/*
+	 * We may observe TASK_TOMBSTONE, which means that the event tear-down
+	 * will stop on the parent's child_mutex that our caller is also holding
+	 */
+	if (task == TASK_TOMBSTONE)
+		return;
+
+	mm = get_task_mm(event->ctx->task);
+	if (!mm)
+		return;
+
+	/* establish the initial hw configuration for this set of filters */
+	perf_event_addr_filters_setup(event, NULL, 0);
+
+	down_read(&mm->mmap_sem);
+
+	raw_spin_lock_irqsave(&ifh->lock, flags);
+	list_for_each_entry(filter, &ifh->list, entry) {
+		offs[count] = 0;
+
+		if (perf_addr_filter_needs_mmap(filter)) {
+			offs[count] = perf_addr_filter_apply(filter, mm);
+
+			if (offs[count])
+				restart++;
+		}
+
+		count++;
+	}
+
+	gen = ++event->addr_filters_gen;
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+	up_read(&mm->mmap_sem);
+
+	if (restart)
+		perf_event_addr_filters_setup(event, offs, gen);
+
+	mmput(mm);
+}
+
+/*
+ * Address range filtering: limiting the data to certain
+ * instruction address ranges. Filters are ioctl()ed to us from
+ * userspace as ascii strings.
+ *
+ * Filter string format:
+ *
+ * ACTION SOURCE:RANGE_SPEC
+ * where ACTION is one of the
+ *  * "filter": limit the trace to this region
+ *  * "start": start tracing from this address
+ *  * "stop": stop tracing at this address/region;
+ * SOURCE is either "file" or "kernel"
+ * RANGE_SPEC is
+ *  * for "kernel": <start address>[/<size>]
+ *  * for "file":   <start address>[/<size>]@</path/to/object/file>
+ *
+ * if <size> is not specified, the range is treated as a single address.
+ */
+enum {
+	IF_ACT_FILTER,
+	IF_ACT_START,
+	IF_ACT_STOP,
+	IF_SRC_FILE,
+	IF_SRC_KERNEL,
+	IF_SRC_FILEADDR,
+	IF_SRC_KERNELADDR,
+};
+
+enum {
+	IF_STATE_ACTION = 0,
+	IF_STATE_SOURCE,
+	IF_STATE_END,
+};
+
+static const match_table_t if_tokens = {
+	{ IF_ACT_FILTER,	"filter" },
+	{ IF_ACT_START,		"start" },
+	{ IF_ACT_STOP,		"stop" },
+	{ IF_SRC_FILE,		"file:%u/%u@%s" },
+	{ IF_SRC_KERNEL,	"kernel:%u/%u" },
+	{ IF_SRC_FILEADDR,	"file:%u@%s" },
+	{ IF_SRC_KERNELADDR,	"kernel:%u" },
+};
+
+/*
+ * Address filter string parser
+ */
+static int
+perf_event_parse_addr_filter(struct perf_event *event, char *fstr,
+			       struct list_head *filters)
+{
+	struct perf_addr_filter *filter = NULL;
+	char *start, *orig, *filename = NULL;
+	struct path path;
+	substring_t args[MAX_OPT_ARGS];
+	int state = IF_STATE_ACTION, token;
+	int ret = -EINVAL;
+
+	orig = fstr = kstrdup(fstr, GFP_KERNEL);
+	if (!fstr)
+		return -ENOMEM;
+
+	while ((start = strsep(&fstr, " ,\n")) != NULL) {
+		ret = -EINVAL;
+
+		if (!*start)
+			continue;
+
+		/* filter definition begins */
+		if (state == IF_STATE_ACTION) {
+			filter = perf_addr_filter_new(event, filters);
+			if (!filter)
+				goto fail;
+		}
+
+		token = match_token(start, if_tokens, args);
+		switch (token) {
+		case IF_ACT_FILTER:
+		case IF_ACT_START:
+			filter->filter = 1;
+
+		case IF_ACT_STOP:
+			if (state != IF_STATE_ACTION)
+				goto fail;
+
+			state = IF_STATE_SOURCE;
+			break;
+
+		case IF_SRC_KERNELADDR:
+		case IF_SRC_KERNEL:
+			filter->kernel = 1;
+
+		case IF_SRC_FILEADDR:
+		case IF_SRC_FILE:
+			if (state != IF_STATE_SOURCE)
+				goto fail;
+
+			if (token == IF_SRC_FILE || token == IF_SRC_KERNEL)
+				filter->range = 1;
+
+			*args[0].to = 0;
+			ret = kstrtoul(args[0].from, 0, &filter->offset);
+			if (ret)
+				goto fail;
+
+			if (filter->range) {
+				*args[1].to = 0;
+				ret = kstrtoul(args[1].from, 0, &filter->size);
+				if (ret)
+					goto fail;
+			}
+
+			if (token == IF_SRC_FILE) {
+				filename = match_strdup(&args[2]);
+				if (!filename) {
+					ret = -ENOMEM;
+					goto fail;
+				}
+			}
+
+			state = IF_STATE_END;
+			break;
+
+		default:
+			goto fail;
+		}
+
+		/*
+		 * Filter definition is fully parsed, validate and install it.
+		 * Make sure that it doesn't contradict itself or the event's
+		 * attribute.
+		 */
+		if (state == IF_STATE_END) {
+			if (filter->kernel && event->attr.exclude_kernel)
+				goto fail;
+
+			if (!filter->kernel) {
+				if (!filename)
+					goto fail;
+
+				/* look up the path and grab its inode */
+				ret = kern_path(filename, LOOKUP_FOLLOW, &path);
+				if (ret)
+					goto fail_free_name;
+
+				filter->inode = igrab(d_inode(path.dentry));
+				path_put(&path);
+				kfree(filename);
+				filename = NULL;
+			}
+
+			/* ready to consume more filters */
+			state = IF_STATE_ACTION;
+			filter = NULL;
+		}
+	}
+
+	if (state != IF_STATE_ACTION)
+		goto fail;
+
+	kfree(orig);
+
+	return 0;
+
+fail_free_name:
+	kfree(filename);
+fail:
+	free_filters_list(filters);
+	kfree(orig);
+
+	return ret;
+}
+
+static int
+perf_event_set_addr_filter(struct perf_event *event, char *filter_str)
+{
+	LIST_HEAD(filters);
+	int ret;
+
+	/*
+	 * Since this is called in perf_ioctl() path, we're already holding
+	 * ctx::mutex.
+	 */
+	lockdep_assert_held(&event->ctx->mutex);
+
+	if (WARN_ON_ONCE(event->parent))
+		return -EINVAL;
+
+	/*
+	 * For now, we only support filtering in per-task events; doing so
+	 * for cpu-wide events requires additional context switching trickery,
+	 * since same object code will be mapped at different virtual
+	 * addresses in different processes.
+	 */
+	if (!event->ctx->task)
+		return -EOPNOTSUPP;
+
+	ret = perf_event_parse_addr_filter(event, filter_str, &filters);
+	if (ret)
+		return ret;
+
+	ret = event->pmu->addr_filters_validate(&filters);
+	if (ret) {
+		free_filters_list(&filters);
+		return ret;
+	}
+
+	/* remove existing filters, if any */
+	perf_addr_filters_splice(event, &filters);
+
+	/* install new filters */
+	perf_event_for_each_child(event, perf_event_addr_filters_apply);
+
+	return ret;
+}
+
 static int perf_event_set_filter(struct perf_event *event, void __user *arg)
 {
 	char *filter_str;
 	int ret = -EINVAL;
 
-	if (event->attr.type != PERF_TYPE_TRACEPOINT ||
-	    !IS_ENABLED(CONFIG_EVENT_TRACING))
+	if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
+	    !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
+	    !has_addr_filter(event))
 		return -EINVAL;
 
 	filter_str = strndup_user(arg, PAGE_SIZE);
@@ -7374,6 +7968,8 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
 	    event->attr.type == PERF_TYPE_TRACEPOINT)
 		ret = ftrace_profile_set_filter(event, event->attr.config,
 						filter_str);
+	else if (has_addr_filter(event))
+		ret = perf_event_set_addr_filter(event, filter_str);
 
 	kfree(filter_str);
 	return ret;
@@ -8196,6 +8792,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	INIT_LIST_HEAD(&event->sibling_list);
 	INIT_LIST_HEAD(&event->rb_entry);
 	INIT_LIST_HEAD(&event->active_entry);
+	INIT_LIST_HEAD(&event->addr_filters.list);
 	INIT_HLIST_NODE(&event->hlist_entry);
 	INIT_LIST_HEAD(&event->sb_list);
 
@@ -8203,6 +8800,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	init_irq_work(&event->pending, perf_pending_event);
 
 	mutex_init(&event->mmap_mutex);
+	raw_spin_lock_init(&event->addr_filters.lock);
 
 	atomic_long_set(&event->refcount, 1);
 	event->cpu		= cpu;
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v1 5/5] perf/x86/intel/pt: Add support for address range filtering in PT
  2016-04-21 15:16 [PATCH v1 0/5] perf: Introduce address range filtering Alexander Shishkin
                   ` (3 preceding siblings ...)
  2016-04-21 15:17 ` [PATCH v1 4/5] perf: Introduce address range filtering Alexander Shishkin
@ 2016-04-21 15:17 ` Alexander Shishkin
  4 siblings, 0 replies; 27+ messages in thread
From: Alexander Shishkin @ 2016-04-21 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, Alexander Shishkin

Newer versions of Intel PT support address ranges, which can be used to
define IP address range-based filters or TraceSTOP regions. Number of
ranges in enumerated via cpuid.

This patch implements pmu callbacks and related low-level code to allow
filter validation, configuration and programming into the hardware.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/events/intel/pt.c | 183 ++++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/events/intel/pt.h |  28 +++++++
 2 files changed, 210 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 891447dd61..e577afb74d 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -253,6 +253,73 @@ static bool pt_event_valid(struct perf_event *event)
  * These all are cpu affine and operate on a local PT
  */
 
+/* Address ranges and their corresponding msr configuration registers */
+static const struct pt_address_range {
+	unsigned long	msr_a;
+	unsigned long	msr_b;
+	unsigned int	reg_off;
+} pt_address_ranges[] = {
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR0_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR0_B,
+		.reg_off = RTIT_CTL_ADDR0_OFFSET,
+	},
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR1_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR1_B,
+		.reg_off = RTIT_CTL_ADDR1_OFFSET,
+	},
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR2_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR2_B,
+		.reg_off = RTIT_CTL_ADDR2_OFFSET,
+	},
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR3_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR3_B,
+		.reg_off = RTIT_CTL_ADDR3_OFFSET,
+	}
+};
+
+static u64 pt_config_filters(struct perf_event *event)
+{
+	struct pt_filters *filters = event->hw.addr_filters;
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	unsigned int range = 0;
+	u64 rtit_ctl = 0;
+
+	if (!filters)
+		return 0;
+
+	for (range = 0; range < filters->nr_filters; range++) {
+		struct pt_filter *filter = &filters->filter[range];
+
+		/*
+		 * Note, if the range has zero start/end addresses due
+		 * to its dynamic object not being loaded yet, we just
+		 * go ahead and program zeroed range, which will simply
+		 * produce no data. Note^2: if executable code at 0x0
+		 * is a concern, we can set up an "invalid" configuration
+		 * such as msr_b < msr_a.
+		 */
+
+		/* avoid redundant msr writes */
+		if (pt->filters.filter[range].msr_a != filter->msr_a) {
+			wrmsrl(pt_address_ranges[range].msr_a, filter->msr_a);
+			pt->filters.filter[range].msr_a = filter->msr_a;
+		}
+
+		if (pt->filters.filter[range].msr_b != filter->msr_b) {
+			wrmsrl(pt_address_ranges[range].msr_b, filter->msr_b);
+			pt->filters.filter[range].msr_b = filter->msr_b;
+		}
+
+		rtit_ctl |= filter->config << pt_address_ranges[range].reg_off;
+	}
+
+	return rtit_ctl;
+}
+
 static void pt_config(struct perf_event *event)
 {
 	u64 reg;
@@ -262,7 +329,8 @@ static void pt_config(struct perf_event *event)
 		wrmsrl(MSR_IA32_RTIT_STATUS, 0);
 	}
 
-	reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN | RTIT_CTL_TRACEEN;
+	reg = pt_config_filters(event);
+	reg |= RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN | RTIT_CTL_TRACEEN;
 
 	if (!event->attr.exclude_kernel)
 		reg |= RTIT_CTL_OS;
@@ -907,6 +975,105 @@ static void pt_buffer_free_aux(void *data)
 	kfree(buf);
 }
 
+static int pt_addr_filters_init(struct perf_event *event)
+{
+	struct pt_filters *filters;
+	int node = event->cpu == -1 ? -1 : cpu_to_node(event->cpu);
+
+	if (!pt_cap_get(PT_CAP_num_address_ranges))
+		return 0;
+
+	filters = kzalloc_node(sizeof(struct pt_filters), GFP_KERNEL, node);
+	if (!filters)
+		return -ENOMEM;
+
+	if (event->parent)
+		memcpy(filters, event->parent->hw.addr_filters,
+		       sizeof(*filters));
+
+	event->hw.addr_filters = filters;
+	event->hw.addr_filters_offs = filters->offs;
+
+	return 0;
+}
+
+static void pt_addr_filters_fini(struct perf_event *event)
+{
+	kfree(event->hw.addr_filters);
+	event->hw.addr_filters = NULL;
+	event->hw.addr_filters_offs = NULL;
+}
+
+static int pt_event_addr_filters_validate(struct list_head *filters)
+{
+	struct perf_addr_filter *filter;
+	int range = 0;
+
+	list_for_each_entry(filter, filters, entry) {
+		/* PT doesn't support single address triggers */
+		if (!filter->range)
+			return -EOPNOTSUPP;
+
+		if (filter->kernel && !kernel_ip(filter->offset))
+			return -EINVAL;
+
+		if (++range > pt_cap_get(PT_CAP_num_address_ranges))
+			return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static void pt_event_start(struct perf_event *event, int mode);
+static void pt_event_stop(struct perf_event *event, int mode);
+
+static void pt_event_addr_filters_setup(struct perf_event *event,
+					  unsigned long *offs, int flags)
+{
+	struct perf_addr_filters_head *head = perf_event_addr_filters(event);
+	struct pt_filters *filters = event->hw.addr_filters;
+	unsigned int reload = flags & PERF_EF_RELOAD;
+	struct perf_addr_filter *filter;
+	unsigned long msr_a, msr_b;
+	int range = 0;
+
+	if (!filters ||
+	    WARN_ON_ONCE(pt_event_addr_filters_validate(&head->list)))
+		return;
+
+	if (reload)
+		pt_event_stop(event, PERF_EF_UPDATE);
+
+	list_for_each_entry(filter, &head->list, entry) {
+		/* kernel-space filters are already configured */
+		if (filter->kernel) {
+			msr_a = filter->offset;
+			msr_b = filter->offset + filter->size;
+		} else if (!offs) {
+			/* clear the range */
+			msr_a = msr_b = 0;
+		} else if (offs[range] == 0) {
+			/* keep the existing range */
+			goto next;
+		} else {
+			/* apply the offset */
+			msr_a = filter->offset + offs[range];
+			msr_b = filter->size + msr_a;
+		}
+
+		filters->filter[range].msr_a  = msr_a;
+		filters->filter[range].msr_b  = msr_b;
+next:
+		filters->filter[range].config = filter->filter ? 1 : 2;
+		range++;
+	}
+
+	filters->nr_filters = range;
+
+	if (reload)
+		pt_event_start(event, PERF_EF_RELOAD);
+}
+
 /**
  * intel_pt_interrupt() - PT PMI handler
  */
@@ -1075,6 +1242,7 @@ static void pt_event_read(struct perf_event *event)
 
 static void pt_event_destroy(struct perf_event *event)
 {
+	pt_addr_filters_fini(event);
 	x86_del_exclusive(x86_lbr_exclusive_pt);
 }
 
@@ -1089,6 +1257,11 @@ static int pt_event_init(struct perf_event *event)
 	if (x86_add_exclusive(x86_lbr_exclusive_pt))
 		return -EBUSY;
 
+	if (pt_addr_filters_init(event)) {
+		x86_del_exclusive(x86_lbr_exclusive_pt);
+		return -ENOMEM;
+	}
+
 	event->destroy = pt_event_destroy;
 
 	return 0;
@@ -1152,6 +1325,14 @@ static __init int pt_init(void)
 	pt_pmu.pmu.read		= pt_event_read;
 	pt_pmu.pmu.setup_aux	= pt_buffer_setup_aux;
 	pt_pmu.pmu.free_aux	= pt_buffer_free_aux;
+
+	if (pt_cap_get(PT_CAP_num_address_ranges)) {
+		pt_pmu.pmu.addr_filters_validate =
+			pt_event_addr_filters_validate;
+		pt_pmu.pmu.addr_filters_setup =
+			pt_event_addr_filters_setup;
+	}
+
 	ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);
 
 	return ret;
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index 6ce8cd20b9..27bd3838c5 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -105,13 +105,41 @@ struct pt_buffer {
 	struct topa_entry	*topa_index[0];
 };
 
+#define PT_FILTERS_NUM	4
+
+/**
+ * struct pt_filter - IP range filter configuration
+ * @msr_a:	range start, goes to RTIT_ADDRn_A
+ * @msr_b:	range end, goes to RTIT_ADDRn_B
+ * @config:	4-bit field in RTIT_CTL
+ */
+struct pt_filter {
+	unsigned long	msr_a;
+	unsigned long	msr_b;
+	unsigned long	config;
+};
+
+/**
+ * struct pt_filters - IP range filtering context
+ * @filter:	filters defined for this context
+ * @offs:	array of mapped object's offsets
+ * @nr_filters:	number of defined filters in the @filter array
+ */
+struct pt_filters {
+	struct pt_filter	filter[PT_FILTERS_NUM];
+	unsigned long		offs[PT_FILTERS_NUM];
+	unsigned int		nr_filters;
+};
+
 /**
  * struct pt - per-cpu pt context
  * @handle:	perf output handle
+ * @filters:		last configured filters
  * @handle_nmi:	do handle PT PMI on this cpu, there's an active event
  */
 struct pt {
 	struct perf_output_handle handle;
+	struct pt_filters	filters;
 	int			handle_nmi;
 };
 
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 2/5] perf/x86/intel/pt: IP filtering register/cpuid bits
  2016-04-21 15:17 ` [PATCH v1 2/5] perf/x86/intel/pt: IP filtering register/cpuid bits Alexander Shishkin
@ 2016-04-21 17:48   ` Borislav Petkov
  2016-04-21 18:55     ` Thomas Gleixner
  0 siblings, 1 reply; 27+ messages in thread
From: Borislav Petkov @ 2016-04-21 17:48 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Thu, Apr 21, 2016 at 06:17:00PM +0300, Alexander Shishkin wrote:
> New versions of Intel PT support address range-based filtering. These
> are the registers, bit definitions and relevant CPUID bits.
> 
> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> ---
>  arch/x86/events/intel/pt.c       |  2 ++
>  arch/x86/events/intel/pt.h       |  2 ++
>  arch/x86/include/asm/msr-index.h | 18 ++++++++++++++++++
>  3 files changed, 22 insertions(+)
> 
> diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
> index 127f58c179..891447dd61 100644
> --- a/arch/x86/events/intel/pt.c
> +++ b/arch/x86/events/intel/pt.c
> @@ -67,11 +67,13 @@ static struct pt_cap_desc {
>  	PT_CAP(max_subleaf,		0, CR_EAX, 0xffffffff),
>  	PT_CAP(cr3_filtering,		0, CR_EBX, BIT(0)),
>  	PT_CAP(psb_cyc,			0, CR_EBX, BIT(1)),
> +	PT_CAP(ip_filtering,		0, CR_EBX, BIT(2)),
>  	PT_CAP(mtc,			0, CR_EBX, BIT(3)),
>  	PT_CAP(topa_output,		0, CR_ECX, BIT(0)),
>  	PT_CAP(topa_multiple_entries,	0, CR_ECX, BIT(1)),
>  	PT_CAP(single_range_output,	0, CR_ECX, BIT(2)),
>  	PT_CAP(payloads_lip,		0, CR_ECX, BIT(31)),
> +	PT_CAP(num_address_ranges,	1, CR_EAX, 0x3),
>  	PT_CAP(mtc_periods,		1, CR_EAX, 0xffff0000),
>  	PT_CAP(cycle_thresholds,	1, CR_EBX, 0xffff),
>  	PT_CAP(psb_periods,		1, CR_EBX, 0xffff0000),
> diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
> index 336878a5d2..6ce8cd20b9 100644
> --- a/arch/x86/events/intel/pt.h
> +++ b/arch/x86/events/intel/pt.h
> @@ -52,11 +52,13 @@ enum pt_capabilities {
>  	PT_CAP_max_subleaf = 0,
>  	PT_CAP_cr3_filtering,
>  	PT_CAP_psb_cyc,
> +	PT_CAP_ip_filtering,
>  	PT_CAP_mtc,
>  	PT_CAP_topa_output,
>  	PT_CAP_topa_multiple_entries,
>  	PT_CAP_single_range_output,
>  	PT_CAP_payloads_lip,
> +	PT_CAP_num_address_ranges,
>  	PT_CAP_mtc_periods,
>  	PT_CAP_cycle_thresholds,
>  	PT_CAP_psb_periods,
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index e0e2f7dfbd..964d7e17a6 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -105,11 +105,29 @@
>  #define RTIT_CTL_CYC_THRESH		(0x0full << RTIT_CTL_CYC_THRESH_OFFSET)
>  #define RTIT_CTL_PSB_FREQ_OFFSET	24
>  #define RTIT_CTL_PSB_FREQ      		(0x0full << RTIT_CTL_PSB_FREQ_OFFSET)
> +#define RTIT_CTL_ADDR0_OFFSET		32
> +#define RTIT_CTL_ADDR0      		(0x0full << RTIT_CTL_ADDR0_OFFSET)
> +#define RTIT_CTL_ADDR1_OFFSET		36
> +#define RTIT_CTL_ADDR1      		(0x0full << RTIT_CTL_ADDR1_OFFSET)
> +#define RTIT_CTL_ADDR2_OFFSET		40
> +#define RTIT_CTL_ADDR2      		(0x0full << RTIT_CTL_ADDR2_OFFSET)
> +#define RTIT_CTL_ADDR3_OFFSET		44
> +#define RTIT_CTL_ADDR3      		(0x0full << RTIT_CTL_ADDR3_OFFSET)
>  #define MSR_IA32_RTIT_STATUS		0x00000571
> +#define RTIT_STATUS_FILTEREN		BIT(0)
>  #define RTIT_STATUS_CONTEXTEN		BIT(1)
>  #define RTIT_STATUS_TRIGGEREN		BIT(2)
> +#define RTIT_STATUS_BUFFOVF		BIT(3)
>  #define RTIT_STATUS_ERROR		BIT(4)
>  #define RTIT_STATUS_STOPPED		BIT(5)
> +#define MSR_IA32_RTIT_ADDR0_A		0x00000580
> +#define MSR_IA32_RTIT_ADDR0_B		0x00000581
> +#define MSR_IA32_RTIT_ADDR1_A		0x00000582
> +#define MSR_IA32_RTIT_ADDR1_B		0x00000583
> +#define MSR_IA32_RTIT_ADDR2_A		0x00000584
> +#define MSR_IA32_RTIT_ADDR2_B		0x00000585
> +#define MSR_IA32_RTIT_ADDR3_A		0x00000586
> +#define MSR_IA32_RTIT_ADDR3_B		0x00000587

So can we not turn msr-index.h a dumping ground for MSRs pls?

If those are only PT-relevant, why not define them all in pt.h?

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 2/5] perf/x86/intel/pt: IP filtering register/cpuid bits
  2016-04-21 17:48   ` Borislav Petkov
@ 2016-04-21 18:55     ` Thomas Gleixner
  2016-04-21 19:17       ` Peter Zijlstra
  2016-04-21 20:37       ` Borislav Petkov
  0 siblings, 2 replies; 27+ messages in thread
From: Thomas Gleixner @ 2016-04-21 18:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Shishkin, Peter Zijlstra, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier

On Thu, 21 Apr 2016, Borislav Petkov wrote:
> > +#define MSR_IA32_RTIT_ADDR0_A		0x00000580
> > +#define MSR_IA32_RTIT_ADDR0_B		0x00000581
> > +#define MSR_IA32_RTIT_ADDR1_A		0x00000582
> > +#define MSR_IA32_RTIT_ADDR1_B		0x00000583
> > +#define MSR_IA32_RTIT_ADDR2_A		0x00000584
> > +#define MSR_IA32_RTIT_ADDR2_B		0x00000585
> > +#define MSR_IA32_RTIT_ADDR3_A		0x00000586
> > +#define MSR_IA32_RTIT_ADDR3_B		0x00000587
> 
> So can we not turn msr-index.h a dumping ground for MSRs pls?
> 
> If those are only PT-relevant, why not define them all in pt.h?

I have to disagree here. The MSRs itself can really go into msr-index.h while
the bit definitions might go elsewhere. What's wrong with having all MSRs at a
central place?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 2/5] perf/x86/intel/pt: IP filtering register/cpuid bits
  2016-04-21 18:55     ` Thomas Gleixner
@ 2016-04-21 19:17       ` Peter Zijlstra
  2016-04-21 20:39         ` Borislav Petkov
  2016-04-21 20:37       ` Borislav Petkov
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-04-21 19:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, Alexander Shishkin, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier

On Thu, Apr 21, 2016 at 08:55:38PM +0200, Thomas Gleixner wrote:
> On Thu, 21 Apr 2016, Borislav Petkov wrote:
> > > +#define MSR_IA32_RTIT_ADDR0_A		0x00000580
> > > +#define MSR_IA32_RTIT_ADDR0_B		0x00000581
> > > +#define MSR_IA32_RTIT_ADDR1_A		0x00000582
> > > +#define MSR_IA32_RTIT_ADDR1_B		0x00000583
> > > +#define MSR_IA32_RTIT_ADDR2_A		0x00000584
> > > +#define MSR_IA32_RTIT_ADDR2_B		0x00000585
> > > +#define MSR_IA32_RTIT_ADDR3_A		0x00000586
> > > +#define MSR_IA32_RTIT_ADDR3_B		0x00000587
> > 
> > So can we not turn msr-index.h a dumping ground for MSRs pls?
> > 
> > If those are only PT-relevant, why not define them all in pt.h?
> 
> I have to disagree here. The MSRs itself can really go into msr-index.h while
> the bit definitions might go elsewhere. What's wrong with having all MSRs at a
> central place?

So I agree with Thomas; the risk of not doing this is that we'll
introduce the same MSR again, in another file, under another name.

That's confusion we can do without.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 2/5] perf/x86/intel/pt: IP filtering register/cpuid bits
  2016-04-21 18:55     ` Thomas Gleixner
  2016-04-21 19:17       ` Peter Zijlstra
@ 2016-04-21 20:37       ` Borislav Petkov
  2016-04-22  7:58         ` Thomas Gleixner
  1 sibling, 1 reply; 27+ messages in thread
From: Borislav Petkov @ 2016-04-21 20:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alexander Shishkin, Peter Zijlstra, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier

On Thu, Apr 21, 2016 at 08:55:38PM +0200, Thomas Gleixner wrote:
> I have to disagree here. The MSRs itself can really go into msr-index.h while
> the bit definitions might go elsewhere. What's wrong with having all MSRs at a
> central place?

Same reason as for pci_ids.h - to contain only MSRs which are used in
multiple compilation units.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 2/5] perf/x86/intel/pt: IP filtering register/cpuid bits
  2016-04-21 19:17       ` Peter Zijlstra
@ 2016-04-21 20:39         ` Borislav Petkov
  0 siblings, 0 replies; 27+ messages in thread
From: Borislav Petkov @ 2016-04-21 20:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Alexander Shishkin, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier

On Thu, Apr 21, 2016 at 09:17:02PM +0200, Peter Zijlstra wrote:
> So I agree with Thomas; the risk of not doing this is that we'll
> introduce the same MSR again, in another file, under another name.

... only if that MSR is useful in other compilation units. If not, then
you're unlikely to need it somewhere else. That's why I meant if it is
PT-relevant only, to define it in its header instead.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-21 15:17 ` [PATCH v1 4/5] perf: Introduce address range filtering Alexander Shishkin
@ 2016-04-22  7:45   ` Peter Zijlstra
  2016-04-22 16:19     ` Alexander Shishkin
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-04-22  7:45 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Thu, Apr 21, 2016 at 06:17:02PM +0300, Alexander Shishkin wrote:
> +struct addr_filter_setup_data {
> +	struct perf_event	*event;
> +	unsigned long		*offs;
> +	unsigned long		gen;
> +};
> +
> +static int __perf_event_addr_filters_setup(void *info)
> +{
> +	struct addr_filter_setup_data *id = info;
> +	struct perf_event *event = id->event;
> +	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
> +	unsigned long flags;
> +
> +	if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
> +		return -EAGAIN;
> +
> +	/* matches smp_wmb() in event_sched_in() */
> +	smp_rmb();
> +
> +	/*
> +	 * There is a window with interrupts enabled before we get here,
> +	 * so we need to check again lest we try to stop another cpu's event.
> +	 */
> +	if (READ_ONCE(event->oncpu) != smp_processor_id())
> +		return -EAGAIN;
> +
> +	raw_spin_lock_irqsave(&ifh->lock, flags);

Since we only ever use this from cpu_function_call() IRQs are guaranteed
off already, you even rely on that in the above ->oncpu test, so the
_irqsave() thing is entirely redundant, no?

> +	/*
> +	 * In case of generations' mismatch, we don't have to do anything for
> +	 * this instance any, there will be another one with the *right* gen.
> +	 * If called to clear filters, always let it through.
> +	 */
> +	if (id->gen == event->addr_filters_gen || !id->offs)
> +		event->pmu->addr_filters_setup(event, id->offs, PERF_EF_RELOAD);
> +	raw_spin_unlock_irqrestore(&ifh->lock, flags);
> +
> +	return 0;
> +}
> +
> +static int perf_event_addr_filters_setup(struct perf_event *event,
> +					   unsigned long *offs,
> +					   unsigned long gen)
> +{
> +	struct addr_filter_setup_data id = {
> +		.event	= event,
> +		.offs	= offs,
> +		.gen	= gen,
> +	};
> +	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	/*
> +	 * We can't use event_function_call() here, because that would
> +	 * require ctx::mutex, but one of our callers is called with
> +	 * mm::mmap_sem down, which would cause an inversion, see bullet
> +	 * (2) in put_event().
> +	 */
> +	do {
> +		if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE) {
> +			raw_spin_lock_irqsave(&ifh->lock, flags);

And here, like in all the other sites, IRQs must be enabled, no?

> +			/* see __perf_event_addr_filters_setup */

How is this not racy? The event could have gotten enabled between that
test and getting here, right?

> +			if (gen == event->addr_filters_gen || !offs)
> +				event->pmu->addr_filters_setup(event, offs, 0);
> +			raw_spin_unlock_irqrestore(&ifh->lock, flags);
> +
> +			if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
> +				break;

I r confused, calling ->addr_filters_setup() on an active event, from
the wrong cpu, is badness, no?

Is this something the callback has to handle?

> +			/* otherwise, fall through to the cross-call */
> +		}
> +
> +		/* matches smp_wmb() in event_sched_in() */
> +		smp_rmb();
> +
> +		ret = cpu_function_call(READ_ONCE(event->oncpu),
> +					__perf_event_addr_filters_setup, &id);
> +	} while (ret == -EAGAIN);
> +
> +	return ret;
> +}

This whole thing seems rather full of tricky :/


> @@ -6398,6 +6629,7 @@ void perf_event_mmap(struct vm_area_struct *vma)
>  		/* .flags (attr_mmap2 only) */
>  	};
>  
> +	perf_addr_filters_adjust(vma);
>  	perf_event_mmap_event(&mmap_event);
>  }

And this is the 'offending' site that requires all the tricky..

> +/*
> + * Calculate event's address filters' ranges based on the
> + * task's existing mappings; if any of the existing mappings
> + * match the filters, update event's hw configuration and
> + * restart it if it's running.
> + */
> +static void perf_event_addr_filters_apply(struct perf_event *event)
> +{
> +	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
> +	struct perf_addr_filter *filter;
> +	struct task_struct *task = READ_ONCE(event->ctx->task);
> +	struct mm_struct *mm = NULL;
> +	unsigned int restart = 0, count = 0;
> +	unsigned long *offs, flags, gen;
> +
> +	offs = event->hw.addr_filters_offs;
> +
> +	/*
> +	 * We may observe TASK_TOMBSTONE, which means that the event tear-down
> +	 * will stop on the parent's child_mutex that our caller is also holding
> +	 */
> +	if (task == TASK_TOMBSTONE)
> +		return;
> +
> +	mm = get_task_mm(event->ctx->task);
> +	if (!mm)
> +		return;

kernel threads may not have a filter?

> +
> +	/* establish the initial hw configuration for this set of filters */
> +	perf_event_addr_filters_setup(event, NULL, 0);
> +
> +	down_read(&mm->mmap_sem);
> +
> +	raw_spin_lock_irqsave(&ifh->lock, flags);
> +	list_for_each_entry(filter, &ifh->list, entry) {
> +		offs[count] = 0;
> +
> +		if (perf_addr_filter_needs_mmap(filter)) {
> +			offs[count] = perf_addr_filter_apply(filter, mm);
> +
> +			if (offs[count])
> +				restart++;
> +		}
> +
> +		count++;
> +	}
> +
> +	gen = ++event->addr_filters_gen;
> +	raw_spin_unlock_irqrestore(&ifh->lock, flags);
> +
> +	up_read(&mm->mmap_sem);
> +
> +	if (restart)
> +		perf_event_addr_filters_setup(event, offs, gen);
> +
> +	mmput(mm);
> +}

> +/*
> + * Address range filtering: limiting the data to certain
> + * instruction address ranges. Filters are ioctl()ed to us from
> + * userspace as ascii strings.
> + *
> + * Filter string format:
> + *
> + * ACTION SOURCE:RANGE_SPEC
> + * where ACTION is one of the
> + *  * "filter": limit the trace to this region
> + *  * "start": start tracing from this address
> + *  * "stop": stop tracing at this address/region;
> + * SOURCE is either "file" or "kernel"
> + * RANGE_SPEC is
> + *  * for "kernel": <start address>[/<size>]
> + *  * for "file":   <start address>[/<size>]@</path/to/object/file>
> + *
> + * if <size> is not specified, the range is treated as a single address.
> + */

Scary bit of string manipulation there.. have you tried fuzzing it? ;-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 2/5] perf/x86/intel/pt: IP filtering register/cpuid bits
  2016-04-21 20:37       ` Borislav Petkov
@ 2016-04-22  7:58         ` Thomas Gleixner
  2016-04-22  9:34           ` Borislav Petkov
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Gleixner @ 2016-04-22  7:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Shishkin, Peter Zijlstra, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier

On Thu, 21 Apr 2016, Borislav Petkov wrote:

> On Thu, Apr 21, 2016 at 08:55:38PM +0200, Thomas Gleixner wrote:
> > I have to disagree here. The MSRs itself can really go into msr-index.h while
> > the bit definitions might go elsewhere. What's wrong with having all MSRs at a
> > central place?
> 
> Same reason as for pci_ids.h - to contain only MSRs which are used in
> multiple compilation units.

That's really not the same thing. pci ids are issued by a gazillion of vendors
for a bazillion of different devices. There is no consistent view for them.

MSRs on the other hand are x86 specific registers nicely defined in the
SDM/APM and having at least the MSR defines in a single header makes a lot of
sense.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 2/5] perf/x86/intel/pt: IP filtering register/cpuid bits
  2016-04-22  7:58         ` Thomas Gleixner
@ 2016-04-22  9:34           ` Borislav Petkov
  0 siblings, 0 replies; 27+ messages in thread
From: Borislav Petkov @ 2016-04-22  9:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alexander Shishkin, Peter Zijlstra, Ingo Molnar, linux-kernel,
	vince, eranian, Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Apr 22, 2016 at 09:58:31AM +0200, Thomas Gleixner wrote:
> That's really not the same thing. pci ids are issued by a gazillion of vendors
> for a bazillion of different devices. There is no consistent view for them.

So my reasoning was to not add *every* MSR to that file, especially the
ones which are strictly topical. For example, the MCA MSRs which can
easily live in mce.h as nothing else needs to touch them...

> MSRs on the other hand are x86 specific registers nicely defined in the
> SDM/APM and having at least the MSR defines in a single header makes a lot of
> sense.

... but ok. I see there could be some merit of keeping them all in the
same place. We can always change that if the handling of msr-index.h
starts becoming too unwieldy.

In any case, we'd need to zap

053080a9d1c8 ("x86/msr: Document msr-index.h rule for addition")

now.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-22  7:45   ` Peter Zijlstra
@ 2016-04-22 16:19     ` Alexander Shishkin
  2016-04-25 12:57       ` Peter Zijlstra
                         ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Alexander Shishkin @ 2016-04-22 16:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Apr 21, 2016 at 06:17:02PM +0300, Alexander Shishkin wrote:
>> +			/* otherwise, fall through to the cross-call */
>> +		}
>> +
>> +		/* matches smp_wmb() in event_sched_in() */
>> +		smp_rmb();
>> +
>> +		ret = cpu_function_call(READ_ONCE(event->oncpu),
>> +					__perf_event_addr_filters_setup, &id);
>> +	} while (ret == -EAGAIN);
>> +
>> +	return ret;
>> +}
>
> This whole thing seems rather full of tricky :/

Yes, so I got rid of it.

>> +	mm = get_task_mm(event->ctx->task);
>> +	if (!mm)
>> +		return;
>
> kernel threads may not have a filter?

Good catch. They won't have executable vmas, but they can still have
kernel regions, so here we should skip scanning the address space and
proceed to propagating the filters to the pmu.

>> +/*
>> + * Address range filtering: limiting the data to certain
>> + * instruction address ranges. Filters are ioctl()ed to us from
>> + * userspace as ascii strings.
>> + *
>> + * Filter string format:
>> + *
>> + * ACTION SOURCE:RANGE_SPEC
>> + * where ACTION is one of the
>> + *  * "filter": limit the trace to this region
>> + *  * "start": start tracing from this address
>> + *  * "stop": stop tracing at this address/region;
>> + * SOURCE is either "file" or "kernel"
>> + * RANGE_SPEC is
>> + *  * for "kernel": <start address>[/<size>]
>> + *  * for "file":   <start address>[/<size>]@</path/to/object/file>
>> + *
>> + * if <size> is not specified, the range is treated as a single address.
>> + */
>
> Scary bit of string manipulation there.. have you tried fuzzing it? ;-)

Not yet. But you have a point. :)

So below is what I came up with instead of this patch. The 5/5 also
changed as a result, and I pushed the whole thing here [1].

[1] https://git.kernel.org/cgit/linux/kernel/git/ash/linux.git/log/?h=perf-addr-filters

>From b96e7feabc4f1679e53d9a4b59b479f47291384d Mon Sep 17 00:00:00 2001
From: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Date: Mon, 13 Jul 2015 17:25:02 +0300
Subject: perf: Introduce address range filtering

Many instruction trace pmus out there support address range-based
filtering, which would, for example, generate trace data only for a
given range of instruction addresses, which is useful for tracing
individual functions, modules or libraries. Other pmus may also
utilize this functionality to allow filtering to or filtering out
code at certain address ranges.

This patch introduces the interface for userspace to specify these
filters and for the pmu drivers to apply these filters to hardware
configuration.

The user interface is an ascii string that is passed via an ioctl
and specifies (in a way similar to uprobe) address ranges within
certain object files or within kernel. There is no special treatment
for kernel modules yet, but it might be a worthy pursuit.

The pmu driver interface basically adds an extra callback to the
pmu driver structure, which validates the filter configuration proposed
by the user against what the hardware is actually capable of doing
and translates it into something that pmu::start can program into
hardware.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h |  91 +++++++
 kernel/events/core.c       | 605 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 680 insertions(+), 16 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b717902c99..4f968d6b96 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -151,6 +151,15 @@ struct hw_perf_event {
 	 */
 	struct task_struct		*target;
 
+	/*
+	 * PMU would store hardware filter configuration
+	 * here.
+	 */
+	void				*addr_filters;
+
+	/* Last sync'ed generation of filters */
+	unsigned long			addr_filters_gen;
+
 /*
  * hw_perf_event::state flags; used to track the PERF_EF_* state.
  */
@@ -240,6 +249,9 @@ struct pmu {
 	int				task_ctx_nr;
 	int				hrtimer_interval_ms;
 
+	/* number of address filters this pmu can do */
+	unsigned int			nr_addr_filters;
+
 	/*
 	 * Fully disable/enable this PMU, can be used to protect from the PMI
 	 * as well as for lazy/batch writing of the MSRs.
@@ -393,12 +405,64 @@ struct pmu {
 	void (*free_aux)		(void *aux); /* optional */
 
 	/*
+	 * Validate address range filters: make sure hw supports the
+	 * requested configuration and number of filters; return 0 if the
+	 * supplied filters are valid, -errno otherwise.
+	 */
+	int (*addr_filters_validate)	(struct list_head *filters);
+					/* optional */
+
+	/*
+	 * Synchronize address range filter configuration:
+	 * translate hw-agnostic filter into hardware configuration in
+	 * event::hw::addr_filters.
+	 */
+	void (*addr_filters_sync)	(struct perf_event *event);
+					/* optional */
+
+	/*
 	 * Filter events for PMU-specific reasons.
 	 */
 	int (*filter_match)		(struct perf_event *event); /* optional */
 };
 
 /**
+ * struct perf_addr_filter - address range filter definition
+ * @entry:	event's filter list linkage
+ * @inode:	object file's inode for file-based filters
+ * @offset:	filter range offset
+ * @size:	filter range size
+ * @range:	1: range, 0: address
+ * @filter:	1: filter/start, 0: stop
+ * @kernel:	1: kernel, 0: file-based
+ *
+ * This is a hardware-agnostic filter configuration as specified by the user.
+ */
+struct perf_addr_filter {
+	struct list_head	entry;
+	struct inode		*inode;
+	unsigned long		offset;
+	unsigned long		size;
+	unsigned int		range	: 1,
+				filter	: 1,
+				kernel	: 1;
+};
+
+/**
+ * struct perf_addr_filters_head - container for address range filters
+ * @list:	list of filters for this event
+ * @lock:	spinlock that serializes accesses to the @list and event's
+ *		(and its children's) filter generations.
+ *
+ * A child event will use parent's @list (and therefore @lock), so they are
+ * bundled together; see perf_event_addr_filters().
+ */
+struct perf_addr_filters_head {
+	struct list_head	list;
+	raw_spinlock_t		lock;
+};
+
+/**
  * enum perf_event_active_state - the states of a event
  */
 enum perf_event_active_state {
@@ -571,6 +635,12 @@ struct perf_event {
 
 	atomic_t			event_limit;
 
+	/* address range filters */
+	struct perf_addr_filters_head	addr_filters;
+	/* vma address array for file-based filders */
+	unsigned long			*addr_filters_offs;
+	unsigned long			addr_filters_gen;
+
 	void (*destroy)(struct perf_event *);
 	struct rcu_head			rcu_head;
 
@@ -685,6 +755,22 @@ struct perf_output_handle {
 	int				page;
 };
 
+void perf_event_addr_filters_sync(struct perf_event *event);
+
+/*
+ * An inherited event uses parent's filters
+ */
+static inline struct perf_addr_filters_head *
+perf_event_addr_filters(struct perf_event *event)
+{
+	struct perf_addr_filters_head *ifh = &event->addr_filters;
+
+	if (event->parent)
+		ifh = &event->parent->addr_filters;
+
+	return ifh;
+}
+
 #ifdef CONFIG_CGROUP_PERF
 
 /*
@@ -1063,6 +1149,11 @@ static inline bool is_write_backward(struct perf_event *event)
 	return !!event->attr.write_backward;
 }
 
+static inline bool has_addr_filter(struct perf_event *event)
+{
+	return event->pmu->nr_addr_filters;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern int perf_output_begin_forward(struct perf_output_handle *handle,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9260a8a073..765c100176 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -44,6 +44,8 @@
 #include <linux/compat.h>
 #include <linux/bpf.h>
 #include <linux/filter.h>
+#include <linux/namei.h>
+#include <linux/parser.h>
 
 #include "internal.h"
 
@@ -2364,11 +2366,17 @@ void perf_event_enable(struct perf_event *event)
 }
 EXPORT_SYMBOL_GPL(perf_event_enable);
 
+struct stop_event_data {
+	struct perf_event	*event;
+	unsigned int		restart;
+};
+
 static int __perf_event_stop(void *info)
 {
-	struct perf_event *event = info;
+	struct stop_event_data *sd = info;
+	struct perf_event *event = sd->event;
 
-	/* for AUX events, our job is done if the event is already inactive */
+	/* if it's already INACTIVE, do nothing */
 	if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
 		return 0;
 
@@ -2383,10 +2391,72 @@ static int __perf_event_stop(void *info)
 		return -EAGAIN;
 
 	event->pmu->stop(event, PERF_EF_UPDATE);
+	if (sd->restart)
+		event->pmu->start(event, PERF_EF_START);
 
 	return 0;
 }
 
+static int perf_event_restart(struct perf_event *event)
+{
+	struct stop_event_data sd = {
+		.event		= event,
+		.restart	= 1,
+	};
+	int ret = 0;
+
+	do {
+		if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
+			return 0;
+
+		/* matches smp_wmb() in event_sched_in() */
+		smp_rmb();
+
+		ret = cpu_function_call(READ_ONCE(event->oncpu),
+					__perf_event_stop, &sd);
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
+/*
+ * In order to contain the amount of racy and tricky in the address filter
+ * configuration management, it is a two part process:
+ *
+ * (p1) when userspace mappings change as a result of (1) or (2) or (3) below,
+ *      we update the addresses of corresponding vmas in
+ *	event::addr_filters_offs array and bump the event::addr_filters_gen;
+ * (p2) when an event is scheduled in (pmu::add), it calls
+ *      perf_event_addr_filters_sync() which calls pmu::addr_filters_sync()
+ *      if the generation has changed since the previous call.
+ *
+ * If (p1) happens while the event is active, we restart it to force (p2).
+ *
+ * (1) perf_addr_filters_apply(): adjusting filters' offsets based on
+ *     pre-existing mappings, called once when new filters arrive via SET_FILTER
+ *     ioctl;
+ * (2) perf_addr_filters_adjust(): adjusting filters' offsets based on newly
+ *     registered mapping, called for every new mmap(), with mm::mmap_sem down
+ *     for reading;
+ * (3) perf_event_addr_filters_exec(): clearing filters' offsets in the process
+ *     of exec.
+ */
+void perf_event_addr_filters_sync(struct perf_event *event)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+
+	if (!has_addr_filter(event))
+		return;
+
+	raw_spin_lock(&ifh->lock);
+	if (event->addr_filters_gen != event->hw.addr_filters_gen) {
+		event->pmu->addr_filters_sync(event);
+		event->hw.addr_filters_gen = event->addr_filters_gen;
+	}
+	raw_spin_unlock(&ifh->lock);
+}
+EXPORT_SYMBOL_GPL(perf_event_addr_filters_sync);
+
 static int _perf_event_refresh(struct perf_event *event, int refresh)
 {
 	/*
@@ -3236,16 +3306,6 @@ out:
 		put_ctx(clone_ctx);
 }
 
-void perf_event_exec(void)
-{
-	int ctxn;
-
-	rcu_read_lock();
-	for_each_task_context_nr(ctxn)
-		perf_event_enable_on_exec(ctxn);
-	rcu_read_unlock();
-}
-
 struct perf_read_data {
 	struct perf_event *event;
 	bool group;
@@ -3779,6 +3839,9 @@ static bool exclusive_event_installable(struct perf_event *event,
 	return true;
 }
 
+static void perf_addr_filters_splice(struct perf_event *event,
+				       struct list_head *head);
+
 static void _free_event(struct perf_event *event)
 {
 	irq_work_sync(&event->pending);
@@ -3806,6 +3869,8 @@ static void _free_event(struct perf_event *event)
 	}
 
 	perf_event_free_bpf_prog(event);
+	perf_addr_filters_splice(event, NULL);
+	kfree(event->addr_filters_offs);
 
 	if (event->destroy)
 		event->destroy(event);
@@ -5884,6 +5949,57 @@ perf_event_aux(perf_event_aux_output_cb output, void *data,
 	rcu_read_unlock();
 }
 
+/*
+ * Clear all file-based filters at exec, they'll have to be
+ * re-instated when/if these objects are mmapped again.
+ */
+static void perf_event_addr_filters_exec(struct perf_event *event, void *data)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	struct perf_addr_filter *filter;
+	unsigned int restart = 0, count = 0;
+	unsigned long flags;
+
+	if (!has_addr_filter(event))
+		return;
+
+	raw_spin_lock_irqsave(&ifh->lock, flags);
+	list_for_each_entry(filter, &ifh->list, entry) {
+		if (!filter->kernel) {
+			event->addr_filters_offs[count] = 0;
+			restart++;
+		}
+
+		count++;
+	}
+
+	if (restart)
+		event->addr_filters_gen++;
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+	if (restart)
+		perf_event_restart(event);
+}
+
+void perf_event_exec(void)
+{
+	struct perf_event_context *ctx;
+	int ctxn;
+
+	rcu_read_lock();
+	for_each_task_context_nr(ctxn) {
+		ctx = current->perf_event_ctxp[ctxn];
+		if (!ctx)
+			continue;
+
+		perf_event_enable_on_exec(ctxn);
+
+		perf_event_aux_ctx(ctx, perf_event_addr_filters_exec, NULL,
+				   true);
+	}
+	rcu_read_unlock();
+}
+
 struct remote_output {
 	struct ring_buffer	*rb;
 	int			err;
@@ -5894,6 +6010,9 @@ static void __perf_event_output_stop(struct perf_event *event, void *data)
 	struct perf_event *parent = event->parent;
 	struct remote_output *ro = data;
 	struct ring_buffer *rb = ro->rb;
+	struct stop_event_data sd = {
+		.event	= event,
+	};
 
 	if (!has_aux(event))
 		return;
@@ -5906,7 +6025,7 @@ static void __perf_event_output_stop(struct perf_event *event, void *data)
 	 * ring-buffer, but it will be the child that's actually using it:
 	 */
 	if (rcu_dereference(parent->rb) == rb)
-		ro->err = __perf_event_stop(event);
+		ro->err = __perf_event_stop(&sd);
 }
 
 static int __perf_pmu_output_stop(void *info)
@@ -6367,6 +6486,90 @@ got_name:
 	kfree(buf);
 }
 
+/*
+ * Whether this @filter depends on a dynamic object which is not loaded
+ * yet or its load addresses are not known.
+ */
+static bool perf_addr_filter_needs_mmap(struct perf_addr_filter *filter)
+{
+	return filter->filter && !filter->kernel;
+}
+
+/*
+ * Check whether inode and address range match filter criteria.
+ */
+static bool perf_addr_filter_match(struct perf_addr_filter *filter,
+				     struct file *file, unsigned long offset,
+				     unsigned long size)
+{
+	if (filter->kernel)
+		return false;
+
+	if (filter->inode != file->f_inode)
+		return false;
+
+	if (filter->offset > offset + size)
+		return false;
+
+	if (filter->offset + filter->size < offset)
+		return false;
+
+	return true;
+}
+
+static void __perf_addr_filters_adjust(struct perf_event *event, void *data)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	struct vm_area_struct *vma = data;
+	unsigned long off = vma->vm_pgoff << PAGE_SHIFT, flags;
+	struct file *file = vma->vm_file;
+	struct perf_addr_filter *filter;
+	unsigned int restart = 0, count = 0;
+
+	if (!has_addr_filter(event))
+		return;
+
+	if (!file)
+		return;
+
+	raw_spin_lock_irqsave(&ifh->lock, flags);
+	list_for_each_entry(filter, &ifh->list, entry) {
+		if (perf_addr_filter_match(filter, file, off,
+					     vma->vm_end - vma->vm_start)) {
+			event->addr_filters_offs[count] = vma->vm_start;
+			restart++;
+		}
+
+		count++;
+	}
+
+	if (restart)
+		event->addr_filters_gen++;
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+	if (restart)
+		perf_event_restart(event);
+}
+
+/*
+ * Adjust all task's events' filters to the new vma
+ */
+static void perf_addr_filters_adjust(struct vm_area_struct *vma)
+{
+	struct perf_event_context *ctx;
+	int ctxn;
+
+	rcu_read_lock();
+	for_each_task_context_nr(ctxn) {
+		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
+		if (!ctx)
+			continue;
+
+		perf_event_aux_ctx(ctx, __perf_addr_filters_adjust, vma, true);
+	}
+	rcu_read_unlock();
+}
+
 void perf_event_mmap(struct vm_area_struct *vma)
 {
 	struct perf_mmap_event mmap_event;
@@ -6398,6 +6601,7 @@ void perf_event_mmap(struct vm_area_struct *vma)
 		/* .flags (attr_mmap2 only) */
 	};
 
+	perf_addr_filters_adjust(vma);
 	perf_event_mmap_event(&mmap_event);
 }
 
@@ -7357,13 +7561,364 @@ void perf_bp_event(struct perf_event *bp, void *data)
 }
 #endif
 
+/*
+ * Allocate a new address filter
+ */
+static struct perf_addr_filter *
+perf_addr_filter_new(struct perf_event *event, struct list_head *filters)
+{
+	int node = cpu_to_node(event->cpu == -1 ? 0 : event->cpu);
+	struct perf_addr_filter *filter;
+
+	filter = kzalloc_node(sizeof(*filter), GFP_KERNEL, node);
+	if (!filter)
+		return NULL;
+
+	INIT_LIST_HEAD(&filter->entry);
+	list_add_tail(&filter->entry, filters);
+
+	return filter;
+}
+
+static void free_filters_list(struct list_head *filters)
+{
+	struct perf_addr_filter *filter, *iter;
+
+	list_for_each_entry_safe(filter, iter, filters, entry) {
+		if (filter->inode)
+			iput(filter->inode);
+		list_del(&filter->entry);
+		kfree(filter);
+	}
+}
+
+/*
+ * Free existing address filters and optionally install new ones
+ */
+static void perf_addr_filters_splice(struct perf_event *event,
+				     struct list_head *head)
+{
+	unsigned long flags;
+	LIST_HEAD(list);
+
+	if (!has_addr_filter(event))
+		return;
+
+	/* don't bother with children, they don't have their own filters */
+	if (event->parent)
+		return;
+
+	raw_spin_lock_irqsave(&event->addr_filters.lock, flags);
+
+	list_splice_init(&event->addr_filters.list, &list);
+	if (head)
+		list_splice(head, &event->addr_filters.list);
+
+	raw_spin_unlock_irqrestore(&event->addr_filters.lock, flags);
+
+	free_filters_list(&list);
+}
+
+/*
+ * Scan through mm's vmas and see if one of them matches the
+ * @filter; if so, adjust filter's address range.
+ * Called with mm::mmap_sem down for reading.
+ */
+static unsigned long perf_addr_filter_apply(struct perf_addr_filter *filter,
+					    struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	for (vma = mm->mmap; vma->vm_next; vma = vma->vm_next) {
+		struct file *file = vma->vm_file;
+		unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
+		unsigned long vma_size = vma->vm_end - vma->vm_start;
+
+		if (!file)
+			continue;
+
+		if (!perf_addr_filter_match(filter, file, off, vma_size))
+			continue;
+
+		return vma->vm_start;
+	}
+
+	return 0;
+}
+
+/*
+ * Update event's address range filters based on the
+ * task's existing mappings, if any.
+ */
+static void perf_event_addr_filters_apply(struct perf_event *event)
+{
+	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	struct task_struct *task = READ_ONCE(event->ctx->task);
+	struct perf_addr_filter *filter;
+	struct mm_struct *mm = NULL;
+	unsigned int count = 0;
+	unsigned long flags;
+
+	/*
+	 * We may observe TASK_TOMBSTONE, which means that the event tear-down
+	 * will stop on the parent's child_mutex that our caller is also holding
+	 */
+	if (task == TASK_TOMBSTONE)
+		return;
+
+	mm = get_task_mm(event->ctx->task);
+	if (!mm)
+		goto restart;
+
+	down_read(&mm->mmap_sem);
+
+	raw_spin_lock_irqsave(&ifh->lock, flags);
+	list_for_each_entry(filter, &ifh->list, entry) {
+		event->addr_filters_offs[count] = 0;
+
+		if (perf_addr_filter_needs_mmap(filter))
+			event->addr_filters_offs[count] =
+				perf_addr_filter_apply(filter, mm);
+
+		count++;
+	}
+
+	event->addr_filters_gen++;
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
+
+	up_read(&mm->mmap_sem);
+
+	mmput(mm);
+
+restart:
+	perf_event_restart(event);
+}
+
+/*
+ * Address range filtering: limiting the data to certain
+ * instruction address ranges. Filters are ioctl()ed to us from
+ * userspace as ascii strings.
+ *
+ * Filter string format:
+ *
+ * ACTION SOURCE:RANGE_SPEC
+ * where ACTION is one of the
+ *  * "filter": limit the trace to this region
+ *  * "start": start tracing from this address
+ *  * "stop": stop tracing at this address/region;
+ * SOURCE is either "file" or "kernel"
+ * RANGE_SPEC is
+ *  * for "kernel": <start address>[/<size>]
+ *  * for "file":   <start address>[/<size>]@</path/to/object/file>
+ *
+ * if <size> is not specified, the range is treated as a single address.
+ */
+enum {
+	IF_ACT_FILTER,
+	IF_ACT_START,
+	IF_ACT_STOP,
+	IF_SRC_FILE,
+	IF_SRC_KERNEL,
+	IF_SRC_FILEADDR,
+	IF_SRC_KERNELADDR,
+};
+
+enum {
+	IF_STATE_ACTION = 0,
+	IF_STATE_SOURCE,
+	IF_STATE_END,
+};
+
+static const match_table_t if_tokens = {
+	{ IF_ACT_FILTER,	"filter" },
+	{ IF_ACT_START,		"start" },
+	{ IF_ACT_STOP,		"stop" },
+	{ IF_SRC_FILE,		"file:%u/%u@%s" },
+	{ IF_SRC_KERNEL,	"kernel:%u/%u" },
+	{ IF_SRC_FILEADDR,	"file:%u@%s" },
+	{ IF_SRC_KERNELADDR,	"kernel:%u" },
+};
+
+/*
+ * Address filter string parser
+ */
+static int
+perf_event_parse_addr_filter(struct perf_event *event, char *fstr,
+			     struct list_head *filters)
+{
+	struct perf_addr_filter *filter = NULL;
+	char *start, *orig, *filename = NULL;
+	struct path path;
+	substring_t args[MAX_OPT_ARGS];
+	int state = IF_STATE_ACTION, token;
+	int ret = -EINVAL;
+
+	orig = fstr = kstrdup(fstr, GFP_KERNEL);
+	if (!fstr)
+		return -ENOMEM;
+
+	while ((start = strsep(&fstr, " ,\n")) != NULL) {
+		ret = -EINVAL;
+
+		if (!*start)
+			continue;
+
+		/* filter definition begins */
+		if (state == IF_STATE_ACTION) {
+			filter = perf_addr_filter_new(event, filters);
+			if (!filter)
+				goto fail;
+		}
+
+		token = match_token(start, if_tokens, args);
+		switch (token) {
+		case IF_ACT_FILTER:
+		case IF_ACT_START:
+			filter->filter = 1;
+
+		case IF_ACT_STOP:
+			if (state != IF_STATE_ACTION)
+				goto fail;
+
+			state = IF_STATE_SOURCE;
+			break;
+
+		case IF_SRC_KERNELADDR:
+		case IF_SRC_KERNEL:
+			filter->kernel = 1;
+
+		case IF_SRC_FILEADDR:
+		case IF_SRC_FILE:
+			if (state != IF_STATE_SOURCE)
+				goto fail;
+
+			if (token == IF_SRC_FILE || token == IF_SRC_KERNEL)
+				filter->range = 1;
+
+			*args[0].to = 0;
+			ret = kstrtoul(args[0].from, 0, &filter->offset);
+			if (ret)
+				goto fail;
+
+			if (filter->range) {
+				*args[1].to = 0;
+				ret = kstrtoul(args[1].from, 0, &filter->size);
+				if (ret)
+					goto fail;
+			}
+
+			if (token == IF_SRC_FILE) {
+				filename = match_strdup(&args[2]);
+				if (!filename) {
+					ret = -ENOMEM;
+					goto fail;
+				}
+			}
+
+			state = IF_STATE_END;
+			break;
+
+		default:
+			goto fail;
+		}
+
+		/*
+		 * Filter definition is fully parsed, validate and install it.
+		 * Make sure that it doesn't contradict itself or the event's
+		 * attribute.
+		 */
+		if (state == IF_STATE_END) {
+			if (filter->kernel && event->attr.exclude_kernel)
+				goto fail;
+
+			if (!filter->kernel) {
+				if (!filename)
+					goto fail;
+
+				/* look up the path and grab its inode */
+				ret = kern_path(filename, LOOKUP_FOLLOW, &path);
+				if (ret)
+					goto fail_free_name;
+
+				filter->inode = igrab(d_inode(path.dentry));
+				path_put(&path);
+				kfree(filename);
+				filename = NULL;
+			}
+
+			/* ready to consume more filters */
+			state = IF_STATE_ACTION;
+			filter = NULL;
+		}
+	}
+
+	if (state != IF_STATE_ACTION)
+		goto fail;
+
+	kfree(orig);
+
+	return 0;
+
+fail_free_name:
+	kfree(filename);
+fail:
+	free_filters_list(filters);
+	kfree(orig);
+
+	return ret;
+}
+
+static int
+perf_event_set_addr_filter(struct perf_event *event, char *filter_str)
+{
+	LIST_HEAD(filters);
+	int ret;
+
+	/*
+	 * Since this is called in perf_ioctl() path, we're already holding
+	 * ctx::mutex.
+	 */
+	lockdep_assert_held(&event->ctx->mutex);
+
+	if (WARN_ON_ONCE(event->parent))
+		return -EINVAL;
+
+	/*
+	 * For now, we only support filtering in per-task events; doing so
+	 * for cpu-wide events requires additional context switching trickery,
+	 * since same object code will be mapped at different virtual
+	 * addresses in different processes.
+	 */
+	if (!event->ctx->task)
+		return -EOPNOTSUPP;
+
+	ret = perf_event_parse_addr_filter(event, filter_str, &filters);
+	if (ret)
+		return ret;
+
+	ret = event->pmu->addr_filters_validate(&filters);
+	if (ret) {
+		free_filters_list(&filters);
+		return ret;
+	}
+
+	/* remove existing filters, if any */
+	perf_addr_filters_splice(event, &filters);
+
+	/* install new filters */
+	perf_event_for_each_child(event, perf_event_addr_filters_apply);
+
+	return ret;
+}
+
 static int perf_event_set_filter(struct perf_event *event, void __user *arg)
 {
 	char *filter_str;
 	int ret = -EINVAL;
 
-	if (event->attr.type != PERF_TYPE_TRACEPOINT ||
-	    !IS_ENABLED(CONFIG_EVENT_TRACING))
+	if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
+	    !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
+	    !has_addr_filter(event))
 		return -EINVAL;
 
 	filter_str = strndup_user(arg, PAGE_SIZE);
@@ -7374,6 +7929,8 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
 	    event->attr.type == PERF_TYPE_TRACEPOINT)
 		ret = ftrace_profile_set_filter(event, event->attr.config,
 						filter_str);
+	else if (has_addr_filter(event))
+		ret = perf_event_set_addr_filter(event, filter_str);
 
 	kfree(filter_str);
 	return ret;
@@ -8196,6 +8753,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	INIT_LIST_HEAD(&event->sibling_list);
 	INIT_LIST_HEAD(&event->rb_entry);
 	INIT_LIST_HEAD(&event->active_entry);
+	INIT_LIST_HEAD(&event->addr_filters.list);
 	INIT_HLIST_NODE(&event->hlist_entry);
 	INIT_LIST_HEAD(&event->sb_list);
 
@@ -8203,6 +8761,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	init_irq_work(&event->pending, perf_pending_event);
 
 	mutex_init(&event->mmap_mutex);
+	raw_spin_lock_init(&event->addr_filters.lock);
 
 	atomic_long_set(&event->refcount, 1);
 	event->cpu		= cpu;
@@ -8287,11 +8846,22 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (err)
 		goto err_pmu;
 
+	if (has_addr_filter(event)) {
+		event->addr_filters_offs = kcalloc(pmu->nr_addr_filters,
+						   sizeof(unsigned long),
+						   GFP_KERNEL);
+		if (!event->addr_filters_offs)
+			goto err_per_task;
+
+		/* force hw sync on the address filters */
+		event->addr_filters_gen = 1;
+	}
+
 	if (!event->parent) {
 		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
 			err = get_callchain_buffers();
 			if (err)
-				goto err_per_task;
+				goto err_addr_filters;
 		}
 	}
 
@@ -8300,6 +8870,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 	return event;
 
+err_addr_filters:
+	kfree(event->addr_filters_offs);
+
 err_per_task:
 	exclusive_event_destroy(event);
 
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-22 16:19     ` Alexander Shishkin
@ 2016-04-25 12:57       ` Peter Zijlstra
  2016-04-25 15:41         ` Alexander Shishkin
  2016-04-25 13:10       ` Peter Zijlstra
                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-04-25 12:57 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Apr 22, 2016 at 07:19:11PM +0300, Alexander Shishkin wrote:
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index b717902c99..4f968d6b96 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -151,6 +151,15 @@ struct hw_perf_event {
>  	 */
>  	struct task_struct		*target;
>  
> +	/*
> +	 * PMU would store hardware filter configuration
> +	 * here.
> +	 */
> +	void				*addr_filters;
> +
> +	/* Last sync'ed generation of filters */
> +	unsigned long			addr_filters_gen;
> +

should these not go in the itrace struct?

/me continues reading

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-22 16:19     ` Alexander Shishkin
  2016-04-25 12:57       ` Peter Zijlstra
@ 2016-04-25 13:10       ` Peter Zijlstra
  2016-04-25 13:36         ` Peter Zijlstra
  2016-04-25 16:02         ` Alexander Shishkin
  2016-04-25 13:19       ` Peter Zijlstra
                         ` (2 subsequent siblings)
  4 siblings, 2 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-04-25 13:10 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Apr 22, 2016 at 07:19:11PM +0300, Alexander Shishkin wrote:
> +static int perf_event_restart(struct perf_event *event)
> +{
> +	struct stop_event_data sd = {
> +		.event		= event,
> +		.restart	= 1,
> +	};
> +	int ret = 0;
> +
> +	do {
> +		if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
> +			return 0;
> +
> +		/* matches smp_wmb() in event_sched_in() */
> +		smp_rmb();
> +
> +		ret = cpu_function_call(READ_ONCE(event->oncpu),
> +					__perf_event_stop, &sd);
> +	} while (ret == -EAGAIN);
> +
> +	return ret;
> +}

A few concerns with this;

 - if I understand things right; this will loose a bit of trace when
   people tickle the ioctl(), right? This might want a note of sorts,

 - you need to handle event->oncpu == -1,

 - can we race with an actual stop, and accidentally re-enable the
   thing? If not, this might be a good place for a comment explaining
   how.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-22 16:19     ` Alexander Shishkin
  2016-04-25 12:57       ` Peter Zijlstra
  2016-04-25 13:10       ` Peter Zijlstra
@ 2016-04-25 13:19       ` Peter Zijlstra
  2016-04-25 16:03         ` Alexander Shishkin
  2016-04-25 14:14       ` Peter Zijlstra
  2016-04-25 14:25       ` Peter Zijlstra
  4 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-04-25 13:19 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Apr 22, 2016 at 07:19:11PM +0300, Alexander Shishkin wrote:
> @@ -393,12 +405,64 @@ struct pmu {
>  	void (*free_aux)		(void *aux); /* optional */
>  
>  	/*
> +	 * Validate address range filters: make sure hw supports the
> +	 * requested configuration and number of filters; return 0 if the
> +	 * supplied filters are valid, -errno otherwise.
> +	 */
> +	int (*addr_filters_validate)	(struct list_head *filters);
> +					/* optional */
> +
> +	/*
> +	 * Synchronize address range filter configuration:
> +	 * translate hw-agnostic filter into hardware configuration in
> +	 * event::hw::addr_filters.
> +	 */
> +	void (*addr_filters_sync)	(struct perf_event *event);
> +					/* optional */

So these two are not serialized the 'normal' way right? Does that want
more comment?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-25 13:10       ` Peter Zijlstra
@ 2016-04-25 13:36         ` Peter Zijlstra
  2016-04-25 16:02         ` Alexander Shishkin
  1 sibling, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-04-25 13:36 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Mon, Apr 25, 2016 at 03:10:13PM +0200, Peter Zijlstra wrote:
>  - can we race with an actual stop, and accidentally re-enable the
>    thing? If not, this might be a good place for a comment explaining
>    how.

A quick look seems to suggest we might not care; still wants a comment.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-22 16:19     ` Alexander Shishkin
                         ` (2 preceding siblings ...)
  2016-04-25 13:19       ` Peter Zijlstra
@ 2016-04-25 14:14       ` Peter Zijlstra
  2016-04-25 16:07         ` Alexander Shishkin
  2016-04-25 14:25       ` Peter Zijlstra
  4 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-04-25 14:14 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Apr 22, 2016 at 07:19:11PM +0300, Alexander Shishkin wrote:
>  /**
> + * struct perf_addr_filter - address range filter definition
> + * @entry:	event's filter list linkage
> + * @inode:	object file's inode for file-based filters
> + * @offset:	filter range offset
> + * @size:	filter range size
> + * @range:	1: range, 0: address
> + * @filter:	1: filter/start, 0: stop
> + * @kernel:	1: kernel, 0: file-based
> + *
> + * This is a hardware-agnostic filter configuration as specified by the user.
> + */
> +struct perf_addr_filter {
> +	struct list_head	entry;
> +	struct inode		*inode;
> +	unsigned long		offset;
> +	unsigned long		size;
> +	unsigned int		range	: 1,
> +				filter	: 1,
> +				kernel	: 1;
> +};

FWIW, why not have !inode be kernel?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-22 16:19     ` Alexander Shishkin
                         ` (3 preceding siblings ...)
  2016-04-25 14:14       ` Peter Zijlstra
@ 2016-04-25 14:25       ` Peter Zijlstra
  2016-04-25 16:14         ` Alexander Shishkin
  2016-04-26 14:35         ` Alexander Shishkin
  4 siblings, 2 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-04-25 14:25 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Apr 22, 2016 at 07:19:11PM +0300, Alexander Shishkin wrote:
> +/*
> + * Address range filtering: limiting the data to certain
> + * instruction address ranges. Filters are ioctl()ed to us from
> + * userspace as ascii strings.
> + *
> + * Filter string format:
> + *
> + * ACTION SOURCE:RANGE_SPEC
> + * where ACTION is one of the
> + *  * "filter": limit the trace to this region
> + *  * "start": start tracing from this address
> + *  * "stop": stop tracing at this address/region;
> + * SOURCE is either "file" or "kernel"
> + * RANGE_SPEC is
> + *  * for "kernel": <start address>[/<size>]
> + *  * for "file":   <start address>[/<size>]@</path/to/object/file>

SOURCE seems entirely redundant

> + *
> + * if <size> is not specified, the range is treated as a single address.

single instruction, right?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-25 12:57       ` Peter Zijlstra
@ 2016-04-25 15:41         ` Alexander Shishkin
  0 siblings, 0 replies; 27+ messages in thread
From: Alexander Shishkin @ 2016-04-25 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Apr 22, 2016 at 07:19:11PM +0300, Alexander Shishkin wrote:
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index b717902c99..4f968d6b96 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -151,6 +151,15 @@ struct hw_perf_event {
>>  	 */
>>  	struct task_struct		*target;
>>  
>> +	/*
>> +	 * PMU would store hardware filter configuration
>> +	 * here.
>> +	 */
>> +	void				*addr_filters;
>> +
>> +	/* Last sync'ed generation of filters */
>> +	unsigned long			addr_filters_gen;
>> +
>
> should these not go in the itrace struct?

I wanted to decouple it from the whole 'itrace' thing, because I thought
it might be usable (or at least applicable) in other contexts as well,
like tracepoints, for example. Have not given it much thought yet,
though.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-25 13:10       ` Peter Zijlstra
  2016-04-25 13:36         ` Peter Zijlstra
@ 2016-04-25 16:02         ` Alexander Shishkin
  1 sibling, 0 replies; 27+ messages in thread
From: Alexander Shishkin @ 2016-04-25 16:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Apr 22, 2016 at 07:19:11PM +0300, Alexander Shishkin wrote:
>> +static int perf_event_restart(struct perf_event *event)
>> +{
>> +	struct stop_event_data sd = {
>> +		.event		= event,
>> +		.restart	= 1,
>> +	};
>> +	int ret = 0;
>> +
>> +	do {
>> +		if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
>> +			return 0;
>> +
>> +		/* matches smp_wmb() in event_sched_in() */
>> +		smp_rmb();
>> +
>> +		ret = cpu_function_call(READ_ONCE(event->oncpu),
>> +					__perf_event_stop, &sd);
>> +	} while (ret == -EAGAIN);
>> +
>> +	return ret;
>> +}
>
> A few concerns with this;

(awesome, and now I see where the comments are actually needed)

>  - if I understand things right; this will loose a bit of trace when
>    people tickle the ioctl(), right? This might want a note of sorts,

I don't actually see how, if they're doing the ioctl() on an enabled
event, it'll generate unfiltered data until it gets scheduled out or it
receives this cross call, which will apply the new filters. The
__perf_event_stop(), which restarts the event will run on the target
cpu, so it doesn't race with the workload.

>  - you need to handle event->oncpu == -1,

Ok, we're in practice ignoring this but it deserves a comment. The
cpu_function_call() here should return -ENXIO in this case and we'll
just fall through, which is fine, because that means that the event is
inactive and the filters will apply when it goes active again.

>  - can we race with an actual stop, and accidentally re-enable the
>    thing? If not, this might be a good place for a comment explaining
>    how.

We can (mmap in workload vs munmap(AUX) in trace consumer), but the
event's start will fail in perf_aux_output_begin() because of
rb::aux_mmap_count being zero. The actual stop is only used in aux
unmapping, so like this we should be covered.

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-25 13:19       ` Peter Zijlstra
@ 2016-04-25 16:03         ` Alexander Shishkin
  0 siblings, 0 replies; 27+ messages in thread
From: Alexander Shishkin @ 2016-04-25 16:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Apr 22, 2016 at 07:19:11PM +0300, Alexander Shishkin wrote:
>> @@ -393,12 +405,64 @@ struct pmu {
>>  	void (*free_aux)		(void *aux); /* optional */
>>  
>>  	/*
>> +	 * Validate address range filters: make sure hw supports the
>> +	 * requested configuration and number of filters; return 0 if the
>> +	 * supplied filters are valid, -errno otherwise.
>> +	 */
>> +	int (*addr_filters_validate)	(struct list_head *filters);
>> +					/* optional */
>> +
>> +	/*
>> +	 * Synchronize address range filter configuration:
>> +	 * translate hw-agnostic filter into hardware configuration in
>> +	 * event::hw::addr_filters.
>> +	 */
>> +	void (*addr_filters_sync)	(struct perf_event *event);
>> +					/* optional */
>
> So these two are not serialized the 'normal' way right? Does that want
> more comment?

Yes, will see to it.

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-25 14:14       ` Peter Zijlstra
@ 2016-04-25 16:07         ` Alexander Shishkin
  0 siblings, 0 replies; 27+ messages in thread
From: Alexander Shishkin @ 2016-04-25 16:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Apr 22, 2016 at 07:19:11PM +0300, Alexander Shishkin wrote:
>>  /**
>> + * struct perf_addr_filter - address range filter definition
>> + * @entry:	event's filter list linkage
>> + * @inode:	object file's inode for file-based filters
>> + * @offset:	filter range offset
>> + * @size:	filter range size
>> + * @range:	1: range, 0: address
>> + * @filter:	1: filter/start, 0: stop
>> + * @kernel:	1: kernel, 0: file-based
>> + *
>> + * This is a hardware-agnostic filter configuration as specified by the user.
>> + */
>> +struct perf_addr_filter {
>> +	struct list_head	entry;
>> +	struct inode		*inode;
>> +	unsigned long		offset;
>> +	unsigned long		size;
>> +	unsigned int		range	: 1,
>> +				filter	: 1,
>> +				kernel	: 1;
>> +};
>
> FWIW, why not have !inode be kernel?

It actually can, you're right.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-25 14:25       ` Peter Zijlstra
@ 2016-04-25 16:14         ` Alexander Shishkin
  2016-04-26 14:35         ` Alexander Shishkin
  1 sibling, 0 replies; 27+ messages in thread
From: Alexander Shishkin @ 2016-04-25 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Apr 22, 2016 at 07:19:11PM +0300, Alexander Shishkin wrote:
>> +/*
>> + * Address range filtering: limiting the data to certain
>> + * instruction address ranges. Filters are ioctl()ed to us from
>> + * userspace as ascii strings.
>> + *
>> + * Filter string format:
>> + *
>> + * ACTION SOURCE:RANGE_SPEC
>> + * where ACTION is one of the
>> + *  * "filter": limit the trace to this region
>> + *  * "start": start tracing from this address
>> + *  * "stop": stop tracing at this address/region;
>> + * SOURCE is either "file" or "kernel"
>> + * RANGE_SPEC is
>> + *  * for "kernel": <start address>[/<size>]
>> + *  * for "file":   <start address>[/<size>]@</path/to/object/file>
>
> SOURCE seems entirely redundant

In the current state, yes, but if we wanted dedicated support for kernel
modules in filter definitions, it may come useful.

>> + *
>> + * if <size> is not specified, the range is treated as a single address.
>
> single instruction, right?

Yes, hitting that instruction will trigger tracing to start. PT doesn't
do this at the moment, but Coresight ETMs/PTMs do.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] perf: Introduce address range filtering
  2016-04-25 14:25       ` Peter Zijlstra
  2016-04-25 16:14         ` Alexander Shishkin
@ 2016-04-26 14:35         ` Alexander Shishkin
  1 sibling, 0 replies; 27+ messages in thread
From: Alexander Shishkin @ 2016-04-26 14:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Apr 22, 2016 at 07:19:11PM +0300, Alexander Shishkin wrote:
>> +/*
>> + * Address range filtering: limiting the data to certain
>> + * instruction address ranges. Filters are ioctl()ed to us from
>> + * userspace as ascii strings.
>> + *
>> + * Filter string format:
>> + *
>> + * ACTION SOURCE:RANGE_SPEC
>> + * where ACTION is one of the
>> + *  * "filter": limit the trace to this region
>> + *  * "start": start tracing from this address
>> + *  * "stop": stop tracing at this address/region;
>> + * SOURCE is either "file" or "kernel"
>> + * RANGE_SPEC is
>> + *  * for "kernel": <start address>[/<size>]
>> + *  * for "file":   <start address>[/<size>]@</path/to/object/file>
>
> SOURCE seems entirely redundant

So module support as it is done in kprobe doesn't seem very useful for
us, because all it does is symbol name resolution and pinning module
while the corresponding kprobe is on.

What I had in mind was more like specifying $module:$start/$end, which
will lay dormant till the module is loaded and will also go away once
the module is unloaded, but I'm not sure if I can even get away with
stuff like that.

So, for the time being I would like to keep it like so:

...
 * ACTION RANGE_SPEC
 * where ACTION is one of the
 *  * "filter": limit the trace to this region
 *  * "start": start tracing from this address
 *  * "stop": stop tracing at this address/region;
 * RANGE_SPEC is
 *  * for kernel addresses: <start address>[/<size>]
 *  * for object files:     <start address>[/<size>]@</path/to/object/file>
...

iow, dropping the SOURCE like you said and moving on.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-04-26 14:37 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-21 15:16 [PATCH v1 0/5] perf: Introduce address range filtering Alexander Shishkin
2016-04-21 15:16 ` [PATCH v1 1/5] perf: Move set_filter() from behind EVENT_TRACING Alexander Shishkin
2016-04-21 15:17 ` [PATCH v1 2/5] perf/x86/intel/pt: IP filtering register/cpuid bits Alexander Shishkin
2016-04-21 17:48   ` Borislav Petkov
2016-04-21 18:55     ` Thomas Gleixner
2016-04-21 19:17       ` Peter Zijlstra
2016-04-21 20:39         ` Borislav Petkov
2016-04-21 20:37       ` Borislav Petkov
2016-04-22  7:58         ` Thomas Gleixner
2016-04-22  9:34           ` Borislav Petkov
2016-04-21 15:17 ` [PATCH v1 3/5] perf: Extend perf_event_aux_ctx() to optionally iterate through more events Alexander Shishkin
2016-04-21 15:17 ` [PATCH v1 4/5] perf: Introduce address range filtering Alexander Shishkin
2016-04-22  7:45   ` Peter Zijlstra
2016-04-22 16:19     ` Alexander Shishkin
2016-04-25 12:57       ` Peter Zijlstra
2016-04-25 15:41         ` Alexander Shishkin
2016-04-25 13:10       ` Peter Zijlstra
2016-04-25 13:36         ` Peter Zijlstra
2016-04-25 16:02         ` Alexander Shishkin
2016-04-25 13:19       ` Peter Zijlstra
2016-04-25 16:03         ` Alexander Shishkin
2016-04-25 14:14       ` Peter Zijlstra
2016-04-25 16:07         ` Alexander Shishkin
2016-04-25 14:25       ` Peter Zijlstra
2016-04-25 16:14         ` Alexander Shishkin
2016-04-26 14:35         ` Alexander Shishkin
2016-04-21 15:17 ` [PATCH v1 5/5] perf/x86/intel/pt: Add support for address range filtering in PT Alexander Shishkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.