linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v0 0/5] perf: Introduce instruction trace filtering
@ 2015-12-11 13:36 Alexander Shishkin
  2015-12-11 13:36 ` [PATCH v0 1/5] perf: Move set_filter() from behind EVENT_TRACING Alexander Shishkin
                   ` (5 more replies)
  0 siblings, 6 replies; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 13:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, Alexander Shishkin

Hi Peter,

Newer version of Intel PT supports address-based filtering, and this
patchset adds support for it to perf core and the PT pmu driver. It
works by configuring a number of address ranges in hardware and
telling it to use these ranges to filter its traces. Similar feature
also exists in ARM Coresight ETM/PTM and it is also taken into account
in this patchset.

Firstly, userspace configures filters via an ioctl(), filters are
formatted as an ascii string. Filters may refer to addresses in object
files for userspace code or kernel addresses. The latter might be
extended in the future to support kernel modules.

For userspace filters, we scan the task's vmas to see if any of them
match the defined filters (inode+offset) and if they do, calculate
memory offsets and program them into hardware. Note that since
different tasks will have different mappings for the same object
files, supporting cpu-wide events would require special tricks to
context-switch filters for userspace code.

Also, we monitor new mmap and exec events to update (or clear) filter
configuration.

This is based on my perf_mmap_close() patchset from yesterday [1], which
in turn is based on your perf/core queue.

[1] http://marc.info/?l=linux-kernel&m=144976438631073

Alexander Shishkin (5):
  perf: Move set_filter() from behind EVENT_TRACING
  perf: Extend perf_event_aux() to optionally iterate through more
    events
  perf: Introduce instruction trace filtering
  perf/x86/intel/pt: IP filtering register/cpuid bits
  perf/x86/intel/pt: Add support for instruction trace filtering in PT

 arch/x86/include/asm/msr-index.h          |  18 +
 arch/x86/kernel/cpu/intel_pt.h            |  32 +-
 arch/x86/kernel/cpu/perf_event_intel_pt.c | 134 +++++-
 include/linux/perf_event.h                |  40 ++
 kernel/events/core.c                      | 655 ++++++++++++++++++++++++++++--
 5 files changed, 835 insertions(+), 44 deletions(-)

-- 
2.6.2


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v0 1/5] perf: Move set_filter() from behind EVENT_TRACING
  2015-12-11 13:36 [PATCH v0 0/5] perf: Introduce instruction trace filtering Alexander Shishkin
@ 2015-12-11 13:36 ` Alexander Shishkin
  2015-12-11 13:36 ` [PATCH v0 2/5] perf: Extend perf_event_aux() to optionally iterate through more events Alexander Shishkin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 13:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, Alexander Shishkin

For instruction trace filtering, namely, for communicating filter
definitions from userspace, I'd like to re-use the SET_FILTER code
that the tracepoints are using currently.

To that end, this patch moves the relevant code from behind EVENT_TRACING
macro.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 45 ++++++++++++++++++++++-----------------------
 1 file changed, 22 insertions(+), 23 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 05a50b4fb9..0b28116dd7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7078,24 +7078,6 @@ static inline void perf_tp_register(void)
 	perf_pmu_register(&perf_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
 }
 
-static int perf_event_set_filter(struct perf_event *event, void __user *arg)
-{
-	char *filter_str;
-	int ret;
-
-	if (event->attr.type != PERF_TYPE_TRACEPOINT)
-		return -EINVAL;
-
-	filter_str = strndup_user(arg, PAGE_SIZE);
-	if (IS_ERR(filter_str))
-		return PTR_ERR(filter_str);
-
-	ret = ftrace_profile_set_filter(event, event->attr.config, filter_str);
-
-	kfree(filter_str);
-	return ret;
-}
-
 static void perf_event_free_filter(struct perf_event *event)
 {
 	ftrace_profile_free_filter(event);
@@ -7150,11 +7132,6 @@ static inline void perf_tp_register(void)
 {
 }
 
-static int perf_event_set_filter(struct perf_event *event, void __user *arg)
-{
-	return -ENOENT;
-}
-
 static void perf_event_free_filter(struct perf_event *event)
 {
 }
@@ -7182,6 +7159,28 @@ void perf_bp_event(struct perf_event *bp, void *data)
 }
 #endif
 
+static int perf_event_set_filter(struct perf_event *event, void __user *arg)
+{
+	char *filter_str;
+	int ret = -EINVAL;
+
+	if (event->attr.type != PERF_TYPE_TRACEPOINT ||
+	    !IS_ENABLED(CONFIG_EVENT_TRACING))
+		return -EINVAL;
+
+	filter_str = strndup_user(arg, PAGE_SIZE);
+	if (IS_ERR(filter_str))
+		return PTR_ERR(filter_str);
+
+	if (IS_ENABLED(CONFIG_EVENT_TRACING) &&
+	    event->attr.type == PERF_TYPE_TRACEPOINT)
+		ret = ftrace_profile_set_filter(event, event->attr.config,
+						filter_str);
+
+	kfree(filter_str);
+	return ret;
+}
+
 /*
  * hrtimer based swevent callback
  */
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v0 2/5] perf: Extend perf_event_aux() to optionally iterate through more events
  2015-12-11 13:36 [PATCH v0 0/5] perf: Introduce instruction trace filtering Alexander Shishkin
  2015-12-11 13:36 ` [PATCH v0 1/5] perf: Move set_filter() from behind EVENT_TRACING Alexander Shishkin
@ 2015-12-11 13:36 ` Alexander Shishkin
  2015-12-11 13:36 ` [PATCH v0 3/5] perf: Introduce instruction trace filtering Alexander Shishkin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 13:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, Alexander Shishkin

Trace filtering code needs an iterator that can go through all events,
including inactive and filtered, to be able to update their filters'
ranges based on mmap or exec events.

This patch changes perf_event_aux() to optionally do this.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 38 +++++++++++++++++++++-----------------
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0b28116dd7..2bab4af901 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5642,33 +5642,36 @@ typedef void (perf_event_aux_output_cb)(struct perf_event *event, void *data);
 static void
 perf_event_aux_ctx(struct perf_event_context *ctx,
 		   perf_event_aux_output_cb output,
-		   void *data)
+		   void *data, bool all)
 {
 	struct perf_event *event;
 
 	list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
-		if (event->state < PERF_EVENT_STATE_INACTIVE)
-			continue;
-		if (!event_filter_match(event))
-			continue;
+		if (!all) {
+			if (event->state < PERF_EVENT_STATE_INACTIVE)
+				continue;
+			if (!event_filter_match(event))
+				continue;
+		}
+
 		output(event, data);
 	}
 }
 
 static void
 perf_event_aux_task_ctx(perf_event_aux_output_cb output, void *data,
-			struct perf_event_context *task_ctx)
+			struct perf_event_context *task_ctx, bool all)
 {
 	rcu_read_lock();
 	preempt_disable();
-	perf_event_aux_ctx(task_ctx, output, data);
+	perf_event_aux_ctx(task_ctx, output, data, all);
 	preempt_enable();
 	rcu_read_unlock();
 }
 
 static void
 perf_event_aux(perf_event_aux_output_cb output, void *data,
-	       struct perf_event_context *task_ctx)
+	       struct perf_event_context *task_ctx, bool all)
 {
 	struct perf_cpu_context *cpuctx;
 	struct perf_event_context *ctx;
@@ -5682,7 +5685,7 @@ perf_event_aux(perf_event_aux_output_cb output, void *data,
 	 * context.
 	 */
 	if (task_ctx) {
-		perf_event_aux_task_ctx(output, data, task_ctx);
+		perf_event_aux_task_ctx(output, data, task_ctx, all);
 		return;
 	}
 
@@ -5691,13 +5694,13 @@ perf_event_aux(perf_event_aux_output_cb output, void *data,
 		cpuctx = get_cpu_ptr(pmu->pmu_cpu_context);
 		if (cpuctx->unique_pmu != pmu)
 			goto next;
-		perf_event_aux_ctx(&cpuctx->ctx, output, data);
+		perf_event_aux_ctx(&cpuctx->ctx, output, data, all);
 		ctxn = pmu->task_ctx_nr;
 		if (ctxn < 0)
 			goto next;
 		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
 		if (ctx)
-			perf_event_aux_ctx(ctx, output, data);
+			perf_event_aux_ctx(ctx, output, data, all);
 next:
 		put_cpu_ptr(pmu->pmu_cpu_context);
 	}
@@ -5725,10 +5728,11 @@ static int __perf_pmu_output_stop(void *info)
 	struct perf_cpu_context *cpuctx = get_cpu_ptr(pmu->pmu_cpu_context);
 
 	rcu_read_lock();
-	perf_event_aux_ctx(&cpuctx->ctx, __perf_event_output_stop, event->rb);
+	perf_event_aux_ctx(&cpuctx->ctx, __perf_event_output_stop, event->rb,
+			   false);
 	if (cpuctx->task_ctx)
 		perf_event_aux_ctx(cpuctx->task_ctx, __perf_event_output_stop,
-				   event->rb);
+				   event->rb, false);
 	rcu_read_unlock();
 
 	return 0;
@@ -5840,7 +5844,7 @@ static void perf_event_task(struct task_struct *task,
 
 	perf_event_aux(perf_event_task_output,
 		       &task_event,
-		       task_ctx);
+		       task_ctx, false);
 }
 
 void perf_event_fork(struct task_struct *task)
@@ -5919,7 +5923,7 @@ static void perf_event_comm_event(struct perf_comm_event *comm_event)
 
 	perf_event_aux(perf_event_comm_output,
 		       comm_event,
-		       NULL);
+		       NULL, false);
 }
 
 void perf_event_comm(struct task_struct *task, bool exec)
@@ -6150,7 +6154,7 @@ got_name:
 
 	perf_event_aux(perf_event_mmap_output,
 		       mmap_event,
-		       NULL);
+		       NULL, false);
 
 	kfree(buf);
 }
@@ -6338,7 +6342,7 @@ static void perf_event_switch(struct task_struct *task,
 
 	perf_event_aux(perf_event_switch_output,
 		       &switch_event,
-		       NULL);
+		       NULL, false);
 }
 
 /*
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 13:36 [PATCH v0 0/5] perf: Introduce instruction trace filtering Alexander Shishkin
  2015-12-11 13:36 ` [PATCH v0 1/5] perf: Move set_filter() from behind EVENT_TRACING Alexander Shishkin
  2015-12-11 13:36 ` [PATCH v0 2/5] perf: Extend perf_event_aux() to optionally iterate through more events Alexander Shishkin
@ 2015-12-11 13:36 ` Alexander Shishkin
  2015-12-11 14:02   ` Peter Zijlstra
                     ` (8 more replies)
  2015-12-11 13:36 ` [PATCH v0 4/5] perf/x86/intel/pt: IP filtering register/cpuid bits Alexander Shishkin
                   ` (2 subsequent siblings)
  5 siblings, 9 replies; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 13:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, Alexander Shishkin

Many instruction trace pmus out there support address range-based
filtering, which would, for example, generate trace data only for a
given range of instruction addresses, which is useful for tracing
individual functions, modules or libraries.

This patch introduces the interface for userspace to specify these
filters and for the pmu drivers to apply these filters to hardware
configuration.

The user interface is an ascii string that is passed via an ioctl
and specifies (in a way similar to uprobe) address ranges within
certain object files or within kernel. There is no special treatment
for kernel modules yet, but it might be a worthy pursuit.

The pmu driver interface basically adds an extra callback to the
pmu driver structure, which validates the filter configuration proposed
by the user against what the hardware is actually capable of doing
and translates it into something that pmu::start can program into
hardware.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h |  40 ++++
 kernel/events/core.c       | 576 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 613 insertions(+), 3 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f9828a48f1..4ddbedc100 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -127,6 +127,11 @@ struct hw_perf_event {
 		};
 		struct { /* itrace */
 			int			itrace_started;
+			/*
+			 * PMU would store hardware filter configuration
+			 * here.
+			 */
+			void			*itrace_filters;
 		};
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 		struct { /* breakpoint */
@@ -388,12 +393,38 @@ struct pmu {
 	void (*free_aux)		(void *aux); /* optional */
 
 	/*
+	 * Validate instruction tracing filters: make sure hw supports the
+	 * requested configuration and number of filters.
+	 *
+	 * Configure instruction tracing filters: translate hw-agnostic filter
+	 * into hardware configuration in event::hw::itrace_filters
+	 */
+	int (*itrace_filter_setup)	(struct perf_event *event); /* optional */
+
+	/*
 	 * Filter events for PMU-specific reasons.
 	 */
 	int (*filter_match)		(struct perf_event *event); /* optional */
 };
 
 /**
+ * Instruction trace (ITRACE) filter
+ */
+struct perf_itrace_filter {
+	struct list_head	entry;
+	struct rcu_head		rcu_head;
+	struct inode		*inode;
+	struct task_struct	*task;
+	unsigned long		offset;
+	unsigned long		size;
+	unsigned long		start;
+	unsigned long		end;
+	unsigned int		range	: 1, /* 1: range, 0: addr */
+				filter	: 1, /* 1: filter/start, 0: stop */
+				kernel	: 1; /* 1: kernel, 0: object file*/
+};
+
+/**
  * enum perf_event_active_state - the states of a event
  */
 enum perf_event_active_state {
@@ -559,6 +590,10 @@ struct perf_event {
 
 	atomic_t			event_limit;
 
+	/* instruction trace filters */
+	struct list_head		itrace_filters;
+	struct mutex			itrace_filters_mutex;
+
 	void (*destroy)(struct perf_event *);
 	struct rcu_head			rcu_head;
 
@@ -1032,6 +1067,11 @@ static inline bool has_aux(struct perf_event *event)
 	return event->pmu->setup_aux;
 }
 
+static inline bool has_itrace_filter(struct perf_event *event)
+{
+	return event->pmu->itrace_filter_setup;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2bab4af901..28ce173a28 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -44,6 +44,8 @@
 #include <linux/compat.h>
 #include <linux/bpf.h>
 #include <linux/filter.h>
+#include <linux/namei.h>
+#include <linux/parser.h>
 
 #include "internal.h"
 
@@ -2335,6 +2337,59 @@ static int __perf_event_stop(void *info)
 	return 0;
 }
 
+static int __perf_event_itrace_filters_setup(void *info)
+{
+	struct perf_event *event = info;
+	int ret;
+
+	if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
+		return -EAGAIN;
+
+	/* matches smp_wmb() in event_sched_in() */
+	smp_rmb();
+
+	/*
+	 * There is a window with interrupts enabled before we get here,
+	 * so we need to check again lest we try to stop another cpu's event.
+	 */
+	if (READ_ONCE(event->oncpu) != smp_processor_id())
+		return -EAGAIN;
+
+	event->pmu->stop(event, PERF_EF_UPDATE);
+	rcu_read_lock();
+	ret = event->pmu->itrace_filter_setup(event);
+	rcu_read_unlock();
+	event->pmu->start(event, PERF_EF_RELOAD);
+
+	return ret;
+}
+
+static int perf_event_itrace_filters_setup(struct perf_event *event)
+{
+	int ret;
+
+	/*
+	 * We can't use event_function_call() here, because that would
+	 * require ctx::mutex, but one of our callers is called with
+	 * mm::mmap_sem down, which would cause an inversion, see bullet
+	 * (2) in put_event().
+	 */
+	do {
+		if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE) {
+			ret = event->pmu->itrace_filter_setup(event);
+			break;
+		}
+
+		/* matches smp_wmb() in event_sched_in() */
+		smp_rmb();
+
+		ret = cpu_function_call(READ_ONCE(event->oncpu),
+					__perf_event_itrace_filters_setup, event);
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
 static int _perf_event_refresh(struct perf_event *event, int refresh)
 {
 	/*
@@ -3663,6 +3718,8 @@ static bool exclusive_event_installable(struct perf_event *event,
 	return true;
 }
 
+static void perf_itrace_filters_clear(struct perf_event *event);
+
 static void __free_event(struct perf_event *event)
 {
 	if (!event->parent) {
@@ -3671,6 +3728,7 @@ static void __free_event(struct perf_event *event)
 	}
 
 	perf_event_free_bpf_prog(event);
+	perf_itrace_filters_clear(event);
 
 	if (event->destroy)
 		event->destroy(event);
@@ -5907,6 +5965,37 @@ out:
 	comm_event->event_id.header.size = size;
 }
 
+/*
+ * Clear all dynamic object-based filters at exec, they'll have to be
+ * re-instated when/if these objects are mmapped again.
+ */
+static void perf_itrace_exec(struct perf_event *event, void *data)
+{
+	struct perf_itrace_filter *filter;
+	unsigned int restart = 0;
+
+	if (!has_itrace_filter(event))
+		return;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(filter, &event->itrace_filters, entry) {
+		if (filter->kernel)
+			continue;
+
+		filter->start = filter->end = 0;
+		restart++;
+	}
+
+	rcu_read_unlock();
+
+	/*
+	 * kernel filters, however, will still be valid
+	 */
+	if (restart)
+		perf_event_itrace_filters_setup(event);
+}
+
 static void perf_event_comm_event(struct perf_comm_event *comm_event)
 {
 	char comm[TASK_COMM_LEN];
@@ -5921,6 +6010,8 @@ static void perf_event_comm_event(struct perf_comm_event *comm_event)
 
 	comm_event->event_id.header.size = sizeof(comm_event->event_id) + size;
 
+	if (comm_event->event_id.header.misc == PERF_RECORD_MISC_COMM_EXEC)
+		perf_event_aux(perf_itrace_exec, comm_event, NULL, true);
 	perf_event_aux(perf_event_comm_output,
 		       comm_event,
 		       NULL, false);
@@ -6159,6 +6250,77 @@ got_name:
 	kfree(buf);
 }
 
+/*
+ * Whether this @filter depends on a dynamic object which is not loaded
+ * yet or its load addresses are not known.
+ */
+static bool perf_itrace_filter_needs_mmap(struct perf_itrace_filter *filter)
+{
+	return filter->filter && !filter->kernel && !filter->end;
+}
+
+/*
+ * Check whether inode and address range match filter criteria.
+ */
+static bool perf_itrace_filter_match(struct perf_itrace_filter *filter,
+				     struct file *file, unsigned long offset,
+				     unsigned long size)
+{
+
+	if (filter->inode != file->f_inode)
+		return false;
+
+	if (filter->offset > offset + size)
+		return false;
+
+	if (filter->offset + filter->size < offset)
+		return false;
+
+	return true;
+}
+
+/*
+ * Update event's itrace filters
+ */
+static void perf_itrace_filters_update(struct perf_event *event, void *data)
+{
+	struct perf_mmap_event *mmap_event = data;
+	unsigned long off = mmap_event->vma->vm_pgoff << PAGE_SHIFT;
+	struct file *file = mmap_event->vma->vm_file;
+	struct perf_itrace_filter *filter;
+	unsigned int restart = 0;
+
+	if (!has_itrace_filter(event))
+		return;
+
+	if (!file)
+		return;
+
+	/* we do not modify the list or sleep, no need for the mutex */
+	rcu_read_lock();
+	list_for_each_entry_rcu(filter, &event->itrace_filters, entry) {
+		if (filter->kernel)
+			continue;
+
+		if (filter->task->mm != mmap_event->vma->vm_mm)
+			continue;
+
+		if (!perf_itrace_filter_match(filter, file, off,
+					      mmap_event->event_id.len))
+			continue;
+
+		restart++;
+		filter->start = mmap_event->event_id.start + filter->offset;
+		filter->end = mmap_event->event_id.start + filter->offset +
+			filter->size;
+	}
+	rcu_read_unlock();
+
+	/* reprogram updated filters into hardware */
+	if (restart)
+		perf_event_itrace_filters_setup(event);
+}
+
 void perf_event_mmap(struct vm_area_struct *vma)
 {
 	struct perf_mmap_event mmap_event;
@@ -6190,6 +6352,7 @@ void perf_event_mmap(struct vm_area_struct *vma)
 		/* .flags (attr_mmap2 only) */
 	};
 
+	perf_event_aux(perf_itrace_filters_update, &mmap_event, NULL, true);
 	perf_event_mmap_event(&mmap_event);
 }
 
@@ -7163,13 +7326,405 @@ void perf_bp_event(struct perf_event *bp, void *data)
 }
 #endif
 
+/*
+ * Insert an itrace @filter into @event's list of filters.
+ * @filter is used as a template
+ */
+static int perf_itrace_filter_insert(struct perf_event *event,
+				     struct perf_itrace_filter *src,
+				     struct task_struct *task)
+{
+	int node = cpu_to_node(event->cpu == -1 ? 0 : event->cpu);
+	struct perf_itrace_filter *filter;
+
+	filter = kzalloc_node(sizeof(*filter), GFP_KERNEL, node);
+	if (!filter)
+		return -ENOMEM;
+
+	filter->inode  = src->inode;
+	filter->offset = src->offset;
+	filter->size   = src->size;
+	filter->range  = src->range;
+	filter->filter = src->filter;
+	filter->kernel = src->kernel;
+	/*
+	 * We copy the actual address range too: it will remain the same for
+	 * kernel addresses and user addresses after fork(); in case of exec
+	 * or mmap, it will get cleared or modified by
+	 * perf_itrace_filters_clear()/perf_itrace_filters_update().
+	 */
+	filter->start = src->start;
+	filter->end   = src->end;
+
+	/*
+	 * We're already holding a reference to this task_struct since
+	 * alloc_perf_context() till last put_ctx() in __free_event().
+	 */
+	filter->task = task;
+
+	/*
+	 * If we're called through perf_itrace_filters_clone(), we're already
+	 * holding parent's filter mutex.
+	 */
+	mutex_lock_nested(&event->itrace_filters_mutex, SINGLE_DEPTH_NESTING);
+	list_add_tail_rcu(&filter->entry, &event->itrace_filters);
+	mutex_unlock(&event->itrace_filters_mutex);
+
+	return 0;
+}
+
+static void perf_itrace_filter_free_rcu(struct rcu_head *rcu_head)
+{
+	struct perf_itrace_filter *filter =
+		container_of(rcu_head, struct perf_itrace_filter, rcu_head);
+
+	if (filter->inode)
+		iput(filter->inode);
+	kfree(filter);
+}
+
+/*
+ * we can do this via task_function_call(), as well as setting filters
+ * and maybe event updating them!
+ */
+static void perf_itrace_filters_clear(struct perf_event *event)
+{
+	struct perf_itrace_filter *filter, *iter;
+
+	if (!has_itrace_filter(event))
+		return;
+
+	mutex_lock(&event->itrace_filters_mutex);
+	list_for_each_entry_safe(filter, iter, &event->itrace_filters, entry) {
+		list_del_rcu(&filter->entry);
+		call_rcu(&filter->rcu_head, perf_itrace_filter_free_rcu);
+	}
+
+	perf_event_itrace_filters_setup(event);
+	mutex_unlock(&event->itrace_filters_mutex);
+}
+
+/*
+ * Scan through mm's vmas and see if one of them matches the
+ * @filter; if so, adjust filter's address range.
+ * Called with mm::mmap_sem down for reading.
+ */
+static int perf_itrace_filter_apply(struct perf_event *event,
+				    struct perf_itrace_filter *filter,
+				    struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	for (vma = mm->mmap; vma->vm_next; vma = vma->vm_next) {
+		struct file *file = vma->vm_file;
+		unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
+		unsigned long vma_size = vma->vm_end - vma->vm_start;
+
+		if (!file)
+			continue;
+
+		if (!perf_itrace_filter_match(filter, file, off,
+					      vma_size))
+			continue;
+
+		filter->start = vma->vm_start + filter->offset;
+		filter->end = vma->vm_start + filter->offset +
+			filter->size;
+
+		return 1;
+	}
+
+	return 0;
+}
+
+/*
+ * Adjust event's itrace filters' address ranges based on the
+ * task's existing mappings
+ */
+static void perf_itrace_filters_apply(struct perf_event *event)
+{
+	struct perf_itrace_filter *filter;
+	struct mm_struct *mm = NULL;
+	unsigned int restart = 0;
+
+	lockdep_assert_held(&event->ctx->mutex);
+
+	mm = get_task_mm(event->ctx->task);
+	if (!mm)
+		return;
+
+	down_read(&mm->mmap_sem);
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(filter, &event->itrace_filters, entry) {
+		if (!perf_itrace_filter_needs_mmap(filter))
+			continue;
+
+		restart += perf_itrace_filter_apply(event, filter, mm);
+	}
+	rcu_read_unlock();
+
+	up_read(&mm->mmap_sem);
+
+	if (restart)
+		perf_event_itrace_filters_setup(event);
+
+	mmput(mm);
+}
+
+/*
+ * Instruction trace filtering: limiting the trace to certain
+ * instruction address ranges. Filters are ioctl()ed to us from
+ * userspace as ascii strings.
+ *
+ * Filter string format:
+ *
+ * ACTION SOURCE:RANGE_SPEC
+ * where ACTION is one of the
+ *  * "filter": limit the trace to this region
+ *  * "start": start tracing from this address
+ *  * "stop": stop tracing at this address/region;
+ * SOURCE is either "file" or "kernel"
+ * RANGE_SPEC is
+ *  * for "kernel": <start address>[/<size>]
+ *  * for "file":   <start address>[/<size>]@</path/to/object/file>
+ *
+ * if <size> is not specified, the range is treated as a single address.
+ */
+enum {
+	IF_ACT_FILTER,
+	IF_ACT_START,
+	IF_ACT_STOP,
+	IF_SRC_FILE,
+	IF_SRC_KERNEL,
+	IF_SRC_FILEADDR,
+	IF_SRC_KERNELADDR,
+};
+
+enum {
+	IF_STATE_ACTION = 0,
+	IF_STATE_SOURCE,
+	IF_STATE_END,
+};
+
+static const match_table_t if_tokens = {
+	{ IF_ACT_FILTER,	"filter" },
+	{ IF_ACT_START,		"start" },
+	{ IF_ACT_STOP,		"stop" },
+	{ IF_SRC_FILE,		"file:%u/%u@%s" },
+	{ IF_SRC_KERNEL,	"kernel:%u/%u" },
+	{ IF_SRC_FILEADDR,	"file:%u@%s" },
+	{ IF_SRC_KERNELADDR,	"kernel:%u" },
+};
+
+/*
+ * Itrace filter string parser
+ */
+static int
+perf_event_parse_itrace_filter(struct perf_event *event, char *fstr)
+{
+	struct perf_itrace_filter filter;
+	char *start, *orig, *filename = NULL;
+	struct path path;
+	substring_t args[MAX_OPT_ARGS];
+	int state = IF_STATE_ACTION, token;
+	int ret = -EINVAL;
+
+	orig = fstr = kstrdup(fstr, GFP_KERNEL);
+	if (!fstr)
+		return -ENOMEM;
+
+	while ((start = strsep(&fstr, " ,\n")) != NULL) {
+		ret = -EINVAL;
+
+		if (!*start)
+			continue;
+
+		/* filter definition begins */
+		if (state == IF_STATE_ACTION)
+			memset(&filter, 0, sizeof(filter));
+
+		token = match_token(start, if_tokens, args);
+		switch (token) {
+		case IF_ACT_FILTER:
+		case IF_ACT_START:
+			filter.filter = 1;
+
+		case IF_ACT_STOP:
+			if (state != IF_STATE_ACTION)
+				goto fail;
+
+			state = IF_STATE_SOURCE;
+			break;
+
+		case IF_SRC_KERNELADDR:
+		case IF_SRC_KERNEL:
+			filter.kernel = 1;
+
+		case IF_SRC_FILEADDR:
+		case IF_SRC_FILE:
+			if (state != IF_STATE_SOURCE)
+				goto fail;
+
+			if (token == IF_SRC_FILE || token == IF_SRC_KERNEL)
+				filter.range = 1;
+
+			*args[0].to = 0;
+			ret = kstrtoul(args[0].from, 0, &filter.offset);
+			if (ret)
+				goto fail;
+
+			if (filter.range) {
+				*args[1].to = 0;
+				ret = kstrtoul(args[1].from, 0, &filter.size);
+				if (ret)
+					goto fail;
+			}
+
+			if (token == IF_SRC_FILE) {
+				filename = match_strdup(&args[2]);
+				if (!filename) {
+					ret = -ENOMEM;
+					goto fail;
+				}
+			}
+
+			state = IF_STATE_END;
+			break;
+
+		default:
+			goto fail;
+		}
+
+		/*
+		 * Filter definition is fully parsed, validate and install it.
+		 * Make sure that it doesn't contradict itself or the event's
+		 * attribute.
+		 */
+		if (state == IF_STATE_END) {
+			if (filter.kernel && event->attr.exclude_kernel)
+				goto fail;
+
+			if (!filter.kernel) {
+				if (!filename)
+					goto fail;
+
+				/* look up the path and grab its inode */
+				ret = kern_path(filename, LOOKUP_FOLLOW, &path);
+				if (ret)
+					goto fail_free_name;
+
+				filter.inode = igrab(d_inode(path.dentry));
+				path_put(&path);
+				kfree(filename);
+				filename = NULL;
+			}
+
+			ret = perf_itrace_filter_insert(event, &filter,
+							event->ctx->task);
+			if (ret)
+				goto fail;
+
+			/* ready to consume more filters */
+			state = IF_STATE_ACTION;
+		}
+	}
+
+	if (state != IF_STATE_ACTION)
+		goto fail;
+
+	kfree(orig);
+
+	return 0;
+
+fail_free_name:
+	kfree(filename);
+fail:
+	perf_itrace_filters_clear(event);
+	kfree(orig);
+
+	return ret;
+}
+
+/*
+ * Filters are cloned in inherit_event() path to make sure child tracing is
+ * consistent with parent.
+ */
+static int
+perf_itrace_filters_clone(struct perf_event *to, struct perf_event *from,
+			  struct task_struct *task)
+{
+	struct perf_itrace_filter *filter;
+	int ret = -ENOMEM;
+
+	mutex_lock(&from->itrace_filters_mutex);
+	list_for_each_entry_rcu(filter, &from->itrace_filters, entry) {
+		/* parent's filter must hold a reference to this inode */
+		if (WARN_ON_ONCE(!igrab(filter->inode)))
+			goto fail;
+
+		ret = perf_itrace_filter_insert(to, filter, task);
+		if (ret) {
+			iput(filter->inode);
+			goto fail;
+		}
+	}
+
+	ret = 0;
+fail:
+	mutex_unlock(&from->itrace_filters_mutex);
+
+	if (!ret)
+		ret = perf_event_itrace_filters_setup(to);
+	else
+		perf_itrace_filters_clear(to);
+
+	return ret;
+}
+
+static int
+perf_event_set_itrace_filter(struct perf_event *event, char *filter_str)
+{
+	int ret = 0;
+
+	/*
+	 * Since this is called in perf_ioctl() path, we're already holding
+	 * ctx::mutex.
+	 */
+	lockdep_assert_held(&event->ctx->mutex);
+
+	/*
+	 * For now, we only support filtering in per-task events; doing so
+	 * for cpu-wide events requires additional context switching trickery,
+	 * since same object code will be mapped at different virtual
+	 * addresses in different processes.
+	 */
+	if (!event->ctx->task)
+		return -EOPNOTSUPP;
+
+	/* remove existing filters, if any */
+	perf_itrace_filters_clear(event);
+
+	ret = perf_event_parse_itrace_filter(event, filter_str);
+	if (!ret) {
+		perf_itrace_filters_apply(event);
+
+		ret = perf_event_itrace_filters_setup(event);
+		if (ret)
+			perf_itrace_filters_clear(event);
+	}
+
+	return ret;
+}
+
 static int perf_event_set_filter(struct perf_event *event, void __user *arg)
 {
 	char *filter_str;
 	int ret = -EINVAL;
 
-	if (event->attr.type != PERF_TYPE_TRACEPOINT ||
-	    !IS_ENABLED(CONFIG_EVENT_TRACING))
+	if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
+	    !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
+	    !has_itrace_filter(event))
 		return -EINVAL;
 
 	filter_str = strndup_user(arg, PAGE_SIZE);
@@ -7180,6 +7735,8 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
 	    event->attr.type == PERF_TYPE_TRACEPOINT)
 		ret = ftrace_profile_set_filter(event, event->attr.config,
 						filter_str);
+	else if (has_itrace_filter(event))
+		ret = perf_event_set_itrace_filter(event, filter_str);
 
 	kfree(filter_str);
 	return ret;
@@ -7921,13 +8478,14 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	INIT_LIST_HEAD(&event->sibling_list);
 	INIT_LIST_HEAD(&event->rb_entry);
 	INIT_LIST_HEAD(&event->active_entry);
+	INIT_LIST_HEAD(&event->itrace_filters);
 	INIT_HLIST_NODE(&event->hlist_entry);
 
-
 	init_waitqueue_head(&event->waitq);
 	init_irq_work(&event->pending, perf_pending_event);
 
 	mutex_init(&event->mmap_mutex);
+	mutex_init(&event->itrace_filters_mutex);
 
 	atomic_long_set(&event->refcount, 1);
 	event->cpu		= cpu;
@@ -9063,6 +9621,18 @@ inherit_event(struct perf_event *parent_event,
 	get_ctx(child_ctx);
 
 	/*
+	 * Clone itrace filters from the parent, if any
+	 */
+	if (has_itrace_filter(child_event)) {
+		if (perf_itrace_filters_clone(child_event, parent_event,
+					      child)) {
+			put_ctx(child_ctx);
+			free_event(child_event);
+			return NULL;
+		}
+	}
+
+	/*
 	 * Make the child state follow the state of the parent event,
 	 * not its attr.disabled bit.  We hold the parent's mutex,
 	 * so we won't race with perf_event_{en, dis}able_family.
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v0 4/5] perf/x86/intel/pt: IP filtering register/cpuid bits
  2015-12-11 13:36 [PATCH v0 0/5] perf: Introduce instruction trace filtering Alexander Shishkin
                   ` (2 preceding siblings ...)
  2015-12-11 13:36 ` [PATCH v0 3/5] perf: Introduce instruction trace filtering Alexander Shishkin
@ 2015-12-11 13:36 ` Alexander Shishkin
  2015-12-11 13:36 ` [PATCH v0 5/5] perf/x86/intel/pt: Add support for instruction trace filtering in PT Alexander Shishkin
  2015-12-11 21:38 ` [PATCH v0 0/5] perf: Introduce instruction trace filtering Mathieu Poirier
  5 siblings, 0 replies; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 13:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, Alexander Shishkin

New versions of Intel PT support address range-based filtering. These
are the registers, bit definitions and relevant CPUID bits.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/include/asm/msr-index.h          | 18 ++++++++++++++++++
 arch/x86/kernel/cpu/intel_pt.h            |  2 ++
 arch/x86/kernel/cpu/perf_event_intel_pt.c |  2 ++
 3 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 690b4027e1..fbbe21bcca 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -100,11 +100,29 @@
 #define RTIT_CTL_CYC_THRESH		(0x0full << RTIT_CTL_CYC_THRESH_OFFSET)
 #define RTIT_CTL_PSB_FREQ_OFFSET	24
 #define RTIT_CTL_PSB_FREQ      		(0x0full << RTIT_CTL_PSB_FREQ_OFFSET)
+#define RTIT_CTL_ADDR0_OFFSET		32
+#define RTIT_CTL_ADDR0      		(0x0full << RTIT_CTL_ADDR0_OFFSET)
+#define RTIT_CTL_ADDR1_OFFSET		36
+#define RTIT_CTL_ADDR1      		(0x0full << RTIT_CTL_ADDR1_OFFSET)
+#define RTIT_CTL_ADDR2_OFFSET		40
+#define RTIT_CTL_ADDR2      		(0x0full << RTIT_CTL_ADDR2_OFFSET)
+#define RTIT_CTL_ADDR3_OFFSET		44
+#define RTIT_CTL_ADDR3      		(0x0full << RTIT_CTL_ADDR3_OFFSET)
 #define MSR_IA32_RTIT_STATUS		0x00000571
+#define RTIT_STATUS_FILTEREN		BIT(0)
 #define RTIT_STATUS_CONTEXTEN		BIT(1)
 #define RTIT_STATUS_TRIGGEREN		BIT(2)
+#define RTIT_STATUS_BUFFOVF		BIT(3)
 #define RTIT_STATUS_ERROR		BIT(4)
 #define RTIT_STATUS_STOPPED		BIT(5)
+#define MSR_IA32_RTIT_ADDR0_A		0x00000580
+#define MSR_IA32_RTIT_ADDR0_B		0x00000581
+#define MSR_IA32_RTIT_ADDR1_A		0x00000582
+#define MSR_IA32_RTIT_ADDR1_B		0x00000583
+#define MSR_IA32_RTIT_ADDR2_A		0x00000584
+#define MSR_IA32_RTIT_ADDR2_B		0x00000585
+#define MSR_IA32_RTIT_ADDR3_A		0x00000586
+#define MSR_IA32_RTIT_ADDR3_B		0x00000587
 #define MSR_IA32_RTIT_CR3_MATCH		0x00000572
 #define MSR_IA32_RTIT_OUTPUT_BASE	0x00000560
 #define MSR_IA32_RTIT_OUTPUT_MASK	0x00000561
diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
index 336878a5d2..6ce8cd20b9 100644
--- a/arch/x86/kernel/cpu/intel_pt.h
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -52,11 +52,13 @@ enum pt_capabilities {
 	PT_CAP_max_subleaf = 0,
 	PT_CAP_cr3_filtering,
 	PT_CAP_psb_cyc,
+	PT_CAP_ip_filtering,
 	PT_CAP_mtc,
 	PT_CAP_topa_output,
 	PT_CAP_topa_multiple_entries,
 	PT_CAP_single_range_output,
 	PT_CAP_payloads_lip,
+	PT_CAP_num_address_ranges,
 	PT_CAP_mtc_periods,
 	PT_CAP_cycle_thresholds,
 	PT_CAP_psb_periods,
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index daf5f6caf8..2ec25581de 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -67,11 +67,13 @@ static struct pt_cap_desc {
 	PT_CAP(max_subleaf,		0, CR_EAX, 0xffffffff),
 	PT_CAP(cr3_filtering,		0, CR_EBX, BIT(0)),
 	PT_CAP(psb_cyc,			0, CR_EBX, BIT(1)),
+	PT_CAP(ip_filtering,		0, CR_EBX, BIT(2)),
 	PT_CAP(mtc,			0, CR_EBX, BIT(3)),
 	PT_CAP(topa_output,		0, CR_ECX, BIT(0)),
 	PT_CAP(topa_multiple_entries,	0, CR_ECX, BIT(1)),
 	PT_CAP(single_range_output,	0, CR_ECX, BIT(2)),
 	PT_CAP(payloads_lip,		0, CR_ECX, BIT(31)),
+	PT_CAP(num_address_ranges,	1, CR_EAX, 0x3),
 	PT_CAP(mtc_periods,		1, CR_EAX, 0xffff0000),
 	PT_CAP(cycle_thresholds,	1, CR_EBX, 0xffff),
 	PT_CAP(psb_periods,		1, CR_EBX, 0xffff0000),
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v0 5/5] perf/x86/intel/pt: Add support for instruction trace filtering in PT
  2015-12-11 13:36 [PATCH v0 0/5] perf: Introduce instruction trace filtering Alexander Shishkin
                   ` (3 preceding siblings ...)
  2015-12-11 13:36 ` [PATCH v0 4/5] perf/x86/intel/pt: IP filtering register/cpuid bits Alexander Shishkin
@ 2015-12-11 13:36 ` Alexander Shishkin
  2015-12-11 18:06   ` Mathieu Poirier
  2015-12-11 21:38 ` [PATCH v0 0/5] perf: Introduce instruction trace filtering Mathieu Poirier
  5 siblings, 1 reply; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 13:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, Alexander Shishkin

Newer versions of Intel PT support address ranges, which can be used to
define IP address range-based filters or TraceSTOP regions. Number of
ranges in enumerated via cpuid.

This patch implements pmu callbacks and related low-level code to allow
filter validation, configuration and programming into the hardware.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/kernel/cpu/intel_pt.h            |  30 ++++++-
 arch/x86/kernel/cpu/perf_event_intel_pt.c | 132 +++++++++++++++++++++++++++++-
 2 files changed, 159 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
index 6ce8cd20b9..1b36c8529d 100644
--- a/arch/x86/kernel/cpu/intel_pt.h
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -105,13 +105,39 @@ struct pt_buffer {
 	struct topa_entry	*topa_index[0];
 };
 
+#define PT_FILTERS_NUM	4
+
+/**
+ * struct pt_filter - IP range filter configuration
+ * @msr_a:	range start, goes to RTIT_ADDRn_A
+ * @msr_b:	range end, goes to RTIT_ADDRn_B
+ * @config:	4-bit field in RTIT_CTL
+ */
+struct pt_filter {
+	unsigned long	msr_a;
+	unsigned long	msr_b;
+	unsigned long	config;
+};
+
+/**
+ * struct pt_filters - IP range filtering context
+ * @filter:	filters defined for this context
+ * @nr_filters:	number of defined filters in the @filter array
+ */
+struct pt_filters {
+	struct pt_filter	filter[PT_FILTERS_NUM];
+	unsigned int		nr_filters;
+};
+
 /**
  * struct pt - per-cpu pt context
- * @handle:	perf output handle
- * @handle_nmi:	do handle PT PMI on this cpu, there's an active event
+ * @handle:		perf output handle
+ * @filters:		last configured filters
+ * @handle_nmi:		do handle PT PMI on this cpu, there's an active event
  */
 struct pt {
 	struct perf_output_handle handle;
+	struct pt_filters	filters;
 	int			handle_nmi;
 };
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index 2ec25581de..f531a2f0de 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -253,6 +253,73 @@ static bool pt_event_valid(struct perf_event *event)
  * These all are cpu affine and operate on a local PT
  */
 
+/* Address ranges and their corresponding msr configuration registers */
+static const struct pt_address_range {
+	unsigned long	msr_a;
+	unsigned long	msr_b;
+	unsigned int	reg_off;
+} pt_address_ranges[] = {
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR0_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR0_B,
+		.reg_off = RTIT_CTL_ADDR0_OFFSET,
+	},
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR1_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR1_B,
+		.reg_off = RTIT_CTL_ADDR1_OFFSET,
+	},
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR2_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR2_B,
+		.reg_off = RTIT_CTL_ADDR2_OFFSET,
+	},
+	{
+		.msr_a	 = MSR_IA32_RTIT_ADDR3_A,
+		.msr_b	 = MSR_IA32_RTIT_ADDR3_B,
+		.reg_off = RTIT_CTL_ADDR3_OFFSET,
+	}
+};
+
+static u64 pt_config_filters(struct perf_event *event)
+{
+	struct pt_filters *filters = event->hw.itrace_filters;
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	unsigned int range = 0;
+	u64 rtit_ctl = 0;
+
+	if (!filters)
+		return 0;
+
+	for (range = 0; range < filters->nr_filters; range++) {
+		struct pt_filter *filter = &filters->filter[range];
+
+		/*
+		 * Note, if the range has zero start/end addresses due
+		 * to its dynamic object not being loaded yet, we just
+		 * go ahead and program zeroed range, which will simply
+		 * produce no data. Note^2: if executable code at 0x0
+		 * is a concern, we can set up an "invalid" configuration
+		 * such as msr_b < msr_a.
+		 */
+
+		/* avoid redundant msr writes */
+		if (pt->filters.filter[range].msr_a != filter->msr_a) {
+			wrmsrl(pt_address_ranges[range].msr_a, filter->msr_a);
+			pt->filters.filter[range].msr_a = filter->msr_a;
+		}
+
+		if (pt->filters.filter[range].msr_b != filter->msr_b) {
+			wrmsrl(pt_address_ranges[range].msr_b, filter->msr_b);
+			pt->filters.filter[range].msr_b = filter->msr_b;
+		}
+
+		rtit_ctl |= filter->config << pt_address_ranges[range].reg_off;
+	}
+
+	return rtit_ctl;
+}
+
 static void pt_config(struct perf_event *event)
 {
 	u64 reg;
@@ -262,7 +329,8 @@ static void pt_config(struct perf_event *event)
 		wrmsrl(MSR_IA32_RTIT_STATUS, 0);
 	}
 
-	reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN | RTIT_CTL_TRACEEN;
+	reg = pt_config_filters(event);
+	reg |= RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN | RTIT_CTL_TRACEEN;
 
 	if (!event->attr.exclude_kernel)
 		reg |= RTIT_CTL_OS;
@@ -907,6 +975,60 @@ static void pt_buffer_free_aux(void *data)
 	kfree(buf);
 }
 
+static int pt_itrace_filters_init(struct perf_event *event)
+{
+	struct pt_filters *filters;
+	int node = event->cpu == -1 ? -1 : cpu_to_node(event->cpu);
+
+	if (!pt_cap_get(PT_CAP_num_address_ranges))
+		return 0;
+
+	filters = kzalloc_node(sizeof(struct pt_filters), GFP_KERNEL, node);
+	if (!filters)
+		return -ENOMEM;
+
+	event->hw.itrace_filters = filters;
+
+	return 0;
+}
+
+static void pt_itrace_filters_fini(struct perf_event *event)
+{
+	kfree(event->hw.itrace_filters);
+	event->hw.itrace_filters = NULL;
+}
+
+static int pt_event_itrace_filter_setup(struct perf_event *event)
+{
+	struct pt_filters *filters = event->hw.itrace_filters;
+	struct perf_itrace_filter *filter;
+	int range = 0;
+
+	if (!filters)
+		return -EOPNOTSUPP;
+
+	list_for_each_entry_rcu(filter, &event->itrace_filters, entry) {
+		/* PT doesn't support single address triggers */
+		if (!filter->range)
+			return -EOPNOTSUPP;
+
+		if (filter->kernel && !kernel_ip(filter->offset))
+			return -EINVAL;
+
+		filters->filter[range].msr_a  = filter->start;
+		filters->filter[range].msr_b  = filter->end;
+		filters->filter[range].config = filter->filter ? 1 : 2;
+
+		if (++range > pt_cap_get(PT_CAP_num_address_ranges))
+			return -EOPNOTSUPP;
+	}
+
+	if (range)
+		filters->nr_filters = range - 1;
+
+	return 0;
+}
+
 /**
  * intel_pt_interrupt() - PT PMI handler
  */
@@ -1075,6 +1197,7 @@ static void pt_event_read(struct perf_event *event)
 
 static void pt_event_destroy(struct perf_event *event)
 {
+	pt_itrace_filters_fini(event);
 	x86_del_exclusive(x86_lbr_exclusive_pt);
 }
 
@@ -1089,6 +1212,11 @@ static int pt_event_init(struct perf_event *event)
 	if (x86_add_exclusive(x86_lbr_exclusive_pt))
 		return -EBUSY;
 
+	if (pt_itrace_filters_init(event)) {
+		x86_del_exclusive(x86_lbr_exclusive_pt);
+		return -ENOMEM;
+	}
+
 	event->destroy = pt_event_destroy;
 
 	return 0;
@@ -1152,6 +1280,8 @@ static __init int pt_init(void)
 	pt_pmu.pmu.read		= pt_event_read;
 	pt_pmu.pmu.setup_aux	= pt_buffer_setup_aux;
 	pt_pmu.pmu.free_aux	= pt_buffer_free_aux;
+	pt_pmu.pmu.itrace_filter_setup =
+		pt_event_itrace_filter_setup;
 	ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);
 
 	return ret;
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 13:36 ` [PATCH v0 3/5] perf: Introduce instruction trace filtering Alexander Shishkin
@ 2015-12-11 14:02   ` Peter Zijlstra
  2015-12-11 14:20     ` Alexander Shishkin
  2015-12-11 14:23     ` Mark Rutland
  2015-12-11 14:56   ` Peter Zijlstra
                     ` (7 subsequent siblings)
  8 siblings, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 14:02 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, mark.rutland

On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:

> The pmu driver interface basically adds an extra callback to the
> pmu driver structure, which validates the filter configuration proposed
> by the user against what the hardware is actually capable of doing
> and translates it into something that pmu::start can program into
> hardware.

> @@ -388,12 +393,38 @@ struct pmu {
>  	void (*free_aux)		(void *aux); /* optional */
>  
>  	/*
> +	 * Validate instruction tracing filters: make sure hw supports the
> +	 * requested configuration and number of filters.
> +	 *
> +	 * Configure instruction tracing filters: translate hw-agnostic filter
> +	 * into hardware configuration in event::hw::itrace_filters
> +	 */
> +	int (*itrace_filter_setup)	(struct perf_event *event); /* optional */
> +
> +	/*
>  	 * Filter events for PMU-specific reasons.
>  	 */
>  	int (*filter_match)		(struct perf_event *event); /* optional */
>  };

Any reason you cannot use pmu::filter_match ?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 14:02   ` Peter Zijlstra
@ 2015-12-11 14:20     ` Alexander Shishkin
  2015-12-11 14:23     ` Mark Rutland
  1 sibling, 0 replies; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier, mark.rutland

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
>
>> The pmu driver interface basically adds an extra callback to the
>> pmu driver structure, which validates the filter configuration proposed
>> by the user against what the hardware is actually capable of doing
>> and translates it into something that pmu::start can program into
>> hardware.
>
>> @@ -388,12 +393,38 @@ struct pmu {
>>  	void (*free_aux)		(void *aux); /* optional */
>>  
>>  	/*
>> +	 * Validate instruction tracing filters: make sure hw supports the
>> +	 * requested configuration and number of filters.
>> +	 *
>> +	 * Configure instruction tracing filters: translate hw-agnostic filter
>> +	 * into hardware configuration in event::hw::itrace_filters
>> +	 */
>> +	int (*itrace_filter_setup)	(struct perf_event *event); /* optional */
>> +
>> +	/*
>>  	 * Filter events for PMU-specific reasons.
>>  	 */
>>  	int (*filter_match)		(struct perf_event *event); /* optional */
>>  };
>
> Any reason you cannot use pmu::filter_match ?

It's for filtering out events that we want to skip over while iterating
various event lists, it's also used in quite a few places (which are
also relevant to our pmus), so I couldn't see how to re-use it.

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 14:02   ` Peter Zijlstra
  2015-12-11 14:20     ` Alexander Shishkin
@ 2015-12-11 14:23     ` Mark Rutland
  2015-12-11 14:52       ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Mark Rutland @ 2015-12-11 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 03:02:01PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
> 
> > The pmu driver interface basically adds an extra callback to the
> > pmu driver structure, which validates the filter configuration proposed
> > by the user against what the hardware is actually capable of doing
> > and translates it into something that pmu::start can program into
> > hardware.
> 
> > @@ -388,12 +393,38 @@ struct pmu {
> >  	void (*free_aux)		(void *aux); /* optional */
> >  
> >  	/*
> > +	 * Validate instruction tracing filters: make sure hw supports the
> > +	 * requested configuration and number of filters.
> > +	 *
> > +	 * Configure instruction tracing filters: translate hw-agnostic filter
> > +	 * into hardware configuration in event::hw::itrace_filters
> > +	 */
> > +	int (*itrace_filter_setup)	(struct perf_event *event); /* optional */
> > +
> > +	/*
> >  	 * Filter events for PMU-specific reasons.
> >  	 */
> >  	int (*filter_match)		(struct perf_event *event); /* optional */
> >  };
> 
> Any reason you cannot use pmu::filter_match ?

Maybe I've misunderstood your point, but the two seem quite different.

We introduced pmu::filter_match to apply a SW filter each time we
installed events from a context. We use that on ARM to avoid programming
big events into little cores and vice-versa.

As far as I can see, itrace_filter_setup is closer in operation to
event_init. It can fail at configuration time (long before scheduling
events to cores), and leaves the actual filtering to the HW.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 14:23     ` Mark Rutland
@ 2015-12-11 14:52       ` Peter Zijlstra
  2015-12-11 15:12         ` Alexander Shishkin
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 14:52 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 02:23:36PM +0000, Mark Rutland wrote:
> On Fri, Dec 11, 2015 at 03:02:01PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
> > 
> > > The pmu driver interface basically adds an extra callback to the
> > > pmu driver structure, which validates the filter configuration proposed
> > > by the user against what the hardware is actually capable of doing
> > > and translates it into something that pmu::start can program into
> > > hardware.
> > 
> > > @@ -388,12 +393,38 @@ struct pmu {
> > >  	void (*free_aux)		(void *aux); /* optional */
> > >  
> > >  	/*
> > > +	 * Validate instruction tracing filters: make sure hw supports the
> > > +	 * requested configuration and number of filters.
> > > +	 *
> > > +	 * Configure instruction tracing filters: translate hw-agnostic filter
> > > +	 * into hardware configuration in event::hw::itrace_filters
> > > +	 */
> > > +	int (*itrace_filter_setup)	(struct perf_event *event); /* optional */
> > > +
> > > +	/*
> > >  	 * Filter events for PMU-specific reasons.
> > >  	 */
> > >  	int (*filter_match)		(struct perf_event *event); /* optional */
> > >  };
> > 
> > Any reason you cannot use pmu::filter_match ?
> 
> Maybe I've misunderstood your point, but the two seem quite different.
> 
> We introduced pmu::filter_match to apply a SW filter each time we
> installed events from a context. We use that on ARM to avoid programming
> big events into little cores and vice-versa.
> 
> As far as I can see, itrace_filter_setup is closer in operation to
> event_init. It can fail at configuration time (long before scheduling
> events to cores), and leaves the actual filtering to the HW.

Ah indeed. I got confused by the similarities in the signature and both
having filter in the name.

Alexander, maybe then add a filter parameter to validate to the
signature, otherwise you have to change the event state in order to
validate it which leaves you with an invalid configured event in case it
doesn't validate.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 13:36 ` [PATCH v0 3/5] perf: Introduce instruction trace filtering Alexander Shishkin
  2015-12-11 14:02   ` Peter Zijlstra
@ 2015-12-11 14:56   ` Peter Zijlstra
  2015-12-11 15:14     ` Alexander Shishkin
  2015-12-11 15:00   ` Peter Zijlstra
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 14:56 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
> @@ -5921,6 +6010,8 @@ static void perf_event_comm_event(struct perf_comm_event *comm_event)
>  
>  	comm_event->event_id.header.size = sizeof(comm_event->event_id) + size;
>  
> +	if (comm_event->event_id.header.misc == PERF_RECORD_MISC_COMM_EXEC)
> +		perf_event_aux(perf_itrace_exec, comm_event, NULL, true);

Maybe put this in perf_event_exec() ?

>  	perf_event_aux(perf_event_comm_output,
>  		       comm_event,
>  		       NULL, false);



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 13:36 ` [PATCH v0 3/5] perf: Introduce instruction trace filtering Alexander Shishkin
  2015-12-11 14:02   ` Peter Zijlstra
  2015-12-11 14:56   ` Peter Zijlstra
@ 2015-12-11 15:00   ` Peter Zijlstra
  2015-12-11 15:17     ` Alexander Shishkin
  2015-12-11 16:06     ` Alexander Shishkin
  2015-12-11 15:02   ` Peter Zijlstra
                     ` (5 subsequent siblings)
  8 siblings, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 15:00 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
> +static int __perf_event_itrace_filters_setup(void *info)
> +{
> +	struct perf_event *event = info;
> +	int ret;
> +
> +	if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
> +		return -EAGAIN;
> +
> +	/* matches smp_wmb() in event_sched_in() */
> +	smp_rmb();
> +
> +	/*
> +	 * There is a window with interrupts enabled before we get here,
> +	 * so we need to check again lest we try to stop another cpu's event.
> +	 */
> +	if (READ_ONCE(event->oncpu) != smp_processor_id())
> +		return -EAGAIN;
> +
> +	event->pmu->stop(event, PERF_EF_UPDATE);
> +	rcu_read_lock();
> +	ret = event->pmu->itrace_filter_setup(event);
> +	rcu_read_unlock();
> +	event->pmu->start(event, PERF_EF_RELOAD);

Would it not be more sensible to let the ::itrace_filter_setup() method
do the stop/start-ing if and when needed?

> +
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 13:36 ` [PATCH v0 3/5] perf: Introduce instruction trace filtering Alexander Shishkin
                     ` (2 preceding siblings ...)
  2015-12-11 15:00   ` Peter Zijlstra
@ 2015-12-11 15:02   ` Peter Zijlstra
  2015-12-11 15:20     ` Peter Zijlstra
  2015-12-11 15:27     ` Alexander Shishkin
  2015-12-11 15:09   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  8 siblings, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 15:02 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
> +static int perf_event_itrace_filters_setup(struct perf_event *event)
> +{
> +	int ret;
> +
> +	/*
> +	 * We can't use event_function_call() here, because that would
> +	 * require ctx::mutex, but one of our callers is called with
> +	 * mm::mmap_sem down, which would cause an inversion, see bullet
> +	 * (2) in put_event().
> +	 */
> +	do {
> +		if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE) {
> +			ret = event->pmu->itrace_filter_setup(event);
> +			break;

So this is tricky, if its not active it can be any moment, there is
nothing serializing against that.

> +		}
> +
> +		/* matches smp_wmb() in event_sched_in() */
> +		smp_rmb();
> +
> +		ret = cpu_function_call(READ_ONCE(event->oncpu),
> +					__perf_event_itrace_filters_setup, event);

This otoh, running with IRQs disabled on the CPU the thing is active on
guarantees it will not become inactive -- nothing can come in and switch
it off.

> +	} while (ret == -EAGAIN);
> +
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 13:36 ` [PATCH v0 3/5] perf: Introduce instruction trace filtering Alexander Shishkin
                     ` (3 preceding siblings ...)
  2015-12-11 15:02   ` Peter Zijlstra
@ 2015-12-11 15:09   ` Peter Zijlstra
  2015-12-11 15:12   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 15:09 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:

> @@ -559,6 +590,10 @@ struct perf_event {
>  
>  	atomic_t			event_limit;
>  
> +	/* instruction trace filters */
> +	struct list_head		itrace_filters;
> +	struct mutex			itrace_filters_mutex;
> +
>  	void (*destroy)(struct perf_event *);
>  	struct rcu_head			rcu_head;
>  

> +static int __perf_event_itrace_filters_setup(void *info)
> +{
> +	struct perf_event *event = info;
> +	int ret;
> +
> +	if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
> +		return -EAGAIN;
> +
> +	/* matches smp_wmb() in event_sched_in() */
> +	smp_rmb();
> +
> +	/*
> +	 * There is a window with interrupts enabled before we get here,
> +	 * so we need to check again lest we try to stop another cpu's event.
> +	 */
> +	if (READ_ONCE(event->oncpu) != smp_processor_id())
> +		return -EAGAIN;
> +
> +	event->pmu->stop(event, PERF_EF_UPDATE);
> +	rcu_read_lock();

So you're holding rcu_read_lock() here to ensure the filter list is
observable. However this is still very much racy, nothing stops another
filter being added while we're trying to validate/program the hardware.

The solution we've used for other such places in perf is to use both a
mutex and a spinlock to protect the list. You need to hold both to
modify a list, holding either ensures the list is stable.

That would allow you to hold the spinlock here, and call the pmu method
on a stable list.

> +	ret = event->pmu->itrace_filter_setup(event);
> +	rcu_read_unlock();
> +	event->pmu->start(event, PERF_EF_RELOAD);
> +
> +	return ret;
> +}

> +/*
> + * Insert an itrace @filter into @event's list of filters.
> + * @filter is used as a template
> + */
> +static int perf_itrace_filter_insert(struct perf_event *event,
> +				     struct perf_itrace_filter *src,
> +				     struct task_struct *task)
> +{

> +	/*
> +	 * If we're called through perf_itrace_filters_clone(), we're already
> +	 * holding parent's filter mutex.
> +	 */
> +	mutex_lock_nested(&event->itrace_filters_mutex, SINGLE_DEPTH_NESTING);
> +	list_add_tail_rcu(&filter->entry, &event->itrace_filters);
> +	mutex_unlock(&event->itrace_filters_mutex);
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 13:36 ` [PATCH v0 3/5] perf: Introduce instruction trace filtering Alexander Shishkin
                     ` (4 preceding siblings ...)
  2015-12-11 15:09   ` Peter Zijlstra
@ 2015-12-11 15:12   ` Peter Zijlstra
  2015-12-11 15:28   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 15:12 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
> +/*
> + * Scan through mm's vmas and see if one of them matches the
> + * @filter; if so, adjust filter's address range.
> + * Called with mm::mmap_sem down for reading.
> + */
> +static int perf_itrace_filter_apply(struct perf_event *event,
> +				    struct perf_itrace_filter *filter,
> +				    struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vma;

	lockdep_assert_held(&mm->mmap_sem);

If you want to be pedantic we could add:

	lockdep_assert_held_shared()
	lockdep_assert_held_exclusive()

to assert we hold it in any particular state.

> +
> +	for (vma = mm->mmap; vma->vm_next; vma = vma->vm_next) {
> +		struct file *file = vma->vm_file;
> +		unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
> +		unsigned long vma_size = vma->vm_end - vma->vm_start;
> +
> +		if (!file)
> +			continue;
> +
> +		if (!perf_itrace_filter_match(filter, file, off,
> +					      vma_size))
> +			continue;
> +
> +		filter->start = vma->vm_start + filter->offset;
> +		filter->end = vma->vm_start + filter->offset +
> +			filter->size;
> +
> +		return 1;
> +	}
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 14:52       ` Peter Zijlstra
@ 2015-12-11 15:12         ` Alexander Shishkin
  0 siblings, 0 replies; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 15:12 UTC (permalink / raw)
  To: Peter Zijlstra, Mark Rutland
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Dec 11, 2015 at 02:23:36PM +0000, Mark Rutland wrote:
>> On Fri, Dec 11, 2015 at 03:02:01PM +0100, Peter Zijlstra wrote:
>> > On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
>> > 
>> > > The pmu driver interface basically adds an extra callback to the
>> > > pmu driver structure, which validates the filter configuration proposed
>> > > by the user against what the hardware is actually capable of doing
>> > > and translates it into something that pmu::start can program into
>> > > hardware.
>> > 
>> > > @@ -388,12 +393,38 @@ struct pmu {
>> > >  	void (*free_aux)		(void *aux); /* optional */
>> > >  
>> > >  	/*
>> > > +	 * Validate instruction tracing filters: make sure hw supports the
>> > > +	 * requested configuration and number of filters.
>> > > +	 *
>> > > +	 * Configure instruction tracing filters: translate hw-agnostic filter
>> > > +	 * into hardware configuration in event::hw::itrace_filters
>> > > +	 */
>> > > +	int (*itrace_filter_setup)	(struct perf_event *event); /* optional */
>> > > +
>> > > +	/*
>> > >  	 * Filter events for PMU-specific reasons.
>> > >  	 */
>> > >  	int (*filter_match)		(struct perf_event *event); /* optional */
>> > >  };
>> > 
>> > Any reason you cannot use pmu::filter_match ?
>> 
>> Maybe I've misunderstood your point, but the two seem quite different.
>> 
>> We introduced pmu::filter_match to apply a SW filter each time we
>> installed events from a context. We use that on ARM to avoid programming
>> big events into little cores and vice-versa.
>> 
>> As far as I can see, itrace_filter_setup is closer in operation to
>> event_init. It can fail at configuration time (long before scheduling
>> events to cores), and leaves the actual filtering to the HW.
>
> Ah indeed. I got confused by the similarities in the signature and both
> having filter in the name.
>
> Alexander, maybe then add a filter parameter to validate to the
> signature, otherwise you have to change the event state in order to
> validate it which leaves you with an invalid configured event in case it
> doesn't validate.

The validation part also checks that we have not more filters than
hardware supports, so this callback walks through the list and counts
them up, iow, it looks at all the filters in one go.

In case validation fails (and it can only fail in the ioctl() path), the
event itself is still in a valid state though, we just decline the
filters supplied by the user.

I could split this callback in two: one for validation and one for the
stop/update_filters/start cadence, but I don't see if that helps much
either.

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 14:56   ` Peter Zijlstra
@ 2015-12-11 15:14     ` Alexander Shishkin
  0 siblings, 0 replies; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
>> @@ -5921,6 +6010,8 @@ static void perf_event_comm_event(struct perf_comm_event *comm_event)
>>  
>>  	comm_event->event_id.header.size = sizeof(comm_event->event_id) + size;
>>  
>> +	if (comm_event->event_id.header.misc == PERF_RECORD_MISC_COMM_EXEC)
>> +		perf_event_aux(perf_itrace_exec, comm_event, NULL, true);
>
> Maybe put this in perf_event_exec() ?

Good point, I'm not actually even using the 'comm_event' in
perf_itrace_exec() for anything.

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 15:00   ` Peter Zijlstra
@ 2015-12-11 15:17     ` Alexander Shishkin
  2015-12-11 15:34       ` Peter Zijlstra
  2015-12-11 16:06     ` Alexander Shishkin
  1 sibling, 1 reply; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
>> +static int __perf_event_itrace_filters_setup(void *info)
>> +{
>> +	struct perf_event *event = info;
>> +	int ret;
>> +
>> +	if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
>> +		return -EAGAIN;
>> +
>> +	/* matches smp_wmb() in event_sched_in() */
>> +	smp_rmb();
>> +
>> +	/*
>> +	 * There is a window with interrupts enabled before we get here,
>> +	 * so we need to check again lest we try to stop another cpu's event.
>> +	 */
>> +	if (READ_ONCE(event->oncpu) != smp_processor_id())
>> +		return -EAGAIN;
>> +
>> +	event->pmu->stop(event, PERF_EF_UPDATE);
>> +	rcu_read_lock();
>> +	ret = event->pmu->itrace_filter_setup(event);
>> +	rcu_read_unlock();
>> +	event->pmu->start(event, PERF_EF_RELOAD);
>
> Would it not be more sensible to let the ::itrace_filter_setup() method
> do the stop/start-ing if and when needed?

I don't have a strong opinion on this, the only question is, are we
comfortable with pmu driver callback doing the
rcu_read_lock/unlock, because it still needs to iterate the filter list.
Other than that it's probably a good idea.

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 15:02   ` Peter Zijlstra
@ 2015-12-11 15:20     ` Peter Zijlstra
  2015-12-11 15:27     ` Alexander Shishkin
  1 sibling, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 15:20 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 04:02:36PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
> > +static int perf_event_itrace_filters_setup(struct perf_event *event)
> > +{
> > +	int ret;
> > +
> > +	/*
> > +	 * We can't use event_function_call() here, because that would
> > +	 * require ctx::mutex, but one of our callers is called with
> > +	 * mm::mmap_sem down, which would cause an inversion, see bullet
> > +	 * (2) in put_event().
> > +	 */
> > +	do {
> > +		if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE) {
> > +			ret = event->pmu->itrace_filter_setup(event);
> > +			break;
> 
> So this is tricky, if its not active it can be any moment, there is
> nothing serializing against that.

One solution is to keep a filter generation counter and check that on
pmu::add(), and if mis-matched recompute the hardware setup at that
point.

> > +		}
> > +
> > +		/* matches smp_wmb() in event_sched_in() */
> > +		smp_rmb();
> > +
> > +		ret = cpu_function_call(READ_ONCE(event->oncpu),
> > +					__perf_event_itrace_filters_setup, event);
> 
> This otoh, running with IRQs disabled on the CPU the thing is active on
> guarantees it will not become inactive -- nothing can come in and switch
> it off.
> 
> > +	} while (ret == -EAGAIN);
> > +
> > +	return ret;
> > +}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 15:02   ` Peter Zijlstra
  2015-12-11 15:20     ` Peter Zijlstra
@ 2015-12-11 15:27     ` Alexander Shishkin
  2015-12-11 15:33       ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
>> +static int perf_event_itrace_filters_setup(struct perf_event *event)
>> +{
>> +	int ret;
>> +
>> +	/*
>> +	 * We can't use event_function_call() here, because that would
>> +	 * require ctx::mutex, but one of our callers is called with
>> +	 * mm::mmap_sem down, which would cause an inversion, see bullet
>> +	 * (2) in put_event().
>> +	 */
>> +	do {
>> +		if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE) {
>> +			ret = event->pmu->itrace_filter_setup(event);
>> +			break;
>
> So this is tricky, if its not active it can be any moment, there is
> nothing serializing against that.

Indeed. But we should be able to call pmu::itrace_filter_setup()
multiple times, so if after this we re-check that the event is still
inactive, we can return, otherwise proceed with the cross-call. Does
this make sense?

>
>> +		}
>> +
>> +		/* matches smp_wmb() in event_sched_in() */
>> +		smp_rmb();
>> +
>> +		ret = cpu_function_call(READ_ONCE(event->oncpu),
>> +					__perf_event_itrace_filters_setup, event);
>
> This otoh, running with IRQs disabled on the CPU the thing is active on
> guarantees it will not become inactive -- nothing can come in and switch
> it off.
>
>> +	} while (ret == -EAGAIN);
>> +
>> +	return ret;
>> +}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 13:36 ` [PATCH v0 3/5] perf: Introduce instruction trace filtering Alexander Shishkin
                     ` (5 preceding siblings ...)
  2015-12-11 15:12   ` Peter Zijlstra
@ 2015-12-11 15:28   ` Peter Zijlstra
  2015-12-11 17:00     ` Peter Zijlstra
  2015-12-11 16:59   ` Peter Zijlstra
  2015-12-11 18:11   ` Mathieu Poirier
  8 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 15:28 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:

> @@ -9063,6 +9621,18 @@ inherit_event(struct perf_event *parent_event,
>  	get_ctx(child_ctx);
>  
>  	/*
> +	 * Clone itrace filters from the parent, if any
> +	 */
> +	if (has_itrace_filter(child_event)) {
> +		if (perf_itrace_filters_clone(child_event, parent_event,
> +					      child)) {
> +			put_ctx(child_ctx);
> +			free_event(child_event);
> +			return NULL;

So inherit_event()'s return policy is somewhat opaque, there's 3
possible returns:

 1) a valid struct perf_event pointer; the clone was successful
 2) ERR_PTR(err), the clone failed, abort inherit_group, fail fork()
 3) NULL, the clone failed, ignore, continue

We return NULL under two special cases:

 - the original event doesn't exist anymore, we're an orphan, do not make
   more orphans.

 - the parent event is dying


I'm fairly sure this return should be in the 2) category. If we cannot
fully clone the event something bad happened, we should not ignore it.

> +		}
> +	}
> +
> +	/*
>  	 * Make the child state follow the state of the parent event,
>  	 * not its attr.disabled bit.  We hold the parent's mutex,
>  	 * so we won't race with perf_event_{en, dis}able_family.
> -- 
> 2.6.2
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 15:27     ` Alexander Shishkin
@ 2015-12-11 15:33       ` Peter Zijlstra
  2015-12-11 15:48         ` Alexander Shishkin
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 15:33 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 05:27:22PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
> >> +static int perf_event_itrace_filters_setup(struct perf_event *event)
> >> +{
> >> +	int ret;
> >> +
> >> +	/*
> >> +	 * We can't use event_function_call() here, because that would
> >> +	 * require ctx::mutex, but one of our callers is called with
> >> +	 * mm::mmap_sem down, which would cause an inversion, see bullet
> >> +	 * (2) in put_event().
> >> +	 */
> >> +	do {
> >> +		if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE) {
> >> +			ret = event->pmu->itrace_filter_setup(event);
> >> +			break;
> >
> > So this is tricky, if its not active it can be any moment, there is
> > nothing serializing against that.
> 
> Indeed. But we should be able to call pmu::itrace_filter_setup()
> multiple times, so if after this we re-check that the event is still
> inactive, we can return, otherwise proceed with the cross-call. Does
> this make sense?

Dunno, I worry :-)

What if:

	if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE) {
		// we were INACTIVE, but now the event gets scheduled in
		// on _another_ CPU
		event->pmu->itrace_filter_setup() := {
			if (event->state == PERF_EVENT_STATE_ACTIVE) {
				/* muck with hardware */
			}
		}
	}

Here too I feel a strict validation vs programming split would make sense.

We can always call the validation thing, we must not call the program
thing !ACTIVE is a clear and simple rule.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 15:17     ` Alexander Shishkin
@ 2015-12-11 15:34       ` Peter Zijlstra
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 15:34 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 05:17:25PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
> >> +static int __perf_event_itrace_filters_setup(void *info)
> >> +{
> >> +	struct perf_event *event = info;
> >> +	int ret;
> >> +
> >> +	if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
> >> +		return -EAGAIN;
> >> +
> >> +	/* matches smp_wmb() in event_sched_in() */
> >> +	smp_rmb();
> >> +
> >> +	/*
> >> +	 * There is a window with interrupts enabled before we get here,
> >> +	 * so we need to check again lest we try to stop another cpu's event.
> >> +	 */
> >> +	if (READ_ONCE(event->oncpu) != smp_processor_id())
> >> +		return -EAGAIN;
> >> +
> >> +	event->pmu->stop(event, PERF_EF_UPDATE);
> >> +	rcu_read_lock();
> >> +	ret = event->pmu->itrace_filter_setup(event);
> >> +	rcu_read_unlock();
> >> +	event->pmu->start(event, PERF_EF_RELOAD);
> >
> > Would it not be more sensible to let the ::itrace_filter_setup() method
> > do the stop/start-ing if and when needed?
> 
> I don't have a strong opinion on this, the only question is, are we
> comfortable with pmu driver callback doing the
> rcu_read_lock/unlock, because it still needs to iterate the filter list.
> Other than that it's probably a good idea.

See another email; I'm not sure RCU works for this. You can observe more
events than you set out for, and if you do multiple iterations of the
list they need not match.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 15:33       ` Peter Zijlstra
@ 2015-12-11 15:48         ` Alexander Shishkin
  2015-12-11 16:01           ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 15:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Dec 11, 2015 at 05:27:22PM +0200, Alexander Shishkin wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
>> >> +static int perf_event_itrace_filters_setup(struct perf_event *event)
>> >> +{
>> >> +	int ret;
>> >> +
>> >> +	/*
>> >> +	 * We can't use event_function_call() here, because that would
>> >> +	 * require ctx::mutex, but one of our callers is called with
>> >> +	 * mm::mmap_sem down, which would cause an inversion, see bullet
>> >> +	 * (2) in put_event().
>> >> +	 */
>> >> +	do {
>> >> +		if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE) {
>> >> +			ret = event->pmu->itrace_filter_setup(event);
>> >> +			break;
>> >
>> > So this is tricky, if its not active it can be any moment, there is
>> > nothing serializing against that.
>> 
>> Indeed. But we should be able to call pmu::itrace_filter_setup()
>> multiple times, so if after this we re-check that the event is still
>> inactive, we can return, otherwise proceed with the cross-call. Does
>> this make sense?
>
> Dunno, I worry :-)
>
> What if:
>
> 	if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE) {
> 		// we were INACTIVE, but now the event gets scheduled in
> 		// on _another_ CPU
> 		event->pmu->itrace_filter_setup() := {
> 			if (event->state == PERF_EVENT_STATE_ACTIVE) {
> 				/* muck with hardware */
> 			}
> 		}
> 	}
>
> Here too I feel a strict validation vs programming split would make sense.
>
> We can always call the validation thing, we must not call the program
> thing !ACTIVE is a clear and simple rule.

Ah, but pmu::itrace_filter_setup() does not touch the hardware,
pmu::start() does. The former keeps an array of, say, MSR values ready
for programming in event::hw and the latter actually writes the MSRs. So
the above example should not be a problem.

So in a way validation and programming are split already. And PT, for
example, won't have it any other way, you can only program stuff into
the registers while tracing is disabled.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 15:48         ` Alexander Shishkin
@ 2015-12-11 16:01           ` Peter Zijlstra
  2015-12-11 17:02             ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 16:01 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 05:48:03PM +0200, Alexander Shishkin wrote:

> > We can always call the validation thing, we must not call the program
> > thing !ACTIVE is a clear and simple rule.
> 
> Ah, but pmu::itrace_filter_setup() does not touch the hardware,
> pmu::start() does. The former keeps an array of, say, MSR values ready
> for programming in event::hw and the latter actually writes the MSRs. So
> the above example should not be a problem.
> 
> So in a way validation and programming are split already. And PT, for
> example, won't have it any other way, you can only program stuff into
> the registers while tracing is disabled.

Yes, I just read that in the last patch. If however we fold the whole
stop/start bits into it, that fails again.

Hmm.. lemme ponder a bit.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 15:00   ` Peter Zijlstra
  2015-12-11 15:17     ` Alexander Shishkin
@ 2015-12-11 16:06     ` Alexander Shishkin
  1 sibling, 0 replies; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
>> +static int __perf_event_itrace_filters_setup(void *info)
>> +{
>> +	struct perf_event *event = info;
>> +	int ret;
>> +
>> +	if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
>> +		return -EAGAIN;
>> +
>> +	/* matches smp_wmb() in event_sched_in() */
>> +	smp_rmb();
>> +
>> +	/*
>> +	 * There is a window with interrupts enabled before we get here,
>> +	 * so we need to check again lest we try to stop another cpu's event.
>> +	 */
>> +	if (READ_ONCE(event->oncpu) != smp_processor_id())
>> +		return -EAGAIN;
>> +
>> +	event->pmu->stop(event, PERF_EF_UPDATE);
>> +	rcu_read_lock();
>> +	ret = event->pmu->itrace_filter_setup(event);
>> +	rcu_read_unlock();
>> +	event->pmu->start(event, PERF_EF_RELOAD);
>
> Would it not be more sensible to let the ::itrace_filter_setup() method
> do the stop/start-ing if and when needed?

Ok, so keeping in mind the other mails, if I add "int flags" to the
signature and have it do the stop/start when called like

  event->pmu->itrace_filter_setup(event, PERF_EF_RELOAD);

and not otherwise?

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 13:36 ` [PATCH v0 3/5] perf: Introduce instruction trace filtering Alexander Shishkin
                     ` (6 preceding siblings ...)
  2015-12-11 15:28   ` Peter Zijlstra
@ 2015-12-11 16:59   ` Peter Zijlstra
  2015-12-11 17:15     ` Alexander Shishkin
  2015-12-11 18:11   ` Mathieu Poirier
  8 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 16:59 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
> +static int
> +perf_event_set_itrace_filter(struct perf_event *event, char *filter_str)
> +{
> +	int ret = 0;
> +
> +	/*
> +	 * Since this is called in perf_ioctl() path, we're already holding
> +	 * ctx::mutex.
> +	 */
> +	lockdep_assert_held(&event->ctx->mutex);
> +
> +	/*
> +	 * For now, we only support filtering in per-task events; doing so
> +	 * for cpu-wide events requires additional context switching trickery,
> +	 * since same object code will be mapped at different virtual
> +	 * addresses in different processes.
> +	 */
> +	if (!event->ctx->task)
> +		return -EOPNOTSUPP;
> +
> +	/* remove existing filters, if any */
> +	perf_itrace_filters_clear(event);
> +
> +	ret = perf_event_parse_itrace_filter(event, filter_str);
> +	if (!ret) {
> +		perf_itrace_filters_apply(event);
> +
> +		ret = perf_event_itrace_filters_setup(event);
> +		if (ret)
> +			perf_itrace_filters_clear(event);

This is what I meant, if you try and set a 'wrong' filter while it
already has filters set, you'll not only error out, you'll also wipe the
current state.

This seems wrong.

> +	}
> +
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 15:28   ` Peter Zijlstra
@ 2015-12-11 17:00     ` Peter Zijlstra
  2015-12-11 17:13       ` Alexander Shishkin
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 17:00 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 04:28:54PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
> 
> > @@ -9063,6 +9621,18 @@ inherit_event(struct perf_event *parent_event,
> >  	get_ctx(child_ctx);
> >  
> >  	/*
> > +	 * Clone itrace filters from the parent, if any
> > +	 */
> > +	if (has_itrace_filter(child_event)) {
> > +		if (perf_itrace_filters_clone(child_event, parent_event,
> > +					      child)) {
> > +			put_ctx(child_ctx);
> > +			free_event(child_event);
> > +			return NULL;
> 
> So inherit_event()'s return policy is somewhat opaque, there's 3
> possible returns:
> 
>  1) a valid struct perf_event pointer; the clone was successful
>  2) ERR_PTR(err), the clone failed, abort inherit_group, fail fork()
>  3) NULL, the clone failed, ignore, continue
> 
> We return NULL under two special cases:
> 
>  - the original event doesn't exist anymore, we're an orphan, do not make
>    more orphans.
> 
>  - the parent event is dying
> 
> 
> I'm fairly sure this return should be in the 2) category. If we cannot
> fully clone the event something bad happened, we should not ignore it.

On second thought; we should not inherit the filters at all.

We should always use event->parent (if exists) for filters. Otherwise
inherited events will get different filters if you change the filter
after clone.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 16:01           ` Peter Zijlstra
@ 2015-12-11 17:02             ` Peter Zijlstra
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 17:02 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 05:01:36PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2015 at 05:48:03PM +0200, Alexander Shishkin wrote:
> 
> > > We can always call the validation thing, we must not call the program
> > > thing !ACTIVE is a clear and simple rule.
> > 
> > Ah, but pmu::itrace_filter_setup() does not touch the hardware,
> > pmu::start() does. The former keeps an array of, say, MSR values ready
> > for programming in event::hw and the latter actually writes the MSRs. So
> > the above example should not be a problem.
> > 
> > So in a way validation and programming are split already. And PT, for
> > example, won't have it any other way, you can only program stuff into
> > the registers while tracing is disabled.
> 
> Yes, I just read that in the last patch. If however we fold the whole
> stop/start bits into it, that fails again.
> 
> Hmm.. lemme ponder a bit.

Nope, it doesn't matter. Either way around you need serialization.

Because while, as proposed, pmu::itrace_filter_setup() does not modify
the hardware state, it does modify event state. So it needs to be
serialized against concurrent pmu::add().



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 17:00     ` Peter Zijlstra
@ 2015-12-11 17:13       ` Alexander Shishkin
  2015-12-11 22:39         ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Dec 11, 2015 at 04:28:54PM +0100, Peter Zijlstra wrote:
>> On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
>> 
>> > @@ -9063,6 +9621,18 @@ inherit_event(struct perf_event *parent_event,
>> >  	get_ctx(child_ctx);
>> >  
>> >  	/*
>> > +	 * Clone itrace filters from the parent, if any
>> > +	 */
>> > +	if (has_itrace_filter(child_event)) {
>> > +		if (perf_itrace_filters_clone(child_event, parent_event,
>> > +					      child)) {
>> > +			put_ctx(child_ctx);
>> > +			free_event(child_event);
>> > +			return NULL;
>> 
>> So inherit_event()'s return policy is somewhat opaque, there's 3
>> possible returns:
>> 
>>  1) a valid struct perf_event pointer; the clone was successful
>>  2) ERR_PTR(err), the clone failed, abort inherit_group, fail fork()
>>  3) NULL, the clone failed, ignore, continue
>> 
>> We return NULL under two special cases:
>> 
>>  - the original event doesn't exist anymore, we're an orphan, do not make
>>    more orphans.
>> 
>>  - the parent event is dying
>> 
>> 
>> I'm fairly sure this return should be in the 2) category. If we cannot
>> fully clone the event something bad happened, we should not ignore it.
>
> On second thought; we should not inherit the filters at all.
>
> We should always use event->parent (if exists) for filters. Otherwise
> inherited events will get different filters if you change the filter
> after clone.

But children will have different mappings, so the actual filter
configurations will still differ between parents and children. I guess I
could split the filter in two parts: one that's defined by the user and
one that we calculated from vma addresses, that we later program into
hardware.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 16:59   ` Peter Zijlstra
@ 2015-12-11 17:15     ` Alexander Shishkin
  0 siblings, 0 replies; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-11 17:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Dec 11, 2015 at 03:36:36PM +0200, Alexander Shishkin wrote:
>> +static int
>> +perf_event_set_itrace_filter(struct perf_event *event, char *filter_str)
>> +{
>> +	int ret = 0;
>> +
>> +	/*
>> +	 * Since this is called in perf_ioctl() path, we're already holding
>> +	 * ctx::mutex.
>> +	 */
>> +	lockdep_assert_held(&event->ctx->mutex);
>> +
>> +	/*
>> +	 * For now, we only support filtering in per-task events; doing so
>> +	 * for cpu-wide events requires additional context switching trickery,
>> +	 * since same object code will be mapped at different virtual
>> +	 * addresses in different processes.
>> +	 */
>> +	if (!event->ctx->task)
>> +		return -EOPNOTSUPP;
>> +
>> +	/* remove existing filters, if any */
>> +	perf_itrace_filters_clear(event);
>> +
>> +	ret = perf_event_parse_itrace_filter(event, filter_str);
>> +	if (!ret) {
>> +		perf_itrace_filters_apply(event);
>> +
>> +		ret = perf_event_itrace_filters_setup(event);
>> +		if (ret)
>> +			perf_itrace_filters_clear(event);
>
> This is what I meant, if you try and set a 'wrong' filter while it
> already has filters set, you'll not only error out, you'll also wipe the
> current state.
>
> This seems wrong.

Ah, now I see what you mean. Yes, you're right.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 5/5] perf/x86/intel/pt: Add support for instruction trace filtering in PT
  2015-12-11 13:36 ` [PATCH v0 5/5] perf/x86/intel/pt: Add support for instruction trace filtering in PT Alexander Shishkin
@ 2015-12-11 18:06   ` Mathieu Poirier
  0 siblings, 0 replies; 38+ messages in thread
From: Mathieu Poirier @ 2015-12-11 18:06 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, vince,
	Stephane Eranian, Arnaldo Carvalho de Melo

On 11 December 2015 at 06:36, Alexander Shishkin
<alexander.shishkin@linux.intel.com> wrote:
> Newer versions of Intel PT support address ranges, which can be used to
> define IP address range-based filters or TraceSTOP regions. Number of
> ranges in enumerated via cpuid.
>
> This patch implements pmu callbacks and related low-level code to allow
> filter validation, configuration and programming into the hardware.
>
> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/intel_pt.h            |  30 ++++++-
>  arch/x86/kernel/cpu/perf_event_intel_pt.c | 132 +++++++++++++++++++++++++++++-
>  2 files changed, 159 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
> index 6ce8cd20b9..1b36c8529d 100644
> --- a/arch/x86/kernel/cpu/intel_pt.h
> +++ b/arch/x86/kernel/cpu/intel_pt.h
> @@ -105,13 +105,39 @@ struct pt_buffer {
>         struct topa_entry       *topa_index[0];
>  };
>
> +#define PT_FILTERS_NUM 4
> +
> +/**
> + * struct pt_filter - IP range filter configuration
> + * @msr_a:     range start, goes to RTIT_ADDRn_A
> + * @msr_b:     range end, goes to RTIT_ADDRn_B
> + * @config:    4-bit field in RTIT_CTL
> + */
> +struct pt_filter {
> +       unsigned long   msr_a;
> +       unsigned long   msr_b;
> +       unsigned long   config;
> +};
> +
> +/**
> + * struct pt_filters - IP range filtering context
> + * @filter:    filters defined for this context
> + * @nr_filters:        number of defined filters in the @filter array
> + */
> +struct pt_filters {
> +       struct pt_filter        filter[PT_FILTERS_NUM];
> +       unsigned int            nr_filters;
> +};
> +
>  /**
>   * struct pt - per-cpu pt context
> - * @handle:    perf output handle
> - * @handle_nmi:        do handle PT PMI on this cpu, there's an active event
> + * @handle:            perf output handle
> + * @filters:           last configured filters
> + * @handle_nmi:                do handle PT PMI on this cpu, there's an active event
>   */
>  struct pt {
>         struct perf_output_handle handle;
> +       struct pt_filters       filters;
>         int                     handle_nmi;
>  };
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
> index 2ec25581de..f531a2f0de 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
> @@ -253,6 +253,73 @@ static bool pt_event_valid(struct perf_event *event)
>   * These all are cpu affine and operate on a local PT
>   */
>
> +/* Address ranges and their corresponding msr configuration registers */
> +static const struct pt_address_range {
> +       unsigned long   msr_a;
> +       unsigned long   msr_b;
> +       unsigned int    reg_off;
> +} pt_address_ranges[] = {
> +       {
> +               .msr_a   = MSR_IA32_RTIT_ADDR0_A,
> +               .msr_b   = MSR_IA32_RTIT_ADDR0_B,
> +               .reg_off = RTIT_CTL_ADDR0_OFFSET,
> +       },
> +       {
> +               .msr_a   = MSR_IA32_RTIT_ADDR1_A,
> +               .msr_b   = MSR_IA32_RTIT_ADDR1_B,
> +               .reg_off = RTIT_CTL_ADDR1_OFFSET,
> +       },
> +       {
> +               .msr_a   = MSR_IA32_RTIT_ADDR2_A,
> +               .msr_b   = MSR_IA32_RTIT_ADDR2_B,
> +               .reg_off = RTIT_CTL_ADDR2_OFFSET,
> +       },
> +       {
> +               .msr_a   = MSR_IA32_RTIT_ADDR3_A,
> +               .msr_b   = MSR_IA32_RTIT_ADDR3_B,
> +               .reg_off = RTIT_CTL_ADDR3_OFFSET,
> +       }
> +};
> +
> +static u64 pt_config_filters(struct perf_event *event)
> +{
> +       struct pt_filters *filters = event->hw.itrace_filters;
> +       struct pt *pt = this_cpu_ptr(&pt_ctx);
> +       unsigned int range = 0;
> +       u64 rtit_ctl = 0;
> +
> +       if (!filters)
> +               return 0;
> +
> +       for (range = 0; range < filters->nr_filters; range++) {
> +               struct pt_filter *filter = &filters->filter[range];
> +
> +               /*
> +                * Note, if the range has zero start/end addresses due
> +                * to its dynamic object not being loaded yet, we just
> +                * go ahead and program zeroed range, which will simply
> +                * produce no data. Note^2: if executable code at 0x0
> +                * is a concern, we can set up an "invalid" configuration
> +                * such as msr_b < msr_a.
> +                */
> +
> +               /* avoid redundant msr writes */
> +               if (pt->filters.filter[range].msr_a != filter->msr_a) {
> +                       wrmsrl(pt_address_ranges[range].msr_a, filter->msr_a);
> +                       pt->filters.filter[range].msr_a = filter->msr_a;
> +               }
> +
> +               if (pt->filters.filter[range].msr_b != filter->msr_b) {
> +                       wrmsrl(pt_address_ranges[range].msr_b, filter->msr_b);
> +                       pt->filters.filter[range].msr_b = filter->msr_b;
> +               }

You don't need checks to make sure the address range is correct?  On
the CS side a < b must be respected or the tracer will produced
invalid results.

> +
> +               rtit_ctl |= filter->config << pt_address_ranges[range].reg_off;

I understand what you're doing here but it is probably a good idea to
make it clear with a comment.

> +       }
> +
> +       return rtit_ctl;
> +}
> +
>  static void pt_config(struct perf_event *event)
>  {
>         u64 reg;
> @@ -262,7 +329,8 @@ static void pt_config(struct perf_event *event)
>                 wrmsrl(MSR_IA32_RTIT_STATUS, 0);
>         }
>
> -       reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN | RTIT_CTL_TRACEEN;
> +       reg = pt_config_filters(event);
> +       reg |= RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN | RTIT_CTL_TRACEEN;
>
>         if (!event->attr.exclude_kernel)
>                 reg |= RTIT_CTL_OS;
> @@ -907,6 +975,60 @@ static void pt_buffer_free_aux(void *data)
>         kfree(buf);
>  }
>
> +static int pt_itrace_filters_init(struct perf_event *event)
> +{
> +       struct pt_filters *filters;
> +       int node = event->cpu == -1 ? -1 : cpu_to_node(event->cpu);
> +
> +       if (!pt_cap_get(PT_CAP_num_address_ranges))
> +               return 0;
> +
> +       filters = kzalloc_node(sizeof(struct pt_filters), GFP_KERNEL, node);
> +       if (!filters)
> +               return -ENOMEM;
> +
> +       event->hw.itrace_filters = filters;
> +
> +       return 0;
> +}
> +
> +static void pt_itrace_filters_fini(struct perf_event *event)
> +{
> +       kfree(event->hw.itrace_filters);
> +       event->hw.itrace_filters = NULL;
> +}
> +
> +static int pt_event_itrace_filter_setup(struct perf_event *event)
> +{
> +       struct pt_filters *filters = event->hw.itrace_filters;
> +       struct perf_itrace_filter *filter;
> +       int range = 0;
> +
> +       if (!filters)
> +               return -EOPNOTSUPP;
> +
> +       list_for_each_entry_rcu(filter, &event->itrace_filters, entry) {
> +               /* PT doesn't support single address triggers */
> +               if (!filter->range)
> +                       return -EOPNOTSUPP;
> +
> +               if (filter->kernel && !kernel_ip(filter->offset))
> +                       return -EINVAL;
> +
> +               filters->filter[range].msr_a  = filter->start;
> +               filters->filter[range].msr_b  = filter->end;
> +               filters->filter[range].config = filter->filter ? 1 : 2;
> +
> +               if (++range > pt_cap_get(PT_CAP_num_address_ranges))
> +                       return -EOPNOTSUPP;
> +       }
> +
> +       if (range)
> +               filters->nr_filters = range - 1;
> +
> +       return 0;
> +}
> +
>  /**
>   * intel_pt_interrupt() - PT PMI handler
>   */
> @@ -1075,6 +1197,7 @@ static void pt_event_read(struct perf_event *event)
>
>  static void pt_event_destroy(struct perf_event *event)
>  {
> +       pt_itrace_filters_fini(event);
>         x86_del_exclusive(x86_lbr_exclusive_pt);
>  }
>
> @@ -1089,6 +1212,11 @@ static int pt_event_init(struct perf_event *event)
>         if (x86_add_exclusive(x86_lbr_exclusive_pt))
>                 return -EBUSY;
>
> +       if (pt_itrace_filters_init(event)) {
> +               x86_del_exclusive(x86_lbr_exclusive_pt);
> +               return -ENOMEM;
> +       }
> +
>         event->destroy = pt_event_destroy;
>
>         return 0;
> @@ -1152,6 +1280,8 @@ static __init int pt_init(void)
>         pt_pmu.pmu.read         = pt_event_read;
>         pt_pmu.pmu.setup_aux    = pt_buffer_setup_aux;
>         pt_pmu.pmu.free_aux     = pt_buffer_free_aux;
> +       pt_pmu.pmu.itrace_filter_setup =
> +               pt_event_itrace_filter_setup;
>         ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);
>
>         return ret;
> --
> 2.6.2
>

I've been scratching my head for a while now on how we could convey
address ranges to the tracers.  My initial thought was to extend the
-e cs_etm/.../ to take strings that could be parsed.  I was going to
send an RFC email to the list in January but you beat me to the punch
- the method you are putting forward is better.

I had comments about some possible race conditions in the core but
Peter got to those first.

Other than the above comment and the suggestion in 4/5, for the
portion of the work concerned with IntelPT:

Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 13:36 ` [PATCH v0 3/5] perf: Introduce instruction trace filtering Alexander Shishkin
                     ` (7 preceding siblings ...)
  2015-12-11 16:59   ` Peter Zijlstra
@ 2015-12-11 18:11   ` Mathieu Poirier
  2015-12-11 22:41     ` Peter Zijlstra
  8 siblings, 1 reply; 38+ messages in thread
From: Mathieu Poirier @ 2015-12-11 18:11 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, vince,
	Stephane Eranian, Arnaldo Carvalho de Melo

On 11 December 2015 at 06:36, Alexander Shishkin
<alexander.shishkin@linux.intel.com> wrote:
> Many instruction trace pmus out there support address range-based
> filtering, which would, for example, generate trace data only for a
> given range of instruction addresses, which is useful for tracing
> individual functions, modules or libraries.
>
> This patch introduces the interface for userspace to specify these
> filters and for the pmu drivers to apply these filters to hardware
> configuration.
>
> The user interface is an ascii string that is passed via an ioctl
> and specifies (in a way similar to uprobe) address ranges within
> certain object files or within kernel. There is no special treatment
> for kernel modules yet, but it might be a worthy pursuit.
>
> The pmu driver interface basically adds an extra callback to the
> pmu driver structure, which validates the filter configuration proposed
> by the user against what the hardware is actually capable of doing
> and translates it into something that pmu::start can program into
> hardware.
>
> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> ---
>  include/linux/perf_event.h |  40 ++++
>  kernel/events/core.c       | 576 ++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 613 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index f9828a48f1..4ddbedc100 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -127,6 +127,11 @@ struct hw_perf_event {
>                 };
>                 struct { /* itrace */
>                         int                     itrace_started;
> +                       /*
> +                        * PMU would store hardware filter configuration
> +                        * here.
> +                        */
> +                       void                    *itrace_filters;
>                 };
>  #ifdef CONFIG_HAVE_HW_BREAKPOINT
>                 struct { /* breakpoint */
> @@ -388,12 +393,38 @@ struct pmu {
>         void (*free_aux)                (void *aux); /* optional */
>
>         /*
> +        * Validate instruction tracing filters: make sure hw supports the
> +        * requested configuration and number of filters.
> +        *
> +        * Configure instruction tracing filters: translate hw-agnostic filter
> +        * into hardware configuration in event::hw::itrace_filters
> +        */
> +       int (*itrace_filter_setup)      (struct perf_event *event); /* optional */
> +
> +       /*
>          * Filter events for PMU-specific reasons.
>          */
>         int (*filter_match)             (struct perf_event *event); /* optional */
>  };
>
>  /**
> + * Instruction trace (ITRACE) filter
> + */
> +struct perf_itrace_filter {
> +       struct list_head        entry;
> +       struct rcu_head         rcu_head;
> +       struct inode            *inode;
> +       struct task_struct      *task;
> +       unsigned long           offset;
> +       unsigned long           size;
> +       unsigned long           start;
> +       unsigned long           end;
> +       unsigned int            range   : 1, /* 1: range, 0: addr */
> +                               filter  : 1, /* 1: filter/start, 0: stop */
> +                               kernel  : 1; /* 1: kernel, 0: object file*/

I've seen a rant on the list before about bit fields like these... Do
we gain anything with having them?  I personally avoid them to favour
bool types but that will be for Peter to decide.  Given its
importance, I also think this structure could you more documentation.

> +};
> +
> +/**
>   * enum perf_event_active_state - the states of a event
>   */
>  enum perf_event_active_state {
> @@ -559,6 +590,10 @@ struct perf_event {
>
>         atomic_t                        event_limit;
>
> +       /* instruction trace filters */
> +       struct list_head                itrace_filters;
> +       struct mutex                    itrace_filters_mutex;
> +
>         void (*destroy)(struct perf_event *);
>         struct rcu_head                 rcu_head;
>
> @@ -1032,6 +1067,11 @@ static inline bool has_aux(struct perf_event *event)
>         return event->pmu->setup_aux;
>  }
>
> +static inline bool has_itrace_filter(struct perf_event *event)
> +{
> +       return event->pmu->itrace_filter_setup;
> +}
> +
>  extern int perf_output_begin(struct perf_output_handle *handle,
>                              struct perf_event *event, unsigned int size);
>  extern void perf_output_end(struct perf_output_handle *handle);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 2bab4af901..28ce173a28 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -44,6 +44,8 @@
>  #include <linux/compat.h>
>  #include <linux/bpf.h>
>  #include <linux/filter.h>
> +#include <linux/namei.h>
> +#include <linux/parser.h>
>
>  #include "internal.h"
>
> @@ -2335,6 +2337,59 @@ static int __perf_event_stop(void *info)
>         return 0;
>  }
>
> +static int __perf_event_itrace_filters_setup(void *info)
> +{
> +       struct perf_event *event = info;
> +       int ret;
> +
> +       if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
> +               return -EAGAIN;
> +
> +       /* matches smp_wmb() in event_sched_in() */
> +       smp_rmb();
> +
> +       /*
> +        * There is a window with interrupts enabled before we get here,
> +        * so we need to check again lest we try to stop another cpu's event.
> +        */
> +       if (READ_ONCE(event->oncpu) != smp_processor_id())
> +               return -EAGAIN;
> +
> +       event->pmu->stop(event, PERF_EF_UPDATE);
> +       rcu_read_lock();
> +       ret = event->pmu->itrace_filter_setup(event);
> +       rcu_read_unlock();
> +       event->pmu->start(event, PERF_EF_RELOAD);
> +
> +       return ret;
> +}
> +
> +static int perf_event_itrace_filters_setup(struct perf_event *event)
> +{
> +       int ret;
> +
> +       /*
> +        * We can't use event_function_call() here, because that would
> +        * require ctx::mutex, but one of our callers is called with
> +        * mm::mmap_sem down, which would cause an inversion, see bullet
> +        * (2) in put_event().
> +        */
> +       do {
> +               if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE) {
> +                       ret = event->pmu->itrace_filter_setup(event);
> +                       break;
> +               }
> +
> +               /* matches smp_wmb() in event_sched_in() */
> +               smp_rmb();
> +
> +               ret = cpu_function_call(READ_ONCE(event->oncpu),
> +                                       __perf_event_itrace_filters_setup, event);
> +       } while (ret == -EAGAIN);
> +
> +       return ret;
> +}
> +
>  static int _perf_event_refresh(struct perf_event *event, int refresh)
>  {
>         /*
> @@ -3663,6 +3718,8 @@ static bool exclusive_event_installable(struct perf_event *event,
>         return true;
>  }
>
> +static void perf_itrace_filters_clear(struct perf_event *event);
> +
>  static void __free_event(struct perf_event *event)
>  {
>         if (!event->parent) {
> @@ -3671,6 +3728,7 @@ static void __free_event(struct perf_event *event)
>         }
>
>         perf_event_free_bpf_prog(event);
> +       perf_itrace_filters_clear(event);
>
>         if (event->destroy)
>                 event->destroy(event);
> @@ -5907,6 +5965,37 @@ out:
>         comm_event->event_id.header.size = size;
>  }
>
> +/*
> + * Clear all dynamic object-based filters at exec, they'll have to be
> + * re-instated when/if these objects are mmapped again.
> + */
> +static void perf_itrace_exec(struct perf_event *event, void *data)
> +{
> +       struct perf_itrace_filter *filter;
> +       unsigned int restart = 0;
> +
> +       if (!has_itrace_filter(event))
> +               return;
> +
> +       rcu_read_lock();
> +
> +       list_for_each_entry_rcu(filter, &event->itrace_filters, entry) {
> +               if (filter->kernel)
> +                       continue;
> +
> +               filter->start = filter->end = 0;
> +               restart++;
> +       }
> +
> +       rcu_read_unlock();
> +
> +       /*
> +        * kernel filters, however, will still be valid
> +        */
> +       if (restart)
> +               perf_event_itrace_filters_setup(event);
> +}
> +
>  static void perf_event_comm_event(struct perf_comm_event *comm_event)
>  {
>         char comm[TASK_COMM_LEN];
> @@ -5921,6 +6010,8 @@ static void perf_event_comm_event(struct perf_comm_event *comm_event)
>
>         comm_event->event_id.header.size = sizeof(comm_event->event_id) + size;
>
> +       if (comm_event->event_id.header.misc == PERF_RECORD_MISC_COMM_EXEC)
> +               perf_event_aux(perf_itrace_exec, comm_event, NULL, true);
>         perf_event_aux(perf_event_comm_output,
>                        comm_event,
>                        NULL, false);
> @@ -6159,6 +6250,77 @@ got_name:
>         kfree(buf);
>  }
>
> +/*
> + * Whether this @filter depends on a dynamic object which is not loaded
> + * yet or its load addresses are not known.
> + */
> +static bool perf_itrace_filter_needs_mmap(struct perf_itrace_filter *filter)
> +{
> +       return filter->filter && !filter->kernel && !filter->end;
> +}
> +
> +/*
> + * Check whether inode and address range match filter criteria.
> + */
> +static bool perf_itrace_filter_match(struct perf_itrace_filter *filter,
> +                                    struct file *file, unsigned long offset,
> +                                    unsigned long size)
> +{
> +

Unneeded white space - I'm suprised checkpatch didn't catch it.

> +       if (filter->inode != file->f_inode)
> +               return false;
> +
> +       if (filter->offset > offset + size)
> +               return false;
> +
> +       if (filter->offset + filter->size < offset)
> +               return false;
> +
> +       return true;
> +}
> +
> +/*
> + * Update event's itrace filters
> + */
> +static void perf_itrace_filters_update(struct perf_event *event, void *data)
> +{
> +       struct perf_mmap_event *mmap_event = data;
> +       unsigned long off = mmap_event->vma->vm_pgoff << PAGE_SHIFT;
> +       struct file *file = mmap_event->vma->vm_file;
> +       struct perf_itrace_filter *filter;
> +       unsigned int restart = 0;
> +
> +       if (!has_itrace_filter(event))
> +               return;
> +
> +       if (!file)
> +               return;
> +
> +       /* we do not modify the list or sleep, no need for the mutex */
> +       rcu_read_lock();
> +       list_for_each_entry_rcu(filter, &event->itrace_filters, entry) {
> +               if (filter->kernel)
> +                       continue;
> +
> +               if (filter->task->mm != mmap_event->vma->vm_mm)
> +                       continue;
> +
> +               if (!perf_itrace_filter_match(filter, file, off,
> +                                             mmap_event->event_id.len))
> +                       continue;
> +
> +               restart++;
> +               filter->start = mmap_event->event_id.start + filter->offset;
> +               filter->end = mmap_event->event_id.start + filter->offset +
> +                       filter->size;
> +       }
> +       rcu_read_unlock();
> +
> +       /* reprogram updated filters into hardware */
> +       if (restart)
> +               perf_event_itrace_filters_setup(event);
> +}
> +
>  void perf_event_mmap(struct vm_area_struct *vma)
>  {
>         struct perf_mmap_event mmap_event;
> @@ -6190,6 +6352,7 @@ void perf_event_mmap(struct vm_area_struct *vma)
>                 /* .flags (attr_mmap2 only) */
>         };
>
> +       perf_event_aux(perf_itrace_filters_update, &mmap_event, NULL, true);
>         perf_event_mmap_event(&mmap_event);
>  }
>
> @@ -7163,13 +7326,405 @@ void perf_bp_event(struct perf_event *bp, void *data)
>  }
>  #endif
>
> +/*
> + * Insert an itrace @filter into @event's list of filters.
> + * @filter is used as a template
> + */
> +static int perf_itrace_filter_insert(struct perf_event *event,
> +                                    struct perf_itrace_filter *src,
> +                                    struct task_struct *task)
> +{
> +       int node = cpu_to_node(event->cpu == -1 ? 0 : event->cpu);
> +       struct perf_itrace_filter *filter;
> +
> +       filter = kzalloc_node(sizeof(*filter), GFP_KERNEL, node);
> +       if (!filter)
> +               return -ENOMEM;
> +
> +       filter->inode  = src->inode;
> +       filter->offset = src->offset;
> +       filter->size   = src->size;
> +       filter->range  = src->range;
> +       filter->filter = src->filter;
> +       filter->kernel = src->kernel;
> +       /*
> +        * We copy the actual address range too: it will remain the same for
> +        * kernel addresses and user addresses after fork(); in case of exec
> +        * or mmap, it will get cleared or modified by
> +        * perf_itrace_filters_clear()/perf_itrace_filters_update().
> +        */
> +       filter->start = src->start;
> +       filter->end   = src->end;
> +
> +       /*
> +        * We're already holding a reference to this task_struct since
> +        * alloc_perf_context() till last put_ctx() in __free_event().
> +        */
> +       filter->task = task;
> +
> +       /*
> +        * If we're called through perf_itrace_filters_clone(), we're already
> +        * holding parent's filter mutex.
> +        */
> +       mutex_lock_nested(&event->itrace_filters_mutex, SINGLE_DEPTH_NESTING);
> +       list_add_tail_rcu(&filter->entry, &event->itrace_filters);
> +       mutex_unlock(&event->itrace_filters_mutex);
> +
> +       return 0;
> +}
> +
> +static void perf_itrace_filter_free_rcu(struct rcu_head *rcu_head)
> +{
> +       struct perf_itrace_filter *filter =
> +               container_of(rcu_head, struct perf_itrace_filter, rcu_head);
> +
> +       if (filter->inode)
> +               iput(filter->inode);
> +       kfree(filter);
> +}
> +
> +/*
> + * we can do this via task_function_call(), as well as setting filters
> + * and maybe event updating them!
> + */
> +static void perf_itrace_filters_clear(struct perf_event *event)
> +{
> +       struct perf_itrace_filter *filter, *iter;
> +
> +       if (!has_itrace_filter(event))
> +               return;
> +
> +       mutex_lock(&event->itrace_filters_mutex);
> +       list_for_each_entry_safe(filter, iter, &event->itrace_filters, entry) {
> +               list_del_rcu(&filter->entry);
> +               call_rcu(&filter->rcu_head, perf_itrace_filter_free_rcu);
> +       }
> +
> +       perf_event_itrace_filters_setup(event);
> +       mutex_unlock(&event->itrace_filters_mutex);
> +}
> +
> +/*
> + * Scan through mm's vmas and see if one of them matches the
> + * @filter; if so, adjust filter's address range.
> + * Called with mm::mmap_sem down for reading.
> + */
> +static int perf_itrace_filter_apply(struct perf_event *event,
> +                                   struct perf_itrace_filter *filter,
> +                                   struct mm_struct *mm)
> +{
> +       struct vm_area_struct *vma;
> +
> +       for (vma = mm->mmap; vma->vm_next; vma = vma->vm_next) {
> +               struct file *file = vma->vm_file;
> +               unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
> +               unsigned long vma_size = vma->vm_end - vma->vm_start;
> +
> +               if (!file)
> +                       continue;
> +
> +               if (!perf_itrace_filter_match(filter, file, off,
> +                                             vma_size))
> +                       continue;
> +
> +               filter->start = vma->vm_start + filter->offset;
> +               filter->end = vma->vm_start + filter->offset +
> +                       filter->size;
> +
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
> +/*
> + * Adjust event's itrace filters' address ranges based on the
> + * task's existing mappings
> + */
> +static void perf_itrace_filters_apply(struct perf_event *event)
> +{
> +       struct perf_itrace_filter *filter;
> +       struct mm_struct *mm = NULL;
> +       unsigned int restart = 0;
> +
> +       lockdep_assert_held(&event->ctx->mutex);
> +
> +       mm = get_task_mm(event->ctx->task);
> +       if (!mm)
> +               return;
> +
> +       down_read(&mm->mmap_sem);
> +
> +       rcu_read_lock();
> +       list_for_each_entry_rcu(filter, &event->itrace_filters, entry) {
> +               if (!perf_itrace_filter_needs_mmap(filter))
> +                       continue;
> +
> +               restart += perf_itrace_filter_apply(event, filter, mm);
> +       }
> +       rcu_read_unlock();
> +
> +       up_read(&mm->mmap_sem);
> +
> +       if (restart)
> +               perf_event_itrace_filters_setup(event);
> +
> +       mmput(mm);
> +}
> +
> +/*
> + * Instruction trace filtering: limiting the trace to certain
> + * instruction address ranges. Filters are ioctl()ed to us from
> + * userspace as ascii strings.
> + *
> + * Filter string format:
> + *
> + * ACTION SOURCE:RANGE_SPEC
> + * where ACTION is one of the
> + *  * "filter": limit the trace to this region
> + *  * "start": start tracing from this address
> + *  * "stop": stop tracing at this address/region;
> + * SOURCE is either "file" or "kernel"
> + * RANGE_SPEC is
> + *  * for "kernel": <start address>[/<size>]
> + *  * for "file":   <start address>[/<size>]@</path/to/object/file>
> + *
> + * if <size> is not specified, the range is treated as a single address.
> + */
> +enum {
> +       IF_ACT_FILTER,
> +       IF_ACT_START,
> +       IF_ACT_STOP,
> +       IF_SRC_FILE,
> +       IF_SRC_KERNEL,
> +       IF_SRC_FILEADDR,
> +       IF_SRC_KERNELADDR,
> +};
> +
> +enum {
> +       IF_STATE_ACTION = 0,
> +       IF_STATE_SOURCE,
> +       IF_STATE_END,
> +};
> +
> +static const match_table_t if_tokens = {
> +       { IF_ACT_FILTER,        "filter" },
> +       { IF_ACT_START,         "start" },
> +       { IF_ACT_STOP,          "stop" },
> +       { IF_SRC_FILE,          "file:%u/%u@%s" },
> +       { IF_SRC_KERNEL,        "kernel:%u/%u" },
> +       { IF_SRC_FILEADDR,      "file:%u@%s" },
> +       { IF_SRC_KERNELADDR,    "kernel:%u" },
> +};
> +
> +/*
> + * Itrace filter string parser
> + */
> +static int
> +perf_event_parse_itrace_filter(struct perf_event *event, char *fstr)
> +{
> +       struct perf_itrace_filter filter;
> +       char *start, *orig, *filename = NULL;
> +       struct path path;
> +       substring_t args[MAX_OPT_ARGS];
> +       int state = IF_STATE_ACTION, token;
> +       int ret = -EINVAL;
> +
> +       orig = fstr = kstrdup(fstr, GFP_KERNEL);
> +       if (!fstr)
> +               return -ENOMEM;
> +
> +       while ((start = strsep(&fstr, " ,\n")) != NULL) {
> +               ret = -EINVAL;
> +
> +               if (!*start)
> +                       continue;
> +
> +               /* filter definition begins */
> +               if (state == IF_STATE_ACTION)
> +                       memset(&filter, 0, sizeof(filter));
> +
> +               token = match_token(start, if_tokens, args);
> +               switch (token) {
> +               case IF_ACT_FILTER:
> +               case IF_ACT_START:
> +                       filter.filter = 1;
> +
> +               case IF_ACT_STOP:
> +                       if (state != IF_STATE_ACTION)
> +                               goto fail;
> +
> +                       state = IF_STATE_SOURCE;
> +                       break;
> +
> +               case IF_SRC_KERNELADDR:
> +               case IF_SRC_KERNEL:
> +                       filter.kernel = 1;
> +
> +               case IF_SRC_FILEADDR:
> +               case IF_SRC_FILE:
> +                       if (state != IF_STATE_SOURCE)
> +                               goto fail;
> +
> +                       if (token == IF_SRC_FILE || token == IF_SRC_KERNEL)
> +                               filter.range = 1;
> +
> +                       *args[0].to = 0;
> +                       ret = kstrtoul(args[0].from, 0, &filter.offset);
> +                       if (ret)
> +                               goto fail;
> +
> +                       if (filter.range) {
> +                               *args[1].to = 0;
> +                               ret = kstrtoul(args[1].from, 0, &filter.size);
> +                               if (ret)
> +                                       goto fail;
> +                       }
> +
> +                       if (token == IF_SRC_FILE) {
> +                               filename = match_strdup(&args[2]);
> +                               if (!filename) {
> +                                       ret = -ENOMEM;
> +                                       goto fail;
> +                               }
> +                       }
> +
> +                       state = IF_STATE_END;
> +                       break;
> +
> +               default:
> +                       goto fail;
> +               }
> +
> +               /*
> +                * Filter definition is fully parsed, validate and install it.
> +                * Make sure that it doesn't contradict itself or the event's
> +                * attribute.
> +                */
> +               if (state == IF_STATE_END) {
> +                       if (filter.kernel && event->attr.exclude_kernel)
> +                               goto fail;
> +
> +                       if (!filter.kernel) {
> +                               if (!filename)
> +                                       goto fail;
> +
> +                               /* look up the path and grab its inode */
> +                               ret = kern_path(filename, LOOKUP_FOLLOW, &path);
> +                               if (ret)
> +                                       goto fail_free_name;
> +
> +                               filter.inode = igrab(d_inode(path.dentry));
> +                               path_put(&path);
> +                               kfree(filename);
> +                               filename = NULL;
> +                       }
> +
> +                       ret = perf_itrace_filter_insert(event, &filter,
> +                                                       event->ctx->task);
> +                       if (ret)
> +                               goto fail;
> +
> +                       /* ready to consume more filters */
> +                       state = IF_STATE_ACTION;
> +               }
> +       }
> +
> +       if (state != IF_STATE_ACTION)
> +               goto fail;
> +
> +       kfree(orig);
> +
> +       return 0;
> +
> +fail_free_name:
> +       kfree(filename);
> +fail:
> +       perf_itrace_filters_clear(event);
> +       kfree(orig);
> +
> +       return ret;
> +}
> +
> +/*
> + * Filters are cloned in inherit_event() path to make sure child tracing is
> + * consistent with parent.
> + */
> +static int
> +perf_itrace_filters_clone(struct perf_event *to, struct perf_event *from,
> +                         struct task_struct *task)
> +{
> +       struct perf_itrace_filter *filter;
> +       int ret = -ENOMEM;
> +
> +       mutex_lock(&from->itrace_filters_mutex);
> +       list_for_each_entry_rcu(filter, &from->itrace_filters, entry) {
> +               /* parent's filter must hold a reference to this inode */
> +               if (WARN_ON_ONCE(!igrab(filter->inode)))
> +                       goto fail;
> +
> +               ret = perf_itrace_filter_insert(to, filter, task);
> +               if (ret) {
> +                       iput(filter->inode);
> +                       goto fail;
> +               }
> +       }
> +
> +       ret = 0;
> +fail:
> +       mutex_unlock(&from->itrace_filters_mutex);
> +
> +       if (!ret)
> +               ret = perf_event_itrace_filters_setup(to);
> +       else
> +               perf_itrace_filters_clear(to);
> +
> +       return ret;
> +}
> +
> +static int
> +perf_event_set_itrace_filter(struct perf_event *event, char *filter_str)
> +{
> +       int ret = 0;
> +
> +       /*
> +        * Since this is called in perf_ioctl() path, we're already holding
> +        * ctx::mutex.
> +        */
> +       lockdep_assert_held(&event->ctx->mutex);
> +
> +       /*
> +        * For now, we only support filtering in per-task events; doing so
> +        * for cpu-wide events requires additional context switching trickery,
> +        * since same object code will be mapped at different virtual
> +        * addresses in different processes.
> +        */
> +       if (!event->ctx->task)
> +               return -EOPNOTSUPP;
> +
> +       /* remove existing filters, if any */
> +       perf_itrace_filters_clear(event);
> +
> +       ret = perf_event_parse_itrace_filter(event, filter_str);
> +       if (!ret) {
> +               perf_itrace_filters_apply(event);
> +
> +               ret = perf_event_itrace_filters_setup(event);
> +               if (ret)
> +                       perf_itrace_filters_clear(event);
> +       }
> +
> +       return ret;
> +}
> +
>  static int perf_event_set_filter(struct perf_event *event, void __user *arg)
>  {
>         char *filter_str;
>         int ret = -EINVAL;
>
> -       if (event->attr.type != PERF_TYPE_TRACEPOINT ||
> -           !IS_ENABLED(CONFIG_EVENT_TRACING))
> +       if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
> +           !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
> +           !has_itrace_filter(event))
>                 return -EINVAL;
>
>         filter_str = strndup_user(arg, PAGE_SIZE);
> @@ -7180,6 +7735,8 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
>             event->attr.type == PERF_TYPE_TRACEPOINT)
>                 ret = ftrace_profile_set_filter(event, event->attr.config,
>                                                 filter_str);
> +       else if (has_itrace_filter(event))
> +               ret = perf_event_set_itrace_filter(event, filter_str);
>
>         kfree(filter_str);
>         return ret;
> @@ -7921,13 +8478,14 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>         INIT_LIST_HEAD(&event->sibling_list);
>         INIT_LIST_HEAD(&event->rb_entry);
>         INIT_LIST_HEAD(&event->active_entry);
> +       INIT_LIST_HEAD(&event->itrace_filters);
>         INIT_HLIST_NODE(&event->hlist_entry);
>
> -
>         init_waitqueue_head(&event->waitq);
>         init_irq_work(&event->pending, perf_pending_event);
>
>         mutex_init(&event->mmap_mutex);
> +       mutex_init(&event->itrace_filters_mutex);
>
>         atomic_long_set(&event->refcount, 1);
>         event->cpu              = cpu;
> @@ -9063,6 +9621,18 @@ inherit_event(struct perf_event *parent_event,
>         get_ctx(child_ctx);
>
>         /*
> +        * Clone itrace filters from the parent, if any
> +        */
> +       if (has_itrace_filter(child_event)) {
> +               if (perf_itrace_filters_clone(child_event, parent_event,
> +                                             child)) {
> +                       put_ctx(child_ctx);
> +                       free_event(child_event);
> +                       return NULL;
> +               }
> +       }
> +
> +       /*
>          * Make the child state follow the state of the parent event,
>          * not its attr.disabled bit.  We hold the parent's mutex,
>          * so we won't race with perf_event_{en, dis}able_family.
> --
> 2.6.2
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 0/5] perf: Introduce instruction trace filtering
  2015-12-11 13:36 [PATCH v0 0/5] perf: Introduce instruction trace filtering Alexander Shishkin
                   ` (4 preceding siblings ...)
  2015-12-11 13:36 ` [PATCH v0 5/5] perf/x86/intel/pt: Add support for instruction trace filtering in PT Alexander Shishkin
@ 2015-12-11 21:38 ` Mathieu Poirier
  2015-12-14  8:50   ` Alexander Shishkin
  5 siblings, 1 reply; 38+ messages in thread
From: Mathieu Poirier @ 2015-12-11 21:38 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, vince,
	Stephane Eranian, Arnaldo Carvalho de Melo

On 11 December 2015 at 06:36, Alexander Shishkin
<alexander.shishkin@linux.intel.com> wrote:
> Hi Peter,
>
> Newer version of Intel PT supports address-based filtering, and this
> patchset adds support for it to perf core and the PT pmu driver. It
> works by configuring a number of address ranges in hardware and
> telling it to use these ranges to filter its traces. Similar feature
> also exists in ARM Coresight ETM/PTM and it is also taken into account
> in this patchset.
>
> Firstly, userspace configures filters via an ioctl(), filters are
> formatted as an ascii string. Filters may refer to addresses in object
> files for userspace code or kernel addresses. The latter might be
> extended in the future to support kernel modules.
>
> For userspace filters, we scan the task's vmas to see if any of them
> match the defined filters (inode+offset) and if they do, calculate
> memory offsets and program them into hardware. Note that since
> different tasks will have different mappings for the same object
> files, supporting cpu-wide events would require special tricks to
> context-switch filters for userspace code.
>
> Also, we monitor new mmap and exec events to update (or clear) filter
> configuration.
>
> This is based on my perf_mmap_close() patchset from yesterday [1], which
> in turn is based on your perf/core queue.
>
> [1] http://marc.info/?l=linux-kernel&m=144976438631073
>
> Alexander Shishkin (5):
>   perf: Move set_filter() from behind EVENT_TRACING
>   perf: Extend perf_event_aux() to optionally iterate through more
>     events
>   perf: Introduce instruction trace filtering
>   perf/x86/intel/pt: IP filtering register/cpuid bits
>   perf/x86/intel/pt: Add support for instruction trace filtering in PT
>
>  arch/x86/include/asm/msr-index.h          |  18 +
>  arch/x86/kernel/cpu/intel_pt.h            |  32 +-
>  arch/x86/kernel/cpu/perf_event_intel_pt.c | 134 +++++-
>  include/linux/perf_event.h                |  40 ++
>  kernel/events/core.c                      | 655 ++++++++++++++++++++++++++++--
>  5 files changed, 835 insertions(+), 44 deletions(-)
>
> --
> 2.6.2
>

Alex, Peter and al,

As I mentioned in a previous reply I think this patchset is aiming in
the right direction.  Here we are dealing with address range
filtering, something that is common to both IntelPT and CS, but what
happens when we want introduce options that aren't generic to all
tracers and still want to us the ioctl method?

Can we make the current scheme more extensible or generic so that
adding more architecture specific option is easily feasible?

Thanks for the consideration,
Mathieu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 17:13       ` Alexander Shishkin
@ 2015-12-11 22:39         ` Peter Zijlstra
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 22:39 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Mathieu Poirier

On Fri, Dec 11, 2015 at 07:13:10PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> > On second thought; we should not inherit the filters at all.
> >
> > We should always use event->parent (if exists) for filters. Otherwise
> > inherited events will get different filters if you change the filter
> > after clone.
> 
> But children will have different mappings,

_can_ have.

> so the actual filter
> configurations will still differ between parents and children. I guess I
> could split the filter in two parts: one that's defined by the user and
> one that we calculated from vma addresses, that we later program into
> hardware.

/me confused, isn't that what you already do?

In any case, since inherited counters are uncontrollable (they have no
filedesc of their own) and you cannot a priory tell what a child will go
do, let alone a child of a child. It really makes no sense to have
different filters on different parts of the inherited tree.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 3/5] perf: Introduce instruction trace filtering
  2015-12-11 18:11   ` Mathieu Poirier
@ 2015-12-11 22:41     ` Peter Zijlstra
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2015-12-11 22:41 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince,
	Stephane Eranian, Arnaldo Carvalho de Melo

On Fri, Dec 11, 2015 at 11:11:48AM -0700, Mathieu Poirier wrote:
> On 11 December 2015 at 06:36, Alexander Shishkin

> >  /**
> > + * Instruction trace (ITRACE) filter
> > + */
> > +struct perf_itrace_filter {
> > +       struct list_head        entry;
> > +       struct rcu_head         rcu_head;
> > +       struct inode            *inode;
> > +       struct task_struct      *task;
> > +       unsigned long           offset;
> > +       unsigned long           size;
> > +       unsigned long           start;
> > +       unsigned long           end;
> > +       unsigned int            range   : 1, /* 1: range, 0: addr */
> > +                               filter  : 1, /* 1: filter/start, 0: stop */
> > +                               kernel  : 1; /* 1: kernel, 0: object file*/
> 
> I've seen a rant on the list before about bit fields like these... Do
> we gain anything with having them?  I personally avoid them to favour
> bool types but that will be for Peter to decide.  Given its
> importance, I also think this structure could you more documentation.

Bool doesn't have a well defined storage type (although the x86_64 ABI
defines one, the language itself does not), so I tend to shy away from
using it in structures since its very hard to tell what the layout will
be.

Agreed on the comments though.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 0/5] perf: Introduce instruction trace filtering
  2015-12-11 21:38 ` [PATCH v0 0/5] perf: Introduce instruction trace filtering Mathieu Poirier
@ 2015-12-14  8:50   ` Alexander Shishkin
  2015-12-15  0:25     ` Mathieu Poirier
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Shishkin @ 2015-12-14  8:50 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, vince,
	Stephane Eranian, Arnaldo Carvalho de Melo

Mathieu Poirier <mathieu.poirier@linaro.org> writes:

> On 11 December 2015 at 06:36, Alexander Shishkin
> <alexander.shishkin@linux.intel.com> wrote:
> Alex, Peter and al,
>
> As I mentioned in a previous reply I think this patchset is aiming in
> the right direction.  Here we are dealing with address range
> filtering, something that is common to both IntelPT and CS, but what
> happens when we want introduce options that aren't generic to all
> tracers and still want to us the ioctl method?

Are we still talking about filtering options or more like event
configuration? First, we need to understand if one particular feature or
an option should be enabled for the entire lifetime of an event or if it
should be configurable on the fly. For the former, you can use
attr.config and friends, for the latter it makes sense to use an ioctl.

If we're still talking about address range filtering, there is one
difference in coresight that I'm aware of, which is specifying
individual addresses as start/stop triggers (as opposed to enable
ranges) and that is already taken care of in the current parser, I
should probably write a comment to make it more apparent.

So my approach was to not consider them to be architecture specific
features, but simply features that either are or aren't supported by a
particular pmu.

> Can we make the current scheme more extensible or generic so that
> adding more architecture specific option is easily feasible?

Well, the bigger question is, how do you represent a very architecture
specific option in the core structures. I have one solution to that that
is described above. It shouldn't take much architecture-specific code to
handle each new option, unless it something that really only makes sense
for one architecture/pmu.

All that said, one could still extend the current code quite easily to
fit completely non-generic things. In the default clause, in the parser
function, one could add:

  if (state == IF_STATE_ACTION)
      if (event->pmu->itrace_filter_parse(event, &filter))
          goto fail;

if one really had to.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v0 0/5] perf: Introduce instruction trace filtering
  2015-12-14  8:50   ` Alexander Shishkin
@ 2015-12-15  0:25     ` Mathieu Poirier
  0 siblings, 0 replies; 38+ messages in thread
From: Mathieu Poirier @ 2015-12-15  0:25 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, vince,
	Stephane Eranian, Arnaldo Carvalho de Melo

On 14 December 2015 at 01:50, Alexander Shishkin
<alexander.shishkin@linux.intel.com> wrote:
> Mathieu Poirier <mathieu.poirier@linaro.org> writes:
>
>> On 11 December 2015 at 06:36, Alexander Shishkin
>> <alexander.shishkin@linux.intel.com> wrote:
>> Alex, Peter and al,
>>
>> As I mentioned in a previous reply I think this patchset is aiming in
>> the right direction.  Here we are dealing with address range
>> filtering, something that is common to both IntelPT and CS, but what
>> happens when we want introduce options that aren't generic to all
>> tracers and still want to us the ioctl method?
>
> Are we still talking about filtering options or more like event
> configuration? First, we need to understand if one particular feature or
> an option should be enabled for the entire lifetime of an event or if it
> should be configurable on the fly. For the former, you can use
> attr.config and friends, for the latter it makes sense to use an ioctl.
> If we're still talking about address range filtering, there is one
> difference in coresight that I'm aware of, which is specifying
> individual addresses as start/stop triggers (as opposed to enable
> ranges) and that is already taken care of in the current parser, I
> should probably write a comment to make it more apparent.

Yes, I saw that option.

>
> So my approach was to not consider them to be architecture specific
> features, but simply features that either are or aren't supported by a
> particular pmu.
>
>> Can we make the current scheme more extensible or generic so that
>> adding more architecture specific option is easily feasible?
>
> Well, the bigger question is, how do you represent a very architecture
> specific option in the core structures. I have one solution to that that
> is described above. It shouldn't take much architecture-specific code to
> handle each new option, unless it something that really only makes sense
> for one architecture/pmu.
>
> All that said, one could still extend the current code quite easily to
> fit completely non-generic things. In the default clause, in the parser
> function, one could add:
>
>   if (state == IF_STATE_ACTION)
>       if (event->pmu->itrace_filter_parse(event, &filter))
>           goto fail;
>
> if one really had to.

What I had in mind is more advanced tracing features like state
machines and counters but in hindsight I have no plans to play with
those any time soon.

What's more pressing is the need to be able to select the sink that
will gather the trace data from the perf cmd line tool.  At this time
it is a two step process that needs to be fixed.  At first I was
thinking about something like:

perf record -e cs_etm/sink:XYZ.tmc/

But from what I gathered it won't be easily feasible without changing
a lot of things - I think your ioctl() mechanism is much better for
something like that.  And sink selection a one time thing that doesn't
change throughout the trace session.

Based on what I found on the internet one would use ioctl() like this:

perf record -e intel_pt// --filter kernel:0x80000/0x1000

As such and thinking along the same lines I could fix my sink
enablement problem like this:

perf record -e intel_pt// --filter sink:XYZ.tmc

But a sink isn't a filter.  Maybe using the PERF_EVENT_IOC_SET_OUTPUT
would be a better choice but then again, a sink isn't a file.  So I'm
a little puzzled here.  Maybe Peter or Arnaldo would be able to advise
here...

Long story short my request to make things more generic can be put to rest.

Thanks,
Mathieu

>
>
> Regards,
> --
> Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2015-12-15  0:25 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-11 13:36 [PATCH v0 0/5] perf: Introduce instruction trace filtering Alexander Shishkin
2015-12-11 13:36 ` [PATCH v0 1/5] perf: Move set_filter() from behind EVENT_TRACING Alexander Shishkin
2015-12-11 13:36 ` [PATCH v0 2/5] perf: Extend perf_event_aux() to optionally iterate through more events Alexander Shishkin
2015-12-11 13:36 ` [PATCH v0 3/5] perf: Introduce instruction trace filtering Alexander Shishkin
2015-12-11 14:02   ` Peter Zijlstra
2015-12-11 14:20     ` Alexander Shishkin
2015-12-11 14:23     ` Mark Rutland
2015-12-11 14:52       ` Peter Zijlstra
2015-12-11 15:12         ` Alexander Shishkin
2015-12-11 14:56   ` Peter Zijlstra
2015-12-11 15:14     ` Alexander Shishkin
2015-12-11 15:00   ` Peter Zijlstra
2015-12-11 15:17     ` Alexander Shishkin
2015-12-11 15:34       ` Peter Zijlstra
2015-12-11 16:06     ` Alexander Shishkin
2015-12-11 15:02   ` Peter Zijlstra
2015-12-11 15:20     ` Peter Zijlstra
2015-12-11 15:27     ` Alexander Shishkin
2015-12-11 15:33       ` Peter Zijlstra
2015-12-11 15:48         ` Alexander Shishkin
2015-12-11 16:01           ` Peter Zijlstra
2015-12-11 17:02             ` Peter Zijlstra
2015-12-11 15:09   ` Peter Zijlstra
2015-12-11 15:12   ` Peter Zijlstra
2015-12-11 15:28   ` Peter Zijlstra
2015-12-11 17:00     ` Peter Zijlstra
2015-12-11 17:13       ` Alexander Shishkin
2015-12-11 22:39         ` Peter Zijlstra
2015-12-11 16:59   ` Peter Zijlstra
2015-12-11 17:15     ` Alexander Shishkin
2015-12-11 18:11   ` Mathieu Poirier
2015-12-11 22:41     ` Peter Zijlstra
2015-12-11 13:36 ` [PATCH v0 4/5] perf/x86/intel/pt: IP filtering register/cpuid bits Alexander Shishkin
2015-12-11 13:36 ` [PATCH v0 5/5] perf/x86/intel/pt: Add support for instruction trace filtering in PT Alexander Shishkin
2015-12-11 18:06   ` Mathieu Poirier
2015-12-11 21:38 ` [PATCH v0 0/5] perf: Introduce instruction trace filtering Mathieu Poirier
2015-12-14  8:50   ` Alexander Shishkin
2015-12-15  0:25     ` Mathieu Poirier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).