All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] perf: Add AUX data sampling
@ 2016-09-23 11:27 Alexander Shishkin
  2016-09-23 11:27 ` [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation Alexander Shishkin
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Alexander Shishkin @ 2016-09-23 11:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx, ak, Alexander Shishkin

Hi Peter,

This is an RFC, I'm not sending the tooling bits in this series,
although they can be found here [1].

This series introduces AUX data sampling for perf events, which in
case of our instruction/branch tracing PMUs like Intel PT, BTS, CS
ETM means execution flow history leading up to a perf event's
overflow.

The bulk of code is in 4/6, which adds attribute fields, creates
kernel events to generate the AUX data, takes samples and takes care
of all the tricky. 1/6 and 6/6 may also be considered separately from
this series. In particular, I suspect that 6/6 applies today to the
architectures that deliver PMIs as IRQs. Mathieu?

[1] https://git.kernel.org/cgit/linux/kernel/git/ash/linux.git/log/?h=perf-aux-sampling

Alexander Shishkin (6):
  perf: Move mlock accounting to ring buffer allocation
  perf: Add api to (de-)allocate AUX buffers for kernel counters
  perf: Add a helper for looking up pmus by type
  perf: Add infrastructure for using AUX data in perf samples
  perf: Disable PMU around address filter adjustment
  perf: Disable IRQs in address filter sync path

 include/linux/perf_event.h      |  12 ++
 include/uapi/linux/perf_event.h |  16 +-
 kernel/events/core.c            | 419 +++++++++++++++++++++++++++++++++-------
 kernel/events/internal.h        |  24 ++-
 kernel/events/ring_buffer.c     | 210 ++++++++++++++++++--
 5 files changed, 598 insertions(+), 83 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation
  2016-09-23 11:27 [RFC PATCH 0/6] perf: Add AUX data sampling Alexander Shishkin
@ 2016-09-23 11:27 ` Alexander Shishkin
  2016-09-23 12:14   ` Peter Zijlstra
  2016-09-23 11:27 ` [RFC PATCH 2/6] perf: Add api to (de-)allocate AUX buffers for kernel counters Alexander Shishkin
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Alexander Shishkin @ 2016-09-23 11:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx, ak, Alexander Shishkin

In order to be able to allocate perf ring buffers in non-mmap path, we
need to make sure we can still account the memory to the user and that
they don't exceed their mlock limit.

This patch moves ring buffer memory accounting down the rb_alloc() path
so that its callers won't have to worry about it. This also serves the
additional purpose of slightly cleaning up perf_mmap().

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c        |  67 +++--------------------
 kernel/events/internal.h    |   5 +-
 kernel/events/ring_buffer.c | 127 ++++++++++++++++++++++++++++++++++++++------
 3 files changed, 123 insertions(+), 76 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7c0d263f6b..2e8a0e389b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4933,6 +4933,8 @@ void ring_buffer_put(struct ring_buffer *rb)
 	if (!atomic_dec_and_test(&rb->refcount))
 		return;
 
+	ring_buffer_unaccount(rb, false);
+
 	WARN_ON_ONCE(!list_empty(&rb->event_list));
 
 	call_rcu(&rb->rcu_head, rb_free_rcu);
@@ -4967,9 +4969,6 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 	struct perf_event *event = vma->vm_file->private_data;
 
 	struct ring_buffer *rb = ring_buffer_get(event);
-	struct user_struct *mmap_user = rb->mmap_user;
-	int mmap_locked = rb->mmap_locked;
-	unsigned long size = perf_data_size(rb);
 
 	if (event->pmu->event_unmapped)
 		event->pmu->event_unmapped(event);
@@ -4989,11 +4988,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 		 */
 		perf_pmu_output_stop(event);
 
-		/* now it's safe to free the pages */
-		atomic_long_sub(rb->aux_nr_pages, &mmap_user->locked_vm);
-		vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
-
-		/* this has to be the last one */
+		/* now it's safe to free the pages; ought to be the last one */
 		rb_free_aux(rb);
 		WARN_ON_ONCE(atomic_read(&rb->aux_refcount));
 
@@ -5054,19 +5049,6 @@ again:
 	}
 	rcu_read_unlock();
 
-	/*
-	 * It could be there's still a few 0-ref events on the list; they'll
-	 * get cleaned up by free_event() -- they'll also still have their
-	 * ref on the rb and will free it whenever they are done with it.
-	 *
-	 * Aside from that, this buffer is 'fully' detached and unmapped,
-	 * undo the VM accounting.
-	 */
-
-	atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
-	vma->vm_mm->pinned_vm -= mmap_locked;
-	free_uid(mmap_user);
-
 out_put:
 	ring_buffer_put(rb); /* could be last */
 }
@@ -5081,13 +5063,9 @@ static const struct vm_operations_struct perf_mmap_vmops = {
 static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct perf_event *event = file->private_data;
-	unsigned long user_locked, user_lock_limit;
-	struct user_struct *user = current_user();
-	unsigned long locked, lock_limit;
 	struct ring_buffer *rb = NULL;
 	unsigned long vma_size;
 	unsigned long nr_pages;
-	long user_extra = 0, extra = 0;
 	int ret = 0, flags = 0;
 
 	/*
@@ -5158,7 +5136,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 		}
 
 		atomic_set(&rb->aux_mmap_count, 1);
-		user_extra = nr_pages;
 
 		goto accounting;
 	}
@@ -5195,49 +5172,24 @@ again:
 		goto unlock;
 	}
 
-	user_extra = nr_pages + 1;
-
 accounting:
-	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
-
-	/*
-	 * Increase the limit linearly with more CPUs:
-	 */
-	user_lock_limit *= num_online_cpus();
-
-	user_locked = atomic_long_read(&user->locked_vm) + user_extra;
-
-	if (user_locked > user_lock_limit)
-		extra = user_locked - user_lock_limit;
-
-	lock_limit = rlimit(RLIMIT_MEMLOCK);
-	lock_limit >>= PAGE_SHIFT;
-	locked = vma->vm_mm->pinned_vm + extra;
-
-	if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
-		!capable(CAP_IPC_LOCK)) {
-		ret = -EPERM;
-		goto unlock;
-	}
-
 	WARN_ON(!rb && event->rb);
 
 	if (vma->vm_flags & VM_WRITE)
 		flags |= RING_BUFFER_WRITABLE;
 
 	if (!rb) {
-		rb = rb_alloc(nr_pages,
+		rb = rb_alloc(vma->vm_mm, nr_pages,
 			      event->attr.watermark ? event->attr.wakeup_watermark : 0,
 			      event->cpu, flags);
 
-		if (!rb) {
-			ret = -ENOMEM;
+		if (IS_ERR_OR_NULL(rb)) {
+			ret = PTR_ERR(rb);
+			rb = NULL;
 			goto unlock;
 		}
 
 		atomic_set(&rb->mmap_count, 1);
-		rb->mmap_user = get_current_user();
-		rb->mmap_locked = extra;
 
 		ring_buffer_attach(event, rb);
 
@@ -5246,15 +5198,10 @@ accounting:
 	} else {
 		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
 				   event->attr.aux_watermark, flags);
-		if (!ret)
-			rb->aux_mmap_locked = extra;
 	}
 
 unlock:
 	if (!ret) {
-		atomic_long_add(user_extra, &user->locked_vm);
-		vma->vm_mm->pinned_vm += extra;
-
 		atomic_inc(&event->mmap_count);
 	} else if (rb) {
 		atomic_dec(&rb->mmap_count);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 486fd78eb8..a7ce82b670 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -36,6 +36,7 @@ struct ring_buffer {
 	atomic_t			mmap_count;
 	unsigned long			mmap_locked;
 	struct user_struct		*mmap_user;
+	struct mm_struct		*mmap_mapping;
 
 	/* AUX area */
 	local_t				aux_head;
@@ -56,6 +57,7 @@ struct ring_buffer {
 };
 
 extern void rb_free(struct ring_buffer *rb);
+extern void ring_buffer_unaccount(struct ring_buffer *rb, bool aux);
 
 static inline void rb_free_rcu(struct rcu_head *rcu_head)
 {
@@ -74,7 +76,8 @@ static inline void rb_toggle_paused(struct ring_buffer *rb, bool pause)
 }
 
 extern struct ring_buffer *
-rb_alloc(int nr_pages, long watermark, int cpu, int flags);
+rb_alloc(struct mm_struct *mm, int nr_pages, long watermark, int cpu,
+	 int flags);
 extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 			pgoff_t pgoff, int nr_pages, long watermark, int flags);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 257fa460b8..484ce09d96 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -496,6 +496,88 @@ void *perf_get_aux(struct perf_output_handle *handle)
 	return handle->rb->aux_priv;
 }
 
+/*
+ * Check if the current user can afford @nr_pages, considering the
+ * perf_event_mlock sysctl and their mlock limit. If the former is exceeded,
+ * pin the remainder on their mm, if the latter is not sufficient either,
+ * error out. Otherwise, keep track of the pages used in the ring_buffer so
+ * that the accounting can be undone when the pages are freed.
+ */
+static int ring_buffer_account(struct ring_buffer *rb, struct mm_struct *mm,
+			       unsigned long nr_pages, bool aux)
+{
+	unsigned long total, limit, pinned;
+
+	if (!mm)
+		mm = rb->mmap_mapping;
+
+	rb->mmap_user = current_user();
+
+	limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
+
+	/*
+	 * Increase the limit linearly with more CPUs:
+	 */
+	limit *= num_online_cpus();
+
+	total = atomic_long_read(&rb->mmap_user->locked_vm) + nr_pages;
+
+	pinned = 0;
+	if (total > limit) {
+		/*
+		 * Everything that's over the sysctl_perf_event_mlock
+		 * limit needs to be accounted to the consumer's mm.
+		 */
+		if (!mm)
+			return -EPERM;
+
+		pinned = total - limit;
+
+		limit = rlimit(RLIMIT_MEMLOCK);
+		limit >>= PAGE_SHIFT;
+		total = mm->pinned_vm + pinned;
+
+		if ((total > limit) && perf_paranoid_tracepoint_raw() &&
+		    !capable(CAP_IPC_LOCK)) {
+			return -EPERM;
+		}
+
+		if (aux)
+			rb->aux_mmap_locked = pinned;
+		else
+			rb->mmap_locked = pinned;
+
+		mm->pinned_vm += pinned;
+	}
+
+	if (!rb->mmap_mapping)
+		rb->mmap_mapping = mm;
+
+	/* account for user page */
+	if (!aux)
+		nr_pages++;
+
+	rb->mmap_user = get_current_user();
+	atomic_long_add(nr_pages, &rb->mmap_user->locked_vm);
+
+	return 0;
+}
+
+/*
+ * Undo the mlock pages accounting done in ring_buffer_account().
+ */
+void ring_buffer_unaccount(struct ring_buffer *rb, bool aux)
+{
+	unsigned long nr_pages = aux ? rb->aux_nr_pages : rb->nr_pages + 1;
+	unsigned long pinned = aux ? rb->aux_mmap_locked : rb->mmap_locked;
+
+	atomic_long_sub(nr_pages, &rb->mmap_user->locked_vm);
+	if (rb->mmap_mapping)
+		rb->mmap_mapping->pinned_vm -= pinned;
+
+	free_uid(rb->mmap_user);
+}
+
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
 static struct page *rb_alloc_aux_page(int node, int order)
@@ -565,11 +647,16 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 {
 	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
 	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
-	int ret = -ENOMEM, max_order = 0;
+	int ret, max_order = 0;
 
 	if (!has_aux(event))
 		return -ENOTSUPP;
 
+	ret = ring_buffer_account(rb, NULL, nr_pages, true);
+	if (ret)
+		return ret;
+
+	ret = -ENOMEM;
 	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) {
 		/*
 		 * We need to start with the max_order that fits in nr_pages,
@@ -655,8 +742,11 @@ out:
 
 void rb_free_aux(struct ring_buffer *rb)
 {
-	if (atomic_dec_and_test(&rb->aux_refcount))
+	if (atomic_dec_and_test(&rb->aux_refcount)) {
+		ring_buffer_unaccount(rb, true);
+
 		__rb_free_aux(rb);
+	}
 }
 
 #ifndef CONFIG_PERF_USE_VMALLOC
@@ -690,19 +780,22 @@ static void *perf_mmap_alloc_page(int cpu)
 	return page_address(page);
 }
 
-struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
+struct ring_buffer *rb_alloc(struct mm_struct *mm, int nr_pages, long watermark,
+			     int cpu, int flags)
 {
+	unsigned long size = offsetof(struct ring_buffer, data_pages[nr_pages]);
 	struct ring_buffer *rb;
-	unsigned long size;
-	int i;
-
-	size = sizeof(struct ring_buffer);
-	size += nr_pages * sizeof(void *);
+	int i, ret = -ENOMEM;
 
 	rb = kzalloc(size, GFP_KERNEL);
 	if (!rb)
 		goto fail;
 
+	ret = ring_buffer_account(rb, mm, nr_pages, false);
+	if (ret)
+		goto fail;
+
+	ret = -ENOMEM;
 	rb->user_page = perf_mmap_alloc_page(cpu);
 	if (!rb->user_page)
 		goto fail_user_page;
@@ -729,7 +822,7 @@ fail_user_page:
 	kfree(rb);
 
 fail:
-	return NULL;
+	return ERR_PTR(ret);
 }
 
 static void perf_mmap_free_page(unsigned long addr)
@@ -796,19 +889,23 @@ void rb_free(struct ring_buffer *rb)
 	schedule_work(&rb->work);
 }
 
-struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
+struct ring_buffer *rb_alloc(struct mm_struct *mm, int nr_pages, long watermark,
+			     int cpu, int flags)
 {
+	unsigned long size = offsetof(struct ring_buffer, data_pages[1]);
 	struct ring_buffer *rb;
-	unsigned long size;
 	void *all_buf;
-
-	size = sizeof(struct ring_buffer);
-	size += sizeof(void *);
+	int ret = -ENOMEM;
 
 	rb = kzalloc(size, GFP_KERNEL);
 	if (!rb)
 		goto fail;
 
+	ret = ring_buffer_account(rb, mm, nr_pages, false);
+	if (ret)
+		goto fail;
+
+	ret = -ENOMEM;
 	INIT_WORK(&rb->work, rb_free_work);
 
 	all_buf = vmalloc_user((nr_pages + 1) * PAGE_SIZE);
@@ -830,7 +927,7 @@ fail_all_buf:
 	kfree(rb);
 
 fail:
-	return NULL;
+	return ERR_PTR(ret);
 }
 
 #endif
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 2/6] perf: Add api to (de-)allocate AUX buffers for kernel counters
  2016-09-23 11:27 [RFC PATCH 0/6] perf: Add AUX data sampling Alexander Shishkin
  2016-09-23 11:27 ` [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation Alexander Shishkin
@ 2016-09-23 11:27 ` Alexander Shishkin
  2016-09-23 11:27 ` [RFC PATCH 3/6] perf: Add a helper for looking up pmus by type Alexander Shishkin
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Alexander Shishkin @ 2016-09-23 11:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx, ak, Alexander Shishkin

Several use cases for AUX data, namely event sampling (including a piece of
AUX data in some perf event sample, so that the user can get, for example,
instruction traces leading up to a certain event like a breakpoint or a
hardware event), process core dumps (providing user with a history of a
process' instruction flow leading up to a crash), system crash dumps and
storing AUX data in pstore across reboot (to facilitate post-mortem
investigation of a system crash) require different parts of the kernel code
to be able to configure hardware to produce AUX data and collect it when it
is needed.

Luckily, there is already an api for in-kernel perf events, which has several
users. This proposal is to extend that api to allow in-kernel users to
allocate AUX buffers for kernel counters. Such users will call
rb_alloc_kernel() to allocate what they want and later copy the data out to
other backends, e.g. a sample in another event's ring buffer or a core dump
file. These buffers are never mapped to userspace.

There are no additional constraints or requirements on the pmu drivers.

A typical user of this interface will first create a kernel counter with a
call to perf_event_create_kernel_counter() and then allocate a ring buffer
for it with rb_alloc_kernel(). Data can then be copied out from the AUX
buffer using rb_output_aux(), which is passed a callback that will write
chunks of AUX data into the desired destination, such as perf_output_copy()
or dump_emit(). Caller needs to use perf_event_disable to make sure that the
counter is not active while it copies data out.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/internal.h    | 19 +++++++++++
 kernel/events/ring_buffer.c | 83 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 102 insertions(+)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index a7ce82b670..4ae300ee02 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -56,6 +56,9 @@ struct ring_buffer {
 	void				*data_pages[0];
 };
 
+typedef unsigned long (*aux_copyfn)(void *data, const void *src,
+				    unsigned long len);
+
 extern void rb_free(struct ring_buffer *rb);
 extern void ring_buffer_unaccount(struct ring_buffer *rb, bool aux);
 
@@ -82,6 +85,11 @@ extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 			pgoff_t pgoff, int nr_pages, long watermark, int flags);
 extern void rb_free_aux(struct ring_buffer *rb);
+extern long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+			  unsigned long to, aux_copyfn copyfn, void *data);
+extern int rb_alloc_kernel(struct perf_event *event, int nr_pages,
+			   int aux_nr_pages);
+extern void rb_free_kernel(struct ring_buffer *rb, struct perf_event *event);
 extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
 extern void ring_buffer_put(struct ring_buffer *rb);
 
@@ -126,6 +134,17 @@ static inline unsigned long perf_aux_size(struct ring_buffer *rb)
 	return rb->aux_nr_pages << PAGE_SHIFT;
 }
 
+static inline bool kernel_rb_event(struct perf_event *event)
+{
+	/*
+	 * Having a ring buffer and not being on any ring buffers' wakeup
+	 * list means it was attached by rb_alloc_kernel() and not
+	 * ring_buffer_attach(). It's the only case when these two
+	 * conditions take place at the same time.
+	 */
+	return event->rb && list_empty(&event->rb_entry);
+}
+
 #define __DEFINE_OUTPUT_COPY_BODY(advance_buf, memcpy_func, ...)	\
 {									\
 	unsigned long size, written;					\
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 484ce09d96..9244a4fa9b 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -578,6 +578,40 @@ void ring_buffer_unaccount(struct ring_buffer *rb, bool aux)
 	free_uid(rb->mmap_user);
 }
 
+/*
+ * Copy out AUX data from a ring_buffer using a supplied callback.
+ */
+long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+		   unsigned long to, aux_copyfn copyfn, void *data)
+{
+	unsigned long tocopy, remainder, len = 0;
+	void *addr;
+
+	from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+	to &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+
+	do {
+		tocopy = PAGE_SIZE - offset_in_page(from);
+		if (to > from)
+			tocopy = min(tocopy, to - from);
+		if (!tocopy)
+			break;
+
+		addr = rb->aux_pages[from >> PAGE_SHIFT];
+		addr += offset_in_page(from);
+
+		remainder = copyfn(data, addr, tocopy);
+		if (remainder)
+			return -EFAULT;
+
+		len += tocopy;
+		from += tocopy;
+		from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+	} while (to != from);
+
+	return len;
+}
+
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
 static struct page *rb_alloc_aux_page(int node, int order)
@@ -749,6 +783,55 @@ void rb_free_aux(struct ring_buffer *rb)
 	}
 }
 
+/*
+ * Allocate a ring_buffer for a kernel event and attach it to this event.
+ * This ring_buffer will not participate in mmap operations and set_output,
+ * so ring_buffer_attach() and related complications do not apply.
+ */
+int rb_alloc_kernel(struct perf_event *event, int nr_pages, int aux_nr_pages)
+{
+	struct ring_buffer *rb;
+	int ret, pgoff = nr_pages + 1;
+
+	/*
+	 * Use overwrite mode (!RING_BUFFER_WRITABLE) for both data and aux
+	 * areas as we don't want wakeups or interrupts.
+	 */
+	rb = rb_alloc(NULL, nr_pages, 0, event->cpu, 0);
+	if (IS_ERR(rb))
+		return PTR_ERR(rb);
+
+	ret = rb_alloc_aux(rb, event, pgoff, aux_nr_pages, 0, 0);
+	if (ret) {
+		rb_free(rb);
+		return ret;
+	}
+
+	/*
+	 * These buffers never get mmapped; so the only use of the
+	 * aux_mmap_count is to enable AUX transactions
+	 * (see perf_aux_output_begin()).
+	 */
+	atomic_set(&rb->aux_mmap_count, 1);
+
+	/*
+	 * Kernel counters don't need ring buffer wakeups, therefore we don't
+	 * use ring_buffer_attach() here and event->rb_entry stays empty.
+	 */
+	rcu_assign_pointer(event->rb, rb);
+
+	return 0;
+}
+
+void rb_free_kernel(struct ring_buffer *rb, struct perf_event *event)
+{
+	WARN_ON_ONCE(atomic_read(&rb->refcount) != 1);
+	atomic_set(&rb->aux_mmap_count, 0);
+	rcu_assign_pointer(event->rb, NULL);
+	rb_free_aux(rb);
+	rb_free(rb);
+}
+
 #ifndef CONFIG_PERF_USE_VMALLOC
 
 /*
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 3/6] perf: Add a helper for looking up pmus by type
  2016-09-23 11:27 [RFC PATCH 0/6] perf: Add AUX data sampling Alexander Shishkin
  2016-09-23 11:27 ` [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation Alexander Shishkin
  2016-09-23 11:27 ` [RFC PATCH 2/6] perf: Add api to (de-)allocate AUX buffers for kernel counters Alexander Shishkin
@ 2016-09-23 11:27 ` Alexander Shishkin
  2016-09-23 11:27 ` [RFC PATCH 4/6] perf: Add infrastructure for using AUX data in perf samples Alexander Shishkin
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Alexander Shishkin @ 2016-09-23 11:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx, ak, Alexander Shishkin

This patch adds a helper for looking up a registered pmu by its type.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2e8a0e389b..b64a5c611f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8780,6 +8780,18 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 	return ret;
 }
 
+/* call under pmus_srcu */
+static struct pmu *__perf_find_pmu(u32 type)
+{
+	struct pmu *pmu;
+
+	rcu_read_lock();
+	pmu = idr_find(&pmu_idr, type);
+	rcu_read_unlock();
+
+	return pmu;
+}
+
 static struct pmu *perf_init_event(struct perf_event *event)
 {
 	struct pmu *pmu = NULL;
@@ -8788,9 +8800,7 @@ static struct pmu *perf_init_event(struct perf_event *event)
 
 	idx = srcu_read_lock(&pmus_srcu);
 
-	rcu_read_lock();
-	pmu = idr_find(&pmu_idr, event->attr.type);
-	rcu_read_unlock();
+	pmu = __perf_find_pmu(event->attr.type);
 	if (pmu) {
 		ret = perf_try_init_event(pmu, event);
 		if (ret)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 4/6] perf: Add infrastructure for using AUX data in perf samples
  2016-09-23 11:27 [RFC PATCH 0/6] perf: Add AUX data sampling Alexander Shishkin
                   ` (2 preceding siblings ...)
  2016-09-23 11:27 ` [RFC PATCH 3/6] perf: Add a helper for looking up pmus by type Alexander Shishkin
@ 2016-09-23 11:27 ` Alexander Shishkin
  2016-09-23 11:27 ` [RFC PATCH 5/6] perf: Disable PMU around address filter adjustment Alexander Shishkin
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Alexander Shishkin @ 2016-09-23 11:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx, ak, Alexander Shishkin

AUX data can be used to annotate perf events such as performance counters
or tracepoints/breakpoints by including it in sample records when
PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
and profiling by providing, for example, a history of instruction flow
leading up to the event's overflow.

To facilitate this, this patch adds code to create a kernel counter with a
ring buffer to track and collect AUX data that is then copied out into the
sampled events' perf data stream as samples.

The user interface is extended to allow for this, new attribute fields are
added:

  * aux_sample_type: specify PMU on which the AUX data generating event
                     is created;
  * aux_sample_config: event config (maps to attribute's config field),
  * aux_sample_size: size of the sample to be written.

This kernel counter is configured similarly to the event that is being
annotated with regards to filtering (exclude_{hv,idle,user,kernel}) and
enabled state (disabled, enable_on_exec) to make sure that the sampler
is not tracking any out of context activity. One sampler can be used
for multiple events.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      |  12 ++
 include/uapi/linux/perf_event.h |  16 +-
 kernel/events/core.c            | 315 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 341 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5c5362584a..7121cf7b5c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -101,6 +101,12 @@ struct perf_branch_stack {
 	struct perf_branch_entry	entries[0];
 };
 
+struct perf_aux_record {
+	u64		size;
+	unsigned long	from;
+	unsigned long	to;
+};
+
 struct task_struct;
 
 /*
@@ -532,6 +538,7 @@ struct swevent_hlist {
 #define PERF_ATTACH_GROUP	0x02
 #define PERF_ATTACH_TASK	0x04
 #define PERF_ATTACH_TASK_DATA	0x08
+#define PERF_ATTACH_SAMPLING	0x10
 
 struct perf_cgroup;
 struct ring_buffer;
@@ -691,6 +698,9 @@ struct perf_event {
 	perf_overflow_handler_t		overflow_handler;
 	void				*overflow_handler_context;
 
+	struct perf_event		*aux_sampler;
+	atomic_long_t			aux_samplees_count;
+
 #ifdef CONFIG_EVENT_TRACING
 	struct trace_event_call		*tp_event;
 	struct event_filter		*filter;
@@ -888,6 +898,7 @@ struct perf_sample_data {
 	 */
 	u64				addr;
 	struct perf_raw_record		*raw;
+	struct perf_aux_record		aux;
 	struct perf_branch_stack	*br_stack;
 	u64				period;
 	u64				weight;
@@ -937,6 +948,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
 	/* remaining struct members initialized in perf_prepare_sample() */
 	data->addr = addr;
 	data->raw  = NULL;
+	data->aux.from = data->aux.to = data->aux.size = 0;
 	data->br_stack = NULL;
 	data->period = period;
 	data->weight = 0;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c66a485a24..1bf3f2c358 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -139,8 +139,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
 	PERF_SAMPLE_REGS_INTR			= 1U << 18,
+	PERF_SAMPLE_AUX				= 1U << 19,
 
-	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
 };
 
 /*
@@ -273,6 +274,9 @@ enum perf_event_read_format {
 					/* add: sample_stack_user */
 #define PERF_ATTR_SIZE_VER4	104	/* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5	112	/* add: aux_watermark */
+#define PERF_ATTR_SIZE_VER6	136	/* add: aux_sample_type */
+					/* add: aux_sample_config */
+					/* add: aux_sample_size */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -390,6 +394,14 @@ struct perf_event_attr {
 	__u32	aux_watermark;
 	__u16	sample_max_stack;
 	__u16	__reserved_2;	/* align to __u64 */
+
+	/*
+	 * AUX area sampling configuration
+	 */
+	__u64	aux_sample_config;	/* event config for AUX sampling */
+	__u64	aux_sample_size;	/* desired sample size */
+	__u32	aux_sample_type;	/* pmu::type of an AUX PMU */
+	__u32	__reserved_3;		/* align to __u64 */
 };
 
 #define perf_flags(attr)	(*(&(attr)->read_format + 1))
@@ -773,6 +785,8 @@ enum perf_event_type {
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			abi; # enum perf_sample_regs_abi
 	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+	 *	{ u64			size;
+	 *	  char			data[size]; } && PERF_SAMPLE_AUX
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b64a5c611f..fdb20fdeb1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2422,6 +2422,25 @@ static void _perf_event_enable(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
 
+	if (event->aux_sampler) {
+		struct perf_event_context *sctx = event->aux_sampler->ctx;
+
+		lockdep_assert_held(&ctx->mutex);
+
+		if (sctx != ctx) {
+			sctx = perf_event_ctx_lock_nested(event->aux_sampler,
+							  SINGLE_DEPTH_NESTING);
+			if (WARN_ON_ONCE(!sctx))
+				goto done;
+		}
+
+		_perf_event_enable(event->aux_sampler);
+
+		if (sctx != ctx)
+			perf_event_ctx_unlock(event->aux_sampler, sctx);
+	}
+
+done:
 	raw_spin_lock_irq(&ctx->lock);
 	if (event->state >= PERF_EVENT_STATE_INACTIVE ||
 	    event->state <  PERF_EVENT_STATE_ERROR) {
@@ -3855,6 +3874,8 @@ static void unaccount_freq_event(void)
 		atomic_dec(&nr_freq_events);
 }
 
+static void perf_aux_sampler_fini(struct perf_event *event);
+
 static void unaccount_event(struct perf_event *event)
 {
 	bool dec = false;
@@ -3886,6 +3907,9 @@ static void unaccount_event(struct perf_event *event)
 			schedule_delayed_work(&perf_sched_work, HZ);
 	}
 
+	if ((event->attr.sample_type & PERF_SAMPLE_AUX))
+		perf_aux_sampler_fini(event);
+
 	unaccount_event_cpu(event, event->cpu);
 
 	unaccount_pmu_sb_event(event);
@@ -3993,6 +4017,23 @@ static void _free_event(struct perf_event *event)
 
 	unaccount_event(event);
 
+	if (kernel_rb_event(event)) {
+		struct perf_event_context *ctx = event->ctx;
+		unsigned long flags;
+
+		/*
+		 * This event may not be explicitly freed by
+		 * perf_event_release_kernel(), we still need to remove it
+		 * from its context.
+		 */
+		raw_spin_lock_irqsave(&ctx->lock, flags);
+		list_del_event(event, ctx);
+		raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+		ring_buffer_unaccount(event->rb, false);
+		rb_free_kernel(event->rb, event);
+	}
+
 	if (event->rb) {
 		/*
 		 * Can happen when we close an event with re-directed output.
@@ -5455,6 +5496,232 @@ perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size,
 	}
 }
 
+struct perf_event *__find_sampling_counter(struct perf_event_context *ctx,
+					   struct perf_event *event,
+					   struct task_struct *task)
+{
+	struct perf_event *sampler = NULL;
+
+	list_for_each_entry(sampler, &ctx->event_list, event_entry) {
+		if (kernel_rb_event(sampler) &&
+		    sampler->cpu                  == event->cpu &&
+		    sampler->attr.type            == event->attr.aux_sample_type &&
+		    sampler->attr.config          == event->attr.aux_sample_config &&
+		    sampler->attr.exclude_hv      == event->attr.exclude_hv &&
+		    sampler->attr.exclude_idle    == event->attr.exclude_idle &&
+		    sampler->attr.exclude_user    == event->attr.exclude_user &&
+		    sampler->attr.exclude_kernel  == event->attr.exclude_kernel &&
+		    sampler->attr.aux_sample_size >= event->attr.aux_sample_size &&
+		    atomic_long_inc_not_zero(&sampler->refcount))
+			return sampler;
+	}
+
+	return NULL;
+}
+
+struct perf_event *find_sampling_counter(struct pmu *pmu,
+					 struct perf_event *event,
+					 struct task_struct *task)
+{
+	struct perf_event *sampler = NULL;
+	struct perf_cpu_context *cpuctx;
+	struct perf_event_context *ctx;
+	unsigned long flags;
+
+	if (!task) {
+		if (!cpu_online(event->cpu))
+			return NULL;
+
+		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, event->cpu);
+		ctx = &cpuctx->ctx;
+		raw_spin_lock_irqsave(&ctx->lock, flags);
+	} else {
+		ctx = perf_lock_task_context(task, pmu->task_ctx_nr, &flags);
+
+		if (!ctx)
+			return NULL;
+	}
+
+	sampler = __find_sampling_counter(ctx, event, task);
+	raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return sampler;
+}
+
+/*
+ * Sampling AUX data in perf events is done by means of a kernel event that
+ * collects data to its own ring_buffer. This data gets copied out into sampled
+ * event's SAMPLE_AUX records every time the sampled event overflows. One such
+ * kernel event (sampler) can be used to provide samples for multiple events
+ * (samplees) on the same context if their attributes match. Each samplee
+ * holds a reference to the sampler event; the last one out frees the sampler;
+ * perf_event_exit_task() is instructed not to free samplers directly.
+ */
+static int perf_aux_sampler_init(struct perf_event *event,
+				 struct task_struct *task,
+				 struct pmu *pmu)
+{
+	struct perf_event_attr attr;
+	struct perf_event *sampler;
+	unsigned long nr_pages;
+	int ret;
+
+	if (!pmu || !pmu->setup_aux)
+		return -ENOTSUPP;
+
+	sampler = find_sampling_counter(pmu, event, task);
+	if (!sampler) {
+		memset(&attr, 0, sizeof(attr));
+		attr.type            = pmu->type;
+		attr.config          = event->attr.aux_sample_config;
+		attr.disabled        = 1; /* see below */
+		attr.enable_on_exec  = event->attr.enable_on_exec;
+		attr.exclude_hv      = event->attr.exclude_hv;
+		attr.exclude_idle    = event->attr.exclude_idle;
+		attr.exclude_user    = event->attr.exclude_user;
+		attr.exclude_kernel  = event->attr.exclude_kernel;
+		attr.aux_sample_size = event->attr.aux_sample_size;
+
+		sampler = perf_event_create_kernel_counter(&attr, event->cpu,
+							   task, NULL, NULL);
+		if (IS_ERR(sampler))
+			return PTR_ERR(sampler);
+
+		nr_pages = 1ul << __get_order(event->attr.aux_sample_size);
+
+		ret = rb_alloc_kernel(sampler, 0, nr_pages);
+		if (ret) {
+			perf_event_release_kernel(sampler);
+			return ret;
+		}
+
+		/*
+		 * This event will be freed by the last exiting samplee;
+		 * perf_event_exit_task() should skip it over.
+		 */
+		sampler->attach_state |= PERF_ATTACH_SAMPLING;
+	}
+
+	event->aux_sampler = sampler;
+
+	if (!atomic_long_inc_return(&sampler->aux_samplees_count)) {
+		/*
+		 * enable the sampler here unless the original event wants
+		 * to stay disabled
+		 */
+		if (!event->attr.disabled)
+			perf_event_enable(sampler);
+	}
+
+	return 0;
+}
+
+static void perf_aux_sampler_fini(struct perf_event *event)
+{
+	struct perf_event *sampler = event->aux_sampler;
+
+	if (!sampler)
+		return;
+
+	/*
+	 * We're holding a reference to the sampler, so it's always
+	 * valid here.
+	 */
+	if (atomic_long_dec_and_test(&sampler->aux_samplees_count))
+		perf_event_disable(sampler);
+
+	/* can be last */
+	put_event(sampler);
+
+	event->aux_sampler = NULL;
+}
+
+static unsigned long perf_aux_sampler_trace(struct perf_event *event,
+					    struct perf_sample_data *data)
+{
+	struct perf_event *sampler = event->aux_sampler;
+	struct ring_buffer *rb;
+	int *disable_count;
+
+	data->aux.size = 0;
+
+	if (!sampler || READ_ONCE(sampler->state) != PERF_EVENT_STATE_ACTIVE)
+		goto out;
+
+	if (READ_ONCE(sampler->oncpu) != smp_processor_id())
+		goto out;
+
+	/*
+	 * Non-zero disable count here means that we, being the NMI
+	 * context, are racing with pmu::add or pmu::del, both of which
+	 * may lead to a dangling hardware event and all manner of mayhem.
+	 */
+	disable_count = this_cpu_ptr(sampler->pmu->pmu_disable_count);
+	if (*disable_count)
+		goto out;
+
+	perf_pmu_disable(sampler->pmu);
+
+	rb = ring_buffer_get(sampler);
+	if (!rb) {
+		perf_pmu_enable(sampler->pmu);
+		goto out;
+	}
+
+	sampler->pmu->stop(sampler, PERF_EF_UPDATE);
+
+	data->aux.to = local_read(&rb->aux_head);
+
+	if (data->aux.to < sampler->attr.aux_sample_size)
+		data->aux.from = rb->aux_nr_pages * PAGE_SIZE +
+			data->aux.to - sampler->attr.aux_sample_size;
+	else
+		data->aux.from = data->aux.to -
+			sampler->attr.aux_sample_size;
+	data->aux.size = ALIGN(sampler->attr.aux_sample_size, sizeof(u64));
+	ring_buffer_put(rb);
+
+out:
+	return data->aux.size;
+}
+
+static void perf_aux_sampler_output(struct perf_event *event,
+				    struct perf_output_handle *handle,
+				    struct perf_sample_data *data)
+{
+	struct perf_event *sampler = event->aux_sampler;
+	struct ring_buffer *rb;
+	unsigned long pad;
+	int ret;
+
+	if (WARN_ON_ONCE(!sampler || !data->aux.size))
+		goto out_enable;
+
+	rb = ring_buffer_get(sampler);
+	if (WARN_ON_ONCE(!rb))
+		goto out_enable;
+
+	ret = rb_output_aux(rb, data->aux.from, data->aux.to,
+			    (aux_copyfn)perf_output_copy, handle);
+	if (ret < 0) {
+		pr_warn_ratelimited("failed to copy trace data\n");
+		goto out;
+	}
+
+	pad = data->aux.size - ret;
+	if (pad) {
+		u64 p = 0;
+
+		perf_output_copy(handle, &p, pad);
+	}
+out:
+	ring_buffer_put(rb);
+	sampler->pmu->start(sampler, 0);
+
+out_enable:
+	perf_pmu_enable(sampler->pmu);
+}
+
 static void __perf_event_header__init_id(struct perf_event_header *header,
 					 struct perf_sample_data *data,
 					 struct perf_event *event)
@@ -5774,6 +6041,13 @@ void perf_output_sample(struct perf_output_handle *handle,
 		}
 	}
 
+	if (sample_type & PERF_SAMPLE_AUX) {
+		perf_output_put(handle, data->aux.size);
+
+		if (data->aux.size)
+			perf_aux_sampler_output(event, handle, data);
+	}
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -5907,6 +6181,14 @@ void perf_prepare_sample(struct perf_event_header *header,
 
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_AUX) {
+		u64 size = sizeof(u64);
+
+		size += perf_aux_sampler_trace(event, data);
+
+		header->size += size;
+	}
 }
 
 static void __always_inline
@@ -6109,6 +6391,8 @@ static void perf_event_addr_filters_exec(struct perf_event *event, void *data)
 		event->addr_filters_gen++;
 	raw_spin_unlock_irqrestore(&ifh->lock, flags);
 
+	perf_pmu_enable(event->pmu);
+
 	if (restart)
 		perf_event_stop(event, 1);
 }
@@ -6673,6 +6957,8 @@ static void __perf_addr_filters_adjust(struct perf_event *event, void *data)
 		event->addr_filters_gen++;
 	raw_spin_unlock_irqrestore(&ifh->lock, flags);
 
+	perf_pmu_enable(event->pmu);
+
 	if (restart)
 		perf_event_stop(event, 1);
 }
@@ -9076,10 +9362,27 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	}
 
 	if (!event->parent) {
+		if (event->attr.sample_type & PERF_SAMPLE_AUX) {
+			struct pmu *aux_pmu;
+			int idx;
+
+			err = -EINVAL;
+
+			idx = srcu_read_lock(&pmus_srcu);
+			aux_pmu = __perf_find_pmu(event->attr.aux_sample_type);
+			if (aux_pmu)
+				err = perf_aux_sampler_init(event, task,
+							    aux_pmu);
+			srcu_read_unlock(&pmus_srcu, idx);
+
+			if (err)
+				goto err_addr_filters;
+		}
+
 		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
 			err = get_callchain_buffers(attr->sample_max_stack);
 			if (err)
-				goto err_addr_filters;
+				goto err_aux_sampler;
 		}
 	}
 
@@ -9088,6 +9391,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 	return event;
 
+err_aux_sampler:
+	perf_aux_sampler_fini(event);
+
 err_addr_filters:
 	kfree(event->addr_filters_offs);
 
@@ -9917,6 +10223,13 @@ perf_event_exit_event(struct perf_event *child_event,
 	struct perf_event *parent_event = child_event->parent;
 
 	/*
+	 * Skip over samplers, they are released by the last holder
+	 * of their reference.
+	 */
+	if (child_event->attach_state & PERF_ATTACH_SAMPLING)
+		return;
+
+	/*
 	 * Do not destroy the 'original' grouping; because of the context
 	 * switch optimization the original events could've ended up in a
 	 * random child task.
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 5/6] perf: Disable PMU around address filter adjustment
  2016-09-23 11:27 [RFC PATCH 0/6] perf: Add AUX data sampling Alexander Shishkin
                   ` (3 preceding siblings ...)
  2016-09-23 11:27 ` [RFC PATCH 4/6] perf: Add infrastructure for using AUX data in perf samples Alexander Shishkin
@ 2016-09-23 11:27 ` Alexander Shishkin
  2016-09-23 11:27 ` [RFC PATCH 6/6] perf: Disable IRQs in address filter sync path Alexander Shishkin
  2016-09-23 11:49 ` [RFC PATCH 0/6] perf: Add AUX data sampling Peter Zijlstra
  6 siblings, 0 replies; 23+ messages in thread
From: Alexander Shishkin @ 2016-09-23 11:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx, ak, Alexander Shishkin

If the PMU is used for AUX data sampling, the hardware event that triggers
it may come in during a critical section of address filters adjustment (if
it's delivered as NMI) and in turn call the address filter sync code,
causing a spinlock recursion.

To prevent this, disable the PMU around these critical sections, so that
AUX sampling won't take place there.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index fdb20fdeb1..f6582df1c9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6377,6 +6377,12 @@ static void perf_event_addr_filters_exec(struct perf_event *event, void *data)
 	if (!has_addr_filter(event))
 		return;
 
+	/*
+	 * if sampling kicks in in the critical section,
+	 * we risk spinlock recursion on the ifh::lock
+	 */
+	perf_pmu_disable(event->pmu);
+
 	raw_spin_lock_irqsave(&ifh->lock, flags);
 	list_for_each_entry(filter, &ifh->list, entry) {
 		if (filter->inode) {
@@ -6387,6 +6393,8 @@ static void perf_event_addr_filters_exec(struct perf_event *event, void *data)
 		count++;
 	}
 
+	perf_pmu_enable(event->pmu);
+
 	if (restart)
 		event->addr_filters_gen++;
 	raw_spin_unlock_irqrestore(&ifh->lock, flags);
@@ -6942,6 +6950,8 @@ static void __perf_addr_filters_adjust(struct perf_event *event, void *data)
 	if (!file)
 		return;
 
+	perf_pmu_disable(event->pmu);
+
 	raw_spin_lock_irqsave(&ifh->lock, flags);
 	list_for_each_entry(filter, &ifh->list, entry) {
 		if (perf_addr_filter_match(filter, file, off,
@@ -6953,6 +6963,8 @@ static void __perf_addr_filters_adjust(struct perf_event *event, void *data)
 		count++;
 	}
 
+	perf_pmu_enable(event->pmu);
+
 	if (restart)
 		event->addr_filters_gen++;
 	raw_spin_unlock_irqrestore(&ifh->lock, flags);
@@ -8126,6 +8138,8 @@ static void perf_event_addr_filters_apply(struct perf_event *event)
 
 	down_read(&mm->mmap_sem);
 
+	perf_pmu_disable(event->pmu);
+
 	raw_spin_lock_irqsave(&ifh->lock, flags);
 	list_for_each_entry(filter, &ifh->list, entry) {
 		event->addr_filters_offs[count] = 0;
@@ -8144,6 +8158,8 @@ static void perf_event_addr_filters_apply(struct perf_event *event)
 	event->addr_filters_gen++;
 	raw_spin_unlock_irqrestore(&ifh->lock, flags);
 
+	perf_pmu_enable(event->pmu);
+
 	up_read(&mm->mmap_sem);
 
 	mmput(mm);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 6/6] perf: Disable IRQs in address filter sync path
  2016-09-23 11:27 [RFC PATCH 0/6] perf: Add AUX data sampling Alexander Shishkin
                   ` (4 preceding siblings ...)
  2016-09-23 11:27 ` [RFC PATCH 5/6] perf: Disable PMU around address filter adjustment Alexander Shishkin
@ 2016-09-23 11:27 ` Alexander Shishkin
  2016-09-26 16:18   ` Alexander Shishkin
  2016-09-23 11:49 ` [RFC PATCH 0/6] perf: Add AUX data sampling Peter Zijlstra
  6 siblings, 1 reply; 23+ messages in thread
From: Alexander Shishkin @ 2016-09-23 11:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx, ak, Alexander Shishkin

If PMU callbacks are executed in hardirq context, the address filter
sync code needs to disable interrupts when taking its spinlock to be
consistent with the rest of its users. This may happen if the PMU is
used in AUX sampling.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f6582df1c9..047c495c94 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2568,16 +2568,17 @@ static int perf_event_stop(struct perf_event *event, int restart)
 void perf_event_addr_filters_sync(struct perf_event *event)
 {
 	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	unsigned long flags;
 
 	if (!has_addr_filter(event))
 		return;
 
-	raw_spin_lock(&ifh->lock);
+	raw_spin_lock_irqsave(&ifh->lock, flags);
 	if (event->addr_filters_gen != event->hw.addr_filters_gen) {
 		event->pmu->addr_filters_sync(event);
 		event->hw.addr_filters_gen = event->addr_filters_gen;
 	}
-	raw_spin_unlock(&ifh->lock);
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
 }
 EXPORT_SYMBOL_GPL(perf_event_addr_filters_sync);
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] perf: Add AUX data sampling
  2016-09-23 11:27 [RFC PATCH 0/6] perf: Add AUX data sampling Alexander Shishkin
                   ` (5 preceding siblings ...)
  2016-09-23 11:27 ` [RFC PATCH 6/6] perf: Disable IRQs in address filter sync path Alexander Shishkin
@ 2016-09-23 11:49 ` Peter Zijlstra
  2016-09-23 17:19   ` Andi Kleen
  6 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2016-09-23 11:49 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx, ak

On Fri, Sep 23, 2016 at 02:27:20PM +0300, Alexander Shishkin wrote:
> Hi Peter,
> 
> This is an RFC, I'm not sending the tooling bits in this series,
> although they can be found here [1].
> 
> This series introduces AUX data sampling for perf events, which in
> case of our instruction/branch tracing PMUs like Intel PT, BTS, CS
> ETM means execution flow history leading up to a perf event's
> overflow.

This fails to explain _WHY_ this is a good thing to have. What kind of
analysis does this enable, and is that fully implemented in [1] (I
didn't look).

Please provide a bit more justification.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation
  2016-09-23 11:27 ` [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation Alexander Shishkin
@ 2016-09-23 12:14   ` Peter Zijlstra
  2016-09-23 14:27     ` Alexander Shishkin
  2016-09-23 17:26     ` Andi Kleen
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2016-09-23 12:14 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx, ak

On Fri, Sep 23, 2016 at 02:27:21PM +0300, Alexander Shishkin wrote:
> In order to be able to allocate perf ring buffers in non-mmap path, we
> need to make sure we can still account the memory to the user and that
> they don't exceed their mlock limit.
> 
> This patch moves ring buffer memory accounting down the rb_alloc() path
> so that its callers won't have to worry about it. This also serves the
> additional purpose of slightly cleaning up perf_mmap().

While I like a cleanup of that code (it really can use one), I'm not a
big fan of hidden buffers like this. Why is this needed?

A quick look through the patches also leaves me wondering on the design
and interface of this thing. A few words explaining the overall design
would be nice.

Afaict there's no actual need to hide the AUX buffer for this sampling
stuff; the user knows about all this and can simply mmap() the AUX part.
The sample could either point to locations in the AUX buffer, or (as I
think this code does) memcpy bits out.

Ideally we'd pass the AUX-event into the syscall, that way you avoid all
the find_aux_event crud. I'm not sure we want to overload the group_fd
thing more (its already very hard to create counter groups in a cgroup
for example) ..

Coredump was mentioned somewhere, but I'm not sure I've seen
code/interfaces for that. How was that envisioned to work?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation
  2016-09-23 12:14   ` Peter Zijlstra
@ 2016-09-23 14:27     ` Alexander Shishkin
  2016-09-23 15:27       ` Peter Zijlstra
  2016-09-23 17:26     ` Andi Kleen
  1 sibling, 1 reply; 23+ messages in thread
From: Alexander Shishkin @ 2016-09-23 14:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx, ak

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Sep 23, 2016 at 02:27:21PM +0300, Alexander Shishkin wrote:
>> In order to be able to allocate perf ring buffers in non-mmap path, we
>> need to make sure we can still account the memory to the user and that
>> they don't exceed their mlock limit.
>> 
>> This patch moves ring buffer memory accounting down the rb_alloc() path
>> so that its callers won't have to worry about it. This also serves the
>> additional purpose of slightly cleaning up perf_mmap().
>
> While I like a cleanup of that code (it really can use one), I'm not a
> big fan of hidden buffers like this. Why is this needed?

So what I wanted is a similar interface to call stack sampling or pretty
much anything else sampling that we have at the moment. The user would
ask for AUX samples of, say, intel_pt, and would get a sample with PT
stuff right in the perf buffer every time their main event overflows.

They don't *need* to know that we have a kernel event with a ring buffer
under the hood. This was one of the use cases of 'hidden' ring
buffers. The other two are process core dump and system core dump ([1]
tried to do it without involving perf at all, for reference).

> A quick look through the patches also leaves me wondering on the design
> and interface of this thing. A few words explaining the overall design
> would be nice.

Right; here goes. PERF_SAMPLE_AUX is set in the attr.sample_type of the
event that you want to sample. Then, using that event's
attr.aux_sample_type as the PMU 'type' and attr.aux_sample_config as
'config' we create a kernel event. For this kernel event, we then
allocate a ring buffer with 0 data pages and as many aux pages as would
fit the attr.aux_sample_size.

Then, we hook into the perf_prepare_sample()/perf_output_sample() path
so that When the original event goes off, we first stop the kernel
event, then memcpy the data from the 'hidden' aux buffer into the
original event's perf buffer under PERF_SAMPLE_AUX and then restart the
kernel event. This all is happening on the local cpu. The 'hidden' aux
buffer is running in overwrite mode, so we copy attr.aux_sample_size
bytes every time, which means there may be overlaps between samples, but
the tooling has logic to handle this.

This is about it. Before creating a new counter we first look for an
existing one that fits the bill wrt filtering bits; if there is one, we
grab its reference and use it instead. This is so that one could do
things like

$ perf record -Aintel_pt -e 'cycles,instructions,branch-misses' ls

or

$ perf record -Aintel_pt -e 'sched:*' -a sleep 10

> Afaict there's no actual need to hide the AUX buffer for this sampling
> stuff; the user knows about all this and can simply mmap() the AUX part.

Yes, you're right here. We could also re-use the AUX record, adding a
new flag for this. It may be even better if I can work out the
inheritance (the current code doesn't handle inheritance at the moment
in case we decide to scrap it).

> The sample could either point to locations in the AUX buffer, or (as I
> think this code does) memcpy bits out.

Yes and yes, it does.

> Ideally we'd pass the AUX-event into the syscall, that way you avoid all
> the find_aux_event crud. I'm not sure we want to overload the group_fd
> thing more (its already very hard to create counter groups in a cgroup
> for example) ..

It can be also stuffed into the attribute or ioctl()ed. The latter is
probably the best.

> Coredump was mentioned somewhere, but I'm not sure I've seen
> code/interfaces for that. How was that envisioned to work?

Ok, so what I have is a new RLIMIT_PERF, which is set to the aux data
sample to be included in the [process] core dump. At the
prlimit(RLIMIT_PERF) time, given that RLIMIT_CORE is also nonzero, I
create a kernel event with a 'hidden' buffer. The PMU for this event is,
in this scenario, a system-wide setting, which is a tad iffy, seeing as
we now have 2 PMUs in the system that can be used for this, but which
are mutually exclusive.

Now, when the core dump is written, we check if there's such an event on
the task's perf context and if there is, we dump_emit() data from the
hidden buffer into the file. The difference with sampling is that this
kernel event is also inheritable, so that when the task fork()s, a new
event is created. The memory is counted against
sysctl_perf_event_mlock+user's RLIMIT_MEMLOCK (just like the rest of
perf buffers), so when the user is out of it, no new events are created.

The rlimit as interface to enable this seems weirder the more I look at
it, which is also the reason why I haven't sent it out yet. The other
ideas I had for this were a prctl(), which would be more
straightforward, would also allow to specify the PMU, but, unlike
prlimit() would only work on the current process. Yet another way would
be to go through perf_event_open() and then somehow feed the event into
the ether instead of polling it.

The last one that can use the hidden buffer is system core dumps, that
would be either retreived by kdump or stored in pstore/EFI capsule. I
don't have the code for this yet, but the general idea is that per-cpu
AUX events would start at boot time in overwrite mode and just hang in
there till things go south.

[1] http://marc.info/?l=linux-kernel&m=143814616805933

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation
  2016-09-23 14:27     ` Alexander Shishkin
@ 2016-09-23 15:27       ` Peter Zijlstra
  2016-09-23 15:58         ` Alexander Shishkin
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2016-09-23 15:27 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx, ak

On Fri, Sep 23, 2016 at 05:27:22PM +0300, Alexander Shishkin wrote:
> > Afaict there's no actual need to hide the AUX buffer for this sampling
> > stuff; the user knows about all this and can simply mmap() the AUX part.
> 
> Yes, you're right here. We could also re-use the AUX record, adding a
> new flag for this. It may be even better if I can work out the
> inheritance (the current code doesn't handle inheritance at the moment
> in case we decide to scrap it).

What is the exact problem with inheritance? You can inherit PT (and
other) events just fine, and their output redirects to the original
(AUX) buffer too.

Is the problem untangling which part of the AUX buffer belongs to which
task upon sample?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation
  2016-09-23 15:27       ` Peter Zijlstra
@ 2016-09-23 15:58         ` Alexander Shishkin
  0 siblings, 0 replies; 23+ messages in thread
From: Alexander Shishkin @ 2016-09-23 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx, ak

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Sep 23, 2016 at 05:27:22PM +0300, Alexander Shishkin wrote:
>> > Afaict there's no actual need to hide the AUX buffer for this sampling
>> > stuff; the user knows about all this and can simply mmap() the AUX part.
>> 
>> Yes, you're right here. We could also re-use the AUX record, adding a
>> new flag for this. It may be even better if I can work out the
>> inheritance (the current code doesn't handle inheritance at the moment
>> in case we decide to scrap it).
>
> What is the exact problem with inheritance? You can inherit PT (and
> other) events just fine, and their output redirects to the original
> (AUX) buffer too.
>
> Is the problem untangling which part of the AUX buffer belongs to which
> task upon sample?

Tasks we can figure out from id samples on RECORD_AUX (assuming we're
using those), but which event (if you have multiple) does a sample
belong to is trickier.

Cutting out samples becomes more interesting as normally RECORD_AUX
don't overlap, we can keep it that way and then the samples will
naturally be non-overlapping, but will all be different sizes. And there
is a question of waking up the consumer often enough to copy out all the
samples before the buffer overwrites itself. Let me think a bit.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] perf: Add AUX data sampling
  2016-09-23 11:49 ` [RFC PATCH 0/6] perf: Add AUX data sampling Peter Zijlstra
@ 2016-09-23 17:19   ` Andi Kleen
  2016-09-23 20:35     ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2016-09-23 17:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx

On Fri, Sep 23, 2016 at 01:49:17PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 23, 2016 at 02:27:20PM +0300, Alexander Shishkin wrote:
> > Hi Peter,
> > 
> > This is an RFC, I'm not sending the tooling bits in this series,
> > although they can be found here [1].
> > 
> > This series introduces AUX data sampling for perf events, which in
> > case of our instruction/branch tracing PMUs like Intel PT, BTS, CS
> > ETM means execution flow history leading up to a perf event's
> > overflow.
> 
> This fails to explain _WHY_ this is a good thing to have. What kind of
> analysis does this enable, and is that fully implemented in [1] (I
> didn't look).

Think of it as a super LBR. (Near) all things LBR can do, PT can do
with much more branches for each sample.

Also long term execution recording of PT normally doesn't work well because the
sustained bandwidth is too high for perf and the disk to keep up

Currently the main solution we have for that is the snapshot mode, but it
requires explicit instrumentation for someone to trigger snapshots.

Sampling PT is an alternative that works for many use cases, and does
not rely on instrumentation.

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation
  2016-09-23 12:14   ` Peter Zijlstra
  2016-09-23 14:27     ` Alexander Shishkin
@ 2016-09-23 17:26     ` Andi Kleen
  2016-09-23 20:28       ` Peter Zijlstra
  1 sibling, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2016-09-23 17:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx

> Afaict there's no actual need to hide the AUX buffer for this sampling
> stuff; the user knows about all this and can simply mmap() the AUX part.
> The sample could either point to locations in the AUX buffer, or (as I
> think this code does) memcpy bits out.

This would work for perf, but not for the core dump case below.

> Ideally we'd pass the AUX-event into the syscall, that way you avoid all
> the find_aux_event crud. I'm not sure we want to overload the group_fd
> thing more (its already very hard to create counter groups in a cgroup
> for example) ..
> 
> Coredump was mentioned somewhere, but I'm not sure I've seen
> code/interfaces for that. How was that envisioned to work?

The idea was to have a rlimit that enables PT running as a ring buffer
in the background.  If something crashes the ring buffer is dumped
as part of the core dump, and then gdb can tell you how you crashed.
This extends what gdb already does explicitly today using perf
API calls.

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation
  2016-09-23 17:26     ` Andi Kleen
@ 2016-09-23 20:28       ` Peter Zijlstra
  2016-09-26  8:27         ` Alexander Shishkin
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2016-09-23 20:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx

On Fri, Sep 23, 2016 at 10:26:15AM -0700, Andi Kleen wrote:
> > Afaict there's no actual need to hide the AUX buffer for this sampling
> > stuff; the user knows about all this and can simply mmap() the AUX part.
> > The sample could either point to locations in the AUX buffer, or (as I
> > think this code does) memcpy bits out.
> 
> This would work for perf, but not for the core dump case below.
> 
> > Ideally we'd pass the AUX-event into the syscall, that way you avoid all
> > the find_aux_event crud. I'm not sure we want to overload the group_fd
> > thing more (its already very hard to create counter groups in a cgroup
> > for example) ..
> > 
> > Coredump was mentioned somewhere, but I'm not sure I've seen
> > code/interfaces for that. How was that envisioned to work?
> 
> The idea was to have a rlimit that enables PT running as a ring buffer
> in the background.  If something crashes the ring buffer is dumped
> as part of the core dump, and then gdb can tell you how you crashed.
> This extends what gdb already does explicitly today using perf
> API calls.

Well, we could 'force' inject a VMA into the process's address space, we
do that for a few other things as well. It also makes for less
exceptions with the actual core dumping.

But the worry I have is the total amount of pinned memory. If you want
to inherit this on fork(), as is a reasonable expectation, then its
possible to quickly exceed the total amount of pinnable memory.

At which point we _should_ start failing fork(), which is a somewhat
unexpected, and undesirable side-effect.

Ideally we'd unpin the old buffers and repin the new buffers on context
switch, but that's impossible since faulting needs scheduling,
recursion, we loose.

I really want to see something sensible before we go do that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] perf: Add AUX data sampling
  2016-09-23 17:19   ` Andi Kleen
@ 2016-09-23 20:35     ` Peter Zijlstra
  2016-09-23 22:34       ` Andi Kleen
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2016-09-23 20:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx

On Fri, Sep 23, 2016 at 10:19:43AM -0700, Andi Kleen wrote:
> On Fri, Sep 23, 2016 at 01:49:17PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 23, 2016 at 02:27:20PM +0300, Alexander Shishkin wrote:
> > > Hi Peter,
> > > 
> > > This is an RFC, I'm not sending the tooling bits in this series,
> > > although they can be found here [1].
> > > 
> > > This series introduces AUX data sampling for perf events, which in
> > > case of our instruction/branch tracing PMUs like Intel PT, BTS, CS
> > > ETM means execution flow history leading up to a perf event's
> > > overflow.
> > 
> > This fails to explain _WHY_ this is a good thing to have. What kind of
> > analysis does this enable, and is that fully implemented in [1] (I
> > didn't look).
> 
> Think of it as a super LBR. (Near) all things LBR can do, PT can do
> with much more branches for each sample.

Clarify the 'near'? Should we then not expose it as a BRANCH_STACK?
Expand on the down-sides of that.

> Also long term execution recording of PT normally doesn't work well because the
> sustained bandwidth is too high for perf and the disk to keep up
> 
> Currently the main solution we have for that is the snapshot mode, but it
> requires explicit instrumentation for someone to trigger snapshots.
> 
> Sampling PT is an alternative that works for many use cases, and does
> not rely on instrumentation.

List a few use-cases on either side of that divide ?


This really isn't rocket science, patches should come with
justification, try and sell this stuff. Don't try and skimp on that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] perf: Add AUX data sampling
  2016-09-23 20:35     ` Peter Zijlstra
@ 2016-09-23 22:34       ` Andi Kleen
  0 siblings, 0 replies; 23+ messages in thread
From: Andi Kleen @ 2016-09-23 22:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx

On Fri, Sep 23, 2016 at 10:35:27PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 23, 2016 at 10:19:43AM -0700, Andi Kleen wrote:
> > On Fri, Sep 23, 2016 at 01:49:17PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 23, 2016 at 02:27:20PM +0300, Alexander Shishkin wrote:
> > > > Hi Peter,
> > > > 
> > > > This is an RFC, I'm not sending the tooling bits in this series,
> > > > although they can be found here [1].
> > > > 
> > > > This series introduces AUX data sampling for perf events, which in
> > > > case of our instruction/branch tracing PMUs like Intel PT, BTS, CS
> > > > ETM means execution flow history leading up to a perf event's
> > > > overflow.
> > > 
> > > This fails to explain _WHY_ this is a good thing to have. What kind of
> > > analysis does this enable, and is that fully implemented in [1] (I
> > > didn't look).
> > 
> > Think of it as a super LBR. (Near) all things LBR can do, PT can do
> > with much more branches for each sample.
> 
> Clarify the 'near'? Should we then not expose it as a BRANCH_STACK?

- Exposing it as branch stack would need a PT decoder in kernel space.
A PT decoder is quite complicated and needs a lot of infrastructure.
- Putting a PT decoder in kernel space is bad because the decoder is
much slower than the execution and much better runs offline than online.
It would cause a lot more data loss.
- LBR has some features which are not in PT (but also other way round),
like mispredict indication, call stack mode or individual basic block level
timing. It also has practically no runtime overhead.
- Also BTW the pt decoder in user space already supporting exposing
PT as a virtual LBR. It is just done all without kernel help
after decoding.

> > Also long term execution recording of PT normally doesn't work well because the
> > sustained bandwidth is too high for perf and the disk to keep up
> > 
> > Currently the main solution we have for that is the snapshot mode, but it
> > requires explicit instrumentation for someone to trigger snapshots.
> > 
> > Sampling PT is an alternative that works for many use cases, and does
> > not rely on instrumentation.
> 
> List a few use-cases on either side of that divide ?

- Snapshot mode is good for targeted performance debugging. You're looking
for something specific and can instrument for it.
- Sample mode is good for generic data collection. You don't know yet 
what you're looking for, but want to see hot paths in your application.
- Sample mode also works for automated data collection, like using
it for compiler profile feedback.

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation
  2016-09-23 20:28       ` Peter Zijlstra
@ 2016-09-26  8:27         ` Alexander Shishkin
  2016-09-26  9:03           ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Alexander Shishkin @ 2016-09-26  8:27 UTC (permalink / raw)
  To: Peter Zijlstra, Andi Kleen
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx

Peter Zijlstra <peterz@infradead.org> writes:

> Well, we could 'force' inject a VMA into the process's address space, we
> do that for a few other things as well. It also makes for less
> exceptions with the actual core dumping.

Threads then will end up with the same buffer (through sharing the mm),
but they can't really share trace buffers.

Also, system core dump is still a problem.

> But the worry I have is the total amount of pinned memory. If you want
> to inherit this on fork(), as is a reasonable expectation, then its
> possible to quickly exceed the total amount of pinnable memory.
>
> At which point we _should_ start failing fork(), which is a somewhat
> unexpected, and undesirable side-effect.

I'm not sure I see why we should fail fork() when we run out of pinned
memory.

> Ideally we'd unpin the old buffers and repin the new buffers on context
> switch, but that's impossible since faulting needs scheduling,
> recursion, we loose.

Or we can have per-cpu buffers for all user's tasks, record where each
task starts and ends in each buffer and cut out only bits relevant to
the task(s) that dump core.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation
  2016-09-26  8:27         ` Alexander Shishkin
@ 2016-09-26  9:03           ` Peter Zijlstra
  2016-09-26 12:39             ` Alexander Shishkin
  2016-09-26 16:13             ` Alexander Shishkin
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2016-09-26  9:03 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Andi Kleen, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx

On Mon, Sep 26, 2016 at 11:27:08AM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > Well, we could 'force' inject a VMA into the process's address space, we
> > do that for a few other things as well. It also makes for less
> > exceptions with the actual core dumping.
> 
> Threads then will end up with the same buffer (through sharing the mm),
> but they can't really share trace buffers.
> 
> Also, system core dump is still a problem.

Hurm, true on both counts.

> > But the worry I have is the total amount of pinned memory. If you want
> > to inherit this on fork(), as is a reasonable expectation, then its
> > possible to quickly exceed the total amount of pinnable memory.
> >
> > At which point we _should_ start failing fork(), which is a somewhat
> > unexpected, and undesirable side-effect.
> 
> I'm not sure I see why we should fail fork() when we run out of pinned
> memory.

Well, we cannot fully honour the inherit, what other option do we have?
Silently malfunctioning? That's far worse.

> > Ideally we'd unpin the old buffers and repin the new buffers on context
> > switch, but that's impossible since faulting needs scheduling,
> > recursion, we loose.
> 
> Or we can have per-cpu buffers for all user's tasks, record where each
> task starts and ends in each buffer and cut out only bits relevant to
> the task(s) that dump core.

Which gets you the problem that when a task dumps core there might not
be any state in the buffer, because the previous task flushed it all out
:/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation
  2016-09-26  9:03           ` Peter Zijlstra
@ 2016-09-26 12:39             ` Alexander Shishkin
  2016-09-26 16:13             ` Alexander Shishkin
  1 sibling, 0 replies; 23+ messages in thread
From: Alexander Shishkin @ 2016-09-26 12:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Sep 26, 2016 at 11:27:08AM +0300, Alexander Shishkin wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> > At which point we _should_ start failing fork(), which is a somewhat
>> > unexpected, and undesirable side-effect.
>> 
>> I'm not sure I see why we should fail fork() when we run out of pinned
>> memory.
>
> Well, we cannot fully honour the inherit, what other option do we have?
> Silently malfunctioning? That's far worse.

We can still put a note there saying that we tried. The user will know
to adjust their buffer size requirement or the RLIMIT_MEMLOCK.

>> > Ideally we'd unpin the old buffers and repin the new buffers on context
>> > switch, but that's impossible since faulting needs scheduling,
>> > recursion, we loose.
>> 
>> Or we can have per-cpu buffers for all user's tasks, record where each
>> task starts and ends in each buffer and cut out only bits relevant to
>> the task(s) that dump core.
>
> Which gets you the problem that when a task dumps core there might not
> be any state in the buffer, because the previous task flushed it all out
> :/

Well, there's going to be at list something that leads up to the core
dump if this task is the last one to schedule in for this buffer. It's a
bit more gambling, though.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation
  2016-09-26  9:03           ` Peter Zijlstra
  2016-09-26 12:39             ` Alexander Shishkin
@ 2016-09-26 16:13             ` Alexander Shishkin
  1 sibling, 0 replies; 23+ messages in thread
From: Alexander Shishkin @ 2016-09-26 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Sep 26, 2016 at 11:27:08AM +0300, Alexander Shishkin wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > Well, we could 'force' inject a VMA into the process's address space, we
>> > do that for a few other things as well. It also makes for less
>> > exceptions with the actual core dumping.
>> 
>> Threads then will end up with the same buffer (through sharing the mm),
>> but they can't really share trace buffers.
>> 
>> Also, system core dump is still a problem.
>
> Hurm, true on both counts.

OTOH, system core dump buffers don't need inheritance or memlock
accounting.

>> Or we can have per-cpu buffers for all user's tasks, record where each
>> task starts and ends in each buffer and cut out only bits relevant to
>> the task(s) that dump core.
>
> Which gets you the problem that when a task dumps core there might not
> be any state in the buffer, because the previous task flushed it all out
> :/

And also won't work with PMUs that don't generate PMIs, like ETMs.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 6/6] perf: Disable IRQs in address filter sync path
  2016-09-23 11:27 ` [RFC PATCH 6/6] perf: Disable IRQs in address filter sync path Alexander Shishkin
@ 2016-09-26 16:18   ` Alexander Shishkin
  2016-10-04 16:49     ` Mathieu Poirier
  0 siblings, 1 reply; 23+ messages in thread
From: Alexander Shishkin @ 2016-09-26 16:18 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, tglx, ak

Alexander Shishkin <alexander.shishkin@linux.intel.com> writes:

> If PMU callbacks are executed in hardirq context, the address filter
> sync code needs to disable interrupts when taking its spinlock to be
> consistent with the rest of its users. This may happen if the PMU is
> used in AUX sampling.

Hi Mathieu,

I've been meaning to CC you on this series and forgot. My concern was
that on PMUs that run PMIs in hardirq context this patch should be
required already now. Is this the case for you?

>
> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> ---
>  kernel/events/core.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f6582df1c9..047c495c94 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2568,16 +2568,17 @@ static int perf_event_stop(struct perf_event *event, int restart)
>  void perf_event_addr_filters_sync(struct perf_event *event)
>  {
>  	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
> +	unsigned long flags;
>  
>  	if (!has_addr_filter(event))
>  		return;
>  
> -	raw_spin_lock(&ifh->lock);
> +	raw_spin_lock_irqsave(&ifh->lock, flags);
>  	if (event->addr_filters_gen != event->hw.addr_filters_gen) {
>  		event->pmu->addr_filters_sync(event);
>  		event->hw.addr_filters_gen = event->addr_filters_gen;
>  	}
> -	raw_spin_unlock(&ifh->lock);
> +	raw_spin_unlock_irqrestore(&ifh->lock, flags);
>  }
>  EXPORT_SYMBOL_GPL(perf_event_addr_filters_sync);
>  
> -- 
> 2.9.3

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 6/6] perf: Disable IRQs in address filter sync path
  2016-09-26 16:18   ` Alexander Shishkin
@ 2016-10-04 16:49     ` Mathieu Poirier
  0 siblings, 0 replies; 23+ messages in thread
From: Mathieu Poirier @ 2016-10-04 16:49 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Vince Weaver,
	Stephane Eranian, Arnaldo Carvalho de Melo, Thomas Gleixner,
	Andi Kleen

On 26 September 2016 at 10:18, Alexander Shishkin
<alexander.shishkin@linux.intel.com> wrote:
> Alexander Shishkin <alexander.shishkin@linux.intel.com> writes:
>
>> If PMU callbacks are executed in hardirq context, the address filter
>> sync code needs to disable interrupts when taking its spinlock to be
>> consistent with the rest of its users. This may happen if the PMU is
>> used in AUX sampling.
>
> Hi Mathieu,
>
> I've been meaning to CC you on this series and forgot. My concern was
> that on PMUs that run PMIs in hardirq context this patch should be
> required already now. Is this the case for you?

At this time cross triggers haven't been implemented and as such, PMIs
aren't an issue.  On the flip side I can see the value of your code
when we get around to do the implementation.

Mathieu

>
>>
>> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>> ---
>>  kernel/events/core.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index f6582df1c9..047c495c94 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -2568,16 +2568,17 @@ static int perf_event_stop(struct perf_event *event, int restart)
>>  void perf_event_addr_filters_sync(struct perf_event *event)
>>  {
>>       struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
>> +     unsigned long flags;
>>
>>       if (!has_addr_filter(event))
>>               return;
>>
>> -     raw_spin_lock(&ifh->lock);
>> +     raw_spin_lock_irqsave(&ifh->lock, flags);
>>       if (event->addr_filters_gen != event->hw.addr_filters_gen) {
>>               event->pmu->addr_filters_sync(event);
>>               event->hw.addr_filters_gen = event->addr_filters_gen;
>>       }
>> -     raw_spin_unlock(&ifh->lock);
>> +     raw_spin_unlock_irqrestore(&ifh->lock, flags);
>>  }
>>  EXPORT_SYMBOL_GPL(perf_event_addr_filters_sync);
>>
>> --
>> 2.9.3

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-10-04 16:49 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-23 11:27 [RFC PATCH 0/6] perf: Add AUX data sampling Alexander Shishkin
2016-09-23 11:27 ` [RFC PATCH 1/6] perf: Move mlock accounting to ring buffer allocation Alexander Shishkin
2016-09-23 12:14   ` Peter Zijlstra
2016-09-23 14:27     ` Alexander Shishkin
2016-09-23 15:27       ` Peter Zijlstra
2016-09-23 15:58         ` Alexander Shishkin
2016-09-23 17:26     ` Andi Kleen
2016-09-23 20:28       ` Peter Zijlstra
2016-09-26  8:27         ` Alexander Shishkin
2016-09-26  9:03           ` Peter Zijlstra
2016-09-26 12:39             ` Alexander Shishkin
2016-09-26 16:13             ` Alexander Shishkin
2016-09-23 11:27 ` [RFC PATCH 2/6] perf: Add api to (de-)allocate AUX buffers for kernel counters Alexander Shishkin
2016-09-23 11:27 ` [RFC PATCH 3/6] perf: Add a helper for looking up pmus by type Alexander Shishkin
2016-09-23 11:27 ` [RFC PATCH 4/6] perf: Add infrastructure for using AUX data in perf samples Alexander Shishkin
2016-09-23 11:27 ` [RFC PATCH 5/6] perf: Disable PMU around address filter adjustment Alexander Shishkin
2016-09-23 11:27 ` [RFC PATCH 6/6] perf: Disable IRQs in address filter sync path Alexander Shishkin
2016-09-26 16:18   ` Alexander Shishkin
2016-10-04 16:49     ` Mathieu Poirier
2016-09-23 11:49 ` [RFC PATCH 0/6] perf: Add AUX data sampling Peter Zijlstra
2016-09-23 17:19   ` Andi Kleen
2016-09-23 20:35     ` Peter Zijlstra
2016-09-23 22:34       ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.