linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT
@ 2014-08-11  5:19 Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 01/23] perf: Add data_{offset,size} to user_page Alexander Shishkin
                   ` (23 more replies)
  0 siblings, 24 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Hi Peter and all,

Here's a new version of PT support patchset, this time including PT and
BTS drivers and a few more tweaks to the core. I still left out some bits
like core dump support for now. Tooling support is not included in this
series so that it's easier to review the kernel bits, I suppose it's best
to send it separately, since it's quite a huge patchset of its own.

Alexander Shishkin (22):
  perf: Add data_{offset,size} to user_page
  perf: Support high-order allocations for AUX space
  perf: Add a capability for AUX_NO_SG pmus to do software double
    buffering
  perf: Add a pmu capability for "exclusive" events
  perf: Redirect output from inherited events to parents
  perf: Add api for pmus to write to AUX space
  perf: Add AUX record
  perf: Support overwrite mode for AUX area
  perf: Add wakeup watermark control to AUX area
  perf: Add itrace_config to the event attribute
  perf: add ITRACE_START record to indicate that tracing has started
  x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection
  x86: perf: Intel PT and LBR/BTS are mutually exclusive
  x86: perf: intel_pt: Intel PT PMU driver
  x86: perf: intel_bts: Add BTS PMU driver
  perf: Add rb_{alloc,free}_kernel api
  perf: Add a helper to copy AUX data in the kernel
  perf: Add a helper for looking up pmus by type
  perf: itrace: Infrastructure for sampling instruction flow traces
  perf: Allocate ring buffers for inherited per-task kernel events
  perf: itrace: Allow itrace sampling for multiple events
  perf: itrace: Allow sampling of inherited events

Peter Zijlstra (1):
  perf: Add AUX area to ring buffer for raw data streams

 arch/x86/include/asm/cpufeature.h          |   1 +
 arch/x86/include/uapi/asm/msr-index.h      |  18 +
 arch/x86/kernel/cpu/Makefile               |   1 +
 arch/x86/kernel/cpu/intel_pt.h             | 129 ++++
 arch/x86/kernel/cpu/perf_event.h           |  14 +
 arch/x86/kernel/cpu/perf_event_intel.c     |  14 +-
 arch/x86/kernel/cpu/perf_event_intel_bts.c | 496 +++++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |  11 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   9 +-
 arch/x86/kernel/cpu/perf_event_intel_pt.c  | 952 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/scattered.c            |   1 +
 include/linux/itrace.h                     |  45 ++
 include/linux/perf_event.h                 |  61 +-
 include/uapi/linux/perf_event.h            |  69 ++-
 kernel/events/Makefile                     |   2 +-
 kernel/events/core.c                       | 348 +++++++++--
 kernel/events/internal.h                   |  54 ++
 kernel/events/itrace.c                     | 234 +++++++
 kernel/events/ring_buffer.c                | 313 +++++++++-
 19 files changed, 2724 insertions(+), 48 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/intel_pt.h
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_bts.c
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c
 create mode 100644 include/linux/itrace.h
 create mode 100644 kernel/events/itrace.c

-- 
2.1.0.rc1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 01/23] perf: Add data_{offset,size} to user_page
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 02/23] perf: Add AUX area to ring buffer for raw data streams Alexander Shishkin
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Currently, the actual perf ring buffer is one page into the mmap area,
following the user page and the userspace follows this convention. This
patch adds data_{offset,size} fields to user_page that can be used by
userspace instead for locating perf data in the mmap area. This is also
helpful when mapping existing or shared buffers if their size is not
known in advance.

Right now, it is made to follow the existing convention that

	data_offset == PAGE_SIZE and
	data_offset + data_size == mmap_size.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/uapi/linux/perf_event.h | 5 +++++
 kernel/events/core.c            | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9269de2548..f7d18c2cb7 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -489,9 +489,14 @@ struct perf_event_mmap_page {
 	 * In this case the kernel will not over-write unread data.
 	 *
 	 * See perf_output_put_handle() for the data ordering.
+	 *
+	 * data_{offset,size} indicate the location and size of the perf record
+	 * buffer within the mmapped area.
 	 */
 	__u64   data_head;		/* head in the data section */
 	__u64	data_tail;		/* user-space written tail */
+	__u64	data_offset;		/* where the buffer starts */
+	__u64	data_size;		/* data buffer size */
 };
 
 #define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1cf24b3e42..bf58d40a26 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3780,6 +3780,8 @@ static void perf_event_init_userpage(struct perf_event *event)
 	/* Allow new userspace to detect that bit 0 is deprecated */
 	userpg->cap_bit0_is_deprecated = 1;
 	userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
+	userpg->data_offset = PAGE_SIZE;
+	userpg->data_size = perf_data_size(rb);
 
 unlock:
 	rcu_read_unlock();
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 02/23] perf: Add AUX area to ring buffer for raw data streams
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 01/23] perf: Add data_{offset,size} to user_page Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 03/23] perf: Support high-order allocations for AUX space Alexander Shishkin
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Peter Zijlstra, Alexander Shishkin

From: Peter Zijlstra <peterz@infradead.org>

This patch introduces "AUX space" in the perf mmap buffer, intended for
exporting high bandwidth data streams to userspace, such as instruction
flow traces.

AUX space is a ring buffer, defined by aux_{offset,size} fields in the
user_page structure, and read/write pointers aux_{head,tail}, which abide
by the same rules as data_* counterparts of the main perf buffer.

In order to allocate/mmap AUX, userspace needs to set up aux_offset to
such an offset that will be greater than data_offset+data_size and
aux_size to be the desired buffer size. Both need to be page aligned.
Then, same aux_offset and aux_size should be passed to mmap() call and
if everything adds up, you should have an AUX buffer as a result.

Pages that are mapped into this buffer also come out of user's mlock
rlimit plus perf_event_mlock_kb allowance.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      |  17 +++++
 include/uapi/linux/perf_event.h |  16 +++++
 kernel/events/core.c            | 140 +++++++++++++++++++++++++++++++++-------
 kernel/events/internal.h        |  21 ++++++
 kernel/events/ring_buffer.c     |  86 ++++++++++++++++++++++--
 5 files changed, 251 insertions(+), 29 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 707617a8c0..cf62338421 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -262,6 +262,18 @@ struct pmu {
 	 * flush branch stack on context-switches (needed in cpu-wide mode)
 	 */
 	void (*flush_branch_stack)	(void);
+
+	/*
+	 * Set up pmu-private data structures for an AUX area
+	 */
+	void *(*setup_aux)		(int cpu, void **pages,
+					 int nr_pages, bool overwrite);
+					/* optional */
+
+	/*
+	 * Free pmu-private AUX data structures
+	 */
+	void (*free_aux)		(void *aux); /* optional */
 };
 
 /**
@@ -770,6 +782,11 @@ static inline bool has_branch_stack(struct perf_event *event)
 	return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
 }
 
+static inline bool has_aux(struct perf_event *event)
+{
+	return event->pmu->setup_aux;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index f7d18c2cb7..7e0967c0f5 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -497,6 +497,22 @@ struct perf_event_mmap_page {
 	__u64	data_tail;		/* user-space written tail */
 	__u64	data_offset;		/* where the buffer starts */
 	__u64	data_size;		/* data buffer size */
+
+	/*
+	 * AUX area is defined by aux_{offset,size} fields that should be set
+	 * by the userspace, so that
+	 *
+	 *   aux_offset >= data_offset + data_size
+	 *
+	 * prior to mmap()ing it. Size of the mmap()ed area should be aux_size.
+	 *
+	 * Ring buffer pointers aux_{head,tail} have the same semantics as
+	 * data_{head,tail} and same ordering rules apply.
+	 */
+	__u64	aux_head;
+	__u64	aux_tail;
+	__u64	aux_offset;
+	__u64	aux_size;
 };
 
 #define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bf58d40a26..c07768d0a0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3979,6 +3979,8 @@ static void perf_mmap_open(struct vm_area_struct *vma)
 
 	atomic_inc(&event->mmap_count);
 	atomic_inc(&event->rb->mmap_count);
+	if (vma->vm_pgoff)
+		atomic_inc(&event->rb->aux_mmap_count);
 }
 
 /*
@@ -3998,6 +4000,20 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 	int mmap_locked = rb->mmap_locked;
 	unsigned long size = perf_data_size(rb);
 
+	/*
+	 * rb->aux_mmap_count will always drop before rb->mmap_count and
+	 * event->mmap_count, so it is ok to use event->mmap_mutex to
+	 * serialize with perf_mmap here.
+	 */
+	if (rb_has_aux(rb) && vma->vm_pgoff == rb->aux_pgoff &&
+	    atomic_dec_and_mutex_lock(&rb->aux_mmap_count, &event->mmap_mutex)) {
+		atomic_long_sub(rb->aux_nr_pages, &mmap_user->locked_vm);
+		vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
+
+		rb_free_aux(rb, event);
+		mutex_unlock(&event->mmap_mutex);
+	}
+
 	atomic_dec(&rb->mmap_count);
 
 	if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
@@ -4071,7 +4087,7 @@ out_put:
 
 static const struct vm_operations_struct perf_mmap_vmops = {
 	.open		= perf_mmap_open,
-	.close		= perf_mmap_close,
+	.close		= perf_mmap_close, /* non mergable */
 	.fault		= perf_mmap_fault,
 	.page_mkwrite	= perf_mmap_fault,
 };
@@ -4082,10 +4098,10 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	unsigned long user_locked, user_lock_limit;
 	struct user_struct *user = current_user();
 	unsigned long locked, lock_limit;
-	struct ring_buffer *rb;
+	struct ring_buffer *rb = NULL;
 	unsigned long vma_size;
 	unsigned long nr_pages;
-	long user_extra, extra;
+	long user_extra = 0, extra = 0;
 	int ret = 0, flags = 0;
 
 	/*
@@ -4100,7 +4116,66 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 		return -EINVAL;
 
 	vma_size = vma->vm_end - vma->vm_start;
-	nr_pages = (vma_size / PAGE_SIZE) - 1;
+
+	if (vma->vm_pgoff == 0) {
+		nr_pages = (vma_size / PAGE_SIZE) - 1;
+	} else {
+		/*
+		 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
+		 * mapped, all subsequent mappings should have the same size
+		 * and offset. Must be above the normal perf buffer.
+		 */
+		u64 aux_offset, aux_size;
+
+		if (!event->rb)
+			return -EINVAL;
+
+		nr_pages = vma_size / PAGE_SIZE;
+
+		mutex_lock(&event->mmap_mutex);
+		ret = -EINVAL;
+
+		rb = event->rb;
+		if (!rb)
+			goto aux_unlock;
+
+		aux_offset = ACCESS_ONCE(rb->user_page->aux_offset);
+		aux_size = ACCESS_ONCE(rb->user_page->aux_size);
+
+		if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
+			goto aux_unlock;
+
+		if (aux_offset != vma->vm_pgoff << PAGE_SHIFT)
+			goto aux_unlock;
+
+		/* already mapped with a different offset */
+		if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
+			goto aux_unlock;
+
+		if (aux_size != vma_size || aux_size != nr_pages * PAGE_SIZE)
+			goto aux_unlock;
+
+		/* already mapped with a different size */
+		if (rb_has_aux(rb) && rb->aux_nr_pages != nr_pages)
+			goto aux_unlock;
+
+		if (!is_power_of_2(nr_pages))
+			goto aux_unlock;
+
+		if (!atomic_inc_not_zero(&rb->mmap_count))
+			goto aux_unlock;
+
+		if (rb_has_aux(rb)) {
+			atomic_inc(&rb->aux_mmap_count);
+			ret = 0;
+			goto unlock;
+		}
+
+		atomic_set(&rb->aux_mmap_count, 1);
+		user_extra = nr_pages;
+
+		goto accounting;
+	}
 
 	/*
 	 * If we have rb pages ensure they're a power-of-two number, so we
@@ -4112,9 +4187,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	if (vma_size != PAGE_SIZE * (1 + nr_pages))
 		return -EINVAL;
 
-	if (vma->vm_pgoff != 0)
-		return -EINVAL;
-
 	WARN_ON_ONCE(event->ctx->parent_ctx);
 again:
 	mutex_lock(&event->mmap_mutex);
@@ -4138,6 +4210,8 @@ again:
 	}
 
 	user_extra = nr_pages + 1;
+
+accounting:
 	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
 
 	/*
@@ -4147,7 +4221,6 @@ again:
 
 	user_locked = atomic_long_read(&user->locked_vm) + user_extra;
 
-	extra = 0;
 	if (user_locked > user_lock_limit)
 		extra = user_locked - user_lock_limit;
 
@@ -4161,35 +4234,45 @@ again:
 		goto unlock;
 	}
 
-	WARN_ON(event->rb);
+	WARN_ON(!rb && event->rb);
 
 	if (vma->vm_flags & VM_WRITE)
 		flags |= RING_BUFFER_WRITABLE;
 
-	rb = rb_alloc(nr_pages, 
-		event->attr.watermark ? event->attr.wakeup_watermark : 0,
-		event->cpu, flags);
-
 	if (!rb) {
-		ret = -ENOMEM;
-		goto unlock;
-	}
+		rb = rb_alloc(nr_pages,
+			      event->attr.watermark ? event->attr.wakeup_watermark : 0,
+			      event->cpu, flags);
 
-	atomic_set(&rb->mmap_count, 1);
-	rb->mmap_locked = extra;
-	rb->mmap_user = get_current_user();
+		if (!rb) {
+			ret = -ENOMEM;
+			goto unlock;
+		}
 
-	atomic_long_add(user_extra, &user->locked_vm);
-	vma->vm_mm->pinned_vm += extra;
+		atomic_set(&rb->mmap_count, 1);
+		rb->mmap_user = get_current_user();
+		rb->mmap_locked = extra;
 
-	ring_buffer_attach(event, rb);
+		ring_buffer_attach(event, rb);
 
-	perf_event_init_userpage(event);
-	perf_event_update_userpage(event);
+		perf_event_init_userpage(event);
+		perf_event_update_userpage(event);
+	} else {
+		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages, flags);
+		if (ret)
+			atomic_dec(&rb->mmap_count);
+		else
+			rb->aux_mmap_locked = extra;
+	}
 
 unlock:
-	if (!ret)
+	if (!ret) {
+		atomic_long_add(user_extra, &user->locked_vm);
+		vma->vm_mm->pinned_vm += extra;
+
 		atomic_inc(&event->mmap_count);
+	}
+aux_unlock:
 	mutex_unlock(&event->mmap_mutex);
 
 	/*
@@ -7036,6 +7119,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 	if (output_event->cpu == -1 && output_event->ctx != event->ctx)
 		goto out;
 
+	/*
+	 * If both events generate aux data, they must be on the same PMU
+	 */
+	if (has_aux(event) && has_aux(output_event) &&
+	    event->pmu != output_event->pmu)
+		goto out;
+
 set:
 	mutex_lock(&event->mmap_mutex);
 	/* Can't redirect output if we've got an active mmap() */
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b218782..e5374030b1 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -35,6 +35,14 @@ struct ring_buffer {
 	unsigned long			mmap_locked;
 	struct user_struct		*mmap_user;
 
+	/* AUX area */
+	unsigned long			aux_pgoff;
+	int				aux_nr_pages;
+	atomic_t			aux_mmap_count;
+	unsigned long			aux_mmap_locked;
+	void				**aux_pages;
+	void				*aux_priv;
+
 	struct perf_event_mmap_page	*user_page;
 	void				*data_pages[0];
 };
@@ -43,6 +51,14 @@ extern void rb_free(struct ring_buffer *rb);
 extern struct ring_buffer *
 rb_alloc(int nr_pages, long watermark, int cpu, int flags);
 extern void perf_event_wakeup(struct perf_event *event);
+extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+			pgoff_t pgoff, int nr_pages, int flags);
+extern void rb_free_aux(struct ring_buffer *rb, struct perf_event *event);
+
+static inline bool rb_has_aux(struct ring_buffer *rb)
+{
+	return !!rb->aux_nr_pages;
+}
 
 extern void
 perf_event_header__init_id(struct perf_event_header *header,
@@ -81,6 +97,11 @@ static inline unsigned long perf_data_size(struct ring_buffer *rb)
 	return rb->nr_pages << (PAGE_SHIFT + page_order(rb));
 }
 
+static inline unsigned long perf_aux_size(struct ring_buffer *rb)
+{
+	return rb->aux_nr_pages << PAGE_SHIFT;
+}
+
 #define DEFINE_OUTPUT_COPY(func_name, memcpy_func)			\
 static inline unsigned long						\
 func_name(struct perf_output_handle *handle,				\
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 146a5792b1..00708d5916 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,14 +242,76 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 	spin_lock_init(&rb->event_lock);
 }
 
+int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+		 pgoff_t pgoff, int nr_pages, int flags)
+{
+	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
+	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
+	int ret = -ENOMEM;
+
+	if (!has_aux(event))
+		return -ENOTSUPP;
+
+	rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
+	if (!rb->aux_pages)
+		return -ENOMEM;
+
+	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;
+	     rb->aux_nr_pages++) {
+		struct page *page;
+
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+		if (!page)
+			goto out;
+
+		rb->aux_pages[rb->aux_nr_pages] = page_address(page);
+	}
+
+	rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages,
+					     overwrite);
+	if (rb->aux_priv)
+		ret = 0;
+
+out:
+	if (!ret)
+		rb->aux_pgoff = pgoff;
+	else
+		rb_free_aux(rb, event);
+
+	return ret;
+}
+
+void rb_free_aux(struct ring_buffer *rb, struct perf_event *event)
+{
+	struct perf_event *iter;
+	int pg;
+
+	if (rb->aux_priv) {
+		/* disable all potential writers before freeing */
+		rcu_read_lock();
+		list_for_each_entry_rcu(iter, &rb->event_list, rb_entry)
+			perf_event_disable(iter);
+		rcu_read_unlock();
+
+		event->pmu->free_aux(rb->aux_priv);
+		rb->aux_priv = NULL;
+	}
+
+	for (pg = 0; pg < rb->aux_nr_pages; pg++)
+		free_page((unsigned long)rb->aux_pages[pg]);
+
+	kfree(rb->aux_pages);
+	rb->aux_nr_pages = 0;
+}
+
 #ifndef CONFIG_PERF_USE_VMALLOC
 
 /*
  * Back perf_mmap() with regular GFP_KERNEL-0 pages.
  */
 
-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
 {
 	if (pgoff > rb->nr_pages)
 		return NULL;
@@ -339,8 +401,8 @@ static int data_page_nr(struct ring_buffer *rb)
 	return rb->nr_pages << page_order(rb);
 }
 
-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
 {
 	/* The '>' counts in the user page. */
 	if (pgoff > data_page_nr(rb))
@@ -415,3 +477,19 @@ fail:
 }
 
 #endif
+
+struct page *
+perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+{
+	if (rb->aux_nr_pages) {
+		/* above AUX space */
+		if (pgoff > rb->aux_pgoff + rb->aux_nr_pages)
+			return NULL;
+
+		/* AUX space */
+		if (pgoff >= rb->aux_pgoff)
+			return virt_to_page(rb->aux_pages[pgoff - rb->aux_pgoff]);
+	}
+
+	return __perf_mmap_to_page(rb, pgoff);
+}
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 03/23] perf: Support high-order allocations for AUX space
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 01/23] perf: Add data_{offset,size} to user_page Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 02/23] perf: Add AUX area to ring buffer for raw data streams Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 04/23] perf: Add a capability for AUX_NO_SG pmus to do software double buffering Alexander Shishkin
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability)
don't support scatter-gather and will prefer larger contiguous areas for
their output regions.

This patch adds a new pmu capability to request higher order allocations.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h  |  1 +
 kernel/events/ring_buffer.c | 51 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index cf62338421..4d9ede200f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -170,6 +170,7 @@ struct perf_event;
  * pmu::capabilities flags
  */
 #define PERF_PMU_CAP_NO_INTERRUPT		0x01
+#define PERF_PMU_CAP_AUX_NO_SG			0x02
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 00708d5916..d10919ca42 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,29 +242,68 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 	spin_lock_init(&rb->event_lock);
 }
 
+#define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
+
+static struct page *rb_alloc_aux_page(int node, int order)
+{
+	struct page *page;
+
+	if (order > MAX_ORDER)
+		order = MAX_ORDER;
+
+	do {
+		page = alloc_pages_node(node, PERF_AUX_GFP, order);
+	} while (!page && order--);
+
+	if (page && order) {
+		/*
+		 * Communicate the allocation size to the driver
+		 */
+		split_page(page, order);
+		SetPagePrivate(page);
+		set_page_private(page, order);
+	}
+
+	return page;
+}
+
+static void rb_free_aux_page(struct ring_buffer *rb, int idx)
+{
+	struct page *page = virt_to_page(rb->aux_pages[idx]);
+
+	ClearPagePrivate(page);
+	page->mapping = NULL;
+	__free_page(page);
+}
+
 int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 		 pgoff_t pgoff, int nr_pages, int flags)
 {
 	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
 	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
-	int ret = -ENOMEM;
+	int ret = -ENOMEM, order = 0;
 
 	if (!has_aux(event))
 		return -ENOTSUPP;
 
+	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
+		order = get_order(nr_pages * PAGE_SIZE);
+
 	rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
 	if (!rb->aux_pages)
 		return -ENOMEM;
 
-	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;
-	     rb->aux_nr_pages++) {
+	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;) {
 		struct page *page;
+		int last;
 
-		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+		page = rb_alloc_aux_page(node, order);
 		if (!page)
 			goto out;
 
-		rb->aux_pages[rb->aux_nr_pages] = page_address(page);
+		for (last = rb->aux_nr_pages + (1 << page_private(page));
+		     last > rb->aux_nr_pages; rb->aux_nr_pages++)
+			rb->aux_pages[rb->aux_nr_pages] = page_address(page++);
 	}
 
 	rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages,
@@ -298,7 +337,7 @@ void rb_free_aux(struct ring_buffer *rb, struct perf_event *event)
 	}
 
 	for (pg = 0; pg < rb->aux_nr_pages; pg++)
-		free_page((unsigned long)rb->aux_pages[pg]);
+		rb_free_aux_page(rb, pg);
 
 	kfree(rb->aux_pages);
 	rb->aux_nr_pages = 0;
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 04/23] perf: Add a capability for AUX_NO_SG pmus to do software double buffering
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (2 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 03/23] perf: Support high-order allocations for AUX space Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 05/23] perf: Add a pmu capability for "exclusive" events Alexander Shishkin
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

For pmus that don't support scatter-gather for AUX data in hardware, it
might still make sense to implement software double buffering to avoid
losing data while the user is reading data out. For this purpose, add
a pmu capability that guarantees multiple high-order chunks for AUX buffer,
so that the pmu driver can do switchover tricks.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h  |  1 +
 kernel/events/ring_buffer.c | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4d9ede200f..71be27169c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -171,6 +171,7 @@ struct perf_event;
  */
 #define PERF_PMU_CAP_NO_INTERRUPT		0x01
 #define PERF_PMU_CAP_AUX_NO_SG			0x02
+#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF		0x04
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index d10919ca42..f5ee3669f8 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -286,9 +286,22 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 	if (!has_aux(event))
 		return -ENOTSUPP;
 
-	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
+	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) {
 		order = get_order(nr_pages * PAGE_SIZE);
 
+		/*
+		 * PMU requests more than one contiguous chunks of memory
+		 * for SW double buffering
+		 */
+		if ((event->pmu->capabilities & PERF_PMU_CAP_AUX_SW_DOUBLEBUF) &&
+		    !overwrite) {
+			if (!order)
+				return -EINVAL;
+
+			order--;
+		}
+	}
+
 	rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
 	if (!rb->aux_pages)
 		return -ENOMEM;
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 05/23] perf: Add a pmu capability for "exclusive" events
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (3 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 04/23] perf: Add a capability for AUX_NO_SG pmus to do software double buffering Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 06/23] perf: Redirect output from inherited events to parents Alexander Shishkin
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Usually, pmus that do, for example, instruction tracing, would only ever
be able to have one event per task per cpu (or per perf_context). For such
pmus it makes sense to disallow creating conflicting events early on, so
as to provide consistent behavior for the user.

This patch adds a pmu capability that indicates such constraint on event
creation.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h |  1 +
 kernel/events/core.c       | 45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 71be27169c..b8902ebcb7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -172,6 +172,7 @@ struct perf_event;
 #define PERF_PMU_CAP_NO_INTERRUPT		0x01
 #define PERF_PMU_CAP_AUX_NO_SG			0x02
 #define PERF_PMU_CAP_AUX_SW_DOUBLEBUF		0x04
+#define PERF_PMU_CAP_EXCLUSIVE			0x08
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c07768d0a0..361a45eee2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7149,6 +7149,32 @@ out:
 	return ret;
 }
 
+static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2)
+{
+	if ((e1->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) &&
+	    (e1->cpu == e2->cpu ||
+	     e1->cpu == -1 ||
+	     e2->cpu == -1))
+		return true;
+	return false;
+}
+
+static bool exclusive_event_ok(struct perf_event *event,
+			      struct perf_event_context *ctx)
+{
+	struct perf_event *iter_event;
+
+	if (!(event->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE))
+		return true;
+
+	list_for_each_entry(iter_event, &ctx->event_list, event_entry) {
+		if (exclusive_event_match(iter_event, event))
+			return false;
+	}
+
+	return true;
+}
+
 /**
  * sys_perf_event_open - open a performance event, associate it to a task/cpu
  *
@@ -7300,6 +7326,11 @@ SYSCALL_DEFINE5(perf_event_open,
 		goto err_alloc;
 	}
 
+	if ((pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) && group_leader) {
+		err = -EBUSY;
+		goto err_context;
+	}
+
 	if (task) {
 		put_task_struct(task);
 		task = NULL;
@@ -7385,6 +7416,12 @@ SYSCALL_DEFINE5(perf_event_open,
 		}
 	}
 
+	if (!exclusive_event_ok(event, ctx)) {
+		mutex_unlock(&ctx->mutex);
+		fput(event_file);
+		goto err_context;
+	}
+
 	perf_install_in_context(ctx, event, event->cpu);
 	perf_unpin_context(ctx);
 	mutex_unlock(&ctx->mutex);
@@ -7468,6 +7505,14 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 
 	WARN_ON_ONCE(ctx->parent_ctx);
 	mutex_lock(&ctx->mutex);
+	if (!exclusive_event_ok(event, ctx)) {
+		mutex_unlock(&ctx->mutex);
+		perf_unpin_context(ctx);
+		put_ctx(ctx);
+		err = -EBUSY;
+		goto err_free;
+	}
+
 	perf_install_in_context(ctx, event, cpu);
 	perf_unpin_context(ctx);
 	mutex_unlock(&ctx->mutex);
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 06/23] perf: Redirect output from inherited events to parents
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (4 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 05/23] perf: Add a pmu capability for "exclusive" events Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 07/23] perf: Add api for pmus to write to AUX space Alexander Shishkin
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

In order to collect AUX data from an inherited event, we can redirect its
output to parent's ring buffer if possible (they must be cpu affine). This
patch adds set_output() to the inheritance path.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 361a45eee2..55cd524564 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7859,6 +7859,12 @@ inherit_event(struct perf_event *parent_event,
 		= parent_event->overflow_handler_context;
 
 	/*
+	 * Direct child's output to parent's ring buffer (if any)
+	 */
+	if (parent_event->cpu != -1)
+		(void)perf_event_set_output(child_event, parent_event);
+
+	/*
 	 * Precalculate sample_data sizes
 	 */
 	perf_event__header_size(child_event);
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 07/23] perf: Add api for pmus to write to AUX space
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (5 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 06/23] perf: Redirect output from inherited events to parents Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 08/23] perf: Add AUX record Alexander Shishkin
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

For pmus that wish to write data to AUX space, provide
perf_aux_output_{begin,end}() calls to initiate/commit data writes,
similarly to perf_output_{begin,end}. These also use the same output
handle structure.

After the perf_aux_output_begin() returns successfully, handle->size
is set to the maximum amount of data that can be written wrt aux_tail
pointer, so that no data that the user hasn't seen will be overwritten.

PMU driver should pass the actual amount of data written as a parameter
to perf_aux_output_end().

Nested writers are forbidden and guards are in place to catch such
attempts.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h  | 23 ++++++++++++-
 kernel/events/core.c        |  5 ++-
 kernel/events/internal.h    |  4 +++
 kernel/events/ring_buffer.c | 81 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 109 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b8902ebcb7..94961c73e0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -550,12 +550,22 @@ struct perf_output_handle {
 	struct ring_buffer		*rb;
 	unsigned long			wakeup;
 	unsigned long			size;
-	void				*addr;
+	union {
+		void			*addr;
+		unsigned long		head;
+	};
 	int				page;
 };
 
 #ifdef CONFIG_PERF_EVENTS
 
+extern void *perf_aux_output_begin(struct perf_output_handle *handle,
+				   struct perf_event *event);
+extern void perf_aux_output_end(struct perf_output_handle *handle,
+				unsigned long size, bool truncated);
+extern int perf_aux_output_skip(struct perf_output_handle *handle,
+				unsigned long size);
+extern void *perf_get_aux(struct perf_output_handle *handle);
 extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
 extern void perf_pmu_unregister(struct pmu *pmu);
 
@@ -805,6 +815,17 @@ extern void perf_event_disable(struct perf_event *event);
 extern int __perf_event_disable(void *info);
 extern void perf_event_task_tick(void);
 #else /* !CONFIG_PERF_EVENTS: */
+static inline void *
+perf_aux_output_begin(struct perf_output_handle *handle,
+		      struct perf_event *event)				{ return NULL; }
+static inline void
+perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
+		    bool truncated)					{ }
+static inline int
+perf_aux_output_skip(struct perf_output_handle *handle,
+		     unsigned long size)				{ return -EINVAL; }
+static inline void *
+perf_get_aux(struct perf_output_handle *handle)				{ return NULL; }
 static inline void
 perf_event_task_sched_in(struct task_struct *prev,
 			 struct task_struct *task)			{ }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 55cd524564..848f2af576 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3212,7 +3212,6 @@ static void free_event_rcu(struct rcu_head *head)
 	kfree(event);
 }
 
-static void ring_buffer_put(struct ring_buffer *rb);
 static void ring_buffer_attach(struct perf_event *event,
 			       struct ring_buffer *rb);
 
@@ -3948,7 +3947,7 @@ static void rb_free_rcu(struct rcu_head *rcu_head)
 	rb_free(rb);
 }
 
-static struct ring_buffer *ring_buffer_get(struct perf_event *event)
+struct ring_buffer *ring_buffer_get(struct perf_event *event)
 {
 	struct ring_buffer *rb;
 
@@ -3963,7 +3962,7 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
 	return rb;
 }
 
-static void ring_buffer_put(struct ring_buffer *rb)
+void ring_buffer_put(struct ring_buffer *rb)
 {
 	if (!atomic_dec_and_test(&rb->refcount))
 		return;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index e5374030b1..b8f6c193ea 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -36,6 +36,8 @@ struct ring_buffer {
 	struct user_struct		*mmap_user;
 
 	/* AUX area */
+	local_t				aux_head;
+	local_t				aux_nest;
 	unsigned long			aux_pgoff;
 	int				aux_nr_pages;
 	atomic_t			aux_mmap_count;
@@ -54,6 +56,8 @@ extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 			pgoff_t pgoff, int nr_pages, int flags);
 extern void rb_free_aux(struct ring_buffer *rb, struct perf_event *event);
+extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
+extern void ring_buffer_put(struct ring_buffer *rb);
 
 static inline bool rb_has_aux(struct ring_buffer *rb)
 {
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index f5ee3669f8..feee52077f 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,6 +242,87 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 	spin_lock_init(&rb->event_lock);
 }
 
+void *perf_aux_output_begin(struct perf_output_handle *handle,
+			    struct perf_event *event)
+{
+	unsigned long aux_head, aux_tail;
+	struct ring_buffer *rb;
+
+	rb = ring_buffer_get(event);
+	if (!rb)
+		return NULL;
+
+	if (!rb_has_aux(rb))
+		goto err;
+
+	/*
+	 * Nesting is not supported for AUX area, make sure nested
+	 * writers are caught early
+	 */
+	if (WARN_ON_ONCE(local_xchg(&rb->aux_nest, 1)))
+		goto err;
+
+	aux_head = local_read(&rb->aux_head);
+	aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+
+	handle->rb = rb;
+	handle->event = event;
+	handle->head = aux_head;
+	handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
+
+	if (!handle->size) {
+		event->pending_disable = 1;
+		event->hw.state = PERF_HES_STOPPED;
+		perf_output_wakeup(handle);
+		local_set(&rb->aux_nest, 0);
+		goto err;
+	}
+
+	return handle->rb->aux_priv;
+
+err:
+	ring_buffer_put(rb);
+	handle->event = NULL;
+
+	return NULL;
+}
+
+void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
+			 bool truncated)
+{
+	struct ring_buffer *rb = handle->rb;
+
+	local_add(size, &rb->aux_head);
+
+	smp_wmb();
+	rb->user_page->aux_head = local_read(&rb->aux_head);
+
+	perf_output_wakeup(handle);
+	handle->event = NULL;
+
+	local_set(&rb->aux_nest, 0);
+	ring_buffer_put(rb);
+}
+
+int perf_aux_output_skip(struct perf_output_handle *handle, unsigned long size)
+{
+	struct ring_buffer *rb = handle->rb;
+
+	if (size > handle->size)
+		return -ENOSPC;
+
+	local_add(size, &rb->aux_head);
+	handle->head = local_read(&rb->aux_head);
+	handle->size -= size;
+
+	return 0;
+}
+
+void *perf_get_aux(struct perf_output_handle *handle)
+{
+	return handle->rb->aux_priv;
+}
+
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
 static struct page *rb_alloc_aux_page(int node, int order)
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 08/23] perf: Add AUX record
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (6 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 07/23] perf: Add api for pmus to write to AUX space Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 09/23] perf: Support overwrite mode for AUX area Alexander Shishkin
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

When there's new data in the AUX space, output a record indicating its
offset and size and weather it was truncated to fix in the ring buffer.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/uapi/linux/perf_event.h | 16 ++++++++++++++++
 kernel/events/core.c            | 39 +++++++++++++++++++++++++++++++++++++++
 kernel/events/internal.h        |  3 +++
 kernel/events/ring_buffer.c     |  1 +
 4 files changed, 59 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 7e0967c0f5..c022c3d756 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -733,6 +733,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_MMAP2			= 10,
 
+	/*
+	 * Records that new data landed in the AUX buffer part.
+	 *
+	 * struct {
+	 * 	struct perf_event_header	header;
+	 *
+	 * 	u64				aux_offset;
+	 * 	u64				aux_size;
+	 *	u8				truncated;
+	 *	u8				reserved[7];
+	 *	u64				id;
+	 *	u64				stream_id;
+	 * };
+	 */
+	PERF_RECORD_AUX				= 11,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 848f2af576..25aad70812 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5443,6 +5443,45 @@ void perf_event_mmap(struct vm_area_struct *vma)
 	perf_event_mmap_event(&mmap_event);
 }
 
+void perf_event_aux_event(struct perf_event *event, unsigned long head,
+			  unsigned long size, bool truncated)
+{
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	struct perf_aux_event {
+		struct perf_event_header	header;
+		u64				offset;
+		u64				size;
+		u8				truncated;
+		u8				reserved[7];
+		u64				id;
+		u64				stream_id;
+	} rec = {
+		.header = {
+			.type = PERF_RECORD_AUX,
+			.misc = 0,
+			.size = sizeof(rec),
+		},
+		.offset		= head,
+		.size		= size,
+		.truncated	= truncated,
+		.id		= primary_event_id(event),
+		.stream_id	= event->id,
+	};
+	int ret;
+
+	perf_event_header__init_id(&rec.header, &sample, event);
+	ret = perf_output_begin(&handle, event, rec.header.size);
+
+	if (ret)
+		return;
+
+	perf_output_put(&handle, rec);
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+}
+
 /*
  * IRQ throttle logging
  */
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index b8f6c193ea..c6b2987afe 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -64,6 +64,9 @@ static inline bool rb_has_aux(struct ring_buffer *rb)
 	return !!rb->aux_nr_pages;
 }
 
+void perf_event_aux_event(struct perf_event *event, unsigned long head,
+			  unsigned long size, bool truncated);
+
 extern void
 perf_event_header__init_id(struct perf_event_header *header,
 			   struct perf_sample_data *data,
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index feee52077f..598c02a555 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -293,6 +293,7 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
 	struct ring_buffer *rb = handle->rb;
 
 	local_add(size, &rb->aux_head);
+	perf_event_aux_event(handle->event, aux_head, size, truncated);
 
 	smp_wmb();
 	rb->user_page->aux_head = local_read(&rb->aux_head);
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 09/23] perf: Support overwrite mode for AUX area
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (7 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 08/23] perf: Add AUX record Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 10/23] perf: Add wakeup watermark control to " Alexander Shishkin
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

This adds support for overwrite mode in the AUX area, which means "keep
collecting data till you're stopped". It does not depend on data buffer's
overwrite mode, so that it doesn't lose sideband data that is instrumental
for processing AUX data.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/internal.h    |  1 +
 kernel/events/ring_buffer.c | 40 +++++++++++++++++++++++++++++-----------
 2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index c6b2987afe..4607742be8 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -40,6 +40,7 @@ struct ring_buffer {
 	local_t				aux_nest;
 	unsigned long			aux_pgoff;
 	int				aux_nr_pages;
+	int				aux_overwrite;
 	atomic_t			aux_mmap_count;
 	unsigned long			aux_mmap_locked;
 	void				**aux_pages;
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 598c02a555..4ee7723d87 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -263,19 +263,23 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
 		goto err;
 
 	aux_head = local_read(&rb->aux_head);
-	aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
 
 	handle->rb = rb;
 	handle->event = event;
 	handle->head = aux_head;
-	handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
-
-	if (!handle->size) {
-		event->pending_disable = 1;
-		event->hw.state = PERF_HES_STOPPED;
-		perf_output_wakeup(handle);
-		local_set(&rb->aux_nest, 0);
-		goto err;
+	if (!rb->aux_overwrite) {
+		aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+		handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
+
+		if (!handle->size) {
+			event->pending_disable = 1;
+			event->hw.state = PERF_HES_STOPPED;
+			perf_output_wakeup(handle);
+			local_set(&rb->aux_nest, 0);
+			goto err;
+		}
+	} else {
+		handle->size = 0;
 	}
 
 	return handle->rb->aux_priv;
@@ -291,9 +295,22 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
 			 bool truncated)
 {
 	struct ring_buffer *rb = handle->rb;
+	unsigned long aux_head;
 
-	local_add(size, &rb->aux_head);
-	perf_event_aux_event(handle->event, aux_head, size, truncated);
+	aux_head = local_read(&rb->aux_head);
+
+	if (rb->aux_overwrite) {
+		local_set(&rb->aux_head, size);
+
+		/*
+		 * Send a RECORD_AUX with size==0 to communicate aux_head
+		 * of this snapshot to userspace
+		 */
+		perf_event_aux_event(handle->event, size, 0, truncated);
+	} else {
+		local_add(size, &rb->aux_head);
+		perf_event_aux_event(handle->event, aux_head, size, truncated);
+	}
 
 	smp_wmb();
 	rb->user_page->aux_head = local_read(&rb->aux_head);
@@ -405,6 +422,7 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 					     overwrite);
 	if (rb->aux_priv)
 		ret = 0;
+	rb->aux_overwrite = overwrite;
 
 out:
 	if (!ret)
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 10/23] perf: Add wakeup watermark control to AUX area
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (8 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 09/23] perf: Support overwrite mode for AUX area Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 11/23] perf: Add itrace_config to the event attribute Alexander Shishkin
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

When AUX area gets a certain amount of new data, we want to wake up
userspace to collect it. This adds a new control to specify how much
data will cause a wakeup.

We repurpose __reserved_2 in the event attribute for this, even though
it was never checked to be zero before, aux_watermark will only matter
for new AUX-aware code, so the old code should still be fine.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/uapi/linux/perf_event.h |  7 +++++--
 kernel/events/core.c            |  3 ++-
 kernel/events/internal.h        |  4 +++-
 kernel/events/ring_buffer.c     | 17 ++++++++++++++---
 4 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c022c3d756..507b5e1f5b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -238,6 +238,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER2	80	/* add: branch_sample_type */
 #define PERF_ATTR_SIZE_VER3	96	/* add: sample_regs_user */
 					/* add: sample_stack_user */
+					/* add: aux_watermark */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -332,8 +333,10 @@ struct perf_event_attr {
 	 */
 	__u32	sample_stack_user;
 
-	/* Align to u64. */
-	__u32	__reserved_2;
+	/*
+	 * Wakeup watermark for AUX area
+	 */
+	__u32	aux_watermark;
 };
 
 #define perf_flags(attr)	(*(&(attr)->read_format + 1))
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 25aad70812..2de7d40cb6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4257,7 +4257,8 @@ accounting:
 		perf_event_init_userpage(event);
 		perf_event_update_userpage(event);
 	} else {
-		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages, flags);
+		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
+				   event->attr.aux_watermark, flags);
 		if (ret)
 			atomic_dec(&rb->mmap_count);
 		else
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 4607742be8..4f99987bc3 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -27,6 +27,7 @@ struct ring_buffer {
 	local_t				lost;		/* nr records lost   */
 
 	long				watermark;	/* wakeup watermark  */
+	long				aux_watermark;
 	/* poll crap */
 	spinlock_t			event_lock;
 	struct list_head		event_list;
@@ -38,6 +39,7 @@ struct ring_buffer {
 	/* AUX area */
 	local_t				aux_head;
 	local_t				aux_nest;
+	local_t				aux_wakeup;
 	unsigned long			aux_pgoff;
 	int				aux_nr_pages;
 	int				aux_overwrite;
@@ -55,7 +57,7 @@ extern struct ring_buffer *
 rb_alloc(int nr_pages, long watermark, int cpu, int flags);
 extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
-			pgoff_t pgoff, int nr_pages, int flags);
+			pgoff_t pgoff, int nr_pages, long watermark, int flags);
 extern void rb_free_aux(struct ring_buffer *rb, struct perf_event *event);
 extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
 extern void ring_buffer_put(struct ring_buffer *rb);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 4ee7723d87..85858d201c 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -269,8 +269,12 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
 	handle->head = aux_head;
 	if (!rb->aux_overwrite) {
 		aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+		handle->wakeup = local_read(&rb->aux_wakeup);
 		handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
 
+		if (rb->aux_watermark && handle->size > rb->aux_watermark)
+			handle->size = rb->aux_watermark;
+
 		if (!handle->size) {
 			event->pending_disable = 1;
 			event->hw.state = PERF_HES_STOPPED;
@@ -313,9 +317,12 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
 	}
 
 	smp_wmb();
-	rb->user_page->aux_head = local_read(&rb->aux_head);
+	aux_head = rb->user_page->aux_head = local_read(&rb->aux_head);
 
-	perf_output_wakeup(handle);
+	if (aux_head - local_read(&rb->aux_wakeup) > rb->aux_watermark) {
+		perf_output_wakeup(handle);
+		local_add(rb->aux_watermark, &rb->aux_wakeup);
+	}
 	handle->event = NULL;
 
 	local_set(&rb->aux_nest, 0);
@@ -376,7 +383,7 @@ static void rb_free_aux_page(struct ring_buffer *rb, int idx)
 }
 
 int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
-		 pgoff_t pgoff, int nr_pages, int flags)
+		 pgoff_t pgoff, int nr_pages, long watermark, int flags)
 {
 	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
 	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
@@ -423,6 +430,10 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 	if (rb->aux_priv)
 		ret = 0;
 	rb->aux_overwrite = overwrite;
+	rb->aux_watermark = watermark;
+
+	if (!rb->aux_watermark && !rb->aux_overwrite)
+		rb->aux_watermark = nr_pages << (PAGE_SHIFT - 1);
 
 out:
 	if (!ret)
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 11/23] perf: Add itrace_config to the event attribute
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (9 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 10/23] perf: Add wakeup watermark control to " Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 12/23] perf: add ITRACE_START record to indicate that tracing has started Alexander Shishkin
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

To configure itrace events, we use a separate config field in the attribute,
which can be used by normal and sampling counters. The latter will use config
to specify the actual event and itrace_config for itrace configuration that
will be used for sampling.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      | 1 +
 include/uapi/linux/perf_event.h | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 94961c73e0..cb8c92e041 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -173,6 +173,7 @@ struct perf_event;
 #define PERF_PMU_CAP_AUX_NO_SG			0x02
 #define PERF_PMU_CAP_AUX_SW_DOUBLEBUF		0x04
 #define PERF_PMU_CAP_EXCLUSIVE			0x08
+#define PERF_PMU_CAP_ITRACE			0x10
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 507b5e1f5b..b14b1f57c1 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -239,6 +239,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER3	96	/* add: sample_regs_user */
 					/* add: sample_stack_user */
 					/* add: aux_watermark */
+#define PERF_ATTR_SIZE_VER4	104	/* add: itrace_config */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -337,6 +338,11 @@ struct perf_event_attr {
 	 * Wakeup watermark for AUX area
 	 */
 	__u32	aux_watermark;
+
+	/*
+	 * Itrace pmus' event config
+	 */
+	__u64	itrace_config;
 };
 
 #define perf_flags(attr)	(*(&(attr)->read_format + 1))
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 12/23] perf: add ITRACE_START record to indicate that tracing has started
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (10 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 11/23] perf: Add itrace_config to the event attribute Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 13/23] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection Alexander Shishkin
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

For counters such as instruction tracing, it is useful for the decoder
to know which tasks are running when the event is first scheduled in,
before the first sched_switch.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      |  3 +++
 include/uapi/linux/perf_event.h | 11 +++++++++++
 kernel/events/core.c            | 41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index cb8c92e041..46137cb4d6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -126,6 +126,9 @@ struct hw_perf_event {
 			/* for tp_event->class */
 			struct list_head	tp_list;
 		};
+		struct { /* itrace */
+			int			itrace_started;
+		};
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 		struct { /* breakpoint */
 			/*
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index b14b1f57c1..500e18b8e9 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -758,6 +758,17 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX				= 11,
 
+	/*
+	 * Indicates that instruction trace has started
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u32				pid;
+	 *	u32				tid;
+	 * };
+	 */
+	PERF_RECORD_ITRACE_START		= 12,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2de7d40cb6..d4b5e33b74 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1671,6 +1671,7 @@ static void perf_set_shadow_time(struct perf_event *event,
 #define MAX_INTERRUPTS (~0ULL)
 
 static void perf_log_throttle(struct perf_event *event, int enable);
+static void perf_log_itrace_start(struct perf_event *event);
 
 static int
 event_sched_in(struct perf_event *event,
@@ -1705,6 +1706,8 @@ event_sched_in(struct perf_event *event,
 
 	perf_pmu_disable(event->pmu);
 
+	perf_log_itrace_start(event);
+
 	if (event->pmu->add(event, PERF_EF_START)) {
 		event->state = PERF_EVENT_STATE_INACTIVE;
 		event->oncpu = -1;
@@ -5524,6 +5527,44 @@ static void perf_log_throttle(struct perf_event *event, int enable)
 	perf_output_end(&handle);
 }
 
+static void perf_log_itrace_start(struct perf_event *event)
+{
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	struct perf_aux_event {
+		struct perf_event_header        header;
+		u32				pid;
+		u32				tid;
+	} rec;
+	int ret;
+
+	if (event->parent)
+		event = event->parent;
+
+	if (!(event->pmu->capabilities & PERF_PMU_CAP_ITRACE) ||
+	    event->hw.itrace_started)
+		return;
+
+	event->hw.itrace_started = 1;
+
+	rec.header.type	= PERF_RECORD_ITRACE_START;
+	rec.header.misc	= 0;
+	rec.header.size	= sizeof(rec);
+	rec.pid	= perf_event_pid(event, current);
+	rec.tid	= perf_event_tid(event, current);
+
+	perf_event_header__init_id(&rec.header, &sample, event);
+	ret = perf_output_begin(&handle, event, rec.header.size);
+
+	if (ret)
+		return;
+
+	perf_output_put(&handle, rec);
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+}
+
 /*
  * Generic event overflow handling, sampling.
  */
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 13/23] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (11 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 12/23] perf: add ITRACE_START record to indicate that tracing has started Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 14/23] x86: perf: Intel PT and LBR/BTS are mutually exclusive Alexander Shishkin
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Intel Processor Trace is an architecture extension that allows for program
flow tracing.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/include/asm/cpufeature.h | 1 +
 arch/x86/kernel/cpu/scattered.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index e265ff95d1..53710d028a 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -186,6 +186,7 @@
 #define X86_FEATURE_DTHERM	(7*32+ 7) /* Digital Thermal Sensor */
 #define X86_FEATURE_HW_PSTATE	(7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK (7*32+ 9) /* AMD ProcFeedbackInterface */
+#define X86_FEATURE_INTEL_PT	(7*32+10) /* Intel Processor Trace */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW  (8*32+ 0) /* Intel TPR Shadow */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index b6f794aa16..726e6a376b 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -36,6 +36,7 @@ void init_scattered_cpuid_features(struct cpuinfo_x86 *c)
 		{ X86_FEATURE_ARAT,		CR_EAX, 2, 0x00000006, 0 },
 		{ X86_FEATURE_PLN,		CR_EAX, 4, 0x00000006, 0 },
 		{ X86_FEATURE_PTS,		CR_EAX, 6, 0x00000006, 0 },
+		{ X86_FEATURE_INTEL_PT,		CR_EBX,25, 0x00000007, 0 },
 		{ X86_FEATURE_APERFMPERF,	CR_ECX, 0, 0x00000006, 0 },
 		{ X86_FEATURE_EPB,		CR_ECX, 3, 0x00000006, 0 },
 		{ X86_FEATURE_XSAVEOPT,		CR_EAX,	0, 0x0000000d, 1 },
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 14/23] x86: perf: Intel PT and LBR/BTS are mutually exclusive
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (12 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 13/23] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 15/23] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Intel PT cannot be used at the same time as LBR or BTS and will cause a
general protection fault if they are used together. In order to avoid
fixing up GPs in the fast path, instead we use flags to indicate that
that one of these is in use so that the other avoids MSR access altogether.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.h           | 6 ++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  | 8 +++++++-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 9 +++++----
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 8ade93111e..da07b41496 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -146,6 +146,7 @@ struct cpu_hw_events {
 	 * Intel DebugStore bits
 	 */
 	struct debug_store	*ds;
+	unsigned int		bts_enabled;
 	u64			pebs_enabled;
 
 	/*
@@ -159,6 +160,11 @@ struct cpu_hw_events {
 	u64				br_sel;
 
 	/*
+	 * Intel Processor Trace
+	 */
+	unsigned int			pt_enabled;
+
+	/*
 	 * Intel host/guest exclude bits
 	 */
 	u64				intel_ctrl_guest_mask;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 696ade311d..b8a7f9315f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -456,8 +456,13 @@ struct event_constraint bts_constraint =
 
 void intel_pmu_enable_bts(u64 config)
 {
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	unsigned long debugctlmsr;
 
+	if (cpuc->pt_enabled)
+		return;
+
+	cpuc->bts_enabled = 1;
 	debugctlmsr = get_debugctlmsr();
 
 	debugctlmsr |= DEBUGCTLMSR_TR;
@@ -478,9 +483,10 @@ void intel_pmu_disable_bts(void)
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	unsigned long debugctlmsr;
 
-	if (!cpuc->ds)
+	if (!cpuc->ds || cpuc->pt_enabled)
 		return;
 
+	cpuc->bts_enabled = 0;
 	debugctlmsr = get_debugctlmsr();
 
 	debugctlmsr &=
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 9dd2459a4c..516e52d0ac 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -172,7 +172,9 @@ static void intel_pmu_lbr_reset_64(void)
 
 void intel_pmu_lbr_reset(void)
 {
-	if (!x86_pmu.lbr_nr)
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (!x86_pmu.lbr_nr || cpuc->pt_enabled)
 		return;
 
 	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
@@ -185,7 +187,7 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 
-	if (!x86_pmu.lbr_nr)
+	if (!x86_pmu.lbr_nr || cpuc->pt_enabled)
 		return;
 
 	/*
@@ -205,11 +207,10 @@ void intel_pmu_lbr_disable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 
-	if (!x86_pmu.lbr_nr)
+	if (!x86_pmu.lbr_nr || !cpuc->lbr_users || cpuc->pt_enabled)
 		return;
 
 	cpuc->lbr_users--;
-	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
 	if (cpuc->enabled && !cpuc->lbr_users) {
 		__intel_pmu_lbr_disable();
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 15/23] x86: perf: intel_pt: Intel PT PMU driver
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (13 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 14/23] x86: perf: Intel PT and LBR/BTS are mutually exclusive Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 16/23] x86: perf: intel_bts: Add BTS " Alexander Shishkin
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Add support for Intel Processor Trace (PT) to kernel's perf events.
PT is an extension of Intel Architecture that collects information about
software execuction such as control flow, execution modes and timings and
formats it into highly compressed binary packets. Even being compressed,
these packets are generated at hundreds of megabytes per second per core,
which makes it impractical to decode them on the fly in the kernel.

This driver exports trace data by through AUX space in the perf ring
buffer, which is zero-copy mapped into userspace for faster data retrieval.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/include/uapi/asm/msr-index.h     |  18 +
 arch/x86/kernel/cpu/Makefile              |   1 +
 arch/x86/kernel/cpu/intel_pt.h            | 129 ++++
 arch/x86/kernel/cpu/perf_event.h          |   2 +
 arch/x86/kernel/cpu/perf_event_intel.c    |   8 +
 arch/x86/kernel/cpu/perf_event_intel_pt.c | 952 ++++++++++++++++++++++++++++++
 6 files changed, 1110 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/intel_pt.h
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c

diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h
index fcf2b3ae1b..b34434e652 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -74,6 +74,24 @@
 #define MSR_IA32_PERF_CAPABILITIES	0x00000345
 #define MSR_PEBS_LD_LAT_THRESHOLD	0x000003f6
 
+#define MSR_IA32_RTIT_CTL		0x00000570
+#define RTIT_CTL_TRACEEN		BIT(0)
+#define RTIT_CTL_OS			BIT(2)
+#define RTIT_CTL_USR			BIT(3)
+#define RTIT_CTL_CR3EN			BIT(7)
+#define RTIT_CTL_TOPA			BIT(8)
+#define RTIT_CTL_TSC_EN			BIT(10)
+#define RTIT_CTL_DISRETC		BIT(11)
+#define RTIT_CTL_BRANCH_EN		BIT(13)
+#define MSR_IA32_RTIT_STATUS		0x00000571
+#define RTIT_STATUS_CONTEXTEN		BIT(1)
+#define RTIT_STATUS_TRIGGEREN		BIT(2)
+#define RTIT_STATUS_ERROR		BIT(4)
+#define RTIT_STATUS_STOPPED		BIT(5)
+#define MSR_IA32_RTIT_CR3_MATCH		0x00000572
+#define MSR_IA32_RTIT_OUTPUT_BASE	0x00000560
+#define MSR_IA32_RTIT_OUTPUT_MASK	0x00000561
+
 #define MSR_MTRRfix64K_00000		0x00000250
 #define MSR_MTRRfix16K_80000		0x00000258
 #define MSR_MTRRfix16K_A0000		0x00000259
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 7fd54f09b0..f1f38e1caf 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -37,6 +37,7 @@ endif
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_p6.o perf_event_knc.o perf_event_p4.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o perf_event_intel_rapl.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_pt.o
 endif
 
 
diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
new file mode 100644
index 0000000000..58af62daf7
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -0,0 +1,129 @@
+/*
+ * Intel(R) Processor Trace PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#ifndef __INTEL_PT_H__
+#define __INTEL_PT_H__
+
+/*
+ * Single-entry ToPA: when this close to region boundary, switch
+ * buffers to avoid losing data.
+ */
+#define TOPA_PMI_MARGIN 512
+
+/*
+ * Table of Physical Addresses bits
+ */
+enum topa_sz {
+	TOPA_4K	= 0,
+	TOPA_8K,
+	TOPA_16K,
+	TOPA_32K,
+	TOPA_64K,
+	TOPA_128K,
+	TOPA_256K,
+	TOPA_512K,
+	TOPA_1MB,
+	TOPA_2MB,
+	TOPA_4MB,
+	TOPA_8MB,
+	TOPA_16MB,
+	TOPA_32MB,
+	TOPA_64MB,
+	TOPA_128MB,
+	TOPA_SZ_END,
+};
+
+static inline unsigned int sizes(enum topa_sz tsz)
+{
+	return 1 << (tsz + 12);
+};
+
+struct topa_entry {
+	u64	end	: 1;
+	u64	rsvd0	: 1;
+	u64	intr	: 1;
+	u64	rsvd1	: 1;
+	u64	stop	: 1;
+	u64	rsvd2	: 1;
+	u64	size	: 4;
+	u64	rsvd3	: 2;
+	u64	base	: 36;
+	u64	rsvd4	: 16;
+};
+
+#define TOPA_SHIFT 12
+#define PT_CPUID_LEAVES 2
+
+enum pt_capabilities {
+	PT_CAP_max_subleaf = 0,
+	PT_CAP_cr3_filtering,
+	PT_CAP_topa_output,
+	PT_CAP_topa_multiple_entries,
+	PT_CAP_payloads_lip,
+};
+
+struct pt_pmu {
+	struct pmu		pmu;
+	u32			caps[4 * PT_CPUID_LEAVES];
+};
+
+/**
+ * struct pt_buffer - buffer configuration; one buffer per task_struct or
+ * cpu, depending on perf event configuration
+ * @tables: list of ToPA tables in this buffer
+ * @first, @last: shorthands for first and last topa tables
+ * @cur: current topa table
+ * @nr_pages: buffer size in pages
+ * @cur_idx: current output region's index within @cur table
+ * @output_off: offset within the current output region
+ * @data_size: running total of the amount of data in this buffer
+ * @lost: if data was lost/truncated
+ * @head: logical write offset inside the buffer
+ * @snapshot: if this is for a snapshot/overwrite counter
+ * @stop_pos, @intr_pos: STOP and INT topa entries in the buffer
+ * @data_pages: array of pages from perf
+ * @topa_index: table of topa entries indexed by page offset
+ */
+struct pt_buffer {
+	/* hint for allocation */
+	int			cpu;
+	/* list of ToPA tables */
+	struct list_head	tables;
+	/* top-level table */
+	struct topa		*first, *last, *cur;
+	unsigned int		cur_idx;
+	size_t			output_off;
+	unsigned long		nr_pages;
+	local_t			data_size;
+	local_t			lost;
+	local64_t		head;
+	bool			snapshot;
+	unsigned long		stop_pos, intr_pos;
+	void			**data_pages;
+	struct topa_entry	*topa_index[0];
+};
+
+/**
+ * struct pt - per-cpu pt
+ */
+struct pt {
+	raw_spinlock_t		lock;
+	struct perf_output_handle handle;
+};
+
+#endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index da07b41496..8b6e4e79fe 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -713,6 +713,8 @@ void intel_pmu_lbr_init_snb(void);
 
 int intel_pmu_setup_lbr_filter(struct perf_event *event);
 
+void intel_pt_interrupt(void);
+
 int p4_pmu_init(void);
 
 int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 2502d0d9d2..16d63b91b3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1399,6 +1399,14 @@ again:
 	}
 
 	/*
+	 * Intel PT
+	 */
+	if (__test_and_clear_bit(55, (unsigned long *)&status)) {
+		handled++;
+		intel_pt_interrupt();
+	}
+
+	/*
 	 * Checkpointed counters can lead to 'spurious' PMIs because the
 	 * rollback caused by the PMI will have cleared the overflow status
 	 * bit. Therefore always force probe these counters.
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
new file mode 100644
index 0000000000..a8ad034e4f
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -0,0 +1,952 @@
+/*
+ * Intel(R) Processor Trace PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#undef DEBUG
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/device.h>
+
+#include <asm/perf_event.h>
+#include <asm/insn.h>
+
+#include "perf_event.h"
+#include "intel_pt.h"
+
+static DEFINE_PER_CPU(struct pt, pt_ctx) = {
+	.lock	= __RAW_SPIN_LOCK_UNLOCKED(pt_ctx.lock),
+};
+
+static struct pt_pmu pt_pmu;
+
+enum cpuid_regs {
+	CR_EAX = 0,
+	CR_ECX,
+	CR_EDX,
+	CR_EBX
+};
+
+/*
+ * Capabilities of Intel PT hardware, such as number of address bits or
+ * supported output schemes, are cached and exported to userspace as "caps"
+ * attribute group of pt pmu device
+ * (/sys/bus/event_source/devices/intel_pt/caps/) so that userspace can store
+ * relevant bits together with intel_pt traces.
+ */
+#define PT_CAP(_n, _l, _r, _m)						\
+	[PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l,	\
+			    .reg = _r, .mask = _m }
+
+static struct pt_cap_desc {
+	const char	*name;
+	u32		leaf;
+	u8		reg;
+	u32		mask;
+} pt_caps[] = {
+	PT_CAP(max_subleaf,		0, CR_EAX, 0xffffffff),
+	PT_CAP(cr3_filtering,		0, CR_EBX, BIT(0)),
+	PT_CAP(topa_output,		0, CR_ECX, BIT(0)),
+	PT_CAP(topa_multiple_entries,	0, CR_ECX, BIT(1)),
+	PT_CAP(payloads_lip,		0, CR_ECX, BIT(31)),
+};
+
+static u32 pt_cap_get(enum pt_capabilities cap)
+{
+	struct pt_cap_desc *cd = &pt_caps[cap];
+	u32 c = pt_pmu.caps[cd->leaf * 4 + cd->reg];
+	unsigned int shift = __ffs(cd->mask);
+
+	return (c & cd->mask) >> shift;
+}
+
+static ssize_t pt_cap_show(struct device *cdev,
+			   struct device_attribute *attr,
+			   char *buf)
+{
+	struct dev_ext_attribute *ea =
+		container_of(attr, struct dev_ext_attribute, attr);
+	enum pt_capabilities cap = (long)ea->var;
+
+	return snprintf(buf, PAGE_SIZE, "%x\n", pt_cap_get(cap));
+}
+
+static struct attribute_group pt_cap_group = {
+	.name	= "caps",
+};
+
+PMU_FORMAT_ATTR(tsc,		"itrace_config:10"	);
+PMU_FORMAT_ATTR(noretcomp,	"itrace_config:11"	);
+
+static struct attribute *pt_formats_attr[] = {
+	&format_attr_tsc.attr,
+	&format_attr_noretcomp.attr,
+	NULL,
+};
+
+static struct attribute_group pt_format_group = {
+	.name	= "format",
+	.attrs	= pt_formats_attr,
+};
+
+static const struct attribute_group *pt_attr_groups[] = {
+	&pt_cap_group,
+	&pt_format_group,
+	NULL,
+};
+
+static int __init pt_pmu_hw_init(void)
+{
+	struct dev_ext_attribute *de_attrs;
+	struct attribute **attrs;
+	size_t size;
+	long i;
+
+	if (test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT)) {
+		for (i = 0; i < PT_CPUID_LEAVES; i++)
+			cpuid_count(20, i,
+				    &pt_pmu.caps[CR_EAX + i * 4],
+				    &pt_pmu.caps[CR_EBX + i * 4],
+				    &pt_pmu.caps[CR_ECX + i * 4],
+				    &pt_pmu.caps[CR_EDX + i * 4]);
+	} else
+		return -ENODEV;
+
+	size = sizeof(struct attribute *) * (ARRAY_SIZE(pt_caps) + 1);
+	attrs = kzalloc(size, GFP_KERNEL);
+	if (!attrs)
+		goto err_attrs;
+
+	size = sizeof(struct dev_ext_attribute) * (ARRAY_SIZE(pt_caps) + 1);
+	de_attrs = kzalloc(size, GFP_KERNEL);
+	if (!de_attrs)
+		goto err_de_attrs;
+
+	for (i = 0; i < ARRAY_SIZE(pt_caps); i++) {
+		de_attrs[i].attr.attr.name = pt_caps[i].name;
+
+		sysfs_attr_init(&de_attrs[i].attr.attr);
+		de_attrs[i].attr.attr.mode = S_IRUGO;
+		de_attrs[i].attr.show = pt_cap_show;
+		de_attrs[i].var = (void *)i;
+		attrs[i] = &de_attrs[i].attr.attr;
+	}
+
+	pt_cap_group.attrs = attrs;
+	return 0;
+
+err_de_attrs:
+	kfree(de_attrs);
+err_attrs:
+	kfree(attrs);
+
+	return -ENOMEM;
+}
+
+#define PT_CONFIG_MASK (RTIT_CTL_TSC_EN | RTIT_CTL_DISRETC)
+/* bits 57:14 are reserved for packet enables */
+#define PT_BYPASS_MASK 0x03ffffffffffc000ull
+
+static bool pt_event_valid(struct perf_event *event)
+{
+	u64 itrace_config = event->attr.itrace_config;
+
+	/* admin can set any packet generation parameters */
+	if (capable(CAP_SYS_ADMIN) &&
+	    (itrace_config & PT_BYPASS_MASK) == itrace_config)
+		return true;
+
+	if ((itrace_config & PT_CONFIG_MASK) != itrace_config)
+		return false;
+
+	return true;
+}
+
+/*
+ * PT configuration helpers
+ * These all are cpu affine and operate on a local PT
+ */
+
+static bool pt_is_running(void)
+{
+	u64 ctl;
+
+	rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+	return !!(ctl & RTIT_CTL_TRACEEN);
+}
+
+static int pt_config(struct perf_event *event)
+{
+	u64 reg;
+
+	reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN;
+
+	if (!event->attr.exclude_kernel)
+		reg |= RTIT_CTL_OS;
+	if (!event->attr.exclude_user)
+		reg |= RTIT_CTL_USR;
+
+	reg |= (event->attr.itrace_config & PT_CONFIG_MASK);
+
+	wrmsrl(MSR_IA32_RTIT_CTL, reg);
+
+	return 0;
+}
+
+static void pt_config_start(bool start)
+{
+	u64 ctl;
+
+	rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+	if (start)
+		ctl |= RTIT_CTL_TRACEEN;
+	else
+		ctl &= ~RTIT_CTL_TRACEEN;
+	wrmsrl(MSR_IA32_RTIT_CTL, ctl);
+}
+
+static void pt_config_buffer(void *buf, unsigned int topa_idx,
+			     unsigned int output_off)
+{
+	u64 reg;
+
+	wrmsrl(MSR_IA32_RTIT_OUTPUT_BASE, virt_to_phys(buf));
+
+	reg = 0x7f | ((u64)topa_idx << 7) | ((u64)output_off << 32);
+
+	wrmsrl(MSR_IA32_RTIT_OUTPUT_MASK, reg);
+}
+
+/*
+ * Keep ToPA table-related metadata on the same page as the actual table,
+ * taking up a few words from the top
+ */
+
+#define TENTS_PER_PAGE (((PAGE_SIZE - 40) / sizeof(struct topa_entry)) - 1)
+
+struct topa {
+	struct topa_entry	table[TENTS_PER_PAGE];
+	struct list_head	list;
+	u64			phys;
+	u64			offset;
+	size_t			size;
+	int			last;
+};
+
+/* make negative table index stand for the last table entry */
+#define TOPA_ENTRY(t, i) ((i) == -1 ? &(t)->table[(t)->last] : &(t)->table[(i)])
+
+/*
+ * allocate page-sized ToPA table
+ */
+static struct topa *topa_alloc(int cpu, gfp_t gfp)
+{
+	int node = cpu_to_node(cpu);
+	struct topa *topa;
+	struct page *p;
+
+	p = alloc_pages_node(node, gfp | __GFP_ZERO, 0);
+	if (!p)
+		return NULL;
+
+	topa = page_address(p);
+	topa->last = 0;
+	topa->phys = page_to_phys(p);
+
+	/*
+	 * In case of singe-entry ToPA, always put the self-referencing END
+	 * link as the 2nd entry in the table
+	 */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		TOPA_ENTRY(topa, 1)->base = topa->phys >> TOPA_SHIFT;
+		TOPA_ENTRY(topa, 1)->end = 1;
+	}
+
+	return topa;
+}
+
+static void topa_free(struct topa *topa)
+{
+	free_page((unsigned long)topa);
+}
+
+/**
+ * topa_insert_table - insert a ToPA table into a buffer
+ * @buf - pt buffer that's being extended
+ * @topa - new topa table to be inserted
+ *
+ * If it's the first table in this buffer, set up buffer's pointers
+ * accordingly; otherwise, add a END=1 link entry to @topa to the current
+ * "last" table and adjust the last table pointer to @topa.
+ */
+static void topa_insert_table(struct pt_buffer *buf, struct topa *topa)
+{
+	struct topa *last = buf->last;
+
+	list_add_tail(&topa->list, &buf->tables);
+
+	if (!buf->first) {
+		buf->first = buf->last = buf->cur = topa;
+		return;
+	}
+
+	topa->offset = last->offset + last->size;
+	buf->last = topa;
+
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		return;
+
+	BUG_ON(last->last != TENTS_PER_PAGE - 1);
+
+	TOPA_ENTRY(last, -1)->base = topa->phys >> TOPA_SHIFT;
+	TOPA_ENTRY(last, -1)->end = 1;
+}
+
+static bool topa_table_full(struct topa *topa)
+{
+	/* single-entry ToPA is a special case */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		return !!topa->last;
+
+	return topa->last == TENTS_PER_PAGE - 1;
+}
+
+static int topa_insert_pages(struct pt_buffer *buf, gfp_t gfp)
+{
+	struct topa *topa = buf->last;
+	int order = 0;
+	struct page *p;
+
+	p = virt_to_page(buf->data_pages[buf->nr_pages]);
+	if (PagePrivate(p))
+		order = page_private(p);
+
+	if (topa_table_full(topa)) {
+		topa = topa_alloc(buf->cpu, gfp);
+		if (!topa)
+			return -ENOMEM;
+
+		topa_insert_table(buf, topa);
+	}
+
+	TOPA_ENTRY(topa, -1)->base = page_to_phys(p) >> TOPA_SHIFT;
+	TOPA_ENTRY(topa, -1)->size = order;
+	if (!buf->snapshot && !pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		TOPA_ENTRY(topa, -1)->intr = 1;
+		TOPA_ENTRY(topa, -1)->stop = 1;
+	}
+
+	topa->last++;
+	topa->size += sizes(order);
+
+	buf->nr_pages += 1ul << order;
+
+	return 0;
+}
+
+static void pt_topa_dump(struct pt_buffer *buf)
+{
+	struct topa *topa;
+
+	list_for_each_entry(topa, &buf->tables, list) {
+		int i;
+
+		pr_debug("# table @%p (%p), off %llx size %zx\n", topa->table,
+			 (void *)topa->phys, topa->offset, topa->size);
+		for (i = 0; i < TENTS_PER_PAGE; i++) {
+			pr_debug("# entry @%p (%lx sz %u %c%c%c) raw=%16llx\n",
+				 &topa->table[i],
+				 (unsigned long)topa->table[i].base << TOPA_SHIFT,
+				 sizes(topa->table[i].size),
+				 topa->table[i].end ?  'E' : ' ',
+				 topa->table[i].intr ? 'I' : ' ',
+				 topa->table[i].stop ? 'S' : ' ',
+				 *(u64 *)&topa->table[i]);
+			if ((pt_cap_get(PT_CAP_topa_multiple_entries)
+			     && topa->table[i].stop)
+			    || topa->table[i].end)
+				break;
+		}
+	}
+}
+
+/* advance to the next output region */
+static void pt_buffer_advance(struct pt_buffer *buf)
+{
+	buf->output_off = 0;
+	buf->cur_idx++;
+
+	if (buf->cur_idx == buf->cur->last) {
+		if (buf->cur == buf->last)
+			buf->cur = buf->first;
+		else
+			buf->cur = list_entry(buf->cur->list.next, struct topa,
+					      list);
+		buf->cur_idx = 0;
+	}
+}
+
+static void pt_update_head(struct pt *pt)
+{
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+	u64 topa_idx, base, old;
+
+	/* offset of the first region in this table from the beginning of buf */
+	base = buf->cur->offset + buf->output_off;
+
+	/* offset of the current output region within this table */
+	for (topa_idx = 0; topa_idx < buf->cur_idx; topa_idx++)
+		base += sizes(buf->cur->table[topa_idx].size);
+
+	if (buf->snapshot) {
+		local_set(&buf->data_size, base);
+	} else {
+		old = (local64_xchg(&buf->head, base) &
+		       ((buf->nr_pages << PAGE_SHIFT) - 1));
+		if (base < old)
+			base += buf->nr_pages << PAGE_SHIFT;
+
+		local_add(base - old, &buf->data_size);
+	}
+}
+
+static void *pt_buffer_region(struct pt_buffer *buf)
+{
+	return phys_to_virt(buf->cur->table[buf->cur_idx].base << TOPA_SHIFT);
+}
+
+static size_t pt_buffer_region_size(struct pt_buffer *buf)
+{
+	return sizes(buf->cur->table[buf->cur_idx].size);
+}
+
+/**
+ * pt_handle_status - take care of possible status conditions
+ * @pt: per-cpu pt handle
+ */
+static void pt_handle_status(struct pt *pt)
+{
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+	int advance = 0;
+	u64 status;
+
+	rdmsrl(MSR_IA32_RTIT_STATUS, status);
+
+	if (status & RTIT_STATUS_ERROR) {
+		pr_err_ratelimited("ToPA ERROR encountered, trying to recover\n");
+		pt_topa_dump(buf);
+		status &= ~RTIT_STATUS_ERROR;
+		wrmsrl(MSR_IA32_RTIT_STATUS, status);
+	}
+
+	if (status & RTIT_STATUS_STOPPED) {
+		status &= ~RTIT_STATUS_STOPPED;
+		wrmsrl(MSR_IA32_RTIT_STATUS, status);
+
+		/*
+		 * On systems that only do single-entry ToPA, hitting STOP
+		 * means we are already losing data; need to let the decoder
+		 * know.
+		 */
+		if (!pt_cap_get(PT_CAP_topa_multiple_entries) ||
+		    buf->output_off == sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size)) {
+			local_inc(&buf->lost);
+			advance++;
+		}
+	}
+
+	/*
+	 * Also on single-entry ToPA implementations, interrupt will come
+	 * before the output reaches its output region's boundary.
+	 */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries) && !buf->snapshot &&
+	    pt_buffer_region_size(buf) - buf->output_off <= TOPA_PMI_MARGIN) {
+		void *head = pt_buffer_region(buf);
+
+		/* everything within this margin needs to be zeroed out */
+		memset(head + buf->output_off, 0,
+		       pt_buffer_region_size(buf) -
+		       buf->output_off);
+		advance++;
+	}
+
+	if (advance)
+		pt_buffer_advance(buf);
+}
+
+static void pt_read_offset(struct pt_buffer *buf)
+{
+	u64 offset, base_topa;
+
+	rdmsrl(MSR_IA32_RTIT_OUTPUT_BASE, base_topa);
+	buf->cur = phys_to_virt(base_topa);
+
+	rdmsrl(MSR_IA32_RTIT_OUTPUT_MASK, offset);
+	/* offset within current output region */
+	buf->output_off = offset >> 32;
+	/* index of current output region within this table */
+	buf->cur_idx = (offset & 0xffffff80) >> 7;
+}
+
+/**
+ * pt_buffer_fini_topa() - deallocate ToPA structure of a buffer
+ * @buf: pt buffer
+ */
+static void pt_buffer_fini_topa(struct pt_buffer *buf)
+{
+	struct topa *topa, *iter;
+
+	list_for_each_entry_safe(topa, iter, &buf->tables, list) {
+		list_del(&topa->list);
+		topa_free(topa);
+	}
+}
+
+static unsigned int pt_topa_next_entry(struct pt_buffer *buf, unsigned int pg)
+{
+	struct topa_entry *te = buf->topa_index[pg];
+
+	if (buf->first == buf->last && buf->first->last == 1)
+		return pg;
+
+	do {
+		pg++;
+		pg &= buf->nr_pages - 1;
+	} while (buf->topa_index[pg] == te);
+
+	return pg;
+}
+
+static void pt_buffer_reset_markers(struct pt_buffer *buf, unsigned long off)
+
+{
+	unsigned long idx, npages, end;
+
+	if (buf->snapshot)
+		return;
+
+	/* single entry ToPA is handled by marking all regions STOP=1 INT=1 */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		return;
+
+	/* clear STOP and INT from current entry */
+	buf->topa_index[buf->stop_pos]->stop = 0;
+	buf->topa_index[buf->intr_pos]->intr = 0;
+
+	if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		npages = (off + 1) >> PAGE_SHIFT;
+		end = (local64_read(&buf->head) >> PAGE_SHIFT) + npages;
+		idx = end & (buf->nr_pages - 1);
+		buf->stop_pos = idx;
+		idx = (local64_read(&buf->head) >> PAGE_SHIFT) + npages / 2;
+		idx &= buf->nr_pages - 1;
+		buf->intr_pos = idx;
+	}
+
+	buf->topa_index[buf->stop_pos]->stop = 1;
+	buf->topa_index[buf->intr_pos]->intr = 1;
+}
+
+static void pt_buffer_setup_topa_index(struct pt_buffer *buf)
+{
+	struct topa *cur = buf->first, *prev = buf->last;
+	struct topa_entry *te_cur = TOPA_ENTRY(cur, 0),
+		*te_prev = TOPA_ENTRY(prev, prev->last - 1);
+	int pg = 0, idx = 0, ntopa = 0;
+
+	while (pg < buf->nr_pages) {
+		int tidx;
+
+		/* pages within one topa entry */
+		for (tidx = 0; tidx < 1 << te_cur->size; tidx++, pg++)
+			buf->topa_index[pg] = te_prev;
+
+		te_prev = te_cur;
+
+		if (idx == cur->last - 1) {
+			/* advance to next topa table */
+			idx = 0;
+			cur = list_entry(cur->list.next, struct topa, list);
+			ntopa++;
+		} else
+			idx++;
+		te_cur = TOPA_ENTRY(cur, idx);
+	}
+
+}
+
+static void pt_buffer_reset_offsets(struct pt_buffer *buf, unsigned long head)
+{
+	int pg;
+
+	if (buf->snapshot)
+		head &= (buf->nr_pages << PAGE_SHIFT) - 1;
+
+	pg = (head >> PAGE_SHIFT) & (buf->nr_pages - 1);
+	pg = pt_topa_next_entry(buf, pg);
+
+	buf->cur = (struct topa *)((unsigned long)buf->topa_index[pg] & PAGE_MASK);
+	buf->cur_idx = ((unsigned long)buf->topa_index[pg] -
+			(unsigned long)buf->cur) / sizeof(struct topa_entry);
+	buf->output_off = head & (sizes(buf->cur->table[buf->cur_idx].size) - 1);
+
+	local64_set(&buf->head, head);
+	local_set(&buf->data_size, 0);
+}
+
+/**
+ * pt_buffer_init_topa() - initialize ToPA table for pt buffer
+ * @buf: pt buffer
+ * @size: total size of all regions within this ToPA
+ * @gfp: allocation flags
+ */
+static int pt_buffer_init_topa(struct pt_buffer *buf, unsigned long nr_pages,
+			       gfp_t gfp)
+{
+	struct topa *topa;
+	int err;
+
+	topa = topa_alloc(buf->cpu, gfp);
+	if (!topa)
+		return -ENOMEM;
+
+	topa_insert_table(buf, topa);
+
+	while (buf->nr_pages < nr_pages) {
+		err = topa_insert_pages(buf, gfp);
+		if (err) {
+			pt_buffer_fini_topa(buf);
+			return -ENOMEM;
+		}
+	}
+
+	pt_buffer_setup_topa_index(buf);
+
+	/* link last table to the first one, unless we're double buffering */
+	if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		TOPA_ENTRY(buf->last, -1)->base = buf->first->phys >> TOPA_SHIFT;
+		TOPA_ENTRY(buf->last, -1)->end = 1;
+	}
+
+	pt_topa_dump(buf);
+	return 0;
+}
+
+/**
+ * pt_buffer_setup_aux() - set up topa tables for a PT buffer
+ * @cpu: cpu on which to allocate, -1 means current
+ * @pages: array of pointers to buffer pages passed from perf core
+ * @nr_pages: number of pages in the buffer
+ * @snapshot: if this is a snapshot/overwrite counter
+ */
+static void *
+pt_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool snapshot)
+{
+	struct pt_buffer *buf;
+	int node, ret;
+
+	if (!nr_pages)
+		return NULL;
+
+	if (cpu == -1)
+		cpu = raw_smp_processor_id();
+	node = cpu_to_node(cpu);
+
+	buf = kzalloc_node(offsetof(struct pt_buffer, topa_index[nr_pages]),
+			   GFP_KERNEL, node);
+	if (!buf)
+		return NULL;
+
+	buf->cpu = cpu;
+	buf->snapshot = snapshot;
+	buf->data_pages = pages;
+
+	INIT_LIST_HEAD(&buf->tables);
+
+	ret = pt_buffer_init_topa(buf, nr_pages, GFP_KERNEL);
+	if (ret) {
+		kfree(buf);
+		return NULL;
+	}
+
+	return buf;
+}
+
+/**
+ * pt_buffer_free() - dispose of pt buffer
+ * @data: pt buffer
+ */
+static void pt_buffer_free_aux(void *data)
+{
+	struct pt_buffer *buf = data;
+
+	pt_buffer_fini_topa(buf);
+
+	kfree(buf);
+}
+
+/**
+ * pt_buffer_is_full - check if the buffer is full
+ * @buf: pt buffer
+ * @pt: per-cpu pt handle
+ * If the user hasn't read data from the output region that aux_head
+ * points to, the buffer is considered full: the user needs to read at
+ * least this region and update aux_tail to point past it.
+ */
+static bool pt_buffer_is_full(struct pt_buffer *buf, struct pt *pt)
+{
+	if (buf->snapshot)
+		return false;
+
+	if (local_read(&buf->data_size) >= pt->handle.size)
+		return true;
+
+	return false;
+}
+
+void intel_pt_interrupt(void)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct pt_buffer *buf;
+	struct perf_event *event = pt->handle.event;
+
+	pt_config_start(false);
+
+	if (!event)
+		return;
+
+	buf = perf_get_aux(&pt->handle);
+	if (!buf)
+		return;
+
+	pt_read_offset(buf);
+
+	pt_handle_status(pt);
+
+	pt_update_head(pt);
+
+	perf_aux_output_end(&pt->handle, local_xchg(&buf->data_size, 0),
+			    local_xchg(&buf->lost, 0));
+
+	if (!event->hw.state) {
+		int ret = pt_config(event);
+
+		if (ret) {
+			event->hw.state = PERF_HES_STOPPED;
+			return;
+		}
+
+		buf = perf_aux_output_begin(&pt->handle, event);
+		if (!buf) {
+			event->hw.state = PERF_HES_STOPPED;
+			return;
+		}
+
+		pt_buffer_reset_offsets(buf, pt->handle.head);
+		pt_buffer_reset_markers(buf, pt->handle.size);
+		pt_config_buffer(buf->cur->table, buf->cur_idx,
+				 buf->output_off);
+		wrmsrl(MSR_IA32_RTIT_STATUS, 0);
+		pt_config_start(true);
+	}
+}
+
+static void pt_event_start(struct perf_event *event, int mode)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+
+	if (pt_is_running() || !buf || pt_buffer_is_full(buf, pt) ||
+	    pt_config(event)) {
+		event->hw.state = PERF_HES_STOPPED;
+		return;
+	}
+
+	event->hw.state = 0;
+
+	pt_config_buffer(buf->cur->table, buf->cur_idx,
+			 buf->output_off);
+	wrmsrl(MSR_IA32_RTIT_STATUS, 0);
+	pt_config_start(true);
+}
+
+static void pt_event_stop(struct perf_event *event, int mode)
+{
+	if (event->hw.state == PERF_HES_STOPPED)
+		return;
+
+	event->hw.state = PERF_HES_STOPPED;
+
+	pt_config_start(false);
+
+	if (mode & PERF_EF_UPDATE) {
+		struct pt *pt = this_cpu_ptr(&pt_ctx);
+		struct pt_buffer *buf = perf_get_aux(&pt->handle);
+
+		if (!buf || !pt->handle.event)
+			return;
+
+		if (WARN_ON_ONCE(pt->handle.event != event))
+			return;
+		pt_read_offset(buf);
+
+		pt_handle_status(pt);
+
+		pt_update_head(pt);
+	}
+}
+
+static void pt_event_del(struct perf_event *event, int mode)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	unsigned long flags;
+
+	pt_event_stop(event, PERF_EF_UPDATE);
+
+	cpuc->pt_enabled = 0;
+
+	raw_spin_lock_irqsave(&pt->lock, flags);
+	if (pt->handle.event)
+		perf_aux_output_end(&pt->handle, local_xchg(&buf->data_size, 0),
+				    local_xchg(&buf->lost, 0));
+	raw_spin_unlock_irqrestore(&pt->lock, flags);
+}
+
+static int pt_event_add(struct perf_event *event, int mode)
+{
+	struct pt_buffer *buf;
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct hw_perf_event *hwc = &event->hw;
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	unsigned long flags;
+	int ret = -EBUSY;
+
+	if (cpuc->lbr_users || cpuc->bts_enabled)
+		goto out;
+
+	ret = pt_config(event);
+	if (ret)
+		goto out;
+
+	raw_spin_lock_irqsave(&pt->lock, flags);
+	if (pt->handle.event) {
+		raw_spin_unlock_irqrestore(&pt->lock, flags);
+		ret = -EBUSY;
+		goto out;
+	}
+
+	buf = perf_aux_output_begin(&pt->handle, event);
+	if (!buf) {
+		raw_spin_unlock_irqrestore(&pt->lock, flags);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	pt_buffer_reset_offsets(buf, pt->handle.head);
+	if (!buf->snapshot)
+		pt_buffer_reset_markers(buf, pt->handle.size);
+
+	raw_spin_unlock_irqrestore(&pt->lock, flags);
+
+	if (mode & PERF_EF_START) {
+		pt_event_start(event, 0);
+		if (hwc->state == PERF_HES_STOPPED) {
+			pt_event_del(event, 0);
+			ret = -EBUSY;
+		}
+	} else {
+		hwc->state = PERF_HES_STOPPED;
+	}
+
+out:
+
+	if (ret)
+		hwc->state = PERF_HES_STOPPED;
+	else
+		cpuc->pt_enabled = 1;
+
+	return ret;
+}
+
+static void pt_event_read(struct perf_event *event)
+{
+}
+
+static int pt_event_init(struct perf_event *event)
+{
+	if (event->attr.type != pt_pmu.pmu.type)
+		return -ENOENT;
+
+	if (!pt_event_valid(event))
+		return -EINVAL;
+
+	return 0;
+}
+
+static __init int pt_init(void)
+{
+	int ret, cpu, prior_warn = 0;
+
+	BUILD_BUG_ON(sizeof(struct topa) > PAGE_SIZE);
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		u64 ctl;
+
+		ret = rdmsrl_safe_on_cpu(cpu, MSR_IA32_RTIT_CTL, &ctl);
+		if (!ret && (ctl & RTIT_CTL_TRACEEN))
+			prior_warn++;
+	}
+	put_online_cpus();
+
+	ret = pt_pmu_hw_init();
+	if (ret)
+		return ret;
+
+	if (!pt_cap_get(PT_CAP_topa_output)) {
+		pr_warn("ToPA output is not supported on this CPU\n");
+		return -ENODEV;
+	}
+
+	if (prior_warn)
+		pr_warn("PT is enabled at boot time, traces may be empty\n");
+
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		pt_pmu.pmu.capabilities =
+			PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_AUX_SW_DOUBLEBUF;
+
+	pt_pmu.pmu.capabilities	|= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE;
+	pt_pmu.pmu.attr_groups	= pt_attr_groups;
+	pt_pmu.pmu.task_ctx_nr	= perf_hw_context;
+	pt_pmu.pmu.event_init	= pt_event_init;
+	pt_pmu.pmu.add		= pt_event_add;
+	pt_pmu.pmu.del		= pt_event_del;
+	pt_pmu.pmu.start	= pt_event_start;
+	pt_pmu.pmu.stop		= pt_event_stop;
+	pt_pmu.pmu.read		= pt_event_read;
+	pt_pmu.pmu.setup_aux	= pt_buffer_setup_aux;
+	pt_pmu.pmu.free_aux	= pt_buffer_free_aux;
+	ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);
+
+	return ret;
+}
+
+module_init(pt_init);
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 16/23] x86: perf: intel_bts: Add BTS PMU driver
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (14 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 15/23] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 17/23] perf: Add rb_{alloc,free}_kernel api Alexander Shishkin
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Add support for Branch Trace Store (BTS) via kernel perf event infrastructure.
The difference with the existing implementation of BTS support is that this
one is a separate PMU that exports events' trace buffers to userspace by means
of AUX area of the perf buffer, which is zero-copy mapped into userspace.

The immediate benefit is that the buffer size can be much bigger, resulting in
fewer interrupts and no kernel side copying is involved and little to no trace
data loss. Also, kernel code can be traced with this driver.

The old way of collecting BTS traces still works.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/kernel/cpu/Makefile               |   2 +-
 arch/x86/kernel/cpu/perf_event.h           |   6 +
 arch/x86/kernel/cpu/perf_event_intel.c     |   6 +-
 arch/x86/kernel/cpu/perf_event_intel_bts.c | 496 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   3 +-
 5 files changed, 510 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_bts.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index f1f38e1caf..47492ca121 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -37,7 +37,7 @@ endif
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_p6.o perf_event_knc.o perf_event_p4.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o perf_event_intel_rapl.o
-obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_pt.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_pt.o perf_event_intel_bts.o
 endif
 
 
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 8b6e4e79fe..17e6348f29 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -715,6 +715,12 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event);
 
 void intel_pt_interrupt(void);
 
+int intel_bts_interrupt(void);
+
+void intel_bts_enable_local(void);
+
+void intel_bts_disable_local(void);
+
 int p4_pmu_init(void);
 
 int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 16d63b91b3..ceb8232ae9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1051,6 +1051,8 @@ static void intel_pmu_disable_all(void)
 
 	if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
 		intel_pmu_disable_bts();
+	else
+		intel_bts_disable_local();
 
 	intel_pmu_pebs_disable_all();
 	intel_pmu_lbr_disable_all();
@@ -1073,7 +1075,8 @@ static void intel_pmu_enable_all(int added)
 			return;
 
 		intel_pmu_enable_bts(event->hw.config);
-	}
+	} else
+		intel_bts_enable_local();
 }
 
 /*
@@ -1359,6 +1362,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 		apic_write(APIC_LVTPC, APIC_DM_NMI);
 	intel_pmu_disable_all();
 	handled = intel_pmu_drain_bts_buffer();
+	handled += intel_bts_interrupt();
 	status = intel_pmu_get_status();
 	if (!status)
 		goto done;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_bts.c b/arch/x86/kernel/cpu/perf_event_intel_bts.c
new file mode 100644
index 0000000000..e1a7ff35e9
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_bts.c
@@ -0,0 +1,496 @@
+/*
+ * BTS PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#undef DEBUG
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/debugfs.h>
+#include <linux/device.h>
+#include <linux/coredump.h>
+
+#include <asm-generic/sizes.h>
+#include <asm/perf_event.h>
+
+#include "perf_event.h"
+
+struct bts_ctx {
+	raw_spinlock_t			lock;
+	struct perf_output_handle	handle;
+	struct debug_store		ds_back;
+};
+
+static DEFINE_PER_CPU(struct bts_ctx, bts_ctx);
+
+#define BTS_RECORD_SIZE		24
+#define BTS_SAFETY_MARGIN	4080
+
+struct bts_phys {
+	struct page	*page;
+	unsigned long	size;
+	unsigned long	offset;
+	unsigned long	displacement;
+};
+
+struct bts_buffer {
+	size_t		real_size;	/* multiple of BTS_RECORD_SIZE */
+	unsigned int	nr_pages;
+	unsigned int	nr_bufs;
+	unsigned int	cur_buf;
+	unsigned long	index;
+	bool		snapshot;
+	local_t		data_size;
+	local_t		lost;
+	local_t		head;
+	unsigned long	end;
+	void		**data_pages;
+	struct bts_phys	buf[0];
+};
+
+struct pmu bts_pmu;
+
+void intel_pmu_enable_bts(u64 config);
+void intel_pmu_disable_bts(void);
+
+static size_t buf_size(struct page *page)
+{
+	return 1 << (PAGE_SHIFT + page_private(page));
+}
+
+static void *
+bts_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool overwrite)
+{
+	struct bts_buffer *buf;
+	struct page *page;
+	int node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+	unsigned long offset;
+	size_t size = nr_pages << PAGE_SHIFT;
+	int pg, nbuf, pad;
+
+	/* count all the high order buffers */
+	for (pg = 0, nbuf = 0; pg < nr_pages;) {
+		page = virt_to_page(pages[pg]);
+		if (WARN_ON_ONCE(!PagePrivate(page) && nr_pages > 1))
+			return NULL;
+		pg += 1 << page_private(page);
+		nbuf++;
+	}
+
+	/*
+	 * to avoid interrupts in overwrite mode, only allow one physical
+	 */
+	if (overwrite && nbuf > 1)
+		return NULL;
+
+	buf = kzalloc_node(offsetof(struct bts_buffer, buf[nbuf]), GFP_KERNEL, node);
+	if (!buf)
+		return NULL;
+
+	buf->nr_pages = nr_pages;
+	buf->nr_bufs = nbuf;
+	buf->snapshot = overwrite;
+	buf->data_pages = pages;
+	buf->real_size = size - size % BTS_RECORD_SIZE;
+
+	for (pg = 0, nbuf = 0, offset = 0, pad = 0; nbuf < buf->nr_bufs; nbuf++) {
+		unsigned int __nr_pages;
+
+		page = virt_to_page(pages[pg]);
+		__nr_pages = PagePrivate(page) ? 1 << page_private(page) : 1;
+		buf->buf[nbuf].page = page;
+		buf->buf[nbuf].offset = offset;
+		buf->buf[nbuf].displacement = (pad ? BTS_RECORD_SIZE - pad : 0);
+		buf->buf[nbuf].size = buf_size(page) - buf->buf[nbuf].displacement;
+		pad = buf->buf[nbuf].size % BTS_RECORD_SIZE;
+		buf->buf[nbuf].size -= pad;
+
+		pg += __nr_pages;
+		offset += __nr_pages << PAGE_SHIFT;
+	}
+
+	return buf;
+}
+
+static void bts_buffer_free_aux(void *data)
+{
+	kfree(data);
+}
+
+static unsigned long bts_buffer_offset(struct bts_buffer *buf, unsigned int idx)
+{
+	return buf->buf[idx].offset + buf->buf[idx].displacement;
+}
+
+static unsigned long
+bts_buffer_advance(struct bts_buffer *buf, unsigned long head)
+{
+	buf->cur_buf++;
+	if (buf->cur_buf == buf->nr_bufs) {
+		buf->cur_buf = 0;
+		head = 0;
+	} else {
+		head = bts_buffer_offset(buf, buf->cur_buf);
+	}
+
+	return head;
+}
+
+static void
+bts_config_buffer(struct bts_buffer *buf)
+{
+	int cpu = raw_smp_processor_id();
+	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+	struct bts_phys *phys = &buf->buf[buf->cur_buf];
+	unsigned long index, thresh = 0, end = phys->size;
+	struct page *page = phys->page;
+
+	index = local_read(&buf->head);
+
+	if (!buf->snapshot) {
+		if (buf->end < phys->offset + buf_size(page))
+			end = buf->end - phys->offset - phys->displacement;
+
+		index -= phys->offset + phys->displacement;
+
+		thresh = end - BTS_RECORD_SIZE;
+		if (end - index > BTS_SAFETY_MARGIN)
+			thresh -= BTS_SAFETY_MARGIN;
+	}
+
+	ds->bts_buffer_base = (u64)page_address(page) + phys->displacement;
+	ds->bts_index = ds->bts_buffer_base + index;
+	ds->bts_absolute_maximum = ds->bts_buffer_base + end;
+	ds->bts_interrupt_threshold = !buf->snapshot
+		? ds->bts_buffer_base + thresh
+		: ds->bts_absolute_maximum + BTS_RECORD_SIZE;
+}
+
+static bool bts_buffer_is_full(struct bts_buffer *buf, struct bts_ctx *bts)
+{
+	if (buf->snapshot)
+		return false;
+
+	if (local_read(&buf->data_size) >= bts->handle.size ||
+	    bts->handle.size - local_read(&buf->data_size) < BTS_RECORD_SIZE)
+		return true;
+
+	return false;
+}
+
+static void bts_update(struct bts_ctx *bts)
+{
+	int cpu = raw_smp_processor_id();
+	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+	struct bts_buffer *buf = perf_get_aux(&bts->handle);
+	unsigned long index = ds->bts_index - ds->bts_buffer_base, old, head;
+
+	if (!buf)
+		return;
+
+	head = index + bts_buffer_offset(buf, buf->cur_buf);
+
+	if (!buf->snapshot) {
+		struct bts_phys *phys = &buf->buf[buf->cur_buf];
+		int advance = 0;
+
+		if (phys->size - index < BTS_RECORD_SIZE) {
+			advance++;
+			local_inc(&buf->lost);
+		} else if (phys->size - index < BTS_SAFETY_MARGIN) {
+			advance++;
+			memset((void *)ds->bts_index, 0, phys->size - index);
+		}
+
+		if (advance)
+			head = bts_buffer_advance(buf, head);
+
+		old = local_xchg(&buf->head, head);
+		if (!head)
+		  head = buf->real_size;
+
+		local_add(head - old, &buf->data_size);
+	} else {
+		local_set(&buf->data_size, head);
+	}
+}
+
+static void bts_event_start(struct perf_event *event, int flags)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct bts_buffer *buf = perf_get_aux(&bts->handle);
+	u64 config = 0;
+
+	if (!buf || bts_buffer_is_full(buf, bts)) {
+		event->hw.state = PERF_HES_STOPPED;
+		return;
+	}
+
+	event->hw.state = 0;
+
+	if (!buf->snapshot)
+		config |= ARCH_PERFMON_EVENTSEL_INT;
+	if (!event->attr.exclude_kernel)
+		config |= ARCH_PERFMON_EVENTSEL_OS;
+	if (!event->attr.exclude_user)
+		config |= ARCH_PERFMON_EVENTSEL_USR;
+
+	bts_config_buffer(buf);
+
+	wmb();
+
+	intel_pmu_enable_bts(config);
+}
+
+static void bts_event_stop(struct perf_event *event, int flags)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+	if (event->hw.state == PERF_HES_STOPPED)
+		return;
+
+	event->hw.state = PERF_HES_STOPPED;
+	intel_pmu_disable_bts();
+
+	if (flags & PERF_EF_UPDATE)
+		bts_update(bts);
+}
+
+void intel_bts_enable_local(void)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+	if (bts->handle.event)
+		bts_event_start(bts->handle.event, 0);
+}
+
+void intel_bts_disable_local(void)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+	if (bts->handle.event)
+		bts_event_stop(bts->handle.event, 0);
+}
+
+static int
+bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle)
+{
+	unsigned long pad, size;
+	struct bts_phys *phys;
+	int ret, little_room = 0;
+
+	handle->head &= ((buf->nr_pages << PAGE_SHIFT) - 1);
+	pad = (buf->nr_pages << PAGE_SHIFT) - handle->head;
+	if (pad > BTS_RECORD_SIZE) {
+		pad = handle->head % BTS_RECORD_SIZE;
+		if (pad)
+			pad = BTS_RECORD_SIZE - pad;
+	}
+
+	if (pad) {
+		ret = perf_aux_output_skip(handle, pad);
+		if (ret)
+			return ret;
+		handle->head &= ((buf->nr_pages << PAGE_SHIFT) - 1);
+	}
+
+	size = (handle->size / BTS_RECORD_SIZE) * BTS_RECORD_SIZE;
+	if (size < BTS_SAFETY_MARGIN)
+		little_room++;
+
+	/* figure out index offset in the current buffer */
+	for (buf->cur_buf = 0, phys = &buf->buf[buf->cur_buf];
+	     handle->head >= phys->offset + phys->displacement + phys->size;
+	     phys = &buf->buf[++buf->cur_buf])
+		;
+	if (WARN_ON_ONCE(buf->cur_buf == buf->nr_bufs))
+		return -EINVAL;
+
+	pad = phys->offset + phys->displacement + phys->size - handle->head;
+	if (!little_room && pad < BTS_SAFETY_MARGIN) {
+		memset(page_address(phys->page) + phys->displacement + handle->head, 0, pad);
+		ret = perf_aux_output_skip(handle, pad);
+		if (ret)
+			return ret;
+		handle->head = bts_buffer_advance(buf, handle->head);
+	}
+
+	local_set(&buf->data_size, 0);
+	local_set(&buf->head, handle->head);
+	buf->end = min(handle->head + size, buf->real_size);
+
+	return 0;
+}
+
+int intel_bts_interrupt(void)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct perf_event *event = bts->handle.event;
+	struct bts_buffer *buf;
+	s64 old_head;
+	int err;
+
+	if (!event)
+		return 0;
+
+	buf = perf_get_aux(&bts->handle);
+	/*
+	 * Skip snapshot counters: they don't use the interrupt, but
+	 * there's no other way of telling, because the pointer will
+	 * keep moving
+	 */
+	if (!buf || buf->snapshot)
+		return 0;
+
+	old_head = local_read(&buf->head);
+	bts_update(bts);
+
+	perf_aux_output_end(&bts->handle, local_xchg(&buf->data_size, 0),
+			    !!local_xchg(&buf->lost, 0));
+	if (old_head == local_read(&buf->head))
+		return 0;
+
+	buf = perf_aux_output_begin(&bts->handle, event);
+	if (!buf) {
+		event->hw.state = PERF_HES_STOPPED;
+		return 1;
+	}
+
+	err = bts_buffer_reset(buf, &bts->handle);
+	if (err) {
+		event->hw.state = PERF_HES_STOPPED;
+		perf_aux_output_end(&bts->handle, 0, true);
+	}
+
+	return 1;
+}
+
+static void bts_event_del(struct perf_event *event, int mode)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct bts_buffer *buf = perf_get_aux(&bts->handle);
+	unsigned long flags;
+
+	bts_event_stop(event, PERF_EF_UPDATE);
+
+	raw_spin_lock_irqsave(&bts->lock, flags);
+	if (bts->handle.event)
+		perf_aux_output_end(&bts->handle, local_xchg(&buf->data_size, 0),
+				    !!local_xchg(&buf->lost, 0));
+	cpuc->ds->bts_index = bts->ds_back.bts_buffer_base;
+	cpuc->ds->bts_buffer_base = bts->ds_back.bts_buffer_base;
+	cpuc->ds->bts_absolute_maximum = bts->ds_back.bts_absolute_maximum;
+	cpuc->ds->bts_interrupt_threshold = bts->ds_back.bts_interrupt_threshold;
+	raw_spin_unlock_irqrestore(&bts->lock, flags);
+}
+
+static int bts_event_add(struct perf_event *event, int mode)
+{
+	struct bts_buffer *buf;
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct hw_perf_event *hwc = &event->hw;
+	unsigned long flags;
+	int ret = -EBUSY;
+
+	if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
+		goto err;
+
+	if (cpuc->pt_enabled)
+		goto err;
+
+	raw_spin_lock_irqsave(&bts->lock, flags);
+	if (bts->handle.event)
+		goto err_unlock;
+
+	buf = perf_aux_output_begin(&bts->handle, event);
+	if (!buf) {
+		ret = -EINVAL;
+		goto err_unlock;
+	}
+
+	ret = bts_buffer_reset(buf, &bts->handle);
+	if (ret) {
+		perf_aux_output_end(&bts->handle, 0, true);
+		goto err_unlock;
+	}
+
+	bts->ds_back.bts_buffer_base = cpuc->ds->bts_buffer_base;
+	bts->ds_back.bts_absolute_maximum = cpuc->ds->bts_absolute_maximum;
+	bts->ds_back.bts_interrupt_threshold = cpuc->ds->bts_interrupt_threshold;
+	raw_spin_unlock_irqrestore(&bts->lock, flags);
+
+	hwc->state = !(mode & PERF_EF_START);
+	if (!hwc->state) {
+		bts_event_start(event, 0);
+		if (hwc->state == PERF_HES_STOPPED) {
+			bts_event_del(event, 0);
+			ret = -EBUSY;
+			goto err;
+		}
+	}
+
+	return 0;
+
+err_unlock:
+	raw_spin_unlock(&bts->lock);
+err:
+	hwc->state = PERF_HES_STOPPED;
+
+	return ret;
+}
+
+static int bts_event_init(struct perf_event *event)
+{
+	if (event->attr.type != bts_pmu.type)
+		return -ENOENT;
+
+	return 0;
+}
+
+static void bts_event_read(struct perf_event *event)
+{
+}
+
+static __init int bts_init(void)
+{
+	int cpu;
+
+	if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
+		return -ENODEV;
+
+	get_online_cpus();
+	for_each_possible_cpu(cpu) {
+		raw_spin_lock_init(&per_cpu(bts_ctx, cpu).lock);
+	}
+	put_online_cpus();
+
+	bts_pmu.capabilities	= PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE;
+	bts_pmu.task_ctx_nr	= perf_hw_context;
+	bts_pmu.event_init	= bts_event_init;
+	bts_pmu.add		= bts_event_add;
+	bts_pmu.del		= bts_event_del;
+	bts_pmu.start		= bts_event_start;
+	bts_pmu.stop		= bts_event_stop;
+	bts_pmu.read		= bts_event_read;
+	bts_pmu.setup_aux	= bts_buffer_setup_aux;
+	bts_pmu.free_aux	= bts_buffer_free_aux;
+
+	return perf_pmu_register(&bts_pmu, "intel_bts", -1);
+}
+
+module_init(bts_init);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index b8a7f9315f..e44b1727f7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -467,7 +467,8 @@ void intel_pmu_enable_bts(u64 config)
 
 	debugctlmsr |= DEBUGCTLMSR_TR;
 	debugctlmsr |= DEBUGCTLMSR_BTS;
-	debugctlmsr |= DEBUGCTLMSR_BTINT;
+	if (config & ARCH_PERFMON_EVENTSEL_INT)
+		debugctlmsr |= DEBUGCTLMSR_BTINT;
 
 	if (!(config & ARCH_PERFMON_EVENTSEL_OS))
 		debugctlmsr |= DEBUGCTLMSR_BTS_OFF_OS;
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 17/23] perf: Add rb_{alloc,free}_kernel api
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (15 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 16/23] x86: perf: intel_bts: Add BTS " Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 18/23] perf: Add a helper to copy AUX data in the kernel Alexander Shishkin
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Events that generate AUX data can also be created by the kernel. In this
case, some in-kernel infrastructure is needed to store and copy this data.

This patch adds api for ring buffer (de-)allocation that can be used by the
kernel code.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/internal.h    |  3 +++
 kernel/events/ring_buffer.c | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 4f99987bc3..f4cfa4cabb 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -59,6 +59,9 @@ extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 			pgoff_t pgoff, int nr_pages, long watermark, int flags);
 extern void rb_free_aux(struct ring_buffer *rb, struct perf_event *event);
+extern struct ring_buffer *
+rb_alloc_kernel(struct perf_event *event, int nr_pages, int aux_nr_pages);
+extern void rb_free_kernel(struct ring_buffer *rb, struct perf_event *event);
 extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
 extern void ring_buffer_put(struct ring_buffer *rb);
 
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 85858d201c..43c4cc1892 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -467,6 +467,39 @@ void rb_free_aux(struct ring_buffer *rb, struct perf_event *event)
 	rb->aux_nr_pages = 0;
 }
 
+struct ring_buffer *
+rb_alloc_kernel(struct perf_event *event, int nr_pages, int aux_nr_pages)
+{
+	struct ring_buffer *rb;
+	int ret, pgoff = nr_pages + 1;
+
+	rb = rb_alloc(nr_pages, 0, event->cpu, 0);
+	if (!rb)
+		return NULL;
+
+	ret = rb_alloc_aux(rb, event, pgoff, aux_nr_pages, 0, 0);
+	if (ret) {
+		rb_free(rb);
+		return NULL;
+	}
+
+	/*
+	 * Kernel counters don't need ring buffer wakeups, therefore we don't
+	 * use ring_buffer_attach() here and event->rb_entry stays empty
+	 */
+	rcu_assign_pointer(event->rb, rb);
+
+	return rb;
+}
+
+void rb_free_kernel(struct ring_buffer *rb, struct perf_event *event)
+{
+	BUG_ON(atomic_read(&rb->refcount) != 1);
+	rcu_assign_pointer(event->rb, NULL);
+	rb_free_aux(rb, event);
+	rb_free(rb);
+}
+
 #ifndef CONFIG_PERF_USE_VMALLOC
 
 /*
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 18/23] perf: Add a helper to copy AUX data in the kernel
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (16 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 17/23] perf: Add rb_{alloc,free}_kernel api Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 19/23] perf: Add a helper for looking up pmus by type Alexander Shishkin
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

This patch adds a helper for the kernel counters that generate AUX data to
copy this data around, for example, to output it to the perf ring buffer
as a sample record or write it to a core dump file.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/internal.h    |  5 +++++
 kernel/events/ring_buffer.c | 31 +++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index f4cfa4cabb..bb035dd645 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -52,6 +52,9 @@ struct ring_buffer {
 	void				*data_pages[0];
 };
 
+typedef unsigned long (*aux_copyfn)(void *data, const void *src,
+				    unsigned long len);
+
 extern void rb_free(struct ring_buffer *rb);
 extern struct ring_buffer *
 rb_alloc(int nr_pages, long watermark, int cpu, int flags);
@@ -59,6 +62,8 @@ extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 			pgoff_t pgoff, int nr_pages, long watermark, int flags);
 extern void rb_free_aux(struct ring_buffer *rb, struct perf_event *event);
+extern long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+			  unsigned long to, aux_copyfn copyfn, void *data);
 extern struct ring_buffer *
 rb_alloc_kernel(struct perf_event *event, int nr_pages, int aux_nr_pages);
 extern void rb_free_kernel(struct ring_buffer *rb, struct perf_event *event);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 43c4cc1892..3434bd145c 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -348,6 +348,37 @@ void *perf_get_aux(struct perf_output_handle *handle)
 	return handle->rb->aux_priv;
 }
 
+long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+		   unsigned long to, aux_copyfn copyfn, void *data)
+{
+	unsigned long tocopy, remainder, len = 0;
+	void *addr;
+
+	from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+	to &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+
+	do {
+		tocopy = PAGE_SIZE - offset_in_page(from);
+		if (to > from)
+			tocopy = min(tocopy, to - from);
+		if (!tocopy)
+			break;
+
+		addr = rb->aux_pages[from >> PAGE_SHIFT];
+		addr += offset_in_page(from);
+
+		remainder = copyfn(data, addr, tocopy);
+		if (remainder)
+			return -EFAULT;
+
+		len += tocopy;
+		from += tocopy;
+		from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+	} while (to != from);
+
+	return len;
+}
+
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
 static struct page *rb_alloc_aux_page(int node, int order)
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 19/23] perf: Add a helper for looking up pmus by type
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (17 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 18/23] perf: Add a helper to copy AUX data in the kernel Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 20/23] perf: itrace: Infrastructure for sampling instruction flow traces Alexander Shishkin
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

This patch adds a helper for looking up a registered pmu by its type.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index d4b5e33b74..c0f05f8748 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6831,6 +6831,18 @@ void perf_pmu_unregister(struct pmu *pmu)
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 
+/* call under pmus_srcu */
+static struct pmu *__perf_find_pmu(u32 type)
+{
+	struct pmu *pmu;
+
+	rcu_read_lock();
+	pmu = idr_find(&pmu_idr, type);
+	rcu_read_unlock();
+
+	return pmu;
+}
+
 struct pmu *perf_init_event(struct perf_event *event)
 {
 	struct pmu *pmu = NULL;
@@ -6839,9 +6851,7 @@ struct pmu *perf_init_event(struct perf_event *event)
 
 	idx = srcu_read_lock(&pmus_srcu);
 
-	rcu_read_lock();
-	pmu = idr_find(&pmu_idr, event->attr.type);
-	rcu_read_unlock();
+	pmu = __perf_find_pmu(event->attr.type);
 	if (pmu) {
 		if (!try_module_get(pmu->module)) {
 			pmu = ERR_PTR(-ENODEV);
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 20/23] perf: itrace: Infrastructure for sampling instruction flow traces
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (18 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 19/23] perf: Add a helper for looking up pmus by type Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 21/23] perf: Allocate ring buffers for inherited per-task kernel events Alexander Shishkin
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Instruction tracing PMUs are capable of recording a log of instruction
execution flow on a cpu core, which can be useful for profiling and crash
analysis. This patch adds itrace infrastructure for perf events and the
rest of the kernel to use.

This trace data can be used to annotate other perf events by including it
in sample records when PERF_SAMPLE_ITRACE flag is set. In this case, a
kernel counter is created for each such event and trace data is retrieved
from it and stored in the perf data stream.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/itrace.h          |  45 ++++++++++++
 include/linux/perf_event.h      |  14 ++++
 include/uapi/linux/perf_event.h |  14 +++-
 kernel/events/Makefile          |   2 +-
 kernel/events/core.c            |  38 ++++++++++
 kernel/events/itrace.c          | 159 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 269 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/itrace.h
 create mode 100644 kernel/events/itrace.c

diff --git a/include/linux/itrace.h b/include/linux/itrace.h
new file mode 100644
index 0000000000..c6c0674092
--- /dev/null
+++ b/include/linux/itrace.h
@@ -0,0 +1,45 @@
+/*
+ * Instruction flow trace unit infrastructure
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_ITRACE_H
+#define _LINUX_ITRACE_H
+
+#include <linux/perf_event.h>
+
+#ifdef CONFIG_PERF_EVENTS
+extern int itrace_sampler_init(struct perf_event *event,
+			       struct task_struct *task,
+			       struct pmu *pmu);
+extern void itrace_sampler_fini(struct perf_event *event);
+extern unsigned long itrace_sampler_trace(struct perf_event *event,
+					  struct perf_sample_data *data);
+extern void itrace_sampler_output(struct perf_event *event,
+				  struct perf_output_handle *handle,
+				  struct perf_sample_data *data);
+#else
+static inline int itrace_sampler_init(struct perf_event *event,
+				      struct task_struct *task,
+				      struct pmu *pmu)		{ return -EINVAL; }
+static inline void
+itrace_sampler_fini(struct perf_event *event)			{}
+static inline unsigned long
+itrace_sampler_trace(struct perf_event *event,
+		     struct perf_sample_data *data)		{ return 0; }
+static inline void
+itrace_sampler_output(struct perf_event *event,
+		      struct perf_output_handle *handle,
+		      struct perf_sample_data *data)		{}
+#endif
+
+#endif /* _LINUX_PERF_EVENT_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 46137cb4d6..94e667a530 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -83,6 +83,12 @@ struct perf_regs_user {
 	struct pt_regs	*regs;
 };
 
+struct perf_trace_record {
+	u64		size;
+	unsigned long	from;
+	unsigned long	to;
+};
+
 struct task_struct;
 
 /*
@@ -456,6 +462,7 @@ struct perf_event {
 	perf_overflow_handler_t		overflow_handler;
 	void				*overflow_handler_context;
 
+	struct perf_event		*trace_event;
 #ifdef CONFIG_EVENT_TRACING
 	struct ftrace_event_call	*tp_event;
 	struct event_filter		*filter;
@@ -623,6 +630,7 @@ struct perf_sample_data {
 	union  perf_mem_data_src	data_src;
 	struct perf_callchain_entry	*callchain;
 	struct perf_raw_record		*raw;
+	struct perf_trace_record	trace;
 	struct perf_branch_stack	*br_stack;
 	struct perf_regs_user		regs_user;
 	u64				stack_user_size;
@@ -643,6 +651,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
 	data->period = period;
 	data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
 	data->regs_user.regs = NULL;
+	data->trace.from = data->trace.to = data->trace.size = 0;
 	data->stack_user_size = 0;
 	data->weight = 0;
 	data->data_src.val = 0;
@@ -804,6 +813,11 @@ static inline bool has_aux(struct perf_event *event)
 	return event->pmu->setup_aux;
 }
 
+static inline bool is_itrace_event(struct perf_event *event)
+{
+	return !!(event->pmu->capabilities & PERF_PMU_CAP_ITRACE);
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 500e18b8e9..fbc2b51ad1 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -137,8 +137,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_DATA_SRC			= 1U << 15,
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
+	PERF_SAMPLE_ITRACE			= 1U << 18,
 
-	PERF_SAMPLE_MAX = 1U << 18,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
 };
 
 /*
@@ -239,7 +240,9 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER3	96	/* add: sample_regs_user */
 					/* add: sample_stack_user */
 					/* add: aux_watermark */
-#define PERF_ATTR_SIZE_VER4	104	/* add: itrace_config */
+#define PERF_ATTR_SIZE_VER4	120	/* add: itrace_config */
+					/* add: itrace_sample_size */
+					/* add: itrace_sample_type */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -343,6 +346,11 @@ struct perf_event_attr {
 	 * Itrace pmus' event config
 	 */
 	__u64	itrace_config;
+	__u64	itrace_sample_size;
+	__u32	itrace_sample_type;	/* pmu->type of the itrace PMU */
+
+	/* Align to u64. */
+	__u32	__reserved_2;
 };
 
 #define perf_flags(attr)	(*(&(attr)->read_format + 1))
@@ -716,6 +724,8 @@ enum perf_event_type {
 	 *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
+	 *	{ u64			size;
+	 *	  char			data[size]; } && PERF_SAMPLE_ITRACE
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/Makefile b/kernel/events/Makefile
index 103f5d147b..46a37708d0 100644
--- a/kernel/events/Makefile
+++ b/kernel/events/Makefile
@@ -2,7 +2,7 @@ ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_core.o = -pg
 endif
 
-obj-y := core.o ring_buffer.o callchain.o
+obj-y := core.o ring_buffer.o callchain.o itrace.o
 
 obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
 obj-$(CONFIG_UPROBES) += uprobes.o
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c0f05f8748..7a3ffda1c0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -41,6 +41,7 @@
 #include <linux/cgroup.h>
 #include <linux/module.h>
 #include <linux/mman.h>
+#include <linux/itrace.h>
 
 #include "internal.h"
 
@@ -1595,6 +1596,9 @@ void perf_event_disable(struct perf_event *event)
 	struct perf_event_context *ctx = event->ctx;
 	struct task_struct *task = ctx->task;
 
+	if (event->trace_event)
+		perf_event_disable(event->trace_event);
+
 	if (!task) {
 		/*
 		 * Disable the event on the cpu that it's on
@@ -2094,6 +2098,8 @@ void perf_event_enable(struct perf_event *event)
 	struct perf_event_context *ctx = event->ctx;
 	struct task_struct *task = ctx->task;
 
+	if (event->trace_event)
+		perf_event_enable(event->trace_event);
 	if (!task) {
 		/*
 		 * Enable the event on the cpu that it's on
@@ -3250,6 +3256,8 @@ static void unaccount_event(struct perf_event *event)
 		static_key_slow_dec_deferred(&perf_sched_events);
 	if (has_branch_stack(event))
 		static_key_slow_dec_deferred(&perf_sched_events);
+	if ((event->attr.sample_type & PERF_SAMPLE_ITRACE))
+		itrace_sampler_fini(event);
 
 	unaccount_event_cpu(event, event->cpu);
 }
@@ -4781,6 +4789,13 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		perf_output_put(handle, data->txn);
 
+	if (sample_type & PERF_SAMPLE_ITRACE) {
+		perf_output_put(handle, data->trace.size);
+
+		if (data->trace.size)
+			itrace_sampler_output(event, handle, data);
+	}
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -4888,6 +4903,14 @@ void perf_prepare_sample(struct perf_event_header *header,
 		data->stack_user_size = stack_size;
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_ITRACE) {
+		u64 size = sizeof(u64);
+
+		size += itrace_sampler_trace(event, data);
+
+		header->size += size;
+	}
 }
 
 static void perf_event_output(struct perf_event *event,
@@ -7040,6 +7063,21 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 			if (err)
 				goto err_pmu;
 		}
+
+		if (event->attr.sample_type & PERF_SAMPLE_ITRACE) {
+			struct pmu *itrace_pmu;
+			int idx;
+
+			idx = srcu_read_lock(&pmus_srcu);
+			itrace_pmu = __perf_find_pmu(event->attr.itrace_sample_type);
+			err = itrace_sampler_init(event, task, itrace_pmu);
+			srcu_read_unlock(&pmus_srcu, idx);
+
+			if (err) {
+				put_callchain_buffers();
+				goto err_pmu;
+			}
+		}
 	}
 
 	return event;
diff --git a/kernel/events/itrace.c b/kernel/events/itrace.c
new file mode 100644
index 0000000000..f57b2ab31e
--- /dev/null
+++ b/kernel/events/itrace.c
@@ -0,0 +1,159 @@
+/*
+ * Instruction flow trace unit infrastructure
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#undef DEBUG
+
+#include <linux/kernel.h>
+#include <linux/perf_event.h>
+#include <linux/itrace.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+
+#include "internal.h"
+
+static void itrace_event_destroy(struct perf_event *event)
+{
+	struct ring_buffer *rb = event->rb;
+
+	if (!rb)
+		return;
+
+	ring_buffer_put(rb); /* can be last */
+}
+
+/*
+ * Trace sample annotation
+ * For events that have attr.sample_type & PERF_SAMPLE_ITRACE, perf calls here
+ * to configure and obtain itrace samples.
+ */
+
+int itrace_sampler_init(struct perf_event *event, struct task_struct *task,
+			struct pmu *pmu)
+{
+	struct perf_event_attr attr;
+	struct perf_event *tevt;
+	struct ring_buffer *rb;
+	unsigned long nr_pages;
+
+	if (!pmu || !(pmu->capabilities & PERF_PMU_CAP_ITRACE))
+		return -ENOTSUPP;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.type = pmu->type;
+	attr.config = 0;
+	attr.sample_type = 0;
+	attr.exclude_user = event->attr.exclude_user;
+	attr.exclude_kernel = event->attr.exclude_kernel;
+	attr.itrace_sample_size = event->attr.itrace_sample_size;
+	attr.itrace_config = event->attr.itrace_config;
+
+	tevt = perf_event_create_kernel_counter(&attr, event->cpu, task, NULL, NULL);
+	if (IS_ERR(tevt))
+		return PTR_ERR(tevt);
+
+	nr_pages = 1ul << __get_order(event->attr.itrace_sample_size);
+
+	rb = rb_alloc_kernel(tevt, 0, nr_pages);
+	if (!rb) {
+		perf_event_release_kernel(tevt);
+		return -ENOMEM;
+	}
+
+	event->trace_event = tevt;
+	tevt->destroy = itrace_event_destroy;
+	if (event->state != PERF_EVENT_STATE_OFF)
+		perf_event_enable(event->trace_event);
+
+	return 0;
+}
+
+void itrace_sampler_fini(struct perf_event *event)
+{
+	struct perf_event *tevt = event->trace_event;
+
+	/* might get free'd from event->destroy() path */
+	if (!tevt)
+		return;
+
+	perf_event_release_kernel(tevt);
+
+	event->trace_event = NULL;
+}
+
+unsigned long itrace_sampler_trace(struct perf_event *event,
+				   struct perf_sample_data *data)
+{
+	struct perf_event *tevt = event->trace_event;
+	struct ring_buffer *rb;
+
+	if (!tevt || tevt->state != PERF_EVENT_STATE_ACTIVE) {
+		data->trace.size = 0;
+		goto out;
+	}
+
+	rb = ring_buffer_get(tevt);
+	if (!rb) {
+		data->trace.size = 0;
+		goto out;
+	}
+
+	tevt->pmu->del(tevt, 0);
+
+	data->trace.to = local_read(&rb->aux_head);
+
+	if (data->trace.to < tevt->attr.itrace_sample_size)
+		data->trace.from = rb->aux_nr_pages * PAGE_SIZE +
+			data->trace.to - tevt->attr.itrace_sample_size;
+	else
+		data->trace.from = data->trace.to -
+			tevt->attr.itrace_sample_size;
+	data->trace.size = ALIGN(tevt->attr.itrace_sample_size, sizeof(u64));
+	ring_buffer_put(rb);
+
+out:
+	return data->trace.size;
+}
+
+void itrace_sampler_output(struct perf_event *event,
+			   struct perf_output_handle *handle,
+			   struct perf_sample_data *data)
+{
+	struct perf_event *tevt = event->trace_event;
+	struct ring_buffer *rb;
+	unsigned long pad;
+	int ret;
+
+	if (WARN_ON_ONCE(!tevt || !data->trace.size))
+		return;
+
+	rb = ring_buffer_get(tevt);
+	if (WARN_ON_ONCE(!rb))
+		return;
+	ret = rb_output_aux(rb, data->trace.from, data->trace.to,
+			    (aux_copyfn)perf_output_copy, handle);
+	if (ret < 0) {
+		pr_warn_ratelimited("failed to copy trace data\n");
+		goto out;
+	}
+
+	pad = data->trace.size - ret;
+	if (pad) {
+		u64 p = 0;
+
+		perf_output_copy(handle, &p, pad);
+	}
+out:
+	ring_buffer_put(rb);
+	tevt->pmu->add(tevt, PERF_EF_START);
+}
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 21/23] perf: Allocate ring buffers for inherited per-task kernel events
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (19 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 20/23] perf: itrace: Infrastructure for sampling instruction flow traces Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 22/23] perf: itrace: Allow itrace sampling for multiple events Alexander Shishkin
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

When a new event is inherited from a per-task kernel event that has a
ring buffer, allocate a new buffer for this event so that data from the
child task is collected and can later be retrieved for sample annotation
or whatnot.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c     |  9 +++++++++
 kernel/events/internal.h | 11 +++++++++++
 2 files changed, 20 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7a3ffda1c0..c2f02ea6d7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7993,6 +7993,15 @@ inherit_event(struct perf_event *parent_event,
 		(void)perf_event_set_output(child_event, parent_event);
 
 	/*
+	 * For per-task kernel events with ring buffers, set_output doesn't
+	 * make sense, but we can allocate a new buffer here.
+	 */
+	if (parent_event->cpu == -1 && kernel_rb_event(parent_event)) {
+		(void)rb_alloc_kernel(child_event, parent_event->rb->nr_pages,
+				      parent_event->rb->aux_nr_pages);
+	}
+
+	/*
 	 * Precalculate sample_data sizes
 	 */
 	perf_event__header_size(child_event);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index bb035dd645..b306bc9307 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -120,6 +120,17 @@ static inline unsigned long perf_aux_size(struct ring_buffer *rb)
 	return rb->aux_nr_pages << PAGE_SHIFT;
 }
 
+static inline bool kernel_rb_event(struct perf_event *event)
+{
+	/*
+	 * Having a ring buffer and not being on any ring buffers' wakeup
+	 * list means it was attached by rb_alloc_kernel() and not
+	 * ring_buffer_attach(). It's the only case when these two
+	 * conditions take place at the same time.
+	 */
+	return event->rb && list_empty(&event->rb_entry);
+}
+
 #define DEFINE_OUTPUT_COPY(func_name, memcpy_func)			\
 static inline unsigned long						\
 func_name(struct perf_output_handle *handle,				\
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 22/23] perf: itrace: Allow itrace sampling for multiple events
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (20 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 21/23] perf: Allocate ring buffers for inherited per-task kernel events Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:19 ` [PATCH v3 23/23] perf: itrace: Allow sampling of inherited events Alexander Shishkin
  2014-08-11  5:46 ` [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Ingo Molnar
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Right now, only one perf event can be annotated with itrace data, however
it should be possible to annotate several events with similar configuration
(wrt exclude_{hv,idle,user,kernel}, itrace_config, etc).

So every time, before a kernel counter is created for itrace sampling, we
first look for an existing counter that is suitable in terms of configuration
and we use it to annotate the new event as well.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c     |  6 ++--
 kernel/events/internal.h |  4 +++
 kernel/events/itrace.c   | 94 +++++++++++++++++++++++++++++++++++-------------
 3 files changed, 77 insertions(+), 27 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index c2f02ea6d7..89d61178df 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -891,7 +891,7 @@ static void get_ctx(struct perf_event_context *ctx)
 	WARN_ON(!atomic_inc_not_zero(&ctx->refcount));
 }
 
-static void put_ctx(struct perf_event_context *ctx)
+void put_ctx(struct perf_event_context *ctx)
 {
 	if (atomic_dec_and_test(&ctx->refcount)) {
 		if (ctx->parent_ctx)
@@ -3130,7 +3130,7 @@ errout:
 /*
  * Returns a matching context with refcount and pincount.
  */
-static struct perf_event_context *
+struct perf_event_context *
 find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
 {
 	struct perf_event_context *ctx;
@@ -3324,7 +3324,7 @@ static void free_event(struct perf_event *event)
 /*
  * Called when the last reference to the file is gone.
  */
-static void put_event(struct perf_event *event)
+void put_event(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
 	struct task_struct *owner;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index b306bc9307..4cea5578b9 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -55,6 +55,10 @@ struct ring_buffer {
 typedef unsigned long (*aux_copyfn)(void *data, const void *src,
 				    unsigned long len);
 
+extern struct perf_event_context *
+find_get_context(struct pmu *pmu, struct task_struct *task, int cpu);
+extern void put_ctx(struct perf_event_context *ctx);
+void put_event(struct perf_event *event);
 extern void rb_free(struct ring_buffer *rb);
 extern struct ring_buffer *
 rb_alloc(int nr_pages, long watermark, int cpu, int flags);
diff --git a/kernel/events/itrace.c b/kernel/events/itrace.c
index f57b2ab31e..eae85cf578 100644
--- a/kernel/events/itrace.c
+++ b/kernel/events/itrace.c
@@ -32,6 +32,16 @@ static void itrace_event_destroy(struct perf_event *event)
 	ring_buffer_put(rb); /* can be last */
 }
 
+static bool itrace_event_match(struct perf_event *e1, struct perf_event *e2)
+{
+	if (is_itrace_event(e1) &&
+	    (e1->cpu == e2->cpu ||
+	     e1->cpu == -1 ||
+	     e2->cpu == -1))
+		return true;
+	return false;
+}
+
 /*
  * Trace sample annotation
  * For events that have attr.sample_type & PERF_SAMPLE_ITRACE, perf calls here
@@ -41,39 +51,74 @@ static void itrace_event_destroy(struct perf_event *event)
 int itrace_sampler_init(struct perf_event *event, struct task_struct *task,
 			struct pmu *pmu)
 {
+	struct perf_event_context *ctx;
 	struct perf_event_attr attr;
-	struct perf_event *tevt;
+	struct perf_event *tevt = NULL;
 	struct ring_buffer *rb;
-	unsigned long nr_pages;
+	unsigned long nr_pages, flags;
 
 	if (!pmu || !(pmu->capabilities & PERF_PMU_CAP_ITRACE))
 		return -ENOTSUPP;
 
-	memset(&attr, 0, sizeof(attr));
-	attr.type = pmu->type;
-	attr.config = 0;
-	attr.sample_type = 0;
-	attr.exclude_user = event->attr.exclude_user;
-	attr.exclude_kernel = event->attr.exclude_kernel;
-	attr.itrace_sample_size = event->attr.itrace_sample_size;
-	attr.itrace_config = event->attr.itrace_config;
-
-	tevt = perf_event_create_kernel_counter(&attr, event->cpu, task, NULL, NULL);
-	if (IS_ERR(tevt))
-		return PTR_ERR(tevt);
-
-	nr_pages = 1ul << __get_order(event->attr.itrace_sample_size);
+	ctx = find_get_context(pmu, task, event->cpu);
+	if (ctx) {
+		raw_spin_lock_irqsave(&ctx->lock, flags);
+		list_for_each_entry(tevt, &ctx->event_list, event_entry) {
+			/*
+			 * event is not an itrace event, but all the relevant
+			 * bits should match
+			 */
+			if (itrace_event_match(tevt, event) &&
+			    tevt->attr.exclude_hv == event->attr.exclude_hv &&
+			    tevt->attr.exclude_idle == event->attr.exclude_idle &&
+			    tevt->attr.exclude_user == event->attr.exclude_user &&
+			    tevt->attr.exclude_kernel == event->attr.exclude_kernel &&
+			    tevt->attr.itrace_config == event->attr.itrace_config &&
+			    tevt->attr.type == event->attr.itrace_sample_type &&
+			    tevt->attr.itrace_sample_size >= event->attr.itrace_sample_size &&
+			    atomic_long_inc_not_zero(&tevt->refcount))
+				goto got_event;
+		}
+
+		tevt = NULL;
+
+got_event:
+		--ctx->pin_count;
+		put_ctx(ctx);
+		raw_spin_unlock_irqrestore(&ctx->lock, flags);
+	}
 
-	rb = rb_alloc_kernel(tevt, 0, nr_pages);
-	if (!rb) {
-		perf_event_release_kernel(tevt);
-		return -ENOMEM;
+	if (!tevt) {
+		memset(&attr, 0, sizeof(attr));
+		attr.type = pmu->type;
+		attr.config = 0;
+		attr.sample_type = 0;
+		attr.exclude_hv = event->attr.exclude_hv;
+		attr.exclude_idle = event->attr.exclude_idle;
+		attr.exclude_user = event->attr.exclude_user;
+		attr.exclude_kernel = event->attr.exclude_kernel;
+		attr.itrace_sample_size = event->attr.itrace_sample_size;
+		attr.itrace_config = event->attr.itrace_config;
+
+		tevt = perf_event_create_kernel_counter(&attr, event->cpu, task,
+							NULL, NULL);
+		if (IS_ERR(tevt))
+			return PTR_ERR(tevt);
+
+		nr_pages = 1ul << __get_order(event->attr.itrace_sample_size);
+
+		rb = rb_alloc_kernel(tevt, 0, nr_pages);
+		if (!rb) {
+			perf_event_release_kernel(tevt);
+			return -ENOMEM;
+		}
+
+		tevt->destroy = itrace_event_destroy;
+		if (event->state != PERF_EVENT_STATE_OFF)
+			perf_event_enable(event->trace_event);
 	}
 
 	event->trace_event = tevt;
-	tevt->destroy = itrace_event_destroy;
-	if (event->state != PERF_EVENT_STATE_OFF)
-		perf_event_enable(event->trace_event);
 
 	return 0;
 }
@@ -86,7 +131,7 @@ void itrace_sampler_fini(struct perf_event *event)
 	if (!tevt)
 		return;
 
-	perf_event_release_kernel(tevt);
+	put_event(tevt);
 
 	event->trace_event = NULL;
 }
@@ -97,6 +142,7 @@ unsigned long itrace_sampler_trace(struct perf_event *event,
 	struct perf_event *tevt = event->trace_event;
 	struct ring_buffer *rb;
 
+	/* Don't go further if the event is being scheduled out */
 	if (!tevt || tevt->state != PERF_EVENT_STATE_ACTIVE) {
 		data->trace.size = 0;
 		goto out;
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 23/23] perf: itrace: Allow sampling of inherited events
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (21 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 22/23] perf: itrace: Allow itrace sampling for multiple events Alexander Shishkin
@ 2014-08-11  5:19 ` Alexander Shishkin
  2014-08-11  5:46 ` [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Ingo Molnar
  23 siblings, 0 replies; 25+ messages in thread
From: Alexander Shishkin @ 2014-08-11  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Try to find an itrace sampler event for the current event if none is linked.
This is useful when these events are allocated by inheritance path,
independently of one another.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/itrace.c | 91 +++++++++++++++++++++++++++++++++-----------------
 1 file changed, 60 insertions(+), 31 deletions(-)

diff --git a/kernel/events/itrace.c b/kernel/events/itrace.c
index eae85cf578..7dbac8ac63 100644
--- a/kernel/events/itrace.c
+++ b/kernel/events/itrace.c
@@ -48,46 +48,66 @@ static bool itrace_event_match(struct perf_event *e1, struct perf_event *e2)
  * to configure and obtain itrace samples.
  */
 
+struct perf_event *__find_sampling_counter(struct perf_event_context *ctx,
+					   struct perf_event *event,
+					   struct task_struct *task)
+{
+	struct perf_event *sampler = NULL;
+
+	list_for_each_entry(sampler, &ctx->event_list, event_entry) {
+		/*
+		 * event is not an itrace event, but all the relevant
+		 * bits should match
+		 */
+		if (itrace_event_match(sampler, event) &&
+		    kernel_rb_event(sampler) &&
+		    sampler->attr.exclude_hv == event->attr.exclude_hv &&
+		    sampler->attr.exclude_idle == event->attr.exclude_idle &&
+		    sampler->attr.exclude_user == event->attr.exclude_user &&
+		    sampler->attr.exclude_kernel == event->attr.exclude_kernel &&
+		    sampler->attr.itrace_config == event->attr.itrace_config &&
+		    sampler->attr.type == event->attr.itrace_sample_type &&
+		    sampler->attr.itrace_sample_size >= event->attr.itrace_sample_size &&
+		    atomic_long_inc_not_zero(&sampler->refcount))
+			return sampler;
+	}
+
+	return NULL;
+}
+
+struct perf_event *find_sampling_counter(struct pmu *pmu,
+					 struct perf_event *event,
+					 struct task_struct *task)
+{
+	struct perf_event_context *ctx;
+	struct perf_event *sampler = NULL;
+	unsigned long flags;
+
+	ctx = find_get_context(pmu, task, event->cpu);
+	if (!ctx)
+		return NULL;
+
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+	sampler = __find_sampling_counter(ctx, event, task);
+	--ctx->pin_count;
+	put_ctx(ctx);
+	raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return sampler;
+}
+
 int itrace_sampler_init(struct perf_event *event, struct task_struct *task,
 			struct pmu *pmu)
 {
-	struct perf_event_context *ctx;
 	struct perf_event_attr attr;
 	struct perf_event *tevt = NULL;
 	struct ring_buffer *rb;
-	unsigned long nr_pages, flags;
+	unsigned long nr_pages;
 
 	if (!pmu || !(pmu->capabilities & PERF_PMU_CAP_ITRACE))
 		return -ENOTSUPP;
 
-	ctx = find_get_context(pmu, task, event->cpu);
-	if (ctx) {
-		raw_spin_lock_irqsave(&ctx->lock, flags);
-		list_for_each_entry(tevt, &ctx->event_list, event_entry) {
-			/*
-			 * event is not an itrace event, but all the relevant
-			 * bits should match
-			 */
-			if (itrace_event_match(tevt, event) &&
-			    tevt->attr.exclude_hv == event->attr.exclude_hv &&
-			    tevt->attr.exclude_idle == event->attr.exclude_idle &&
-			    tevt->attr.exclude_user == event->attr.exclude_user &&
-			    tevt->attr.exclude_kernel == event->attr.exclude_kernel &&
-			    tevt->attr.itrace_config == event->attr.itrace_config &&
-			    tevt->attr.type == event->attr.itrace_sample_type &&
-			    tevt->attr.itrace_sample_size >= event->attr.itrace_sample_size &&
-			    atomic_long_inc_not_zero(&tevt->refcount))
-				goto got_event;
-		}
-
-		tevt = NULL;
-
-got_event:
-		--ctx->pin_count;
-		put_ctx(ctx);
-		raw_spin_unlock_irqrestore(&ctx->lock, flags);
-	}
-
+	tevt = find_sampling_counter(pmu, event, task);
 	if (!tevt) {
 		memset(&attr, 0, sizeof(attr));
 		attr.type = pmu->type;
@@ -139,9 +159,18 @@ void itrace_sampler_fini(struct perf_event *event)
 unsigned long itrace_sampler_trace(struct perf_event *event,
 				   struct perf_sample_data *data)
 {
-	struct perf_event *tevt = event->trace_event;
+	struct perf_event *tevt;
 	struct ring_buffer *rb;
 
+	if (!event->trace_event) {
+		/*
+		 * down this path, event->ctx is already locked IF it's the
+		 * same context
+		 */
+		event->trace_event = __find_sampling_counter(event->ctx, event, event->ctx->task);
+	}
+
+	tevt = event->trace_event;
 	/* Don't go further if the event is being scheduled out */
 	if (!tevt || tevt->state != PERF_EVENT_STATE_ACTIVE) {
 		data->trace.size = 0;
-- 
2.1.0.rc1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT
  2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (22 preceding siblings ...)
  2014-08-11  5:19 ` [PATCH v3 23/23] perf: itrace: Allow sampling of inherited events Alexander Shishkin
@ 2014-08-11  5:46 ` Ingo Molnar
  23 siblings, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2014-08-11  5:46 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Robert Richter,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, kan.liang


* Alexander Shishkin <alexander.shishkin@linux.intel.com> wrote:

> Hi Peter and all,
> 
> Here's a new version of PT support patchset, this time including PT and
> BTS drivers and a few more tweaks to the core. I still left out some bits
> like core dump support for now. Tooling support is not included in this
> series so that it's easier to review the kernel bits, I suppose it's best
> to send it separately, since it's quite a huge patchset of its own.

I know what this series is about, but pretty please, always include a 
complete introduction, or at least a link to a good introdcution, 
which spells out the whole problem space, the proposed solution, its 
various design trade-offs (if any), links to tooling for people who 
want to have a look at both sides, list of items not yet properly 
implemented [or a clear statement that it's all ready to go in your 
view], etc., in as much detail as possible!

The reason is that many reviewers will skip early versions of patch 
series, especially if they see something trivially objectionable in 
it. Why spend effort on reviewing something that might never see the 
light of the day? But if later re-sends of the series skip essential 
information it's harder to 'jump in' and offer useful feedback...

If you worry about being overly verbose in your patch series 
announcement, in my experience as a kernel maintainer it's literally 
impossible for kernel developers to over-do the description of a new 
feature. (Steve Rostedt tries on occasion but even he never succeeded 
in the past. I claim this is due to software developer brain 
structure, but I digress.)

It's also absolutely fine to cut & paste an old 0/N description over 
and over again, and to update it only as far as the patches have 
changed.

A good rule of thumb is to have at least as many sentences in the 0/N 
description as there are patches in the series, i.e. 23 sentences in 
this instance. Preferably scaled up by complexity.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2014-08-11  5:46 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-11  5:19 [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 01/23] perf: Add data_{offset,size} to user_page Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 02/23] perf: Add AUX area to ring buffer for raw data streams Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 03/23] perf: Support high-order allocations for AUX space Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 04/23] perf: Add a capability for AUX_NO_SG pmus to do software double buffering Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 05/23] perf: Add a pmu capability for "exclusive" events Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 06/23] perf: Redirect output from inherited events to parents Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 07/23] perf: Add api for pmus to write to AUX space Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 08/23] perf: Add AUX record Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 09/23] perf: Support overwrite mode for AUX area Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 10/23] perf: Add wakeup watermark control to " Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 11/23] perf: Add itrace_config to the event attribute Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 12/23] perf: add ITRACE_START record to indicate that tracing has started Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 13/23] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 14/23] x86: perf: Intel PT and LBR/BTS are mutually exclusive Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 15/23] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 16/23] x86: perf: intel_bts: Add BTS " Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 17/23] perf: Add rb_{alloc,free}_kernel api Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 18/23] perf: Add a helper to copy AUX data in the kernel Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 19/23] perf: Add a helper for looking up pmus by type Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 20/23] perf: itrace: Infrastructure for sampling instruction flow traces Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 21/23] perf: Allocate ring buffers for inherited per-task kernel events Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 22/23] perf: itrace: Allow itrace sampling for multiple events Alexander Shishkin
2014-08-11  5:19 ` [PATCH v3 23/23] perf: itrace: Allow sampling of inherited events Alexander Shishkin
2014-08-11  5:46 ` [PATCH v3 00/23] perf: Add infrastructure and support for Intel PT Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).