linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT
@ 2014-08-20 12:35 Alexander Shishkin
  2014-08-20 12:35 ` [PATCH v4 01/22] perf: Add data_{offset,size} to user_page Alexander Shishkin
                   ` (23 more replies)
  0 siblings, 24 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Hi Peter and all,

This patchset adds support for Intel Processor Trace (PT) extension [1] of
Intel Architecture that allows the capture of information about software
execution flow, to the perf kernel infrastructure.

The single most notable thing is that while PT outputs trace data in a
compressed binary format, it will still generate hundreds of megabytes
of trace data per second per core. Decoding this binary stream takes
2-3 orders of magnitude the cpu time that it takes to generate
it. These considerations make it impossible to carry out decoding in
kernel space. Therefore, the trace data is exported to userspace as a
zero-copy mapping that userspace can collect and store for later
decoding. To address this, this patchset extends perf ring buffer with
an "AUX space", which is allocated for hardware blocks such as PT to
export their trace data with minimal overhead. This space can be
configured via buffer's user page and mmapped from the same file
descriptor with a given offset. Data can then be collected from it
by reading the aux_head (write) pointer from the user page and updating
aux_tail (read) pointer similarly to data_{head,tail} of the
traditional perf buffer. There is an api between perf core and pmu
drivers that wish to make use of this AUX space to export their data.

For tracing blocks that don't support hardware scatter-gather tables,
we provide high-order physically contiguous allocations to minimize
the overhead needed for software double buffering and PMI pressure.

This way we get a normal perf data stream that provides sideband
information that is required to decode the trace data, such as MMAPs,
COMMs etc, plus the actual trace in its own logical space.

If the trace buffer is mapped writable, the driver will stop tracing
when it fills up (aux_head approaches aux_tail), till data is read,
aux_tail pointer is moved forward and an ioctl() is issued to
re-enable tracing. If the trace buffer is mapped read only, the
tracing will continue, overwriting older data, so that the buffer
always contains the most recent data. Tracing can be stopped with an
ioctl() and restarted once the data is collected.

Another use case is annotating samples of other perf events: setting
PERF_SAMPLE_AUX requests attr.aux_sample_size bytes of trace to be
included in each event's sample.

This patchset consists of necessary changes to the perf kernel
infrastructure, and PT and BTS pmu drivers. The tooling support is not
included in this series, however, it can be found in my github tree [2].

This version changes the way watermarks are handled for AUX area and
gets rid of the notion of "itrace" both in the core and in the perf
interface (event attribute), which makes it more logical.

[1] http://software.intel.com/en-us/intel-isa-extensions
[2] http://github.com/virtuoso/linux-perf/tree/intel_pt

Alexander Shishkin (21):
  perf: Add data_{offset,size} to user_page
  perf: Support high-order allocations for AUX space
  perf: Add a capability for AUX_NO_SG pmus to do software double
    buffering
  perf: Add a pmu capability for "exclusive" events
  perf: Redirect output from inherited events to parents
  perf: Add api for pmus to write to AUX space
  perf: Add AUX record
  perf: Support overwrite mode for AUX area
  perf: Add wakeup watermark control to AUX area
  perf: add ITRACE_START record to indicate that tracing has started
  x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection
  x86: perf: Intel PT and LBR/BTS are mutually exclusive
  x86: perf: intel_pt: Intel PT PMU driver
  x86: perf: intel_bts: Add BTS PMU driver
  perf: Add rb_{alloc,free}_kernel api
  perf: Add a helper to copy AUX data in the kernel
  perf: Add a helper for looking up pmus by type
  perf: Add infrastructure for using AUX data in perf samples
  perf: Allocate ring buffers for inherited per-task kernel events
  perf: Allow AUX sampling for multiple events
  perf: Allow sampling of inherited events

Peter Zijlstra (1):
  perf: Add AUX area to ring buffer for raw data streams

 arch/x86/include/asm/cpufeature.h          |   1 +
 arch/x86/include/uapi/asm/msr-index.h      |  18 +
 arch/x86/kernel/cpu/Makefile               |   1 +
 arch/x86/kernel/cpu/intel_pt.h             | 129 ++++
 arch/x86/kernel/cpu/perf_event.h           |  14 +
 arch/x86/kernel/cpu/perf_event_intel.c     |  14 +-
 arch/x86/kernel/cpu/perf_event_intel_bts.c | 501 +++++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |  11 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   9 +-
 arch/x86/kernel/cpu/perf_event_intel_pt.c  | 973 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/scattered.c            |   1 +
 include/linux/perf_event.h                 |  56 +-
 include/uapi/linux/perf_event.h            |  69 +-
 kernel/events/core.c                       | 545 +++++++++++++++-
 kernel/events/internal.h                   |  50 ++
 kernel/events/ring_buffer.c                | 310 ++++++++-
 16 files changed, 2658 insertions(+), 44 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/intel_pt.h
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_bts.c
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c

-- 
2.1.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 01/22] perf: Add data_{offset,size} to user_page
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
@ 2014-08-20 12:35 ` Alexander Shishkin
  2014-08-20 12:35 ` [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams Alexander Shishkin
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Currently, the actual perf ring buffer is one page into the mmap area,
following the user page and the userspace follows this convention. This
patch adds data_{offset,size} fields to user_page that can be used by
userspace instead for locating perf data in the mmap area. This is also
helpful when mapping existing or shared buffers if their size is not
known in advance.

Right now, it is made to follow the existing convention that

	data_offset == PAGE_SIZE and
	data_offset + data_size == mmap_size.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/uapi/linux/perf_event.h | 5 +++++
 kernel/events/core.c            | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9269de2548..f7d18c2cb7 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -489,9 +489,14 @@ struct perf_event_mmap_page {
 	 * In this case the kernel will not over-write unread data.
 	 *
 	 * See perf_output_put_handle() for the data ordering.
+	 *
+	 * data_{offset,size} indicate the location and size of the perf record
+	 * buffer within the mmapped area.
 	 */
 	__u64   data_head;		/* head in the data section */
 	__u64	data_tail;		/* user-space written tail */
+	__u64	data_offset;		/* where the buffer starts */
+	__u64	data_size;		/* data buffer size */
 };
 
 #define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2d7363adf6..1e208bfe89 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3879,6 +3879,8 @@ static void perf_event_init_userpage(struct perf_event *event)
 	/* Allow new userspace to detect that bit 0 is deprecated */
 	userpg->cap_bit0_is_deprecated = 1;
 	userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
+	userpg->data_offset = PAGE_SIZE;
+	userpg->data_size = perf_data_size(rb);
 
 unlock:
 	rcu_read_unlock();
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
  2014-08-20 12:35 ` [PATCH v4 01/22] perf: Add data_{offset,size} to user_page Alexander Shishkin
@ 2014-08-20 12:35 ` Alexander Shishkin
  2014-09-08  7:02   ` Peter Zijlstra
  2014-08-20 12:36 ` [PATCH v4 03/22] perf: Support high-order allocations for AUX space Alexander Shishkin
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Peter Zijlstra, Alexander Shishkin

From: Peter Zijlstra <peterz@infradead.org>

This patch introduces "AUX space" in the perf mmap buffer, intended for
exporting high bandwidth data streams to userspace, such as instruction
flow traces.

AUX space is a ring buffer, defined by aux_{offset,size} fields in the
user_page structure, and read/write pointers aux_{head,tail}, which abide
by the same rules as data_* counterparts of the main perf buffer.

In order to allocate/mmap AUX, userspace needs to set up aux_offset to
such an offset that will be greater than data_offset+data_size and
aux_size to be the desired buffer size. Both need to be page aligned.
Then, same aux_offset and aux_size should be passed to mmap() call and
if everything adds up, you should have an AUX buffer as a result.

Pages that are mapped into this buffer also come out of user's mlock
rlimit plus perf_event_mlock_kb allowance.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      |  17 +++++
 include/uapi/linux/perf_event.h |  16 +++++
 kernel/events/core.c            | 140 +++++++++++++++++++++++++++++++++-------
 kernel/events/internal.h        |  21 ++++++
 kernel/events/ring_buffer.c     |  86 ++++++++++++++++++++++--
 5 files changed, 251 insertions(+), 29 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f0a1036b19..fd7b32876c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -263,6 +263,18 @@ struct pmu {
 	 * flush branch stack on context-switches (needed in cpu-wide mode)
 	 */
 	void (*flush_branch_stack)	(void);
+
+	/*
+	 * Set up pmu-private data structures for an AUX area
+	 */
+	void *(*setup_aux)		(int cpu, void **pages,
+					 int nr_pages, bool overwrite);
+					/* optional */
+
+	/*
+	 * Free pmu-private AUX data structures
+	 */
+	void (*free_aux)		(void *aux); /* optional */
 };
 
 /**
@@ -781,6 +793,11 @@ static inline bool has_branch_stack(struct perf_event *event)
 	return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
 }
 
+static inline bool has_aux(struct perf_event *event)
+{
+	return event->pmu->setup_aux;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index f7d18c2cb7..7e0967c0f5 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -497,6 +497,22 @@ struct perf_event_mmap_page {
 	__u64	data_tail;		/* user-space written tail */
 	__u64	data_offset;		/* where the buffer starts */
 	__u64	data_size;		/* data buffer size */
+
+	/*
+	 * AUX area is defined by aux_{offset,size} fields that should be set
+	 * by the userspace, so that
+	 *
+	 *   aux_offset >= data_offset + data_size
+	 *
+	 * prior to mmap()ing it. Size of the mmap()ed area should be aux_size.
+	 *
+	 * Ring buffer pointers aux_{head,tail} have the same semantics as
+	 * data_{head,tail} and same ordering rules apply.
+	 */
+	__u64	aux_head;
+	__u64	aux_tail;
+	__u64	aux_offset;
+	__u64	aux_size;
 };
 
 #define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1e208bfe89..63d98d6998 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4078,6 +4078,8 @@ static void perf_mmap_open(struct vm_area_struct *vma)
 
 	atomic_inc(&event->mmap_count);
 	atomic_inc(&event->rb->mmap_count);
+	if (vma->vm_pgoff)
+		atomic_inc(&event->rb->aux_mmap_count);
 }
 
 /*
@@ -4097,6 +4099,20 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 	int mmap_locked = rb->mmap_locked;
 	unsigned long size = perf_data_size(rb);
 
+	/*
+	 * rb->aux_mmap_count will always drop before rb->mmap_count and
+	 * event->mmap_count, so it is ok to use event->mmap_mutex to
+	 * serialize with perf_mmap here.
+	 */
+	if (rb_has_aux(rb) && vma->vm_pgoff == rb->aux_pgoff &&
+	    atomic_dec_and_mutex_lock(&rb->aux_mmap_count, &event->mmap_mutex)) {
+		atomic_long_sub(rb->aux_nr_pages, &mmap_user->locked_vm);
+		vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
+
+		rb_free_aux(rb, event);
+		mutex_unlock(&event->mmap_mutex);
+	}
+
 	atomic_dec(&rb->mmap_count);
 
 	if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
@@ -4170,7 +4186,7 @@ out_put:
 
 static const struct vm_operations_struct perf_mmap_vmops = {
 	.open		= perf_mmap_open,
-	.close		= perf_mmap_close,
+	.close		= perf_mmap_close, /* non mergable */
 	.fault		= perf_mmap_fault,
 	.page_mkwrite	= perf_mmap_fault,
 };
@@ -4181,10 +4197,10 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	unsigned long user_locked, user_lock_limit;
 	struct user_struct *user = current_user();
 	unsigned long locked, lock_limit;
-	struct ring_buffer *rb;
+	struct ring_buffer *rb = NULL;
 	unsigned long vma_size;
 	unsigned long nr_pages;
-	long user_extra, extra;
+	long user_extra = 0, extra = 0;
 	int ret = 0, flags = 0;
 
 	/*
@@ -4199,7 +4215,66 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 		return -EINVAL;
 
 	vma_size = vma->vm_end - vma->vm_start;
-	nr_pages = (vma_size / PAGE_SIZE) - 1;
+
+	if (vma->vm_pgoff == 0) {
+		nr_pages = (vma_size / PAGE_SIZE) - 1;
+	} else {
+		/*
+		 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
+		 * mapped, all subsequent mappings should have the same size
+		 * and offset. Must be above the normal perf buffer.
+		 */
+		u64 aux_offset, aux_size;
+
+		if (!event->rb)
+			return -EINVAL;
+
+		nr_pages = vma_size / PAGE_SIZE;
+
+		mutex_lock(&event->mmap_mutex);
+		ret = -EINVAL;
+
+		rb = event->rb;
+		if (!rb)
+			goto aux_unlock;
+
+		aux_offset = ACCESS_ONCE(rb->user_page->aux_offset);
+		aux_size = ACCESS_ONCE(rb->user_page->aux_size);
+
+		if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
+			goto aux_unlock;
+
+		if (aux_offset != vma->vm_pgoff << PAGE_SHIFT)
+			goto aux_unlock;
+
+		/* already mapped with a different offset */
+		if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
+			goto aux_unlock;
+
+		if (aux_size != vma_size || aux_size != nr_pages * PAGE_SIZE)
+			goto aux_unlock;
+
+		/* already mapped with a different size */
+		if (rb_has_aux(rb) && rb->aux_nr_pages != nr_pages)
+			goto aux_unlock;
+
+		if (!is_power_of_2(nr_pages))
+			goto aux_unlock;
+
+		if (!atomic_inc_not_zero(&rb->mmap_count))
+			goto aux_unlock;
+
+		if (rb_has_aux(rb)) {
+			atomic_inc(&rb->aux_mmap_count);
+			ret = 0;
+			goto unlock;
+		}
+
+		atomic_set(&rb->aux_mmap_count, 1);
+		user_extra = nr_pages;
+
+		goto accounting;
+	}
 
 	/*
 	 * If we have rb pages ensure they're a power-of-two number, so we
@@ -4211,9 +4286,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	if (vma_size != PAGE_SIZE * (1 + nr_pages))
 		return -EINVAL;
 
-	if (vma->vm_pgoff != 0)
-		return -EINVAL;
-
 	WARN_ON_ONCE(event->ctx->parent_ctx);
 again:
 	mutex_lock(&event->mmap_mutex);
@@ -4237,6 +4309,8 @@ again:
 	}
 
 	user_extra = nr_pages + 1;
+
+accounting:
 	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
 
 	/*
@@ -4246,7 +4320,6 @@ again:
 
 	user_locked = atomic_long_read(&user->locked_vm) + user_extra;
 
-	extra = 0;
 	if (user_locked > user_lock_limit)
 		extra = user_locked - user_lock_limit;
 
@@ -4260,35 +4333,45 @@ again:
 		goto unlock;
 	}
 
-	WARN_ON(event->rb);
+	WARN_ON(!rb && event->rb);
 
 	if (vma->vm_flags & VM_WRITE)
 		flags |= RING_BUFFER_WRITABLE;
 
-	rb = rb_alloc(nr_pages, 
-		event->attr.watermark ? event->attr.wakeup_watermark : 0,
-		event->cpu, flags);
-
 	if (!rb) {
-		ret = -ENOMEM;
-		goto unlock;
-	}
+		rb = rb_alloc(nr_pages,
+			      event->attr.watermark ? event->attr.wakeup_watermark : 0,
+			      event->cpu, flags);
 
-	atomic_set(&rb->mmap_count, 1);
-	rb->mmap_locked = extra;
-	rb->mmap_user = get_current_user();
+		if (!rb) {
+			ret = -ENOMEM;
+			goto unlock;
+		}
 
-	atomic_long_add(user_extra, &user->locked_vm);
-	vma->vm_mm->pinned_vm += extra;
+		atomic_set(&rb->mmap_count, 1);
+		rb->mmap_user = get_current_user();
+		rb->mmap_locked = extra;
 
-	ring_buffer_attach(event, rb);
+		ring_buffer_attach(event, rb);
 
-	perf_event_init_userpage(event);
-	perf_event_update_userpage(event);
+		perf_event_init_userpage(event);
+		perf_event_update_userpage(event);
+	} else {
+		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages, flags);
+		if (ret)
+			atomic_dec(&rb->mmap_count);
+		else
+			rb->aux_mmap_locked = extra;
+	}
 
 unlock:
-	if (!ret)
+	if (!ret) {
+		atomic_long_add(user_extra, &user->locked_vm);
+		vma->vm_mm->pinned_vm += extra;
+
 		atomic_inc(&event->mmap_count);
+	}
+aux_unlock:
 	mutex_unlock(&event->mmap_mutex);
 
 	/*
@@ -7135,6 +7218,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 	if (output_event->cpu == -1 && output_event->ctx != event->ctx)
 		goto out;
 
+	/*
+	 * If both events generate aux data, they must be on the same PMU
+	 */
+	if (has_aux(event) && has_aux(output_event) &&
+	    event->pmu != output_event->pmu)
+		goto out;
+
 set:
 	mutex_lock(&event->mmap_mutex);
 	/* Can't redirect output if we've got an active mmap() */
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b218782..e5374030b1 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -35,6 +35,14 @@ struct ring_buffer {
 	unsigned long			mmap_locked;
 	struct user_struct		*mmap_user;
 
+	/* AUX area */
+	unsigned long			aux_pgoff;
+	int				aux_nr_pages;
+	atomic_t			aux_mmap_count;
+	unsigned long			aux_mmap_locked;
+	void				**aux_pages;
+	void				*aux_priv;
+
 	struct perf_event_mmap_page	*user_page;
 	void				*data_pages[0];
 };
@@ -43,6 +51,14 @@ extern void rb_free(struct ring_buffer *rb);
 extern struct ring_buffer *
 rb_alloc(int nr_pages, long watermark, int cpu, int flags);
 extern void perf_event_wakeup(struct perf_event *event);
+extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+			pgoff_t pgoff, int nr_pages, int flags);
+extern void rb_free_aux(struct ring_buffer *rb, struct perf_event *event);
+
+static inline bool rb_has_aux(struct ring_buffer *rb)
+{
+	return !!rb->aux_nr_pages;
+}
 
 extern void
 perf_event_header__init_id(struct perf_event_header *header,
@@ -81,6 +97,11 @@ static inline unsigned long perf_data_size(struct ring_buffer *rb)
 	return rb->nr_pages << (PAGE_SHIFT + page_order(rb));
 }
 
+static inline unsigned long perf_aux_size(struct ring_buffer *rb)
+{
+	return rb->aux_nr_pages << PAGE_SHIFT;
+}
+
 #define DEFINE_OUTPUT_COPY(func_name, memcpy_func)			\
 static inline unsigned long						\
 func_name(struct perf_output_handle *handle,				\
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 146a5792b1..00708d5916 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,14 +242,76 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 	spin_lock_init(&rb->event_lock);
 }
 
+int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+		 pgoff_t pgoff, int nr_pages, int flags)
+{
+	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
+	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
+	int ret = -ENOMEM;
+
+	if (!has_aux(event))
+		return -ENOTSUPP;
+
+	rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
+	if (!rb->aux_pages)
+		return -ENOMEM;
+
+	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;
+	     rb->aux_nr_pages++) {
+		struct page *page;
+
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+		if (!page)
+			goto out;
+
+		rb->aux_pages[rb->aux_nr_pages] = page_address(page);
+	}
+
+	rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages,
+					     overwrite);
+	if (rb->aux_priv)
+		ret = 0;
+
+out:
+	if (!ret)
+		rb->aux_pgoff = pgoff;
+	else
+		rb_free_aux(rb, event);
+
+	return ret;
+}
+
+void rb_free_aux(struct ring_buffer *rb, struct perf_event *event)
+{
+	struct perf_event *iter;
+	int pg;
+
+	if (rb->aux_priv) {
+		/* disable all potential writers before freeing */
+		rcu_read_lock();
+		list_for_each_entry_rcu(iter, &rb->event_list, rb_entry)
+			perf_event_disable(iter);
+		rcu_read_unlock();
+
+		event->pmu->free_aux(rb->aux_priv);
+		rb->aux_priv = NULL;
+	}
+
+	for (pg = 0; pg < rb->aux_nr_pages; pg++)
+		free_page((unsigned long)rb->aux_pages[pg]);
+
+	kfree(rb->aux_pages);
+	rb->aux_nr_pages = 0;
+}
+
 #ifndef CONFIG_PERF_USE_VMALLOC
 
 /*
  * Back perf_mmap() with regular GFP_KERNEL-0 pages.
  */
 
-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
 {
 	if (pgoff > rb->nr_pages)
 		return NULL;
@@ -339,8 +401,8 @@ static int data_page_nr(struct ring_buffer *rb)
 	return rb->nr_pages << page_order(rb);
 }
 
-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
 {
 	/* The '>' counts in the user page. */
 	if (pgoff > data_page_nr(rb))
@@ -415,3 +477,19 @@ fail:
 }
 
 #endif
+
+struct page *
+perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+{
+	if (rb->aux_nr_pages) {
+		/* above AUX space */
+		if (pgoff > rb->aux_pgoff + rb->aux_nr_pages)
+			return NULL;
+
+		/* AUX space */
+		if (pgoff >= rb->aux_pgoff)
+			return virt_to_page(rb->aux_pages[pgoff - rb->aux_pgoff]);
+	}
+
+	return __perf_mmap_to_page(rb, pgoff);
+}
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 03/22] perf: Support high-order allocations for AUX space
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
  2014-08-20 12:35 ` [PATCH v4 01/22] perf: Add data_{offset,size} to user_page Alexander Shishkin
  2014-08-20 12:35 ` [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-08-20 12:36 ` [PATCH v4 04/22] perf: Add a capability for AUX_NO_SG pmus to do software double buffering Alexander Shishkin
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability)
don't support scatter-gather and will prefer larger contiguous areas for
their output regions.

This patch adds a new pmu capability to request higher order allocations.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h  |  1 +
 kernel/events/ring_buffer.c | 51 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fd7b32876c..fe10bf6f94 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -171,6 +171,7 @@ struct perf_event;
  * pmu::capabilities flags
  */
 #define PERF_PMU_CAP_NO_INTERRUPT		0x01
+#define PERF_PMU_CAP_AUX_NO_SG			0x02
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 00708d5916..d10919ca42 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,29 +242,68 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 	spin_lock_init(&rb->event_lock);
 }
 
+#define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
+
+static struct page *rb_alloc_aux_page(int node, int order)
+{
+	struct page *page;
+
+	if (order > MAX_ORDER)
+		order = MAX_ORDER;
+
+	do {
+		page = alloc_pages_node(node, PERF_AUX_GFP, order);
+	} while (!page && order--);
+
+	if (page && order) {
+		/*
+		 * Communicate the allocation size to the driver
+		 */
+		split_page(page, order);
+		SetPagePrivate(page);
+		set_page_private(page, order);
+	}
+
+	return page;
+}
+
+static void rb_free_aux_page(struct ring_buffer *rb, int idx)
+{
+	struct page *page = virt_to_page(rb->aux_pages[idx]);
+
+	ClearPagePrivate(page);
+	page->mapping = NULL;
+	__free_page(page);
+}
+
 int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 		 pgoff_t pgoff, int nr_pages, int flags)
 {
 	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
 	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
-	int ret = -ENOMEM;
+	int ret = -ENOMEM, order = 0;
 
 	if (!has_aux(event))
 		return -ENOTSUPP;
 
+	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
+		order = get_order(nr_pages * PAGE_SIZE);
+
 	rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
 	if (!rb->aux_pages)
 		return -ENOMEM;
 
-	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;
-	     rb->aux_nr_pages++) {
+	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;) {
 		struct page *page;
+		int last;
 
-		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+		page = rb_alloc_aux_page(node, order);
 		if (!page)
 			goto out;
 
-		rb->aux_pages[rb->aux_nr_pages] = page_address(page);
+		for (last = rb->aux_nr_pages + (1 << page_private(page));
+		     last > rb->aux_nr_pages; rb->aux_nr_pages++)
+			rb->aux_pages[rb->aux_nr_pages] = page_address(page++);
 	}
 
 	rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages,
@@ -298,7 +337,7 @@ void rb_free_aux(struct ring_buffer *rb, struct perf_event *event)
 	}
 
 	for (pg = 0; pg < rb->aux_nr_pages; pg++)
-		free_page((unsigned long)rb->aux_pages[pg]);
+		rb_free_aux_page(rb, pg);
 
 	kfree(rb->aux_pages);
 	rb->aux_nr_pages = 0;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 04/22] perf: Add a capability for AUX_NO_SG pmus to do software double buffering
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (2 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 03/22] perf: Support high-order allocations for AUX space Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-09-08  7:17   ` Peter Zijlstra
  2014-08-20 12:36 ` [PATCH v4 05/22] perf: Add a pmu capability for "exclusive" events Alexander Shishkin
                   ` (19 subsequent siblings)
  23 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

For pmus that don't support scatter-gather for AUX data in hardware, it
might still make sense to implement software double buffering to avoid
losing data while the user is reading data out. For this purpose, add
a pmu capability that guarantees multiple high-order chunks for AUX buffer,
so that the pmu driver can do switchover tricks.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h  |  1 +
 kernel/events/ring_buffer.c | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fe10bf6f94..1e7b659b49 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -172,6 +172,7 @@ struct perf_event;
  */
 #define PERF_PMU_CAP_NO_INTERRUPT		0x01
 #define PERF_PMU_CAP_AUX_NO_SG			0x02
+#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF		0x04
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index d10919ca42..f5ee3669f8 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -286,9 +286,22 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 	if (!has_aux(event))
 		return -ENOTSUPP;
 
-	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
+	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) {
 		order = get_order(nr_pages * PAGE_SIZE);
 
+		/*
+		 * PMU requests more than one contiguous chunks of memory
+		 * for SW double buffering
+		 */
+		if ((event->pmu->capabilities & PERF_PMU_CAP_AUX_SW_DOUBLEBUF) &&
+		    !overwrite) {
+			if (!order)
+				return -EINVAL;
+
+			order--;
+		}
+	}
+
 	rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
 	if (!rb->aux_pages)
 		return -ENOMEM;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 05/22] perf: Add a pmu capability for "exclusive" events
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (3 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 04/22] perf: Add a capability for AUX_NO_SG pmus to do software double buffering Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-08-20 12:36 ` [PATCH v4 06/22] perf: Redirect output from inherited events to parents Alexander Shishkin
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Usually, pmus that do, for example, instruction tracing, would only ever
be able to have one event per task per cpu (or per perf_context). For such
pmus it makes sense to disallow creating conflicting events early on, so
as to provide consistent behavior for the user.

This patch adds a pmu capability that indicates such constraint on event
creation.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h |  1 +
 kernel/events/core.c       | 45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1e7b659b49..6bd3e743b1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -173,6 +173,7 @@ struct perf_event;
 #define PERF_PMU_CAP_NO_INTERRUPT		0x01
 #define PERF_PMU_CAP_AUX_NO_SG			0x02
 #define PERF_PMU_CAP_AUX_SW_DOUBLEBUF		0x04
+#define PERF_PMU_CAP_EXCLUSIVE			0x08
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 63d98d6998..67f857ab56 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7248,6 +7248,32 @@ out:
 	return ret;
 }
 
+static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2)
+{
+	if ((e1->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) &&
+	    (e1->cpu == e2->cpu ||
+	     e1->cpu == -1 ||
+	     e2->cpu == -1))
+		return true;
+	return false;
+}
+
+static bool exclusive_event_ok(struct perf_event *event,
+			      struct perf_event_context *ctx)
+{
+	struct perf_event *iter_event;
+
+	if (!(event->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE))
+		return true;
+
+	list_for_each_entry(iter_event, &ctx->event_list, event_entry) {
+		if (exclusive_event_match(iter_event, event))
+			return false;
+	}
+
+	return true;
+}
+
 /**
  * sys_perf_event_open - open a performance event, associate it to a task/cpu
  *
@@ -7399,6 +7425,11 @@ SYSCALL_DEFINE5(perf_event_open,
 		goto err_alloc;
 	}
 
+	if ((pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) && group_leader) {
+		err = -EBUSY;
+		goto err_context;
+	}
+
 	if (task) {
 		put_task_struct(task);
 		task = NULL;
@@ -7484,6 +7515,12 @@ SYSCALL_DEFINE5(perf_event_open,
 		}
 	}
 
+	if (!exclusive_event_ok(event, ctx)) {
+		mutex_unlock(&ctx->mutex);
+		fput(event_file);
+		goto err_context;
+	}
+
 	perf_install_in_context(ctx, event, event->cpu);
 	perf_unpin_context(ctx);
 	mutex_unlock(&ctx->mutex);
@@ -7570,6 +7607,14 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 
 	WARN_ON_ONCE(ctx->parent_ctx);
 	mutex_lock(&ctx->mutex);
+	if (!exclusive_event_ok(event, ctx)) {
+		mutex_unlock(&ctx->mutex);
+		perf_unpin_context(ctx);
+		put_ctx(ctx);
+		err = -EBUSY;
+		goto err_free;
+	}
+
 	perf_install_in_context(ctx, event, cpu);
 	perf_unpin_context(ctx);
 	mutex_unlock(&ctx->mutex);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 06/22] perf: Redirect output from inherited events to parents
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (4 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 05/22] perf: Add a pmu capability for "exclusive" events Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-09-08 15:26   ` Peter Zijlstra
  2014-08-20 12:36 ` [PATCH v4 07/22] perf: Add api for pmus to write to AUX space Alexander Shishkin
                   ` (17 subsequent siblings)
  23 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

In order to collect AUX data from an inherited event, we can redirect its
output to parent's ring buffer if possible (they must be cpu affine). This
patch adds set_output() to the inheritance path.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 67f857ab56..e36478564c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7962,6 +7962,12 @@ inherit_event(struct perf_event *parent_event,
 		= parent_event->overflow_handler_context;
 
 	/*
+	 * Direct child's output to parent's ring buffer (if any)
+	 */
+	if (parent_event->cpu != -1)
+		(void)perf_event_set_output(child_event, parent_event);
+
+	/*
 	 * Precalculate sample_data sizes
 	 */
 	perf_event__header_size(child_event);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 07/22] perf: Add api for pmus to write to AUX space
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (5 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 06/22] perf: Redirect output from inherited events to parents Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-09-08 16:06   ` Peter Zijlstra
  2014-08-20 12:36 ` [PATCH v4 08/22] perf: Add AUX record Alexander Shishkin
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

For pmus that wish to write data to AUX space, provide
perf_aux_output_{begin,end}() calls to initiate/commit data writes,
similarly to perf_output_{begin,end}. These also use the same output
handle structure.

After the perf_aux_output_begin() returns successfully, handle->size
is set to the maximum amount of data that can be written wrt aux_tail
pointer, so that no data that the user hasn't seen will be overwritten.

PMU driver should pass the actual amount of data written as a parameter
to perf_aux_output_end().

Nested writers are forbidden and guards are in place to catch such
attempts.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h  | 23 ++++++++++++-
 kernel/events/core.c        |  5 ++-
 kernel/events/internal.h    |  4 +++
 kernel/events/ring_buffer.c | 84 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 112 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6bd3e743b1..63016a0e32 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -554,12 +554,22 @@ struct perf_output_handle {
 	struct ring_buffer		*rb;
 	unsigned long			wakeup;
 	unsigned long			size;
-	void				*addr;
+	union {
+		void			*addr;
+		unsigned long		head;
+	};
 	int				page;
 };
 
 #ifdef CONFIG_PERF_EVENTS
 
+extern void *perf_aux_output_begin(struct perf_output_handle *handle,
+				   struct perf_event *event);
+extern void perf_aux_output_end(struct perf_output_handle *handle,
+				unsigned long size, bool truncated);
+extern int perf_aux_output_skip(struct perf_output_handle *handle,
+				unsigned long size);
+extern void *perf_get_aux(struct perf_output_handle *handle);
 extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
 extern void perf_pmu_unregister(struct pmu *pmu);
 
@@ -816,6 +826,17 @@ extern void perf_event_disable(struct perf_event *event);
 extern int __perf_event_disable(void *info);
 extern void perf_event_task_tick(void);
 #else /* !CONFIG_PERF_EVENTS: */
+static inline void *
+perf_aux_output_begin(struct perf_output_handle *handle,
+		      struct perf_event *event)				{ return NULL; }
+static inline void
+perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
+		    bool truncated)					{ }
+static inline int
+perf_aux_output_skip(struct perf_output_handle *handle,
+		     unsigned long size)				{ return -EINVAL; }
+static inline void *
+perf_get_aux(struct perf_output_handle *handle)				{ return NULL; }
 static inline void
 perf_event_task_sched_in(struct task_struct *prev,
 			 struct task_struct *task)			{ }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e36478564c..9fc9a7583b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3267,7 +3267,6 @@ static void free_event_rcu(struct rcu_head *head)
 	kfree(event);
 }
 
-static void ring_buffer_put(struct ring_buffer *rb);
 static void ring_buffer_attach(struct perf_event *event,
 			       struct ring_buffer *rb);
 
@@ -4047,7 +4046,7 @@ static void rb_free_rcu(struct rcu_head *rcu_head)
 	rb_free(rb);
 }
 
-static struct ring_buffer *ring_buffer_get(struct perf_event *event)
+struct ring_buffer *ring_buffer_get(struct perf_event *event)
 {
 	struct ring_buffer *rb;
 
@@ -4062,7 +4061,7 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
 	return rb;
 }
 
-static void ring_buffer_put(struct ring_buffer *rb)
+void ring_buffer_put(struct ring_buffer *rb)
 {
 	if (!atomic_dec_and_test(&rb->refcount))
 		return;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index e5374030b1..b8f6c193ea 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -36,6 +36,8 @@ struct ring_buffer {
 	struct user_struct		*mmap_user;
 
 	/* AUX area */
+	local_t				aux_head;
+	local_t				aux_nest;
 	unsigned long			aux_pgoff;
 	int				aux_nr_pages;
 	atomic_t			aux_mmap_count;
@@ -54,6 +56,8 @@ extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 			pgoff_t pgoff, int nr_pages, int flags);
 extern void rb_free_aux(struct ring_buffer *rb, struct perf_event *event);
+extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
+extern void ring_buffer_put(struct ring_buffer *rb);
 
 static inline bool rb_has_aux(struct ring_buffer *rb)
 {
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index f5ee3669f8..3b3a915767 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,6 +242,90 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 	spin_lock_init(&rb->event_lock);
 }
 
+void *perf_aux_output_begin(struct perf_output_handle *handle,
+			    struct perf_event *event)
+{
+	unsigned long aux_head, aux_tail;
+	struct ring_buffer *rb;
+
+	rb = ring_buffer_get(event);
+	if (!rb)
+		return NULL;
+
+	if (!rb_has_aux(rb))
+		goto err;
+
+	/*
+	 * Nesting is not supported for AUX area, make sure nested
+	 * writers are caught early
+	 */
+	if (WARN_ON_ONCE(local_xchg(&rb->aux_nest, 1)))
+		goto err;
+
+	aux_head = local_read(&rb->aux_head);
+	aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+
+	handle->rb = rb;
+	handle->event = event;
+	handle->head = aux_head;
+	if (aux_head - aux_tail < perf_aux_size(rb))
+		handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
+	else
+		handle->size = 0;
+
+	if (!handle->size) {
+		event->pending_disable = 1;
+		event->hw.state = PERF_HES_STOPPED;
+		perf_output_wakeup(handle);
+		local_set(&rb->aux_nest, 0);
+		goto err;
+	}
+
+	return handle->rb->aux_priv;
+
+err:
+	ring_buffer_put(rb);
+	handle->event = NULL;
+
+	return NULL;
+}
+
+void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
+			 bool truncated)
+{
+	struct ring_buffer *rb = handle->rb;
+
+	local_add(size, &rb->aux_head);
+
+	smp_wmb();
+	rb->user_page->aux_head = local_read(&rb->aux_head);
+
+	perf_output_wakeup(handle);
+	handle->event = NULL;
+
+	local_set(&rb->aux_nest, 0);
+	ring_buffer_put(rb);
+}
+
+int perf_aux_output_skip(struct perf_output_handle *handle, unsigned long size)
+{
+	struct ring_buffer *rb = handle->rb;
+
+	if (size > handle->size)
+		return -ENOSPC;
+
+	local_add(size, &rb->aux_head);
+	handle->head = local_read(&rb->aux_head);
+	handle->size -= size;
+
+	return 0;
+}
+
+void *perf_get_aux(struct perf_output_handle *handle)
+{
+	return handle->rb->aux_priv;
+}
+
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
 static struct page *rb_alloc_aux_page(int node, int order)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 08/22] perf: Add AUX record
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (6 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 07/22] perf: Add api for pmus to write to AUX space Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-09-09  8:20   ` Peter Zijlstra
  2014-08-20 12:36 ` [PATCH v4 09/22] perf: Support overwrite mode for AUX area Alexander Shishkin
                   ` (15 subsequent siblings)
  23 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

When there's new data in the AUX space, output a record indicating its
offset and size and weather it was truncated to fix in the ring buffer.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/uapi/linux/perf_event.h | 16 ++++++++++++++++
 kernel/events/core.c            | 39 +++++++++++++++++++++++++++++++++++++++
 kernel/events/internal.h        |  3 +++
 kernel/events/ring_buffer.c     |  1 +
 4 files changed, 59 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 7e0967c0f5..c022c3d756 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -733,6 +733,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_MMAP2			= 10,
 
+	/*
+	 * Records that new data landed in the AUX buffer part.
+	 *
+	 * struct {
+	 * 	struct perf_event_header	header;
+	 *
+	 * 	u64				aux_offset;
+	 * 	u64				aux_size;
+	 *	u8				truncated;
+	 *	u8				reserved[7];
+	 *	u64				id;
+	 *	u64				stream_id;
+	 * };
+	 */
+	PERF_RECORD_AUX				= 11,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9fc9a7583b..0251983018 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5542,6 +5542,45 @@ void perf_event_mmap(struct vm_area_struct *vma)
 	perf_event_mmap_event(&mmap_event);
 }
 
+void perf_event_aux_event(struct perf_event *event, unsigned long head,
+			  unsigned long size, bool truncated)
+{
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	struct perf_aux_event {
+		struct perf_event_header	header;
+		u64				offset;
+		u64				size;
+		u8				truncated;
+		u8				reserved[7];
+		u64				id;
+		u64				stream_id;
+	} rec = {
+		.header = {
+			.type = PERF_RECORD_AUX,
+			.misc = 0,
+			.size = sizeof(rec),
+		},
+		.offset		= head,
+		.size		= size,
+		.truncated	= truncated,
+		.id		= primary_event_id(event),
+		.stream_id	= event->id,
+	};
+	int ret;
+
+	perf_event_header__init_id(&rec.header, &sample, event);
+	ret = perf_output_begin(&handle, event, rec.header.size);
+
+	if (ret)
+		return;
+
+	perf_output_put(&handle, rec);
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+}
+
 /*
  * IRQ throttle logging
  */
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index b8f6c193ea..c6b2987afe 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -64,6 +64,9 @@ static inline bool rb_has_aux(struct ring_buffer *rb)
 	return !!rb->aux_nr_pages;
 }
 
+void perf_event_aux_event(struct perf_event *event, unsigned long head,
+			  unsigned long size, bool truncated);
+
 extern void
 perf_event_header__init_id(struct perf_event_header *header,
 			   struct perf_sample_data *data,
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 3b3a915767..925f369947 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -296,6 +296,7 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
 	struct ring_buffer *rb = handle->rb;
 
 	local_add(size, &rb->aux_head);
+	perf_event_aux_event(handle->event, aux_head, size, truncated);
 
 	smp_wmb();
 	rb->user_page->aux_head = local_read(&rb->aux_head);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 09/22] perf: Support overwrite mode for AUX area
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (7 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 08/22] perf: Add AUX record Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-09-09  8:33   ` Peter Zijlstra
  2014-09-09  8:44   ` Peter Zijlstra
  2014-08-20 12:36 ` [PATCH v4 10/22] perf: Add wakeup watermark control to " Alexander Shishkin
                   ` (14 subsequent siblings)
  23 siblings, 2 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

This adds support for overwrite mode in the AUX area, which means "keep
collecting data till you're stopped". It does not depend on data buffer's
overwrite mode, so that it doesn't lose sideband data that is instrumental
for processing AUX data.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/internal.h    |  1 +
 kernel/events/ring_buffer.c | 43 +++++++++++++++++++++++++++++--------------
 2 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index c6b2987afe..4607742be8 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -40,6 +40,7 @@ struct ring_buffer {
 	local_t				aux_nest;
 	unsigned long			aux_pgoff;
 	int				aux_nr_pages;
+	int				aux_overwrite;
 	atomic_t			aux_mmap_count;
 	unsigned long			aux_mmap_locked;
 	void				**aux_pages;
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 925f369947..5006caba63 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -263,22 +263,23 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
 		goto err;
 
 	aux_head = local_read(&rb->aux_head);
-	aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
 
 	handle->rb = rb;
 	handle->event = event;
 	handle->head = aux_head;
-	if (aux_head - aux_tail < perf_aux_size(rb))
-		handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
-	else
-		handle->size = 0;
-
-	if (!handle->size) {
-		event->pending_disable = 1;
-		event->hw.state = PERF_HES_STOPPED;
-		perf_output_wakeup(handle);
-		local_set(&rb->aux_nest, 0);
-		goto err;
+	handle->size = 0;
+	if (!rb->aux_overwrite) {
+		aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+		if (aux_head - aux_tail < perf_aux_size(rb))
+			handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
+
+		if (!handle->size) {
+			event->pending_disable = 1;
+			event->hw.state = PERF_HES_STOPPED;
+			perf_output_wakeup(handle);
+			local_set(&rb->aux_nest, 0);
+			goto err;
+		}
 	}
 
 	return handle->rb->aux_priv;
@@ -294,9 +295,22 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
 			 bool truncated)
 {
 	struct ring_buffer *rb = handle->rb;
+	unsigned long aux_head;
 
-	local_add(size, &rb->aux_head);
-	perf_event_aux_event(handle->event, aux_head, size, truncated);
+	aux_head = local_read(&rb->aux_head);
+
+	if (rb->aux_overwrite) {
+		local_set(&rb->aux_head, size);
+
+		/*
+		 * Send a RECORD_AUX with size==0 to communicate aux_head
+		 * of this snapshot to userspace
+		 */
+		perf_event_aux_event(handle->event, size, 0, truncated);
+	} else {
+		local_add(size, &rb->aux_head);
+		perf_event_aux_event(handle->event, aux_head, size, truncated);
+	}
 
 	smp_wmb();
 	rb->user_page->aux_head = local_read(&rb->aux_head);
@@ -408,6 +422,7 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 					     overwrite);
 	if (rb->aux_priv)
 		ret = 0;
+	rb->aux_overwrite = overwrite;
 
 out:
 	if (!ret)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 10/22] perf: Add wakeup watermark control to AUX area
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (8 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 09/22] perf: Support overwrite mode for AUX area Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-08-20 12:36 ` [PATCH v4 11/22] perf: add ITRACE_START record to indicate that tracing has started Alexander Shishkin
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

When AUX area gets a certain amount of new data, we want to wake up
userspace to collect it. This adds a new control to specify how much
data will cause a wakeup. This is then passed down to pmu drivers via
output handle's "wakeup" field, so that the driver can find the nearest
point where it can generate an interrupt.

We repurpose __reserved_2 in the event attribute for this, even though
it was never checked to be zero before, aux_watermark will only matter
for new AUX-aware code, so the old code should still be fine.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/uapi/linux/perf_event.h |  7 +++++--
 kernel/events/core.c            |  3 ++-
 kernel/events/internal.h        |  4 +++-
 kernel/events/ring_buffer.c     | 14 +++++++++++---
 4 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c022c3d756..507b5e1f5b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -238,6 +238,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER2	80	/* add: branch_sample_type */
 #define PERF_ATTR_SIZE_VER3	96	/* add: sample_regs_user */
 					/* add: sample_stack_user */
+					/* add: aux_watermark */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -332,8 +333,10 @@ struct perf_event_attr {
 	 */
 	__u32	sample_stack_user;
 
-	/* Align to u64. */
-	__u32	__reserved_2;
+	/*
+	 * Wakeup watermark for AUX area
+	 */
+	__u32	aux_watermark;
 };
 
 #define perf_flags(attr)	(*(&(attr)->read_format + 1))
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0251983018..c4551ac324 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4356,7 +4356,8 @@ accounting:
 		perf_event_init_userpage(event);
 		perf_event_update_userpage(event);
 	} else {
-		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages, flags);
+		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
+				   event->attr.aux_watermark, flags);
 		if (ret)
 			atomic_dec(&rb->mmap_count);
 		else
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 4607742be8..4f99987bc3 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -27,6 +27,7 @@ struct ring_buffer {
 	local_t				lost;		/* nr records lost   */
 
 	long				watermark;	/* wakeup watermark  */
+	long				aux_watermark;
 	/* poll crap */
 	spinlock_t			event_lock;
 	struct list_head		event_list;
@@ -38,6 +39,7 @@ struct ring_buffer {
 	/* AUX area */
 	local_t				aux_head;
 	local_t				aux_nest;
+	local_t				aux_wakeup;
 	unsigned long			aux_pgoff;
 	int				aux_nr_pages;
 	int				aux_overwrite;
@@ -55,7 +57,7 @@ extern struct ring_buffer *
 rb_alloc(int nr_pages, long watermark, int cpu, int flags);
 extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
-			pgoff_t pgoff, int nr_pages, int flags);
+			pgoff_t pgoff, int nr_pages, long watermark, int flags);
 extern void rb_free_aux(struct ring_buffer *rb, struct perf_event *event);
 extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
 extern void ring_buffer_put(struct ring_buffer *rb);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 5006caba63..27d1dd7cfa 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -270,6 +270,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
 	handle->size = 0;
 	if (!rb->aux_overwrite) {
 		aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+		handle->wakeup = local_read(&rb->aux_wakeup) + rb->aux_watermark;
 		if (aux_head - aux_tail < perf_aux_size(rb))
 			handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
 
@@ -313,9 +314,12 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
 	}
 
 	smp_wmb();
-	rb->user_page->aux_head = local_read(&rb->aux_head);
+	aux_head = rb->user_page->aux_head = local_read(&rb->aux_head);
 
-	perf_output_wakeup(handle);
+	if (aux_head - local_read(&rb->aux_wakeup) >= rb->aux_watermark) {
+		perf_output_wakeup(handle);
+		local_add(rb->aux_watermark, &rb->aux_wakeup);
+	}
 	handle->event = NULL;
 
 	local_set(&rb->aux_nest, 0);
@@ -376,7 +380,7 @@ static void rb_free_aux_page(struct ring_buffer *rb, int idx)
 }
 
 int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
-		 pgoff_t pgoff, int nr_pages, int flags)
+		 pgoff_t pgoff, int nr_pages, long watermark, int flags)
 {
 	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
 	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
@@ -423,6 +427,10 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 	if (rb->aux_priv)
 		ret = 0;
 	rb->aux_overwrite = overwrite;
+	rb->aux_watermark = watermark;
+
+	if (!rb->aux_watermark && !rb->aux_overwrite)
+		rb->aux_watermark = nr_pages << (PAGE_SHIFT - 1);
 
 out:
 	if (!ret)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 11/22] perf: add ITRACE_START record to indicate that tracing has started
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (9 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 10/22] perf: Add wakeup watermark control to " Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-09-09  9:08   ` Peter Zijlstra
  2014-08-20 12:36 ` [PATCH v4 12/22] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection Alexander Shishkin
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

For counters such as instruction tracing, it is useful for the decoder
to know which tasks are running when the event is first scheduled in,
before the first sched_switch.

To single out such instruction tracing pmus, this patch alse introduces
ITRACE PMU capability.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      |  4 ++++
 include/uapi/linux/perf_event.h | 11 +++++++++++
 kernel/events/core.c            | 41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 56 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 63016a0e32..bcfd7a9d84 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -127,6 +127,9 @@ struct hw_perf_event {
 			/* for tp_event->class */
 			struct list_head	tp_list;
 		};
+		struct { /* itrace */
+			int			itrace_started;
+		};
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 		struct { /* breakpoint */
 			/*
@@ -174,6 +177,7 @@ struct perf_event;
 #define PERF_PMU_CAP_AUX_NO_SG			0x02
 #define PERF_PMU_CAP_AUX_SW_DOUBLEBUF		0x04
 #define PERF_PMU_CAP_EXCLUSIVE			0x08
+#define PERF_PMU_CAP_ITRACE			0x10
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 507b5e1f5b..349c261f93 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -752,6 +752,17 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX				= 11,
 
+	/*
+	 * Indicates that instruction trace has started
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u32				pid;
+	 *	u32				tid;
+	 * };
+	 */
+	PERF_RECORD_ITRACE_START		= 12,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c4551ac324..b82392911a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1722,6 +1722,7 @@ static void perf_set_shadow_time(struct perf_event *event,
 #define MAX_INTERRUPTS (~0ULL)
 
 static void perf_log_throttle(struct perf_event *event, int enable);
+static void perf_log_itrace_start(struct perf_event *event);
 
 static int
 event_sched_in(struct perf_event *event,
@@ -1756,6 +1757,8 @@ event_sched_in(struct perf_event *event,
 
 	perf_pmu_disable(event->pmu);
 
+	perf_log_itrace_start(event);
+
 	if (event->pmu->add(event, PERF_EF_START)) {
 		event->state = PERF_EVENT_STATE_INACTIVE;
 		event->oncpu = -1;
@@ -5623,6 +5626,44 @@ static void perf_log_throttle(struct perf_event *event, int enable)
 	perf_output_end(&handle);
 }
 
+static void perf_log_itrace_start(struct perf_event *event)
+{
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	struct perf_aux_event {
+		struct perf_event_header        header;
+		u32				pid;
+		u32				tid;
+	} rec;
+	int ret;
+
+	if (event->parent)
+		event = event->parent;
+
+	if (!(event->pmu->capabilities & PERF_PMU_CAP_ITRACE) ||
+	    event->hw.itrace_started)
+		return;
+
+	event->hw.itrace_started = 1;
+
+	rec.header.type	= PERF_RECORD_ITRACE_START;
+	rec.header.misc	= 0;
+	rec.header.size	= sizeof(rec);
+	rec.pid	= perf_event_pid(event, current);
+	rec.tid	= perf_event_tid(event, current);
+
+	perf_event_header__init_id(&rec.header, &sample, event);
+	ret = perf_output_begin(&handle, event, rec.header.size);
+
+	if (ret)
+		return;
+
+	perf_output_put(&handle, rec);
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+}
+
 /*
  * Generic event overflow handling, sampling.
  */
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 12/22] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (10 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 11/22] perf: add ITRACE_START record to indicate that tracing has started Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-08-20 12:36 ` [PATCH v4 13/22] x86: perf: Intel PT and LBR/BTS are mutually exclusive Alexander Shishkin
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Intel Processor Trace is an architecture extension that allows for program
flow tracing.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/include/asm/cpufeature.h | 1 +
 arch/x86/kernel/cpu/scattered.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index bb9b258d60..db2debe5bb 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -185,6 +185,7 @@
 #define X86_FEATURE_DTHERM	( 7*32+ 7) /* Digital Thermal Sensor */
 #define X86_FEATURE_HW_PSTATE	( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK ( 7*32+ 9) /* AMD ProcFeedbackInterface */
+#define X86_FEATURE_INTEL_PT	( 7*32+10) /* Intel Processor Trace */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW  ( 8*32+ 0) /* Intel TPR Shadow */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 4a8013d559..42f5fa953f 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -36,6 +36,7 @@ void init_scattered_cpuid_features(struct cpuinfo_x86 *c)
 		{ X86_FEATURE_ARAT,		CR_EAX, 2, 0x00000006, 0 },
 		{ X86_FEATURE_PLN,		CR_EAX, 4, 0x00000006, 0 },
 		{ X86_FEATURE_PTS,		CR_EAX, 6, 0x00000006, 0 },
+		{ X86_FEATURE_INTEL_PT,		CR_EBX,25, 0x00000007, 0 },
 		{ X86_FEATURE_APERFMPERF,	CR_ECX, 0, 0x00000006, 0 },
 		{ X86_FEATURE_EPB,		CR_ECX, 3, 0x00000006, 0 },
 		{ X86_FEATURE_HW_PSTATE,	CR_EDX, 7, 0x80000007, 0 },
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 13/22] x86: perf: Intel PT and LBR/BTS are mutually exclusive
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (11 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 12/22] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-08-20 12:36 ` [PATCH v4 14/22] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Intel PT cannot be used at the same time as LBR or BTS and will cause a
general protection fault if they are used together. In order to avoid
fixing up GPs in the fast path, instead we use flags to indicate that
that one of these is in use so that the other avoids MSR access altogether.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.h           | 6 ++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  | 8 +++++++-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 9 +++++----
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index fc5eb390b3..a542fa8a1d 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -148,6 +148,7 @@ struct cpu_hw_events {
 	 * Intel DebugStore bits
 	 */
 	struct debug_store	*ds;
+	unsigned int		bts_enabled;
 	u64			pebs_enabled;
 
 	/*
@@ -161,6 +162,11 @@ struct cpu_hw_events {
 	u64				br_sel;
 
 	/*
+	 * Intel Processor Trace
+	 */
+	unsigned int			pt_enabled;
+
+	/*
 	 * Intel host/guest exclude bits
 	 */
 	u64				intel_ctrl_guest_mask;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 9dc4199917..5ae212af23 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -455,8 +455,13 @@ struct event_constraint bts_constraint =
 
 void intel_pmu_enable_bts(u64 config)
 {
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	unsigned long debugctlmsr;
 
+	if (cpuc->pt_enabled)
+		return;
+
+	cpuc->bts_enabled = 1;
 	debugctlmsr = get_debugctlmsr();
 
 	debugctlmsr |= DEBUGCTLMSR_TR;
@@ -477,9 +482,10 @@ void intel_pmu_disable_bts(void)
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	unsigned long debugctlmsr;
 
-	if (!cpuc->ds)
+	if (!cpuc->ds || cpuc->pt_enabled)
 		return;
 
+	cpuc->bts_enabled = 0;
 	debugctlmsr = get_debugctlmsr();
 
 	debugctlmsr &=
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 9dd2459a4c..516e52d0ac 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -172,7 +172,9 @@ static void intel_pmu_lbr_reset_64(void)
 
 void intel_pmu_lbr_reset(void)
 {
-	if (!x86_pmu.lbr_nr)
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (!x86_pmu.lbr_nr || cpuc->pt_enabled)
 		return;
 
 	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
@@ -185,7 +187,7 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 
-	if (!x86_pmu.lbr_nr)
+	if (!x86_pmu.lbr_nr || cpuc->pt_enabled)
 		return;
 
 	/*
@@ -205,11 +207,10 @@ void intel_pmu_lbr_disable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 
-	if (!x86_pmu.lbr_nr)
+	if (!x86_pmu.lbr_nr || !cpuc->lbr_users || cpuc->pt_enabled)
 		return;
 
 	cpuc->lbr_users--;
-	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
 	if (cpuc->enabled && !cpuc->lbr_users) {
 		__intel_pmu_lbr_disable();
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 14/22] x86: perf: intel_pt: Intel PT PMU driver
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (12 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 13/22] x86: perf: Intel PT and LBR/BTS are mutually exclusive Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-08-20 12:36 ` [PATCH v4 15/22] x86: perf: intel_bts: Add BTS " Alexander Shishkin
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Add support for Intel Processor Trace (PT) to kernel's perf events.
PT is an extension of Intel Architecture that collects information about
software execuction such as control flow, execution modes and timings and
formats it into highly compressed binary packets. Even being compressed,
these packets are generated at hundreds of megabytes per second per core,
which makes it impractical to decode them on the fly in the kernel.

This driver exports trace data by through AUX space in the perf ring
buffer, which is zero-copy mapped into userspace for faster data retrieval.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/include/uapi/asm/msr-index.h     |  18 +
 arch/x86/kernel/cpu/Makefile              |   1 +
 arch/x86/kernel/cpu/intel_pt.h            | 129 ++++
 arch/x86/kernel/cpu/perf_event.h          |   2 +
 arch/x86/kernel/cpu/perf_event_intel.c    |   8 +
 arch/x86/kernel/cpu/perf_event_intel_pt.c | 973 ++++++++++++++++++++++++++++++
 6 files changed, 1131 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/intel_pt.h
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c

diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h
index eac9e92fe1..206a4c487f 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -74,6 +74,24 @@
 #define MSR_IA32_PERF_CAPABILITIES	0x00000345
 #define MSR_PEBS_LD_LAT_THRESHOLD	0x000003f6
 
+#define MSR_IA32_RTIT_CTL		0x00000570
+#define RTIT_CTL_TRACEEN		BIT(0)
+#define RTIT_CTL_OS			BIT(2)
+#define RTIT_CTL_USR			BIT(3)
+#define RTIT_CTL_CR3EN			BIT(7)
+#define RTIT_CTL_TOPA			BIT(8)
+#define RTIT_CTL_TSC_EN			BIT(10)
+#define RTIT_CTL_DISRETC		BIT(11)
+#define RTIT_CTL_BRANCH_EN		BIT(13)
+#define MSR_IA32_RTIT_STATUS		0x00000571
+#define RTIT_STATUS_CONTEXTEN		BIT(1)
+#define RTIT_STATUS_TRIGGEREN		BIT(2)
+#define RTIT_STATUS_ERROR		BIT(4)
+#define RTIT_STATUS_STOPPED		BIT(5)
+#define MSR_IA32_RTIT_CR3_MATCH		0x00000572
+#define MSR_IA32_RTIT_OUTPUT_BASE	0x00000560
+#define MSR_IA32_RTIT_OUTPUT_MASK	0x00000561
+
 #define MSR_MTRRfix64K_00000		0x00000250
 #define MSR_MTRRfix16K_80000		0x00000258
 #define MSR_MTRRfix16K_A0000		0x00000259
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 7e1fd4e085..00d40f889d 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -39,6 +39,7 @@ obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o per
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o perf_event_intel_uncore_snb.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore_snbep.o perf_event_intel_uncore_nhmex.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_rapl.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_pt.o
 endif
 
 
diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
new file mode 100644
index 0000000000..58af62daf7
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -0,0 +1,129 @@
+/*
+ * Intel(R) Processor Trace PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#ifndef __INTEL_PT_H__
+#define __INTEL_PT_H__
+
+/*
+ * Single-entry ToPA: when this close to region boundary, switch
+ * buffers to avoid losing data.
+ */
+#define TOPA_PMI_MARGIN 512
+
+/*
+ * Table of Physical Addresses bits
+ */
+enum topa_sz {
+	TOPA_4K	= 0,
+	TOPA_8K,
+	TOPA_16K,
+	TOPA_32K,
+	TOPA_64K,
+	TOPA_128K,
+	TOPA_256K,
+	TOPA_512K,
+	TOPA_1MB,
+	TOPA_2MB,
+	TOPA_4MB,
+	TOPA_8MB,
+	TOPA_16MB,
+	TOPA_32MB,
+	TOPA_64MB,
+	TOPA_128MB,
+	TOPA_SZ_END,
+};
+
+static inline unsigned int sizes(enum topa_sz tsz)
+{
+	return 1 << (tsz + 12);
+};
+
+struct topa_entry {
+	u64	end	: 1;
+	u64	rsvd0	: 1;
+	u64	intr	: 1;
+	u64	rsvd1	: 1;
+	u64	stop	: 1;
+	u64	rsvd2	: 1;
+	u64	size	: 4;
+	u64	rsvd3	: 2;
+	u64	base	: 36;
+	u64	rsvd4	: 16;
+};
+
+#define TOPA_SHIFT 12
+#define PT_CPUID_LEAVES 2
+
+enum pt_capabilities {
+	PT_CAP_max_subleaf = 0,
+	PT_CAP_cr3_filtering,
+	PT_CAP_topa_output,
+	PT_CAP_topa_multiple_entries,
+	PT_CAP_payloads_lip,
+};
+
+struct pt_pmu {
+	struct pmu		pmu;
+	u32			caps[4 * PT_CPUID_LEAVES];
+};
+
+/**
+ * struct pt_buffer - buffer configuration; one buffer per task_struct or
+ * cpu, depending on perf event configuration
+ * @tables: list of ToPA tables in this buffer
+ * @first, @last: shorthands for first and last topa tables
+ * @cur: current topa table
+ * @nr_pages: buffer size in pages
+ * @cur_idx: current output region's index within @cur table
+ * @output_off: offset within the current output region
+ * @data_size: running total of the amount of data in this buffer
+ * @lost: if data was lost/truncated
+ * @head: logical write offset inside the buffer
+ * @snapshot: if this is for a snapshot/overwrite counter
+ * @stop_pos, @intr_pos: STOP and INT topa entries in the buffer
+ * @data_pages: array of pages from perf
+ * @topa_index: table of topa entries indexed by page offset
+ */
+struct pt_buffer {
+	/* hint for allocation */
+	int			cpu;
+	/* list of ToPA tables */
+	struct list_head	tables;
+	/* top-level table */
+	struct topa		*first, *last, *cur;
+	unsigned int		cur_idx;
+	size_t			output_off;
+	unsigned long		nr_pages;
+	local_t			data_size;
+	local_t			lost;
+	local64_t		head;
+	bool			snapshot;
+	unsigned long		stop_pos, intr_pos;
+	void			**data_pages;
+	struct topa_entry	*topa_index[0];
+};
+
+/**
+ * struct pt - per-cpu pt
+ */
+struct pt {
+	raw_spinlock_t		lock;
+	struct perf_output_handle handle;
+};
+
+#endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index a542fa8a1d..2f41d0db42 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -749,6 +749,8 @@ void intel_pmu_lbr_init_snb(void);
 
 int intel_pmu_setup_lbr_filter(struct perf_event *event);
 
+void intel_pt_interrupt(void);
+
 int p4_pmu_init(void);
 
 int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 89bc750efc..6f8025053d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1399,6 +1399,14 @@ again:
 	}
 
 	/*
+	 * Intel PT
+	 */
+	if (__test_and_clear_bit(55, (unsigned long *)&status)) {
+		handled++;
+		intel_pt_interrupt();
+	}
+
+	/*
 	 * Checkpointed counters can lead to 'spurious' PMIs because the
 	 * rollback caused by the PMI will have cleared the overflow status
 	 * bit. Therefore always force probe these counters.
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
new file mode 100644
index 0000000000..eb0dca59ad
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -0,0 +1,973 @@
+/*
+ * Intel(R) Processor Trace PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#undef DEBUG
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/device.h>
+
+#include <asm/perf_event.h>
+#include <asm/insn.h>
+
+#include "perf_event.h"
+#include "intel_pt.h"
+
+static DEFINE_PER_CPU(struct pt, pt_ctx) = {
+	.lock	= __RAW_SPIN_LOCK_UNLOCKED(pt_ctx.lock),
+};
+
+static struct pt_pmu pt_pmu;
+
+enum cpuid_regs {
+	CR_EAX = 0,
+	CR_ECX,
+	CR_EDX,
+	CR_EBX
+};
+
+/*
+ * Capabilities of Intel PT hardware, such as number of address bits or
+ * supported output schemes, are cached and exported to userspace as "caps"
+ * attribute group of pt pmu device
+ * (/sys/bus/event_source/devices/intel_pt/caps/) so that userspace can store
+ * relevant bits together with intel_pt traces.
+ */
+#define PT_CAP(_n, _l, _r, _m)						\
+	[PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l,	\
+			    .reg = _r, .mask = _m }
+
+static struct pt_cap_desc {
+	const char	*name;
+	u32		leaf;
+	u8		reg;
+	u32		mask;
+} pt_caps[] = {
+	PT_CAP(max_subleaf,		0, CR_EAX, 0xffffffff),
+	PT_CAP(cr3_filtering,		0, CR_EBX, BIT(0)),
+	PT_CAP(topa_output,		0, CR_ECX, BIT(0)),
+	PT_CAP(topa_multiple_entries,	0, CR_ECX, BIT(1)),
+	PT_CAP(payloads_lip,		0, CR_ECX, BIT(31)),
+};
+
+static u32 pt_cap_get(enum pt_capabilities cap)
+{
+	struct pt_cap_desc *cd = &pt_caps[cap];
+	u32 c = pt_pmu.caps[cd->leaf * 4 + cd->reg];
+	unsigned int shift = __ffs(cd->mask);
+
+	return (c & cd->mask) >> shift;
+}
+
+static ssize_t pt_cap_show(struct device *cdev,
+			   struct device_attribute *attr,
+			   char *buf)
+{
+	struct dev_ext_attribute *ea =
+		container_of(attr, struct dev_ext_attribute, attr);
+	enum pt_capabilities cap = (long)ea->var;
+
+	return snprintf(buf, PAGE_SIZE, "%x\n", pt_cap_get(cap));
+}
+
+static struct attribute_group pt_cap_group = {
+	.name	= "caps",
+};
+
+PMU_FORMAT_ATTR(tsc,		"config:10"	);
+PMU_FORMAT_ATTR(noretcomp,	"config:11"	);
+
+static struct attribute *pt_formats_attr[] = {
+	&format_attr_tsc.attr,
+	&format_attr_noretcomp.attr,
+	NULL,
+};
+
+static struct attribute_group pt_format_group = {
+	.name	= "format",
+	.attrs	= pt_formats_attr,
+};
+
+static const struct attribute_group *pt_attr_groups[] = {
+	&pt_cap_group,
+	&pt_format_group,
+	NULL,
+};
+
+static int __init pt_pmu_hw_init(void)
+{
+	struct dev_ext_attribute *de_attrs;
+	struct attribute **attrs;
+	size_t size;
+	long i;
+
+	if (test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT)) {
+		for (i = 0; i < PT_CPUID_LEAVES; i++)
+			cpuid_count(20, i,
+				    &pt_pmu.caps[CR_EAX + i * 4],
+				    &pt_pmu.caps[CR_EBX + i * 4],
+				    &pt_pmu.caps[CR_ECX + i * 4],
+				    &pt_pmu.caps[CR_EDX + i * 4]);
+	} else
+		return -ENODEV;
+
+	size = sizeof(struct attribute *) * (ARRAY_SIZE(pt_caps) + 1);
+	attrs = kzalloc(size, GFP_KERNEL);
+	if (!attrs)
+		goto err_attrs;
+
+	size = sizeof(struct dev_ext_attribute) * (ARRAY_SIZE(pt_caps) + 1);
+	de_attrs = kzalloc(size, GFP_KERNEL);
+	if (!de_attrs)
+		goto err_de_attrs;
+
+	for (i = 0; i < ARRAY_SIZE(pt_caps); i++) {
+		de_attrs[i].attr.attr.name = pt_caps[i].name;
+
+		sysfs_attr_init(&de_attrs[i].attr.attr);
+		de_attrs[i].attr.attr.mode = S_IRUGO;
+		de_attrs[i].attr.show = pt_cap_show;
+		de_attrs[i].var = (void *)i;
+		attrs[i] = &de_attrs[i].attr.attr;
+	}
+
+	pt_cap_group.attrs = attrs;
+	return 0;
+
+err_de_attrs:
+	kfree(de_attrs);
+err_attrs:
+	kfree(attrs);
+
+	return -ENOMEM;
+}
+
+#define PT_CONFIG_MASK (RTIT_CTL_TSC_EN | RTIT_CTL_DISRETC)
+/* bits 57:14 are reserved for packet enables */
+#define PT_BYPASS_MASK 0x03ffffffffffc000ull
+
+static bool pt_event_valid(struct perf_event *event)
+{
+	u64 config = event->attr.config;
+
+	/* admin can set any packet generation parameters */
+	if (capable(CAP_SYS_ADMIN) && (config & PT_BYPASS_MASK) == config)
+		return true;
+
+	if ((config & PT_CONFIG_MASK) != config)
+		return false;
+
+	return true;
+}
+
+/*
+ * PT configuration helpers
+ * These all are cpu affine and operate on a local PT
+ */
+
+static bool pt_is_running(void)
+{
+	u64 ctl;
+
+	rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+	return !!(ctl & RTIT_CTL_TRACEEN);
+}
+
+static int pt_config(struct perf_event *event)
+{
+	u64 reg;
+
+	reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN;
+
+	if (!event->attr.exclude_kernel)
+		reg |= RTIT_CTL_OS;
+	if (!event->attr.exclude_user)
+		reg |= RTIT_CTL_USR;
+
+	reg |= (event->attr.config & PT_CONFIG_MASK);
+
+	wrmsrl(MSR_IA32_RTIT_CTL, reg);
+
+	return 0;
+}
+
+static void pt_config_start(bool start)
+{
+	u64 ctl;
+
+	rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+	if (start)
+		ctl |= RTIT_CTL_TRACEEN;
+	else
+		ctl &= ~RTIT_CTL_TRACEEN;
+	wrmsrl(MSR_IA32_RTIT_CTL, ctl);
+}
+
+static void pt_config_buffer(void *buf, unsigned int topa_idx,
+			     unsigned int output_off)
+{
+	u64 reg;
+
+	wrmsrl(MSR_IA32_RTIT_OUTPUT_BASE, virt_to_phys(buf));
+
+	reg = 0x7f | ((u64)topa_idx << 7) | ((u64)output_off << 32);
+
+	wrmsrl(MSR_IA32_RTIT_OUTPUT_MASK, reg);
+}
+
+/*
+ * Keep ToPA table-related metadata on the same page as the actual table,
+ * taking up a few words from the top
+ */
+
+#define TENTS_PER_PAGE (((PAGE_SIZE - 40) / sizeof(struct topa_entry)) - 1)
+
+struct topa {
+	struct topa_entry	table[TENTS_PER_PAGE];
+	struct list_head	list;
+	u64			phys;
+	u64			offset;
+	size_t			size;
+	int			last;
+};
+
+/* make negative table index stand for the last table entry */
+#define TOPA_ENTRY(t, i) ((i) == -1 ? &(t)->table[(t)->last] : &(t)->table[(i)])
+
+/*
+ * allocate page-sized ToPA table
+ */
+static struct topa *topa_alloc(int cpu, gfp_t gfp)
+{
+	int node = cpu_to_node(cpu);
+	struct topa *topa;
+	struct page *p;
+
+	p = alloc_pages_node(node, gfp | __GFP_ZERO, 0);
+	if (!p)
+		return NULL;
+
+	topa = page_address(p);
+	topa->last = 0;
+	topa->phys = page_to_phys(p);
+
+	/*
+	 * In case of singe-entry ToPA, always put the self-referencing END
+	 * link as the 2nd entry in the table
+	 */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		TOPA_ENTRY(topa, 1)->base = topa->phys >> TOPA_SHIFT;
+		TOPA_ENTRY(topa, 1)->end = 1;
+	}
+
+	return topa;
+}
+
+static void topa_free(struct topa *topa)
+{
+	free_page((unsigned long)topa);
+}
+
+/**
+ * topa_insert_table - insert a ToPA table into a buffer
+ * @buf - pt buffer that's being extended
+ * @topa - new topa table to be inserted
+ *
+ * If it's the first table in this buffer, set up buffer's pointers
+ * accordingly; otherwise, add a END=1 link entry to @topa to the current
+ * "last" table and adjust the last table pointer to @topa.
+ */
+static void topa_insert_table(struct pt_buffer *buf, struct topa *topa)
+{
+	struct topa *last = buf->last;
+
+	list_add_tail(&topa->list, &buf->tables);
+
+	if (!buf->first) {
+		buf->first = buf->last = buf->cur = topa;
+		return;
+	}
+
+	topa->offset = last->offset + last->size;
+	buf->last = topa;
+
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		return;
+
+	BUG_ON(last->last != TENTS_PER_PAGE - 1);
+
+	TOPA_ENTRY(last, -1)->base = topa->phys >> TOPA_SHIFT;
+	TOPA_ENTRY(last, -1)->end = 1;
+}
+
+static bool topa_table_full(struct topa *topa)
+{
+	/* single-entry ToPA is a special case */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		return !!topa->last;
+
+	return topa->last == TENTS_PER_PAGE - 1;
+}
+
+static int topa_insert_pages(struct pt_buffer *buf, gfp_t gfp)
+{
+	struct topa *topa = buf->last;
+	int order = 0;
+	struct page *p;
+
+	p = virt_to_page(buf->data_pages[buf->nr_pages]);
+	if (PagePrivate(p))
+		order = page_private(p);
+
+	if (topa_table_full(topa)) {
+		topa = topa_alloc(buf->cpu, gfp);
+		if (!topa)
+			return -ENOMEM;
+
+		topa_insert_table(buf, topa);
+	}
+
+	TOPA_ENTRY(topa, -1)->base = page_to_phys(p) >> TOPA_SHIFT;
+	TOPA_ENTRY(topa, -1)->size = order;
+	if (!buf->snapshot && !pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		TOPA_ENTRY(topa, -1)->intr = 1;
+		TOPA_ENTRY(topa, -1)->stop = 1;
+	}
+
+	topa->last++;
+	topa->size += sizes(order);
+
+	buf->nr_pages += 1ul << order;
+
+	return 0;
+}
+
+static void pt_topa_dump(struct pt_buffer *buf)
+{
+	struct topa *topa;
+
+	list_for_each_entry(topa, &buf->tables, list) {
+		int i;
+
+		pr_debug("# table @%p (%p), off %llx size %zx\n", topa->table,
+			 (void *)topa->phys, topa->offset, topa->size);
+		for (i = 0; i < TENTS_PER_PAGE; i++) {
+			pr_debug("# entry @%p (%lx sz %u %c%c%c) raw=%16llx\n",
+				 &topa->table[i],
+				 (unsigned long)topa->table[i].base << TOPA_SHIFT,
+				 sizes(topa->table[i].size),
+				 topa->table[i].end ?  'E' : ' ',
+				 topa->table[i].intr ? 'I' : ' ',
+				 topa->table[i].stop ? 'S' : ' ',
+				 *(u64 *)&topa->table[i]);
+			if ((pt_cap_get(PT_CAP_topa_multiple_entries)
+			     && topa->table[i].stop)
+			    || topa->table[i].end)
+				break;
+		}
+	}
+}
+
+/* advance to the next output region */
+static void pt_buffer_advance(struct pt_buffer *buf)
+{
+	buf->output_off = 0;
+	buf->cur_idx++;
+
+	if (buf->cur_idx == buf->cur->last) {
+		if (buf->cur == buf->last)
+			buf->cur = buf->first;
+		else
+			buf->cur = list_entry(buf->cur->list.next, struct topa,
+					      list);
+		buf->cur_idx = 0;
+	}
+}
+
+static void pt_update_head(struct pt *pt)
+{
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+	u64 topa_idx, base, old;
+
+	/* offset of the first region in this table from the beginning of buf */
+	base = buf->cur->offset + buf->output_off;
+
+	/* offset of the current output region within this table */
+	for (topa_idx = 0; topa_idx < buf->cur_idx; topa_idx++)
+		base += sizes(buf->cur->table[topa_idx].size);
+
+	if (buf->snapshot) {
+		local_set(&buf->data_size, base);
+	} else {
+		old = (local64_xchg(&buf->head, base) &
+		       ((buf->nr_pages << PAGE_SHIFT) - 1));
+		if (base < old)
+			base += buf->nr_pages << PAGE_SHIFT;
+
+		local_add(base - old, &buf->data_size);
+	}
+}
+
+static void *pt_buffer_region(struct pt_buffer *buf)
+{
+	return phys_to_virt(buf->cur->table[buf->cur_idx].base << TOPA_SHIFT);
+}
+
+static size_t pt_buffer_region_size(struct pt_buffer *buf)
+{
+	return sizes(buf->cur->table[buf->cur_idx].size);
+}
+
+/**
+ * pt_handle_status - take care of possible status conditions
+ * @pt: per-cpu pt handle
+ */
+static void pt_handle_status(struct pt *pt)
+{
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+	int advance = 0;
+	u64 status;
+
+	rdmsrl(MSR_IA32_RTIT_STATUS, status);
+
+	if (status & RTIT_STATUS_ERROR) {
+		pr_err_ratelimited("ToPA ERROR encountered, trying to recover\n");
+		pt_topa_dump(buf);
+		status &= ~RTIT_STATUS_ERROR;
+		wrmsrl(MSR_IA32_RTIT_STATUS, status);
+	}
+
+	if (status & RTIT_STATUS_STOPPED) {
+		status &= ~RTIT_STATUS_STOPPED;
+		wrmsrl(MSR_IA32_RTIT_STATUS, status);
+
+		/*
+		 * On systems that only do single-entry ToPA, hitting STOP
+		 * means we are already losing data; need to let the decoder
+		 * know.
+		 */
+		if (!pt_cap_get(PT_CAP_topa_multiple_entries) ||
+		    buf->output_off == sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size)) {
+			local_inc(&buf->lost);
+			advance++;
+		}
+	}
+
+	/*
+	 * Also on single-entry ToPA implementations, interrupt will come
+	 * before the output reaches its output region's boundary.
+	 */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries) && !buf->snapshot &&
+	    pt_buffer_region_size(buf) - buf->output_off <= TOPA_PMI_MARGIN) {
+		void *head = pt_buffer_region(buf);
+
+		/* everything within this margin needs to be zeroed out */
+		memset(head + buf->output_off, 0,
+		       pt_buffer_region_size(buf) -
+		       buf->output_off);
+		advance++;
+	}
+
+	if (advance)
+		pt_buffer_advance(buf);
+}
+
+static void pt_read_offset(struct pt_buffer *buf)
+{
+	u64 offset, base_topa;
+
+	rdmsrl(MSR_IA32_RTIT_OUTPUT_BASE, base_topa);
+	buf->cur = phys_to_virt(base_topa);
+
+	rdmsrl(MSR_IA32_RTIT_OUTPUT_MASK, offset);
+	/* offset within current output region */
+	buf->output_off = offset >> 32;
+	/* index of current output region within this table */
+	buf->cur_idx = (offset & 0xffffff80) >> 7;
+}
+
+/**
+ * pt_buffer_fini_topa() - deallocate ToPA structure of a buffer
+ * @buf: pt buffer
+ */
+static void pt_buffer_fini_topa(struct pt_buffer *buf)
+{
+	struct topa *topa, *iter;
+
+	list_for_each_entry_safe(topa, iter, &buf->tables, list) {
+		list_del(&topa->list);
+		topa_free(topa);
+	}
+}
+
+static unsigned int pt_topa_next_entry(struct pt_buffer *buf, unsigned int pg)
+{
+	struct topa_entry *te = buf->topa_index[pg];
+
+	if (buf->first == buf->last && buf->first->last == 1)
+		return pg;
+
+	do {
+		pg++;
+		pg &= buf->nr_pages - 1;
+	} while (buf->topa_index[pg] == te);
+
+	return pg;
+}
+
+static int pt_buffer_reset_markers(struct pt_buffer *buf,
+				   struct perf_output_handle *handle)
+
+{
+	unsigned long idx, npages, end;
+
+	if (buf->snapshot)
+		return 0;
+
+	/* can't stop in the middle of an output region */
+	if (buf->output_off + handle->size + 1 <
+	    sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size))
+		return -EINVAL;
+
+
+	/* single entry ToPA is handled by marking all regions STOP=1 INT=1 */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		return 0;
+
+	/* clear STOP and INT from current entry */
+	buf->topa_index[buf->stop_pos]->stop = 0;
+	buf->topa_index[buf->intr_pos]->intr = 0;
+
+	if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		npages = (handle->size + 1) >> PAGE_SHIFT;
+		end = (local64_read(&buf->head) >> PAGE_SHIFT) + npages;
+		if (end > handle->wakeup >> PAGE_SHIFT)
+			end = handle->wakeup >> PAGE_SHIFT;
+		idx = end & (buf->nr_pages - 1);
+		buf->stop_pos = idx;
+		idx = (local64_read(&buf->head) >> PAGE_SHIFT) + npages / 2;
+		idx &= buf->nr_pages - 1;
+		buf->intr_pos = idx;
+	}
+
+	buf->topa_index[buf->stop_pos]->stop = 1;
+	buf->topa_index[buf->intr_pos]->intr = 1;
+
+	return 0;
+}
+
+static void pt_buffer_setup_topa_index(struct pt_buffer *buf)
+{
+	struct topa *cur = buf->first, *prev = buf->last;
+	struct topa_entry *te_cur = TOPA_ENTRY(cur, 0),
+		*te_prev = TOPA_ENTRY(prev, prev->last - 1);
+	int pg = 0, idx = 0, ntopa = 0;
+
+	while (pg < buf->nr_pages) {
+		int tidx;
+
+		/* pages within one topa entry */
+		for (tidx = 0; tidx < 1 << te_cur->size; tidx++, pg++)
+			buf->topa_index[pg] = te_prev;
+
+		te_prev = te_cur;
+
+		if (idx == cur->last - 1) {
+			/* advance to next topa table */
+			idx = 0;
+			cur = list_entry(cur->list.next, struct topa, list);
+			ntopa++;
+		} else
+			idx++;
+		te_cur = TOPA_ENTRY(cur, idx);
+	}
+
+}
+
+static void pt_buffer_reset_offsets(struct pt_buffer *buf, unsigned long head)
+{
+	int pg;
+
+	if (buf->snapshot)
+		head &= (buf->nr_pages << PAGE_SHIFT) - 1;
+
+	pg = (head >> PAGE_SHIFT) & (buf->nr_pages - 1);
+	pg = pt_topa_next_entry(buf, pg);
+
+	buf->cur = (struct topa *)((unsigned long)buf->topa_index[pg] & PAGE_MASK);
+	buf->cur_idx = ((unsigned long)buf->topa_index[pg] -
+			(unsigned long)buf->cur) / sizeof(struct topa_entry);
+	buf->output_off = head & (sizes(buf->cur->table[buf->cur_idx].size) - 1);
+
+	local64_set(&buf->head, head);
+	local_set(&buf->data_size, 0);
+}
+
+/**
+ * pt_buffer_init_topa() - initialize ToPA table for pt buffer
+ * @buf: pt buffer
+ * @size: total size of all regions within this ToPA
+ * @gfp: allocation flags
+ */
+static int pt_buffer_init_topa(struct pt_buffer *buf, unsigned long nr_pages,
+			       gfp_t gfp)
+{
+	struct topa *topa;
+	int err;
+
+	topa = topa_alloc(buf->cpu, gfp);
+	if (!topa)
+		return -ENOMEM;
+
+	topa_insert_table(buf, topa);
+
+	while (buf->nr_pages < nr_pages) {
+		err = topa_insert_pages(buf, gfp);
+		if (err) {
+			pt_buffer_fini_topa(buf);
+			return -ENOMEM;
+		}
+	}
+
+	pt_buffer_setup_topa_index(buf);
+
+	/* link last table to the first one, unless we're double buffering */
+	if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		TOPA_ENTRY(buf->last, -1)->base = buf->first->phys >> TOPA_SHIFT;
+		TOPA_ENTRY(buf->last, -1)->end = 1;
+	}
+
+	pt_topa_dump(buf);
+	return 0;
+}
+
+/**
+ * pt_buffer_setup_aux() - set up topa tables for a PT buffer
+ * @cpu: cpu on which to allocate, -1 means current
+ * @pages: array of pointers to buffer pages passed from perf core
+ * @nr_pages: number of pages in the buffer
+ * @snapshot: if this is a snapshot/overwrite counter
+ */
+static void *
+pt_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool snapshot)
+{
+	struct pt_buffer *buf;
+	int node, ret;
+
+	if (!nr_pages)
+		return NULL;
+
+	if (cpu == -1)
+		cpu = raw_smp_processor_id();
+	node = cpu_to_node(cpu);
+
+	buf = kzalloc_node(offsetof(struct pt_buffer, topa_index[nr_pages]),
+			   GFP_KERNEL, node);
+	if (!buf)
+		return NULL;
+
+	buf->cpu = cpu;
+	buf->snapshot = snapshot;
+	buf->data_pages = pages;
+
+	INIT_LIST_HEAD(&buf->tables);
+
+	ret = pt_buffer_init_topa(buf, nr_pages, GFP_KERNEL);
+	if (ret) {
+		kfree(buf);
+		return NULL;
+	}
+
+	return buf;
+}
+
+/**
+ * pt_buffer_free() - dispose of pt buffer
+ * @data: pt buffer
+ */
+static void pt_buffer_free_aux(void *data)
+{
+	struct pt_buffer *buf = data;
+
+	pt_buffer_fini_topa(buf);
+
+	kfree(buf);
+}
+
+/**
+ * pt_buffer_is_full - check if the buffer is full
+ * @buf: pt buffer
+ * @pt: per-cpu pt handle
+ * If the user hasn't read data from the output region that aux_head
+ * points to, the buffer is considered full: the user needs to read at
+ * least this region and update aux_tail to point past it.
+ */
+static bool pt_buffer_is_full(struct pt_buffer *buf, struct pt *pt)
+{
+	if (buf->snapshot)
+		return false;
+
+	if (local_read(&buf->data_size) >= pt->handle.size)
+		return true;
+
+	return false;
+}
+
+void intel_pt_interrupt(void)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct pt_buffer *buf;
+	struct perf_event *event = pt->handle.event;
+
+	pt_config_start(false);
+
+	if (!event)
+		return;
+
+	buf = perf_get_aux(&pt->handle);
+	if (!buf)
+		return;
+
+	pt_read_offset(buf);
+
+	pt_handle_status(pt);
+
+	pt_update_head(pt);
+
+	perf_aux_output_end(&pt->handle, local_xchg(&buf->data_size, 0),
+			    local_xchg(&buf->lost, 0));
+
+	if (!event->hw.state) {
+		int ret = pt_config(event);
+
+		if (ret) {
+			event->hw.state = PERF_HES_STOPPED;
+			return;
+		}
+
+		buf = perf_aux_output_begin(&pt->handle, event);
+		if (!buf) {
+			event->hw.state = PERF_HES_STOPPED;
+			return;
+		}
+
+		pt_buffer_reset_offsets(buf, pt->handle.head);
+		ret = pt_buffer_reset_markers(buf, &pt->handle);
+		if (ret) {
+			perf_aux_output_end(&pt->handle, 0, true);
+			return;
+		}
+
+		pt_config_buffer(buf->cur->table, buf->cur_idx,
+				 buf->output_off);
+		wrmsrl(MSR_IA32_RTIT_STATUS, 0);
+		pt_config_start(true);
+	}
+}
+
+static void pt_event_start(struct perf_event *event, int mode)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+
+	if (pt_is_running() || !buf || pt_buffer_is_full(buf, pt) ||
+	    pt_config(event)) {
+		event->hw.state = PERF_HES_STOPPED;
+		return;
+	}
+
+	event->hw.state = 0;
+
+	pt_config_buffer(buf->cur->table, buf->cur_idx,
+			 buf->output_off);
+	wrmsrl(MSR_IA32_RTIT_STATUS, 0);
+	pt_config_start(true);
+}
+
+static void pt_event_stop(struct perf_event *event, int mode)
+{
+	if (event->hw.state == PERF_HES_STOPPED)
+		return;
+
+	event->hw.state = PERF_HES_STOPPED;
+
+	pt_config_start(false);
+
+	if (mode & PERF_EF_UPDATE) {
+		struct pt *pt = this_cpu_ptr(&pt_ctx);
+		struct pt_buffer *buf = perf_get_aux(&pt->handle);
+
+		if (!buf || !pt->handle.event)
+			return;
+
+		if (WARN_ON_ONCE(pt->handle.event != event))
+			return;
+		pt_read_offset(buf);
+
+		pt_handle_status(pt);
+
+		pt_update_head(pt);
+	}
+}
+
+static void pt_event_del(struct perf_event *event, int mode)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	unsigned long flags;
+
+	pt_event_stop(event, PERF_EF_UPDATE);
+
+	cpuc->pt_enabled = 0;
+
+	raw_spin_lock_irqsave(&pt->lock, flags);
+	if (pt->handle.event)
+		perf_aux_output_end(&pt->handle, local_xchg(&buf->data_size, 0),
+				    local_xchg(&buf->lost, 0));
+	raw_spin_unlock_irqrestore(&pt->lock, flags);
+}
+
+static int pt_event_add(struct perf_event *event, int mode)
+{
+	struct pt_buffer *buf;
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct hw_perf_event *hwc = &event->hw;
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	unsigned long flags;
+	int ret = -EBUSY;
+
+	if (cpuc->lbr_users || cpuc->bts_enabled)
+		goto out;
+
+	ret = pt_config(event);
+	if (ret)
+		goto out;
+
+	raw_spin_lock_irqsave(&pt->lock, flags);
+	if (pt->handle.event) {
+		raw_spin_unlock_irqrestore(&pt->lock, flags);
+		ret = -EBUSY;
+		goto out;
+	}
+
+	buf = perf_aux_output_begin(&pt->handle, event);
+	if (!buf) {
+		raw_spin_unlock_irqrestore(&pt->lock, flags);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	pt_buffer_reset_offsets(buf, pt->handle.head);
+	if (!buf->snapshot) {
+		ret = pt_buffer_reset_markers(buf, &pt->handle);
+		if (ret) {
+			perf_aux_output_end(&pt->handle, 0, true);
+			raw_spin_unlock_irqrestore(&pt->lock, flags);
+			goto out;
+		}
+	}
+
+	raw_spin_unlock_irqrestore(&pt->lock, flags);
+
+	if (mode & PERF_EF_START) {
+		pt_event_start(event, 0);
+		if (hwc->state == PERF_HES_STOPPED) {
+			pt_event_del(event, 0);
+			ret = -EBUSY;
+		}
+	} else {
+		hwc->state = PERF_HES_STOPPED;
+	}
+
+out:
+
+	if (ret)
+		hwc->state = PERF_HES_STOPPED;
+	else
+		cpuc->pt_enabled = 1;
+
+	return ret;
+}
+
+static void pt_event_read(struct perf_event *event)
+{
+}
+
+static int pt_event_init(struct perf_event *event)
+{
+	if (event->attr.type != pt_pmu.pmu.type)
+		return -ENOENT;
+
+	if (!pt_event_valid(event))
+		return -EINVAL;
+
+	return 0;
+}
+
+static __init int pt_init(void)
+{
+	int ret, cpu, prior_warn = 0;
+
+	BUILD_BUG_ON(sizeof(struct topa) > PAGE_SIZE);
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		u64 ctl;
+
+		ret = rdmsrl_safe_on_cpu(cpu, MSR_IA32_RTIT_CTL, &ctl);
+		if (!ret && (ctl & RTIT_CTL_TRACEEN))
+			prior_warn++;
+	}
+	put_online_cpus();
+
+	ret = pt_pmu_hw_init();
+	if (ret)
+		return ret;
+
+	if (!pt_cap_get(PT_CAP_topa_output)) {
+		pr_warn("ToPA output is not supported on this CPU\n");
+		return -ENODEV;
+	}
+
+	if (prior_warn)
+		pr_warn("PT is enabled at boot time, traces may be empty\n");
+
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		pt_pmu.pmu.capabilities =
+			PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_AUX_SW_DOUBLEBUF;
+
+	pt_pmu.pmu.capabilities	|= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE;
+	pt_pmu.pmu.attr_groups	= pt_attr_groups;
+	pt_pmu.pmu.task_ctx_nr	= perf_hw_context;
+	pt_pmu.pmu.event_init	= pt_event_init;
+	pt_pmu.pmu.add		= pt_event_add;
+	pt_pmu.pmu.del		= pt_event_del;
+	pt_pmu.pmu.start	= pt_event_start;
+	pt_pmu.pmu.stop		= pt_event_stop;
+	pt_pmu.pmu.read		= pt_event_read;
+	pt_pmu.pmu.setup_aux	= pt_buffer_setup_aux;
+	pt_pmu.pmu.free_aux	= pt_buffer_free_aux;
+	ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);
+
+	return ret;
+}
+
+module_init(pt_init);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 15/22] x86: perf: intel_bts: Add BTS PMU driver
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (13 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 14/22] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-08-20 12:36 ` [PATCH v4 16/22] perf: Add rb_{alloc,free}_kernel api Alexander Shishkin
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Add support for Branch Trace Store (BTS) via kernel perf event infrastructure.
The difference with the existing implementation of BTS support is that this
one is a separate PMU that exports events' trace buffers to userspace by means
of AUX area of the perf buffer, which is zero-copy mapped into userspace.

The immediate benefit is that the buffer size can be much bigger, resulting in
fewer interrupts and no kernel side copying is involved and little to no trace
data loss. Also, kernel code can be traced with this driver.

The old way of collecting BTS traces still works.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/kernel/cpu/Makefile               |   2 +-
 arch/x86/kernel/cpu/perf_event.h           |   6 +
 arch/x86/kernel/cpu/perf_event_intel.c     |   6 +-
 arch/x86/kernel/cpu/perf_event_intel_bts.c | 501 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   3 +-
 5 files changed, 515 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_bts.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 00d40f889d..387223c716 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -39,7 +39,7 @@ obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o per
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o perf_event_intel_uncore_snb.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore_snbep.o perf_event_intel_uncore_nhmex.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_rapl.o
-obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_pt.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_pt.o perf_event_intel_bts.o
 endif
 
 
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 2f41d0db42..73424d3139 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -751,6 +751,12 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event);
 
 void intel_pt_interrupt(void);
 
+int intel_bts_interrupt(void);
+
+void intel_bts_enable_local(void);
+
+void intel_bts_disable_local(void);
+
 int p4_pmu_init(void);
 
 int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 6f8025053d..2f54bcfd6f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1051,6 +1051,8 @@ static void intel_pmu_disable_all(void)
 
 	if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
 		intel_pmu_disable_bts();
+	else
+		intel_bts_disable_local();
 
 	intel_pmu_pebs_disable_all();
 	intel_pmu_lbr_disable_all();
@@ -1073,7 +1075,8 @@ static void intel_pmu_enable_all(int added)
 			return;
 
 		intel_pmu_enable_bts(event->hw.config);
-	}
+	} else
+		intel_bts_enable_local();
 }
 
 /*
@@ -1359,6 +1362,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 		apic_write(APIC_LVTPC, APIC_DM_NMI);
 	intel_pmu_disable_all();
 	handled = intel_pmu_drain_bts_buffer();
+	handled += intel_bts_interrupt();
 	status = intel_pmu_get_status();
 	if (!status)
 		goto done;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_bts.c b/arch/x86/kernel/cpu/perf_event_intel_bts.c
new file mode 100644
index 0000000000..d0b749b719
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_bts.c
@@ -0,0 +1,501 @@
+/*
+ * BTS PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#undef DEBUG
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/debugfs.h>
+#include <linux/device.h>
+#include <linux/coredump.h>
+
+#include <asm-generic/sizes.h>
+#include <asm/perf_event.h>
+
+#include "perf_event.h"
+
+struct bts_ctx {
+	raw_spinlock_t			lock;
+	struct perf_output_handle	handle;
+	struct debug_store		ds_back;
+};
+
+static DEFINE_PER_CPU(struct bts_ctx, bts_ctx);
+
+#define BTS_RECORD_SIZE		24
+#define BTS_SAFETY_MARGIN	4080
+
+struct bts_phys {
+	struct page	*page;
+	unsigned long	size;
+	unsigned long	offset;
+	unsigned long	displacement;
+};
+
+struct bts_buffer {
+	size_t		real_size;	/* multiple of BTS_RECORD_SIZE */
+	unsigned int	nr_pages;
+	unsigned int	nr_bufs;
+	unsigned int	cur_buf;
+	unsigned long	index;
+	bool		snapshot;
+	local_t		data_size;
+	local_t		lost;
+	local_t		head;
+	unsigned long	end;
+	void		**data_pages;
+	struct bts_phys	buf[0];
+};
+
+struct pmu bts_pmu;
+
+void intel_pmu_enable_bts(u64 config);
+void intel_pmu_disable_bts(void);
+
+static size_t buf_size(struct page *page)
+{
+	return 1 << (PAGE_SHIFT + page_private(page));
+}
+
+static void *
+bts_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool overwrite)
+{
+	struct bts_buffer *buf;
+	struct page *page;
+	int node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+	unsigned long offset;
+	size_t size = nr_pages << PAGE_SHIFT;
+	int pg, nbuf, pad;
+
+	/* count all the high order buffers */
+	for (pg = 0, nbuf = 0; pg < nr_pages;) {
+		page = virt_to_page(pages[pg]);
+		if (WARN_ON_ONCE(!PagePrivate(page) && nr_pages > 1))
+			return NULL;
+		pg += 1 << page_private(page);
+		nbuf++;
+	}
+
+	/*
+	 * to avoid interrupts in overwrite mode, only allow one physical
+	 */
+	if (overwrite && nbuf > 1)
+		return NULL;
+
+	buf = kzalloc_node(offsetof(struct bts_buffer, buf[nbuf]), GFP_KERNEL, node);
+	if (!buf)
+		return NULL;
+
+	buf->nr_pages = nr_pages;
+	buf->nr_bufs = nbuf;
+	buf->snapshot = overwrite;
+	buf->data_pages = pages;
+	buf->real_size = size - size % BTS_RECORD_SIZE;
+
+	for (pg = 0, nbuf = 0, offset = 0, pad = 0; nbuf < buf->nr_bufs; nbuf++) {
+		unsigned int __nr_pages;
+
+		page = virt_to_page(pages[pg]);
+		__nr_pages = PagePrivate(page) ? 1 << page_private(page) : 1;
+		buf->buf[nbuf].page = page;
+		buf->buf[nbuf].offset = offset;
+		buf->buf[nbuf].displacement = (pad ? BTS_RECORD_SIZE - pad : 0);
+		buf->buf[nbuf].size = buf_size(page) - buf->buf[nbuf].displacement;
+		pad = buf->buf[nbuf].size % BTS_RECORD_SIZE;
+		buf->buf[nbuf].size -= pad;
+
+		pg += __nr_pages;
+		offset += __nr_pages << PAGE_SHIFT;
+	}
+
+	return buf;
+}
+
+static void bts_buffer_free_aux(void *data)
+{
+	kfree(data);
+}
+
+static unsigned long bts_buffer_offset(struct bts_buffer *buf, unsigned int idx)
+{
+	return buf->buf[idx].offset + buf->buf[idx].displacement;
+}
+
+static unsigned long
+bts_buffer_advance(struct bts_buffer *buf, unsigned long head)
+{
+	buf->cur_buf++;
+	if (buf->cur_buf == buf->nr_bufs) {
+		buf->cur_buf = 0;
+		head = 0;
+	} else {
+		head = bts_buffer_offset(buf, buf->cur_buf);
+	}
+
+	return head;
+}
+
+static void
+bts_config_buffer(struct bts_buffer *buf)
+{
+	int cpu = raw_smp_processor_id();
+	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+	struct bts_phys *phys = &buf->buf[buf->cur_buf];
+	unsigned long index, thresh = 0, end = phys->size;
+	struct page *page = phys->page;
+
+	index = local_read(&buf->head);
+
+	if (!buf->snapshot) {
+		if (buf->end < phys->offset + buf_size(page))
+			end = buf->end - phys->offset - phys->displacement;
+
+		index -= phys->offset + phys->displacement;
+
+		thresh = end - BTS_RECORD_SIZE;
+		if (end - index > BTS_SAFETY_MARGIN)
+			thresh -= BTS_SAFETY_MARGIN;
+	}
+
+	ds->bts_buffer_base = (u64)page_address(page) + phys->displacement;
+	ds->bts_index = ds->bts_buffer_base + index;
+	ds->bts_absolute_maximum = ds->bts_buffer_base + end;
+	ds->bts_interrupt_threshold = !buf->snapshot
+		? ds->bts_buffer_base + thresh
+		: ds->bts_absolute_maximum + BTS_RECORD_SIZE;
+}
+
+static bool bts_buffer_is_full(struct bts_buffer *buf, struct bts_ctx *bts)
+{
+	if (buf->snapshot)
+		return false;
+
+	if (local_read(&buf->data_size) >= bts->handle.size ||
+	    bts->handle.size - local_read(&buf->data_size) < BTS_RECORD_SIZE)
+		return true;
+
+	return false;
+}
+
+static void bts_update(struct bts_ctx *bts)
+{
+	int cpu = raw_smp_processor_id();
+	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+	struct bts_buffer *buf = perf_get_aux(&bts->handle);
+	unsigned long index = ds->bts_index - ds->bts_buffer_base, old, head;
+
+	if (!buf)
+		return;
+
+	head = index + bts_buffer_offset(buf, buf->cur_buf);
+
+	if (!buf->snapshot) {
+		struct bts_phys *phys = &buf->buf[buf->cur_buf];
+		int advance = 0;
+
+		if (phys->size - index < BTS_RECORD_SIZE) {
+			advance++;
+			local_inc(&buf->lost);
+		} else if (phys->size - index < BTS_SAFETY_MARGIN) {
+			advance++;
+			memset((void *)ds->bts_index, 0, phys->size - index);
+		}
+
+		if (advance)
+			head = bts_buffer_advance(buf, head);
+
+		old = local_xchg(&buf->head, head);
+		if (!head)
+		  head = buf->real_size;
+
+		local_add(head - old, &buf->data_size);
+	} else {
+		local_set(&buf->data_size, head);
+	}
+}
+
+static void bts_event_start(struct perf_event *event, int flags)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct bts_buffer *buf = perf_get_aux(&bts->handle);
+	u64 config = 0;
+
+	if (!buf || bts_buffer_is_full(buf, bts)) {
+		event->hw.state = PERF_HES_STOPPED;
+		return;
+	}
+
+	event->hw.state = 0;
+
+	if (!buf->snapshot)
+		config |= ARCH_PERFMON_EVENTSEL_INT;
+	if (!event->attr.exclude_kernel)
+		config |= ARCH_PERFMON_EVENTSEL_OS;
+	if (!event->attr.exclude_user)
+		config |= ARCH_PERFMON_EVENTSEL_USR;
+
+	bts_config_buffer(buf);
+
+	wmb();
+
+	intel_pmu_enable_bts(config);
+}
+
+static void bts_event_stop(struct perf_event *event, int flags)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+	if (event->hw.state == PERF_HES_STOPPED)
+		return;
+
+	event->hw.state = PERF_HES_STOPPED;
+	intel_pmu_disable_bts();
+
+	if (flags & PERF_EF_UPDATE)
+		bts_update(bts);
+}
+
+void intel_bts_enable_local(void)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+	if (bts->handle.event)
+		bts_event_start(bts->handle.event, 0);
+}
+
+void intel_bts_disable_local(void)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+	if (bts->handle.event)
+		bts_event_stop(bts->handle.event, 0);
+}
+
+static int
+bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle)
+{
+	unsigned long pad, size;
+	struct bts_phys *phys;
+	int ret, little_room = 0;
+
+	handle->head &= ((buf->nr_pages << PAGE_SHIFT) - 1);
+	pad = (buf->nr_pages << PAGE_SHIFT) - handle->head;
+	if (pad > BTS_RECORD_SIZE) {
+		pad = handle->head % BTS_RECORD_SIZE;
+		if (pad)
+			pad = BTS_RECORD_SIZE - pad;
+	}
+
+	if (pad) {
+		ret = perf_aux_output_skip(handle, pad);
+		if (ret)
+			return ret;
+		handle->head &= ((buf->nr_pages << PAGE_SHIFT) - 1);
+	}
+
+	if (handle->wakeup - handle->head < BTS_RECORD_SIZE)
+		size = handle->size;
+	else
+		size = handle->wakeup - handle->head;
+	size = (size / BTS_RECORD_SIZE) * BTS_RECORD_SIZE;
+
+	if (size < BTS_SAFETY_MARGIN)
+		little_room++;
+
+	/* figure out index offset in the current buffer */
+	for (buf->cur_buf = 0, phys = &buf->buf[buf->cur_buf];
+	     handle->head >= phys->offset + phys->displacement + phys->size;
+	     phys = &buf->buf[++buf->cur_buf])
+		;
+	if (WARN_ON_ONCE(buf->cur_buf == buf->nr_bufs))
+		return -EINVAL;
+
+	pad = phys->offset + phys->displacement + phys->size - handle->head;
+	if (!little_room && pad < BTS_SAFETY_MARGIN) {
+		memset(page_address(phys->page) + phys->displacement + handle->head, 0, pad);
+		ret = perf_aux_output_skip(handle, pad);
+		if (ret)
+			return ret;
+		handle->head = bts_buffer_advance(buf, handle->head);
+	}
+
+	local_set(&buf->data_size, 0);
+	local_set(&buf->head, handle->head);
+	buf->end = min(handle->head + size, buf->real_size);
+
+	return 0;
+}
+
+int intel_bts_interrupt(void)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct perf_event *event = bts->handle.event;
+	struct bts_buffer *buf;
+	s64 old_head;
+	int err;
+
+	if (!event)
+		return 0;
+
+	buf = perf_get_aux(&bts->handle);
+	/*
+	 * Skip snapshot counters: they don't use the interrupt, but
+	 * there's no other way of telling, because the pointer will
+	 * keep moving
+	 */
+	if (!buf || buf->snapshot)
+		return 0;
+
+	old_head = local_read(&buf->head);
+	bts_update(bts);
+
+	perf_aux_output_end(&bts->handle, local_xchg(&buf->data_size, 0),
+			    !!local_xchg(&buf->lost, 0));
+	if (old_head == local_read(&buf->head))
+		return 0;
+
+	buf = perf_aux_output_begin(&bts->handle, event);
+	if (!buf) {
+		event->hw.state = PERF_HES_STOPPED;
+		return 1;
+	}
+
+	err = bts_buffer_reset(buf, &bts->handle);
+	if (err) {
+		event->hw.state = PERF_HES_STOPPED;
+		perf_aux_output_end(&bts->handle, 0, true);
+	}
+
+	return 1;
+}
+
+static void bts_event_del(struct perf_event *event, int mode)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct bts_buffer *buf = perf_get_aux(&bts->handle);
+	unsigned long flags;
+
+	bts_event_stop(event, PERF_EF_UPDATE);
+
+	raw_spin_lock_irqsave(&bts->lock, flags);
+	if (bts->handle.event)
+		perf_aux_output_end(&bts->handle, local_xchg(&buf->data_size, 0),
+				    !!local_xchg(&buf->lost, 0));
+	cpuc->ds->bts_index = bts->ds_back.bts_buffer_base;
+	cpuc->ds->bts_buffer_base = bts->ds_back.bts_buffer_base;
+	cpuc->ds->bts_absolute_maximum = bts->ds_back.bts_absolute_maximum;
+	cpuc->ds->bts_interrupt_threshold = bts->ds_back.bts_interrupt_threshold;
+	raw_spin_unlock_irqrestore(&bts->lock, flags);
+}
+
+static int bts_event_add(struct perf_event *event, int mode)
+{
+	struct bts_buffer *buf;
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct hw_perf_event *hwc = &event->hw;
+	unsigned long flags;
+	int ret = -EBUSY;
+
+	if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
+		goto err;
+
+	if (cpuc->pt_enabled)
+		goto err;
+
+	raw_spin_lock_irqsave(&bts->lock, flags);
+	if (bts->handle.event)
+		goto err_unlock;
+
+	buf = perf_aux_output_begin(&bts->handle, event);
+	if (!buf) {
+		ret = -EINVAL;
+		goto err_unlock;
+	}
+
+	ret = bts_buffer_reset(buf, &bts->handle);
+	if (ret) {
+		perf_aux_output_end(&bts->handle, 0, true);
+		goto err_unlock;
+	}
+
+	bts->ds_back.bts_buffer_base = cpuc->ds->bts_buffer_base;
+	bts->ds_back.bts_absolute_maximum = cpuc->ds->bts_absolute_maximum;
+	bts->ds_back.bts_interrupt_threshold = cpuc->ds->bts_interrupt_threshold;
+	raw_spin_unlock_irqrestore(&bts->lock, flags);
+
+	hwc->state = !(mode & PERF_EF_START);
+	if (!hwc->state) {
+		bts_event_start(event, 0);
+		if (hwc->state == PERF_HES_STOPPED) {
+			bts_event_del(event, 0);
+			ret = -EBUSY;
+			goto err;
+		}
+	}
+
+	return 0;
+
+err_unlock:
+	raw_spin_unlock(&bts->lock);
+err:
+	hwc->state = PERF_HES_STOPPED;
+
+	return ret;
+}
+
+static int bts_event_init(struct perf_event *event)
+{
+	if (event->attr.type != bts_pmu.type)
+		return -ENOENT;
+
+	return 0;
+}
+
+static void bts_event_read(struct perf_event *event)
+{
+}
+
+static __init int bts_init(void)
+{
+	int cpu;
+
+	if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
+		return -ENODEV;
+
+	get_online_cpus();
+	for_each_possible_cpu(cpu) {
+		raw_spin_lock_init(&per_cpu(bts_ctx, cpu).lock);
+	}
+	put_online_cpus();
+
+	bts_pmu.capabilities	= PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE;
+	bts_pmu.task_ctx_nr	= perf_hw_context;
+	bts_pmu.event_init	= bts_event_init;
+	bts_pmu.add		= bts_event_add;
+	bts_pmu.del		= bts_event_del;
+	bts_pmu.start		= bts_event_start;
+	bts_pmu.stop		= bts_event_stop;
+	bts_pmu.read		= bts_event_read;
+	bts_pmu.setup_aux	= bts_buffer_setup_aux;
+	bts_pmu.free_aux	= bts_buffer_free_aux;
+
+	return perf_pmu_register(&bts_pmu, "intel_bts", -1);
+}
+
+module_init(bts_init);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 5ae212af23..f069f1f536 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -466,7 +466,8 @@ void intel_pmu_enable_bts(u64 config)
 
 	debugctlmsr |= DEBUGCTLMSR_TR;
 	debugctlmsr |= DEBUGCTLMSR_BTS;
-	debugctlmsr |= DEBUGCTLMSR_BTINT;
+	if (config & ARCH_PERFMON_EVENTSEL_INT)
+		debugctlmsr |= DEBUGCTLMSR_BTINT;
 
 	if (!(config & ARCH_PERFMON_EVENTSEL_OS))
 		debugctlmsr |= DEBUGCTLMSR_BTS_OFF_OS;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 16/22] perf: Add rb_{alloc,free}_kernel api
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (14 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 15/22] x86: perf: intel_bts: Add BTS " Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-09-09  9:09   ` Peter Zijlstra
  2014-08-20 12:36 ` [PATCH v4 17/22] perf: Add a helper to copy AUX data in the kernel Alexander Shishkin
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Events that generate AUX data can also be created by the kernel. In this
case, some in-kernel infrastructure is needed to store and copy this data.

This patch adds api for ring buffer (de-)allocation that can be used by the
kernel code.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/internal.h    |  3 +++
 kernel/events/ring_buffer.c | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 4f99987bc3..f4cfa4cabb 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -59,6 +59,9 @@ extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 			pgoff_t pgoff, int nr_pages, long watermark, int flags);
 extern void rb_free_aux(struct ring_buffer *rb, struct perf_event *event);
+extern struct ring_buffer *
+rb_alloc_kernel(struct perf_event *event, int nr_pages, int aux_nr_pages);
+extern void rb_free_kernel(struct ring_buffer *rb, struct perf_event *event);
 extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
 extern void ring_buffer_put(struct ring_buffer *rb);
 
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 27d1dd7cfa..7887601aa5 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -464,6 +464,39 @@ void rb_free_aux(struct ring_buffer *rb, struct perf_event *event)
 	rb->aux_nr_pages = 0;
 }
 
+struct ring_buffer *
+rb_alloc_kernel(struct perf_event *event, int nr_pages, int aux_nr_pages)
+{
+	struct ring_buffer *rb;
+	int ret, pgoff = nr_pages + 1;
+
+	rb = rb_alloc(nr_pages, 0, event->cpu, 0);
+	if (!rb)
+		return NULL;
+
+	ret = rb_alloc_aux(rb, event, pgoff, aux_nr_pages, 0, 0);
+	if (ret) {
+		rb_free(rb);
+		return NULL;
+	}
+
+	/*
+	 * Kernel counters don't need ring buffer wakeups, therefore we don't
+	 * use ring_buffer_attach() here and event->rb_entry stays empty
+	 */
+	rcu_assign_pointer(event->rb, rb);
+
+	return rb;
+}
+
+void rb_free_kernel(struct ring_buffer *rb, struct perf_event *event)
+{
+	BUG_ON(atomic_read(&rb->refcount) != 1);
+	rcu_assign_pointer(event->rb, NULL);
+	rb_free_aux(rb, event);
+	rb_free(rb);
+}
+
 #ifndef CONFIG_PERF_USE_VMALLOC
 
 /*
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 17/22] perf: Add a helper to copy AUX data in the kernel
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (15 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 16/22] perf: Add rb_{alloc,free}_kernel api Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-08-20 12:36 ` [PATCH v4 18/22] perf: Add a helper for looking up pmus by type Alexander Shishkin
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

This patch adds a helper for the kernel counters that generate AUX data to
copy this data around, for example, to output it to the perf ring buffer
as a sample record or write it to a core dump file.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/internal.h    |  5 +++++
 kernel/events/ring_buffer.c | 31 +++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index f4cfa4cabb..bb035dd645 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -52,6 +52,9 @@ struct ring_buffer {
 	void				*data_pages[0];
 };
 
+typedef unsigned long (*aux_copyfn)(void *data, const void *src,
+				    unsigned long len);
+
 extern void rb_free(struct ring_buffer *rb);
 extern struct ring_buffer *
 rb_alloc(int nr_pages, long watermark, int cpu, int flags);
@@ -59,6 +62,8 @@ extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 			pgoff_t pgoff, int nr_pages, long watermark, int flags);
 extern void rb_free_aux(struct ring_buffer *rb, struct perf_event *event);
+extern long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+			  unsigned long to, aux_copyfn copyfn, void *data);
 extern struct ring_buffer *
 rb_alloc_kernel(struct perf_event *event, int nr_pages, int aux_nr_pages);
 extern void rb_free_kernel(struct ring_buffer *rb, struct perf_event *event);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 7887601aa5..4cf5a1cd0a 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -345,6 +345,37 @@ void *perf_get_aux(struct perf_output_handle *handle)
 	return handle->rb->aux_priv;
 }
 
+long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+		   unsigned long to, aux_copyfn copyfn, void *data)
+{
+	unsigned long tocopy, remainder, len = 0;
+	void *addr;
+
+	from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+	to &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+
+	do {
+		tocopy = PAGE_SIZE - offset_in_page(from);
+		if (to > from)
+			tocopy = min(tocopy, to - from);
+		if (!tocopy)
+			break;
+
+		addr = rb->aux_pages[from >> PAGE_SHIFT];
+		addr += offset_in_page(from);
+
+		remainder = copyfn(data, addr, tocopy);
+		if (remainder)
+			return -EFAULT;
+
+		len += tocopy;
+		from += tocopy;
+		from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+	} while (to != from);
+
+	return len;
+}
+
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
 static struct page *rb_alloc_aux_page(int node, int order)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 18/22] perf: Add a helper for looking up pmus by type
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (16 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 17/22] perf: Add a helper to copy AUX data in the kernel Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-08-20 12:36 ` [PATCH v4 19/22] perf: Add infrastructure for using AUX data in perf samples Alexander Shishkin
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

This patch adds a helper for looking up a registered pmu by its type.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index b82392911a..550c22a2b7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6930,6 +6930,18 @@ void perf_pmu_unregister(struct pmu *pmu)
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 
+/* call under pmus_srcu */
+static struct pmu *__perf_find_pmu(u32 type)
+{
+	struct pmu *pmu;
+
+	rcu_read_lock();
+	pmu = idr_find(&pmu_idr, type);
+	rcu_read_unlock();
+
+	return pmu;
+}
+
 struct pmu *perf_init_event(struct perf_event *event)
 {
 	struct pmu *pmu = NULL;
@@ -6938,9 +6950,7 @@ struct pmu *perf_init_event(struct perf_event *event)
 
 	idx = srcu_read_lock(&pmus_srcu);
 
-	rcu_read_lock();
-	pmu = idr_find(&pmu_idr, event->attr.type);
-	rcu_read_unlock();
+	pmu = __perf_find_pmu(event->attr.type);
 	if (pmu) {
 		if (!try_module_get(pmu->module)) {
 			pmu = ERR_PTR(-ENODEV);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 19/22] perf: Add infrastructure for using AUX data in perf samples
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (17 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 18/22] perf: Add a helper for looking up pmus by type Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-09-09  9:11   ` Peter Zijlstra
  2014-08-20 12:36 ` [PATCH v4 20/22] perf: Allocate ring buffers for inherited per-task kernel events Alexander Shishkin
                   ` (4 subsequent siblings)
  23 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

AUX data can be used to annotate other perf events by including it in
sample records when PERF_SAMPLE_AUX flag is set. In this case, a kernel
counter is created for each such event and trace data is retrieved
from it and stored in the perf data stream.

To this end, new attribute fields are added:
  * aux_sample_type: specify PMU on which the AUX data generating event
                     is created;
  * aux_sample_config: event config (maps to attribute's config field),
  * aux_sample_size: size of the sample to be written.

This kernel counter is configured similarly to its "main" event with
regards to filtering (exclude_{hv,idle,user,kernel}) and enabled state
(disabled, enable_on_exec) to make sure that we don't get out of context
AUX traces.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      |   9 +++
 include/uapi/linux/perf_event.h |  18 ++++-
 kernel/events/core.c            | 172 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 198 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bcfd7a9d84..8731325405 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -84,6 +84,12 @@ struct perf_regs_user {
 	struct pt_regs	*regs;
 };
 
+struct perf_aux_record {
+	u64		size;
+	unsigned long	from;
+	unsigned long	to;
+};
+
 struct task_struct;
 
 /*
@@ -457,6 +463,7 @@ struct perf_event {
 	perf_overflow_handler_t		overflow_handler;
 	void				*overflow_handler_context;
 
+	struct perf_event		*sampler;
 #ifdef CONFIG_EVENT_TRACING
 	struct ftrace_event_call	*tp_event;
 	struct event_filter		*filter;
@@ -627,6 +634,7 @@ struct perf_sample_data {
 	union  perf_mem_data_src	data_src;
 	struct perf_callchain_entry	*callchain;
 	struct perf_raw_record		*raw;
+	struct perf_aux_record		aux;
 	struct perf_branch_stack	*br_stack;
 	struct perf_regs_user		regs_user;
 	u64				stack_user_size;
@@ -654,6 +662,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
 	data->period = period;
 	data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
 	data->regs_user.regs = NULL;
+	data->aux.from = data->aux.to = data->aux.size = 0;
 	data->stack_user_size = 0;
 	data->weight = 0;
 	data->data_src.val = PERF_MEM_NA;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 349c261f93..b24f170abf 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -137,8 +137,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_DATA_SRC			= 1U << 15,
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
+	PERF_SAMPLE_AUX				= 1U << 18,
 
-	PERF_SAMPLE_MAX = 1U << 18,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
 };
 
 /*
@@ -239,6 +240,9 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER3	96	/* add: sample_regs_user */
 					/* add: sample_stack_user */
 					/* add: aux_watermark */
+#define PERF_ATTR_SIZE_VER4	120	/* add: aux_sample_config */
+					/* add: aux_sample_size */
+					/* add: aux_sample_type */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -337,6 +341,16 @@ struct perf_event_attr {
 	 * Wakeup watermark for AUX area
 	 */
 	__u32	aux_watermark;
+
+	/*
+	 * Itrace pmus' event config
+	 */
+	__u64	aux_sample_config;	/* event config for AUX sampling */
+	__u64	aux_sample_size;	/* desired sample size */
+	__u32	aux_sample_type;	/* pmu->type of an AUX PMU */
+
+	/* Align to u64. */
+	__u32	__reserved_2;
 };
 
 #define perf_flags(attr)	(*(&(attr)->read_format + 1))
@@ -710,6 +724,8 @@ enum perf_event_type {
 	 *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
+	 *	{ u64			size;
+	 *	  char			data[size]; } && PERF_SAMPLE_AUX
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 550c22a2b7..3b1550fd0e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1646,6 +1646,9 @@ void perf_event_disable(struct perf_event *event)
 	struct perf_event_context *ctx = event->ctx;
 	struct task_struct *task = ctx->task;
 
+	if (event->sampler)
+		perf_event_disable(event->sampler);
+
 	if (!task) {
 		/*
 		 * Disable the event on the cpu that it's on
@@ -2148,6 +2151,8 @@ void perf_event_enable(struct perf_event *event)
 	struct perf_event_context *ctx = event->ctx;
 	struct task_struct *task = ctx->task;
 
+	if (event->sampler)
+		perf_event_enable(event->sampler);
 	if (!task) {
 		/*
 		 * Enable the event on the cpu that it's on
@@ -3286,6 +3291,8 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
 		atomic_dec(&per_cpu(perf_cgroup_events, cpu));
 }
 
+static void perf_aux_sampler_fini(struct perf_event *event);
+
 static void unaccount_event(struct perf_event *event)
 {
 	if (event->parent)
@@ -3305,6 +3312,8 @@ static void unaccount_event(struct perf_event *event)
 		static_key_slow_dec_deferred(&perf_sched_events);
 	if (has_branch_stack(event))
 		static_key_slow_dec_deferred(&perf_sched_events);
+	if ((event->attr.sample_type & PERF_SAMPLE_AUX))
+		perf_aux_sampler_fini(event);
 
 	unaccount_event_cpu(event, event->cpu);
 }
@@ -4594,6 +4603,139 @@ perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size,
 	}
 }
 
+static void perf_aux_sampler_destroy(struct perf_event *event)
+{
+	struct ring_buffer *rb = event->rb;
+
+	if (!rb)
+		return;
+
+	ring_buffer_put(rb); /* can be last */
+}
+
+static int perf_aux_sampler_init(struct perf_event *event,
+				 struct task_struct *task,
+				 struct pmu *pmu)
+{
+	struct perf_event_attr attr;
+	struct perf_event *sampler;
+	struct ring_buffer *rb;
+	unsigned long nr_pages;
+
+	if (!pmu || !(pmu->setup_aux))
+		return -ENOTSUPP;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.type = pmu->type;
+	attr.config = event->attr.aux_sample_config;
+	attr.sample_type = 0;
+	attr.disabled = event->attr.disabled;
+	attr.enable_on_exec = event->attr.enable_on_exec;
+	attr.exclude_hv = event->attr.exclude_hv;
+	attr.exclude_idle = event->attr.exclude_idle;
+	attr.exclude_user = event->attr.exclude_user;
+	attr.exclude_kernel = event->attr.exclude_kernel;
+	attr.aux_sample_size = event->attr.aux_sample_size;
+
+	sampler = perf_event_create_kernel_counter(&attr, event->cpu, task,
+						   NULL, NULL);
+	if (IS_ERR(sampler))
+		return PTR_ERR(sampler);
+
+	nr_pages = 1ul << __get_order(event->attr.aux_sample_size);
+
+	rb = rb_alloc_kernel(sampler, 0, nr_pages);
+	if (!rb) {
+		perf_event_release_kernel(sampler);
+		return -ENOMEM;
+	}
+
+	event->sampler = sampler;
+	sampler->destroy = perf_aux_sampler_destroy;
+
+	return 0;
+}
+
+static void perf_aux_sampler_fini(struct perf_event *event)
+{
+	struct perf_event *sampler = event->sampler;
+
+	/* might get free'd from event->destroy() path */
+	if (!sampler)
+		return;
+
+	perf_event_release_kernel(sampler);
+
+	event->sampler = NULL;
+}
+
+static unsigned long perf_aux_sampler_trace(struct perf_event *event,
+					    struct perf_sample_data *data)
+{
+	struct perf_event *sampler = event->sampler;
+	struct ring_buffer *rb;
+
+	if (!sampler || sampler->state != PERF_EVENT_STATE_ACTIVE) {
+		data->aux.size = 0;
+		goto out;
+	}
+
+	rb = ring_buffer_get(sampler);
+	if (!rb) {
+		data->aux.size = 0;
+		goto out;
+	}
+
+	sampler->pmu->del(sampler, 0);
+
+	data->aux.to = local_read(&rb->aux_head);
+
+	if (data->aux.to < sampler->attr.aux_sample_size)
+		data->aux.from = rb->aux_nr_pages * PAGE_SIZE +
+			data->aux.to - sampler->attr.aux_sample_size;
+	else
+		data->aux.from = data->aux.to -
+			sampler->attr.aux_sample_size;
+	data->aux.size = ALIGN(sampler->attr.aux_sample_size, sizeof(u64));
+	ring_buffer_put(rb);
+
+out:
+	return data->aux.size;
+}
+
+static void perf_aux_sampler_output(struct perf_event *event,
+				    struct perf_output_handle *handle,
+				    struct perf_sample_data *data)
+{
+	struct perf_event *sampler = event->sampler;
+	struct ring_buffer *rb;
+	unsigned long pad;
+	int ret;
+
+	if (WARN_ON_ONCE(!sampler || !data->aux.size))
+		return;
+
+	rb = ring_buffer_get(sampler);
+	if (WARN_ON_ONCE(!rb))
+		return;
+	ret = rb_output_aux(rb, data->aux.from, data->aux.to,
+			    (aux_copyfn)perf_output_copy, handle);
+	if (ret < 0) {
+		pr_warn_ratelimited("failed to copy trace data\n");
+		goto out;
+	}
+
+	pad = data->aux.size - ret;
+	if (pad) {
+		u64 p = 0;
+
+		perf_output_copy(handle, &p, pad);
+	}
+out:
+	ring_buffer_put(rb);
+	sampler->pmu->add(sampler, PERF_EF_START);
+}
+
 static void __perf_event_header__init_id(struct perf_event_header *header,
 					 struct perf_sample_data *data,
 					 struct perf_event *event)
@@ -4880,6 +5022,13 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		perf_output_put(handle, data->txn);
 
+	if (sample_type & PERF_SAMPLE_AUX) {
+		perf_output_put(handle, data->aux.size);
+
+		if (data->aux.size)
+			perf_aux_sampler_output(event, handle, data);
+	}
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -4987,6 +5136,14 @@ void perf_prepare_sample(struct perf_event_header *header,
 		data->stack_user_size = stack_size;
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_AUX) {
+		u64 size = sizeof(u64);
+
+		size += perf_aux_sampler_trace(event, data);
+
+		header->size += size;
+	}
 }
 
 static void perf_event_output(struct perf_event *event,
@@ -7139,6 +7296,21 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 			if (err)
 				goto err_pmu;
 		}
+
+		if (event->attr.sample_type & PERF_SAMPLE_AUX) {
+			struct pmu *aux_pmu;
+			int idx;
+
+			idx = srcu_read_lock(&pmus_srcu);
+			aux_pmu = __perf_find_pmu(event->attr.aux_sample_type);
+			err = perf_aux_sampler_init(event, task, aux_pmu);
+			srcu_read_unlock(&pmus_srcu, idx);
+
+			if (err) {
+				put_callchain_buffers();
+				goto err_pmu;
+			}
+		}
 	}
 
 	return event;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 20/22] perf: Allocate ring buffers for inherited per-task kernel events
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (18 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 19/22] perf: Add infrastructure for using AUX data in perf samples Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-09-09  9:12   ` Peter Zijlstra
  2014-08-20 12:36 ` [PATCH v4 21/22] perf: Allow AUX sampling for multiple events Alexander Shishkin
                   ` (3 subsequent siblings)
  23 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

When a new event is inherited from a per-task kernel event that has a
ring buffer, allocate a new buffer for this event so that data from the
child task is collected and can later be retrieved for sample annotation
or whatnot.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c     |  9 +++++++++
 kernel/events/internal.h | 11 +++++++++++
 2 files changed, 20 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3b1550fd0e..9d7f6086d5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8230,6 +8230,15 @@ inherit_event(struct perf_event *parent_event,
 		(void)perf_event_set_output(child_event, parent_event);
 
 	/*
+	 * For per-task kernel events with ring buffers, set_output doesn't
+	 * make sense, but we can allocate a new buffer here.
+	 */
+	if (parent_event->cpu == -1 && kernel_rb_event(parent_event)) {
+		(void)rb_alloc_kernel(child_event, parent_event->rb->nr_pages,
+				      parent_event->rb->aux_nr_pages);
+	}
+
+	/*
 	 * Precalculate sample_data sizes
 	 */
 	perf_event__header_size(child_event);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index bb035dd645..b306bc9307 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -120,6 +120,17 @@ static inline unsigned long perf_aux_size(struct ring_buffer *rb)
 	return rb->aux_nr_pages << PAGE_SHIFT;
 }
 
+static inline bool kernel_rb_event(struct perf_event *event)
+{
+	/*
+	 * Having a ring buffer and not being on any ring buffers' wakeup
+	 * list means it was attached by rb_alloc_kernel() and not
+	 * ring_buffer_attach(). It's the only case when these two
+	 * conditions take place at the same time.
+	 */
+	return event->rb && list_empty(&event->rb_entry);
+}
+
 #define DEFINE_OUTPUT_COPY(func_name, memcpy_func)			\
 static inline unsigned long						\
 func_name(struct perf_output_handle *handle,				\
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 21/22] perf: Allow AUX sampling for multiple events
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (19 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 20/22] perf: Allocate ring buffers for inherited per-task kernel events Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-08-20 12:36 ` [PATCH v4 22/22] perf: Allow sampling of inherited events Alexander Shishkin
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Right now, only one perf event can be annotated with AUX data, however
it should be possible to annotate several events with similar configuration
(wrt exclude_{hv,idle,user,kernel}, config and other event attribute fields).

So every time, before a kernel counter is created for AUX sampling, we
first look for an existing counter that is suitable in terms of configuration
and we use it to annotate the new event as well.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 93 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 66 insertions(+), 27 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9d7f6086d5..73f6f5a5b7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4613,45 +4613,84 @@ static void perf_aux_sampler_destroy(struct perf_event *event)
 	ring_buffer_put(rb); /* can be last */
 }
 
+static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2);
+
+static bool perf_aux_event_match(struct perf_event *e1, struct perf_event *e2)
+{
+	return has_aux(e1) && exclusive_event_match(e1, e2);
+}
+
 static int perf_aux_sampler_init(struct perf_event *event,
 				 struct task_struct *task,
 				 struct pmu *pmu)
 {
+	struct perf_event_context *ctx;
 	struct perf_event_attr attr;
-	struct perf_event *sampler;
+	struct perf_event *sampler = NULL;
 	struct ring_buffer *rb;
-	unsigned long nr_pages;
+	unsigned long nr_pages, flags;
 
 	if (!pmu || !(pmu->setup_aux))
 		return -ENOTSUPP;
 
-	memset(&attr, 0, sizeof(attr));
-	attr.type = pmu->type;
-	attr.config = event->attr.aux_sample_config;
-	attr.sample_type = 0;
-	attr.disabled = event->attr.disabled;
-	attr.enable_on_exec = event->attr.enable_on_exec;
-	attr.exclude_hv = event->attr.exclude_hv;
-	attr.exclude_idle = event->attr.exclude_idle;
-	attr.exclude_user = event->attr.exclude_user;
-	attr.exclude_kernel = event->attr.exclude_kernel;
-	attr.aux_sample_size = event->attr.aux_sample_size;
-
-	sampler = perf_event_create_kernel_counter(&attr, event->cpu, task,
-						   NULL, NULL);
-	if (IS_ERR(sampler))
-		return PTR_ERR(sampler);
-
-	nr_pages = 1ul << __get_order(event->attr.aux_sample_size);
-
-	rb = rb_alloc_kernel(sampler, 0, nr_pages);
-	if (!rb) {
-		perf_event_release_kernel(sampler);
-		return -ENOMEM;
+	ctx = find_get_context(pmu, task, event->cpu);
+	if (ctx) {
+		raw_spin_lock_irqsave(&ctx->lock, flags);
+		list_for_each_entry(sampler, &ctx->event_list, event_entry) {
+			/*
+			 * event is not an aux event, but all the relevant
+			 * bits should match
+			 */
+			if (perf_aux_event_match(sampler, event) &&
+			    sampler->attr.type == event->attr.aux_sample_type &&
+			    sampler->attr.config == event->attr.aux_sample_config &&
+			    sampler->attr.exclude_hv == event->attr.exclude_hv &&
+			    sampler->attr.exclude_idle == event->attr.exclude_idle &&
+			    sampler->attr.exclude_user == event->attr.exclude_user &&
+			    sampler->attr.exclude_kernel == event->attr.exclude_kernel &&
+			    sampler->attr.aux_sample_size >= event->attr.aux_sample_size &&
+			    atomic_long_inc_not_zero(&sampler->refcount))
+				goto got_event;
+		}
+
+		sampler = NULL;
+
+got_event:
+		--ctx->pin_count;
+		put_ctx(ctx);
+		raw_spin_unlock_irqrestore(&ctx->lock, flags);
+	}
+
+	if (!sampler) {
+		memset(&attr, 0, sizeof(attr));
+		attr.type = pmu->type;
+		attr.config = event->attr.aux_sample_config;
+		attr.sample_type = 0;
+		attr.disabled = event->attr.disabled;
+		attr.enable_on_exec = event->attr.enable_on_exec;
+		attr.exclude_hv = event->attr.exclude_hv;
+		attr.exclude_idle = event->attr.exclude_idle;
+		attr.exclude_user = event->attr.exclude_user;
+		attr.exclude_kernel = event->attr.exclude_kernel;
+		attr.aux_sample_size = event->attr.aux_sample_size;
+
+		sampler = perf_event_create_kernel_counter(&attr, event->cpu,
+							   task, NULL, NULL);
+		if (IS_ERR(sampler))
+			return PTR_ERR(sampler);
+
+		nr_pages = 1ul << __get_order(event->attr.aux_sample_size);
+
+		rb = rb_alloc_kernel(sampler, 0, nr_pages);
+		if (!rb) {
+			perf_event_release_kernel(sampler);
+			return -ENOMEM;
+		}
+
+		sampler->destroy = perf_aux_sampler_destroy;
 	}
 
 	event->sampler = sampler;
-	sampler->destroy = perf_aux_sampler_destroy;
 
 	return 0;
 }
@@ -4664,7 +4703,7 @@ static void perf_aux_sampler_fini(struct perf_event *event)
 	if (!sampler)
 		return;
 
-	perf_event_release_kernel(sampler);
+	put_event(sampler);
 
 	event->sampler = NULL;
 }
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 22/22] perf: Allow sampling of inherited events
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (20 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 21/22] perf: Allow AUX sampling for multiple events Alexander Shishkin
@ 2014-08-20 12:36 ` Alexander Shishkin
  2014-08-25  6:21 ` [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Adrian Hunter
  2014-09-01 16:30 ` Peter Zijlstra
  23 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-08-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Alexander Shishkin

Try to find an AUX sampler event for the current event if none is linked.
This is useful when these events are allocated by inheritance path,
independently of one another.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 94 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 62 insertions(+), 32 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 73f6f5a5b7..c59a596c8f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4620,47 +4620,67 @@ static bool perf_aux_event_match(struct perf_event *e1, struct perf_event *e2)
 	return has_aux(e1) && exclusive_event_match(e1, e2);
 }
 
+struct perf_event *__find_sampling_counter(struct perf_event_context *ctx,
+					   struct perf_event *event,
+					   struct task_struct *task)
+{
+	struct perf_event *sampler = NULL;
+
+	list_for_each_entry(sampler, &ctx->event_list, event_entry) {
+		/*
+		 * event is not an itrace event, but all the relevant
+		 * bits should match
+		 */
+		if (perf_aux_event_match(sampler, event) &&
+		    kernel_rb_event(sampler) &&
+		    sampler->attr.type == event->attr.aux_sample_type &&
+		    sampler->attr.config == event->attr.aux_sample_config &&
+		    sampler->attr.exclude_hv == event->attr.exclude_hv &&
+		    sampler->attr.exclude_idle == event->attr.exclude_idle &&
+		    sampler->attr.exclude_user == event->attr.exclude_user &&
+		    sampler->attr.exclude_kernel == event->attr.exclude_kernel &&
+		    sampler->attr.aux_sample_size >= event->attr.aux_sample_size &&
+		    atomic_long_inc_not_zero(&sampler->refcount))
+			return sampler;
+	}
+
+	return NULL;
+}
+
+struct perf_event *find_sampling_counter(struct pmu *pmu,
+					 struct perf_event *event,
+					 struct task_struct *task)
+{
+	struct perf_event_context *ctx;
+	struct perf_event *sampler = NULL;
+	unsigned long flags;
+
+	ctx = find_get_context(pmu, task, event->cpu);
+	if (!ctx)
+		return NULL;
+
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+	sampler = __find_sampling_counter(ctx, event, task);
+	--ctx->pin_count;
+	put_ctx(ctx);
+	raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return sampler;
+}
+
 static int perf_aux_sampler_init(struct perf_event *event,
 				 struct task_struct *task,
 				 struct pmu *pmu)
 {
-	struct perf_event_context *ctx;
 	struct perf_event_attr attr;
-	struct perf_event *sampler = NULL;
+	struct perf_event *sampler;
 	struct ring_buffer *rb;
-	unsigned long nr_pages, flags;
+	unsigned long nr_pages;
 
 	if (!pmu || !(pmu->setup_aux))
 		return -ENOTSUPP;
 
-	ctx = find_get_context(pmu, task, event->cpu);
-	if (ctx) {
-		raw_spin_lock_irqsave(&ctx->lock, flags);
-		list_for_each_entry(sampler, &ctx->event_list, event_entry) {
-			/*
-			 * event is not an aux event, but all the relevant
-			 * bits should match
-			 */
-			if (perf_aux_event_match(sampler, event) &&
-			    sampler->attr.type == event->attr.aux_sample_type &&
-			    sampler->attr.config == event->attr.aux_sample_config &&
-			    sampler->attr.exclude_hv == event->attr.exclude_hv &&
-			    sampler->attr.exclude_idle == event->attr.exclude_idle &&
-			    sampler->attr.exclude_user == event->attr.exclude_user &&
-			    sampler->attr.exclude_kernel == event->attr.exclude_kernel &&
-			    sampler->attr.aux_sample_size >= event->attr.aux_sample_size &&
-			    atomic_long_inc_not_zero(&sampler->refcount))
-				goto got_event;
-		}
-
-		sampler = NULL;
-
-got_event:
-		--ctx->pin_count;
-		put_ctx(ctx);
-		raw_spin_unlock_irqrestore(&ctx->lock, flags);
-	}
-
+	sampler = find_sampling_counter(pmu, event, task);
 	if (!sampler) {
 		memset(&attr, 0, sizeof(attr));
 		attr.type = pmu->type;
@@ -4711,9 +4731,19 @@ static void perf_aux_sampler_fini(struct perf_event *event)
 static unsigned long perf_aux_sampler_trace(struct perf_event *event,
 					    struct perf_sample_data *data)
 {
-	struct perf_event *sampler = event->sampler;
+	struct perf_event *sampler;
 	struct ring_buffer *rb;
 
+	if (!event->sampler) {
+		/*
+		 * down this path, event->ctx is already locked IF it's the
+		 * same context
+		 */
+		event->sampler = __find_sampling_counter(event->ctx, event,
+							 event->ctx->task);
+	}
+
+	sampler = event->sampler;
 	if (!sampler || sampler->state != PERF_EVENT_STATE_ACTIVE) {
 		data->aux.size = 0;
 		goto out;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (21 preceding siblings ...)
  2014-08-20 12:36 ` [PATCH v4 22/22] perf: Allow sampling of inherited events Alexander Shishkin
@ 2014-08-25  6:21 ` Adrian Hunter
  2014-09-01 16:21   ` Peter Zijlstra
  2014-09-01 16:30 ` Peter Zijlstra
  23 siblings, 1 reply; 61+ messages in thread
From: Adrian Hunter @ 2014-08-25  6:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, Robert Richter,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, kan.liang

On 08/20/2014 03:35 PM, Alexander Shishkin wrote:
> This patchset adds support for Intel Processor Trace (PT) extension [1] of
> Intel Architecture that allows the capture of information about software
> execution flow, to the perf kernel infrastructure.
> 

Alex is away, so I would like to know if anyone (Peter?) plans to review or
comment on these patches.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT
  2014-08-25  6:21 ` [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Adrian Hunter
@ 2014-09-01 16:21   ` Peter Zijlstra
  0 siblings, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-01 16:21 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, Robert Richter,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, kan.liang

On Mon, Aug 25, 2014 at 09:21:23AM +0300, Adrian Hunter wrote:
> On 08/20/2014 03:35 PM, Alexander Shishkin wrote:
> > This patchset adds support for Intel Processor Trace (PT) extension [1] of
> > Intel Architecture that allows the capture of information about software
> > execution flow, to the perf kernel infrastructure.
> > 
> 
> Alex is away, so I would like to know if anyone (Peter?) plans to review or
> comment on these patches.

I'll get to it (eventually), I had a week of down-time after KS/Linuxcon
and have now returned to a copiously overflown inbox.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT
  2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (22 preceding siblings ...)
  2014-08-25  6:21 ` [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Adrian Hunter
@ 2014-09-01 16:30 ` Peter Zijlstra
  2014-09-01 17:17   ` Pawel Moll
  23 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-01 16:30 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Pawel.Moll, Michael.Williams, ralf

On Wed, Aug 20, 2014 at 03:35:57PM +0300, Alexander Shishkin wrote:
> Hi Peter and all,
> 
> This patchset adds support for Intel Processor Trace (PT) extension [1] of
> Intel Architecture that allows the capture of information about software
> execution flow, to the perf kernel infrastructure.
> 
> The single most notable thing is that while PT outputs trace data in a
> compressed binary format, it will still generate hundreds of megabytes
> of trace data per second per core. Decoding this binary stream takes
> 2-3 orders of magnitude the cpu time that it takes to generate
> it. These considerations make it impossible to carry out decoding in
> kernel space. Therefore, the trace data is exported to userspace as a
> zero-copy mapping that userspace can collect and store for later
> decoding. To address this, this patchset extends perf ring buffer with
> an "AUX space", which is allocated for hardware blocks such as PT to
> export their trace data with minimal overhead. This space can be
> configured via buffer's user page and mmapped from the same file
> descriptor with a given offset. Data can then be collected from it
> by reading the aux_head (write) pointer from the user page and updating
> aux_tail (read) pointer similarly to data_{head,tail} of the
> traditional perf buffer. There is an api between perf core and pmu
> drivers that wish to make use of this AUX space to export their data.
> 
> For tracing blocks that don't support hardware scatter-gather tables,
> we provide high-order physically contiguous allocations to minimize
> the overhead needed for software double buffering and PMI pressure.
> 
> This way we get a normal perf data stream that provides sideband
> information that is required to decode the trace data, such as MMAPs,
> COMMs etc, plus the actual trace in its own logical space.
> 
> If the trace buffer is mapped writable, the driver will stop tracing
> when it fills up (aux_head approaches aux_tail), till data is read,
> aux_tail pointer is moved forward and an ioctl() is issued to
> re-enable tracing. If the trace buffer is mapped read only, the
> tracing will continue, overwriting older data, so that the buffer
> always contains the most recent data. Tracing can be stopped with an
> ioctl() and restarted once the data is collected.
> 
> Another use case is annotating samples of other perf events: setting
> PERF_SAMPLE_AUX requests attr.aux_sample_size bytes of trace to be
> included in each event's sample.
> 
> This patchset consists of necessary changes to the perf kernel
> infrastructure, and PT and BTS pmu drivers. The tooling support is not
> included in this series, however, it can be found in my github tree [2].
> 
> This version changes the way watermarks are handled for AUX area and
> gets rid of the notion of "itrace" both in the core and in the perf
> interface (event attribute), which makes it more logical.
> 
> [1] http://software.intel.com/en-us/intel-isa-extensions
> [2] http://github.com/virtuoso/linux-perf/tree/intel_pt

It would also be good if some other archs can comment on this (the
generic parts obviously). There is the ARM CoreSight stuff and ISTR that
MIPS also has something like this, although I'm not entirely sure who to
poke on that, Ralf?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT
  2014-09-01 16:30 ` Peter Zijlstra
@ 2014-09-01 17:17   ` Pawel Moll
       [not found]     ` <CANLsYky0vuwo7MwKbiGXypkLkrX7k6BOEf2uej3-Z3-HZHKd7w@mail.gmail.com>
  0 siblings, 1 reply; 61+ messages in thread
From: Pawel Moll @ 2014-09-01 17:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, Robert Richter,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, kan.liang, Michael Williams, ralf,
	Mathieu Poirier

On Mon, 2014-09-01 at 17:30 +0100, Peter Zijlstra wrote:
> On Wed, Aug 20, 2014 at 03:35:57PM +0300, Alexander Shishkin wrote:
> > This patchset adds support for Intel Processor Trace (PT) extension [1] of
> > Intel Architecture that allows the capture of information about software
> > execution flow, to the perf kernel infrastructure.

I'm particularly excited about the "to the perf kernel
infrastructure" :-)

> It would also be good if some other archs can comment on this (the
> generic parts obviously). There is the ARM CoreSight stuff 

We've got Mathieu Poirier at Linaro working on the drivers for CoreSight
components, with the latest (v5) version here:

http://thread.gmane.org/gmane.linux.ports.arm.kernel/351654

but it was mostly intended to have the processor trace decoded using a
separate tool in userspace. I must say that an idea of getting it all
into perf sounds really interesting.

I've copied Mathieu so he can have a look as well. As for me personally,
I'd like to have 48 hours to digest it all. I've just got back from a
very distant time zone and am suffering a monumental jet lag... 

Paweł


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT
       [not found]     ` <CANLsYky0vuwo7MwKbiGXypkLkrX7k6BOEf2uej3-Z3-HZHKd7w@mail.gmail.com>
@ 2014-09-04  8:26       ` Peter Zijlstra
  2014-09-05 13:34         ` Mathieu Poirier
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-04  8:26 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: Pawel Moll, Alexander Shishkin, Ingo Molnar, linux-kernel,
	Robert Richter, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, kan.liang,
	Michael Williams, ralf

On Tue, Sep 02, 2014 at 02:18:16PM -0600, Mathieu Poirier wrote:
> Pawell, many thanks for looping me in.
> 
> I am definitely not a perf-internal guru and as such won't be able to
> comment on the implementation.  On the flip side it is easy for me to see
> how the work on coresight done at Linaro can be made to tie-in what
> Alexander is proposing.  Albeit not at the top of the priority list at this
> time, integration with perf (and ftrace) is definitely on the roadmap.
> 
> Powell is correct in his statement that Linaro's work in HW trace decoding
> is (currently) mainly focused on processor tracing but that will change
> when we have the basic infrastructure upstreamed.
> 
> Last but not least it would be interesting to have more information on the
> "sideband data".  With coresight we have something called "metadata", also
> related to how the trace source was configured and instrumental to proper
> trace decoding.  I'm pretty sure we are facing the same problems.

So we use the sideband or AUX data stream to export the 'big' data
stream generated by the CPU in an opaque manner. For every AUX data
block 'posted' we issue an event into the regular data buffer that
describes it.

I was assuming that both ARM and MIPS would generate a single data
stream as well. So please do tell more about your meta-data; is that a
one time thing or a second continuous stream of data, albeit smaller
than the main stream?

The way I read your explanation it is a one time blob generated once you
setup the hardware. I suppose we could either dump it once into the
normal data stream or maybe dump it once every time we generate an AUX
buffer event into the normal data stream -- if its not too big.

In any case, can you point us to public documentation of the ARM
CoreSight stuff and maybe provide a short summary for the tl;dr crowd?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT
  2014-09-04  8:26       ` Peter Zijlstra
@ 2014-09-05 13:34         ` Mathieu Poirier
  2014-09-08 11:55           ` Alexander Shishkin
  0 siblings, 1 reply; 61+ messages in thread
From: Mathieu Poirier @ 2014-09-05 13:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pawel Moll, Alexander Shishkin, Ingo Molnar, linux-kernel,
	Robert Richter, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, kan.liang,
	Michael Williams, ralf, Al Grant, Deepak Saxena

On 4 September 2014 02:26, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 02, 2014 at 02:18:16PM -0600, Mathieu Poirier wrote:
>> Pawell, many thanks for looping me in.
>>
>> I am definitely not a perf-internal guru and as such won't be able to
>> comment on the implementation.  On the flip side it is easy for me to see
>> how the work on coresight done at Linaro can be made to tie-in what
>> Alexander is proposing.  Albeit not at the top of the priority list at this
>> time, integration with perf (and ftrace) is definitely on the roadmap.
>>
>> Powell is correct in his statement that Linaro's work in HW trace decoding
>> is (currently) mainly focused on processor tracing but that will change
>> when we have the basic infrastructure upstreamed.
>>
>> Last but not least it would be interesting to have more information on the
>> "sideband data".  With coresight we have something called "metadata", also
>> related to how the trace source was configured and instrumental to proper
>> trace decoding.  I'm pretty sure we are facing the same problems.
>
> So we use the sideband or AUX data stream to export the 'big' data
> stream generated by the CPU in an opaque manner. For every AUX data
> block 'posted' we issue an event into the regular data buffer that
> describes it.

In the context of "describe it" written above, what kind of
information one would typically find in that description?

>
> I was assuming that both ARM and MIPS would generate a single data
> stream as well. So please do tell more about your meta-data; is that a
> one time thing or a second continuous stream of data, albeit smaller
> than the main stream?

Coresight does indeed generate a single steam of compressed data.
Depending on the tracing specifics that were requested by the use case
(trace engine configuration) the format of the packets in the
compressed stream will change.  Since the compressed stream itself
doesn't carry clues about the formatting information, knowledge of how
the trace engine was configured is mandatory for the proper decoding
of the trace stream.

Metadata refer to exactly that - the configuration of the trace
engine.  It has to be somehow lumped with the trace stream for off
target analysis.

>
> The way I read your explanation it is a one time blob generated once you
> setup the hardware.

Correct.

> I suppose we could either dump it once into the
> normal data stream or maybe dump it once every time we generate an AUX
> buffer event into the normal data stream -- if its not too big.

Right, there is a set of meta-data to be generated with each trace
run.  With the current implementation a "trace run" pertains to all
the information collected between the beginning and end of a trace
scenario.  Future work involve triggering a DMA transfer of the full
coresight buffer to a kernel memory area, something that is probably
close to the "buffer event" you are referring to.

When we get there I also agree that metadata should be lumped with
each buffer event, removing the need to have the first buffer in the
set to be able to decode all the other buffers.

>
> In any case, can you point us to public documentation of the ARM
> CoreSight stuff and maybe provide a short summary for the tl;dr crowd?

Absolutely.

Information on metadata: [1]
Current coresight patchset: [2]
Quick summary on coresight: [3]
Technical details on Coresight: [4]

Best regards,
Mathieu


[1]. http://people.linaro.org/~mathieu.poirier/coresight/cs-decode.pdf
[2]. https://lkml.org/lkml/2014/8/27/479
[3]. https://lkml.org/lkml/2014/8/27/467
[4]. http://infocenter.arm.com/help/index.jsp (section "CoreSight
on-chip trace and debug")

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams
  2014-08-20 12:35 ` [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams Alexander Shishkin
@ 2014-09-08  7:02   ` Peter Zijlstra
  2014-09-08 11:16     ` Alexander Shishkin
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-08  7:02 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 1869 bytes --]

On Wed, Aug 20, 2014 at 03:35:59PM +0300, Alexander Shishkin wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> This patch introduces "AUX space" in the perf mmap buffer, intended for
> exporting high bandwidth data streams to userspace, such as instruction
> flow traces.
> 
> AUX space is a ring buffer, defined by aux_{offset,size} fields in the
> user_page structure, and read/write pointers aux_{head,tail}, which abide
> by the same rules as data_* counterparts of the main perf buffer.
> 
> In order to allocate/mmap AUX, userspace needs to set up aux_offset to
> such an offset that will be greater than data_offset+data_size and
> aux_size to be the desired buffer size. Both need to be page aligned.
> Then, same aux_offset and aux_size should be passed to mmap() call and
> if everything adds up, you should have an AUX buffer as a result.
> 
> Pages that are mapped into this buffer also come out of user's mlock
> rlimit plus perf_event_mlock_kb allowance.
> 
> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> ---

> +void rb_free_aux(struct ring_buffer *rb, struct perf_event *event)
> +{
> +	struct perf_event *iter;
> +	int pg;
> +
> +	if (rb->aux_priv) {
> +		/* disable all potential writers before freeing */
> +		rcu_read_lock();
> +		list_for_each_entry_rcu(iter, &rb->event_list, rb_entry)
> +			perf_event_disable(iter);
> +		rcu_read_unlock();

Hmm, I cannot remember this from the last time; and its not explained
why this was added.

This would change semantics between munmap of a buffer with and without
AUX bits in. 

> +
> +		event->pmu->free_aux(rb->aux_priv);
> +		rb->aux_priv = NULL;
> +	}
> +
> +	for (pg = 0; pg < rb->aux_nr_pages; pg++)
> +		free_page((unsigned long)rb->aux_pages[pg]);
> +
> +	kfree(rb->aux_pages);
> +	rb->aux_nr_pages = 0;
> +}

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 04/22] perf: Add a capability for AUX_NO_SG pmus to do software double buffering
  2014-08-20 12:36 ` [PATCH v4 04/22] perf: Add a capability for AUX_NO_SG pmus to do software double buffering Alexander Shishkin
@ 2014-09-08  7:17   ` Peter Zijlstra
  2014-09-08 11:07     ` Alexander Shishkin
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-08  7:17 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 2125 bytes --]

On Wed, Aug 20, 2014 at 03:36:01PM +0300, Alexander Shishkin wrote:
> For pmus that don't support scatter-gather for AUX data in hardware, it
> might still make sense to implement software double buffering to avoid
> losing data while the user is reading data out. For this purpose, add
> a pmu capability that guarantees multiple high-order chunks for AUX buffer,
> so that the pmu driver can do switchover tricks.

Please expand this with more detail on how to use this.

> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> ---
>  include/linux/perf_event.h  |  1 +
>  kernel/events/ring_buffer.c | 15 ++++++++++++++-
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index fe10bf6f94..1e7b659b49 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -172,6 +172,7 @@ struct perf_event;
>   */
>  #define PERF_PMU_CAP_NO_INTERRUPT		0x01
>  #define PERF_PMU_CAP_AUX_NO_SG			0x02
> +#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF		0x04
>  
>  /**
>   * struct pmu - generic performance monitoring unit
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index d10919ca42..f5ee3669f8 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -286,9 +286,22 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
>  	if (!has_aux(event))
>  		return -ENOTSUPP;
>  
> -	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
> +	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) {
>  		order = get_order(nr_pages * PAGE_SIZE);
>  
> +		/*
> +		 * PMU requests more than one contiguous chunks of memory
> +		 * for SW double buffering
> +		 */
> +		if ((event->pmu->capabilities & PERF_PMU_CAP_AUX_SW_DOUBLEBUF) &&
> +		    !overwrite) {
> +			if (!order)
> +				return -EINVAL;
> +
> +			order--;
> +		}
> +	}

In particular this looks like it will allocate double the total amount
of pages and 'loose' half of them. There is no corresponding code in the
free path to collect them.


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 04/22] perf: Add a capability for AUX_NO_SG pmus to do software double buffering
  2014-09-08  7:17   ` Peter Zijlstra
@ 2014-09-08 11:07     ` Alexander Shishkin
  2014-09-08 11:31       ` Peter Zijlstra
  0 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-09-08 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Aug 20, 2014 at 03:36:01PM +0300, Alexander Shishkin wrote:
>> For pmus that don't support scatter-gather for AUX data in hardware, it
>> might still make sense to implement software double buffering to avoid
>> losing data while the user is reading data out. For this purpose, add
>> a pmu capability that guarantees multiple high-order chunks for AUX buffer,
>> so that the pmu driver can do switchover tricks.
>
> Please expand this with more detail on how to use this.

Sure.

>> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>> ---
>>  include/linux/perf_event.h  |  1 +
>>  kernel/events/ring_buffer.c | 15 ++++++++++++++-
>>  2 files changed, 15 insertions(+), 1 deletion(-)
>> 
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index fe10bf6f94..1e7b659b49 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -172,6 +172,7 @@ struct perf_event;
>>   */
>>  #define PERF_PMU_CAP_NO_INTERRUPT		0x01
>>  #define PERF_PMU_CAP_AUX_NO_SG			0x02
>> +#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF		0x04
>>  
>>  /**
>>   * struct pmu - generic performance monitoring unit
>> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
>> index d10919ca42..f5ee3669f8 100644
>> --- a/kernel/events/ring_buffer.c
>> +++ b/kernel/events/ring_buffer.c
>> @@ -286,9 +286,22 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
>>  	if (!has_aux(event))
>>  		return -ENOTSUPP;
>>  
>> -	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
>> +	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) {
>>  		order = get_order(nr_pages * PAGE_SIZE);
>>  
>> +		/*
>> +		 * PMU requests more than one contiguous chunks of memory
>> +		 * for SW double buffering
>> +		 */
>> +		if ((event->pmu->capabilities & PERF_PMU_CAP_AUX_SW_DOUBLEBUF) &&
>> +		    !overwrite) {
>> +			if (!order)
>> +				return -EINVAL;
>> +
>> +			order--;
>> +		}
>> +	}
>
> In particular this looks like it will allocate double the total amount
> of pages and 'loose' half of them. There is no corresponding code in the
> free path to collect them.

This code makes the biggest high order allocation no bigger than half of
the total requested size. Then, when I allocate the high-order chunks, I
do a split_page() on them and everywhere else in the code they are
treated as individual pages, including the free path. So this patch has
no implication on freeing. Is this your concern?

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams
  2014-09-08  7:02   ` Peter Zijlstra
@ 2014-09-08 11:16     ` Alexander Shishkin
  2014-09-08 11:34       ` Peter Zijlstra
  0 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-09-08 11:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Aug 20, 2014 at 03:35:59PM +0300, Alexander Shishkin wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>> 
>> This patch introduces "AUX space" in the perf mmap buffer, intended for
>> exporting high bandwidth data streams to userspace, such as instruction
>> flow traces.
>> 
>> AUX space is a ring buffer, defined by aux_{offset,size} fields in the
>> user_page structure, and read/write pointers aux_{head,tail}, which abide
>> by the same rules as data_* counterparts of the main perf buffer.
>> 
>> In order to allocate/mmap AUX, userspace needs to set up aux_offset to
>> such an offset that will be greater than data_offset+data_size and
>> aux_size to be the desired buffer size. Both need to be page aligned.
>> Then, same aux_offset and aux_size should be passed to mmap() call and
>> if everything adds up, you should have an AUX buffer as a result.
>> 
>> Pages that are mapped into this buffer also come out of user's mlock
>> rlimit plus perf_event_mlock_kb allowance.
>> 
>> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>> ---
>
>> +void rb_free_aux(struct ring_buffer *rb, struct perf_event *event)
>> +{
>> +	struct perf_event *iter;
>> +	int pg;
>> +
>> +	if (rb->aux_priv) {
>> +		/* disable all potential writers before freeing */
>> +		rcu_read_lock();
>> +		list_for_each_entry_rcu(iter, &rb->event_list, rb_entry)
>> +			perf_event_disable(iter);
>> +		rcu_read_unlock();
>
> Hmm, I cannot remember this from the last time; and its not explained
> why this was added.
>
> This would change semantics between munmap of a buffer with and without
> AUX bits in. 

Indeed, but this seems fair: if an event is generating AUX data while we
unmap the AUX buffer, there is no reason to keep it on, because chances
are, AUX data is the only thing this event is generating. The
alternative would be to have a separate callback for this, but then
again, that would race with other pmu callbacks, since it would be
called from munmap() path. I should add an elaborate comment about this.

Does this sound reasonable?

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 04/22] perf: Add a capability for AUX_NO_SG pmus to do software double buffering
  2014-09-08 11:07     ` Alexander Shishkin
@ 2014-09-08 11:31       ` Peter Zijlstra
  0 siblings, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-08 11:31 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 723 bytes --]

On Mon, Sep 08, 2014 at 02:07:22PM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> > In particular this looks like it will allocate double the total amount
> > of pages and 'loose' half of them. There is no corresponding code in the
> > free path to collect them.
> 
> This code makes the biggest high order allocation no bigger than half of
> the total requested size. Then, when I allocate the high-order chunks, I
> do a split_page() on them and everywhere else in the code they are
> treated as individual pages, including the free path. So this patch has
> no implication on freeing. Is this your concern?

Ah right. I completely misread the patch. Sorry for the noise.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams
  2014-09-08 11:16     ` Alexander Shishkin
@ 2014-09-08 11:34       ` Peter Zijlstra
  2014-09-08 12:55         ` Alexander Shishkin
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-08 11:34 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 2738 bytes --]

On Mon, Sep 08, 2014 at 02:16:56PM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Wed, Aug 20, 2014 at 03:35:59PM +0300, Alexander Shishkin wrote:
> >> From: Peter Zijlstra <peterz@infradead.org>
> >> 
> >> This patch introduces "AUX space" in the perf mmap buffer, intended for
> >> exporting high bandwidth data streams to userspace, such as instruction
> >> flow traces.
> >> 
> >> AUX space is a ring buffer, defined by aux_{offset,size} fields in the
> >> user_page structure, and read/write pointers aux_{head,tail}, which abide
> >> by the same rules as data_* counterparts of the main perf buffer.
> >> 
> >> In order to allocate/mmap AUX, userspace needs to set up aux_offset to
> >> such an offset that will be greater than data_offset+data_size and
> >> aux_size to be the desired buffer size. Both need to be page aligned.
> >> Then, same aux_offset and aux_size should be passed to mmap() call and
> >> if everything adds up, you should have an AUX buffer as a result.
> >> 
> >> Pages that are mapped into this buffer also come out of user's mlock
> >> rlimit plus perf_event_mlock_kb allowance.
> >> 
> >> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> >> ---
> >
> >> +void rb_free_aux(struct ring_buffer *rb, struct perf_event *event)
> >> +{
> >> +	struct perf_event *iter;
> >> +	int pg;
> >> +
> >> +	if (rb->aux_priv) {
> >> +		/* disable all potential writers before freeing */
> >> +		rcu_read_lock();
> >> +		list_for_each_entry_rcu(iter, &rb->event_list, rb_entry)
> >> +			perf_event_disable(iter);
> >> +		rcu_read_unlock();
> >
> > Hmm, I cannot remember this from the last time; and its not explained
> > why this was added.
> >
> > This would change semantics between munmap of a buffer with and without
> > AUX bits in. 
> 
> Indeed, but this seems fair: if an event is generating AUX data while we
> unmap the AUX buffer, there is no reason to keep it on, because chances
> are, AUX data is the only thing this event is generating. The
> alternative would be to have a separate callback for this, but then
> again, that would race with other pmu callbacks, since it would be
> called from munmap() path. I should add an elaborate comment about this.
> 
> Does this sound reasonable?

Well, the same is true for regular buffers, if you munmap() them while
in use the events are pointless.

Still we keep them enabled, we simply skip generating output.

I would much like not to create deviating behaviour if you don't
absolutely have to. In both cases its an explicit user request, so you
don't need to go hold hands. He did it, he gets to keep whatever pieces
it generates.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT
  2014-09-05 13:34         ` Mathieu Poirier
@ 2014-09-08 11:55           ` Alexander Shishkin
  2014-09-08 13:08             ` Michael Williams
  2014-09-08 13:29             ` Al Grant
  0 siblings, 2 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-09-08 11:55 UTC (permalink / raw)
  To: Mathieu Poirier, Peter Zijlstra
  Cc: Pawel Moll, Ingo Molnar, linux-kernel, Robert Richter,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, kan.liang, Michael Williams, ralf,
	Al Grant, Deepak Saxena

Mathieu Poirier <mathieu.poirier@linaro.org> writes:

> On 4 September 2014 02:26, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, Sep 02, 2014 at 02:18:16PM -0600, Mathieu Poirier wrote:
>>> Pawell, many thanks for looping me in.
>>>
>>> I am definitely not a perf-internal guru and as such won't be able to
>>> comment on the implementation.  On the flip side it is easy for me to see
>>> how the work on coresight done at Linaro can be made to tie-in what
>>> Alexander is proposing.  Albeit not at the top of the priority list at this
>>> time, integration with perf (and ftrace) is definitely on the roadmap.
>>>
>>> Powell is correct in his statement that Linaro's work in HW trace decoding
>>> is (currently) mainly focused on processor tracing but that will change
>>> when we have the basic infrastructure upstreamed.
>>>
>>> Last but not least it would be interesting to have more information on the
>>> "sideband data".  With coresight we have something called "metadata", also
>>> related to how the trace source was configured and instrumental to proper
>>> trace decoding.  I'm pretty sure we are facing the same problems.
>>
>> So we use the sideband or AUX data stream to export the 'big' data
>> stream generated by the CPU in an opaque manner. For every AUX data
>> block 'posted' we issue an event into the regular data buffer that
>> describes it.
>
> In the context of "describe it" written above, what kind of
> information one would typically find in that description?

It's like "got a chunk of AUX data in the AUX buffer, starting at offset
$X, length $Y".

>>
>> I was assuming that both ARM and MIPS would generate a single data
>> stream as well. So please do tell more about your meta-data; is that a
>> one time thing or a second continuous stream of data, albeit smaller
>> than the main stream?
>
> Coresight does indeed generate a single steam of compressed data.
> Depending on the tracing specifics that were requested by the use case
> (trace engine configuration) the format of the packets in the
> compressed stream will change.  Since the compressed stream itself
> doesn't carry clues about the formatting information, knowledge of how
> the trace engine was configured is mandatory for the proper decoding
> of the trace stream.

Ok, in perf the trace configuration would be part of 'session'
information, so the way the tracing was configured by userspace will be
saved to the resulting trace file (perf.data) by the userspace.
We have that with Intel PT as well.

> Metadata refer to exactly that - the configuration of the trace
> engine.  It has to be somehow lumped with the trace stream for off
> target analysis.

What we call sideband data in our code is more like runtime metadata,
such as executable mappings (so that you know what is mapped to which
addresses, iirc ETM/PTM also deals in virtual addresses, so you'll need
this information to make sense of the trace, right?) and context
switches.

One of the great things about perf here is that it provides all this
information practically for free.

>>
>> The way I read your explanation it is a one time blob generated once you
>> setup the hardware.
>
> Correct.
>
>> I suppose we could either dump it once into the
>> normal data stream or maybe dump it once every time we generate an AUX
>> buffer event into the normal data stream -- if its not too big.
>
> Right, there is a set of meta-data to be generated with each trace
> run.  With the current implementation a "trace run" pertains to all
> the information collected between the beginning and end of a trace
> scenario.  Future work involve triggering a DMA transfer of the full
> coresight buffer to a kernel memory area, something that is probably
> close to the "buffer event" you are referring to.

Again correct me if I'm wrong, but the TMC(?) controller can be
configured to direct ETM/PTM output right into system memory by means of
a scatter-gather table. This is what we call AUX area, it's basically a
circular buffer with trace data. Trace output is sent to the system
memory, which is also mmap()ed to the userspace tracing tool (perf), so
that it can capture it in real time. Well, that's one of the scenarios.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams
  2014-09-08 11:34       ` Peter Zijlstra
@ 2014-09-08 12:55         ` Alexander Shishkin
  2014-09-08 13:12           ` Peter Zijlstra
  0 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-09-08 12:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Sep 08, 2014 at 02:16:56PM +0300, Alexander Shishkin wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > On Wed, Aug 20, 2014 at 03:35:59PM +0300, Alexander Shishkin wrote:
>> >> From: Peter Zijlstra <peterz@infradead.org>
>> >> 
>> >> This patch introduces "AUX space" in the perf mmap buffer, intended for
>> >> exporting high bandwidth data streams to userspace, such as instruction
>> >> flow traces.
>> >> 
>> >> AUX space is a ring buffer, defined by aux_{offset,size} fields in the
>> >> user_page structure, and read/write pointers aux_{head,tail}, which abide
>> >> by the same rules as data_* counterparts of the main perf buffer.
>> >> 
>> >> In order to allocate/mmap AUX, userspace needs to set up aux_offset to
>> >> such an offset that will be greater than data_offset+data_size and
>> >> aux_size to be the desired buffer size. Both need to be page aligned.
>> >> Then, same aux_offset and aux_size should be passed to mmap() call and
>> >> if everything adds up, you should have an AUX buffer as a result.
>> >> 
>> >> Pages that are mapped into this buffer also come out of user's mlock
>> >> rlimit plus perf_event_mlock_kb allowance.
>> >> 
>> >> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>> >> ---
>> >
>> >> +void rb_free_aux(struct ring_buffer *rb, struct perf_event *event)
>> >> +{
>> >> +	struct perf_event *iter;
>> >> +	int pg;
>> >> +
>> >> +	if (rb->aux_priv) {
>> >> +		/* disable all potential writers before freeing */
>> >> +		rcu_read_lock();
>> >> +		list_for_each_entry_rcu(iter, &rb->event_list, rb_entry)
>> >> +			perf_event_disable(iter);
>> >> +		rcu_read_unlock();
>> >
>> > Hmm, I cannot remember this from the last time; and its not explained
>> > why this was added.
>> >
>> > This would change semantics between munmap of a buffer with and without
>> > AUX bits in. 
>> 
>> Indeed, but this seems fair: if an event is generating AUX data while we
>> unmap the AUX buffer, there is no reason to keep it on, because chances
>> are, AUX data is the only thing this event is generating. The
>> alternative would be to have a separate callback for this, but then
>> again, that would race with other pmu callbacks, since it would be
>> called from munmap() path. I should add an elaborate comment about this.
>> 
>> Does this sound reasonable?
>
> Well, the same is true for regular buffers, if you munmap() them while
> in use the events are pointless.
>
> Still we keep them enabled, we simply skip generating output.
>
> I would much like not to create deviating behaviour if you don't
> absolutely have to. In both cases its an explicit user request, so you
> don't need to go hold hands. He did it, he gets to keep whatever pieces
> it generates.

Fair enough. Then I'd like to disable the ACTIVE ones before freeing AUX
stuff and then re-enabling them since perf_event_{en,dis}able() already
provide the convenient cross-cpu calls, which would also avoid
concurrency between pmu::{add,del} callbacks and this unmap path. Makes
sense?

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT
  2014-09-08 11:55           ` Alexander Shishkin
@ 2014-09-08 13:08             ` Michael Williams
  2014-09-08 13:29             ` Al Grant
  1 sibling, 0 replies; 61+ messages in thread
From: Michael Williams @ 2014-09-08 13:08 UTC (permalink / raw)
  To: Alexander Shishkin, Mathieu Poirier, Peter Zijlstra
  Cc: Pawel Moll, Ingo Molnar, linux-kernel, Robert Richter,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, kan.liang, ralf, Al Grant,
	Deepak Saxena


Alexander Shishkin wrote:
> Mathieu Poirier <mathieu.poirier@linaro.org> writes:
>
>> On 4 September 2014 02:26, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Tue, Sep 02, 2014 at 02:18:16PM -0600, Mathieu Poirier wrote:
>>>> Pawell, many thanks for looping me in.
>>>>
>>>> I am definitely not a perf-internal guru and as such won't be able to
>>>> comment on the implementation.  On the flip side it is easy for me to
>>>> see how the work on coresight done at Linaro can be made to tie-in
>>>> what Alexander is proposing.  Albeit not at the top of the priority
>>>> list at this time, integration with perf (and ftrace) is definitely
>>>> on the roadmap.
>>>>
>>>> Powell is correct in his statement that Linaro's work in HW trace
>>>> decoding is (currently) mainly focused on processor tracing but that
>>>> will change when we have the basic infrastructure upstreamed.
>>>>
>>>> Last but not least it would be interesting to have more information
>>>> on the "sideband data".  With coresight we have something called
>>>> "metadata", also related to how the trace source was configured and
>>>> instrumental to proper trace decoding.  I'm pretty sure we are facing
> the same problems.
>>>
>>> So we use the sideband or AUX data stream to export the 'big' data
>>> stream generated by the CPU in an opaque manner. For every AUX data
>>> block 'posted' we issue an event into the regular data buffer that
>>> describes it.
>>
>> In the context of "describe it" written above, what kind of
>> information one would typically find in that description?
>
> It's like "got a chunk of AUX data in the AUX buffer, starting at offset
> $X, length $Y".
>
>>>
>>> I was assuming that both ARM and MIPS would generate a single data
>>> stream as well. So please do tell more about your meta-data; is that
>>> a one time thing or a second continuous stream of data, albeit
>>> smaller than the main stream?
>>
>> Coresight does indeed generate a single steam of compressed data.
>> Depending on the tracing specifics that were requested by the use case
>> (trace engine configuration) the format of the packets in the
>> compressed stream will change.  Since the compressed stream itself
>> doesn't carry clues about the formatting information, knowledge of how
>> the trace engine was configured is mandatory for the proper decoding
>> of the trace stream.
>
> Ok, in perf the trace configuration would be part of 'session'
> information, so the way the tracing was configured by userspace will be
> saved to the resulting trace file (perf.data) by the userspace.
> We have that with Intel PT as well.
>
>> Metadata refer to exactly that - the configuration of the trace
>> engine.  It has to be somehow lumped with the trace stream for off
>> target analysis.
>
> What we call sideband data in our code is more like runtime metadata,
> such as executable mappings (so that you know what is mapped to which
> addresses, iirc ETM/PTM also deals in virtual addresses, so you'll need
> this information to make sense of the trace, right?) and context
> switches.
>
> One of the great things about perf here is that it provides all this
> information practically for free.
>
>>>
>>> The way I read your explanation it is a one time blob generated once
>>> you setup the hardware.
>>
>> Correct.
>>
>>> I suppose we could either dump it once into the normal data stream or
>>> maybe dump it once every time we generate an AUX buffer event into
>>> the normal data stream -- if its not too big.
>>
>> Right, there is a set of meta-data to be generated with each trace
>> run.  With the current implementation a "trace run" pertains to all
>> the information collected between the beginning and end of a trace
>> scenario.  Future work involve triggering a DMA transfer of the full
>> coresight buffer to a kernel memory area, something that is probably
>> close to the "buffer event" you are referring to.
>
> Again correct me if I'm wrong, but the TMC(?) controller can be
> configured to direct ETM/PTM output right into system memory by means of
> a scatter-gather table. This is what we call AUX area, it's basically a
> circular buffer with trace data. Trace output is sent to the system
> memory, which is also mmap()ed to the userspace tracing tool (perf), so
> that it can capture it in real time. Well, that's one of the scenarios.

Correct. However there are two provisos:

Firstly, not all systems will have the Trace Memory Controller (TMC) in the Embedded Trace Router (ETR) configuration that can write directly to system memory. Many systems have a dedicated Embedded Trace Buffer (ETB). This is a dedicated SRAM for collecting trace that is not memory-mapped. It can only be accessed by the driver. Getting the data out is quick compared to reconstructing the trace. Systems can have a combination of multiple ETRs, ETBs, ...

Secondly, ETR/ETB etc. might contain multiple traces from different sources and different processors multiplexed together with the Trace Wrapping Protocol (TWP). For security reasons you might decide to unwrap this in privileged mode code, not in userspace. Again this is quick compared to reconstructing the trace.

So you might need to support copying the data out into a userspace buffer.

Mike.
--
Michael Williams, Principal Engineer, ARM Limited, Cambridge UK
ARM: The Architecture for the Digital World http://www.arm.com



-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2548782


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams
  2014-09-08 12:55         ` Alexander Shishkin
@ 2014-09-08 13:12           ` Peter Zijlstra
  2014-10-06  9:08             ` Alexander Shishkin
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-08 13:12 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 802 bytes --]

On Mon, Sep 08, 2014 at 03:55:11PM +0300, Alexander Shishkin wrote:

> Fair enough. Then I'd like to disable the ACTIVE ones before freeing AUX
> stuff and then re-enabling them since perf_event_{en,dis}able() already
> provide the convenient cross-cpu calls, which would also avoid
> concurrency between pmu::{add,del} callbacks and this unmap path. Makes
> sense?

But why? The buffer stuff is RCU freed, so if the hardware observes
pages and does get_page_unless_zero() on them its good. The memory will
not be freed from underneath the hardware writer because of the
get_page().

Then when the buffer is full and we 'swap', we'll find there is no next
buffer. At that point we can not provide a new buffer, effectively
stopping the hardware writes and release the old buffer, freeing the
memory.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT
  2014-09-08 11:55           ` Alexander Shishkin
  2014-09-08 13:08             ` Michael Williams
@ 2014-09-08 13:29             ` Al Grant
  1 sibling, 0 replies; 61+ messages in thread
From: Al Grant @ 2014-09-08 13:29 UTC (permalink / raw)
  To: Alexander Shishkin, Mathieu Poirier, Peter Zijlstra
  Cc: Pawel Moll, Ingo Molnar, linux-kernel, Robert Richter,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, kan.liang, Michael Williams, ralf,
	Deepak Saxena

> Ok, in perf the trace configuration would be part of 'session'
> information, so the way the tracing was configured by userspace will
> be
> saved to the resulting trace file (perf.data) by the userspace.
> We have that with Intel PT as well.

For ETM, in principle, those aspects of the ETM configuration that
affect packet-level decode might change during a trace session (you'd
have to disable the ETM while you did that, but you might still be
capturing trace).  The decoder would then need to switch its own
model of the configuration at exactly the right time.  The most
robust way I could think of for doing that (given that timestamps
might not be precise enough) was to switch the ETM to use a new
trace source identifier.  But there also needs to be a sideband
message that the decoder can listen to - a "trace configuration
changed" message.  That could go in perf.data somewhere.

I don't know if you have that issue for PT.

You might decide that reconfiguring a trace source during a trace
capture is not a valid use case...


> > Metadata refer to exactly that - the configuration of the trace
> > engine.  It has to be somehow lumped with the trace stream for off
> > target analysis.
>
> What we call sideband data in our code is more like runtime
> metadata,
> such as executable mappings (so that you know what is mapped to
> which
> addresses, iirc ETM/PTM also deals in virtual addresses, so you'll
> need
> this information to make sense of the trace, right?) and context
> switches.

Yes, that's exactly the information ETM/PTM needs too - a dynamic
mapping from virtual address to opcodes.

For context switches, if the kernel is set up to use the hardware
context id register (CONTEXTIDR), it might not be necessary to have
a message on every context switch, only on changes to the address
space maps.  So you get less invasive tracing (because you're not
generating a perf event every context switch) at the expense of a
slight fixed increase in context switch time due to the kernel
updating CONTEXTIDR.

There's a lot more about this in

http://people.linaro.org/~mathieu.poirier/coresight/cs-decode.pdf

where I've used the term "static metadata" for the basic trace
configuration vs. "dynamic metadata" for the data that changes as
the OS executes whatever workload you're tracing.  Both are sideband
data as far as ETM is concerned as they aren't in the ETM stream.


> One of the great things about perf here is that it provides all this
> information practically for free.

That's good to know!

Tracing of JITted code also creates challenges and it would be
interesting to hear how PT and perf solve that problem.
The trace decoder needs the address->opcode mapping, but as the
opcodes have only a transient lifetime now, they need to be
captured somewhere - either in the perf stream itself or in a
side channel.  Hopefully, you only have to do that when you JIT
a block, and not every time you execute the cached code block,
so it's not so invasive as to negate the benefits of JIT.

Al



>
> >>
> >> The way I read your explanation it is a one time blob generated
> once you
> >> setup the hardware.
> >
> > Correct.
> >
> >> I suppose we could either dump it once into the
> >> normal data stream or maybe dump it once every time we generate
> an AUX
> >> buffer event into the normal data stream -- if its not too big.
> >
> > Right, there is a set of meta-data to be generated with each trace
> > run.  With the current implementation a "trace run" pertains to
> all
> > the information collected between the beginning and end of a trace
> > scenario.  Future work involve triggering a DMA transfer of the
> full
> > coresight buffer to a kernel memory area, something that is
> probably
> > close to the "buffer event" you are referring to.
>
> Again correct me if I'm wrong, but the TMC(?) controller can be
> configured to direct ETM/PTM output right into system memory by
> means of
> a scatter-gather table. This is what we call AUX area, it's
> basically a
> circular buffer with trace data. Trace output is sent to the system
> memory, which is also mmap()ed to the userspace tracing tool (perf),
> so
> that it can capture it in real time. Well, that's one of the
> scenarios.
>
> Regards,
> --
> Alex


-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2548782


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 06/22] perf: Redirect output from inherited events to parents
  2014-08-20 12:36 ` [PATCH v4 06/22] perf: Redirect output from inherited events to parents Alexander Shishkin
@ 2014-09-08 15:26   ` Peter Zijlstra
  2014-09-09  9:54     ` Alexander Shishkin
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-08 15:26 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 1087 bytes --]

On Wed, Aug 20, 2014 at 03:36:03PM +0300, Alexander Shishkin wrote:
> In order to collect AUX data from an inherited event, we can redirect its
> output to parent's ring buffer if possible (they must be cpu affine). This
> patch adds set_output() to the inheritance path.
> 
> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> ---
>  kernel/events/core.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 67f857ab56..e36478564c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7962,6 +7962,12 @@ inherit_event(struct perf_event *parent_event,
>  		= parent_event->overflow_handler_context;
>  
>  	/*
> +	 * Direct child's output to parent's ring buffer (if any)
> +	 */
> +	if (parent_event->cpu != -1)
> +		(void)perf_event_set_output(child_event, parent_event);
> +
> +	/*
>  	 * Precalculate sample_data sizes
>  	 */
>  	perf_event__header_size(child_event);

Uhm, nope, see perf_output_begin(), it always redirects output to parent
events.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 07/22] perf: Add api for pmus to write to AUX space
  2014-08-20 12:36 ` [PATCH v4 07/22] perf: Add api for pmus to write to AUX space Alexander Shishkin
@ 2014-09-08 16:06   ` Peter Zijlstra
  2014-09-08 16:18     ` Peter Zijlstra
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-08 16:06 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 2762 bytes --]

On Wed, Aug 20, 2014 at 03:36:04PM +0300, Alexander Shishkin wrote:
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index f5ee3669f8..3b3a915767 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -242,6 +242,90 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
>  	spin_lock_init(&rb->event_lock);
>  }
>  
> +void *perf_aux_output_begin(struct perf_output_handle *handle,
> +			    struct perf_event *event)
> +{
> +	unsigned long aux_head, aux_tail;
> +	struct ring_buffer *rb;
> +
> +	rb = ring_buffer_get(event);
> +	if (!rb)
> +		return NULL;

Yeah, no need to much with ring_buffer_get() here, do as
perf_output_begin()/end() and keep the RCU section over the entire
output. That avoids the atomic and allows you to always use the parent
event.

> +
> +	if (!rb_has_aux(rb))
> +		goto err;
> +
> +	/*
> +	 * Nesting is not supported for AUX area, make sure nested
> +	 * writers are caught early
> +	 */
> +	if (WARN_ON_ONCE(local_xchg(&rb->aux_nest, 1)))
> +		goto err;
> +
> +	aux_head = local_read(&rb->aux_head);
> +	aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
> +
> +	handle->rb = rb;
> +	handle->event = event;
> +	handle->head = aux_head;
> +	if (aux_head - aux_tail < perf_aux_size(rb))
> +		handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
> +	else
> +		handle->size = 0;
> +
> +	if (!handle->size) {
> +		event->pending_disable = 1;
> +		event->hw.state = PERF_HES_STOPPED;
> +		perf_output_wakeup(handle);
> +		local_set(&rb->aux_nest, 0);
> +		goto err;
> +	}

This needs a comment on the /* A */ barrier; see the comments in
perf_output_put_handle() and perf_output_begin(). 

I'm not sure we can use the same control dependency that we do for the
normal buffers since its the hardware doing the stores, not the regular
instruction stream.

Please document the order in which the hardware writes vs this software
setup and explain the ordering guarantees provided by the hardware wrt
regular software.

> +	return handle->rb->aux_priv;
> +
> +err:
> +	ring_buffer_put(rb);
> +	handle->event = NULL;
> +
> +	return NULL;
> +}
> +
> +void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
> +			 bool truncated)
> +{
> +	struct ring_buffer *rb = handle->rb;
> +
> +	local_add(size, &rb->aux_head);
> +
> +	smp_wmb();

An uncommented barrier is a bug.

> +	rb->user_page->aux_head = local_read(&rb->aux_head);
> +
> +	perf_output_wakeup(handle);
> +	handle->event = NULL;
> +
> +	local_set(&rb->aux_nest, 0);
> +	ring_buffer_put(rb);
> +}

Also, should perf_aux_output_end() not generate an event into the
regular buffer?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 07/22] perf: Add api for pmus to write to AUX space
  2014-09-08 16:06   ` Peter Zijlstra
@ 2014-09-08 16:18     ` Peter Zijlstra
  0 siblings, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-08 16:18 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 3495 bytes --]

On Mon, Sep 08, 2014 at 06:06:24PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 20, 2014 at 03:36:04PM +0300, Alexander Shishkin wrote:
> > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> > index f5ee3669f8..3b3a915767 100644
> > --- a/kernel/events/ring_buffer.c
> > +++ b/kernel/events/ring_buffer.c
> > @@ -242,6 +242,90 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
> >  	spin_lock_init(&rb->event_lock);
> >  }
> >  
> > +void *perf_aux_output_begin(struct perf_output_handle *handle,
> > +			    struct perf_event *event)
> > +{
> > +	unsigned long aux_head, aux_tail;
> > +	struct ring_buffer *rb;
> > +
> > +	rb = ring_buffer_get(event);
> > +	if (!rb)
> > +		return NULL;
> 
> Yeah, no need to much with ring_buffer_get() here, do as
> perf_output_begin()/end() and keep the RCU section over the entire
> output. That avoids the atomic and allows you to always use the parent
> event.

Ah, I see what you were trying to do. You were thinking of doing that
perf_aux_output_begin() when setting up the hardware buffer. Then have
the hardware 'fill' it and then call perf_aux_output_end() to commit.

That way you cannot have the RCU section open across and you do indeed
need that refcount -- maybe.

> > +
> > +	if (!rb_has_aux(rb))
> > +		goto err;
> > +
> > +	/*
> > +	 * Nesting is not supported for AUX area, make sure nested
> > +	 * writers are caught early
> > +	 */
> > +	if (WARN_ON_ONCE(local_xchg(&rb->aux_nest, 1)))
> > +		goto err;
> > +
> > +	aux_head = local_read(&rb->aux_head);
> > +	aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
> > +
> > +	handle->rb = rb;
> > +	handle->event = event;
> > +	handle->head = aux_head;
> > +	if (aux_head - aux_tail < perf_aux_size(rb))
> > +		handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
> > +	else
> > +		handle->size = 0;
> > +
> > +	if (!handle->size) {
> > +		event->pending_disable = 1;
> > +		event->hw.state = PERF_HES_STOPPED;
> > +		perf_output_wakeup(handle);
> > +		local_set(&rb->aux_nest, 0);
> > +		goto err;
> > +	}
> 
> This needs a comment on the /* A */ barrier; see the comments in
> perf_output_put_handle() and perf_output_begin(). 
> 
> I'm not sure we can use the same control dependency that we do for the
> normal buffers since its the hardware doing the stores, not the regular
> instruction stream.
> 
> Please document the order in which the hardware writes vs this software
> setup and explain the ordering guarantees provided by the hardware wrt
> regular software.

So given that the hardware will simply not have a pointer to write to
before this completes I think we can say the control dependency is good
enough.

> > +	return handle->rb->aux_priv;
> > +
> > +err:
> > +	ring_buffer_put(rb);
> > +	handle->event = NULL;
> > +
> > +	return NULL;
> > +}
> > +
> > +void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
> > +			 bool truncated)
> > +{
> > +	struct ring_buffer *rb = handle->rb;
> > +
> > +	local_add(size, &rb->aux_head);
> > +
> > +	smp_wmb();
> 
> An uncommented barrier is a bug.

Still true, and we need clarification on whether this wmb (a NOP on
x86) is sufficient to order against the hardware or if we need something
stronger someplace, like a full MFENCE or even SYNC.

Arguably we'd want to place that in the driver before calling this, but
we need to get that clarified etc..

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 08/22] perf: Add AUX record
  2014-08-20 12:36 ` [PATCH v4 08/22] perf: Add AUX record Alexander Shishkin
@ 2014-09-09  8:20   ` Peter Zijlstra
  0 siblings, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-09  8:20 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, Arnaldo Carvalho de Melo

[-- Attachment #1: Type: text/plain, Size: 3020 bytes --]

On Wed, Aug 20, 2014 at 03:36:05PM +0300, Alexander Shishkin wrote:
> When there's new data in the AUX space, output a record indicating its
> offset and size and weather it was truncated to fix in the ring buffer.

This patch is too late; it should have been before the patch adding
perf_aux_output_*().

I also added acme to cc, he might have wants/needs for the data format I
suppose.

> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> ---
>  include/uapi/linux/perf_event.h | 16 ++++++++++++++++
>  kernel/events/core.c            | 39 +++++++++++++++++++++++++++++++++++++++
>  kernel/events/internal.h        |  3 +++
>  kernel/events/ring_buffer.c     |  1 +
>  4 files changed, 59 insertions(+)
> 
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 7e0967c0f5..c022c3d756 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -733,6 +733,22 @@ enum perf_event_type {
>  	 */
>  	PERF_RECORD_MMAP2			= 10,
>  
> +	/*
> +	 * Records that new data landed in the AUX buffer part.
> +	 *
> +	 * struct {
> +	 * 	struct perf_event_header	header;
> +	 *
> +	 * 	u64				aux_offset;
> +	 * 	u64				aux_size;
> +	 *	u8				truncated;
> +	 *	u8				reserved[7];

Creative.. do we want a u64 flags instead? Is there any chance at all
we're going to fill out these other bits?

> +	 *	u64				id;
> +	 *	u64				stream_id;

You probably should have included a struct sample_id there instead.

> +	 * };
> +	 */
> +	PERF_RECORD_AUX				= 11,
> +
>  	PERF_RECORD_MAX,			/* non-ABI */
>  };
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 9fc9a7583b..0251983018 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5542,6 +5542,45 @@ void perf_event_mmap(struct vm_area_struct *vma)
>  	perf_event_mmap_event(&mmap_event);
>  }
>  
> +void perf_event_aux_event(struct perf_event *event, unsigned long head,
> +			  unsigned long size, bool truncated)
> +{
> +	struct perf_output_handle handle;
> +	struct perf_sample_data sample;
> +	struct perf_aux_event {
> +		struct perf_event_header	header;
> +		u64				offset;
> +		u64				size;
> +		u8				truncated;
> +		u8				reserved[7];
> +		u64				id;
> +		u64				stream_id;
> +	} rec = {
> +		.header = {
> +			.type = PERF_RECORD_AUX,
> +			.misc = 0,
> +			.size = sizeof(rec),
> +		},
> +		.offset		= head,
> +		.size		= size,
> +		.truncated	= truncated,
> +		.id		= primary_event_id(event),
> +		.stream_id	= event->id,
> +	};
> +	int ret;
> +
> +	perf_event_header__init_id(&rec.header, &sample, event);

Oh hey, you do actually do the struct sample_id here, so why then also
include the id/stream_id again?

> +	ret = perf_output_begin(&handle, event, rec.header.size);
> +
> +	if (ret)
> +		return;
> +
> +	perf_output_put(&handle, rec);
> +	perf_event__output_id_sample(event, &handle, &sample);
> +
> +	perf_output_end(&handle);
> +}

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 09/22] perf: Support overwrite mode for AUX area
  2014-08-20 12:36 ` [PATCH v4 09/22] perf: Support overwrite mode for AUX area Alexander Shishkin
@ 2014-09-09  8:33   ` Peter Zijlstra
  2014-09-09  8:44   ` Peter Zijlstra
  1 sibling, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-09  8:33 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 544 bytes --]

On Wed, Aug 20, 2014 at 03:36:06PM +0300, Alexander Shishkin wrote:
> This adds support for overwrite mode in the AUX area, which means "keep
> collecting data till you're stopped". It does not depend on data buffer's
> overwrite mode, so that it doesn't lose sideband data that is instrumental
> for processing AUX data.

This patch is out of order as well, it would've made sense right after
the other aux buffer patch. Also, its not specified (and I can't find it
in the patch) how it _is_ set. The Changelog above only says how it is
not.


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 09/22] perf: Support overwrite mode for AUX area
  2014-08-20 12:36 ` [PATCH v4 09/22] perf: Support overwrite mode for AUX area Alexander Shishkin
  2014-09-09  8:33   ` Peter Zijlstra
@ 2014-09-09  8:44   ` Peter Zijlstra
  2014-09-09  9:40     ` Alexander Shishkin
  1 sibling, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-09  8:44 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]

On Wed, Aug 20, 2014 at 03:36:06PM +0300, Alexander Shishkin wrote:

> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 925f369947..5006caba63 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c

> @@ -294,9 +295,22 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
>  			 bool truncated)
>  {
>  	struct ring_buffer *rb = handle->rb;
> +	unsigned long aux_head;
>  
> +	aux_head = local_read(&rb->aux_head);
> +
> +	if (rb->aux_overwrite) {
> +		local_set(&rb->aux_head, size);
> +
> +		/*
> +		 * Send a RECORD_AUX with size==0 to communicate aux_head
> +		 * of this snapshot to userspace
> +		 */
> +		perf_event_aux_event(handle->event, size, 0, truncated);

Humm.. why not write a 'normal' AUX record?

Also, you didn't mention this in your Changelog _at_all_.

> +	} else {
> +		local_add(size, &rb->aux_head);
> +		perf_event_aux_event(handle->event, aux_head, size, truncated);
> +	}
>  
>  	smp_wmb();
>  	rb->user_page->aux_head = local_read(&rb->aux_head);

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 11/22] perf: add ITRACE_START record to indicate that tracing has started
  2014-08-20 12:36 ` [PATCH v4 11/22] perf: add ITRACE_START record to indicate that tracing has started Alexander Shishkin
@ 2014-09-09  9:08   ` Peter Zijlstra
  2014-09-09  9:33     ` Alexander Shishkin
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-09  9:08 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 643 bytes --]

On Wed, Aug 20, 2014 at 03:36:08PM +0300, Alexander Shishkin wrote:
> For counters such as instruction tracing, it is useful for the decoder
> to know which tasks are running when the event is first scheduled in,
> before the first sched_switch.
> 
> To single out such instruction tracing pmus, this patch alse introduces
> ITRACE PMU capability.

You forgot to tell why. Also explain why.

The next context switch event will also tell you the previous task. So
this seems like superfluous information. Also if you really want this
you could simply emit an empty AUX record I suppose, no need to create
yet another record type.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 16/22] perf: Add rb_{alloc,free}_kernel api
  2014-08-20 12:36 ` [PATCH v4 16/22] perf: Add rb_{alloc,free}_kernel api Alexander Shishkin
@ 2014-09-09  9:09   ` Peter Zijlstra
  0 siblings, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-09  9:09 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 281 bytes --]

On Wed, Aug 20, 2014 at 03:36:13PM +0300, Alexander Shishkin wrote:
> Events that generate AUX data can also be created by the kernel. In this
> case, some in-kernel infrastructure is needed to store and copy this data.

Oh, do tell.

You really need to work on these changelogs..

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 19/22] perf: Add infrastructure for using AUX data in perf samples
  2014-08-20 12:36 ` [PATCH v4 19/22] perf: Add infrastructure for using AUX data in perf samples Alexander Shishkin
@ 2014-09-09  9:11   ` Peter Zijlstra
  0 siblings, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-09  9:11 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 1016 bytes --]

On Wed, Aug 20, 2014 at 03:36:16PM +0300, Alexander Shishkin wrote:
> AUX data can be used to annotate other perf events by including it in
> sample records when PERF_SAMPLE_AUX flag is set. In this case, a kernel
> counter is created for each such event and trace data is retrieved
> from it and stored in the perf data stream.
> 
> To this end, new attribute fields are added:
>   * aux_sample_type: specify PMU on which the AUX data generating event
>                      is created;
>   * aux_sample_config: event config (maps to attribute's config field),
>   * aux_sample_size: size of the sample to be written.
> 
> This kernel counter is configured similarly to its "main" event with
> regards to filtering (exclude_{hv,idle,user,kernel}) and enabled state
> (disabled, enable_on_exec) to make sure that we don't get out of context
> AUX traces.

WHY? Why would we want to do this? This doesn't tell me why I should
want this or even consider it.

Convince me I want to read the patch.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 20/22] perf: Allocate ring buffers for inherited per-task kernel events
  2014-08-20 12:36 ` [PATCH v4 20/22] perf: Allocate ring buffers for inherited per-task kernel events Alexander Shishkin
@ 2014-09-09  9:12   ` Peter Zijlstra
  0 siblings, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-09  9:12 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 319 bytes --]

On Wed, Aug 20, 2014 at 03:36:17PM +0300, Alexander Shishkin wrote:
> When a new event is inherited from a per-task kernel event that has a
> ring buffer, allocate a new buffer for this event so that data from the
> child task is collected and can later be retrieved for sample annotation
> or whatnot.
> 

No.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 11/22] perf: add ITRACE_START record to indicate that tracing has started
  2014-09-09  9:08   ` Peter Zijlstra
@ 2014-09-09  9:33     ` Alexander Shishkin
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-09-09  9:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, acme

Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Aug 20, 2014 at 03:36:08PM +0300, Alexander Shishkin wrote:
>> For counters such as instruction tracing, it is useful for the decoder
>> to know which tasks are running when the event is first scheduled in,
>> before the first sched_switch.
>> 
>> To single out such instruction tracing pmus, this patch alse introduces
>> ITRACE PMU capability.
>
> You forgot to tell why. Also explain why.
>
> The next context switch event will also tell you the previous task. So
> this seems like superfluous information. Also if you really want this
> you could simply emit an empty AUX record I suppose, no need to create
> yet another record type.

Before the first sched_switch we don't actually know what pid/tid is
running, so we can't make sense of the beginning of the trace without
some event reordering trickery. We can use AUX for that as well, I guess
this might be another use case for those flags in the AUX
record. Changelog, got it.

Adrian and Arnaldo cc'ed.

Cheers,
--
Alex

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 09/22] perf: Support overwrite mode for AUX area
  2014-09-09  8:44   ` Peter Zijlstra
@ 2014-09-09  9:40     ` Alexander Shishkin
  2014-09-09 10:55       ` Peter Zijlstra
  0 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-09-09  9:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Aug 20, 2014 at 03:36:06PM +0300, Alexander Shishkin wrote:
>
>> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
>> index 925f369947..5006caba63 100644
>> --- a/kernel/events/ring_buffer.c
>> +++ b/kernel/events/ring_buffer.c
>
>> @@ -294,9 +295,22 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
>>  			 bool truncated)
>>  {
>>  	struct ring_buffer *rb = handle->rb;
>> +	unsigned long aux_head;
>>  
>> +	aux_head = local_read(&rb->aux_head);
>> +
>> +	if (rb->aux_overwrite) {
>> +		local_set(&rb->aux_head, size);
>> +
>> +		/*
>> +		 * Send a RECORD_AUX with size==0 to communicate aux_head
>> +		 * of this snapshot to userspace
>> +		 */
>> +		perf_event_aux_event(handle->event, size, 0, truncated);
>
> Humm.. why not write a 'normal' AUX record?

In this mode, the hardware is running in a circular buffer mode,
overwriting old data, so we don't actually know the size of the
snapshot, we have userspace figure it out later on (based on timestamps,
for example). I didn't want to configure PMI for this mode to avoid
overhead, but with PMI we can try to keep track of the overwrites and
try to infer the actual snapshot size in the kernel. For Intel PT. As
far as I can tell, ARM's scatter-gather trace-to-memory storing block
does not generate interrupts at all.

> Also, you didn't mention this in your Changelog _at_all_.

Will do.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 06/22] perf: Redirect output from inherited events to parents
  2014-09-08 15:26   ` Peter Zijlstra
@ 2014-09-09  9:54     ` Alexander Shishkin
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-09-09  9:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Aug 20, 2014 at 03:36:03PM +0300, Alexander Shishkin wrote:
>> In order to collect AUX data from an inherited event, we can redirect its
>> output to parent's ring buffer if possible (they must be cpu affine). This
>> patch adds set_output() to the inheritance path.
>> 
>> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>> ---
>>  kernel/events/core.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>> 
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 67f857ab56..e36478564c 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -7962,6 +7962,12 @@ inherit_event(struct perf_event *parent_event,
>>  		= parent_event->overflow_handler_context;
>>  
>>  	/*
>> +	 * Direct child's output to parent's ring buffer (if any)
>> +	 */
>> +	if (parent_event->cpu != -1)
>> +		(void)perf_event_set_output(child_event, parent_event);
>> +
>> +	/*
>>  	 * Precalculate sample_data sizes
>>  	 */
>>  	perf_event__header_size(child_event);
>
> Uhm, nope, see perf_output_begin(), it always redirects output to parent
> events.

Ouch, indeed. We'll just do the same in perf_aux_output_begin().

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 09/22] perf: Support overwrite mode for AUX area
  2014-09-09  9:40     ` Alexander Shishkin
@ 2014-09-09 10:55       ` Peter Zijlstra
  2014-09-09 11:53         ` Alexander Shishkin
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-09 10:55 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 1952 bytes --]

On Tue, Sep 09, 2014 at 12:40:39PM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Wed, Aug 20, 2014 at 03:36:06PM +0300, Alexander Shishkin wrote:
> >
> >> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> >> index 925f369947..5006caba63 100644
> >> --- a/kernel/events/ring_buffer.c
> >> +++ b/kernel/events/ring_buffer.c
> >
> >> @@ -294,9 +295,22 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
> >>  			 bool truncated)
> >>  {
> >>  	struct ring_buffer *rb = handle->rb;
> >> +	unsigned long aux_head;
> >>  
> >> +	aux_head = local_read(&rb->aux_head);
> >> +
> >> +	if (rb->aux_overwrite) {
> >> +		local_set(&rb->aux_head, size);
> >> +
> >> +		/*
> >> +		 * Send a RECORD_AUX with size==0 to communicate aux_head
> >> +		 * of this snapshot to userspace
> >> +		 */
> >> +		perf_event_aux_event(handle->event, size, 0, truncated);
> >
> > Humm.. why not write a 'normal' AUX record?
> 
> In this mode, the hardware is running in a circular buffer mode,
> overwriting old data, so we don't actually know the size of the
> snapshot, we have userspace figure it out later on (based on timestamps,
> for example). I didn't want to configure PMI for this mode to avoid
> overhead, but with PMI we can try to keep track of the overwrites and
> try to infer the actual snapshot size in the kernel. For Intel PT. As
> far as I can tell, ARM's scatter-gather trace-to-memory storing block
> does not generate interrupts at all.

Well, wouldn't the 'size' be basically the entire buffer. All you have
to then provide is the head pointer. Ideally you would also provide a
tail pointer so you know when to stop, but I suppose you can infer that
from the data stream itself? If you can provide the tail you can indeed
compute the size etc.. at which point you don't have to rely on parsing
the stream etc.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 09/22] perf: Support overwrite mode for AUX area
  2014-09-09 10:55       ` Peter Zijlstra
@ 2014-09-09 11:53         ` Alexander Shishkin
  2014-09-09 12:43           ` Peter Zijlstra
  0 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-09-09 11:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Sep 09, 2014 at 12:40:39PM +0300, Alexander Shishkin wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > On Wed, Aug 20, 2014 at 03:36:06PM +0300, Alexander Shishkin wrote:
>> >
>> >> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
>> >> index 925f369947..5006caba63 100644
>> >> --- a/kernel/events/ring_buffer.c
>> >> +++ b/kernel/events/ring_buffer.c
>> >
>> >> @@ -294,9 +295,22 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
>> >>  			 bool truncated)
>> >>  {
>> >>  	struct ring_buffer *rb = handle->rb;
>> >> +	unsigned long aux_head;
>> >>  
>> >> +	aux_head = local_read(&rb->aux_head);
>> >> +
>> >> +	if (rb->aux_overwrite) {
>> >> +		local_set(&rb->aux_head, size);
>> >> +
>> >> +		/*
>> >> +		 * Send a RECORD_AUX with size==0 to communicate aux_head
>> >> +		 * of this snapshot to userspace
>> >> +		 */
>> >> +		perf_event_aux_event(handle->event, size, 0, truncated);
>> >
>> > Humm.. why not write a 'normal' AUX record?
>> 
>> In this mode, the hardware is running in a circular buffer mode,
>> overwriting old data, so we don't actually know the size of the
>> snapshot, we have userspace figure it out later on (based on timestamps,
>> for example). I didn't want to configure PMI for this mode to avoid
>> overhead, but with PMI we can try to keep track of the overwrites and
>> try to infer the actual snapshot size in the kernel. For Intel PT. As
>> far as I can tell, ARM's scatter-gather trace-to-memory storing block
>> does not generate interrupts at all.
>
> Well, wouldn't the 'size' be basically the entire buffer. All you have
> to then provide is the head pointer.

Yes, that's what the code above is doing. We can replace size==0 with
size==$buffer_size to mean the same thing.

> Ideally you would also provide a
> tail pointer so you know when to stop, but I suppose you can infer that
> from the data stream itself?

The tail pointer is the problem I mentioned above, because it's either

  - where we stopped the previous time
  - head+1, if old data is overwritten

and in order to tell the difference, we need an interrupt.
We can infer where the new data starts from the timestamps in the trace
stream, so the decoder can take care of it (and that's how it's done at
the moment).

> If you can provide the tail you can indeed
> compute the size etc.. at which point you don't have to rely on parsing
> the stream etc.

Ideally, there wouldn't be too many adjacent snapshots so it won't even
be a problem. But yes, if we want to reliably know the tail/size, we
need an interrupt.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 09/22] perf: Support overwrite mode for AUX area
  2014-09-09 11:53         ` Alexander Shishkin
@ 2014-09-09 12:43           ` Peter Zijlstra
  2014-09-09 13:00             ` Alexander Shishkin
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2014-09-09 12:43 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

[-- Attachment #1: Type: text/plain, Size: 640 bytes --]

On Tue, Sep 09, 2014 at 02:53:42PM +0300, Alexander Shishkin wrote:

> We can infer where the new data starts from the timestamps in the trace
> stream, so the decoder can take care of it (and that's how it's done at
> the moment).

So that means the data stream can be read from arbitrary locations,
right? I can imagine not all data streams are always readable like that
(for instance the perf datastream is not).

Does it make sense to have the driver provide head,tail for this mode?
In your case you can simply provide whatever, but for those where it
matters they can ensure consistent data such that the stream might be
recoverable.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 09/22] perf: Support overwrite mode for AUX area
  2014-09-09 12:43           ` Peter Zijlstra
@ 2014-09-09 13:00             ` Alexander Shishkin
  0 siblings, 0 replies; 61+ messages in thread
From: Alexander Shishkin @ 2014-09-09 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Sep 09, 2014 at 02:53:42PM +0300, Alexander Shishkin wrote:
>
>> We can infer where the new data starts from the timestamps in the trace
>> stream, so the decoder can take care of it (and that's how it's done at
>> the moment).
>
> So that means the data stream can be read from arbitrary locations,
> right? I can imagine not all data streams are always readable like that
> (for instance the perf datastream is not).
>
> Does it make sense to have the driver provide head,tail for this mode?
> In your case you can simply provide whatever, but for those where it
> matters they can ensure consistent data such that the stream might be
> recoverable.

Yes, there's no reason not to, and some HW might indeed benefit from
this.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams
  2014-09-08 13:12           ` Peter Zijlstra
@ 2014-10-06  9:08             ` Alexander Shishkin
  2014-10-06 16:20               ` Peter Zijlstra
  0 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-10-06  9:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Sep 08, 2014 at 03:55:11PM +0300, Alexander Shishkin wrote:
>
>> Fair enough. Then I'd like to disable the ACTIVE ones before freeing AUX
>> stuff and then re-enabling them since perf_event_{en,dis}able() already
>> provide the convenient cross-cpu calls, which would also avoid
>> concurrency between pmu::{add,del} callbacks and this unmap path. Makes
>> sense?
>
> But why? The buffer stuff is RCU freed, so if the hardware observes
> pages and does get_page_unless_zero() on them its good. The memory will
> not be freed from underneath the hardware writer because of the
> get_page().
>
> Then when the buffer is full and we 'swap', we'll find there is no next
> buffer. At that point we can not provide a new buffer, effectively
> stopping the hardware writes and release the old buffer, freeing the
> memory.

There are several problems with this. Firstly, aux buffers can be quite
large, which means that we have to do get_page() on thousands of pages
on every pmu::add, which is a hot path and free_page() again in
pmu::del.

Secondly, all the sg bookkeeping that the driver keeps in aux_priv needs
to be refcounted. Right now, in the mmap_close path we just free
everything. But if we want to free the aux_pages in pmu::del, we need to
keep a list of these pages still around after mmap_close() and same goes
for the actual sg tables. I can see a way of doing that on the ring
buffer side (as opposed to the driver side), but are you quite sure we
should go down this road?

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams
  2014-10-06  9:08             ` Alexander Shishkin
@ 2014-10-06 16:20               ` Peter Zijlstra
  2014-10-06 21:52                 ` Alexander Shishkin
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2014-10-06 16:20 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

On Mon, Oct 06, 2014 at 12:08:19PM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Mon, Sep 08, 2014 at 03:55:11PM +0300, Alexander Shishkin wrote:
> >
> >> Fair enough. Then I'd like to disable the ACTIVE ones before freeing AUX
> >> stuff and then re-enabling them since perf_event_{en,dis}able() already
> >> provide the convenient cross-cpu calls, which would also avoid
> >> concurrency between pmu::{add,del} callbacks and this unmap path. Makes
> >> sense?
> >
> > But why? The buffer stuff is RCU freed, so if the hardware observes
> > pages and does get_page_unless_zero() on them its good. The memory will
> > not be freed from underneath the hardware writer because of the
> > get_page().
> >
> > Then when the buffer is full and we 'swap', we'll find there is no next
> > buffer. At that point we can not provide a new buffer, effectively
> > stopping the hardware writes and release the old buffer, freeing the
> > memory.
> 
> There are several problems with this. Firstly, aux buffers can be quite
> large, which means that we have to do get_page() on thousands of pages
> on every pmu::add, which is a hot path and free_page() again in
> pmu::del.
> 
> Secondly, all the sg bookkeeping that the driver keeps in aux_priv needs
> to be refcounted. Right now, in the mmap_close path we just free
> everything. But if we want to free the aux_pages in pmu::del, we need to
> keep a list of these pages still around after mmap_close() and same goes
> for the actual sg tables. I can see a way of doing that on the ring
> buffer side (as opposed to the driver side), but are you quite sure we
> should go down this road?

No, and I think I realized this after sending that email, but at the
time I figured there was another way to do it. Of course, now that
several weeks have passed I cannot for the life of me remember what it
was.

Lemme go over these patches again to refresh my mind and maybe I'll
remember.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams
  2014-10-06 16:20               ` Peter Zijlstra
@ 2014-10-06 21:52                 ` Alexander Shishkin
  2014-10-07 15:15                   ` Peter Zijlstra
  0 siblings, 1 reply; 61+ messages in thread
From: Alexander Shishkin @ 2014-10-06 21:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Oct 06, 2014 at 12:08:19PM +0300, Alexander Shishkin wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > On Mon, Sep 08, 2014 at 03:55:11PM +0300, Alexander Shishkin wrote:
>> >
>> >> Fair enough. Then I'd like to disable the ACTIVE ones before freeing AUX
>> >> stuff and then re-enabling them since perf_event_{en,dis}able() already
>> >> provide the convenient cross-cpu calls, which would also avoid
>> >> concurrency between pmu::{add,del} callbacks and this unmap path. Makes
>> >> sense?
>> >
>> > But why? The buffer stuff is RCU freed, so if the hardware observes
>> > pages and does get_page_unless_zero() on them its good. The memory will
>> > not be freed from underneath the hardware writer because of the
>> > get_page().
>> >
>> > Then when the buffer is full and we 'swap', we'll find there is no next
>> > buffer. At that point we can not provide a new buffer, effectively
>> > stopping the hardware writes and release the old buffer, freeing the
>> > memory.
>> 
>> There are several problems with this. Firstly, aux buffers can be quite
>> large, which means that we have to do get_page() on thousands of pages
>> on every pmu::add, which is a hot path and free_page() again in
>> pmu::del.
>> 
>> Secondly, all the sg bookkeeping that the driver keeps in aux_priv needs
>> to be refcounted. Right now, in the mmap_close path we just free
>> everything. But if we want to free the aux_pages in pmu::del, we need to
>> keep a list of these pages still around after mmap_close() and same goes
>> for the actual sg tables. I can see a way of doing that on the ring
>> buffer side (as opposed to the driver side), but are you quite sure we
>> should go down this road?
>
> No, and I think I realized this after sending that email, but at the
> time I figured there was another way to do it. Of course, now that
> several weeks have passed I cannot for the life of me remember what it
> was.
>
> Lemme go over these patches again to refresh my mind and maybe I'll
> remember.

Yes, ring buffer can keep a refcount for the aux_priv object, which is
grabbed once at mmap and once at perf_aux_output_begin() and released
accordingly and whichever drops the refcount to zero calls
pmu::free_aux. No need to grab page->_count and driver is a bit simpler.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams
  2014-10-06 21:52                 ` Alexander Shishkin
@ 2014-10-07 15:15                   ` Peter Zijlstra
  0 siblings, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2014-10-07 15:15 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang

On Tue, Oct 07, 2014 at 12:52:49AM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:

> > No, and I think I realized this after sending that email, but at the
> > time I figured there was another way to do it. Of course, now that
> > several weeks have passed I cannot for the life of me remember what it
> > was.
> >
> > Lemme go over these patches again to refresh my mind and maybe I'll
> > remember.
> 
> Yes, ring buffer can keep a refcount for the aux_priv object, which is
> grabbed once at mmap and once at perf_aux_output_begin() and released
> accordingly and whichever drops the refcount to zero calls
> pmu::free_aux. No need to grab page->_count and driver is a bit simpler.

Yes, I suppose that'll work just fine.

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2014-10-07 15:15 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-20 12:35 [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Alexander Shishkin
2014-08-20 12:35 ` [PATCH v4 01/22] perf: Add data_{offset,size} to user_page Alexander Shishkin
2014-08-20 12:35 ` [PATCH v4 02/22] perf: Add AUX area to ring buffer for raw data streams Alexander Shishkin
2014-09-08  7:02   ` Peter Zijlstra
2014-09-08 11:16     ` Alexander Shishkin
2014-09-08 11:34       ` Peter Zijlstra
2014-09-08 12:55         ` Alexander Shishkin
2014-09-08 13:12           ` Peter Zijlstra
2014-10-06  9:08             ` Alexander Shishkin
2014-10-06 16:20               ` Peter Zijlstra
2014-10-06 21:52                 ` Alexander Shishkin
2014-10-07 15:15                   ` Peter Zijlstra
2014-08-20 12:36 ` [PATCH v4 03/22] perf: Support high-order allocations for AUX space Alexander Shishkin
2014-08-20 12:36 ` [PATCH v4 04/22] perf: Add a capability for AUX_NO_SG pmus to do software double buffering Alexander Shishkin
2014-09-08  7:17   ` Peter Zijlstra
2014-09-08 11:07     ` Alexander Shishkin
2014-09-08 11:31       ` Peter Zijlstra
2014-08-20 12:36 ` [PATCH v4 05/22] perf: Add a pmu capability for "exclusive" events Alexander Shishkin
2014-08-20 12:36 ` [PATCH v4 06/22] perf: Redirect output from inherited events to parents Alexander Shishkin
2014-09-08 15:26   ` Peter Zijlstra
2014-09-09  9:54     ` Alexander Shishkin
2014-08-20 12:36 ` [PATCH v4 07/22] perf: Add api for pmus to write to AUX space Alexander Shishkin
2014-09-08 16:06   ` Peter Zijlstra
2014-09-08 16:18     ` Peter Zijlstra
2014-08-20 12:36 ` [PATCH v4 08/22] perf: Add AUX record Alexander Shishkin
2014-09-09  8:20   ` Peter Zijlstra
2014-08-20 12:36 ` [PATCH v4 09/22] perf: Support overwrite mode for AUX area Alexander Shishkin
2014-09-09  8:33   ` Peter Zijlstra
2014-09-09  8:44   ` Peter Zijlstra
2014-09-09  9:40     ` Alexander Shishkin
2014-09-09 10:55       ` Peter Zijlstra
2014-09-09 11:53         ` Alexander Shishkin
2014-09-09 12:43           ` Peter Zijlstra
2014-09-09 13:00             ` Alexander Shishkin
2014-08-20 12:36 ` [PATCH v4 10/22] perf: Add wakeup watermark control to " Alexander Shishkin
2014-08-20 12:36 ` [PATCH v4 11/22] perf: add ITRACE_START record to indicate that tracing has started Alexander Shishkin
2014-09-09  9:08   ` Peter Zijlstra
2014-09-09  9:33     ` Alexander Shishkin
2014-08-20 12:36 ` [PATCH v4 12/22] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection Alexander Shishkin
2014-08-20 12:36 ` [PATCH v4 13/22] x86: perf: Intel PT and LBR/BTS are mutually exclusive Alexander Shishkin
2014-08-20 12:36 ` [PATCH v4 14/22] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
2014-08-20 12:36 ` [PATCH v4 15/22] x86: perf: intel_bts: Add BTS " Alexander Shishkin
2014-08-20 12:36 ` [PATCH v4 16/22] perf: Add rb_{alloc,free}_kernel api Alexander Shishkin
2014-09-09  9:09   ` Peter Zijlstra
2014-08-20 12:36 ` [PATCH v4 17/22] perf: Add a helper to copy AUX data in the kernel Alexander Shishkin
2014-08-20 12:36 ` [PATCH v4 18/22] perf: Add a helper for looking up pmus by type Alexander Shishkin
2014-08-20 12:36 ` [PATCH v4 19/22] perf: Add infrastructure for using AUX data in perf samples Alexander Shishkin
2014-09-09  9:11   ` Peter Zijlstra
2014-08-20 12:36 ` [PATCH v4 20/22] perf: Allocate ring buffers for inherited per-task kernel events Alexander Shishkin
2014-09-09  9:12   ` Peter Zijlstra
2014-08-20 12:36 ` [PATCH v4 21/22] perf: Allow AUX sampling for multiple events Alexander Shishkin
2014-08-20 12:36 ` [PATCH v4 22/22] perf: Allow sampling of inherited events Alexander Shishkin
2014-08-25  6:21 ` [PATCH v4 00/22] perf: Add infrastructure and support for Intel PT Adrian Hunter
2014-09-01 16:21   ` Peter Zijlstra
2014-09-01 16:30 ` Peter Zijlstra
2014-09-01 17:17   ` Pawel Moll
     [not found]     ` <CANLsYky0vuwo7MwKbiGXypkLkrX7k6BOEf2uej3-Z3-HZHKd7w@mail.gmail.com>
2014-09-04  8:26       ` Peter Zijlstra
2014-09-05 13:34         ` Mathieu Poirier
2014-09-08 11:55           ` Alexander Shishkin
2014-09-08 13:08             ` Michael Williams
2014-09-08 13:29             ` Al Grant

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).