All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT
@ 2014-11-14 13:43 Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 01/14] perf: Add data_{offset,size} to user_page Alexander Shishkin
                   ` (14 more replies)
  0 siblings, 15 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

Hi Peter and all,

Since there weren't too many review comments on the two previous
versions, here's another one with a couple more fixes.

[full description below the changelog]

Changes since v7:
  * fixed a bug in aux allocation logic that caused more than
    requested pages to be allocated,
  * fixed pt_event_add() for snapshot mode, which would return an
    error in case of successful completion.

Changes since v6:
  * replaced kref with atomic in aux refcounting this time for real,
  * added a small comment on capabilities/"caps" attribute group in PT
    pmu.

Changes since v5:

  * dropped "bypass mask", bits that a privileged user could set in
    RTIT_CTL,
  * merged configuring and enabling RTIT_CTL writes,
  * fixed kernel-doc comments
  * if LBR/BTS events are present in the system, PT event creation is
    disallowed and vice versa,
  * fixed a bug with high order allocation under pressure,
  * fixed undoing vm_locked accounting in case of AUX allocation
    failure,
  * left out kernel counters for a separate series to keep this one
    concise.

This patchset adds support for Intel Processor Trace (PT) extension [1] of
Intel Architecture that allows the capture of information about software
execution flow, to the perf kernel infrastructure.

The single most notable thing is that while PT outputs trace data in a
compressed binary format, it will still generate hundreds of megabytes
of trace data per second per core. Decoding this binary stream takes
2-3 orders of magnitude the cpu time that it takes to generate
it. These considerations make it impossible to carry out decoding in
kernel space. Therefore, the trace data is exported to userspace as a
zero-copy mapping that userspace can collect and store for later
decoding. To address this, this patchset extends perf ring buffer with
an "AUX space", which is allocated for hardware blocks such as PT to
export their trace data with minimal overhead. This space can be
configured via buffer's user page and mmapped from the same file
descriptor with a given offset. Data can then be collected from it
by reading the aux_head (write) pointer from the user page and updating
aux_tail (read) pointer similarly to data_{head,tail} of the
traditional perf buffer. There is an api between perf core and pmu
drivers that wish to make use of this AUX space to export their data.

For tracing blocks that don't support hardware scatter-gather tables,
we provide high-order physically contiguous allocations to minimize
the overhead needed for software double buffering and PMI pressure.

This way we get a normal perf data stream that provides sideband
information that is required to decode the trace data, such as MMAPs,
COMMs etc, plus the actual trace in its own logical space.

If the trace buffer is mapped writable, the driver will stop tracing
when it fills up (aux_head approaches aux_tail), till data is read,
aux_tail pointer is moved forward and an ioctl() is issued to
re-enable tracing. If the trace buffer is mapped read only, the
tracing will continue, overwriting older data, so that the buffer
always contains the most recent data. Tracing can be stopped with an
ioctl() and restarted once the data is collected.

This patchset consists of necessary changes to the perf kernel
infrastructure, and PT and BTS pmu drivers. The tooling support is not
included in this series, however, it can be found in my github tree [2].

[1] chapter 36 of
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
[2] http://github.com/virtuoso/linux-perf/tree/intel_pt

Alexander Shishkin (13):
  perf: Add data_{offset,size} to user_page
  perf: Support high-order allocations for AUX space
  perf: Add a capability for AUX_NO_SG pmus to do software double
    buffering
  perf: Add a pmu capability for "exclusive" events
  perf: Add AUX record
  perf: Add api for pmus to write to AUX area
  perf: Support overwrite mode for AUX area
  perf: Add wakeup watermark control to AUX area
  x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection
  x86: perf: Intel PT and LBR/BTS are mutually exclusive
  x86: perf: intel_pt: Intel PT PMU driver
  x86: perf: intel_bts: Add BTS PMU driver
  perf: add ITRACE_START record to indicate that tracing has started

Peter Zijlstra (1):
  perf: Add AUX area to ring buffer for raw data streams

 arch/x86/include/asm/cpufeature.h          |    1 +
 arch/x86/include/uapi/asm/msr-index.h      |   18 +
 arch/x86/kernel/cpu/Makefile               |    1 +
 arch/x86/kernel/cpu/intel_pt.h             |  131 ++++
 arch/x86/kernel/cpu/perf_event.c           |   43 ++
 arch/x86/kernel/cpu/perf_event.h           |   49 ++
 arch/x86/kernel/cpu/perf_event_intel.c     |   40 +-
 arch/x86/kernel/cpu/perf_event_intel_bts.c |  525 ++++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |    3 +-
 arch/x86/kernel/cpu/perf_event_intel_pt.c  | 1082 ++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/scattered.c            |    1 +
 include/linux/perf_event.h                 |   47 +-
 include/uapi/linux/perf_event.h            |   59 +-
 kernel/events/core.c                       |  268 ++++++-
 kernel/events/internal.h                   |   33 +
 kernel/events/ring_buffer.c                |  327 ++++++++-
 16 files changed, 2577 insertions(+), 51 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/intel_pt.h
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_bts.c
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c

-- 
2.1.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v8 01/14] perf: Add data_{offset,size} to user_page
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams Alexander Shishkin
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

Currently, the actual perf ring buffer is one page into the mmap area,
following the user page and the userspace follows this convention. This
patch adds data_{offset,size} fields to user_page that can be used by
userspace instead for locating perf data in the mmap area. This is also
helpful when mapping existing or shared buffers if their size is not
known in advance.

Right now, it is made to follow the existing convention that

	data_offset == PAGE_SIZE and
	data_offset + data_size == mmap_size.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/uapi/linux/perf_event.h | 5 +++++
 kernel/events/core.c            | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9d845404d8..dbb50c31c7 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -489,9 +489,14 @@ struct perf_event_mmap_page {
 	 * In this case the kernel will not over-write unread data.
 	 *
 	 * See perf_output_put_handle() for the data ordering.
+	 *
+	 * data_{offset,size} indicate the location and size of the perf record
+	 * buffer within the mmapped area.
 	 */
 	__u64   data_head;		/* head in the data section */
 	__u64	data_tail;		/* user-space written tail */
+	__u64	data_offset;		/* where the buffer starts */
+	__u64	data_size;		/* data buffer size */
 };
 
 #define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2b02c9fda7..96105bb6b1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3942,6 +3942,8 @@ static void perf_event_init_userpage(struct perf_event *event)
 	/* Allow new userspace to detect that bit 0 is deprecated */
 	userpg->cap_bit0_is_deprecated = 1;
 	userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
+	userpg->data_offset = PAGE_SIZE;
+	userpg->data_size = perf_data_size(rb);
 
 unlock:
 	rcu_read_unlock();
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 01/14] perf: Add data_{offset,size} to user_page Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2014-11-17  9:33   ` Metzger, Markus T
  2014-11-17 21:24   ` Sukadev Bhattiprolu
  2014-11-14 13:43 ` [PATCH v8 03/14] perf: Support high-order allocations for AUX space Alexander Shishkin
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Peter Zijlstra, Alexander Shishkin

From: Peter Zijlstra <peterz@infradead.org>

This patch introduces "AUX space" in the perf mmap buffer, intended for
exporting high bandwidth data streams to userspace, such as instruction
flow traces.

AUX space is a ring buffer, defined by aux_{offset,size} fields in the
user_page structure, and read/write pointers aux_{head,tail}, which abide
by the same rules as data_* counterparts of the main perf buffer.

In order to allocate/mmap AUX, userspace needs to set up aux_offset to
such an offset that will be greater than data_offset+data_size and
aux_size to be the desired buffer size. Both need to be page aligned.
Then, same aux_offset and aux_size should be passed to mmap() call and
if everything adds up, you should have an AUX buffer as a result.

Pages that are mapped into this buffer also come out of user's mlock
rlimit plus perf_event_mlock_kb allowance.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      |  17 +++++
 include/uapi/linux/perf_event.h |  16 +++++
 kernel/events/core.c            | 140 +++++++++++++++++++++++++++++++++-------
 kernel/events/internal.h        |  23 +++++++
 kernel/events/ring_buffer.c     |  97 ++++++++++++++++++++++++++--
 5 files changed, 264 insertions(+), 29 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 893a0d0798..344058c71d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -263,6 +263,18 @@ struct pmu {
 	 * flush branch stack on context-switches (needed in cpu-wide mode)
 	 */
 	void (*flush_branch_stack)	(void);
+
+	/*
+	 * Set up pmu-private data structures for an AUX area
+	 */
+	void *(*setup_aux)		(int cpu, void **pages,
+					 int nr_pages, bool overwrite);
+					/* optional */
+
+	/*
+	 * Free pmu-private AUX data structures
+	 */
+	void (*free_aux)		(void *aux); /* optional */
 };
 
 /**
@@ -782,6 +794,11 @@ static inline bool has_branch_stack(struct perf_event *event)
 	return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
 }
 
+static inline bool has_aux(struct perf_event *event)
+{
+	return event->pmu->setup_aux;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index dbb50c31c7..9a13e5fded 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -497,6 +497,22 @@ struct perf_event_mmap_page {
 	__u64	data_tail;		/* user-space written tail */
 	__u64	data_offset;		/* where the buffer starts */
 	__u64	data_size;		/* data buffer size */
+
+	/*
+	 * AUX area is defined by aux_{offset,size} fields that should be set
+	 * by the userspace, so that
+	 *
+	 *   aux_offset >= data_offset + data_size
+	 *
+	 * prior to mmap()ing it. Size of the mmap()ed area should be aux_size.
+	 *
+	 * Ring buffer pointers aux_{head,tail} have the same semantics as
+	 * data_{head,tail} and same ordering rules apply.
+	 */
+	__u64	aux_head;
+	__u64	aux_tail;
+	__u64	aux_offset;
+	__u64	aux_size;
 };
 
 #define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 96105bb6b1..ac7bfb43fc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4141,6 +4141,8 @@ static void perf_mmap_open(struct vm_area_struct *vma)
 
 	atomic_inc(&event->mmap_count);
 	atomic_inc(&event->rb->mmap_count);
+	if (vma->vm_pgoff)
+		atomic_inc(&event->rb->aux_mmap_count);
 }
 
 /*
@@ -4160,6 +4162,20 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 	int mmap_locked = rb->mmap_locked;
 	unsigned long size = perf_data_size(rb);
 
+	/*
+	 * rb->aux_mmap_count will always drop before rb->mmap_count and
+	 * event->mmap_count, so it is ok to use event->mmap_mutex to
+	 * serialize with perf_mmap here.
+	 */
+	if (rb_has_aux(rb) && vma->vm_pgoff == rb->aux_pgoff &&
+	    atomic_dec_and_mutex_lock(&rb->aux_mmap_count, &event->mmap_mutex)) {
+		atomic_long_sub(rb->aux_nr_pages, &mmap_user->locked_vm);
+		vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
+
+		rb_free_aux(rb);
+		mutex_unlock(&event->mmap_mutex);
+	}
+
 	atomic_dec(&rb->mmap_count);
 
 	if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
@@ -4233,7 +4249,7 @@ out_put:
 
 static const struct vm_operations_struct perf_mmap_vmops = {
 	.open		= perf_mmap_open,
-	.close		= perf_mmap_close,
+	.close		= perf_mmap_close, /* non mergable */
 	.fault		= perf_mmap_fault,
 	.page_mkwrite	= perf_mmap_fault,
 };
@@ -4244,10 +4260,10 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	unsigned long user_locked, user_lock_limit;
 	struct user_struct *user = current_user();
 	unsigned long locked, lock_limit;
-	struct ring_buffer *rb;
+	struct ring_buffer *rb = NULL;
 	unsigned long vma_size;
 	unsigned long nr_pages;
-	long user_extra, extra;
+	long user_extra = 0, extra = 0;
 	int ret = 0, flags = 0;
 
 	/*
@@ -4262,7 +4278,66 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 		return -EINVAL;
 
 	vma_size = vma->vm_end - vma->vm_start;
-	nr_pages = (vma_size / PAGE_SIZE) - 1;
+
+	if (vma->vm_pgoff == 0) {
+		nr_pages = (vma_size / PAGE_SIZE) - 1;
+	} else {
+		/*
+		 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
+		 * mapped, all subsequent mappings should have the same size
+		 * and offset. Must be above the normal perf buffer.
+		 */
+		u64 aux_offset, aux_size;
+
+		if (!event->rb)
+			return -EINVAL;
+
+		nr_pages = vma_size / PAGE_SIZE;
+
+		mutex_lock(&event->mmap_mutex);
+		ret = -EINVAL;
+
+		rb = event->rb;
+		if (!rb)
+			goto aux_unlock;
+
+		aux_offset = ACCESS_ONCE(rb->user_page->aux_offset);
+		aux_size = ACCESS_ONCE(rb->user_page->aux_size);
+
+		if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
+			goto aux_unlock;
+
+		if (aux_offset != vma->vm_pgoff << PAGE_SHIFT)
+			goto aux_unlock;
+
+		/* already mapped with a different offset */
+		if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
+			goto aux_unlock;
+
+		if (aux_size != vma_size || aux_size != nr_pages * PAGE_SIZE)
+			goto aux_unlock;
+
+		/* already mapped with a different size */
+		if (rb_has_aux(rb) && rb->aux_nr_pages != nr_pages)
+			goto aux_unlock;
+
+		if (!is_power_of_2(nr_pages))
+			goto aux_unlock;
+
+		if (!atomic_inc_not_zero(&rb->mmap_count))
+			goto aux_unlock;
+
+		if (rb_has_aux(rb)) {
+			atomic_inc(&rb->aux_mmap_count);
+			ret = 0;
+			goto unlock;
+		}
+
+		atomic_set(&rb->aux_mmap_count, 1);
+		user_extra = nr_pages;
+
+		goto accounting;
+	}
 
 	/*
 	 * If we have rb pages ensure they're a power-of-two number, so we
@@ -4274,9 +4349,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	if (vma_size != PAGE_SIZE * (1 + nr_pages))
 		return -EINVAL;
 
-	if (vma->vm_pgoff != 0)
-		return -EINVAL;
-
 	WARN_ON_ONCE(event->ctx->parent_ctx);
 again:
 	mutex_lock(&event->mmap_mutex);
@@ -4300,6 +4372,8 @@ again:
 	}
 
 	user_extra = nr_pages + 1;
+
+accounting:
 	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
 
 	/*
@@ -4309,7 +4383,6 @@ again:
 
 	user_locked = atomic_long_read(&user->locked_vm) + user_extra;
 
-	extra = 0;
 	if (user_locked > user_lock_limit)
 		extra = user_locked - user_lock_limit;
 
@@ -4323,35 +4396,45 @@ again:
 		goto unlock;
 	}
 
-	WARN_ON(event->rb);
+	WARN_ON(!rb && event->rb);
 
 	if (vma->vm_flags & VM_WRITE)
 		flags |= RING_BUFFER_WRITABLE;
 
-	rb = rb_alloc(nr_pages, 
-		event->attr.watermark ? event->attr.wakeup_watermark : 0,
-		event->cpu, flags);
-
 	if (!rb) {
-		ret = -ENOMEM;
-		goto unlock;
-	}
+		rb = rb_alloc(nr_pages,
+			      event->attr.watermark ? event->attr.wakeup_watermark : 0,
+			      event->cpu, flags);
 
-	atomic_set(&rb->mmap_count, 1);
-	rb->mmap_locked = extra;
-	rb->mmap_user = get_current_user();
+		if (!rb) {
+			ret = -ENOMEM;
+			goto unlock;
+		}
 
-	atomic_long_add(user_extra, &user->locked_vm);
-	vma->vm_mm->pinned_vm += extra;
+		atomic_set(&rb->mmap_count, 1);
+		rb->mmap_user = get_current_user();
+		rb->mmap_locked = extra;
 
-	ring_buffer_attach(event, rb);
+		ring_buffer_attach(event, rb);
 
-	perf_event_init_userpage(event);
-	perf_event_update_userpage(event);
+		perf_event_init_userpage(event);
+		perf_event_update_userpage(event);
+	} else {
+		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages, flags);
+		if (!ret)
+			rb->aux_mmap_locked = extra;
+	}
 
 unlock:
-	if (!ret)
+	if (!ret) {
+		atomic_long_add(user_extra, &user->locked_vm);
+		vma->vm_mm->pinned_vm += extra;
+
 		atomic_inc(&event->mmap_count);
+	} else if (rb) {
+		atomic_dec(&rb->mmap_count);
+	}
+aux_unlock:
 	mutex_unlock(&event->mmap_mutex);
 
 	/*
@@ -7185,6 +7268,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 	if (output_event->cpu == -1 && output_event->ctx != event->ctx)
 		goto out;
 
+	/*
+	 * If both events generate aux data, they must be on the same PMU
+	 */
+	if (has_aux(event) && has_aux(output_event) &&
+	    event->pmu != output_event->pmu)
+		goto out;
+
 set:
 	mutex_lock(&event->mmap_mutex);
 	/* Can't redirect output if we've got an active mmap() */
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b218782..0f6d080159 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -35,6 +35,16 @@ struct ring_buffer {
 	unsigned long			mmap_locked;
 	struct user_struct		*mmap_user;
 
+	/* AUX area */
+	unsigned long			aux_pgoff;
+	int				aux_nr_pages;
+	atomic_t			aux_mmap_count;
+	unsigned long			aux_mmap_locked;
+	void				(*free_aux)(void *);
+	atomic_t			aux_refcount;
+	void				**aux_pages;
+	void				*aux_priv;
+
 	struct perf_event_mmap_page	*user_page;
 	void				*data_pages[0];
 };
@@ -43,6 +53,14 @@ extern void rb_free(struct ring_buffer *rb);
 extern struct ring_buffer *
 rb_alloc(int nr_pages, long watermark, int cpu, int flags);
 extern void perf_event_wakeup(struct perf_event *event);
+extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+			pgoff_t pgoff, int nr_pages, int flags);
+extern void rb_free_aux(struct ring_buffer *rb);
+
+static inline bool rb_has_aux(struct ring_buffer *rb)
+{
+	return !!rb->aux_nr_pages;
+}
 
 extern void
 perf_event_header__init_id(struct perf_event_header *header,
@@ -81,6 +99,11 @@ static inline unsigned long perf_data_size(struct ring_buffer *rb)
 	return rb->nr_pages << (PAGE_SHIFT + page_order(rb));
 }
 
+static inline unsigned long perf_aux_size(struct ring_buffer *rb)
+{
+	return rb->aux_nr_pages << PAGE_SHIFT;
+}
+
 #define DEFINE_OUTPUT_COPY(func_name, memcpy_func)			\
 static inline unsigned long						\
 func_name(struct perf_output_handle *handle,				\
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 146a5792b1..c6fdce4ea0 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,14 +242,87 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 	spin_lock_init(&rb->event_lock);
 }
 
+int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
+		 pgoff_t pgoff, int nr_pages, int flags)
+{
+	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
+	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
+	int ret = -ENOMEM;
+
+	if (!has_aux(event))
+		return -ENOTSUPP;
+
+	rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
+	if (!rb->aux_pages)
+		return -ENOMEM;
+
+	rb->free_aux = event->pmu->free_aux;
+	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;
+	     rb->aux_nr_pages++) {
+		struct page *page;
+
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+		if (!page)
+			goto out;
+
+		rb->aux_pages[rb->aux_nr_pages] = page_address(page);
+	}
+
+	rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages,
+					     overwrite);
+	if (!rb->aux_priv)
+		goto out;
+
+	ret = 0;
+
+	/*
+	 * aux_pages (and pmu driver's private data, aux_priv) will be
+	 * referenced in both producer's and consumer's contexts, thus
+	 * we keep a refcount here to make sure either of the two can
+	 * reference them safely.
+	 */
+	atomic_set(&rb->aux_refcount, 1);
+
+out:
+	if (!ret)
+		rb->aux_pgoff = pgoff;
+	else
+		rb_free_aux(rb);
+
+	return ret;
+}
+
+static void __rb_free_aux(struct ring_buffer *rb)
+{
+	int pg;
+
+	if (rb->aux_priv) {
+		rb->free_aux(rb->aux_priv);
+		rb->free_aux = NULL;
+		rb->aux_priv = NULL;
+	}
+
+	for (pg = 0; pg < rb->aux_nr_pages; pg++)
+		free_page((unsigned long)rb->aux_pages[pg]);
+
+	kfree(rb->aux_pages);
+	rb->aux_nr_pages = 0;
+}
+
+void rb_free_aux(struct ring_buffer *rb)
+{
+	if (atomic_dec_and_test(&rb->aux_refcount))
+		__rb_free_aux(rb);
+}
+
 #ifndef CONFIG_PERF_USE_VMALLOC
 
 /*
  * Back perf_mmap() with regular GFP_KERNEL-0 pages.
  */
 
-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
 {
 	if (pgoff > rb->nr_pages)
 		return NULL;
@@ -339,8 +412,8 @@ static int data_page_nr(struct ring_buffer *rb)
 	return rb->nr_pages << page_order(rb);
 }
 
-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+static struct page *
+__perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
 {
 	/* The '>' counts in the user page. */
 	if (pgoff > data_page_nr(rb))
@@ -415,3 +488,19 @@ fail:
 }
 
 #endif
+
+struct page *
+perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+{
+	if (rb->aux_nr_pages) {
+		/* above AUX space */
+		if (pgoff > rb->aux_pgoff + rb->aux_nr_pages)
+			return NULL;
+
+		/* AUX space */
+		if (pgoff >= rb->aux_pgoff)
+			return virt_to_page(rb->aux_pages[pgoff - rb->aux_pgoff]);
+	}
+
+	return __perf_mmap_to_page(rb, pgoff);
+}
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v8 03/14] perf: Support high-order allocations for AUX space
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 01/14] perf: Add data_{offset,size} to user_page Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 04/14] perf: Add a capability for AUX_NO_SG pmus to do software double buffering Alexander Shishkin
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability)
don't support scatter-gather and will prefer larger contiguous areas for
their output regions.

This patch adds a new pmu capability to request higher order allocations.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h  |  1 +
 kernel/events/ring_buffer.c | 56 ++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 344058c71d..71df948b4d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -171,6 +171,7 @@ struct perf_event;
  * pmu::capabilities flags
  */
 #define PERF_PMU_CAP_NO_INTERRUPT		0x01
+#define PERF_PMU_CAP_AUX_NO_SG			0x02
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index c6fdce4ea0..6e700e8fd6 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,30 +242,74 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 	spin_lock_init(&rb->event_lock);
 }
 
+#define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
+
+static struct page *rb_alloc_aux_page(int node, int order)
+{
+	struct page *page;
+
+	if (order > MAX_ORDER)
+		order = MAX_ORDER;
+
+	do {
+		page = alloc_pages_node(node, PERF_AUX_GFP, order);
+	} while (!page && order--);
+
+	if (page && order) {
+		/*
+		 * Communicate the allocation size to the driver
+		 */
+		split_page(page, order);
+		SetPagePrivate(page);
+		set_page_private(page, order);
+	}
+
+	return page;
+}
+
+static void rb_free_aux_page(struct ring_buffer *rb, int idx)
+{
+	struct page *page = virt_to_page(rb->aux_pages[idx]);
+
+	ClearPagePrivate(page);
+	page->mapping = NULL;
+	__free_page(page);
+}
+
 int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 		 pgoff_t pgoff, int nr_pages, int flags)
 {
 	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
 	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
-	int ret = -ENOMEM;
+	int ret = -ENOMEM, max_order = 0;
 
 	if (!has_aux(event))
 		return -ENOTSUPP;
 
+	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
+		/*
+		 * We need to start with the max_order that fits in nr_pages,
+		 * not the other way around, hence ilog2() and not get_order.
+		 */
+		max_order = ilog2(nr_pages);
+
 	rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
 	if (!rb->aux_pages)
 		return -ENOMEM;
 
 	rb->free_aux = event->pmu->free_aux;
-	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;
-	     rb->aux_nr_pages++) {
+	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;) {
 		struct page *page;
+		int last, order;
 
-		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+		order = min(max_order, ilog2(nr_pages - rb->aux_nr_pages));
+		page = rb_alloc_aux_page(node, order);
 		if (!page)
 			goto out;
 
-		rb->aux_pages[rb->aux_nr_pages] = page_address(page);
+		for (last = rb->aux_nr_pages + (1 << page_private(page));
+		     last > rb->aux_nr_pages; rb->aux_nr_pages++)
+			rb->aux_pages[rb->aux_nr_pages] = page_address(page++);
 	}
 
 	rb->aux_priv = event->pmu->setup_aux(event->cpu, rb->aux_pages, nr_pages,
@@ -303,7 +347,7 @@ static void __rb_free_aux(struct ring_buffer *rb)
 	}
 
 	for (pg = 0; pg < rb->aux_nr_pages; pg++)
-		free_page((unsigned long)rb->aux_pages[pg]);
+		rb_free_aux_page(rb, pg);
 
 	kfree(rb->aux_pages);
 	rb->aux_nr_pages = 0;
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v8 04/14] perf: Add a capability for AUX_NO_SG pmus to do software double buffering
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (2 preceding siblings ...)
  2014-11-14 13:43 ` [PATCH v8 03/14] perf: Support high-order allocations for AUX space Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 05/14] perf: Add a pmu capability for "exclusive" events Alexander Shishkin
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

For pmus that don't support scatter-gather for AUX data in hardware, it
might still make sense to implement software double buffering to avoid
losing data while the user is reading data out. For this purpose, add
a pmu capability that guarantees multiple high-order chunks for AUX buffer,
so that the pmu driver can do switchover tricks.

To make use of this feature, add PERF_PMU_CAP_AUX_SW_DOUBLEBUF to your
pmu's capability mask. This will make the ring buffer AUX allocation code
ensure that the biggest high order allocation for the aux buffer pages is
no bigger than half of the total requested buffer size, thus making sure
that the buffer has at least two high order allocations.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h  |  1 +
 kernel/events/ring_buffer.c | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 71df948b4d..e7a15f5c3f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -172,6 +172,7 @@ struct perf_event;
  */
 #define PERF_PMU_CAP_NO_INTERRUPT		0x01
 #define PERF_PMU_CAP_AUX_NO_SG			0x02
+#define PERF_PMU_CAP_AUX_SW_DOUBLEBUF		0x04
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 6e700e8fd6..608c0d67ad 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -286,13 +286,26 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 	if (!has_aux(event))
 		return -ENOTSUPP;
 
-	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG)
+	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) {
 		/*
 		 * We need to start with the max_order that fits in nr_pages,
 		 * not the other way around, hence ilog2() and not get_order.
 		 */
 		max_order = ilog2(nr_pages);
 
+		/*
+		 * PMU requests more than one contiguous chunks of memory
+		 * for SW double buffering
+		 */
+		if ((event->pmu->capabilities & PERF_PMU_CAP_AUX_SW_DOUBLEBUF) &&
+		    !overwrite) {
+			if (!max_order)
+				return -EINVAL;
+
+			max_order--;
+		}
+	}
+
 	rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
 	if (!rb->aux_pages)
 		return -ENOMEM;
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v8 05/14] perf: Add a pmu capability for "exclusive" events
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (3 preceding siblings ...)
  2014-11-14 13:43 ` [PATCH v8 04/14] perf: Add a capability for AUX_NO_SG pmus to do software double buffering Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 06/14] perf: Add AUX record Alexander Shishkin
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

Usually, pmus that do, for example, instruction tracing, would only ever
be able to have one event per task per cpu (or per perf_context). For such
pmus it makes sense to disallow creating conflicting events early on, so
as to provide consistent behavior for the user.

This patch adds a pmu capability that indicates such constraint on event
creation.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h |  1 +
 kernel/events/core.c       | 45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e7a15f5c3f..5c3dee021c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -173,6 +173,7 @@ struct perf_event;
 #define PERF_PMU_CAP_NO_INTERRUPT		0x01
 #define PERF_PMU_CAP_AUX_NO_SG			0x02
 #define PERF_PMU_CAP_AUX_SW_DOUBLEBUF		0x04
+#define PERF_PMU_CAP_EXCLUSIVE			0x08
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ac7bfb43fc..2399c12e49 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7298,6 +7298,32 @@ out:
 	return ret;
 }
 
+static bool exclusive_event_match(struct perf_event *e1, struct perf_event *e2)
+{
+	if ((e1->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) &&
+	    (e1->cpu == e2->cpu ||
+	     e1->cpu == -1 ||
+	     e2->cpu == -1))
+		return true;
+	return false;
+}
+
+static bool exclusive_event_ok(struct perf_event *event,
+			      struct perf_event_context *ctx)
+{
+	struct perf_event *iter_event;
+
+	if (!(event->pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE))
+		return true;
+
+	list_for_each_entry(iter_event, &ctx->event_list, event_entry) {
+		if (exclusive_event_match(iter_event, event))
+			return false;
+	}
+
+	return true;
+}
+
 /**
  * sys_perf_event_open - open a performance event, associate it to a task/cpu
  *
@@ -7449,6 +7475,11 @@ SYSCALL_DEFINE5(perf_event_open,
 		goto err_alloc;
 	}
 
+	if ((pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) && group_leader) {
+		err = -EBUSY;
+		goto err_context;
+	}
+
 	if (task) {
 		put_task_struct(task);
 		task = NULL;
@@ -7534,6 +7565,12 @@ SYSCALL_DEFINE5(perf_event_open,
 		}
 	}
 
+	if (!exclusive_event_ok(event, ctx)) {
+		mutex_unlock(&ctx->mutex);
+		fput(event_file);
+		goto err_context;
+	}
+
 	perf_install_in_context(ctx, event, event->cpu);
 	perf_unpin_context(ctx);
 	mutex_unlock(&ctx->mutex);
@@ -7620,6 +7657,14 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 
 	WARN_ON_ONCE(ctx->parent_ctx);
 	mutex_lock(&ctx->mutex);
+	if (!exclusive_event_ok(event, ctx)) {
+		mutex_unlock(&ctx->mutex);
+		perf_unpin_context(ctx);
+		put_ctx(ctx);
+		err = -EBUSY;
+		goto err_free;
+	}
+
 	perf_install_in_context(ctx, event, cpu);
 	perf_unpin_context(ctx);
 	mutex_unlock(&ctx->mutex);
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v8 06/14] perf: Add AUX record
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (4 preceding siblings ...)
  2014-11-14 13:43 ` [PATCH v8 05/14] perf: Add a pmu capability for "exclusive" events Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 07/14] perf: Add api for pmus to write to AUX area Alexander Shishkin
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

When there's new data in the AUX space, output a record indicating its
offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to
mean the described data was truncated to fit in the ring buffer.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
---
 include/uapi/linux/perf_event.h | 19 +++++++++++++++++++
 kernel/events/core.c            | 34 ++++++++++++++++++++++++++++++++++
 kernel/events/internal.h        |  3 +++
 3 files changed, 56 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9a13e5fded..272258380e 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -733,6 +733,20 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_MMAP2			= 10,
 
+	/*
+	 * Records that new data landed in the AUX buffer part.
+	 *
+	 * struct {
+	 * 	struct perf_event_header	header;
+	 *
+	 * 	u64				aux_offset;
+	 * 	u64				aux_size;
+	 *	u64				flags;
+	 * 	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_AUX				= 11,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -750,6 +764,11 @@ enum perf_callchain_context {
 	PERF_CONTEXT_MAX		= (__u64)-4095,
 };
 
+/**
+ * PERF_RECORD_AUX::flags bits
+ */
+#define PERF_AUX_FLAG_TRUNCATED		0x01	/* record was truncated to fit */
+
 #define PERF_FLAG_FD_NO_GROUP		(1UL << 0)
 #define PERF_FLAG_FD_OUTPUT		(1UL << 1)
 #define PERF_FLAG_PID_CGROUP		(1UL << 2) /* pid=cgroup id, per-cpu mode only */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2399c12e49..779f60d300 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5606,6 +5606,40 @@ void perf_event_mmap(struct vm_area_struct *vma)
 	perf_event_mmap_event(&mmap_event);
 }
 
+void perf_event_aux_event(struct perf_event *event, unsigned long head,
+			  unsigned long size, u64 flags)
+{
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	struct perf_aux_event {
+		struct perf_event_header	header;
+		u64				offset;
+		u64				size;
+		u64				flags;
+	} rec = {
+		.header = {
+			.type = PERF_RECORD_AUX,
+			.misc = 0,
+			.size = sizeof(rec),
+		},
+		.offset		= head,
+		.size		= size,
+		.flags		= flags,
+	};
+	int ret;
+
+	perf_event_header__init_id(&rec.header, &sample, event);
+	ret = perf_output_begin(&handle, event, rec.header.size);
+
+	if (ret)
+		return;
+
+	perf_output_put(&handle, rec);
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+}
+
 /*
  * IRQ throttle logging
  */
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 0f6d080159..4d117a9814 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -62,6 +62,9 @@ static inline bool rb_has_aux(struct ring_buffer *rb)
 	return !!rb->aux_nr_pages;
 }
 
+void perf_event_aux_event(struct perf_event *event, unsigned long head,
+			  unsigned long size, u64 flags);
+
 extern void
 perf_event_header__init_id(struct perf_event_header *header,
 			   struct perf_sample_data *data,
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v8 07/14] perf: Add api for pmus to write to AUX area
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (5 preceding siblings ...)
  2014-11-14 13:43 ` [PATCH v8 06/14] perf: Add AUX record Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 08/14] perf: Support overwrite mode for " Alexander Shishkin
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

For pmus that wish to write data to ring buffer's AUX area, provide
perf_aux_output_{begin,end}() calls to initiate/commit data writes,
similarly to perf_output_{begin,end}. These also use the same output
handle structure. Also, similarly to software counterparts, these
will direct inherited events' output to parents' ring buffers.

After the perf_aux_output_begin() returns successfully, handle->size
is set to the maximum amount of data that can be written wrt aux_tail
pointer, so that no data that the user hasn't seen will be overwritten,
therefore this should always be called before hardware writing is
enabled. On success, this will return the pointer to pmu driver's
private structure allocated for this aux area by pmu::setup_aux. Same
pointer can also be retrieved using perf_get_aux() while hardware
writing is enabled.

PMU driver should pass the actual amount of data written as a parameter
to perf_aux_output_end(). All hardware writes should be completed and
visible before this one is called.

Additionally, perf_aux_output_skip() will adjust output handle and
aux_head in case some part of the buffer has to be skipped over to
maintain hardware's alignment constraints.

Nested writers are forbidden and guards are in place to catch such
attempts.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h  |  23 +++++++-
 kernel/events/core.c        |   5 +-
 kernel/events/internal.h    |   4 ++
 kernel/events/ring_buffer.c | 139 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 167 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5c3dee021c..f126eb89e6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -555,12 +555,22 @@ struct perf_output_handle {
 	struct ring_buffer		*rb;
 	unsigned long			wakeup;
 	unsigned long			size;
-	void				*addr;
+	union {
+		void			*addr;
+		unsigned long		head;
+	};
 	int				page;
 };
 
 #ifdef CONFIG_PERF_EVENTS
 
+extern void *perf_aux_output_begin(struct perf_output_handle *handle,
+				   struct perf_event *event);
+extern void perf_aux_output_end(struct perf_output_handle *handle,
+				unsigned long size, bool truncated);
+extern int perf_aux_output_skip(struct perf_output_handle *handle,
+				unsigned long size);
+extern void *perf_get_aux(struct perf_output_handle *handle);
 extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
 extern void perf_pmu_unregister(struct pmu *pmu);
 
@@ -817,6 +827,17 @@ extern void perf_event_disable(struct perf_event *event);
 extern int __perf_event_disable(void *info);
 extern void perf_event_task_tick(void);
 #else /* !CONFIG_PERF_EVENTS: */
+static inline void *
+perf_aux_output_begin(struct perf_output_handle *handle,
+		      struct perf_event *event)				{ return NULL; }
+static inline void
+perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
+		    bool truncated)					{ }
+static inline int
+perf_aux_output_skip(struct perf_output_handle *handle,
+		     unsigned long size)				{ return -EINVAL; }
+static inline void *
+perf_get_aux(struct perf_output_handle *handle)				{ return NULL; }
 static inline void
 perf_event_task_sched_in(struct task_struct *prev,
 			 struct task_struct *task)			{ }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 779f60d300..48b8dc0d36 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3293,7 +3293,6 @@ static void free_event_rcu(struct rcu_head *head)
 	kfree(event);
 }
 
-static void ring_buffer_put(struct ring_buffer *rb);
 static void ring_buffer_attach(struct perf_event *event,
 			       struct ring_buffer *rb);
 
@@ -4110,7 +4109,7 @@ static void rb_free_rcu(struct rcu_head *rcu_head)
 	rb_free(rb);
 }
 
-static struct ring_buffer *ring_buffer_get(struct perf_event *event)
+struct ring_buffer *ring_buffer_get(struct perf_event *event)
 {
 	struct ring_buffer *rb;
 
@@ -4125,7 +4124,7 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
 	return rb;
 }
 
-static void ring_buffer_put(struct ring_buffer *rb)
+void ring_buffer_put(struct ring_buffer *rb)
 {
 	if (!atomic_dec_and_test(&rb->refcount))
 		return;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 4d117a9814..b701ebc325 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -36,6 +36,8 @@ struct ring_buffer {
 	struct user_struct		*mmap_user;
 
 	/* AUX area */
+	local_t				aux_head;
+	local_t				aux_nest;
 	unsigned long			aux_pgoff;
 	int				aux_nr_pages;
 	atomic_t			aux_mmap_count;
@@ -56,6 +58,8 @@ extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 			pgoff_t pgoff, int nr_pages, int flags);
 extern void rb_free_aux(struct ring_buffer *rb);
+extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
+extern void ring_buffer_put(struct ring_buffer *rb);
 
 static inline bool rb_has_aux(struct ring_buffer *rb)
 {
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 608c0d67ad..49d6d8f410 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -242,6 +242,145 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 	spin_lock_init(&rb->event_lock);
 }
 
+/*
+ * This is called before hardware starts writing to the AUX area to
+ * obtain an output handle and make sure there's room in the buffer.
+ * When the capture completes, call perf_aux_output_end() to commit
+ * the recorded data to the buffer.
+ *
+ * The ordering is similar to that of perf_output_{begin,end}, with
+ * the exception of (B), which should be taken care of by the pmu
+ * driver, since ordering rules will differ depending on hardware.
+ */
+void *perf_aux_output_begin(struct perf_output_handle *handle,
+			    struct perf_event *event)
+{
+	struct perf_event *output_event = event;
+	unsigned long aux_head, aux_tail;
+	struct ring_buffer *rb;
+
+	if (output_event->parent)
+		output_event = output_event->parent;
+
+	/*
+	 * Since this will typically be open across pmu::add/pmu::del, we
+	 * grab ring_buffer's refcount instead of holding rcu read lock
+	 * to make sure it doesn't disappear under us.
+	 */
+	rb = ring_buffer_get(output_event);
+	if (!rb)
+		return NULL;
+
+	if (!rb_has_aux(rb) || !atomic_inc_not_zero(&rb->aux_refcount))
+		goto err;
+
+	/*
+	 * Nesting is not supported for AUX area, make sure nested
+	 * writers are caught early
+	 */
+	if (WARN_ON_ONCE(local_xchg(&rb->aux_nest, 1)))
+		goto err_put;
+
+	aux_head = local_read(&rb->aux_head);
+	aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+
+	handle->rb = rb;
+	handle->event = event;
+	handle->head = aux_head;
+	if (aux_head - aux_tail < perf_aux_size(rb))
+		handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
+	else
+		handle->size = 0;
+
+	/*
+	 * handle->size computation depends on aux_tail load; this forms a
+	 * control dependency barrier separating aux_tail load from aux data
+	 * store that will be enabled on successful return
+	 */
+	if (!handle->size) { /* A, matches D */
+		event->pending_disable = 1;
+		perf_output_wakeup(handle);
+		local_set(&rb->aux_nest, 0);
+		goto err_put;
+	}
+
+	return handle->rb->aux_priv;
+
+err_put:
+	rb_free_aux(rb);
+
+err:
+	ring_buffer_put(rb);
+	handle->event = NULL;
+
+	return NULL;
+}
+
+/*
+ * Commit the data written by hardware into the ring buffer by adjusting
+ * aux_head and posting a PERF_RECORD_AUX into the perf buffer. It is the
+ * pmu driver's responsibility to observe ordering rules of the hardware,
+ * so that all the data is externally visible before this is called.
+ */
+void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
+			 bool truncated)
+{
+	struct ring_buffer *rb = handle->rb;
+	unsigned long aux_head = local_read(&rb->aux_head);
+	u64 flags = 0;
+
+	if (truncated)
+		flags |= PERF_AUX_FLAG_TRUNCATED;
+
+	local_add(size, &rb->aux_head);
+
+	if (size || flags) {
+		/*
+		 * Only send RECORD_AUX if we have something useful to communicate
+		 */
+
+		perf_event_aux_event(handle->event, aux_head, size, flags);
+	}
+
+	rb->user_page->aux_head = local_read(&rb->aux_head);
+
+	perf_output_wakeup(handle);
+	handle->event = NULL;
+
+	local_set(&rb->aux_nest, 0);
+	rb_free_aux(rb);
+	ring_buffer_put(rb);
+}
+
+/*
+ * Skip over a given number of bytes in the AUX buffer, due to, for example,
+ * hardware's alignment constraints.
+ */
+int perf_aux_output_skip(struct perf_output_handle *handle, unsigned long size)
+{
+	struct ring_buffer *rb = handle->rb;
+	unsigned long aux_head;
+
+	if (size > handle->size)
+		return -ENOSPC;
+
+	local_add(size, &rb->aux_head);
+
+	handle->head = aux_head;
+	handle->size -= size;
+
+	return 0;
+}
+
+void *perf_get_aux(struct perf_output_handle *handle)
+{
+	/* this is only valid between perf_aux_output_begin and *_end */
+	if (!handle->event)
+		return NULL;
+
+	return handle->rb->aux_priv;
+}
+
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
 static struct page *rb_alloc_aux_page(int node, int order)
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v8 08/14] perf: Support overwrite mode for AUX area
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (6 preceding siblings ...)
  2014-11-14 13:43 ` [PATCH v8 07/14] perf: Add api for pmus to write to AUX area Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 09/14] perf: Add wakeup watermark control to " Alexander Shishkin
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

This adds support for overwrite mode in the AUX area, which means "keep
collecting data till you're stopped", turning AUX area into a circular
buffer, where new data overwrites old data. It does not depend on data
buffer's overwrite mode, so that it doesn't lose sideband data that is
instrumental for processing AUX data.

Overwrite mode is enabled at mapping AUX area read only. Even though
aux_tail in the buffer's user page might be user writable, it will be
ignored in this mode.

A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf
data stream every time an event writes new data to the AUX area. The pmu
driver might not be able to infer the exact beginning of the new data in
each snapshot, some drivers will only provide the tail, which is
aux_offset + aux_size in the AUX record. Consumer has to be able to tell
the new data from the old one, for example, by means of time stamps if
such are provided in the trace.

Consumer is also responsible for disabling any events that might write
to the AUX area (thus potentially racing with the consumer) before
collecting the data.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/uapi/linux/perf_event.h |  1 +
 kernel/events/internal.h        |  1 +
 kernel/events/ring_buffer.c     | 48 ++++++++++++++++++++++++++++-------------
 3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 272258380e..f36aac2ac5 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -768,6 +768,7 @@ enum perf_callchain_context {
  * PERF_RECORD_AUX::flags bits
  */
 #define PERF_AUX_FLAG_TRUNCATED		0x01	/* record was truncated to fit */
+#define PERF_AUX_FLAG_OVERWRITE		0x02	/* snapshot from overwrite mode */
 
 #define PERF_FLAG_FD_NO_GROUP		(1UL << 0)
 #define PERF_FLAG_FD_OUTPUT		(1UL << 1)
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index b701ebc325..ffd51d9f59 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -40,6 +40,7 @@ struct ring_buffer {
 	local_t				aux_nest;
 	unsigned long			aux_pgoff;
 	int				aux_nr_pages;
+	int				aux_overwrite;
 	atomic_t			aux_mmap_count;
 	unsigned long			aux_mmap_locked;
 	void				(*free_aux)(void *);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 49d6d8f410..8bff8c85eb 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -282,26 +282,33 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
 		goto err_put;
 
 	aux_head = local_read(&rb->aux_head);
-	aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
 
 	handle->rb = rb;
 	handle->event = event;
 	handle->head = aux_head;
-	if (aux_head - aux_tail < perf_aux_size(rb))
-		handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
-	else
-		handle->size = 0;
+	handle->size = 0;
 
 	/*
-	 * handle->size computation depends on aux_tail load; this forms a
-	 * control dependency barrier separating aux_tail load from aux data
-	 * store that will be enabled on successful return
+	 * In overwrite mode, AUX data stores do not depend on aux_tail,
+	 * therefore (A) control dependency barrier does not exist. The
+	 * (B) <-> (C) ordering is still observed by the pmu driver.
 	 */
-	if (!handle->size) { /* A, matches D */
-		event->pending_disable = 1;
-		perf_output_wakeup(handle);
-		local_set(&rb->aux_nest, 0);
-		goto err_put;
+	if (!rb->aux_overwrite) {
+		aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+		if (aux_head - aux_tail < perf_aux_size(rb))
+			handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
+
+		/*
+		 * handle->size computation depends on aux_tail load; this forms a
+		 * control dependency barrier separating aux_tail load from aux data
+		 * store that will be enabled on successful return
+		 */
+		if (!handle->size) { /* A, matches D */
+			event->pending_disable = 1;
+			perf_output_wakeup(handle);
+			local_set(&rb->aux_nest, 0);
+			goto err_put;
+		}
 	}
 
 	return handle->rb->aux_priv;
@@ -326,13 +333,22 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
 			 bool truncated)
 {
 	struct ring_buffer *rb = handle->rb;
-	unsigned long aux_head = local_read(&rb->aux_head);
+	unsigned long aux_head;
 	u64 flags = 0;
 
 	if (truncated)
 		flags |= PERF_AUX_FLAG_TRUNCATED;
 
-	local_add(size, &rb->aux_head);
+	/* in overwrite mode, driver provides aux_head via handle */
+	if (rb->aux_overwrite) {
+		flags |= PERF_AUX_FLAG_OVERWRITE;
+
+		aux_head = handle->head;
+		local_set(&rb->aux_head, aux_head);
+	} else {
+		aux_head = local_read(&rb->aux_head);
+		local_add(size, &rb->aux_head);
+	}
 
 	if (size || flags) {
 		/*
@@ -479,6 +495,8 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 	 */
 	atomic_set(&rb->aux_refcount, 1);
 
+	rb->aux_overwrite = overwrite;
+
 out:
 	if (!ret)
 		rb->aux_pgoff = pgoff;
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v8 09/14] perf: Add wakeup watermark control to AUX area
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (7 preceding siblings ...)
  2014-11-14 13:43 ` [PATCH v8 08/14] perf: Support overwrite mode for " Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 10/14] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection Alexander Shishkin
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

When AUX area gets a certain amount of new data, we want to wake up
userspace to collect it. This adds a new control to specify how much
data will cause a wakeup. This is then passed down to pmu drivers via
output handle's "wakeup" field, so that the driver can find the nearest
point where it can generate an interrupt.

We repurpose __reserved_2 in the event attribute for this, even though
it was never checked to be zero before, aux_watermark will only matter
for new AUX-aware code, so the old code should still be fine.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/uapi/linux/perf_event.h |  7 +++++--
 kernel/events/core.c            |  3 ++-
 kernel/events/internal.h        |  4 +++-
 kernel/events/ring_buffer.c     | 22 +++++++++++++++++++---
 4 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index f36aac2ac5..ffd52c0861 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -238,6 +238,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER2	80	/* add: branch_sample_type */
 #define PERF_ATTR_SIZE_VER3	96	/* add: sample_regs_user */
 					/* add: sample_stack_user */
+					/* add: aux_watermark */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -332,8 +333,10 @@ struct perf_event_attr {
 	 */
 	__u32	sample_stack_user;
 
-	/* Align to u64. */
-	__u32	__reserved_2;
+	/*
+	 * Wakeup watermark for AUX area
+	 */
+	__u32	aux_watermark;
 };
 
 #define perf_flags(attr)	(*(&(attr)->read_format + 1))
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 48b8dc0d36..da2e44aedd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4419,7 +4419,8 @@ accounting:
 		perf_event_init_userpage(event);
 		perf_event_update_userpage(event);
 	} else {
-		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages, flags);
+		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
+				   event->attr.aux_watermark, flags);
 		if (!ret)
 			rb->aux_mmap_locked = extra;
 	}
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index ffd51d9f59..9f6ce9ba4a 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -27,6 +27,7 @@ struct ring_buffer {
 	local_t				lost;		/* nr records lost   */
 
 	long				watermark;	/* wakeup watermark  */
+	long				aux_watermark;
 	/* poll crap */
 	spinlock_t			event_lock;
 	struct list_head		event_list;
@@ -38,6 +39,7 @@ struct ring_buffer {
 	/* AUX area */
 	local_t				aux_head;
 	local_t				aux_nest;
+	local_t				aux_wakeup;
 	unsigned long			aux_pgoff;
 	int				aux_nr_pages;
 	int				aux_overwrite;
@@ -57,7 +59,7 @@ extern struct ring_buffer *
 rb_alloc(int nr_pages, long watermark, int cpu, int flags);
 extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
-			pgoff_t pgoff, int nr_pages, int flags);
+			pgoff_t pgoff, int nr_pages, long watermark, int flags);
 extern void rb_free_aux(struct ring_buffer *rb);
 extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
 extern void ring_buffer_put(struct ring_buffer *rb);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 8bff8c85eb..8d959f24ab 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -295,6 +295,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
 	 */
 	if (!rb->aux_overwrite) {
 		aux_tail = ACCESS_ONCE(rb->user_page->aux_tail);
+		handle->wakeup = local_read(&rb->aux_wakeup) + rb->aux_watermark;
 		if (aux_head - aux_tail < perf_aux_size(rb))
 			handle->size = CIRC_SPACE(aux_head, aux_tail, perf_aux_size(rb));
 
@@ -358,9 +359,12 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size,
 		perf_event_aux_event(handle->event, aux_head, size, flags);
 	}
 
-	rb->user_page->aux_head = local_read(&rb->aux_head);
+	aux_head = rb->user_page->aux_head = local_read(&rb->aux_head);
 
-	perf_output_wakeup(handle);
+	if (aux_head - local_read(&rb->aux_wakeup) >= rb->aux_watermark) {
+		perf_output_wakeup(handle);
+		local_add(rb->aux_watermark, &rb->aux_wakeup);
+	}
 	handle->event = NULL;
 
 	local_set(&rb->aux_nest, 0);
@@ -382,6 +386,14 @@ int perf_aux_output_skip(struct perf_output_handle *handle, unsigned long size)
 
 	local_add(size, &rb->aux_head);
 
+	aux_head = rb->user_page->aux_head = local_read(&rb->aux_head);
+	if (aux_head - local_read(&rb->aux_wakeup) >= rb->aux_watermark) {
+		perf_output_wakeup(handle);
+		local_add(rb->aux_watermark, &rb->aux_wakeup);
+		handle->wakeup = local_read(&rb->aux_wakeup) +
+				 rb->aux_watermark;
+	}
+
 	handle->head = aux_head;
 	handle->size -= size;
 
@@ -432,7 +444,7 @@ static void rb_free_aux_page(struct ring_buffer *rb, int idx)
 }
 
 int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
-		 pgoff_t pgoff, int nr_pages, int flags)
+		 pgoff_t pgoff, int nr_pages, long watermark, int flags)
 {
 	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
 	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
@@ -496,6 +508,10 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 	atomic_set(&rb->aux_refcount, 1);
 
 	rb->aux_overwrite = overwrite;
+	rb->aux_watermark = watermark;
+
+	if (!rb->aux_watermark && !rb->aux_overwrite)
+		rb->aux_watermark = nr_pages << (PAGE_SHIFT - 1);
 
 out:
 	if (!ret)
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v8 10/14] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (8 preceding siblings ...)
  2014-11-14 13:43 ` [PATCH v8 09/14] perf: Add wakeup watermark control to " Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 11/14] x86: perf: Intel PT and LBR/BTS are mutually exclusive Alexander Shishkin
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

Intel Processor Trace is an architecture extension that allows for program
flow tracing.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/include/asm/cpufeature.h | 1 +
 arch/x86/kernel/cpu/scattered.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 0bb1335313..86aa081a0d 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -189,6 +189,7 @@
 #define X86_FEATURE_DTHERM	( 7*32+ 7) /* Digital Thermal Sensor */
 #define X86_FEATURE_HW_PSTATE	( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK ( 7*32+ 9) /* AMD ProcFeedbackInterface */
+#define X86_FEATURE_INTEL_PT	( 7*32+10) /* Intel Processor Trace */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW  ( 8*32+ 0) /* Intel TPR Shadow */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 4a8013d559..42f5fa953f 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -36,6 +36,7 @@ void init_scattered_cpuid_features(struct cpuinfo_x86 *c)
 		{ X86_FEATURE_ARAT,		CR_EAX, 2, 0x00000006, 0 },
 		{ X86_FEATURE_PLN,		CR_EAX, 4, 0x00000006, 0 },
 		{ X86_FEATURE_PTS,		CR_EAX, 6, 0x00000006, 0 },
+		{ X86_FEATURE_INTEL_PT,		CR_EBX,25, 0x00000007, 0 },
 		{ X86_FEATURE_APERFMPERF,	CR_ECX, 0, 0x00000006, 0 },
 		{ X86_FEATURE_EPB,		CR_ECX, 3, 0x00000006, 0 },
 		{ X86_FEATURE_HW_PSTATE,	CR_EDX, 7, 0x80000007, 0 },
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v8 11/14] x86: perf: Intel PT and LBR/BTS are mutually exclusive
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (9 preceding siblings ...)
  2014-11-14 13:43 ` [PATCH v8 10/14] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

Intel PT cannot be used at the same time as LBR or BTS and will cause a
general protection fault if they are used together. In order to avoid
fixing up GPs in the fast path, instead we disallow creating LBR/BTS
events when PT events are present and vice versa.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.c       | 43 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/perf_event.h       | 40 +++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/perf_event_intel.c | 26 ++++++++++----------
 3 files changed, 95 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 143e5f5dc8..1cca1ac9b6 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -259,6 +259,14 @@ static void hw_perf_event_destroy(struct perf_event *event)
 	}
 }
 
+void hw_perf_lbr_event_destroy(struct perf_event *event)
+{
+	hw_perf_event_destroy(event);
+
+	/* undo the lbr/bts event accounting */
+	x86_del_exclusive(x86_lbr_exclusive_lbr);
+}
+
 static inline int x86_pmu_initialized(void)
 {
 	return x86_pmu.handle_irq != NULL;
@@ -298,6 +306,35 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event)
 	return x86_pmu_extra_regs(val, event);
 }
 
+/*
+ * Check if we can create event of a certain type (that no conflicting events
+ * are present).
+ */
+int x86_add_exclusive(unsigned int what)
+{
+	int ret = -EBUSY, i;
+
+	if (atomic_inc_not_zero(&x86_pmu.lbr_exclusive[what]))
+		return 0;
+
+	mutex_lock(&pmc_reserve_mutex);
+	for (i = 0; i < ARRAY_SIZE(x86_pmu.lbr_exclusive); i++)
+		if (i != what && atomic_read(&x86_pmu.lbr_exclusive[i]))
+			goto out;
+
+	atomic_inc(&x86_pmu.lbr_exclusive[what]);
+	ret = 0;
+
+out:
+	mutex_unlock(&pmc_reserve_mutex);
+	return ret;
+}
+
+void x86_del_exclusive(unsigned int what)
+{
+	atomic_dec(&x86_pmu.lbr_exclusive[what]);
+}
+
 int x86_setup_perfctr(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
@@ -342,6 +379,12 @@ int x86_setup_perfctr(struct perf_event *event)
 		/* BTS is currently only allowed for user-mode. */
 		if (!attr->exclude_kernel)
 			return -EOPNOTSUPP;
+
+		/* disallow bts if conflicting events are present */
+		if (x86_add_exclusive(x86_lbr_exclusive_lbr))
+			return -EBUSY;
+
+		event->destroy = hw_perf_lbr_event_destroy;
 	}
 
 	hwc->config |= config;
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index fc5eb390b3..620612d311 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -402,6 +402,12 @@ union x86_pmu_config {
 
 #define X86_CONFIG(args...) ((union x86_pmu_config){.bits = {args}}).value
 
+enum {
+	x86_lbr_exclusive_lbr,
+	x86_lbr_exclusive_pt,
+	x86_lbr_exclusive_max,
+};
+
 /*
  * struct x86_pmu - generic x86 pmu
  */
@@ -498,6 +504,11 @@ struct x86_pmu {
 	bool		lbr_double_abort;	   /* duplicated lbr aborts */
 
 	/*
+	 * Intel PT/LBR/BTS are exclusive
+	 */
+	atomic_t	lbr_exclusive[x86_lbr_exclusive_max];
+
+	/*
 	 * Extra registers for events
 	 */
 	struct extra_reg *extra_regs;
@@ -582,6 +593,12 @@ static inline int x86_pmu_rdpmc_index(int index)
 	return x86_pmu.rdpmc_index ? x86_pmu.rdpmc_index(index) : index;
 }
 
+int x86_add_exclusive(unsigned int what);
+
+void x86_del_exclusive(unsigned int what);
+
+void hw_perf_lbr_event_destroy(struct perf_event *event);
+
 int x86_setup_perfctr(struct perf_event *event);
 
 int x86_pmu_hw_config(struct perf_event *event);
@@ -668,6 +685,29 @@ static inline int amd_pmu_init(void)
 
 #ifdef CONFIG_CPU_SUP_INTEL
 
+static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
+{
+	/* user explicitly requested branch sampling */
+	if (has_branch_stack(event))
+		return true;
+
+	/* implicit branch sampling to correct PEBS skid */
+	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
+	    x86_pmu.intel_cap.pebs_format < 2)
+		return true;
+
+	return false;
+}
+
+static inline bool intel_pmu_has_bts(struct perf_event *event)
+{
+	if (event->attr.config == PERF_COUNT_HW_BRANCH_INSTRUCTIONS &&
+	    !event->attr.freq && event->hw.sample_period == 1)
+		return true;
+
+	return false;
+}
+
 int intel_pmu_save_and_restart(struct perf_event *event);
 
 struct event_constraint *
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 944bf019b7..c64e97a1cb 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1029,20 +1029,6 @@ static __initconst const u64 slm_hw_cache_event_ids
  },
 };
 
-static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
-{
-	/* user explicitly requested branch sampling */
-	if (has_branch_stack(event))
-		return true;
-
-	/* implicit branch sampling to correct PEBS skid */
-	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
-	    x86_pmu.intel_cap.pebs_format < 2)
-		return true;
-
-	return false;
-}
-
 static void intel_pmu_disable_all(void)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1751,6 +1737,18 @@ static int intel_pmu_hw_config(struct perf_event *event)
 		ret = intel_pmu_setup_lbr_filter(event);
 		if (ret)
 			return ret;
+
+		/*
+		 * bts is set up earlier in this path, so don't account
+		 * twice
+		 */
+		if (!intel_pmu_has_bts(event)) {
+			/* disallow lbr if conflicting events are present */
+			if (x86_add_exclusive(x86_lbr_exclusive_lbr))
+				return -EBUSY;
+
+			event->destroy = hw_perf_lbr_event_destroy;
+		}
 	}
 
 	if (event->attr.type != PERF_TYPE_RAW)
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (10 preceding siblings ...)
  2014-11-14 13:43 ` [PATCH v8 11/14] x86: perf: Intel PT and LBR/BTS are mutually exclusive Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2015-01-09 12:48   ` Peter Zijlstra
                     ` (2 more replies)
  2014-11-14 13:43 ` [PATCH v8 13/14] x86: perf: intel_bts: Add BTS " Alexander Shishkin
                   ` (2 subsequent siblings)
  14 siblings, 3 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

Add support for Intel Processor Trace (PT) to kernel's perf events.
PT is an extension of Intel Architecture that collects information about
software execuction such as control flow, execution modes and timings and
formats it into highly compressed binary packets. Even being compressed,
these packets are generated at hundreds of megabytes per second per core,
which makes it impractical to decode them on the fly in the kernel.

This driver exports trace data by through AUX space in the perf ring
buffer, which is zero-copy mapped into userspace for faster data retrieval.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/include/uapi/asm/msr-index.h     |   18 +
 arch/x86/kernel/cpu/Makefile              |    1 +
 arch/x86/kernel/cpu/intel_pt.h            |  131 ++++
 arch/x86/kernel/cpu/perf_event.h          |    2 +
 arch/x86/kernel/cpu/perf_event_intel.c    |    8 +
 arch/x86/kernel/cpu/perf_event_intel_pt.c | 1082 +++++++++++++++++++++++++++++
 6 files changed, 1242 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/intel_pt.h
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c

diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h
index e21331ce36..caae296ab7 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -74,6 +74,24 @@
 #define MSR_IA32_PERF_CAPABILITIES	0x00000345
 #define MSR_PEBS_LD_LAT_THRESHOLD	0x000003f6
 
+#define MSR_IA32_RTIT_CTL		0x00000570
+#define RTIT_CTL_TRACEEN		BIT(0)
+#define RTIT_CTL_OS			BIT(2)
+#define RTIT_CTL_USR			BIT(3)
+#define RTIT_CTL_CR3EN			BIT(7)
+#define RTIT_CTL_TOPA			BIT(8)
+#define RTIT_CTL_TSC_EN			BIT(10)
+#define RTIT_CTL_DISRETC		BIT(11)
+#define RTIT_CTL_BRANCH_EN		BIT(13)
+#define MSR_IA32_RTIT_STATUS		0x00000571
+#define RTIT_STATUS_CONTEXTEN		BIT(1)
+#define RTIT_STATUS_TRIGGEREN		BIT(2)
+#define RTIT_STATUS_ERROR		BIT(4)
+#define RTIT_STATUS_STOPPED		BIT(5)
+#define MSR_IA32_RTIT_CR3_MATCH		0x00000572
+#define MSR_IA32_RTIT_OUTPUT_BASE	0x00000560
+#define MSR_IA32_RTIT_OUTPUT_MASK	0x00000561
+
 #define MSR_MTRRfix64K_00000		0x00000250
 #define MSR_MTRRfix16K_80000		0x00000258
 #define MSR_MTRRfix16K_A0000		0x00000259
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index e27b49d7c9..4bc29fa23e 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -40,6 +40,7 @@ endif
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_p6.o perf_event_knc.o perf_event_p4.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_rapl.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_pt.o
 
 obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE)	+= perf_event_intel_uncore.o \
 					   perf_event_intel_uncore_snb.o \
diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
new file mode 100644
index 0000000000..1c338b0eba
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -0,0 +1,131 @@
+/*
+ * Intel(R) Processor Trace PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#ifndef __INTEL_PT_H__
+#define __INTEL_PT_H__
+
+/*
+ * Single-entry ToPA: when this close to region boundary, switch
+ * buffers to avoid losing data.
+ */
+#define TOPA_PMI_MARGIN 512
+
+/*
+ * Table of Physical Addresses bits
+ */
+enum topa_sz {
+	TOPA_4K	= 0,
+	TOPA_8K,
+	TOPA_16K,
+	TOPA_32K,
+	TOPA_64K,
+	TOPA_128K,
+	TOPA_256K,
+	TOPA_512K,
+	TOPA_1MB,
+	TOPA_2MB,
+	TOPA_4MB,
+	TOPA_8MB,
+	TOPA_16MB,
+	TOPA_32MB,
+	TOPA_64MB,
+	TOPA_128MB,
+	TOPA_SZ_END,
+};
+
+static inline unsigned int sizes(enum topa_sz tsz)
+{
+	return 1 << (tsz + 12);
+};
+
+struct topa_entry {
+	u64	end	: 1;
+	u64	rsvd0	: 1;
+	u64	intr	: 1;
+	u64	rsvd1	: 1;
+	u64	stop	: 1;
+	u64	rsvd2	: 1;
+	u64	size	: 4;
+	u64	rsvd3	: 2;
+	u64	base	: 36;
+	u64	rsvd4	: 16;
+};
+
+#define TOPA_SHIFT 12
+#define PT_CPUID_LEAVES 2
+
+enum pt_capabilities {
+	PT_CAP_max_subleaf = 0,
+	PT_CAP_cr3_filtering,
+	PT_CAP_topa_output,
+	PT_CAP_topa_multiple_entries,
+	PT_CAP_payloads_lip,
+};
+
+struct pt_pmu {
+	struct pmu		pmu;
+	u32			caps[4 * PT_CPUID_LEAVES];
+};
+
+/**
+ * struct pt_buffer - buffer configuration; one buffer per task_struct or
+ *		cpu, depending on perf event configuration
+ * @cpu:	cpu for per-cpu allocation
+ * @tables:	list of ToPA tables in this buffer
+ * @first:	shorthand for first topa table
+ * @last:	shorthand for last topa table
+ * @cur:	current topa table
+ * @nr_pages:	buffer size in pages
+ * @cur_idx:	current output region's index within @cur table
+ * @output_off:	offset within the current output region
+ * @data_size:	running total of the amount of data in this buffer
+ * @lost:	if data was lost/truncated
+ * @head:	logical write offset inside the buffer
+ * @snapshot:	if this is for a snapshot/overwrite counter
+ * @stop_pos:	STOP topa entry in the buffer
+ * @intr_pos:	INT topa entry in the buffer
+ * @data_pages:	array of pages from perf
+ * @topa_index:	table of topa entries indexed by page offset
+ */
+struct pt_buffer {
+	int			cpu;
+	struct list_head	tables;
+	struct topa		*first, *last, *cur;
+	unsigned int		cur_idx;
+	size_t			output_off;
+	unsigned long		nr_pages;
+	local_t			data_size;
+	local_t			lost;
+	local64_t		head;
+	bool			snapshot;
+	unsigned long		stop_pos, intr_pos;
+	void			**data_pages;
+	struct topa_entry	*topa_index[0];
+};
+
+/**
+ * struct pt - per-cpu pt context
+ * @handle:	perf output handle
+ * @handle_nmi:	do handle PT PMI on this cpu, there's an active event
+ */
+struct pt {
+	struct perf_output_handle handle;
+	int			handle_nmi;
+};
+
+#endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 620612d311..9ab6c59851 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -783,6 +783,8 @@ void intel_pmu_lbr_init_snb(void);
 
 int intel_pmu_setup_lbr_filter(struct perf_event *event);
 
+void intel_pt_interrupt(void);
+
 int p4_pmu_init(void);
 
 int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index c64e97a1cb..5a29ea59d6 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1385,6 +1385,14 @@ again:
 	}
 
 	/*
+	 * Intel PT
+	 */
+	if (__test_and_clear_bit(55, (unsigned long *)&status)) {
+		handled++;
+		intel_pt_interrupt();
+	}
+
+	/*
 	 * Checkpointed counters can lead to 'spurious' PMIs because the
 	 * rollback caused by the PMI will have cleared the overflow status
 	 * bit. Therefore always force probe these counters.
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
new file mode 100644
index 0000000000..f2fda8f038
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -0,0 +1,1082 @@
+/*
+ * Intel(R) Processor Trace PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#undef DEBUG
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/device.h>
+
+#include <asm/perf_event.h>
+#include <asm/insn.h>
+
+#include "perf_event.h"
+#include "intel_pt.h"
+
+static DEFINE_PER_CPU(struct pt, pt_ctx);
+
+static struct pt_pmu pt_pmu;
+
+enum cpuid_regs {
+	CR_EAX = 0,
+	CR_ECX,
+	CR_EDX,
+	CR_EBX
+};
+
+/*
+ * Capabilities of Intel PT hardware, such as number of address bits or
+ * supported output schemes, are cached and exported to userspace as "caps"
+ * attribute group of pt pmu device
+ * (/sys/bus/event_source/devices/intel_pt/caps/) so that userspace can store
+ * relevant bits together with intel_pt traces.
+ *
+ * These are necessary for both trace decoding (payloads_lip, contains address
+ * width encoded in IP-related packets), and event configuration (bitmasks with
+ * permitted values for certain bit fields).
+ */
+#define PT_CAP(_n, _l, _r, _m)						\
+	[PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l,	\
+			    .reg = _r, .mask = _m }
+
+static struct pt_cap_desc {
+	const char	*name;
+	u32		leaf;
+	u8		reg;
+	u32		mask;
+} pt_caps[] = {
+	PT_CAP(max_subleaf,		0, CR_EAX, 0xffffffff),
+	PT_CAP(cr3_filtering,		0, CR_EBX, BIT(0)),
+	PT_CAP(topa_output,		0, CR_ECX, BIT(0)),
+	PT_CAP(topa_multiple_entries,	0, CR_ECX, BIT(1)),
+	PT_CAP(payloads_lip,		0, CR_ECX, BIT(31)),
+};
+
+static u32 pt_cap_get(enum pt_capabilities cap)
+{
+	struct pt_cap_desc *cd = &pt_caps[cap];
+	u32 c = pt_pmu.caps[cd->leaf * 4 + cd->reg];
+	unsigned int shift = __ffs(cd->mask);
+
+	return (c & cd->mask) >> shift;
+}
+
+static ssize_t pt_cap_show(struct device *cdev,
+			   struct device_attribute *attr,
+			   char *buf)
+{
+	struct dev_ext_attribute *ea =
+		container_of(attr, struct dev_ext_attribute, attr);
+	enum pt_capabilities cap = (long)ea->var;
+
+	return snprintf(buf, PAGE_SIZE, "%x\n", pt_cap_get(cap));
+}
+
+static struct attribute_group pt_cap_group = {
+	.name	= "caps",
+};
+
+PMU_FORMAT_ATTR(tsc,		"config:10"	);
+PMU_FORMAT_ATTR(noretcomp,	"config:11"	);
+
+static struct attribute *pt_formats_attr[] = {
+	&format_attr_tsc.attr,
+	&format_attr_noretcomp.attr,
+	NULL,
+};
+
+static struct attribute_group pt_format_group = {
+	.name	= "format",
+	.attrs	= pt_formats_attr,
+};
+
+static const struct attribute_group *pt_attr_groups[] = {
+	&pt_cap_group,
+	&pt_format_group,
+	NULL,
+};
+
+static int __init pt_pmu_hw_init(void)
+{
+	struct dev_ext_attribute *de_attrs;
+	struct attribute **attrs;
+	size_t size;
+	long i;
+
+	if (test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT)) {
+		for (i = 0; i < PT_CPUID_LEAVES; i++)
+			cpuid_count(20, i,
+				    &pt_pmu.caps[CR_EAX + i * 4],
+				    &pt_pmu.caps[CR_EBX + i * 4],
+				    &pt_pmu.caps[CR_ECX + i * 4],
+				    &pt_pmu.caps[CR_EDX + i * 4]);
+	} else {
+		return -ENODEV;
+	}
+
+	size = sizeof(struct attribute *) * (ARRAY_SIZE(pt_caps) + 1);
+	attrs = kzalloc(size, GFP_KERNEL);
+	if (!attrs)
+		goto err_attrs;
+
+	size = sizeof(struct dev_ext_attribute) * (ARRAY_SIZE(pt_caps) + 1);
+	de_attrs = kzalloc(size, GFP_KERNEL);
+	if (!de_attrs)
+		goto err_de_attrs;
+
+	for (i = 0; i < ARRAY_SIZE(pt_caps); i++) {
+		de_attrs[i].attr.attr.name = pt_caps[i].name;
+
+		sysfs_attr_init(&de_attrs[i].attr.attr);
+		de_attrs[i].attr.attr.mode = S_IRUGO;
+		de_attrs[i].attr.show = pt_cap_show;
+		de_attrs[i].var = (void *)i;
+		attrs[i] = &de_attrs[i].attr.attr;
+	}
+
+	pt_cap_group.attrs = attrs;
+	return 0;
+
+err_de_attrs:
+	kfree(de_attrs);
+err_attrs:
+	kfree(attrs);
+
+	return -ENOMEM;
+}
+
+#define PT_CONFIG_MASK (RTIT_CTL_TSC_EN | RTIT_CTL_DISRETC)
+
+static bool pt_event_valid(struct perf_event *event)
+{
+	u64 config = event->attr.config;
+
+	if ((config & PT_CONFIG_MASK) != config)
+		return false;
+
+	return true;
+}
+
+/*
+ * PT configuration helpers
+ * These all are cpu affine and operate on a local PT
+ */
+
+static bool pt_is_running(void)
+{
+	u64 ctl;
+
+	rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+	return !!(ctl & RTIT_CTL_TRACEEN);
+}
+
+static void pt_config(struct perf_event *event)
+{
+	u64 reg;
+
+	reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN | RTIT_CTL_TRACEEN;
+
+	if (!event->attr.exclude_kernel)
+		reg |= RTIT_CTL_OS;
+	if (!event->attr.exclude_user)
+		reg |= RTIT_CTL_USR;
+
+	reg |= (event->attr.config & PT_CONFIG_MASK);
+
+	wrmsrl(MSR_IA32_RTIT_CTL, reg);
+}
+
+static void pt_config_start(bool start)
+{
+	u64 ctl;
+
+	rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+	if (start)
+		ctl |= RTIT_CTL_TRACEEN;
+	else
+		ctl &= ~RTIT_CTL_TRACEEN;
+	wrmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+	/*
+	 * A wrmsr that disables trace generation serializes other PT
+	 * registers and causes all data packets to be written to memory,
+	 * but a fence is required for the data to become globally visible.
+	 *
+	 * The below WMB, separating data store and aux_head store matches
+	 * the consumer's RMB that separates aux_head load and data load.
+	 */
+	if (!start)
+		wmb();
+}
+
+static void pt_config_buffer(void *buf, unsigned int topa_idx,
+			     unsigned int output_off)
+{
+	u64 reg;
+
+	wrmsrl(MSR_IA32_RTIT_OUTPUT_BASE, virt_to_phys(buf));
+
+	reg = 0x7f | ((u64)topa_idx << 7) | ((u64)output_off << 32);
+
+	wrmsrl(MSR_IA32_RTIT_OUTPUT_MASK, reg);
+}
+
+/*
+ * Keep ToPA table-related metadata on the same page as the actual table,
+ * taking up a few words from the top
+ */
+
+#define TENTS_PER_PAGE (((PAGE_SIZE - 40) / sizeof(struct topa_entry)) - 1)
+
+/**
+ * struct topa - page-sized ToPA table with metadata at the top
+ * @table:	actual ToPA table entries, as understood by PT hardware
+ * @list:	linkage to struct pt_buffer's list of tables
+ * @phys:	physical address of this page
+ * @offset:	offset of the first entry in this table in the buffer
+ * @size:	total size of all entries in this table
+ * @last:	index of the last initialized entry in this table
+ */
+struct topa {
+	struct topa_entry	table[TENTS_PER_PAGE];
+	struct list_head	list;
+	u64			phys;
+	u64			offset;
+	size_t			size;
+	int			last;
+};
+
+/* make -1 stand for the last table entry */
+#define TOPA_ENTRY(t, i) ((i) == -1 ? &(t)->table[(t)->last] : &(t)->table[(i)])
+
+/**
+ * topa_alloc() - allocate page-sized ToPA table
+ * @cpu:	CPU on which to allocate.
+ * @gfp:	Allocation flags.
+ *
+ * Return:	On success, return the pointer to ToPA table page.
+ */
+static struct topa *topa_alloc(int cpu, gfp_t gfp)
+{
+	int node = cpu_to_node(cpu);
+	struct topa *topa;
+	struct page *p;
+
+	p = alloc_pages_node(node, gfp | __GFP_ZERO, 0);
+	if (!p)
+		return NULL;
+
+	topa = page_address(p);
+	topa->last = 0;
+	topa->phys = page_to_phys(p);
+
+	/*
+	 * In case of singe-entry ToPA, always put the self-referencing END
+	 * link as the 2nd entry in the table
+	 */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		TOPA_ENTRY(topa, 1)->base = topa->phys >> TOPA_SHIFT;
+		TOPA_ENTRY(topa, 1)->end = 1;
+	}
+
+	return topa;
+}
+
+/**
+ * topa_free() - free a page-sized ToPA table
+ * @topa:	Table to deallocate.
+ */
+static void topa_free(struct topa *topa)
+{
+	free_page((unsigned long)topa);
+}
+
+/**
+ * topa_insert_table() - insert a ToPA table into a buffer
+ * @buf:	 PT buffer that's being extended.
+ * @topa:	 New topa table to be inserted.
+ *
+ * If it's the first table in this buffer, set up buffer's pointers
+ * accordingly; otherwise, add a END=1 link entry to @topa to the current
+ * "last" table and adjust the last table pointer to @topa.
+ */
+static void topa_insert_table(struct pt_buffer *buf, struct topa *topa)
+{
+	struct topa *last = buf->last;
+
+	list_add_tail(&topa->list, &buf->tables);
+
+	if (!buf->first) {
+		buf->first = buf->last = buf->cur = topa;
+		return;
+	}
+
+	topa->offset = last->offset + last->size;
+	buf->last = topa;
+
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		return;
+
+	BUG_ON(last->last != TENTS_PER_PAGE - 1);
+
+	TOPA_ENTRY(last, -1)->base = topa->phys >> TOPA_SHIFT;
+	TOPA_ENTRY(last, -1)->end = 1;
+}
+
+/**
+ * topa_table_full() - check if a ToPA table is filled up
+ * @topa:	ToPA table.
+ */
+static bool topa_table_full(struct topa *topa)
+{
+	/* single-entry ToPA is a special case */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		return !!topa->last;
+
+	return topa->last == TENTS_PER_PAGE - 1;
+}
+
+/**
+ * topa_insert_pages() - create a list of ToPA tables
+ * @buf:	PT buffer being initialized.
+ * @gfp:	Allocation flags.
+ *
+ * This initializes a list of ToPA tables with entries from
+ * the data_pages provided by rb_alloc_aux().
+ *
+ * Return:	0 on success or error code.
+ */
+static int topa_insert_pages(struct pt_buffer *buf, gfp_t gfp)
+{
+	struct topa *topa = buf->last;
+	int order = 0;
+	struct page *p;
+
+	p = virt_to_page(buf->data_pages[buf->nr_pages]);
+	if (PagePrivate(p))
+		order = page_private(p);
+
+	if (topa_table_full(topa)) {
+		topa = topa_alloc(buf->cpu, gfp);
+		if (!topa)
+			return -ENOMEM;
+
+		topa_insert_table(buf, topa);
+	}
+
+	TOPA_ENTRY(topa, -1)->base = page_to_phys(p) >> TOPA_SHIFT;
+	TOPA_ENTRY(topa, -1)->size = order;
+	if (!buf->snapshot && !pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		TOPA_ENTRY(topa, -1)->intr = 1;
+		TOPA_ENTRY(topa, -1)->stop = 1;
+	}
+
+	topa->last++;
+	topa->size += sizes(order);
+
+	buf->nr_pages += 1ul << order;
+
+	return 0;
+}
+
+/**
+ * pt_topa_dump() - print ToPA tables and their entries
+ * @buf:	PT buffer.
+ */
+static void pt_topa_dump(struct pt_buffer *buf)
+{
+	struct topa *topa;
+
+	list_for_each_entry(topa, &buf->tables, list) {
+		int i;
+
+		pr_debug("# table @%p (%p), off %llx size %zx\n", topa->table,
+			 (void *)topa->phys, topa->offset, topa->size);
+		for (i = 0; i < TENTS_PER_PAGE; i++) {
+			pr_debug("# entry @%p (%lx sz %u %c%c%c) raw=%16llx\n",
+				 &topa->table[i],
+				 (unsigned long)topa->table[i].base << TOPA_SHIFT,
+				 sizes(topa->table[i].size),
+				 topa->table[i].end ?  'E' : ' ',
+				 topa->table[i].intr ? 'I' : ' ',
+				 topa->table[i].stop ? 'S' : ' ',
+				 *(u64 *)&topa->table[i]);
+			if ((pt_cap_get(PT_CAP_topa_multiple_entries) &&
+			     topa->table[i].stop) ||
+			    topa->table[i].end)
+				break;
+		}
+	}
+}
+
+/**
+ * pt_buffer_advance() - advance to the next output region
+ * @buf:	PT buffer.
+ *
+ * Advance the current pointers in the buffer to the next ToPA entry.
+ */
+static void pt_buffer_advance(struct pt_buffer *buf)
+{
+	buf->output_off = 0;
+	buf->cur_idx++;
+
+	if (buf->cur_idx == buf->cur->last) {
+		if (buf->cur == buf->last)
+			buf->cur = buf->first;
+		else
+			buf->cur = list_entry(buf->cur->list.next, struct topa,
+					      list);
+		buf->cur_idx = 0;
+	}
+}
+
+/**
+ * pt_update_head() - calculate current offsets and sizes
+ * @pt:		Per-cpu pt context.
+ *
+ * Update buffer's current write pointer position and data size.
+ */
+static void pt_update_head(struct pt *pt)
+{
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+	u64 topa_idx, base, old;
+
+	/* offset of the first region in this table from the beginning of buf */
+	base = buf->cur->offset + buf->output_off;
+
+	/* offset of the current output region within this table */
+	for (topa_idx = 0; topa_idx < buf->cur_idx; topa_idx++)
+		base += sizes(buf->cur->table[topa_idx].size);
+
+	if (buf->snapshot) {
+		local_set(&buf->data_size, base);
+	} else {
+		old = (local64_xchg(&buf->head, base) &
+		       ((buf->nr_pages << PAGE_SHIFT) - 1));
+		if (base < old)
+			base += buf->nr_pages << PAGE_SHIFT;
+
+		local_add(base - old, &buf->data_size);
+	}
+}
+
+/**
+ * pt_buffer_region() - obtain current output region's address
+ * @buf:	PT buffer.
+ */
+static void *pt_buffer_region(struct pt_buffer *buf)
+{
+	return phys_to_virt(buf->cur->table[buf->cur_idx].base << TOPA_SHIFT);
+}
+
+/**
+ * pt_buffer_region_size() - obtain current output region's size
+ * @buf:	PT buffer.
+ */
+static size_t pt_buffer_region_size(struct pt_buffer *buf)
+{
+	return sizes(buf->cur->table[buf->cur_idx].size);
+}
+
+/**
+ * pt_handle_status() - take care of possible status conditions
+ * @pt:		Per-cpu pt context.
+ */
+static void pt_handle_status(struct pt *pt)
+{
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+	int advance = 0;
+	u64 status;
+
+	rdmsrl(MSR_IA32_RTIT_STATUS, status);
+
+	if (status & RTIT_STATUS_ERROR) {
+		pr_err_ratelimited("ToPA ERROR encountered, trying to recover\n");
+		pt_topa_dump(buf);
+		status &= ~RTIT_STATUS_ERROR;
+	}
+
+	if (status & RTIT_STATUS_STOPPED) {
+		status &= ~RTIT_STATUS_STOPPED;
+
+		/*
+		 * On systems that only do single-entry ToPA, hitting STOP
+		 * means we are already losing data; need to let the decoder
+		 * know.
+		 */
+		if (!pt_cap_get(PT_CAP_topa_multiple_entries) ||
+		    buf->output_off == sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size)) {
+			local_inc(&buf->lost);
+			advance++;
+		}
+	}
+
+	/*
+	 * Also on single-entry ToPA implementations, interrupt will come
+	 * before the output reaches its output region's boundary.
+	 */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries) && !buf->snapshot &&
+	    pt_buffer_region_size(buf) - buf->output_off <= TOPA_PMI_MARGIN) {
+		void *head = pt_buffer_region(buf);
+
+		/* everything within this margin needs to be zeroed out */
+		memset(head + buf->output_off, 0,
+		       pt_buffer_region_size(buf) -
+		       buf->output_off);
+		advance++;
+	}
+
+	if (advance)
+		pt_buffer_advance(buf);
+
+	wrmsrl(MSR_IA32_RTIT_STATUS, status);
+}
+
+/**
+ * pt_read_offset() - translate registers into buffer pointers
+ * @buf:	PT buffer.
+ *
+ * Set buffer's output pointers from MSR values.
+ */
+static void pt_read_offset(struct pt_buffer *buf)
+{
+	u64 offset, base_topa;
+
+	rdmsrl(MSR_IA32_RTIT_OUTPUT_BASE, base_topa);
+	buf->cur = phys_to_virt(base_topa);
+
+	rdmsrl(MSR_IA32_RTIT_OUTPUT_MASK, offset);
+	/* offset within current output region */
+	buf->output_off = offset >> 32;
+	/* index of current output region within this table */
+	buf->cur_idx = (offset & 0xffffff80) >> 7;
+}
+
+/**
+ * pt_topa_next_entry() - obtain index of the first page in the next ToPA entry
+ * @buf:	PT buffer.
+ * @pg:		Page offset in the buffer.
+ *
+ * When advancing to the next output region (ToPA entry), given a page offset
+ * into the buffer, we need to find the offset of the first page in the next
+ * region.
+ */
+static unsigned int pt_topa_next_entry(struct pt_buffer *buf, unsigned int pg)
+{
+	struct topa_entry *te = buf->topa_index[pg];
+
+	/* one region */
+	if (buf->first == buf->last && buf->first->last == 1)
+		return pg;
+
+	do {
+		pg++;
+		pg &= buf->nr_pages - 1;
+	} while (buf->topa_index[pg] == te);
+
+	return pg;
+}
+
+/**
+ * pt_buffer_reset_markers() - place interrupt and stop bits in the buffer
+ * @buf:	PT buffer.
+ * @handle:	Current output handle.
+ *
+ * Place INT and STOP marks to prevent overwriting old data that the consumer
+ * hasn't yet collected.
+ */
+static int pt_buffer_reset_markers(struct pt_buffer *buf,
+				   struct perf_output_handle *handle)
+
+{
+	unsigned long idx, npages, end;
+
+	if (buf->snapshot)
+		return 0;
+
+	/* can't stop in the middle of an output region */
+	if (buf->output_off + handle->size + 1 <
+	    sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size))
+		return -EINVAL;
+
+
+	/* single entry ToPA is handled by marking all regions STOP=1 INT=1 */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		return 0;
+
+	/* clear STOP and INT from current entry */
+	buf->topa_index[buf->stop_pos]->stop = 0;
+	buf->topa_index[buf->intr_pos]->intr = 0;
+
+	if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		npages = (handle->size + 1) >> PAGE_SHIFT;
+		end = (local64_read(&buf->head) >> PAGE_SHIFT) + npages;
+		/*if (end > handle->wakeup >> PAGE_SHIFT)
+		  end = handle->wakeup >> PAGE_SHIFT;*/
+		idx = end & (buf->nr_pages - 1);
+		buf->stop_pos = idx;
+		idx = (local64_read(&buf->head) >> PAGE_SHIFT) + npages - 1;
+		idx &= buf->nr_pages - 1;
+		buf->intr_pos = idx;
+	}
+
+	buf->topa_index[buf->stop_pos]->stop = 1;
+	buf->topa_index[buf->intr_pos]->intr = 1;
+
+	return 0;
+}
+
+/**
+ * pt_buffer_setup_topa_index() - build topa_index[] table of regions
+ * @buf:	PT buffer.
+ *
+ * topa_index[] references output regions indexed by offset into the
+ * buffer for purposes of quick reverse lookup.
+ */
+static void pt_buffer_setup_topa_index(struct pt_buffer *buf)
+{
+	struct topa *cur = buf->first, *prev = buf->last;
+	struct topa_entry *te_cur = TOPA_ENTRY(cur, 0),
+		*te_prev = TOPA_ENTRY(prev, prev->last - 1);
+	int pg = 0, idx = 0, ntopa = 0;
+
+	while (pg < buf->nr_pages) {
+		int tidx;
+
+		/* pages within one topa entry */
+		for (tidx = 0; tidx < 1 << te_cur->size; tidx++, pg++)
+			buf->topa_index[pg] = te_prev;
+
+		te_prev = te_cur;
+
+		if (idx == cur->last - 1) {
+			/* advance to next topa table */
+			idx = 0;
+			cur = list_entry(cur->list.next, struct topa, list);
+			ntopa++;
+		} else
+			idx++;
+		te_cur = TOPA_ENTRY(cur, idx);
+	}
+
+}
+
+/**
+ * pt_buffer_reset_offsets() - adjust buffer's write pointers from aux_head
+ * @buf:	PT buffer.
+ * @head:	Write pointer (aux_head) from AUX buffer.
+ *
+ * Find the ToPA table and entry corresponding to given @head and set buffer's
+ * "current" pointers accordingly.
+ */
+static void pt_buffer_reset_offsets(struct pt_buffer *buf, unsigned long head)
+{
+	int pg;
+
+	if (buf->snapshot)
+		head &= (buf->nr_pages << PAGE_SHIFT) - 1;
+
+	pg = (head >> PAGE_SHIFT) & (buf->nr_pages - 1);
+	pg = pt_topa_next_entry(buf, pg);
+
+	buf->cur = (struct topa *)((unsigned long)buf->topa_index[pg] & PAGE_MASK);
+	buf->cur_idx = ((unsigned long)buf->topa_index[pg] -
+			(unsigned long)buf->cur) / sizeof(struct topa_entry);
+	buf->output_off = head & (sizes(buf->cur->table[buf->cur_idx].size) - 1);
+
+	local64_set(&buf->head, head);
+	local_set(&buf->data_size, 0);
+}
+
+/**
+ * pt_buffer_fini_topa() - deallocate ToPA structure of a buffer
+ * @buf:	PT buffer.
+ */
+static void pt_buffer_fini_topa(struct pt_buffer *buf)
+{
+	struct topa *topa, *iter;
+
+	list_for_each_entry_safe(topa, iter, &buf->tables, list) {
+		/*
+		 * right now, this is in free_aux() path only, so
+		 * no need to unlink this table from the list
+		 */
+		topa_free(topa);
+	}
+}
+
+/**
+ * pt_buffer_init_topa() - initialize ToPA table for pt buffer
+ * @buf:	PT buffer.
+ * @size:	Total size of all regions within this ToPA.
+ * @gfp:	Allocation flags.
+ */
+static int pt_buffer_init_topa(struct pt_buffer *buf, unsigned long nr_pages,
+			       gfp_t gfp)
+{
+	struct topa *topa;
+	int err;
+
+	topa = topa_alloc(buf->cpu, gfp);
+	if (!topa)
+		return -ENOMEM;
+
+	topa_insert_table(buf, topa);
+
+	while (buf->nr_pages < nr_pages) {
+		err = topa_insert_pages(buf, gfp);
+		if (err) {
+			pt_buffer_fini_topa(buf);
+			return -ENOMEM;
+		}
+	}
+
+	pt_buffer_setup_topa_index(buf);
+
+	/* link last table to the first one, unless we're double buffering */
+	if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		TOPA_ENTRY(buf->last, -1)->base = buf->first->phys >> TOPA_SHIFT;
+		TOPA_ENTRY(buf->last, -1)->end = 1;
+	}
+
+	pt_topa_dump(buf);
+	return 0;
+}
+
+/**
+ * pt_buffer_setup_aux() - set up topa tables for a PT buffer
+ * @cpu:	Cpu on which to allocate, -1 means current.
+ * @pages:	Array of pointers to buffer pages passed from perf core.
+ * @nr_pages:	Number of pages in the buffer.
+ * @snapshot:	If this is a snapshot/overwrite counter.
+ *
+ * This is a pmu::setup_aux callback that sets up ToPA tables and all the
+ * bookkeeping for an AUX buffer.
+ *
+ * Return:	Our private PT buffer structure.
+ */
+static void *
+pt_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool snapshot)
+{
+	struct pt_buffer *buf;
+	int node, ret;
+
+	if (!nr_pages)
+		return NULL;
+
+	if (cpu == -1)
+		cpu = raw_smp_processor_id();
+	node = cpu_to_node(cpu);
+
+	buf = kzalloc_node(offsetof(struct pt_buffer, topa_index[nr_pages]),
+			   GFP_KERNEL, node);
+	if (!buf)
+		return NULL;
+
+	buf->cpu = cpu;
+	buf->snapshot = snapshot;
+	buf->data_pages = pages;
+
+	INIT_LIST_HEAD(&buf->tables);
+
+	ret = pt_buffer_init_topa(buf, nr_pages, GFP_KERNEL);
+	if (ret) {
+		kfree(buf);
+		return NULL;
+	}
+
+	return buf;
+}
+
+/**
+ * pt_buffer_free_aux() - perf AUX deallocation path callback
+ * @data:	PT buffer.
+ */
+static void pt_buffer_free_aux(void *data)
+{
+	struct pt_buffer *buf = data;
+
+	pt_buffer_fini_topa(buf);
+	kfree(buf);
+}
+
+/**
+ * pt_buffer_is_full() - check if the buffer is full
+ * @buf:	PT buffer.
+ * @pt:		Per-cpu pt handle.
+ *
+ * If the user hasn't read data from the output region that aux_head
+ * points to, the buffer is considered full: the user needs to read at
+ * least this region and update aux_tail to point past it.
+ */
+static bool pt_buffer_is_full(struct pt_buffer *buf, struct pt *pt)
+{
+	if (buf->snapshot)
+		return false;
+
+	if (local_read(&buf->data_size) >= pt->handle.size)
+		return true;
+
+	return false;
+}
+
+/**
+ * intel_pt_interrupt() - PT PMI handler
+ */
+void intel_pt_interrupt(void)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct pt_buffer *buf;
+	struct perf_event *event = pt->handle.event;
+
+	if (!ACCESS_ONCE(pt->handle_nmi))
+		return;
+
+	pt_config_start(false);
+
+	if (!event)
+		return;
+
+	buf = perf_get_aux(&pt->handle);
+	if (!buf)
+		return;
+
+	pt_read_offset(buf);
+
+	pt_handle_status(pt);
+
+	pt_update_head(pt);
+
+	perf_aux_output_end(&pt->handle, local_xchg(&buf->data_size, 0),
+			    local_xchg(&buf->lost, 0));
+
+	if (!event->hw.state) {
+		int ret;
+
+		buf = perf_aux_output_begin(&pt->handle, event);
+		if (!buf) {
+			event->hw.state = PERF_HES_STOPPED;
+			return;
+		}
+
+		pt_buffer_reset_offsets(buf, pt->handle.head);
+		ret = pt_buffer_reset_markers(buf, &pt->handle);
+		if (ret) {
+			perf_aux_output_end(&pt->handle, 0, true);
+			return;
+		}
+
+		pt_config_buffer(buf->cur->table, buf->cur_idx,
+				 buf->output_off);
+		wrmsrl(MSR_IA32_RTIT_STATUS, 0);
+		pt_config(event);
+	}
+}
+
+/*
+ * PMU callbacks
+ */
+
+static void pt_event_start(struct perf_event *event, int mode)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+
+	if (pt_is_running() || !buf || pt_buffer_is_full(buf, pt)) {
+		event->hw.state = PERF_HES_STOPPED;
+		return;
+	}
+
+	ACCESS_ONCE(pt->handle_nmi) = 1;
+	event->hw.state = 0;
+
+	pt_config_buffer(buf->cur->table, buf->cur_idx,
+			 buf->output_off);
+	wrmsrl(MSR_IA32_RTIT_STATUS, 0);
+	pt_config(event);
+}
+
+static void pt_event_stop(struct perf_event *event, int mode)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+
+	ACCESS_ONCE(pt->handle_nmi) = 0;
+	pt_config_start(false);
+
+	if (event->hw.state == PERF_HES_STOPPED)
+		return;
+
+	event->hw.state = PERF_HES_STOPPED;
+
+	if (mode & PERF_EF_UPDATE) {
+		struct pt *pt = this_cpu_ptr(&pt_ctx);
+		struct pt_buffer *buf = perf_get_aux(&pt->handle);
+
+		if (!buf)
+			return;
+
+		if (WARN_ON_ONCE(pt->handle.event != event))
+			return;
+
+		pt_read_offset(buf);
+
+		pt_handle_status(pt);
+
+		pt_update_head(pt);
+	}
+}
+
+static void pt_event_del(struct perf_event *event, int mode)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct pt_buffer *buf;
+
+	pt_event_stop(event, PERF_EF_UPDATE);
+
+	buf = perf_get_aux(&pt->handle);
+
+	if (buf) {
+		if (buf->snapshot)
+			pt->handle.head =
+				local_xchg(&buf->data_size,
+					   buf->nr_pages << PAGE_SHIFT);
+		perf_aux_output_end(&pt->handle, local_xchg(&buf->data_size, 0),
+				    local_xchg(&buf->lost, 0));
+	}
+}
+
+static int pt_event_add(struct perf_event *event, int mode)
+{
+	struct pt_buffer *buf;
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct hw_perf_event *hwc = &event->hw;
+	int ret = -EBUSY;
+
+	if (pt->handle.event)
+		goto out;
+
+	buf = perf_aux_output_begin(&pt->handle, event);
+	if (!buf) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	pt_buffer_reset_offsets(buf, pt->handle.head);
+	if (!buf->snapshot) {
+		ret = pt_buffer_reset_markers(buf, &pt->handle);
+		if (ret) {
+			perf_aux_output_end(&pt->handle, 0, true);
+			goto out;
+		}
+	}
+
+	if (mode & PERF_EF_START) {
+		pt_event_start(event, 0);
+		if (hwc->state == PERF_HES_STOPPED) {
+			pt_event_del(event, 0);
+			ret = -EBUSY;
+		}
+	} else {
+		hwc->state = PERF_HES_STOPPED;
+	}
+
+	ret = 0;
+out:
+
+	if (ret)
+		hwc->state = PERF_HES_STOPPED;
+
+	return ret;
+}
+
+static void pt_event_read(struct perf_event *event)
+{
+}
+
+static void pt_event_destroy(struct perf_event *event)
+{
+	x86_del_exclusive(x86_lbr_exclusive_pt);
+}
+
+static int pt_event_init(struct perf_event *event)
+{
+	if (event->attr.type != pt_pmu.pmu.type)
+		return -ENOENT;
+
+	if (!pt_event_valid(event))
+		return -EINVAL;
+
+	if (x86_add_exclusive(x86_lbr_exclusive_pt))
+		return -EBUSY;
+
+	event->destroy = pt_event_destroy;
+
+	return 0;
+}
+
+static __init int pt_init(void)
+{
+	int ret, cpu, prior_warn = 0;
+
+	BUILD_BUG_ON(sizeof(struct topa) > PAGE_SIZE);
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		u64 ctl;
+
+		ret = rdmsrl_safe_on_cpu(cpu, MSR_IA32_RTIT_CTL, &ctl);
+		if (!ret && (ctl & RTIT_CTL_TRACEEN))
+			prior_warn++;
+	}
+	put_online_cpus();
+
+	ret = pt_pmu_hw_init();
+	if (ret)
+		return ret;
+
+	if (!pt_cap_get(PT_CAP_topa_output)) {
+		pr_warn("ToPA output is not supported on this CPU\n");
+		return -ENODEV;
+	}
+
+	if (prior_warn)
+		pr_warn("PT is enabled at boot time, traces may be empty\n");
+
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		pt_pmu.pmu.capabilities =
+			PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_AUX_SW_DOUBLEBUF;
+
+	pt_pmu.pmu.capabilities	|= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE;
+	pt_pmu.pmu.attr_groups	= pt_attr_groups;
+	pt_pmu.pmu.task_ctx_nr	= perf_hw_context;
+	pt_pmu.pmu.event_init	= pt_event_init;
+	pt_pmu.pmu.add		= pt_event_add;
+	pt_pmu.pmu.del		= pt_event_del;
+	pt_pmu.pmu.start	= pt_event_start;
+	pt_pmu.pmu.stop		= pt_event_stop;
+	pt_pmu.pmu.read		= pt_event_read;
+	pt_pmu.pmu.setup_aux	= pt_buffer_setup_aux;
+	pt_pmu.pmu.free_aux	= pt_buffer_free_aux;
+	ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);
+
+	return ret;
+}
+
+module_init(pt_init);
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v8 13/14] x86: perf: intel_bts: Add BTS PMU driver
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (11 preceding siblings ...)
  2014-11-14 13:43 ` [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2014-11-14 13:43 ` [PATCH v8 14/14] perf: add ITRACE_START record to indicate that tracing has started Alexander Shishkin
  2014-12-17 14:06 ` [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
  14 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

Add support for Branch Trace Store (BTS) via kernel perf event infrastructure.
The difference with the existing implementation of BTS support is that this
one is a separate PMU that exports events' trace buffers to userspace by means
of AUX area of the perf buffer, which is zero-copy mapped into userspace.

The immediate benefit is that the buffer size can be much bigger, resulting in
fewer interrupts and no kernel side copying is involved and little to no trace
data loss. Also, kernel code can be traced with this driver.

The old way of collecting BTS traces still works.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/kernel/cpu/Makefile               |   2 +-
 arch/x86/kernel/cpu/perf_event.h           |   7 +
 arch/x86/kernel/cpu/perf_event_intel.c     |   6 +-
 arch/x86/kernel/cpu/perf_event_intel_bts.c | 525 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   3 +-
 5 files changed, 540 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_bts.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 4bc29fa23e..a595246a47 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -40,7 +40,7 @@ endif
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_p6.o perf_event_knc.o perf_event_p4.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_rapl.o
-obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_pt.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_pt.o perf_event_intel_bts.o
 
 obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE)	+= perf_event_intel_uncore.o \
 					   perf_event_intel_uncore_snb.o \
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 9ab6c59851..a30c812f22 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -404,6 +404,7 @@ union x86_pmu_config {
 
 enum {
 	x86_lbr_exclusive_lbr,
+	x86_lbr_exclusive_bts,
 	x86_lbr_exclusive_pt,
 	x86_lbr_exclusive_max,
 };
@@ -785,6 +786,12 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event);
 
 void intel_pt_interrupt(void);
 
+int intel_bts_interrupt(void);
+
+void intel_bts_enable_local(void);
+
+void intel_bts_disable_local(void);
+
 int p4_pmu_init(void);
 
 int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 5a29ea59d6..a2fc1bf04c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1037,6 +1037,8 @@ static void intel_pmu_disable_all(void)
 
 	if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
 		intel_pmu_disable_bts();
+	else
+		intel_bts_disable_local();
 
 	intel_pmu_pebs_disable_all();
 	intel_pmu_lbr_disable_all();
@@ -1059,7 +1061,8 @@ static void intel_pmu_enable_all(int added)
 			return;
 
 		intel_pmu_enable_bts(event->hw.config);
-	}
+	} else
+		intel_bts_enable_local();
 }
 
 /*
@@ -1345,6 +1348,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 		apic_write(APIC_LVTPC, APIC_DM_NMI);
 	intel_pmu_disable_all();
 	handled = intel_pmu_drain_bts_buffer();
+	handled += intel_bts_interrupt();
 	status = intel_pmu_get_status();
 	if (!status)
 		goto done;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_bts.c b/arch/x86/kernel/cpu/perf_event_intel_bts.c
new file mode 100644
index 0000000000..af26a88052
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_bts.c
@@ -0,0 +1,525 @@
+/*
+ * BTS PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#undef DEBUG
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/debugfs.h>
+#include <linux/device.h>
+#include <linux/coredump.h>
+
+#include <asm-generic/sizes.h>
+#include <asm/perf_event.h>
+
+#include "perf_event.h"
+
+struct bts_ctx {
+	struct perf_output_handle	handle;
+	struct debug_store		ds_back;
+	int				started;
+};
+
+static DEFINE_PER_CPU(struct bts_ctx, bts_ctx);
+
+#define BTS_RECORD_SIZE		24
+#define BTS_SAFETY_MARGIN	4080
+
+struct bts_phys {
+	struct page	*page;
+	unsigned long	size;
+	unsigned long	offset;
+	unsigned long	displacement;
+};
+
+struct bts_buffer {
+	size_t		real_size;	/* multiple of BTS_RECORD_SIZE */
+	unsigned int	nr_pages;
+	unsigned int	nr_bufs;
+	unsigned int	cur_buf;
+	bool		snapshot;
+	local_t		data_size;
+	local_t		lost;
+	local_t		head;
+	unsigned long	end;
+	void		**data_pages;
+	struct bts_phys	buf[0];
+};
+
+struct pmu bts_pmu;
+
+void intel_pmu_enable_bts(u64 config);
+void intel_pmu_disable_bts(void);
+
+static size_t buf_size(struct page *page)
+{
+	return 1 << (PAGE_SHIFT + page_private(page));
+}
+
+static void *
+bts_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool overwrite)
+{
+	struct bts_buffer *buf;
+	struct page *page;
+	int node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+	unsigned long offset;
+	size_t size = nr_pages << PAGE_SHIFT;
+	int pg, nbuf, pad;
+
+	/* count all the high order buffers */
+	for (pg = 0, nbuf = 0; pg < nr_pages;) {
+		page = virt_to_page(pages[pg]);
+		if (WARN_ON_ONCE(!PagePrivate(page) && nr_pages > 1))
+			return NULL;
+		pg += 1 << page_private(page);
+		nbuf++;
+	}
+
+	/*
+	 * to avoid interrupts in overwrite mode, only allow one physical
+	 */
+	if (overwrite && nbuf > 1)
+		return NULL;
+
+	buf = kzalloc_node(offsetof(struct bts_buffer, buf[nbuf]), GFP_KERNEL, node);
+	if (!buf)
+		return NULL;
+
+	buf->nr_pages = nr_pages;
+	buf->nr_bufs = nbuf;
+	buf->snapshot = overwrite;
+	buf->data_pages = pages;
+	buf->real_size = size - size % BTS_RECORD_SIZE;
+
+	for (pg = 0, nbuf = 0, offset = 0, pad = 0; nbuf < buf->nr_bufs; nbuf++) {
+		unsigned int __nr_pages;
+
+		page = virt_to_page(pages[pg]);
+		__nr_pages = PagePrivate(page) ? 1 << page_private(page) : 1;
+		buf->buf[nbuf].page = page;
+		buf->buf[nbuf].offset = offset;
+		buf->buf[nbuf].displacement = (pad ? BTS_RECORD_SIZE - pad : 0);
+		buf->buf[nbuf].size = buf_size(page) - buf->buf[nbuf].displacement;
+		pad = buf->buf[nbuf].size % BTS_RECORD_SIZE;
+		buf->buf[nbuf].size -= pad;
+
+		pg += __nr_pages;
+		offset += __nr_pages << PAGE_SHIFT;
+	}
+
+	return buf;
+}
+
+static void bts_buffer_free_aux(void *data)
+{
+	kfree(data);
+}
+
+static unsigned long bts_buffer_offset(struct bts_buffer *buf, unsigned int idx)
+{
+	return buf->buf[idx].offset + buf->buf[idx].displacement;
+}
+
+static void
+bts_config_buffer(struct bts_buffer *buf)
+{
+	int cpu = raw_smp_processor_id();
+	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+	struct bts_phys *phys = &buf->buf[buf->cur_buf];
+	unsigned long index, thresh = 0, end = phys->size;
+	struct page *page = phys->page;
+
+	index = local_read(&buf->head);
+
+	if (!buf->snapshot) {
+		if (buf->end < phys->offset + buf_size(page))
+			end = buf->end - phys->offset - phys->displacement;
+
+		index -= phys->offset + phys->displacement;
+
+		if (end - index > BTS_SAFETY_MARGIN)
+			thresh = end - BTS_SAFETY_MARGIN;
+		else if (end - index > BTS_RECORD_SIZE)
+			thresh = end - BTS_RECORD_SIZE;
+		else
+			thresh = end;
+	}
+
+	ds->bts_buffer_base = (u64)page_address(page) + phys->displacement;
+	ds->bts_index = ds->bts_buffer_base + index;
+	ds->bts_absolute_maximum = ds->bts_buffer_base + end;
+	ds->bts_interrupt_threshold = !buf->snapshot
+		? ds->bts_buffer_base + thresh
+		: ds->bts_absolute_maximum + BTS_RECORD_SIZE;
+}
+
+static void bts_buffer_pad_out(struct bts_phys *phys, unsigned long head)
+{
+	unsigned long index = head - phys->offset;
+
+	memset(page_address(phys->page) + index, 0, phys->size - index);
+}
+
+static bool bts_buffer_is_full(struct bts_buffer *buf, struct bts_ctx *bts)
+{
+	if (buf->snapshot)
+		return false;
+
+	if (local_read(&buf->data_size) >= bts->handle.size ||
+	    bts->handle.size - local_read(&buf->data_size) < BTS_RECORD_SIZE)
+		return true;
+
+	return false;
+}
+
+static void bts_update(struct bts_ctx *bts)
+{
+	int cpu = raw_smp_processor_id();
+	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+	struct bts_buffer *buf = perf_get_aux(&bts->handle);
+	unsigned long index = ds->bts_index - ds->bts_buffer_base, old, head;
+
+	if (!buf)
+		return;
+
+	head = index + bts_buffer_offset(buf, buf->cur_buf);
+	old = local_xchg(&buf->head, head);
+
+	if (!buf->snapshot) {
+		if (old == head)
+			return;
+
+		if (ds->bts_index >= ds->bts_absolute_maximum)
+			local_inc(&buf->lost);
+
+		/*
+		 * old and head are always in the same physical buffer, so we
+		 * can subtract them to get the data size.
+		 */
+		local_add(head - old, &buf->data_size);
+	} else {
+		local_set(&buf->data_size, head);
+	}
+}
+
+static void __bts_event_start(struct perf_event *event)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct bts_buffer *buf = perf_get_aux(&bts->handle);
+	u64 config = 0;
+
+	if (!buf || bts_buffer_is_full(buf, bts))
+		return;
+
+	event->hw.state = 0;
+
+	if (!buf->snapshot)
+		config |= ARCH_PERFMON_EVENTSEL_INT;
+	if (!event->attr.exclude_kernel)
+		config |= ARCH_PERFMON_EVENTSEL_OS;
+	if (!event->attr.exclude_user)
+		config |= ARCH_PERFMON_EVENTSEL_USR;
+
+	bts_config_buffer(buf);
+
+	/*
+	 * local barrier to make sure that ds configuration made it
+	 * before we enable BTS
+	 */
+	wmb();
+
+	intel_pmu_enable_bts(config);
+}
+
+static void bts_event_start(struct perf_event *event, int flags)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+	__bts_event_start(event);
+
+	/* PMI handler: this counter is running and likely generating PMIs */
+	ACCESS_ONCE(bts->started) = 1;
+}
+
+static void __bts_event_stop(struct perf_event *event)
+{
+	/*
+	 * No extra synchronization is mandated by the documentation to have
+	 * BTS data stores globally visible.
+	 */
+	intel_pmu_disable_bts();
+
+	if (event->hw.state & PERF_HES_STOPPED)
+		return;
+
+	ACCESS_ONCE(event->hw.state) |= PERF_HES_STOPPED;
+}
+
+static void bts_event_stop(struct perf_event *event, int flags)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+	/* PMI handler: don't restart this counter */
+	ACCESS_ONCE(bts->started) = 0;
+
+	__bts_event_stop(event);
+
+	if (flags & PERF_EF_UPDATE)
+		bts_update(bts);
+}
+
+void intel_bts_enable_local(void)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+	if (bts->handle.event && bts->started)
+		__bts_event_start(bts->handle.event);
+}
+
+void intel_bts_disable_local(void)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+	if (bts->handle.event)
+		__bts_event_stop(bts->handle.event);
+}
+
+static int
+bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle)
+{
+	unsigned long head, space, next_space, pad, gap, skip, wakeup;
+	unsigned int next_buf;
+	struct bts_phys *phys, *next_phys;
+	int ret;
+
+	if (buf->snapshot)
+		return 0;
+
+	head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
+	if (WARN_ON_ONCE(head != local_read(&buf->head)))
+		return -EINVAL;
+
+	phys = &buf->buf[buf->cur_buf];
+	space = phys->offset + phys->displacement + phys->size - head;
+	pad = space;
+	if (space > handle->size) {
+		space = handle->size;
+		space -= space % BTS_RECORD_SIZE;
+	}
+	if (space <= BTS_SAFETY_MARGIN) {
+		/* See if next phys buffer has more space */
+		next_buf = buf->cur_buf + 1;
+		if (next_buf >= buf->nr_bufs)
+			next_buf = 0;
+		next_phys = &buf->buf[next_buf];
+		gap = buf_size(phys->page) - phys->displacement - phys->size +
+		      next_phys->displacement;
+		skip = pad + gap;
+		if (handle->size >= skip) {
+			next_space = next_phys->size;
+			if (next_space + skip > handle->size) {
+				next_space = handle->size - skip;
+				next_space -= next_space % BTS_RECORD_SIZE;
+			}
+			if (next_space > space || !space) {
+				if (pad)
+					bts_buffer_pad_out(phys, head);
+				ret = perf_aux_output_skip(handle, skip);
+				if (ret)
+					return ret;
+				/* Advance to next phys buffer */
+				phys = next_phys;
+				space = next_space;
+				head = phys->offset + phys->displacement;
+				/*
+				 * After this, cur_buf and head won't match ds
+				 * anymore, so we must not be racing with
+				 * bts_update().
+				 */
+				buf->cur_buf = next_buf;
+				local_set(&buf->head, head);
+			}
+		}
+	}
+
+	/* Don't go far beyond wakeup watermark */
+	wakeup = BTS_SAFETY_MARGIN + BTS_RECORD_SIZE + handle->wakeup -
+		 handle->head;
+	if (space > wakeup) {
+		space = wakeup;
+		space -= space % BTS_RECORD_SIZE;
+	}
+
+	buf->end = head + space;
+
+	/*
+	 * If we have no space, the lost notification would have been sent when
+	 * we hit absolute_maximum - see bts_update()
+	 */
+	if (!space)
+		return -ENOSPC;
+
+	return 0;
+}
+
+int intel_bts_interrupt(void)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct perf_event *event = bts->handle.event;
+	struct bts_buffer *buf;
+	s64 old_head;
+	int err;
+
+	if (!event || !bts->started)
+		return 0;
+
+	buf = perf_get_aux(&bts->handle);
+	/*
+	 * Skip snapshot counters: they don't use the interrupt, but
+	 * there's no other way of telling, because the pointer will
+	 * keep moving
+	 */
+	if (!buf || buf->snapshot)
+		return 0;
+
+	old_head = local_read(&buf->head);
+	bts_update(bts);
+
+	/* no new data */
+	if (old_head == local_read(&buf->head))
+		return 0;
+
+	perf_aux_output_end(&bts->handle, local_xchg(&buf->data_size, 0),
+			    !!local_xchg(&buf->lost, 0));
+
+	buf = perf_aux_output_begin(&bts->handle, event);
+	if (!buf)
+		return 1;
+
+	err = bts_buffer_reset(buf, &bts->handle);
+	if (err)
+		perf_aux_output_end(&bts->handle, 0, false);
+
+	return 1;
+}
+
+static void bts_event_del(struct perf_event *event, int mode)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct bts_buffer *buf = perf_get_aux(&bts->handle);
+
+	bts_event_stop(event, PERF_EF_UPDATE);
+
+	if (buf) {
+		if (buf->snapshot)
+			bts->handle.head =
+				local_xchg(&buf->data_size,
+					   buf->nr_pages << PAGE_SHIFT);
+		perf_aux_output_end(&bts->handle, local_xchg(&buf->data_size, 0),
+				    !!local_xchg(&buf->lost, 0));
+	}
+
+	cpuc->ds->bts_index = bts->ds_back.bts_buffer_base;
+	cpuc->ds->bts_buffer_base = bts->ds_back.bts_buffer_base;
+	cpuc->ds->bts_absolute_maximum = bts->ds_back.bts_absolute_maximum;
+	cpuc->ds->bts_interrupt_threshold = bts->ds_back.bts_interrupt_threshold;
+}
+
+static int bts_event_add(struct perf_event *event, int mode)
+{
+	struct bts_buffer *buf;
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct hw_perf_event *hwc = &event->hw;
+	int ret = -EBUSY;
+
+	event->hw.state = PERF_HES_STOPPED;
+
+	if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
+		return -EBUSY;
+
+	if (bts->handle.event)
+		return -EBUSY;
+
+	buf = perf_aux_output_begin(&bts->handle, event);
+	if (!buf)
+		return -EINVAL;
+
+	ret = bts_buffer_reset(buf, &bts->handle);
+	if (ret) {
+		perf_aux_output_end(&bts->handle, 0, false);
+		return ret;
+	}
+
+	bts->ds_back.bts_buffer_base = cpuc->ds->bts_buffer_base;
+	bts->ds_back.bts_absolute_maximum = cpuc->ds->bts_absolute_maximum;
+	bts->ds_back.bts_interrupt_threshold = cpuc->ds->bts_interrupt_threshold;
+
+	if (mode & PERF_EF_START) {
+		bts_event_start(event, 0);
+		if (hwc->state & PERF_HES_STOPPED) {
+			bts_event_del(event, 0);
+			return -EBUSY;
+		}
+	}
+
+	return 0;
+}
+
+static void bts_event_destroy(struct perf_event *event)
+{
+	x86_del_exclusive(x86_lbr_exclusive_bts);
+}
+
+static int bts_event_init(struct perf_event *event)
+{
+	if (event->attr.type != bts_pmu.type)
+		return -ENOENT;
+
+	if (x86_add_exclusive(x86_lbr_exclusive_bts))
+		return -EBUSY;
+
+	event->destroy = bts_event_destroy;
+
+	return 0;
+}
+
+static void bts_event_read(struct perf_event *event)
+{
+}
+
+static __init int bts_init(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
+		return -ENODEV;
+
+	bts_pmu.capabilities	= PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE;
+	bts_pmu.task_ctx_nr	= perf_hw_context;
+	bts_pmu.event_init	= bts_event_init;
+	bts_pmu.add		= bts_event_add;
+	bts_pmu.del		= bts_event_del;
+	bts_pmu.start		= bts_event_start;
+	bts_pmu.stop		= bts_event_stop;
+	bts_pmu.read		= bts_event_read;
+	bts_pmu.setup_aux	= bts_buffer_setup_aux;
+	bts_pmu.free_aux	= bts_buffer_free_aux;
+
+	return perf_pmu_register(&bts_pmu, "intel_bts", -1);
+}
+
+module_init(bts_init);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 46211bcc81..0b5d97e0a2 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -461,7 +461,8 @@ void intel_pmu_enable_bts(u64 config)
 
 	debugctlmsr |= DEBUGCTLMSR_TR;
 	debugctlmsr |= DEBUGCTLMSR_BTS;
-	debugctlmsr |= DEBUGCTLMSR_BTINT;
+	if (config & ARCH_PERFMON_EVENTSEL_INT)
+		debugctlmsr |= DEBUGCTLMSR_BTINT;
 
 	if (!(config & ARCH_PERFMON_EVENTSEL_OS))
 		debugctlmsr |= DEBUGCTLMSR_BTS_OFF_OS;
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v8 14/14] perf: add ITRACE_START record to indicate that tracing has started
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (12 preceding siblings ...)
  2014-11-14 13:43 ` [PATCH v8 13/14] x86: perf: intel_bts: Add BTS " Alexander Shishkin
@ 2014-11-14 13:43 ` Alexander Shishkin
  2015-01-09 14:12   ` Peter Zijlstra
  2014-12-17 14:06 ` [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
  14 siblings, 1 reply; 37+ messages in thread
From: Alexander Shishkin @ 2014-11-14 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme, Alexander Shishkin

For counters that generate AUX data that is bound to the context of a
running task, such as instruction tracing, the decoder needs to know
exactly which task is running when the event is first scheduled in,
before the first sched_switch. The decoder's need to know this stems
from the fact that instruction flow trace decoding will almost always
require program's object code in order to reconstruct said flow and
for that we need at least its pid/tid in the perf stream.

To single out such instruction tracing pmus, this patch introduces
ITRACE PMU capability. The reason this is not part of RECORD_AUX
record is that not all pmus capable of generating AUX data need this,
and the opposite is *probably* also true.

While sched_switch covers for most cases, there are two problems with it:
the consumer will need to process events out of order (that is, having
found RECORD_AUX, it will have to skip forward to the nearest sched_switch
to figure out which task it was, then go back to the actual trace to
decode it) and it completely misses the case when the tracing is enabled
and disabled before sched_switch, for example, via PERF_EVENT_IOC_DISABLE.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      |  4 ++++
 include/uapi/linux/perf_event.h | 11 +++++++++++
 kernel/events/core.c            | 41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 56 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f126eb89e6..282721b2df 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -127,6 +127,9 @@ struct hw_perf_event {
 			/* for tp_event->class */
 			struct list_head	tp_list;
 		};
+		struct { /* itrace */
+			int			itrace_started;
+		};
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 		struct { /* breakpoint */
 			/*
@@ -174,6 +177,7 @@ struct perf_event;
 #define PERF_PMU_CAP_AUX_NO_SG			0x02
 #define PERF_PMU_CAP_AUX_SW_DOUBLEBUF		0x04
 #define PERF_PMU_CAP_EXCLUSIVE			0x08
+#define PERF_PMU_CAP_ITRACE			0x10
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ffd52c0861..78017276fe 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -750,6 +750,17 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX				= 11,
 
+	/*
+	 * Indicates that instruction trace has started
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u32				pid;
+	 *	u32				tid;
+	 * };
+	 */
+	PERF_RECORD_ITRACE_START		= 12,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index da2e44aedd..988ef2380a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1733,6 +1733,7 @@ static void perf_set_shadow_time(struct perf_event *event,
 #define MAX_INTERRUPTS (~0ULL)
 
 static void perf_log_throttle(struct perf_event *event, int enable);
+static void perf_log_itrace_start(struct perf_event *event);
 
 static int
 event_sched_in(struct perf_event *event,
@@ -1767,6 +1768,8 @@ event_sched_in(struct perf_event *event,
 
 	perf_pmu_disable(event->pmu);
 
+	perf_log_itrace_start(event);
+
 	if (event->pmu->add(event, PERF_EF_START)) {
 		event->state = PERF_EVENT_STATE_INACTIVE;
 		event->oncpu = -1;
@@ -5681,6 +5684,44 @@ static void perf_log_throttle(struct perf_event *event, int enable)
 	perf_output_end(&handle);
 }
 
+static void perf_log_itrace_start(struct perf_event *event)
+{
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	struct perf_aux_event {
+		struct perf_event_header        header;
+		u32				pid;
+		u32				tid;
+	} rec;
+	int ret;
+
+	if (event->parent)
+		event = event->parent;
+
+	if (!(event->pmu->capabilities & PERF_PMU_CAP_ITRACE) ||
+	    event->hw.itrace_started)
+		return;
+
+	event->hw.itrace_started = 1;
+
+	rec.header.type	= PERF_RECORD_ITRACE_START;
+	rec.header.misc	= 0;
+	rec.header.size	= sizeof(rec);
+	rec.pid	= perf_event_pid(event, current);
+	rec.tid	= perf_event_tid(event, current);
+
+	perf_event_header__init_id(&rec.header, &sample, event);
+	ret = perf_output_begin(&handle, event, rec.header.size);
+
+	if (ret)
+		return;
+
+	perf_output_put(&handle, rec);
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+}
+
 /*
  * Generic event overflow handling, sampling.
  */
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* RE: [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams
  2014-11-14 13:43 ` [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams Alexander Shishkin
@ 2014-11-17  9:33   ` Metzger, Markus T
  2015-01-09 15:18     ` Peter Zijlstra
  2014-11-17 21:24   ` Sukadev Bhattiprolu
  1 sibling, 1 reply; 37+ messages in thread
From: Metzger, Markus T @ 2014-11-17  9:33 UTC (permalink / raw)
  To: Alexander Shishkin, Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	Liang, Kan, Hunter, Adrian, mathieu.poirier, acme,
	Peter Zijlstra

> -----Original Message-----
> From: Alexander Shishkin [mailto:alexander.shishkin@linux.intel.com]
> Sent: Friday, November 14, 2014 2:44 PM
> To: Peter Zijlstra
> Cc: Ingo Molnar; linux-kernel@vger.kernel.org; Robert Richter; Frederic
> Weisbecker; Mike Galbraith; Paul Mackerras; Stephane Eranian; Andi Kleen;
> Liang, Kan; Hunter, Adrian; Metzger, Markus T; mathieu.poirier@linaro.org;
> acme@infradead.org; Peter Zijlstra; Alexander Shishkin
> Subject: [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data
> streams
> 
> From: Peter Zijlstra <peterz@infradead.org>
> 
> This patch introduces "AUX space" in the perf mmap buffer, intended for
> exporting high bandwidth data streams to userspace, such as instruction
> flow traces.
> 
> AUX space is a ring buffer, defined by aux_{offset,size} fields in the
> user_page structure, and read/write pointers aux_{head,tail}, which abide
> by the same rules as data_* counterparts of the main perf buffer.
> 
> In order to allocate/mmap AUX, userspace needs to set up aux_offset to
> such an offset that will be greater than data_offset+data_size and
> aux_size to be the desired buffer size. Both need to be page aligned.
> Then, same aux_offset and aux_size should be passed to mmap() call and
> if everything adds up, you should have an AUX buffer as a result.

Why would userspace need to fill in aux_offset and aux_size?  Can't the kernel
do this when userspace mmaps the aux buffer like it does for the respective
data_* fields?

How would userspace fill in those fields if it mapped the data buffer read-only?

Regards,
Markus.

Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen, Deutschland
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Christian Lamprechter, Hannes Schwaderer, Douglas Lusk
Registergericht: Muenchen HRB 47456
Ust.-IdNr./VAT Registration No.: DE129385895
Citibank Frankfurt a.M. (BLZ 502 109 00) 600119052


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams
  2014-11-14 13:43 ` [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams Alexander Shishkin
  2014-11-17  9:33   ` Metzger, Markus T
@ 2014-11-17 21:24   ` Sukadev Bhattiprolu
  2014-11-17 21:45     ` Andi Kleen
  1 sibling, 1 reply; 37+ messages in thread
From: Sukadev Bhattiprolu @ 2014-11-17 21:24 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Robert Richter,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, kan.liang, adrian.hunter,
	markus.t.metzger, mathieu.poirier, acme, Peter Zijlstra,
	Michael Ellerman

Alexander Shishkin [alexander.shishkin@linux.intel.com] wrote:
| From: Peter Zijlstra <peterz@infradead.org>
| 
| This patch introduces "AUX space" in the perf mmap buffer, intended for
| exporting high bandwidth data streams to userspace, such as instruction
| flow traces.
| 
| AUX space is a ring buffer, defined by aux_{offset,size} fields in the
| user_page structure, and read/write pointers aux_{head,tail}, which abide
| by the same rules as data_* counterparts of the main perf buffer.

The "format" of this raw stream of data can differ across processors
(and architecutres) correct ? i.e the perf tool processing of this
aux data will also differ across processors ?

Power8 processors support what we call 24x7 counters that collect info
on a large number of events. The current 24x7 support in
(arch/powerpc/perf/hv-24x7.c) currently uses reading one counter at a
time using the pmu->read() interface.

We are looking for ways to export much larger number of counters through
one pmu->read() or another call. I am trying to see if the hv-24x7 pmu
could use this aux interface and then have/implement a helper in perf
tool to extract the counter values.

Sukadev


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams
  2014-11-17 21:24   ` Sukadev Bhattiprolu
@ 2014-11-17 21:45     ` Andi Kleen
  0 siblings, 0 replies; 37+ messages in thread
From: Andi Kleen @ 2014-11-17 21:45 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Alexander Shishkin, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Robert Richter, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, kan.liang, adrian.hunter,
	markus.t.metzger, mathieu.poirier, acme, Peter Zijlstra,
	Michael Ellerman

On Mon, Nov 17, 2014 at 01:24:00PM -0800, Sukadev Bhattiprolu wrote:
> Alexander Shishkin [alexander.shishkin@linux.intel.com] wrote:
> | From: Peter Zijlstra <peterz@infradead.org>
> | 
> | This patch introduces "AUX space" in the perf mmap buffer, intended for
> | exporting high bandwidth data streams to userspace, such as instruction
> | flow traces.
> | 
> | AUX space is a ring buffer, defined by aux_{offset,size} fields in the
> | user_page structure, and read/write pointers aux_{head,tail}, which abide
> | by the same rules as data_* counterparts of the main perf buffer.
> 
> The "format" of this raw stream of data can differ across processors
> (and architecutres) correct ? i.e the perf tool processing of this
> aux data will also differ across processors ?

Yes.

The main caveat is that you need special handling code in perf.

> 
> Power8 processors support what we call 24x7 counters that collect info
> on a large number of events. The current 24x7 support in
> (arch/powerpc/perf/hv-24x7.c) currently uses reading one counter at a
> time using the pmu->read() interface.
> 
> We are looking for ways to export much larger number of counters through
> one pmu->read() or another call. I am trying to see if the hv-24x7 pmu
> could use this aux interface and then have/implement a helper in perf
> tool to extract the counter values.

It's probably only worth it if your amount of data is extremely
high and has to be to compressed (like in PT -- no choice, cannot do it
otherwise)

If your data amounts are more reasonable it will be a lot
simpler to stay with the current perf interfaces, even
though they have a bit higher overhead.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT
  2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
                   ` (13 preceding siblings ...)
  2014-11-14 13:43 ` [PATCH v8 14/14] perf: add ITRACE_START record to indicate that tracing has started Alexander Shishkin
@ 2014-12-17 14:06 ` Alexander Shishkin
  2015-01-07  9:32   ` Alexander Shishkin
  14 siblings, 1 reply; 37+ messages in thread
From: Alexander Shishkin @ 2014-12-17 14:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

Alexander Shishkin <alexander.shishkin@linux.intel.com> writes:

> Hi Peter and all,

Hi again,

> Since there weren't too many review comments on the two previous
> versions, here's another one with a couple more fixes.

This patchset is one month old now, I thought somebody might want to
take a look before a send a rebased new version.

Cheers,
--
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT
  2014-12-17 14:06 ` [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
@ 2015-01-07  9:32   ` Alexander Shishkin
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2015-01-07  9:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

Alexander Shishkin <alexander.shishkin@linux.intel.com> writes:

> Alexander Shishkin <alexander.shishkin@linux.intel.com> writes:
>
>> Hi Peter and all,
>
> Hi again,

Ping^2

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver
  2014-11-14 13:43 ` [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
@ 2015-01-09 12:48   ` Peter Zijlstra
  2015-01-12 12:19     ` Alexander Shishkin
  2015-01-13 15:09     ` Alexander Shishkin
  2015-01-09 13:10   ` Peter Zijlstra
  2015-01-09 14:09   ` Peter Zijlstra
  2 siblings, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2015-01-09 12:48 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

On Fri, Nov 14, 2014 at 03:43:45PM +0200, Alexander Shishkin wrote:
> +static void pt_event_stop(struct perf_event *event, int mode)
> +{
> +	struct pt *pt = this_cpu_ptr(&pt_ctx);
> +
> +	ACCESS_ONCE(pt->handle_nmi) = 0;

Why is this needed? Will the hardware still generate interrupts if you
stop the PT thing?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver
  2014-11-14 13:43 ` [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
  2015-01-09 12:48   ` Peter Zijlstra
@ 2015-01-09 13:10   ` Peter Zijlstra
  2015-01-12 12:45     ` Alexander Shishkin
  2015-01-09 14:09   ` Peter Zijlstra
  2 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2015-01-09 13:10 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

On Fri, Nov 14, 2014 at 03:43:45PM +0200, Alexander Shishkin wrote:
> +static void pt_handle_status(struct pt *pt)
> +{
> +	struct pt_buffer *buf = perf_get_aux(&pt->handle);
> +	int advance = 0;
> +	u64 status;
> +
> +	rdmsrl(MSR_IA32_RTIT_STATUS, status);
> +
> +	if (status & RTIT_STATUS_ERROR) {
> +		pr_err_ratelimited("ToPA ERROR encountered, trying to recover\n");
> +		pt_topa_dump(buf);
> +		status &= ~RTIT_STATUS_ERROR;
> +	}
> +
> +	if (status & RTIT_STATUS_STOPPED) {
> +		status &= ~RTIT_STATUS_STOPPED;
> +
> +		/*
> +		 * On systems that only do single-entry ToPA, hitting STOP
> +		 * means we are already losing data; need to let the decoder
> +		 * know.
> +		 */
> +		if (!pt_cap_get(PT_CAP_topa_multiple_entries) ||
> +		    buf->output_off == sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size)) {
> +			local_inc(&buf->lost);
> +			advance++;
> +		}
> +	}
> +
> +	/*
> +	 * Also on single-entry ToPA implementations, interrupt will come
> +	 * before the output reaches its output region's boundary.
> +	 */

My understanding was that, while yes it will attempt to generate the
interrupt early, there is absolutely no guarantee it will in fact arrive
in time or even it if does, guarantee that there is buffer left by the
time you're ready to read it.

> +	if (!pt_cap_get(PT_CAP_topa_multiple_entries) && !buf->snapshot &&
> +	    pt_buffer_region_size(buf) - buf->output_off <= TOPA_PMI_MARGIN) {

Now this check appears to indeed cater for this scenario where the
buffer is overran, right?

> +		void *head = pt_buffer_region(buf);
> +
> +		/* everything within this margin needs to be zeroed out */
> +		memset(head + buf->output_off, 0,
> +		       pt_buffer_region_size(buf) -
> +		       buf->output_off);
> +		advance++;
> +	}
> +
> +	if (advance)
> +		pt_buffer_advance(buf);
> +
> +	wrmsrl(MSR_IA32_RTIT_STATUS, status);
> +}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver
  2014-11-14 13:43 ` [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
  2015-01-09 12:48   ` Peter Zijlstra
  2015-01-09 13:10   ` Peter Zijlstra
@ 2015-01-09 14:09   ` Peter Zijlstra
  2015-01-12 12:53     ` Alexander Shishkin
  2015-01-12 16:37     ` Alexander Shishkin
  2 siblings, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2015-01-09 14:09 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

On Fri, Nov 14, 2014 at 03:43:45PM +0200, Alexander Shishkin wrote:
> +static __init int pt_init(void)
> +{
> +	int ret, cpu, prior_warn = 0;
> +
> +	BUILD_BUG_ON(sizeof(struct topa) > PAGE_SIZE);
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		u64 ctl;
> +
> +		ret = rdmsrl_safe_on_cpu(cpu, MSR_IA32_RTIT_CTL, &ctl);
> +		if (!ret && (ctl & RTIT_CTL_TRACEEN))
> +			prior_warn++;
> +	}
> +	put_online_cpus();
> +
> +	ret = pt_pmu_hw_init();
> +	if (ret)
> +		return ret;
> +
> +	if (!pt_cap_get(PT_CAP_topa_output)) {
> +		pr_warn("ToPA output is not supported on this CPU\n");
> +		return -ENODEV;
> +	}
> +
> +	if (prior_warn)
> +		pr_warn("PT is enabled at boot time, traces may be empty\n");

Should we not also add_exclusive(pt) here?

Also, if its already enabled, should we not return ENODEV as well, no
saying who or what programmed it, we should not be touching it.

> +	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
> +		pt_pmu.pmu.capabilities =
> +			PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_AUX_SW_DOUBLEBUF;
> +
> +	pt_pmu.pmu.capabilities	|= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE;
> +	pt_pmu.pmu.attr_groups	= pt_attr_groups;
> +	pt_pmu.pmu.task_ctx_nr	= perf_hw_context;
> +	pt_pmu.pmu.event_init	= pt_event_init;
> +	pt_pmu.pmu.add		= pt_event_add;
> +	pt_pmu.pmu.del		= pt_event_del;
> +	pt_pmu.pmu.start	= pt_event_start;
> +	pt_pmu.pmu.stop		= pt_event_stop;
> +	pt_pmu.pmu.read		= pt_event_read;
> +	pt_pmu.pmu.setup_aux	= pt_buffer_setup_aux;
> +	pt_pmu.pmu.free_aux	= pt_buffer_free_aux;
> +	ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);
> +
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 14/14] perf: add ITRACE_START record to indicate that tracing has started
  2014-11-14 13:43 ` [PATCH v8 14/14] perf: add ITRACE_START record to indicate that tracing has started Alexander Shishkin
@ 2015-01-09 14:12   ` Peter Zijlstra
  2015-01-09 14:13     ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2015-01-09 14:12 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

On Fri, Nov 14, 2014 at 03:43:47PM +0200, Alexander Shishkin wrote:
> +++ b/include/uapi/linux/perf_event.h
> @@ -750,6 +750,17 @@ enum perf_event_type {
>  	 */
>  	PERF_RECORD_AUX				= 11,
>  
> +	/*
> +	 * Indicates that instruction trace has started
> +	 *
> +	 * struct {
> +	 *	struct perf_event_header	header;
> +	 *	u32				pid;
> +	 *	u32				tid;

The below function suggests we should have:

		struct sample_id		sample_id;

> +	 * };
> +	 */
> +	PERF_RECORD_ITRACE_START		= 12,
> +
>  	PERF_RECORD_MAX,			/* non-ABI */
>  };

> +static void perf_log_itrace_start(struct perf_event *event)
> +{
> +	struct perf_output_handle handle;
> +	struct perf_sample_data sample;
> +	struct perf_aux_event {
> +		struct perf_event_header        header;
> +		u32				pid;
> +		u32				tid;
> +	} rec;
> +	int ret;
> +
> +	if (event->parent)
> +		event = event->parent;
> +
> +	if (!(event->pmu->capabilities & PERF_PMU_CAP_ITRACE) ||
> +	    event->hw.itrace_started)
> +		return;
> +
> +	event->hw.itrace_started = 1;
> +
> +	rec.header.type	= PERF_RECORD_ITRACE_START;
> +	rec.header.misc	= 0;
> +	rec.header.size	= sizeof(rec);
> +	rec.pid	= perf_event_pid(event, current);
> +	rec.tid	= perf_event_tid(event, current);
> +
> +	perf_event_header__init_id(&rec.header, &sample, event);
> +	ret = perf_output_begin(&handle, event, rec.header.size);
> +
> +	if (ret)
> +		return;
> +
> +	perf_output_put(&handle, rec);
> +	perf_event__output_id_sample(event, &handle, &sample);
> +
> +	perf_output_end(&handle);
> +}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 14/14] perf: add ITRACE_START record to indicate that tracing has started
  2015-01-09 14:12   ` Peter Zijlstra
@ 2015-01-09 14:13     ` Peter Zijlstra
  2015-01-12  9:30       ` Adrian Hunter
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2015-01-09 14:13 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

On Fri, Jan 09, 2015 at 03:12:48PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 14, 2014 at 03:43:47PM +0200, Alexander Shishkin wrote:
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -750,6 +750,17 @@ enum perf_event_type {
> >  	 */
> >  	PERF_RECORD_AUX				= 11,
> >  
> > +	/*
> > +	 * Indicates that instruction trace has started
> > +	 *
> > +	 * struct {
> > +	 *	struct perf_event_header	header;
> > +	 *	u32				pid;
> > +	 *	u32				tid;
> 
> The below function suggests we should have:
> 
> 		struct sample_id		sample_id;
> 
> > +	 * };
> > +	 */
> > +	PERF_RECORD_ITRACE_START		= 12,

This also raises the question; why not use PERF_RECORD_COMM?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams
  2014-11-17  9:33   ` Metzger, Markus T
@ 2015-01-09 15:18     ` Peter Zijlstra
  2015-01-12 13:12       ` Alexander Shishkin
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2015-01-09 15:18 UTC (permalink / raw)
  To: Metzger, Markus T
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, Robert Richter,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, Liang, Kan, Hunter, Adrian,
	mathieu.poirier, acme

On Mon, Nov 17, 2014 at 09:33:00AM +0000, Metzger, Markus T wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > This patch introduces "AUX space" in the perf mmap buffer, intended for
> > exporting high bandwidth data streams to userspace, such as instruction
> > flow traces.
> > 
> > AUX space is a ring buffer, defined by aux_{offset,size} fields in the
> > user_page structure, and read/write pointers aux_{head,tail}, which abide
> > by the same rules as data_* counterparts of the main perf buffer.
> > 
> > In order to allocate/mmap AUX, userspace needs to set up aux_offset to
> > such an offset that will be greater than data_offset+data_size and
> > aux_size to be the desired buffer size. Both need to be page aligned.
> > Then, same aux_offset and aux_size should be passed to mmap() call and
> > if everything adds up, you should have an AUX buffer as a result.
> 
> Why would userspace need to fill in aux_offset and aux_size?  Can't the kernel
> do this when userspace mmaps the aux buffer like it does for the respective
> data_* fields?

I suppose we could; I'm trying to remember why I did it like this, I'm
failing to remember much past yesterday atm :/ 

> How would userspace fill in those fields if it mapped the data buffer read-only?

Good point, its not used much atm. But yes. That also reminds me; we
should probably finish this:

  https://lkml.org/lkml/2013/7/8/154

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 14/14] perf: add ITRACE_START record to indicate that tracing has started
  2015-01-09 14:13     ` Peter Zijlstra
@ 2015-01-12  9:30       ` Adrian Hunter
  0 siblings, 0 replies; 37+ messages in thread
From: Adrian Hunter @ 2015-01-12  9:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, Robert Richter,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, kan.liang, markus.t.metzger,
	mathieu.poirier, acme

On 09/01/15 16:13, Peter Zijlstra wrote:
> On Fri, Jan 09, 2015 at 03:12:48PM +0100, Peter Zijlstra wrote:
>> On Fri, Nov 14, 2014 at 03:43:47PM +0200, Alexander Shishkin wrote:
>>> +++ b/include/uapi/linux/perf_event.h
>>> @@ -750,6 +750,17 @@ enum perf_event_type {
>>>  	 */
>>>  	PERF_RECORD_AUX				= 11,
>>>  
>>> +	/*
>>> +	 * Indicates that instruction trace has started
>>> +	 *
>>> +	 * struct {
>>> +	 *	struct perf_event_header	header;
>>> +	 *	u32				pid;
>>> +	 *	u32				tid;
>>
>> The below function suggests we should have:
>>
>> 		struct sample_id		sample_id;
>>
>>> +	 * };
>>> +	 */
>>> +	PERF_RECORD_ITRACE_START		= 12,
> 
> This also raises the question; why not use PERF_RECORD_COMM?

They mean slightly different things.

PERF_RECORD_COMM means the comm string has been set. We could infer that tid
was the currently running task but we don't at the moment, and it would be a
constraint that should be documented if we are going to rely on it.
Also it might mess up perf tools a little which calls the idle thread
"swapper" but the kernel calls it "swapper/n" on SMP where n is the cpu number.

PERF_RECORD_ITRACE_START just means that pid/tid was the currently running task.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver
  2015-01-09 12:48   ` Peter Zijlstra
@ 2015-01-12 12:19     ` Alexander Shishkin
  2015-01-13 15:09     ` Alexander Shishkin
  1 sibling, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2015-01-12 12:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Nov 14, 2014 at 03:43:45PM +0200, Alexander Shishkin wrote:
>> +static void pt_event_stop(struct perf_event *event, int mode)
>> +{
>> +	struct pt *pt = this_cpu_ptr(&pt_ctx);
>> +
>> +	ACCESS_ONCE(pt->handle_nmi) = 0;
>
> Why is this needed? Will the hardware still generate interrupts if you
> stop the PT thing?

It's meant to serialize nmi handler against the pmu:: callbacks, but now
you mentioned it, it looks very much redundant.

Cheers,
--
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver
  2015-01-09 13:10   ` Peter Zijlstra
@ 2015-01-12 12:45     ` Alexander Shishkin
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2015-01-12 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Nov 14, 2014 at 03:43:45PM +0200, Alexander Shishkin wrote:
>> +static void pt_handle_status(struct pt *pt)
>> +{
>> +	struct pt_buffer *buf = perf_get_aux(&pt->handle);
>> +	int advance = 0;
>> +	u64 status;
>> +
>> +	rdmsrl(MSR_IA32_RTIT_STATUS, status);
>> +
>> +	if (status & RTIT_STATUS_ERROR) {
>> +		pr_err_ratelimited("ToPA ERROR encountered, trying to recover\n");
>> +		pt_topa_dump(buf);
>> +		status &= ~RTIT_STATUS_ERROR;
>> +	}
>> +
>> +	if (status & RTIT_STATUS_STOPPED) {
>> +		status &= ~RTIT_STATUS_STOPPED;
>> +
>> +		/*
>> +		 * On systems that only do single-entry ToPA, hitting STOP
>> +		 * means we are already losing data; need to let the decoder
>> +		 * know.
>> +		 */
>> +		if (!pt_cap_get(PT_CAP_topa_multiple_entries) ||
>> +		    buf->output_off == sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size)) {
>> +			local_inc(&buf->lost);
>> +			advance++;
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * Also on single-entry ToPA implementations, interrupt will come
>> +	 * before the output reaches its output region's boundary.
>> +	 */
>
> My understanding was that, while yes it will attempt to generate the
> interrupt early, there is absolutely no guarantee it will in fact arrive
> in time or even it if does, guarantee that there is buffer left by the
> time you're ready to read it.

That's correct. If the PMI didn't arrive on time and the buffer got
filled up, the STOPPED bit will be set in the STATUS register, which is
what we're handling above.

>> +	if (!pt_cap_get(PT_CAP_topa_multiple_entries) && !buf->snapshot &&
>> +	    pt_buffer_region_size(buf) - buf->output_off <= TOPA_PMI_MARGIN) {
>
> Now this check appears to indeed cater for this scenario where the
> buffer is overran, right?

The check above is for the case when the PMI did make it before the end
of the physical buffer was reached.

>
>> +		void *head = pt_buffer_region(buf);
>> +
>> +		/* everything within this margin needs to be zeroed out */
>> +		memset(head + buf->output_off, 0,
>> +		       pt_buffer_region_size(buf) -
>> +		       buf->output_off);
>> +		advance++;
>> +	}
>> +
>> +	if (advance)
>> +		pt_buffer_advance(buf);
>> +
>> +	wrmsrl(MSR_IA32_RTIT_STATUS, status);
>> +}

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver
  2015-01-09 14:09   ` Peter Zijlstra
@ 2015-01-12 12:53     ` Alexander Shishkin
  2015-01-12 16:37     ` Alexander Shishkin
  1 sibling, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2015-01-12 12:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Nov 14, 2014 at 03:43:45PM +0200, Alexander Shishkin wrote:
>> +static __init int pt_init(void)
>> +{
>> +	int ret, cpu, prior_warn = 0;
>> +
>> +	BUILD_BUG_ON(sizeof(struct topa) > PAGE_SIZE);
>> +	get_online_cpus();
>> +	for_each_online_cpu(cpu) {
>> +		u64 ctl;
>> +
>> +		ret = rdmsrl_safe_on_cpu(cpu, MSR_IA32_RTIT_CTL, &ctl);
>> +		if (!ret && (ctl & RTIT_CTL_TRACEEN))
>> +			prior_warn++;
>> +	}
>> +	put_online_cpus();
>> +
>> +	ret = pt_pmu_hw_init();
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (!pt_cap_get(PT_CAP_topa_output)) {
>> +		pr_warn("ToPA output is not supported on this CPU\n");
>> +		return -ENODEV;
>> +	}
>> +
>> +	if (prior_warn)
>> +		pr_warn("PT is enabled at boot time, traces may be empty\n");
>
> Should we not also add_exclusive(pt) here?

Good point.

> Also, if its already enabled, should we not return ENODEV as well, no
> saying who or what programmed it, we should not be touching it.

Indeed we should, although I'd also like to have a force-override boot
time option for this.

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams
  2015-01-09 15:18     ` Peter Zijlstra
@ 2015-01-12 13:12       ` Alexander Shishkin
  2015-01-12 13:38         ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Shishkin @ 2015-01-12 13:12 UTC (permalink / raw)
  To: Peter Zijlstra, Metzger, Markus T
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	Liang, Kan, Hunter, Adrian, mathieu.poirier, acme

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Nov 17, 2014 at 09:33:00AM +0000, Metzger, Markus T wrote:
>> > From: Peter Zijlstra <peterz@infradead.org>
>> > 
>> > This patch introduces "AUX space" in the perf mmap buffer, intended for
>> > exporting high bandwidth data streams to userspace, such as instruction
>> > flow traces.
>> > 
>> > AUX space is a ring buffer, defined by aux_{offset,size} fields in the
>> > user_page structure, and read/write pointers aux_{head,tail}, which abide
>> > by the same rules as data_* counterparts of the main perf buffer.
>> > 
>> > In order to allocate/mmap AUX, userspace needs to set up aux_offset to
>> > such an offset that will be greater than data_offset+data_size and
>> > aux_size to be the desired buffer size. Both need to be page aligned.
>> > Then, same aux_offset and aux_size should be passed to mmap() call and
>> > if everything adds up, you should have an AUX buffer as a result.
>> 
>> Why would userspace need to fill in aux_offset and aux_size?  Can't the kernel
>> do this when userspace mmaps the aux buffer like it does for the respective
>> data_* fields?
>
> I suppose we could; I'm trying to remember why I did it like this, I'm
> failing to remember much past yesterday atm :/ 

Well, right now we can only map things in one order: user page + perf
buffer, then aux buffer. Ideally, we can reduce it to mapping the user
page first (always R/W), setting up the desired offsets for data and aux
buffers and then mapping those in whatever order. And we can also have
holes between them if we want (not sure if that's useful, though).

If we want to stick to the current scheme of things where user page is
mapped together with data, we should set up aux_{offset,size} in
kernel's perf_mmap() path, similarly to data_{offset,size}.

There is another problem with it, though: it makes it impossible to run
the aux buffer in "full trace mode" (where we move the tail as we read
the data out), while running the data buffer in overwrite mode, because
the latter requires the user_page to be mapped RO => no way to update
aux_tail for the user.

>> How would userspace fill in those fields if it mapped the data buffer read-only?
>
> Good point, its not used much atm. But yes. That also reminds me; we
> should probably finish this:
>
>   https://lkml.org/lkml/2013/7/8/154

Indeed, it's been on my todo list for some time.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams
  2015-01-12 13:12       ` Alexander Shishkin
@ 2015-01-12 13:38         ` Peter Zijlstra
  2015-01-12 14:00           ` Alexander Shishkin
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2015-01-12 13:38 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Metzger, Markus T, Ingo Molnar, linux-kernel, Robert Richter,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, Liang, Kan, Hunter, Adrian,
	mathieu.poirier, acme

On Mon, Jan 12, 2015 at 03:12:58PM +0200, Alexander Shishkin wrote:
> > I suppose we could; I'm trying to remember why I did it like this, I'm
> > failing to remember much past yesterday atm :/ 
> 
> Well, right now we can only map things in one order: user page + perf
> buffer, then aux buffer. Ideally, we can reduce it to mapping the user
> page first (always R/W), setting up the desired offsets for data and aux
> buffers and then mapping those in whatever order. And we can also have
> holes between them if we want (not sure if that's useful, though).

Indeed, over the weekend I came up a similar argument. Suppose we want
to extend the thing with yet another area, at that point we simply do
not know which segment is meant to be mapp()ed.

Breaking up the control page from the data buffer is indeed a sane
proposal.

> If we want to stick to the current scheme of things where user page is
> mapped together with data, we should set up aux_{offset,size} in
> kernel's perf_mmap() path, similarly to data_{offset,size}.
> 
> There is another problem with it, though: it makes it impossible to run
> the aux buffer in "full trace mode" (where we move the tail as we read
> the data out), while running the data buffer in overwrite mode, because
> the latter requires the user_page to be mapped RO => no way to update
> aux_tail for the user.

ISTR aux bits relying on their own mmap(PROT_WRITE) bit, but yes-ish.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams
  2015-01-12 13:38         ` Peter Zijlstra
@ 2015-01-12 14:00           ` Alexander Shishkin
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Shishkin @ 2015-01-12 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Metzger, Markus T, Ingo Molnar, linux-kernel, Robert Richter,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, Liang, Kan, Hunter, Adrian,
	mathieu.poirier, acme

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Jan 12, 2015 at 03:12:58PM +0200, Alexander Shishkin wrote:
>> > I suppose we could; I'm trying to remember why I did it like this, I'm
>> > failing to remember much past yesterday atm :/ 
>> 
>> Well, right now we can only map things in one order: user page + perf
>> buffer, then aux buffer. Ideally, we can reduce it to mapping the user
>> page first (always R/W), setting up the desired offsets for data and aux
>> buffers and then mapping those in whatever order. And we can also have
>> holes between them if we want (not sure if that's useful, though).
>
> Indeed, over the weekend I came up a similar argument. Suppose we want
> to extend the thing with yet another area, at that point we simply do
> not know which segment is meant to be mapp()ed.
>
> Breaking up the control page from the data buffer is indeed a sane
> proposal.
>
>> If we want to stick to the current scheme of things where user page is
>> mapped together with data, we should set up aux_{offset,size} in
>> kernel's perf_mmap() path, similarly to data_{offset,size}.
>> 
>> There is another problem with it, though: it makes it impossible to run
>> the aux buffer in "full trace mode" (where we move the tail as we read
>> the data out), while running the data buffer in overwrite mode, because
>> the latter requires the user_page to be mapped RO => no way to update
>> aux_tail for the user.
>
> ISTR aux bits relying on their own mmap(PROT_WRITE) bit, but yes-ish.

It is. Now if data buffer is made to behave the same way, this problem
goes away. Let me see if I can come up with a patch for this, bearing in
mind backwards compatibility with older perfs etc.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver
  2015-01-09 14:09   ` Peter Zijlstra
  2015-01-12 12:53     ` Alexander Shishkin
@ 2015-01-12 16:37     ` Alexander Shishkin
  2015-01-12 16:40       ` Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: Alexander Shishkin @ 2015-01-12 16:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

Peter Zijlstra <peterz@infradead.org> writes:

> Also, if its already enabled, should we not return ENODEV as well, no
> saying who or what programmed it, we should not be touching it.

Or EBUSY?

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver
  2015-01-12 16:37     ` Alexander Shishkin
@ 2015-01-12 16:40       ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2015-01-12 16:40 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

On Mon, Jan 12, 2015 at 06:37:36PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > Also, if its already enabled, should we not return ENODEV as well, no
> > saying who or what programmed it, we should not be touching it.
> 
> Or EBUSY?

Sure, same end result I imagine, driver doesn't load.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver
  2015-01-09 12:48   ` Peter Zijlstra
  2015-01-12 12:19     ` Alexander Shishkin
@ 2015-01-13 15:09     ` Alexander Shishkin
  2015-01-13 16:27       ` Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: Alexander Shishkin @ 2015-01-13 15:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Nov 14, 2014 at 03:43:45PM +0200, Alexander Shishkin wrote:
>> +static void pt_event_stop(struct perf_event *event, int mode)
>> +{
>> +	struct pt *pt = this_cpu_ptr(&pt_ctx);
>> +
>> +	ACCESS_ONCE(pt->handle_nmi) = 0;
>
> Why is this needed? Will the hardware still generate interrupts if you
> stop the PT thing?

Actually, it turns out that the interrupt condition can race with the
wrmsr that disables tracing and the PT bit remains set in the global
status register, causing the next PMI to go into the PT PMI handler,
which then may decide to re-enable the counter that should be disabled.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver
  2015-01-13 15:09     ` Alexander Shishkin
@ 2015-01-13 16:27       ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2015-01-13 16:27 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Robert Richter, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen,
	kan.liang, adrian.hunter, markus.t.metzger, mathieu.poirier,
	acme

On Tue, Jan 13, 2015 at 05:09:58PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Fri, Nov 14, 2014 at 03:43:45PM +0200, Alexander Shishkin wrote:
> >> +static void pt_event_stop(struct perf_event *event, int mode)
> >> +{
> >> +	struct pt *pt = this_cpu_ptr(&pt_ctx);
> >> +
> >> +	ACCESS_ONCE(pt->handle_nmi) = 0;
> >
> > Why is this needed? Will the hardware still generate interrupts if you
> > stop the PT thing?
> 
> Actually, it turns out that the interrupt condition can race with the
> wrmsr that disables tracing and the PT bit remains set in the global
> status register, causing the next PMI to go into the PT PMI handler,
> which then may decide to re-enable the counter that should be disabled.

OK, please add this information in a suitably placed comment.

Thanks!

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2015-01-13 16:27 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-14 13:43 [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
2014-11-14 13:43 ` [PATCH v8 01/14] perf: Add data_{offset,size} to user_page Alexander Shishkin
2014-11-14 13:43 ` [PATCH v8 02/14] perf: Add AUX area to ring buffer for raw data streams Alexander Shishkin
2014-11-17  9:33   ` Metzger, Markus T
2015-01-09 15:18     ` Peter Zijlstra
2015-01-12 13:12       ` Alexander Shishkin
2015-01-12 13:38         ` Peter Zijlstra
2015-01-12 14:00           ` Alexander Shishkin
2014-11-17 21:24   ` Sukadev Bhattiprolu
2014-11-17 21:45     ` Andi Kleen
2014-11-14 13:43 ` [PATCH v8 03/14] perf: Support high-order allocations for AUX space Alexander Shishkin
2014-11-14 13:43 ` [PATCH v8 04/14] perf: Add a capability for AUX_NO_SG pmus to do software double buffering Alexander Shishkin
2014-11-14 13:43 ` [PATCH v8 05/14] perf: Add a pmu capability for "exclusive" events Alexander Shishkin
2014-11-14 13:43 ` [PATCH v8 06/14] perf: Add AUX record Alexander Shishkin
2014-11-14 13:43 ` [PATCH v8 07/14] perf: Add api for pmus to write to AUX area Alexander Shishkin
2014-11-14 13:43 ` [PATCH v8 08/14] perf: Support overwrite mode for " Alexander Shishkin
2014-11-14 13:43 ` [PATCH v8 09/14] perf: Add wakeup watermark control to " Alexander Shishkin
2014-11-14 13:43 ` [PATCH v8 10/14] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection Alexander Shishkin
2014-11-14 13:43 ` [PATCH v8 11/14] x86: perf: Intel PT and LBR/BTS are mutually exclusive Alexander Shishkin
2014-11-14 13:43 ` [PATCH v8 12/14] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
2015-01-09 12:48   ` Peter Zijlstra
2015-01-12 12:19     ` Alexander Shishkin
2015-01-13 15:09     ` Alexander Shishkin
2015-01-13 16:27       ` Peter Zijlstra
2015-01-09 13:10   ` Peter Zijlstra
2015-01-12 12:45     ` Alexander Shishkin
2015-01-09 14:09   ` Peter Zijlstra
2015-01-12 12:53     ` Alexander Shishkin
2015-01-12 16:37     ` Alexander Shishkin
2015-01-12 16:40       ` Peter Zijlstra
2014-11-14 13:43 ` [PATCH v8 13/14] x86: perf: intel_bts: Add BTS " Alexander Shishkin
2014-11-14 13:43 ` [PATCH v8 14/14] perf: add ITRACE_START record to indicate that tracing has started Alexander Shishkin
2015-01-09 14:12   ` Peter Zijlstra
2015-01-09 14:13     ` Peter Zijlstra
2015-01-12  9:30       ` Adrian Hunter
2014-12-17 14:06 ` [PATCH v8 00/14] perf: Add infrastructure and support for Intel PT Alexander Shishkin
2015-01-07  9:32   ` Alexander Shishkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.