All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 00/11] perf: Add support for Intel Processor Trace
@ 2014-02-06 10:50 Alexander Shishkin
  2014-02-06 10:50 ` [PATCH v1 01/11] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection Alexander Shishkin
                   ` (10 more replies)
  0 siblings, 11 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-06 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming, Alexander Shishkin

Hi Peter and all,

Here's the second attempt at Intel PT support patchset, this time I
only include the kernel part, since it requires more scrutiny. The
whole patchset including the userspace currently can be found in my
github repo [1]. Major changes since the previous version are:

 * magic mmap() offset got replaced with a separate file descriptor,
 which is a 2nd ring buffer attached to the same event; this way, the
 first ring buffer (perf stream) receives trace buffer related events
 such as the one that signals trace data being lost (ITRACE_LOST), in
 addition to the normal sideband data,
 * added a driver for BTS per Ingo's request, now BTS can be used via
 the same interface as Intel PT, thus illustrating the capabilities of
 "itrace" framework to those who are interested,
 * massive patches got split into more digestible ones for the benefit
 of the reviewer,
 * added support for multiple itrace pmus (since we have to accomodate
 both PT and BTS now),
 * various small changes.

This patchset adds support for Intel Processor Trace (PT) extension [2] of
Intel Architecture that allows the capture of information about software
execution flow, to the perf kernel and userspace infrastructure. We
provide an abstraction for it called "itrace" for "instruction
trace" ([3]).

The single most notable thing is that while PT outputs trace data in a
compressed binary format, it will still generate hundreds of megabytes
of trace data per second per core. Decoding this binary stream takes
2-3 orders of magnitude the cpu time that it takes to generate
it. These considerations make it impossible to carry out decoding in
kernel space. Therefore, the trace data is exported to userspace as a
zero-copy mapping that userspace can collect and store for later
decoding. To that end, perf is extended to support an additional ring
buffer per event, which will export the trace data. This ring buffer
is mapped from a file descriptor, which is derived from the event's
file descriptor. This ring buffer has its own user page with data_head
and data_tail (in case the buffer is mapped writable) pointers used as
read/write pointers in the buffer.

This way we get a normal perf data stream that provides sideband
information that is required to decode the trace data, such as MMAPs,
COMMs etc, plus the actual trace in a separate buffer.

If the trace buffer is mapped writable, the driver will stop tracing
when it fills up (data_head approaches data_tail), till data is read,
data_tail pointer is moved forward and an ioctl() is issued to
re-enable tracing. If the trace buffer is mapped read only, the
tracing will continue, overwriting older data, so that the buffer
always contains the most recent data. Tracing can be stopped with an
ioctl() and restarted once the data is collected.

Another use case is annotating samples of other perf events: if you
set PERF_SAMPLE_ITRACE, attr.itrace_sample_size bytes of trace will be
included in each event's sample.

Also, itrace data can be included in process core dumps, which can be
enabled with a new rlimit -- RLIMIT_ITRACE.

[1] https://github.com/virtuoso/linux-perf/tree/intel_pt
[2] http://download-software.intel.com/sites/default/files/managed/50/1a/319433-018.pdf
[3] http://events.linuxfoundation.org/sites/events/files/slides/lcna13_kleen.pdf

Alexander Shishkin (11):
  x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection
  perf: Abstract ring_buffer backing store operations
  perf: Allow for multiple ring buffers per event
  itrace: Infrastructure for instruction flow tracing units
  itrace: Add functionality to include traces in perf event samples
  itrace: Add functionality to include traces in process core dumps
  x86: perf: intel_pt: Intel PT PMU driver
  x86: perf: intel_pt: Add sampling functionality
  x86: perf: intel_pt: Add core dump functionality
  x86: perf: intel_bts: Add BTS PMU driver
  x86: perf: intel_bts: Add core dump related functionality

 arch/x86/include/asm/cpufeature.h          |    1 +
 arch/x86/include/uapi/asm/msr-index.h      |   18 +
 arch/x86/kernel/cpu/Makefile               |    1 +
 arch/x86/kernel/cpu/intel_pt.h             |  129 +++
 arch/x86/kernel/cpu/perf_event.c           |    4 +
 arch/x86/kernel/cpu/perf_event.h           |    6 +
 arch/x86/kernel/cpu/perf_event_intel.c     |   16 +-
 arch/x86/kernel/cpu/perf_event_intel_bts.c |  500 ++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |    3 +-
 arch/x86/kernel/cpu/perf_event_intel_pt.c  | 1180 ++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/scattered.c            |    1 +
 fs/binfmt_elf.c                            |    6 +
 fs/proc/base.c                             |    1 +
 include/asm-generic/resource.h             |    1 +
 include/linux/itrace.h                     |  162 ++++
 include/linux/perf_event.h                 |   34 +-
 include/uapi/asm-generic/resource.h        |    3 +-
 include/uapi/linux/elf.h                   |    1 +
 include/uapi/linux/perf_event.h            |   22 +-
 kernel/events/Makefile                     |    2 +-
 kernel/events/core.c                       |  341 +++++---
 kernel/events/internal.h                   |   39 +-
 kernel/events/itrace.c                     |  705 +++++++++++++++++
 kernel/events/ring_buffer.c                |  178 +++--
 kernel/exit.c                              |    3 +
 kernel/sys.c                               |    5 +
 26 files changed, 3189 insertions(+), 173 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/intel_pt.h
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_bts.c
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c
 create mode 100644 include/linux/itrace.h
 create mode 100644 kernel/events/itrace.c

-- 
1.8.5.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v1 01/11] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection
  2014-02-06 10:50 [PATCH v1 00/11] perf: Add support for Intel Processor Trace Alexander Shishkin
@ 2014-02-06 10:50 ` Alexander Shishkin
  2014-02-06 10:50 ` [PATCH v1 02/11] perf: Abstract ring_buffer backing store operations Alexander Shishkin
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-06 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming, Alexander Shishkin

Intel Processor Trace is an architecture extension that allows for program
flow tracing.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/include/asm/cpufeature.h | 1 +
 arch/x86/kernel/cpu/scattered.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 89270b4..cb9864f 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -186,6 +186,7 @@
 #define X86_FEATURE_DTHERM	(7*32+ 7) /* Digital Thermal Sensor */
 #define X86_FEATURE_HW_PSTATE	(7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK (7*32+ 9) /* AMD ProcFeedbackInterface */
+#define X86_FEATURE_INTEL_PT	(7*32+10) /* Intel Processor Trace */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW  (8*32+ 0) /* Intel TPR Shadow */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index b6f794a..726e6a3 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -36,6 +36,7 @@ void init_scattered_cpuid_features(struct cpuinfo_x86 *c)
 		{ X86_FEATURE_ARAT,		CR_EAX, 2, 0x00000006, 0 },
 		{ X86_FEATURE_PLN,		CR_EAX, 4, 0x00000006, 0 },
 		{ X86_FEATURE_PTS,		CR_EAX, 6, 0x00000006, 0 },
+		{ X86_FEATURE_INTEL_PT,		CR_EBX,25, 0x00000007, 0 },
 		{ X86_FEATURE_APERFMPERF,	CR_ECX, 0, 0x00000006, 0 },
 		{ X86_FEATURE_EPB,		CR_ECX, 3, 0x00000006, 0 },
 		{ X86_FEATURE_XSAVEOPT,		CR_EAX,	0, 0x0000000d, 1 },
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v1 02/11] perf: Abstract ring_buffer backing store operations
  2014-02-06 10:50 [PATCH v1 00/11] perf: Add support for Intel Processor Trace Alexander Shishkin
  2014-02-06 10:50 ` [PATCH v1 01/11] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection Alexander Shishkin
@ 2014-02-06 10:50 ` Alexander Shishkin
  2014-02-06 10:50 ` [PATCH v1 03/11] perf: Allow for multiple ring buffers per event Alexander Shishkin
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-06 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming, Alexander Shishkin

This patch extends perf's ring_buffer code so that buffers with different
backing can be allocated simultaneously with rb_alloc(). This allows the reuse
of ring_buffer code for exporting hardware-written trace buffers (such as
those of Intel PT) to userspace.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c        |   4 +-
 kernel/events/internal.h    |  32 +++++++-
 kernel/events/ring_buffer.c | 176 +++++++++++++++++++++++++++-----------------
 3 files changed, 143 insertions(+), 69 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 56003c6..6899741 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4105,9 +4105,9 @@ again:
 	if (vma->vm_flags & VM_WRITE)
 		flags |= RING_BUFFER_WRITABLE;
 
-	rb = rb_alloc(nr_pages, 
+	rb = rb_alloc(event, nr_pages,
 		event->attr.watermark ? event->attr.wakeup_watermark : 0,
-		event->cpu, flags);
+		event->cpu, flags, NULL);
 
 	if (!rb) {
 		ret = -ENOMEM;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b2187..6cb208f 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -6,6 +6,33 @@
 
 /* Buffer handling */
 
+struct ring_buffer;
+
+struct ring_buffer_ops {
+	/*
+	 * How much memory should be allocated for struct ring_buffer, taking into
+	 * account data_pages[] array.
+	 */
+	unsigned long	(*get_size)(int);
+	/*
+	 * Allocate user_page for this buffer, can be NULL, in which case it is
+	 * allocated by alloc_data_page().
+	 */
+	int		(*alloc_user_page)(struct ring_buffer *, int, int);
+	/*
+	 * Allocate data_pages for this buffer.
+	 */
+	int		(*alloc_data_page)(struct ring_buffer *, int, int, int);
+	/*
+	 * Free the buffer.
+	 */
+	void		(*free_buffer)(struct ring_buffer *);
+	/*
+	 * Get a struct page for a given page index in the buffer.
+	 */
+	struct page	*(*mmap_to_page)(struct ring_buffer *, unsigned long);
+};
+
 #define RING_BUFFER_WRITABLE		0x01
 
 struct ring_buffer {
@@ -15,6 +42,8 @@ struct ring_buffer {
 	struct work_struct		work;
 	int				page_order;	/* allocation order  */
 #endif
+	struct ring_buffer_ops		*ops;
+	struct perf_event		*event;
 	int				nr_pages;	/* nr of data pages  */
 	int				overwrite;	/* can overwrite itself */
 
@@ -41,7 +70,8 @@ struct ring_buffer {
 
 extern void rb_free(struct ring_buffer *rb);
 extern struct ring_buffer *
-rb_alloc(int nr_pages, long watermark, int cpu, int flags);
+rb_alloc(struct perf_event *event, int nr_pages, long watermark, int cpu,
+	 int flags, struct ring_buffer_ops *rb_ops);
 extern void perf_event_wakeup(struct perf_event *event);
 
 extern void
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 146a579..161a676 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -248,18 +248,6 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
  * Back perf_mmap() with regular GFP_KERNEL-0 pages.
  */
 
-struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
-{
-	if (pgoff > rb->nr_pages)
-		return NULL;
-
-	if (pgoff == 0)
-		return virt_to_page(rb->user_page);
-
-	return virt_to_page(rb->data_pages[pgoff - 1]);
-}
-
 static void *perf_mmap_alloc_page(int cpu)
 {
 	struct page *page;
@@ -273,46 +261,31 @@ static void *perf_mmap_alloc_page(int cpu)
 	return page_address(page);
 }
 
-struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
+static int perf_mmap_alloc_user_page(struct ring_buffer *rb, int cpu,
+				     int flags)
 {
-	struct ring_buffer *rb;
-	unsigned long size;
-	int i;
-
-	size = sizeof(struct ring_buffer);
-	size += nr_pages * sizeof(void *);
-
-	rb = kzalloc(size, GFP_KERNEL);
-	if (!rb)
-		goto fail;
-
 	rb->user_page = perf_mmap_alloc_page(cpu);
 	if (!rb->user_page)
-		goto fail_user_page;
-
-	for (i = 0; i < nr_pages; i++) {
-		rb->data_pages[i] = perf_mmap_alloc_page(cpu);
-		if (!rb->data_pages[i])
-			goto fail_data_pages;
-	}
+		return -ENOMEM;
 
-	rb->nr_pages = nr_pages;
-
-	ring_buffer_init(rb, watermark, flags);
+	return 0;
+}
 
-	return rb;
+static int perf_mmap_alloc_data_page(struct ring_buffer *rb, int cpu,
+				     int nr_pages, int flags)
+{
+	void *data;
 
-fail_data_pages:
-	for (i--; i >= 0; i--)
-		free_page((unsigned long)rb->data_pages[i]);
+	if (nr_pages != 1)
+		return -EINVAL;
 
-	free_page((unsigned long)rb->user_page);
+	data = perf_mmap_alloc_page(cpu);
+	if (!data)
+		return -ENOMEM;
 
-fail_user_page:
-	kfree(rb);
+	rb->data_pages[rb->nr_pages] = data;
 
-fail:
-	return NULL;
+	return 0;
 }
 
 static void perf_mmap_free_page(unsigned long addr)
@@ -323,24 +296,51 @@ static void perf_mmap_free_page(unsigned long addr)
 	__free_page(page);
 }
 
-void rb_free(struct ring_buffer *rb)
+static void perf_mmap_gfp0_free(struct ring_buffer *rb)
 {
 	int i;
 
-	perf_mmap_free_page((unsigned long)rb->user_page);
+	if (rb->user_page)
+		perf_mmap_free_page((unsigned long)rb->user_page);
 	for (i = 0; i < rb->nr_pages; i++)
 		perf_mmap_free_page((unsigned long)rb->data_pages[i]);
 	kfree(rb);
 }
 
+struct page *
+perf_mmap_gfp0_to_page(struct ring_buffer *rb, unsigned long pgoff)
+{
+	if (pgoff > rb->nr_pages)
+		return NULL;
+
+	if (pgoff == 0)
+		return virt_to_page(rb->user_page);
+
+	return virt_to_page(rb->data_pages[pgoff - 1]);
+}
+
+static unsigned long perf_mmap_gfp0_get_size(int nr_pages)
+{
+	return sizeof(struct ring_buffer) + sizeof(void *) * nr_pages;
+}
+
+struct ring_buffer_ops perf_rb_ops = {
+	.get_size		= perf_mmap_gfp0_get_size,
+	.alloc_user_page	= perf_mmap_alloc_user_page,
+	.alloc_data_page	= perf_mmap_alloc_data_page,
+	.free_buffer		= perf_mmap_gfp0_free,
+	.mmap_to_page		= perf_mmap_gfp0_to_page,
+};
+
 #else
+
 static int data_page_nr(struct ring_buffer *rb)
 {
 	return rb->nr_pages << page_order(rb);
 }
 
 struct page *
-perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+perf_mmap_vmalloc_to_page(struct ring_buffer *rb, unsigned long pgoff)
 {
 	/* The '>' counts in the user page. */
 	if (pgoff > data_page_nr(rb))
@@ -349,14 +349,14 @@ perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
 	return vmalloc_to_page((void *)rb->user_page + pgoff * PAGE_SIZE);
 }
 
-static void perf_mmap_unmark_page(void *addr)
+static void perf_mmap_vmalloc_unmark_page(void *addr)
 {
 	struct page *page = vmalloc_to_page(addr);
 
 	page->mapping = NULL;
 }
 
-static void rb_free_work(struct work_struct *work)
+static void perf_mmap_vmalloc_free_work(struct work_struct *work)
 {
 	struct ring_buffer *rb;
 	void *base;
@@ -368,50 +368,94 @@ static void rb_free_work(struct work_struct *work)
 	base = rb->user_page;
 	/* The '<=' counts in the user page. */
 	for (i = 0; i <= nr; i++)
-		perf_mmap_unmark_page(base + (i * PAGE_SIZE));
+		perf_mmap_vmalloc_unmark_page(base + (i * PAGE_SIZE));
 
 	vfree(base);
 	kfree(rb);
 }
 
-void rb_free(struct ring_buffer *rb)
+static void perf_mmap_vmalloc_free(struct ring_buffer *rb)
 {
 	schedule_work(&rb->work);
 }
 
-struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
+static int perf_mmap_vmalloc_data_pages(struct ring_buffer *rb, int cpu,
+					int nr_pages, int flags)
 {
-	struct ring_buffer *rb;
-	unsigned long size;
 	void *all_buf;
 
-	size = sizeof(struct ring_buffer);
-	size += sizeof(void *);
-
-	rb = kzalloc(size, GFP_KERNEL);
-	if (!rb)
-		goto fail;
-
-	INIT_WORK(&rb->work, rb_free_work);
+	INIT_WORK(&rb->work, perf_mmap_vmalloc_free_work);
 
 	all_buf = vmalloc_user((nr_pages + 1) * PAGE_SIZE);
 	if (!all_buf)
-		goto fail_all_buf;
+		return -ENOMEM;
 
 	rb->user_page = all_buf;
 	rb->data_pages[0] = all_buf + PAGE_SIZE;
 	rb->page_order = ilog2(nr_pages);
 	rb->nr_pages = !!nr_pages;
 
+	return 0;
+}
+
+static unsigned long perf_mmap_vmalloc_get_size(int nr_pages)
+{
+	return sizeof(struct ring_buffer) + sizeof(void *);
+}
+
+struct ring_buffer_ops perf_rb_ops = {
+	.get_size		= perf_mmap_vmalloc_get_size,
+	.alloc_data_page	= perf_mmap_vmalloc_data_pages,
+	.free_buffer		= perf_mmap_vmalloc_free,
+	.mmap_to_page		= perf_mmap_vmalloc_to_page,
+};
+
+#endif
+
+struct ring_buffer *rb_alloc(struct perf_event *event, int nr_pages,
+			     long watermark, int cpu, int flags,
+			     struct ring_buffer_ops *rb_ops)
+{
+	struct ring_buffer *rb;
+	int i;
+
+	if (!rb_ops)
+		rb_ops = &perf_rb_ops;
+
+	rb = kzalloc(rb_ops->get_size(nr_pages), GFP_KERNEL);
+	if (!rb)
+		return NULL;
+
+	rb->event = event;
+	rb->ops = rb_ops;
+	if (rb->ops->alloc_user_page) {
+		if (rb->ops->alloc_user_page(rb, cpu, flags))
+			goto fail;
+
+		for (i = 0; i < nr_pages; i++, rb->nr_pages++)
+			if (rb->ops->alloc_data_page(rb, cpu, 1, flags))
+				goto fail;
+	} else {
+		if (rb->ops->alloc_data_page(rb, cpu, nr_pages, flags))
+			goto fail;
+	}
+
 	ring_buffer_init(rb, watermark, flags);
 
 	return rb;
 
-fail_all_buf:
-	kfree(rb);
-
 fail:
+	rb->ops->free_buffer(rb);
 	return NULL;
 }
 
-#endif
+void rb_free(struct ring_buffer *rb)
+{
+	rb->ops->free_buffer(rb);
+}
+
+struct page *
+perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+{
+	return rb->ops->mmap_to_page(rb, pgoff);
+}
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-02-06 10:50 [PATCH v1 00/11] perf: Add support for Intel Processor Trace Alexander Shishkin
  2014-02-06 10:50 ` [PATCH v1 01/11] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection Alexander Shishkin
  2014-02-06 10:50 ` [PATCH v1 02/11] perf: Abstract ring_buffer backing store operations Alexander Shishkin
@ 2014-02-06 10:50 ` Alexander Shishkin
  2014-02-17 14:33   ` Peter Zijlstra
  2014-05-07 15:26   ` Peter Zijlstra
  2014-02-06 10:50 ` [PATCH v1 04/11] itrace: Infrastructure for instruction flow tracing units Alexander Shishkin
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-06 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming, Alexander Shishkin

Currently, a perf event can have one ring buffer associated with it, that
is used for perf record stream. However, some pmus, such as instruction
tracing units, will generate binary streams of their own, for which it is
convenient to reuse the ring buffer code to export such streams to the
userspace. So, this patch extends the perf code to support more than one
ring buffer per event. All the existing functionality will default to
using the main ring buffer for everything and only the main buffer is
exported to userspace at this point.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h  |  11 ++-
 kernel/events/core.c        | 186 ++++++++++++++++++++++++--------------------
 kernel/events/internal.h    |   7 ++
 kernel/events/ring_buffer.c |   2 +-
 4 files changed, 118 insertions(+), 88 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e56b07f..93cefb6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -289,6 +289,11 @@ struct swevent_hlist {
 struct perf_cgroup;
 struct ring_buffer;
 
+enum perf_event_rb {
+	PERF_RB_MAIN = 0,
+	PERF_NR_RB,
+};
+
 /**
  * struct perf_event - performance event kernel representation:
  */
@@ -398,10 +403,10 @@ struct perf_event {
 
 	/* mmap bits */
 	struct mutex			mmap_mutex;
-	atomic_t			mmap_count;
+	atomic_t			mmap_count[PERF_NR_RB];
 
-	struct ring_buffer		*rb;
-	struct list_head		rb_entry;
+	struct ring_buffer		*rb[PERF_NR_RB];
+	struct list_head		rb_entry[PERF_NR_RB];
 
 	/* poll related */
 	wait_queue_head_t		waitq;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6899741..533230c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3175,9 +3175,6 @@ static void free_event_rcu(struct rcu_head *head)
 	kfree(event);
 }
 
-static void ring_buffer_put(struct ring_buffer *rb);
-static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);
-
 static void unaccount_event_cpu(struct perf_event *event, int cpu)
 {
 	if (event->parent)
@@ -3231,28 +3228,31 @@ static void __free_event(struct perf_event *event)
 }
 static void free_event(struct perf_event *event)
 {
+	int rbx;
+
 	irq_work_sync(&event->pending);
 
 	unaccount_event(event);
 
-	if (event->rb) {
-		struct ring_buffer *rb;
+	for (rbx = PERF_RB_MAIN; rbx < PERF_NR_RB; rbx++)
+		if (event->rb[rbx]) {
+			struct ring_buffer *rb;
 
-		/*
-		 * Can happen when we close an event with re-directed output.
-		 *
-		 * Since we have a 0 refcount, perf_mmap_close() will skip
-		 * over us; possibly making our ring_buffer_put() the last.
-		 */
-		mutex_lock(&event->mmap_mutex);
-		rb = event->rb;
-		if (rb) {
-			rcu_assign_pointer(event->rb, NULL);
-			ring_buffer_detach(event, rb);
-			ring_buffer_put(rb); /* could be last */
+			/*
+			 * Can happen when we close an event with re-directed output.
+			 *
+			 * Since we have a 0 refcount, perf_mmap_close() will skip
+			 * over us; possibly making our ring_buffer_put() the last.
+			 */
+			mutex_lock(&event->mmap_mutex);
+			rb = event->rb[rbx];
+			if (rb) {
+				rcu_assign_pointer(event->rb[rbx], NULL);
+				ring_buffer_detach(event, rb);
+				ring_buffer_put(rb); /* could be last */
+			}
+			mutex_unlock(&event->mmap_mutex);
 		}
-		mutex_unlock(&event->mmap_mutex);
-	}
 
 	if (is_cgroup_event(event))
 		perf_detach_cgroup(event);
@@ -3481,21 +3481,24 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
 {
 	struct perf_event *event = file->private_data;
 	struct ring_buffer *rb;
-	unsigned int events = POLL_HUP;
+	unsigned int events = 0;
+	int rbx;
 
 	/*
 	 * Pin the event->rb by taking event->mmap_mutex; otherwise
 	 * perf_event_set_output() can swizzle our rb and make us miss wakeups.
 	 */
 	mutex_lock(&event->mmap_mutex);
-	rb = event->rb;
-	if (rb)
-		events = atomic_xchg(&rb->poll, 0);
+	for (rbx = PERF_RB_MAIN; rbx < PERF_NR_RB; rbx++) {
+		rb = event->rb[rbx];
+		if (rb)
+			events |= atomic_xchg(&rb->poll, 0);
+	}
 	mutex_unlock(&event->mmap_mutex);
 
 	poll_wait(file, &event->waitq, wait);
 
-	return events;
+	return events ? events : POLL_HUP;
 }
 
 static void perf_event_reset(struct perf_event *event)
@@ -3726,7 +3729,7 @@ static void perf_event_init_userpage(struct perf_event *event)
 	struct ring_buffer *rb;
 
 	rcu_read_lock();
-	rb = rcu_dereference(event->rb);
+	rb = rcu_dereference(event->rb[PERF_RB_MAIN]);
 	if (!rb)
 		goto unlock;
 
@@ -3756,7 +3759,7 @@ void perf_event_update_userpage(struct perf_event *event)
 	u64 enabled, running, now;
 
 	rcu_read_lock();
-	rb = rcu_dereference(event->rb);
+	rb = rcu_dereference(event->rb[PERF_RB_MAIN]);
 	if (!rb)
 		goto unlock;
 
@@ -3812,7 +3815,7 @@ static int perf_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 
 	rcu_read_lock();
-	rb = rcu_dereference(event->rb);
+	rb = rcu_dereference(event->rb[PERF_RB_MAIN]);
 	if (!rb)
 		goto unlock;
 
@@ -3834,29 +3837,31 @@ unlock:
 	return ret;
 }
 
-static void ring_buffer_attach(struct perf_event *event,
-			       struct ring_buffer *rb)
+void ring_buffer_attach(struct perf_event *event,
+			struct ring_buffer *rb)
 {
+	struct list_head *head = &event->rb_entry[PERF_RB_MAIN];
 	unsigned long flags;
 
-	if (!list_empty(&event->rb_entry))
+	if (!list_empty(head))
 		return;
 
 	spin_lock_irqsave(&rb->event_lock, flags);
-	if (list_empty(&event->rb_entry))
-		list_add(&event->rb_entry, &rb->event_list);
+	if (list_empty(head))
+		list_add(head, &rb->event_list);
 	spin_unlock_irqrestore(&rb->event_lock, flags);
 }
 
-static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
+void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
 {
+	struct list_head *head = &event->rb_entry[PERF_RB_MAIN];
 	unsigned long flags;
 
-	if (list_empty(&event->rb_entry))
+	if (list_empty(head))
 		return;
 
 	spin_lock_irqsave(&rb->event_lock, flags);
-	list_del_init(&event->rb_entry);
+	list_del_init(head);
 	wake_up_all(&event->waitq);
 	spin_unlock_irqrestore(&rb->event_lock, flags);
 }
@@ -3864,12 +3869,16 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
 static void ring_buffer_wakeup(struct perf_event *event)
 {
 	struct ring_buffer *rb;
+	struct perf_event *iter;
+	int rbx;
 
 	rcu_read_lock();
-	rb = rcu_dereference(event->rb);
-	if (rb) {
-		list_for_each_entry_rcu(event, &rb->event_list, rb_entry)
-			wake_up_all(&event->waitq);
+	for (rbx = PERF_RB_MAIN; rbx < PERF_NR_RB; rbx++) {
+		rb = rcu_dereference(event->rb[rbx]);
+		if (rb) {
+			list_for_each_entry_rcu(iter, &rb->event_list, rb_entry[rbx])
+				wake_up_all(&iter->waitq);
+		}
 	}
 	rcu_read_unlock();
 }
@@ -3882,12 +3891,12 @@ static void rb_free_rcu(struct rcu_head *rcu_head)
 	rb_free(rb);
 }
 
-static struct ring_buffer *ring_buffer_get(struct perf_event *event)
+struct ring_buffer *ring_buffer_get(struct perf_event *event, int rbx)
 {
 	struct ring_buffer *rb;
 
 	rcu_read_lock();
-	rb = rcu_dereference(event->rb);
+	rb = rcu_dereference(event->rb[rbx]);
 	if (rb) {
 		if (!atomic_inc_not_zero(&rb->refcount))
 			rb = NULL;
@@ -3897,7 +3906,7 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
 	return rb;
 }
 
-static void ring_buffer_put(struct ring_buffer *rb)
+void ring_buffer_put(struct ring_buffer *rb)
 {
 	if (!atomic_dec_and_test(&rb->refcount))
 		return;
@@ -3911,8 +3920,8 @@ static void perf_mmap_open(struct vm_area_struct *vma)
 {
 	struct perf_event *event = vma->vm_file->private_data;
 
-	atomic_inc(&event->mmap_count);
-	atomic_inc(&event->rb->mmap_count);
+	atomic_inc(&event->mmap_count[PERF_RB_MAIN]);
+	atomic_inc(&event->rb[PERF_RB_MAIN]->mmap_count);
 }
 
 /*
@@ -3926,19 +3935,20 @@ static void perf_mmap_open(struct vm_area_struct *vma)
 static void perf_mmap_close(struct vm_area_struct *vma)
 {
 	struct perf_event *event = vma->vm_file->private_data;
-
-	struct ring_buffer *rb = event->rb;
+	int rbx = PERF_RB_MAIN;
+	struct ring_buffer *rb = event->rb[rbx];
 	struct user_struct *mmap_user = rb->mmap_user;
 	int mmap_locked = rb->mmap_locked;
 	unsigned long size = perf_data_size(rb);
 
 	atomic_dec(&rb->mmap_count);
 
-	if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
+	if (!atomic_dec_and_mutex_lock(&event->mmap_count[rbx],
+				       &event->mmap_mutex))
 		return;
 
 	/* Detach current event from the buffer. */
-	rcu_assign_pointer(event->rb, NULL);
+	rcu_assign_pointer(event->rb[rbx], NULL);
 	ring_buffer_detach(event, rb);
 	mutex_unlock(&event->mmap_mutex);
 
@@ -3955,7 +3965,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 	 */
 again:
 	rcu_read_lock();
-	list_for_each_entry_rcu(event, &rb->event_list, rb_entry) {
+	list_for_each_entry_rcu(event, &rb->event_list, rb_entry[rbx]) {
 		if (!atomic_long_inc_not_zero(&event->refcount)) {
 			/*
 			 * This event is en-route to free_event() which will
@@ -3976,8 +3986,8 @@ again:
 		 * still restart the iteration to make sure we're not now
 		 * iterating the wrong list.
 		 */
-		if (event->rb == rb) {
-			rcu_assign_pointer(event->rb, NULL);
+		if (event->rb[rbx] == rb) {
+			rcu_assign_pointer(event->rb[rbx], NULL);
 			ring_buffer_detach(event, rb);
 			ring_buffer_put(rb); /* can't be last, we still have one */
 		}
@@ -4026,6 +4036,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	unsigned long nr_pages;
 	long user_extra, extra;
 	int ret = 0, flags = 0;
+	int rbx = PERF_RB_MAIN;
 
 	/*
 	 * Don't allow mmap() of inherited per-task counters. This would
@@ -4039,6 +4050,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 		return -EINVAL;
 
 	vma_size = vma->vm_end - vma->vm_start;
+
 	nr_pages = (vma_size / PAGE_SIZE) - 1;
 
 	/*
@@ -4057,13 +4069,14 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	WARN_ON_ONCE(event->ctx->parent_ctx);
 again:
 	mutex_lock(&event->mmap_mutex);
-	if (event->rb) {
-		if (event->rb->nr_pages != nr_pages) {
+	rb = event->rb[rbx];
+	if (rb) {
+		if (rb->nr_pages != nr_pages) {
 			ret = -EINVAL;
 			goto unlock;
 		}
 
-		if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
+		if (!atomic_inc_not_zero(&rb->mmap_count)) {
 			/*
 			 * Raced against perf_mmap_close() through
 			 * perf_event_set_output(). Try again, hope for better
@@ -4100,7 +4113,7 @@ again:
 		goto unlock;
 	}
 
-	WARN_ON(event->rb);
+	WARN_ON(event->rb[rbx]);
 
 	if (vma->vm_flags & VM_WRITE)
 		flags |= RING_BUFFER_WRITABLE;
@@ -4122,14 +4135,14 @@ again:
 	vma->vm_mm->pinned_vm += extra;
 
 	ring_buffer_attach(event, rb);
-	rcu_assign_pointer(event->rb, rb);
+	rcu_assign_pointer(event->rb[rbx], rb);
 
 	perf_event_init_userpage(event);
 	perf_event_update_userpage(event);
 
 unlock:
 	if (!ret)
-		atomic_inc(&event->mmap_count);
+		atomic_inc(&event->mmap_count[rbx]);
 	mutex_unlock(&event->mmap_mutex);
 
 	/*
@@ -6661,6 +6674,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	struct perf_event *event;
 	struct hw_perf_event *hwc;
 	long err = -EINVAL;
+	int rbx;
 
 	if ((unsigned)cpu >= nr_cpu_ids) {
 		if (!task || cpu != -1)
@@ -6684,7 +6698,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	INIT_LIST_HEAD(&event->group_entry);
 	INIT_LIST_HEAD(&event->event_entry);
 	INIT_LIST_HEAD(&event->sibling_list);
-	INIT_LIST_HEAD(&event->rb_entry);
+	for (rbx = PERF_RB_MAIN; rbx < PERF_NR_RB; rbx++)
+		INIT_LIST_HEAD(&event->rb_entry[rbx]);
 	INIT_LIST_HEAD(&event->active_entry);
 	INIT_HLIST_NODE(&event->hlist_entry);
 
@@ -6912,8 +6927,7 @@ err_size:
 static int
 perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 {
-	struct ring_buffer *rb = NULL, *old_rb = NULL;
-	int ret = -EINVAL;
+	int ret = -EINVAL, rbx;
 
 	if (!output_event)
 		goto set;
@@ -6936,39 +6950,43 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 
 set:
 	mutex_lock(&event->mmap_mutex);
-	/* Can't redirect output if we've got an active mmap() */
-	if (atomic_read(&event->mmap_count))
-		goto unlock;
 
-	old_rb = event->rb;
+	for (rbx = PERF_RB_MAIN; rbx < PERF_NR_RB; rbx++) {
+		struct ring_buffer *rb = NULL, *old_rb = NULL;
 
-	if (output_event) {
-		/* get the rb we want to redirect to */
-		rb = ring_buffer_get(output_event);
-		if (!rb)
-			goto unlock;
-	}
+		/* Can't redirect output if we've got an active mmap() */
+		if (atomic_read(&event->mmap_count[rbx]))
+			continue;
 
-	if (old_rb)
-		ring_buffer_detach(event, old_rb);
+		old_rb = event->rb[rbx];
 
-	if (rb)
-		ring_buffer_attach(event, rb);
+		if (output_event) {
+			/* get the rb we want to redirect to */
+			rb = ring_buffer_get(output_event, rbx);
+			if (!rb)
+				continue;
+		}
 
-	rcu_assign_pointer(event->rb, rb);
+		if (old_rb)
+			ring_buffer_detach(event, old_rb);
 
-	if (old_rb) {
-		ring_buffer_put(old_rb);
-		/*
-		 * Since we detached before setting the new rb, so that we
-		 * could attach the new rb, we could have missed a wakeup.
-		 * Provide it now.
-		 */
-		wake_up_all(&event->waitq);
+		if (rb)
+			ring_buffer_attach(event, rb);
+
+		rcu_assign_pointer(event->rb[rbx], rb);
+
+		if (old_rb) {
+			ring_buffer_put(old_rb);
+			/*
+			 * Since we detached before setting the new rb, so that we
+			 * could attach the new rb, we could have missed a wakeup.
+			 * Provide it now.
+			 */
+			wake_up_all(&event->waitq);
+		}
 	}
 
 	ret = 0;
-unlock:
 	mutex_unlock(&event->mmap_mutex);
 
 out:
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 6cb208f..841f7c4 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -63,6 +63,7 @@ struct ring_buffer {
 	atomic_t			mmap_count;
 	unsigned long			mmap_locked;
 	struct user_struct		*mmap_user;
+	void				*priv;
 
 	struct perf_event_mmap_page	*user_page;
 	void				*data_pages[0];
@@ -73,6 +74,12 @@ extern struct ring_buffer *
 rb_alloc(struct perf_event *event, int nr_pages, long watermark, int cpu,
 	 int flags, struct ring_buffer_ops *rb_ops);
 extern void perf_event_wakeup(struct perf_event *event);
+extern struct ring_buffer *ring_buffer_get(struct perf_event *event, int rbx);
+extern void ring_buffer_put(struct ring_buffer *rb);
+extern void ring_buffer_attach(struct perf_event *event,
+			       struct ring_buffer *rb);
+extern void ring_buffer_detach(struct perf_event *event,
+			       struct ring_buffer *rb);
 
 extern void
 perf_event_header__init_id(struct perf_event_header *header,
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 161a676..232d7de 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -120,7 +120,7 @@ int perf_output_begin(struct perf_output_handle *handle,
 	if (event->parent)
 		event = event->parent;
 
-	rb = rcu_dereference(event->rb);
+	rb = rcu_dereference(event->rb[PERF_RB_MAIN]);
 	if (unlikely(!rb))
 		goto out;
 
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v1 04/11] itrace: Infrastructure for instruction flow tracing units
  2014-02-06 10:50 [PATCH v1 00/11] perf: Add support for Intel Processor Trace Alexander Shishkin
                   ` (2 preceding siblings ...)
  2014-02-06 10:50 ` [PATCH v1 03/11] perf: Allow for multiple ring buffers per event Alexander Shishkin
@ 2014-02-06 10:50 ` Alexander Shishkin
  2014-02-06 10:50 ` [PATCH v1 05/11] itrace: Add functionality to include traces in perf event samples Alexander Shishkin
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-06 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming, Alexander Shishkin

Instruction tracing PMUs are capable of recording a log of instruction
execution flow on a cpu core, which can be useful for profiling and crash
analysis. This patch adds itrace infrastructure for perf events and the
rest of the kernel to use.

Since such PMUs can produce copious amounts of trace data at the rate for
hundreds of megabytes per second per core, it may be impractical to process
it inside the kernel in real time, but instead export raw trace streams to
userspace for subsequent analysis. Thus, itrace PMUs may export their trace
buffers, which can be mmap()ed to userspace from a special file descriptor,
which can be obtained from the perf_event_open() syscall by using a
PERF_FLAG_FD_ITRACE flag and the original perf event descriptor.

This infrastructure should also be useful for ARM ETM/PTM and other program
flow tracing units that can potentially generate a lot of trace data very
fast.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/itrace.h          |  89 +++++++++++++
 include/linux/perf_event.h      |   5 +
 include/uapi/linux/perf_event.h |  17 +++
 kernel/events/Makefile          |   2 +-
 kernel/events/core.c            | 130 ++++++++++++++++---
 kernel/events/itrace.c          | 271 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 495 insertions(+), 19 deletions(-)
 create mode 100644 include/linux/itrace.h
 create mode 100644 kernel/events/itrace.c

diff --git a/include/linux/itrace.h b/include/linux/itrace.h
new file mode 100644
index 0000000..735baaf4
--- /dev/null
+++ b/include/linux/itrace.h
@@ -0,0 +1,89 @@
+/*
+ * Instruction flow trace unit infrastructure
+ * Copyright (c) 2013, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ */
+
+#ifndef _LINUX_ITRACE_H
+#define _LINUX_ITRACE_H
+
+#include <linux/perf_event.h>
+#include <linux/file.h>
+
+extern struct ring_buffer_ops itrace_rb_ops;
+
+static inline bool is_itrace_vma(struct vm_area_struct *vma)
+{
+	if (vma->vm_file) {
+		struct perf_event *event = vma->vm_file->private_data;
+		if (event->hw.itrace_file == vma->vm_file)
+			return true;
+	}
+
+	return false;
+}
+
+void *itrace_priv(struct perf_event *event);
+
+void *itrace_event_get_priv(struct perf_event *event);
+void itrace_event_put(struct perf_event *event);
+
+struct itrace_pmu {
+	struct pmu		pmu;
+	struct list_head	entry;
+	/*
+	 * Allocate/free ring_buffer backing store
+	 */
+	void			*(*alloc_buffer)(int cpu, int nr_pages, bool overwrite,
+						 void **pages,
+						 struct perf_event_mmap_page **user_page);
+	void			(*free_buffer)(void *buffer);
+
+	int			(*event_init)(struct perf_event *event);
+
+	char			*name;
+};
+
+#define to_itrace_pmu(x) container_of((x), struct itrace_pmu, pmu)
+
+#ifdef CONFIG_PERF_EVENTS
+extern int itrace_inherit_event(struct perf_event *event,
+				struct task_struct *task);
+extern void itrace_lost_data(struct perf_event *event, u64 offset);
+extern int itrace_pmu_register(struct itrace_pmu *ipmu);
+
+extern int itrace_event_installable(struct perf_event *event,
+				    struct perf_event_context *ctx);
+
+extern void itrace_wake_up(struct perf_event *event);
+
+extern bool is_itrace_event(struct perf_event *event);
+
+#else
+static int itrace_inherit_event(struct perf_event *event,
+				struct task_struct *task)	{ return 0; }
+static inline void
+itrace_lost_data(struct perf_event *event, u64 offset)		{}
+static inline int itrace_pmu_register(struct itrace_pmu *ipmu)	{ return -EINVAL; }
+
+static inline int
+itrace_event_installable(struct perf_event *event,
+			 struct perf_event_context *ctx)	{ return -EINVAL; }
+static inline void itrace_wake_up(struct perf_event *event)	{}
+static inline bool is_itrace_event(struct perf_event *event)	{ return false; }
+#endif
+
+#endif /* _LINUX_PERF_EVENT_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 93cefb6..b0147e0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -126,6 +126,10 @@ struct hw_perf_event {
 			/* for tp_event->class */
 			struct list_head	tp_list;
 		};
+		struct { /* itrace */
+			struct file		*itrace_file;
+			struct task_struct	*itrace_target;
+		};
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 		struct { /* breakpoint */
 			/*
@@ -291,6 +295,7 @@ struct ring_buffer;
 
 enum perf_event_rb {
 	PERF_RB_MAIN = 0,
+	PERF_RB_ITRACE,
 	PERF_NR_RB,
 };
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index e244ed4..2dd57db 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -237,6 +237,10 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER2	80	/* add: branch_sample_type */
 #define PERF_ATTR_SIZE_VER3	96	/* add: sample_regs_user */
 					/* add: sample_stack_user */
+#define PERF_ATTR_SIZE_VER4	120	/* add: itrace_config */
+					/* add: itrace_watermark */
+					/* add: itrace_sample_type */
+					/* add: itrace_sample_size */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -333,6 +337,11 @@ struct perf_event_attr {
 
 	/* Align to u64. */
 	__u32	__reserved_2;
+
+	__u64	itrace_config;
+	__u32	itrace_watermark;	/* wakeup every n pages */
+	__u32	itrace_sample_type;	/* pmu->type of the itrace PMU */
+	__u64	itrace_sample_size;
 };
 
 #define perf_flags(attr)	(*(&(attr)->read_format + 1))
@@ -705,6 +714,13 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_MMAP2			= 10,
 
+	/*
+	 * struct {
+	 *   u64 offset;
+	 * }
+	 */
+	PERF_RECORD_ITRACE_LOST			= 11,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -726,6 +742,7 @@ enum perf_callchain_context {
 #define PERF_FLAG_FD_OUTPUT		(1U << 1)
 #define PERF_FLAG_PID_CGROUP		(1U << 2) /* pid=cgroup id, per-cpu mode only */
 #define PERF_FLAG_FD_CLOEXEC		(1U << 3) /* O_CLOEXEC */
+#define PERF_FLAG_FD_ITRACE		(1U << 4) /* request itrace fd */
 
 union perf_mem_data_src {
 	__u64 val;
diff --git a/kernel/events/Makefile b/kernel/events/Makefile
index 103f5d1..46a3770 100644
--- a/kernel/events/Makefile
+++ b/kernel/events/Makefile
@@ -2,7 +2,7 @@ ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_core.o = -pg
 endif
 
-obj-y := core.o ring_buffer.o callchain.o
+obj-y := core.o ring_buffer.o callchain.o itrace.o
 
 obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
 obj-$(CONFIG_UPROBES) += uprobes.o
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 533230c..ff6e286 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -39,6 +39,7 @@
 #include <linux/hw_breakpoint.h>
 #include <linux/mm_types.h>
 #include <linux/cgroup.h>
+#include <linux/itrace.h>
 
 #include "internal.h"
 
@@ -120,7 +121,8 @@ static int cpu_function_call(int cpu, int (*func) (void *info), void *info)
 #define PERF_FLAG_ALL (PERF_FLAG_FD_NO_GROUP |\
 		       PERF_FLAG_FD_OUTPUT  |\
 		       PERF_FLAG_PID_CGROUP |\
-		       PERF_FLAG_FD_CLOEXEC)
+		       PERF_FLAG_FD_CLOEXEC |\
+		       PERF_FLAG_FD_ITRACE)
 
 /*
  * branch priv levels that need permission checks
@@ -3339,7 +3341,12 @@ static void put_event(struct perf_event *event)
 
 static int perf_release(struct inode *inode, struct file *file)
 {
-	put_event(file->private_data);
+	struct perf_event *event = file->private_data;
+
+	if (is_itrace_event(event) && event->hw.itrace_file == file)
+		event->hw.itrace_file = NULL;
+
+	put_event(event);
 	return 0;
 }
 
@@ -3806,7 +3813,10 @@ static int perf_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct perf_event *event = vma->vm_file->private_data;
 	struct ring_buffer *rb;
-	int ret = VM_FAULT_SIGBUS;
+	int ret = VM_FAULT_SIGBUS, rbx = PERF_RB_MAIN;
+
+	if (is_itrace_event(event) && is_itrace_vma(vma))
+		rbx = PERF_RB_ITRACE;
 
 	if (vmf->flags & FAULT_FLAG_MKWRITE) {
 		if (vmf->pgoff == 0)
@@ -3815,7 +3825,7 @@ static int perf_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 
 	rcu_read_lock();
-	rb = rcu_dereference(event->rb[PERF_RB_MAIN]);
+	rb = rcu_dereference(event->rb[rbx]);
 	if (!rb)
 		goto unlock;
 
@@ -3840,7 +3850,8 @@ unlock:
 void ring_buffer_attach(struct perf_event *event,
 			struct ring_buffer *rb)
 {
-	struct list_head *head = &event->rb_entry[PERF_RB_MAIN];
+	int rbx = rb->priv ? PERF_RB_ITRACE : PERF_RB_MAIN;
+	struct list_head *head = &event->rb_entry[rbx];
 	unsigned long flags;
 
 	if (!list_empty(head))
@@ -3854,7 +3865,8 @@ void ring_buffer_attach(struct perf_event *event,
 
 void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
 {
-	struct list_head *head = &event->rb_entry[PERF_RB_MAIN];
+	int rbx = rb->priv ? PERF_RB_ITRACE : PERF_RB_MAIN;
+	struct list_head *head = &event->rb_entry[rbx];
 	unsigned long flags;
 
 	if (list_empty(head))
@@ -3919,9 +3931,10 @@ void ring_buffer_put(struct ring_buffer *rb)
 static void perf_mmap_open(struct vm_area_struct *vma)
 {
 	struct perf_event *event = vma->vm_file->private_data;
+	int rbx = is_itrace_vma(vma) ? PERF_RB_ITRACE : PERF_RB_MAIN;
 
-	atomic_inc(&event->mmap_count[PERF_RB_MAIN]);
-	atomic_inc(&event->rb[PERF_RB_MAIN]->mmap_count);
+	atomic_inc(&event->mmap_count[rbx]);
+	atomic_inc(&event->rb[rbx]->mmap_count);
 }
 
 /*
@@ -3935,7 +3948,7 @@ static void perf_mmap_open(struct vm_area_struct *vma)
 static void perf_mmap_close(struct vm_area_struct *vma)
 {
 	struct perf_event *event = vma->vm_file->private_data;
-	int rbx = PERF_RB_MAIN;
+	int rbx = is_itrace_vma(vma) ? PERF_RB_ITRACE : PERF_RB_MAIN;
 	struct ring_buffer *rb = event->rb[rbx];
 	struct user_struct *mmap_user = rb->mmap_user;
 	int mmap_locked = rb->mmap_locked;
@@ -4051,13 +4064,16 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 
 	vma_size = vma->vm_end - vma->vm_start;
 
+	if (is_itrace_event(event) && is_itrace_vma(vma))
+		rbx = PERF_RB_ITRACE;
+
 	nr_pages = (vma_size / PAGE_SIZE) - 1;
 
 	/*
 	 * If we have rb pages ensure they're a power-of-two number, so we
 	 * can do bitmasks instead of modulo.
 	 */
-	if (nr_pages != 0 && !is_power_of_2(nr_pages))
+	if (!rbx && nr_pages != 0 && !is_power_of_2(nr_pages))
 		return -EINVAL;
 
 	if (vma_size != PAGE_SIZE * (1 + nr_pages))
@@ -4120,7 +4136,7 @@ again:
 
 	rb = rb_alloc(event, nr_pages,
 		event->attr.watermark ? event->attr.wakeup_watermark : 0,
-		event->cpu, flags, NULL);
+		event->cpu, flags, rbx ? &itrace_rb_ops : NULL);
 
 	if (!rb) {
 		ret = -ENOMEM;
@@ -6728,6 +6744,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 		if (attr->type == PERF_TYPE_TRACEPOINT)
 			event->hw.tp_target = task;
+		else if (is_itrace_event(event))
+			event->hw.itrace_target = task;
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 		/*
 		 * hw_breakpoint is a bit difficult here..
@@ -6947,6 +6965,17 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 	 */
 	if (output_event->cpu == -1 && output_event->ctx != event->ctx)
 		goto out;
+	/*
+	 * Both itrace events must be on a same PMU; itrace events can
+	 * be only redirected to other itrace events.
+	 */
+	if (is_itrace_event(event)) {
+		if (!is_itrace_event(output_event))
+			goto out;
+
+		if (event->attr.type != output_event->attr.type)
+			goto out;
+	}
 
 set:
 	mutex_lock(&event->mmap_mutex);
@@ -6993,6 +7022,46 @@ out:
 	return ret;
 }
 
+static int do_perf_get_itrace_fd(int group_fd, int f_flags)
+{
+	struct fd group = {NULL, 0};
+	struct perf_event *event;
+	struct file *file = NULL;
+	int fd, err;
+
+	fd = get_unused_fd_flags(f_flags);
+	if (fd < 0)
+		return fd;
+
+	err = perf_fget_light(group_fd, &group);
+	if (err)
+		goto err_fd;
+
+	event = group.file->private_data;
+	if (!is_itrace_event(event)) {
+		err = -EINVAL;
+		goto err_group_fd;
+	}
+
+	file = anon_inode_getfile("[itrace]", &perf_fops, event, f_flags);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_group_fd;
+	}
+
+	event->hw.itrace_file = file;
+
+	fdput(group);
+	fd_install(fd, file);
+	return fd;
+
+err_group_fd:
+	fdput(group);
+err_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
 /**
  * sys_perf_event_open - open a performance event, associate it to a task/cpu
  *
@@ -7022,6 +7091,18 @@ SYSCALL_DEFINE5(perf_event_open,
 	if (flags & ~PERF_FLAG_ALL)
 		return -EINVAL;
 
+	if (flags & PERF_FLAG_FD_CLOEXEC)
+		f_flags |= O_CLOEXEC;
+
+	if (flags & PERF_FLAG_FD_ITRACE) {
+		/* only allowed to specify group_fd with this flag */
+		if (group_fd == -1 || attr_uptr || cpu != -1 || pid != -1 ||
+		    (flags & ~(PERF_FLAG_FD_ITRACE | PERF_FLAG_FD_CLOEXEC)))
+			return -EINVAL;
+
+		return do_perf_get_itrace_fd(group_fd, f_flags);
+	}
+
 	err = perf_copy_attr(attr_uptr, &attr);
 	if (err)
 		return err;
@@ -7045,9 +7126,6 @@ SYSCALL_DEFINE5(perf_event_open,
 	if ((flags & PERF_FLAG_PID_CGROUP) && (pid == -1 || cpu == -1))
 		return -EINVAL;
 
-	if (flags & PERF_FLAG_FD_CLOEXEC)
-		f_flags |= O_CLOEXEC;
-
 	event_fd = get_unused_fd_flags(f_flags);
 	if (event_fd < 0)
 		return event_fd;
@@ -7128,6 +7206,10 @@ SYSCALL_DEFINE5(perf_event_open,
 		goto err_alloc;
 	}
 
+	err = itrace_event_installable(event, ctx);
+	if (err)
+		goto err_alloc;
+
 	if (task) {
 		put_task_struct(task);
 		task = NULL;
@@ -7293,6 +7375,10 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 		goto err_free;
 	}
 
+	err = itrace_event_installable(event, ctx);
+	if (err)
+		goto err_free;
+
 	WARN_ON_ONCE(ctx->parent_ctx);
 	mutex_lock(&ctx->mutex);
 	perf_install_in_context(ctx, event, cpu);
@@ -7583,6 +7669,7 @@ inherit_event(struct perf_event *parent_event,
 {
 	struct perf_event *child_event;
 	unsigned long flags;
+	int err;
 
 	/*
 	 * Instead of creating recursive hierarchies of events,
@@ -7601,10 +7688,12 @@ inherit_event(struct perf_event *parent_event,
 	if (IS_ERR(child_event))
 		return child_event;
 
-	if (!atomic_long_inc_not_zero(&parent_event->refcount)) {
-		free_event(child_event);
-		return NULL;
-	}
+	err = itrace_inherit_event(child_event, child);
+	if (err)
+		goto err_alloc;
+
+	if (!atomic_long_inc_not_zero(&parent_event->refcount))
+		goto err_alloc;
 
 	get_ctx(child_ctx);
 
@@ -7655,6 +7744,11 @@ inherit_event(struct perf_event *parent_event,
 	mutex_unlock(&parent_event->child_mutex);
 
 	return child_event;
+
+err_alloc:
+	free_event(child_event);
+
+	return NULL;
 }
 
 static int inherit_group(struct perf_event *parent_event,
diff --git a/kernel/events/itrace.c b/kernel/events/itrace.c
new file mode 100644
index 0000000..ec26373
--- /dev/null
+++ b/kernel/events/itrace.c
@@ -0,0 +1,271 @@
+/*
+ * Instruction flow trace unit infrastructure
+ * Copyright (c) 2013, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ */
+
+#undef DEBUG
+
+#include <linux/kernel.h>
+#include <linux/perf_event.h>
+#include <linux/itrace.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+
+#include "internal.h"
+
+static LIST_HEAD(itrace_pmus);
+static DEFINE_MUTEX(itrace_pmus_mutex);
+
+struct static_key_deferred itrace_core_events __read_mostly;
+
+struct itrace_lost_record {
+	struct perf_event_header	header;
+	u64				offset;
+};
+
+/*
+ * In the worst case, perf buffer might be full and we're not able to output
+ * this record, so the decoder won't know that the data was lost. However,
+ * it will still see inconsistency in the trace IP.
+ */
+void itrace_lost_data(struct perf_event *event, u64 offset)
+{
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	struct itrace_lost_record rec = {
+		.header = {
+			.type = PERF_RECORD_ITRACE_LOST,
+			.misc = 0,
+			.size = sizeof(rec),
+		},
+		.offset = offset
+	};
+	int ret;
+
+	perf_event_header__init_id(&rec.header, &sample, event);
+	ret = perf_output_begin(&handle, event, rec.header.size);
+
+	if (ret)
+		return;
+
+	perf_output_put(&handle, rec);
+	perf_event__output_id_sample(event, &handle, &sample);
+	perf_output_end(&handle);
+}
+
+static struct itrace_pmu *itrace_pmu_find(int type)
+{
+	struct itrace_pmu *ipmu;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(ipmu, &itrace_pmus, entry) {
+		if (ipmu->pmu.type == type)
+			goto out;
+	}
+
+	ipmu = NULL;
+out:
+	rcu_read_unlock();
+
+	return ipmu;
+}
+
+bool is_itrace_event(struct perf_event *event)
+{
+	return !!itrace_pmu_find(event->attr.type);
+}
+
+int itrace_event_installable(struct perf_event *event,
+			     struct perf_event_context *ctx)
+{
+	struct perf_event *iter_event;
+
+	if (!is_itrace_event(event))
+		return 0;
+
+	/*
+	 * the context is locked and pinned and won't change under us,
+	 * also we don't care if it's a cpu or task context at this point
+	 */
+	list_for_each_entry(iter_event, &ctx->event_list, event_entry) {
+		if (is_itrace_event(iter_event) &&
+		    (iter_event->cpu == event->cpu ||
+		     iter_event->cpu == -1 ||
+		     event->cpu == -1))
+			return -EEXIST;
+	}
+
+	return 0;
+}
+
+static int itrace_event_init(struct perf_event *event)
+{
+	struct itrace_pmu *ipmu = to_itrace_pmu(event->pmu);
+
+	return ipmu->event_init(event);
+}
+
+static unsigned long itrace_rb_get_size(int nr_pages)
+{
+	return sizeof(struct ring_buffer) + sizeof(void *) * nr_pages;
+}
+
+static int itrace_alloc_data_pages(struct ring_buffer *rb, int cpu,
+				   int nr_pages, int flags)
+{
+	struct itrace_pmu *ipmu = to_itrace_pmu(rb->event->pmu);
+	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
+
+	rb->priv = ipmu->alloc_buffer(cpu, nr_pages, overwrite,
+				      rb->data_pages, &rb->user_page);
+	if (!rb->priv)
+		return -ENOMEM;
+	rb->nr_pages = nr_pages;
+
+	return 0;
+}
+
+static void itrace_free(struct ring_buffer *rb)
+{
+	struct itrace_pmu *ipmu = to_itrace_pmu(rb->event->pmu);
+
+	if (rb->priv)
+		ipmu->free_buffer(rb->priv);
+}
+
+struct page *
+itrace_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
+{
+	if (pgoff > rb->nr_pages)
+		return NULL;
+
+	if (pgoff == 0)
+		return virt_to_page(rb->user_page);
+
+	return virt_to_page(rb->data_pages[pgoff - 1]);
+}
+
+struct ring_buffer_ops itrace_rb_ops = {
+	.get_size		= itrace_rb_get_size,
+	.alloc_data_page	= itrace_alloc_data_pages,
+	.free_buffer		= itrace_free,
+	.mmap_to_page		= itrace_mmap_to_page,
+};
+
+void *itrace_priv(struct perf_event *event)
+{
+	if (!event->rb[PERF_RB_ITRACE])
+		return NULL;
+
+	return event->rb[PERF_RB_ITRACE]->priv;
+}
+
+void *itrace_event_get_priv(struct perf_event *event)
+{
+	struct ring_buffer *rb = ring_buffer_get(event, PERF_RB_ITRACE);
+
+	return rb ? rb->priv : NULL;
+}
+
+void itrace_event_put(struct perf_event *event)
+{
+	struct ring_buffer *rb;
+
+	rcu_read_lock();
+	rb = rcu_dereference(event->rb[PERF_RB_ITRACE]);
+	if (rb)
+		ring_buffer_put(rb);
+	rcu_read_unlock();
+}
+
+static void itrace_set_output(struct perf_event *event,
+			      struct perf_event *output_event)
+{
+	struct ring_buffer *rb;
+
+	mutex_lock(&event->mmap_mutex);
+
+	if (atomic_read(&event->mmap_count[PERF_RB_ITRACE]) ||
+	    event->rb[PERF_RB_ITRACE])
+		goto out;
+
+	rb = ring_buffer_get(output_event, PERF_RB_ITRACE);
+	if (!rb)
+		goto out;
+
+	ring_buffer_attach(event, rb);
+	rcu_assign_pointer(event->rb[PERF_RB_ITRACE], rb);
+
+out:
+	mutex_unlock(&event->mmap_mutex);
+}
+
+int itrace_inherit_event(struct perf_event *event, struct task_struct *task)
+{
+	struct perf_event *parent = event->parent;
+	struct itrace_pmu *ipmu;
+
+	if (!is_itrace_event(event))
+		return 0;
+
+	ipmu = to_itrace_pmu(event->pmu);
+
+	/*
+	 * inherited user's counters should inherit buffers IF
+	 * they aren't cpu==-1
+	 */
+	if (parent->cpu == -1)
+		return -EINVAL;
+
+	itrace_set_output(event, parent);
+
+	return 0;
+}
+
+void itrace_wake_up(struct perf_event *event)
+{
+	struct ring_buffer *rb;
+
+	rcu_read_lock();
+	rb = rcu_dereference(event->rb[PERF_RB_ITRACE]);
+	if (rb) {
+		atomic_set(&rb->poll, POLL_IN);
+		irq_work_queue(&event->pending);
+	}
+	rcu_read_unlock();
+}
+
+int itrace_pmu_register(struct itrace_pmu *ipmu)
+{
+	int ret;
+
+	if (!ipmu->alloc_buffer || !ipmu->free_buffer)
+		return -EINVAL;
+
+	ipmu->event_init = ipmu->pmu.event_init;
+	ipmu->pmu.event_init = itrace_event_init;
+
+	ret = perf_pmu_register(&ipmu->pmu, ipmu->name, -1);
+	if (ret)
+		return ret;
+
+	mutex_lock(&itrace_pmus_mutex);
+	list_add_tail_rcu(&ipmu->entry, &itrace_pmus);
+	mutex_unlock(&itrace_pmus_mutex);
+
+	return ret;
+}
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v1 05/11] itrace: Add functionality to include traces in perf event samples
  2014-02-06 10:50 [PATCH v1 00/11] perf: Add support for Intel Processor Trace Alexander Shishkin
                   ` (3 preceding siblings ...)
  2014-02-06 10:50 ` [PATCH v1 04/11] itrace: Infrastructure for instruction flow tracing units Alexander Shishkin
@ 2014-02-06 10:50 ` Alexander Shishkin
  2014-02-06 10:50 ` [PATCH v1 06/11] itrace: Add functionality to include traces in process core dumps Alexander Shishkin
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-06 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming, Alexander Shishkin

Trace data from itrace PMUs can be used to annotate other perf events
by including it in sample records when PERF_SAMPLE_ITRACE flag is set. In
this case, a PT kernel counter is created for each such event and trace data
is retrieved from it and stored in the perf data stream.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/itrace.h          |  37 +++++++++
 include/linux/perf_event.h      |  15 ++++
 include/uapi/linux/perf_event.h |   5 +-
 kernel/events/core.c            |  35 +++++++++
 kernel/events/itrace.c          | 169 ++++++++++++++++++++++++++++++++++++++--
 5 files changed, 252 insertions(+), 9 deletions(-)

diff --git a/include/linux/itrace.h b/include/linux/itrace.h
index 735baaf4..6adbb32 100644
--- a/include/linux/itrace.h
+++ b/include/linux/itrace.h
@@ -54,12 +54,27 @@ struct itrace_pmu {
 
 	int			(*event_init)(struct perf_event *event);
 
+	/*
+	 * Calculate the size of a sample to be written out
+	 */
+	unsigned long		(*sample_trace)(struct perf_event *event,
+						struct perf_sample_data *data);
+
+	/*
+	 * Write out a trace sample to the given output handle
+	 */
+	void			(*sample_output)(struct perf_event *event,
+						 struct perf_output_handle *handle,
+						 struct perf_sample_data *data);
 	char			*name;
 };
 
 #define to_itrace_pmu(x) container_of((x), struct itrace_pmu, pmu)
 
 #ifdef CONFIG_PERF_EVENTS
+
+extern int itrace_kernel_event(struct perf_event *event,
+			       struct task_struct *task);
 extern int itrace_inherit_event(struct perf_event *event,
 				struct task_struct *task);
 extern void itrace_lost_data(struct perf_event *event, u64 offset);
@@ -72,7 +87,17 @@ extern void itrace_wake_up(struct perf_event *event);
 
 extern bool is_itrace_event(struct perf_event *event);
 
+extern int itrace_sampler_init(struct perf_event *event,
+			       struct task_struct *task);
+extern void itrace_sampler_fini(struct perf_event *event);
+extern unsigned long itrace_sampler_trace(struct perf_event *event,
+					  struct perf_sample_data *data);
+extern void itrace_sampler_output(struct perf_event *event,
+				  struct perf_output_handle *handle,
+				  struct perf_sample_data *data);
 #else
+static int itrace_kernel_event(struct perf_event *event,
+			       struct task_struct *task)	{ return 0; }
 static int itrace_inherit_event(struct perf_event *event,
 				struct task_struct *task)	{ return 0; }
 static inline void
@@ -84,6 +109,18 @@ itrace_event_installable(struct perf_event *event,
 			 struct perf_event_context *ctx)	{ return -EINVAL; }
 static inline void itrace_wake_up(struct perf_event *event)	{}
 static inline bool is_itrace_event(struct perf_event *event)	{ return false; }
+
+static inline int itrace_sampler_init(struct perf_event *event,
+				      struct task_struct *task)	{}
+static inline void
+itrace_sampler_fini(struct perf_event *event)			{}
+static inline unsigned long
+itrace_sampler_trace(struct perf_event *event,
+		     struct perf_sample_data *data)		{ return 0; }
+static inline void
+itrace_sampler_output(struct perf_event *event,
+		      struct perf_output_handle *handle,
+		      struct perf_sample_data *data)		{}
 #endif
 
 #endif /* _LINUX_PERF_EVENT_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b0147e0..11eb133 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -83,6 +83,12 @@ struct perf_regs_user {
 	struct pt_regs	*regs;
 };
 
+struct perf_trace_record {
+	u64		size;
+	unsigned long	from;
+	unsigned long	to;
+};
+
 struct task_struct;
 
 /*
@@ -97,6 +103,11 @@ struct hw_perf_event_extra {
 
 struct event_constraint;
 
+enum perf_itrace_counter_type {
+	PERF_ITRACE_USER	= BIT(1),
+	PERF_ITRACE_SAMPLING	= BIT(2),
+};
+
 /**
  * struct hw_perf_event - performance event hardware details:
  */
@@ -129,6 +140,7 @@ struct hw_perf_event {
 		struct { /* itrace */
 			struct file		*itrace_file;
 			struct task_struct	*itrace_target;
+			unsigned int		counter_type;
 		};
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 		struct { /* breakpoint */
@@ -434,6 +446,7 @@ struct perf_event {
 	perf_overflow_handler_t		overflow_handler;
 	void				*overflow_handler_context;
 
+	struct perf_event		*trace_event;
 #ifdef CONFIG_EVENT_TRACING
 	struct ftrace_event_call	*tp_event;
 	struct event_filter		*filter;
@@ -591,6 +604,7 @@ struct perf_sample_data {
 	union  perf_mem_data_src	data_src;
 	struct perf_callchain_entry	*callchain;
 	struct perf_raw_record		*raw;
+	struct perf_trace_record	trace;
 	struct perf_branch_stack	*br_stack;
 	struct perf_regs_user		regs_user;
 	u64				stack_user_size;
@@ -611,6 +625,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
 	data->period = period;
 	data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
 	data->regs_user.regs = NULL;
+	data->trace.from = data->trace.to = data->trace.size = 0;
 	data->stack_user_size = 0;
 	data->weight = 0;
 	data->data_src.val = 0;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 2dd57db..a06cf4b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -137,8 +137,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_DATA_SRC			= 1U << 15,
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
+	PERF_SAMPLE_ITRACE			= 1U << 18,
 
-	PERF_SAMPLE_MAX = 1U << 18,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
 };
 
 /*
@@ -689,6 +690,8 @@ enum perf_event_type {
 	 *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
+	 *	{ u64			size;
+	 *	  char			data[size]; } && PERF_SAMPLE_ITRACE
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ff6e286..e1388a5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1576,6 +1576,9 @@ void perf_event_disable(struct perf_event *event)
 	struct perf_event_context *ctx = event->ctx;
 	struct task_struct *task = ctx->task;
 
+	if (event->trace_event)
+		perf_event_disable(event->trace_event);
+
 	if (!task) {
 		/*
 		 * Disable the event on the cpu that it's on
@@ -2070,6 +2073,8 @@ void perf_event_enable(struct perf_event *event)
 	struct perf_event_context *ctx = event->ctx;
 	struct task_struct *task = ctx->task;
 
+	if (event->trace_event)
+		perf_event_enable(event->trace_event);
 	if (!task) {
 		/*
 		 * Enable the event on the cpu that it's on
@@ -3209,6 +3214,8 @@ static void unaccount_event(struct perf_event *event)
 		static_key_slow_dec_deferred(&perf_sched_events);
 	if (has_branch_stack(event))
 		static_key_slow_dec_deferred(&perf_sched_events);
+	if ((event->attr.sample_type & PERF_SAMPLE_ITRACE) && event->trace_event)
+		itrace_sampler_fini(event);
 
 	unaccount_event_cpu(event, event->cpu);
 }
@@ -4664,6 +4671,13 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		perf_output_put(handle, data->txn);
 
+	if (sample_type & PERF_SAMPLE_ITRACE) {
+		perf_output_put(handle, data->trace.size);
+
+		if (data->trace.size)
+			itrace_sampler_output(event, handle, data);
+	}
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -4771,6 +4785,14 @@ void perf_prepare_sample(struct perf_event_header *header,
 		data->stack_user_size = stack_size;
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_ITRACE) {
+		u64 size = sizeof(u64);
+
+		size += itrace_sampler_trace(event, data);
+
+		header->size += size;
+	}
 }
 
 static void perf_event_output(struct perf_event *event,
@@ -6795,6 +6817,15 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 			if (err)
 				goto err_pmu;
 		}
+
+		if (event->attr.sample_type & PERF_SAMPLE_ITRACE) {
+			err = itrace_sampler_init(event, task);
+			if (err) {
+				/* XXX: either clean up callchain buffers too
+				   or forbid them to go together */
+				goto err_pmu;
+			}
+		}
 	}
 
 	return event;
@@ -7369,6 +7400,10 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 
 	account_event(event);
 
+	err = itrace_kernel_event(event, task);
+	if (err)
+		goto err_free;
+
 	ctx = find_get_context(event->pmu, task, cpu);
 	if (IS_ERR(ctx)) {
 		err = PTR_ERR(ctx);
diff --git a/kernel/events/itrace.c b/kernel/events/itrace.c
index ec26373..f003530 100644
--- a/kernel/events/itrace.c
+++ b/kernel/events/itrace.c
@@ -89,6 +89,22 @@ bool is_itrace_event(struct perf_event *event)
 	return !!itrace_pmu_find(event->attr.type);
 }
 
+static void itrace_event_destroy(struct perf_event *event)
+{
+	struct ring_buffer *rb = event->rb[PERF_RB_ITRACE];
+
+	if (!rb)
+		return;
+
+	if (event->hw.counter_type != PERF_ITRACE_USER) {
+		atomic_dec(&rb->mmap_count);
+		atomic_dec(&event->mmap_count[PERF_RB_ITRACE]);
+		ring_buffer_detach(event, rb);
+		rcu_assign_pointer(event->rb[PERF_RB_ITRACE], NULL);
+		ring_buffer_put(rb); /* should be last */
+	}
+}
+
 int itrace_event_installable(struct perf_event *event,
 			     struct perf_event_context *ctx)
 {
@@ -115,8 +131,16 @@ int itrace_event_installable(struct perf_event *event,
 static int itrace_event_init(struct perf_event *event)
 {
 	struct itrace_pmu *ipmu = to_itrace_pmu(event->pmu);
+	int ret;
 
-	return ipmu->event_init(event);
+	ret = ipmu->event_init(event);
+	if (ret)
+		return ret;
+
+	event->destroy = itrace_event_destroy;
+	event->hw.counter_type = PERF_ITRACE_USER;
+
+	return 0;
 }
 
 static unsigned long itrace_rb_get_size(int nr_pages)
@@ -214,9 +238,16 @@ out:
 	mutex_unlock(&event->mmap_mutex);
 }
 
+static size_t roundup_buffer_size(u64 size)
+{
+	return 1ul << (__get_order(size) + PAGE_SHIFT);
+}
+
 int itrace_inherit_event(struct perf_event *event, struct task_struct *task)
 {
+	size_t size = event->attr.itrace_sample_size;
 	struct perf_event *parent = event->parent;
+	struct ring_buffer *rb;
 	struct itrace_pmu *ipmu;
 
 	if (!is_itrace_event(event))
@@ -224,14 +255,59 @@ int itrace_inherit_event(struct perf_event *event, struct task_struct *task)
 
 	ipmu = to_itrace_pmu(event->pmu);
 
-	/*
-	 * inherited user's counters should inherit buffers IF
-	 * they aren't cpu==-1
-	 */
-	if (parent->cpu == -1)
-		return -EINVAL;
+	if (parent->hw.counter_type == PERF_ITRACE_USER) {
+		/*
+		 * inherited user's counters should inherit buffers IF
+		 * they aren't cpu==-1
+		 */
+		if (parent->cpu == -1)
+			return -EINVAL;
+
+		itrace_set_output(event, parent);
+		return 0;
+	}
+
+	event->hw.counter_type = parent->hw.counter_type;
+
+	size = roundup_buffer_size(size);
+	rb = rb_alloc(event, size >> PAGE_SHIFT, 0, event->cpu, 0,
+		      &itrace_rb_ops);
+	if (!rb)
+		return -ENOMEM;
+
+	ring_buffer_attach(event, rb);
+	rcu_assign_pointer(event->rb[PERF_RB_ITRACE], rb);
+	atomic_set(&rb->mmap_count, 1);
+	atomic_set(&event->mmap_count[PERF_RB_ITRACE], 1);
+
+	return 0;
+}
+
+int itrace_kernel_event(struct perf_event *event, struct task_struct *task)
+{
+	struct itrace_pmu *ipmu;
+	struct ring_buffer *rb;
+	size_t size;
+
+	if (!is_itrace_event(event))
+		return 0;
 
-	itrace_set_output(event, parent);
+	ipmu = to_itrace_pmu(event->pmu);
+
+	if (!event->attr.itrace_sample_size)
+		return 0;
+
+	size = roundup_buffer_size(event->attr.itrace_sample_size);
+
+	rb = rb_alloc(event, size >> PAGE_SHIFT, 0, event->cpu, 0,
+		      &itrace_rb_ops);
+	if (!rb)
+		return -ENOMEM;
+
+	ring_buffer_attach(event, rb);
+	rcu_assign_pointer(event->rb[PERF_RB_ITRACE], rb);
+	atomic_set(&rb->mmap_count, 1);
+	atomic_set(&event->mmap_count[PERF_RB_ITRACE], 1);
 
 	return 0;
 }
@@ -269,3 +345,80 @@ int itrace_pmu_register(struct itrace_pmu *ipmu)
 
 	return ret;
 }
+
+/*
+ * Trace sample annotation
+ * For events that have attr.sample_type & PERF_SAMPLE_ITRACE, perf calls here
+ * to configure and obtain itrace samples.
+ */
+
+int itrace_sampler_init(struct perf_event *event, struct task_struct *task)
+{
+	struct perf_event_attr attr;
+	struct perf_event *tevt;
+	struct itrace_pmu *ipmu;
+
+	ipmu = itrace_pmu_find(event->attr.itrace_sample_type);
+	if (!ipmu || !ipmu->sample_trace || !ipmu->sample_output)
+		return -ENOTSUPP;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.type = ipmu->pmu.type;
+	attr.config = 0;
+	attr.sample_type = 0;
+	attr.exclude_user = event->attr.exclude_user;
+	attr.exclude_kernel = event->attr.exclude_kernel;
+	attr.itrace_sample_size = event->attr.itrace_sample_size;
+	attr.itrace_config = event->attr.itrace_config;
+
+	tevt = perf_event_create_kernel_counter(&attr, event->cpu, task, NULL, NULL);
+	if (IS_ERR(tevt))
+		return PTR_ERR(tevt);
+
+	if (!itrace_priv(tevt)) {
+		perf_event_release_kernel(tevt);
+		return -EINVAL;
+	}
+
+	event->trace_event = tevt;
+	tevt->hw.counter_type = PERF_ITRACE_SAMPLING;
+	if (event->state != PERF_EVENT_STATE_OFF)
+		perf_event_enable(event->trace_event);
+
+	return 0;
+}
+
+void itrace_sampler_fini(struct perf_event *event)
+{
+	struct perf_event *tevt = event->trace_event;
+
+	perf_event_release_kernel(tevt);
+	event->trace_event = NULL;
+}
+
+unsigned long itrace_sampler_trace(struct perf_event *event,
+				   struct perf_sample_data *data)
+{
+	struct perf_event *tevt = event->trace_event;
+	struct itrace_pmu *ipmu;
+
+	if (!tevt)
+		return 0;
+
+	ipmu = to_itrace_pmu(tevt->pmu);
+	return ipmu->sample_trace(tevt, data);
+}
+
+void itrace_sampler_output(struct perf_event *event,
+			   struct perf_output_handle *handle,
+			   struct perf_sample_data *data)
+{
+	struct perf_event *tevt = event->trace_event;
+	struct itrace_pmu *ipmu;
+
+	if (!tevt || !data->trace.size)
+		return;
+
+	ipmu = to_itrace_pmu(tevt->pmu);
+	ipmu->sample_output(tevt, handle, data);
+}
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v1 06/11] itrace: Add functionality to include traces in process core dumps
  2014-02-06 10:50 [PATCH v1 00/11] perf: Add support for Intel Processor Trace Alexander Shishkin
                   ` (4 preceding siblings ...)
  2014-02-06 10:50 ` [PATCH v1 05/11] itrace: Add functionality to include traces in perf event samples Alexander Shishkin
@ 2014-02-06 10:50 ` Alexander Shishkin
  2014-02-06 10:50 ` [PATCH v1 07/11] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-06 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming, Alexander Shishkin

Per thread trace data that is provided by itrace PMUs can be included in
process core dumps, which is controlled via a new rlimit parameter
RLIMIT_ITRACE. This is done by a per-thread kernel counter that is
created when this RLIMIT_ITRACE is set.

The value of RLIMIT_ITRACE indicates the size of the per-thread elf note
in a core dump and the buffer size used to collect corresponding trace.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 fs/binfmt_elf.c                     |   6 +
 fs/proc/base.c                      |   1 +
 include/asm-generic/resource.h      |   1 +
 include/linux/itrace.h              |  36 +++++
 include/linux/perf_event.h          |   3 +
 include/uapi/asm-generic/resource.h |   3 +-
 include/uapi/linux/elf.h            |   1 +
 kernel/events/itrace.c              | 289 +++++++++++++++++++++++++++++++++++-
 kernel/exit.c                       |   3 +
 kernel/sys.c                        |   5 +
 10 files changed, 343 insertions(+), 5 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 571a423..c7fcd49 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -34,6 +34,7 @@
 #include <linux/utsname.h>
 #include <linux/coredump.h>
 #include <linux/sched.h>
+#include <linux/itrace.h>
 #include <asm/uaccess.h>
 #include <asm/param.h>
 #include <asm/page.h>
@@ -1576,6 +1577,8 @@ static int fill_thread_core_info(struct elf_thread_core_info *t,
 		}
 	}
 
+	*total += itrace_elf_note_size(t->task);
+
 	return 1;
 }
 
@@ -1608,6 +1611,7 @@ static int fill_note_info(struct elfhdr *elf, int phdrs,
 	for (i = 0; i < view->n; ++i)
 		if (view->regsets[i].core_note_type != 0)
 			++info->thread_notes;
+	info->thread_notes++; /* ITRACE */
 
 	/*
 	 * Sanity check.  We rely on regset 0 being in NT_PRSTATUS,
@@ -1710,6 +1714,8 @@ static int write_note_info(struct elf_note_info *info,
 			    !writenote(&t->notes[i], cprm))
 				return 0;
 
+		itrace_elf_note_write(cprm, t->task);
+
 		first = 0;
 		t = t->next;
 	} while (t);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 03c8d74..69935a9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -471,6 +471,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
 	[RLIMIT_NICE] = {"Max nice priority", NULL},
 	[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
 	[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
+	[RLIMIT_ITRACE] = {"Max ITRACE buffer size", "bytes"},
 };
 
 /* Display limits for a process */
diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index b4ea8f5..e6e5657 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -25,6 +25,7 @@
 	[RLIMIT_NICE]		= { 0, 0 },				\
 	[RLIMIT_RTPRIO]		= { 0, 0 },				\
 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
+	[RLIMIT_ITRACE]		= {              0,  RLIM_INFINITY },	\
 }
 
 #endif
diff --git a/include/linux/itrace.h b/include/linux/itrace.h
index 6adbb32..c1eb6d3 100644
--- a/include/linux/itrace.h
+++ b/include/linux/itrace.h
@@ -22,6 +22,7 @@
 
 #include <linux/perf_event.h>
 #include <linux/file.h>
+#include <linux/coredump.h>
 
 extern struct ring_buffer_ops itrace_rb_ops;
 
@@ -66,6 +67,19 @@ struct itrace_pmu {
 	void			(*sample_output)(struct perf_event *event,
 						 struct perf_output_handle *handle,
 						 struct perf_sample_data *data);
+
+	/*
+	 * Get the PMU-specific part of a core dump note
+	 */
+	size_t			(*core_size)(struct perf_event *event);
+
+	/*
+	 * Write out the core dump note
+	 */
+	void			(*core_output)(struct coredump_params *cprm,
+					       struct perf_event *event,
+					       unsigned long len);
+	u64			coredump_config;
 	char			*name;
 };
 
@@ -95,6 +109,17 @@ extern unsigned long itrace_sampler_trace(struct perf_event *event,
 extern void itrace_sampler_output(struct perf_event *event,
 				  struct perf_output_handle *handle,
 				  struct perf_sample_data *data);
+
+extern int update_itrace_rlimit(struct task_struct *, unsigned long);
+extern void exit_itrace(struct task_struct *);
+
+struct itrace_note {
+	u64	itrace_config;
+};
+
+extern size_t itrace_elf_note_size(struct task_struct *tsk);
+extern void itrace_elf_note_write(struct coredump_params *cprm,
+				  struct task_struct *task);
 #else
 static int itrace_kernel_event(struct perf_event *event,
 			       struct task_struct *task)	{ return 0; }
@@ -121,6 +146,17 @@ static inline void
 itrace_sampler_output(struct perf_event *event,
 		      struct perf_output_handle *handle,
 		      struct perf_sample_data *data)		{}
+
+static inline int
+update_itrace_rlimit(struct task_struct *, unsigned long)	{ return -EINVAL; }
+static inline void exit_itrace(struct task_struct *)		{}
+
+static inline size_t
+itrace_elf_note_size(struct task_struct *tsk)			{ return 0; }
+static inline void
+itrace_elf_note_write(struct coredump_params *cprm,
+		      struct task_struct *task)			{}
+
 #endif
 
 #endif /* _LINUX_PERF_EVENT_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 11eb133..8353d7f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -106,6 +106,9 @@ struct event_constraint;
 enum perf_itrace_counter_type {
 	PERF_ITRACE_USER	= BIT(1),
 	PERF_ITRACE_SAMPLING	= BIT(2),
+	PERF_ITRACE_COREDUMP	= BIT(3),
+	PERF_ITRACE_KERNEL	= (PERF_ITRACE_SAMPLING | PERF_ITRACE_COREDUMP),
+	PERF_ITRACE_ANY		= (PERF_ITRACE_KERNEL | PERF_ITRACE_USER),
 };
 
 /**
diff --git a/include/uapi/asm-generic/resource.h b/include/uapi/asm-generic/resource.h
index f863428..073f413 100644
--- a/include/uapi/asm-generic/resource.h
+++ b/include/uapi/asm-generic/resource.h
@@ -45,7 +45,8 @@
 					   0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
 #define RLIMIT_RTTIME		15	/* timeout for RT tasks in us */
-#define RLIM_NLIMITS		16
+#define RLIMIT_ITRACE		16	/* max itrace size */
+#define RLIM_NLIMITS		17
 
 /*
  * SuS says limits have to be unsigned.
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index ef6103b..4bfbf66 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -369,6 +369,7 @@ typedef struct elf64_shdr {
 #define NT_PRPSINFO	3
 #define NT_TASKSTRUCT	4
 #define NT_AUXV		6
+#define NT_ITRACE	7
 /*
  * Note to userspace developers: size of NT_SIGINFO note may increase
  * in the future to accomodate more fields, don't assume it is fixed!
diff --git a/kernel/events/itrace.c b/kernel/events/itrace.c
index f003530..1cc9a36 100644
--- a/kernel/events/itrace.c
+++ b/kernel/events/itrace.c
@@ -20,15 +20,21 @@
 #undef DEBUG
 
 #include <linux/kernel.h>
+#include <linux/sched.h>
 #include <linux/perf_event.h>
 #include <linux/itrace.h>
 #include <linux/sizes.h>
+#include <linux/elf.h>
+#include <linux/coredump.h>
 #include <linux/slab.h>
 
 #include "internal.h"
 
 static LIST_HEAD(itrace_pmus);
 static DEFINE_MUTEX(itrace_pmus_mutex);
+static struct itrace_pmu *itrace_pmu_coredump;
+
+#define CORE_OWNER "ITRACE"
 
 struct static_key_deferred itrace_core_events __read_mostly;
 
@@ -91,8 +97,12 @@ bool is_itrace_event(struct perf_event *event)
 
 static void itrace_event_destroy(struct perf_event *event)
 {
+	struct task_struct *task = event->hw.itrace_target;
 	struct ring_buffer *rb = event->rb[PERF_RB_ITRACE];
 
+	if (task && event->hw.counter_type == PERF_ITRACE_COREDUMP)
+		static_key_slow_dec_deferred(&itrace_core_events);
+
 	if (!rb)
 		return;
 
@@ -268,6 +278,10 @@ int itrace_inherit_event(struct perf_event *event, struct task_struct *task)
 	}
 
 	event->hw.counter_type = parent->hw.counter_type;
+	if (event->hw.counter_type == PERF_ITRACE_COREDUMP) {
+		static_key_slow_inc(&itrace_core_events.key);
+		size = task_rlimit(task, RLIMIT_ITRACE);
+	}
 
 	size = roundup_buffer_size(size);
 	rb = rb_alloc(event, size >> PAGE_SHIFT, 0, event->cpu, 0,
@@ -294,10 +308,10 @@ int itrace_kernel_event(struct perf_event *event, struct task_struct *task)
 
 	ipmu = to_itrace_pmu(event->pmu);
 
-	if (!event->attr.itrace_sample_size)
-		return 0;
-
-	size = roundup_buffer_size(event->attr.itrace_sample_size);
+	if (event->attr.itrace_sample_size)
+		size = roundup_buffer_size(event->attr.itrace_sample_size);
+	else
+		size = task_rlimit(task, RLIMIT_ITRACE);
 
 	rb = rb_alloc(event, size >> PAGE_SHIFT, 0, event->cpu, 0,
 		      &itrace_rb_ops);
@@ -325,6 +339,104 @@ void itrace_wake_up(struct perf_event *event)
 	rcu_read_unlock();
 }
 
+static ssize_t
+coredump_show(struct device *dev,
+	      struct device_attribute *attr,
+	      char *page)
+{
+	struct pmu *pmu = dev_get_drvdata(dev);
+	struct itrace_pmu *ipmu = to_itrace_pmu(pmu);
+	int ret;
+
+	mutex_lock(&itrace_pmus_mutex);
+	ret = itrace_pmu_coredump == ipmu;
+	mutex_unlock(&itrace_pmus_mutex);
+
+	return snprintf(page, PAGE_SIZE-1, "%d\n", ret);
+}
+
+static ssize_t
+coredump_store(struct device *dev,
+	       struct device_attribute *attr,
+	       const char *buf, size_t count)
+{
+	struct pmu *pmu = dev_get_drvdata(dev);
+	struct itrace_pmu *ipmu = to_itrace_pmu(pmu);
+
+	mutex_lock(&itrace_pmus_mutex);
+	if (ipmu->core_size && ipmu->core_output)
+		itrace_pmu_coredump = ipmu;
+	mutex_unlock(&itrace_pmus_mutex);
+
+	return count;
+}
+static DEVICE_ATTR_RW(coredump);
+
+static ssize_t
+coredump_config_show(struct device *dev,
+		     struct device_attribute *attr,
+		     char *page)
+{
+	struct pmu *pmu = dev_get_drvdata(dev);
+	struct itrace_pmu *ipmu = to_itrace_pmu(pmu);
+
+	return snprintf(page, PAGE_SIZE-1, "%016llx\n", ipmu->coredump_config);
+}
+
+static ssize_t
+coredump_config_store(struct device *dev,
+		      struct device_attribute *attr,
+		      const char *buf, size_t count)
+{
+	struct pmu *pmu = dev_get_drvdata(dev);
+	struct itrace_pmu *ipmu = to_itrace_pmu(pmu);
+	u64 config;
+	int ret;
+
+	ret = kstrtou64(buf, 0, &config);
+	if (ret)
+		return ret;
+
+	ipmu->coredump_config = config;
+
+	return count;
+}
+static DEVICE_ATTR_RW(coredump_config);
+
+static struct attribute *itrace_attrs[] = {
+	&dev_attr_coredump.attr,
+	&dev_attr_coredump_config.attr,
+	NULL,
+};
+
+struct attribute_group itrace_group = {
+	.attrs	= itrace_attrs,
+};
+
+static const struct attribute_group **
+itrace_get_attr_groups(const struct attribute_group **pgroups)
+{
+	const struct attribute_group **groups;
+	int i, ngroups;
+	size_t size;
+
+	for (i = 0, ngroups = 2; pgroups[i]; i++, ngroups++)
+		;
+
+	size = sizeof(struct attribute_group *) * ngroups;
+	groups = kzalloc(size, GFP_KERNEL);
+	if (!groups)
+		goto out;
+
+	for (i = 0; pgroups[i]; i++)
+		groups[i] = pgroups[i];
+
+	groups[i] = &itrace_group;
+
+out:
+	return groups;
+}
+
 int itrace_pmu_register(struct itrace_pmu *ipmu)
 {
 	int ret;
@@ -334,6 +446,7 @@ int itrace_pmu_register(struct itrace_pmu *ipmu)
 
 	ipmu->event_init = ipmu->pmu.event_init;
 	ipmu->pmu.event_init = itrace_event_init;
+	ipmu->pmu.attr_groups = itrace_get_attr_groups(ipmu->pmu.attr_groups);
 
 	ret = perf_pmu_register(&ipmu->pmu, ipmu->name, -1);
 	if (ret)
@@ -341,6 +454,8 @@ int itrace_pmu_register(struct itrace_pmu *ipmu)
 
 	mutex_lock(&itrace_pmus_mutex);
 	list_add_tail_rcu(&ipmu->entry, &itrace_pmus);
+	if (ipmu->core_size && ipmu->core_output)
+		itrace_pmu_coredump = ipmu;
 	mutex_unlock(&itrace_pmus_mutex);
 
 	return ret;
@@ -422,3 +537,169 @@ void itrace_sampler_output(struct perf_event *event,
 	ipmu = to_itrace_pmu(tevt->pmu);
 	ipmu->sample_output(tevt, handle, data);
 }
+
+/*
+ * Core dump bits
+ *
+ * Various parts of the kernel will call here:
+ *   + do_prlimit(): to tell us that the user is trying to set RLIMIT_ITRACE
+ *   + various places in bitfmt_elf.c: to write out itrace notes
+ *   + do_exit(): to destroy the first core dump counter
+ *   + the rest (copy_process()/do_exit()) is taken care of by perf for us
+ */
+
+static struct perf_event *
+itrace_find_task_event(struct task_struct *task, unsigned type)
+{
+	struct perf_event_context *ctx;
+	struct perf_event *event = NULL;
+
+	rcu_read_lock();
+	ctx = rcu_dereference(task->perf_event_ctxp[perf_hw_context]);
+	if (!ctx)
+		goto out;
+
+	list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
+		if (is_itrace_event(event) &&
+		    event->cpu == -1 &&
+		    !!(event->hw.counter_type & type))
+			goto out;
+	}
+
+	event = NULL;
+out:
+	rcu_read_unlock();
+
+	return event;
+}
+
+int update_itrace_rlimit(struct task_struct *task, unsigned long rlim)
+{
+	struct perf_event_attr attr;
+	struct perf_event *event;
+
+	event = itrace_find_task_event(task, PERF_ITRACE_ANY);
+	if (event) {
+		if (event->hw.counter_type != PERF_ITRACE_COREDUMP)
+			return -EINVAL;
+
+		perf_event_release_kernel(event);
+		static_key_slow_dec_deferred(&itrace_core_events);
+	}
+
+	if (!rlim)
+		return 0;
+
+	memset(&attr, 0, sizeof(attr));
+
+	mutex_lock(&itrace_pmus_mutex);
+	if (!itrace_pmu_coredump) {
+		mutex_unlock(&itrace_pmus_mutex);
+		return -ENOTSUPP;
+	}
+
+	attr.type = itrace_pmu_coredump->pmu.type;
+	attr.config = 0;
+	attr.sample_type = 0;
+	attr.exclude_kernel = 1;
+	attr.inherit = 1;
+	attr.itrace_config = itrace_pmu_coredump->coredump_config;
+
+	event = perf_event_create_kernel_counter(&attr, -1, task, NULL, NULL);
+	mutex_unlock(&itrace_pmus_mutex);
+
+	if (IS_ERR(event))
+		return PTR_ERR(event);
+
+	static_key_slow_inc(&itrace_core_events.key);
+
+	event->hw.counter_type = PERF_ITRACE_COREDUMP;
+	perf_event_enable(event);
+
+	return 0;
+}
+
+static void itrace_pmu_exit_task(struct task_struct *task)
+{
+	struct perf_event *event;
+
+	event = itrace_find_task_event(task, PERF_ITRACE_COREDUMP);
+
+	/*
+	 * here we are only interested in kernel counters created by
+	 * update_itrace_rlimit(), inherited ones should be taken care of by
+	 * perf_event_exit_task(), sampling ones are taken care of by
+	 * itrace_sampler_fini().
+	 */
+	if (!event)
+		return;
+
+	if (!event->parent)
+		perf_event_release_kernel(event);
+}
+
+void exit_itrace(struct task_struct *task)
+{
+	if (static_key_false(&itrace_core_events.key))
+		itrace_pmu_exit_task(task);
+}
+
+size_t itrace_elf_note_size(struct task_struct *task)
+{
+	struct itrace_pmu *ipmu;
+	struct perf_event *event = NULL;
+	size_t size = 0;
+
+	event = itrace_find_task_event(task, PERF_ITRACE_COREDUMP);
+	if (event) {
+		perf_event_disable(event);
+
+		ipmu = to_itrace_pmu(event->pmu);
+		size = ipmu->core_size(event);
+		size += task_rlimit(task, RLIMIT_ITRACE);
+		size = roundup(size + strlen(ipmu->name) + 1, 4);
+		size += sizeof(struct itrace_note) + sizeof(struct elf_note);
+		size += roundup(sizeof(CORE_OWNER), 4);
+	}
+
+	return size;
+}
+
+void itrace_elf_note_write(struct coredump_params *cprm,
+			   struct task_struct *task)
+{
+	struct perf_event *event;
+	struct itrace_note note;
+	struct itrace_pmu *ipmu;
+	struct elf_note en;
+	unsigned long rlim;
+	size_t pmu_len;
+
+	event = itrace_find_task_event(task, PERF_ITRACE_COREDUMP);
+	if (!event)
+		return;
+
+	ipmu = to_itrace_pmu(event->pmu);
+	pmu_len = strlen(ipmu->name) + 1;
+
+	rlim = task_rlimit(task, RLIMIT_ITRACE);
+
+	/* Elf note with name */
+	en.n_namesz = strlen(CORE_OWNER);
+	en.n_descsz = roundup(ipmu->core_size(event) + rlim + sizeof(note) +
+			      pmu_len, 4);
+	en.n_type = NT_ITRACE;
+	dump_emit(cprm, &en, sizeof(en));
+	dump_align(cprm, 4);
+	dump_emit(cprm, CORE_OWNER, sizeof(CORE_OWNER));
+	dump_align(cprm, 4);
+
+	/* ITRACE header */
+	note.itrace_config = event->attr.itrace_config;
+	dump_emit(cprm, &note, sizeof(note));
+	dump_emit(cprm, ipmu->name, pmu_len);
+
+	/* ITRACE PMU header + payload */
+	ipmu->core_output(cprm, event, rlim);
+	dump_align(cprm, 4);
+}
diff --git a/kernel/exit.c b/kernel/exit.c
index a949819..28138ef 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -48,6 +48,7 @@
 #include <linux/fs_struct.h>
 #include <linux/init_task.h>
 #include <linux/perf_event.h>
+#include <linux/itrace.h>
 #include <trace/events/sched.h>
 #include <linux/hw_breakpoint.h>
 #include <linux/oom.h>
@@ -788,6 +789,8 @@ void do_exit(long code)
 	check_stack_usage();
 	exit_thread();
 
+	exit_itrace(tsk);
+
 	/*
 	 * Flush inherited counters to the parent - before the parent
 	 * gets woken up by child-exit notifications.
diff --git a/kernel/sys.c b/kernel/sys.c
index c723113..7651d6f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -14,6 +14,7 @@
 #include <linux/fs.h>
 #include <linux/kmod.h>
 #include <linux/perf_event.h>
+#include <linux/itrace.h>
 #include <linux/resource.h>
 #include <linux/kernel.h>
 #include <linux/workqueue.h>
@@ -1402,6 +1403,10 @@ int do_prlimit(struct task_struct *tsk, unsigned int resource,
 		update_rlimit_cpu(tsk, new_rlim->rlim_cur);
 out:
 	read_unlock(&tasklist_lock);
+
+	if (!retval && new_rlim && resource == RLIMIT_ITRACE)
+		retval = update_itrace_rlimit(tsk, new_rlim->rlim_cur);
+
 	return retval;
 }
 
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v1 07/11] x86: perf: intel_pt: Intel PT PMU driver
  2014-02-06 10:50 [PATCH v1 00/11] perf: Add support for Intel Processor Trace Alexander Shishkin
                   ` (5 preceding siblings ...)
  2014-02-06 10:50 ` [PATCH v1 06/11] itrace: Add functionality to include traces in process core dumps Alexander Shishkin
@ 2014-02-06 10:50 ` Alexander Shishkin
  2014-02-06 20:29   ` Andi Kleen
                     ` (2 more replies)
  2014-02-06 10:50 ` [PATCH v1 08/11] x86: perf: intel_pt: Add sampling functionality Alexander Shishkin
                   ` (3 subsequent siblings)
  10 siblings, 3 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-06 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming, Alexander Shishkin

Add support for Intel Processor Trace (PT) to kernel's perf/itrace events.
PT is an extension of Intel Architecture that collects information about
software execuction such as control flow, execution modes and timings and
formats it into highly compressed binary packets. Even being compressed,
these packets are generated at hundreds of megabytes per second per core,
which makes it impractical to decode them on the fly in the kernel. Thus,
buffers containing this binary stream are zero-copy mapped to the debug
tools in userspace for subsequent decoding and analysis.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/include/uapi/asm/msr-index.h     |  18 +
 arch/x86/kernel/cpu/Makefile              |   1 +
 arch/x86/kernel/cpu/intel_pt.h            | 127 ++++
 arch/x86/kernel/cpu/perf_event.c          |   4 +
 arch/x86/kernel/cpu/perf_event_intel.c    |  10 +
 arch/x86/kernel/cpu/perf_event_intel_pt.c | 991 ++++++++++++++++++++++++++++++
 6 files changed, 1151 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/intel_pt.h
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c

diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h
index 37813b5..38979e7 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -74,6 +74,24 @@
 #define MSR_IA32_PERF_CAPABILITIES	0x00000345
 #define MSR_PEBS_LD_LAT_THRESHOLD	0x000003f6
 
+#define MSR_IA32_RTIT_CTL		0x00000570
+#define RTIT_CTL_TRACEEN		BIT(0)
+#define RTIT_CTL_OS			BIT(2)
+#define RTIT_CTL_USR			BIT(3)
+#define RTIT_CTL_CR3EN			BIT(7)
+#define RTIT_CTL_TOPA			BIT(8)
+#define RTIT_CTL_TSC_EN			BIT(10)
+#define RTIT_CTL_DISRETC		BIT(11)
+#define RTIT_CTL_BRANCH_EN		BIT(13)
+#define MSR_IA32_RTIT_STATUS		0x00000571
+#define RTIT_STATUS_CONTEXTEN		BIT(1)
+#define RTIT_STATUS_TRIGGEREN		BIT(2)
+#define RTIT_STATUS_ERROR		BIT(4)
+#define RTIT_STATUS_STOPPED		BIT(5)
+#define MSR_IA32_RTIT_CR3_MATCH		0x00000572
+#define MSR_IA32_RTIT_OUTPUT_BASE	0x00000560
+#define MSR_IA32_RTIT_OUTPUT_MASK	0x00000561
+
 #define MSR_MTRRfix64K_00000		0x00000250
 #define MSR_MTRRfix16K_80000		0x00000258
 #define MSR_MTRRfix16K_A0000		0x00000259
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 6359506..cb69de3 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -37,6 +37,7 @@ endif
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_p6.o perf_event_knc.o perf_event_p4.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o perf_event_intel_rapl.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_pt.o
 endif
 
 
diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
new file mode 100644
index 0000000..dd69092
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -0,0 +1,127 @@
+/*
+ * Intel(R) Processor Trace PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ */
+
+#ifndef __INTEL_PT_H__
+#define __INTEL_PT_H__
+
+#include <linux/radix-tree.h>
+#include <linux/itrace.h>
+
+/*
+ * Single-entry ToPA: when this close to region boundary, switch
+ * buffers to avoid losing data.
+ */
+#define TOPA_PMI_MARGIN 512
+
+/*
+ * Table of Physical Addresses bits
+ */
+enum topa_sz {
+	TOPA_4K	= 0,
+	TOPA_8K,
+	TOPA_16K,
+	TOPA_32K,
+	TOPA_64K,
+	TOPA_128K,
+	TOPA_256K,
+	TOPA_512K,
+	TOPA_1MB,
+	TOPA_2MB,
+	TOPA_4MB,
+	TOPA_8MB,
+	TOPA_16MB,
+	TOPA_32MB,
+	TOPA_64MB,
+	TOPA_128MB,
+	TOPA_SZ_END,
+};
+
+static inline unsigned int sizes(enum topa_sz tsz)
+{
+	return 1 << (tsz + 12);
+};
+
+struct topa_entry {
+	u64	end	: 1;
+	u64	rsvd0	: 1;
+	u64	intr	: 1;
+	u64	rsvd1	: 1;
+	u64	stop	: 1;
+	u64	rsvd2	: 1;
+	u64	size	: 4;
+	u64	rsvd3	: 2;
+	u64	base	: 36;
+	u64	rsvd4	: 16;
+};
+
+#define TOPA_SHIFT 12
+#define PT_CPUID_LEAVES 2
+
+enum pt_capabilities {
+	PT_CAP_max_subleaf = 0,
+	PT_CAP_cr3_filtering,
+	PT_CAP_topa_output,
+	PT_CAP_topa_multiple_entries,
+	PT_CAP_payloads_lip,
+};
+
+struct pt_pmu {
+	struct itrace_pmu	itrace;
+	u32			caps[4 * PT_CPUID_LEAVES];
+};
+
+/**
+ * struct pt_buffer - buffer configuration; one buffer per task_struct or
+ * cpu, depending on perf event configuration
+ * @tables: list of ToPA tables in this buffer
+ * @first, @last: shorthands for first and last topa tables
+ * @cur: current topa table
+ * @size: total size of all output regions within this buffer
+ * @cur_idx: current output region's index within @cur table
+ * @output_off: offset within the current output region
+ */
+struct pt_buffer {
+	/* hint for allocation */
+	int			cpu;
+	/* list of ToPA tables */
+	struct list_head	tables;
+	/* top-level table */
+	struct topa		*first, *last, *cur;
+	unsigned long		round;
+	unsigned int		cur_idx;
+	size_t			output_off;
+	unsigned long		size;
+	local64_t		head;
+	unsigned long		watermark;
+	bool			snapshot;
+	struct perf_event_mmap_page *user_page;
+	void			**data_pages;
+};
+
+/**
+ * struct pt - per-cpu pt
+ */
+struct pt {
+	raw_spinlock_t		lock;
+	struct perf_event	*event;
+};
+
+void intel_pt_interrupt(void);
+
+#endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 8e13293..9125797 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -385,6 +385,10 @@ static inline int precise_br_compat(struct perf_event *event)
 
 int x86_pmu_hw_config(struct perf_event *event)
 {
+	if (event->attr.sample_type & PERF_SAMPLE_ITRACE &&
+	    event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK)
+		return -EINVAL;
+
 	if (event->attr.precise_ip) {
 		int precise = 0;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 0fa4f24..28b5023 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1312,6 +1312,8 @@ int intel_pmu_save_and_restart(struct perf_event *event)
 	return x86_perf_event_set_period(event);
 }
 
+void intel_pt_interrupt(void);
+
 static void intel_pmu_reset(void)
 {
 	struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);
@@ -1393,6 +1395,14 @@ again:
 	}
 
 	/*
+	 * Intel PT
+	 */
+	if (__test_and_clear_bit(55, (unsigned long *)&status)) {
+		handled++;
+		intel_pt_interrupt();
+	}
+
+	/*
 	 * Checkpointed counters can lead to 'spurious' PMIs because the
 	 * rollback caused by the PMI will have cleared the overflow status
 	 * bit. Therefore always force probe these counters.
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
new file mode 100644
index 0000000..b6b1a84
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -0,0 +1,991 @@
+/*
+ * Intel(R) Processor Trace PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ */
+
+#undef DEBUG
+
+#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/debugfs.h>
+#include <linux/device.h>
+
+#include <asm-generic/sizes.h>
+#include <asm/perf_event.h>
+#include <asm/insn.h>
+
+#include "perf_event.h"
+#include "intel_pt.h"
+
+static DEFINE_PER_CPU(struct pt, pt_ctx);
+
+static struct pt_pmu pt_pmu;
+
+enum cpuid_regs {
+	CR_EAX = 0,
+	CR_ECX,
+	CR_EDX,
+	CR_EBX
+};
+
+/*
+ * Capabilities of Intel PT hardware, such as number of address bits or
+ * supported output schemes, are cached and exported to userspace as "caps"
+ * attribute group of pt pmu device
+ * (/sys/bus/event_source/devices/intel_pt/caps/) so that userspace can store
+ * relevant bits together with intel_pt traces.
+ *
+ * Currently, for debugging purposes, these attributes are also writable; this
+ * should be removed in the final version.
+ */
+#define PT_CAP(_n, _l, _r, _m)						\
+	[PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l,	\
+			    .reg = _r, .mask = _m }
+
+static struct pt_cap_desc {
+	const char	*name;
+	u32		leaf;
+	u8		reg;
+	u32		mask;
+} pt_caps[] = {
+	PT_CAP(max_subleaf,		0, CR_EAX, 0xffffffff),
+	PT_CAP(cr3_filtering,		0, CR_EBX, BIT(0)),
+	PT_CAP(topa_output,		0, CR_ECX, BIT(0)),
+	PT_CAP(topa_multiple_entries,	0, CR_ECX, BIT(1)),
+	PT_CAP(payloads_lip,		0, CR_ECX, BIT(31)),
+};
+
+static u32 pt_cap_get(enum pt_capabilities cap)
+{
+	struct pt_cap_desc *cd = &pt_caps[cap];
+	u32 c = pt_pmu.caps[cd->leaf * 4 + cd->reg];
+	unsigned int shift = __ffs(cd->mask);
+
+	return (c & cd->mask) >> shift;
+}
+
+static void pt_cap_set(enum pt_capabilities cap, u32 val)
+{
+	struct pt_cap_desc *cd = &pt_caps[cap];
+	unsigned int idx = cd->leaf * 4 + cd->reg;
+	unsigned int shift = __ffs(cd->mask);
+
+	pt_pmu.caps[idx] = (val << shift) & cd->mask;
+}
+
+static ssize_t pt_cap_show(struct device *cdev,
+			   struct device_attribute *attr,
+			   char *buf)
+{
+	struct dev_ext_attribute *ea =
+		container_of(attr, struct dev_ext_attribute, attr);
+	enum pt_capabilities cap = (long)ea->var;
+
+	return snprintf(buf, PAGE_SIZE, "%x\n", pt_cap_get(cap));
+}
+
+static ssize_t pt_cap_store(struct device *cdev,
+			    struct device_attribute *attr,
+			    const char *buf, size_t size)
+{
+	struct dev_ext_attribute *ea =
+		container_of(attr, struct dev_ext_attribute, attr);
+	enum pt_capabilities cap = (long)ea->var;
+	unsigned long new;
+	char *end;
+
+	new = simple_strtoul(buf, &end, 0);
+	if (end == buf)
+		return -EINVAL;
+
+	pt_cap_set(cap, new);
+	return size;
+}
+
+static struct attribute_group pt_cap_group = {
+	.name	= "caps",
+};
+
+PMU_FORMAT_ATTR(tsc,		"itrace_config:10"	);
+PMU_FORMAT_ATTR(noretcomp,	"itrace_config:11"	);
+
+static struct attribute *pt_formats_attr[] = {
+	&format_attr_tsc.attr,
+	&format_attr_noretcomp.attr,
+	NULL,
+};
+
+static struct attribute_group pt_format_group = {
+	.name	= "format",
+	.attrs	= pt_formats_attr,
+};
+
+static const struct attribute_group *pt_attr_groups[] = {
+	&pt_cap_group,
+	&pt_format_group,
+	NULL,
+};
+
+static int __init pt_pmu_hw_init(void)
+{
+	struct dev_ext_attribute *de_attrs;
+	struct attribute **attrs;
+	size_t size;
+	long i;
+
+	if (test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT)) {
+		for (i = 0; i < PT_CPUID_LEAVES; i++)
+			cpuid_count(20, i,
+				    &pt_pmu.caps[CR_EAX + i * 4],
+				    &pt_pmu.caps[CR_EBX + i * 4],
+				    &pt_pmu.caps[CR_ECX + i * 4],
+				    &pt_pmu.caps[CR_EDX + i * 4]);
+	} else
+		return -ENODEV;
+
+	size = sizeof(struct attribute *) * (ARRAY_SIZE(pt_caps) + 1);
+	attrs = kzalloc(size, GFP_KERNEL);
+	if (!attrs)
+		goto err_attrs;
+
+	size = sizeof(struct dev_ext_attribute) * (ARRAY_SIZE(pt_caps) + 1);
+	de_attrs = kzalloc(size, GFP_KERNEL);
+	if (!de_attrs)
+		goto err_de_attrs;
+
+	for (i = 0; i < ARRAY_SIZE(pt_caps); i++) {
+		de_attrs[i].attr.attr.name = pt_caps[i].name;
+
+		sysfs_attr_init(&de_attrs[i].attr.attr);
+		de_attrs[i].attr.attr.mode = S_IRUGO | S_IWUSR;
+		de_attrs[i].attr.show = pt_cap_show;
+		de_attrs[i].attr.store = pt_cap_store;
+		de_attrs[i].var = (void *)i;
+		attrs[i] = &de_attrs[i].attr.attr;
+	}
+
+	pt_cap_group.attrs = attrs;
+	return 0;
+
+err_de_attrs:
+	kfree(de_attrs);
+err_attrs:
+	kfree(attrs);
+
+	return -ENOMEM;
+}
+
+#define PT_CONFIG_MASK (RTIT_CTL_TSC_EN | RTIT_CTL_DISRETC)
+
+static bool pt_event_valid(struct perf_event *event)
+{
+	u64 itrace_config = event->attr.itrace_config;
+
+	if ((itrace_config & PT_CONFIG_MASK) != itrace_config)
+		return false;
+
+	return true;
+}
+
+/*
+ * PT configuration helpers
+ * These all are cpu affine and operate on a local PT
+ */
+
+static int pt_config(struct perf_event *event)
+{
+	u64 reg;
+
+	reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN;
+
+	if (!event->attr.exclude_kernel)
+		reg |= RTIT_CTL_OS;
+	if (!event->attr.exclude_user)
+		reg |= RTIT_CTL_USR;
+
+	reg |= (event->attr.itrace_config & PT_CONFIG_MASK);
+
+	if (wrmsr_safe(MSR_IA32_RTIT_CTL, reg, 0) < 0) {
+		pr_warn("Failed to enable PT on cpu %d\n", event->cpu);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void pt_config_start(bool start)
+{
+	u64 ctl;
+
+	rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+	if (start)
+		ctl |= RTIT_CTL_TRACEEN;
+	else
+		ctl &= ~RTIT_CTL_TRACEEN;
+	wrmsrl(MSR_IA32_RTIT_CTL, ctl);
+}
+
+static void pt_config_buffer(void *buf, unsigned int topa_idx,
+			     unsigned int output_off)
+{
+	u64 reg;
+
+	wrmsrl(MSR_IA32_RTIT_OUTPUT_BASE, virt_to_phys(buf));
+
+	reg = 0x7f | ((u64)topa_idx << 7) | ((u64)output_off << 32);
+
+	wrmsrl(MSR_IA32_RTIT_OUTPUT_MASK, reg);
+}
+
+#define TENTS_PER_PAGE (((PAGE_SIZE - 40) / sizeof(struct topa_entry)) - 1)
+
+struct topa {
+	struct topa_entry	table[TENTS_PER_PAGE];
+	struct list_head	list;
+	u64			phys;
+	u64			offset;
+	size_t			size;
+	int			last;
+};
+
+/* make negative table index stand for the last table entry */
+#define TOPA_ENTRY(t, i) ((i) == -1 ? &(t)->table[(t)->last] : &(t)->table[(i)])
+
+/*
+ * allocate page-sized ToPA table
+ */
+static struct topa *topa_alloc(int cpu, gfp_t gfp)
+{
+	int node = cpu_to_node(cpu);
+	struct topa *topa;
+	struct page *p;
+
+	p = alloc_pages_node(node, gfp | __GFP_ZERO, 0);
+	if (!p)
+		return NULL;
+
+	topa = page_address(p);
+	topa->last = 0;
+	topa->phys = page_to_phys(p);
+
+	/*
+	 * In case of singe-entry ToPA, always put the self-referencing END
+	 * link as the 2nd entry in the table
+	 */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		TOPA_ENTRY(topa, 1)->base = topa->phys >> TOPA_SHIFT;
+		TOPA_ENTRY(topa, 1)->end = 1;
+	}
+
+	return topa;
+}
+
+static void topa_free(struct topa *topa)
+{
+	free_page((unsigned long)topa);
+}
+
+static void topa_free_pages(struct pt_buffer *buf, struct topa *topa, int idx)
+{
+	size_t size = sizes(TOPA_ENTRY(topa, idx)->size);
+	void *base = phys_to_virt(TOPA_ENTRY(topa, idx)->base << TOPA_SHIFT);
+	unsigned long pn;
+
+	for (pn = 0; pn < size; pn += PAGE_SIZE) {
+		struct page *page = virt_to_page(base + pn);
+
+		page->mapping = NULL;
+		__free_page(page);
+	}
+}
+
+/**
+ * topa_insert_table - insert a ToPA table into a buffer
+ * @buf - pt buffer that's being extended
+ * @topa - new topa table to be inserted
+ *
+ * If it's the first table in this buffer, set up buffer's pointers
+ * accordingly; otherwise, add a END=1 link entry to @topa to the current
+ * "last" table and adjust the last table pointer to @topa.
+ */
+static void topa_insert_table(struct pt_buffer *buf, struct topa *topa)
+{
+	struct topa *last = buf->last;
+
+	list_add_tail(&topa->list, &buf->tables);
+
+	if (!buf->first) {
+		buf->first = buf->last = buf->cur = topa;
+		return;
+	}
+
+	topa->offset = last->offset + last->size;
+	buf->last = topa;
+
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		return;
+
+	BUG_ON(last->last != TENTS_PER_PAGE - 1);
+
+	TOPA_ENTRY(last, -1)->base = topa->phys >> TOPA_SHIFT;
+	TOPA_ENTRY(last, -1)->end = 1;
+}
+
+static bool topa_table_full(struct topa *topa)
+{
+	/* single-entry ToPA is a special case */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
+		return !!topa->last;
+
+	return topa->last == TENTS_PER_PAGE - 1;
+}
+
+static bool pt_buffer_needs_watermark(struct pt_buffer *buf, unsigned long offset)
+{
+	if (buf->snapshot)
+		return false;
+
+	return !(offset % (buf->watermark << PAGE_SHIFT));
+}
+
+static int topa_insert_pages(struct pt_buffer *buf, gfp_t gfp,
+			     enum topa_sz sz)
+{
+	struct topa *topa = buf->last;
+	int node = cpu_to_node(buf->cpu);
+	int order = get_order(sizes(sz));
+	struct page *p;
+	unsigned long pn;
+
+	p = alloc_pages_node(node, gfp | GFP_USER | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY, order);
+	if (!p)
+		return -ENOMEM;
+
+	split_page(p, order);
+
+	if (topa_table_full(topa)) {
+		topa = topa_alloc(buf->cpu, gfp);
+
+		if (!topa) {
+			free_pages((unsigned long)page_address(p), order);
+			return -ENOMEM;
+		}
+
+		topa_insert_table(buf, topa);
+	}
+
+	TOPA_ENTRY(topa, -1)->base = page_to_phys(p) >> TOPA_SHIFT;
+	TOPA_ENTRY(topa, -1)->size = sz;
+	if (!buf->snapshot && !pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		TOPA_ENTRY(topa, -1)->intr = 1;
+		TOPA_ENTRY(topa, -1)->stop = 1;
+	}
+	if (pt_buffer_needs_watermark(buf, buf->size))
+		TOPA_ENTRY(topa, -1)->intr = 1;
+
+	topa->last++;
+	topa->size += sizes(sz);
+	for (pn = 0; pn < sizes(sz); pn += PAGE_SIZE, buf->size += PAGE_SIZE)
+		buf->data_pages[buf->size >> PAGE_SHIFT] = page_address(p) + pn;
+
+	return 0;
+}
+
+static void pt_topa_dump(struct pt_buffer *buf)
+{
+	struct topa *topa;
+
+	list_for_each_entry(topa, &buf->tables, list) {
+		int i;
+
+		pr_debug("# table @%p (%p), off %llx size %lx\n", topa->table,
+			 (void *)topa->phys, topa->offset, topa->size);
+		for (i = 0; i < TENTS_PER_PAGE; i++) {
+			pr_debug("# entry @%p (%lx sz %u %c%c%c) raw=%16llx\n",
+				 &topa->table[i],
+				 (unsigned long)topa->table[i].base << TOPA_SHIFT,
+				 sizes(topa->table[i].size),
+				 topa->table[i].end ?  'E' : ' ',
+				 topa->table[i].intr ? 'I' : ' ',
+				 topa->table[i].stop ? 'S' : ' ',
+				 *(u64 *)&topa->table[i]);
+			if ((pt_cap_get(PT_CAP_topa_multiple_entries) && topa->table[i].stop)
+			    || topa->table[i].end)
+				break;
+		}
+	}
+}
+
+/* advance to the next output region */
+static void pt_buffer_advance(struct pt_buffer *buf)
+{
+	buf->output_off = 0;
+	buf->cur_idx++;
+
+	if (buf->cur_idx == buf->cur->last) {
+		if (buf->cur == buf->last)
+			buf->cur = buf->first;
+		else
+			buf->cur = list_entry(buf->cur->list.next, struct topa, list);
+		buf->cur_idx = 0;
+	}
+}
+
+static void pt_update_head(struct pt_buffer *buf)
+{
+	u64 topa_idx, base;
+
+	/* offset of the first region in this table from the beginning of buf */
+	base = buf->cur->offset + buf->output_off;
+
+	/* offset of the current output region within this table */
+	for (topa_idx = 0; topa_idx < buf->cur_idx; topa_idx++)
+		base += sizes(buf->cur->table[topa_idx].size);
+
+	/* data_head always increases when buffer pointer wraps */
+	base += buf->size * buf->round;
+
+	local64_set(&buf->head, base);
+	if (!buf->user_page)
+		return;
+
+	buf->user_page->data_head = base;
+	smp_wmb();
+}
+
+static void *pt_buffer_region(struct pt_buffer *buf)
+{
+	return phys_to_virt(buf->cur->table[buf->cur_idx].base << TOPA_SHIFT);
+}
+
+static size_t pt_buffer_region_size(struct pt_buffer *buf)
+{
+	return sizes(buf->cur->table[buf->cur_idx].size);
+}
+
+/**
+ * pt_handle_status - take care of possible status conditions
+ * @event: currently active PT event
+ */
+static void pt_handle_status(struct perf_event *event)
+{
+	struct pt_buffer *buf = itrace_priv(event);
+	int advance = 0;
+	u64 status;
+
+	rdmsrl(MSR_IA32_RTIT_STATUS, status);
+
+	if (status & RTIT_STATUS_ERROR) {
+		pr_err("ToPA ERROR encountered, trying to recover\n");
+		pt_topa_dump(buf);
+		status &= ~RTIT_STATUS_ERROR;
+		wrmsrl(MSR_IA32_RTIT_STATUS, status);
+	}
+
+	if (status & RTIT_STATUS_STOPPED) {
+		status &= ~RTIT_STATUS_STOPPED;
+		wrmsrl(MSR_IA32_RTIT_STATUS, status);
+
+		/*
+		 * On systems that only do single-entry ToPA, hitting STOP
+		 * means we are already losing data; need to let the decoder
+		 * know.
+		 */
+		if (!pt_cap_get(PT_CAP_topa_multiple_entries) ||
+		    buf->output_off == sizes(TOPA_ENTRY(buf->cur, buf->cur_idx)->size)) {
+			pt_update_head(buf);
+			itrace_lost_data(event, local64_read(&buf->head));
+			advance++;
+		}
+	}
+
+	/*
+	 * Also on single-entry ToPA implementations, interrupt will come
+	 * before the output reaches its output region's boundary.
+	 */
+	if (!pt_cap_get(PT_CAP_topa_multiple_entries) && !buf->snapshot &&
+	    pt_buffer_region_size(buf) - buf->output_off <= TOPA_PMI_MARGIN) {
+		void *head = pt_buffer_region(buf);
+
+		/* everything within this margin needs to be zeroed out */
+		memset(head + buf->output_off, 0,
+		       pt_buffer_region_size(buf) -
+		       buf->output_off);
+		advance++;
+	}
+
+	if (advance) {
+		/* check if the pointer has wrapped */
+		if (!buf->snapshot &&
+		    buf->cur == buf->last &&
+		    buf->cur_idx == buf->cur->last - 1)
+			buf->round++;
+		pt_buffer_advance(buf);
+	}
+}
+
+static void pt_read_offset(struct pt_buffer *buf)
+{
+	u64 offset, base_topa;
+
+	rdmsrl(MSR_IA32_RTIT_OUTPUT_BASE, base_topa);
+	buf->cur = phys_to_virt(base_topa);
+
+	rdmsrl(MSR_IA32_RTIT_OUTPUT_MASK, offset);
+	/* offset within current output region */
+	buf->output_off = offset >> 32;
+	/* index of current output region within this table */
+	buf->cur_idx = (offset & 0xffffff80) >> 7;
+}
+
+/**
+ * pt_buffer_fini_topa() - deallocate ToPA structure of a buffer
+ * @buf: pt buffer
+ */
+static void pt_buffer_fini_topa(struct pt_buffer *buf)
+{
+	struct topa *topa, *iter;
+
+	list_for_each_entry_safe(topa, iter, &buf->tables, list) {
+		int i;
+
+		for (i = 0; i < topa->last; i++)
+			topa_free_pages(buf, topa, i);
+
+		list_del(&topa->list);
+		topa_free(topa);
+	}
+}
+
+/**
+ * pt_get_topa_region_size - calculate one output region's size
+ * @snapshot: if the counter is a snapshot counter
+ * @size: overall requested allocation size
+ * returns topa region size or error
+ */
+static int pt_get_topa_region_size(bool snapshot, size_t size)
+{
+	unsigned int factor = snapshot ? 1 : 2;
+
+	if (pt_cap_get(PT_CAP_topa_multiple_entries))
+		return TOPA_4K;
+
+	if (size < SZ_4K * factor)
+		return -EINVAL;
+
+	if (!is_power_of_2(size))
+		return -EINVAL;
+
+	if (size >= SZ_128M)
+		return TOPA_128MB;
+
+	return get_order(size / factor);
+}
+
+/**
+ * pt_buffer_init_topa() - initialize ToPA table for pt buffer
+ * @buf: pt buffer
+ * @size: total size of all regions within this ToPA
+ * @gfp: allocation flags
+ */
+static int pt_buffer_init_topa(struct pt_buffer *buf, size_t size, gfp_t gfp)
+{
+	struct topa *topa;
+	int err, region_size;
+
+	topa = topa_alloc(buf->cpu, gfp);
+	if (!topa)
+		return -ENOMEM;
+
+	topa_insert_table(buf, topa);
+
+	region_size = pt_get_topa_region_size(buf->snapshot, size);
+	if (region_size < 0) {
+		pt_buffer_fini_topa(buf);
+		return region_size;
+	}
+
+	while (region_size && get_order(sizes(region_size)) > MAX_ORDER)
+		region_size--;
+
+	/* fixup watermark in case of higher order allocations */
+	if (buf->watermark < (sizes(region_size) >> PAGE_SHIFT))
+		buf->watermark = sizes(region_size) >> PAGE_SHIFT;
+
+	while (buf->size < size) {
+		err = topa_insert_pages(buf, gfp, region_size);
+		if (err) {
+			if (region_size) {
+				region_size--;
+				continue;
+			}
+			pt_buffer_fini_topa(buf);
+			return -ENOMEM;
+		}
+	}
+
+	/* link last table to the first one, unless we're double buffering */
+	if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
+		TOPA_ENTRY(buf->last, -1)->base = buf->first->phys >> TOPA_SHIFT;
+		TOPA_ENTRY(buf->last, -1)->end = 1;
+	}
+
+	pt_topa_dump(buf);
+	return 0;
+}
+
+/**
+ * pt_buffer_alloc() - make a buffer for pt data
+ * @cpu: cpu on which to allocate, -1 means current
+ * @size: desired buffer size, should be multiple of pages
+ * @watermark: place interrupt flags every @watermark pages, 0 == disable
+ * @snapshot: if this is a snapshot counter
+ * @gfp: allocation flags
+ */
+static struct pt_buffer *pt_buffer_alloc(int cpu, size_t size,
+					 unsigned long watermark,
+					 bool snapshot, gfp_t gfp,
+					 void **pages)
+{
+	struct pt_buffer *buf;
+	int node;
+	int ret;
+
+	if (!size || watermark << PAGE_SHIFT > size)
+		return NULL;
+
+	if (cpu == -1)
+		cpu = raw_smp_processor_id();
+	node = cpu_to_node(cpu);
+
+	buf = kzalloc(sizeof(struct pt_buffer), gfp);
+	if (!buf)
+		return NULL;
+
+	buf->cpu = cpu;
+	buf->data_pages = pages;
+	buf->snapshot = snapshot;
+	buf->watermark = watermark;
+	if (!buf->watermark)
+		buf->watermark = (size / 2) >> PAGE_SHIFT;
+
+	INIT_LIST_HEAD(&buf->tables);
+
+	ret = pt_buffer_init_topa(buf, size, gfp);
+	if (ret) {
+		kfree(buf);
+		return NULL;
+	}
+
+	return buf;
+}
+
+/**
+ * pt_buffer_free() - dispose of pt buffer
+ * @buf: pt buffer
+ */
+static void pt_buffer_itrace_free(void *data)
+{
+	struct pt_buffer *buf = data;
+
+	pt_buffer_fini_topa(buf);
+	if (buf->user_page) {
+		struct page *up = virt_to_page(buf->user_page);
+
+		up->mapping = NULL;
+		__free_page(up);
+	}
+
+	kfree(buf);
+}
+
+static void *
+pt_buffer_itrace_alloc(int cpu, int nr_pages, bool overwrite, void **pages,
+		       struct perf_event_mmap_page **user_page)
+{
+	struct pt_buffer *buf;
+	struct page *up = NULL;
+	int node;
+
+	if (user_page) {
+		*user_page = NULL;
+		node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+		up = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+		if (!up)
+			return NULL;
+	}
+
+	buf = pt_buffer_alloc(cpu, nr_pages << PAGE_SHIFT, 0, overwrite,
+			      GFP_KERNEL, pages);
+	if (user_page && buf) {
+		buf->user_page = page_address(up);
+		*user_page = page_address(up);
+	} else if (up)
+		__free_page(up);
+
+	return buf;
+}
+
+/**
+ * pt_buffer_get_page() - find n'th page in pt buffer
+ * @buf: pt buffer
+ * @idx: page index in the buffer
+ */
+static void *pt_buffer_get_page(struct pt_buffer *buf, unsigned long idx)
+{
+	return buf->data_pages[idx];
+}
+
+/**
+ * pt_buffer_is_full - check if the buffer is full
+ * @event: pt event
+ * If the user hasn't read data from the output region that data_head
+ * points to, the buffer is considered full: the user needs to read at
+ * least this region and update data_tail to point past it.
+ */
+static bool pt_buffer_is_full(struct pt_buffer *buf)
+{
+	void *tail, *head;
+	unsigned long tailoff, headoff = local64_read(&buf->head);
+
+	if (buf->snapshot)
+		return false;
+
+	tailoff = ACCESS_ONCE(buf->user_page->data_tail);
+	smp_mb();
+
+	if (headoff < tailoff || headoff - tailoff < buf->size / 2)
+		return false;
+
+	tailoff %= buf->size;
+	headoff %= buf->size;
+
+	if (headoff > tailoff)
+		return false;
+
+	/* check if head and tail are in the same output region */
+	tail = pt_buffer_get_page(buf, tailoff >> PAGE_SHIFT);
+	head = pt_buffer_region(buf);
+
+	if (tail >= head && tail < head + pt_buffer_region_size(buf))
+		return true;
+
+	return false;
+}
+
+static void pt_wake_up(struct perf_event *event)
+{
+	struct pt_buffer *buf = itrace_priv(event);
+
+	if (!buf || buf->snapshot)
+		return;
+	if (pt_buffer_is_full(buf)) {
+		event->pending_disable = 1;
+		event->pending_kill = POLL_IN;
+		event->pending_wakeup = 1;
+		event->hw.state = PERF_HES_STOPPED;
+	}
+
+	if (pt_buffer_needs_watermark(buf, local64_read(&buf->head))) {
+		event->pending_wakeup = 1;
+		event->pending_kill = POLL_IN;
+	}
+
+	if (event->pending_disable || event->pending_kill)
+		itrace_wake_up(event);
+}
+
+void intel_pt_interrupt(void)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct perf_event *event = pt->event;
+	struct pt_buffer *buf;
+
+	pt_config_start(false);
+
+	if (!event)
+		return;
+
+	buf = itrace_event_get_priv(event);
+	if (!buf)
+		return;
+
+	pt_read_offset(buf);
+
+	pt_handle_status(event);
+
+	pt_update_head(buf);
+
+	pt_wake_up(event);
+
+	if (!event->hw.state) {
+		pt_config(event);
+		pt_config_buffer(buf->cur->table, buf->cur_idx,
+				 buf->output_off);
+		wrmsrl(MSR_IA32_RTIT_STATUS, 0);
+		pt_config_start(true);
+	}
+
+	itrace_event_put(event);
+}
+
+static void pt_event_start(struct perf_event *event, int flags)
+{
+	struct pt_buffer *buf = itrace_priv(event);
+
+	if (!buf || pt_buffer_is_full(buf) || pt_config(event)) {
+		event->hw.state = PERF_HES_STOPPED;
+		return;
+	}
+
+	event->hw.state = 0;
+
+	pt_config_buffer(buf->cur->table, buf->cur_idx,
+			 buf->output_off);
+	wrmsrl(MSR_IA32_RTIT_STATUS, 0);
+	pt_config_start(true);
+}
+
+static void pt_event_stop(struct perf_event *event, int flags)
+{
+	if (event->hw.state == PERF_HES_STOPPED)
+		return;
+
+	event->hw.state = PERF_HES_STOPPED;
+
+	pt_config_start(false);
+
+	if (flags & PERF_EF_UPDATE) {
+		struct pt_buffer *buf = itrace_priv(event);
+
+		if (WARN_ONCE(!buf, "no buffer\n"))
+			return;
+
+		pt_read_offset(buf);
+
+		pt_handle_status(event);
+
+		pt_update_head(buf);
+
+		pt_wake_up(event);
+	}
+}
+
+static void pt_event_del(struct perf_event *event, int flags)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+
+	pt_event_stop(event, PERF_EF_UPDATE);
+
+	raw_spin_lock(&pt->lock);
+	pt->event = NULL;
+	raw_spin_unlock(&pt->lock);
+
+	itrace_event_put(event);
+}
+
+static int pt_event_add(struct perf_event *event, int flags)
+{
+	struct pt_buffer *buf;
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct hw_perf_event *hwc = &event->hw;
+	int ret = 0;
+
+	ret = pt_config(event);
+	if (ret)
+		return ret;
+
+	buf = itrace_event_get_priv(event);
+	if (!buf) {
+		hwc->state = PERF_HES_STOPPED;
+		return -EINVAL;
+	}
+
+	raw_spin_lock(&pt->lock);
+	if (pt->event) {
+		raw_spin_unlock(&pt->lock);
+		itrace_event_put(event);
+		ret = -EBUSY;
+		event->hw.state = PERF_HES_STOPPED;
+		goto out;
+	}
+
+	pt->event = event;
+	raw_spin_unlock(&pt->lock);
+
+	hwc->state = !(flags & PERF_EF_START);
+	if (!hwc->state) {
+		pt_event_start(event, 0);
+		if (hwc->state == PERF_HES_STOPPED) {
+			pt_event_del(event, 0);
+			pt_wake_up(event);
+			ret = -EBUSY;
+		}
+	}
+
+out:
+	return ret;
+}
+
+static void pt_event_read(struct perf_event *event)
+{
+}
+
+static int pt_event_init(struct perf_event *event)
+{
+	if (event->attr.type != pt_pmu.itrace.pmu.type)
+		return -ENOENT;
+
+	if (!pt_event_valid(event))
+		return -EINVAL;
+
+	return 0;
+}
+
+static __init int pt_init(void)
+{
+	int ret, cpu;
+
+	BUILD_BUG_ON(sizeof(struct topa) > PAGE_SIZE);
+	get_online_cpus();
+	for_each_possible_cpu(cpu) {
+		raw_spin_lock_init(&per_cpu(pt_ctx, cpu).lock);
+	}
+	put_online_cpus();
+
+	ret = pt_pmu_hw_init();
+	if (ret)
+		return ret;
+
+	pt_pmu.itrace.pmu.attr_groups	= pt_attr_groups;
+	pt_pmu.itrace.pmu.task_ctx_nr	= perf_hw_context;
+	pt_pmu.itrace.pmu.event_init	= pt_event_init;
+	pt_pmu.itrace.pmu.add		= pt_event_add;
+	pt_pmu.itrace.pmu.del		= pt_event_del;
+	pt_pmu.itrace.pmu.start		= pt_event_start;
+	pt_pmu.itrace.pmu.stop		= pt_event_stop;
+	pt_pmu.itrace.pmu.read		= pt_event_read;
+	pt_pmu.itrace.alloc_buffer	= pt_buffer_itrace_alloc;
+	pt_pmu.itrace.free_buffer	= pt_buffer_itrace_free;
+	pt_pmu.itrace.name		= "intel_pt";
+	ret = itrace_pmu_register(&pt_pmu.itrace);
+
+	return ret;
+}
+
+module_init(pt_init);
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v1 08/11] x86: perf: intel_pt: Add sampling functionality
  2014-02-06 10:50 [PATCH v1 00/11] perf: Add support for Intel Processor Trace Alexander Shishkin
                   ` (6 preceding siblings ...)
  2014-02-06 10:50 ` [PATCH v1 07/11] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
@ 2014-02-06 10:50 ` Alexander Shishkin
  2014-02-06 10:50 ` [PATCH v1 09/11] x86: perf: intel_pt: Add core dump functionality Alexander Shishkin
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-06 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming, Alexander Shishkin

Intel Processor Trace (PT) data can be used in perf event samples to annotate
other perf events. This patch implements itrace sampling related hooks that
configure and output sampling data to perf stream. Users will need to include
PERF_SAMPLE_ITRACE flag in the attr.sample_type mask and specify PT's PMU type
in attr.itrace_sample_type field.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_pt.c | 115 ++++++++++++++++++++++++++++++
 1 file changed, 115 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index b6b1a84..af1482d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -946,17 +946,130 @@ static void pt_event_read(struct perf_event *event)
 {
 }
 
+typedef unsigned int (*pt_copyfn)(void *data, const void *src,
+				  unsigned int len);
+
+/**
+ * pt_buffer_output - copy part of pt buffer to perf stream
+ * @buf: buffer to copy from
+ * @from: initial offset
+ * @to: final offset
+ * @copyfn: function that copies data out (like perf_output_copy())
+ * @data: data to be passed on to the copy function (like perf_output_handle)
+ */
+static int pt_buffer_output(struct pt_buffer *buf, unsigned long from,
+			    unsigned long to, pt_copyfn copyfn, void *data)
+{
+	unsigned long tocopy;
+	unsigned int len = 0, remainder;
+	void *page;
+
+	do {
+		tocopy = PAGE_SIZE - offset_in_page(from);
+		if (to > from)
+			tocopy = min(tocopy, to - from);
+		if (!tocopy)
+			break;
+
+		page = pt_buffer_get_page(buf, from >> PAGE_SHIFT);
+		if (WARN_ONCE(!page, "no data page for %lx offset\n", from))
+			break;
+
+		page += offset_in_page(from);
+
+		remainder = copyfn(data, page, tocopy);
+		if (remainder)
+			return -EFAULT;
+
+		len += tocopy;
+		from += tocopy;
+		if (from == buf->size)
+			from = 0;
+	} while (to != from);
+	return len;
+}
+
 static int pt_event_init(struct perf_event *event)
 {
 	if (event->attr.type != pt_pmu.itrace.pmu.type)
 		return -ENOENT;
 
+	/* can't be both */
+	if (event->attr.sample_type & PERF_SAMPLE_ITRACE)
+		return -ENOENT;
+
 	if (!pt_event_valid(event))
 		return -EINVAL;
 
 	return 0;
 }
 
+static unsigned long pt_trace_sampler_trace(struct perf_event *event,
+					    struct perf_sample_data *data)
+{
+	struct pt_buffer *buf;
+
+	pt_event_stop(event, 0);
+
+	buf = itrace_event_get_priv(event);
+	if (!buf) {
+		data->trace.size = 0;
+		goto out;
+	}
+
+	pt_read_offset(buf);
+	pt_update_head(buf);
+
+	data->trace.to = local64_read(&buf->head);
+
+	if (data->trace.to < event->attr.itrace_sample_size)
+		data->trace.from = buf->size + data->trace.to -
+			event->attr.itrace_sample_size;
+	else
+		data->trace.from = data->trace.to -
+			event->attr.itrace_sample_size;
+	data->trace.size = ALIGN(event->attr.itrace_sample_size, sizeof(u64));
+
+	itrace_event_put(event);
+
+out:
+	if (!data->trace.size)
+		pt_event_start(event, 0);
+
+	return data->trace.size;
+}
+
+static void pt_trace_sampler_output(struct perf_event *event,
+				    struct perf_output_handle *handle,
+				    struct perf_sample_data *data)
+{
+	unsigned long padding;
+	struct pt_buffer *buf;
+	int ret;
+
+	buf = itrace_event_get_priv(event);
+	if (!buf)
+		return;
+
+	ret = pt_buffer_output(buf, data->trace.from, data->trace.to,
+			       (pt_copyfn)perf_output_copy, handle);
+	itrace_event_put(event);
+	if (ret < 0) {
+		pr_warn("%s: failed to copy trace data\n", __func__);
+		goto out;
+	}
+
+	padding = data->trace.size - ret;
+	if (padding) {
+		u64 u = 0;
+
+		perf_output_copy(handle, &u, padding);
+	}
+
+out:
+	pt_event_start(event, 0);
+}
+
 static __init int pt_init(void)
 {
 	int ret, cpu;
@@ -982,6 +1095,8 @@ static __init int pt_init(void)
 	pt_pmu.itrace.pmu.read		= pt_event_read;
 	pt_pmu.itrace.alloc_buffer	= pt_buffer_itrace_alloc;
 	pt_pmu.itrace.free_buffer	= pt_buffer_itrace_free;
+	pt_pmu.itrace.sample_trace	= pt_trace_sampler_trace;
+	pt_pmu.itrace.sample_output	= pt_trace_sampler_output;
 	pt_pmu.itrace.name		= "intel_pt";
 	ret = itrace_pmu_register(&pt_pmu.itrace);
 
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v1 09/11] x86: perf: intel_pt: Add core dump functionality
  2014-02-06 10:50 [PATCH v1 00/11] perf: Add support for Intel Processor Trace Alexander Shishkin
                   ` (7 preceding siblings ...)
  2014-02-06 10:50 ` [PATCH v1 08/11] x86: perf: intel_pt: Add sampling functionality Alexander Shishkin
@ 2014-02-06 10:50 ` Alexander Shishkin
  2014-02-06 20:36   ` Andi Kleen
  2014-02-06 23:59   ` Andi Kleen
  2014-02-06 10:50 ` [PATCH v1 10/11] x86: perf: intel_bts: Add BTS PMU driver Alexander Shishkin
  2014-02-06 10:50 ` [PATCH v1 11/11] x86: perf: intel_bts: Add core dump related functionality Alexander Shishkin
  10 siblings, 2 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-06 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming, Alexander Shishkin

Intel Processor Trace (PT) data can be used in process core dumps. This
is done by implementing itrace code dump related hooks that configure and
output trace data to a core file. The driver will also include the list
of PT capabilities in itrace core dump notes so that the decoder can make
assumptions about the binary stream.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/kernel/cpu/intel_pt.h            |  2 +
 arch/x86/kernel/cpu/perf_event_intel_pt.c | 74 +++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
index dd69092..befde1f 100644
--- a/arch/x86/kernel/cpu/intel_pt.h
+++ b/arch/x86/kernel/cpu/intel_pt.h
@@ -84,6 +84,8 @@ enum pt_capabilities {
 struct pt_pmu {
 	struct itrace_pmu	itrace;
 	u32			caps[4 * PT_CPUID_LEAVES];
+	char			*capstr;
+	unsigned int		caplen;
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index af1482d..cb03594 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -24,6 +24,7 @@
 #include <linux/slab.h>
 #include <linux/debugfs.h>
 #include <linux/device.h>
+#include <linux/coredump.h>
 
 #include <asm-generic/sizes.h>
 #include <asm/perf_event.h>
@@ -88,6 +89,34 @@ static void pt_cap_set(enum pt_capabilities cap, u32 val)
 	pt_pmu.caps[idx] = (val << shift) & cd->mask;
 }
 
+/**
+ * pt_cap_string - format PT capabilities into an ascii string
+ *
+ * We need to include PT capabilities in the core dump note so that the
+ * decoder knows what to expect in the binary stream.
+ */
+static void pt_cap_string(void)
+{
+	char *capstr;
+	int pos, i;
+
+	capstr = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!capstr)
+		return;
+
+	for (i = 0, pos = 0; i < ARRAY_SIZE(pt_caps) && pos < PAGE_SIZE; i++) {
+		pos += snprintf(&capstr[pos], PAGE_SIZE - pos, "%s:%x%c",
+				pt_caps[i].name, pt_cap_get(i),
+				i == ARRAY_SIZE(pt_caps) - 1 ? 0 : ',');
+	}
+
+	if (pt_pmu.capstr)
+		kfree(pt_pmu.capstr);
+
+	pt_pmu.capstr = capstr;
+	pt_pmu.caplen = pos;
+}
+
 static ssize_t pt_cap_show(struct device *cdev,
 			   struct device_attribute *attr,
 			   char *buf)
@@ -114,6 +143,7 @@ static ssize_t pt_cap_store(struct device *cdev,
 		return -EINVAL;
 
 	pt_cap_set(cap, new);
+	pt_cap_string();
 	return size;
 }
 
@@ -179,6 +209,7 @@ static int __init pt_pmu_hw_init(void)
 		attrs[i] = &de_attrs[i].attr.attr;
 	}
 
+	pt_cap_string();
 	pt_cap_group.attrs = attrs;
 	return 0;
 
@@ -1070,6 +1101,46 @@ out:
 	pt_event_start(event, 0);
 }
 
+static size_t pt_trace_core_size(struct perf_event *event)
+{
+	return pt_pmu.caplen;
+}
+
+static unsigned int pt_core_copy(void *data, const void *src,
+				 unsigned int len)
+{
+	struct coredump_params *cprm = data;
+
+	if (dump_emit(cprm, src, len))
+		return 0;
+
+	return len;
+}
+
+static void pt_trace_core_output(struct coredump_params *cprm,
+				 struct perf_event *event,
+				 unsigned long len)
+{
+	struct pt_buffer *buf;
+	u64 from, to;
+	int ret;
+
+	buf = itrace_priv(event);
+
+	if (!dump_emit(cprm, pt_pmu.capstr, pt_pmu.caplen))
+		return;
+
+	to = local64_read(&buf->head);
+	if (to < len)
+		from = buf->size + to - len;
+	else
+		from = to - len;
+
+	ret = pt_buffer_output(buf, from, to, pt_core_copy, cprm);
+	if (ret < 0)
+		pr_warn("%s: failed to copy trace data\n", __func__);
+}
+
 static __init int pt_init(void)
 {
 	int ret, cpu;
@@ -1097,6 +1168,9 @@ static __init int pt_init(void)
 	pt_pmu.itrace.free_buffer	= pt_buffer_itrace_free;
 	pt_pmu.itrace.sample_trace	= pt_trace_sampler_trace;
 	pt_pmu.itrace.sample_output	= pt_trace_sampler_output;
+	pt_pmu.itrace.core_size		= pt_trace_core_size;
+	pt_pmu.itrace.core_output	= pt_trace_core_output;
+	pt_pmu.itrace.coredump_config	= RTIT_CTL_TSC_EN | RTIT_CTL_DISRETC;
 	pt_pmu.itrace.name		= "intel_pt";
 	ret = itrace_pmu_register(&pt_pmu.itrace);
 
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v1 10/11] x86: perf: intel_bts: Add BTS PMU driver
  2014-02-06 10:50 [PATCH v1 00/11] perf: Add support for Intel Processor Trace Alexander Shishkin
                   ` (8 preceding siblings ...)
  2014-02-06 10:50 ` [PATCH v1 09/11] x86: perf: intel_pt: Add core dump functionality Alexander Shishkin
@ 2014-02-06 10:50 ` Alexander Shishkin
  2014-02-06 10:50 ` [PATCH v1 11/11] x86: perf: intel_bts: Add core dump related functionality Alexander Shishkin
  10 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-06 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming, Alexander Shishkin

Add support for Branch Trace Store (BTS) via kernel perf/itrace event
infrastructure. The difference with the existing implementation of BTS
support is that this one is a separate PMU that exports events' trace
buffers to userspace the same way as Intel PT PMU does. The immediate
benefit is that the buffer size can be much bigger, resulting in fewer
interrupts and no kernel side copying is involved. Also, of the kernel
code is possible. Additionally, it is now possible to include BTS traces
into process core dumps.

The old way of collecting BTS traces still works.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/kernel/cpu/Makefile               |   2 +-
 arch/x86/kernel/cpu/perf_event.h           |   6 +
 arch/x86/kernel/cpu/perf_event_intel.c     |   6 +-
 arch/x86/kernel/cpu/perf_event_intel_bts.c | 478 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   3 +-
 5 files changed, 492 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_bts.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index cb69de3..29f7f32 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -37,7 +37,7 @@ endif
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_p6.o perf_event_knc.o perf_event_p4.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o perf_event_intel_rapl.o
-obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_pt.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_pt.o perf_event_intel_bts.o
 endif
 
 
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index c1a8618..00b1ffb 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -702,6 +702,12 @@ void intel_pmu_lbr_init_snb(void);
 
 int intel_pmu_setup_lbr_filter(struct perf_event *event);
 
+int intel_bts_interrupt(void);
+
+void intel_bts_enable_local(void);
+
+void intel_bts_disable_local(void);
+
 int p4_pmu_init(void);
 
 int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 28b5023..e447972 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1052,6 +1052,8 @@ static void intel_pmu_disable_all(void)
 
 	if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
 		intel_pmu_disable_bts();
+	else
+		intel_bts_disable_local();
 
 	intel_pmu_pebs_disable_all();
 	intel_pmu_lbr_disable_all();
@@ -1074,7 +1076,8 @@ static void intel_pmu_enable_all(int added)
 			return;
 
 		intel_pmu_enable_bts(event->hw.config);
-	}
+	} else
+		intel_bts_enable_local();
 }
 
 /*
@@ -1362,6 +1365,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 		apic_write(APIC_LVTPC, APIC_DM_NMI);
 	intel_pmu_disable_all();
 	handled = intel_pmu_drain_bts_buffer();
+	handled += intel_bts_interrupt();
 	status = intel_pmu_get_status();
 	if (!status) {
 		intel_pmu_enable_all(0);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_bts.c b/arch/x86/kernel/cpu/perf_event_intel_bts.c
new file mode 100644
index 0000000..0a08969
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_bts.c
@@ -0,0 +1,478 @@
+/*
+ * BTS PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ */
+
+#undef DEBUG
+
+#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/debugfs.h>
+#include <linux/device.h>
+#include <linux/coredump.h>
+#include <linux/itrace.h>
+
+#include <asm-generic/sizes.h>
+#include <asm/perf_event.h>
+
+#include "perf_event.h"
+
+static struct dentry *bts_dir_dent;
+static struct dentry *bts_poison_dent;
+
+static u32 poison;
+
+struct bts_ctx {
+	raw_spinlock_t		lock;
+	struct perf_event	*event;
+	struct debug_store	ds_back;
+};
+
+static DEFINE_PER_CPU(struct bts_ctx, bts_ctx);
+
+#define BTS_RECORD_SIZE		24
+
+struct bts_buffer {
+	void		*buf;
+	void		**data_pages;
+	size_t		size;		/* multiple of PAGE_SIZE */
+	size_t		real_size;	/* multiple of BTS_RECORD_SIZE */
+	unsigned long	round;
+	unsigned long	index;
+	unsigned long	watermark;
+	bool		snapshot;
+	local64_t	head;
+	struct perf_event_mmap_page	*user_page;
+};
+
+static struct dentry *bts_poison_dent;
+struct itrace_pmu bts_pmu;
+
+void intel_pmu_enable_bts(u64 config);
+void intel_pmu_disable_bts(void);
+
+/* add tsc to the bts buffer for the benefit of the decoder */
+#define BTS_SYNTH_TSC	BIT(1)
+#define BTS_CONFIG_MASK	BTS_SYNTH_TSC
+
+PMU_FORMAT_ATTR(tsc,		"itrace_config:1"	);
+
+static struct attribute *bts_formats_attr[] = {
+	&format_attr_tsc.attr,
+	NULL,
+};
+
+static struct attribute_group bts_format_group = {
+	.name	= "format",
+	.attrs	= bts_formats_attr,
+};
+
+static const struct attribute_group *bts_attr_groups[] = {
+	&bts_format_group,
+	NULL,
+};
+
+static void *
+bts_buffer_itrace_alloc(int cpu, int nr_pages, bool overwrite, void **pages,
+			struct perf_event_mmap_page **user_page)
+{
+	struct bts_buffer *buf;
+	struct page *up = NULL, *page;
+	int node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+	size_t size = nr_pages << PAGE_SHIFT;
+	int i, order;
+
+	if (!is_power_of_2(nr_pages))
+		return NULL;
+
+	buf = kzalloc(sizeof(struct bts_buffer), GFP_KERNEL);
+	if (!buf)
+		return NULL;
+
+	buf->snapshot = overwrite;
+
+	buf->size = size;
+	buf->real_size = size - size % BTS_RECORD_SIZE;
+	order = get_order(buf->size);
+
+	if (user_page) {
+		*user_page = NULL;
+		up = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+		if (!up)
+			goto err_buf;
+	}
+
+	buf->data_pages = pages;
+
+	page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY, order);
+	if (!page)
+		goto err_up;
+
+	buf->buf = page_address(page);
+	split_page(page, order);
+
+	for (i = 0; i < nr_pages; i++)
+		buf->data_pages[i] = buf->buf + PAGE_SIZE * i;
+
+	if (!overwrite)
+		buf->watermark = buf->real_size / 2;
+	if (user_page) {
+		buf->user_page = page_address(up);
+		*user_page = page_address(up);
+	}
+
+	return buf;
+
+err_up:
+	__free_page(up);
+err_buf:
+	kfree(buf);
+
+	return NULL;
+}
+
+static void bts_buffer_itrace_free(void *data)
+{
+	struct bts_buffer *buf = data;
+	int i;
+
+	for (i = 0; i < buf->size >> PAGE_SHIFT; i++) {
+		struct page *page = virt_to_page(buf->data_pages[i]);
+		page->mapping = NULL;
+		__free_page(page);
+	}
+	if (buf->user_page) {
+		struct page *up = virt_to_page(buf->user_page);
+
+		up->mapping = NULL;
+		__free_page(up);
+	}
+
+	kfree(buf);
+}
+
+static void
+bts_config_buffer(int cpu, void *buf, size_t size, unsigned long thresh,
+		  unsigned long index)
+{
+	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+
+	ds->bts_buffer_base = (u64)buf;
+	ds->bts_index = ds->bts_buffer_base + index;
+	ds->bts_absolute_maximum = ds->bts_buffer_base + size;
+	ds->bts_interrupt_threshold = thresh
+		? ds->bts_buffer_base + thresh - 0x180 /* arbitrary */
+		: ds->bts_absolute_maximum + BTS_RECORD_SIZE;
+}
+
+static bool bts_buffer_is_full(struct bts_buffer *buf)
+{
+	unsigned long tailoff, headoff = local64_read(&buf->head);
+
+	if (buf->snapshot)
+		return false;
+
+	tailoff = ACCESS_ONCE(buf->user_page->data_tail);
+	smp_mb();
+
+	if (headoff <= tailoff || headoff - tailoff < buf->real_size)
+		return false;
+
+	return true;
+}
+
+static void bts_wake_up(struct perf_event *event)
+{
+	struct bts_buffer *buf = itrace_priv(event);
+
+	if (!buf || buf->snapshot)
+		return;
+	if (bts_buffer_is_full(buf)) {
+		event->pending_disable = 1;
+		event->pending_kill = POLL_IN;
+		event->pending_wakeup = 1;
+		event->hw.state = PERF_HES_STOPPED;
+	}
+
+	if (event->pending_disable || event->pending_kill)
+		itrace_wake_up(event);
+}
+
+static void bts_update(struct perf_event *event)
+{
+	int cpu = raw_smp_processor_id();
+	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+	struct bts_buffer *buf = itrace_priv(event);
+	unsigned long index = ds->bts_index - ds->bts_buffer_base;
+	int lost = 0;
+
+	if (WARN_ONCE(!buf, "no buffer\n"))
+		return;
+
+	smp_wmb();
+	if (buf->snapshot)
+		local64_set(&buf->head, index);
+	else {
+		if (index >= buf->real_size) {
+			buf->round++;
+			index = 0;
+			lost++;
+		}
+
+		local64_set(&buf->head, buf->round * buf->real_size + index);
+		if (lost)
+			itrace_lost_data(event, local64_read(&buf->head));
+	}
+
+	if (buf->user_page) {
+		buf->user_page->data_head = local64_read(&buf->head);
+		smp_wmb();
+	}
+}
+
+static void bts_timestamp(struct perf_event *event)
+{
+	struct debug_store *ds = __get_cpu_var(cpu_hw_events).ds;
+	u64 tsc, *wp = (void *)ds->bts_index;
+
+	rdtscll(tsc);
+	*wp++ = 0xffffffffull;
+	*wp++ = tsc;
+	*wp++ = 1;
+	ds->bts_index += BTS_RECORD_SIZE;
+	bts_update(event);
+	bts_wake_up(event);
+}
+
+static void bts_event_start(struct perf_event *event, int flags)
+{
+	struct bts_buffer *buf = itrace_priv(event);
+	int cpu = raw_smp_processor_id();
+	unsigned long index, thresh = 0;
+	u64 config = 0;
+
+	if (!buf) {
+		event->hw.state = PERF_HES_STOPPED;
+		return;
+	}
+
+	event->hw.state = 0;
+
+	if (!buf->snapshot)
+		config |= ARCH_PERFMON_EVENTSEL_INT;
+	if (!event->attr.exclude_kernel)
+		config |= ARCH_PERFMON_EVENTSEL_OS;
+	if (!event->attr.exclude_user)
+		config |= ARCH_PERFMON_EVENTSEL_USR;
+
+	index = local64_read(&buf->head) % buf->real_size;
+	if (buf->watermark)
+		thresh = ((index + buf->watermark) / buf->watermark) * buf->watermark;
+	else
+		thresh = buf->real_size;
+
+	bts_config_buffer(cpu, buf->buf, thresh, buf->snapshot ? 0 : thresh,
+			  index);
+
+	if (event->attr.itrace_config & BTS_SYNTH_TSC) {
+		bts_timestamp(event);
+		if (event->hw.state == PERF_HES_STOPPED)
+			return;
+	}
+
+	wmb();
+
+	intel_pmu_enable_bts(config);
+}
+
+static void bts_event_stop(struct perf_event *event, int flags)
+{
+	if (event->hw.state == PERF_HES_STOPPED)
+		return;
+
+	event->hw.state = PERF_HES_STOPPED;
+	intel_pmu_disable_bts();
+
+	if (flags & PERF_EF_UPDATE) {
+		bts_update(event);
+		bts_wake_up(event);
+	}
+}
+
+void intel_bts_enable_local(void)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+	if (bts->event)
+		bts_event_start(bts->event, 0);
+}
+
+void intel_bts_disable_local(void)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+	if (bts->event)
+		bts_event_stop(bts->event, 0);
+}
+
+int intel_bts_interrupt(void)
+{
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct bts_buffer *buf;
+	s64 old_head;
+
+	if (!bts->event)
+		return 0;
+
+	buf = itrace_priv(bts->event);
+	if (WARN_ONCE(!buf, "no buffer"))
+		return 0;
+
+	old_head = local64_read(&buf->head);
+	bts_update(bts->event);
+	if (old_head != local64_read(&buf->head)) {
+		bts_wake_up(bts->event);
+		return 1;
+	}
+
+	return 0;
+}
+
+static void bts_event_del(struct perf_event *event, int flags)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+
+	bts_event_stop(event, PERF_EF_UPDATE);
+
+	raw_spin_lock(&bts->lock);
+	bts->event = NULL;
+	cpuc->ds->bts_index = bts->ds_back.bts_buffer_base;
+	cpuc->ds->bts_buffer_base = bts->ds_back.bts_buffer_base;
+	cpuc->ds->bts_absolute_maximum = bts->ds_back.bts_absolute_maximum;
+	cpuc->ds->bts_interrupt_threshold = bts->ds_back.bts_interrupt_threshold;
+	raw_spin_unlock(&bts->lock);
+
+	itrace_event_put(event);
+}
+
+static int bts_event_add(struct perf_event *event, int flags)
+{
+	struct bts_buffer *buf;
+	struct bts_ctx *bts = this_cpu_ptr(&bts_ctx);
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct hw_perf_event *hwc = &event->hw;
+	int ret = 0;
+
+	if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) {
+		hwc->state = PERF_HES_STOPPED;
+		return -EINVAL;
+	}
+
+	buf = itrace_event_get_priv(event);
+	if (!buf) {
+		hwc->state = PERF_HES_STOPPED;
+		return -EINVAL;
+	}
+
+	raw_spin_lock(&bts->lock);
+	if (bts->event) {
+		raw_spin_unlock(&bts->lock);
+		itrace_event_put(event);
+		ret = -EBUSY;
+		event->hw.state = PERF_HES_STOPPED;
+		goto out;
+	}
+
+	bts->event = event;
+	bts->ds_back.bts_buffer_base = cpuc->ds->bts_buffer_base;
+	bts->ds_back.bts_absolute_maximum = cpuc->ds->bts_absolute_maximum;
+	bts->ds_back.bts_interrupt_threshold = cpuc->ds->bts_interrupt_threshold;
+	raw_spin_unlock(&bts->lock);
+
+	hwc->state = !(flags & PERF_EF_START);
+	if (!hwc->state) {
+		bts_event_start(event, 0);
+		if (hwc->state == PERF_HES_STOPPED) {
+			bts_event_del(event, 0);
+			bts_wake_up(event);
+			ret = -EBUSY;
+		}
+	}
+
+out:
+	return ret;
+}
+
+static int bts_event_init(struct perf_event *event)
+{
+	u64 config = event->attr.itrace_config;
+
+	if (event->attr.type != bts_pmu.pmu.type)
+		return -ENOENT;
+
+	if ((config & BTS_CONFIG_MASK) != config)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void bts_event_read(struct perf_event *event)
+{
+}
+
+static __init int bts_init(void)
+{
+	int ret, cpu;
+
+	if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
+		return -ENODEV;
+
+	get_online_cpus();
+	for_each_possible_cpu(cpu) {
+		raw_spin_lock_init(&per_cpu(bts_ctx, cpu).lock);
+	}
+	put_online_cpus();
+
+	bts_pmu.pmu.attr_groups		= bts_attr_groups;
+	bts_pmu.pmu.task_ctx_nr		= perf_hw_context;
+	bts_pmu.pmu.event_init		= bts_event_init;
+	bts_pmu.pmu.add			= bts_event_add;
+	bts_pmu.pmu.del			= bts_event_del;
+	bts_pmu.pmu.start		= bts_event_start;
+	bts_pmu.pmu.stop		= bts_event_stop;
+	bts_pmu.pmu.read		= bts_event_read;
+	bts_pmu.alloc_buffer		= bts_buffer_itrace_alloc;
+	bts_pmu.free_buffer		= bts_buffer_itrace_free;
+	bts_pmu.name			= "intel_bts";
+
+	ret = itrace_pmu_register(&bts_pmu);
+	if (ret)
+		return ret;
+
+	bts_dir_dent = debugfs_create_dir("intel_bts", NULL);
+	bts_poison_dent = debugfs_create_bool("poison", S_IRUSR | S_IWUSR,
+					      bts_dir_dent, &poison);
+
+	if (IS_ERR(bts_dir_dent) || IS_ERR(bts_poison_dent))
+		pr_warn("Can't create debugfs entries.\n");
+
+	return 0;
+}
+
+module_init(bts_init);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index ae96cfa..21f799f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -444,7 +444,8 @@ void intel_pmu_enable_bts(u64 config)
 
 	debugctlmsr |= DEBUGCTLMSR_TR;
 	debugctlmsr |= DEBUGCTLMSR_BTS;
-	debugctlmsr |= DEBUGCTLMSR_BTINT;
+	if (config & ARCH_PERFMON_EVENTSEL_INT)
+		debugctlmsr |= DEBUGCTLMSR_BTINT;
 
 	if (!(config & ARCH_PERFMON_EVENTSEL_OS))
 		debugctlmsr |= DEBUGCTLMSR_BTS_OFF_OS;
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v1 11/11] x86: perf: intel_bts: Add core dump related functionality
  2014-02-06 10:50 [PATCH v1 00/11] perf: Add support for Intel Processor Trace Alexander Shishkin
                   ` (9 preceding siblings ...)
  2014-02-06 10:50 ` [PATCH v1 10/11] x86: perf: intel_bts: Add BTS PMU driver Alexander Shishkin
@ 2014-02-06 10:50 ` Alexander Shishkin
  2014-02-06 23:57   ` Andi Kleen
  10 siblings, 1 reply; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-06 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming, Alexander Shishkin

BTS data can be used in process core dumps. This patch implements itrace
core dump related hooks that will configure and output BTS traces into
a core file.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_bts.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_bts.c b/arch/x86/kernel/cpu/perf_event_intel_bts.c
index 0a08969..20a19b2 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_bts.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_bts.c
@@ -436,6 +436,26 @@ static void bts_event_read(struct perf_event *event)
 {
 }
 
+static size_t bts_trace_core_size(struct perf_event *event)
+{
+	return 0;
+}
+
+static void bts_trace_core_output(struct coredump_params *cprm,
+				  struct perf_event *event,
+				  unsigned long len)
+{
+	struct bts_buffer *buf = itrace_event_get_priv(event);
+	u64 head = local64_read(&buf->head);
+
+	if (head < len) {
+		dump_emit(cprm, buf->buf + head, buf->real_size - head);
+		dump_emit(cprm, buf->buf, len - buf->real_size + head);
+	} else
+		dump_emit(cprm, buf->buf + head - len, len);
+	itrace_event_put(event);
+}
+
 static __init int bts_init(void)
 {
 	int ret, cpu;
@@ -459,6 +479,8 @@ static __init int bts_init(void)
 	bts_pmu.pmu.read		= bts_event_read;
 	bts_pmu.alloc_buffer		= bts_buffer_itrace_alloc;
 	bts_pmu.free_buffer		= bts_buffer_itrace_free;
+	bts_pmu.core_size		= bts_trace_core_size;
+	bts_pmu.core_output		= bts_trace_core_output;
 	bts_pmu.name			= "intel_bts";
 
 	ret = itrace_pmu_register(&bts_pmu);
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 07/11] x86: perf: intel_pt: Intel PT PMU driver
  2014-02-06 10:50 ` [PATCH v1 07/11] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
@ 2014-02-06 20:29   ` Andi Kleen
  2014-02-17 14:44   ` Peter Zijlstra
  2014-02-17 14:46   ` Peter Zijlstra
  2 siblings, 0 replies; 45+ messages in thread
From: Andi Kleen @ 2014-02-06 20:29 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Adrian Hunter,
	Matt Fleming

> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index 0fa4f24..28b5023 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -1312,6 +1312,8 @@ int intel_pmu_save_and_restart(struct perf_event *event)
>  	return x86_perf_event_set_period(event);
>  }
>  
> +void intel_pt_interrupt(void);

Should be in $(pwd)/perf_event.h

> diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
> new file mode 100644
> index 0000000..b6b1a84
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
> @@ -0,0 +1,991 @@
> +/*
> + * Intel(R) Processor Trace PMU driver for perf
> + * Copyright (c) 2013-2014, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program; if not, write to the Free Software Foundation, Inc.,
> + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.

Remove the address, and add a pointer to the specification

Similar in the other files.

> +/*
> + * Capabilities of Intel PT hardware, such as number of address bits or
> + * supported output schemes, are cached and exported to userspace as "caps"
> + * attribute group of pt pmu device
> + * (/sys/bus/event_source/devices/intel_pt/caps/) so that userspace can store
> + * relevant bits together with intel_pt traces.
> + *
> + * Currently, for debugging purposes, these attributes are also writable; this
> + * should be removed in the final version.

Already remove that code?

> +{
> +	u64 reg;
> +
> +	reg = RTIT_CTL_TOPA | RTIT_CTL_BRANCH_EN;
> +
> +	if (!event->attr.exclude_kernel)
> +		reg |= RTIT_CTL_OS;
> +	if (!event->attr.exclude_user)
> +		reg |= RTIT_CTL_USR;
> +
> +	reg |= (event->attr.itrace_config & PT_CONFIG_MASK);
> +
> +	if (wrmsr_safe(MSR_IA32_RTIT_CTL, reg, 0) < 0) {
> +		pr_warn("Failed to enable PT on cpu %d\n", event->cpu);

Should rate limit this warning

> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +static void pt_config_start(bool start)
> +{
> +	u64 ctl;
> +
> +	rdmsrl(MSR_IA32_RTIT_CTL, ctl);

Should bail out here if someone else already started (e.g. hardware debugger)
The read needs to be moved to before we overwrite other MSRs

> +	if (start)
> +		ctl |= RTIT_CTL_TRACEEN;
> +	else
> +		ctl &= ~RTIT_CTL_TRACEEN;
> +	wrmsrl(MSR_IA32_RTIT_CTL, ctl);


> +
> +/**
> + * pt_handle_status - take care of possible status conditions
> + * @event: currently active PT event
> + */
> +static void pt_handle_status(struct perf_event *event)
> +{
> +	struct pt_buffer *buf = itrace_priv(event);
> +	int advance = 0;
> +	u64 status;
> +
> +	rdmsrl(MSR_IA32_RTIT_STATUS, status);
> +
> +	if (status & RTIT_STATUS_ERROR) {
> +		pr_err("ToPA ERROR encountered, trying to recover\n");

Add perf: prefix here (or better redefine pr_fmt at the beginning) 
Should be rate limited

> +static struct pt_buffer *pt_buffer_alloc(int cpu, size_t size,
> +					 unsigned long watermark,
> +					 bool snapshot, gfp_t gfp,
> +					 void **pages)
> +{
> +	struct pt_buffer *buf;
> +	int node;
> +	int ret;
> +
> +	if (!size || watermark << PAGE_SHIFT > size)
> +		return NULL;
> +
> +	if (cpu == -1)
> +		cpu = raw_smp_processor_id();
> +	node = cpu_to_node(cpu);
> +
> +	buf = kzalloc(sizeof(struct pt_buffer), gfp);
> +	if (!buf)
> +		return NULL;

Should be kzalloc_node() 

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 09/11] x86: perf: intel_pt: Add core dump functionality
  2014-02-06 10:50 ` [PATCH v1 09/11] x86: perf: intel_pt: Add core dump functionality Alexander Shishkin
@ 2014-02-06 20:36   ` Andi Kleen
  2014-02-07  9:03     ` Alexander Shishkin
  2014-02-06 23:59   ` Andi Kleen
  1 sibling, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2014-02-06 20:36 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Adrian Hunter,
	Matt Fleming

>  			   char *buf)
> @@ -114,6 +143,7 @@ static ssize_t pt_cap_store(struct device *cdev,
>  		return -EINVAL;
>  
>  	pt_cap_set(cap, new);
> +	pt_cap_string();

Don't we need some lock here? Otherwise it may leak memory with racing writes
and become inconsistent.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 11/11] x86: perf: intel_bts: Add core dump related functionality
  2014-02-06 10:50 ` [PATCH v1 11/11] x86: perf: intel_bts: Add core dump related functionality Alexander Shishkin
@ 2014-02-06 23:57   ` Andi Kleen
  2014-02-07  9:02     ` Alexander Shishkin
  0 siblings, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2014-02-06 23:57 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Adrian Hunter,
	Matt Fleming

Alexander Shishkin <alexander.shishkin@linux.intel.com> writes:

> BTS data can be used in process core dumps. This patch implements itrace
> core dump related hooks that will configure and output BTS traces into
> a core file.

Don't we need a different note number here? 

Otherwise how should the debugger know if it's BTS or PT.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 09/11] x86: perf: intel_pt: Add core dump functionality
  2014-02-06 10:50 ` [PATCH v1 09/11] x86: perf: intel_pt: Add core dump functionality Alexander Shishkin
  2014-02-06 20:36   ` Andi Kleen
@ 2014-02-06 23:59   ` Andi Kleen
  2014-02-07  9:09     ` Alexander Shishkin
  1 sibling, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2014-02-06 23:59 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Adrian Hunter,
	Matt Fleming

Alexander Shishkin <alexander.shishkin@linux.intel.com> writes:
> +
> +static void pt_trace_core_output(struct coredump_params *cprm,
> +				 struct perf_event *event,
> +				 unsigned long len)
> +{
> +	struct pt_buffer *buf;
> +	u64 from, to;
> +	int ret;
> +
> +	buf = itrace_priv(event);
> +
> +	if (!dump_emit(cprm, pt_pmu.capstr, pt_pmu.caplen))
> +		return;

It would be nicer if this was a separate note, instead of just being
concatenated with the rest of the data.

Would make simpler parsing and be cleaner.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 11/11] x86: perf: intel_bts: Add core dump related functionality
  2014-02-06 23:57   ` Andi Kleen
@ 2014-02-07  9:02     ` Alexander Shishkin
  0 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-07  9:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Adrian Hunter,
	Matt Fleming

Andi Kleen <andi@firstfloor.org> writes:

> Alexander Shishkin <alexander.shishkin@linux.intel.com> writes:
>
>> BTS data can be used in process core dumps. This patch implements itrace
>> core dump related hooks that will configure and output BTS traces into
>> a core file.
>
> Don't we need a different note number here? 
>
> Otherwise how should the debugger know if it's BTS or PT.

The pmu name is part of the note as well as its itrace_config.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 09/11] x86: perf: intel_pt: Add core dump functionality
  2014-02-06 20:36   ` Andi Kleen
@ 2014-02-07  9:03     ` Alexander Shishkin
  0 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-07  9:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Adrian Hunter,
	Matt Fleming

Andi Kleen <ak@linux.intel.com> writes:

>>  			   char *buf)
>> @@ -114,6 +143,7 @@ static ssize_t pt_cap_store(struct device *cdev,
>>  		return -EINVAL;
>>  
>>  	pt_cap_set(cap, new);
>> +	pt_cap_string();
>
> Don't we need some lock here? Otherwise it may leak memory with racing writes
> and become inconsistent.

Good point.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 09/11] x86: perf: intel_pt: Add core dump functionality
  2014-02-06 23:59   ` Andi Kleen
@ 2014-02-07  9:09     ` Alexander Shishkin
  0 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-07  9:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Adrian Hunter,
	Matt Fleming

Andi Kleen <andi@firstfloor.org> writes:

> Alexander Shishkin <alexander.shishkin@linux.intel.com> writes:
>> +
>> +static void pt_trace_core_output(struct coredump_params *cprm,
>> +				 struct perf_event *event,
>> +				 unsigned long len)
>> +{
>> +	struct pt_buffer *buf;
>> +	u64 from, to;
>> +	int ret;
>> +
>> +	buf = itrace_priv(event);
>> +
>> +	if (!dump_emit(cprm, pt_pmu.capstr, pt_pmu.caplen))
>> +		return;
>
> It would be nicer if this was a separate note, instead of just being
> concatenated with the rest of the data.
>
> Would make simpler parsing and be cleaner.

So long as we won't have to include traces from two different pmus in
the same core file, then matching these sections may provide another
challenge. Doesn't seem like a sensible scenario, though.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-02-06 10:50 ` [PATCH v1 03/11] perf: Allow for multiple ring buffers per event Alexander Shishkin
@ 2014-02-17 14:33   ` Peter Zijlstra
  2014-02-18  2:36     ` Andi Kleen
  2014-02-19 22:02     ` Dave Hansen
  2014-05-07 15:26   ` Peter Zijlstra
  1 sibling, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2014-02-17 14:33 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming

On Thu, Feb 06, 2014 at 12:50:26PM +0200, Alexander Shishkin wrote:
> Currently, a perf event can have one ring buffer associated with it, that
> is used for perf record stream. However, some pmus, such as instruction
> tracing units, will generate binary streams of their own, for which it is
> convenient to reuse the ring buffer code to export such streams to the
> userspace. So, this patch extends the perf code to support more than one
> ring buffer per event.



No-no-no-no... like I said last time around, 'splice' whatever results
you get into a perf buffer and make it look like perf events.

I'm not convinced it needs to be a PERF_RECORD_SAMPLE; but some
PERF_RECORD_* type for sure. Also it must allow interleaving with other
events.

I understand your use-case wants sideband events in another buffer due
to generation speed and not particularly caring about itrace data that's
lost but wanting a coherent side-band stream.

And that's fine, use two events for this; but that doesn't mean it
shouldn't be possible to mix them.

So for example:

/*
 * struct {
 *	struct perf_event_header	header;
 *	u64				extended_size;
 *	u64				data_offset;
 *	u64				data_size;
 *	struct sample_id		sample_id;
 * }
PERF_RECORD_DATA


Now; suppose your itrace data is 1mb, allocate an event of
1mb+sizeof(PERF_RECORD_DATA)+PAGE_SIZE-1.

Then write the PERF_RECORD_DATA structure into the normal ring-buffer
location; set data_offset to point to the first page boundary, data_size
to 1mb.

Then frob things such that perf_mmap_to_page() for the next 1mb of pages
points to your buffer pages and wipe the page-table entries.

Then we need to somehow shoot down TLBs, and that's tricky, because up
to this point we're in interrupt context (ideally the whole itrace
nonsense gets dropped out of the PMI through an irq_work ASAP, no point
in doing it in NMI context anyhow).

So for TLB shootdown we can do a number of vile-ish things; but I think
the prettiest is relying (and thus mandating) that the consumer wait in
poll()/select()/etc. And either adding something like poll_work() which
gets ran on poll-wakeup on the right task, or doing something ugly with
task-work.

The point being that the consumer only needs to flush the TLBs before
trying to access the buffer and that its clearly not doing so when its
poll()-ing.

Another vile option is shooting down page-table entries and TLBs for the
entire buffer when writing into the control page to update the tail --
that has some other 'fun' issues, but should be possible as well.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 07/11] x86: perf: intel_pt: Intel PT PMU driver
  2014-02-06 10:50 ` [PATCH v1 07/11] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
  2014-02-06 20:29   ` Andi Kleen
@ 2014-02-17 14:44   ` Peter Zijlstra
  2014-02-17 16:07     ` Andi Kleen
  2014-02-17 14:46   ` Peter Zijlstra
  2 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2014-02-17 14:44 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming

On Thu, Feb 06, 2014 at 12:50:30PM +0200, Alexander Shishkin wrote:
> +static int topa_insert_pages(struct pt_buffer *buf, gfp_t gfp,
> +			     enum topa_sz sz)
> +{
> +	struct topa *topa = buf->last;
> +	int node = cpu_to_node(buf->cpu);
> +	int order = get_order(sizes(sz));
> +	struct page *p;
> +	unsigned long pn;
> +
> +	p = alloc_pages_node(node, gfp | GFP_USER | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY, order);
> +	if (!p)
> +		return -ENOMEM;

> +
> +	return 0;
> +}
> +
> +static int pt_buffer_init_topa(struct pt_buffer *buf, size_t size, gfp_t gfp)
> +{
> +	struct topa *topa;
> +	int err, region_size;
> +
> +	topa = topa_alloc(buf->cpu, gfp);
> +	if (!topa)
> +		return -ENOMEM;
> +
> +	topa_insert_table(buf, topa);
> +
> +	region_size = pt_get_topa_region_size(buf->snapshot, size);
> +	if (region_size < 0) {
> +		pt_buffer_fini_topa(buf);
> +		return region_size;
> +	}
> +
> +	while (region_size && get_order(sizes(region_size)) > MAX_ORDER)
> +		region_size--;
> +
> +	/* fixup watermark in case of higher order allocations */
> +	if (buf->watermark < (sizes(region_size) >> PAGE_SHIFT))
> +		buf->watermark = sizes(region_size) >> PAGE_SHIFT;
> +
> +	while (buf->size < size) {
> +		err = topa_insert_pages(buf, gfp, region_size);
> +		if (err) {
> +			if (region_size) {
> +				region_size--;
> +				continue;
> +			}
> +			pt_buffer_fini_topa(buf);
> +			return -ENOMEM;
> +		}
> +	}

So MAX_ORDER is 11, which gives us (11+12) 23bit or 8M max allocations, right?

Given this TOPA stuff is completely fucked in the first release, that's
about all we can get it, right?

Now given you said 100s MB/s data rate for this itrace stuff, we're at
~0.1s traces. And that's in the very best case where we can actually get
8M.

Is that a workable amount?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 07/11] x86: perf: intel_pt: Intel PT PMU driver
  2014-02-06 10:50 ` [PATCH v1 07/11] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
  2014-02-06 20:29   ` Andi Kleen
  2014-02-17 14:44   ` Peter Zijlstra
@ 2014-02-17 14:46   ` Peter Zijlstra
  2014-02-18 12:42     ` Alexander Shishkin
  2 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2014-02-17 14:46 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming

On Thu, Feb 06, 2014 at 12:50:30PM +0200, Alexander Shishkin wrote:
> Add support for Intel Processor Trace (PT) to kernel's perf/itrace events.
> PT is an extension of Intel Architecture that collects information about
> software execuction such as control flow, execution modes and timings and
> formats it into highly compressed binary packets. Even being compressed,
> these packets are generated at hundreds of megabytes per second per core,
> which makes it impractical to decode them on the fly in the kernel. Thus,
> buffers containing this binary stream are zero-copy mapped to the debug
> tools in userspace for subsequent decoding and analysis.
> 
> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> ---
>  arch/x86/include/uapi/asm/msr-index.h     |  18 +
>  arch/x86/kernel/cpu/Makefile              |   1 +
>  arch/x86/kernel/cpu/intel_pt.h            | 127 ++++
>  arch/x86/kernel/cpu/perf_event.c          |   4 +
>  arch/x86/kernel/cpu/perf_event_intel.c    |  10 +
>  arch/x86/kernel/cpu/perf_event_intel_pt.c | 991 ++++++++++++++++++++++++++++++
>  6 files changed, 1151 insertions(+)
>  create mode 100644 arch/x86/kernel/cpu/intel_pt.h
>  create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c

Andi said that when itrace is enabled the LBR is wrecked; this patch
seems to fail to deal with that.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 07/11] x86: perf: intel_pt: Intel PT PMU driver
  2014-02-17 14:44   ` Peter Zijlstra
@ 2014-02-17 16:07     ` Andi Kleen
  0 siblings, 0 replies; 45+ messages in thread
From: Andi Kleen @ 2014-02-17 16:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Adrian Hunter, Matt Fleming

> So MAX_ORDER is 11, which gives us (11+12) 23bit or 8M max allocations, right?
> 
> Given this TOPA stuff is completely fucked in the first release, that's
> about all we can get it, right?
> 
> Now given you said 100s MB/s data rate for this itrace stuff, we're at
> ~0.1s traces. And that's in the very best case where we can actually get
> 8M.
> 
> Is that a workable amount

The PMI handler switches to a new buffer to allow longer traces
in memory.  It just costs one PMI.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-02-17 14:33   ` Peter Zijlstra
@ 2014-02-18  2:36     ` Andi Kleen
  2014-03-14 10:38       ` Peter Zijlstra
  2014-02-19 22:02     ` Dave Hansen
  1 sibling, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2014-02-18  2:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Adrian Hunter, Matt Fleming

> I'm not convinced it needs to be a PERF_RECORD_SAMPLE; but some
> PERF_RECORD_* type for sure. 

Adding a header shouldn't be a problem, it's merely wasting 4K.

> Also it must allow interleaving with other > events.

But can you describe a concrete use case where interleaving is better?

I'm not aware of any. Anything that could be usefully interleaved
can just be in the side band stream, and if you want a unified uncompressed 
stream you just run perf inject. The standard tools don't care for 
it as they have to reorder everything anyways to deal with multi
CPU reordering.

Your scheme is very complex and adds a lot of use restrictions 
over the current code, so there should be a good reason for it
at least.

Especially the TLB hac^wproposal sounds horrible to me, compared
to the straight forward zero copy ring buffer used today.

-Andi


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 07/11] x86: perf: intel_pt: Intel PT PMU driver
  2014-02-17 14:46   ` Peter Zijlstra
@ 2014-02-18 12:42     ` Alexander Shishkin
  0 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-02-18 12:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming

Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Feb 06, 2014 at 12:50:30PM +0200, Alexander Shishkin wrote:
>> Add support for Intel Processor Trace (PT) to kernel's perf/itrace events.
>> PT is an extension of Intel Architecture that collects information about
>> software execuction such as control flow, execution modes and timings and
>> formats it into highly compressed binary packets. Even being compressed,
>> these packets are generated at hundreds of megabytes per second per core,
>> which makes it impractical to decode them on the fly in the kernel. Thus,
>> buffers containing this binary stream are zero-copy mapped to the debug
>> tools in userspace for subsequent decoding and analysis.
>> 
>> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>> ---
>>  arch/x86/include/uapi/asm/msr-index.h     |  18 +
>>  arch/x86/kernel/cpu/Makefile              |   1 +
>>  arch/x86/kernel/cpu/intel_pt.h            | 127 ++++
>>  arch/x86/kernel/cpu/perf_event.c          |   4 +
>>  arch/x86/kernel/cpu/perf_event_intel.c    |  10 +
>>  arch/x86/kernel/cpu/perf_event_intel_pt.c | 991 ++++++++++++++++++++++++++++++
>>  6 files changed, 1151 insertions(+)
>>  create mode 100644 arch/x86/kernel/cpu/intel_pt.h
>>  create mode 100644 arch/x86/kernel/cpu/perf_event_intel_pt.c
>
> Andi said that when itrace is enabled the LBR is wrecked; this patch
> seems to fail to deal with that.

True, there needs to be a _safe() msr access before any configuration is
done, I have a fix for that, but let's first deal with the buffer
management.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-02-17 14:33   ` Peter Zijlstra
  2014-02-18  2:36     ` Andi Kleen
@ 2014-02-19 22:02     ` Dave Hansen
  2014-03-10  9:59       ` Alexander Shishkin
  2014-03-14 10:41       ` Peter Zijlstra
  1 sibling, 2 replies; 45+ messages in thread
From: Dave Hansen @ 2014-02-19 22:02 UTC (permalink / raw)
  To: Peter Zijlstra, Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming

On 02/17/2014 06:33 AM, Peter Zijlstra wrote:
> Then write the PERF_RECORD_DATA structure into the normal ring-buffer
> location; set data_offset to point to the first page boundary, data_size
> to 1mb.
> 
> Then frob things such that perf_mmap_to_page() for the next 1mb of pages
> points to your buffer pages and wipe the page-table entries.

Wouldn't we have to teach a ton of code how to be IRQ safe for this to
work?  Just step one: how do we go modifying page tables safely from an
interrupt?  mm->page_table_lock is a plain non-irq spinlock.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-02-19 22:02     ` Dave Hansen
@ 2014-03-10  9:59       ` Alexander Shishkin
  2014-03-10 17:24         ` Andi Kleen
  2014-03-14 10:41       ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Alexander Shishkin @ 2014-03-10  9:59 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming

Dave Hansen <dave.hansen@intel.com> writes:

> On 02/17/2014 06:33 AM, Peter Zijlstra wrote:
>> Then write the PERF_RECORD_DATA structure into the normal ring-buffer
>> location; set data_offset to point to the first page boundary, data_size
>> to 1mb.
>> 
>> Then frob things such that perf_mmap_to_page() for the next 1mb of pages
>> points to your buffer pages and wipe the page-table entries.
>
> Wouldn't we have to teach a ton of code how to be IRQ safe for this to
> work?  Just step one: how do we go modifying page tables safely from an
> interrupt?  mm->page_table_lock is a plain non-irq spinlock.

Yes, this does look more than just tricky even if we move the bulk of
interrupt code to an irq_work. Peter, are you quite sure this is what we
want to do just for exporting trace buffers to userspace?

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-03-10  9:59       ` Alexander Shishkin
@ 2014-03-10 17:24         ` Andi Kleen
  2014-03-14 10:44           ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2014-03-10 17:24 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Dave Hansen, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Adrian Hunter, Matt Fleming

> > Wouldn't we have to teach a ton of code how to be IRQ safe for this to
> > work?  Just step one: how do we go modifying page tables safely from an
> > interrupt?  mm->page_table_lock is a plain non-irq spinlock.
> 
> Yes, this does look more than just tricky even if we move the bulk of
> interrupt code to an irq_work. Peter, are you quite sure this is what we
> want to do just for exporting trace buffers to userspace?

The other big problem is scalability. Even if it was somehow possible
to make this scheme work the IPIs for flushing would kill performance 
on any multi threaded client.  Given perf is not multi-threaded today, but
it doesn't seem a good idea to design the interface assuming no client ever
will be.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-02-18  2:36     ` Andi Kleen
@ 2014-03-14 10:38       ` Peter Zijlstra
  2014-03-14 14:10         ` Andi Kleen
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2014-03-14 10:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Adrian Hunter, Matt Fleming

On Mon, Feb 17, 2014 at 06:36:59PM -0800, Andi Kleen wrote:
> > I'm not convinced it needs to be a PERF_RECORD_SAMPLE; but some
> > PERF_RECORD_* type for sure. 
> 
> Adding a header shouldn't be a problem, it's merely wasting 4K.
> 
> > Also it must allow interleaving with other > events.
> 
> But can you describe a concrete use case where interleaving is better?

I really don't want the multi-buffer nonsense proposed. An event gets
_1_ buffer, that's it.

That also means that if someone redirect another event into this buffer,
it needs to just work.

And because its a perf buffer, people expect it to look like one. So
we've got 'wrap' it.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-02-19 22:02     ` Dave Hansen
  2014-03-10  9:59       ` Alexander Shishkin
@ 2014-03-14 10:41       ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2014-03-14 10:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, Adrian Hunter, Matt Fleming

On Wed, Feb 19, 2014 at 02:02:36PM -0800, Dave Hansen wrote:
> On 02/17/2014 06:33 AM, Peter Zijlstra wrote:
> > Then write the PERF_RECORD_DATA structure into the normal ring-buffer
> > location; set data_offset to point to the first page boundary, data_size
> > to 1mb.
> > 
> > Then frob things such that perf_mmap_to_page() for the next 1mb of pages
> > points to your buffer pages and wipe the page-table entries.
> 
> Wouldn't we have to teach a ton of code how to be IRQ safe for this to
> work?  Just step one: how do we go modifying page tables safely from an
> interrupt?  mm->page_table_lock is a plain non-irq spinlock.

One could modify existing page tables the same way we do the lockless
lookup for GUP. But instead of doing the get_page() we do a pte
modification.

But I suppose we can push all that to task context by having the polling
task do it before it gets to userspace again.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-03-10 17:24         ` Andi Kleen
@ 2014-03-14 10:44           ` Peter Zijlstra
  2014-03-14 14:13             ` Andi Kleen
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2014-03-14 10:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alexander Shishkin, Dave Hansen, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Adrian Hunter, Matt Fleming

On Mon, Mar 10, 2014 at 10:24:40AM -0700, Andi Kleen wrote:
> > > Wouldn't we have to teach a ton of code how to be IRQ safe for this to
> > > work?  Just step one: how do we go modifying page tables safely from an
> > > interrupt?  mm->page_table_lock is a plain non-irq spinlock.
> > 
> > Yes, this does look more than just tricky even if we move the bulk of
> > interrupt code to an irq_work. Peter, are you quite sure this is what we
> > want to do just for exporting trace buffers to userspace?
> 
> The other big problem is scalability. Even if it was somehow possible
> to make this scheme work the IPIs for flushing would kill performance 
> on any multi threaded client.  Given perf is not multi-threaded today, but
> it doesn't seem a good idea to design the interface assuming no client ever
> will be.

Well any mmap()ed interface that wants to swap buffers will have this
same problem.

You can restrict the TLB flushing to the threads that poll() on the
relevant events. This just means other threads will see old/partial
data, but that shouldn't be a problem as they shouldn't be looking in
the first place.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-03-14 10:38       ` Peter Zijlstra
@ 2014-03-14 14:10         ` Andi Kleen
  2014-03-18 14:06           ` Alexander Shishkin
  0 siblings, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2014-03-14 14:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Adrian Hunter, Matt Fleming

> I really don't want the multi-buffer nonsense proposed. 

> An event gets
> _1_ buffer, that's it.

But we already have multi buffer. Just profile multiple CPUs
Then you have one buffer per CPU that need to be combined.

This just has two buffers per CPU.

> That also means that if someone redirect another event into this buffer,
> it needs to just work.

All the tools already handle multiple buffers (for multi CPUs).
So they don't need it.

> And because its a perf buffer, people expect it to look like one. So
> we've got 'wrap' it.

Flushing TLBs from NMIs or irq work or any interrupt context is just
a non starter.

The start/stop hardware and create gigantic gaps was also pretty bad
and would completely change the perf format too

It seem to me you're trying to solve a non-problem.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-03-14 10:44           ` Peter Zijlstra
@ 2014-03-14 14:13             ` Andi Kleen
  0 siblings, 0 replies; 45+ messages in thread
From: Andi Kleen @ 2014-03-14 14:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Dave Hansen, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Adrian Hunter, Matt Fleming

> > The other big problem is scalability. Even if it was somehow possible
> > to make this scheme work the IPIs for flushing would kill performance 
> > on any multi threaded client.  Given perf is not multi-threaded today, but
> > it doesn't seem a good idea to design the interface assuming no client ever
> > will be.
> 
> Well any mmap()ed interface that wants to swap buffers will have this
> same problem.

There's no need to swap buffers in a sane design. The perf ring buffer
or the ftrace buffer don't need this. There's no need for a PT buffer
to do so either. 

> 
> You can restrict the TLB flushing to the threads that poll() on the
> relevant events. This just means other threads will see old/partial
> data, but that shouldn't be a problem as they shouldn't be looking in
> the first place.

Then we get incoherent processes. You're not serious about that are you?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-03-14 14:10         ` Andi Kleen
@ 2014-03-18 14:06           ` Alexander Shishkin
  0 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-03-18 14:06 UTC (permalink / raw)
  To: Andi Kleen, Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Adrian Hunter, Matt Fleming

Andi Kleen <ak@linux.intel.com> writes:

>> I really don't want the multi-buffer nonsense proposed. 
>
>> An event gets
>> _1_ buffer, that's it.
>
> But we already have multi buffer. Just profile multiple CPUs
> Then you have one buffer per CPU that need to be combined.
>
> This just has two buffers per CPU.

Well, an event still gets *one* *perf* buffer in our implementation,
which is consistent with how things are done now, plus one trace
buffer. We could also export the trace buffer as a device node or
something, so that no software would expect to see perf headers in that
buffer.

>> That also means that if someone redirect another event into this buffer,
>> it needs to just work.
>
> All the tools already handle multiple buffers (for multi CPUs).
> So they don't need it.
>
>> And because its a perf buffer, people expect it to look like one. So
>> we've got 'wrap' it.
>
> Flushing TLBs from NMIs or irq work or any interrupt context is just
> a non starter.
>
> The start/stop hardware and create gigantic gaps was also pretty bad
> and would completely change the perf format too
>
> It seem to me you're trying to solve a non-problem.

Look at it this way: if the only way for it to be part of perf is by
wrapping trace data in perf headers, perf framework is simply not
suitable for instruction tracing. Therefore, it seems logical to have a
standalone driver or, considering ETM/PTM and others, a standalone
framework for exporting instruction trace data to userspace as plain
mmap interface with out the overhead of overwriting userspace ptes,
flushing tlbs, having inconsistent mappings in threads and that would
still work for hardware that doesn't support sg lists.

What do you think?

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-02-06 10:50 ` [PATCH v1 03/11] perf: Allow for multiple ring buffers per event Alexander Shishkin
  2014-02-17 14:33   ` Peter Zijlstra
@ 2014-05-07 15:26   ` Peter Zijlstra
  2014-05-07 19:25     ` Ingo Molnar
                       ` (3 more replies)
  1 sibling, 4 replies; 45+ messages in thread
From: Peter Zijlstra @ 2014-05-07 15:26 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming

[-- Attachment #1: Type: text/plain, Size: 5536 bytes --]



How about something like this for the itrace thing?

You would mmap() the regular buffer; when write ->aux_{offset,size} in
the control page. After which you can do a second mmap() with the .pgoff
matching the aux_offset you gave and .length matching the aux_size you
gave.

This way the mmap() content still looks like a single linear file (could
be sparse if you leave a hole, although we could require the aux_offset
to match the end of the data section).

And there is still the single event->rb, not more.

Then, when data inside that aux data store changes they should inject an
PERF_RECORD_AUX to indicate this did happen, which ties it back into the
normal event flow.

With this there should be no difficult page table tricks or anything.

The patch is way incomplete but should sketch enough of the idea..

So the aux_head/tail values should also be in the file space and not
start at 0 again, similar for the offsets in the AUX record.

---
 include/uapi/linux/perf_event.h | 19 +++++++++++++++
 kernel/events/core.c            | 51 +++++++++++++++++++++++++++++++++++++----
 kernel/events/internal.h        |  6 +++++
 kernel/events/ring_buffer.c     |  8 +------
 4 files changed, 72 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 853bc1ccb395..adef7c0f1e7c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -491,6 +491,13 @@ struct perf_event_mmap_page {
 	 */
 	__u64   data_head;		/* head in the data section */
 	__u64	data_tail;		/* user-space written tail */
+	__u64	data_offset;
+	__u64	data_size;
+
+	__u64	aux_head;
+	__u64	aux_tail;
+	__u64	aux_offset;
+	__u64	aux_size;
 };
 
 #define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
@@ -705,6 +712,18 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_MMAP2			= 10,
 
+	/*
+	 * Records that new data landed in the AUX buffer part.
+	 *
+	 * struct {
+	 * 	struct perf_event_header	header;
+	 *
+	 * 	u64				aux_offset;
+	 * 	u64				aux_size;
+	 * };
+	 */
+	PERF_RECORD_AUX				= 11,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5129b1201050..993995a23b73 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4016,7 +4016,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 
 static const struct vm_operations_struct perf_mmap_vmops = {
 	.open		= perf_mmap_open,
-	.close		= perf_mmap_close,
+	.close		= perf_mmap_close, /* non mergable */
 	.fault		= perf_mmap_fault,
 	.page_mkwrite	= perf_mmap_fault,
 };
@@ -4030,6 +4030,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	struct ring_buffer *rb;
 	unsigned long vma_size;
 	unsigned long nr_pages;
+	unsigned long pgoff;
 	long user_extra, extra;
 	int ret = 0, flags = 0;
 
@@ -4045,7 +4046,50 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 		return -EINVAL;
 
 	vma_size = vma->vm_end - vma->vm_start;
-	nr_pages = (vma_size / PAGE_SIZE) - 1;
+
+	if (vma->vm_pgoff == 0) {
+		nr_pages = (vma_size / PAGE_SIZE) - 1;
+	} else {
+		if (!event->rb)
+			return -EINVAL;
+
+		nr_pages = vma_size / PAGE_SIZE;
+
+		mutex_lock(&event->mmap_mutex);
+		ret = -EINVAL;
+		if (!event->rb)
+			goto err_aux_unlock;
+
+		if (!atomic_inc_not_zero(&event->rb->mmap_count))
+			goto err_aux_unlock;
+
+		if (userpg->aux_offset < userpg->data_offset + userpg->data_size)
+			goto err_aux_unlock;
+
+		pgoff = userpg->aux_offset;
+		if (pgoff & ~PAGE_MASK)
+			goto err_aux_unlock;
+
+		pgoff >>= PAGE_SHIFT;
+		if (pgoff != vma->vm_pgoff)
+			goto err_aux_unlock;
+
+		/* XXX do we want to allow !power_of_2 sizes, for AUX?  */
+		if (nr_pages == 0 || !is_power_of_2(nr_pages))
+			goto err_aux_unlock;
+
+		if (vma_size != PAGE_SIZE * nr_pages)
+			goto err_aux_unlock;
+
+		if (userpg->aux_size != vma_size)
+			goto err_aux_unlock;
+			
+		ret = rb_alloc_aux(event->rb, userpg->aux_offset >> PAGE_SHIFT, nr_pages);
+
+err_aux_unlock:
+		mutex_unlock(&event->mmap_mutex);
+		return ret;
+	}
 
 	/*
 	 * If we have rb pages ensure they're a power-of-two number, so we
@@ -4057,9 +4101,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	if (vma_size != PAGE_SIZE * (1 + nr_pages))
 		return -EINVAL;
 
-	if (vma->vm_pgoff != 0)
-		return -EINVAL;
-
 	WARN_ON_ONCE(event->ctx->parent_ctx);
 again:
 	mutex_lock(&event->mmap_mutex);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b218782ad..6258aaa36097 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -36,6 +36,7 @@ struct ring_buffer {
 	struct user_struct		*mmap_user;
 
 	struct perf_event_mmap_page	*user_page;
+	struct radix_tree_root		page_tree;
 	void				*data_pages[0];
 };
 
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 146a5792b1d2..b82505325df0 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -251,13 +251,7 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 struct page *
 perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
 {
-	if (pgoff > rb->nr_pages)
-		return NULL;
-
-	if (pgoff == 0)
-		return virt_to_page(rb->user_page);
-
-	return virt_to_page(rb->data_pages[pgoff - 1]);
+	return radix_tree_lookup(&rb->page_tree, pgoff);
 }
 
 static void *perf_mmap_alloc_page(int cpu)

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-05-07 15:26   ` Peter Zijlstra
@ 2014-05-07 19:25     ` Ingo Molnar
  2014-05-07 21:08     ` Andi Kleen
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2014-05-07 19:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen, Adrian Hunter, Matt Fleming


* Peter Zijlstra <peterz@infradead.org> wrote:

> How about something like this for the itrace thing?
> 
> You would mmap() the regular buffer; when write ->aux_{offset,size} 
> in the control page. After which you can do a second mmap() with the 
> .pgoff matching the aux_offset you gave and .length matching the 
> aux_size you gave.
> 
> This way the mmap() content still looks like a single linear file 
> (could be sparse if you leave a hole, although we could require the 
> aux_offset to match the end of the data section).
> 
> And there is still the single event->rb, not more.
> 
> Then, when data inside that aux data store changes they should 
> inject an PERF_RECORD_AUX to indicate this did happen, which ties it 
> back into the normal event flow.
> 
> With this there should be no difficult page table tricks or 
> anything.
> 
> The patch is way incomplete but should sketch enough of the idea..
> 
> So the aux_head/tail values should also be in the file space and not 
> start at 0 again, similar for the offsets in the AUX record.

This looks like a pretty good concept to me, to support the buffering 
quirks/constraints that itrace CPUs apparently have.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-05-07 15:26   ` Peter Zijlstra
  2014-05-07 19:25     ` Ingo Molnar
@ 2014-05-07 21:08     ` Andi Kleen
  2014-05-07 21:22       ` Peter Zijlstra
  2014-05-08  4:05     ` Alexander Shishkin
  2014-05-08 12:34     ` Alexander Shishkin
  3 siblings, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2014-05-07 21:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Adrian Hunter, Matt Fleming

> Then, when data inside that aux data store changes they should inject an
> PERF_RECORD_AUX to indicate this did happen, which ties it back into the
> normal event flow.

What happens when the aux buffer wraps? How would the client know
if the data belongs to this _AUX entry or some later one?

May need some extra sequence numbers in the mmap header and the aux
entry to handle this.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-05-07 21:08     ` Andi Kleen
@ 2014-05-07 21:22       ` Peter Zijlstra
  2014-05-08  3:26         ` Alexander Shishkin
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2014-05-07 21:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Adrian Hunter, Matt Fleming

On Wed, May 07, 2014 at 02:08:43PM -0700, Andi Kleen wrote:
> > Then, when data inside that aux data store changes they should inject an
> > PERF_RECORD_AUX to indicate this did happen, which ties it back into the
> > normal event flow.
> 
> What happens when the aux buffer wraps? How would the client know
> if the data belongs to this _AUX entry or some later one?

It belongs to the last one. Rewind them from 'now' until you hit
collisions in AUX space, then you're done.

> May need some extra sequence numbers in the mmap header and the aux
> entry to handle this.

You're thinking of overwrite mode, right? We should update the tail in
that case, I've not thought about how to do that for the AUX buffer.

There have been some patches for the normal buffer, but they stalled;

https://lkml.org/lkml/2013/7/8/154

I'm all for merging that patch (or a fixed on, since it has fail in) if
we can show the current !overwrite case doesn't regress.

Also, would anybody want different mode for the data and aux parts? In
that case we do need to add some extra state to the control page to
indicate such.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-05-07 21:22       ` Peter Zijlstra
@ 2014-05-08  3:26         ` Alexander Shishkin
  0 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-05-08  3:26 UTC (permalink / raw)
  To: Peter Zijlstra, Andi Kleen
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Adrian Hunter, Matt Fleming

Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, May 07, 2014 at 02:08:43PM -0700, Andi Kleen wrote:
>> > Then, when data inside that aux data store changes they should inject an
>> > PERF_RECORD_AUX to indicate this did happen, which ties it back into the
>> > normal event flow.
>> 
>> What happens when the aux buffer wraps? How would the client know
>> if the data belongs to this _AUX entry or some later one?
>
> It belongs to the last one. Rewind them from 'now' until you hit
> collisions in AUX space, then you're done.

I guess the point here is that if we don't want to lose any data in aux
space, we need to stop the perf_event when it fills up. Also there's a
question if we need a separate wake up watermark for the AUX buffer or
do we simply wake up the poller every time there's new data.

>> May need some extra sequence numbers in the mmap header and the aux
>> entry to handle this.
>
> You're thinking of overwrite mode, right? We should update the tail in
> that case, I've not thought about how to do that for the AUX buffer.

In the overwrite mode we don't have to write out AUX records at all
before we stop the trace, we don't care how many times data in the AUX
space wraps.

> There have been some patches for the normal buffer, but they stalled;
>
> https://lkml.org/lkml/2013/7/8/154
>
> I'm all for merging that patch (or a fixed on, since it has fail in) if
> we can show the current !overwrite case doesn't regress.
>
> Also, would anybody want different mode for the data and aux parts? In
> that case we do need to add some extra state to the control page to
> indicate such.

For the decoder to make sense of the trace, it needs all the data in the
normal buffer (MMAPs, sched_switches), not just the latest bits, so it's
a good idea to have it.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-05-07 15:26   ` Peter Zijlstra
  2014-05-07 19:25     ` Ingo Molnar
  2014-05-07 21:08     ` Andi Kleen
@ 2014-05-08  4:05     ` Alexander Shishkin
  2014-05-08  9:08       ` Alexander Shishkin
  2014-05-08 12:34     ` Alexander Shishkin
  3 siblings, 1 reply; 45+ messages in thread
From: Alexander Shishkin @ 2014-05-08  4:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming

Peter Zijlstra <peterz@infradead.org> writes:

> How about something like this for the itrace thing?

It's much nicer than the page swizzling draft I was about to send you.

> You would mmap() the regular buffer; when write ->aux_{offset,size} in
> the control page. After which you can do a second mmap() with the .pgoff
> matching the aux_offset you gave and .length matching the aux_size you
> gave.

Why do we need aux_{offset,size} at all, then? Userspace should know how
they mmap()ed it.

> This way the mmap() content still looks like a single linear file (could
> be sparse if you leave a hole, although we could require the aux_offset
> to match the end of the data section).
>
> And there is still the single event->rb, not more.

Fair enough.

> Then, when data inside that aux data store changes they should inject an
> PERF_RECORD_AUX to indicate this did happen, which ties it back into the
> normal event flow.
>
> With this there should be no difficult page table tricks or anything.

True.

> The patch is way incomplete but should sketch enough of the idea..

Can I take it over?

> So the aux_head/tail values should also be in the file space and not
> start at 0 again, similar for the offsets in the AUX record.

With PERF_RECORD_AUX carrying offset and size, we shouldn't need
aux_{head,tail} either, don't you think?

>
> ---
>  include/uapi/linux/perf_event.h | 19 +++++++++++++++
>  kernel/events/core.c            | 51 +++++++++++++++++++++++++++++++++++++----
>  kernel/events/internal.h        |  6 +++++
>  kernel/events/ring_buffer.c     |  8 +------
>  4 files changed, 72 insertions(+), 12 deletions(-)
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 853bc1ccb395..adef7c0f1e7c 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -491,6 +491,13 @@ struct perf_event_mmap_page {
>  	 */
>  	__u64   data_head;		/* head in the data section */
>  	__u64	data_tail;		/* user-space written tail */
> +	__u64	data_offset;
> +	__u64	data_size;
> +
> +	__u64	aux_head;
> +	__u64	aux_tail;
> +	__u64	aux_offset;
> +	__u64	aux_size;
>  };
>  
>  #define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
> @@ -705,6 +712,18 @@ enum perf_event_type {
>  	 */
>  	PERF_RECORD_MMAP2			= 10,
>  
> +	/*
> +	 * Records that new data landed in the AUX buffer part.
> +	 *
> +	 * struct {
> +	 * 	struct perf_event_header	header;
> +	 *
> +	 * 	u64				aux_offset;
> +	 * 	u64				aux_size;
> +	 * };
> +	 */
> +	PERF_RECORD_AUX				= 11,
> +
>  	PERF_RECORD_MAX,			/* non-ABI */
>  };
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 5129b1201050..993995a23b73 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4016,7 +4016,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
>  
>  static const struct vm_operations_struct perf_mmap_vmops = {
>  	.open		= perf_mmap_open,
> -	.close		= perf_mmap_close,
> +	.close		= perf_mmap_close, /* non mergable */
>  	.fault		= perf_mmap_fault,
>  	.page_mkwrite	= perf_mmap_fault,
>  };
> @@ -4030,6 +4030,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  	struct ring_buffer *rb;
>  	unsigned long vma_size;
>  	unsigned long nr_pages;
> +	unsigned long pgoff;
>  	long user_extra, extra;
>  	int ret = 0, flags = 0;
>  
> @@ -4045,7 +4046,50 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  		return -EINVAL;
>  
>  	vma_size = vma->vm_end - vma->vm_start;
> -	nr_pages = (vma_size / PAGE_SIZE) - 1;
> +
> +	if (vma->vm_pgoff == 0) {
> +		nr_pages = (vma_size / PAGE_SIZE) - 1;
> +	} else {
> +		if (!event->rb)
> +			return -EINVAL;
> +
> +		nr_pages = vma_size / PAGE_SIZE;
> +
> +		mutex_lock(&event->mmap_mutex);
> +		ret = -EINVAL;
> +		if (!event->rb)
> +			goto err_aux_unlock;
> +
> +		if (!atomic_inc_not_zero(&event->rb->mmap_count))
> +			goto err_aux_unlock;
> +
> +		if (userpg->aux_offset < userpg->data_offset + userpg->data_size)
> +			goto err_aux_unlock;

The data_{offset,size} seem to be only set by userspace too, maybe we
can do away with these altogether unless we want to allow for it to be a
sparse file?

> +		pgoff = userpg->aux_offset;

..and simply do a

                pgoff = event->rb->nr_pages + 1;

?

> +		if (pgoff & ~PAGE_MASK)
> +			goto err_aux_unlock;
> +
> +		pgoff >>= PAGE_SHIFT;
> +		if (pgoff != vma->vm_pgoff)
> +			goto err_aux_unlock;
> +
> +		/* XXX do we want to allow !power_of_2 sizes, for AUX?  */
> +		if (nr_pages == 0 || !is_power_of_2(nr_pages))
> +			goto err_aux_unlock;
> +
> +		if (vma_size != PAGE_SIZE * nr_pages)
> +			goto err_aux_unlock;
> +
> +		if (userpg->aux_size != vma_size)
> +			goto err_aux_unlock;
> +			
> +		ret = rb_alloc_aux(event->rb, userpg->aux_offset >> PAGE_SHIFT, nr_pages);
> +
> +err_aux_unlock:
> +		mutex_unlock(&event->mmap_mutex);
> +		return ret;
> +	}
>  
>  	/*
>  	 * If we have rb pages ensure they're a power-of-two number, so we
> @@ -4057,9 +4101,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  	if (vma_size != PAGE_SIZE * (1 + nr_pages))
>  		return -EINVAL;
>  
> -	if (vma->vm_pgoff != 0)
> -		return -EINVAL;
> -
>  	WARN_ON_ONCE(event->ctx->parent_ctx);
>  again:
>  	mutex_lock(&event->mmap_mutex);
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index 569b218782ad..6258aaa36097 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -36,6 +36,7 @@ struct ring_buffer {
>  	struct user_struct		*mmap_user;
>  
>  	struct perf_event_mmap_page	*user_page;
> +	struct radix_tree_root		page_tree;
>  	void				*data_pages[0];
>  };
>  
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 146a5792b1d2..b82505325df0 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -251,13 +251,7 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
>  struct page *
>  perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff)
>  {
> -	if (pgoff > rb->nr_pages)
> -		return NULL;
> -
> -	if (pgoff == 0)
> -		return virt_to_page(rb->user_page);
> -
> -	return virt_to_page(rb->data_pages[pgoff - 1]);
> +	return radix_tree_lookup(&rb->page_tree, pgoff);

This can instead call into the underlying driver, which will likely
maintain an array similar to data_pages[] anyway.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-05-08  4:05     ` Alexander Shishkin
@ 2014-05-08  9:08       ` Alexander Shishkin
  0 siblings, 0 replies; 45+ messages in thread
From: Alexander Shishkin @ 2014-05-08  9:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming

Alexander Shishkin <alexander.shishkin@linux.intel.com> writes:

> Peter Zijlstra <peterz@infradead.org> writes:
>
>> How about something like this for the itrace thing?
>
> It's much nicer than the page swizzling draft I was about to send you.
>
>> You would mmap() the regular buffer; when write ->aux_{offset,size} in
>> the control page. After which you can do a second mmap() with the .pgoff
>> matching the aux_offset you gave and .length matching the aux_size you
>> gave.
>
> Why do we need aux_{offset,size} at all, then? Userspace should know how
> they mmap()ed it.
>
>> This way the mmap() content still looks like a single linear file (could
>> be sparse if you leave a hole, although we could require the aux_offset
>> to match the end of the data section).
>>
>> And there is still the single event->rb, not more.
>
> Fair enough.
>
>> Then, when data inside that aux data store changes they should inject an
>> PERF_RECORD_AUX to indicate this did happen, which ties it back into the
>> normal event flow.
>>
>> With this there should be no difficult page table tricks or anything.
>
> True.
>
>> The patch is way incomplete but should sketch enough of the idea..
>
> Can I take it over?
>
>> So the aux_head/tail values should also be in the file space and not
>> start at 0 again, similar for the offsets in the AUX record.
>
> With PERF_RECORD_AUX carrying offset and size, we shouldn't need
> aux_{head,tail} either, don't you think?

I take this one back, as perf record doesn't actually parse records from
the buffer, it would still need the pointers.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-05-07 15:26   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2014-05-08  4:05     ` Alexander Shishkin
@ 2014-05-08 12:34     ` Alexander Shishkin
  2014-05-08 12:41       ` Peter Zijlstra
  3 siblings, 1 reply; 45+ messages in thread
From: Alexander Shishkin @ 2014-05-08 12:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming

Peter Zijlstra <peterz@infradead.org> writes:

> So the aux_head/tail values should also be in the file space and not
> start at 0 again, similar for the offsets in the AUX record.

Thinking some more about it: if we keep aux_{offset,size} in userpg,
then it would make sense to have aux_{head,tail} run from 0 to infinity
similar to data_{head,tail}, especially considering that neither data_*
nor aux_* pointers are file offsets anyway.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-05-08 12:34     ` Alexander Shishkin
@ 2014-05-08 12:41       ` Peter Zijlstra
  2014-05-08 12:46         ` Alexander Shishkin
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2014-05-08 12:41 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming

[-- Attachment #1: Type: text/plain, Size: 660 bytes --]

On Thu, May 08, 2014 at 03:34:17PM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > So the aux_head/tail values should also be in the file space and not
> > start at 0 again, similar for the offsets in the AUX record.
> 
> Thinking some more about it: if we keep aux_{offset,size} in userpg,
> then it would make sense to have aux_{head,tail} run from 0 to infinity
> similar to data_{head,tail}, especially considering that neither data_*
> nor aux_* pointers are file offsets anyway.

Yeah, this might make it easier. So then the rule is *_offset +
*_{head,tail} is the actual file offset into the buffer.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-05-08 12:41       ` Peter Zijlstra
@ 2014-05-08 12:46         ` Alexander Shishkin
  2014-05-08 14:16           ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Alexander Shishkin @ 2014-05-08 12:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming

Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, May 08, 2014 at 03:34:17PM +0300, Alexander Shishkin wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > So the aux_head/tail values should also be in the file space and not
>> > start at 0 again, similar for the offsets in the AUX record.
>> 
>> Thinking some more about it: if we keep aux_{offset,size} in userpg,
>> then it would make sense to have aux_{head,tail} run from 0 to infinity
>> similar to data_{head,tail}, especially considering that neither data_*
>> nor aux_* pointers are file offsets anyway.
>
> Yeah, this might make it easier. So then the rule is *_offset +
> *_{head,tail} is the actual file offset into the buffer.

You mean *_offset + (*_{head,tail} % *_size) ?

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v1 03/11] perf: Allow for multiple ring buffers per event
  2014-05-08 12:46         ` Alexander Shishkin
@ 2014-05-08 14:16           ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2014-05-08 14:16 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Adrian Hunter,
	Matt Fleming

[-- Attachment #1: Type: text/plain, Size: 897 bytes --]

On Thu, May 08, 2014 at 03:46:36PM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Thu, May 08, 2014 at 03:34:17PM +0300, Alexander Shishkin wrote:
> >> Peter Zijlstra <peterz@infradead.org> writes:
> >> 
> >> > So the aux_head/tail values should also be in the file space and not
> >> > start at 0 again, similar for the offsets in the AUX record.
> >> 
> >> Thinking some more about it: if we keep aux_{offset,size} in userpg,
> >> then it would make sense to have aux_{head,tail} run from 0 to infinity
> >> similar to data_{head,tail}, especially considering that neither data_*
> >> nor aux_* pointers are file offsets anyway.
> >
> > Yeah, this might make it easier. So then the rule is *_offset +
> > *_{head,tail} is the actual file offset into the buffer.
> 
> You mean *_offset + (*_{head,tail} % *_size) ?

Indeed I did.. 

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2014-05-08 14:16 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-06 10:50 [PATCH v1 00/11] perf: Add support for Intel Processor Trace Alexander Shishkin
2014-02-06 10:50 ` [PATCH v1 01/11] x86: Add Intel Processor Trace (INTEL_PT) cpu feature detection Alexander Shishkin
2014-02-06 10:50 ` [PATCH v1 02/11] perf: Abstract ring_buffer backing store operations Alexander Shishkin
2014-02-06 10:50 ` [PATCH v1 03/11] perf: Allow for multiple ring buffers per event Alexander Shishkin
2014-02-17 14:33   ` Peter Zijlstra
2014-02-18  2:36     ` Andi Kleen
2014-03-14 10:38       ` Peter Zijlstra
2014-03-14 14:10         ` Andi Kleen
2014-03-18 14:06           ` Alexander Shishkin
2014-02-19 22:02     ` Dave Hansen
2014-03-10  9:59       ` Alexander Shishkin
2014-03-10 17:24         ` Andi Kleen
2014-03-14 10:44           ` Peter Zijlstra
2014-03-14 14:13             ` Andi Kleen
2014-03-14 10:41       ` Peter Zijlstra
2014-05-07 15:26   ` Peter Zijlstra
2014-05-07 19:25     ` Ingo Molnar
2014-05-07 21:08     ` Andi Kleen
2014-05-07 21:22       ` Peter Zijlstra
2014-05-08  3:26         ` Alexander Shishkin
2014-05-08  4:05     ` Alexander Shishkin
2014-05-08  9:08       ` Alexander Shishkin
2014-05-08 12:34     ` Alexander Shishkin
2014-05-08 12:41       ` Peter Zijlstra
2014-05-08 12:46         ` Alexander Shishkin
2014-05-08 14:16           ` Peter Zijlstra
2014-02-06 10:50 ` [PATCH v1 04/11] itrace: Infrastructure for instruction flow tracing units Alexander Shishkin
2014-02-06 10:50 ` [PATCH v1 05/11] itrace: Add functionality to include traces in perf event samples Alexander Shishkin
2014-02-06 10:50 ` [PATCH v1 06/11] itrace: Add functionality to include traces in process core dumps Alexander Shishkin
2014-02-06 10:50 ` [PATCH v1 07/11] x86: perf: intel_pt: Intel PT PMU driver Alexander Shishkin
2014-02-06 20:29   ` Andi Kleen
2014-02-17 14:44   ` Peter Zijlstra
2014-02-17 16:07     ` Andi Kleen
2014-02-17 14:46   ` Peter Zijlstra
2014-02-18 12:42     ` Alexander Shishkin
2014-02-06 10:50 ` [PATCH v1 08/11] x86: perf: intel_pt: Add sampling functionality Alexander Shishkin
2014-02-06 10:50 ` [PATCH v1 09/11] x86: perf: intel_pt: Add core dump functionality Alexander Shishkin
2014-02-06 20:36   ` Andi Kleen
2014-02-07  9:03     ` Alexander Shishkin
2014-02-06 23:59   ` Andi Kleen
2014-02-07  9:09     ` Alexander Shishkin
2014-02-06 10:50 ` [PATCH v1 10/11] x86: perf: intel_bts: Add BTS PMU driver Alexander Shishkin
2014-02-06 10:50 ` [PATCH v1 11/11] x86: perf: intel_bts: Add core dump related functionality Alexander Shishkin
2014-02-06 23:57   ` Andi Kleen
2014-02-07  9:02     ` Alexander Shishkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.