linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] perf: Add AUX data sampling
@ 2019-10-25 14:08 Alexander Shishkin
  2019-10-25 14:08 ` [PATCH v3 1/3] perf: Allow using AUX data in perf samples Alexander Shishkin
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Alexander Shishkin @ 2019-10-25 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa,
	adrian.hunter, mathieu.poirier, mark.rutland, Alexander Shishkin

Hi Peter,

Here's another version of the AUX sampling, addressing all the comments
from the previous one [5]: fixed group leader refcount leak if both
aux_source and aux_sample_size are set, changed aux_sample_size to u32
in the ABI and removed the pointless sample init bit. Also dropped 4/4
from this series, will send separately. This one has a context dependency
on the attr.__reserved_2 fix [6], as it adds one more reserved bit.

Changes since version one [3]: it addresses the issues of NMI-safety
and sampling hardware events. The former is addressed by adding a new
PMU callback, the latter by making use of grouping. It also depends
on the AUX output stop fix [4] to work correctly. I decided to post
them separately, because [4] is also a candidate for perf/urgent.

This series introduces AUX data sampling for perf events, which in
case of our instruction/branch tracing PMUs like Intel PT, BTS, CS
ETM means execution flow history leading up to a perf event's
overflow.

In case of Intel PT, this can be used as an alternative to LBR, with
virtually as many as you like branches per sample. It doesn't support
some of the LBR features (branch prediction indication, basic block
level timing, etc [1]) and it can't be exposed as BRANCH_STACK, because
that would require decoding PT stream in kernel space, which is not
practical. Instead, we deliver the PT data to userspace as is, for
offline processing. The PT decoder already supports presenting PT as
virtual LBR.

AUX sampling is different from the snapshot mode in that it doesn't
require instrumentation (for when to take a snapshot) and is better
for generic data collection, when you don't yet know what you are
looking for. It's also useful for automated data collection, for
example, for feedback-driven compiler optimizaitions.

It's also different from the "full trace mode" in that it produces
much less data and, consequently, takes up less I/O bandwidth and
storage space, and takes less time to decode.

The bulk of code is in 1/4, which adds the user interface bits and
the code to measure and copy out AUX data. 3/4 adds PT side support
for sampling. 4/4 is not strictly related, but makes an improvement
to the PT's snapshot mode by implementing a simpler buffer management
that would also benefit the sampling.

The tooling support is ready, although I'm not including it here to
save the bandwidth. Adrian or I will post it separately. Meanwhile,
it can be found here [2], updated to reflect the ABI change.

[1] https://marc.info/?l=linux-kernel&m=147467007714928&w=2
[2] https://git.kernel.org/cgit/linux/kernel/git/ash/linux.git/log/?h=perf-aux-sampling
[3] https://marc.info/?l=linux-kernel&m=152878999928771
[4] https://marc.info/?l=linux-kernel&m=157172999231707
[5] https://marc.info/?l=linux-kernel&m=157173832302445
[6] https://marc.info/?l=linux-kernel&m=157200581818800

Alexander Shishkin (3):
  perf: Allow using AUX data in perf samples
  perf/x86/intel/pt: Factor out starting the trace
  perf/x86/intel/pt: Add sampling support

 arch/x86/events/intel/pt.c      |  76 ++++++++++++--
 include/linux/perf_event.h      |  19 ++++
 include/uapi/linux/perf_event.h |  10 +-
 kernel/events/core.c            | 172 +++++++++++++++++++++++++++++++-
 kernel/events/internal.h        |   1 +
 kernel/events/ring_buffer.c     |  36 +++++++
 6 files changed, 303 insertions(+), 11 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/3] perf: Allow using AUX data in perf samples
  2019-10-25 14:08 [PATCH v3 0/3] perf: Add AUX data sampling Alexander Shishkin
@ 2019-10-25 14:08 ` Alexander Shishkin
  2019-10-28 16:27   ` Peter Zijlstra
                     ` (2 more replies)
  2019-10-25 14:08 ` [PATCH v3 2/3] perf/x86/intel/pt: Factor out starting the trace Alexander Shishkin
  2019-10-25 14:08 ` [PATCH v3 3/3] perf/x86/intel/pt: Add sampling support Alexander Shishkin
  2 siblings, 3 replies; 15+ messages in thread
From: Alexander Shishkin @ 2019-10-25 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa,
	adrian.hunter, mathieu.poirier, mark.rutland, Alexander Shishkin

AUX data can be used to annotate perf events such as performance counters
or tracepoints/breakpoints by including it in sample records when
PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
and profiling by providing, for example, a history of instruction flow
leading up to the event's overflow.

The implementation makes use of grouping an AUX event with all the events
that wish to take samples of the AUX data, such that the former is the
group leader. The samplees should also specify the desired size of the AUX
sample via attr.aux_sample_size.

AUX capable PMUs need to explicitly add support for sampling, because it
relies on a new callback to take a snapshot of the buffer without touching
the event states.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      |  19 ++++
 include/uapi/linux/perf_event.h |  10 +-
 kernel/events/core.c            | 172 +++++++++++++++++++++++++++++++-
 kernel/events/internal.h        |   1 +
 kernel/events/ring_buffer.c     |  36 +++++++
 5 files changed, 233 insertions(+), 5 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 587ae4d002f5..446ce0014e89 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -249,6 +249,8 @@ struct perf_event;
 #define PERF_PMU_CAP_NO_EXCLUDE			0x80
 #define PERF_PMU_CAP_AUX_OUTPUT			0x100
 
+struct perf_output_handle;
+
 /**
  * struct pmu - generic performance monitoring unit
  */
@@ -423,6 +425,19 @@ struct pmu {
 	 */
 	void (*free_aux)		(void *aux); /* optional */
 
+	/*
+	 * Take a snapshot of the AUX buffer without touching the event
+	 * state, so that preempting ->start()/->stop() callbacks does
+	 * not interfere with their logic. Called in PMI context.
+	 *
+	 * Returns the size of AUX data copied to the output handle.
+	 *
+	 * Optional.
+	 */
+	long (*snapshot_aux)		(struct perf_event *event,
+					 struct perf_output_handle *handle,
+					 unsigned long size);
+
 	/*
 	 * Validate address range filters: make sure the HW supports the
 	 * requested configuration and number of filters; return 0 if the
@@ -964,6 +979,7 @@ struct perf_sample_data {
 		u32	reserved;
 	}				cpu_entry;
 	struct perf_callchain_entry	*callchain;
+	u64				aux_size;
 
 	/*
 	 * regs_user may point to task_pt_regs or to regs_user_copy, depending
@@ -1353,6 +1369,9 @@ extern unsigned int perf_output_copy(struct perf_output_handle *handle,
 			     const void *buf, unsigned int len);
 extern unsigned int perf_output_skip(struct perf_output_handle *handle,
 				     unsigned int len);
+extern long perf_output_copy_aux(struct perf_output_handle *aux_handle,
+				 struct perf_output_handle *handle,
+				 unsigned long from, unsigned long to);
 extern int perf_swevent_get_recursion_context(void);
 extern void perf_swevent_put_recursion_context(int rctx);
 extern u64 perf_swevent_set_period(struct perf_event *event);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index bb7b271397a6..377d794d3105 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -141,8 +141,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
 	PERF_SAMPLE_REGS_INTR			= 1U << 18,
 	PERF_SAMPLE_PHYS_ADDR			= 1U << 19,
+	PERF_SAMPLE_AUX				= 1U << 20,
 
-	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 21,		/* non-ABI */
 
 	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
 };
@@ -300,6 +301,7 @@ enum perf_event_read_format {
 					/* add: sample_stack_user */
 #define PERF_ATTR_SIZE_VER4	104	/* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5	112	/* add: aux_watermark */
+#define PERF_ATTR_SIZE_VER6	120	/* add: aux_sample_size */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -424,7 +426,9 @@ struct perf_event_attr {
 	 */
 	__u32	aux_watermark;
 	__u16	sample_max_stack;
-	__u16	__reserved_2;	/* align to __u64 */
+	__u16	__reserved_2;
+	__u32	aux_sample_size;
+	__u32	__reserved_3;
 };
 
 /*
@@ -864,6 +868,8 @@ enum perf_event_type {
 	 *	{ u64			abi; # enum perf_sample_regs_abi
 	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
+	 *	{ u64			size;
+	 *	  char			data[size]; } && PERF_SAMPLE_AUX
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0940c8810be0..36c612dbfcb0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1941,6 +1941,11 @@ static void perf_put_aux_event(struct perf_event *event)
 	}
 }
 
+static bool perf_need_aux_event(struct perf_event *event)
+{
+	return !!event->attr.aux_output || !!event->attr.aux_sample_size;
+}
+
 static int perf_get_aux_event(struct perf_event *event,
 			      struct perf_event *group_leader)
 {
@@ -1953,7 +1958,17 @@ static int perf_get_aux_event(struct perf_event *event,
 	if (!group_leader)
 		return 0;
 
-	if (!perf_aux_output_match(event, group_leader))
+	/*
+	 * aux_output and aux_sample_size are mutually exclusive.
+	 */
+	if (event->attr.aux_output && event->attr.aux_sample_size)
+		return 0;
+
+	if (event->attr.aux_output &&
+	    !perf_aux_output_match(event, group_leader))
+		return 0;
+
+	if (event->attr.aux_sample_size && !group_leader->pmu->snapshot_aux)
 		return 0;
 
 	if (!atomic_long_inc_not_zero(&group_leader->refcount))
@@ -6192,6 +6207,121 @@ perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size,
 	}
 }
 
+static unsigned long perf_aux_sample_size(struct perf_event *event,
+					  struct perf_sample_data *data,
+					  size_t size)
+{
+	struct perf_event *sampler = event->aux_event;
+	struct ring_buffer *rb;
+
+	data->aux_size = 0;
+
+	if (!sampler)
+		goto out;
+
+	if (WARN_ON_ONCE(READ_ONCE(sampler->state) != PERF_EVENT_STATE_ACTIVE))
+		goto out;
+
+	if (WARN_ON_ONCE(READ_ONCE(sampler->oncpu) != smp_processor_id()))
+		goto out;
+
+	rb = ring_buffer_get(sampler->parent ? sampler->parent : sampler);
+	if (!rb)
+		goto out;
+
+	/*
+	 * If this is an NMI hit inside sampling code, don't take
+	 * the sample. See also perf_aux_sample_output().
+	 */
+	if (READ_ONCE(rb->aux_in_sampling)) {
+		data->aux_size = 0;
+	} else {
+		size = min_t(size_t, size, perf_aux_size(rb));
+		data->aux_size = ALIGN(size, sizeof(u64));
+	}
+	ring_buffer_put(rb);
+
+out:
+	return data->aux_size;
+}
+
+long perf_pmu_aux_sample_output(struct perf_event *event,
+				struct perf_output_handle *handle,
+				unsigned long size)
+{
+	unsigned long flags;
+	long ret;
+
+	/*
+	 * Normal ->start()/->stop() callbacks run in IRQ mode in scheduler
+	 * paths. If we start calling them in NMI context, they may race with
+	 * the IRQ ones, that is, for example, re-starting an event that's just
+	 * been stopped, which is why we're using a separate callback that
+	 * doesn't change the event state.
+	 *
+	 * IRQs need to be disabled to prevent IPIs from racing with us.
+	 */
+	local_irq_save(flags);
+
+	ret = event->pmu->snapshot_aux(event, handle, size);
+
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+static void perf_aux_sample_output(struct perf_event *event,
+				   struct perf_output_handle *handle,
+				   struct perf_sample_data *data)
+{
+	struct perf_event *sampler = event->aux_event;
+	unsigned long pad;
+	struct ring_buffer *rb;
+	long size;
+
+	if (WARN_ON_ONCE(!sampler || !data->aux_size))
+		return;
+
+	rb = ring_buffer_get(sampler->parent ? sampler->parent : sampler);
+	if (!rb)
+		return;
+
+	/*
+	 * Guard against NMI hits inside the critical section;
+	 * see also perf_aux_sample_size().
+	 */
+	WRITE_ONCE(rb->aux_in_sampling, 1);
+
+	size = perf_pmu_aux_sample_output(sampler, handle, data->aux_size);
+
+	/*
+	 * An error here means that perf_output_copy() failed (returned a
+	 * non-zero surplus that it didn't copy), which in its current
+	 * enlightened implementation is not possible. If that changes, we'd
+	 * like to know.
+	 */
+	if (WARN_ON_ONCE(size < 0))
+		goto out_clear;
+
+	/*
+	 * The pad comes from ALIGN()ing data->aux_size up to u64 in
+	 * perf_aux_sample_size(), so should not be more than that.
+	 */
+	pad = data->aux_size - size;
+	if (WARN_ON_ONCE(pad >= sizeof(u64)))
+		pad = 8;
+
+	if (pad) {
+		u64 zero = 0;
+		perf_output_copy(handle, &zero, pad);
+	}
+
+out_clear:
+	WRITE_ONCE(rb->aux_in_sampling, 0);
+
+	ring_buffer_put(rb);
+}
+
 static void __perf_event_header__init_id(struct perf_event_header *header,
 					 struct perf_sample_data *data,
 					 struct perf_event *event)
@@ -6511,6 +6641,13 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
 		perf_output_put(handle, data->phys_addr);
 
+	if (sample_type & PERF_SAMPLE_AUX) {
+		perf_output_put(handle, data->aux_size);
+
+		if (data->aux_size)
+			perf_aux_sample_output(event, handle, data);
+	}
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -6699,6 +6836,35 @@ void perf_prepare_sample(struct perf_event_header *header,
 
 	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
 		data->phys_addr = perf_virt_to_phys(data->addr);
+
+	if (sample_type & PERF_SAMPLE_AUX) {
+		u64 size;
+
+		header->size += sizeof(u64); /* size */
+
+		/*
+		 * Given the 16bit nature of header::size, an AUX sample can
+		 * easily overflow it, what with all the preceding sample bits.
+		 * Make sure this doesn't happen by using up to U16_MAX bytes
+		 * per sample in total (rounded down to 8 byte boundary).
+		 */
+		size = min_t(size_t, U16_MAX - header->size,
+			     event->attr.aux_sample_size);
+		size = rounddown(size, 8);
+		size = perf_aux_sample_size(event, data, size);
+
+		WARN_ON_ONCE(size + header->size > U16_MAX);
+		header->size += size;
+	}
+	/*
+	 * If you're adding more sample types here, you likely need to do
+	 * something about the overflowing header::size, like repurpose the
+	 * lowest 3 bits of size, which should be always zero at the moment.
+	 * This raises a more important question, do we really need 512k sized
+	 * samples and why, so good argumentation is in order for whatever you
+	 * do here next.
+	 */
+	WARN_ON_ONCE(header->size & 7);
 }
 
 static __always_inline int
@@ -10660,7 +10826,7 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 
 	attr->size = size;
 
-	if (attr->__reserved_1 || attr->__reserved_2)
+	if (attr->__reserved_1 || attr->__reserved_2 || attr->__reserved_3)
 		return -EINVAL;
 
 	if (attr->sample_type & ~(PERF_SAMPLE_MAX-1))
@@ -11210,7 +11376,7 @@ SYSCALL_DEFINE5(perf_event_open,
 		}
 	}
 
-	if (event->attr.aux_output && !perf_get_aux_event(event, group_leader))
+	if (perf_need_aux_event(event) && !perf_get_aux_event(event, group_leader))
 		goto err_locked;
 
 	/*
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 3aef4191798c..747d67f130cb 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -50,6 +50,7 @@ struct ring_buffer {
 	unsigned long			aux_mmap_locked;
 	void				(*free_aux)(void *);
 	refcount_t			aux_refcount;
+	int				aux_in_sampling;
 	void				**aux_pages;
 	void				*aux_priv;
 
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 246c83ac5643..7ffd5c763f93 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -562,6 +562,42 @@ void *perf_get_aux(struct perf_output_handle *handle)
 }
 EXPORT_SYMBOL_GPL(perf_get_aux);
 
+/*
+ * Copy out AUX data from an AUX handle.
+ */
+long perf_output_copy_aux(struct perf_output_handle *aux_handle,
+			  struct perf_output_handle *handle,
+			  unsigned long from, unsigned long to)
+{
+	unsigned long tocopy, remainder, len = 0;
+	struct ring_buffer *rb = aux_handle->rb;
+	void *addr;
+
+	from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+	to &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+
+	do {
+		tocopy = PAGE_SIZE - offset_in_page(from);
+		if (to > from)
+			tocopy = min(tocopy, to - from);
+		if (!tocopy)
+			break;
+
+		addr = rb->aux_pages[from >> PAGE_SHIFT];
+		addr += offset_in_page(from);
+
+		remainder = perf_output_copy(handle, addr, tocopy);
+		if (remainder)
+			return -EFAULT;
+
+		len += tocopy;
+		from += tocopy;
+		from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+	} while (to != from);
+
+	return len;
+}
+
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
 static struct page *rb_alloc_aux_page(int node, int order)
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/3] perf/x86/intel/pt: Factor out starting the trace
  2019-10-25 14:08 [PATCH v3 0/3] perf: Add AUX data sampling Alexander Shishkin
  2019-10-25 14:08 ` [PATCH v3 1/3] perf: Allow using AUX data in perf samples Alexander Shishkin
@ 2019-10-25 14:08 ` Alexander Shishkin
  2019-11-13 10:56   ` [tip: perf/core] perf/x86/intel/pt: Factor out pt_config_start() tip-bot2 for Alexander Shishkin
  2019-10-25 14:08 ` [PATCH v3 3/3] perf/x86/intel/pt: Add sampling support Alexander Shishkin
  2 siblings, 1 reply; 15+ messages in thread
From: Alexander Shishkin @ 2019-10-25 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa,
	adrian.hunter, mathieu.poirier, mark.rutland, Alexander Shishkin

PT trace is now enabled at the bottom of the event configuration
function that takes care of all configuration bits related to a given
event, including the address filter update. This is only needed where
the event configuration changes, that is, in ->add()/->start().

In the interrupt path we can use a lighter version that keeps the
configuration intact, since it hasn't changed, and only flips the
enable bit.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/events/intel/pt.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 05e43d0f430b..170f3b402274 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -397,6 +397,20 @@ static bool pt_event_valid(struct perf_event *event)
  * These all are cpu affine and operate on a local PT
  */
 
+static void pt_config_start(struct perf_event *event)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	u64 ctl = event->hw.config;
+
+	ctl |= RTIT_CTL_TRACEEN;
+	if (READ_ONCE(pt->vmx_on))
+		perf_aux_output_flag(&pt->handle, PERF_AUX_FLAG_PARTIAL);
+	else
+		wrmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+	WRITE_ONCE(event->hw.config, ctl);
+}
+
 /* Address ranges and their corresponding msr configuration registers */
 static const struct pt_address_range {
 	unsigned long	msr_a;
@@ -468,7 +482,6 @@ static u64 pt_config_filters(struct perf_event *event)
 
 static void pt_config(struct perf_event *event)
 {
-	struct pt *pt = this_cpu_ptr(&pt_ctx);
 	u64 reg;
 
 	/* First round: clear STATUS, in particular the PSB byte counter. */
@@ -501,10 +514,7 @@ static void pt_config(struct perf_event *event)
 	reg |= (event->attr.config & PT_CONFIG_MASK);
 
 	event->hw.config = reg;
-	if (READ_ONCE(pt->vmx_on))
-		perf_aux_output_flag(&pt->handle, PERF_AUX_FLAG_PARTIAL);
-	else
-		wrmsrl(MSR_IA32_RTIT_CTL, reg);
+	pt_config_start(event);
 }
 
 static void pt_config_stop(struct perf_event *event)
@@ -1381,7 +1391,7 @@ void intel_pt_interrupt(void)
 
 		pt_config_buffer(topa_to_page(buf->cur)->table, buf->cur_idx,
 				 buf->output_off);
-		pt_config(event);
+		pt_config_start(event);
 	}
 }
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 3/3] perf/x86/intel/pt: Add sampling support
  2019-10-25 14:08 [PATCH v3 0/3] perf: Add AUX data sampling Alexander Shishkin
  2019-10-25 14:08 ` [PATCH v3 1/3] perf: Allow using AUX data in perf samples Alexander Shishkin
  2019-10-25 14:08 ` [PATCH v3 2/3] perf/x86/intel/pt: Factor out starting the trace Alexander Shishkin
@ 2019-10-25 14:08 ` Alexander Shishkin
  2019-11-13 10:56   ` [tip: perf/core] " tip-bot2 for Alexander Shishkin
  2 siblings, 1 reply; 15+ messages in thread
From: Alexander Shishkin @ 2019-10-25 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa,
	adrian.hunter, mathieu.poirier, mark.rutland, Alexander Shishkin

Add AUX sampling support to the PT PMU: implement an NMI-safe callback
that takes a snapshot of the buffer without touching the event states.
This is done for PT events that don't use PMIs, that is, snapshot mode
(RO mapping of the AUX area).

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/events/intel/pt.c | 54 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 170f3b402274..2f20d5a333c1 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1208,6 +1208,13 @@ pt_buffer_setup_aux(struct perf_event *event, void **pages,
 	if (!nr_pages)
 		return NULL;
 
+	/*
+	 * Only support AUX sampling in snapshot mode, where we don't
+	 * generate NMIs.
+	 */
+	if (event->attr.aux_sample_size && !snapshot)
+		return NULL;
+
 	if (cpu == -1)
 		cpu = raw_smp_processor_id();
 	node = cpu_to_node(cpu);
@@ -1506,6 +1513,52 @@ static void pt_event_stop(struct perf_event *event, int mode)
 	}
 }
 
+static long pt_event_snapshot_aux(struct perf_event *event,
+				  struct perf_output_handle *handle,
+				  unsigned long size)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+	unsigned long from = 0, to;
+	long ret;
+
+	if (WARN_ON_ONCE(!buf))
+		return 0;
+
+	/*
+	 * Sampling is only allowed on snapshot events;
+	 * see pt_buffer_setup_aux().
+	 */
+	if (WARN_ON_ONCE(!buf->snapshot))
+		return 0;
+
+	/*
+	 * Here, handle_nmi tells us if the tracing is on
+	 */
+	if (READ_ONCE(pt->handle_nmi))
+		pt_config_stop(event);
+
+	pt_read_offset(buf);
+	pt_update_head(pt);
+
+	to = local_read(&buf->data_size);
+	if (to < size)
+		from = buf->nr_pages << PAGE_SHIFT;
+	from += to - size;
+
+	ret = perf_output_copy_aux(&pt->handle, handle, from, to);
+
+	/*
+	 * If the tracing was on when we turned up, restart it.
+	 * Compiler barrier not needed as we couldn't have been
+	 * preempted by anything that touches pt->handle_nmi.
+	 */
+	if (pt->handle_nmi)
+		pt_config_start(event);
+
+	return ret;
+}
+
 static void pt_event_del(struct perf_event *event, int mode)
 {
 	pt_event_stop(event, PERF_EF_UPDATE);
@@ -1625,6 +1678,7 @@ static __init int pt_init(void)
 	pt_pmu.pmu.del			 = pt_event_del;
 	pt_pmu.pmu.start		 = pt_event_start;
 	pt_pmu.pmu.stop			 = pt_event_stop;
+	pt_pmu.pmu.snapshot_aux		 = pt_event_snapshot_aux;
 	pt_pmu.pmu.read			 = pt_event_read;
 	pt_pmu.pmu.setup_aux		 = pt_buffer_setup_aux;
 	pt_pmu.pmu.free_aux		 = pt_buffer_free_aux;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] perf: Allow using AUX data in perf samples
  2019-10-25 14:08 ` [PATCH v3 1/3] perf: Allow using AUX data in perf samples Alexander Shishkin
@ 2019-10-28 16:27   ` Peter Zijlstra
  2019-10-28 16:28     ` Peter Zijlstra
  2019-10-28 17:08     ` Alexander Shishkin
  2019-11-04 10:16   ` Peter Zijlstra
  2019-11-13 10:56   ` [tip: perf/core] perf/aux: " tip-bot2 for Alexander Shishkin
  2 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2019-10-28 16:27 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa,
	adrian.hunter, mathieu.poirier, mark.rutland

On Fri, Oct 25, 2019 at 05:08:33PM +0300, Alexander Shishkin wrote:
> +static void perf_aux_sample_output(struct perf_event *event,
> +				   struct perf_output_handle *handle,
> +				   struct perf_sample_data *data)
> +{
> +	struct perf_event *sampler = event->aux_event;
> +	unsigned long pad;
> +	struct ring_buffer *rb;
> +	long size;
> +
> +	if (WARN_ON_ONCE(!sampler || !data->aux_size))
> +		return;
> +
> +	rb = ring_buffer_get(sampler->parent ? sampler->parent : sampler);
> +	if (!rb)
> +		return;
> +
> +	/*
> +	 * Guard against NMI hits inside the critical section;
> +	 * see also perf_aux_sample_size().
> +	 */
> +	WRITE_ONCE(rb->aux_in_sampling, 1);
> +
> +	size = perf_pmu_aux_sample_output(sampler, handle, data->aux_size);
> +
> +	/*
> +	 * An error here means that perf_output_copy() failed (returned a
> +	 * non-zero surplus that it didn't copy), which in its current
> +	 * enlightened implementation is not possible. If that changes, we'd
> +	 * like to know.
> +	 */
> +	if (WARN_ON_ONCE(size < 0))
> +		goto out_clear;
> +
> +	/*
> +	 * The pad comes from ALIGN()ing data->aux_size up to u64 in
> +	 * perf_aux_sample_size(), so should not be more than that.
> +	 */
> +	pad = data->aux_size - size;
> +	if (WARN_ON_ONCE(pad >= sizeof(u64)))
> +		pad = 8;
> +
> +	if (pad) {
> +		u64 zero = 0;
> +		perf_output_copy(handle, &zero, pad);
> +	}
> +
> +out_clear:
> +	WRITE_ONCE(rb->aux_in_sampling, 0);
> +
> +	ring_buffer_put(rb);
> +}

I have the below delta on top of this patch.

And while I get why we need recursion protection for pmu::snapshot_aux,
I'm a little puzzled on why it is over the padding, that is, why isn't
the whole of aux_in_sampling inside (the newly minted)
perf_pmu_snapshot_aux() ?

---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6237,7 +6237,7 @@ perf_output_sample_ustack(struct perf_ou
 	}
 }
 
-static unsigned long perf_aux_sample_size(struct perf_event *event,
+static unsigned long perf_prepare_sample_aux(struct perf_event *event,
 					  struct perf_sample_data *data,
 					  size_t size)
 {
@@ -6275,9 +6275,9 @@ static unsigned long perf_aux_sample_siz
 	return data->aux_size;
 }
 
-long perf_pmu_aux_sample_output(struct perf_event *event,
-				struct perf_output_handle *handle,
-				unsigned long size)
+long perf_pmu_snapshot_aux(struct perf_event *event,
+			   struct perf_output_handle *handle,
+			   unsigned long size)
 {
 	unsigned long flags;
 	long ret;
@@ -6318,11 +6318,12 @@ static void perf_aux_sample_output(struc
 
 	/*
 	 * Guard against NMI hits inside the critical section;
-	 * see also perf_aux_sample_size().
+	 * see also perf_prepare_sample_aux().
 	 */
 	WRITE_ONCE(rb->aux_in_sampling, 1);
+	barrier();
 
-	size = perf_pmu_aux_sample_output(sampler, handle, data->aux_size);
+	size = perf_pmu_snapshot_aux(sampler, handle, data->aux_size);
 
 	/*
 	 * An error here means that perf_output_copy() failed (returned a
@@ -6335,7 +6336,7 @@ static void perf_aux_sample_output(struc
 
 	/*
 	 * The pad comes from ALIGN()ing data->aux_size up to u64 in
-	 * perf_aux_sample_size(), so should not be more than that.
+	 * perf_prepare_sample_aux(), so should not be more than that.
 	 */
 	pad = data->aux_size - size;
 	if (WARN_ON_ONCE(pad >= sizeof(u64)))
@@ -6347,6 +6348,7 @@ static void perf_aux_sample_output(struc
 	}
 
 out_clear:
+	barrier();
 	WRITE_ONCE(rb->aux_in_sampling, 0);
 
 	ring_buffer_put(rb);
@@ -6881,7 +6883,7 @@ void perf_prepare_sample(struct perf_eve
 		size = min_t(size_t, U16_MAX - header->size,
 			     event->attr.aux_sample_size);
 		size = rounddown(size, 8);
-		size = perf_aux_sample_size(event, data, size);
+		size = perf_prepare_sample_aux(event, data, size);
 
 		WARN_ON_ONCE(size + header->size > U16_MAX);
 		header->size += size;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] perf: Allow using AUX data in perf samples
  2019-10-28 16:27   ` Peter Zijlstra
@ 2019-10-28 16:28     ` Peter Zijlstra
  2019-10-28 17:10       ` Alexander Shishkin
  2019-10-28 17:08     ` Alexander Shishkin
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2019-10-28 16:28 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa,
	adrian.hunter, mathieu.poirier, mark.rutland

On Mon, Oct 28, 2019 at 05:27:12PM +0100, Peter Zijlstra wrote:
> And while I get why we need recursion protection for pmu::snapshot_aux,
> I'm a little puzzled on why it is over the padding, that is, why isn't
> the whole of aux_in_sampling inside (the newly minted)
> perf_pmu_snapshot_aux() ?

That is, given the previous delta, the below.

---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6292,9 +6292,17 @@ long perf_pmu_snapshot_aux(struct perf_e
 	 * IRQs need to be disabled to prevent IPIs from racing with us.
 	 */
 	local_irq_save(flags);
+	/*
+	 * Guard against NMI hits inside the critical section;
+	 * see also perf_prepare_sample_aux().
+	 */
+	WRITE_ONCE(rb->aux_in_sampling, 1);
+	barrier();
 
 	ret = event->pmu->snapshot_aux(event, handle, size);
 
+	barrier();
+	WRITE_ONCE(rb->aux_in_sampling, 0);
 	local_irq_restore(flags);
 
 	return ret;
@@ -6316,13 +6324,6 @@ static void perf_aux_sample_output(struc
 	if (!rb)
 		return;
 
-	/*
-	 * Guard against NMI hits inside the critical section;
-	 * see also perf_prepare_sample_aux().
-	 */
-	WRITE_ONCE(rb->aux_in_sampling, 1);
-	barrier();
-
 	size = perf_pmu_snapshot_aux(sampler, handle, data->aux_size);
 
 	/*
@@ -6348,9 +6349,6 @@ static void perf_aux_sample_output(struc
 	}
 
 out_clear:
-	barrier();
-	WRITE_ONCE(rb->aux_in_sampling, 0);
-
 	ring_buffer_put(rb);
 }
 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] perf: Allow using AUX data in perf samples
  2019-10-28 16:27   ` Peter Zijlstra
  2019-10-28 16:28     ` Peter Zijlstra
@ 2019-10-28 17:08     ` Alexander Shishkin
  2019-11-04  8:40       ` Peter Zijlstra
  1 sibling, 1 reply; 15+ messages in thread
From: Alexander Shishkin @ 2019-10-28 17:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa,
	adrian.hunter, mathieu.poirier, mark.rutland, alexander.shishkin

Peter Zijlstra <peterz@infradead.org> writes:

> I have the below delta on top of this patch.
>
> And while I get why we need recursion protection for pmu::snapshot_aux,
> I'm a little puzzled on why it is over the padding, that is, why isn't
> the whole of aux_in_sampling inside (the newly minted)
> perf_pmu_snapshot_aux() ?

No reason. Too long staring at that code by myself.

> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6237,7 +6237,7 @@ perf_output_sample_ustack(struct perf_ou
>  	}
>  }
>  
> -static unsigned long perf_aux_sample_size(struct perf_event *event,
> +static unsigned long perf_prepare_sample_aux(struct perf_event *event,
>  					  struct perf_sample_data *data,
>  					  size_t size)
>  {
> @@ -6275,9 +6275,9 @@ static unsigned long perf_aux_sample_siz
>  	return data->aux_size;
>  }
>  
> -long perf_pmu_aux_sample_output(struct perf_event *event,
> -				struct perf_output_handle *handle,
> -				unsigned long size)
> +long perf_pmu_snapshot_aux(struct perf_event *event,
> +			   struct perf_output_handle *handle,
> +			   unsigned long size)

That makes more sense indeed.

>  {
>  	unsigned long flags;
>  	long ret;
> @@ -6318,11 +6318,12 @@ static void perf_aux_sample_output(struc
>  
>  	/*
>  	 * Guard against NMI hits inside the critical section;
> -	 * see also perf_aux_sample_size().
> +	 * see also perf_prepare_sample_aux().
>  	 */
>  	WRITE_ONCE(rb->aux_in_sampling, 1);
> +	barrier();

Isn't WRITE_ONCE() barrier enough on its own? My thinking was that we
only need a compiler barrier here, hence the WRITE_ONCE.

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] perf: Allow using AUX data in perf samples
  2019-10-28 16:28     ` Peter Zijlstra
@ 2019-10-28 17:10       ` Alexander Shishkin
  0 siblings, 0 replies; 15+ messages in thread
From: Alexander Shishkin @ 2019-10-28 17:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa,
	adrian.hunter, mathieu.poirier, mark.rutland, alexander.shishkin

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Oct 28, 2019 at 05:27:12PM +0100, Peter Zijlstra wrote:
>> And while I get why we need recursion protection for pmu::snapshot_aux,
>> I'm a little puzzled on why it is over the padding, that is, why isn't
>> the whole of aux_in_sampling inside (the newly minted)
>> perf_pmu_snapshot_aux() ?
>
> That is, given the previous delta, the below.
>
> ---
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6292,9 +6292,17 @@ long perf_pmu_snapshot_aux(struct perf_e
>  	 * IRQs need to be disabled to prevent IPIs from racing with us.
>  	 */
>  	local_irq_save(flags);
> +	/*
> +	 * Guard against NMI hits inside the critical section;
> +	 * see also perf_prepare_sample_aux().
> +	 */
> +	WRITE_ONCE(rb->aux_in_sampling, 1);
> +	barrier();
>  
>  	ret = event->pmu->snapshot_aux(event, handle, size);
>  
> +	barrier();
> +	WRITE_ONCE(rb->aux_in_sampling, 0);
>  	local_irq_restore(flags);
>  
>  	return ret;
> @@ -6316,13 +6324,6 @@ static void perf_aux_sample_output(struc
>  	if (!rb)
>  		return;
>  
> -	/*
> -	 * Guard against NMI hits inside the critical section;
> -	 * see also perf_prepare_sample_aux().
> -	 */
> -	WRITE_ONCE(rb->aux_in_sampling, 1);
> -	barrier();
> -
>  	size = perf_pmu_snapshot_aux(sampler, handle, data->aux_size);
>  
>  	/*
> @@ -6348,9 +6349,6 @@ static void perf_aux_sample_output(struc
>  	}
>  
>  out_clear:
> -	barrier();
> -	WRITE_ONCE(rb->aux_in_sampling, 0);
> -
>  	ring_buffer_put(rb);

I can't tell without applying these, if the labels still make sense. But
this one probably becomes "out_put" at this point.

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] perf: Allow using AUX data in perf samples
  2019-10-28 17:08     ` Alexander Shishkin
@ 2019-11-04  8:40       ` Peter Zijlstra
  2019-11-04 10:40         ` Alexander Shishkin
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2019-11-04  8:40 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa,
	adrian.hunter, mathieu.poirier, mark.rutland

On Mon, Oct 28, 2019 at 07:08:18PM +0200, Alexander Shishkin wrote:

> > @@ -6318,11 +6318,12 @@ static void perf_aux_sample_output(struc
> >  
> >  	/*
> >  	 * Guard against NMI hits inside the critical section;
> > -	 * see also perf_aux_sample_size().
> > +	 * see also perf_prepare_sample_aux().
> >  	 */
> >  	WRITE_ONCE(rb->aux_in_sampling, 1);
> > +	barrier();
> 
> Isn't WRITE_ONCE() barrier enough on its own? My thinking was that we
> only need a compiler barrier here, hence the WRITE_ONCE.

WRITE_ONCE() is a volatile store and (IIRC) the compiler ensures order
against other volatile things, but not in general.

barrier() OTOH clobbers all of memory and thereby ensures nothing can
get hoised over it.

Now, the only thing we do inside this region is an indirect call, which
on its own already implies a sync point for as long as the compiler
cannot inline it, so it might be a bit paranoid on my end (I don't think
even LTO can reduce this indirection and cause inlining).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] perf: Allow using AUX data in perf samples
  2019-10-25 14:08 ` [PATCH v3 1/3] perf: Allow using AUX data in perf samples Alexander Shishkin
  2019-10-28 16:27   ` Peter Zijlstra
@ 2019-11-04 10:16   ` Peter Zijlstra
  2019-11-04 12:30     ` Leo Yan
  2019-11-13 10:56   ` [tip: perf/core] perf/aux: " tip-bot2 for Alexander Shishkin
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2019-11-04 10:16 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa,
	adrian.hunter, mathieu.poirier, mark.rutland, leo.yan


Leo Yan,

Since you were helpful in the other CS thread, could you please have a
look to see if this interface will work for you guys?

Thanks!


On Fri, Oct 25, 2019 at 05:08:33PM +0300, Alexander Shishkin wrote:
> AUX data can be used to annotate perf events such as performance counters
> or tracepoints/breakpoints by including it in sample records when
> PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
> and profiling by providing, for example, a history of instruction flow
> leading up to the event's overflow.
> 
> The implementation makes use of grouping an AUX event with all the events
> that wish to take samples of the AUX data, such that the former is the
> group leader. The samplees should also specify the desired size of the AUX
> sample via attr.aux_sample_size.
> 
> AUX capable PMUs need to explicitly add support for sampling, because it
> relies on a new callback to take a snapshot of the buffer without touching
> the event states.
> 
> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> ---
>  include/linux/perf_event.h      |  19 ++++
>  include/uapi/linux/perf_event.h |  10 +-
>  kernel/events/core.c            | 172 +++++++++++++++++++++++++++++++-
>  kernel/events/internal.h        |   1 +
>  kernel/events/ring_buffer.c     |  36 +++++++
>  5 files changed, 233 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 587ae4d002f5..446ce0014e89 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -249,6 +249,8 @@ struct perf_event;
>  #define PERF_PMU_CAP_NO_EXCLUDE			0x80
>  #define PERF_PMU_CAP_AUX_OUTPUT			0x100
>  
> +struct perf_output_handle;
> +
>  /**
>   * struct pmu - generic performance monitoring unit
>   */
> @@ -423,6 +425,19 @@ struct pmu {
>  	 */
>  	void (*free_aux)		(void *aux); /* optional */
>  
> +	/*
> +	 * Take a snapshot of the AUX buffer without touching the event
> +	 * state, so that preempting ->start()/->stop() callbacks does
> +	 * not interfere with their logic. Called in PMI context.
> +	 *
> +	 * Returns the size of AUX data copied to the output handle.
> +	 *
> +	 * Optional.
> +	 */
> +	long (*snapshot_aux)		(struct perf_event *event,
> +					 struct perf_output_handle *handle,
> +					 unsigned long size);
> +
>  	/*
>  	 * Validate address range filters: make sure the HW supports the
>  	 * requested configuration and number of filters; return 0 if the
> @@ -964,6 +979,7 @@ struct perf_sample_data {
>  		u32	reserved;
>  	}				cpu_entry;
>  	struct perf_callchain_entry	*callchain;
> +	u64				aux_size;
>  
>  	/*
>  	 * regs_user may point to task_pt_regs or to regs_user_copy, depending
> @@ -1353,6 +1369,9 @@ extern unsigned int perf_output_copy(struct perf_output_handle *handle,
>  			     const void *buf, unsigned int len);
>  extern unsigned int perf_output_skip(struct perf_output_handle *handle,
>  				     unsigned int len);
> +extern long perf_output_copy_aux(struct perf_output_handle *aux_handle,
> +				 struct perf_output_handle *handle,
> +				 unsigned long from, unsigned long to);
>  extern int perf_swevent_get_recursion_context(void);
>  extern void perf_swevent_put_recursion_context(int rctx);
>  extern u64 perf_swevent_set_period(struct perf_event *event);
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index bb7b271397a6..377d794d3105 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -141,8 +141,9 @@ enum perf_event_sample_format {
>  	PERF_SAMPLE_TRANSACTION			= 1U << 17,
>  	PERF_SAMPLE_REGS_INTR			= 1U << 18,
>  	PERF_SAMPLE_PHYS_ADDR			= 1U << 19,
> +	PERF_SAMPLE_AUX				= 1U << 20,
>  
> -	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
> +	PERF_SAMPLE_MAX = 1U << 21,		/* non-ABI */
>  
>  	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
>  };
> @@ -300,6 +301,7 @@ enum perf_event_read_format {
>  					/* add: sample_stack_user */
>  #define PERF_ATTR_SIZE_VER4	104	/* add: sample_regs_intr */
>  #define PERF_ATTR_SIZE_VER5	112	/* add: aux_watermark */
> +#define PERF_ATTR_SIZE_VER6	120	/* add: aux_sample_size */
>  
>  /*
>   * Hardware event_id to monitor via a performance monitoring event:
> @@ -424,7 +426,9 @@ struct perf_event_attr {
>  	 */
>  	__u32	aux_watermark;
>  	__u16	sample_max_stack;
> -	__u16	__reserved_2;	/* align to __u64 */
> +	__u16	__reserved_2;
> +	__u32	aux_sample_size;
> +	__u32	__reserved_3;
>  };
>  
>  /*
> @@ -864,6 +868,8 @@ enum perf_event_type {
>  	 *	{ u64			abi; # enum perf_sample_regs_abi
>  	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
>  	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
> +	 *	{ u64			size;
> +	 *	  char			data[size]; } && PERF_SAMPLE_AUX
>  	 * };
>  	 */
>  	PERF_RECORD_SAMPLE			= 9,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 0940c8810be0..36c612dbfcb0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1941,6 +1941,11 @@ static void perf_put_aux_event(struct perf_event *event)
>  	}
>  }
>  
> +static bool perf_need_aux_event(struct perf_event *event)
> +{
> +	return !!event->attr.aux_output || !!event->attr.aux_sample_size;
> +}
> +
>  static int perf_get_aux_event(struct perf_event *event,
>  			      struct perf_event *group_leader)
>  {
> @@ -1953,7 +1958,17 @@ static int perf_get_aux_event(struct perf_event *event,
>  	if (!group_leader)
>  		return 0;
>  
> -	if (!perf_aux_output_match(event, group_leader))
> +	/*
> +	 * aux_output and aux_sample_size are mutually exclusive.
> +	 */
> +	if (event->attr.aux_output && event->attr.aux_sample_size)
> +		return 0;
> +
> +	if (event->attr.aux_output &&
> +	    !perf_aux_output_match(event, group_leader))
> +		return 0;
> +
> +	if (event->attr.aux_sample_size && !group_leader->pmu->snapshot_aux)
>  		return 0;
>  
>  	if (!atomic_long_inc_not_zero(&group_leader->refcount))
> @@ -6192,6 +6207,121 @@ perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size,
>  	}
>  }
>  
> +static unsigned long perf_aux_sample_size(struct perf_event *event,
> +					  struct perf_sample_data *data,
> +					  size_t size)
> +{
> +	struct perf_event *sampler = event->aux_event;
> +	struct ring_buffer *rb;
> +
> +	data->aux_size = 0;
> +
> +	if (!sampler)
> +		goto out;
> +
> +	if (WARN_ON_ONCE(READ_ONCE(sampler->state) != PERF_EVENT_STATE_ACTIVE))
> +		goto out;
> +
> +	if (WARN_ON_ONCE(READ_ONCE(sampler->oncpu) != smp_processor_id()))
> +		goto out;
> +
> +	rb = ring_buffer_get(sampler->parent ? sampler->parent : sampler);
> +	if (!rb)
> +		goto out;
> +
> +	/*
> +	 * If this is an NMI hit inside sampling code, don't take
> +	 * the sample. See also perf_aux_sample_output().
> +	 */
> +	if (READ_ONCE(rb->aux_in_sampling)) {
> +		data->aux_size = 0;
> +	} else {
> +		size = min_t(size_t, size, perf_aux_size(rb));
> +		data->aux_size = ALIGN(size, sizeof(u64));
> +	}
> +	ring_buffer_put(rb);
> +
> +out:
> +	return data->aux_size;
> +}
> +
> +long perf_pmu_aux_sample_output(struct perf_event *event,
> +				struct perf_output_handle *handle,
> +				unsigned long size)
> +{
> +	unsigned long flags;
> +	long ret;
> +
> +	/*
> +	 * Normal ->start()/->stop() callbacks run in IRQ mode in scheduler
> +	 * paths. If we start calling them in NMI context, they may race with
> +	 * the IRQ ones, that is, for example, re-starting an event that's just
> +	 * been stopped, which is why we're using a separate callback that
> +	 * doesn't change the event state.
> +	 *
> +	 * IRQs need to be disabled to prevent IPIs from racing with us.
> +	 */
> +	local_irq_save(flags);
> +
> +	ret = event->pmu->snapshot_aux(event, handle, size);
> +
> +	local_irq_restore(flags);
> +
> +	return ret;
> +}
> +
> +static void perf_aux_sample_output(struct perf_event *event,
> +				   struct perf_output_handle *handle,
> +				   struct perf_sample_data *data)
> +{
> +	struct perf_event *sampler = event->aux_event;
> +	unsigned long pad;
> +	struct ring_buffer *rb;
> +	long size;
> +
> +	if (WARN_ON_ONCE(!sampler || !data->aux_size))
> +		return;
> +
> +	rb = ring_buffer_get(sampler->parent ? sampler->parent : sampler);
> +	if (!rb)
> +		return;
> +
> +	/*
> +	 * Guard against NMI hits inside the critical section;
> +	 * see also perf_aux_sample_size().
> +	 */
> +	WRITE_ONCE(rb->aux_in_sampling, 1);
> +
> +	size = perf_pmu_aux_sample_output(sampler, handle, data->aux_size);
> +
> +	/*
> +	 * An error here means that perf_output_copy() failed (returned a
> +	 * non-zero surplus that it didn't copy), which in its current
> +	 * enlightened implementation is not possible. If that changes, we'd
> +	 * like to know.
> +	 */
> +	if (WARN_ON_ONCE(size < 0))
> +		goto out_clear;
> +
> +	/*
> +	 * The pad comes from ALIGN()ing data->aux_size up to u64 in
> +	 * perf_aux_sample_size(), so should not be more than that.
> +	 */
> +	pad = data->aux_size - size;
> +	if (WARN_ON_ONCE(pad >= sizeof(u64)))
> +		pad = 8;
> +
> +	if (pad) {
> +		u64 zero = 0;
> +		perf_output_copy(handle, &zero, pad);
> +	}
> +
> +out_clear:
> +	WRITE_ONCE(rb->aux_in_sampling, 0);
> +
> +	ring_buffer_put(rb);
> +}
> +
>  static void __perf_event_header__init_id(struct perf_event_header *header,
>  					 struct perf_sample_data *data,
>  					 struct perf_event *event)
> @@ -6511,6 +6641,13 @@ void perf_output_sample(struct perf_output_handle *handle,
>  	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
>  		perf_output_put(handle, data->phys_addr);
>  
> +	if (sample_type & PERF_SAMPLE_AUX) {
> +		perf_output_put(handle, data->aux_size);
> +
> +		if (data->aux_size)
> +			perf_aux_sample_output(event, handle, data);
> +	}
> +
>  	if (!event->attr.watermark) {
>  		int wakeup_events = event->attr.wakeup_events;
>  
> @@ -6699,6 +6836,35 @@ void perf_prepare_sample(struct perf_event_header *header,
>  
>  	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
>  		data->phys_addr = perf_virt_to_phys(data->addr);
> +
> +	if (sample_type & PERF_SAMPLE_AUX) {
> +		u64 size;
> +
> +		header->size += sizeof(u64); /* size */
> +
> +		/*
> +		 * Given the 16bit nature of header::size, an AUX sample can
> +		 * easily overflow it, what with all the preceding sample bits.
> +		 * Make sure this doesn't happen by using up to U16_MAX bytes
> +		 * per sample in total (rounded down to 8 byte boundary).
> +		 */
> +		size = min_t(size_t, U16_MAX - header->size,
> +			     event->attr.aux_sample_size);
> +		size = rounddown(size, 8);
> +		size = perf_aux_sample_size(event, data, size);
> +
> +		WARN_ON_ONCE(size + header->size > U16_MAX);
> +		header->size += size;
> +	}
> +	/*
> +	 * If you're adding more sample types here, you likely need to do
> +	 * something about the overflowing header::size, like repurpose the
> +	 * lowest 3 bits of size, which should be always zero at the moment.
> +	 * This raises a more important question, do we really need 512k sized
> +	 * samples and why, so good argumentation is in order for whatever you
> +	 * do here next.
> +	 */
> +	WARN_ON_ONCE(header->size & 7);
>  }
>  
>  static __always_inline int
> @@ -10660,7 +10826,7 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
>  
>  	attr->size = size;
>  
> -	if (attr->__reserved_1 || attr->__reserved_2)
> +	if (attr->__reserved_1 || attr->__reserved_2 || attr->__reserved_3)
>  		return -EINVAL;
>  
>  	if (attr->sample_type & ~(PERF_SAMPLE_MAX-1))
> @@ -11210,7 +11376,7 @@ SYSCALL_DEFINE5(perf_event_open,
>  		}
>  	}
>  
> -	if (event->attr.aux_output && !perf_get_aux_event(event, group_leader))
> +	if (perf_need_aux_event(event) && !perf_get_aux_event(event, group_leader))
>  		goto err_locked;
>  
>  	/*
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index 3aef4191798c..747d67f130cb 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -50,6 +50,7 @@ struct ring_buffer {
>  	unsigned long			aux_mmap_locked;
>  	void				(*free_aux)(void *);
>  	refcount_t			aux_refcount;
> +	int				aux_in_sampling;
>  	void				**aux_pages;
>  	void				*aux_priv;
>  
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 246c83ac5643..7ffd5c763f93 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -562,6 +562,42 @@ void *perf_get_aux(struct perf_output_handle *handle)
>  }
>  EXPORT_SYMBOL_GPL(perf_get_aux);
>  
> +/*
> + * Copy out AUX data from an AUX handle.
> + */
> +long perf_output_copy_aux(struct perf_output_handle *aux_handle,
> +			  struct perf_output_handle *handle,
> +			  unsigned long from, unsigned long to)
> +{
> +	unsigned long tocopy, remainder, len = 0;
> +	struct ring_buffer *rb = aux_handle->rb;
> +	void *addr;
> +
> +	from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
> +	to &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
> +
> +	do {
> +		tocopy = PAGE_SIZE - offset_in_page(from);
> +		if (to > from)
> +			tocopy = min(tocopy, to - from);
> +		if (!tocopy)
> +			break;
> +
> +		addr = rb->aux_pages[from >> PAGE_SHIFT];
> +		addr += offset_in_page(from);
> +
> +		remainder = perf_output_copy(handle, addr, tocopy);
> +		if (remainder)
> +			return -EFAULT;
> +
> +		len += tocopy;
> +		from += tocopy;
> +		from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
> +	} while (to != from);
> +
> +	return len;
> +}
> +
>  #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
>  
>  static struct page *rb_alloc_aux_page(int node, int order)
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] perf: Allow using AUX data in perf samples
  2019-11-04  8:40       ` Peter Zijlstra
@ 2019-11-04 10:40         ` Alexander Shishkin
  0 siblings, 0 replies; 15+ messages in thread
From: Alexander Shishkin @ 2019-11-04 10:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel, jolsa,
	adrian.hunter, mathieu.poirier, mark.rutland, alexander.shishkin

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Oct 28, 2019 at 07:08:18PM +0200, Alexander Shishkin wrote:
>
>> > @@ -6318,11 +6318,12 @@ static void perf_aux_sample_output(struc
>> >  
>> >  	/*
>> >  	 * Guard against NMI hits inside the critical section;
>> > -	 * see also perf_aux_sample_size().
>> > +	 * see also perf_prepare_sample_aux().
>> >  	 */
>> >  	WRITE_ONCE(rb->aux_in_sampling, 1);
>> > +	barrier();
>> 
>> Isn't WRITE_ONCE() barrier enough on its own? My thinking was that we
>> only need a compiler barrier here, hence the WRITE_ONCE.
>
> WRITE_ONCE() is a volatile store and (IIRC) the compiler ensures order
> against other volatile things, but not in general.
>
> barrier() OTOH clobbers all of memory and thereby ensures nothing can
> get hoised over it.
>
> Now, the only thing we do inside this region is an indirect call, which
> on its own already implies a sync point for as long as the compiler
> cannot inline it, so it might be a bit paranoid on my end (I don't think
> even LTO can reduce this indirection and cause inlining).

I see what you mean. I was only thinking about not having to order the
AUX STOREs vs the rb->aux_in_sampling. Ordering the call itself makes
sense.

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] perf: Allow using AUX data in perf samples
  2019-11-04 10:16   ` Peter Zijlstra
@ 2019-11-04 12:30     ` Leo Yan
  0 siblings, 0 replies; 15+ messages in thread
From: Leo Yan @ 2019-11-04 12:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Ingo Molnar,
	linux-kernel, jolsa, adrian.hunter, mathieu.poirier,
	mark.rutland

Hi Peter,

On Mon, Nov 04, 2019 at 11:16:58AM +0100, Peter Zijlstra wrote:
> 
> Leo Yan,
> 
> Since you were helpful in the other CS thread, could you please have a
> look to see if this interface will work for you guys?

Thanks a lot for reminding.

After a quick look, this patch set would be very useful; especially,
given Arm/Arm64 don't have LBR, this patch will be very helpful for
'virtual' branch recording on Arm/Arm64.

@Mathieu, could you review this patch set if you have bandwidth?
Since you implemented CoreSight snapshot mode so you have much deep
understanding than me :)

I will review and test this patch set in this week.

Thanks,
Leo Yan

P.s. one concern is how to allow this feature works with CS ETF/ETB;
ETF/ETB may be more suitable for 'virtual' branch recording due its
trace data is only several KiB but it's sufficient for 32 or 64
branch entries.

> On Fri, Oct 25, 2019 at 05:08:33PM +0300, Alexander Shishkin wrote:
> > AUX data can be used to annotate perf events such as performance counters
> > or tracepoints/breakpoints by including it in sample records when
> > PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
> > and profiling by providing, for example, a history of instruction flow
> > leading up to the event's overflow.
> > 
> > The implementation makes use of grouping an AUX event with all the events
> > that wish to take samples of the AUX data, such that the former is the
> > group leader. The samplees should also specify the desired size of the AUX
> > sample via attr.aux_sample_size.
> > 
> > AUX capable PMUs need to explicitly add support for sampling, because it
> > relies on a new callback to take a snapshot of the buffer without touching
> > the event states.
> > 
> > Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> > ---
> >  include/linux/perf_event.h      |  19 ++++
> >  include/uapi/linux/perf_event.h |  10 +-
> >  kernel/events/core.c            | 172 +++++++++++++++++++++++++++++++-
> >  kernel/events/internal.h        |   1 +
> >  kernel/events/ring_buffer.c     |  36 +++++++
> >  5 files changed, 233 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 587ae4d002f5..446ce0014e89 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -249,6 +249,8 @@ struct perf_event;
> >  #define PERF_PMU_CAP_NO_EXCLUDE			0x80
> >  #define PERF_PMU_CAP_AUX_OUTPUT			0x100
> >  
> > +struct perf_output_handle;
> > +
> >  /**
> >   * struct pmu - generic performance monitoring unit
> >   */
> > @@ -423,6 +425,19 @@ struct pmu {
> >  	 */
> >  	void (*free_aux)		(void *aux); /* optional */
> >  
> > +	/*
> > +	 * Take a snapshot of the AUX buffer without touching the event
> > +	 * state, so that preempting ->start()/->stop() callbacks does
> > +	 * not interfere with their logic. Called in PMI context.
> > +	 *
> > +	 * Returns the size of AUX data copied to the output handle.
> > +	 *
> > +	 * Optional.
> > +	 */
> > +	long (*snapshot_aux)		(struct perf_event *event,
> > +					 struct perf_output_handle *handle,
> > +					 unsigned long size);
> > +
> >  	/*
> >  	 * Validate address range filters: make sure the HW supports the
> >  	 * requested configuration and number of filters; return 0 if the
> > @@ -964,6 +979,7 @@ struct perf_sample_data {
> >  		u32	reserved;
> >  	}				cpu_entry;
> >  	struct perf_callchain_entry	*callchain;
> > +	u64				aux_size;
> >  
> >  	/*
> >  	 * regs_user may point to task_pt_regs or to regs_user_copy, depending
> > @@ -1353,6 +1369,9 @@ extern unsigned int perf_output_copy(struct perf_output_handle *handle,
> >  			     const void *buf, unsigned int len);
> >  extern unsigned int perf_output_skip(struct perf_output_handle *handle,
> >  				     unsigned int len);
> > +extern long perf_output_copy_aux(struct perf_output_handle *aux_handle,
> > +				 struct perf_output_handle *handle,
> > +				 unsigned long from, unsigned long to);
> >  extern int perf_swevent_get_recursion_context(void);
> >  extern void perf_swevent_put_recursion_context(int rctx);
> >  extern u64 perf_swevent_set_period(struct perf_event *event);
> > diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> > index bb7b271397a6..377d794d3105 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -141,8 +141,9 @@ enum perf_event_sample_format {
> >  	PERF_SAMPLE_TRANSACTION			= 1U << 17,
> >  	PERF_SAMPLE_REGS_INTR			= 1U << 18,
> >  	PERF_SAMPLE_PHYS_ADDR			= 1U << 19,
> > +	PERF_SAMPLE_AUX				= 1U << 20,
> >  
> > -	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
> > +	PERF_SAMPLE_MAX = 1U << 21,		/* non-ABI */
> >  
> >  	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
> >  };
> > @@ -300,6 +301,7 @@ enum perf_event_read_format {
> >  					/* add: sample_stack_user */
> >  #define PERF_ATTR_SIZE_VER4	104	/* add: sample_regs_intr */
> >  #define PERF_ATTR_SIZE_VER5	112	/* add: aux_watermark */
> > +#define PERF_ATTR_SIZE_VER6	120	/* add: aux_sample_size */
> >  
> >  /*
> >   * Hardware event_id to monitor via a performance monitoring event:
> > @@ -424,7 +426,9 @@ struct perf_event_attr {
> >  	 */
> >  	__u32	aux_watermark;
> >  	__u16	sample_max_stack;
> > -	__u16	__reserved_2;	/* align to __u64 */
> > +	__u16	__reserved_2;
> > +	__u32	aux_sample_size;
> > +	__u32	__reserved_3;
> >  };
> >  
> >  /*
> > @@ -864,6 +868,8 @@ enum perf_event_type {
> >  	 *	{ u64			abi; # enum perf_sample_regs_abi
> >  	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> >  	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
> > +	 *	{ u64			size;
> > +	 *	  char			data[size]; } && PERF_SAMPLE_AUX
> >  	 * };
> >  	 */
> >  	PERF_RECORD_SAMPLE			= 9,
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 0940c8810be0..36c612dbfcb0 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1941,6 +1941,11 @@ static void perf_put_aux_event(struct perf_event *event)
> >  	}
> >  }
> >  
> > +static bool perf_need_aux_event(struct perf_event *event)
> > +{
> > +	return !!event->attr.aux_output || !!event->attr.aux_sample_size;
> > +}
> > +
> >  static int perf_get_aux_event(struct perf_event *event,
> >  			      struct perf_event *group_leader)
> >  {
> > @@ -1953,7 +1958,17 @@ static int perf_get_aux_event(struct perf_event *event,
> >  	if (!group_leader)
> >  		return 0;
> >  
> > -	if (!perf_aux_output_match(event, group_leader))
> > +	/*
> > +	 * aux_output and aux_sample_size are mutually exclusive.
> > +	 */
> > +	if (event->attr.aux_output && event->attr.aux_sample_size)
> > +		return 0;
> > +
> > +	if (event->attr.aux_output &&
> > +	    !perf_aux_output_match(event, group_leader))
> > +		return 0;
> > +
> > +	if (event->attr.aux_sample_size && !group_leader->pmu->snapshot_aux)
> >  		return 0;
> >  
> >  	if (!atomic_long_inc_not_zero(&group_leader->refcount))
> > @@ -6192,6 +6207,121 @@ perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size,
> >  	}
> >  }
> >  
> > +static unsigned long perf_aux_sample_size(struct perf_event *event,
> > +					  struct perf_sample_data *data,
> > +					  size_t size)
> > +{
> > +	struct perf_event *sampler = event->aux_event;
> > +	struct ring_buffer *rb;
> > +
> > +	data->aux_size = 0;
> > +
> > +	if (!sampler)
> > +		goto out;
> > +
> > +	if (WARN_ON_ONCE(READ_ONCE(sampler->state) != PERF_EVENT_STATE_ACTIVE))
> > +		goto out;
> > +
> > +	if (WARN_ON_ONCE(READ_ONCE(sampler->oncpu) != smp_processor_id()))
> > +		goto out;
> > +
> > +	rb = ring_buffer_get(sampler->parent ? sampler->parent : sampler);
> > +	if (!rb)
> > +		goto out;
> > +
> > +	/*
> > +	 * If this is an NMI hit inside sampling code, don't take
> > +	 * the sample. See also perf_aux_sample_output().
> > +	 */
> > +	if (READ_ONCE(rb->aux_in_sampling)) {
> > +		data->aux_size = 0;
> > +	} else {
> > +		size = min_t(size_t, size, perf_aux_size(rb));
> > +		data->aux_size = ALIGN(size, sizeof(u64));
> > +	}
> > +	ring_buffer_put(rb);
> > +
> > +out:
> > +	return data->aux_size;
> > +}
> > +
> > +long perf_pmu_aux_sample_output(struct perf_event *event,
> > +				struct perf_output_handle *handle,
> > +				unsigned long size)
> > +{
> > +	unsigned long flags;
> > +	long ret;
> > +
> > +	/*
> > +	 * Normal ->start()/->stop() callbacks run in IRQ mode in scheduler
> > +	 * paths. If we start calling them in NMI context, they may race with
> > +	 * the IRQ ones, that is, for example, re-starting an event that's just
> > +	 * been stopped, which is why we're using a separate callback that
> > +	 * doesn't change the event state.
> > +	 *
> > +	 * IRQs need to be disabled to prevent IPIs from racing with us.
> > +	 */
> > +	local_irq_save(flags);
> > +
> > +	ret = event->pmu->snapshot_aux(event, handle, size);
> > +
> > +	local_irq_restore(flags);
> > +
> > +	return ret;
> > +}
> > +
> > +static void perf_aux_sample_output(struct perf_event *event,
> > +				   struct perf_output_handle *handle,
> > +				   struct perf_sample_data *data)
> > +{
> > +	struct perf_event *sampler = event->aux_event;
> > +	unsigned long pad;
> > +	struct ring_buffer *rb;
> > +	long size;
> > +
> > +	if (WARN_ON_ONCE(!sampler || !data->aux_size))
> > +		return;
> > +
> > +	rb = ring_buffer_get(sampler->parent ? sampler->parent : sampler);
> > +	if (!rb)
> > +		return;
> > +
> > +	/*
> > +	 * Guard against NMI hits inside the critical section;
> > +	 * see also perf_aux_sample_size().
> > +	 */
> > +	WRITE_ONCE(rb->aux_in_sampling, 1);
> > +
> > +	size = perf_pmu_aux_sample_output(sampler, handle, data->aux_size);
> > +
> > +	/*
> > +	 * An error here means that perf_output_copy() failed (returned a
> > +	 * non-zero surplus that it didn't copy), which in its current
> > +	 * enlightened implementation is not possible. If that changes, we'd
> > +	 * like to know.
> > +	 */
> > +	if (WARN_ON_ONCE(size < 0))
> > +		goto out_clear;
> > +
> > +	/*
> > +	 * The pad comes from ALIGN()ing data->aux_size up to u64 in
> > +	 * perf_aux_sample_size(), so should not be more than that.
> > +	 */
> > +	pad = data->aux_size - size;
> > +	if (WARN_ON_ONCE(pad >= sizeof(u64)))
> > +		pad = 8;
> > +
> > +	if (pad) {
> > +		u64 zero = 0;
> > +		perf_output_copy(handle, &zero, pad);
> > +	}
> > +
> > +out_clear:
> > +	WRITE_ONCE(rb->aux_in_sampling, 0);
> > +
> > +	ring_buffer_put(rb);
> > +}
> > +
> >  static void __perf_event_header__init_id(struct perf_event_header *header,
> >  					 struct perf_sample_data *data,
> >  					 struct perf_event *event)
> > @@ -6511,6 +6641,13 @@ void perf_output_sample(struct perf_output_handle *handle,
> >  	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
> >  		perf_output_put(handle, data->phys_addr);
> >  
> > +	if (sample_type & PERF_SAMPLE_AUX) {
> > +		perf_output_put(handle, data->aux_size);
> > +
> > +		if (data->aux_size)
> > +			perf_aux_sample_output(event, handle, data);
> > +	}
> > +
> >  	if (!event->attr.watermark) {
> >  		int wakeup_events = event->attr.wakeup_events;
> >  
> > @@ -6699,6 +6836,35 @@ void perf_prepare_sample(struct perf_event_header *header,
> >  
> >  	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
> >  		data->phys_addr = perf_virt_to_phys(data->addr);
> > +
> > +	if (sample_type & PERF_SAMPLE_AUX) {
> > +		u64 size;
> > +
> > +		header->size += sizeof(u64); /* size */
> > +
> > +		/*
> > +		 * Given the 16bit nature of header::size, an AUX sample can
> > +		 * easily overflow it, what with all the preceding sample bits.
> > +		 * Make sure this doesn't happen by using up to U16_MAX bytes
> > +		 * per sample in total (rounded down to 8 byte boundary).
> > +		 */
> > +		size = min_t(size_t, U16_MAX - header->size,
> > +			     event->attr.aux_sample_size);
> > +		size = rounddown(size, 8);
> > +		size = perf_aux_sample_size(event, data, size);
> > +
> > +		WARN_ON_ONCE(size + header->size > U16_MAX);
> > +		header->size += size;
> > +	}
> > +	/*
> > +	 * If you're adding more sample types here, you likely need to do
> > +	 * something about the overflowing header::size, like repurpose the
> > +	 * lowest 3 bits of size, which should be always zero at the moment.
> > +	 * This raises a more important question, do we really need 512k sized
> > +	 * samples and why, so good argumentation is in order for whatever you
> > +	 * do here next.
> > +	 */
> > +	WARN_ON_ONCE(header->size & 7);
> >  }
> >  
> >  static __always_inline int
> > @@ -10660,7 +10826,7 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
> >  
> >  	attr->size = size;
> >  
> > -	if (attr->__reserved_1 || attr->__reserved_2)
> > +	if (attr->__reserved_1 || attr->__reserved_2 || attr->__reserved_3)
> >  		return -EINVAL;
> >  
> >  	if (attr->sample_type & ~(PERF_SAMPLE_MAX-1))
> > @@ -11210,7 +11376,7 @@ SYSCALL_DEFINE5(perf_event_open,
> >  		}
> >  	}
> >  
> > -	if (event->attr.aux_output && !perf_get_aux_event(event, group_leader))
> > +	if (perf_need_aux_event(event) && !perf_get_aux_event(event, group_leader))
> >  		goto err_locked;
> >  
> >  	/*
> > diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> > index 3aef4191798c..747d67f130cb 100644
> > --- a/kernel/events/internal.h
> > +++ b/kernel/events/internal.h
> > @@ -50,6 +50,7 @@ struct ring_buffer {
> >  	unsigned long			aux_mmap_locked;
> >  	void				(*free_aux)(void *);
> >  	refcount_t			aux_refcount;
> > +	int				aux_in_sampling;
> >  	void				**aux_pages;
> >  	void				*aux_priv;
> >  
> > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> > index 246c83ac5643..7ffd5c763f93 100644
> > --- a/kernel/events/ring_buffer.c
> > +++ b/kernel/events/ring_buffer.c
> > @@ -562,6 +562,42 @@ void *perf_get_aux(struct perf_output_handle *handle)
> >  }
> >  EXPORT_SYMBOL_GPL(perf_get_aux);
> >  
> > +/*
> > + * Copy out AUX data from an AUX handle.
> > + */
> > +long perf_output_copy_aux(struct perf_output_handle *aux_handle,
> > +			  struct perf_output_handle *handle,
> > +			  unsigned long from, unsigned long to)
> > +{
> > +	unsigned long tocopy, remainder, len = 0;
> > +	struct ring_buffer *rb = aux_handle->rb;
> > +	void *addr;
> > +
> > +	from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
> > +	to &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
> > +
> > +	do {
> > +		tocopy = PAGE_SIZE - offset_in_page(from);
> > +		if (to > from)
> > +			tocopy = min(tocopy, to - from);
> > +		if (!tocopy)
> > +			break;
> > +
> > +		addr = rb->aux_pages[from >> PAGE_SHIFT];
> > +		addr += offset_in_page(from);
> > +
> > +		remainder = perf_output_copy(handle, addr, tocopy);
> > +		if (remainder)
> > +			return -EFAULT;
> > +
> > +		len += tocopy;
> > +		from += tocopy;
> > +		from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
> > +	} while (to != from);
> > +
> > +	return len;
> > +}
> > +
> >  #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
> >  
> >  static struct page *rb_alloc_aux_page(int node, int order)
> > -- 
> > 2.23.0
> > 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [tip: perf/core] perf/x86/intel/pt: Add sampling support
  2019-10-25 14:08 ` [PATCH v3 3/3] perf/x86/intel/pt: Add sampling support Alexander Shishkin
@ 2019-11-13 10:56   ` tip-bot2 for Alexander Shishkin
  0 siblings, 0 replies; 15+ messages in thread
From: tip-bot2 for Alexander Shishkin @ 2019-11-13 10:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alexander Shishkin, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, David Ahern, Jiri Olsa, Linus Torvalds,
	Mark Rutland, Namhyung Kim, Stephane Eranian, Thomas Gleixner,
	Vince Weaver, adrian.hunter, mathieu.poirier, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     25e8920b301c133aeaa9f57d81295bf4ac78e17b
Gitweb:        https://git.kernel.org/tip/25e8920b301c133aeaa9f57d81295bf4ac78e17b
Author:        Alexander Shishkin <alexander.shishkin@linux.intel.com>
AuthorDate:    Fri, 25 Oct 2019 17:08:35 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 13 Nov 2019 11:06:16 +01:00

perf/x86/intel/pt: Add sampling support

Add AUX sampling support to the PT PMU: implement an NMI-safe callback
that takes a snapshot of the buffer without touching the event states.
This is done for PT events that don't use PMIs, that is, snapshot mode
(RO mapping of the AUX area).

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Cc: mathieu.poirier@linaro.org
Link: https://lkml.kernel.org/r/20191025140835.53665-4-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/events/intel/pt.c | 54 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 54 insertions(+)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 170f3b4..2f20d5a 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1208,6 +1208,13 @@ pt_buffer_setup_aux(struct perf_event *event, void **pages,
 	if (!nr_pages)
 		return NULL;
 
+	/*
+	 * Only support AUX sampling in snapshot mode, where we don't
+	 * generate NMIs.
+	 */
+	if (event->attr.aux_sample_size && !snapshot)
+		return NULL;
+
 	if (cpu == -1)
 		cpu = raw_smp_processor_id();
 	node = cpu_to_node(cpu);
@@ -1506,6 +1513,52 @@ static void pt_event_stop(struct perf_event *event, int mode)
 	}
 }
 
+static long pt_event_snapshot_aux(struct perf_event *event,
+				  struct perf_output_handle *handle,
+				  unsigned long size)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+	unsigned long from = 0, to;
+	long ret;
+
+	if (WARN_ON_ONCE(!buf))
+		return 0;
+
+	/*
+	 * Sampling is only allowed on snapshot events;
+	 * see pt_buffer_setup_aux().
+	 */
+	if (WARN_ON_ONCE(!buf->snapshot))
+		return 0;
+
+	/*
+	 * Here, handle_nmi tells us if the tracing is on
+	 */
+	if (READ_ONCE(pt->handle_nmi))
+		pt_config_stop(event);
+
+	pt_read_offset(buf);
+	pt_update_head(pt);
+
+	to = local_read(&buf->data_size);
+	if (to < size)
+		from = buf->nr_pages << PAGE_SHIFT;
+	from += to - size;
+
+	ret = perf_output_copy_aux(&pt->handle, handle, from, to);
+
+	/*
+	 * If the tracing was on when we turned up, restart it.
+	 * Compiler barrier not needed as we couldn't have been
+	 * preempted by anything that touches pt->handle_nmi.
+	 */
+	if (pt->handle_nmi)
+		pt_config_start(event);
+
+	return ret;
+}
+
 static void pt_event_del(struct perf_event *event, int mode)
 {
 	pt_event_stop(event, PERF_EF_UPDATE);
@@ -1625,6 +1678,7 @@ static __init int pt_init(void)
 	pt_pmu.pmu.del			 = pt_event_del;
 	pt_pmu.pmu.start		 = pt_event_start;
 	pt_pmu.pmu.stop			 = pt_event_stop;
+	pt_pmu.pmu.snapshot_aux		 = pt_event_snapshot_aux;
 	pt_pmu.pmu.read			 = pt_event_read;
 	pt_pmu.pmu.setup_aux		 = pt_buffer_setup_aux;
 	pt_pmu.pmu.free_aux		 = pt_buffer_free_aux;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip: perf/core] perf/x86/intel/pt: Factor out pt_config_start()
  2019-10-25 14:08 ` [PATCH v3 2/3] perf/x86/intel/pt: Factor out starting the trace Alexander Shishkin
@ 2019-11-13 10:56   ` tip-bot2 for Alexander Shishkin
  0 siblings, 0 replies; 15+ messages in thread
From: tip-bot2 for Alexander Shishkin @ 2019-11-13 10:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alexander Shishkin, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, David Ahern, Jiri Olsa, Linus Torvalds,
	Mark Rutland, Namhyung Kim, Stephane Eranian, Thomas Gleixner,
	Vince Weaver, adrian.hunter, mathieu.poirier, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     8e105a1fc2a02d78698834974083c980d2e5b513
Gitweb:        https://git.kernel.org/tip/8e105a1fc2a02d78698834974083c980d2e5b513
Author:        Alexander Shishkin <alexander.shishkin@linux.intel.com>
AuthorDate:    Fri, 25 Oct 2019 17:08:34 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 13 Nov 2019 11:06:15 +01:00

perf/x86/intel/pt: Factor out pt_config_start()

PT trace is now enabled at the bottom of the event configuration
function that takes care of all configuration bits related to a given
event, including the address filter update. This is only needed where
the event configuration changes, that is, in ->add()/->start().

In the interrupt path we can use a lighter version that keeps the
configuration intact, since it hasn't changed, and only flips the
enable bit.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Cc: mathieu.poirier@linaro.org
Link: https://lkml.kernel.org/r/20191025140835.53665-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/events/intel/pt.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 05e43d0..170f3b4 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -397,6 +397,20 @@ static bool pt_event_valid(struct perf_event *event)
  * These all are cpu affine and operate on a local PT
  */
 
+static void pt_config_start(struct perf_event *event)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	u64 ctl = event->hw.config;
+
+	ctl |= RTIT_CTL_TRACEEN;
+	if (READ_ONCE(pt->vmx_on))
+		perf_aux_output_flag(&pt->handle, PERF_AUX_FLAG_PARTIAL);
+	else
+		wrmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+	WRITE_ONCE(event->hw.config, ctl);
+}
+
 /* Address ranges and their corresponding msr configuration registers */
 static const struct pt_address_range {
 	unsigned long	msr_a;
@@ -468,7 +482,6 @@ static u64 pt_config_filters(struct perf_event *event)
 
 static void pt_config(struct perf_event *event)
 {
-	struct pt *pt = this_cpu_ptr(&pt_ctx);
 	u64 reg;
 
 	/* First round: clear STATUS, in particular the PSB byte counter. */
@@ -501,10 +514,7 @@ static void pt_config(struct perf_event *event)
 	reg |= (event->attr.config & PT_CONFIG_MASK);
 
 	event->hw.config = reg;
-	if (READ_ONCE(pt->vmx_on))
-		perf_aux_output_flag(&pt->handle, PERF_AUX_FLAG_PARTIAL);
-	else
-		wrmsrl(MSR_IA32_RTIT_CTL, reg);
+	pt_config_start(event);
 }
 
 static void pt_config_stop(struct perf_event *event)
@@ -1381,7 +1391,7 @@ void intel_pt_interrupt(void)
 
 		pt_config_buffer(topa_to_page(buf->cur)->table, buf->cur_idx,
 				 buf->output_off);
-		pt_config(event);
+		pt_config_start(event);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip: perf/core] perf/aux: Allow using AUX data in perf samples
  2019-10-25 14:08 ` [PATCH v3 1/3] perf: Allow using AUX data in perf samples Alexander Shishkin
  2019-10-28 16:27   ` Peter Zijlstra
  2019-11-04 10:16   ` Peter Zijlstra
@ 2019-11-13 10:56   ` tip-bot2 for Alexander Shishkin
  2 siblings, 0 replies; 15+ messages in thread
From: tip-bot2 for Alexander Shishkin @ 2019-11-13 10:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alexander Shishkin, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, David Ahern, Jiri Olsa, Linus Torvalds,
	Mark Rutland, Namhyung Kim, Stephane Eranian, Thomas Gleixner,
	Vince Weaver, adrian.hunter, mathieu.poirier, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     a4faf00d994c40e64f656805ac375c65e324eefb
Gitweb:        https://git.kernel.org/tip/a4faf00d994c40e64f656805ac375c65e324eefb
Author:        Alexander Shishkin <alexander.shishkin@linux.intel.com>
AuthorDate:    Fri, 25 Oct 2019 17:08:33 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 13 Nov 2019 11:06:14 +01:00

perf/aux: Allow using AUX data in perf samples

AUX data can be used to annotate perf events such as performance counters
or tracepoints/breakpoints by including it in sample records when
PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
and profiling by providing, for example, a history of instruction flow
leading up to the event's overflow.

The implementation makes use of grouping an AUX event with all the events
that wish to take samples of the AUX data, such that the former is the
group leader. The samplees should also specify the desired size of the AUX
sample via attr.aux_sample_size.

AUX capable PMUs need to explicitly add support for sampling, because it
relies on a new callback to take a snapshot of the buffer without touching
the event states.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Cc: mathieu.poirier@linaro.org
Link: https://lkml.kernel.org/r/20191025140835.53665-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/perf_event.h      |  19 +++-
 include/uapi/linux/perf_event.h |  10 +-
 kernel/events/core.c            | 173 ++++++++++++++++++++++++++++++-
 kernel/events/internal.h        |   1 +-
 kernel/events/ring_buffer.c     |  36 ++++++-
 5 files changed, 234 insertions(+), 5 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 011dcbd..34c7c69 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -249,6 +249,8 @@ struct perf_event;
 #define PERF_PMU_CAP_NO_EXCLUDE			0x80
 #define PERF_PMU_CAP_AUX_OUTPUT			0x100
 
+struct perf_output_handle;
+
 /**
  * struct pmu - generic performance monitoring unit
  */
@@ -433,6 +435,19 @@ struct pmu {
 	void (*free_aux)		(void *aux); /* optional */
 
 	/*
+	 * Take a snapshot of the AUX buffer without touching the event
+	 * state, so that preempting ->start()/->stop() callbacks does
+	 * not interfere with their logic. Called in PMI context.
+	 *
+	 * Returns the size of AUX data copied to the output handle.
+	 *
+	 * Optional.
+	 */
+	long (*snapshot_aux)		(struct perf_event *event,
+					 struct perf_output_handle *handle,
+					 unsigned long size);
+
+	/*
 	 * Validate address range filters: make sure the HW supports the
 	 * requested configuration and number of filters; return 0 if the
 	 * supplied filters are valid, -errno otherwise.
@@ -973,6 +988,7 @@ struct perf_sample_data {
 		u32	reserved;
 	}				cpu_entry;
 	struct perf_callchain_entry	*callchain;
+	u64				aux_size;
 
 	/*
 	 * regs_user may point to task_pt_regs or to regs_user_copy, depending
@@ -1362,6 +1378,9 @@ extern unsigned int perf_output_copy(struct perf_output_handle *handle,
 			     const void *buf, unsigned int len);
 extern unsigned int perf_output_skip(struct perf_output_handle *handle,
 				     unsigned int len);
+extern long perf_output_copy_aux(struct perf_output_handle *aux_handle,
+				 struct perf_output_handle *handle,
+				 unsigned long from, unsigned long to);
 extern int perf_swevent_get_recursion_context(void);
 extern void perf_swevent_put_recursion_context(int rctx);
 extern u64 perf_swevent_set_period(struct perf_event *event);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index bb7b271..377d794 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -141,8 +141,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
 	PERF_SAMPLE_REGS_INTR			= 1U << 18,
 	PERF_SAMPLE_PHYS_ADDR			= 1U << 19,
+	PERF_SAMPLE_AUX				= 1U << 20,
 
-	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 21,		/* non-ABI */
 
 	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
 };
@@ -300,6 +301,7 @@ enum perf_event_read_format {
 					/* add: sample_stack_user */
 #define PERF_ATTR_SIZE_VER4	104	/* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5	112	/* add: aux_watermark */
+#define PERF_ATTR_SIZE_VER6	120	/* add: aux_sample_size */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -424,7 +426,9 @@ struct perf_event_attr {
 	 */
 	__u32	aux_watermark;
 	__u16	sample_max_stack;
-	__u16	__reserved_2;	/* align to __u64 */
+	__u16	__reserved_2;
+	__u32	aux_sample_size;
+	__u32	__reserved_3;
 };
 
 /*
@@ -864,6 +868,8 @@ enum perf_event_type {
 	 *	{ u64			abi; # enum perf_sample_regs_abi
 	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
+	 *	{ u64			size;
+	 *	  char			data[size]; } && PERF_SAMPLE_AUX
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8d65e03..16d80ad 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1941,6 +1941,11 @@ static void perf_put_aux_event(struct perf_event *event)
 	}
 }
 
+static bool perf_need_aux_event(struct perf_event *event)
+{
+	return !!event->attr.aux_output || !!event->attr.aux_sample_size;
+}
+
 static int perf_get_aux_event(struct perf_event *event,
 			      struct perf_event *group_leader)
 {
@@ -1953,7 +1958,17 @@ static int perf_get_aux_event(struct perf_event *event,
 	if (!group_leader)
 		return 0;
 
-	if (!perf_aux_output_match(event, group_leader))
+	/*
+	 * aux_output and aux_sample_size are mutually exclusive.
+	 */
+	if (event->attr.aux_output && event->attr.aux_sample_size)
+		return 0;
+
+	if (event->attr.aux_output &&
+	    !perf_aux_output_match(event, group_leader))
+		return 0;
+
+	if (event->attr.aux_sample_size && !group_leader->pmu->snapshot_aux)
 		return 0;
 
 	if (!atomic_long_inc_not_zero(&group_leader->refcount))
@@ -6222,6 +6237,122 @@ perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size,
 	}
 }
 
+static unsigned long perf_prepare_sample_aux(struct perf_event *event,
+					  struct perf_sample_data *data,
+					  size_t size)
+{
+	struct perf_event *sampler = event->aux_event;
+	struct ring_buffer *rb;
+
+	data->aux_size = 0;
+
+	if (!sampler)
+		goto out;
+
+	if (WARN_ON_ONCE(READ_ONCE(sampler->state) != PERF_EVENT_STATE_ACTIVE))
+		goto out;
+
+	if (WARN_ON_ONCE(READ_ONCE(sampler->oncpu) != smp_processor_id()))
+		goto out;
+
+	rb = ring_buffer_get(sampler->parent ? sampler->parent : sampler);
+	if (!rb)
+		goto out;
+
+	/*
+	 * If this is an NMI hit inside sampling code, don't take
+	 * the sample. See also perf_aux_sample_output().
+	 */
+	if (READ_ONCE(rb->aux_in_sampling)) {
+		data->aux_size = 0;
+	} else {
+		size = min_t(size_t, size, perf_aux_size(rb));
+		data->aux_size = ALIGN(size, sizeof(u64));
+	}
+	ring_buffer_put(rb);
+
+out:
+	return data->aux_size;
+}
+
+long perf_pmu_snapshot_aux(struct ring_buffer *rb,
+			   struct perf_event *event,
+			   struct perf_output_handle *handle,
+			   unsigned long size)
+{
+	unsigned long flags;
+	long ret;
+
+	/*
+	 * Normal ->start()/->stop() callbacks run in IRQ mode in scheduler
+	 * paths. If we start calling them in NMI context, they may race with
+	 * the IRQ ones, that is, for example, re-starting an event that's just
+	 * been stopped, which is why we're using a separate callback that
+	 * doesn't change the event state.
+	 *
+	 * IRQs need to be disabled to prevent IPIs from racing with us.
+	 */
+	local_irq_save(flags);
+	/*
+	 * Guard against NMI hits inside the critical section;
+	 * see also perf_prepare_sample_aux().
+	 */
+	WRITE_ONCE(rb->aux_in_sampling, 1);
+	barrier();
+
+	ret = event->pmu->snapshot_aux(event, handle, size);
+
+	barrier();
+	WRITE_ONCE(rb->aux_in_sampling, 0);
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+static void perf_aux_sample_output(struct perf_event *event,
+				   struct perf_output_handle *handle,
+				   struct perf_sample_data *data)
+{
+	struct perf_event *sampler = event->aux_event;
+	unsigned long pad;
+	struct ring_buffer *rb;
+	long size;
+
+	if (WARN_ON_ONCE(!sampler || !data->aux_size))
+		return;
+
+	rb = ring_buffer_get(sampler->parent ? sampler->parent : sampler);
+	if (!rb)
+		return;
+
+	size = perf_pmu_snapshot_aux(rb, sampler, handle, data->aux_size);
+
+	/*
+	 * An error here means that perf_output_copy() failed (returned a
+	 * non-zero surplus that it didn't copy), which in its current
+	 * enlightened implementation is not possible. If that changes, we'd
+	 * like to know.
+	 */
+	if (WARN_ON_ONCE(size < 0))
+		goto out_put;
+
+	/*
+	 * The pad comes from ALIGN()ing data->aux_size up to u64 in
+	 * perf_prepare_sample_aux(), so should not be more than that.
+	 */
+	pad = data->aux_size - size;
+	if (WARN_ON_ONCE(pad >= sizeof(u64)))
+		pad = 8;
+
+	if (pad) {
+		u64 zero = 0;
+		perf_output_copy(handle, &zero, pad);
+	}
+
+out_put:
+	ring_buffer_put(rb);
+}
+
 static void __perf_event_header__init_id(struct perf_event_header *header,
 					 struct perf_sample_data *data,
 					 struct perf_event *event)
@@ -6541,6 +6672,13 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
 		perf_output_put(handle, data->phys_addr);
 
+	if (sample_type & PERF_SAMPLE_AUX) {
+		perf_output_put(handle, data->aux_size);
+
+		if (data->aux_size)
+			perf_aux_sample_output(event, handle, data);
+	}
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -6729,6 +6867,35 @@ void perf_prepare_sample(struct perf_event_header *header,
 
 	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
 		data->phys_addr = perf_virt_to_phys(data->addr);
+
+	if (sample_type & PERF_SAMPLE_AUX) {
+		u64 size;
+
+		header->size += sizeof(u64); /* size */
+
+		/*
+		 * Given the 16bit nature of header::size, an AUX sample can
+		 * easily overflow it, what with all the preceding sample bits.
+		 * Make sure this doesn't happen by using up to U16_MAX bytes
+		 * per sample in total (rounded down to 8 byte boundary).
+		 */
+		size = min_t(size_t, U16_MAX - header->size,
+			     event->attr.aux_sample_size);
+		size = rounddown(size, 8);
+		size = perf_prepare_sample_aux(event, data, size);
+
+		WARN_ON_ONCE(size + header->size > U16_MAX);
+		header->size += size;
+	}
+	/*
+	 * If you're adding more sample types here, you likely need to do
+	 * something about the overflowing header::size, like repurpose the
+	 * lowest 3 bits of size, which should be always zero at the moment.
+	 * This raises a more important question, do we really need 512k sized
+	 * samples and why, so good argumentation is in order for whatever you
+	 * do here next.
+	 */
+	WARN_ON_ONCE(header->size & 7);
 }
 
 static __always_inline int
@@ -10727,7 +10894,7 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 
 	attr->size = size;
 
-	if (attr->__reserved_1 || attr->__reserved_2)
+	if (attr->__reserved_1 || attr->__reserved_2 || attr->__reserved_3)
 		return -EINVAL;
 
 	if (attr->sample_type & ~(PERF_SAMPLE_MAX-1))
@@ -11277,7 +11444,7 @@ SYSCALL_DEFINE5(perf_event_open,
 		}
 	}
 
-	if (event->attr.aux_output && !perf_get_aux_event(event, group_leader))
+	if (perf_need_aux_event(event) && !perf_get_aux_event(event, group_leader))
 		goto err_locked;
 
 	/*
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 3aef419..747d67f 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -50,6 +50,7 @@ struct ring_buffer {
 	unsigned long			aux_mmap_locked;
 	void				(*free_aux)(void *);
 	refcount_t			aux_refcount;
+	int				aux_in_sampling;
 	void				**aux_pages;
 	void				*aux_priv;
 
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 246c83a..7ffd5c7 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -562,6 +562,42 @@ void *perf_get_aux(struct perf_output_handle *handle)
 }
 EXPORT_SYMBOL_GPL(perf_get_aux);
 
+/*
+ * Copy out AUX data from an AUX handle.
+ */
+long perf_output_copy_aux(struct perf_output_handle *aux_handle,
+			  struct perf_output_handle *handle,
+			  unsigned long from, unsigned long to)
+{
+	unsigned long tocopy, remainder, len = 0;
+	struct ring_buffer *rb = aux_handle->rb;
+	void *addr;
+
+	from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+	to &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+
+	do {
+		tocopy = PAGE_SIZE - offset_in_page(from);
+		if (to > from)
+			tocopy = min(tocopy, to - from);
+		if (!tocopy)
+			break;
+
+		addr = rb->aux_pages[from >> PAGE_SHIFT];
+		addr += offset_in_page(from);
+
+		remainder = perf_output_copy(handle, addr, tocopy);
+		if (remainder)
+			return -EFAULT;
+
+		len += tocopy;
+		from += tocopy;
+		from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+	} while (to != from);
+
+	return len;
+}
+
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
 static struct page *rb_alloc_aux_page(int node, int order)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-11-13 10:57 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-25 14:08 [PATCH v3 0/3] perf: Add AUX data sampling Alexander Shishkin
2019-10-25 14:08 ` [PATCH v3 1/3] perf: Allow using AUX data in perf samples Alexander Shishkin
2019-10-28 16:27   ` Peter Zijlstra
2019-10-28 16:28     ` Peter Zijlstra
2019-10-28 17:10       ` Alexander Shishkin
2019-10-28 17:08     ` Alexander Shishkin
2019-11-04  8:40       ` Peter Zijlstra
2019-11-04 10:40         ` Alexander Shishkin
2019-11-04 10:16   ` Peter Zijlstra
2019-11-04 12:30     ` Leo Yan
2019-11-13 10:56   ` [tip: perf/core] perf/aux: " tip-bot2 for Alexander Shishkin
2019-10-25 14:08 ` [PATCH v3 2/3] perf/x86/intel/pt: Factor out starting the trace Alexander Shishkin
2019-11-13 10:56   ` [tip: perf/core] perf/x86/intel/pt: Factor out pt_config_start() tip-bot2 for Alexander Shishkin
2019-10-25 14:08 ` [PATCH v3 3/3] perf/x86/intel/pt: Add sampling support Alexander Shishkin
2019-11-13 10:56   ` [tip: perf/core] " tip-bot2 for Alexander Shishkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).