linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/6] perf: Add AUX data sampling
@ 2018-06-12  7:51 Alexander Shishkin
  2018-06-12  7:51 ` [PATCH v1 1/6] perf: Disable PMU around address filter adjustment Alexander Shishkin
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Alexander Shishkin @ 2018-06-12  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, linux-kernel, jolsa, Alexander Shishkin

Hi,

This series introduces AUX data sampling for perf events, which in
case of our instruction/branch tracing PMUs like Intel PT, BTS, CS
ETM means execution flow history leading up to a perf event's
overflow.

In case of Intel PT, this can be used as an alternative to LBR, with
virtually as many as you like branches per sample. It doesn't support
some of the LBR features (branch prediction indication, basic block
level timing, etc [1]) and it can't be exposed as BRANCH_STACK, because
that would require decoding PT stream in kernel space, which is not
practical. Instead, we deliver the PT data to userspace as is, for
offline processing. The PT decoder already supports presenting PT as
virtual LBR.

AUX sampling is different from the snapshot mode in that it doesn't
require instrumentation (for when to take a snapshot) and is better
for generic data collection, when you don't yet know what you are
looking for. It's also useful for automated data collection, for
example, for feedback-driven compiler optimizaitions.

It's also different from the "full trace mode" in that it produces
much less data and, consequently, takes up less I/O bandwidth and
storage space, and takes less time to decode.

The bulk of code is in 4/6, which adds the user interface bits and
adds code to measure and copy out AUX data. 1/6 and 2/6 close the
races around the sampling code, 3/6 is a helper, 5/6 kills the
PERF_FLAG_FD_OUTPUT which has been effectively dead for about 8 years
now and 6/6 allows SET_OUTPUT between SW and HW contexts, which allows
the tool to not carry around additional file descriptors when sampling
a hardware event.

I'm not including the tooling patches, which Adrian will post separately.
Meanwhile, they can be found here [2].

Changes since the last time I posted patches to this effect [3]:
 - instead of creating kernel events with AUX data, have the user create
   them and pass on to the events that they wish to see AUX samples on;
 - as a result, dropped most of the code.

[1] https://marc.info/?l=linux-kernel&m=147467007714928&w=2
[2] https://git.kernel.org/cgit/linux/kernel/git/ash/linux.git/log/?h=perf-aux-sampling
[3] https://marc.info/?l=linux-kernel&m=147463007105836

Adrian Hunter (1):
  perf: Allow set-output for task contexts of different types

Alexander Shishkin (5):
  perf: Disable PMU around address filter adjustment
  perf: Disable IRQs in address filter sync path
  perf: Add an iterator for AUX data
  perf: Allow using AUX data in perf samples
  perf: Drop PERF_FLAG_FD_OUTPUT

 include/linux/perf_event.h      |  10 ++
 include/uapi/linux/perf_event.h |   8 +-
 kernel/events/core.c            | 206 +++++++++++++++++++++++++++++---
 kernel/events/internal.h        |   5 +
 kernel/events/ring_buffer.c     |  34 ++++++
 5 files changed, 247 insertions(+), 16 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v1 1/6] perf: Disable PMU around address filter adjustment
  2018-06-12  7:51 [PATCH v1 0/6] perf: Add AUX data sampling Alexander Shishkin
@ 2018-06-12  7:51 ` Alexander Shishkin
  2018-06-12 20:11   ` Peter Zijlstra
  2018-06-12  7:51 ` [PATCH v1 2/6] perf: Disable IRQs in address filter sync path Alexander Shishkin
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Alexander Shishkin @ 2018-06-12  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, linux-kernel, jolsa, Alexander Shishkin

If the PMU is used for AUX data sampling, the hardware event that triggers
it may come in during a critical section of address filters adjustment (if
it's delivered as NMI) and in turn call the address filter sync code,
causing a spinlock recursion.

To prevent this, disable the PMU around these critical sections, so that
AUX sampling won't take place there.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4e1a1bf8d867..0f5cd1b08dbd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6299,6 +6299,12 @@ static void perf_event_addr_filters_exec(struct perf_event *event, void *data)
 	if (!has_addr_filter(event))
 		return;
 
+	/*
+	 * if sampling kicks in in the critical section,
+	 * we risk spinlock recursion on the ifh::lock
+	 */
+	perf_pmu_disable(event->pmu);
+
 	raw_spin_lock_irqsave(&ifh->lock, flags);
 	list_for_each_entry(filter, &ifh->list, entry) {
 		if (filter->inode) {
@@ -6313,6 +6319,8 @@ static void perf_event_addr_filters_exec(struct perf_event *event, void *data)
 		event->addr_filters_gen++;
 	raw_spin_unlock_irqrestore(&ifh->lock, flags);
 
+	perf_pmu_enable(event->pmu);
+
 	if (restart)
 		perf_event_stop(event, 1);
 }
@@ -6993,6 +7001,8 @@ static void __perf_addr_filters_adjust(struct perf_event *event, void *data)
 	if (!file)
 		return;
 
+	perf_pmu_disable(event->pmu);
+
 	raw_spin_lock_irqsave(&ifh->lock, flags);
 	list_for_each_entry(filter, &ifh->list, entry) {
 		if (perf_addr_filter_match(filter, file, off,
@@ -7008,6 +7018,8 @@ static void __perf_addr_filters_adjust(struct perf_event *event, void *data)
 		event->addr_filters_gen++;
 	raw_spin_unlock_irqrestore(&ifh->lock, flags);
 
+	perf_pmu_enable(event->pmu);
+
 	if (restart)
 		perf_event_stop(event, 1);
 }
@@ -8261,6 +8273,8 @@ static void perf_event_addr_filters_apply(struct perf_event *event)
 
 	down_read(&mm->mmap_sem);
 
+	perf_pmu_disable(event->pmu);
+
 	raw_spin_lock_irqsave(&ifh->lock, flags);
 	list_for_each_entry(filter, &ifh->list, entry) {
 		event->addr_filters_offs[count] = 0;
@@ -8279,6 +8293,8 @@ static void perf_event_addr_filters_apply(struct perf_event *event)
 	event->addr_filters_gen++;
 	raw_spin_unlock_irqrestore(&ifh->lock, flags);
 
+	perf_pmu_enable(event->pmu);
+
 	up_read(&mm->mmap_sem);
 
 	mmput(mm);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 2/6] perf: Disable IRQs in address filter sync path
  2018-06-12  7:51 [PATCH v1 0/6] perf: Add AUX data sampling Alexander Shishkin
  2018-06-12  7:51 ` [PATCH v1 1/6] perf: Disable PMU around address filter adjustment Alexander Shishkin
@ 2018-06-12  7:51 ` Alexander Shishkin
  2018-06-12  7:51 ` [PATCH v1 3/6] perf: Add an iterator for AUX data Alexander Shishkin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Alexander Shishkin @ 2018-06-12  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, linux-kernel, jolsa, Alexander Shishkin

If PMU callbacks are executed in hardirq context, the address filter
sync code needs to disable interrupts when taking its spinlock, to be
consistent with the rest of its users. This may happen if the PMU is
used in AUX sampling or the PMI is delivered as a regular interrupt.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0f5cd1b08dbd..e1fce335a42a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2601,16 +2601,17 @@ static int perf_event_stop(struct perf_event *event, int restart)
 void perf_event_addr_filters_sync(struct perf_event *event)
 {
 	struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
+	unsigned long flags;
 
 	if (!has_addr_filter(event))
 		return;
 
-	raw_spin_lock(&ifh->lock);
+	raw_spin_lock_irqsave(&ifh->lock, flags);
 	if (event->addr_filters_gen != event->hw.addr_filters_gen) {
 		event->pmu->addr_filters_sync(event);
 		event->hw.addr_filters_gen = event->addr_filters_gen;
 	}
-	raw_spin_unlock(&ifh->lock);
+	raw_spin_unlock_irqrestore(&ifh->lock, flags);
 }
 EXPORT_SYMBOL_GPL(perf_event_addr_filters_sync);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 3/6] perf: Add an iterator for AUX data
  2018-06-12  7:51 [PATCH v1 0/6] perf: Add AUX data sampling Alexander Shishkin
  2018-06-12  7:51 ` [PATCH v1 1/6] perf: Disable PMU around address filter adjustment Alexander Shishkin
  2018-06-12  7:51 ` [PATCH v1 2/6] perf: Disable IRQs in address filter sync path Alexander Shishkin
@ 2018-06-12  7:51 ` Alexander Shishkin
  2018-06-12  7:51 ` [PATCH v1 4/6] perf: Allow using AUX data in perf samples Alexander Shishkin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Alexander Shishkin @ 2018-06-12  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, linux-kernel, jolsa, Alexander Shishkin

A helper function to sequentially walk through the AUX buffer with a
given callback. This is useful to copy the AUX data out of its buffer.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/internal.h    |  5 +++++
 kernel/events/ring_buffer.c | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 6dc725a7e7bc..eb6cd6b370f9 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -56,6 +56,9 @@ struct ring_buffer {
 	void				*data_pages[0];
 };
 
+typedef unsigned long (*aux_copyfn)(void *data, const void *src,
+				    unsigned long len);
+
 extern void rb_free(struct ring_buffer *rb);
 
 static inline void rb_free_rcu(struct rcu_head *rcu_head)
@@ -80,6 +83,8 @@ extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 			pgoff_t pgoff, int nr_pages, long watermark, int flags);
 extern void rb_free_aux(struct ring_buffer *rb);
+extern long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+			  unsigned long to, aux_copyfn copyfn, void *data);
 extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
 extern void ring_buffer_put(struct ring_buffer *rb);
 
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 141aa2ca8728..d6157143f054 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -519,6 +519,40 @@ void *perf_get_aux(struct perf_output_handle *handle)
 }
 EXPORT_SYMBOL_GPL(perf_get_aux);
 
+/*
+ * Copy out AUX data from a ring_buffer using a supplied callback.
+ */
+long rb_output_aux(struct ring_buffer *rb, unsigned long from,
+		   unsigned long to, aux_copyfn copyfn, void *data)
+{
+	unsigned long tocopy, remainder, len = 0;
+	void *addr;
+
+	from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+	to &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+
+	do {
+		tocopy = PAGE_SIZE - offset_in_page(from);
+		if (to > from)
+			tocopy = min(tocopy, to - from);
+		if (!tocopy)
+			break;
+
+		addr = rb->aux_pages[from >> PAGE_SHIFT];
+		addr += offset_in_page(from);
+
+		remainder = copyfn(data, addr, tocopy);
+		if (remainder)
+			return -EFAULT;
+
+		len += tocopy;
+		from += tocopy;
+		from &= (rb->aux_nr_pages << PAGE_SHIFT) - 1;
+	} while (to != from);
+
+	return len;
+}
+
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
 static struct page *rb_alloc_aux_page(int node, int order)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2018-06-12  7:51 [PATCH v1 0/6] perf: Add AUX data sampling Alexander Shishkin
                   ` (2 preceding siblings ...)
  2018-06-12  7:51 ` [PATCH v1 3/6] perf: Add an iterator for AUX data Alexander Shishkin
@ 2018-06-12  7:51 ` Alexander Shishkin
  2018-06-14 19:25   ` Peter Zijlstra
                     ` (4 more replies)
  2018-06-12  7:51 ` [PATCH v1 5/6] perf: Drop PERF_FLAG_FD_OUTPUT Alexander Shishkin
  2018-06-12  7:51 ` [PATCH v1 6/6] perf: Allow set-output for task contexts of different types Alexander Shishkin
  5 siblings, 5 replies; 24+ messages in thread
From: Alexander Shishkin @ 2018-06-12  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, linux-kernel, jolsa, Alexander Shishkin

AUX data can be used to annotate perf events such as performance counters
or tracepoints/breakpoints by including it in sample records when
PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
and profiling by providing, for example, a history of instruction flow
leading up to the event's overflow.

To do this, the AUX event's file descriptor is passed to the perf syscall
with PERF_FLAG_FD_SAMPLE flag set and PERF_SAMPLE_AUX bit set in the sample
type. Also, a new attribute field is added to allow the user to specify the
desired size of the AUX sample: attr.aux_sample_size.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      |  10 ++
 include/uapi/linux/perf_event.h |   8 +-
 kernel/events/core.c            | 158 +++++++++++++++++++++++++++++++-
 3 files changed, 174 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7546822a1d74..9f9e341d45cf 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -102,6 +102,12 @@ struct perf_branch_stack {
 	struct perf_branch_entry	entries[0];
 };
 
+struct perf_aux_record {
+	u64		size;
+	unsigned long	from;
+	unsigned long	to;
+};
+
 struct task_struct;
 
 /*
@@ -674,6 +680,8 @@ struct perf_event {
 	struct bpf_prog			*prog;
 #endif
 
+	struct perf_event		*sample_event;
+
 #ifdef CONFIG_EVENT_TRACING
 	struct trace_event_call		*tp_event;
 	struct event_filter		*filter;
@@ -882,6 +890,7 @@ struct perf_sample_data {
 	 */
 	u64				addr;
 	struct perf_raw_record		*raw;
+	struct perf_aux_record		aux;
 	struct perf_branch_stack	*br_stack;
 	u64				period;
 	u64				weight;
@@ -933,6 +942,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
 	/* remaining struct members initialized in perf_prepare_sample() */
 	data->addr = addr;
 	data->raw  = NULL;
+	data->aux.from = data->aux.to = data->aux.size = 0;
 	data->br_stack = NULL;
 	data->period = period;
 	data->weight = 0;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c77c9a2ebbbb..19a22b161e39 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -141,8 +141,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
 	PERF_SAMPLE_REGS_INTR			= 1U << 18,
 	PERF_SAMPLE_PHYS_ADDR			= 1U << 19,
+	PERF_SAMPLE_AUX				= 1U << 20,
 
-	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 21,		/* non-ABI */
 };
 
 /*
@@ -298,6 +299,7 @@ enum perf_event_read_format {
 					/* add: sample_stack_user */
 #define PERF_ATTR_SIZE_VER4	104	/* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5	112	/* add: aux_watermark */
+#define PERF_ATTR_SIZE_VER6	120	/* add: aux_sample_size */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -416,6 +418,7 @@ struct perf_event_attr {
 	__u32	aux_watermark;
 	__u16	sample_max_stack;
 	__u16	__reserved_2;	/* align to __u64 */
+	__u64	aux_sample_size;
 };
 
 #define perf_flags(attr)	(*(&(attr)->read_format + 1))
@@ -820,6 +823,8 @@ enum perf_event_type {
 	 *	{ u64			abi; # enum perf_sample_regs_abi
 	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
+	 *	{ u64			size;
+	 *	  char			data[size]; } && PERF_SAMPLE_AUX
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
@@ -952,6 +957,7 @@ enum perf_callchain_context {
 #define PERF_FLAG_FD_OUTPUT		(1UL << 1)
 #define PERF_FLAG_PID_CGROUP		(1UL << 2) /* pid=cgroup id, per-cpu mode only */
 #define PERF_FLAG_FD_CLOEXEC		(1UL << 3) /* O_CLOEXEC */
+#define PERF_FLAG_FD_SAMPLE		(1UL << 4) /* use fd event to sample AUX data */
 
 #if defined(__LITTLE_ENDIAN_BITFIELD)
 union perf_mem_data_src {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e1fce335a42a..70918ed33143 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -346,7 +346,8 @@ static void event_function_local(struct perf_event *event, event_f func, void *d
 #define PERF_FLAG_ALL (PERF_FLAG_FD_NO_GROUP |\
 		       PERF_FLAG_FD_OUTPUT  |\
 		       PERF_FLAG_PID_CGROUP |\
-		       PERF_FLAG_FD_CLOEXEC)
+		       PERF_FLAG_FD_CLOEXEC |\
+		       PERF_FLAG_FD_SAMPLE)
 
 /*
  * branch priv levels that need permission checks
@@ -3937,6 +3938,8 @@ static void unaccount_freq_event(void)
 		atomic_dec(&nr_freq_events);
 }
 
+static void put_event(struct perf_event *event);
+
 static void unaccount_event(struct perf_event *event)
 {
 	bool dec = false;
@@ -3970,6 +3973,9 @@ static void unaccount_event(struct perf_event *event)
 			schedule_delayed_work(&perf_sched_work, HZ);
 	}
 
+	if (event->sample_event)
+		put_event(event->sample_event);
+
 	unaccount_event_cpu(event, event->cpu);
 
 	unaccount_pmu_sb_event(event);
@@ -5608,6 +5614,100 @@ perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size,
 	}
 }
 
+/*
+ * See if we can take an AUX sample. If we can, prepare for writing
+ * the sample and return its size. In this case, perf_aux_sample_output()
+ * will undo the preparations.
+ */
+static unsigned long perf_aux_sample_size(struct perf_event *event,
+					  struct perf_sample_data *data,
+					  size_t size)
+{
+	struct perf_event *sampler = event->sample_event;
+	struct ring_buffer *rb;
+	int *disable_count;
+
+	data->aux.size = 0;
+
+	if (!sampler || READ_ONCE(sampler->state) != PERF_EVENT_STATE_ACTIVE)
+		goto out;
+
+	if (READ_ONCE(sampler->oncpu) != smp_processor_id())
+		goto out;
+
+	/*
+	 * Non-zero disable count here means that we, being the NMI
+	 * context, are racing with pmu::add, pmu::del or address filter
+	 * adjustment, which we want to avoid.
+	 */
+	disable_count = this_cpu_ptr(sampler->pmu->pmu_disable_count);
+	if (*disable_count)
+		goto out;
+
+	/* Re-enabled in perf_aux_sample_output() */
+	perf_pmu_disable(sampler->pmu);
+
+	rb = ring_buffer_get(sampler);
+	if (!rb) {
+		perf_pmu_enable(sampler->pmu);
+		goto out;
+	}
+
+	/* Restarted in perf_aux_sample_output() */
+	sampler->pmu->stop(sampler, PERF_EF_UPDATE);
+	data->aux.to = rb->aux_head;
+
+	size = min(size, perf_aux_size(rb));
+
+	if (data->aux.to < size)
+		data->aux.from = rb->aux_nr_pages * PAGE_SIZE + data->aux.to -
+			size;
+	else
+		data->aux.from = data->aux.to - size;
+	data->aux.size = ALIGN(size, sizeof(u64));
+	ring_buffer_put(rb);
+
+out:
+	return data->aux.size;
+}
+
+static void perf_aux_sample_output(struct perf_event *event,
+				   struct perf_output_handle *handle,
+				   struct perf_sample_data *data)
+{
+	struct perf_event *sampler = event->sample_event;
+	struct ring_buffer *rb;
+	unsigned long pad;
+	int ret;
+
+	if (WARN_ON_ONCE(!sampler || !data->aux.size))
+		goto out_enable;
+
+	rb = ring_buffer_get(sampler);
+	if (WARN_ON_ONCE(!rb))
+		goto out_enable;
+
+	ret = rb_output_aux(rb, data->aux.from, data->aux.to,
+			    (aux_copyfn)perf_output_copy, handle);
+	if (ret < 0) {
+		pr_warn_ratelimited("failed to copy trace data\n");
+		goto out;
+	}
+
+	pad = data->aux.size - ret;
+	if (pad) {
+		u64 p = 0;
+
+		perf_output_copy(handle, &p, pad);
+	}
+out:
+	ring_buffer_put(rb);
+	sampler->pmu->start(sampler, 0);
+
+out_enable:
+	perf_pmu_enable(sampler->pmu);
+}
+
 static void __perf_event_header__init_id(struct perf_event_header *header,
 					 struct perf_sample_data *data,
 					 struct perf_event *event)
@@ -5926,6 +6026,13 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
 		perf_output_put(handle, data->phys_addr);
 
+	if (sample_type & PERF_SAMPLE_AUX) {
+		perf_output_put(handle, data->aux.size);
+
+		if (data->aux.size)
+			perf_aux_sample_output(event, handle, data);
+	}
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -6112,6 +6219,32 @@ void perf_prepare_sample(struct perf_event_header *header,
 
 	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
 		data->phys_addr = perf_virt_to_phys(data->addr);
+
+	if (sample_type & PERF_SAMPLE_AUX) {
+		u64 size;
+
+		header->size += sizeof(u64); /* size */
+
+		/*
+		 * Given the 16bit nature of header::size, an AUX sample can
+		 * easily overflow it, what with all the preceding sample bits.
+		 * Make sure this doesn't happen by using up to U16_MAX bytes
+		 * per sample in total (rounded down to 8 byte boundary).
+		 */
+		size = min_t(size_t, U16_MAX - header->size,
+			     event->attr.aux_sample_size);
+		size = rounddown(size, 8);
+		size = perf_aux_sample_size(event, data, size);
+
+		WARN_ON_ONCE(size + header->size > U16_MAX);
+		header->size += size;
+	}
+	/*
+	 * If you're adding more sample types here, you likely need to do
+	 * something about the overflowing header::size, like repurpose the
+	 * lowest 3 bits of size, which should be always zero at the moment.
+	 */
+	WARN_ON_ONCE(header->size & 7);
 }
 
 static void __always_inline
@@ -9841,6 +9974,17 @@ __perf_event_ctx_lock_double(struct perf_event *group_leader,
 	return gctx;
 }
 
+static bool
+can_sample_for(struct perf_event *sample_event, struct perf_event *event)
+{
+	if (has_aux(sample_event) &&
+	    sample_event->cpu == event->cpu &&
+	    atomic_long_inc_not_zero(&sample_event->refcount))
+		return true;
+
+	return false;
+}
+
 /**
  * sys_perf_event_open - open a performance event, associate it to a task/cpu
  *
@@ -9854,6 +9998,7 @@ SYSCALL_DEFINE5(perf_event_open,
 		pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
 {
 	struct perf_event *group_leader = NULL, *output_event = NULL;
+	struct perf_event *sample_event = NULL;
 	struct perf_event *event, *sibling;
 	struct perf_event_attr attr;
 	struct perf_event_context *ctx, *uninitialized_var(gctx);
@@ -9924,6 +10069,8 @@ SYSCALL_DEFINE5(perf_event_open,
 		group_leader = group.file->private_data;
 		if (flags & PERF_FLAG_FD_OUTPUT)
 			output_event = group_leader;
+		if (flags & PERF_FLAG_FD_SAMPLE)
+			sample_event = group_leader;
 		if (flags & PERF_FLAG_FD_NO_GROUP)
 			group_leader = NULL;
 	}
@@ -10146,6 +10293,15 @@ SYSCALL_DEFINE5(perf_event_open,
 		}
 	}
 
+	if (sample_event) {
+		/* Grabs sample_event's reference on success */
+		if (!can_sample_for(sample_event, event)) {
+			err = -EINVAL;
+			goto err_locked;
+		}
+
+		event->sample_event = sample_event;
+	}
 
 	/*
 	 * Must be under the same ctx::mutex as perf_install_in_context(),
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 5/6] perf: Drop PERF_FLAG_FD_OUTPUT
  2018-06-12  7:51 [PATCH v1 0/6] perf: Add AUX data sampling Alexander Shishkin
                   ` (3 preceding siblings ...)
  2018-06-12  7:51 ` [PATCH v1 4/6] perf: Allow using AUX data in perf samples Alexander Shishkin
@ 2018-06-12  7:51 ` Alexander Shishkin
  2018-06-12  7:51 ` [PATCH v1 6/6] perf: Allow set-output for task contexts of different types Alexander Shishkin
  5 siblings, 0 replies; 24+ messages in thread
From: Alexander Shishkin @ 2018-06-12  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, linux-kernel, jolsa, Alexander Shishkin

Since commit:

  ac9721f3f54b ("perf_events: Fix races and clean up perf_event and perf_mmap_data interaction")

... PERF_FLAG_FD_OUTPUT has been always returning -EINVAL. This can be
either fixed by supplying enough context information to the set_output
path to validate the request, or it can simply be dropped. The fact that
in 8 years since it last worked, nobody (except Vince [1]) seems to
notice, this patch drops it.

[1] https://lkml.org/lkml/2015/1/9/479

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 70918ed33143..9bf5c7abb621 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -344,7 +344,6 @@ static void event_function_local(struct perf_event *event, event_f func, void *d
 }
 
 #define PERF_FLAG_ALL (PERF_FLAG_FD_NO_GROUP |\
-		       PERF_FLAG_FD_OUTPUT  |\
 		       PERF_FLAG_PID_CGROUP |\
 		       PERF_FLAG_FD_CLOEXEC |\
 		       PERF_FLAG_FD_SAMPLE)
@@ -9997,7 +9996,7 @@ SYSCALL_DEFINE5(perf_event_open,
 		struct perf_event_attr __user *, attr_uptr,
 		pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
 {
-	struct perf_event *group_leader = NULL, *output_event = NULL;
+	struct perf_event *group_leader = NULL;
 	struct perf_event *sample_event = NULL;
 	struct perf_event *event, *sibling;
 	struct perf_event_attr attr;
@@ -10067,8 +10066,6 @@ SYSCALL_DEFINE5(perf_event_open,
 		if (err)
 			goto err_fd;
 		group_leader = group.file->private_data;
-		if (flags & PERF_FLAG_FD_OUTPUT)
-			output_event = group_leader;
 		if (flags & PERF_FLAG_FD_SAMPLE)
 			sample_event = group_leader;
 		if (flags & PERF_FLAG_FD_NO_GROUP)
@@ -10223,12 +10220,6 @@ SYSCALL_DEFINE5(perf_event_open,
 			goto err_context;
 	}
 
-	if (output_event) {
-		err = perf_event_set_output(event, output_event);
-		if (err)
-			goto err_context;
-	}
-
 	event_file = anon_inode_getfile("[perf_event]", &perf_fops, event,
 					f_flags);
 	if (IS_ERR(event_file)) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 6/6] perf: Allow set-output for task contexts of different types
  2018-06-12  7:51 [PATCH v1 0/6] perf: Add AUX data sampling Alexander Shishkin
                   ` (4 preceding siblings ...)
  2018-06-12  7:51 ` [PATCH v1 5/6] perf: Drop PERF_FLAG_FD_OUTPUT Alexander Shishkin
@ 2018-06-12  7:51 ` Alexander Shishkin
  2018-06-14 19:36   ` Peter Zijlstra
  5 siblings, 1 reply; 24+ messages in thread
From: Alexander Shishkin @ 2018-06-12  7:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, linux-kernel, jolsa, Adrian Hunter,
	Alexander Shishkin

From: Adrian Hunter <adrian.hunter@intel.com>

Set-output must be limited to events that cannot be active on different
cpus at the same time.  Thus either the event cpu must be the same, or
the event task must be the same.  Current logic does not check the task
directly but checks whether the perf_event_context is the same, however
there are separate contexts for hardware and software events so in that
case the perf_event_context is different even though the task is the same.
This patch changes the logic to check the task directly.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9bf5c7abb621..9ab9f21f58da 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9848,9 +9848,21 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 		goto out;
 
 	/*
-	 * If its not a per-cpu rb, it must be the same task.
+	 * If it's not a per-cpu rb, it must be the same task.
+	 *
+	 * Since output_event is a per-task event, ->ctx is stable
+	 * and should be around for as long as the file is around.
+	 *
+	 * The context switch optimization doesn't apply to output_event
+	 * as well, so we can look at its ctx->task, which will be either
+	 * a valid task or TASK_TOMBSTONE.
+	 *
+	 * The source event's task can also be TASK_TOMBSTONE, so look out
+	 * for that also.
 	 */
-	if (output_event->cpu == -1 && output_event->ctx != event->ctx)
+	if (output_event->cpu == -1 &&
+	    (output_event->ctx->task != event->ctx->task ||
+	     output_event->ctx->task == TASK_TOMBSTONE))
 		goto out;
 
 	/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 1/6] perf: Disable PMU around address filter adjustment
  2018-06-12  7:51 ` [PATCH v1 1/6] perf: Disable PMU around address filter adjustment Alexander Shishkin
@ 2018-06-12 20:11   ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2018-06-12 20:11 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa

On Tue, Jun 12, 2018 at 10:51:12AM +0300, Alexander Shishkin wrote:
> If the PMU is used for AUX data sampling, the hardware event that triggers
> it may come in during a critical section of address filters adjustment (if
> it's delivered as NMI) and in turn call the address filter sync code,
> causing a spinlock recursion.

Can you detail that a little more? How can sampling _ever_ get down that
path?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2018-06-12  7:51 ` [PATCH v1 4/6] perf: Allow using AUX data in perf samples Alexander Shishkin
@ 2018-06-14 19:25   ` Peter Zijlstra
  2018-06-14 19:39     ` Peter Zijlstra
  2018-06-14 19:30   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-06-14 19:25 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa

On Tue, Jun 12, 2018 at 10:51:15AM +0300, Alexander Shishkin wrote:
> @@ -882,6 +890,7 @@ struct perf_sample_data {
>  	 */
>  	u64				addr;
>  	struct perf_raw_record		*raw;
> +	struct perf_aux_record		aux;
>  	struct perf_branch_stack	*br_stack;
>  	u64				period;
>  	u64				weight;
> @@ -933,6 +942,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
>  	/* remaining struct members initialized in perf_prepare_sample() */
>  	data->addr = addr;
>  	data->raw  = NULL;
> +	data->aux.from = data->aux.to = data->aux.size = 0;
>  	data->br_stack = NULL;
>  	data->period = period;
>  	data->weight = 0;

You just blew that cacheline...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2018-06-12  7:51 ` [PATCH v1 4/6] perf: Allow using AUX data in perf samples Alexander Shishkin
  2018-06-14 19:25   ` Peter Zijlstra
@ 2018-06-14 19:30   ` Peter Zijlstra
  2018-06-14 19:47   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2018-06-14 19:30 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa

On Tue, Jun 12, 2018 at 10:51:15AM +0300, Alexander Shishkin wrote:
> @@ -416,6 +418,7 @@ struct perf_event_attr {
>  	__u32	aux_watermark;
>  	__u16	sample_max_stack;
>  	__u16	__reserved_2;	/* align to __u64 */
> +	__u64	aux_sample_size;
>  };

Do we really need a __u64 for that? I can see how that __u16 hole might
be too small (512K), but do we really need >4G 'samples' ?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 6/6] perf: Allow set-output for task contexts of different types
  2018-06-12  7:51 ` [PATCH v1 6/6] perf: Allow set-output for task contexts of different types Alexander Shishkin
@ 2018-06-14 19:36   ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2018-06-14 19:36 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa,
	Adrian Hunter

On Tue, Jun 12, 2018 at 10:51:17AM +0300, Alexander Shishkin wrote:
> From: Adrian Hunter <adrian.hunter@intel.com>
> 
> Set-output must be limited to events that cannot be active on different
> cpus at the same time.  Thus either the event cpu must be the same, or
> the event task must be the same.

>                                  Current logic does not check the task
> directly but checks whether the perf_event_context is the same, however
> there are separate contexts for hardware and software events so in that
> case the perf_event_context is different even though the task is the same.

Thing is, __perf_event_task_sched_out() can lazy switch the different
contexts independently. So if someone breaks clone on either software or
hardware but not both, we'll flip only one ctx around and schedule the
other, completely breaking your assumption above.

> This patch changes the logic to check the task directly.

This Changelog completly and utterly fails to explain why though.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2018-06-14 19:25   ` Peter Zijlstra
@ 2018-06-14 19:39     ` Peter Zijlstra
  2018-06-19 10:51       ` Alexander Shishkin
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-06-14 19:39 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa

On Thu, Jun 14, 2018 at 09:25:13PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 12, 2018 at 10:51:15AM +0300, Alexander Shishkin wrote:
> > @@ -882,6 +890,7 @@ struct perf_sample_data {
> >  	 */
> >  	u64				addr;
> >  	struct perf_raw_record		*raw;
> > +	struct perf_aux_record		aux;
> >  	struct perf_branch_stack	*br_stack;
> >  	u64				period;
> >  	u64				weight;
> > @@ -933,6 +942,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
> >  	/* remaining struct members initialized in perf_prepare_sample() */
> >  	data->addr = addr;
> >  	data->raw  = NULL;
> > +	data->aux.from = data->aux.to = data->aux.size = 0;
> >  	data->br_stack = NULL;
> >  	data->period = period;
> >  	data->weight = 0;
> 
> You just blew that cacheline...

And since the whole aux_sample thing seems to be purely prepare/output
driven, I really don't see the need why this needs to be here. AFAICT it
should just work fine in the other part of this structure.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2018-06-12  7:51 ` [PATCH v1 4/6] perf: Allow using AUX data in perf samples Alexander Shishkin
  2018-06-14 19:25   ` Peter Zijlstra
  2018-06-14 19:30   ` Peter Zijlstra
@ 2018-06-14 19:47   ` Peter Zijlstra
  2018-06-19 11:00     ` Alexander Shishkin
  2018-06-14 20:03   ` Peter Zijlstra
  2018-06-14 20:20   ` Peter Zijlstra
  4 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-06-14 19:47 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa

On Tue, Jun 12, 2018 at 10:51:15AM +0300, Alexander Shishkin wrote:
> @@ -6112,6 +6219,32 @@ void perf_prepare_sample(struct perf_event_header *header,
>  
>  	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
>  		data->phys_addr = perf_virt_to_phys(data->addr);
> +
> +	if (sample_type & PERF_SAMPLE_AUX) {
> +		u64 size;
> +
> +		header->size += sizeof(u64); /* size */
> +
> +		/*
> +		 * Given the 16bit nature of header::size, an AUX sample can
> +		 * easily overflow it, what with all the preceding sample bits.
> +		 * Make sure this doesn't happen by using up to U16_MAX bytes
> +		 * per sample in total (rounded down to 8 byte boundary).
> +		 */
> +		size = min_t(size_t, U16_MAX - header->size,
> +			     event->attr.aux_sample_size);
> +		size = rounddown(size, 8);
> +		size = perf_aux_sample_size(event, data, size);
> +
> +		WARN_ON_ONCE(size + header->size > U16_MAX);
> +		header->size += size;
> +	}
> +	/*
> +	 * If you're adding more sample types here, you likely need to do
> +	 * something about the overflowing header::size, like repurpose the
> +	 * lowest 3 bits of size, which should be always zero at the moment.
> +	 */

Bugger yes.. I fairly quickly (but still too late) realized we should've
used that u16 in u64 increments to allow up to 512K instead of 64K
events.

Still, even 64K samples are pretty terrifyingly huge. They'll be
_sloowww_.

In any case, I suppose we can grab one of the attribute bits to rev. the
output format -- a la sample_id_all. Do we really want to do that? 512K
samples.... *shudder*.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2018-06-12  7:51 ` [PATCH v1 4/6] perf: Allow using AUX data in perf samples Alexander Shishkin
                     ` (2 preceding siblings ...)
  2018-06-14 19:47   ` Peter Zijlstra
@ 2018-06-14 20:03   ` Peter Zijlstra
  2018-06-14 20:20   ` Peter Zijlstra
  4 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2018-06-14 20:03 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa

On Tue, Jun 12, 2018 at 10:51:15AM +0300, Alexander Shishkin wrote:
> +	ret = rb_output_aux(rb, data->aux.from, data->aux.to,
> +			    (aux_copyfn)perf_output_copy, handle);

If you look closely, you'll find that perf_output_copy() as 'unsigned
int' return type and your aux_copyfn has 'unsigned long'.

I'm thinking that the compiler is entirely in its right to make that
explode at random points in time.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2018-06-12  7:51 ` [PATCH v1 4/6] perf: Allow using AUX data in perf samples Alexander Shishkin
                     ` (3 preceding siblings ...)
  2018-06-14 20:03   ` Peter Zijlstra
@ 2018-06-14 20:20   ` Peter Zijlstra
  2018-06-19 10:47     ` Alexander Shishkin
  4 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-06-14 20:20 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa

On Tue, Jun 12, 2018 at 10:51:15AM +0300, Alexander Shishkin wrote:
> +static unsigned long perf_aux_sample_size(struct perf_event *event,
> +					  struct perf_sample_data *data,
> +					  size_t size)
> +{
> +	struct perf_event *sampler = event->sample_event;
> +	struct ring_buffer *rb;
> +	int *disable_count;
> +
> +	data->aux.size = 0;
> +
> +	if (!sampler || READ_ONCE(sampler->state) != PERF_EVENT_STATE_ACTIVE)
> +		goto out;
> +
> +	if (READ_ONCE(sampler->oncpu) != smp_processor_id())
> +		goto out;

Should those not be WARNs ? If we allow a configuration where that is
possible, we're doing it wrong I think.

> +	/*
> +	 * Non-zero disable count here means that we, being the NMI
> +	 * context, are racing with pmu::add, pmu::del or address filter
> +	 * adjustment, which we want to avoid.
> +	 */
> +	disable_count = this_cpu_ptr(sampler->pmu->pmu_disable_count);
> +	if (*disable_count)
> +		goto out;
> +
> +	/* Re-enabled in perf_aux_sample_output() */
> +	perf_pmu_disable(sampler->pmu);

This is disguisting..  and also broken I think. Imagine what happens
when the NMI hits in middle of perf_pmu_enable(), right where count
dropped to 0, but we've not yet done pmu->pmu_enable() yet.

Then we end up with a double disable and double enable.

> +
> +	rb = ring_buffer_get(sampler);
> +	if (!rb) {
> +		perf_pmu_enable(sampler->pmu);
> +		goto out;
> +	}
> +
> +	/* Restarted in perf_aux_sample_output() */
> +	sampler->pmu->stop(sampler, PERF_EF_UPDATE);

More yuck...

You rreally should not be calling these pmu::methods, they're meant to
be used from _interrupt_ not NMI context. Using them like this is asking
for tons of trouble.

Why can't you just snapshot the current location and let the thing
'run' ?

> +	data->aux.to = rb->aux_head;
> +
> +	size = min(size, perf_aux_size(rb));
> +
> +	if (data->aux.to < size)
> +		data->aux.from = rb->aux_nr_pages * PAGE_SIZE + data->aux.to -
> +			size;
> +	else
> +		data->aux.from = data->aux.to - size;
> +	data->aux.size = ALIGN(size, sizeof(u64));
> +	ring_buffer_put(rb);
> +
> +out:
> +	return data->aux.size;
> +}
> +
> +static void perf_aux_sample_output(struct perf_event *event,
> +				   struct perf_output_handle *handle,
> +				   struct perf_sample_data *data)
> +{

> +	ring_buffer_put(rb);
> +	sampler->pmu->start(sampler, 0);
> +
> +out_enable:
> +	perf_pmu_enable(sampler->pmu);

More of that... really yuck stuff.

> +}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2018-06-14 20:20   ` Peter Zijlstra
@ 2018-06-19 10:47     ` Alexander Shishkin
  2018-06-21 20:16       ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Alexander Shishkin @ 2018-06-19 10:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, jolsa

On Thu, Jun 14, 2018 at 10:20:22PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 12, 2018 at 10:51:15AM +0300, Alexander Shishkin wrote:
> > +static unsigned long perf_aux_sample_size(struct perf_event *event,
> > +					  struct perf_sample_data *data,
> > +					  size_t size)
> > +{
> > +	struct perf_event *sampler = event->sample_event;
> > +	struct ring_buffer *rb;
> > +	int *disable_count;
> > +
> > +	data->aux.size = 0;
> > +
> > +	if (!sampler || READ_ONCE(sampler->state) != PERF_EVENT_STATE_ACTIVE)
> > +		goto out;
> > +
> > +	if (READ_ONCE(sampler->oncpu) != smp_processor_id())
> > +		goto out;
> 
> Should those not be WARNs ? If we allow a configuration where that is
> possible, we're doing it wrong I think.

Indeed, this shouldn't happen.

> > +	/*
> > +	 * Non-zero disable count here means that we, being the NMI
> > +	 * context, are racing with pmu::add, pmu::del or address filter
> > +	 * adjustment, which we want to avoid.
> > +	 */
> > +	disable_count = this_cpu_ptr(sampler->pmu->pmu_disable_count);
> > +	if (*disable_count)
> > +		goto out;
> > +
> > +	/* Re-enabled in perf_aux_sample_output() */
> > +	perf_pmu_disable(sampler->pmu);
> 
> This is disguisting..  and also broken I think. Imagine what happens
> when the NMI hits in middle of perf_pmu_enable(), right where count
> dropped to 0, but we've not yet done pmu->pmu_enable() yet.
> 
> Then we end up with a double disable and double enable.

True, we kind of tiptoe around that by not having ->pmu_enable() and
->pmu_disable() for PT.

> > +
> > +	rb = ring_buffer_get(sampler);
> > +	if (!rb) {
> > +		perf_pmu_enable(sampler->pmu);
> > +		goto out;
> > +	}
> > +
> > +	/* Restarted in perf_aux_sample_output() */
> > +	sampler->pmu->stop(sampler, PERF_EF_UPDATE);
> 
> More yuck...
> 
> You rreally should not be calling these pmu::methods, they're meant to
> be used from _interrupt_ not NMI context. Using them like this is asking
> for tons of trouble.

Right, the SW stuff may then race with event_function_call() stuff. Hmm.
For the HW stuff, I'm hoping that some kind of a sleight of hand may
suffice. Let me think some more.

> Why can't you just snapshot the current location and let the thing
> 'run' ?

Because the buffer will overwrite itself and the location will be useless.
We don't write the AUX data out in this 'mode' at all, only the samples,
which allows for much less data in the resulting perf.data, less work for
the consumer, less IO bandwidth etc, and as a bonus, no AUX-related
interrupts.

But actually, even to snapshot the location we need to stop the event.

> > +	data->aux.to = rb->aux_head;
> > +
> > +	size = min(size, perf_aux_size(rb));
> > +
> > +	if (data->aux.to < size)
> > +		data->aux.from = rb->aux_nr_pages * PAGE_SIZE + data->aux.to -
> > +			size;
> > +	else
> > +		data->aux.from = data->aux.to - size;
> > +	data->aux.size = ALIGN(size, sizeof(u64));
> > +	ring_buffer_put(rb);
> > +
> > +out:
> > +	return data->aux.size;
> > +}
> > +
> > +static void perf_aux_sample_output(struct perf_event *event,
> > +				   struct perf_output_handle *handle,
> > +				   struct perf_sample_data *data)
> > +{
> 
> > +	ring_buffer_put(rb);
> > +	sampler->pmu->start(sampler, 0);
> > +
> > +out_enable:
> > +	perf_pmu_enable(sampler->pmu);
> 
> More of that... really yuck stuff.
> 
> > +}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2018-06-14 19:39     ` Peter Zijlstra
@ 2018-06-19 10:51       ` Alexander Shishkin
  0 siblings, 0 replies; 24+ messages in thread
From: Alexander Shishkin @ 2018-06-19 10:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, jolsa

On Thu, Jun 14, 2018 at 09:39:17PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 14, 2018 at 09:25:13PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 12, 2018 at 10:51:15AM +0300, Alexander Shishkin wrote:
> > > @@ -882,6 +890,7 @@ struct perf_sample_data {
> > >  	 */
> > >  	u64				addr;
> > >  	struct perf_raw_record		*raw;
> > > +	struct perf_aux_record		aux;
> > >  	struct perf_branch_stack	*br_stack;
> > >  	u64				period;
> > >  	u64				weight;
> > > @@ -933,6 +942,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
> > >  	/* remaining struct members initialized in perf_prepare_sample() */
> > >  	data->addr = addr;
> > >  	data->raw  = NULL;
> > > +	data->aux.from = data->aux.to = data->aux.size = 0;
> > >  	data->br_stack = NULL;
> > >  	data->period = period;
> > >  	data->weight = 0;
> > 
> > You just blew that cacheline...
> 
> And since the whole aux_sample thing seems to be purely prepare/output
> driven, I really don't see the need why this needs to be here. AFAICT it
> should just work fine in the other part of this structure.

Yes, this wasn't intentional. Will fix in the next round.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2018-06-14 19:47   ` Peter Zijlstra
@ 2018-06-19 11:00     ` Alexander Shishkin
  0 siblings, 0 replies; 24+ messages in thread
From: Alexander Shishkin @ 2018-06-19 11:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, jolsa

On Thu, Jun 14, 2018 at 09:47:20PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 12, 2018 at 10:51:15AM +0300, Alexander Shishkin wrote:
> > @@ -6112,6 +6219,32 @@ void perf_prepare_sample(struct perf_event_header *header,
> >  
> >  	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
> >  		data->phys_addr = perf_virt_to_phys(data->addr);
> > +
> > +	if (sample_type & PERF_SAMPLE_AUX) {
> > +		u64 size;
> > +
> > +		header->size += sizeof(u64); /* size */
> > +
> > +		/*
> > +		 * Given the 16bit nature of header::size, an AUX sample can
> > +		 * easily overflow it, what with all the preceding sample bits.
> > +		 * Make sure this doesn't happen by using up to U16_MAX bytes
> > +		 * per sample in total (rounded down to 8 byte boundary).
> > +		 */
> > +		size = min_t(size_t, U16_MAX - header->size,
> > +			     event->attr.aux_sample_size);
> > +		size = rounddown(size, 8);
> > +		size = perf_aux_sample_size(event, data, size);
> > +
> > +		WARN_ON_ONCE(size + header->size > U16_MAX);
> > +		header->size += size;
> > +	}
> > +	/*
> > +	 * If you're adding more sample types here, you likely need to do
> > +	 * something about the overflowing header::size, like repurpose the
> > +	 * lowest 3 bits of size, which should be always zero at the moment.
> > +	 */
> 
> Bugger yes.. I fairly quickly (but still too late) realized we should've
> used that u16 in u64 increments to allow up to 512K instead of 64K
> events.
>
> Still, even 64K samples are pretty terrifyingly huge. They'll be
> _sloowww_.
> 
> In any case, I suppose we can grab one of the attribute bits to rev. the
> output format -- a la sample_id_all. Do we really want to do that? 512K
> samples.... *shudder*.

Well, as far as PT goes, even a 4K sample carries a thousands of branches,
I can't imagine ever needing more than 64K, but I didn't really want to make
the call. Same goes for the width of aux_sample_size, which should match what
we can theoretically output.

Also as a thought, we could use the perf_adjust_period() to also reduce the
sample size dynamically if it takes too long to copy.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2018-06-19 10:47     ` Alexander Shishkin
@ 2018-06-21 20:16       ` Peter Zijlstra
  2018-10-02 14:00         ` Alexander Shishkin
  2019-08-09 12:32         ` Alexander Shishkin
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2018-06-21 20:16 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa

On Tue, Jun 19, 2018 at 01:47:25PM +0300, Alexander Shishkin wrote:
> On Thu, Jun 14, 2018 at 10:20:22PM +0200, Peter Zijlstra wrote:

> > More yuck...
> > 
> > You rreally should not be calling these pmu::methods, they're meant to
> > be used from _interrupt_ not NMI context. Using them like this is asking
> > for tons of trouble.
> 
> Right, the SW stuff may then race with event_function_call() stuff. Hmm.
> For the HW stuff, I'm hoping that some kind of a sleight of hand may
> suffice. Let me think some more.

I currently don't see how the SW driven snapshot can ever work, see my
comment on the last patch.

> > Why can't you just snapshot the current location and let the thing
> > 'run' ?
> 
> Because the buffer will overwrite itself and the location will be useless.

Not if it's large enough ;-)

> We don't write the AUX data out in this 'mode' at all, only the samples,
> which allows for much less data in the resulting perf.data, less work for
> the consumer, less IO bandwidth etc, and as a bonus, no AUX-related
> interrupts.
> 
> But actually, even to snapshot the location we need to stop the event.

Maybe new methods that should only be called from NMI context? 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2018-06-21 20:16       ` Peter Zijlstra
@ 2018-10-02 14:00         ` Alexander Shishkin
  2019-08-09 12:32         ` Alexander Shishkin
  1 sibling, 0 replies; 24+ messages in thread
From: Alexander Shishkin @ 2018-10-02 14:00 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Jun 19, 2018 at 01:47:25PM +0300, Alexander Shishkin wrote:
>> On Thu, Jun 14, 2018 at 10:20:22PM +0200, Peter Zijlstra wrote:
>
>> > More yuck...
>> > 
>> > You rreally should not be calling these pmu::methods, they're meant to
>> > be used from _interrupt_ not NMI context. Using them like this is asking
>> > for tons of trouble.
>> 
>> Right, the SW stuff may then race with event_function_call() stuff. Hmm.
>> For the HW stuff, I'm hoping that some kind of a sleight of hand may
>> suffice. Let me think some more.
>
> I currently don't see how the SW driven snapshot can ever work, see my
> comment on the last patch.

Unless we explicitly break clone when PERF_SAMPLE_AUX is set. Basically,
if you're asking for AUX samples, full perf context switch is not your
biggest performance penalty.

>> > Why can't you just snapshot the current location and let the thing
>> > 'run' ?
>> 
>> Because the buffer will overwrite itself and the location will be useless.
>
> Not if it's large enough ;-)
>
>> We don't write the AUX data out in this 'mode' at all, only the samples,
>> which allows for much less data in the resulting perf.data, less work for
>> the consumer, less IO bandwidth etc, and as a bonus, no AUX-related
>> interrupts.
>> 
>> But actually, even to snapshot the location we need to stop the event.
>
> Maybe new methods that should only be called from NMI context? 

True. Ideally, the existing ->start()/->stop()/interrupt should already
do all necessary internal bookkeeping.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2018-06-21 20:16       ` Peter Zijlstra
  2018-10-02 14:00         ` Alexander Shishkin
@ 2019-08-09 12:32         ` Alexander Shishkin
  2019-09-26 12:04           ` Alexander Shishkin
  2019-09-26 14:44           ` Peter Zijlstra
  1 sibling, 2 replies; 24+ messages in thread
From: Alexander Shishkin @ 2019-08-09 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa,
	alexander.shishkin

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Jun 19, 2018 at 01:47:25PM +0300, Alexander Shishkin wrote:
>> Right, the SW stuff may then race with event_function_call() stuff. Hmm.
>> For the HW stuff, I'm hoping that some kind of a sleight of hand may
>> suffice. Let me think some more.
>
> I currently don't see how the SW driven snapshot can ever work, see my
> comment on the last patch.

Yes, this should be solved by grouping, similarly PEBS->PT.

>> > Why can't you just snapshot the current location and let the thing
>> > 'run' ?
>> 
>> Because the buffer will overwrite itself and the location will be useless.
>
> Not if it's large enough ;-)
>
>> We don't write the AUX data out in this 'mode' at all, only the samples,
>> which allows for much less data in the resulting perf.data, less work for
>> the consumer, less IO bandwidth etc, and as a bonus, no AUX-related
>> interrupts.
>> 
>> But actually, even to snapshot the location we need to stop the event.
>
> Maybe new methods that should only be called from NMI context? 

Right, I went with that and added a ->snapshot_aux() method that can't
change any states or do anything that would affect the operation of
potentially preempted in-IRQ callbacks. So, if NMI hits in the middle of
->stop() or whatnot, we should be fine. Some context: in AUX overwrite
mode (aka "snapshot mode") we don't get PT related NMIs, the only NMIs
are from the HW events for which we're writing samples. We use the cpu
local ->handle_nmi (a kind of misnomer) to tell us if the NMI hit
between ->start() and ->stop() and we can safely take the sample.

The other problem is sampling SW events, that would require a ctx->lock
to prevent racing with event_function_call()s from other cpus, resulting
in somewhat cringy "if (!in_nmi()) raw_spin_lock(...)", but I don't have
better idea as to how to handle that.

The whole thing is squashed into one patch below for convenience.

From df0b5efd9c24a9e4477a3501331888c4b4682588 Mon Sep 17 00:00:00 2001
From: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Date: Tue, 12 Jun 2018 10:51:15 +0300
Subject: [RFC v0] perf, pt: Allow using AUX data in perf samples

---
 arch/x86/events/intel/pt.c      |  40 +++++++++
 include/linux/perf_event.h      |  15 ++++
 include/uapi/linux/perf_event.h |   7 +-
 kernel/events/core.c            | 139 +++++++++++++++++++++++++++++++-
 kernel/events/internal.h        |   1 +
 5 files changed, 200 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index d58124d93e5f..a7318c29242e 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1413,6 +1413,45 @@ static void pt_event_stop(struct perf_event *event, int mode)
 	}
 }
 
+static long pt_event_snapshot_aux(struct perf_event *event,
+				  struct perf_output_handle *handle,
+				  unsigned long size)
+{
+	struct pt *pt = this_cpu_ptr(&pt_ctx);
+	struct pt_buffer *buf = perf_get_aux(&pt->handle);
+	unsigned long from = 0, to;
+	long ret;
+
+	if (!buf->snapshot)
+		return 0;
+
+	/*
+	 * Here, handle_nmi tells us if the tracing is on
+	 */
+	if (!READ_ONCE(pt->handle_nmi))
+		return 0;
+
+	pt_config_stop(event);
+
+	pt_read_offset(buf);
+	pt_update_head(pt);
+
+	to = local_read(&buf->data_size);
+	if (to < size)
+		from = buf->nr_pages << PAGE_SHIFT;
+	from += to - size;
+
+	ret = perf_output_copy_aux(&pt->handle, handle, from, to);
+
+	/*
+	 * If the tracing was on, restart it.
+	 */
+	if (READ_ONCE(pt->handle_nmi))
+		pt_config(event);
+
+	return ret;
+}
+
 static void pt_event_del(struct perf_event *event, int mode)
 {
 	pt_event_stop(event, PERF_EF_UPDATE);
@@ -1532,6 +1571,7 @@ static __init int pt_init(void)
 	pt_pmu.pmu.del			 = pt_event_del;
 	pt_pmu.pmu.start		 = pt_event_start;
 	pt_pmu.pmu.stop			 = pt_event_stop;
+	pt_pmu.pmu.snapshot_aux		 = pt_event_snapshot_aux;
 	pt_pmu.pmu.read			 = pt_event_read;
 	pt_pmu.pmu.setup_aux		 = pt_buffer_setup_aux;
 	pt_pmu.pmu.free_aux		 = pt_buffer_free_aux;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9c8b70e8dc99..48ad35da73cd 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -103,6 +103,10 @@ struct perf_branch_stack {
 	struct perf_branch_entry	entries[0];
 };
 
+struct perf_aux_record {
+	u64		size;
+};
+
 struct task_struct;
 
 /*
@@ -248,6 +252,8 @@ struct perf_event;
 #define PERF_PMU_CAP_NO_EXCLUDE			0x80
 #define PERF_PMU_CAP_AUX_OUTPUT			0x100
 
+struct perf_output_handle;
+
 /**
  * struct pmu - generic performance monitoring unit
  */
@@ -422,6 +428,13 @@ struct pmu {
 	 */
 	void (*free_aux)		(void *aux); /* optional */
 
+	/*
+	 * Fun stuff
+	 */
+	long (*snapshot_aux)		(struct perf_event *event,
+					 struct perf_output_handle *handle,
+					 unsigned long size);
+
 	/*
 	 * Validate address range filters: make sure the HW supports the
 	 * requested configuration and number of filters; return 0 if the
@@ -960,6 +973,7 @@ struct perf_sample_data {
 		u32	reserved;
 	}				cpu_entry;
 	struct perf_callchain_entry	*callchain;
+	struct perf_aux_record		aux;
 
 	/*
 	 * regs_user may point to task_pt_regs or to regs_user_copy, depending
@@ -992,6 +1006,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
 	data->weight = 0;
 	data->data_src.val = PERF_MEM_NA;
 	data->txn = 0;
+	data->aux.size = 0;
 }
 
 extern void perf_output_sample(struct perf_output_handle *handle,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index bb7b271397a6..947f49d46f0d 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -141,8 +141,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
 	PERF_SAMPLE_REGS_INTR			= 1U << 18,
 	PERF_SAMPLE_PHYS_ADDR			= 1U << 19,
+	PERF_SAMPLE_AUX				= 1U << 20,
 
-	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 21,		/* non-ABI */
 
 	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
 };
@@ -300,6 +301,7 @@ enum perf_event_read_format {
 					/* add: sample_stack_user */
 #define PERF_ATTR_SIZE_VER4	104	/* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5	112	/* add: aux_watermark */
+#define PERF_ATTR_SIZE_VER6	120	/* add: aux_sample_size */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -425,6 +427,7 @@ struct perf_event_attr {
 	__u32	aux_watermark;
 	__u16	sample_max_stack;
 	__u16	__reserved_2;	/* align to __u64 */
+	__u64	aux_sample_size;
 };
 
 /*
@@ -864,6 +867,8 @@ enum perf_event_type {
 	 *	{ u64			abi; # enum perf_sample_regs_abi
 	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
+	 *	{ u64			size;
+	 *	  char			data[size]; } && PERF_SAMPLE_AUX
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d72b756b4ccb..9754c1af51a4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1953,7 +1953,10 @@ static int perf_get_aux_event(struct perf_event *event,
 	if (!group_leader)
 		return 0;
 
-	if (!perf_aux_output_match(event, group_leader))
+	if (event->attr.aux_output && !perf_aux_output_match(event, group_leader))
+		return 0;
+
+	if (event->attr.aux_sample_size && !group_leader->pmu->snapshot_aux)
 		return 0;
 
 	if (!atomic_long_inc_not_zero(&group_leader->refcount))
@@ -6150,6 +6153,103 @@ perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size,
 	}
 }
 
+static unsigned long perf_aux_sample_size(struct perf_event *event,
+					  struct perf_sample_data *data,
+					  size_t size)
+{
+	struct perf_event *sampler = event->aux_event;
+	struct ring_buffer *rb;
+
+	data->aux.size = 0;
+
+	if (!sampler)
+		goto out;
+
+	if (WARN_ON_ONCE(READ_ONCE(sampler->state) != PERF_EVENT_STATE_ACTIVE))
+		goto out;
+
+	if (WARN_ON_ONCE(READ_ONCE(sampler->oncpu) != smp_processor_id()))
+		goto out;
+
+	rb = ring_buffer_get(sampler);
+	if (!rb)
+		goto out;
+
+	size = min(size, perf_aux_size(rb));
+	data->aux.size = ALIGN(size, sizeof(u64));
+	ring_buffer_put(rb);
+
+out:
+	return data->aux.size;
+}
+
+int perf_pmu_aux_sample_output(struct perf_event *event,
+			       struct perf_output_handle *handle,
+			       unsigned long size)
+{
+	unsigned long flags;
+	int ret;
+
+	/*
+	 * NMI vs IRQ
+	 *
+	 * Normal ->start()/->stop() callbacks run in IRQ mode in scheduler
+	 * paths. If we start calling them in NMI context, they may race with
+	 * the IRQ ones, that is, for example, re-starting an event that's just
+	 * been stopped.
+	 */
+	if (!in_nmi())
+		raw_spin_lock_irqsave(&event->ctx->lock, flags);
+
+	ret = event->pmu->snapshot_aux(event, handle, size);
+
+	if (!in_nmi())
+		raw_spin_unlock_irqrestore(&event->ctx->lock, flags);
+
+	return ret;
+}
+
+static void perf_aux_sample_output(struct perf_event *event,
+				   struct perf_output_handle *handle,
+				   struct perf_sample_data *data)
+{
+	struct perf_event *sampler = event->aux_event;
+	unsigned long pad;
+	struct ring_buffer *rb;
+	int ret;
+
+	if (WARN_ON_ONCE(!sampler || !data->aux.size))
+		return;
+
+	rb = ring_buffer_get(sampler);
+	if (!rb)
+		return;
+
+	if (READ_ONCE(rb->aux_in_sampling))
+		goto out_put;
+
+	WRITE_ONCE(rb->aux_in_sampling, 1);
+
+	ret = perf_pmu_aux_sample_output(sampler, handle, data->aux.size);
+	if (ret < 0) {
+		pr_warn_ratelimited("failed to copy trace data\n");
+		goto out_clear;
+	}
+
+	pad = data->aux.size - ret;
+	if (pad) {
+		u64 p = 0;
+
+		perf_output_copy(handle, &p, pad);
+	}
+
+out_clear:
+	WRITE_ONCE(rb->aux_in_sampling, 0);
+
+out_put:
+	ring_buffer_put(rb);
+}
+
 static void __perf_event_header__init_id(struct perf_event_header *header,
 					 struct perf_sample_data *data,
 					 struct perf_event *event)
@@ -6469,6 +6569,13 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
 		perf_output_put(handle, data->phys_addr);
 
+	if (sample_type & PERF_SAMPLE_AUX) {
+		perf_output_put(handle, data->aux.size);
+
+		if (data->aux.size)
+			perf_aux_sample_output(event, handle, data);
+	}
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -6657,6 +6764,32 @@ void perf_prepare_sample(struct perf_event_header *header,
 
 	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
 		data->phys_addr = perf_virt_to_phys(data->addr);
+
+	if (sample_type & PERF_SAMPLE_AUX) {
+		u64 size;
+
+		header->size += sizeof(u64); /* size */
+
+		/*
+		 * Given the 16bit nature of header::size, an AUX sample can
+		 * easily overflow it, what with all the preceding sample bits.
+		 * Make sure this doesn't happen by using up to U16_MAX bytes
+		 * per sample in total (rounded down to 8 byte boundary).
+		 */
+		size = min_t(size_t, U16_MAX - header->size,
+			     event->attr.aux_sample_size);
+		size = rounddown(size, 8);
+		size = perf_aux_sample_size(event, data, size);
+
+		WARN_ON_ONCE(size + header->size > U16_MAX);
+		header->size += size;
+	}
+	/*
+	 * If you're adding more sample types here, you likely need to do
+	 * something about the overflowing header::size, like repurpose the
+	 * lowest 3 bits of size, which should be always zero at the moment.
+	 */
+	WARN_ON_ONCE(header->size & 7);
 }
 
 static __always_inline int
@@ -11171,9 +11304,12 @@ SYSCALL_DEFINE5(perf_event_open,
 		}
 	}
 
 	if (event->attr.aux_output && !perf_get_aux_event(event, group_leader))
 		goto err_locked;
 
+	if (event->attr.aux_sample_size && !perf_get_aux_event(event, group_leader))
+		goto err_locked;
+
 	/*
 	 * Must be under the same ctx::mutex as perf_install_in_context(),
 	 * because we need to serialize with concurrent event creation.
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 3aef4191798c..747d67f130cb 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -50,6 +50,7 @@ struct ring_buffer {
 	unsigned long			aux_mmap_locked;
 	void				(*free_aux)(void *);
 	refcount_t			aux_refcount;
+	int				aux_in_sampling;
 	void				**aux_pages;
 	void				*aux_priv;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2019-08-09 12:32         ` Alexander Shishkin
@ 2019-09-26 12:04           ` Alexander Shishkin
  2019-09-26 14:44           ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: Alexander Shishkin @ 2019-09-26 12:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa,
	alexander.shishkin

Alexander Shishkin <alexander.shishkin@linux.intel.com> writes:

> Peter Zijlstra <peterz@infradead.org> writes:
>
>> On Tue, Jun 19, 2018 at 01:47:25PM +0300, Alexander Shishkin wrote:
>>> Right, the SW stuff may then race with event_function_call() stuff. Hmm.
>>> For the HW stuff, I'm hoping that some kind of a sleight of hand may
>>> suffice. Let me think some more.
>>
>> I currently don't see how the SW driven snapshot can ever work, see my
>> comment on the last patch.
>
> Yes, this should be solved by grouping, similarly PEBS->PT.

Peter, do you have any thoughts/suggestions as to this RFC? Thanks!

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2019-08-09 12:32         ` Alexander Shishkin
  2019-09-26 12:04           ` Alexander Shishkin
@ 2019-09-26 14:44           ` Peter Zijlstra
  2019-09-30 11:50             ` Alexander Shishkin
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2019-09-26 14:44 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa

On Fri, Aug 09, 2019 at 03:32:39PM +0300, Alexander Shishkin wrote:
> The other problem is sampling SW events, that would require a ctx->lock
> to prevent racing with event_function_call()s from other cpus, resulting
> in somewhat cringy "if (!in_nmi()) raw_spin_lock(...)", but I don't have
> better idea as to how to handle that.

> +int perf_pmu_aux_sample_output(struct perf_event *event,
> +			       struct perf_output_handle *handle,
> +			       unsigned long size)
> +{
> +	unsigned long flags;
> +	int ret;
> +
> +	/*
> +	 * NMI vs IRQ
> +	 *
> +	 * Normal ->start()/->stop() callbacks run in IRQ mode in scheduler
> +	 * paths. If we start calling them in NMI context, they may race with
> +	 * the IRQ ones, that is, for example, re-starting an event that's just
> +	 * been stopped.
> +	 */
> +	if (!in_nmi())
> +		raw_spin_lock_irqsave(&event->ctx->lock, flags);
> +
> +	ret = event->pmu->snapshot_aux(event, handle, size);
> +
> +	if (!in_nmi())
> +		raw_spin_unlock_irqrestore(&event->ctx->lock, flags);
> +
> +	return ret;
> +}

I'm confused... would not something like:

	unsigned long flags;

	local_irq_save(flags);
	ret = event->pmu->snapshot_aux(...);
	local_irq_restore(flags);

	return ret;

Be sufficient? By disabling IRQs we already hold off remote
event_function_call()s.

Or am I misunderstanding the race here?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] perf: Allow using AUX data in perf samples
  2019-09-26 14:44           ` Peter Zijlstra
@ 2019-09-30 11:50             ` Alexander Shishkin
  0 siblings, 0 replies; 24+ messages in thread
From: Alexander Shishkin @ 2019-09-30 11:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, jolsa,
	alexander.shishkin

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Aug 09, 2019 at 03:32:39PM +0300, Alexander Shishkin wrote:
>> The other problem is sampling SW events, that would require a ctx->lock
>> to prevent racing with event_function_call()s from other cpus, resulting
>> in somewhat cringy "if (!in_nmi()) raw_spin_lock(...)", but I don't have
>> better idea as to how to handle that.
>
>> +int perf_pmu_aux_sample_output(struct perf_event *event,
>> +			       struct perf_output_handle *handle,
>> +			       unsigned long size)
>> +{
>> +	unsigned long flags;
>> +	int ret;
>> +
>> +	/*
>> +	 * NMI vs IRQ
>> +	 *
>> +	 * Normal ->start()/->stop() callbacks run in IRQ mode in scheduler
>> +	 * paths. If we start calling them in NMI context, they may race with
>> +	 * the IRQ ones, that is, for example, re-starting an event that's just
>> +	 * been stopped.
>> +	 */
>> +	if (!in_nmi())
>> +		raw_spin_lock_irqsave(&event->ctx->lock, flags);
>> +
>> +	ret = event->pmu->snapshot_aux(event, handle, size);
>> +
>> +	if (!in_nmi())
>> +		raw_spin_unlock_irqrestore(&event->ctx->lock, flags);
>> +
>> +	return ret;
>> +}
>
> I'm confused... would not something like:
>
> 	unsigned long flags;
>
> 	local_irq_save(flags);
> 	ret = event->pmu->snapshot_aux(...);
> 	local_irq_restore(flags);
>
> 	return ret;
>
> Be sufficient? By disabling IRQs we already hold off remote
> event_function_call()s.
>
> Or am I misunderstanding the race here?

No, you're right, disabling IRQs covers our bases.

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-09-30 11:51 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-12  7:51 [PATCH v1 0/6] perf: Add AUX data sampling Alexander Shishkin
2018-06-12  7:51 ` [PATCH v1 1/6] perf: Disable PMU around address filter adjustment Alexander Shishkin
2018-06-12 20:11   ` Peter Zijlstra
2018-06-12  7:51 ` [PATCH v1 2/6] perf: Disable IRQs in address filter sync path Alexander Shishkin
2018-06-12  7:51 ` [PATCH v1 3/6] perf: Add an iterator for AUX data Alexander Shishkin
2018-06-12  7:51 ` [PATCH v1 4/6] perf: Allow using AUX data in perf samples Alexander Shishkin
2018-06-14 19:25   ` Peter Zijlstra
2018-06-14 19:39     ` Peter Zijlstra
2018-06-19 10:51       ` Alexander Shishkin
2018-06-14 19:30   ` Peter Zijlstra
2018-06-14 19:47   ` Peter Zijlstra
2018-06-19 11:00     ` Alexander Shishkin
2018-06-14 20:03   ` Peter Zijlstra
2018-06-14 20:20   ` Peter Zijlstra
2018-06-19 10:47     ` Alexander Shishkin
2018-06-21 20:16       ` Peter Zijlstra
2018-10-02 14:00         ` Alexander Shishkin
2019-08-09 12:32         ` Alexander Shishkin
2019-09-26 12:04           ` Alexander Shishkin
2019-09-26 14:44           ` Peter Zijlstra
2019-09-30 11:50             ` Alexander Shishkin
2018-06-12  7:51 ` [PATCH v1 5/6] perf: Drop PERF_FLAG_FD_OUTPUT Alexander Shishkin
2018-06-12  7:51 ` [PATCH v1 6/6] perf: Allow set-output for task contexts of different types Alexander Shishkin
2018-06-14 19:36   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).