All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/17] perf: Detached events
@ 2017-09-05 13:30 Alexander Shishkin
  2017-09-05 13:30 ` [RFC PATCH 01/17] perf: Allow mmapping only user page Alexander Shishkin
                   ` (17 more replies)
  0 siblings, 18 replies; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

Hi,

I'm going to keep this short.

Objective: include perf data (specifically, AUX/Intel PT) in process core
dumps.

Obstacles and how this patchset deals with them:
(1) Need to be able to have perf events running without consumer (perf
record) running in the background.
Detached events: a new flag to the perf syscall makes a 'detached' event,
which exists after its file descriptor is released. Not all detached events
are per-thread AUX events: this tries to take into account the need for
system-wide persistent events too.

(2) Need to be able to kill those events, so they need to be accessible
after they are created.
Event files: detached events exist as files in tracefs (at the moment), can
be opened/mmaped/read/removed.

(3) Ring buffer contents from these events needs to end up in the core dump
file.
Injecting perf ring buffer into the target task's address space.

(4) Inheritance will have to allocate ring buffers for such events for this
feature to be useful.
A parentless detached event is created (with a ring buffer) upon
inheritance, no output redirection, each event has its own ring buffer.

(5) Sideeffect of (4) is that we can't use GFP_KERNEL pages for such ring
buffers or else we'll have to fail inherit_event() (and, therefore, user's
fork()) when they exhaust their mlock limit.
Using shmemfs-backed pages for such a ring buffer and only pinning them
while the corresponding target task is running. Other times these pages can
be swapped out.

(6) Ring buffer memory accounting needs to take this new arrangement into
account: one user can use up at most NR_CPUS * buffer_size memory at any
given point in time.
Only account the first such event and undo the accounting when the last
event is gone.

(7) We'll also need to supply all the things that the [PT] decoder normally
finds out via sysfs attributes, like clock ratios, capabilities, etc so that
it also finds its way into the core dump file.
"PMU info" structure is appended to the user page.

I've also hack the perf tool to support all this, all these things can be
found at [1]. I'm not posting the tooling patches though, them being
thoroughly ugly and proof-of-concept. In short, perf record will create
detached events with '--detached' and afterwards will open detached events
via their path in tracefs.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/ash/linux.git/log/?h=perf-detached-shmem-wip

Alexander Shishkin (17):
  perf: Allow mmapping only user page
  perf: Factor out mlock accounting
  tracefs: De-globalize instances' callbacks
  tracefs: Add ->unlink callback to tracefs_dir_ops
  perf: Introduce detached events
  perf: Add buffers to the detached events
  perf: Add pmu_info to user page
  perf: Allow inheritance for detached events
  perf: Use shmemfs pages for userspace-only per-thread detached events
  perf: Implement pinning and scheduling for SHMEM events
  perf: Implement mlock accounting for shmem ring buffers
  perf: Track pinned events per user
  perf: Re-inject shmem buffers after exec
  perf: Add ioctl(REATTACH) for detached events
  perf: Allow controlled non-root access to detached events
  perf/x86/intel/pt: Add PMU info
  perf/x86/intel/bts: Add PMU info

 arch/x86/events/intel/bts.c     |  20 +-
 arch/x86/events/intel/pt.c      |  23 +-
 arch/x86/events/intel/pt.h      |  11 +
 fs/tracefs/inode.c              |  71 +++-
 include/linux/perf_event.h      |  33 ++
 include/linux/sched/user.h      |   6 +
 include/linux/tracefs.h         |   3 +-
 include/uapi/linux/perf_event.h |  15 +
 kernel/events/core.c            | 526 +++++++++++++++++++++++------
 kernel/events/internal.h        |  27 +-
 kernel/events/ring_buffer.c     | 730 ++++++++++++++++++++++++++++++++++++--
 kernel/trace/trace.c            |   8 +-
 kernel/user.c                   |   1 +
 13 files changed, 1315 insertions(+), 159 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 01/17] perf: Allow mmapping only user page
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-09-06 16:28   ` Borislav Petkov
  2017-09-05 13:30 ` [RFC PATCH 02/17] perf: Factor out mlock accounting Alexander Shishkin
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

The 'user page' contains offsets and sizes of data and aux areas of the
ring buffer. If a user wants to mmap a pre-existing buffer, they need to
know these in order to issue mmap()s with correct offsets and sizes.

This enables mmapping of the user page if the ring buffer already exists.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index cb7eaf0f91..9389e27cb0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5366,7 +5366,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 again:
 	mutex_lock(&event->mmap_mutex);
 	if (event->rb) {
-		if (event->rb->nr_pages != nr_pages) {
+		if (nr_pages && event->rb->nr_pages != nr_pages) {
 			ret = -EINVAL;
 			goto unlock;
 		}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 02/17] perf: Factor out mlock accounting
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
  2017-09-05 13:30 ` [RFC PATCH 01/17] perf: Allow mmapping only user page Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-09-05 13:30 ` [RFC PATCH 03/17] tracefs: De-globalize instances' callbacks Alexander Shishkin
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

This patch moves ring buffer memory accounting down the rb_alloc() path
so that its callers won't have to worry about it. This also serves the
additional purpose of slightly cleaning up perf_mmap().

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c        |  67 +++-----------------
 kernel/events/internal.h    |   5 +-
 kernel/events/ring_buffer.c | 145 ++++++++++++++++++++++++++++++++++++++------
 3 files changed, 136 insertions(+), 81 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9389e27cb0..24099ed9e5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5122,6 +5122,8 @@ void ring_buffer_put(struct ring_buffer *rb)
 	if (!atomic_dec_and_test(&rb->refcount))
 		return;
 
+	ring_buffer_unaccount(rb, false);
+
 	WARN_ON_ONCE(!list_empty(&rb->event_list));
 
 	call_rcu(&rb->rcu_head, rb_free_rcu);
@@ -5156,9 +5158,6 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 	struct perf_event *event = vma->vm_file->private_data;
 
 	struct ring_buffer *rb = ring_buffer_get(event);
-	struct user_struct *mmap_user = rb->mmap_user;
-	int mmap_locked = rb->mmap_locked;
-	unsigned long size = perf_data_size(rb);
 
 	if (event->pmu->event_unmapped)
 		event->pmu->event_unmapped(event, vma->vm_mm);
@@ -5178,11 +5177,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 		 */
 		perf_pmu_output_stop(event);
 
-		/* now it's safe to free the pages */
-		atomic_long_sub(rb->aux_nr_pages, &mmap_user->locked_vm);
-		vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
-
-		/* this has to be the last one */
+		/* now it's safe to free the pages; ought to be the last one */
 		rb_free_aux(rb);
 		WARN_ON_ONCE(atomic_read(&rb->aux_refcount));
 
@@ -5243,19 +5238,6 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 	}
 	rcu_read_unlock();
 
-	/*
-	 * It could be there's still a few 0-ref events on the list; they'll
-	 * get cleaned up by free_event() -- they'll also still have their
-	 * ref on the rb and will free it whenever they are done with it.
-	 *
-	 * Aside from that, this buffer is 'fully' detached and unmapped,
-	 * undo the VM accounting.
-	 */
-
-	atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
-	vma->vm_mm->pinned_vm -= mmap_locked;
-	free_uid(mmap_user);
-
 out_put:
 	ring_buffer_put(rb); /* could be last */
 }
@@ -5270,13 +5252,9 @@ static const struct vm_operations_struct perf_mmap_vmops = {
 static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct perf_event *event = file->private_data;
-	unsigned long user_locked, user_lock_limit;
-	struct user_struct *user = current_user();
-	unsigned long locked, lock_limit;
 	struct ring_buffer *rb = NULL;
 	unsigned long vma_size;
 	unsigned long nr_pages;
-	long user_extra = 0, extra = 0;
 	int ret = 0, flags = 0;
 
 	/*
@@ -5347,7 +5325,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 		}
 
 		atomic_set(&rb->aux_mmap_count, 1);
-		user_extra = nr_pages;
 
 		goto accounting;
 	}
@@ -5384,49 +5361,24 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 		goto unlock;
 	}
 
-	user_extra = nr_pages + 1;
-
 accounting:
-	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
-
-	/*
-	 * Increase the limit linearly with more CPUs:
-	 */
-	user_lock_limit *= num_online_cpus();
-
-	user_locked = atomic_long_read(&user->locked_vm) + user_extra;
-
-	if (user_locked > user_lock_limit)
-		extra = user_locked - user_lock_limit;
-
-	lock_limit = rlimit(RLIMIT_MEMLOCK);
-	lock_limit >>= PAGE_SHIFT;
-	locked = vma->vm_mm->pinned_vm + extra;
-
-	if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
-		!capable(CAP_IPC_LOCK)) {
-		ret = -EPERM;
-		goto unlock;
-	}
-
 	WARN_ON(!rb && event->rb);
 
 	if (vma->vm_flags & VM_WRITE)
 		flags |= RING_BUFFER_WRITABLE;
 
 	if (!rb) {
-		rb = rb_alloc(nr_pages,
+		rb = rb_alloc(vma->vm_mm, nr_pages,
 			      event->attr.watermark ? event->attr.wakeup_watermark : 0,
 			      event->cpu, flags);
 
-		if (!rb) {
-			ret = -ENOMEM;
+		if (IS_ERR_OR_NULL(rb)) {
+			ret = PTR_ERR(rb);
+			rb = NULL;
 			goto unlock;
 		}
 
 		atomic_set(&rb->mmap_count, 1);
-		rb->mmap_user = get_current_user();
-		rb->mmap_locked = extra;
 
 		ring_buffer_attach(event, rb);
 
@@ -5435,15 +5387,10 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	} else {
 		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
 				   event->attr.aux_watermark, flags);
-		if (!ret)
-			rb->aux_mmap_locked = extra;
 	}
 
 unlock:
 	if (!ret) {
-		atomic_long_add(user_extra, &user->locked_vm);
-		vma->vm_mm->pinned_vm += extra;
-
 		atomic_inc(&event->mmap_count);
 	} else if (rb) {
 		atomic_dec(&rb->mmap_count);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 843e970473..3e603c45eb 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -36,6 +36,7 @@ struct ring_buffer {
 	atomic_t			mmap_count;
 	unsigned long			mmap_locked;
 	struct user_struct		*mmap_user;
+	struct mm_struct		*mmap_mapping;
 
 	/* AUX area */
 	long				aux_head;
@@ -56,6 +57,7 @@ struct ring_buffer {
 };
 
 extern void rb_free(struct ring_buffer *rb);
+extern void ring_buffer_unaccount(struct ring_buffer *rb, bool aux);
 
 static inline void rb_free_rcu(struct rcu_head *rcu_head)
 {
@@ -74,7 +76,8 @@ static inline void rb_toggle_paused(struct ring_buffer *rb, bool pause)
 }
 
 extern struct ring_buffer *
-rb_alloc(int nr_pages, long watermark, int cpu, int flags);
+rb_alloc(struct mm_struct *mm, int nr_pages, long watermark, int cpu,
+	 int flags);
 extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 			pgoff_t pgoff, int nr_pages, long watermark, int flags);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index af71a84e12..d36f169cae 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -505,6 +505,88 @@ void *perf_get_aux(struct perf_output_handle *handle)
 	return handle->rb->aux_priv;
 }
 
+/*
+ * Check if the current user can afford @nr_pages, considering the
+ * perf_event_mlock sysctl and their mlock limit. If the former is exceeded,
+ * pin the remainder on their mm, if the latter is not sufficient either,
+ * error out. Otherwise, keep track of the pages used in the ring_buffer so
+ * that the accounting can be undone when the pages are freed.
+ */
+static int ring_buffer_account(struct ring_buffer *rb, struct mm_struct *mm,
+			       unsigned long nr_pages, bool aux)
+{
+	unsigned long total, limit, pinned;
+
+	if (!mm)
+		mm = rb->mmap_mapping;
+
+	rb->mmap_user = current_user();
+
+	limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
+
+	/*
+	 * Increase the limit linearly with more CPUs:
+	 */
+	limit *= num_online_cpus();
+
+	total = atomic_long_read(&rb->mmap_user->locked_vm) + nr_pages;
+
+	pinned = 0;
+	if (total > limit) {
+		/*
+		 * Everything that's over the sysctl_perf_event_mlock
+		 * limit needs to be accounted to the consumer's mm.
+		 */
+		if (!mm)
+			return -EPERM;
+
+		pinned = total - limit;
+
+		limit = rlimit(RLIMIT_MEMLOCK);
+		limit >>= PAGE_SHIFT;
+		total = mm->pinned_vm + pinned;
+
+		if ((total > limit) && perf_paranoid_tracepoint_raw() &&
+		    !capable(CAP_IPC_LOCK)) {
+			return -EPERM;
+		}
+
+		if (aux)
+			rb->aux_mmap_locked = pinned;
+		else
+			rb->mmap_locked = pinned;
+
+		mm->pinned_vm += pinned;
+	}
+
+	if (!rb->mmap_mapping)
+		rb->mmap_mapping = mm;
+
+	/* account for user page */
+	if (!aux)
+		nr_pages++;
+
+	rb->mmap_user = get_current_user();
+	atomic_long_add(nr_pages, &rb->mmap_user->locked_vm);
+
+	return 0;
+}
+
+/*
+ * Undo the mlock pages accounting done in ring_buffer_account().
+ */
+void ring_buffer_unaccount(struct ring_buffer *rb, bool aux)
+{
+	unsigned long nr_pages = aux ? rb->aux_nr_pages : rb->nr_pages + 1;
+	unsigned long pinned = aux ? rb->aux_mmap_locked : rb->mmap_locked;
+
+	atomic_long_sub(nr_pages, &rb->mmap_user->locked_vm);
+	if (rb->mmap_mapping)
+		rb->mmap_mapping->pinned_vm -= pinned;
+
+	free_uid(rb->mmap_user);
+}
+
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
 static struct page *rb_alloc_aux_page(int node, int order)
@@ -574,11 +656,16 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 {
 	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
 	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
-	int ret = -ENOMEM, max_order = 0;
+	int ret, max_order = 0;
 
 	if (!has_aux(event))
 		return -EOPNOTSUPP;
 
+	ret = ring_buffer_account(rb, NULL, nr_pages, true);
+	if (ret)
+		return ret;
+
+	ret = -ENOMEM;
 	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) {
 		/*
 		 * We need to start with the max_order that fits in nr_pages,
@@ -593,7 +680,7 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 		if ((event->pmu->capabilities & PERF_PMU_CAP_AUX_SW_DOUBLEBUF) &&
 		    !overwrite) {
 			if (!max_order)
-				return -EINVAL;
+				goto out;
 
 			max_order--;
 		}
@@ -654,18 +741,23 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 		rb->aux_watermark = nr_pages << (PAGE_SHIFT - 1);
 
 out:
-	if (!ret)
+	if (!ret) {
 		rb->aux_pgoff = pgoff;
-	else
+	} else {
+		ring_buffer_unaccount(rb, true);
 		__rb_free_aux(rb);
+	}
 
 	return ret;
 }
 
 void rb_free_aux(struct ring_buffer *rb)
 {
-	if (atomic_dec_and_test(&rb->aux_refcount))
+	if (atomic_dec_and_test(&rb->aux_refcount)) {
+		ring_buffer_unaccount(rb, true);
+
 		__rb_free_aux(rb);
+	}
 }
 
 #ifndef CONFIG_PERF_USE_VMALLOC
@@ -699,22 +791,25 @@ static void *perf_mmap_alloc_page(int cpu)
 	return page_address(page);
 }
 
-struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
+struct ring_buffer *rb_alloc(struct mm_struct *mm, int nr_pages, long watermark,
+			     int cpu, int flags)
 {
+	unsigned long size = offsetof(struct ring_buffer, data_pages[nr_pages]);
 	struct ring_buffer *rb;
-	unsigned long size;
-	int i;
-
-	size = sizeof(struct ring_buffer);
-	size += nr_pages * sizeof(void *);
+	int i, ret = -ENOMEM;
 
 	rb = kzalloc(size, GFP_KERNEL);
 	if (!rb)
 		goto fail;
 
+	ret = ring_buffer_account(rb, mm, nr_pages, false);
+	if (ret)
+		goto fail_free_rb;
+
+	ret = -ENOMEM;
 	rb->user_page = perf_mmap_alloc_page(cpu);
 	if (!rb->user_page)
-		goto fail_user_page;
+		goto fail_unaccount;
 
 	for (i = 0; i < nr_pages; i++) {
 		rb->data_pages[i] = perf_mmap_alloc_page(cpu);
@@ -734,11 +829,14 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
 
 	free_page((unsigned long)rb->user_page);
 
-fail_user_page:
+fail_unaccount:
+	ring_buffer_unaccount(rb, false);
+
+fail_free_rb:
 	kfree(rb);
 
 fail:
-	return NULL;
+	return ERR_PTR(ret);
 }
 
 static void perf_mmap_free_page(unsigned long addr)
@@ -805,19 +903,23 @@ void rb_free(struct ring_buffer *rb)
 	schedule_work(&rb->work);
 }
 
-struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
+struct ring_buffer *rb_alloc(struct mm_struct *mm, int nr_pages, long watermark,
+			     int cpu, int flags)
 {
+	unsigned long size = offsetof(struct ring_buffer, data_pages[1]);
 	struct ring_buffer *rb;
-	unsigned long size;
 	void *all_buf;
-
-	size = sizeof(struct ring_buffer);
-	size += sizeof(void *);
+	int ret = -ENOMEM;
 
 	rb = kzalloc(size, GFP_KERNEL);
 	if (!rb)
 		goto fail;
 
+	ret = ring_buffer_account(rb, mm, nr_pages, false);
+	if (ret)
+		goto fail_free;
+
+	ret = -ENOMEM;
 	INIT_WORK(&rb->work, rb_free_work);
 
 	all_buf = vmalloc_user((nr_pages + 1) * PAGE_SIZE);
@@ -836,10 +938,13 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
 	return rb;
 
 fail_all_buf:
+	ring_buffer_unaccount(rb, false);
+
+fail_free:
 	kfree(rb);
 
 fail:
-	return NULL;
+	return ERR_PTR(ret);
 }
 
 #endif
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 03/17] tracefs: De-globalize instances' callbacks
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
  2017-09-05 13:30 ` [RFC PATCH 01/17] perf: Allow mmapping only user page Alexander Shishkin
  2017-09-05 13:30 ` [RFC PATCH 02/17] perf: Factor out mlock accounting Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2018-01-24 18:54   ` Steven Rostedt
  2017-09-05 13:30 ` [RFC PATCH 04/17] tracefs: Add ->unlink callback to tracefs_dir_ops Alexander Shishkin
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin, Steven Rostedt

Currently, tracefs has exactly one special 'instances' subdirectory, where
the caller can have their own .mkdir/.rmdir callbacks, which allow the
caller to handle user's mkdir/rmdir inside that directory. Tracefs allows
one set of these callbacks (tracefs_dir_ops).

This patch de-globalizes tracefs_dir_ops so that it's possible to have
multiple such subdirectories.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 fs/tracefs/inode.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index bea8ad876b..b14f03a655 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -50,10 +50,10 @@ static const struct file_operations tracefs_file_operations = {
 	.llseek =	noop_llseek,
 };
 
-static struct tracefs_dir_ops {
+struct tracefs_dir_ops {
 	int (*mkdir)(const char *name);
 	int (*rmdir)(const char *name);
-} tracefs_ops;
+};
 
 static char *get_dname(struct dentry *dentry)
 {
@@ -72,6 +72,7 @@ static char *get_dname(struct dentry *dentry)
 
 static int tracefs_syscall_mkdir(struct inode *inode, struct dentry *dentry, umode_t mode)
 {
+	struct tracefs_dir_ops *tracefs_ops = dentry->d_parent->d_fsdata;
 	char *name;
 	int ret;
 
@@ -85,7 +86,7 @@ static int tracefs_syscall_mkdir(struct inode *inode, struct dentry *dentry, umo
 	 * mkdir routine to handle races.
 	 */
 	inode_unlock(inode);
-	ret = tracefs_ops.mkdir(name);
+	ret = tracefs_ops->mkdir(name);
 	inode_lock(inode);
 
 	kfree(name);
@@ -95,6 +96,7 @@ static int tracefs_syscall_mkdir(struct inode *inode, struct dentry *dentry, umo
 
 static int tracefs_syscall_rmdir(struct inode *inode, struct dentry *dentry)
 {
+	struct tracefs_dir_ops *tracefs_ops = dentry->d_fsdata;
 	char *name;
 	int ret;
 
@@ -112,7 +114,7 @@ static int tracefs_syscall_rmdir(struct inode *inode, struct dentry *dentry)
 	inode_unlock(inode);
 	inode_unlock(dentry->d_inode);
 
-	ret = tracefs_ops.rmdir(name);
+	ret = tracefs_ops->rmdir(name);
 
 	inode_lock_nested(inode, I_MUTEX_PARENT);
 	inode_lock(dentry->d_inode);
@@ -342,6 +344,9 @@ static struct dentry *start_creating(const char *name, struct dentry *parent)
 	if (IS_ERR(dentry)) {
 		inode_unlock(parent->d_inode);
 		simple_release_fs(&tracefs_mount, &tracefs_mount_count);
+	} else {
+		/* propagate dir ops */
+		dentry->d_fsdata = parent->d_fsdata;
 	}
 
 	return dentry;
@@ -482,18 +487,25 @@ struct dentry *tracefs_create_instance_dir(const char *name, struct dentry *pare
 					  int (*mkdir)(const char *name),
 					  int (*rmdir)(const char *name))
 {
+	struct tracefs_dir_ops *tracefs_ops = parent ? parent->d_fsdata : NULL;
 	struct dentry *dentry;
 
-	/* Only allow one instance of the instances directory. */
-	if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir))
+	if (WARN_ON(tracefs_ops))
+		return NULL;
+
+	tracefs_ops = kzalloc(sizeof(*tracefs_ops), GFP_KERNEL);
+	if (!tracefs_ops)
 		return NULL;
 
 	dentry = __create_dir(name, parent, &tracefs_dir_inode_operations);
-	if (!dentry)
+	if (!dentry) {
+		kfree(tracefs_ops);
 		return NULL;
+	}
 
-	tracefs_ops.mkdir = mkdir;
-	tracefs_ops.rmdir = rmdir;
+	tracefs_ops->mkdir = mkdir;
+	tracefs_ops->rmdir = rmdir;
+	dentry->d_fsdata = tracefs_ops;
 
 	return dentry;
 }
@@ -513,8 +525,11 @@ static int __tracefs_remove(struct dentry *dentry, struct dentry *parent)
 				simple_unlink(parent->d_inode, dentry);
 				break;
 			}
-			if (!ret)
+			if (!ret) {
 				d_delete(dentry);
+				if (dentry->d_fsdata != parent->d_fsdata)
+					kfree(dentry->d_fsdata);
+			}
 			dput(dentry);
 		}
 	}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 04/17] tracefs: Add ->unlink callback to tracefs_dir_ops
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (2 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 03/17] tracefs: De-globalize instances' callbacks Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-09-05 13:30 ` [RFC PATCH 05/17] perf: Introduce detached events Alexander Shishkin
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin, Steven Rostedt

In addition to mkdir and rmdir, also allow unlink operation within the
'instances' directory if such callback is defined.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 fs/tracefs/inode.c      | 36 +++++++++++++++++++++++++++++++++++-
 include/linux/tracefs.h |  3 ++-
 kernel/trace/trace.c    |  8 +++++++-
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index b14f03a655..fba5a0ce07 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -53,6 +53,7 @@ static const struct file_operations tracefs_file_operations = {
 struct tracefs_dir_ops {
 	int (*mkdir)(const char *name);
 	int (*rmdir)(const char *name);
+	int (*unlink)(const char *name);
 };
 
 static char *get_dname(struct dentry *dentry)
@@ -124,10 +125,41 @@ static int tracefs_syscall_rmdir(struct inode *inode, struct dentry *dentry)
 	return ret;
 }
 
+static int tracefs_syscall_unlink(struct inode *inode, struct dentry *dentry)
+{
+	struct tracefs_dir_ops *tracefs_ops = dentry->d_fsdata;
+	char *name;
+	int ret;
+
+	name = get_dname(dentry);
+	if (!name)
+		return -ENOMEM;
+
+	/*
+	 * The unlink call can call the generic functions that create
+	 * the files within the tracefs system. It is up to the individual
+	 * unlink routine to handle races.
+	 * This time we need to unlock not only the parent (inode) but
+	 * also the file that is being deleted.
+	 */
+	inode_unlock(inode);
+	inode_unlock(dentry->d_inode);
+
+	ret = tracefs_ops->unlink(name);
+
+	inode_lock_nested(inode, I_MUTEX_PARENT);
+	inode_lock(dentry->d_inode);
+
+	kfree(name);
+
+	return ret;
+}
+
 static const struct inode_operations tracefs_dir_inode_operations = {
 	.lookup		= simple_lookup,
 	.mkdir		= tracefs_syscall_mkdir,
 	.rmdir		= tracefs_syscall_rmdir,
+	.unlink		= tracefs_syscall_unlink,
 };
 
 static struct inode *tracefs_get_inode(struct super_block *sb)
@@ -485,7 +517,8 @@ struct dentry *tracefs_create_dir(const char *name, struct dentry *parent)
  */
 struct dentry *tracefs_create_instance_dir(const char *name, struct dentry *parent,
 					  int (*mkdir)(const char *name),
-					  int (*rmdir)(const char *name))
+					  int (*rmdir)(const char *name),
+					  int (*unlink)(const char *name))
 {
 	struct tracefs_dir_ops *tracefs_ops = parent ? parent->d_fsdata : NULL;
 	struct dentry *dentry;
@@ -505,6 +538,7 @@ struct dentry *tracefs_create_instance_dir(const char *name, struct dentry *pare
 
 	tracefs_ops->mkdir = mkdir;
 	tracefs_ops->rmdir = rmdir;
+	tracefs_ops->unlink = unlink;
 	dentry->d_fsdata = tracefs_ops;
 
 	return dentry;
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 5b727a17be..e5bd1f01b6 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -36,7 +36,8 @@ void tracefs_remove_recursive(struct dentry *dentry);
 
 struct dentry *tracefs_create_instance_dir(const char *name, struct dentry *parent,
 					   int (*mkdir)(const char *name),
-					   int (*rmdir)(const char *name));
+					   int (*rmdir)(const char *name),
+					   int (*unlink)(const char *name));
 
 bool tracefs_initialized(void);
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 44004d8aa3..b9abd2029e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7792,11 +7792,17 @@ static int instance_rmdir(const char *name)
 	return ret;
 }
 
+static int instance_unlink(const char *name)
+{
+	return -EACCES;
+}
+
 static __init void create_trace_instances(struct dentry *d_tracer)
 {
 	trace_instance_dir = tracefs_create_instance_dir("instances", d_tracer,
 							 instance_mkdir,
-							 instance_rmdir);
+							 instance_rmdir,
+							 instance_unlink);
 	if (WARN_ON(!trace_instance_dir))
 		return;
 }
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 05/17] perf: Introduce detached events
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (3 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 04/17] tracefs: Add ->unlink callback to tracefs_dir_ops Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-10-03 14:34   ` Peter Zijlstra
  2017-09-05 13:30 ` [RFC PATCH 06/17] perf: Add buffers to the " Alexander Shishkin
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

There are usecases where it is desired to have perf events without the
userspace tool running in the background to keep them alive, but instead
only collect the data when it is needed, for example when an MCE event
is triggered.

This patch adds a new flag to the perf_event_open() syscall that allows
creating such events. Once created, the file descriptor can be closed
and the event continues to exist on its own. To allow access to this
event, a file is created in the tracefs, which the user can open.

Finally, when it is no longer needed, it can be destroyed by unlinking
the file.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      |   4 ++
 include/uapi/linux/perf_event.h |   1 +
 kernel/events/core.c            | 138 ++++++++++++++++++++++++++++++++++++++--
 kernel/events/internal.h        |   6 ++
 4 files changed, 142 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 82b2e3fef9..a07982f48d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -537,6 +537,7 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
 #define PERF_EV_CAP_SOFTWARE		BIT(0)
 #define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
 
+#define PERF_TRACEFS_HASH_BITS		32
 #define SWEVENT_HLIST_BITS		8
 #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
 
@@ -550,6 +551,7 @@ struct swevent_hlist {
 #define PERF_ATTACH_TASK	0x04
 #define PERF_ATTACH_TASK_DATA	0x08
 #define PERF_ATTACH_ITRACE	0x10
+#define PERF_ATTACH_DETACHED	0x20
 
 struct perf_cgroup;
 struct ring_buffer;
@@ -672,6 +674,8 @@ struct perf_event {
 	struct list_head		owner_entry;
 	struct task_struct		*owner;
 
+	struct dentry			*dent;
+
 	/* mmap bits */
 	struct mutex			mmap_mutex;
 	atomic_t			mmap_count;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 140ae638cf..89355584fa 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -946,6 +946,7 @@ enum perf_callchain_context {
 #define PERF_FLAG_FD_OUTPUT		(1UL << 1)
 #define PERF_FLAG_PID_CGROUP		(1UL << 2) /* pid=cgroup id, per-cpu mode only */
 #define PERF_FLAG_FD_CLOEXEC		(1UL << 3) /* O_CLOEXEC */
+#define PERF_FLAG_DETACHED		(1UL << 4) /* event w/o owner */
 
 #if defined(__LITTLE_ENDIAN_BITFIELD)
 union perf_mem_data_src {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 24099ed9e5..320070410d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -50,11 +50,14 @@
 #include <linux/sched/mm.h>
 #include <linux/proc_ns.h>
 #include <linux/mount.h>
+#include <linux/tracefs.h>
 
 #include "internal.h"
 
 #include <asm/irq_regs.h>
 
+static struct dentry *perf_tracefs_dir;
+
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -346,7 +349,8 @@ static void event_function_local(struct perf_event *event, event_f func, void *d
 #define PERF_FLAG_ALL (PERF_FLAG_FD_NO_GROUP |\
 		       PERF_FLAG_FD_OUTPUT  |\
 		       PERF_FLAG_PID_CGROUP |\
-		       PERF_FLAG_FD_CLOEXEC)
+		       PERF_FLAG_FD_CLOEXEC |\
+		       PERF_FLAG_DETACHED)
 
 /*
  * branch priv levels that need permission checks
@@ -4177,6 +4181,12 @@ static void _free_event(struct perf_event *event)
 
 	unaccount_event(event);
 
+	if (event->dent) {
+		tracefs_remove(event->dent);
+
+		event->attach_state &= ~PERF_ATTACH_DETACHED;
+	}
+
 	if (event->rb) {
 		/*
 		 * Can happen when we close an event with re-directed output.
@@ -5427,8 +5437,27 @@ static int perf_fasync(int fd, struct file *filp, int on)
 	return 0;
 }
 
+static int perf_open(struct inode *inode, struct file *file)
+{
+	struct perf_event *event = inode->i_private;
+	int ret;
+
+	if (WARN_ON_ONCE(!event))
+		return -EINVAL;
+
+	if (!atomic_long_inc_not_zero(&event->refcount))
+		return -ENOENT;
+
+	ret = simple_open(inode, file);
+	if (ret)
+		put_event(event);
+
+	return ret;
+}
+
 static const struct file_operations perf_fops = {
 	.llseek			= no_llseek,
+	.open			= perf_open,
 	.release		= perf_release,
 	.read			= perf_read,
 	.poll			= perf_poll,
@@ -9387,6 +9416,27 @@ static void account_event(struct perf_event *event)
 	account_pmu_sb_event(event);
 }
 
+static int perf_event_detach(struct perf_event *event, struct task_struct *task,
+			     struct mm_struct *mm)
+{
+	char *filename;
+
+	filename = kasprintf(GFP_KERNEL, "%s:%x.event",
+			     task ? "task" : "cpu",
+			     hash_64((u64)event, PERF_TRACEFS_HASH_BITS));
+	if (!filename)
+		return -ENOMEM;
+
+	event->dent = tracefs_create_file(filename, 0600,
+					  perf_tracefs_dir,
+					  event, &perf_fops);
+	kfree(filename);
+
+	if (!event->dent)
+		return -ENOMEM;
+
+	return 0;
+}
 /*
  * Allocate and initialize a event structure
  */
@@ -9716,6 +9766,10 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 	struct ring_buffer *rb = NULL;
 	int ret = -EINVAL;
 
+	if ((event->attach_state | output_event->attach_state) &
+	    PERF_ATTACH_DETACHED)
+		goto out;
+
 	if (!output_event)
 		goto set;
 
@@ -9876,7 +9930,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	struct task_struct *task = NULL;
 	struct pmu *pmu;
 	int event_fd;
-	int move_group = 0;
+	int move_group = 0, detached = 0;
 	int err;
 	int f_flags = O_RDWR;
 	int cgroup_fd = -1;
@@ -9956,6 +10010,16 @@ SYSCALL_DEFINE5(perf_event_open,
 		goto err_task;
 	}
 
+	if (flags & PERF_FLAG_DETACHED) {
+		err = -EINVAL;
+
+		/* output redirection and grouping are not allowed */
+		if (output_event || (group_fd != -1))
+			goto err_task;
+
+		detached = 1;
+	}
+
 	if (task) {
 		err = mutex_lock_interruptible(&task->signal->cred_guard_mutex);
 		if (err)
@@ -10104,6 +10168,16 @@ SYSCALL_DEFINE5(perf_event_open,
 		goto err_context;
 	}
 
+	if (detached) {
+		err = perf_event_detach(event, task, NULL);
+		if (err)
+			goto err_context;
+
+		atomic_long_inc(&event->refcount);
+
+		event_file->private_data = event;
+	}
+
 	if (move_group) {
 		gctx = __perf_event_ctx_lock_double(group_leader, ctx);
 
@@ -10236,7 +10310,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	perf_event__header_size(event);
 	perf_event__id_header_size(event);
 
-	event->owner = current;
+	event->owner = detached ? TASK_TOMBSTONE : current;
 
 	perf_install_in_context(ctx, event, event->cpu);
 	perf_unpin_context(ctx);
@@ -10250,9 +10324,11 @@ SYSCALL_DEFINE5(perf_event_open,
 		put_task_struct(task);
 	}
 
-	mutex_lock(&current->perf_event_mutex);
-	list_add_tail(&event->owner_entry, &current->perf_event_list);
-	mutex_unlock(&current->perf_event_mutex);
+	if (!detached) {
+		mutex_lock(&current->perf_event_mutex);
+		list_add_tail(&event->owner_entry, &current->perf_event_list);
+		mutex_unlock(&current->perf_event_mutex);
+	}
 
 	/*
 	 * Drop the reference on the group_event after placing the
@@ -10492,7 +10568,16 @@ perf_event_exit_event(struct perf_event *child_event,
 	 * Parent events are governed by their filedesc, retain them.
 	 */
 	if (!parent_event) {
-		perf_event_wakeup(child_event);
+		/*
+		 * unless they are DETACHED, in which case we still have
+		 * to dispose of them; they have an extra reference with
+		 * the DETACHED state and a tracefs file
+		 */
+		if (is_detached_event(child_event))
+			put_event(child_event); /* can be last */
+		else
+			perf_event_wakeup(child_event);
+
 		return;
 	}
 	/*
@@ -11205,6 +11290,45 @@ static int __init perf_event_sysfs_init(void)
 }
 device_initcall(perf_event_sysfs_init);
 
+static int perf_instance_nop(const char *name)
+{
+	return -EACCES;
+}
+
+static int perf_instance_unlink(const char *name)
+{
+	struct perf_event *event;
+	struct dentry *dent;
+
+	dent = lookup_one_len_unlocked(name, perf_tracefs_dir, strlen(name));
+	if (!dent)
+		return -ENOENT;
+
+	event = dent->d_inode->i_private;
+	if (!event)
+		return -EINVAL;
+
+	if (!(event->attach_state & PERF_ATTACH_CONTEXT))
+		return -EBUSY;
+
+	perf_event_release_kernel(event);
+
+	return 0;
+}
+
+static int __init perf_event_tracefs_init(void)
+{
+	perf_tracefs_dir = tracefs_create_instance_dir("perf", NULL,
+						       perf_instance_nop,
+						       perf_instance_nop,
+						       perf_instance_unlink);
+	if (!perf_tracefs_dir)
+		return -ENOMEM;
+
+	return 0;
+}
+device_initcall(perf_event_tracefs_init);
+
 #ifdef CONFIG_CGROUP_PERF
 static struct cgroup_subsys_state *
 perf_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 3e603c45eb..59136a0e98 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -126,6 +126,12 @@ static inline unsigned long perf_aux_size(struct ring_buffer *rb)
 	return rb->aux_nr_pages << PAGE_SHIFT;
 }
 
+static inline bool is_detached_event(struct perf_event *event)
+{
+	lockdep_assert_held(&event->ctx->mutex);
+	return !!(event->attach_state & PERF_ATTACH_DETACHED);
+}
+
 #define __DEFINE_OUTPUT_COPY_BODY(advance_buf, memcpy_func, ...)	\
 {									\
 	unsigned long size, written;					\
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 06/17] perf: Add buffers to the detached events
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (4 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 05/17] perf: Introduce detached events Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-10-03 14:36   ` Peter Zijlstra
  2017-09-05 13:30 ` [RFC PATCH 07/17] perf: Add pmu_info to user page Alexander Shishkin
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

Detached events make much more sense with ring buffers, which the user
can mmap and read a snapshot of. Unlike the normal perf events, these
ring buffers are allocated by the perf syscall, the sizes of data and
aux areas are specified in the event attribute.

These ring buffers can be mmapped read-only.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/uapi/linux/perf_event.h |  3 +++
 kernel/events/core.c            | 19 ++++++++++++++++
 kernel/events/internal.h        |  2 ++
 kernel/events/ring_buffer.c     | 50 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 74 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 89355584fa..3d64d9ea80 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -297,6 +297,7 @@ enum perf_event_read_format {
 					/* add: sample_stack_user */
 #define PERF_ATTR_SIZE_VER4	104	/* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5	112	/* add: aux_watermark */
+#define PERF_ATTR_SIZE_VER6	120	/* add: detached_* */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -415,6 +416,8 @@ struct perf_event_attr {
 	__u32	aux_watermark;
 	__u16	sample_max_stack;
 	__u16	__reserved_2;	/* align to __u64 */
+	__u32	detached_nr_pages;
+	__u32	detached_aux_nr_pages;
 };
 
 #define perf_flags(attr)	(*(&(attr)->read_format + 1))
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 320070410d..fef1f97974 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4185,6 +4185,9 @@ static void _free_event(struct perf_event *event)
 		tracefs_remove(event->dent);
 
 		event->attach_state &= ~PERF_ATTACH_DETACHED;
+
+		ring_buffer_unaccount(event->rb, false);
+		rb_free_detached(event->rb, event);
 	}
 
 	if (event->rb) {
@@ -5012,6 +5015,10 @@ static int perf_mmap_fault(struct vm_fault *vmf)
 	int ret = VM_FAULT_SIGBUS;
 
 	if (vmf->flags & FAULT_FLAG_MKWRITE) {
+		/* detached events R/O only */
+		if (event->dent)
+			return ret;
+
 		if (vmf->pgoff == 0)
 			ret = 0;
 		return ret;
@@ -9420,6 +9427,7 @@ static int perf_event_detach(struct perf_event *event, struct task_struct *task,
 			     struct mm_struct *mm)
 {
 	char *filename;
+	int err;
 
 	filename = kasprintf(GFP_KERNEL, "%s:%x.event",
 			     task ? "task" : "cpu",
@@ -9435,6 +9443,13 @@ static int perf_event_detach(struct perf_event *event, struct task_struct *task,
 	if (!event->dent)
 		return -ENOMEM;
 
+	err = rb_alloc_detached(event);
+	if (err) {
+		tracefs_remove(event->dent);
+		event->dent = NULL;
+		return err;
+	}
+
 	return 0;
 }
 /*
@@ -10017,6 +10032,9 @@ SYSCALL_DEFINE5(perf_event_open,
 		if (output_event || (group_fd != -1))
 			goto err_task;
 
+		if (!attr.detached_nr_pages)
+			goto err_task;
+
 		detached = 1;
 	}
 
@@ -10174,6 +10192,7 @@ SYSCALL_DEFINE5(perf_event_open,
 			goto err_context;
 
 		atomic_long_inc(&event->refcount);
+		atomic_inc(&event->mmap_count);
 
 		event_file->private_data = event;
 	}
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 59136a0e98..8e267d8faa 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -82,6 +82,8 @@ extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 			pgoff_t pgoff, int nr_pages, long watermark, int flags);
 extern void rb_free_aux(struct ring_buffer *rb);
+extern int rb_alloc_detached(struct perf_event *event);
+extern void rb_free_detached(struct ring_buffer *rb, struct perf_event *event);
 extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
 extern void ring_buffer_put(struct ring_buffer *rb);
 
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index d36f169cae..b4d7841025 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -760,6 +760,56 @@ void rb_free_aux(struct ring_buffer *rb)
 	}
 }
 
+/*
+ * Allocate a ring_buffer for a detached event and attach it to this event.
+ * There's one ring_buffer per detached event and vice versa, so
+ * ring_buffer_attach() does not apply.
+ */
+int rb_alloc_detached(struct perf_event *event)
+{
+	int aux_nr_pages = event->attr.detached_aux_nr_pages;
+	int nr_pages = event->attr.detached_nr_pages;
+	struct ring_buffer *rb;
+	int ret, pgoff = nr_pages + 1;
+
+	/*
+	 * Use overwrite mode (!RING_BUFFER_WRITABLE) for both data and aux
+	 * areas as we don't want wakeups or interrupts.
+	 */
+	rb = rb_alloc(NULL, nr_pages, 0, event->cpu, 0);
+	if (IS_ERR(rb))
+		return PTR_ERR(rb);
+
+	ret = rb_alloc_aux(rb, event, pgoff, aux_nr_pages, 0, 0);
+	if (ret) {
+		rb_free(rb);
+		return ret;
+	}
+
+	atomic_set(&rb->mmap_count, 1);
+	if (aux_nr_pages)
+		atomic_set(&rb->aux_mmap_count, 1);
+
+	/*
+	 * Detached events don't need ring buffer wakeups, therefore we don't
+	 * use ring_buffer_attach() here and event->rb_entry stays empty.
+	 */
+	rcu_assign_pointer(event->rb, rb);
+
+	return 0;
+}
+
+void rb_free_detached(struct ring_buffer *rb, struct perf_event *event)
+{
+	/* Must be the last one */
+	WARN_ON_ONCE(atomic_read(&rb->refcount) != 1);
+
+	atomic_set(&rb->aux_mmap_count, 0);
+	rcu_assign_pointer(event->rb, NULL);
+	rb_free_aux(rb);
+	rb_free(rb);
+}
+
 #ifndef CONFIG_PERF_USE_VMALLOC
 
 /*
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 07/17] perf: Add pmu_info to user page
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (5 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 06/17] perf: Add buffers to the " Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-10-03 14:40   ` Peter Zijlstra
  2017-09-05 13:30 ` [RFC PATCH 08/17] perf: Allow inheritance for detached events Alexander Shishkin
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

Allow PMUs to supply additional static information that may be required
by their decoders. Most of what Intel PT driver exports as capability
attributes (timing packet freqencies, frequency ratios etc), its decoder
needs to be able to correctly decode its binary stream. However, when
decoding Intel PT stream from a core dump, we can't rely on the sysfs
attributes, so we need to pack this information into the perf buffer,
so that the resulting core dump is self-contained.

In order to do this, we append a PMU-specific structure to the user
page. Such structures will include size, for versioning.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h      | 17 ++++++++++
 include/uapi/linux/perf_event.h | 10 ++++++
 kernel/events/core.c            | 27 +--------------
 kernel/events/internal.h        |  2 +-
 kernel/events/ring_buffer.c     | 75 ++++++++++++++++++++++++++++++++++-------
 5 files changed, 92 insertions(+), 39 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a07982f48d..b7939e8811 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -235,6 +235,8 @@ struct hw_perf_event {
 
 struct perf_event;
 
+struct pmu_info;
+
 /*
  * Common implementation detail of pmu::{start,commit,cancel}_txn
  */
@@ -285,6 +287,9 @@ struct pmu {
 	/* number of address filters this PMU can do */
 	unsigned int			nr_addr_filters;
 
+	/* PMU-specific data to append to the user page */
+	const struct pmu_info		*pmu_info;
+
 	/*
 	 * Fully disable/enable this PMU, can be used to protect from the PMI
 	 * as well as for lazy/batch writing of the MSRs.
@@ -508,6 +513,18 @@ struct perf_addr_filters_head {
 	unsigned int		nr_file_filters;
 };
 
+struct pmu_info {
+	/*
+	 * Size of this structure, for versioning.
+	 */
+	u32	note_size;
+
+	/*
+	 * Size of the container structure, not including this one
+	 */
+	u32	pmu_descsz;
+};
+
 /**
  * enum perf_event_active_state - the states of a event
  */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 3d64d9ea80..4cdd4fab9d 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -599,6 +599,16 @@ struct perf_event_mmap_page {
 	__u64	aux_tail;
 	__u64	aux_offset;
 	__u64	aux_size;
+
+	/*
+	 * PMU data: static info that (AUX) decoder wants to know in order to
+	 * decode correctly:
+	 *
+	 *   pmu_offset >= sizeof(struct perf_event_mmap_page)
+	 *   pmu_offset + pmu_size <= PAGE_SIZE
+	 */
+	__u64	pmu_offset;
+	__u64	pmu_size;
 };
 
 #define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index fef1f97974..d62ab2d1de 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4926,28 +4926,6 @@ static void calc_timer_values(struct perf_event *event,
 	*running = ctx_time - event->tstamp_running;
 }
 
-static void perf_event_init_userpage(struct perf_event *event)
-{
-	struct perf_event_mmap_page *userpg;
-	struct ring_buffer *rb;
-
-	rcu_read_lock();
-	rb = rcu_dereference(event->rb);
-	if (!rb)
-		goto unlock;
-
-	userpg = rb->user_page;
-
-	/* Allow new userspace to detect that bit 0 is deprecated */
-	userpg->cap_bit0_is_deprecated = 1;
-	userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
-	userpg->data_offset = PAGE_SIZE;
-	userpg->data_size = perf_data_size(rb);
-
-unlock:
-	rcu_read_unlock();
-}
-
 void __weak arch_perf_update_userpage(
 	struct perf_event *event, struct perf_event_mmap_page *userpg, u64 now)
 {
@@ -5385,9 +5363,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 		flags |= RING_BUFFER_WRITABLE;
 
 	if (!rb) {
-		rb = rb_alloc(vma->vm_mm, nr_pages,
-			      event->attr.watermark ? event->attr.wakeup_watermark : 0,
-			      event->cpu, flags);
+		rb = rb_alloc(event, vma->vm_mm, nr_pages, flags);
 
 		if (IS_ERR_OR_NULL(rb)) {
 			ret = PTR_ERR(rb);
@@ -5399,7 +5375,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 
 		ring_buffer_attach(event, rb);
 
-		perf_event_init_userpage(event);
 		perf_event_update_userpage(event);
 	} else {
 		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 8e267d8faa..4b345ee0d4 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -76,7 +76,7 @@ static inline void rb_toggle_paused(struct ring_buffer *rb, bool pause)
 }
 
 extern struct ring_buffer *
-rb_alloc(struct mm_struct *mm, int nr_pages, long watermark, int cpu,
+rb_alloc(struct perf_event *event, struct mm_struct *mm, int nr_pages,
 	 int flags);
 extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index b4d7841025..d7051868d0 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -268,10 +268,59 @@ void perf_output_end(struct perf_output_handle *handle)
 	rcu_read_unlock();
 }
 
+static void perf_event_init_pmu_info(struct perf_event *event,
+				     struct perf_event_mmap_page *userpg)
+{
+	const struct pmu_info *pi = NULL;
+	void *ptr = (void *)userpg + sizeof(*userpg);
+	size_t size = sizeof(event->attr);
+
+	if (event->pmu && event->pmu->pmu_info) {
+		pi = event->pmu->pmu_info;
+		size += pi->pmu_descsz;
+	}
+
+	if (size + sizeof(*userpg) > PAGE_SIZE)
+		return;
+
+	userpg->pmu_offset = offset_in_page(ptr);
+	userpg->pmu_size = size;
+
+	memcpy(ptr, &event->attr, sizeof(event->attr));
+	if (pi) {
+		ptr += sizeof(event->attr);
+		memcpy(ptr, (void *)pi + pi->note_size, pi->pmu_descsz);
+	}
+}
+
+static void perf_event_init_userpage(struct perf_event *event,
+				     struct ring_buffer *rb)
+{
+	struct perf_event_mmap_page *userpg;
+
+	userpg = rb->user_page;
+
+	/* Allow new userspace to detect that bit 0 is deprecated */
+	userpg->cap_bit0_is_deprecated = 1;
+	userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
+	userpg->data_offset = PAGE_SIZE;
+	userpg->data_size = perf_data_size(rb);
+	if (event->attach_state & PERF_ATTACH_DETACHED) {
+		userpg->aux_offset =
+			(event->attr.detached_nr_pages + 1) << PAGE_SHIFT;
+		userpg->aux_size =
+			event->attr.detached_aux_nr_pages << PAGE_SHIFT;
+	}
+
+	perf_event_init_pmu_info(event, userpg);
+}
+
 static void
-ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
+ring_buffer_init(struct ring_buffer *rb, struct perf_event *event, int flags)
 {
 	long max_size = perf_data_size(rb);
+	long watermark =
+		event->attr.watermark ? event->attr.wakeup_watermark : 0;
 
 	if (watermark)
 		rb->watermark = min(max_size, watermark);
@@ -295,6 +344,8 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 	 */
 	if (!rb->nr_pages)
 		rb->paused = 1;
+
+	perf_event_init_userpage(event, rb);
 }
 
 void perf_aux_output_flag(struct perf_output_handle *handle, u64 flags)
@@ -776,7 +827,7 @@ int rb_alloc_detached(struct perf_event *event)
 	 * Use overwrite mode (!RING_BUFFER_WRITABLE) for both data and aux
 	 * areas as we don't want wakeups or interrupts.
 	 */
-	rb = rb_alloc(NULL, nr_pages, 0, event->cpu, 0);
+	rb = rb_alloc(event, NULL, nr_pages, 0);
 	if (IS_ERR(rb))
 		return PTR_ERR(rb);
 
@@ -841,8 +892,8 @@ static void *perf_mmap_alloc_page(int cpu)
 	return page_address(page);
 }
 
-struct ring_buffer *rb_alloc(struct mm_struct *mm, int nr_pages, long watermark,
-			     int cpu, int flags)
+struct ring_buffer *rb_alloc(struct perf_event *event, struct mm_struct *mm,
+			     int nr_pages, int flags)
 {
 	unsigned long size = offsetof(struct ring_buffer, data_pages[nr_pages]);
 	struct ring_buffer *rb;
@@ -850,26 +901,27 @@ struct ring_buffer *rb_alloc(struct mm_struct *mm, int nr_pages, long watermark,
 
 	rb = kzalloc(size, GFP_KERNEL);
 	if (!rb)
-		goto fail;
+		return ERR_PTR(-ENOMEM);
 
 	ret = ring_buffer_account(rb, mm, nr_pages, false);
 	if (ret)
 		goto fail_free_rb;
 
 	ret = -ENOMEM;
-	rb->user_page = perf_mmap_alloc_page(cpu);
+	rb->user_page = perf_mmap_alloc_page(event->cpu);
 	if (!rb->user_page)
 		goto fail_unaccount;
 
 	for (i = 0; i < nr_pages; i++) {
-		rb->data_pages[i] = perf_mmap_alloc_page(cpu);
+		rb->data_pages[i] = perf_mmap_alloc_page(event->cpu);
+
 		if (!rb->data_pages[i])
 			goto fail_data_pages;
 	}
 
 	rb->nr_pages = nr_pages;
 
-	ring_buffer_init(rb, watermark, flags);
+	ring_buffer_init(rb, event, flags);
 
 	return rb;
 
@@ -885,7 +937,6 @@ struct ring_buffer *rb_alloc(struct mm_struct *mm, int nr_pages, long watermark,
 fail_free_rb:
 	kfree(rb);
 
-fail:
 	return ERR_PTR(ret);
 }
 
@@ -953,8 +1004,8 @@ void rb_free(struct ring_buffer *rb)
 	schedule_work(&rb->work);
 }
 
-struct ring_buffer *rb_alloc(struct mm_struct *mm, int nr_pages, long watermark,
-			     int cpu, int flags)
+struct ring_buffer *rb_alloc(struct perf_event *event, struct mm_struct *mm,
+			     int nr_pages, int flags)
 {
 	unsigned long size = offsetof(struct ring_buffer, data_pages[1]);
 	struct ring_buffer *rb;
@@ -983,7 +1034,7 @@ struct ring_buffer *rb_alloc(struct mm_struct *mm, int nr_pages, long watermark,
 		rb->page_order = ilog2(nr_pages);
 	}
 
-	ring_buffer_init(rb, watermark, flags);
+	ring_buffer_init(rb, event, flags);
 
 	return rb;
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 08/17] perf: Allow inheritance for detached events
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (6 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 07/17] perf: Add pmu_info to user page Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-10-03 14:42   ` Peter Zijlstra
  2017-09-05 13:30 ` [RFC PATCH 09/17] perf: Use shmemfs pages for userspace-only per-thread " Alexander Shishkin
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

This enables inheritance for detached events. Unlike traditional events,
these do not have parents: inheritance produces a new independent event
with the same attribute. If the 'parent' event has a ring buffer, so will
the new event. Considering the mlock accounting, this buffer allocation
may fail, which in turn will fail the parent's fork, something to be
aware of.

This also effectively disables context cloning, because unlike the
traditional events, these will each have its own ring buffer and
context switch optimization can't work.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h |  1 +
 kernel/events/core.c       | 64 ++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 54 insertions(+), 11 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b7939e8811..0b45abad12 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -780,6 +780,7 @@ struct perf_event_context {
 	int				nr_stat;
 	int				nr_freq;
 	int				rotate_disable;
+	int				clone_disable;
 	atomic_t			refcount;
 	struct task_struct		*task;
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d62ab2d1de..89c14644df 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -259,11 +259,12 @@ static void event_function_call(struct perf_event *event, event_f func, void *da
 		.data = data,
 	};
 
-	if (!event->parent) {
+	if (!event->parent && !ctx->clone_disable) {
 		/*
 		 * If this is a !child event, we must hold ctx::mutex to
 		 * stabilize the the event->ctx relation. See
 		 * perf_event_ctx_lock().
+		 * Note: detached events' ctx is always stable.
 		 */
 		lockdep_assert_held(&ctx->mutex);
 	}
@@ -10169,6 +10170,7 @@ SYSCALL_DEFINE5(perf_event_open,
 		atomic_long_inc(&event->refcount);
 		atomic_inc(&event->mmap_count);
 
+		ctx->clone_disable = 1;
 		event_file->private_data = event;
 	}
 
@@ -10699,14 +10701,18 @@ static void perf_free_event(struct perf_event *event,
 {
 	struct perf_event *parent = event->parent;
 
-	if (WARN_ON_ONCE(!parent))
-		return;
+	/*
+	 * If a parentless event turns up here, it has to be a detached
+	 * event, in case of inherit_event() failure.
+	 */
 
-	mutex_lock(&parent->child_mutex);
-	list_del_init(&event->child_list);
-	mutex_unlock(&parent->child_mutex);
+	if (parent) {
+		mutex_lock(&parent->child_mutex);
+		list_del_init(&event->child_list);
+		mutex_unlock(&parent->child_mutex);
 
-	put_event(parent);
+		put_event(parent);
+	}
 
 	raw_spin_lock_irq(&ctx->lock);
 	perf_group_detach(event);
@@ -10803,6 +10809,7 @@ inherit_event(struct perf_event *parent_event,
 	      struct perf_event_context *child_ctx)
 {
 	enum perf_event_active_state parent_state = parent_event->state;
+	bool detached = is_detached_event(parent_event);
 	struct perf_event *child_event;
 	unsigned long flags;
 
@@ -10815,10 +10822,16 @@ inherit_event(struct perf_event *parent_event,
 	if (parent_event->parent)
 		parent_event = parent_event->parent;
 
+	/*
+	 * Detached events don't have parents; instead, inheritance
+	 * creates a new independent event, which is accessible via
+	 * tracefs.
+	 */
 	child_event = perf_event_alloc(&parent_event->attr,
 					   parent_event->cpu,
 					   child,
-					   group_leader, parent_event,
+					   group_leader,
+					   detached ? NULL : parent_event,
 					   NULL, NULL, -1);
 	if (IS_ERR(child_event))
 		return child_event;
@@ -10864,6 +10877,29 @@ inherit_event(struct perf_event *parent_event,
 	child_event->overflow_handler_context
 		= parent_event->overflow_handler_context;
 
+	/*
+	 * For per-task detached events with ring buffers, set_output doesn't
+	 * make sense, but we can allocate a new buffer here. CPU-wide events
+	 * don't have inheritance.
+	 */
+	if (detached) {
+		int err;
+
+		err = perf_event_detach(child_event, child, NULL);
+		if (err) {
+			perf_free_event(child_event, child_ctx);
+			mutex_unlock(&parent_event->child_mutex);
+			put_event(parent_event);
+			return NULL;
+		}
+
+		/*
+		 * Inherited detached events don't use their parent's
+		 * ring buffer, so cloning can't work for them.
+		 */
+		child_ctx->clone_disable = 1;
+	}
+
 	/*
 	 * Precalculate sample_data sizes
 	 */
@@ -10878,11 +10914,17 @@ inherit_event(struct perf_event *parent_event,
 	raw_spin_unlock_irqrestore(&child_ctx->lock, flags);
 
 	/*
-	 * Link this into the parent event's child list
+	 * Link this into the parent event's child list, unless
+	 * it's a detached event, see above.
 	 */
-	list_add_tail(&child_event->child_list, &parent_event->child_list);
+	if (!detached)
+		list_add_tail(&child_event->child_list,
+			      &parent_event->child_list);
 	mutex_unlock(&parent_event->child_mutex);
 
+	if (detached)
+		put_event(parent_event);
+
 	return child_event;
 }
 
@@ -11042,7 +11084,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn)
 
 	child_ctx = child->perf_event_ctxp[ctxn];
 
-	if (child_ctx && inherited_all) {
+	if (child_ctx && inherited_all && !child_ctx->clone_disable) {
 		/*
 		 * Mark the child context as a clone of the parent
 		 * context, or of whatever the parent is a clone of.
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 09/17] perf: Use shmemfs pages for userspace-only per-thread detached events
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (7 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 08/17] perf: Allow inheritance for detached events Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-10-03 14:43   ` Peter Zijlstra
  2017-09-05 13:30 ` [RFC PATCH 10/17] perf: Implement pinning and scheduling for SHMEM events Alexander Shishkin
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

In order to work around the problem of using up mlocked memory for the
detached events, we can pin the ring buffer pages only while they are
in use (that is, the event is ACTIVE), and unpin them for the rest of
the time. When not pinned in, these pages can be swapped out. This way,
one user can have at most mlock_limit*nr_cpus kB of memory pinned at
any given moment, however many events they actually have.

This enforces a constraint: pinning and unpinning may sleep and thus
can't be done in the event scheduling path. Instead, we use a task
work to do this, which this pattern us to userspace-only events.
Also, since one userspace thread only needs one buffer (for whatever
CPU it's running on at any given moment), we only do this for per-thread
events.

The source for such swappable pages is shmemfs. This patch allows
allocating perf ring buffer pages from an shmemfs file if the above
constraints are met.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h  |   1 +
 kernel/events/core.c        |   2 +-
 kernel/events/internal.h    |   8 +-
 kernel/events/ring_buffer.c | 177 +++++++++++++++++++++++++++++++++++++-------
 4 files changed, 160 insertions(+), 28 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0b45abad12..341e9960bc 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -569,6 +569,7 @@ struct swevent_hlist {
 #define PERF_ATTACH_TASK_DATA	0x08
 #define PERF_ATTACH_ITRACE	0x10
 #define PERF_ATTACH_DETACHED	0x20
+#define PERF_ATTACH_SHMEM	0x40
 
 struct perf_cgroup;
 struct ring_buffer;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 89c14644df..feff812e30 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9419,7 +9419,7 @@ static int perf_event_detach(struct perf_event *event, struct task_struct *task,
 	if (!event->dent)
 		return -ENOMEM;
 
-	err = rb_alloc_detached(event);
+	err = rb_alloc_detached(event, task, mm);
 	if (err) {
 		tracefs_remove(event->dent);
 		event->dent = NULL;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 4b345ee0d4..8de9e9cb6a 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -7,6 +7,7 @@
 /* Buffer handling */
 
 #define RING_BUFFER_WRITABLE		0x01
+#define RING_BUFFER_SHMEM		0x02
 
 struct ring_buffer {
 	atomic_t			refcount;
@@ -52,6 +53,9 @@ struct ring_buffer {
 	void				**aux_pages;
 	void				*aux_priv;
 
+	/* tmpfs file for kernel-owned ring buffers */
+	struct file			*shmem_file;
+
 	struct perf_event_mmap_page	*user_page;
 	void				*data_pages[0];
 };
@@ -82,7 +86,9 @@ extern void perf_event_wakeup(struct perf_event *event);
 extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 			pgoff_t pgoff, int nr_pages, long watermark, int flags);
 extern void rb_free_aux(struct ring_buffer *rb);
-extern int rb_alloc_detached(struct perf_event *event);
+extern int rb_alloc_detached(struct perf_event *event,
+			     struct task_struct *task,
+			     struct mm_struct *mm);
 extern void rb_free_detached(struct ring_buffer *rb, struct perf_event *event);
 extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
 extern void ring_buffer_put(struct ring_buffer *rb);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index d7051868d0..25159fe038 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -14,6 +14,7 @@
 #include <linux/slab.h>
 #include <linux/circ_buf.h>
 #include <linux/poll.h>
+#include <linux/shmem_fs.h>
 
 #include "internal.h"
 
@@ -342,10 +343,11 @@ ring_buffer_init(struct ring_buffer *rb, struct perf_event *event, int flags)
 	 * perf_output_begin() only checks rb->paused, therefore
 	 * rb->paused must be true if we have no pages for output.
 	 */
-	if (!rb->nr_pages)
+	if (!rb->nr_pages || (flags & RING_BUFFER_SHMEM))
 		rb->paused = 1;
 
-	perf_event_init_userpage(event, rb);
+	if (!(flags & RING_BUFFER_SHMEM))
+		perf_event_init_userpage(event, rb);
 }
 
 void perf_aux_output_flag(struct perf_output_handle *handle, u64 flags)
@@ -631,6 +633,9 @@ void ring_buffer_unaccount(struct ring_buffer *rb, bool aux)
 	unsigned long nr_pages = aux ? rb->aux_nr_pages : rb->nr_pages + 1;
 	unsigned long pinned = aux ? rb->aux_mmap_locked : rb->mmap_locked;
 
+	if (!rb->nr_pages && !rb->aux_nr_pages)
+		return;
+
 	atomic_long_sub(nr_pages, &rb->mmap_user->locked_vm);
 	if (rb->mmap_mapping)
 		rb->mmap_mapping->pinned_vm -= pinned;
@@ -640,10 +645,15 @@ void ring_buffer_unaccount(struct ring_buffer *rb, bool aux)
 
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
-static struct page *rb_alloc_aux_page(int node, int order)
+static struct page *
+rb_alloc_aux_page(struct ring_buffer *rb, int node, int order, int pgoff)
 {
+	struct file *file = rb->shmem_file;
 	struct page *page;
 
+	if (order && file)
+		return NULL;
+
 	if (order > MAX_ORDER)
 		order = MAX_ORDER;
 
@@ -670,8 +680,13 @@ static void rb_free_aux_page(struct ring_buffer *rb, int idx)
 {
 	struct page *page = virt_to_page(rb->aux_pages[idx]);
 
-	ClearPagePrivate(page);
+	/* SHMEM pages are freed elsewhere */
+	if (rb->shmem_file)
+		return;
+
 	page->mapping = NULL;
+
+	ClearPagePrivate(page);
 	__free_page(page);
 }
 
@@ -706,17 +721,20 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 		 pgoff_t pgoff, int nr_pages, long watermark, int flags)
 {
 	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
+	bool shmem = !!(flags & RING_BUFFER_SHMEM);
 	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
 	int ret, max_order = 0;
 
 	if (!has_aux(event))
 		return -EOPNOTSUPP;
 
-	ret = ring_buffer_account(rb, NULL, nr_pages, true);
-	if (ret)
-		return ret;
+	if (!shmem) {
+		ret = ring_buffer_account(rb, NULL, nr_pages, true);
+		if (ret)
+			return ret;
+	}
 
-	ret = -ENOMEM;
+	ret = -EINVAL;
 	if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NO_SG) {
 		/*
 		 * We need to start with the max_order that fits in nr_pages,
@@ -737,21 +755,41 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 		}
 	}
 
+	ret = -ENOMEM;
 	rb->aux_pages = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
 	if (!rb->aux_pages)
-		return -ENOMEM;
+		goto out;
 
 	rb->free_aux = event->pmu->free_aux;
+
+	if (shmem) {
+		/*
+		 * Can't guarantee contuguous high order allocations.
+		 */
+		if (max_order)
+			goto out;
+
+		/*
+		 * Skip page allocation; it's done in rb_get_kernel_pages().
+		 */
+		rb->aux_nr_pages = nr_pages;
+
+		goto post_setup;
+	}
+
 	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;) {
 		struct page *page;
 		int last, order;
 
 		order = min(max_order, ilog2(nr_pages - rb->aux_nr_pages));
-		page = rb_alloc_aux_page(node, order);
+		page = rb_alloc_aux_page(rb, node, order, pgoff + rb->aux_nr_pages);
 		if (!page)
 			goto out;
 
-		for (last = rb->aux_nr_pages + (1 << page_private(page));
+		if (order)
+			order = page_private(page);
+
+		for (last = rb->aux_nr_pages + (1 << order);
 		     last > rb->aux_nr_pages; rb->aux_nr_pages++)
 			rb->aux_pages[rb->aux_nr_pages] = page_address(page++);
 	}
@@ -775,6 +813,7 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 	if (!rb->aux_priv)
 		goto out;
 
+post_setup:
 	ret = 0;
 
 	/*
@@ -795,7 +834,8 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 	if (!ret) {
 		rb->aux_pgoff = pgoff;
 	} else {
-		ring_buffer_unaccount(rb, true);
+		if (!shmem)
+			ring_buffer_unaccount(rb, true);
 		__rb_free_aux(rb);
 	}
 
@@ -811,35 +851,95 @@ void rb_free_aux(struct ring_buffer *rb)
 	}
 }
 
+static int rb_shmem_setup(struct perf_event *event,
+			  struct task_struct *task,
+			  struct ring_buffer *rb)
+{
+	int nr_pages, err;
+	char *name;
+
+	if (WARN_ON_ONCE(!task))
+		return -EINVAL;
+
+	name = event->dent && event->dent->d_name.name ?
+		kasprintf(GFP_KERNEL, "perf/%s/%s/%d",
+			  event->dent->d_name.name, event->pmu->name,
+			  task_pid_nr_ns(task, event->ns)) :
+		kasprintf(GFP_KERNEL, "perf/%s/%d", event->pmu->name,
+			  task_pid_nr_ns(task, event->ns));
+	if (!name)
+		return -ENOMEM;
+
+	WARN_ON_ONCE(rb->user_page);
+
+	nr_pages = rb->nr_pages + rb->aux_nr_pages + 1;
+	rb->shmem_file = shmem_file_setup(name, nr_pages << PAGE_SHIFT,
+					  VM_NORESERVE);
+	kfree(name);
+
+	if (IS_ERR(rb->shmem_file)) {
+		err = PTR_ERR(rb->shmem_file);
+		rb->shmem_file = NULL;
+		return err;
+	}
+
+	mapping_set_gfp_mask(rb->shmem_file->f_mapping,
+			     GFP_HIGHUSER | __GFP_RECLAIMABLE);
+
+	event->dent->d_inode->i_mapping = rb->shmem_file->f_mapping;
+	event->attach_state |= PERF_ATTACH_SHMEM;
+
+	return 0;
+}
+
 /*
  * Allocate a ring_buffer for a detached event and attach it to this event.
  * There's one ring_buffer per detached event and vice versa, so
  * ring_buffer_attach() does not apply.
  */
-int rb_alloc_detached(struct perf_event *event)
+int rb_alloc_detached(struct perf_event *event, struct task_struct *task,
+		      struct mm_struct *mm)
 {
 	int aux_nr_pages = event->attr.detached_aux_nr_pages;
 	int nr_pages = event->attr.detached_nr_pages;
-	struct ring_buffer *rb;
 	int ret, pgoff = nr_pages + 1;
+	struct ring_buffer *rb;
+	int flags = 0;
 
 	/*
-	 * Use overwrite mode (!RING_BUFFER_WRITABLE) for both data and aux
-	 * areas as we don't want wakeups or interrupts.
+	 * These are basically coredump conditions. If these are
+	 * not met, we proceed as we would, but with pinned pages
+	 * and therefore *no inheritance*.
 	 */
-	rb = rb_alloc(event, NULL, nr_pages, 0);
+	if (event->attr.inherit && event->attr.exclude_kernel &&
+	    event->cpu == -1)
+		flags = RING_BUFFER_SHMEM;
+	else if (event->attr.inherit)
+		return -EINVAL;
+
+	rb = rb_alloc(event, mm, nr_pages, flags);
 	if (IS_ERR(rb))
 		return PTR_ERR(rb);
 
-	ret = rb_alloc_aux(rb, event, pgoff, aux_nr_pages, 0, 0);
-	if (ret) {
-		rb_free(rb);
-		return ret;
+	if (aux_nr_pages) {
+		ret = rb_alloc_aux(rb, event, pgoff, aux_nr_pages, 0, flags);
+		if (ret)
+			goto err_free;
 	}
 
-	atomic_set(&rb->mmap_count, 1);
-	if (aux_nr_pages)
-		atomic_set(&rb->aux_mmap_count, 1);
+	if (flags & RING_BUFFER_SHMEM) {
+		ret = rb_shmem_setup(event, task, rb);
+		if (ret) {
+			rb_free_aux(rb);
+			goto err_free;
+		}
+
+		rb_toggle_paused(rb, true);
+	} else {
+		atomic_inc(&rb->mmap_count);
+		if (aux_nr_pages)
+			atomic_inc(&rb->aux_mmap_count);
+	}
 
 	/*
 	 * Detached events don't need ring buffer wakeups, therefore we don't
@@ -847,7 +947,14 @@ int rb_alloc_detached(struct perf_event *event)
 	 */
 	rcu_assign_pointer(event->rb, rb);
 
+	event->attach_state |= PERF_ATTACH_DETACHED;
+
 	return 0;
+
+err_free:
+	rb_free(rb);
+
+	return ret;
 }
 
 void rb_free_detached(struct ring_buffer *rb, struct perf_event *event)
@@ -855,6 +962,9 @@ void rb_free_detached(struct ring_buffer *rb, struct perf_event *event)
 	/* Must be the last one */
 	WARN_ON_ONCE(atomic_read(&rb->refcount) != 1);
 
+	if (rb->shmem_file)
+		shmem_truncate_range(rb->shmem_file->f_inode, 0, (loff_t)-1);
+
 	atomic_set(&rb->aux_mmap_count, 0);
 	rcu_assign_pointer(event->rb, NULL);
 	rb_free_aux(rb);
@@ -896,6 +1006,7 @@ struct ring_buffer *rb_alloc(struct perf_event *event, struct mm_struct *mm,
 			     int nr_pages, int flags)
 {
 	unsigned long size = offsetof(struct ring_buffer, data_pages[nr_pages]);
+	bool shmem = !!(flags & RING_BUFFER_SHMEM);
 	struct ring_buffer *rb;
 	int i, ret = -ENOMEM;
 
@@ -903,6 +1014,9 @@ struct ring_buffer *rb_alloc(struct perf_event *event, struct mm_struct *mm,
 	if (!rb)
 		return ERR_PTR(-ENOMEM);
 
+	if (shmem)
+		goto post_alloc;
+
 	ret = ring_buffer_account(rb, mm, nr_pages, false);
 	if (ret)
 		goto fail_free_rb;
@@ -919,6 +1033,7 @@ struct ring_buffer *rb_alloc(struct perf_event *event, struct mm_struct *mm,
 			goto fail_data_pages;
 	}
 
+post_alloc:
 	rb->nr_pages = nr_pages;
 
 	ring_buffer_init(rb, event, flags);
@@ -927,9 +1042,9 @@ struct ring_buffer *rb_alloc(struct perf_event *event, struct mm_struct *mm,
 
 fail_data_pages:
 	for (i--; i >= 0; i--)
-		free_page((unsigned long)rb->data_pages[i]);
+		put_page(virt_to_page(rb->data_pages[i]));
 
-	free_page((unsigned long)rb->user_page);
+	put_page(virt_to_page(rb->user_page));
 
 fail_unaccount:
 	ring_buffer_unaccount(rb, false);
@@ -952,9 +1067,16 @@ void rb_free(struct ring_buffer *rb)
 {
 	int i;
 
+	if (rb->shmem_file) {
+		/* the pages should have been freed before */
+		fput(rb->shmem_file);
+		goto out_free;
+	}
+
 	perf_mmap_free_page((unsigned long)rb->user_page);
 	for (i = 0; i < rb->nr_pages; i++)
 		perf_mmap_free_page((unsigned long)rb->data_pages[i]);
+out_free:
 	kfree(rb);
 }
 
@@ -1012,6 +1134,9 @@ struct ring_buffer *rb_alloc(struct perf_event *event, struct mm_struct *mm,
 	void *all_buf;
 	int ret = -ENOMEM;
 
+	if (flags & RING_BUFFER_SHMEM)
+		return -EOPNOTSUPP;
+
 	rb = kzalloc(size, GFP_KERNEL);
 	if (!rb)
 		goto fail;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 10/17] perf: Implement pinning and scheduling for SHMEM events
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (8 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 09/17] perf: Use shmemfs pages for userspace-only per-thread " Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-09-05 13:30 ` [RFC PATCH 11/17] perf: Implement mlock accounting for shmem ring buffers Alexander Shishkin
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

A SHMEM buffer is only pinned in while its task is scheduled in and the
pinning is done in a task work, which also implies that the corresponding
event can only be started from that task work.

Pinning is done on a per-cpu basis: if an event has previously been
pinned on the local cpu, its unpinned and its pin count is dropped and a
new event is pinned on this cpu. When the pin count goes to zero, we
unpin the pages, when it goes to one, we pin them.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h  |  10 +++
 kernel/events/core.c        | 134 ++++++++++++++++++++++++++++-
 kernel/events/internal.h    |   5 ++
 kernel/events/ring_buffer.c | 202 +++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 347 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 341e9960bc..4b966dd0d8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -703,6 +703,13 @@ struct perf_event {
 	unsigned long			rcu_batches;
 	int				rcu_pending;
 
+	/*
+	 * Number of times (CPUs) this event's been pinned (on):
+	 *  xpinned -> 0: unpin the pages,
+	 *  xpinned -> 1: pin the pages. See get_pages_work().
+	 */
+	atomic_t			xpinned;
+
 	/* poll related */
 	wait_queue_head_t		waitq;
 	struct fasync_struct		*fasync;
@@ -735,6 +742,9 @@ struct perf_event {
 	struct bpf_prog			*prog;
 #endif
 
+	/* Task work to pin event's rb pages if needed */
+	struct callback_head		get_pages_work;
+
 #ifdef CONFIG_EVENT_TRACING
 	struct trace_event_call		*tp_event;
 	struct event_filter		*filter;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index feff812e30..c80ffcdb5c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -50,6 +50,7 @@
 #include <linux/sched/mm.h>
 #include <linux/proc_ns.h>
 #include <linux/mount.h>
+#include <linux/task_work.h>
 #include <linux/tracefs.h>
 
 #include "internal.h"
@@ -383,6 +384,7 @@ static atomic_t perf_sched_count;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
 static DEFINE_PER_CPU(int, perf_sched_cb_usages);
 static DEFINE_PER_CPU(struct pmu_event_list, pmu_sb_events);
+static DEFINE_PER_CPU(struct perf_event *, shmem_events);
 
 static atomic_t nr_mmap_events __read_mostly;
 static atomic_t nr_comm_events __read_mostly;
@@ -2058,6 +2060,94 @@ static void perf_set_shadow_time(struct perf_event *event,
 		event->shadow_ctx_time = tstamp - ctx->timestamp;
 }
 
+static void __unpin_event_pages(struct perf_event *event,
+				struct perf_cpu_context *cpuctx,
+				struct perf_event_context *ctx,
+				void *info)
+{
+	if (!atomic_dec_and_test(&event->xpinned))
+		return;
+
+	/*
+	 * If this event happens to be running, we need to stop it before we
+	 * can pull the pages. Note that this will be happening if we allow
+	 * concurrent shmem events, which seems like a bad idea.
+	 */
+	if (READ_ONCE(event->state) == PERF_EVENT_STATE_ACTIVE)
+		event->pmu->stop(event, PERF_EF_UPDATE);
+
+	rb_put_kernel_pages(event->rb, false);
+}
+
+enum pin_event_t {
+	PIN_IN = 0,
+	PIN_NOP,
+};
+
+static enum pin_event_t pin_event_pages(struct perf_event *event)
+{
+	struct perf_event **pinned_event = this_cpu_ptr(&shmem_events);
+	struct perf_event *old_event = *pinned_event;
+
+	if (old_event == event)
+		return PIN_NOP;
+
+	if (old_event && old_event->state > PERF_EVENT_STATE_DEAD)
+		event_function_call(old_event, __unpin_event_pages, NULL);
+
+	*pinned_event = event;
+	if (atomic_inc_return(&event->xpinned) != 1)
+		return PIN_NOP;
+
+	return PIN_IN;
+}
+
+static int perf_event_stop(struct perf_event *event, int restart);
+
+static void get_pages_work(struct callback_head *work)
+{
+	struct perf_event *event = container_of(work, struct perf_event, get_pages_work);
+	int ret;
+	struct ring_buffer *rb = event->rb;
+	int (*get_fn)(struct perf_event *event) = rb_get_kernel_pages;
+
+	work->func = NULL;
+
+	if (!rb || current->flags & PF_EXITING)
+		return;
+
+	if (!rb->shmem_file_addr) {
+		get_fn = rb_inject;
+		if (atomic_cmpxchg(&event->xpinned, 1, 0))
+			rb_put_kernel_pages(rb, false);
+	}
+
+	if (pin_event_pages(event) == PIN_IN) {
+		ret = get_fn(event);
+	} else {
+		ret = 0;
+	}
+
+	if (!ret)
+		perf_event_stop(event, 1);
+}
+
+static int perf_event_queue_work(struct perf_event *event,
+				 struct task_struct *task)
+{
+	int ret;
+
+	if (event->get_pages_work.func)
+		return 0;
+
+	init_task_work(&event->get_pages_work, get_pages_work);
+	ret = task_work_add(task, &event->get_pages_work, true);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
 #define MAX_INTERRUPTS (~0ULL)
 
 static void perf_log_throttle(struct perf_event *event, int enable);
@@ -2069,7 +2159,7 @@ event_sched_in(struct perf_event *event,
 		 struct perf_event_context *ctx)
 {
 	u64 tstamp = perf_event_time(event);
-	int ret = 0;
+	int ret = 0, shmem =  event->attach_state & PERF_ATTACH_SHMEM;
 
 	lockdep_assert_held(&ctx->lock);
 
@@ -2105,13 +2195,21 @@ event_sched_in(struct perf_event *event,
 
 	perf_log_itrace_start(event);
 
-	if (event->pmu->add(event, PERF_EF_START)) {
+	/*
+	 * For shmem events pmu::start will fail because of
+	 * rb::aux_mmap_count==0, so skip the PERF_EF_START, but
+	 * queue the task work that will actually start it.
+	 */
+	if (event->pmu->add(event, shmem ? 0 : PERF_EF_START)) {
 		event->state = PERF_EVENT_STATE_INACTIVE;
 		event->oncpu = -1;
 		ret = -EAGAIN;
 		goto out;
 	}
 
+	if (shmem)
+		perf_event_queue_work(event, ctx->task);
+
 	event->tstamp_running += tstamp - event->tstamp_stopped;
 
 	if (!is_software_event(event))
@@ -4182,6 +4280,30 @@ static void _free_event(struct perf_event *event)
 
 	unaccount_event(event);
 
+	if (event->attach_state & PERF_ATTACH_SHMEM) {
+		struct perf_event_context *ctx = event->ctx;
+		int cpu;
+
+		atomic_set(&event->xpinned, 0);
+		for_each_possible_cpu(cpu) {
+			struct perf_event **pinned_event =
+				per_cpu_ptr(&shmem_events, cpu);
+
+			cmpxchg(pinned_event, event, NULL);
+		}
+
+		event->attach_state &= ~PERF_ATTACH_SHMEM;
+
+		/*
+		 * XXX: !ctx means event is still being created;
+		 * we can get here via tracefs file though
+		 */
+		if (ctx && ctx->task && ctx->task != TASK_TOMBSTONE)
+			task_work_cancel(ctx->task, get_pages_work);
+
+		rb_put_kernel_pages(event->rb, false);
+	}
+
 	if (event->dent) {
 		tracefs_remove(event->dent);
 
@@ -4948,6 +5070,10 @@ void perf_event_update_userpage(struct perf_event *event)
 	if (!rb)
 		goto unlock;
 
+	/* Don't bother with the file backed rb when it's inactive */
+	if (rb->shmem_file && rb->paused)
+		goto unlock;
+
 	/*
 	 * compute total_time_enabled, total_time_running
 	 * based on snapshot values taken when the event
@@ -10684,6 +10810,8 @@ void perf_event_exit_task(struct task_struct *child)
 	}
 	mutex_unlock(&child->perf_event_mutex);
 
+	task_work_cancel(child, get_pages_work);
+
 	for_each_task_context_nr(ctxn)
 		perf_event_exit_task_context(child, ctxn);
 
@@ -10881,6 +11009,8 @@ inherit_event(struct perf_event *parent_event,
 	 * For per-task detached events with ring buffers, set_output doesn't
 	 * make sense, but we can allocate a new buffer here. CPU-wide events
 	 * don't have inheritance.
+	 * If we have to allocate a ring buffer, it must be shmem backed,
+	 * otherwise inheritance is disallowed in rb_alloc_detached().
 	 */
 	if (detached) {
 		int err;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 8de9e9cb6a..80d36a7277 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -55,11 +55,16 @@ struct ring_buffer {
 
 	/* tmpfs file for kernel-owned ring buffers */
 	struct file			*shmem_file;
+	unsigned long			shmem_file_addr;
+	int				shmem_pages_in;
 
 	struct perf_event_mmap_page	*user_page;
 	void				*data_pages[0];
 };
 
+extern int rb_inject(struct perf_event *event);
+extern int rb_get_kernel_pages(struct perf_event *event);
+extern void rb_put_kernel_pages(struct ring_buffer *rb, bool final);
 extern void rb_free(struct ring_buffer *rb);
 extern void ring_buffer_unaccount(struct ring_buffer *rb, bool aux);
 
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 25159fe038..771dfdb71f 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -15,6 +15,8 @@
 #include <linux/circ_buf.h>
 #include <linux/poll.h>
 #include <linux/shmem_fs.h>
+#include <linux/mman.h>
+#include <linux/sched/mm.h>
 
 #include "internal.h"
 
@@ -384,8 +386,11 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
 	unsigned long aux_head, aux_tail;
 	struct ring_buffer *rb;
 
-	if (output_event->parent)
+	if (output_event->parent) {
+		WARN_ON_ONCE(is_detached_event(event));
+		WARN_ON_ONCE(event->attach_state & PERF_ATTACH_SHMEM);
 		output_event = output_event->parent;
+	}
 
 	/*
 	 * Since this will typically be open across pmu::add/pmu::del, we
@@ -851,6 +856,64 @@ void rb_free_aux(struct ring_buffer *rb)
 	}
 }
 
+static unsigned long perf_rb_size(struct ring_buffer *rb)
+{
+	return perf_data_size(rb) + perf_aux_size(rb) + PAGE_SIZE;
+}
+
+int rb_inject(struct perf_event *event)
+{
+	struct ring_buffer *rb = event->rb;
+	struct mm_struct *mm;
+	unsigned long addr;
+	int err = -ENOMEM;
+
+	mm = get_task_mm(current);
+	if (!mm)
+		return -ESRCH;
+
+	err = rb_get_kernel_pages(event);
+	if (err)
+		goto err_mmput;
+
+	addr = vm_mmap(rb->shmem_file, 0, perf_rb_size(rb), PROT_READ,
+		       MAP_SHARED | MAP_POPULATE, 0);
+
+	mmput(mm);
+	rb->mmap_mapping = mm;
+	rb->shmem_file_addr = addr;
+
+	return 0;
+
+err_mmput:
+	mmput(mm);
+
+	return err;
+}
+
+static void rb_shmem_unmap(struct perf_event *event)
+{
+	struct ring_buffer *rb = event->rb;
+	struct mm_struct *mm = rb->mmap_mapping;
+
+	rb_toggle_paused(rb, true);
+
+	if (!rb->shmem_file_addr)
+		return;
+
+	/*
+	 * EXIT state means the task is past exit_mm(),
+	 * no need to unmap anything
+	 */
+	if (event->state == PERF_EVENT_STATE_EXIT)
+		return;
+
+	down_write(&mm->mmap_sem);
+	(void)do_munmap(mm, rb->shmem_file_addr, perf_rb_size(rb), NULL);
+	up_write(&mm->mmap_sem);
+	rb->shmem_file_addr = 0;
+}
+
 static int rb_shmem_setup(struct perf_event *event,
 			  struct task_struct *task,
 			  struct ring_buffer *rb)
@@ -892,6 +955,138 @@ static int rb_shmem_setup(struct perf_event *event,
 	return 0;
 }
 
+/*
+ * Pin ring_buffer's pages to memory while the task is scheduled in;
+ * populate its page arrays (data_pages, aux_pages, user_page).
+ */
+int rb_get_kernel_pages(struct perf_event *event)
+{
+	struct ring_buffer *rb = event->rb;
+	struct address_space *mapping;
+	int nr_pages, i = 0, err = -EINVAL, changed = 0, mc = 0;
+	struct page *page;
+
+	/*
+	 * The mmap_count rules for SHMEM buffers:
+	 *  - they are always taken together
+	 *  - except for perf_mmap(), which doesn't work for shmem buffers:
+	 *    mmaping will force-pin more user's pages than is allowed
+	 *  - if either of them was taken before us, the pages are there
+	 */
+	if (atomic_inc_return(&rb->mmap_count) == 1)
+		mc++;
+
+	if (atomic_inc_return(&rb->aux_mmap_count) == 1)
+		mc++;
+
+	if (mc < 2)
+		goto done;
+
+	if (WARN_ON_ONCE(!rb->shmem_file))
+		goto err_put;
+
+	nr_pages = perf_rb_size(rb) >> PAGE_SHIFT;
+
+	mapping = rb->shmem_file->f_mapping;
+
+restart:
+	for (i = 0; i < nr_pages; i++) {
+		WRITE_ONCE(rb->shmem_pages_in, i);
+		err = shmem_getpage(mapping->host, i, &page, SGP_NOHUGE);
+		if (err)
+			goto err_put;
+
+		unlock_page(page);
+
+		if (READ_ONCE(rb->shmem_pages_in) != i) {
+			put_page(page);
+			goto restart;
+		}
+
+		mark_page_accessed(page);
+		set_page_dirty(page);
+		page->mapping = mapping;
+
+		if (page == perf_mmap_to_page(rb, i))
+			continue;
+
+		changed++;
+		if (!i) {
+			bool init = !rb->user_page;
+
+			rb->user_page = page_address(page);
+			if (init)
+				perf_event_init_userpage(event, rb);
+		} else if (i <= rb->nr_pages) {
+			rb->data_pages[i - 1] = page_address(page);
+		} else {
+			rb->aux_pages[i - rb->nr_pages - 1] = page_address(page);
+		}
+	}
+
+	/* rebuild SG tables: pages may have changed */
+	if (changed) {
+		if (rb->aux_priv)
+			rb->free_aux(rb->aux_priv);
+
+		rb->aux_priv = event->pmu->setup_aux(smp_processor_id(),
+						     rb->aux_pages,
+						     rb->aux_nr_pages, true);
+	}
+
+	if (!rb->aux_priv) {
+		err = -ENOMEM;
+		goto err_put;
+	}
+
+done:
+	rb_toggle_paused(rb, false);
+	if (changed)
+		perf_event_update_userpage(event);
+
+	return 0;
+
+err_put:
+	for (i--; i >= 0; i--) {
+		page = perf_mmap_to_page(rb, i);
+		put_page(page);
+	}
+
+	atomic_dec(&rb->aux_mmap_count);
+	atomic_dec(&rb->mmap_count);
+
+	return err;
+}
+
+void rb_put_kernel_pages(struct ring_buffer *rb, bool final)
+{
+	struct page *page;
+	int i;
+
+	if (!rb || !rb->shmem_file)
+		return;
+
+	rb_toggle_paused(rb, true);
+
+	/*
+	 * If both mmap_counts go to zero, put the pages, otherwise
+	 * do nothing.
+	 */
+	if (!atomic_dec_and_test(&rb->aux_mmap_count) ||
+	    !atomic_dec_and_test(&rb->mmap_count))
+		return;
+
+	for (i = 0; i < READ_ONCE(rb->shmem_pages_in); i++) {
+		page = perf_mmap_to_page(rb, i);
+		set_page_dirty(page);
+		if (final)
+			page->mapping = NULL;
+		put_page(page);
+	}
+
+	WRITE_ONCE(rb->shmem_pages_in, 0);
+}
+
 /*
  * Allocate a ring_buffer for a detached event and attach it to this event.
  * There's one ring_buffer per detached event and vice versa, so
@@ -962,8 +1157,11 @@ void rb_free_detached(struct ring_buffer *rb, struct perf_event *event)
 	/* Must be the last one */
 	WARN_ON_ONCE(atomic_read(&rb->refcount) != 1);
 
-	if (rb->shmem_file)
+	if (rb->shmem_file) {
+		rb_shmem_unmap(event);
 		shmem_truncate_range(rb->shmem_file->f_inode, 0, (loff_t)-1);
+		rb_put_kernel_pages(rb, true);
+	}
 
 	atomic_set(&rb->aux_mmap_count, 0);
 	rcu_assign_pointer(event->rb, NULL);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 11/17] perf: Implement mlock accounting for shmem ring buffers
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (9 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 10/17] perf: Implement pinning and scheduling for SHMEM events Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-09-05 13:30 ` [RFC PATCH 12/17] perf: Track pinned events per user Alexander Shishkin
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

With shmem ring buffers, one can mlock at most nr_pages * nr_cpus, we
only need to do the accounting once the event is created (by means of
sys_perf_event_open()). This implements such accounting, by adding a
shared reference counter: when it goes 0 -> 1, we account the pages,
when it goes to 0, we undo the accounting.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c        |  12 +++--
 kernel/events/internal.h    |   5 +-
 kernel/events/ring_buffer.c | 124 +++++++++++++++++++++++++++++++++++++-------
 3 files changed, 116 insertions(+), 25 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index c80ffcdb5c..1fed69d4ba 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4309,7 +4309,6 @@ static void _free_event(struct perf_event *event)
 
 		event->attach_state &= ~PERF_ATTACH_DETACHED;
 
-		ring_buffer_unaccount(event->rb, false);
 		rb_free_detached(event->rb, event);
 	}
 
@@ -9525,9 +9524,11 @@ static void account_event(struct perf_event *event)
 	account_pmu_sb_event(event);
 }
 
-static int perf_event_detach(struct perf_event *event, struct task_struct *task,
-			     struct mm_struct *mm)
+static int
+perf_event_detach(struct perf_event *event, struct perf_event *parent_event,
+		  struct task_struct *task, struct mm_struct *mm)
 {
+	struct ring_buffer *parent_rb = parent_event ? parent_event->rb : NULL;
 	char *filename;
 	int err;
 
@@ -9545,7 +9546,7 @@ static int perf_event_detach(struct perf_event *event, struct task_struct *task,
 	if (!event->dent)
 		return -ENOMEM;
 
-	err = rb_alloc_detached(event, task, mm);
+	err = rb_alloc_detached(event, task, mm, parent_rb);
 	if (err) {
 		tracefs_remove(event->dent);
 		event->dent = NULL;
@@ -11015,7 +11016,8 @@ inherit_event(struct perf_event *parent_event,
 	if (detached) {
 		int err;
 
-		err = perf_event_detach(child_event, child, NULL);
+		err = perf_event_detach(child_event, parent_event, child,
+					NULL);
 		if (err) {
 			perf_free_event(child_event, child_ctx);
 			mutex_unlock(&parent_event->child_mutex);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 80d36a7277..3dc66961d9 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -53,6 +53,8 @@ struct ring_buffer {
 	void				**aux_pages;
 	void				*aux_priv;
 
+	atomic_t			*acct_refcount;
+
 	/* tmpfs file for kernel-owned ring buffers */
 	struct file			*shmem_file;
 	unsigned long			shmem_file_addr;
@@ -93,7 +95,8 @@ extern int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 extern void rb_free_aux(struct ring_buffer *rb);
 extern int rb_alloc_detached(struct perf_event *event,
 			     struct task_struct *task,
-			     struct mm_struct *mm);
+			     struct mm_struct *mm,
+			     struct ring_buffer *parent_rb);
 extern void rb_free_detached(struct ring_buffer *rb, struct perf_event *event);
 extern struct ring_buffer *ring_buffer_get(struct perf_event *event);
 extern void ring_buffer_put(struct ring_buffer *rb);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 771dfdb71f..896d441642 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -570,8 +570,8 @@ void *perf_get_aux(struct perf_output_handle *handle)
  * error out. Otherwise, keep track of the pages used in the ring_buffer so
  * that the accounting can be undone when the pages are freed.
  */
-static int ring_buffer_account(struct ring_buffer *rb, struct mm_struct *mm,
-			       unsigned long nr_pages, bool aux)
+static int __ring_buffer_account(struct ring_buffer *rb, struct mm_struct *mm,
+                                 unsigned long nr_pages, unsigned long *locked)
 {
 	unsigned long total, limit, pinned;
 
@@ -589,6 +589,9 @@ static int ring_buffer_account(struct ring_buffer *rb, struct mm_struct *mm,
 
 	total = atomic_long_read(&rb->mmap_user->locked_vm) + nr_pages;
 
+	free_uid(rb->mmap_user);
+	rb->mmap_user = NULL;
+
 	pinned = 0;
 	if (total > limit) {
 		/*
@@ -609,27 +612,33 @@ static int ring_buffer_account(struct ring_buffer *rb, struct mm_struct *mm,
 			return -EPERM;
 		}
 
-		if (aux)
-			rb->aux_mmap_locked = pinned;
-		else
-			rb->mmap_locked = pinned;
-
+		*locked = pinned;
 		mm->pinned_vm += pinned;
 	}
 
 	if (!rb->mmap_mapping)
 		rb->mmap_mapping = mm;
 
-	/* account for user page */
-	if (!aux)
-		nr_pages++;
-
 	rb->mmap_user = get_current_user();
 	atomic_long_add(nr_pages, &rb->mmap_user->locked_vm);
 
 	return 0;
 }
 
+static int ring_buffer_account(struct ring_buffer *rb, struct mm_struct *mm,
+			       unsigned long nr_pages, bool aux)
+{
+	int ret;
+
+	/* account for user page */
+	if (!aux)
+		nr_pages++;
+	ret = __ring_buffer_account(rb, mm, nr_pages,
+	                            aux ? &rb->aux_mmap_locked : &rb->mmap_locked);
+
+	return ret;
+}
+
 /*
  * Undo the mlock pages accounting done in ring_buffer_account().
  */
@@ -641,6 +650,9 @@ void ring_buffer_unaccount(struct ring_buffer *rb, bool aux)
 	if (!rb->nr_pages && !rb->aux_nr_pages)
 		return;
 
+	if (WARN_ON_ONCE(!rb->mmap_user))
+		return;
+
 	atomic_long_sub(nr_pages, &rb->mmap_user->locked_vm);
 	if (rb->mmap_mapping)
 		rb->mmap_mapping->pinned_vm -= pinned;
@@ -850,7 +862,8 @@ int rb_alloc_aux(struct ring_buffer *rb, struct perf_event *event,
 void rb_free_aux(struct ring_buffer *rb)
 {
 	if (atomic_dec_and_test(&rb->aux_refcount)) {
-		ring_buffer_unaccount(rb, true);
+		if (!rb->shmem_file)
+			ring_buffer_unaccount(rb, true);
 
 		__rb_free_aux(rb);
 	}
@@ -1087,13 +1100,68 @@ void rb_put_kernel_pages(struct ring_buffer *rb, bool final)
 	WRITE_ONCE(rb->shmem_pages_in, 0);
 }
 
+/*
+ * SHMEM memory is accounted once per user allocated event (via
+ * the syscall), since we can have at most NR_CPUS * nr_pages
+ * pinned pages at any given point in time, regardless of how
+ * many events there actually are.
+ *
+ * The first one (parent_rb==NULL) is where we do the accounting;
+ * it will also be the one coming from the syscall, so if it fails,
+ * we'll hand them back the error.
+ * Others just inherit and bump the counter; can't fail.
+ */
+static int
+rb_shmem_account(struct ring_buffer *rb, struct ring_buffer *parent_rb)
+{
+	unsigned long nr_pages = perf_rb_size(rb) >> PAGE_SHIFT;
+	int ret = 0;
+
+	if (parent_rb) {
+		/* "parent" rb *must* have accounting refcounter */
+		if (WARN_ON_ONCE(!parent_rb->acct_refcount))
+			return -EINVAL;
+
+		rb->acct_refcount = parent_rb->acct_refcount;
+		atomic_inc(rb->acct_refcount);
+
+		return 0;
+	}
+
+	/* All (data + aux + user page) in one go */
+	ret = __ring_buffer_account(rb, NULL, nr_pages,
+	                            &rb->mmap_locked);
+	if (ret)
+		return ret;
+
+	rb->acct_refcount = kmalloc(sizeof(*rb->acct_refcount),
+	                            GFP_KERNEL);
+	if (!rb->acct_refcount)
+		return -ENOMEM;
+
+	atomic_set(rb->acct_refcount, 1);
+
+	return 0;
+}
+
+static void rb_shmem_unaccount(struct ring_buffer *rb)
+{
+	if (!atomic_dec_and_test(rb->acct_refcount)) {
+		rb->acct_refcount = NULL;
+		return;
+	}
+
+	ring_buffer_unaccount(rb, false);
+	kfree(rb->acct_refcount);
+}
+
 /*
  * Allocate a ring_buffer for a detached event and attach it to this event.
  * There's one ring_buffer per detached event and vice versa, so
  * ring_buffer_attach() does not apply.
  */
 int rb_alloc_detached(struct perf_event *event, struct task_struct *task,
-		      struct mm_struct *mm)
+		      struct mm_struct *mm, struct ring_buffer *parent_rb)
 {
 	int aux_nr_pages = event->attr.detached_aux_nr_pages;
 	int nr_pages = event->attr.detached_nr_pages;
@@ -1116,18 +1184,22 @@ int rb_alloc_detached(struct perf_event *event, struct task_struct *task,
 	if (IS_ERR(rb))
 		return PTR_ERR(rb);
 
+	if (flags & RING_BUFFER_SHMEM) {
+		ret = rb_shmem_account(rb, parent_rb);
+		if (ret)
+			goto err_free;
+	}
+
 	if (aux_nr_pages) {
 		ret = rb_alloc_aux(rb, event, pgoff, aux_nr_pages, 0, flags);
 		if (ret)
-			goto err_free;
+			goto err_unaccount;
 	}
 
 	if (flags & RING_BUFFER_SHMEM) {
 		ret = rb_shmem_setup(event, task, rb);
-		if (ret) {
-			rb_free_aux(rb);
-			goto err_free;
-		}
+		if (ret)
+			goto err_free_aux;
 
 		rb_toggle_paused(rb, true);
 	} else {
@@ -1146,8 +1218,19 @@ int rb_alloc_detached(struct perf_event *event, struct task_struct *task,
 
 	return 0;
 
+err_free_aux:
+	if (!(flags & RING_BUFFER_SHMEM))
+		rb_free_aux(rb);
+
+err_unaccount:
+	if (flags & RING_BUFFER_SHMEM)
+		rb_shmem_unaccount(rb);
+
 err_free:
-	rb_free(rb);
+	if (flags & RING_BUFFER_SHMEM)
+		kfree(rb);
+	else
+		rb_free(rb);
 
 	return ret;
 }
@@ -1161,6 +1244,9 @@ void rb_free_detached(struct ring_buffer *rb, struct perf_event *event)
 		rb_shmem_unmap(event);
 		shmem_truncate_range(rb->shmem_file->f_inode, 0, (loff_t)-1);
 		rb_put_kernel_pages(rb, true);
+		rb_shmem_unaccount(rb);
+	} else {
+		ring_buffer_unaccount(rb, false);
 	}
 
 	atomic_set(&rb->aux_mmap_count, 0);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 12/17] perf: Track pinned events per user
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (10 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 11/17] perf: Implement mlock accounting for shmem ring buffers Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-09-05 13:30 ` [RFC PATCH 13/17] perf: Re-inject shmem buffers after exec Alexander Shishkin
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

Maintain a per-user cpu-indexed array of shmemfs-backed events, same
way as mlock accounting.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/sched/user.h  |  6 ++++
 kernel/events/core.c        | 14 ++++-----
 kernel/events/ring_buffer.c | 69 +++++++++++++++++++++++++++++++++++++--------
 kernel/user.c               |  1 +
 4 files changed, 71 insertions(+), 19 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 5d5415e129..bf10f95250 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -5,6 +5,7 @@
 #include <linux/atomic.h>
 
 struct key;
+struct perf_event;
 
 /*
  * Some day this will be a full-fledged user tracking system..
@@ -39,6 +40,11 @@ struct user_struct {
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL)
 	atomic_long_t locked_vm;
 #endif
+#ifdef CONFIG_PERF_EVENTS
+	atomic_long_t nr_pinnable_events;
+	struct mutex pinned_mutex;
+	struct perf_event ** __percpu pinned_events;
+#endif
 };
 
 extern int uids_sysfs_init(void);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1fed69d4ba..e00f1f6aaf 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -384,7 +384,6 @@ static atomic_t perf_sched_count;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
 static DEFINE_PER_CPU(int, perf_sched_cb_usages);
 static DEFINE_PER_CPU(struct pmu_event_list, pmu_sb_events);
-static DEFINE_PER_CPU(struct perf_event *, shmem_events);
 
 static atomic_t nr_mmap_events __read_mostly;
 static atomic_t nr_comm_events __read_mostly;
@@ -2086,7 +2085,8 @@ enum pin_event_t {
 
 static enum pin_event_t pin_event_pages(struct perf_event *event)
 {
-	struct perf_event **pinned_event = this_cpu_ptr(&shmem_events);
+	struct user_struct *user = event->rb->mmap_user;
+	struct perf_event **pinned_event = this_cpu_ptr(user->pinned_events);
 	struct perf_event *old_event = *pinned_event;
 
 	if (old_event == event)
@@ -4281,13 +4281,14 @@ static void _free_event(struct perf_event *event)
 	unaccount_event(event);
 
 	if (event->attach_state & PERF_ATTACH_SHMEM) {
+		struct user_struct *user = event->rb->mmap_user;
 		struct perf_event_context *ctx = event->ctx;
 		int cpu;
 
 		atomic_set(&event->xpinned, 0);
 		for_each_possible_cpu(cpu) {
 			struct perf_event **pinned_event =
-				per_cpu_ptr(&shmem_events, cpu);
+				per_cpu_ptr(user->pinned_events, cpu);
 
 			cmpxchg(pinned_event, event, NULL);
 		}
@@ -9530,7 +9531,7 @@ perf_event_detach(struct perf_event *event, struct perf_event *parent_event,
 {
 	struct ring_buffer *parent_rb = parent_event ? parent_event->rb : NULL;
 	char *filename;
-	int err;
+	int err = -ENOMEM;
 
 	filename = kasprintf(GFP_KERNEL, "%s:%x.event",
 			     task ? "task" : "cpu",
@@ -9550,10 +9551,9 @@ perf_event_detach(struct perf_event *event, struct perf_event *parent_event,
 	if (err) {
 		tracefs_remove(event->dent);
 		event->dent = NULL;
-		return err;
 	}
 
-	return 0;
+	return err;
 }
 /*
  * Allocate and initialize a event structure
@@ -10290,7 +10290,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	}
 
 	if (detached) {
-		err = perf_event_detach(event, task, NULL);
+		err = perf_event_detach(event, NULL, task, NULL);
 		if (err)
 			goto err_context;
 
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 896d441642..8d37e4e591 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -563,6 +563,44 @@ void *perf_get_aux(struct perf_output_handle *handle)
 	return handle->rb->aux_priv;
 }
 
+static struct user_struct *get_users_pinned_events(void)
+{
+	struct user_struct *user = current_user(), *ret = NULL;
+
+	if (atomic_long_inc_not_zero(&user->nr_pinnable_events))
+		return user;
+
+	mutex_lock(&user->pinned_mutex);
+	if (!atomic_long_read(&user->nr_pinnable_events)) {
+		if (WARN_ON_ONCE(!!user->pinned_events))
+			goto unlock;
+
+		user->pinned_events = alloc_percpu(struct perf_event *);
+		if (!user->pinned_events) {
+			goto unlock;
+		} else {
+			atomic_long_inc(&user->nr_pinnable_events);
+			ret = get_current_user();
+		}
+	}
+
+unlock:
+	mutex_unlock(&user->pinned_mutex);
+
+	return ret;
+}
+
+static void put_users_pinned_events(struct user_struct *user)
+{
+	if (!atomic_long_dec_and_test(&user->nr_pinnable_events))
+		return;
+
+	mutex_lock(&user->pinned_mutex);
+	free_percpu(user->pinned_events);
+	user->pinned_events = NULL;
+	mutex_unlock(&user->pinned_mutex);
+}
+
 /*
  * Check if the current user can afford @nr_pages, considering the
  * perf_event_mlock sysctl and their mlock limit. If the former is exceeded,
@@ -574,11 +612,14 @@ static int __ring_buffer_account(struct ring_buffer *rb, struct mm_struct *mm,
                                  unsigned long nr_pages, unsigned long *locked)
 {
 	unsigned long total, limit, pinned;
+	struct user_struct *user;
 
 	if (!mm)
 		mm = rb->mmap_mapping;
 
-	rb->mmap_user = current_user();
+	user = get_users_pinned_events();
+	if (!user)
+		return -ENOMEM;
 
 	limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
 
@@ -587,10 +628,7 @@ static int __ring_buffer_account(struct ring_buffer *rb, struct mm_struct *mm,
 	 */
 	limit *= num_online_cpus();
 
-	total = atomic_long_read(&rb->mmap_user->locked_vm) + nr_pages;
-
-	free_uid(rb->mmap_user);
-	rb->mmap_user = NULL;
+	total = atomic_long_read(&user->locked_vm) + nr_pages;
 
 	pinned = 0;
 	if (total > limit) {
@@ -599,7 +637,7 @@ static int __ring_buffer_account(struct ring_buffer *rb, struct mm_struct *mm,
 		 * limit needs to be accounted to the consumer's mm.
 		 */
 		if (!mm)
-			return -EPERM;
+			goto err_put_user;
 
 		pinned = total - limit;
 
@@ -608,9 +646,8 @@ static int __ring_buffer_account(struct ring_buffer *rb, struct mm_struct *mm,
 		total = mm->pinned_vm + pinned;
 
 		if ((total > limit) && perf_paranoid_tracepoint_raw() &&
-		    !capable(CAP_IPC_LOCK)) {
-			return -EPERM;
-		}
+		    !capable(CAP_IPC_LOCK))
+			goto err_put_user;
 
 		*locked = pinned;
 		mm->pinned_vm += pinned;
@@ -619,10 +656,15 @@ static int __ring_buffer_account(struct ring_buffer *rb, struct mm_struct *mm,
 	if (!rb->mmap_mapping)
 		rb->mmap_mapping = mm;
 
-	rb->mmap_user = get_current_user();
-	atomic_long_add(nr_pages, &rb->mmap_user->locked_vm);
+	rb->mmap_user = user;
+	atomic_long_add(nr_pages, &user->locked_vm);
 
 	return 0;
+
+err_put_user:
+	put_users_pinned_events(user);
+
+	return -EPERM;
 }
 
 static int ring_buffer_account(struct ring_buffer *rb, struct mm_struct *mm,
@@ -657,7 +699,7 @@ void ring_buffer_unaccount(struct ring_buffer *rb, bool aux)
 	if (rb->mmap_mapping)
 		rb->mmap_mapping->pinned_vm -= pinned;
 
-	free_uid(rb->mmap_user);
+	put_users_pinned_events(rb->mmap_user);
 }
 
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
@@ -1124,6 +1166,7 @@ rb_shmem_account(struct ring_buffer *rb, struct ring_buffer *parent_rb)
 
 		rb->acct_refcount = parent_rb->acct_refcount;
 		atomic_inc(rb->acct_refcount);
+		rb->mmap_user = get_uid(parent_rb->mmap_user);
 
 		return 0;
 	}
@@ -1146,6 +1189,8 @@ rb_shmem_account(struct ring_buffer *rb, struct ring_buffer *parent_rb)
 
 static void rb_shmem_unaccount(struct ring_buffer *rb)
 {
+	free_uid(rb->mmap_user);
+
 	if (!atomic_dec_and_test(rb->acct_refcount)) {
 		rb->acct_refcount = NULL;
 		return;
diff --git a/kernel/user.c b/kernel/user.c
index 00281add65..e95a82d31d 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -185,6 +185,7 @@ struct user_struct *alloc_uid(kuid_t uid)
 
 		new->uid = uid;
 		atomic_set(&new->__count, 1);
+		mutex_init(&new->pinned_mutex);
 
 		/*
 		 * Before adding this, check whether we raced
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 13/17] perf: Re-inject shmem buffers after exec
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (11 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 12/17] perf: Track pinned events per user Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-09-05 13:30 ` [RFC PATCH 14/17] perf: Add ioctl(REATTACH) for detached events Alexander Shishkin
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

A exec will unmap everything, but we want our shmem buffers to persist.
This tells the page-pinning task work to mmap event's ring buffer after
the task exec'ed.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index e00f1f6aaf..f0b77b33b4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6482,6 +6482,29 @@ static void perf_event_addr_filters_exec(struct perf_event *event, void *data)
 		perf_event_stop(event, 1);
 }
 
+static void perf_shmem_ctx_exec(struct perf_event_context *ctx)
+{
+	struct perf_event *event;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&ctx->lock, flags);
+
+	list_for_each_entry(event, &ctx->event_list, event_entry) {
+		if (event->attach_state & PERF_ATTACH_SHMEM) {
+			struct ring_buffer *rb;
+
+			/* called inside rcu read section */
+			rb = rcu_dereference(event->rb);
+			if (!rb)
+				continue;
+
+			rb->shmem_file_addr = 0;
+		}
+	}
+
+	raw_spin_unlock_irqrestore(&ctx->lock, flags);
+}
+
 void perf_event_exec(void)
 {
 	struct perf_event_context *ctx;
@@ -6497,6 +6520,7 @@ void perf_event_exec(void)
 
 		perf_iterate_ctx(ctx, perf_event_addr_filters_exec, NULL,
 				   true);
+		perf_shmem_ctx_exec(ctx);
 	}
 	rcu_read_unlock();
 }
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 14/17] perf: Add ioctl(REATTACH) for detached events
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (12 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 13/17] perf: Re-inject shmem buffers after exec Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-10-03 14:50   ` Peter Zijlstra
  2017-09-05 13:30 ` [RFC PATCH 15/17] perf: Allow controlled non-root access to " Alexander Shishkin
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

This adds an ioctl command to demote a detached event to a 'normal' one
that gets destroyed when its file descriptor is closed. It can still be
used to mmap the buffers, but not very useful otherwise.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/uapi/linux/perf_event.h |  1 +
 kernel/events/core.c            | 32 +++++++++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 4cdd4fab9d..ae54bd496d 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -435,6 +435,7 @@ struct perf_event_attr {
 #define PERF_EVENT_IOC_ID		_IOR('$', 7, __u64 *)
 #define PERF_EVENT_IOC_SET_BPF		_IOW('$', 8, __u32)
 #define PERF_EVENT_IOC_PAUSE_OUTPUT	_IOW('$', 9, __u32)
+#define PERF_EVENT_IOC_REATTACH		_IO ('$', 10)
 
 enum perf_event_ioc_flags {
 	PERF_IOC_FLAG_GROUP		= 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f0b77b33b4..fbee221d19 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4532,7 +4532,19 @@ EXPORT_SYMBOL_GPL(perf_event_release_kernel);
  */
 static int perf_release(struct inode *inode, struct file *file)
 {
-	perf_event_release_kernel(file->private_data);
+	struct perf_event *event = file->private_data;
+
+	/*
+	 * For a DETACHED event, perf_release() can't have the last reference,
+	 * because we grabbed one extra in the sys_perf_event_open, IOW it is
+	 * always put_event(). In order for it to be the last reference, we'd
+	 * first need to ioctl(REATTACH) on this event, which would drop the
+	 * PERF_ATTACH_DETACHED attach state.
+	 */
+	if (event->attach_state & PERF_ATTACH_DETACHED)
+		put_event(event);
+	else
+		perf_event_release_kernel(file->private_data);
 	return 0;
 }
 
@@ -4885,6 +4897,11 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
 	void (*func)(struct perf_event *);
 	u32 flags = arg;
 
+	if (event->attach_state & PERF_ATTACH_DETACHED &&
+	    cmd != PERF_EVENT_IOC_REATTACH &&
+	    cmd != PERF_EVENT_IOC_ID)
+		return -EINVAL;
+
 	switch (cmd) {
 	case PERF_EVENT_IOC_ENABLE:
 		func = _perf_event_enable;
@@ -4948,6 +4965,19 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
 		rcu_read_unlock();
 		return 0;
 	}
+	case PERF_EVENT_IOC_REATTACH:
+		/*
+		 * DETACHED state is serialized on ctx::mutex
+		 */
+		if (!is_detached_event(event))
+			return -EINVAL;
+
+		event->attach_state &= ~PERF_ATTACH_DETACHED;
+		tracefs_remove(event->dent);
+		event->dent = NULL;
+		put_event(event); /* can't be last */
+
+		return 0;
 	default:
 		return -ENOTTY;
 	}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 15/17] perf: Allow controlled non-root access to detached events
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (13 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 14/17] perf: Add ioctl(REATTACH) for detached events Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-10-03 14:53   ` Peter Zijlstra
  2017-09-05 13:30 ` [RFC PATCH 16/17] perf/x86/intel/pt: Add PMU info Alexander Shishkin
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

The user who created the event should also be able to open its
corresponding file in tracefs and/or remove it.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index fbee221d19..802c0862a9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5579,7 +5579,7 @@ static int perf_fasync(int fd, struct file *filp, int on)
 static int perf_open(struct inode *inode, struct file *file)
 {
 	struct perf_event *event = inode->i_private;
-	int ret;
+	int ret = 0;
 
 	if (WARN_ON_ONCE(!event))
 		return -EINVAL;
@@ -5587,7 +5587,13 @@ static int perf_open(struct inode *inode, struct file *file)
 	if (!atomic_long_inc_not_zero(&event->refcount))
 		return -ENOENT;
 
-	ret = simple_open(inode, file);
+	/* event's user is stable while we're holding the reference */
+	if (event->rb->mmap_user != current_user() &&
+	    !capable(CAP_SYS_ADMIN))
+		ret = -EACCES;
+
+	if (!ret)
+		ret = simple_open(inode, file);
 	if (ret)
 		put_event(event);
 
@@ -9593,7 +9599,7 @@ perf_event_detach(struct perf_event *event, struct perf_event *parent_event,
 	if (!filename)
 		return -ENOMEM;
 
-	event->dent = tracefs_create_file(filename, 0600,
+	event->dent = tracefs_create_file(filename, 0666,
 					  perf_tracefs_dir,
 					  event, &perf_fops);
 	kfree(filename);
@@ -11521,6 +11527,7 @@ static int perf_instance_unlink(const char *name)
 {
 	struct perf_event *event;
 	struct dentry *dent;
+	int ret = 0;
 
 	dent = lookup_one_len_unlocked(name, perf_tracefs_dir, strlen(name));
 	if (!dent)
@@ -11530,6 +11537,18 @@ static int perf_instance_unlink(const char *name)
 	if (!event)
 		return -EINVAL;
 
+	if (!atomic_long_inc_not_zero(&event->refcount))
+		return 0;
+
+	/* event's user is stable while we're holding the reference */
+	if (event->rb->mmap_user != current_user() &&
+	    !capable(CAP_SYS_ADMIN))
+		ret = -EACCES;
+	put_event(event);
+
+	if (ret)
+		return ret;
+
 	if (!(event->attach_state & PERF_ATTACH_CONTEXT))
 		return -EBUSY;
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 16/17] perf/x86/intel/pt: Add PMU info
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (14 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 15/17] perf: Allow controlled non-root access to " Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-09-05 13:30 ` [RFC PATCH 17/17] perf/x86/intel/bts: " Alexander Shishkin
  2017-09-06 16:24 ` [RFC PATCH 00/17] perf: Detached events Borislav Petkov
  17 siblings, 0 replies; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

Add PMU-specific data structure with family/model/stepping and clock
information required by the decoder.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/events/intel/pt.c | 23 ++++++++++++++++++++++-
 arch/x86/events/intel/pt.h | 11 +++++++++++
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 3b993942a0..053b96f491 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -35,6 +35,8 @@
 
 static DEFINE_PER_CPU(struct pt, pt_ctx);
 
+#define PMU_NAME "intel_pt"
+
 static struct pt_pmu pt_pmu;
 
 /*
@@ -271,6 +273,22 @@ static int __init pt_pmu_hw_init(void)
 	return ret;
 }
 
+static struct intel_pt_pmu_info pt_pmu_info;
+
+static void pt_pmu_info_setup(void)
+{
+	BUILD_BUG_ON(sizeof(pt_pmu_info) +
+	             sizeof(struct perf_event_mmap_page) > PAGE_SIZE);
+	pt_pmu_info.pi.note_size  = sizeof(pt_pmu_info.pi);
+	pt_pmu_info.pi.pmu_descsz = sizeof(pt_pmu_info) - pt_pmu_info.pi.note_size;
+	pt_pmu_info.x86_family    = boot_cpu_data.x86;
+	pt_pmu_info.x86_model     = boot_cpu_data.x86_model;
+	pt_pmu_info.x86_step      = boot_cpu_data.x86_mask;
+	pt_pmu_info.x86_tsc_max_nonturbo_ratio = pt_pmu.max_nonturbo_ratio;
+	pt_pmu_info.x86_tsc_to_art_numerator   = pt_pmu.tsc_art_num;
+	pt_pmu_info.x86_tsc_to_art_denominator = pt_pmu.tsc_art_den;
+}
+
 #define RTIT_CTL_CYC_PSB (RTIT_CTL_CYCLEACC	| \
 			  RTIT_CTL_CYC_THRESH	| \
 			  RTIT_CTL_PSB_FREQ)
@@ -1512,6 +1530,8 @@ static __init int pt_init(void)
 		return -ENODEV;
 	}
 
+	pt_pmu_info_setup();
+
 	if (!pt_cap_get(PT_CAP_topa_multiple_entries))
 		pt_pmu.pmu.capabilities =
 			PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_AUX_SW_DOUBLEBUF;
@@ -1531,8 +1551,9 @@ static __init int pt_init(void)
 	pt_pmu.pmu.addr_filters_validate = pt_event_addr_filters_validate;
 	pt_pmu.pmu.nr_addr_filters       =
 		pt_cap_get(PT_CAP_num_address_ranges);
+	pt_pmu.pmu.pmu_info		 = &pt_pmu_info.pi;
 
-	ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);
+	ret = perf_pmu_register(&pt_pmu.pmu, PMU_NAME, -1);
 
 	return ret;
 }
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index 25fa9710f4..fc19080ca3 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -189,4 +189,15 @@ struct pt {
 	int			vmx_on;
 };
 
+struct intel_pt_pmu_info {
+	struct pmu_info		pi;
+	u8			x86_family;
+	u8			x86_model;
+	u8			x86_step;
+	u8			x86_tsc_max_nonturbo_ratio;
+	u32			x86_tsc_to_art_numerator;
+	u32			x86_tsc_to_art_denominator;
+	u32			__reserved_0;
+};
+
 #endif /* __INTEL_PT_H__ */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 17/17] perf/x86/intel/bts: Add PMU info
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (15 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 16/17] perf/x86/intel/pt: Add PMU info Alexander Shishkin
@ 2017-09-05 13:30 ` Alexander Shishkin
  2017-09-06 16:24 ` [RFC PATCH 00/17] perf: Detached events Borislav Petkov
  17 siblings, 0 replies; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-05 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric, Alexander Shishkin

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/events/intel/bts.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 16076eb346..ce1dac7115 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -21,13 +21,22 @@
 #include <linux/slab.h>
 #include <linux/debugfs.h>
 #include <linux/device.h>
-#include <linux/coredump.h>
 
 #include <asm-generic/sizes.h>
 #include <asm/perf_event.h>
 
 #include "../perf_event.h"
 
+#define PMU_NAME "intel_bts"
+
+static struct intel_bts_pmu_info {
+	struct pmu_info		pi;
+	u8			x86_family;
+	u8			x86_model;
+	u8			x86_step;
+	u8			__reserved_0[5];
+} bts_pmu_info;
+
 struct bts_ctx {
 	struct perf_output_handle	handle;
 	struct debug_store		ds_back;
@@ -582,6 +591,12 @@ static __init int bts_init(void)
 	if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
 		return -ENODEV;
 
+	bts_pmu_info.pi.note_size  = sizeof(bts_pmu_info.pi);
+	bts_pmu_info.pi.pmu_descsz = sizeof(bts_pmu_info) - bts_pmu_info.pi.note_size;
+	bts_pmu_info.x86_family    = boot_cpu_data.x86;
+	bts_pmu_info.x86_model	    = boot_cpu_data.x86_model;
+	bts_pmu_info.x86_step	    = boot_cpu_data.x86_mask;
+
 	bts_pmu.capabilities	= PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE |
 				  PERF_PMU_CAP_EXCLUSIVE;
 	bts_pmu.task_ctx_nr	= perf_sw_context;
@@ -593,7 +608,8 @@ static __init int bts_init(void)
 	bts_pmu.read		= bts_event_read;
 	bts_pmu.setup_aux	= bts_buffer_setup_aux;
 	bts_pmu.free_aux	= bts_buffer_free_aux;
+	bts_pmu.pmu_info	= &bts_pmu_info.pi;
 
-	return perf_pmu_register(&bts_pmu, "intel_bts", -1);
+	return perf_pmu_register(&bts_pmu, PMU_NAME, -1);
 }
 arch_initcall(bts_init);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 00/17] perf: Detached events
  2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
                   ` (16 preceding siblings ...)
  2017-09-05 13:30 ` [RFC PATCH 17/17] perf/x86/intel/bts: " Alexander Shishkin
@ 2017-09-06 16:24 ` Borislav Petkov
  2017-09-13 11:54   ` Alexander Shishkin
  17 siblings, 1 reply; 34+ messages in thread
From: Borislav Petkov @ 2017-09-06 16:24 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, acme, kirill.shutemov, rric

On Tue, Sep 05, 2017 at 04:30:09PM +0300, Alexander Shishkin wrote:
> Detached events: a new flag to the perf syscall makes a 'detached' event,
> which exists after its file descriptor is released. Not all detached events
> are per-thread AUX events: this tries to take into account the need for
> system-wide persistent events too.

Nice, thanks!

> (2) Need to be able to kill those events, so they need to be accessible
> after they are created.
> Event files: detached events exist as files in tracefs (at the moment), can
> be opened/mmaped/read/removed.

I guess I'll see when I continue reading but I remember us doing ioctls
on the event fd.

> (6) Ring buffer memory accounting needs to take this new arrangement into
> account: one user can use up at most NR_CPUS * buffer_size memory at any
> given point in time.
> Only account the first such event and undo the accounting when the last
> event is gone.

... and I guess we probably shouldn't allow the user to create too many
events and shoot herself in the OOM-foot.

> (7) We'll also need to supply all the things that the [PT] decoder normally
> finds out via sysfs attributes, like clock ratios, capabilities, etc so that
> it also finds its way into the core dump file.
> "PMU info" structure is appended to the user page.
> 
> I've also hack the perf tool to support all this, all these things can be
> found at [1]. I'm not posting the tooling patches though, them being
> thoroughly ugly and proof-of-concept. In short, perf record will create
> detached events with '--detached' and afterwards will open detached events
> via their path in tracefs.

Sounds nice. I'd need to test all that just so I can be able to create
detached RAS events (which are tracepoints) with it.

Thanks.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 01/17] perf: Allow mmapping only user page
  2017-09-05 13:30 ` [RFC PATCH 01/17] perf: Allow mmapping only user page Alexander Shishkin
@ 2017-09-06 16:28   ` Borislav Petkov
  2017-09-13 11:35     ` Alexander Shishkin
  0 siblings, 1 reply; 34+ messages in thread
From: Borislav Petkov @ 2017-09-06 16:28 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, acme, kirill.shutemov, rric

On Tue, Sep 05, 2017 at 04:30:10PM +0300, Alexander Shishkin wrote:
> The 'user page' contains offsets and sizes of data and aux areas of the
> ring buffer. If a user wants to mmap a pre-existing buffer, they need to
> know these in order to issue mmap()s with correct offsets and sizes.

Ok, stupid question: shouldn't this be a properly defined interface
instead of allowing userspace to poke inside the user page? Or are we
prepared to handle any changes in the layout of that user page and there
won't be any userspace crying because of it?

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 01/17] perf: Allow mmapping only user page
  2017-09-06 16:28   ` Borislav Petkov
@ 2017-09-13 11:35     ` Alexander Shishkin
  2017-09-13 12:58       ` Borislav Petkov
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-13 11:35 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, acme, kirill.shutemov, rric

Borislav Petkov <bp@alien8.de> writes:

> On Tue, Sep 05, 2017 at 04:30:10PM +0300, Alexander Shishkin wrote:
>> The 'user page' contains offsets and sizes of data and aux areas of the
>> ring buffer. If a user wants to mmap a pre-existing buffer, they need to
>> know these in order to issue mmap()s with correct offsets and sizes.
>
> Ok, stupid question: shouldn't this be a properly defined interface
> instead of allowing userspace to poke inside the user page? Or are we
> prepared to handle any changes in the layout of that user page and there
> won't be any userspace crying because of it?

Well, it is a 'defined' interface: there's the 'struct
perf_event_mmap_page' with versioning and whatnot, which is used for all
the ring buffer metainformation. Or am I misunderstanding your question?

Thanks,
--
Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 00/17] perf: Detached events
  2017-09-06 16:24 ` [RFC PATCH 00/17] perf: Detached events Borislav Petkov
@ 2017-09-13 11:54   ` Alexander Shishkin
  0 siblings, 0 replies; 34+ messages in thread
From: Alexander Shishkin @ 2017-09-13 11:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, acme, kirill.shutemov, rric

Borislav Petkov <bp@alien8.de> writes:

> On Tue, Sep 05, 2017 at 04:30:09PM +0300, Alexander Shishkin wrote:
>> Detached events: a new flag to the perf syscall makes a 'detached' event,
>> which exists after its file descriptor is released. Not all detached events
>> are per-thread AUX events: this tries to take into account the need for
>> system-wide persistent events too.
>
> Nice, thanks!

Forgot to mention that I did hack the tracepoint support into the
tooling as well to make sure it's a workable idea.

>> (2) Need to be able to kill those events, so they need to be accessible
>> after they are created.
>> Event files: detached events exist as files in tracefs (at the moment), can
>> be opened/mmaped/read/removed.
>
> I guess I'll see when I continue reading but I remember us doing ioctls
> on the event fd.

Iirc that was for re-attaching to the event to make it 'normal' before
closing.

>> (6) Ring buffer memory accounting needs to take this new arrangement into
>> account: one user can use up at most NR_CPUS * buffer_size memory at any
>> given point in time.
>> Only account the first such event and undo the accounting when the last
>> event is gone.
>
> ... and I guess we probably shouldn't allow the user to create too many
> events and shoot herself in the OOM-foot.

Well, they are still limited by the RLIMIT_MEMLOCK and perf_event_mlock
sysctl for the total amount of memory that can be pinned for the ring
buffers at any given time, so that should be fine.

>> (7) We'll also need to supply all the things that the [PT] decoder normally
>> finds out via sysfs attributes, like clock ratios, capabilities, etc so that
>> it also finds its way into the core dump file.
>> "PMU info" structure is appended to the user page.
>> 
>> I've also hack the perf tool to support all this, all these things can be
>> found at [1]. I'm not posting the tooling patches though, them being
>> thoroughly ugly and proof-of-concept. In short, perf record will create
>> detached events with '--detached' and afterwards will open detached events
>> via their path in tracefs.
>
> Sounds nice. I'd need to test all that just so I can be able to create
> detached RAS events (which are tracepoints) with it.

Thanks, let me know how it goes.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 01/17] perf: Allow mmapping only user page
  2017-09-13 11:35     ` Alexander Shishkin
@ 2017-09-13 12:58       ` Borislav Petkov
  0 siblings, 0 replies; 34+ messages in thread
From: Borislav Petkov @ 2017-09-13 12:58 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, acme, kirill.shutemov, rric

On Wed, Sep 13, 2017 at 02:35:42PM +0300, Alexander Shishkin wrote:
> Well, it is a 'defined' interface: there's the 'struct
> perf_event_mmap_page' with versioning and whatnot, which is used for all
> the ring buffer metainformation. Or am I misunderstanding your question?

No, you're not. Looking at perf_event_mmap_page, sounds like my concerns are
addressed. :)

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 05/17] perf: Introduce detached events
  2017-09-05 13:30 ` [RFC PATCH 05/17] perf: Introduce detached events Alexander Shishkin
@ 2017-10-03 14:34   ` Peter Zijlstra
  2017-10-06 11:23     ` Alexander Shishkin
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2017-10-03 14:34 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov, Borislav Petkov, rric

On Tue, Sep 05, 2017 at 04:30:14PM +0300, Alexander Shishkin wrote:
> There are usecases where it is desired to have perf events without the
> userspace tool running in the background to keep them alive, but instead
> only collect the data when it is needed, for example when an MCE event
> is triggered.
> 
> This patch adds a new flag to the perf_event_open() syscall that allows
> creating such events. Once created, the file descriptor can be closed
> and the event continues to exist on its own. To allow access to this
> event, a file is created in the tracefs, which the user can open.
> 
> Finally, when it is no longer needed, it can be destroyed by unlinking
> the file.
> 

> @@ -9387,6 +9416,27 @@ static void account_event(struct perf_event *event)
>  	account_pmu_sb_event(event);
>  }
>  
> +static int perf_event_detach(struct perf_event *event, struct task_struct *task,
> +			     struct mm_struct *mm)
> +{
> +	char *filename;
> +
> +	filename = kasprintf(GFP_KERNEL, "%s:%x.event",
> +			     task ? "task" : "cpu",
> +			     hash_64((u64)event, PERF_TRACEFS_HASH_BITS));
> +	if (!filename)
> +		return -ENOMEM;
> +
> +	event->dent = tracefs_create_file(filename, 0600,
> +					  perf_tracefs_dir,
> +					  event, &perf_fops);
> +	kfree(filename);
> +
> +	if (!event->dent)
> +		return -ENOMEM;
> +
> +	return 0;
> +}

So I'm not opposed to the idea of creating events that live independent
from of file descriptors. And stuffing them in a filesystem makes sense.
However I'm not entire convinced on the details.

The above has a number of problems:

 - there's a filesystem race; two concurrent syscalls can try and create
   the same file. In that case the error most certainly is not -ENOMEM.

 - there's a hash collision, similar issue.

 - there's some asymmetry in the create/destroy; that is you create the
   file with sys_perf_event_open() and remove it with unlink().

 - the actual name is very opaque and hard to use; how would a tool find
   the right event to open?


Would it instead make sense to allow the user to creat() their own files
in this filesystem (with whatever descriptive name they need) and then
pass that fd like:

  sys_perf_event_open(.group_fd=fd, .flags=PERF_FLAG_FD_DETACH);

or something to associate the file with the event. Of course, that makes
it very hard to create detached cgroup events :/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 06/17] perf: Add buffers to the detached events
  2017-09-05 13:30 ` [RFC PATCH 06/17] perf: Add buffers to the " Alexander Shishkin
@ 2017-10-03 14:36   ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2017-10-03 14:36 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov, Borislav Petkov, rric

On Tue, Sep 05, 2017 at 04:30:15PM +0300, Alexander Shishkin wrote:
> @@ -415,6 +416,8 @@ struct perf_event_attr {
>  	__u32	aux_watermark;
>  	__u16	sample_max_stack;
>  	__u16	__reserved_2;	/* align to __u64 */
> +	__u32	detached_nr_pages;
> +	__u32	detached_aux_nr_pages;
>  };

Not sure the naming makes sense; I don't see why this would be limited
to detached events. That is, what would stop someone from pre-allocating
buffers on 'regular' events if they wanted to?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 07/17] perf: Add pmu_info to user page
  2017-09-05 13:30 ` [RFC PATCH 07/17] perf: Add pmu_info to user page Alexander Shishkin
@ 2017-10-03 14:40   ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2017-10-03 14:40 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov, Borislav Petkov, rric

On Tue, Sep 05, 2017 at 04:30:16PM +0300, Alexander Shishkin wrote:
> Allow PMUs to supply additional static information that may be required
> by their decoders. Most of what Intel PT driver exports as capability
> attributes (timing packet freqencies, frequency ratios etc), its decoder
> needs to be able to correctly decode its binary stream. However, when
> decoding Intel PT stream from a core dump, we can't rely on the sysfs
> attributes, so we need to pack this information into the perf buffer,
> so that the resulting core dump is self-contained.
> 
> In order to do this, we append a PMU-specific structure to the user
> page. Such structures will include size, for versioning.
> 

> @@ -508,6 +513,18 @@ struct perf_addr_filters_head {
>  	unsigned int		nr_file_filters;
>  };
>  
> +struct pmu_info {
> +	/*
> +	 * Size of this structure, for versioning.
> +	 */
> +	u32	note_size;
> +
> +	/*
> +	 * Size of the container structure, not including this one
> +	 */
> +	u32	pmu_descsz;
> +};
> +
>  /**
>   * enum perf_event_active_state - the states of a event
>   */
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 3d64d9ea80..4cdd4fab9d 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -599,6 +599,16 @@ struct perf_event_mmap_page {
>  	__u64	aux_tail;
>  	__u64	aux_offset;
>  	__u64	aux_size;
> +
> +	/*
> +	 * PMU data: static info that (AUX) decoder wants to know in order to
> +	 * decode correctly:
> +	 *
> +	 *   pmu_offset >= sizeof(struct perf_event_mmap_page)
> +	 *   pmu_offset + pmu_size <= PAGE_SIZE
> +	 */
> +	__u64	pmu_offset;
> +	__u64	pmu_size;
>  };

Why like this? Why not dump the data as part of
PERF_RECORD_ITRACE_START/PERF_RECORD_AUX ?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 08/17] perf: Allow inheritance for detached events
  2017-09-05 13:30 ` [RFC PATCH 08/17] perf: Allow inheritance for detached events Alexander Shishkin
@ 2017-10-03 14:42   ` Peter Zijlstra
  2017-10-06 11:40     ` Alexander Shishkin
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2017-10-03 14:42 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov, Borislav Petkov, rric

On Tue, Sep 05, 2017 at 04:30:17PM +0300, Alexander Shishkin wrote:
> This enables inheritance for detached events. Unlike traditional events,
> these do not have parents: inheritance produces a new independent event
> with the same attribute. If the 'parent' event has a ring buffer, so will
> the new event. Considering the mlock accounting, this buffer allocation
> may fail, which in turn will fail the parent's fork, something to be
> aware of.
> 
> This also effectively disables context cloning, because unlike the
> traditional events, these will each have its own ring buffer and
> context switch optimization can't work.

Right, so this thing is icky... as you know. More naming issues though,
what will you go and call those files.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 09/17] perf: Use shmemfs pages for userspace-only per-thread detached events
  2017-09-05 13:30 ` [RFC PATCH 09/17] perf: Use shmemfs pages for userspace-only per-thread " Alexander Shishkin
@ 2017-10-03 14:43   ` Peter Zijlstra
  2017-10-06 11:52     ` Alexander Shishkin
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2017-10-03 14:43 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov, Borislav Petkov, rric

On Tue, Sep 05, 2017 at 04:30:18PM +0300, Alexander Shishkin wrote:
> In order to work around the problem of using up mlocked memory for the
> detached events, we can pin the ring buffer pages only while they are
> in use (that is, the event is ACTIVE), and unpin them for the rest of
> the time. When not pinned in, these pages can be swapped out. This way,
> one user can have at most mlock_limit*nr_cpus kB of memory pinned at
> any given moment, however many events they actually have.
> 
> This enforces a constraint: pinning and unpinning may sleep and thus
> can't be done in the event scheduling path. Instead, we use a task
> work to do this, which this pattern us to userspace-only events.
> Also, since one userspace thread only needs one buffer (for whatever
> CPU it's running on at any given moment), we only do this for per-thread
> events.
> 
> The source for such swappable pages is shmemfs. This patch allows
> allocating perf ring buffer pages from an shmemfs file if the above
> constraints are met.

Right, so why still allow that previous icky thing? What cases do we
need that for?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 14/17] perf: Add ioctl(REATTACH) for detached events
  2017-09-05 13:30 ` [RFC PATCH 14/17] perf: Add ioctl(REATTACH) for detached events Alexander Shishkin
@ 2017-10-03 14:50   ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2017-10-03 14:50 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov, Borislav Petkov, rric

On Tue, Sep 05, 2017 at 04:30:23PM +0300, Alexander Shishkin wrote:
> This adds an ioctl command to demote a detached event to a 'normal' one
> that gets destroyed when its file descriptor is closed. It can still be
> used to mmap the buffers, but not very useful otherwise.

why not simply use the fd obtained from open() on our special
filesystem?

If you open and then unlink, you loose the 'detached' state and the
filedesc is the only life-line.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 15/17] perf: Allow controlled non-root access to detached events
  2017-09-05 13:30 ` [RFC PATCH 15/17] perf: Allow controlled non-root access to " Alexander Shishkin
@ 2017-10-03 14:53   ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2017-10-03 14:53 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov, Borislav Petkov, rric

On Tue, Sep 05, 2017 at 04:30:24PM +0300, Alexander Shishkin wrote:
> @@ -5587,7 +5587,13 @@ static int perf_open(struct inode *inode, struct file *file)
>  	if (!atomic_long_inc_not_zero(&event->refcount))
>  		return -ENOENT;
>  
> -	ret = simple_open(inode, file);
> +	/* event's user is stable while we're holding the reference */
> +	if (event->rb->mmap_user != current_user() &&
> +	    !capable(CAP_SYS_ADMIN))
> +		ret = -EACCES;
> +
> +	if (!ret)
> +		ret = simple_open(inode, file);
>  	if (ret)
>  		put_event(event);
>  

> @@ -11530,6 +11537,18 @@ static int perf_instance_unlink(const char *name)
>  	if (!event)
>  		return -EINVAL;
>  
> +	if (!atomic_long_inc_not_zero(&event->refcount))
> +		return 0;
> +
> +	/* event's user is stable while we're holding the reference */
> +	if (event->rb->mmap_user != current_user() &&
> +	    !capable(CAP_SYS_ADMIN))
> +		ret = -EACCES;
> +	put_event(event);
> +
> +	if (ret)
> +		return ret;
> +
>  	if (!(event->attach_state & PERF_ATTACH_CONTEXT))
>  		return -EBUSY;
>  

Why aren't we using regular file permissions for this?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 05/17] perf: Introduce detached events
  2017-10-03 14:34   ` Peter Zijlstra
@ 2017-10-06 11:23     ` Alexander Shishkin
  0 siblings, 0 replies; 34+ messages in thread
From: Alexander Shishkin @ 2017-10-06 11:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	rric, alexander.shishkin

Peter Zijlstra <peterz@infradead.org> writes:

> So I'm not opposed to the idea of creating events that live independent
> from of file descriptors. And stuffing them in a filesystem makes sense.
> However I'm not entire convinced on the details.
>
> The above has a number of problems:
>
>  - there's a filesystem race; two concurrent syscalls can try and create
>    the same file. In that case the error most certainly is not -ENOMEM.

Indeed.

>  - there's a hash collision, similar issue.
>
>  - there's some asymmetry in the create/destroy; that is you create the
>    file with sys_perf_event_open() and remove it with unlink().

There is also an ioctl() to turn it into a normal event fd that can then
be closed.

>  - the actual name is very opaque and hard to use; how would a tool find
>    the right event to open?

They can readlink("/proc/self/fd/$fd"), something that I hacked into the
perf tool as well, although, truth be told I didn't actually need it for
anything, partly because it's not a useful name. One use case that I
could think of would be a task that's inherited a detached event wanting
to get rid of it. They can scan their /proc/$pid/maps, find the vma by
name and use that to locate the file.

> Would it instead make sense to allow the user to creat() their own files
> in this filesystem (with whatever descriptive name they need) and then
> pass that fd like:
>
>   sys_perf_event_open(.group_fd=fd, .flags=PERF_FLAG_FD_DETACH);
>
> or something to associate the file with the event. Of course, that makes
> it very hard to create detached cgroup events :/

Yes, I like the idea of moving the burden of naming to the userspace,
but then we have a problem with inheritance, which would still produce
new events w/o user's input.

Maybe use a directory for the 'parent' event? Then the above would still
work.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 08/17] perf: Allow inheritance for detached events
  2017-10-03 14:42   ` Peter Zijlstra
@ 2017-10-06 11:40     ` Alexander Shishkin
  0 siblings, 0 replies; 34+ messages in thread
From: Alexander Shishkin @ 2017-10-06 11:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov, Borislav Petkov, rric

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Sep 05, 2017 at 04:30:17PM +0300, Alexander Shishkin wrote:
>> This enables inheritance for detached events. Unlike traditional events,
>> these do not have parents: inheritance produces a new independent event
>> with the same attribute. If the 'parent' event has a ring buffer, so will
>> the new event. Considering the mlock accounting, this buffer allocation
>> may fail, which in turn will fail the parent's fork, something to be
>> aware of.
>> 
>> This also effectively disables context cloning, because unlike the
>> traditional events, these will each have its own ring buffer and
>> context switch optimization can't work.
>
> Right, so this thing is icky... as you know. More naming issues though,
> what will you go and call those files.

Yes. The failing-the-fork ickiness is dealt with later on in 11/17.
But true about the naming.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 09/17] perf: Use shmemfs pages for userspace-only per-thread detached events
  2017-10-03 14:43   ` Peter Zijlstra
@ 2017-10-06 11:52     ` Alexander Shishkin
  0 siblings, 0 replies; 34+ messages in thread
From: Alexander Shishkin @ 2017-10-06 11:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, acme, kirill.shutemov, Borislav Petkov, rric

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Sep 05, 2017 at 04:30:18PM +0300, Alexander Shishkin wrote:
>> In order to work around the problem of using up mlocked memory for the
>> detached events, we can pin the ring buffer pages only while they are
>> in use (that is, the event is ACTIVE), and unpin them for the rest of
>> the time. When not pinned in, these pages can be swapped out. This way,
>> one user can have at most mlock_limit*nr_cpus kB of memory pinned at
>> any given moment, however many events they actually have.
>> 
>> This enforces a constraint: pinning and unpinning may sleep and thus
>> can't be done in the event scheduling path. Instead, we use a task
>> work to do this, which this pattern us to userspace-only events.
>> Also, since one userspace thread only needs one buffer (for whatever
>> CPU it's running on at any given moment), we only do this for per-thread
>> events.
>> 
>> The source for such swappable pages is shmemfs. This patch allows
>> allocating perf ring buffer pages from an shmemfs file if the above
>> constraints are met.
>
> Right, so why still allow that previous icky thing? What cases do we
> need that for?

8/17..12/17 are really one patch split into smaller chunks. The first
one does the icky thing and then we get to what we actually want.

The idea is that you won't be able to enable inheritance for detached
events unless they are shmem-backed.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 03/17] tracefs: De-globalize instances' callbacks
  2017-09-05 13:30 ` [RFC PATCH 03/17] tracefs: De-globalize instances' callbacks Alexander Shishkin
@ 2018-01-24 18:54   ` Steven Rostedt
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2018-01-24 18:54 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, acme, kirill.shutemov,
	Borislav Petkov, rric


I just stumbled across this patch (and the following one). What purpose
would this have? The "instances" directory is used to make multiple
buffers. Why would we have more than one?

-- Steve


On Tue,  5 Sep 2017 16:30:12 +0300
Alexander Shishkin <alexander.shishkin@linux.intel.com> wrote:

> Currently, tracefs has exactly one special 'instances' subdirectory, where
> the caller can have their own .mkdir/.rmdir callbacks, which allow the
> caller to handle user's mkdir/rmdir inside that directory. Tracefs allows
> one set of these callbacks (tracefs_dir_ops).
> 
> This patch de-globalizes tracefs_dir_ops so that it's possible to have
> multiple such subdirectories.
> 
> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> ---
>  fs/tracefs/inode.c | 35 +++++++++++++++++++++++++----------
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
> index bea8ad876b..b14f03a655 100644
> --- a/fs/tracefs/inode.c
> +++ b/fs/tracefs/inode.c
> @@ -50,10 +50,10 @@ static const struct file_operations tracefs_file_operations = {
>  	.llseek =	noop_llseek,
>  };
>  
> -static struct tracefs_dir_ops {
> +struct tracefs_dir_ops {
>  	int (*mkdir)(const char *name);
>  	int (*rmdir)(const char *name);
> -} tracefs_ops;
> +};
>  
>  static char *get_dname(struct dentry *dentry)
>  {
> @@ -72,6 +72,7 @@ static char *get_dname(struct dentry *dentry)
>  
>  static int tracefs_syscall_mkdir(struct inode *inode, struct dentry *dentry, umode_t mode)
>  {
> +	struct tracefs_dir_ops *tracefs_ops = dentry->d_parent->d_fsdata;
>  	char *name;
>  	int ret;
>  
> @@ -85,7 +86,7 @@ static int tracefs_syscall_mkdir(struct inode *inode, struct dentry *dentry, umo
>  	 * mkdir routine to handle races.
>  	 */
>  	inode_unlock(inode);
> -	ret = tracefs_ops.mkdir(name);
> +	ret = tracefs_ops->mkdir(name);
>  	inode_lock(inode);
>  
>  	kfree(name);
> @@ -95,6 +96,7 @@ static int tracefs_syscall_mkdir(struct inode *inode, struct dentry *dentry, umo
>  
>  static int tracefs_syscall_rmdir(struct inode *inode, struct dentry *dentry)
>  {
> +	struct tracefs_dir_ops *tracefs_ops = dentry->d_fsdata;
>  	char *name;
>  	int ret;
>  
> @@ -112,7 +114,7 @@ static int tracefs_syscall_rmdir(struct inode *inode, struct dentry *dentry)
>  	inode_unlock(inode);
>  	inode_unlock(dentry->d_inode);
>  
> -	ret = tracefs_ops.rmdir(name);
> +	ret = tracefs_ops->rmdir(name);
>  
>  	inode_lock_nested(inode, I_MUTEX_PARENT);
>  	inode_lock(dentry->d_inode);
> @@ -342,6 +344,9 @@ static struct dentry *start_creating(const char *name, struct dentry *parent)
>  	if (IS_ERR(dentry)) {
>  		inode_unlock(parent->d_inode);
>  		simple_release_fs(&tracefs_mount, &tracefs_mount_count);
> +	} else {
> +		/* propagate dir ops */
> +		dentry->d_fsdata = parent->d_fsdata;
>  	}
>  
>  	return dentry;
> @@ -482,18 +487,25 @@ struct dentry *tracefs_create_instance_dir(const char *name, struct dentry *pare
>  					  int (*mkdir)(const char *name),
>  					  int (*rmdir)(const char *name))
>  {
> +	struct tracefs_dir_ops *tracefs_ops = parent ? parent->d_fsdata : NULL;
>  	struct dentry *dentry;
>  
> -	/* Only allow one instance of the instances directory. */
> -	if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir))
> +	if (WARN_ON(tracefs_ops))
> +		return NULL;
> +
> +	tracefs_ops = kzalloc(sizeof(*tracefs_ops), GFP_KERNEL);
> +	if (!tracefs_ops)
>  		return NULL;
>  
>  	dentry = __create_dir(name, parent, &tracefs_dir_inode_operations);
> -	if (!dentry)
> +	if (!dentry) {
> +		kfree(tracefs_ops);
>  		return NULL;
> +	}
>  
> -	tracefs_ops.mkdir = mkdir;
> -	tracefs_ops.rmdir = rmdir;
> +	tracefs_ops->mkdir = mkdir;
> +	tracefs_ops->rmdir = rmdir;
> +	dentry->d_fsdata = tracefs_ops;
>  
>  	return dentry;
>  }
> @@ -513,8 +525,11 @@ static int __tracefs_remove(struct dentry *dentry, struct dentry *parent)
>  				simple_unlink(parent->d_inode, dentry);
>  				break;
>  			}
> -			if (!ret)
> +			if (!ret) {
>  				d_delete(dentry);
> +				if (dentry->d_fsdata != parent->d_fsdata)
> +					kfree(dentry->d_fsdata);
> +			}
>  			dput(dentry);
>  		}
>  	}

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-01-24 18:54 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-05 13:30 [RFC PATCH 00/17] perf: Detached events Alexander Shishkin
2017-09-05 13:30 ` [RFC PATCH 01/17] perf: Allow mmapping only user page Alexander Shishkin
2017-09-06 16:28   ` Borislav Petkov
2017-09-13 11:35     ` Alexander Shishkin
2017-09-13 12:58       ` Borislav Petkov
2017-09-05 13:30 ` [RFC PATCH 02/17] perf: Factor out mlock accounting Alexander Shishkin
2017-09-05 13:30 ` [RFC PATCH 03/17] tracefs: De-globalize instances' callbacks Alexander Shishkin
2018-01-24 18:54   ` Steven Rostedt
2017-09-05 13:30 ` [RFC PATCH 04/17] tracefs: Add ->unlink callback to tracefs_dir_ops Alexander Shishkin
2017-09-05 13:30 ` [RFC PATCH 05/17] perf: Introduce detached events Alexander Shishkin
2017-10-03 14:34   ` Peter Zijlstra
2017-10-06 11:23     ` Alexander Shishkin
2017-09-05 13:30 ` [RFC PATCH 06/17] perf: Add buffers to the " Alexander Shishkin
2017-10-03 14:36   ` Peter Zijlstra
2017-09-05 13:30 ` [RFC PATCH 07/17] perf: Add pmu_info to user page Alexander Shishkin
2017-10-03 14:40   ` Peter Zijlstra
2017-09-05 13:30 ` [RFC PATCH 08/17] perf: Allow inheritance for detached events Alexander Shishkin
2017-10-03 14:42   ` Peter Zijlstra
2017-10-06 11:40     ` Alexander Shishkin
2017-09-05 13:30 ` [RFC PATCH 09/17] perf: Use shmemfs pages for userspace-only per-thread " Alexander Shishkin
2017-10-03 14:43   ` Peter Zijlstra
2017-10-06 11:52     ` Alexander Shishkin
2017-09-05 13:30 ` [RFC PATCH 10/17] perf: Implement pinning and scheduling for SHMEM events Alexander Shishkin
2017-09-05 13:30 ` [RFC PATCH 11/17] perf: Implement mlock accounting for shmem ring buffers Alexander Shishkin
2017-09-05 13:30 ` [RFC PATCH 12/17] perf: Track pinned events per user Alexander Shishkin
2017-09-05 13:30 ` [RFC PATCH 13/17] perf: Re-inject shmem buffers after exec Alexander Shishkin
2017-09-05 13:30 ` [RFC PATCH 14/17] perf: Add ioctl(REATTACH) for detached events Alexander Shishkin
2017-10-03 14:50   ` Peter Zijlstra
2017-09-05 13:30 ` [RFC PATCH 15/17] perf: Allow controlled non-root access to " Alexander Shishkin
2017-10-03 14:53   ` Peter Zijlstra
2017-09-05 13:30 ` [RFC PATCH 16/17] perf/x86/intel/pt: Add PMU info Alexander Shishkin
2017-09-05 13:30 ` [RFC PATCH 17/17] perf/x86/intel/bts: " Alexander Shishkin
2017-09-06 16:24 ` [RFC PATCH 00/17] perf: Detached events Borislav Petkov
2017-09-13 11:54   ` Alexander Shishkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.