All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] perf: Cleanup user's child events
@ 2016-01-15 11:22 Alexander Shishkin
  2016-01-15 12:54 ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Shishkin @ 2016-01-15 11:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Alexander Shishkin

Events are leaking in the following scenario: user creates an event for
task A, task A forks into B (producing a child event), user closes the
original event. Both original user's event and its child will remain for
as long as task B is around. In other words, we don't clean up children
when we try to release the parent.

This patch cleans up user event's children when its file descriptor is
closed.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 kernel/events/core.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 630f53acce..867c4347ea 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3828,7 +3828,43 @@ EXPORT_SYMBOL_GPL(perf_event_release_kernel);
  */
 static int perf_release(struct inode *inode, struct file *file)
 {
-	put_event(file->private_data);
+	struct perf_event *event = file->private_data, *child, *tmp;
+	LIST_HEAD(child_list);
+
+	/*
+	 * event::child_mutex nests inside ctx::lock, so move children
+	 * to a safe place first and avoid inversion
+	 */
+	mutex_lock(&event->child_mutex);
+	list_splice_init(&event->child_list, &child_list);
+	mutex_unlock(&event->child_mutex);
+
+	list_for_each_entry_safe(child, tmp, &child_list, child_list) {
+		struct perf_event_context *ctx;
+
+		/*
+		 * This is somewhat similar to perf_free_event(),
+		 * except for these events are alive and need
+		 * proper perf_remove_from_context().
+		 */
+		ctx = perf_event_ctx_lock(child);
+		perf_remove_from_context(child, true);
+		perf_event_ctx_unlock(child, ctx);
+
+		list_del(&child->child_list);
+
+		/* Children will have exactly one reference */
+		free_event(child);
+
+		/*
+		 * This matches the refcount bump in inherit_event();
+		 * this can't be the last reference.
+		 */
+		put_event(event);
+	}
+
+	/* Must be the last reference */
+	put_event(event);
 	return 0;
 }
 
-- 
2.7.0.rc3

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] perf: Cleanup user's child events
  2016-01-15 11:22 [PATCH] perf: Cleanup user's child events Alexander Shishkin
@ 2016-01-15 12:54 ` Peter Zijlstra
  2016-01-15 13:05   ` Alexander Shishkin
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-15 12:54 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian, Arnaldo Carvalho de Melo

On Fri, Jan 15, 2016 at 01:22:15PM +0200, Alexander Shishkin wrote:
> Events are leaking in the following scenario: user creates an event for
> task A, task A forks into B (producing a child event), user closes the
> original event. Both original user's event and its child will remain for
> as long as task B is around. In other words, we don't clean up children
> when we try to release the parent.

The orphan stuff should clear those up, no?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] perf: Cleanup user's child events
  2016-01-15 12:54 ` Peter Zijlstra
@ 2016-01-15 13:05   ` Alexander Shishkin
  2016-01-15 13:09     ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Shishkin @ 2016-01-15 13:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian, Arnaldo Carvalho de Melo

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Jan 15, 2016 at 01:22:15PM +0200, Alexander Shishkin wrote:
>> Events are leaking in the following scenario: user creates an event for
>> task A, task A forks into B (producing a child event), user closes the
>> original event. Both original user's event and its child will remain for
>> as long as task B is around. In other words, we don't clean up children
>> when we try to release the parent.
>
> The orphan stuff should clear those up, no?

Not if they don't schedule after the parent's gone.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] perf: Cleanup user's child events
  2016-01-15 13:05   ` Alexander Shishkin
@ 2016-01-15 13:09     ` Peter Zijlstra
  2016-01-15 14:07       ` [PATCH] perf: Synchronously cleanup " Alexander Shishkin
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-15 13:09 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, jolsa

On Fri, Jan 15, 2016 at 03:05:33PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Fri, Jan 15, 2016 at 01:22:15PM +0200, Alexander Shishkin wrote:
> >> Events are leaking in the following scenario: user creates an event for
> >> task A, task A forks into B (producing a child event), user closes the
> >> original event. Both original user's event and its child will remain for
> >> as long as task B is around. In other words, we don't clean up children
> >> when we try to release the parent.
> >
> > The orphan stuff should clear those up, no?
> 
> Not if they don't schedule after the parent's gone.

This is true. So when Jiri did this we tried the immediate thing and
that exploded due to lock inversions.

You mention some of that. Let me go dig out that old thread to see if
its the same.

I feel that we should not have both approaches.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] perf: Synchronously cleanup child events
  2016-01-15 13:09     ` Peter Zijlstra
@ 2016-01-15 14:07       ` Alexander Shishkin
  2016-01-15 17:57         ` Peter Zijlstra
  2016-01-19  7:45         ` [PATCH] " Ingo Molnar
  0 siblings, 2 replies; 33+ messages in thread
From: Alexander Shishkin @ 2016-01-15 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, Alexander Shishkin

The orphan cleanup workqueue doesn't always catch orphans, for example,
if they never schedule after they are orphaned. IOW, the event leak is
still very real. It also wouldn't work for kernel counters.

Also, there seems to be no reason not to carry out this cleanup
procedure synchronously during parent event's destruction.

This patch replaces the workqueue approach with a simple cleanup round
in the event's destruction path. To avoid racing with clone, we still
check that parent event has an owner in the inheritance path.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h |   3 --
 kernel/events/core.c       | 121 ++++++++++++++++-----------------------------
 2 files changed, 43 insertions(+), 81 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6612732d8f..cd9c1ace29 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -634,9 +634,6 @@ struct perf_event_context {
 	int				nr_cgroups;	 /* cgroup evts */
 	void				*task_ctx_data; /* pmu specific data */
 	struct rcu_head			rcu_head;
-
-	struct delayed_work		orphans_remove;
-	bool				orphans_remove_sched;
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 630f53acce..8eb3fee429 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,8 +49,6 @@
 
 #include <asm/irq_regs.h>
 
-static struct workqueue_struct *perf_wq;
-
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -1652,40 +1650,9 @@ out:
  */
 static bool is_orphaned_event(struct perf_event *event)
 {
-	return event && !is_kernel_event(event) && !event->owner;
-}
-
-/*
- * Event has a parent but parent's task finished and it's
- * alive only because of children holding refference.
- */
-static bool is_orphaned_child(struct perf_event *event)
-{
-	return is_orphaned_event(event->parent);
-}
-
-static void orphans_remove_work(struct work_struct *work);
-
-static void schedule_orphans_remove(struct perf_event_context *ctx)
-{
-	if (!ctx->task || ctx->orphans_remove_sched || !perf_wq)
-		return;
-
-	if (queue_delayed_work(perf_wq, &ctx->orphans_remove, 1)) {
-		get_ctx(ctx);
-		ctx->orphans_remove_sched = true;
-	}
-}
-
-static int __init perf_workqueue_init(void)
-{
-	perf_wq = create_singlethread_workqueue("perf");
-	WARN(!perf_wq, "failed to create perf workqueue\n");
-	return perf_wq ? 0 : -1;
+	return event && !event->owner;
 }
 
-core_initcall(perf_workqueue_init);
-
 static inline int pmu_filter_match(struct perf_event *event)
 {
 	struct pmu *pmu = event->pmu;
@@ -1746,9 +1713,6 @@ event_sched_out(struct perf_event *event,
 	if (event->attr.exclusive || !cpuctx->active_oncpu)
 		cpuctx->exclusive = 0;
 
-	if (is_orphaned_child(event))
-		schedule_orphans_remove(ctx);
-
 	perf_pmu_enable(event->pmu);
 }
 
@@ -1991,9 +1955,6 @@ event_sched_in(struct perf_event *event,
 	if (event->attr.exclusive)
 		cpuctx->exclusive = 1;
 
-	if (is_orphaned_child(event))
-		schedule_orphans_remove(ctx);
-
 out:
 	perf_pmu_enable(event->pmu);
 
@@ -3370,7 +3331,6 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 	INIT_LIST_HEAD(&ctx->flexible_groups);
 	INIT_LIST_HEAD(&ctx->event_list);
 	atomic_set(&ctx->refcount, 1);
-	INIT_DELAYED_WORK(&ctx->orphans_remove, orphans_remove_work);
 }
 
 static struct perf_event_context *
@@ -3818,54 +3778,59 @@ static void put_event(struct perf_event *event)
 
 int perf_event_release_kernel(struct perf_event *event)
 {
-	put_event(event);
-	return 0;
-}
-EXPORT_SYMBOL_GPL(perf_event_release_kernel);
+	struct perf_event *child, *tmp;
+	LIST_HEAD(child_list);
 
-/*
- * Called when the last reference to the file is gone.
- */
-static int perf_release(struct inode *inode, struct file *file)
-{
-	put_event(file->private_data);
-	return 0;
-}
+	if (!is_kernel_event(event))
+		perf_remove_from_owner(event);
 
-/*
- * Remove all orphanes events from the context.
- */
-static void orphans_remove_work(struct work_struct *work)
-{
-	struct perf_event_context *ctx;
-	struct perf_event *event, *tmp;
+	event->owner = NULL;
 
-	ctx = container_of(work, struct perf_event_context,
-			   orphans_remove.work);
+	/*
+	 * event::child_mutex nests inside ctx::lock, so move children
+	 * to a safe place first and avoid inversion
+	 */
+	mutex_lock(&event->child_mutex);
+	list_splice_init(&event->child_list, &child_list);
+	mutex_unlock(&event->child_mutex);
 
-	mutex_lock(&ctx->mutex);
-	list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry) {
-		struct perf_event *parent_event = event->parent;
+	list_for_each_entry_safe(child, tmp, &child_list, child_list) {
+		struct perf_event_context *ctx;
 
-		if (!is_orphaned_child(event))
-			continue;
+		/*
+		 * This is somewhat similar to perf_free_event(),
+		 * except for these events are alive and need
+		 * proper perf_remove_from_context().
+		 */
+		ctx = perf_event_ctx_lock(child);
+		perf_remove_from_context(child, true);
+		perf_event_ctx_unlock(child, ctx);
 
-		perf_remove_from_context(event, true);
+		list_del(&child->child_list);
 
-		mutex_lock(&parent_event->child_mutex);
-		list_del_init(&event->child_list);
-		mutex_unlock(&parent_event->child_mutex);
+		/* Children will have exactly one reference */
+		free_event(child);
 
-		free_event(event);
-		put_event(parent_event);
+		/*
+		 * This matches the refcount bump in inherit_event();
+		 * this can't be the last reference.
+		 */
+		put_event(event);
 	}
 
-	raw_spin_lock_irq(&ctx->lock);
-	ctx->orphans_remove_sched = false;
-	raw_spin_unlock_irq(&ctx->lock);
-	mutex_unlock(&ctx->mutex);
+	/* Must be the last reference */
+	put_event(event);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(perf_event_release_kernel);
 
-	put_ctx(ctx);
+/*
+ * Called when the last reference to the file is gone.
+ */
+static int perf_release(struct inode *inode, struct file *file)
+{
+	perf_event_release_kernel(file->private_data);
+	return 0;
 }
 
 u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
-- 
2.7.0.rc3

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] perf: Synchronously cleanup child events
  2016-01-15 14:07       ` [PATCH] perf: Synchronously cleanup " Alexander Shishkin
@ 2016-01-15 17:57         ` Peter Zijlstra
  2016-01-18 12:07           ` Alexander Shishkin
  2016-01-18 12:37           ` Alexander Shishkin
  2016-01-19  7:45         ` [PATCH] " Ingo Molnar
  1 sibling, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-15 17:57 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa

On Fri, Jan 15, 2016 at 04:07:41PM +0200, Alexander Shishkin wrote:
>  int perf_event_release_kernel(struct perf_event *event)
>  {
> +	struct perf_event *child, *tmp;
> +	LIST_HEAD(child_list);
>  
> +	if (!is_kernel_event(event))
> +		perf_remove_from_owner(event);
>  
> +	event->owner = NULL;
>  
> +	/*
> +	 * event::child_mutex nests inside ctx::lock, so move children
> +	 * to a safe place first and avoid inversion
> +	 */
> +	mutex_lock(&event->child_mutex);
> +	list_splice_init(&event->child_list, &child_list);
> +	mutex_unlock(&event->child_mutex);

I suspect this races against inherit_event(), like:

	inherit_event()			perf_event_release_kernel()

	if (is_orphaned_event(parent_event) /* false */

					event->owner = NULL

	mutex_lock(child_mutex);
	list_splice
	mutex_unlock(child_mutex);

					mutex_lock(child_mutex);
					list_add_tail
					mutex_unlock(child_mutex);


Something like this would fix that I think, not sure its the best way
though...


--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -979,8 +979,8 @@ static void put_ctx(struct perf_event_co
  * Lock order:
  *	task_struct::perf_event_mutex
  *	  perf_event_context::mutex
- *	    perf_event_context::lock
  *	    perf_event::child_mutex;
+ *	      perf_event_context::lock
  *	    perf_event::mmap_mutex
  *	    mmap_sem
  */
@@ -8956,6 +8958,16 @@ inherit_event(struct perf_event *parent_
 	if (parent_event->parent)
 		parent_event = parent_event->parent;
 
+	WARN_ON_ONCE(parent_event->ctx->parent_ctx);
+	/*
+	 * Serialize against perf_event_kernel_release()'s orphanage..
+	 */
+	mutex_lock(&parent_event->child_mutex);
+	if (is_orphaned_event(parent_event)) {
+		mutex_unlock(&parent_event->child_mutex);
+		return NULL;
+	}
+
 	child_event = perf_event_alloc(&parent_event->attr,
 					   parent_event->cpu,
 					   child,
@@ -8964,8 +8976,8 @@ inherit_event(struct perf_event *parent_
 	if (IS_ERR(child_event))
 		return child_event;
 
-	if (is_orphaned_event(parent_event) ||
-	    !atomic_long_inc_not_zero(&parent_event->refcount)) {
+	if (!atomic_long_inc_not_zero(&parent_event->refcount)) {
+		mutex_unlock(&parent_event->child_mutex);
 		free_event(child_event);
 		return NULL;
 	}
@@ -9013,8 +9025,6 @@ inherit_event(struct perf_event *parent_
 	/*
 	 * Link this into the parent event's child list
 	 */
-	WARN_ON_ONCE(parent_event->ctx->parent_ctx);
-	mutex_lock(&parent_event->child_mutex);
 	list_add_tail(&child_event->child_list, &parent_event->child_list);
 	mutex_unlock(&parent_event->child_mutex);
 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] perf: Synchronously cleanup child events
  2016-01-15 17:57         ` Peter Zijlstra
@ 2016-01-18 12:07           ` Alexander Shishkin
  2016-01-18 12:37           ` Alexander Shishkin
  1 sibling, 0 replies; 33+ messages in thread
From: Alexander Shishkin @ 2016-01-18 12:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Jan 15, 2016 at 04:07:41PM +0200, Alexander Shishkin wrote:
>>  int perf_event_release_kernel(struct perf_event *event)
>>  {
>> +	struct perf_event *child, *tmp;
>> +	LIST_HEAD(child_list);
>>  
>> +	if (!is_kernel_event(event))
>> +		perf_remove_from_owner(event);
>>  
>> +	event->owner = NULL;
>>  
>> +	/*
>> +	 * event::child_mutex nests inside ctx::lock, so move children
>> +	 * to a safe place first and avoid inversion
>> +	 */
>> +	mutex_lock(&event->child_mutex);
>> +	list_splice_init(&event->child_list, &child_list);
>> +	mutex_unlock(&event->child_mutex);
>
> I suspect this races against inherit_event(), like:
>
> 	inherit_event()			perf_event_release_kernel()
>
> 	if (is_orphaned_event(parent_event) /* false */
>
> 					event->owner = NULL
>
> 	mutex_lock(child_mutex);
> 	list_splice
> 	mutex_unlock(child_mutex);
>
> 					mutex_lock(child_mutex);
> 					list_add_tail
> 					mutex_unlock(child_mutex);

Indeed, this is possible.

>
> Something like this would fix that I think, not sure its the best way
> though...
>
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -979,8 +979,8 @@ static void put_ctx(struct perf_event_co
>   * Lock order:
>   *	task_struct::perf_event_mutex
>   *	  perf_event_context::mutex
> - *	    perf_event_context::lock
>   *	    perf_event::child_mutex;
> + *	      perf_event_context::lock
>   *	    perf_event::mmap_mutex
>   *	    mmap_sem
>   */

This is, actually, the order that we have already:

perf_ioctl():                ctx::mutex
-> perf_event_for_each():    event::child_mutex
   -> _perf_event_enable():  ctx::lock

that is, ctx::lock already nests inside event::child_mutex. So what
you're suggesting is an ok solution.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] perf: Synchronously cleanup child events
  2016-01-15 17:57         ` Peter Zijlstra
  2016-01-18 12:07           ` Alexander Shishkin
@ 2016-01-18 12:37           ` Alexander Shishkin
  2016-01-18 14:44             ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Alexander Shishkin @ 2016-01-18 12:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa

Peter Zijlstra <peterz@infradead.org> writes:

> I suspect this races against inherit_event(), like:
>
> 	inherit_event()			perf_event_release_kernel()
>
> 	if (is_orphaned_event(parent_event) /* false */
>
> 					event->owner = NULL
>
> 	mutex_lock(child_mutex);
> 	list_splice
> 	mutex_unlock(child_mutex);
>
> 					mutex_lock(child_mutex);
> 					list_add_tail
> 					mutex_unlock(child_mutex);

Or how about this instead:

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8eb3fee429..cd9f1ac537 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3786,8 +3786,9 @@ int perf_event_release_kernel(struct perf_event *event)
 
 	event->owner = NULL;
 
+retry:
 	/*
-	 * event::child_mutex nests inside ctx::lock, so move children
+	 * event::child_mutex nests inside ctx::mutex, so move children
 	 * to a safe place first and avoid inversion
 	 */
 	mutex_lock(&event->child_mutex);
@@ -3818,8 +3819,13 @@ int perf_event_release_kernel(struct perf_event *event)
 		put_event(event);
 	}
 
-	/* Must be the last reference */
+	/* Must be the last reference, .. */
 	put_event(event);
+
+	/* .. unless we raced with inherit_event(), in which case, repeat */
+	if (atomic_long_inc_not_zero(&event->refcount))
+		goto retry;
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(perf_event_release_kernel);

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] perf: Synchronously cleanup child events
  2016-01-18 12:37           ` Alexander Shishkin
@ 2016-01-18 14:44             ` Peter Zijlstra
  2016-01-19 15:12               ` [PATCH v2] " Alexander Shishkin
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-18 14:44 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa

On Mon, Jan 18, 2016 at 02:37:06PM +0200, Alexander Shishkin wrote:

> Or how about this instead:

In principle better since it doesn't increase the lock hold times etc.,
but buggy I think:

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 8eb3fee429..cd9f1ac537 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3786,8 +3786,9 @@ int perf_event_release_kernel(struct perf_event *event)
>  
>  	event->owner = NULL;
>  
> +retry:
>  	/*
> -	 * event::child_mutex nests inside ctx::lock, so move children
> +	 * event::child_mutex nests inside ctx::mutex, so move children
>  	 * to a safe place first and avoid inversion
>  	 */
>  	mutex_lock(&event->child_mutex);
> @@ -3818,8 +3819,13 @@ int perf_event_release_kernel(struct perf_event *event)
>  		put_event(event);
>  	}
>  
> -	/* Must be the last reference */
> +	/* Must be the last reference, .. */
>  	put_event(event);
> +
> +	/* .. unless we raced with inherit_event(), in which case, repeat */

Nothing prevents @event from being freed here, which would make:

> +	if (atomic_long_inc_not_zero(&event->refcount))

a use-after-free.

You'll have to do something like:

static bool put_event_last(struct perf_event *event, long value)
{
	if (atomic_long_cmpxchg(&event->refcount, 1, 0)) {
		__put_event(event); /* normal put_event() body */
		return true;
	}
	return false;
}

with which you can do:

	if (!put_event_last(event))
		goto retry;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] perf: Synchronously cleanup child events
  2016-01-15 14:07       ` [PATCH] perf: Synchronously cleanup " Alexander Shishkin
  2016-01-15 17:57         ` Peter Zijlstra
@ 2016-01-19  7:45         ` Ingo Molnar
  1 sibling, 0 replies; 33+ messages in thread
From: Ingo Molnar @ 2016-01-19  7:45 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa


* Alexander Shishkin <alexander.shishkin@linux.intel.com> wrote:

> The orphan cleanup workqueue doesn't always catch orphans, for example, if they 
> never schedule after they are orphaned. IOW, the event leak is still very real. 
> It also wouldn't work for kernel counters.
> 
> Also, there seems to be no reason not to carry out this cleanup procedure 
> synchronously during parent event's destruction.

Absolutely, synchronous cleanup will probably also be quicker at triggering any 
remaining (and new ;-) races.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2] perf: Synchronously cleanup child events
  2016-01-18 14:44             ` Peter Zijlstra
@ 2016-01-19 15:12               ` Alexander Shishkin
  2016-01-19 20:05                 ` Peter Zijlstra
  2016-01-19 20:07                 ` [PATCH v2] perf: Synchronously cleanup child events Peter Zijlstra
  0 siblings, 2 replies; 33+ messages in thread
From: Alexander Shishkin @ 2016-01-19 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, Alexander Shishkin

The orphan cleanup workqueue doesn't always catch orphans, for example,
if they never schedule after they are orphaned. IOW, the event leak is
still very real. It also wouldn't work for kernel counters.

Also, there seems to be no reason not to carry out this cleanup
procedure synchronously during parent event's destruction.

This patch replaces the workqueue approach with a simple cleanup round
in the event's destruction path. To avoid racing with clone, we still
check that parent event has an owner in the inheritance path.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 include/linux/perf_event.h |   3 -
 kernel/events/core.c       | 142 ++++++++++++++++++++-------------------------
 2 files changed, 63 insertions(+), 82 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6612732d8f..cd9c1ace29 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -634,9 +634,6 @@ struct perf_event_context {
 	int				nr_cgroups;	 /* cgroup evts */
 	void				*task_ctx_data; /* pmu specific data */
 	struct rcu_head			rcu_head;
-
-	struct delayed_work		orphans_remove;
-	bool				orphans_remove_sched;
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 630f53acce..33083ed5a6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,8 +49,6 @@
 
 #include <asm/irq_regs.h>
 
-static struct workqueue_struct *perf_wq;
-
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -1652,40 +1650,9 @@ out:
  */
 static bool is_orphaned_event(struct perf_event *event)
 {
-	return event && !is_kernel_event(event) && !event->owner;
-}
-
-/*
- * Event has a parent but parent's task finished and it's
- * alive only because of children holding refference.
- */
-static bool is_orphaned_child(struct perf_event *event)
-{
-	return is_orphaned_event(event->parent);
+	return event && !event->owner;
 }
 
-static void orphans_remove_work(struct work_struct *work);
-
-static void schedule_orphans_remove(struct perf_event_context *ctx)
-{
-	if (!ctx->task || ctx->orphans_remove_sched || !perf_wq)
-		return;
-
-	if (queue_delayed_work(perf_wq, &ctx->orphans_remove, 1)) {
-		get_ctx(ctx);
-		ctx->orphans_remove_sched = true;
-	}
-}
-
-static int __init perf_workqueue_init(void)
-{
-	perf_wq = create_singlethread_workqueue("perf");
-	WARN(!perf_wq, "failed to create perf workqueue\n");
-	return perf_wq ? 0 : -1;
-}
-
-core_initcall(perf_workqueue_init);
-
 static inline int pmu_filter_match(struct perf_event *event)
 {
 	struct pmu *pmu = event->pmu;
@@ -1746,9 +1713,6 @@ event_sched_out(struct perf_event *event,
 	if (event->attr.exclusive || !cpuctx->active_oncpu)
 		cpuctx->exclusive = 0;
 
-	if (is_orphaned_child(event))
-		schedule_orphans_remove(ctx);
-
 	perf_pmu_enable(event->pmu);
 }
 
@@ -1991,9 +1955,6 @@ event_sched_in(struct perf_event *event,
 	if (event->attr.exclusive)
 		cpuctx->exclusive = 1;
 
-	if (is_orphaned_child(event))
-		schedule_orphans_remove(ctx);
-
 out:
 	perf_pmu_enable(event->pmu);
 
@@ -3370,7 +3331,6 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 	INIT_LIST_HEAD(&ctx->flexible_groups);
 	INIT_LIST_HEAD(&ctx->event_list);
 	atomic_set(&ctx->refcount, 1);
-	INIT_DELAYED_WORK(&ctx->orphans_remove, orphans_remove_work);
 }
 
 static struct perf_event_context *
@@ -3786,13 +3746,10 @@ static void perf_remove_from_owner(struct perf_event *event)
 	}
 }
 
-static void put_event(struct perf_event *event)
+static void __put_event(struct perf_event *event)
 {
 	struct perf_event_context *ctx;
 
-	if (!atomic_long_dec_and_test(&event->refcount))
-		return;
-
 	if (!is_kernel_event(event))
 		perf_remove_from_owner(event);
 
@@ -3816,56 +3773,83 @@ static void put_event(struct perf_event *event)
 	_free_event(event);
 }
 
-int perf_event_release_kernel(struct perf_event *event)
+static void put_event(struct perf_event *event)
 {
-	put_event(event);
-	return 0;
+	if (atomic_long_dec_and_test(&event->refcount))
+		__put_event(event);
 }
-EXPORT_SYMBOL_GPL(perf_event_release_kernel);
 
-/*
- * Called when the last reference to the file is gone.
- */
-static int perf_release(struct inode *inode, struct file *file)
+static bool put_event_last(struct perf_event *event)
 {
-	put_event(file->private_data);
-	return 0;
+	if (atomic_long_cmpxchg(&event->refcount, 1, 0)) {
+		__put_event(event);
+		return true;
+	}
+
+	return false;
 }
 
-/*
- * Remove all orphanes events from the context.
- */
-static void orphans_remove_work(struct work_struct *work)
+int perf_event_release_kernel(struct perf_event *event)
 {
-	struct perf_event_context *ctx;
-	struct perf_event *event, *tmp;
+	struct perf_event *child, *tmp;
+	LIST_HEAD(child_list);
 
-	ctx = container_of(work, struct perf_event_context,
-			   orphans_remove.work);
+	if (!is_kernel_event(event))
+		perf_remove_from_owner(event);
 
-	mutex_lock(&ctx->mutex);
-	list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry) {
-		struct perf_event *parent_event = event->parent;
+	event->owner = NULL;
 
-		if (!is_orphaned_child(event))
-			continue;
+retry:
+	/*
+	 * event::child_mutex nests inside ctx::mutex, so move children
+	 * to a safe place first and avoid inversion
+	 */
+	mutex_lock(&event->child_mutex);
+	list_splice_init(&event->child_list, &child_list);
+	mutex_unlock(&event->child_mutex);
 
-		perf_remove_from_context(event, true);
+	list_for_each_entry_safe(child, tmp, &child_list, child_list) {
+		struct perf_event_context *ctx;
 
-		mutex_lock(&parent_event->child_mutex);
-		list_del_init(&event->child_list);
-		mutex_unlock(&parent_event->child_mutex);
+		/*
+		 * This is somewhat similar to perf_free_event(),
+		 * except for these events are alive and need
+		 * proper perf_remove_from_context().
+		 */
+		ctx = perf_event_ctx_lock(child);
+		perf_remove_from_context(child, true);
+		perf_event_ctx_unlock(child, ctx);
+
+		list_del(&child->child_list);
 
-		free_event(event);
-		put_event(parent_event);
+		/* Children will have exactly one reference */
+		free_event(child);
+
+		/*
+		 * This matches the refcount bump in inherit_event();
+		 * this can't be the last reference.
+		 */
+		put_event(event);
 	}
 
-	raw_spin_lock_irq(&ctx->lock);
-	ctx->orphans_remove_sched = false;
-	raw_spin_unlock_irq(&ctx->lock);
-	mutex_unlock(&ctx->mutex);
+	/*
+	 * If this is the last reference, we're done here, otherwise
+	 * we must have raced with inherit_event(), in which case, repeat
+	 */
+	if (!put_event_last(event))
+		goto retry;
 
-	put_ctx(ctx);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(perf_event_release_kernel);
+
+/*
+ * Called when the last reference to the file is gone.
+ */
+static int perf_release(struct inode *inode, struct file *file)
+{
+	perf_event_release_kernel(file->private_data);
+	return 0;
 }
 
 u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
-- 
2.7.0.rc3

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-19 15:12               ` [PATCH v2] " Alexander Shishkin
@ 2016-01-19 20:05                 ` Peter Zijlstra
  2016-01-19 21:58                   ` Alexei Starovoitov
                                     ` (2 more replies)
  2016-01-19 20:07                 ` [PATCH v2] perf: Synchronously cleanup child events Peter Zijlstra
  1 sibling, 3 replies; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-19 20:05 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, alexei.starovoitov

On Tue, Jan 19, 2016 at 05:12:34PM +0200, Alexander Shishkin wrote:

> +static void __put_event(struct perf_event *event)
>  {
>  	struct perf_event_context *ctx;
>  
>  	if (!is_kernel_event(event))
>  		perf_remove_from_owner(event);
>  

> +int perf_event_release_kernel(struct perf_event *event)
>  {
> +	struct perf_event *child, *tmp;
> +	LIST_HEAD(child_list);
>  
> +	if (!is_kernel_event(event))
> +		perf_remove_from_owner(event);
>  
> +	event->owner = NULL;
>  
> +retry:

	<snip>

> +	/*
> +	 * If this is the last reference, we're done here, otherwise
> +	 * we must have raced with inherit_event(), in which case, repeat
> +	 */
> +	if (!put_event_last(event))
> +		goto retry;
>  
> +	return 0;
> +}

So I think there's a number of problems still :-(

I all starts with having two perf_remove_from_owner() calls (as I
mentioned on IRC), this doesn't make sense.

I think the moment you close the file and userspace looses control over
it, we should drop the owner bit, which is exactly the one
remove_from_owner in perf_release().

If, for some magical reason, the event lives on after that (and we'll
get to that), it should live on owner-less.

Now, assume someone has such a magical reference, then our
put_event_last() goto again loop will never terminate, this seems like a
bad thing.

The most obvious place that generates such magical references would be
the bpf arraymap doing perf_event_get() on things. There are a few other
places that take temp references (perf_mmap_close), but those are
'short' lived and while ugly will not cause massive grief. The BPF one
OTOH is a real problem here.

And looking at the BPF stuff, that code seems to assume
perf_event_kernel_release() := put_event(), so this patch breaks that
too.


Alexei, is there a reason the arraymap stuff needs a perf event ref as
opposed to a file ref? I'm forever a little confused on how perf<->bpf
works.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-19 15:12               ` [PATCH v2] " Alexander Shishkin
  2016-01-19 20:05                 ` Peter Zijlstra
@ 2016-01-19 20:07                 ` Peter Zijlstra
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-19 20:07 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa

On Tue, Jan 19, 2016 at 05:12:34PM +0200, Alexander Shishkin wrote:
> +static bool put_event_last(struct perf_event *event)
>  {
> +	if (atomic_long_cmpxchg(&event->refcount, 1, 0)) {

	if (atomic_long_cmpxchg(&event->refcount, 1, 0) == 1) {

> +		__put_event(event);
> +		return true;
> +	}
> +
> +	return false;
>  }

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-19 20:05                 ` Peter Zijlstra
@ 2016-01-19 21:58                   ` Alexei Starovoitov
  2016-01-20  8:32                     ` Peter Zijlstra
  2016-01-20  7:04                   ` Alexander Shishkin
  2016-01-22 11:35                   ` Alexander Shishkin
  2 siblings, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2016-01-19 21:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa

On Tue, Jan 19, 2016 at 09:05:58PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 19, 2016 at 05:12:34PM +0200, Alexander Shishkin wrote:
> 
> > +static void __put_event(struct perf_event *event)
> >  {
> >  	struct perf_event_context *ctx;
> >  
> >  	if (!is_kernel_event(event))
> >  		perf_remove_from_owner(event);
> >  
> 
> > +int perf_event_release_kernel(struct perf_event *event)
> >  {
> > +	struct perf_event *child, *tmp;
> > +	LIST_HEAD(child_list);
> >  
> > +	if (!is_kernel_event(event))
> > +		perf_remove_from_owner(event);
> >  
> > +	event->owner = NULL;
> >  
> > +retry:
> 
> 	<snip>
> 
> > +	/*
> > +	 * If this is the last reference, we're done here, otherwise
> > +	 * we must have raced with inherit_event(), in which case, repeat
> > +	 */
> > +	if (!put_event_last(event))
> > +		goto retry;
> >  
> > +	return 0;
> > +}
> 
> So I think there's a number of problems still :-(
> 
> I all starts with having two perf_remove_from_owner() calls (as I
> mentioned on IRC), this doesn't make sense.
> 
> I think the moment you close the file and userspace looses control over
> it, we should drop the owner bit, which is exactly the one
> remove_from_owner in perf_release().
> 
> If, for some magical reason, the event lives on after that (and we'll
> get to that), it should live on owner-less.
> 
> Now, assume someone has such a magical reference, then our
> put_event_last() goto again loop will never terminate, this seems like a
> bad thing.
> 
> The most obvious place that generates such magical references would be
> the bpf arraymap doing perf_event_get() on things. There are a few other
> places that take temp references (perf_mmap_close), but those are
> 'short' lived and while ugly will not cause massive grief. The BPF one
> OTOH is a real problem here.
> 
> And looking at the BPF stuff, that code seems to assume
> perf_event_kernel_release() := put_event(), so this patch breaks that
> too.
> 
> 
> Alexei, is there a reason the arraymap stuff needs a perf event ref as
> opposed to a file ref? I'm forever a little confused on how perf<->bpf
> works.

A file ref will not work, since user space could have closed that
perf_event file to avoid unnecessary FDs.
Program only need the stable pointer to 'struct perf_event' which
it will use while running.
At the end it will call perf_event_kernel_release() which
is == put_event().
It was the case that 'perf_events' were normal refcnt-ed structures
and the last guy frees it.
This put_event_last() logic definitely looks problematic.
There are no ordering guarantees.
User space may close FD, while struct perf_event is still alive.
The loop around perf_event_last() looks buggy.
I'm obviously missing the main goal of this patch.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-19 20:05                 ` Peter Zijlstra
  2016-01-19 21:58                   ` Alexei Starovoitov
@ 2016-01-20  7:04                   ` Alexander Shishkin
  2016-01-20  8:03                     ` Peter Zijlstra
  2016-01-22 11:35                   ` Alexander Shishkin
  2 siblings, 1 reply; 33+ messages in thread
From: Alexander Shishkin @ 2016-01-20  7:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, alexei.starovoitov

Peter Zijlstra <peterz@infradead.org> writes:

> So I think there's a number of problems still :-(
>
> I all starts with having two perf_remove_from_owner() calls (as I
> mentioned on IRC), this doesn't make sense.
>
> I think the moment you close the file and userspace looses control over
> it, we should drop the owner bit, which is exactly the one
> remove_from_owner in perf_release().

Fair enough.

> If, for some magical reason, the event lives on after that (and we'll
> get to that), it should live on owner-less.
>
> Now, assume someone has such a magical reference, then our
> put_event_last() goto again loop will never terminate, this seems like a
> bad thing.
>
> The most obvious place that generates such magical references would be
> the bpf arraymap doing perf_event_get() on things. There are a few other
> places that take temp references (perf_mmap_close), but those are
> 'short' lived and while ugly will not cause massive grief.

We won't get to perf_release() before we're done with perf_mmap_close(),
so that one's not really a problem.

> The BPF one OTOH is a real problem here.
>
> And looking at the BPF stuff, that code seems to assume
> perf_event_kernel_release() := put_event(), so this patch breaks that
> too.

Yes, that one's very much an api abuse, it should clearly be using
get_file()/fput() instead. Now that the code is there already, there's a
slight chance that changing this will have userspace running into the fd
limit and cause a regression. As a workaround we can probably introduce
yet another magial owner to allow a userspace event to be
'stolen'. Since bpf is the only user of perf_event_get(), this can be
somewhat easily arranged.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-20  7:04                   ` Alexander Shishkin
@ 2016-01-20  8:03                     ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-20  8:03 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, alexei.starovoitov

On Wed, Jan 20, 2016 at 09:04:28AM +0200, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:

> > The most obvious place that generates such magical references would be
> > the bpf arraymap doing perf_event_get() on things. There are a few other
> > places that take temp references (perf_mmap_close), but those are
> > 'short' lived and while ugly will not cause massive grief.
> 
> We won't get to perf_release() before we're done with perf_mmap_close(),
> so that one's not really a problem.

Only for the file we mmap()'ed, the events we've attached through
IOC_SET_OUTPUT will not have a file reference from the mmap().

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-19 21:58                   ` Alexei Starovoitov
@ 2016-01-20  8:32                     ` Peter Zijlstra
  2016-01-21  4:55                       ` Alexei Starovoitov
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-20  8:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa

On Tue, Jan 19, 2016 at 01:58:19PM -0800, Alexei Starovoitov wrote:
> On Tue, Jan 19, 2016 at 09:05:58PM +0100, Peter Zijlstra wrote:

> > The most obvious place that generates such magical references would be
> > the bpf arraymap doing perf_event_get() on things. There are a few other
> > places that take temp references (perf_mmap_close), but those are
> > 'short' lived and while ugly will not cause massive grief. The BPF one
> > OTOH is a real problem here.
> > 
> > And looking at the BPF stuff, that code seems to assume
> > perf_event_kernel_release() := put_event(), so this patch breaks that
> > too.
> > 
> > 
> > Alexei, is there a reason the arraymap stuff needs a perf event ref as
> > opposed to a file ref? I'm forever a little confused on how perf<->bpf
> > works.
> 
> A file ref will not work, since user space could have closed that
> perf_event file to avoid unnecessary FDs.

So I'm (possibly again) confused on how BPF works.

I thought the reason you handed in perf events from userspace; as
opposed to creating your own with perf_event_create_kernel_counter();
was because userspace was interested in the output.

Also, BPF should not be a way to get around the filedesc resource limit.

> Program only need the stable pointer to 'struct perf_event' which
> it will use while running.
> At the end it will call perf_event_kernel_release() which
> is == put_event().
> It was the case that 'perf_events' were normal refcnt-ed structures
> and the last guy frees it.

Sort-of, but user events are (or should be, rather) tied to the filedesc
to account the resources used.

There is also the event->owner field, we track the task that created the
event, with your current scheme that is left dangling once userspace
closes the last filedesc and you still have a ref open.

> This put_event_last() logic definitely looks problematic.
> There are no ordering guarantees.
> User space may close FD, while struct perf_event is still alive.
> The loop around perf_event_last() looks buggy.
> I'm obviously missing the main goal of this patch.

Right, so the patch in question tries to synchronously clean up
everything related to the counter when we close the file. Such that the
file better reflects the actual resource usage.

Currently we do this async (and with holes).


In short, user created event really should be filedesc based, yes we
have event references, but those 'should' be short lived.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-20  8:32                     ` Peter Zijlstra
@ 2016-01-21  4:55                       ` Alexei Starovoitov
  0 siblings, 0 replies; 33+ messages in thread
From: Alexei Starovoitov @ 2016-01-21  4:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa

On Wed, Jan 20, 2016 at 09:32:22AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 19, 2016 at 01:58:19PM -0800, Alexei Starovoitov wrote:
> > On Tue, Jan 19, 2016 at 09:05:58PM +0100, Peter Zijlstra wrote:
> 
> > > The most obvious place that generates such magical references would be
> > > the bpf arraymap doing perf_event_get() on things. There are a few other
> > > places that take temp references (perf_mmap_close), but those are
> > > 'short' lived and while ugly will not cause massive grief. The BPF one
> > > OTOH is a real problem here.
> > > 
> > > And looking at the BPF stuff, that code seems to assume
> > > perf_event_kernel_release() := put_event(), so this patch breaks that
> > > too.
> > > 
> > > 
> > > Alexei, is there a reason the arraymap stuff needs a perf event ref as
> > > opposed to a file ref? I'm forever a little confused on how perf<->bpf
> > > works.
> > 
> > A file ref will not work, since user space could have closed that
> > perf_event file to avoid unnecessary FDs.
> 
> So I'm (possibly again) confused on how BPF works.
> 
> I thought the reason you handed in perf events from userspace; as
> opposed to creating your own with perf_event_create_kernel_counter();
> was because userspace was interested in the output.

yes. There are two use cases of perf_events from bpf:
1. sw_bpf_output event is used by bpf to push samples into it and
   user spaces reads it as normal via mmap
2. PERF_TYPE_HARDWARE event is used by bpf program to read
   counters to measure things like number of cycles or tlb misses
   in a given function.
   In this case user space typically leaves FDs around, but it doesn't
   use them for anything.

> Also, BPF should not be a way to get around the filedesc resource limit.

all bpf tracing stuff is root only and maps are charged for every element.

> > Program only need the stable pointer to 'struct perf_event' which
> > it will use while running.
> > At the end it will call perf_event_kernel_release() which
> > is == put_event().
> > It was the case that 'perf_events' were normal refcnt-ed structures
> > and the last guy frees it.
> 
> Sort-of, but user events are (or should be, rather) tied to the filedesc
> to account the resources used.
> 
> There is also the event->owner field, we track the task that created the
> event, with your current scheme that is left dangling once userspace
> closes the last filedesc and you still have a ref open.
> 
> > This put_event_last() logic definitely looks problematic.
> > There are no ordering guarantees.
> > User space may close FD, while struct perf_event is still alive.
> > The loop around perf_event_last() looks buggy.
> > I'm obviously missing the main goal of this patch.
> 
> Right, so the patch in question tries to synchronously clean up
> everything related to the counter when we close the file. Such that the
> file better reflects the actual resource usage.
> 
> Currently we do this async (and with holes).
> 
> In short, user created event really should be filedesc based, yes we
> have event references, but those 'should' be short lived.

I'm still missing why it's the problem.
Which counter do you want to bump as part of perf_event_get() ?
still event->refcount, right?
but the same perf_event can be stored in multiple bpf maps
and many bpf programs can be using it, while nothing can
possibly prevent the user space to do close(perf_event_fd)
while programs are still running and collecting tlb miss data
from the counters.
So what do you propose?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-19 20:05                 ` Peter Zijlstra
  2016-01-19 21:58                   ` Alexei Starovoitov
  2016-01-20  7:04                   ` Alexander Shishkin
@ 2016-01-22 11:35                   ` Alexander Shishkin
  2016-01-22 12:12                     ` Peter Zijlstra
  2016-01-22 12:38                     ` Peter Zijlstra
  2 siblings, 2 replies; 33+ messages in thread
From: Alexander Shishkin @ 2016-01-22 11:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, alexei.starovoitov

Peter Zijlstra <peterz@infradead.org> writes:

> So I think there's a number of problems still :-(

Also, it does indeed race with
__perf_event_exit_task()/sync_child_event(), but that one I'd fix by
simply wrapping the sync_child_event()/free_event() in

mutex_lock(&parent_event->child_mutex);
if (!is_orphan_event(parent_event)) {
   sync_child_event(child_event);
   free_event(child_event);
}
mutex_unlock(&parent_event->child_event);

At some later point in time the code there could use a bit of
reshuffling, I guess.

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-22 11:35                   ` Alexander Shishkin
@ 2016-01-22 12:12                     ` Peter Zijlstra
  2016-01-22 12:38                     ` Peter Zijlstra
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-22 12:12 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, alexei.starovoitov

On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > So I think there's a number of problems still :-(
> 
> Also, it does indeed race with
> __perf_event_exit_task()/sync_child_event(), but that one I'd fix by
> simply wrapping the sync_child_event()/free_event() in
> 
> mutex_lock(&parent_event->child_mutex);
> if (!is_orphan_event(parent_event)) {
>    sync_child_event(child_event);
>    free_event(child_event);
> }
> mutex_unlock(&parent_event->child_event);

So I've been staring at exactly that code for a while because Ingo
managed to trigger that race (and I could reproduce with his
'workload').

But I'm not seeing how; both sites hold ctx->mutex and remove the event
from the ctx->event_list.

So the way I'm seeing it, either the orphan_work find and frees it, or
the __perf_event_exit_task() one does, but I'm a bit stumped on how they
can both do.

Sure, the sync stuff is pointless if we're orphan, but I don't see how
it can harm.

> At some later point in time the code there could use a bit of
> reshuffling, I guess.

Yes.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-22 11:35                   ` Alexander Shishkin
  2016-01-22 12:12                     ` Peter Zijlstra
@ 2016-01-22 12:38                     ` Peter Zijlstra
  2016-01-22 19:44                       ` Alexei Starovoitov
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-22 12:38 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, alexei.starovoitov

On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > So I think there's a number of problems still :-(
> 
> Also, it does indeed race with
> __perf_event_exit_task()/sync_child_event(), but that one I'd fix by
> simply wrapping the sync_child_event()/free_event() in
> 
> mutex_lock(&parent_event->child_mutex);
> if (!is_orphan_event(parent_event)) {
>    sync_child_event(child_event);
>    free_event(child_event);
> }
> mutex_unlock(&parent_event->child_event);

Also, note the comment with _perf_event_disable(), that relies on
sync_child_event() taking event->parent->child_mutex.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-22 12:38                     ` Peter Zijlstra
@ 2016-01-22 19:44                       ` Alexei Starovoitov
  2016-01-25 11:48                         ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2016-01-22 19:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa

On Fri, Jan 22, 2016 at 01:38:47PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> > Peter Zijlstra <peterz@infradead.org> writes:
> > 
> > > So I think there's a number of problems still :-(

I've been looking at how perf_event->owner is handled and couldn't
figure out how you deal with the case of passing perf_event_fd via scm_rights.
It seems one process can open an event, pass it to another process,
but when current process exists owner will still point to dead task,
since refcount > 0.
Which part am I missing?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-22 19:44                       ` Alexei Starovoitov
@ 2016-01-25 11:48                         ` Peter Zijlstra
  2016-01-25 14:54                           ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-25 11:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa

On Fri, Jan 22, 2016 at 11:44:03AM -0800, Alexei Starovoitov wrote:
> On Fri, Jan 22, 2016 at 01:38:47PM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> > > Peter Zijlstra <peterz@infradead.org> writes:
> > > 
> > > > So I think there's a number of problems still :-(
> 
> I've been looking at how perf_event->owner is handled and couldn't
> figure out how you deal with the case of passing perf_event_fd via scm_rights.
> It seems one process can open an event, pass it to another process,
> but when current process exists owner will still point to dead task,
> since refcount > 0.
> Which part am I missing?

Nothing, you raised a good point. I think this shows we cannot link
!event->owner to an event being 'dead'.

If we keep these two states separate, the scm_rights thing should work
again.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-25 11:48                         ` Peter Zijlstra
@ 2016-01-25 14:54                           ` Peter Zijlstra
  2016-01-25 21:04                             ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-25 14:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa

On Mon, Jan 25, 2016 at 12:48:46PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 22, 2016 at 11:44:03AM -0800, Alexei Starovoitov wrote:
> > On Fri, Jan 22, 2016 at 01:38:47PM +0100, Peter Zijlstra wrote:
> > > On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> > > > Peter Zijlstra <peterz@infradead.org> writes:
> > > > 
> > > > > So I think there's a number of problems still :-(
> > 
> > I've been looking at how perf_event->owner is handled and couldn't
> > figure out how you deal with the case of passing perf_event_fd via scm_rights.
> > It seems one process can open an event, pass it to another process,
> > but when current process exists owner will still point to dead task,
> > since refcount > 0.
> > Which part am I missing?
> 
> Nothing, you raised a good point. I think this shows we cannot link
> !event->owner to an event being 'dead'.
> 
> If we keep these two states separate, the scm_rights thing should work
> again.

Alexander, Alexei,

How about the below? That uses event->state == PERF_EVENT_STATE_EXIT to
indicate the event has been given up by its 'owner' and decouples us
from the actual event->owner logic.

This retains the event->owner and event->owner_list thing purely for the
prclt(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE) calls, but does give
us strict 'owner' semantics in that:

  struct perf_event *my_event = perf_event_create_kernel_counter();

  /* ... */

  perf_event_release_kernel(my_event);

Or

  int fd = sys_perf_event_open(...);

  close(fd); /* last, calls fops::release */

Will destroy the event dead. event::refcount will 'retain' the object
but it will become non functional and is strictly meant as a temporal
existence guarantee (for when RCU isn't good enough).

So this should restore the scm_rights case, which preserves the fd but
could result in not having event->owner (and therefore being removed
from its owner_list), which is fine.

BPF still needs to get fixed to use filedesc references instead.

---

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -634,9 +634,6 @@ struct perf_event_context {
 	int				nr_cgroups;	 /* cgroup evts */
 	void				*task_ctx_data; /* pmu specific data */
 	struct rcu_head			rcu_head;
-
-	struct delayed_work		orphans_remove;
-	bool				orphans_remove_sched;
 };
 
 /*
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,8 +49,6 @@
 
 #include <asm/irq_regs.h>
 
-static struct workqueue_struct *perf_wq;
-
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -1086,8 +1084,8 @@ static void put_ctx(struct perf_event_co
  * Lock order:
  *	task_struct::perf_event_mutex
  *	  perf_event_context::mutex
- *	    perf_event_context::lock
  *	    perf_event::child_mutex;
+ *	      perf_event_context::lock
  *	    perf_event::mmap_mutex
  *	    mmap_sem
  */
@@ -1646,45 +1644,11 @@ static void perf_group_detach(struct per
 		perf_event__header_size(tmp);
 }
 
-/*
- * User event without the task.
- */
 static bool is_orphaned_event(struct perf_event *event)
 {
-	return event && !is_kernel_event(event) && !event->owner;
-}
-
-/*
- * Event has a parent but parent's task finished and it's
- * alive only because of children holding refference.
- */
-static bool is_orphaned_child(struct perf_event *event)
-{
-	return is_orphaned_event(event->parent);
-}
-
-static void orphans_remove_work(struct work_struct *work);
-
-static void schedule_orphans_remove(struct perf_event_context *ctx)
-{
-	if (!ctx->task || ctx->orphans_remove_sched || !perf_wq)
-		return;
-
-	if (queue_delayed_work(perf_wq, &ctx->orphans_remove, 1)) {
-		get_ctx(ctx);
-		ctx->orphans_remove_sched = true;
-	}
+	return event->state == PERF_EVENT_STATE_EXIT;
 }
 
-static int __init perf_workqueue_init(void)
-{
-	perf_wq = create_singlethread_workqueue("perf");
-	WARN(!perf_wq, "failed to create perf workqueue\n");
-	return perf_wq ? 0 : -1;
-}
-
-core_initcall(perf_workqueue_init);
-
 static inline int pmu_filter_match(struct perf_event *event)
 {
 	struct pmu *pmu = event->pmu;
@@ -1745,9 +1709,6 @@ event_sched_out(struct perf_event *event
 	if (event->attr.exclusive || !cpuctx->active_oncpu)
 		cpuctx->exclusive = 0;
 
-	if (is_orphaned_child(event))
-		schedule_orphans_remove(ctx);
-
 	perf_pmu_enable(event->pmu);
 }
 
@@ -1771,6 +1732,9 @@ group_sched_out(struct perf_event *group
 		cpuctx->exclusive = 0;
 }
 
+#define DETACH_GROUP	0x01
+#define DETACH_STATE	0x02
+
 /*
  * Cross CPU call to remove a performance event
  *
@@ -1783,13 +1747,19 @@ __perf_remove_from_context(struct perf_e
 			   struct perf_event_context *ctx,
 			   void *info)
 {
-	bool detach_group = (unsigned long)info;
+	unsigned long flags = (unsigned long)info;
 
 	event_sched_out(event, cpuctx, ctx);
-	if (detach_group)
+
+	if (flags & DETACH_GROUP)
 		perf_group_detach(event);
 	list_del_event(event, ctx);
 
+	if (flags & DETACH_STATE) {
+		event->state = PERF_EVENT_STATE_EXIT;
+		perf_event_wakeup(event);
+	}
+
 	if (!ctx->nr_events && ctx->is_active) {
 		ctx->is_active = 0;
 		if (ctx->task) {
@@ -1809,12 +1779,11 @@ __perf_remove_from_context(struct perf_e
  * When called from perf_event_exit_task, it's OK because the
  * context has been detached from its task.
  */
-static void perf_remove_from_context(struct perf_event *event, bool detach_group)
+static void perf_remove_from_context(struct perf_event *event, unsigned long flags)
 {
 	lockdep_assert_held(&event->ctx->mutex);
 
-	event_function_call(event, __perf_remove_from_context,
-			    (void *)(unsigned long)detach_group);
+	event_function_call(event, __perf_remove_from_context, (void *)flags);
 }
 
 /*
@@ -1980,9 +1949,6 @@ event_sched_in(struct perf_event *event,
 	if (event->attr.exclusive)
 		cpuctx->exclusive = 1;
 
-	if (is_orphaned_child(event))
-		schedule_orphans_remove(ctx);
-
 out:
 	perf_pmu_enable(event->pmu);
 
@@ -2253,7 +2219,8 @@ static void __perf_event_enable(struct p
 	struct perf_event *leader = event->group_leader;
 	struct perf_event_context *task_ctx;
 
-	if (event->state >= PERF_EVENT_STATE_INACTIVE)
+	if (event->state >= PERF_EVENT_STATE_INACTIVE ||
+	    event->state <= PERF_EVENT_STATE_ERROR)
 		return;
 
 	update_context_time(ctx);
@@ -2298,7 +2265,8 @@ static void _perf_event_enable(struct pe
 	struct perf_event_context *ctx = event->ctx;
 
 	raw_spin_lock_irq(&ctx->lock);
-	if (event->state >= PERF_EVENT_STATE_INACTIVE) {
+	if (event->state >= PERF_EVENT_STATE_INACTIVE ||
+	    event->state < PERF_EVENT_STATE_ERROR) {
 		raw_spin_unlock_irq(&ctx->lock);
 		return;
 	}
@@ -3363,7 +3331,6 @@ static void __perf_event_init_context(st
 	INIT_LIST_HEAD(&ctx->flexible_groups);
 	INIT_LIST_HEAD(&ctx->event_list);
 	atomic_set(&ctx->refcount, 1);
-	INIT_DELAYED_WORK(&ctx->orphans_remove, orphans_remove_work);
 }
 
 static struct perf_event_context *
@@ -3665,29 +3632,6 @@ static bool exclusive_event_installable(
 	return true;
 }
 
-static void __free_event(struct perf_event *event)
-{
-	if (!event->parent) {
-		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
-			put_callchain_buffers();
-	}
-
-	perf_event_free_bpf_prog(event);
-
-	if (event->destroy)
-		event->destroy(event);
-
-	if (event->ctx)
-		put_ctx(event->ctx);
-
-	if (event->pmu) {
-		exclusive_event_destroy(event);
-		module_put(event->pmu->module);
-	}
-
-	call_rcu(&event->rcu_head, free_event_rcu);
-}
-
 static void _free_event(struct perf_event *event)
 {
 	irq_work_sync(&event->pending);
@@ -3709,7 +3653,25 @@ static void _free_event(struct perf_even
 	if (is_cgroup_event(event))
 		perf_detach_cgroup(event);
 
-	__free_event(event);
+	if (!event->parent) {
+		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
+			put_callchain_buffers();
+	}
+
+	perf_event_free_bpf_prog(event);
+
+	if (event->destroy)
+		event->destroy(event);
+
+	if (event->ctx)
+		put_ctx(event->ctx);
+
+	if (event->pmu) {
+		exclusive_event_destroy(event);
+		module_put(event->pmu->module);
+	}
+
+	call_rcu(&event->rcu_head, free_event_rcu);
 }
 
 /*
@@ -3771,8 +3733,10 @@ static void perf_remove_from_owner(struc
 		 * ensured they're done, and we can proceed with freeing the
 		 * event.
 		 */
-		if (event->owner)
+		if (event->owner) {
 			list_del_init(&event->owner_entry);
+			event->owner = NULL;
+		}
 		mutex_unlock(&owner->perf_event_mutex);
 		put_task_struct(owner);
 	}
@@ -3780,11 +3744,23 @@ static void perf_remove_from_owner(struc
 
 static void put_event(struct perf_event *event)
 {
-	struct perf_event_context *ctx;
-
 	if (!atomic_long_dec_and_test(&event->refcount))
 		return;
 
+	_free_event(event);
+}
+
+/*
+ * Kill an event dead; while event:refcount will preserve the event
+ * object, it will not preserve its functionality. Once the last 'user'
+ * gives up the object, we'll destroy the thing.
+ */
+int perf_event_release_kernel(struct perf_event *event)
+{
+	struct perf_event_context *ctx;
+	struct perf_event *child, *tmp;
+	LIST_HEAD(child_list);
+
 	if (!is_kernel_event(event))
 		perf_remove_from_owner(event);
 
@@ -3802,14 +3778,57 @@ static void put_event(struct perf_event
 	 */
 	ctx = perf_event_ctx_lock_nested(event, SINGLE_DEPTH_NESTING);
 	WARN_ON_ONCE(ctx->parent_ctx);
-	perf_remove_from_context(event, true);
+	perf_remove_from_context(event, DETACH_GROUP | DETACH_STATE);
 	perf_event_ctx_unlock(event, ctx);
 
-	_free_event(event);
-}
+	/*
+	 * At this point we must have event->state == PERF_EVENT_STATE_EXIT,
+	 * either from the above perf_remove_from_context() or through
+	 * __perf_event_exit_task().
+	 *
+	 * Therefore, anybody acquireing event->child_mutex after the below
+	 * list_splice_init() _must_ also see this, most importantly
+	 * inherit_event() which will avoid placing more children on the
+	 * list.
+	 *
+	 * Thus this guarantees that we will in fact observe and kill _ALL_
+	 * child events.
+	 */
+	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_EXIT);
 
-int perf_event_release_kernel(struct perf_event *event)
-{
+	/*
+	 * event::child_mutex nests inside ctx::mutex, so move children
+	 * to a safe place first and avoid inversion
+	 */
+	mutex_lock(&event->child_mutex);
+	list_splice_init(&event->child_list, &child_list);
+	mutex_unlock(&event->child_mutex);
+
+	list_for_each_entry_safe(child, tmp, &child_list, child_list) {
+		struct perf_event_context *ctx;
+
+		/*
+		 * This is somewhat similar to perf_free_event(),
+		 * except for these events are alive and need
+		 * proper perf_remove_from_context().
+		 */
+		ctx = perf_event_ctx_lock(child);
+		perf_remove_from_context(child, DETACH_GROUP);
+		perf_event_ctx_unlock(child, ctx);
+
+		list_del(&child->child_list);
+
+		/* Children will have exactly one reference */
+		free_event(child);
+
+		/*
+		 * This matches the refcount bump in inherit_event();
+		 * this can't be the last reference.
+		 */
+		put_event(event);
+	}
+
+	/* Must be the last reference */
 	put_event(event);
 	return 0;
 }
@@ -3820,46 +3839,10 @@ EXPORT_SYMBOL_GPL(perf_event_release_ker
  */
 static int perf_release(struct inode *inode, struct file *file)
 {
-	put_event(file->private_data);
+	perf_event_release_kernel(file->private_data);
 	return 0;
 }
 
-/*
- * Remove all orphanes events from the context.
- */
-static void orphans_remove_work(struct work_struct *work)
-{
-	struct perf_event_context *ctx;
-	struct perf_event *event, *tmp;
-
-	ctx = container_of(work, struct perf_event_context,
-			   orphans_remove.work);
-
-	mutex_lock(&ctx->mutex);
-	list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry) {
-		struct perf_event *parent_event = event->parent;
-
-		if (!is_orphaned_child(event))
-			continue;
-
-		perf_remove_from_context(event, true);
-
-		mutex_lock(&parent_event->child_mutex);
-		list_del_init(&event->child_list);
-		mutex_unlock(&parent_event->child_mutex);
-
-		free_event(event);
-		put_event(parent_event);
-	}
-
-	raw_spin_lock_irq(&ctx->lock);
-	ctx->orphans_remove_sched = false;
-	raw_spin_unlock_irq(&ctx->lock);
-	mutex_unlock(&ctx->mutex);
-
-	put_ctx(ctx);
-}
-
 u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 {
 	struct perf_event *child;
@@ -8432,11 +8415,11 @@ SYSCALL_DEFINE5(perf_event_open,
 		 * See perf_event_ctx_lock() for comments on the details
 		 * of swizzling perf_event::ctx.
 		 */
-		perf_remove_from_context(group_leader, false);
+		perf_remove_from_context(group_leader, 0);
 
 		list_for_each_entry(sibling, &group_leader->sibling_list,
 				    group_entry) {
-			perf_remove_from_context(sibling, false);
+			perf_remove_from_context(sibling, 0);
 			put_ctx(gctx);
 		}
 
@@ -8616,7 +8599,7 @@ void perf_pmu_migrate_context(struct pmu
 	mutex_lock_double(&src_ctx->mutex, &dst_ctx->mutex);
 	list_for_each_entry_safe(event, tmp, &src_ctx->event_list,
 				 event_entry) {
-		perf_remove_from_context(event, false);
+		perf_remove_from_context(event, 0);
 		unaccount_event_cpu(event, src_cpu);
 		put_ctx(src_ctx);
 		list_add(&event->migrate_entry, &events);
@@ -8665,7 +8648,7 @@ void perf_pmu_migrate_context(struct pmu
 EXPORT_SYMBOL_GPL(perf_pmu_migrate_context);
 
 static void sync_child_event(struct perf_event *child_event,
-			       struct task_struct *child)
+			     struct task_struct *child)
 {
 	struct perf_event *parent_event = child_event->parent;
 	u64 child_val;
@@ -8684,25 +8667,6 @@ static void sync_child_event(struct perf
 	atomic64_add(child_event->total_time_running,
 		     &parent_event->child_total_time_running);
 
-	/*
-	 * Remove this event from the parent's list
-	 */
-	WARN_ON_ONCE(parent_event->ctx->parent_ctx);
-	mutex_lock(&parent_event->child_mutex);
-	list_del_init(&child_event->child_list);
-	mutex_unlock(&parent_event->child_mutex);
-
-	/*
-	 * Make sure user/parent get notified, that we just
-	 * lost one event.
-	 */
-	perf_event_wakeup(parent_event);
-
-	/*
-	 * Release the parent event, if this was the last
-	 * reference to it.
-	 */
-	put_event(parent_event);
 }
 
 static void
@@ -8710,6 +8674,8 @@ __perf_event_exit_task(struct perf_event
 			 struct perf_event_context *child_ctx,
 			 struct task_struct *child)
 {
+	struct perf_event *parent_event = child_event->parent;
+
 	/*
 	 * Do not destroy the 'original' grouping; because of the context
 	 * switch optimization the original events could've ended up in a
@@ -8728,20 +8694,39 @@ __perf_event_exit_task(struct perf_event
 	if (!!child_event->parent)
 		perf_group_detach(child_event);
 	list_del_event(child_event, child_ctx);
+	child_event->state = PERF_EVENT_STATE_EXIT; /* see perf_event_release_kernel() */
 	raw_spin_unlock_irq(&child_ctx->lock);
 
 	/*
-	 * It can happen that the parent exits first, and has events
-	 * that are still around due to the child reference. These
-	 * events need to be zapped.
+	 * Parent events are governed by their filedesc, retain them.
 	 */
-	if (child_event->parent) {
-		sync_child_event(child_event, child);
-		free_event(child_event);
-	} else {
-		child_event->state = PERF_EVENT_STATE_EXIT;
+	if (!child_event->parent) {
 		perf_event_wakeup(child_event);
+		return;
 	}
+	/*
+	 * This is a child event, cleanup and free.
+	 */
+
+	/*
+	 * Fold delta back into parent counts.
+	 */
+	sync_child_event(child_event, child);
+
+	/*
+	 * Remove this event from the parent's list.
+	 */
+	WARN_ON_ONCE(parent_event->ctx->parent_ctx);
+	mutex_lock(&parent_event->child_mutex);
+	list_del_init(&child_event->child_list);
+	mutex_unlock(&parent_event->child_mutex);
+
+	/*
+	 * Kick perf_poll for is_event_hup().
+	 */
+	perf_event_wakeup(parent_event);
+	free_event(child_event);
+	put_event(parent_event);
 }
 
 static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
@@ -8973,8 +8958,15 @@ inherit_event(struct perf_event *parent_
 	if (IS_ERR(child_event))
 		return child_event;
 
+	/*
+	 * is_orphaned_event() and list_add_tail(&parent_event->child_list)
+	 * must be under the same lock in order to serialize against
+	 * perf_event_release_kernel().
+	 */
+	mutex_lock(&parent_event->child_mutex);
 	if (is_orphaned_event(parent_event) ||
 	    !atomic_long_inc_not_zero(&parent_event->refcount)) {
+		mutex_unlock(&parent_event->child_mutex);
 		free_event(child_event);
 		return NULL;
 	}
@@ -9022,8 +9014,6 @@ inherit_event(struct perf_event *parent_
 	/*
 	 * Link this into the parent event's child list
 	 */
-	WARN_ON_ONCE(parent_event->ctx->parent_ctx);
-	mutex_lock(&parent_event->child_mutex);
 	list_add_tail(&child_event->child_list, &parent_event->child_list);
 	mutex_unlock(&parent_event->child_mutex);
 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-25 14:54                           ` Peter Zijlstra
@ 2016-01-25 21:04                             ` Peter Zijlstra
  2016-01-26  4:59                               ` Alexei Starovoitov
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-25 21:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa

On Mon, Jan 25, 2016 at 03:54:14PM +0100, Peter Zijlstra wrote:
> Alexander, Alexei,
> 
> How about the below? That uses event->state == PERF_EVENT_STATE_EXIT to
> indicate the event has been given up by its 'owner' and decouples us
> from the actual event->owner logic.
> 
> This retains the event->owner and event->owner_list thing purely for the
> prclt(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE) calls, but does give
> us strict 'owner' semantics in that:
> 
>   struct perf_event *my_event = perf_event_create_kernel_counter();
> 
>   /* ... */
> 
>   perf_event_release_kernel(my_event);
> 
> Or
> 
>   int fd = sys_perf_event_open(...);
> 
>   close(fd); /* last, calls fops::release */
> 
> Will destroy the event dead. event::refcount will 'retain' the object
> but it will become non functional and is strictly meant as a temporal
> existence guarantee (for when RCU isn't good enough).
> 
> So this should restore the scm_rights case, which preserves the fd but
> could result in not having event->owner (and therefore being removed
> from its owner_list), which is fine.
> 
> BPF still needs to get fixed to use filedesc references instead.

Still no BPF, but this one actually 'works', as in it doesn't have the
blatant exit races and has survived a few hours of runtime.

---
 include/linux/perf_event.h |    3 
 kernel/events/core.c       |  304 ++++++++++++++++++++++-----------------------
 2 files changed, 150 insertions(+), 157 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -634,9 +634,6 @@ struct perf_event_context {
 	int				nr_cgroups;	 /* cgroup evts */
 	void				*task_ctx_data; /* pmu specific data */
 	struct rcu_head			rcu_head;
-
-	struct delayed_work		orphans_remove;
-	bool				orphans_remove_sched;
 };
 
 /*
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,8 +49,6 @@
 
 #include <asm/irq_regs.h>
 
-static struct workqueue_struct *perf_wq;
-
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -1086,8 +1084,8 @@ static void put_ctx(struct perf_event_co
  * Lock order:
  *	task_struct::perf_event_mutex
  *	  perf_event_context::mutex
- *	    perf_event_context::lock
  *	    perf_event::child_mutex;
+ *	      perf_event_context::lock
  *	    perf_event::mmap_mutex
  *	    mmap_sem
  */
@@ -1646,45 +1644,11 @@ static void perf_group_detach(struct per
 		perf_event__header_size(tmp);
 }
 
-/*
- * User event without the task.
- */
 static bool is_orphaned_event(struct perf_event *event)
 {
-	return event && !is_kernel_event(event) && !event->owner;
-}
-
-/*
- * Event has a parent but parent's task finished and it's
- * alive only because of children holding refference.
- */
-static bool is_orphaned_child(struct perf_event *event)
-{
-	return is_orphaned_event(event->parent);
+	return event->state == PERF_EVENT_STATE_EXIT;
 }
 
-static void orphans_remove_work(struct work_struct *work);
-
-static void schedule_orphans_remove(struct perf_event_context *ctx)
-{
-	if (!ctx->task || ctx->orphans_remove_sched || !perf_wq)
-		return;
-
-	if (queue_delayed_work(perf_wq, &ctx->orphans_remove, 1)) {
-		get_ctx(ctx);
-		ctx->orphans_remove_sched = true;
-	}
-}
-
-static int __init perf_workqueue_init(void)
-{
-	perf_wq = create_singlethread_workqueue("perf");
-	WARN(!perf_wq, "failed to create perf workqueue\n");
-	return perf_wq ? 0 : -1;
-}
-
-core_initcall(perf_workqueue_init);
-
 static inline int pmu_filter_match(struct perf_event *event)
 {
 	struct pmu *pmu = event->pmu;
@@ -1745,9 +1709,6 @@ event_sched_out(struct perf_event *event
 	if (event->attr.exclusive || !cpuctx->active_oncpu)
 		cpuctx->exclusive = 0;
 
-	if (is_orphaned_child(event))
-		schedule_orphans_remove(ctx);
-
 	perf_pmu_enable(event->pmu);
 }
 
@@ -1771,6 +1732,9 @@ group_sched_out(struct perf_event *group
 		cpuctx->exclusive = 0;
 }
 
+#define DETACH_GROUP	0x01
+#define DETACH_STATE	0x02
+
 /*
  * Cross CPU call to remove a performance event
  *
@@ -1783,13 +1747,19 @@ __perf_remove_from_context(struct perf_e
 			   struct perf_event_context *ctx,
 			   void *info)
 {
-	bool detach_group = (unsigned long)info;
+	unsigned long flags = (unsigned long)info;
 
 	event_sched_out(event, cpuctx, ctx);
-	if (detach_group)
+
+	if (flags & DETACH_GROUP)
 		perf_group_detach(event);
 	list_del_event(event, ctx);
 
+	if (flags & DETACH_STATE) {
+		event->state = PERF_EVENT_STATE_EXIT;
+		perf_event_wakeup(event);
+	}
+
 	if (!ctx->nr_events && ctx->is_active) {
 		ctx->is_active = 0;
 		if (ctx->task) {
@@ -1809,12 +1779,11 @@ __perf_remove_from_context(struct perf_e
  * When called from perf_event_exit_task, it's OK because the
  * context has been detached from its task.
  */
-static void perf_remove_from_context(struct perf_event *event, bool detach_group)
+static void perf_remove_from_context(struct perf_event *event, unsigned long flags)
 {
 	lockdep_assert_held(&event->ctx->mutex);
 
-	event_function_call(event, __perf_remove_from_context,
-			    (void *)(unsigned long)detach_group);
+	event_function_call(event, __perf_remove_from_context, (void *)flags);
 }
 
 /*
@@ -1980,9 +1949,6 @@ event_sched_in(struct perf_event *event,
 	if (event->attr.exclusive)
 		cpuctx->exclusive = 1;
 
-	if (is_orphaned_child(event))
-		schedule_orphans_remove(ctx);
-
 out:
 	perf_pmu_enable(event->pmu);
 
@@ -2253,7 +2219,8 @@ static void __perf_event_enable(struct p
 	struct perf_event *leader = event->group_leader;
 	struct perf_event_context *task_ctx;
 
-	if (event->state >= PERF_EVENT_STATE_INACTIVE)
+	if (event->state >= PERF_EVENT_STATE_INACTIVE ||
+	    event->state <= PERF_EVENT_STATE_ERROR)
 		return;
 
 	update_context_time(ctx);
@@ -2298,7 +2265,8 @@ static void _perf_event_enable(struct pe
 	struct perf_event_context *ctx = event->ctx;
 
 	raw_spin_lock_irq(&ctx->lock);
-	if (event->state >= PERF_EVENT_STATE_INACTIVE) {
+	if (event->state >= PERF_EVENT_STATE_INACTIVE ||
+	    event->state < PERF_EVENT_STATE_ERROR) {
 		raw_spin_unlock_irq(&ctx->lock);
 		return;
 	}
@@ -3363,7 +3331,6 @@ static void __perf_event_init_context(st
 	INIT_LIST_HEAD(&ctx->flexible_groups);
 	INIT_LIST_HEAD(&ctx->event_list);
 	atomic_set(&ctx->refcount, 1);
-	INIT_DELAYED_WORK(&ctx->orphans_remove, orphans_remove_work);
 }
 
 static struct perf_event_context *
@@ -3665,29 +3632,6 @@ static bool exclusive_event_installable(
 	return true;
 }
 
-static void __free_event(struct perf_event *event)
-{
-	if (!event->parent) {
-		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
-			put_callchain_buffers();
-	}
-
-	perf_event_free_bpf_prog(event);
-
-	if (event->destroy)
-		event->destroy(event);
-
-	if (event->ctx)
-		put_ctx(event->ctx);
-
-	if (event->pmu) {
-		exclusive_event_destroy(event);
-		module_put(event->pmu->module);
-	}
-
-	call_rcu(&event->rcu_head, free_event_rcu);
-}
-
 static void _free_event(struct perf_event *event)
 {
 	irq_work_sync(&event->pending);
@@ -3709,7 +3653,25 @@ static void _free_event(struct perf_even
 	if (is_cgroup_event(event))
 		perf_detach_cgroup(event);
 
-	__free_event(event);
+	if (!event->parent) {
+		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
+			put_callchain_buffers();
+	}
+
+	perf_event_free_bpf_prog(event);
+
+	if (event->destroy)
+		event->destroy(event);
+
+	if (event->ctx)
+		put_ctx(event->ctx);
+
+	if (event->pmu) {
+		exclusive_event_destroy(event);
+		module_put(event->pmu->module);
+	}
+
+	call_rcu(&event->rcu_head, free_event_rcu);
 }
 
 /*
@@ -3771,8 +3733,10 @@ static void perf_remove_from_owner(struc
 		 * ensured they're done, and we can proceed with freeing the
 		 * event.
 		 */
-		if (event->owner)
+		if (event->owner) {
 			list_del_init(&event->owner_entry);
+			event->owner = NULL;
+		}
 		mutex_unlock(&owner->perf_event_mutex);
 		put_task_struct(owner);
 	}
@@ -3780,11 +3744,23 @@ static void perf_remove_from_owner(struc
 
 static void put_event(struct perf_event *event)
 {
-	struct perf_event_context *ctx;
-
 	if (!atomic_long_dec_and_test(&event->refcount))
 		return;
 
+	_free_event(event);
+}
+
+/*
+ * Kill an event dead; while event:refcount will preserve the event
+ * object, it will not preserve its functionality. Once the last 'user'
+ * gives up the object, we'll destroy the thing.
+ */
+int perf_event_release_kernel(struct perf_event *event)
+{
+	struct perf_event_context *ctx;
+	struct perf_event *child, *tmp;
+	LIST_HEAD(child_list);
+
 	if (!is_kernel_event(event))
 		perf_remove_from_owner(event);
 
@@ -3802,14 +3778,61 @@ static void put_event(struct perf_event
 	 */
 	ctx = perf_event_ctx_lock_nested(event, SINGLE_DEPTH_NESTING);
 	WARN_ON_ONCE(ctx->parent_ctx);
-	perf_remove_from_context(event, true);
+	perf_remove_from_context(event, DETACH_GROUP | DETACH_STATE);
 	perf_event_ctx_unlock(event, ctx);
 
-	_free_event(event);
-}
+	/*
+	 * At this point we must have event->state == PERF_EVENT_STATE_EXIT,
+	 * either from the above perf_remove_from_context() or through
+	 * __perf_event_exit_task().
+	 *
+	 * Therefore, anybody acquireing event->child_mutex after the below
+	 * list_splice_init() _must_ also see this, most importantly
+	 * inherit_event() which will avoid placing more children on the
+	 * list.
+	 *
+	 * Thus this guarantees that we will in fact observe and kill _ALL_
+	 * child events.
+	 */
+	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_EXIT);
 
-int perf_event_release_kernel(struct perf_event *event)
-{
+again:
+	mutex_lock(&event->child_mutex);
+	list_for_each_entry(child, &event->child_list, child_list) {
+
+		ctx = lockless_dereference(child->ctx);
+		/*
+		 * must stay valid, see __perf_event_exit_task() holding
+		 * event->child_mutex.
+		 */
+		get_ctx(ctx);
+
+		mutex_unlock(&event->child_mutex);
+		mutex_lock(&ctx->mutex);
+		mutex_lock(&event->child_mutex);
+
+		tmp = list_first_entry_or_null(&event->child_list,
+					       struct perf_event, child_list);
+
+		if (tmp == child) {
+			perf_remove_from_context(child, DETACH_GROUP);
+			list_del(&child->child_list);
+			free_event(child);
+			/*
+			 * This matches the refcount bump in inherit_event();
+			 * this can't be the last reference.
+			 */
+			put_event(event);
+		}
+
+		mutex_unlock(&event->child_mutex);
+		mutex_unlock(&ctx->mutex);
+		put_ctx(ctx);
+		goto again;
+	}
+	mutex_unlock(&event->child_mutex);
+
+	/* Must be the last reference */
 	put_event(event);
 	return 0;
 }
@@ -3820,46 +3843,10 @@ EXPORT_SYMBOL_GPL(perf_event_release_ker
  */
 static int perf_release(struct inode *inode, struct file *file)
 {
-	put_event(file->private_data);
+	perf_event_release_kernel(file->private_data);
 	return 0;
 }
 
-/*
- * Remove all orphanes events from the context.
- */
-static void orphans_remove_work(struct work_struct *work)
-{
-	struct perf_event_context *ctx;
-	struct perf_event *event, *tmp;
-
-	ctx = container_of(work, struct perf_event_context,
-			   orphans_remove.work);
-
-	mutex_lock(&ctx->mutex);
-	list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry) {
-		struct perf_event *parent_event = event->parent;
-
-		if (!is_orphaned_child(event))
-			continue;
-
-		perf_remove_from_context(event, true);
-
-		mutex_lock(&parent_event->child_mutex);
-		list_del_init(&event->child_list);
-		mutex_unlock(&parent_event->child_mutex);
-
-		free_event(event);
-		put_event(parent_event);
-	}
-
-	raw_spin_lock_irq(&ctx->lock);
-	ctx->orphans_remove_sched = false;
-	raw_spin_unlock_irq(&ctx->lock);
-	mutex_unlock(&ctx->mutex);
-
-	put_ctx(ctx);
-}
-
 u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 {
 	struct perf_event *child;
@@ -8432,11 +8419,11 @@ SYSCALL_DEFINE5(perf_event_open,
 		 * See perf_event_ctx_lock() for comments on the details
 		 * of swizzling perf_event::ctx.
 		 */
-		perf_remove_from_context(group_leader, false);
+		perf_remove_from_context(group_leader, 0);
 
 		list_for_each_entry(sibling, &group_leader->sibling_list,
 				    group_entry) {
-			perf_remove_from_context(sibling, false);
+			perf_remove_from_context(sibling, 0);
 			put_ctx(gctx);
 		}
 
@@ -8616,7 +8603,7 @@ void perf_pmu_migrate_context(struct pmu
 	mutex_lock_double(&src_ctx->mutex, &dst_ctx->mutex);
 	list_for_each_entry_safe(event, tmp, &src_ctx->event_list,
 				 event_entry) {
-		perf_remove_from_context(event, false);
+		perf_remove_from_context(event, 0);
 		unaccount_event_cpu(event, src_cpu);
 		put_ctx(src_ctx);
 		list_add(&event->migrate_entry, &events);
@@ -8665,7 +8652,7 @@ void perf_pmu_migrate_context(struct pmu
 EXPORT_SYMBOL_GPL(perf_pmu_migrate_context);
 
 static void sync_child_event(struct perf_event *child_event,
-			       struct task_struct *child)
+			     struct task_struct *child)
 {
 	struct perf_event *parent_event = child_event->parent;
 	u64 child_val;
@@ -8684,25 +8671,6 @@ static void sync_child_event(struct perf
 	atomic64_add(child_event->total_time_running,
 		     &parent_event->child_total_time_running);
 
-	/*
-	 * Remove this event from the parent's list
-	 */
-	WARN_ON_ONCE(parent_event->ctx->parent_ctx);
-	mutex_lock(&parent_event->child_mutex);
-	list_del_init(&child_event->child_list);
-	mutex_unlock(&parent_event->child_mutex);
-
-	/*
-	 * Make sure user/parent get notified, that we just
-	 * lost one event.
-	 */
-	perf_event_wakeup(parent_event);
-
-	/*
-	 * Release the parent event, if this was the last
-	 * reference to it.
-	 */
-	put_event(parent_event);
 }
 
 static void
@@ -8710,6 +8678,11 @@ __perf_event_exit_task(struct perf_event
 			 struct perf_event_context *child_ctx,
 			 struct task_struct *child)
 {
+	struct perf_event *parent_event = child_event->parent;
+
+	if (parent_event)
+		mutex_lock(&parent_event->child_mutex);
+
 	/*
 	 * Do not destroy the 'original' grouping; because of the context
 	 * switch optimization the original events could've ended up in a
@@ -8728,20 +8701,38 @@ __perf_event_exit_task(struct perf_event
 	if (!!child_event->parent)
 		perf_group_detach(child_event);
 	list_del_event(child_event, child_ctx);
+	child_event->state = PERF_EVENT_STATE_EXIT; /* see perf_event_release_kernel() */
 	raw_spin_unlock_irq(&child_ctx->lock);
 
 	/*
-	 * It can happen that the parent exits first, and has events
-	 * that are still around due to the child reference. These
-	 * events need to be zapped.
+	 * Parent events are governed by their filedesc, retain them.
 	 */
-	if (child_event->parent) {
-		sync_child_event(child_event, child);
-		free_event(child_event);
-	} else {
-		child_event->state = PERF_EVENT_STATE_EXIT;
+	if (!child_event->parent) {
 		perf_event_wakeup(child_event);
+		return;
 	}
+	/*
+	 * This is a child event, cleanup and free.
+	 */
+
+	/*
+	 * Fold delta back into parent counts.
+	 */
+	sync_child_event(child_event, child);
+
+	/*
+	 * Remove this event from the parent's list.
+	 */
+	WARN_ON_ONCE(parent_event->ctx->parent_ctx);
+	list_del_init(&child_event->child_list);
+	mutex_unlock(&parent_event->child_mutex);
+
+	/*
+	 * Kick perf_poll for is_event_hup().
+	 */
+	perf_event_wakeup(parent_event);
+	free_event(child_event);
+	put_event(parent_event);
 }
 
 static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
@@ -8973,8 +8964,15 @@ inherit_event(struct perf_event *parent_
 	if (IS_ERR(child_event))
 		return child_event;
 
+	/*
+	 * is_orphaned_event() and list_add_tail(&parent_event->child_list)
+	 * must be under the same lock in order to serialize against
+	 * perf_event_release_kernel().
+	 */
+	mutex_lock(&parent_event->child_mutex);
 	if (is_orphaned_event(parent_event) ||
 	    !atomic_long_inc_not_zero(&parent_event->refcount)) {
+		mutex_unlock(&parent_event->child_mutex);
 		free_event(child_event);
 		return NULL;
 	}
@@ -9022,8 +9020,6 @@ inherit_event(struct perf_event *parent_
 	/*
 	 * Link this into the parent event's child list
 	 */
-	WARN_ON_ONCE(parent_event->ctx->parent_ctx);
-	mutex_lock(&parent_event->child_mutex);
 	list_add_tail(&child_event->child_list, &parent_event->child_list);
 	mutex_unlock(&parent_event->child_mutex);
 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-25 21:04                             ` Peter Zijlstra
@ 2016-01-26  4:59                               ` Alexei Starovoitov
  2016-01-26 16:16                                 ` Peter Zijlstra
  2016-01-29 11:28                                 ` [tip:perf/urgent] perf/bpf: Convert perf_event_array to use struct file tip-bot for Alexei Starovoitov
  0 siblings, 2 replies; 33+ messages in thread
From: Alexei Starovoitov @ 2016-01-26  4:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, Daniel Borkmann, Wang Nan

On Mon, Jan 25, 2016 at 10:04:10PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 25, 2016 at 03:54:14PM +0100, Peter Zijlstra wrote:
> > Alexander, Alexei,
> > 
> > How about the below? That uses event->state == PERF_EVENT_STATE_EXIT to
> > indicate the event has been given up by its 'owner' and decouples us
> > from the actual event->owner logic.
> > 
> > This retains the event->owner and event->owner_list thing purely for the
> > prclt(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE) calls, but does give
> > us strict 'owner' semantics in that:
> > 
> >   struct perf_event *my_event = perf_event_create_kernel_counter();
> > 
> >   /* ... */
> > 
> >   perf_event_release_kernel(my_event);
> > 
> > Or
> > 
> >   int fd = sys_perf_event_open(...);
> > 
> >   close(fd); /* last, calls fops::release */
> > 
> > Will destroy the event dead. event::refcount will 'retain' the object
> > but it will become non functional and is strictly meant as a temporal
> > existence guarantee (for when RCU isn't good enough).
> > 
> > So this should restore the scm_rights case, which preserves the fd but
> > could result in not having event->owner (and therefore being removed
> > from its owner_list), which is fine.
> > 
> > BPF still needs to get fixed to use filedesc references instead.
> 
> Still no BPF, but this one actually 'works', as in it doesn't have the
> blatant exit races and has survived a few hours of runtime.
> 
> ---
>  include/linux/perf_event.h |    3 
>  kernel/events/core.c       |  304 ++++++++++++++++++++++-----------------------
>  2 files changed, 150 insertions(+), 157 deletions(-)

I think I understand what you're trying to do and
the patch looks good to me.
As far as BPF side I did the following...
does it match the model you outlined above?
I did basic testing and it looks fine.

Subject: [PATCH ] perf,bpf: convert perf_event_array to use struct file

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/perf_event.h |  4 ++--
 kernel/bpf/arraymap.c      | 21 +++++++++++----------
 kernel/events/core.c       | 20 ++++++++------------
 kernel/trace/bpf_trace.c   | 14 ++++++++++----
 4 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f9828a48f16a..df275020fde9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -729,7 +729,7 @@ extern int perf_event_init_task(struct task_struct *child);
 extern void perf_event_exit_task(struct task_struct *child);
 extern void perf_event_free_task(struct task_struct *task);
 extern void perf_event_delayed_put(struct task_struct *task);
-extern struct perf_event *perf_event_get(unsigned int fd);
+extern struct file *perf_event_get(unsigned int fd);
 extern const struct perf_event_attr *perf_event_attrs(struct perf_event *event);
 extern void perf_event_print_debug(void);
 extern void perf_pmu_disable(struct pmu *pmu);
@@ -1070,7 +1070,7 @@ static inline int perf_event_init_task(struct task_struct *child)	{ return 0; }
 static inline void perf_event_exit_task(struct task_struct *child)	{ }
 static inline void perf_event_free_task(struct task_struct *task)	{ }
 static inline void perf_event_delayed_put(struct task_struct *task)	{ }
-static inline struct perf_event *perf_event_get(unsigned int fd)	{ return ERR_PTR(-EINVAL); }
+static inline struct file *perf_event_get(unsigned int fd)	{ return ERR_PTR(-EINVAL); }
 static inline const struct perf_event_attr *perf_event_attrs(struct perf_event *event)
 {
 	return ERR_PTR(-EINVAL);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index b0799bced518..89ebbc4d1164 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -291,10 +291,13 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd)
 {
 	struct perf_event *event;
 	const struct perf_event_attr *attr;
+	struct file *file;
 
-	event = perf_event_get(fd);
-	if (IS_ERR(event))
-		return event;
+	file = perf_event_get(fd);
+	if (IS_ERR(file))
+		return file;
+
+	event = file->private_data;
 
 	attr = perf_event_attrs(event);
 	if (IS_ERR(attr))
@@ -304,24 +307,22 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd)
 		goto err;
 
 	if (attr->type == PERF_TYPE_RAW)
-		return event;
+		return file;
 
 	if (attr->type == PERF_TYPE_HARDWARE)
-		return event;
+		return file;
 
 	if (attr->type == PERF_TYPE_SOFTWARE &&
 	    attr->config == PERF_COUNT_SW_BPF_OUTPUT)
-		return event;
+		return file;
 err:
-	perf_event_release_kernel(event);
+	fput(file);
 	return ERR_PTR(-EINVAL);
 }
 
 static void perf_event_fd_array_put_ptr(void *ptr)
 {
-	struct perf_event *event = ptr;
-
-	perf_event_release_kernel(event);
+	fput((struct file *)ptr);
 }
 
 static const struct bpf_map_ops perf_event_array_ops = {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 06ae52e99ac2..2a95e0d2370f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8896,21 +8896,17 @@ void perf_event_delayed_put(struct task_struct *task)
 		WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
 }
 
-struct perf_event *perf_event_get(unsigned int fd)
+struct file *perf_event_get(unsigned int fd)
 {
-	int err;
-	struct fd f;
-	struct perf_event *event;
-
-	err = perf_fget_light(fd, &f);
-	if (err)
-		return ERR_PTR(err);
+	struct file *file;
 
-	event = f.file->private_data;
-	atomic_long_inc(&event->refcount);
-	fdput(f);
+	file = fget_raw(fd);
+	if (file->f_op != &perf_fops) {
+		fput(file);
+		return ERR_PTR(-EBADF);
+	}
 
-	return event;
+	return file;
 }
 
 const struct perf_event_attr *perf_event_attrs(struct perf_event *event)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 45dd798bcd37..326a75e884db 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -191,14 +191,17 @@ static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
 	struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	struct perf_event *event;
+	struct file *file;
 
 	if (unlikely(index >= array->map.max_entries))
 		return -E2BIG;
 
-	event = (struct perf_event *)array->ptrs[index];
-	if (!event)
+	file = (struct file *)array->ptrs[index];
+	if (unlikely(!file))
 		return -ENOENT;
 
+	event = file->private_data;
+
 	/* make sure event is local and doesn't have pmu::count */
 	if (event->oncpu != smp_processor_id() ||
 	    event->pmu->count)
@@ -228,6 +231,7 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 index, u64 r4, u64 size)
 	void *data = (void *) (long) r4;
 	struct perf_sample_data sample_data;
 	struct perf_event *event;
+	struct file *file;
 	struct perf_raw_record raw = {
 		.size = size,
 		.data = data,
@@ -236,10 +240,12 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 index, u64 r4, u64 size)
 	if (unlikely(index >= array->map.max_entries))
 		return -E2BIG;
 
-	event = (struct perf_event *)array->ptrs[index];
-	if (unlikely(!event))
+	file = (struct file *)array->ptrs[index];
+	if (unlikely(!file))
 		return -ENOENT;
 
+	event = file->private_data;
+
 	if (unlikely(event->attr.type != PERF_TYPE_SOFTWARE ||
 		     event->attr.config != PERF_COUNT_SW_BPF_OUTPUT))
 		return -EINVAL;
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-26  4:59                               ` Alexei Starovoitov
@ 2016-01-26 16:16                                 ` Peter Zijlstra
  2016-01-26 17:24                                   ` Peter Zijlstra
  2016-01-29 11:28                                 ` [tip:perf/urgent] perf/bpf: Convert perf_event_array to use struct file tip-bot for Alexei Starovoitov
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-26 16:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, Daniel Borkmann, Wang Nan

On Mon, Jan 25, 2016 at 08:59:49PM -0800, Alexei Starovoitov wrote:
> I think I understand what you're trying to do and
> the patch looks good to me.

Great, thanks!

> As far as BPF side I did the following...
> does it match the model you outlined above?

Yes, a few comments/questions below.

> 
> Subject: [PATCH ] perf,bpf: convert perf_event_array to use struct file
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Can I take this through the tip/perf tree so that all these changes land
together?

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 06ae52e99ac2..2a95e0d2370f 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8896,21 +8896,17 @@ void perf_event_delayed_put(struct task_struct *task)
>  		WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
>  }
>  
> +struct file *perf_event_get(unsigned int fd)
>  {
> +	struct file *file;
>  
> +	file = fget_raw(fd);

fget_raw() to guarantee the return value isn't NULL? afaict the O_PATH
stuff does not apply to perf events, so you'd put any fd for which the
distinction matters anyway.

> +	if (file->f_op != &perf_fops) {
> +		fput(file);
> +		return ERR_PTR(-EBADF);
> +	}
>  
> +	return file;
>  }

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-26 16:16                                 ` Peter Zijlstra
@ 2016-01-26 17:24                                   ` Peter Zijlstra
  2016-01-26 23:31                                     ` Alexei Starovoitov
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-26 17:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, Daniel Borkmann, Wang Nan

On Tue, Jan 26, 2016 at 05:16:37PM +0100, Peter Zijlstra wrote:
> > +struct file *perf_event_get(unsigned int fd)
> >  {
> > +	struct file *file;
> >  
> > +	file = fget_raw(fd);
> 
> fget_raw() to guarantee the return value isn't NULL? afaict the O_PATH
> stuff does not apply to perf events, so you'd put any fd for which the
> distinction matters anyway.
> 
> > +	if (file->f_op != &perf_fops) {
> > +		fput(file);
> > +		return ERR_PTR(-EBADF);
> > +	}
> >  
> > +	return file;
> >  }

It is not possible for one thread to concurrently call close() while
this thread tries to fget() ? In which case, we must check the return
value anyway?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-26 17:24                                   ` Peter Zijlstra
@ 2016-01-26 23:31                                     ` Alexei Starovoitov
  2016-01-27  9:58                                       ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2016-01-26 23:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, Daniel Borkmann, Wang Nan

On Tue, Jan 26, 2016 at 06:24:25PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 26, 2016 at 05:16:37PM +0100, Peter Zijlstra wrote:
> > > +struct file *perf_event_get(unsigned int fd)
> > >  {
> > > +	struct file *file;
> > >  
> > > +	file = fget_raw(fd);
> > 
> > fget_raw() to guarantee the return value isn't NULL? afaict the O_PATH
> > stuff does not apply to perf events, so you'd put any fd for which the
> > distinction matters anyway.

yeah good catch. the following is needed:
        file = fget_raw(fd);
+       if (!file)
+               return ERR_PTR(-EBADF);

> > 
> > > +	if (file->f_op != &perf_fops) {
> > > +		fput(file);
> > > +		return ERR_PTR(-EBADF);
> > > +	}
> > >  
> > > +	return file;
> > >  }
> 
> It is not possible for one thread to concurrently call close() while
> this thread tries to fget() ? In which case, we must check the return
> value anyway?

the !file check is definitely needed, fd passed by the user can be bogus.
The caller of perf_event_get() checks for errors too.

This patch will conflict with kernel/bpf/arraymap.c and
kernel/trace/bpf_trace.c that are planned for net-next,
but the conflicts in kernel/events/core.c are probably harder
to resolve, so yes please take it into tip/perf.
I think your scm_right fixes depend on this patch and together
it's an important bug fix, so probably makes sense to send
them right now without waiting for the next merge window?
As soon as you get the whole thing into tip, I'll test it
to make sure bpf side is ok and I hope Wang will test it too.

I'm still a bit concerned about taking file reference for this,
since bpf prorgams that use perf_events won't be able to be
'detached'. Meaning there gotta be always a user space process
that will be holding perf_event FDs. On networking side we
don't have this limitation. Like we can attach bpf to TC,
iproute2 will exit and reattach some time later. So it
kinda sux, but sounds like you want to get rid of
perf_event->refcnt completely, so I don't see any other way.
We can fix it later if it really becomes an issue.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-26 23:31                                     ` Alexei Starovoitov
@ 2016-01-27  9:58                                       ` Peter Zijlstra
  2016-01-27 17:52                                         ` Alexei Starovoitov
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2016-01-27  9:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, Daniel Borkmann, Wang Nan

On Tue, Jan 26, 2016 at 03:31:59PM -0800, Alexei Starovoitov wrote:

> This patch will conflict with kernel/bpf/arraymap.c and
> kernel/trace/bpf_trace.c that are planned for net-next,
> but the conflicts in kernel/events/core.c are probably harder
> to resolve, so yes please take it into tip/perf.

Thanks.

> I think your scm_right fixes depend on this patch and together
> it's an important bug fix, so probably makes sense to send
> them right now without waiting for the next merge window?

I'll leave that up to Ingo, but likely yes.

> As soon as you get the whole thing into tip, I'll test it
> to make sure bpf side is ok and I hope Wang will test it too.
> 
> I'm still a bit concerned about taking file reference for this,
> since bpf prorgams that use perf_events won't be able to be
> 'detached'. 

I was not aware BPF could be detached like that.

> Meaning there gotta be always a user space process
> that will be holding perf_event FDs.

By using fget() the BPF array thing will hold the FDs, right? I mean
once you do a full fget() userspace can go and kill itself, the struct
file will persists.

> On networking side we
> don't have this limitation. Like we can attach bpf to TC,
> iproute2 will exit and reattach some time later. So it
> kinda sux, but sounds like you want to get rid of
> perf_event->refcnt completely, 

We cannot actually get rid of it, we need it for some existence stuff.
But yes, we need stricter cleanup.

> so I don't see any other way.
> We can fix it later if it really becomes an issue.

One possible avenue would be to allow BPF to create its own (kernel)
events and manage its lifetime along with the BPF object. But so far I
don't see a problem with the file thing.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2] perf: Synchronously cleanup child events
  2016-01-27  9:58                                       ` Peter Zijlstra
@ 2016-01-27 17:52                                         ` Alexei Starovoitov
  0 siblings, 0 replies; 33+ messages in thread
From: Alexei Starovoitov @ 2016-01-27 17:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel, vince, eranian,
	Arnaldo Carvalho de Melo, Jiri Olsa, Daniel Borkmann, Wang Nan

On Wed, Jan 27, 2016 at 10:58:22AM +0100, Peter Zijlstra wrote:
> 
> > Meaning there gotta be always a user space process
> > that will be holding perf_event FDs.
> 
> By using fget() the BPF array thing will hold the FDs, right? I mean
> once you do a full fget() userspace can go and kill itself, the struct
> file will persists.

Right. I mistakenly thought that fget() will prevent the task
from freeing or some fs/directory resources will be hanging.
It's from anon_inode. So all good. All concerns cleared.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [tip:perf/urgent] perf/bpf: Convert perf_event_array to use struct file
  2016-01-26  4:59                               ` Alexei Starovoitov
  2016-01-26 16:16                                 ` Peter Zijlstra
@ 2016-01-29 11:28                                 ` tip-bot for Alexei Starovoitov
  2016-01-29 20:01                                   ` Alexei Starovoitov
  1 sibling, 1 reply; 33+ messages in thread
From: tip-bot for Alexei Starovoitov @ 2016-01-29 11:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: ast, hpa, namhyung, jolsa, vincent.weaver, peterz, acme, eranian,
	linux-kernel, daniel, acme, jolsa, mingo, alexei.starovoitov,
	torvalds, alexander.shishkin, wangnan0, tglx, dsahern

Commit-ID:  e03e7ee34fdd1c3ef494949a75cb8c61c7265fa9
Gitweb:     http://git.kernel.org/tip/e03e7ee34fdd1c3ef494949a75cb8c61c7265fa9
Author:     Alexei Starovoitov <alexei.starovoitov@gmail.com>
AuthorDate: Mon, 25 Jan 2016 20:59:49 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 29 Jan 2016 08:35:25 +0100

perf/bpf: Convert perf_event_array to use struct file

Robustify refcounting.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160126045947.GA40151@ast-mbp.thefacebook.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/perf_event.h |  4 ++--
 kernel/bpf/arraymap.c      | 21 +++++++++++----------
 kernel/events/core.c       | 21 ++++++++++-----------
 kernel/trace/bpf_trace.c   | 14 ++++++++++----
 4 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6612732..4f90434 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -729,7 +729,7 @@ extern int perf_event_init_task(struct task_struct *child);
 extern void perf_event_exit_task(struct task_struct *child);
 extern void perf_event_free_task(struct task_struct *task);
 extern void perf_event_delayed_put(struct task_struct *task);
-extern struct perf_event *perf_event_get(unsigned int fd);
+extern struct file *perf_event_get(unsigned int fd);
 extern const struct perf_event_attr *perf_event_attrs(struct perf_event *event);
 extern void perf_event_print_debug(void);
 extern void perf_pmu_disable(struct pmu *pmu);
@@ -1070,7 +1070,7 @@ static inline int perf_event_init_task(struct task_struct *child)	{ return 0; }
 static inline void perf_event_exit_task(struct task_struct *child)	{ }
 static inline void perf_event_free_task(struct task_struct *task)	{ }
 static inline void perf_event_delayed_put(struct task_struct *task)	{ }
-static inline struct perf_event *perf_event_get(unsigned int fd)	{ return ERR_PTR(-EINVAL); }
+static inline struct file *perf_event_get(unsigned int fd)	{ return ERR_PTR(-EINVAL); }
 static inline const struct perf_event_attr *perf_event_attrs(struct perf_event *event)
 {
 	return ERR_PTR(-EINVAL);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index b0799bc..89ebbc4 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -291,10 +291,13 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd)
 {
 	struct perf_event *event;
 	const struct perf_event_attr *attr;
+	struct file *file;
 
-	event = perf_event_get(fd);
-	if (IS_ERR(event))
-		return event;
+	file = perf_event_get(fd);
+	if (IS_ERR(file))
+		return file;
+
+	event = file->private_data;
 
 	attr = perf_event_attrs(event);
 	if (IS_ERR(attr))
@@ -304,24 +307,22 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd)
 		goto err;
 
 	if (attr->type == PERF_TYPE_RAW)
-		return event;
+		return file;
 
 	if (attr->type == PERF_TYPE_HARDWARE)
-		return event;
+		return file;
 
 	if (attr->type == PERF_TYPE_SOFTWARE &&
 	    attr->config == PERF_COUNT_SW_BPF_OUTPUT)
-		return event;
+		return file;
 err:
-	perf_event_release_kernel(event);
+	fput(file);
 	return ERR_PTR(-EINVAL);
 }
 
 static void perf_event_fd_array_put_ptr(void *ptr)
 {
-	struct perf_event *event = ptr;
-
-	perf_event_release_kernel(event);
+	fput((struct file *)ptr);
 }
 
 static const struct bpf_map_ops perf_event_array_ops = {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index fe97f95..eb44730 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8916,21 +8916,20 @@ void perf_event_delayed_put(struct task_struct *task)
 		WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
 }
 
-struct perf_event *perf_event_get(unsigned int fd)
+struct file *perf_event_get(unsigned int fd)
 {
-	int err;
-	struct fd f;
-	struct perf_event *event;
+	struct file *file;
 
-	err = perf_fget_light(fd, &f);
-	if (err)
-		return ERR_PTR(err);
+	file = fget_raw(fd);
+	if (!file)
+		return ERR_PTR(-EBADF);
 
-	event = f.file->private_data;
-	atomic_long_inc(&event->refcount);
-	fdput(f);
+	if (file->f_op != &perf_fops) {
+		fput(file);
+		return ERR_PTR(-EBADF);
+	}
 
-	return event;
+	return file;
 }
 
 const struct perf_event_attr *perf_event_attrs(struct perf_event *event)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 45dd798..326a75e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -191,14 +191,17 @@ static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
 	struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	struct perf_event *event;
+	struct file *file;
 
 	if (unlikely(index >= array->map.max_entries))
 		return -E2BIG;
 
-	event = (struct perf_event *)array->ptrs[index];
-	if (!event)
+	file = (struct file *)array->ptrs[index];
+	if (unlikely(!file))
 		return -ENOENT;
 
+	event = file->private_data;
+
 	/* make sure event is local and doesn't have pmu::count */
 	if (event->oncpu != smp_processor_id() ||
 	    event->pmu->count)
@@ -228,6 +231,7 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 index, u64 r4, u64 size)
 	void *data = (void *) (long) r4;
 	struct perf_sample_data sample_data;
 	struct perf_event *event;
+	struct file *file;
 	struct perf_raw_record raw = {
 		.size = size,
 		.data = data,
@@ -236,10 +240,12 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 index, u64 r4, u64 size)
 	if (unlikely(index >= array->map.max_entries))
 		return -E2BIG;
 
-	event = (struct perf_event *)array->ptrs[index];
-	if (unlikely(!event))
+	file = (struct file *)array->ptrs[index];
+	if (unlikely(!file))
 		return -ENOENT;
 
+	event = file->private_data;
+
 	if (unlikely(event->attr.type != PERF_TYPE_SOFTWARE ||
 		     event->attr.config != PERF_COUNT_SW_BPF_OUTPUT))
 		return -EINVAL;

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [tip:perf/urgent] perf/bpf: Convert perf_event_array to use struct file
  2016-01-29 11:28                                 ` [tip:perf/urgent] perf/bpf: Convert perf_event_array to use struct file tip-bot for Alexei Starovoitov
@ 2016-01-29 20:01                                   ` Alexei Starovoitov
  0 siblings, 0 replies; 33+ messages in thread
From: Alexei Starovoitov @ 2016-01-29 20:01 UTC (permalink / raw)
  To: tglx, dsahern, wangnan0, alexander.shishkin, torvalds, mingo,
	acme, daniel, jolsa, linux-kernel, eranian, acme, vincent.weaver,
	jolsa, peterz, ast, hpa, namhyung

On Fri, Jan 29, 2016 at 03:28:40AM -0800, tip-bot for Alexei Starovoitov wrote:
> Commit-ID:  e03e7ee34fdd1c3ef494949a75cb8c61c7265fa9
> Gitweb:     http://git.kernel.org/tip/e03e7ee34fdd1c3ef494949a75cb8c61c7265fa9
> Author:     Alexei Starovoitov <alexei.starovoitov@gmail.com>
> AuthorDate: Mon, 25 Jan 2016 20:59:49 -0800
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Fri, 29 Jan 2016 08:35:25 +0100
> 
> perf/bpf: Convert perf_event_array to use struct file
> 
> Robustify refcounting.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/20160126045947.GA40151@ast-mbp.thefacebook.com
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  include/linux/perf_event.h |  4 ++--
>  kernel/bpf/arraymap.c      | 21 +++++++++++----------
>  kernel/events/core.c       | 21 ++++++++++-----------
>  kernel/trace/bpf_trace.c   | 14 ++++++++++----
>  4 files changed, 33 insertions(+), 27 deletions(-)

Tested the latest perf/urgent.
bpf tests are passing. kmemleak and kasan are quiet.
Thanks

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2016-01-29 20:01 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-15 11:22 [PATCH] perf: Cleanup user's child events Alexander Shishkin
2016-01-15 12:54 ` Peter Zijlstra
2016-01-15 13:05   ` Alexander Shishkin
2016-01-15 13:09     ` Peter Zijlstra
2016-01-15 14:07       ` [PATCH] perf: Synchronously cleanup " Alexander Shishkin
2016-01-15 17:57         ` Peter Zijlstra
2016-01-18 12:07           ` Alexander Shishkin
2016-01-18 12:37           ` Alexander Shishkin
2016-01-18 14:44             ` Peter Zijlstra
2016-01-19 15:12               ` [PATCH v2] " Alexander Shishkin
2016-01-19 20:05                 ` Peter Zijlstra
2016-01-19 21:58                   ` Alexei Starovoitov
2016-01-20  8:32                     ` Peter Zijlstra
2016-01-21  4:55                       ` Alexei Starovoitov
2016-01-20  7:04                   ` Alexander Shishkin
2016-01-20  8:03                     ` Peter Zijlstra
2016-01-22 11:35                   ` Alexander Shishkin
2016-01-22 12:12                     ` Peter Zijlstra
2016-01-22 12:38                     ` Peter Zijlstra
2016-01-22 19:44                       ` Alexei Starovoitov
2016-01-25 11:48                         ` Peter Zijlstra
2016-01-25 14:54                           ` Peter Zijlstra
2016-01-25 21:04                             ` Peter Zijlstra
2016-01-26  4:59                               ` Alexei Starovoitov
2016-01-26 16:16                                 ` Peter Zijlstra
2016-01-26 17:24                                   ` Peter Zijlstra
2016-01-26 23:31                                     ` Alexei Starovoitov
2016-01-27  9:58                                       ` Peter Zijlstra
2016-01-27 17:52                                         ` Alexei Starovoitov
2016-01-29 11:28                                 ` [tip:perf/urgent] perf/bpf: Convert perf_event_array to use struct file tip-bot for Alexei Starovoitov
2016-01-29 20:01                                   ` Alexei Starovoitov
2016-01-19 20:07                 ` [PATCH v2] perf: Synchronously cleanup child events Peter Zijlstra
2016-01-19  7:45         ` [PATCH] " Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.