Re: [syzbot] KASAN: use-after-free Read in task_work_run (2)

From: Peter Zijlstra <peterz@infradead.org>
To: Marco Elver <elver@google.com>
Cc: Hillf Danton <hdanton@sina.com>,
	syzbot <syzbot+9228d6098455bb209ec8@syzkaller.appspotmail.com>,
	linux-kernel@vger.kernel.org, syzkaller-bugs@googlegroups.com
Subject: Re: [syzbot] KASAN: use-after-free Read in task_work_run (2)
Date: Wed, 23 Nov 2022 17:27:15 +0100	[thread overview]
Message-ID: <Y35J4383T5rAjSjO@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <Y340Y2iwy3YubOk1@elver.google.com>

On Wed, Nov 23, 2022 at 03:55:31PM +0100, Marco Elver wrote:

> > > --- x/kernel/events/core.c
> > > +++ c/kernel/events/core.c
> > > @@ -2291,6 +2291,7 @@ event_sched_out(struct perf_event *event
> > >                     !event->pending_work) {
> > >                         event->pending_work = 1;
> > >                         dec = false;
> > > +                       atomic_long_inc(&event->refcount);
> > >                         task_work_add(current, &event->pending_task, TWA_RESUME);
> > >                 }
> > >                 if (dec)
> > > @@ -6561,6 +6562,8 @@ static void perf_pending_task(struct cal
> > >         struct perf_event *event = container_of(head, struct perf_event, pending_task);
> > >         int rctx;
> > >
> > > +       if (event->state == PERF_EVENT_STATE_DEAD)
> > > +               goto out;
> > >         /*
> > >          * If we 'fail' here, that's OK, it means recursion is already disabled
> > >          * and we won't recurse 'further'.
> > > @@ -6577,6 +6580,8 @@ static void perf_pending_task(struct cal
> > >         if (rctx >= 0)
> > >                 perf_swevent_put_recursion_context(rctx);
> > >         preempt_enable_notrace();
> > > +out:
> > > +       put_event(event);
> > >  }
> > >
> > >  #ifdef CONFIG_GUEST_PERF_EVENTS

This is broken and will corrupt ctx->nr_pending.


> My guess is that the __fput task work is in the same task as the perf
> task work, and so if we tried to cancel the task work from within
> __fput, it won't actually cancel it if task_work_run() already exchanged
> the 'task_works' list.

That seems very likely indeed.

> So it looks like prolonging the perf events lifetime is the only option
> right now?

Depends a bit on how complicated all this is; at the very least
perf_event_release_kernel() will schedule out the event if it still
running. It does this before switching the state to DEAD (it has to)
which means it can raise perf_pending_task() at this point in time, even
though we're tearing down the event.

This can be avoided by a patch like this...

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9ab0eb073bd5..e9ad1ff7a9f8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2287,6 +2287,7 @@ group_sched_out(struct perf_event *group_event, struct perf_event_context *ctx)
 
 #define DETACH_GROUP	0x01UL
 #define DETACH_CHILD	0x02UL
+#define DETACH_DEAD	0x04UL
 
 /*
  * Cross CPU call to remove a performance event
@@ -2308,12 +2309,20 @@ __perf_remove_from_context(struct perf_event *event,
 		update_cgrp_time_from_cpuctx(cpuctx, false);
 	}
 
+	/*
+	 * Ensure event_sched_out() switches to OFF, at the very least
+	 * this avoids raising perf_pending_task() at this time.
+	 */
+	if (flags & DETACH_DEAD)
+		event->pending_disable = 1;
 	event_sched_out(event, ctx);
 	if (flags & DETACH_GROUP)
 		perf_group_detach(event);
 	if (flags & DETACH_CHILD)
 		perf_child_detach(event);
 	list_del_event(event, ctx);
+	if (flags & DETACH_DEAD)
+		event->state = PERF_EVENT_STATE_DEAD;
 
 	if (!pmu_ctx->nr_events) {
 		pmu_ctx->rotate_necessary = 0;
@@ -5299,9 +5308,7 @@ int perf_event_release_kernel(struct perf_event *event)
 
 	ctx = perf_event_ctx_lock(event);
 	WARN_ON_ONCE(ctx->parent_ctx);
-	perf_remove_from_context(event, DETACH_GROUP);
 
-	raw_spin_lock_irq(&ctx->lock);
 	/*
 	 * Mark this event as STATE_DEAD, there is no external reference to it
 	 * anymore.
@@ -5313,8 +5320,7 @@ int perf_event_release_kernel(struct perf_event *event)
 	 * Thus this guarantees that we will in fact observe and kill _ALL_
 	 * child events.
 	 */
-	event->state = PERF_EVENT_STATE_DEAD;
-	raw_spin_unlock_irq(&ctx->lock);
+	perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
 
 	perf_event_ctx_unlock(event, ctx);
 
---

However; I don't think that actually helps, because in this case the new
task_work would actually still be on the ->task_works list and
task_work_cancel() should've worked.

The other possibility seems to be that the sample happens and we
schedule before close() can terminate the event, which means we've
already got perf_pending_task() queued by the time we get to
perf_remove_from_context().

This means the perf_pending_task() queue happened before the fput()
queue, and it is thus ran later (due to FILO ordering -- also see commit
c82199061009 ("task_work: remove fifo ordering guarantee")).

And I can't really see a way out of that other than doing refcount games
indeed.

There is the straight forward way, similar to what Hillf attempted, and
a really nasty one that avoids the atomics in the common case and is
really only targeted at this case -- given the overhead of signals I'm
thinking simple is better.

---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9ab0eb073bd5..0228ea090b98 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2248,6 +2248,7 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
 		    !event->pending_work) {
 			event->pending_work = 1;
 			dec = false;
+			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
 			task_work_add(current, &event->pending_task, TWA_RESUME);
 		}
 		if (dec)
@@ -6755,6 +6762,8 @@ static void perf_pending_task(struct callback_head *head)
 	if (rctx >= 0)
 		perf_swevent_put_recursion_context(rctx);
 	preempt_enable_notrace();
+
+	put_event(event);
 }
 
 #ifdef CONFIG_GUEST_PERF_EVENTS


So perhaps both the above..

Does that actually work?