Re: Perf hotplug lockup in v4.9-rc8

From: Peter Zijlstra <peterz@infradead.org>
To: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>,
	linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	jeremy.linton@arm.com
Subject: Re: Perf hotplug lockup in v4.9-rc8
Date: Mon, 12 Dec 2016 13:42:28 +0100	[thread overview]
Message-ID: <20161212124228.GE3124@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20161212114640.GD21248@arm.com>

On Mon, Dec 12, 2016 at 11:46:40AM +0000, Will Deacon wrote:
> > @@ -2331,13 +2330,36 @@ perf_install_in_context(struct perf_event_context *ctx,
> >  	/*
> >  	 * Installing events is tricky because we cannot rely on ctx->is_active
> >  	 * to be set in case this is the nr_events 0 -> 1 transition.
> > +	 *
> > +	 * Instead we use task_curr(), which tells us if the task is running.
> > +	 * However, since we use task_curr() outside of rq::lock, we can race
> > +	 * against the actual state. This means the result can be wrong.
> > +	 *
> > +	 * If we get a false positive, we retry, this is harmless.
> > +	 *
> > +	 * If we get a false negative, things are complicated. If we are after
> > +	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
> > +	 * value must be correct. If we're before, it doesn't matter since
> > +	 * perf_event_context_sched_in() will program the counter.
> > +	 *
> > +	 * However, this hinges on the remote context switch having observed
> > +	 * our task->perf_event_ctxp[] store, such that it will in fact take
> > +	 * ctx::lock in perf_event_context_sched_in().
> > +	 *
> > +	 * We do this by task_function_call(), if the IPI fails to hit the task
> > +	 * we know any future context switch of task must see the
> > +	 * perf_event_ctpx[] store.
> >  	 */
> > +
> >  	/*
> > +	 * This smp_mb() orders the task->perf_event_ctxp[] store with the
> > +	 * task_cpu() load, such that if the IPI then does not find the task
> > +	 * running, a future context switch of that task must observe the
> > +	 * store.
> >  	 */
> > +	smp_mb();
> > +again:
> > +	if (!task_function_call(task, __perf_install_in_context, event))
> >  		return;
> 
> I'm trying to figure out whether or not the barriers implied by the IPI
> are sufficient here, or whether we really need the explicit smp_mb().
> Certainly, arch_send_call_function_single_ipi has to order the publishing
> of the remote work before the signalling of the interrupt, but the comment
> above refers to "the task_cpu() load" and I can't see that after your
> diff.
> 
> What are you trying to order here?

I suppose something like this:

CPU0		CPU1		CPU2

		(current == t)

t->perf_event_ctxp[] = ctx;
smp_mb();
cpu = task_cpu(t);

		switch(t, n);
				migrate(t, 2);
				switch(p, t);

				ctx = t->perf_event_ctxp[]; // must not be NULL

smp_function_call(cpu, ..);

		generic_exec_single()
		  func();
		    spin_lock(ctx->lock);
		    if (task_curr(t)) // false

		    add_event_to_ctx();
		    spin_unlock(ctx->lock);

				perf_event_context_sched_in();
				  spin_lock(ctx->lock);
				  // sees event

Where between setting the perf_event_ctxp[] and sending the IPI the task
moves away and the IPI misses, and while the new CPU is in the middle of
scheduling in t, it hasn't yet passed through perf_event_sched_in(), but
when it does, it _must_ observe the ctx value we stored.

My thinking was that the IPI itself is not sufficient since when it
misses the task, nothing then guarantees we see the store. However, if
we order the store and the task_cpu() load, then any context
switching/migrating involved with changing that value, should ensure we
see our prior store.

Of course, even now writing this, I'm still confused.