Re: Perf hotplug lockup in v4.9-rc8

From: Will Deacon <will.deacon@arm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>,
	linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	jeremy.linton@arm.com
Subject: Re: Perf hotplug lockup in v4.9-rc8
Date: Mon, 12 Dec 2016 11:46:40 +0000	[thread overview]
Message-ID: <20161212114640.GD21248@arm.com> (raw)
In-Reply-To: <20161209135900.GU3174@twins.programming.kicks-ass.net>

On Fri, Dec 09, 2016 at 02:59:00PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 07, 2016 at 07:34:55PM +0100, Peter Zijlstra wrote:
> 
> > @@ -2352,6 +2357,28 @@ perf_install_in_context(struct perf_event_context *ctx,
> >  		return;
> >  	}
> >  	raw_spin_unlock_irq(&ctx->lock);
> > +
> > +	raw_spin_lock_irq(&task->pi_lock);
> > +	if (!(task->state == TASK_RUNNING || task->state == TASK_WAKING)) {
> > +		/*
> > +		 * XXX horrific hack...
> > +		 */
> > +		raw_spin_lock(&ctx->lock);
> > +		if (task != ctx->task) {
> > +			raw_spin_unlock(&ctx->lock);
> > +			raw_spin_unlock_irq(&task->pi_lock);
> > +			goto again;
> > +		}
> > +
> > +		add_event_to_ctx(event, ctx);
> > +		raw_spin_unlock(&ctx->lock);
> > +		raw_spin_unlock_irq(&task->pi_lock);
> > +		return;
> > +	}
> > +	raw_spin_unlock_irq(&task->pi_lock);
> > +
> > +	cond_resched();
> > +
> >  	/*
> >  	 * Since !ctx->is_active doesn't mean anything, we must IPI
> >  	 * unconditionally.
> 
> So while I went back and forth trying to make that less ugly, I figured
> there was another problem.
> 
> Imagine the cpu_function_call() hitting the 'right' cpu, but not finding
> the task current. It will then continue to install the event in the
> context. However, that doesn't stop another CPU from pulling the task in
> question from our rq and scheduling it elsewhere.
> 
> This all lead me to the below patch.. Now it has a rather large comment,
> and while it represents my current thinking on the matter, I'm not at
> all sure its entirely correct. I got my brain in a fair twist while
> writing it.
> 
> Please as to carefully think about it.
> 
> ---
>  kernel/events/core.c | 70 +++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 48 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 6ee1febdf6ff..7d9ae461c535 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2252,7 +2252,7 @@ static int  __perf_install_in_context(void *info)
>  	struct perf_event_context *ctx = event->ctx;
>  	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
>  	struct perf_event_context *task_ctx = cpuctx->task_ctx;
> -	bool activate = true;
> +	bool reprogram = true;
>  	int ret = 0;
>  
>  	raw_spin_lock(&cpuctx->ctx.lock);
> @@ -2260,27 +2260,26 @@ static int  __perf_install_in_context(void *info)
>  		raw_spin_lock(&ctx->lock);
>  		task_ctx = ctx;
>  
> -		/* If we're on the wrong CPU, try again */
> -		if (task_cpu(ctx->task) != smp_processor_id()) {
> -			ret = -ESRCH;
> -			goto unlock;
> -		}
> +		reprogram = (ctx->task == current);
>  
>  		/*
> -		 * If we're on the right CPU, see if the task we target is
> -		 * current, if not we don't have to activate the ctx, a future
> -		 * context switch will do that for us.
> +		 * If the task is running, it must be running on this CPU,
> +		 * otherwise we cannot reprogram things.
> +		 *
> +		 * If its not running, we don't care, ctx->lock will
> +		 * serialize against it becoming runnable.
>  		 */
> -		if (ctx->task != current)
> -			activate = false;
> -		else
> -			WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != ctx);
> +		if (task_curr(ctx->task) && !reprogram) {
> +			ret = -ESRCH;
> +			goto unlock;
> +		}
>  
> +		WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx != ctx);
>  	} else if (task_ctx) {
>  		raw_spin_lock(&task_ctx->lock);
>  	}
>  
> -	if (activate) {
> +	if (reprogram) {
>  		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>  		add_event_to_ctx(event, ctx);
>  		ctx_resched(cpuctx, task_ctx);
> @@ -2331,13 +2330,36 @@ perf_install_in_context(struct perf_event_context *ctx,
>  	/*
>  	 * Installing events is tricky because we cannot rely on ctx->is_active
>  	 * to be set in case this is the nr_events 0 -> 1 transition.
> +	 *
> +	 * Instead we use task_curr(), which tells us if the task is running.
> +	 * However, since we use task_curr() outside of rq::lock, we can race
> +	 * against the actual state. This means the result can be wrong.
> +	 *
> +	 * If we get a false positive, we retry, this is harmless.
> +	 *
> +	 * If we get a false negative, things are complicated. If we are after
> +	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
> +	 * value must be correct. If we're before, it doesn't matter since
> +	 * perf_event_context_sched_in() will program the counter.
> +	 *
> +	 * However, this hinges on the remote context switch having observed
> +	 * our task->perf_event_ctxp[] store, such that it will in fact take
> +	 * ctx::lock in perf_event_context_sched_in().
> +	 *
> +	 * We do this by task_function_call(), if the IPI fails to hit the task
> +	 * we know any future context switch of task must see the
> +	 * perf_event_ctpx[] store.
>  	 */
> -again:
> +
>  	/*
> -	 * Cannot use task_function_call() because we need to run on the task's
> -	 * CPU regardless of whether its current or not.
> +	 * This smp_mb() orders the task->perf_event_ctxp[] store with the
> +	 * task_cpu() load, such that if the IPI then does not find the task
> +	 * running, a future context switch of that task must observe the
> +	 * store.
>  	 */
> -	if (!cpu_function_call(task_cpu(task), __perf_install_in_context, event))
> +	smp_mb();
> +again:
> +	if (!task_function_call(task, __perf_install_in_context, event))
>  		return;

I'm trying to figure out whether or not the barriers implied by the IPI
are sufficient here, or whether we really need the explicit smp_mb().
Certainly, arch_send_call_function_single_ipi has to order the publishing
of the remote work before the signalling of the interrupt, but the comment
above refers to "the task_cpu() load" and I can't see that after your
diff.

What are you trying to order here?

Will

>  
>  	raw_spin_lock_irq(&ctx->lock);
> @@ -2351,12 +2373,16 @@ perf_install_in_context(struct perf_event_context *ctx,
>  		raw_spin_unlock_irq(&ctx->lock);
>  		return;
>  	}
> -	raw_spin_unlock_irq(&ctx->lock);
>  	/*
> -	 * Since !ctx->is_active doesn't mean anything, we must IPI
> -	 * unconditionally.
> +	 * If the task is not running, ctx->lock will avoid it becoming so,
> +	 * thus we can safely install the event.
>  	 */
> -	goto again;
> +	if (task_curr(task)) {
> +		raw_spin_unlock_irq(&ctx->lock);
> +		goto again;
> +	}
> +	add_event_to_ctx(event, ctx);
> +	raw_spin_unlock_irq(&ctx->lock);
>  }
>  
>  /*