From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S938729AbcLVIpV (ORCPT ); Thu, 22 Dec 2016 03:45:21 -0500 Received: from merlin.infradead.org ([205.233.59.134]:35002 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935318AbcLVIpT (ORCPT ); Thu, 22 Dec 2016 03:45:19 -0500 Date: Thu, 22 Dec 2016 09:45:09 +0100 From: Peter Zijlstra To: Will Deacon Cc: Mark Rutland , linux-kernel@vger.kernel.org, Ingo Molnar , Arnaldo Carvalho de Melo , Thomas Gleixner , Sebastian Andrzej Siewior , jeremy.linton@arm.com, Boqun Feng , Paul McKenney Subject: Re: Perf hotplug lockup in v4.9-rc8 Message-ID: <20161222084509.GX3174@twins.programming.kicks-ass.net> References: <20161207135217.GA25605@leverpostej> <20161207175347.GB13840@leverpostej> <20161207183455.GQ3124@twins.programming.kicks-ass.net> <20161209135900.GU3174@twins.programming.kicks-ass.net> <20161212114640.GD21248@arm.com> <20161212124228.GE3124@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161212124228.GE3124@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 12, 2016 at 01:42:28PM +0100, Peter Zijlstra wrote: > On Mon, Dec 12, 2016 at 11:46:40AM +0000, Will Deacon wrote: > > > @@ -2331,13 +2330,36 @@ perf_install_in_context(struct perf_event_context *ctx, > > > /* > > > * Installing events is tricky because we cannot rely on ctx->is_active > > > * to be set in case this is the nr_events 0 -> 1 transition. > > > + * > > > + * Instead we use task_curr(), which tells us if the task is running. > > > + * However, since we use task_curr() outside of rq::lock, we can race > > > + * against the actual state. This means the result can be wrong. > > > + * > > > + * If we get a false positive, we retry, this is harmless. > > > + * > > > + * If we get a false negative, things are complicated. If we are after > > > + * perf_event_context_sched_in() ctx::lock will serialize us, and the > > > + * value must be correct. If we're before, it doesn't matter since > > > + * perf_event_context_sched_in() will program the counter. > > > + * > > > + * However, this hinges on the remote context switch having observed > > > + * our task->perf_event_ctxp[] store, such that it will in fact take > > > + * ctx::lock in perf_event_context_sched_in(). > > > + * > > > + * We do this by task_function_call(), if the IPI fails to hit the task > > > + * we know any future context switch of task must see the > > > + * perf_event_ctpx[] store. > > > */ > > > + > > > /* > > > + * This smp_mb() orders the task->perf_event_ctxp[] store with the > > > + * task_cpu() load, such that if the IPI then does not find the task > > > + * running, a future context switch of that task must observe the > > > + * store. > > > */ > > > + smp_mb(); > > > +again: > > > + if (!task_function_call(task, __perf_install_in_context, event)) > > > return; > > > > I'm trying to figure out whether or not the barriers implied by the IPI > > are sufficient here, or whether we really need the explicit smp_mb(). > > Certainly, arch_send_call_function_single_ipi has to order the publishing > > of the remote work before the signalling of the interrupt, but the comment > > above refers to "the task_cpu() load" and I can't see that after your > > diff. > > > > What are you trying to order here? > > I suppose something like this: > > > CPU0 CPU1 CPU2 > > (current == t) > > t->perf_event_ctxp[] = ctx; > smp_mb(); > cpu = task_cpu(t); > > switch(t, n); > migrate(t, 2); > switch(p, t); > > ctx = t->perf_event_ctxp[]; // must not be NULL > So I think I can cast the above into a test like: W[x] = 1 W[y] = 1 R[z] = 1 mb mb mb R[y] = 0 W[z] = 1 R[x] = 0 Where x is the perf_event_ctxp[], y is our task's cpu and z is our task being placed on the rq of cpu2. See also commit: 8643cda549ca ("sched/core, locking: Document Program-Order guarantees"), Independent of which cpu initiates the migration between CPU1 and CPU2 there is ordering between the CPUs. This would then translate into something like: C C-peterz { } P0(int *x, int *y) { int r1; WRITE_ONCE(*x, 1); smp_mb(); r1 = READ_ONCE(*y); } P1(int *y, int *z) { WRITE_ONCE(*y, 1); smp_mb(); WRITE_ONCE(*z, 1); } P2(int *x, int *z) { int r1; int r2; r1 = READ_ONCE(*z); smp_mb(); r2 = READ_ONCE(*x); } exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0) Which evaluates into: Test C-peterz Allowed States 7 0:r1=0; 2:r1=0; 2:r2=0; 0:r1=0; 2:r1=0; 2:r2=1; 0:r1=0; 2:r1=1; 2:r2=1; 0:r1=1; 2:r1=0; 2:r2=0; 0:r1=1; 2:r1=0; 2:r2=1; 0:r1=1; 2:r1=1; 2:r2=0; 0:r1=1; 2:r1=1; 2:r2=1; No Witnesses Positive: 0 Negative: 7 Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0) Observation C-peterz Never 0 7 Hash=661589febb9e41b222d8acae1fd64e25 And the strong and weak model agree. > smp_function_call(cpu, ..); > > generic_exec_single() > func(); > spin_lock(ctx->lock); > if (task_curr(t)) // false > > add_event_to_ctx(); > spin_unlock(ctx->lock); > > perf_event_context_sched_in(); > spin_lock(ctx->lock); > // sees event > > > Where between setting the perf_event_ctxp[] and sending the IPI the task > moves away and the IPI misses, and while the new CPU is in the middle of > scheduling in t, it hasn't yet passed through perf_event_sched_in(), but > when it does, it _must_ observe the ctx value we stored. > > My thinking was that the IPI itself is not sufficient since when it > misses the task, nothing then guarantees we see the store. However, if > we order the store and the task_cpu() load, then any context > switching/migrating involved with changing that value, should ensure we > see our prior store. > > Of course, even now writing this, I'm still confused. On IRC you said: : I think it's similar to the "ISA2" litmus test, only the first reads-from edge is an IPI and the second is an Unlock->Lock In case the IPI misses, we cannot use the IPI itself for anything I'm afraid, also per the above we don't need to. : the case I'm more confused by is if CPU2 takes the ctx->lock before CPU1 : I'm guessing that's prevented by the way migration works? So same scenario but CPU2 takes the ctx->lock first. In that case it will not observe our event and do nothing. CPU1 will then acquire ctx->lock, this then implies ordering against CPU2, which means it _must_ observe task_curr() && task != current and it too will not do anything but we'll loop and try the whole thing again.