From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965465AbbKDNso (ORCPT ); Wed, 4 Nov 2015 08:48:44 -0500 Received: from e31.co.us.ibm.com ([32.97.110.149]:50418 "EHLO e31.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932631AbbKDNsn (ORCPT ); Wed, 4 Nov 2015 08:48:43 -0500 X-IBM-Helo: d03dlp02.boulder.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Wed, 4 Nov 2015 05:48:38 -0800 From: "Paul E. McKenney" To: Peter Zijlstra Cc: Dave Jones , Linux Kernel , Ingo Molnar , Stephane Eranian , Andi Kleen Subject: Re: perf related lockdep bug Message-ID: <20151104134838.GR29027@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20151104051717.GA6098@codemonkey.org.uk> <20151104102151.GG17308@twins.programming.kicks-ass.net> <20151104102800.GZ11639@twins.programming.kicks-ass.net> <20151104105010.GA11639@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151104105010.GA11639@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15110413-8236-0000-0000-000013476873 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 04, 2015 at 11:50:10AM +0100, Peter Zijlstra wrote: > On Wed, Nov 04, 2015 at 11:28:00AM +0100, Peter Zijlstra wrote: > > On Wed, Nov 04, 2015 at 11:21:51AM +0100, Peter Zijlstra wrote: > > > > > The problem appears to be due to the new RCU expedited grace period > > > stuff, with rcu_read_unlock() now randomly trying to acquire locks it > > > previously didn't. > > > > > > Lemme go look at those rcu bits again.. > > > > Paul, I think this is because of: > > > > 8203d6d0ee78 ("rcu: Use single-stage IPI algorithm for RCU expedited grace period") > > > > What happens is that the IPI comes in and tags any random > > rcu_read_unlock() with the special bit, which then goes on and takes > > locks. > > > > Now the problem is that we have scheduler activity inside this lock; > > the one reported lockdep seems easy enough to fix, see below. > > > > I'll got and see if there's more sites than can cause this. > > *sigh* yes, there's gobs more. In fact the very first one I looked at: > > rcu_dump_cpu_stacks() > raw_spin_lock_irqsave(&rnp->lock, flags); > dump_cpu_task() > pr_info() > > That too will end up doing wakeups.. > > idem: > - print_other_cpu_stall() > - rcu_print_details_task_stall_rnp() > > > So just like you had to pull out all the rcu_gp_kthead_wake() calls from > under rnp->lock, so too must we pull out all implied wakeups, which very > much include printk(). Ouch!!! Thank you for the analysis, though I am very surprised that my testing did not find this. But pulling all printk()s out from under rnp->lock is going to re-introduce some stall-warning bugs. So what other options do I have? o I could do raise_softirq(), then report the quiescent state in the core RCU code, but I bet that raise_softirq()'s wakeup gets me into just as much trouble. o Ditto for workqueues, I suspect. o I cannot send an IPI because interrupts are disabled, and that would be rather annoying from a real-time perspective in any case. Any others? Ah, and I see why my testing missed this -- I don't normally exercise perf while running rcutorture. So this hit the code in perf_lock_task_context() that disables preemption across an RCU read-side critical section, which previously sufficed to prevent this scenario. What happened this time is as follows: o CPU 0 entered perf_lock_task_context(), disabled preemption, and entered its RCU read-side critical section. Of course, the whole point of disabling preemption is to prevent the matching rcu_read_unlock() from grabbing locks. o CPU 1 started an expedited grace period. It checked CPU state, saw that CPU 0 was running in the kernel, and therefore IPIed it. o The IPI handler running on CPU 0 saw that there was an RCU read-side critical section in effect, so it set the ->exp_need_qs flag. o When the matching rcu_read_unlock() executes, it notes that ->exp_need_qs is set, and therefore grabs the locks that it shouldn't, hence lockdep's complaints about deadlock. This problem is caused by the IPI handler interrupting the RCU read-side critical section. One way to prevent the IPI from doing this is to disable interrupts across the RCU read-side critical section instead of merely disabling preemption. This is a reasonable approach given that acquiring the scheduler locks is going to disable interrupts in any case. The (untested) patch below takes this approach. Thoughts? Thanx, Paul ------------------------------------------------------------------------ commit 90ab56c9f90865991155d8f23441a5e135aa33cc Author: Paul E. McKenney Date: Wed Nov 4 05:37:08 2015 -0800 perf: Disable irqs across RCU RS CS that acquires scheduler lock The perf_lock_task_context() function disables preemption across its RCU read-side critical section because that critical section acquires a scheduler lock. If there was a preemption during that RCU read-side critical section, the rcu_read_unlock() could attempt to acquire scheduler locks, resulting in deadlock. However, recent optimizations to expedited grace periods mean that IPI handlers that execute during preemptible RCU read-side critical sections can now cause the subsequent rcu_read_unlock() to acquire scheduler locks. Disabling preemption does nothiing to prevent these IPI handlers from executing, so these optimizations introduced a deadlock. In theory, this deadlock could be avoided by pulling all wakeups and printk()s out from rnp->lock critical sections, but in practice this would re-introduce some RCU CPU stall warning bugs. Given that acquiring scheduler locks entails disabling interrupts, these deadlocks can be avoided by disabling interrupts (instead of disabling preemption) across any RCU read-side critical that acquires scheduler locks and holds them across the rcu_read_unlock(). This commit therefore makes this change for perf_lock_task_context(). Reported-by: Dave Jones Reported-by: Peter Zijlstra Signed-off-by: Paul E. McKenney diff --git a/kernel/events/core.c b/kernel/events/core.c index f548f69c4299..2ff0605a5384 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1050,13 +1050,13 @@ retry: /* * One of the few rules of preemptible RCU is that one cannot do * rcu_read_unlock() while holding a scheduler (or nested) lock when - * part of the read side critical section was preemptible -- see + * part of the read side critical section was irqs-enabled -- see * rcu_read_unlock_special(). * * Since ctx->lock nests under rq->lock we must ensure the entire read - * side critical section is non-preemptible. + * side critical section has interrupts disabled. */ - preempt_disable(); + local_irq_save(*flags); rcu_read_lock(); ctx = rcu_dereference(task->perf_event_ctxp[ctxn]); if (ctx) { @@ -1070,21 +1070,22 @@ retry: * if so. If we locked the right context, then it * can't get swapped on us any more. */ - raw_spin_lock_irqsave(&ctx->lock, *flags); + raw_spin_lock(&ctx->lock); if (ctx != rcu_dereference(task->perf_event_ctxp[ctxn])) { - raw_spin_unlock_irqrestore(&ctx->lock, *flags); + raw_spin_unlock(&ctx->lock); rcu_read_unlock(); - preempt_enable(); + local_irq_restore(*flags); goto retry; } if (!atomic_inc_not_zero(&ctx->refcount)) { - raw_spin_unlock_irqrestore(&ctx->lock, *flags); + raw_spin_unlock(&ctx->lock); ctx = NULL; } } rcu_read_unlock(); - preempt_enable(); + if (!ctx) + local_irq_restore(*flags); return ctx; }