From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S965465AbbKDNso (ORCPT <rfc822;w@1wt.eu>);
	Wed, 4 Nov 2015 08:48:44 -0500
Received: from e31.co.us.ibm.com ([32.97.110.149]:50418 "EHLO
	e31.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S932631AbbKDNsn (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 4 Nov 2015 08:48:43 -0500
X-IBM-Helo: d03dlp02.boulder.ibm.com
X-IBM-MailFrom: paulmck@linux.vnet.ibm.com
X-IBM-RcptTo: linux-kernel@vger.kernel.org
Date: Wed, 4 Nov 2015 05:48:38 -0800
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Jones <davej@codemonkey.org.uk>,
        Linux Kernel <linux-kernel@vger.kernel.org>,
        Ingo Molnar <mingo@redhat.com>, Stephane Eranian <eranian@gmail.com>,
        Andi Kleen <andi@firstfloor.org>
Subject: Re: perf related lockdep bug
Message-ID: <20151104134838.GR29027@linux.vnet.ibm.com>
Reply-To: paulmck@linux.vnet.ibm.com
References: <20151104051717.GA6098@codemonkey.org.uk>
 <20151104102151.GG17308@twins.programming.kicks-ass.net>
 <20151104102800.GZ11639@twins.programming.kicks-ass.net>
 <20151104105010.GA11639@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20151104105010.GA11639@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 15110413-8236-0000-0000-000013476873
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Nov 04, 2015 at 11:50:10AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 04, 2015 at 11:28:00AM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 04, 2015 at 11:21:51AM +0100, Peter Zijlstra wrote:
> > 
> > > The problem appears to be due to the new RCU expedited grace period
> > > stuff, with rcu_read_unlock() now randomly trying to acquire locks it
> > > previously didn't.
> > > 
> > > Lemme go look at those rcu bits again..
> > 
> > Paul, I think this is because of:
> > 
> >   8203d6d0ee78 ("rcu: Use single-stage IPI algorithm for RCU expedited grace period")
> > 
> > What happens is that the IPI comes in and tags any random
> > rcu_read_unlock() with the special bit, which then goes on and takes
> > locks.
> > 
> > Now the problem is that we have scheduler activity inside this lock;
> > the one reported lockdep seems easy enough to fix, see below.
> > 
> > I'll got and see if there's more sites than can cause this.
> 
> *sigh* yes, there's gobs more. In fact the very first one I looked at:
> 
> rcu_dump_cpu_stacks()
>   raw_spin_lock_irqsave(&rnp->lock, flags);
>   dump_cpu_task()
>     pr_info()
> 
> That too will end up doing wakeups..
> 
> idem:
>  - print_other_cpu_stall()
>  - rcu_print_details_task_stall_rnp()
> 
> 
> So just like you had to pull out all the rcu_gp_kthead_wake() calls from
> under rnp->lock, so too must we pull out all implied wakeups, which very
> much include printk().

Ouch!!!  Thank you for the analysis, though I am very surprised that
my testing did not find this.  But pulling all printk()s out from under
rnp->lock is going to re-introduce some stall-warning bugs.

So what other options do I have?

o	I could do raise_softirq(), then report the quiescent state in
	the core RCU code, but I bet that raise_softirq()'s  wakeup gets
	me into just as much trouble.

o	Ditto for workqueues, I suspect.

o	I cannot send an IPI because interrupts are disabled, and that
	would be rather annoying from a real-time perspective in any
	case.

Any others?

Ah, and I see why my testing missed this -- I don't normally exercise
perf while running rcutorture.

So this hit the code in perf_lock_task_context() that disables preemption
across an RCU read-side critical section, which previously sufficed to
prevent this scenario.  What happened this time is as follows:

o	CPU 0 entered perf_lock_task_context(), disabled preemption,
	and entered its RCU read-side critical section.  Of course,
	the whole point of disabling preemption is to prevent the
	matching rcu_read_unlock() from grabbing locks.

o	CPU 1 started an expedited grace period.  It checked CPU
	state, saw that CPU 0 was running in the kernel, and therefore
	IPIed it.

o	The IPI handler running on CPU 0 saw that there was an
	RCU read-side critical section in effect, so it set the
	->exp_need_qs flag.

o	When the matching rcu_read_unlock() executes, it notes that
	->exp_need_qs is set, and therefore grabs the locks that it
	shouldn't, hence lockdep's complaints about deadlock.

This problem is caused by the IPI handler interrupting the RCU read-side
critical section.  One way to prevent the IPI from doing this is to
disable interrupts across the RCU read-side critical section instead
of merely disabling preemption.  This is a reasonable approach given
that acquiring the scheduler locks is going to disable interrupts
in any case.

The (untested) patch below takes this approach.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit 90ab56c9f90865991155d8f23441a5e135aa33cc
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Wed Nov 4 05:37:08 2015 -0800

    perf: Disable irqs across RCU RS CS that acquires scheduler lock
    
    The perf_lock_task_context() function disables preemption across its
    RCU read-side critical section because that critical section acquires
    a scheduler lock.  If there was a preemption during that RCU read-side
    critical section, the rcu_read_unlock() could attempt to acquire scheduler
    locks, resulting in deadlock.
    
    However, recent optimizations to expedited grace periods mean that IPI
    handlers that execute during preemptible RCU read-side critical sections
    can now cause the subsequent rcu_read_unlock() to acquire scheduler locks.
    Disabling preemption does nothiing to prevent these IPI handlers from
    executing, so these optimizations introduced a deadlock.  In theory,
    this deadlock could be avoided by pulling all wakeups and printk()s out
    from rnp->lock critical sections, but in practice this would re-introduce
    some RCU CPU stall warning bugs.
    
    Given that acquiring scheduler locks entails disabling interrupts, these
    deadlocks can be avoided by disabling interrupts (instead of disabling
    preemption) across any RCU read-side critical that acquires scheduler
    locks and holds them across the rcu_read_unlock().  This commit therefore
    makes this change for perf_lock_task_context().
    
    Reported-by: Dave Jones <davej@codemonkey.org.uk>
    Reported-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f548f69c4299..2ff0605a5384 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1050,13 +1050,13 @@ retry:
 	/*
 	 * One of the few rules of preemptible RCU is that one cannot do
 	 * rcu_read_unlock() while holding a scheduler (or nested) lock when
-	 * part of the read side critical section was preemptible -- see
+	 * part of the read side critical section was irqs-enabled -- see
 	 * rcu_read_unlock_special().
 	 *
 	 * Since ctx->lock nests under rq->lock we must ensure the entire read
-	 * side critical section is non-preemptible.
+	 * side critical section has interrupts disabled.
 	 */
-	preempt_disable();
+	local_irq_save(*flags);
 	rcu_read_lock();
 	ctx = rcu_dereference(task->perf_event_ctxp[ctxn]);
 	if (ctx) {
@@ -1070,21 +1070,22 @@ retry:
 		 * if so.  If we locked the right context, then it
 		 * can't get swapped on us any more.
 		 */
-		raw_spin_lock_irqsave(&ctx->lock, *flags);
+		raw_spin_lock(&ctx->lock);
 		if (ctx != rcu_dereference(task->perf_event_ctxp[ctxn])) {
-			raw_spin_unlock_irqrestore(&ctx->lock, *flags);
+			raw_spin_unlock(&ctx->lock);
 			rcu_read_unlock();
-			preempt_enable();
+			local_irq_restore(*flags);
 			goto retry;
 		}
 
 		if (!atomic_inc_not_zero(&ctx->refcount)) {
-			raw_spin_unlock_irqrestore(&ctx->lock, *flags);
+			raw_spin_unlock(&ctx->lock);
 			ctx = NULL;
 		}
 	}
 	rcu_read_unlock();
-	preempt_enable();
+	if (!ctx)
+		local_irq_restore(*flags);
 	return ctx;
 }