From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754467Ab3GAPmY (ORCPT ); Mon, 1 Jul 2013 11:42:24 -0400 Received: from e9.ny.us.ibm.com ([32.97.182.139]:32883 "EHLO e9.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751665Ab3GAPmX (ORCPT ); Mon, 1 Jul 2013 11:42:23 -0400 Date: Mon, 1 Jul 2013 08:41:09 -0700 From: "Paul E. McKenney" To: Thomas Gleixner Cc: Prarit Bhargava , Linux Kernel , athorlton@sgi.com, CAI Qian , Peter Zijlstra Subject: Re: BUG: tick device NULL pointer during system initialization and shutdown Message-ID: <20130701154109.GK3773@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <51C0AB09.2090605@redhat.com> <51CA2CBF.70404@redhat.com> <51CC708B.7040605@redhat.com> <51D17F2B.7000300@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13070115-7182-0000-0000-00000797CF93 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jul 01, 2013 at 03:30:47PM +0200, Thomas Gleixner wrote: > On Mon, 1 Jul 2013, Prarit Bhargava wrote: > > On 06/28/2013 06:52 AM, Thomas Gleixner wrote: > > > Huch. Did the warning in the broadcast code trigger before that? > > > > tglx, > > > > AFAICT it does not. Log below on the system I'm testing on. The test on the > > system is system boots, sleeps for 30 seconds and then reboots. > > > [ 270.563197] INFO: rcu_sched detected stalls on CPUs/tasks: { 51} (detected by > > 63, t=217205 jiffies, g=3583, c=3582, q=578) > > So the stall is on CPU51, but we do not get a backtrace for CPU51. > > The backtrace trigger is only sent to online cpus. So CPU51 is offline > already. Which makes sense as we are in the process of bringing CPUs > down and the CPUs with backtrace are 0 and 53-63. > > I'm pretty sure, that the patch which clears the stale flag is > unrelated to this and it cures the NULL pointer dereference (the > reason why this can happen is clear). > > So now you do not longer trip over the NULL pointer dereference, but > you see a weird RCU stall on an already DEAD cpu. Note, it's dead > because we already took CPU52 offline as well. > > Paul??? Odd. The force-quiescent-state machinery should notice that the dead CPU gets a false return from cpu_is_offline(), at which point it should not a quiescent state on behalf of that CPU and get on with the grace period. In the meantime, here are my guesses as to what might be causing this bug: o RCU's grace-period kthreads got stuck somehow. One way that this could happen is if you don't have commit #971394f3 (Fix deadlock with CPU hotplug, RCU GP init, and timer migration) but do have CONFIG_PROVE_RCU_DELAY=y. o The handling of CPU-hotplug bitmaps has changed so that RCU needs to do something other than cpu_offline(). I have been expecting that RCU would be needing to keep its own mask of online CPUs at some point, but didn't think that time had arrived. If neither of those help, then it is time for me to add more information to CONFIG_RCU_CPU_STALL_INFO. ;-) Thanx, Paul > Thanks, > > tglx > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ >