From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752143Ab3GHNFS (ORCPT ); Mon, 8 Jul 2013 09:05:18 -0400 Received: from mx1.redhat.com ([209.132.183.28]:57329 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751960Ab3GHNFN (ORCPT ); Mon, 8 Jul 2013 09:05:13 -0400 Message-ID: <51DAB8DF.6060806@redhat.com> Date: Mon, 08 Jul 2013 09:04:31 -0400 From: Prarit Bhargava User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110419 Red Hat/3.1.10-1.el6_0 Thunderbird/3.1.10 MIME-Version: 1.0 To: Thomas Gleixner CC: Linux Kernel , athorlton@sgi.com, CAI Qian , "Paul E. McKenney" , Peter Zijlstra Subject: Re: BUG: tick device NULL pointer during system initialization and shutdown References: <51C0AB09.2090605@redhat.com> <51CA2CBF.70404@redhat.com> <51CC708B.7040605@redhat.com> <51D17F2B.7000300@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/01/2013 09:30 AM, Thomas Gleixner wrote: > On Mon, 1 Jul 2013, Prarit Bhargava wrote: >> On 06/28/2013 06:52 AM, Thomas Gleixner wrote: >>> Huch. Did the warning in the broadcast code trigger before that? >> >> tglx, >> >> AFAICT it does not. Log below on the system I'm testing on. The test on the >> system is system boots, sleeps for 30 seconds and then reboots. > >> [ 270.563197] INFO: rcu_sched detected stalls on CPUs/tasks: { 51} (detected by >> 63, t=217205 jiffies, g=3583, c=3582, q=578) > > So the stall is on CPU51, but we do not get a backtrace for CPU51. > > The backtrace trigger is only sent to online cpus. So CPU51 is offline > already. Which makes sense as we are in the process of bringing CPUs > down and the CPUs with backtrace are 0 and 53-63. > > I'm pretty sure, that the patch which clears the stale flag is > unrelated to this and it cures the NULL pointer dereference (the > reason why this can happen is clear). > > So now you do not longer trip over the NULL pointer dereference, but > you see a weird RCU stall on an already DEAD cpu. Note, it's dead > because we already took CPU52 offline as well. > > Paul??? I hit this a few times ... but the frequency of hitting this is MUCH less than that off the original bug. So Thomas, can you add Tested-by: Prarit Bhargava to the "tick: Make oneshot broadcast robust vs. CPU offlining" patch? IMO that problem seems to be solved and we're just peeling the proverbial onion and finding deeper bugs. P.