From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755426AbcBWUzf (ORCPT ); Tue, 23 Feb 2016 15:55:35 -0500 Received: from e31.co.us.ibm.com ([32.97.110.149]:37787 "EHLO e31.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755162AbcBWUzc (ORCPT ); Tue, 23 Feb 2016 15:55:32 -0500 X-IBM-Helo: d03dlp02.boulder.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Tue, 23 Feb 2016 12:55:22 -0800 From: "Paul E. McKenney" To: Mathieu Desnoyers Cc: Ross Green , John Stultz , Thomas Gleixner , Peter Zijlstra , lkml , Ingo Molnar , Lai Jiangshan , dipankar@in.ibm.com, Andrew Morton , Josh Triplett , rostedt , David Howells , Eric Dumazet , Darren Hart , =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Oleg Nesterov , pranith kumar Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17 Message-ID: <20160223205522.GT3522@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20160217192817.GA21818@linux.vnet.ibm.com> <20160219173343.GB3522@linux.vnet.ibm.com> <20160220063248.GE3522@linux.vnet.ibm.com> <686568926.5862.1456259651418.JavaMail.zimbra@efficios.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <686568926.5862.1456259651418.JavaMail.zimbra@efficios.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16022320-8236-0000-0000-00001664F1A8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 23, 2016 at 08:34:11PM +0000, Mathieu Desnoyers wrote: > ----- On Feb 21, 2016, at 1:15 PM, Ross Green rgkernel@gmail.com wrote: > > > On Sun, Feb 21, 2016 at 4:04 PM, Ross Green wrote: > >> On Sat, Feb 20, 2016 at 5:32 PM, Paul E. McKenney > >> wrote: > >>> On Sat, Feb 20, 2016 at 03:34:30PM +1100, Ross Green wrote: > >>>> On Sat, Feb 20, 2016 at 4:33 AM, Paul E. McKenney > >>>> wrote: > >>>> > On Thu, Feb 18, 2016 at 08:13:18PM -0800, John Stultz wrote: > >>>> >> On Thu, Feb 18, 2016 at 7:56 PM, Ross Green wrote: > >>>> >> > Well a bonus extra! > >>>> >> > Kept everything running and there was another stall. > >>>> >> > So i have included the demsg output for perusal. > >>>> >> > > >>>> >> > Just to clear things up there is no hotplug involved in this system. > >>>> >> > It is a standard Pandaboard ES Ti4460 two processor system. > >>>> >> > I use this for testing as a generic armv7 processor, plus can keep it > >>>> >> > just running along for testing for a long time. the system has a total > >>>> >> > of 23-25 process running on average. Mainly standard daemons. There is > >>>> >> > certainly no heavy processing going on. I run a series of benchmarks > >>>> >> > that are cpu intensive for the first 20 miinutes after boot and then > >>>> >> > just leave it idle away. checking every so often to see how it has > >>>> >> > gone. > >>>> >> > As mentioned I have observed these stalls going back to 3.17 kernel. > >>>> >> > It will often take up to a week to record such a stall. I will > >>>> >> > typically test every new release kernel, so the -rc? series will get > >>>> >> > around a weeks testing. > >>>> >> > >>>> >> Sorry. Kind of hopping in a bit late here. Is this always happening > >>>> >> with just the pandaboard? Or are you seeing this on different > >>>> >> machines? > >>>> >> > >>>> >> Have you tried enabling CONFIG_DEBUG_TIMEKEEPING just in case > >>>> >> something is going awry there? > >>>> > > >>>> > Excellent point -- timekeeping issues have caused this sort of issue > >>>> > in the past. > >>>> > > >>>> > Ross, on your next test, could you please enable CONFIG_DEBUG_TIMEKEEPING > >>>> > as John suggests? > >>>> > > >>>> > Thanx, Paul > >>>> > > >>>> As John has suggested have already enabled CONFIG_DEBUG_TIMEKEEPING. > >>>> > >>>> So far just on 1 day running. > >>>> > >>>> Sigh...!! Nothing to report as yet, only one day on the clock. > >>>> Its like watching grass grow! > >>> > >>> I hear you! Though I was thinking in terms of watching paint dry... > >>> > >>> Thanx, Paul > >>> > >> Yes, > >> > >> but with paint drying there is an end point! > >> Grass just keeps on growing ... > >> > >> More like the children in the back of the car ... > >> Are we there yet? ... > >> > >> Well still nothing .. to report. I have just built a 4.5-rc5, but will > >> wait till I get some outcome from the previous test. That can't be too > >> much longer! > >> > >> In hope, > >> > >> Ross Green > > Patience little ones ... > > > > Well after 2 days plus running pulled another stall. > > This is with 4.5-rc4 and CONFIG_DEBUG_TIMEKEEPING set. > > > > Can't see anything related to the TIMEKEEPING. > > > > Anyway here is the dmesg output for people to look at. > > > > Paul, I was going to move onto 4.5-rc5 kernel, is there something else > > that you want me to test with that, Anyone else have any suggestions > > or ideas? > > Starting from kernel 3.17, if we request a e.g. 1000 jiffies schedule_timeout > on a HZ=1000 kernel, is there an upper bound to the number of jiffies it > can actually take before the timeout happens, especially on idle systems ? > I remember a talk from Thomas Gleixner on the new timer wheel which could > add some imprecision to those timeouts. Not sure in which kernel version it > got in though. > > My thinking is that it might be a good idea to try using hrtimers rather than > a jiffies-based timeout to awaken the RCU thread if it's really important to > run in a bounded amount of jiffies. Or else the jiffies-based sanity check > that triggers the warning is perhaps too strict. Some imprecision I could well understand. 21 seconds of imprecision on a three-millisecond timeout I am having somewhat more difficulty understanding. ;-) But I would rather avoid hrtimers for the RCU grace-period kthread because a few milliseconds of adjustment for energy-efficiency reasons is a very good thing. Still working on getting decent traces... Thanx, Paul