On Wed, Feb 24, 2016 at 7:55 AM, Paul E. McKenney wrote: > On Tue, Feb 23, 2016 at 08:34:11PM +0000, Mathieu Desnoyers wrote: >> ----- On Feb 21, 2016, at 1:15 PM, Ross Green rgkernel@gmail.com wrote: >> >> > On Sun, Feb 21, 2016 at 4:04 PM, Ross Green wrote: >> >> On Sat, Feb 20, 2016 at 5:32 PM, Paul E. McKenney >> >> wrote: >> >>> On Sat, Feb 20, 2016 at 03:34:30PM +1100, Ross Green wrote: >> >>>> On Sat, Feb 20, 2016 at 4:33 AM, Paul E. McKenney >> >>>> wrote: >> >>>> > On Thu, Feb 18, 2016 at 08:13:18PM -0800, John Stultz wrote: >> >>>> >> On Thu, Feb 18, 2016 at 7:56 PM, Ross Green wrote: >> >>>> >> > Well a bonus extra! >> >>>> >> > Kept everything running and there was another stall. >> >>>> >> > So i have included the demsg output for perusal. >> >>>> >> > >> >>>> >> > Just to clear things up there is no hotplug involved in this system. >> >>>> >> > It is a standard Pandaboard ES Ti4460 two processor system. >> >>>> >> > I use this for testing as a generic armv7 processor, plus can keep it >> >>>> >> > just running along for testing for a long time. the system has a total >> >>>> >> > of 23-25 process running on average. Mainly standard daemons. There is >> >>>> >> > certainly no heavy processing going on. I run a series of benchmarks >> >>>> >> > that are cpu intensive for the first 20 miinutes after boot and then >> >>>> >> > just leave it idle away. checking every so often to see how it has >> >>>> >> > gone. >> >>>> >> > As mentioned I have observed these stalls going back to 3.17 kernel. >> >>>> >> > It will often take up to a week to record such a stall. I will >> >>>> >> > typically test every new release kernel, so the -rc? series will get >> >>>> >> > around a weeks testing. >> >>>> >> >> >>>> >> Sorry. Kind of hopping in a bit late here. Is this always happening >> >>>> >> with just the pandaboard? Or are you seeing this on different >> >>>> >> machines? >> >>>> >> >> >>>> >> Have you tried enabling CONFIG_DEBUG_TIMEKEEPING just in case >> >>>> >> something is going awry there? >> >>>> > >> >>>> > Excellent point -- timekeeping issues have caused this sort of issue >> >>>> > in the past. >> >>>> > >> >>>> > Ross, on your next test, could you please enable CONFIG_DEBUG_TIMEKEEPING >> >>>> > as John suggests? >> >>>> > >> >>>> > Thanx, Paul >> >>>> > >> >>>> As John has suggested have already enabled CONFIG_DEBUG_TIMEKEEPING. >> >>>> >> >>>> So far just on 1 day running. >> >>>> >> >>>> Sigh...!! Nothing to report as yet, only one day on the clock. >> >>>> Its like watching grass grow! >> >>> >> >>> I hear you! Though I was thinking in terms of watching paint dry... >> >>> >> >>> Thanx, Paul >> >>> >> >> Yes, >> >> >> >> but with paint drying there is an end point! >> >> Grass just keeps on growing ... >> >> >> >> More like the children in the back of the car ... >> >> Are we there yet? ... >> >> >> >> Well still nothing .. to report. I have just built a 4.5-rc5, but will >> >> wait till I get some outcome from the previous test. That can't be too >> >> much longer! >> >> >> >> In hope, >> >> >> >> Ross Green >> > Patience little ones ... >> > >> > Well after 2 days plus running pulled another stall. >> > This is with 4.5-rc4 and CONFIG_DEBUG_TIMEKEEPING set. >> > >> > Can't see anything related to the TIMEKEEPING. >> > >> > Anyway here is the dmesg output for people to look at. >> > >> > Paul, I was going to move onto 4.5-rc5 kernel, is there something else >> > that you want me to test with that, Anyone else have any suggestions >> > or ideas? >> >> Starting from kernel 3.17, if we request a e.g. 1000 jiffies schedule_timeout >> on a HZ=1000 kernel, is there an upper bound to the number of jiffies it >> can actually take before the timeout happens, especially on idle systems ? >> I remember a talk from Thomas Gleixner on the new timer wheel which could >> add some imprecision to those timeouts. Not sure in which kernel version it >> got in though. >> >> My thinking is that it might be a good idea to try using hrtimers rather than >> a jiffies-based timeout to awaken the RCU thread if it's really important to >> run in a bounded amount of jiffies. Or else the jiffies-based sanity check >> that triggers the warning is perhaps too strict. > > Some imprecision I could well understand. 21 seconds of imprecision > on a three-millisecond timeout I am having somewhat more difficulty > understanding. ;-) > > But I would rather avoid hrtimers for the RCU grace-period kthread > because a few milliseconds of adjustment for energy-efficiency > reasons is a very good thing. > > Still working on getting decent traces... > > Thanx, Paul > G'day all, Here is another dmesg output for 4.5-rc5 showing another rcu_preempt stall. This one appeared after only a day of running. CONFIG_DEBUG_TIMING is turned on, but can't see any output that shows from this. Again testing as before, Boot, run a series of small benchmarks, then just let the system be and idle away. I notice in the stack trace there is mention of hrtimer_run_queues and hrtimer_interrupt. Anyway, leave this for a few more eyes to look at. Open to any other suggestions of things to test. Regards, Ross Green