On Wed, Feb 24, 2016 at 8:28 AM, Ross Green wrote: > On Wed, Feb 24, 2016 at 7:55 AM, Paul E. McKenney > wrote: >> On Tue, Feb 23, 2016 at 08:34:11PM +0000, Mathieu Desnoyers wrote: >>> ----- On Feb 21, 2016, at 1:15 PM, Ross Green rgkernel@gmail.com wrote: >>> >>> > On Sun, Feb 21, 2016 at 4:04 PM, Ross Green wrote: >>> >> On Sat, Feb 20, 2016 at 5:32 PM, Paul E. McKenney >>> >> wrote: >>> >>> On Sat, Feb 20, 2016 at 03:34:30PM +1100, Ross Green wrote: >>> >>>> On Sat, Feb 20, 2016 at 4:33 AM, Paul E. McKenney >>> >>>> wrote: >>> >>>> > On Thu, Feb 18, 2016 at 08:13:18PM -0800, John Stultz wrote: >>> >>>> >> On Thu, Feb 18, 2016 at 7:56 PM, Ross Green wrote: >>> >>>> >> > Well a bonus extra! >>> >>>> >> > Kept everything running and there was another stall. >>> >>>> >> > So i have included the demsg output for perusal. >>> >>>> >> > >>> >>>> >> > Just to clear things up there is no hotplug involved in this system. >>> >>>> >> > It is a standard Pandaboard ES Ti4460 two processor system. >>> >>>> >> > I use this for testing as a generic armv7 processor, plus can keep it >>> >>>> >> > just running along for testing for a long time. the system has a total >>> >>>> >> > of 23-25 process running on average. Mainly standard daemons. There is >>> >>>> >> > certainly no heavy processing going on. I run a series of benchmarks >>> >>>> >> > that are cpu intensive for the first 20 miinutes after boot and then >>> >>>> >> > just leave it idle away. checking every so often to see how it has >>> >>>> >> > gone. >>> >>>> >> > As mentioned I have observed these stalls going back to 3.17 kernel. >>> >>>> >> > It will often take up to a week to record such a stall. I will >>> >>>> >> > typically test every new release kernel, so the -rc? series will get >>> >>>> >> > around a weeks testing. >>> >>>> >> >>> >>>> >> Sorry. Kind of hopping in a bit late here. Is this always happening >>> >>>> >> with just the pandaboard? Or are you seeing this on different >>> >>>> >> machines? >>> >>>> >> >>> >>>> >> Have you tried enabling CONFIG_DEBUG_TIMEKEEPING just in case >>> >>>> >> something is going awry there? >>> >>>> > >>> >>>> > Excellent point -- timekeeping issues have caused this sort of issue >>> >>>> > in the past. >>> >>>> > >>> >>>> > Ross, on your next test, could you please enable CONFIG_DEBUG_TIMEKEEPING >>> >>>> > as John suggests? >>> >>>> > >>> >>>> > Thanx, Paul >>> >>>> > >>> >>>> As John has suggested have already enabled CONFIG_DEBUG_TIMEKEEPING. >>> >>>> >>> >>>> So far just on 1 day running. >>> >>>> >>> >>>> Sigh...!! Nothing to report as yet, only one day on the clock. >>> >>>> Its like watching grass grow! >>> >>> >>> >>> I hear you! Though I was thinking in terms of watching paint dry... >>> >>> >>> >>> Thanx, Paul >>> >>> >>> >> Yes, >>> >> >>> >> but with paint drying there is an end point! >>> >> Grass just keeps on growing ... >>> >> >>> >> More like the children in the back of the car ... >>> >> Are we there yet? ... >>> >> >>> >> Well still nothing .. to report. I have just built a 4.5-rc5, but will >>> >> wait till I get some outcome from the previous test. That can't be too >>> >> much longer! >>> >> >>> >> In hope, >>> >> >>> >> Ross Green >>> > Patience little ones ... >>> > >>> > Well after 2 days plus running pulled another stall. >>> > This is with 4.5-rc4 and CONFIG_DEBUG_TIMEKEEPING set. >>> > >>> > Can't see anything related to the TIMEKEEPING. >>> > >>> > Anyway here is the dmesg output for people to look at. >>> > >>> > Paul, I was going to move onto 4.5-rc5 kernel, is there something else >>> > that you want me to test with that, Anyone else have any suggestions >>> > or ideas? >>> >>> Starting from kernel 3.17, if we request a e.g. 1000 jiffies schedule_timeout >>> on a HZ=1000 kernel, is there an upper bound to the number of jiffies it >>> can actually take before the timeout happens, especially on idle systems ? >>> I remember a talk from Thomas Gleixner on the new timer wheel which could >>> add some imprecision to those timeouts. Not sure in which kernel version it >>> got in though. >>> >>> My thinking is that it might be a good idea to try using hrtimers rather than >>> a jiffies-based timeout to awaken the RCU thread if it's really important to >>> run in a bounded amount of jiffies. Or else the jiffies-based sanity check >>> that triggers the warning is perhaps too strict. >> >> Some imprecision I could well understand. 21 seconds of imprecision >> on a three-millisecond timeout I am having somewhat more difficulty >> understanding. ;-) >> >> But I would rather avoid hrtimers for the RCU grace-period kthread >> because a few milliseconds of adjustment for energy-efficiency >> reasons is a very good thing. >> >> Still working on getting decent traces... >> >> Thanx, Paul >> > > G'day all, > > Here is another dmesg output for 4.5-rc5 showing another rcu_preempt stall. > This one appeared after only a day of running. CONFIG_DEBUG_TIMING is > turned on, but can't see any output that shows from this. > > Again testing as before, > > Boot, run a series of small benchmarks, then just let the system be > and idle away. > > I notice in the stack trace there is mention of hrtimer_run_queues and > hrtimer_interrupt. > > Anyway, leave this for a few more eyes to look at. > > Open to any other suggestions of things to test. > > Regards, > > Ross Green G'day Paul, I left the pandaboard running and captured another stall. the attachment is the dmesg output. Again there is no apparent output from any CONFIG_DEBUG_TIMING so I assume there is nothing happening there. I just saw the updates for 4.6 RCU code. Is the patch in [PATCH tip/core/rcu 04/13] valid here? do you want me try the new patch set with this configuration? Regards, Ross Green