On Thu, Feb 18, 2016 at 10:19 AM, Paul E. McKenney wrote: > On Wed, Feb 17, 2016 at 12:28:29PM -0800, Paul E. McKenney wrote: >> On Wed, Feb 17, 2016 at 08:45:54PM +0100, Peter Zijlstra wrote: >> > On Wed, Feb 17, 2016 at 11:28:17AM -0800, Paul E. McKenney wrote: >> > > On Tue, Feb 16, 2016 at 09:45:49PM -0800, Paul E. McKenney wrote: >> > > > On Tue, Feb 09, 2016 at 09:11:55PM +1100, Ross Green wrote: >> > > > > Continued testing with the latest linux-4.5-rc3 release. >> > > > > >> > > > > Please find attached a copy of traces from dmesg: >> > > > > >> > > > > There is a lot more debug and trace data so hopefully this will shed >> > > > > some light on what might be happening here. >> > > > > >> > > > > My testing remains run a series of simple benchmarks, let that run to >> > > > > completion and then leave the system idle away with just a few daemons >> > > > > running. >> > > > > >> > > > > the self detected stalls in this instance turned up after a days run time. >> > > > > There were NO heavy artificial computational loads on the machine. >> > > > >> > > > It does indeed look quiet on that dmesg for a good long time. >> > > > >> > > > The following insanely crude not-for-mainline hack -might- be producing >> > > > good results in my testing. It will take some time before I can claim >> > > > statistically different results. But please feel free to give it a go >> > > > in the meantime. (Thanks to Al Viro for pointing me in this direction.) >> > >> > Your case was special in that is was hotplug triggering it, right? >> >> Yes, it has thus far only shown up with CPU hotplug enabled. >> >> > I was auditing the hotplug paths involved when I fell ill two weeks ago, >> > and have not really made any progress on that because of that :/ >> >> I have always said that being sick is bad for one's health, but I didn't >> realize that it could be bad for the kernel's health as well. ;-) >> >> > I'll go have another look, I had a vague feeling for a race back then, >> > lets see if I can still remember how.. >> >> I believe that I can -finally- get an ftrace_dump() to happen within >> 10-20 milliseconds of the problem, which just might be soon enough >> after the problem to gather some useful information. I am currently >> testing this theory with "ftrace trace_event=sched_waking,sched_wakeup" >> boot arguments on a two-hour run. > > And apparently another way to greatly reduce the probability of this > bug occurring is to enable ftrace. :-/ > > Will try longer runs. > > Thanx, Paul > >> If this works out, what would be a useful set of trace events for me >> to capture? >> >> Thanx, Paul > Well managed to catch this one on linux-4.5-rc4. Took over 3 days and 7 hours to hit. Same test as before, boot, run a series of simple benchmarks and then let the machine just idle away. As before, the reported stall, AND everything keeps on running as if nothing had happened. I notice in the task dump for both the cpus, the swapper is running on both cpus. does that make any sense? There is around 3% of memory actually used. Anyway, please find attached a copy of the dmesg output. Hope this helps a few people fill in the missing pieces here. Regards, Ross Green