From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161739AbcBQU2e (ORCPT ); Wed, 17 Feb 2016 15:28:34 -0500 Received: from e17.ny.us.ibm.com ([129.33.205.207]:33041 "EHLO e17.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964976AbcBQU2d (ORCPT ); Wed, 17 Feb 2016 15:28:33 -0500 X-IBM-Helo: d01dlp01.pok.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Wed, 17 Feb 2016 12:28:29 -0800 From: "Paul E. McKenney" To: Peter Zijlstra Cc: Ross Green , linux-kernel@vger.kernel.org, mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com, akpm@linux-foundation.org, Mathieu Desnoyers , josh@joshtriplett.org, tglx@linutronix.de, rostedt@goodmis.org, dhowells@redhat.com, Eric Dumazet , dvhart@linux.intel.com, =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , oleg@redhat.com, pranith kumar Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17 Message-ID: <20160217202829.GO6719@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20160217054549.GB6719@linux.vnet.ibm.com> <20160217192817.GA21818@linux.vnet.ibm.com> <20160217194554.GO6357@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160217194554.GO6357@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16021720-0041-0000-0000-00000350AA6A Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 17, 2016 at 08:45:54PM +0100, Peter Zijlstra wrote: > On Wed, Feb 17, 2016 at 11:28:17AM -0800, Paul E. McKenney wrote: > > On Tue, Feb 16, 2016 at 09:45:49PM -0800, Paul E. McKenney wrote: > > > On Tue, Feb 09, 2016 at 09:11:55PM +1100, Ross Green wrote: > > > > Continued testing with the latest linux-4.5-rc3 release. > > > > > > > > Please find attached a copy of traces from dmesg: > > > > > > > > There is a lot more debug and trace data so hopefully this will shed > > > > some light on what might be happening here. > > > > > > > > My testing remains run a series of simple benchmarks, let that run to > > > > completion and then leave the system idle away with just a few daemons > > > > running. > > > > > > > > the self detected stalls in this instance turned up after a days run time. > > > > There were NO heavy artificial computational loads on the machine. > > > > > > It does indeed look quiet on that dmesg for a good long time. > > > > > > The following insanely crude not-for-mainline hack -might- be producing > > > good results in my testing. It will take some time before I can claim > > > statistically different results. But please feel free to give it a go > > > in the meantime. (Thanks to Al Viro for pointing me in this direction.) > > Your case was special in that is was hotplug triggering it, right? Yes, it has thus far only shown up with CPU hotplug enabled. > I was auditing the hotplug paths involved when I fell ill two weeks ago, > and have not really made any progress on that because of that :/ I have always said that being sick is bad for one's health, but I didn't realize that it could be bad for the kernel's health as well. ;-) > I'll go have another look, I had a vague feeling for a race back then, > lets see if I can still remember how.. I believe that I can -finally- get an ftrace_dump() to happen within 10-20 milliseconds of the problem, which just might be soon enough after the problem to gather some useful information. I am currently testing this theory with "ftrace trace_event=sched_waking,sched_wakeup" boot arguments on a two-hour run. If this works out, what would be a useful set of trace events for me to capture? Thanx, Paul