From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754253AbaLMUmi (ORCPT ); Sat, 13 Dec 2014 15:42:38 -0500 Received: from mx1.redhat.com ([209.132.183.28]:58830 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751224AbaLMUmh (ORCPT ); Sat, 13 Dec 2014 15:42:37 -0500 Date: Sat, 13 Dec 2014 15:41:52 -0500 From: Dave Jones To: "Paul E. McKenney" Cc: Linus Torvalds , Chris Mason , Mike Galbraith , Ingo Molnar , Peter Zijlstra , =?iso-8859-1?Q?D=E2niel?= Fraga , Sasha Levin , Linux Kernel Mailing List Subject: Re: frequent lockups in 3.18rc4 Message-ID: <20141213204152.GA15714@redhat.com> Mail-Followup-To: Dave Jones , "Paul E. McKenney" , Linus Torvalds , Chris Mason , Mike Galbraith , Ingo Molnar , Peter Zijlstra , =?iso-8859-1?Q?D=E2niel?= Fraga , Sasha Levin , Linux Kernel Mailing List References: <20141205171501.GA1320@redhat.com> <1417806247.4845.1@mail.thefacebook.com> <20141211145408.GB16800@redhat.com> <20141212185454.GB4716@redhat.com> <20141213165915.GA12756@redhat.com> <20141213180408.GH25340@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141213180408.GH25340@linux.vnet.ibm.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Dec 13, 2014 at 10:04:08AM -0800, Paul E. McKenney wrote: > On Sat, Dec 13, 2014 at 11:59:15AM -0500, Dave Jones wrote: > > On Fri, Dec 12, 2014 at 11:14:06AM -0800, Linus Torvalds wrote: > > > On Fri, Dec 12, 2014 at 10:54 AM, Dave Jones wrote: > > > > > > > > Something that's still making me wonder if it's some kind of hardware > > > > problem is the non-deterministic nature of this bug. > > > > > > I'd expect it to be a race condition, though. Which can easily cause > > > these kinds of issues, and the timing will be pretty random even if > > > the load is very regular. > > > > > > And we know that the scheduler has an integer overflow under Sasha's > > > loads, although I didn't hear anything from Ingo and friends about it. > > > Ingo/Peter, you were cc'd on that report, where at least one of the > > > multiplcations in wake_affine() ended up overflowing.. > > > > > > Some scheduler thing that overflows only under heavy load, and screws > > > up scheduling could easily account for the RCU thread thing. I see it > > > *less* easily accounting for DaveJ's case, though, because the > > > watchdog is running at RT priority, and the scheduler would have to > > > screw up much more to then not schedule an RT task, but.. > > > > > > I'm also not sure if the bug ever happens with preemption disabled. > > > > Bah, so I see some watchdog traces with preemption off, and that then > > taints the kernel, and the fuzzing stops. I'll hack something up > > so it ignores the taint and keeps going. All I really care about here > > is the "machine hangs completely" case, which the trace below didn't > > hit.. > > > > (back to fuzzing almost everything, not just lsetxattr btw) > > Hmmm... This one looks like the RCU grace-period kthread is getting > starved: "idle=b4c/0/0". Is this running with the "dangerous" patch > that sets these kthreads to RT priority? sorry, no. Ran out of time yesterday. I'll try and get to applying that later this evening if I get chance. Dave