From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753739AbaLVBWw (ORCPT ); Sun, 21 Dec 2014 20:22:52 -0500 Received: from arcturus.aphlor.org ([188.246.204.175]:57587 "EHLO arcturus.aphlor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753545AbaLVBWv (ORCPT ); Sun, 21 Dec 2014 20:22:51 -0500 Date: Sun, 21 Dec 2014 20:22:21 -0500 From: Dave Jones To: Linus Torvalds Cc: Thomas Gleixner , Chris Mason , Mike Galbraith , Ingo Molnar , Peter Zijlstra , =?iso-8859-1?Q?D=E2niel?= Fraga , Sasha Levin , "Paul E. McKenney" , Linux Kernel Mailing List , Suresh Siddha , Oleg Nesterov , Peter Anvin Subject: Re: frequent lockups in 3.18rc4 Message-ID: <20141222012221.GA11533@codemonkey.org.uk> Mail-Followup-To: Dave Jones , Linus Torvalds , Thomas Gleixner , Chris Mason , Mike Galbraith , Ingo Molnar , Peter Zijlstra , =?iso-8859-1?Q?D=E2niel?= Fraga , Sasha Levin , "Paul E. McKenney" , Linux Kernel Mailing List , Suresh Siddha , Oleg Nesterov , Peter Anvin References: <20141221223204.GA9618@codemonkey.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Score: -2.9 (--) X-Spam-Report: Spam report generated by SpamAssassin on "arcturus.aphlor.org" Content analysis details: (-2.9 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] X-Authenticated-User: davej@codemonkey.org.uk Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote: > > The second time (or third, or fourth - it might not take immediately) > > you get a lockup or similar. Bad things happen. > > I've only tested it twice now, but the first time I got a weird > lockup-like thing (things *kind* of worked, but I could imagine that > one CPU was stuck with a lock held, because things eventually ground > to a screeching halt. > > The second time I got > > INFO: rcu_sched self-detected stall on CPU { 5} (t=84533 jiffies > g=11971 c=11970 q=17) > > and then > > INFO: rcu_sched detected stalls on CPUs/tasks: { 1 2 3 4 5 6 7} > (detected by 0, t=291309 jiffies, g=12031, c=12030, q=57) > > with backtraces that made no sense (because obviously no actual stall > had taken place), and were the CPU's mostly being idle. > > I could easily see it resulting in your softlockup scenario too. So something trinity does when it doesn't have a better idea of something to pass a syscall is to generate a random number. A wild hypothesis could be that we're in one of these situations, and we randomly generated 0xfed000f0 and passed that as a value to a syscall, and the kernel wrote 0 to that address. What syscall could do that, and not just fail a access_ok() or similar is a mystery though. Dave