From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751820AbaLOOAI (ORCPT ); Mon, 15 Dec 2014 09:00:08 -0500 Received: from mail.skyhub.de ([78.46.96.112]:53512 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751239AbaLOOAE (ORCPT ); Mon, 15 Dec 2014 09:00:04 -0500 Date: Mon, 15 Dec 2014 15:00:00 +0100 From: Borislav Petkov To: Linus Torvalds Cc: Dave Jones , Chris Mason , Mike Galbraith , Ingo Molnar , Peter Zijlstra , =?utf-8?Q?D=C3=A2niel?= Fraga , Sasha Levin , "Paul E. McKenney" , Linux Kernel Mailing List , Suresh Siddha , Oleg Nesterov , Peter Anvin Subject: Re: frequent lockups in 3.18rc4 Message-ID: <20141215140000.GB6590@pd.tnic> References: <20141211145408.GB16800@redhat.com> <20141212185454.GB4716@redhat.com> <20141213165915.GA12756@redhat.com> <20141213223616.GA22559@redhat.com> <20141214234654.GA396@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Dec 14, 2014 at 09:47:26PM -0800, Linus Torvalds wrote: > and "save_xstate_sig+0x81" shows up on all stacks, although only on > CPU1 does it show up as a "guaranteed" part of the stack chain (ie it > matches frame pointer data too). CPU1 also has that __clear_user show > up (which is called from save_xstate_sig), but not other CPU's. CPU2 > and CPU3 have "save_xstate_sig+0x98" in addition to that +0x81 thing. > > My guess is that "save_xstate_sig+0x81" is the instruction after the > __clear_user call, and that CPU1 took the fault in __clear_user(), > while CPU2 and CPU3 took the fault at "save_xstate_sig+0x98" instead, > which I'd guess is the > > xsave64 (%rdi) Err, maybe a wild guess, but could XSAVE be encountering some problems, like store ordering violations or somesuch? Quick search shows "AZ72. Store Ordering Violation When Using XSAVE" here http://download.intel.com/design/mobile/specupdt/320121.pdf which talks about SSE context stores happening out of order. Now, there are a lot of IFs like does Dave's machine even have the erratum and even if, would that erratum cause some sort of a livelock leading to the kernel lockups and so on and so on... It might be worth to rule out though. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --