From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757173AbcGZQ0r (ORCPT ); Tue, 26 Jul 2016 12:26:47 -0400 Received: from mx1.redhat.com ([209.132.183.28]:45493 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753992AbcGZQ0q (ORCPT ); Tue, 26 Jul 2016 12:26:46 -0400 Date: Tue, 26 Jul 2016 11:26:42 -0500 From: Josh Poimboeuf To: Andy Lutomirski Cc: Thomas Gleixner , Ingo Molnar , "H . Peter Anvin" , X86 ML , "linux-kernel@vger.kernel.org" , Linus Torvalds , Steven Rostedt , Brian Gerst , Kees Cook , Peter Zijlstra , Frederic Weisbecker , Byungchul Park Subject: Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface Message-ID: <20160726162642.e3ro4ynrgex74232@treble> References: <88568c51f1a253210897e368262d0f5fa1d7e97a.1469136008.git.jpoimboe@redhat.com> <20160722235459.xtikpj263hroloqo@treble> <20160723140439.36bhw5aslcxlkf3f@treble> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.6.0.1 (2016-04-01) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Tue, 26 Jul 2016 16:26:45 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote: > On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf wrote: > >> > Unless I'm missing something, I think it should be fine for nested NMIs, > >> > since they're all on the same stack. I can try to test it. What in > >> > particular are you worried about? > >> > > >> > >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS, > >> CS, IP) records. Off the top of my head, the record that matters is > >> the third one, so it should be regs[-15]. If an MCE hits while this > >> mess is being set up, good luck unwinding *that*. If you really want > >> to know, take a deep breath, read the long comment in entry_64.S after > >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other > >> architecture. > > > > I did read that comment. Fortunately there's a big difference between > > reading and understanding so I can go on being an ignorant x86 hacker! > > > > For nested NMIs, it does look like the stack of the exception which > > interrupted the first NMI would get skipped by the stack dump. (But > > that's a general problem, not specific to my patch set.) > > If we end up with task -> IST -> NMI -> same IST, we're doomed and > we're going to crash, so it doesn't matter much whether the unwinder > works. Is that what you mean? I read the NMI entry code again, and now I realize my comment was completely misinformed, so never mind. Is "IST -> NMI -> same IST" even possible, since the other IST's are higher priority than NMI? > > Am I correct in understanding that there can only be one level of NMI > > nesting at any given time? If so, could we make it easier on the > > unwinder by putting the nested NMI on a separate software stack, so the > > "next stack" pointers are always in the same place? Or am I just being > > naive? > > I think you're being naive :) > > But we don't really need the unwinder to be 100% faithful here. If we have: > > task stack > NMI > nested NMI > > then the nested NMI code won't call into C and thus it should be > impossible to ever invoke your unwinder on that state. Instead the > nested NMI code will fiddle with the saved regs and return in such a > way that the outer NMI will be forced to loop through again. So it > *should* (assuming I'm remembering all this crap correctly) be > sufficient to find the "outermost" pt_regs, which is sitting at > (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp > value. This ought to be the same thing that the frame-based unwinder > would naturally try to do. But if you make this change, ISTM you > should make it separately because it does change behavior and Linus is > understandably leery about that. Ok, I think that makes sense to me now. As I understand it, the "outermost" RIP is the authoritative one, because it was written by the original NMI. Any nested NMIs will update the original and/or iret RIPs, which will only ever point to NMI entry code, and so they should be ignored. But I think there's a case where this wouldn't work: task stack NMI IST stack dump If the IST interrupt hits before the NMI has a chance to update the outermost regs, the authoritative RIP would be the original one written by HW, right? > Hmm. I wonder if it would make sense to decode this thing both ways > and display it. So show_trace_log_lvl() could print something like: > > <#DB (0xffffwhatever000-0xffffwhateverfff), next frame is at 0xffffsomething> > > and, in the case where the frame unwinder disagrees, it'll at least be > visible in that 0xffffsomething won't be between 0xffffwhatever000 and > 0xffffwhateverfff. > > Then Linus is happy because the unwinder works just like it did before > but people like me are also happy because it's clearer what's going > on. FWIW, I've personally debugged crashes in the NMI code where the > current unwinder falls apart completely and it's not fun -- having a > display indicating that the unwinder got confused would be nice. Hm, maybe. Another idea would be to have the unwinder print some kind of warning if it goes off the rails. It should be able to detect that, because every stack trace should end at a user pt_regs. -- Josh