From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751082AbcGWAPZ (ORCPT ); Fri, 22 Jul 2016 20:15:25 -0400 Received: from mail-vk0-f45.google.com ([209.85.213.45]:35531 "EHLO mail-vk0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750728AbcGWAPY (ORCPT ); Fri, 22 Jul 2016 20:15:24 -0400 MIME-Version: 1.0 In-Reply-To: <20160722235459.xtikpj263hroloqo@treble> References: <88568c51f1a253210897e368262d0f5fa1d7e97a.1469136008.git.jpoimboe@redhat.com> <20160722235459.xtikpj263hroloqo@treble> From: Andy Lutomirski Date: Fri, 22 Jul 2016 17:15:03 -0700 Message-ID: Subject: Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface To: Josh Poimboeuf Cc: Thomas Gleixner , Ingo Molnar , "H . Peter Anvin" , X86 ML , "linux-kernel@vger.kernel.org" , Linus Torvalds , Steven Rostedt , Brian Gerst , Kees Cook , Peter Zijlstra , Frederic Weisbecker , Byungchul Park Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf wrote: >> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info, >> > + unsigned long *visit_mask) >> > +{ >> > + unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack); >> > + unsigned long *end = begin + (THREAD_SIZE / sizeof(long)); >> > + >> > + if (stack < begin || stack >= end) >> > + return false; >> > + >> > + if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask)) >> > + return false; >> > + >> > + info->type = STACK_TYPE_IRQ; >> > + info->begin = begin; >> > + info->end = end; >> > + info->next = (unsigned long *)*begin; >> >> This works, but it's a bit magic. I don't suppose we could get rid of >> this ->next thing entirely and teach show_stack_log_lvl(), etc. to >> move from stack to stack by querying the stack type of whatever the >> frame base address is if the frame base address ends up being out of >> bounds for the current stack? Or maybe the unwinder could even do >> this by itself. > > I'm not quite sure what you mean here. The ->next stack pointer is > quite useful and it abstracts that ugliness away from the callers of > get_stack_info(). I'm open to any specific suggestions. So far I've found two users of this thing. One is show_stack_log_lvl(), and it makes sense there, but maybe info->heuristic_next_stack would be a better name. The other is the unwinder itself, and I think that walking from stack to stack using this heuristic is the wrong approach there, at least in the long term. I'd rather we just followed the bp chain wherever it leads us, as long as it leads us to a valid stack that we haven't visited before. As a concrete example of what I think is wrong with the current approach, ISTM it would be totally valid to implement stack switching like this: some_func: push %rbp mov %rsp, %rbp ... mov [next stack], %rsp call some_other_func mov %rbp, %rsp pop %rbp ret With the current approach, you can't unwind out of that function, because there's no way to populate info->next. I'm not actually suggesting that the kernel should ever do such a thing on x86, and my proposed rewrite of the IRQ stack code [1] will be fully compatible with your approach, but it seems odd to me that the unwinder should depend on idea that the stacks in use are chained together in a way that can be decoded without . (But maybe some of the Go compilers do work this way -- I've never looked at their precise stack layout.) Also, if you ever intend to port this thing to other architectures, I think there are architectures that have separate exception stacks and that track the next available slot on those stacks dynamically. I think that x86_32 is an example of this if task gates are used in a back-and-forth manner, although Linux doesn't do that. (x86_64 should have done this for IST, but it didn't.) On those architectures, you can have two separate switches onto the same stack live at the same time, and your current approach won't work. (Even if you make the change I'm suggesting, visit_mask will break, too, but fixing that would be a much less invasive change.) Am I making any sense? This is a suggestion for making it better, not something I see as a requirement for getting the x86 code upstream. >> >> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info, >> > + unsigned long *visit_mask) >> > { >> > unsigned long stack = (unsigned long)s; >> > unsigned long begin, end; >> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long *s, char **name, >> > if (stack < begin || stack >= end) >> > continue; >> > >> > - if (test_and_set_bit(k, visit_mask)) >> > + if (visit_mask && >> > + test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask)) >> > return false; >> > >> > - *name = exception_stack_names[k]; >> > - return (unsigned long *)end; >> > + info->type = STACK_TYPE_EXCEPTION + k; >> > + info->begin = (unsigned long *)begin; >> > + info->end = (unsigned long *)end; >> > + info->next = (unsigned long *)info->end[-2]; >> >> This is so magical that I don't immediately see why it's correct. >> Presumably it's because the thing two slots down from the top of the >> stack is regs->sp? If so, that needs a comment. > > Heck if I know, I just stole it from dump_trace() ;-) > > I'll figure it out and add a comment. If you can write it as: struct pt_regs *regs = (struct pt_regs *)end - 1; info->next = regs->sp; and it still works, then no comment required :) > >> But again, couldn't we use the fact that we now know how to decode >> pt_regs to avoid needing this? I can imagine it being useful as a >> fallback in the event that the unwinder fails, but this is just a >> fallback. > > Yeah, this is needed as a fallback. But I wouldn't call it "just" a > fallback: the stack dump code *needs* to be able to still traverse the > stacks if frame pointers fail. > >> Also, NMI is weird and I'm wondering whether this works at >> all when trying to unwind from a looped NMI. > > Unless I'm missing something, I think it should be fine for nested NMIs, > since they're all on the same stack. I can try to test it. What in > particular are you worried about? > The top of the NMI frame contains no less than *three* (SS, SP, FLAGS, CS, IP) records. Off the top of my head, the record that matters is the third one, so it should be regs[-15]. If an MCE hits while this mess is being set up, good luck unwinding *that*. If you really want to know, take a deep breath, read the long comment in entry_64.S after .Lnmi_from_kernel, then give up on x86 and start hacking on some other architecture.