From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751082AbcGWAPZ (ORCPT <rfc822;w@1wt.eu>);
	Fri, 22 Jul 2016 20:15:25 -0400
Received: from mail-vk0-f45.google.com ([209.85.213.45]:35531 "EHLO
	mail-vk0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750728AbcGWAPY (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 22 Jul 2016 20:15:24 -0400
MIME-Version: 1.0
In-Reply-To: <20160722235459.xtikpj263hroloqo@treble>
References: <cover.1469136008.git.jpoimboe@redhat.com> <88568c51f1a253210897e368262d0f5fa1d7e97a.1469136008.git.jpoimboe@redhat.com>
 <CALCETrWLKgYB9KaSZgY3Hz9uVvYNrFc5Yw8moz=CdzJu4nN06g@mail.gmail.com> <20160722235459.xtikpj263hroloqo@treble>
From: Andy Lutomirski <luto@amacapital.net>
Date: Fri, 22 Jul 2016 17:15:03 -0700
Message-ID: <CALCETrVjWtXQJhpnJsDuQv=kuHKY2+Hk43TzRVxDnThQ8GX6Fw@mail.gmail.com>
Subject: Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
To: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@kernel.org>,
        "H . Peter Anvin" <hpa@zytor.com>, X86 ML <x86@kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Steven Rostedt <rostedt@goodmis.org>, Brian Gerst <brgerst@gmail.com>,
        Kees Cook <keescook@chromium.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Byungchul Park <byungchul.park@lge.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
>> > +                            unsigned long *visit_mask)
>> > +{
>> > +       unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
>> > +       unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> > +
>> > +       if (stack < begin || stack >= end)
>> > +               return false;
>> > +
>> > +       if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> > +               return false;
>> > +
>> > +       info->type      = STACK_TYPE_IRQ;
>> > +       info->begin     = begin;
>> > +       info->end       = end;
>> > +       info->next      = (unsigned long *)*begin;
>>
>> This works, but it's a bit magic.  I don't suppose we could get rid of
>> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
>> move from stack to stack by querying the stack type of whatever the
>> frame base address is if the frame base address ends up being out of
>> bounds for the current stack?  Or maybe the unwinder could even do
>> this by itself.
>
> I'm not quite sure what you mean here.  The ->next stack pointer is
> quite useful and it abstracts that ugliness away from the callers of
> get_stack_info().  I'm open to any specific suggestions.

So far I've found two users of this thing.  One is
show_stack_log_lvl(), and it makes sense there, but maybe
info->heuristic_next_stack would be a better name.  The other is the
unwinder itself, and I think that walking from stack to stack using
this heuristic is the wrong approach there, at least in the long term.
I'd rather we just followed the bp chain wherever it leads us, as long
as it leads us to a valid stack that we haven't visited before.

As a concrete example of what I think is wrong with the current
approach, ISTM it would be totally valid to implement stack switching
like this:

some_func:
 push %rbp
 mov %rsp, %rbp
 ...
 mov [next stack], %rsp
 call some_other_func
 mov %rbp, %rsp
 pop %rbp
 ret

With the current approach, you can't unwind out of that function,
because there's no way to populate info->next.  I'm not actually
suggesting that the kernel should ever do such a thing on x86, and my
proposed rewrite of the IRQ stack code [1] will be fully compatible
with your approach, but it seems odd to me that the unwinder should
depend on idea that the stacks in use are chained together in a way
that can be decoded without .  (But maybe some of the Go compilers do
work this way -- I've never looked at their precise stack layout.)

Also, if you ever intend to port this thing to other architectures, I
think there are architectures that have separate exception stacks and
that track the next available slot on those stacks dynamically.  I
think that x86_32 is an example of this if task gates are used in a
back-and-forth manner, although Linux doesn't do that.  (x86_64 should
have done this for IST, but it didn't.)  On those architectures, you
can have two separate switches onto the same stack live at the same
time, and your current approach won't work.  (Even if you make the
change I'm suggesting, visit_mask will break, too, but fixing that
would be a much less invasive change.)

Am I making any sense?  This is a suggestion for making it better, not
something I see as a requirement for getting the x86 code upstream.

>>
>> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
>> > +                              unsigned long *visit_mask)
>> >  {
>> >         unsigned long stack = (unsigned long)s;
>> >         unsigned long begin, end;
>> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long *s, char **name,
>> >                 if (stack < begin || stack >= end)
>> >                         continue;
>> >
>> > -               if (test_and_set_bit(k, visit_mask))
>> > +               if (visit_mask &&
>> > +                   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
>> >                         return false;
>> >
>> > -               *name = exception_stack_names[k];
>> > -               return (unsigned long *)end;
>> > +               info->type      = STACK_TYPE_EXCEPTION + k;
>> > +               info->begin     = (unsigned long *)begin;
>> > +               info->end       = (unsigned long *)end;
>> > +               info->next      = (unsigned long *)info->end[-2];
>>
>> This is so magical that I don't immediately see why it's correct.
>> Presumably it's because the thing two slots down from the top of the
>> stack is regs->sp?  If so, that needs a comment.
>
> Heck if I know, I just stole it from dump_trace() ;-)
>
> I'll figure it out and add a comment.

If you can write it as:

struct pt_regs *regs = (struct pt_regs *)end - 1;
info->next = regs->sp;

and it still works, then no comment required :)

>
>> But again, couldn't we use the fact that we now know how to decode
>> pt_regs to avoid needing this?  I can imagine it being useful as a
>> fallback in the event that the unwinder fails, but this is just a
>> fallback.
>
> Yeah, this is needed as a fallback.  But I wouldn't call it "just" a
> fallback: the stack dump code *needs* to be able to still traverse the
> stacks if frame pointers fail.
>
>> Also, NMI is weird and I'm wondering whether this works at
>> all when trying to unwind from a looped NMI.
>
> Unless I'm missing something, I think it should be fine for nested NMIs,
> since they're all on the same stack.  I can try to test it.  What in
> particular are you worried about?
>

The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
CS, IP) records.  Off the top of my head, the record that matters is
the third one, so it should be regs[-15].  If an MCE hits while this
mess is being set up, good luck unwinding *that*.  If you really want
to know, take a deep breath, read the long comment in entry_64.S after
.Lnmi_from_kernel, then give up on x86 and start hacking on some other
architecture.