All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86/orc: Don't bail on stack overflow
@ 2017-11-25 17:28 Andy Lutomirski
  2017-11-25 18:26 ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2017-11-25 17:28 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

If we overflow the stack into a guard page and then try to unwind
it with ORC, it should work perfectly: by construction, there can't
be any meaningful data in the guard page because no writes to the
guard page will have succeeded.

ORC seems entirely capable of unwinding in this situation, except
that it doesn't even try.  Adjust its initial stack check so that
it's willing to try unwinding.

I tested this by intentionally overflowing the task stack.  The
result is an accurate call trace instead of a trace consisting
purely of '?' entries.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---

Hi all-

Ingo, this would have fixed half the debugging problem you had, I think.
To really nail it, we'd want some kind of magic to annotate the trace
so that page_fault (and async_page_fault) entries show CR2 and error_code.

Josh, any ideas of how to do that cleanly?  We could easily hard-code it
in the OOPS unwinder, I guess.

arch/x86/kernel/unwind_orc.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index a3f973b2c97a..7f6e3935666b 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -553,8 +553,18 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
 	}
 
 	if (get_stack_info((unsigned long *)state->sp, state->task,
-			   &state->stack_info, &state->stack_mask))
-		return;
+			   &state->stack_info, &state->stack_mask)) {
+		/*
+		 * We weren't on a valid stack.  It's possible that
+		 * we overflowed a valid stack into a guard page.
+		 * See if the next page up is valid so that we can
+		 * generate some kind of backtrace if this happens.
+		 */
+		void *next_page = (void *)PAGE_ALIGN((unsigned long)regs->sp);
+		if (get_stack_info(next_page, state->task, &state->stack_info,
+				   &state->stack_mask))
+			return;
+	}
 
 	/*
 	 * The caller can provide the address of the first frame directly
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/orc: Don't bail on stack overflow
  2017-11-25 17:28 [PATCH] x86/orc: Don't bail on stack overflow Andy Lutomirski
@ 2017-11-25 18:26 ` Andy Lutomirski
  2017-11-25 23:13   ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2017-11-25 18:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf

On Sat, Nov 25, 2017 at 9:28 AM, Andy Lutomirski <luto@kernel.org> wrote:
> If we overflow the stack into a guard page and then try to unwind
> it with ORC, it should work perfectly: by construction, there can't
> be any meaningful data in the guard page because no writes to the
> guard page will have succeeded.
>
> ORC seems entirely capable of unwinding in this situation, except
> that it doesn't even try.  Adjust its initial stack check so that
> it's willing to try unwinding.
>
> I tested this by intentionally overflowing the task stack.  The
> result is an accurate call trace instead of a trace consisting
> purely of '?' entries.
>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>
> Hi all-
>
> Ingo, this would have fixed half the debugging problem you had, I think.
> To really nail it, we'd want some kind of magic to annotate the trace
> so that page_fault (and async_page_fault) entries show CR2 and error_code.
>
> Josh, any ideas of how to do that cleanly?  We could easily hard-code it
> in the OOPS unwinder, I guess.

Actually, this does pretty well.  We don't get CR2, but, when I added
an intentional bug kind of along the lines of the one you debugged,
the intermediate page fault successfully dumps all the regs in the
stack trace, so we get the faulting instruction *and* the registers.
We also get ORIG_RAX, which tells us the error code.  We could be
fancy and decode that.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/orc: Don't bail on stack overflow
  2017-11-25 18:26 ` Andy Lutomirski
@ 2017-11-25 23:13   ` Thomas Gleixner
  2017-11-26  0:16     ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2017-11-25 23:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf

On Sat, 25 Nov 2017, Andy Lutomirski wrote:

> On Sat, Nov 25, 2017 at 9:28 AM, Andy Lutomirski <luto@kernel.org> wrote:
> > If we overflow the stack into a guard page and then try to unwind
> > it with ORC, it should work perfectly: by construction, there can't
> > be any meaningful data in the guard page because no writes to the
> > guard page will have succeeded.
> >
> > ORC seems entirely capable of unwinding in this situation, except
> > that it doesn't even try.  Adjust its initial stack check so that
> > it's willing to try unwinding.
> >
> > I tested this by intentionally overflowing the task stack.  The
> > result is an accurate call trace instead of a trace consisting
> > purely of '?' entries.
> >
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > ---
> >
> > Hi all-
> >
> > Ingo, this would have fixed half the debugging problem you had, I think.
> > To really nail it, we'd want some kind of magic to annotate the trace
> > so that page_fault (and async_page_fault) entries show CR2 and error_code.
> >
> > Josh, any ideas of how to do that cleanly?  We could easily hard-code it
> > in the OOPS unwinder, I guess.
> 
> Actually, this does pretty well.  We don't get CR2, but, when I added
> an intentional bug kind of along the lines of the one you debugged,
> the intermediate page fault successfully dumps all the regs in the
> stack trace, so we get the faulting instruction *and* the registers.
> We also get ORIG_RAX, which tells us the error code.  We could be
> fancy and decode that.

It works in general, but for that case it's not much better than before
vs. the '?' entries.

Thanks,

	tglx

[    2.556065] PANIC: double fault, error_code: 0x0
[    2.557116] CPU: 1 PID: 273 Comm: systemd-udevd Not tainted 4.14.0-01256-g03dea81fe9f2-dirty #30
[    2.558930] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[    2.560133] task: ffff880428dd8000 task.stack: ffffc900025fc000
[    2.560729] RIP: 0010:page_fault+0x11/0x60
[    2.561122] RSP: 0000:ffffffffff083fc8 EFLAGS: 00010046
[    2.561607] RAX: 00000000819d0ac7 RBX: 0000000000000001 RCX: ffffffff819d0ac7
[    2.562357] RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffffffffff084078
[    2.563027] RBP: 000000000000000b R08: 00000000ffffffff R09: 0000000000000040
[    2.563726] R10: 0000000000000018 R11: 0000000000000246 R12: 0000000000000003
[    2.564429] R13: 000055719fd7d410 R14: 0000000000000000 R15: 0000000003938700
[    2.565104] FS:  00007f9edc0b78c0(0000) GS:ffff88042e440000(0000) knlGS:0000000000000000
[    2.565844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.566396] CR2: ffffffffff083fb8 CR3: 0000000428ec4005 CR4: 00000000001606e0
[    2.567097] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.567761] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.568451] Call Trace:
[    2.568704]  <SYSENTER>
[    2.568950]  ? __do_page_fault+0x4b0/0x4b0
[    2.569348]  ? page_fault+0x2c/0x60
[    2.569680]  ? native_iret+0x7/0x7
[    2.570019]  ? __do_page_fault+0x4b0/0x4b0
[    2.570396]  ? page_fault+0x2c/0x60
[    2.570743]  ? call_function_interrupt+0xc0/0xc0
[    2.571199]  </SYSENTER>
[    2.571422] Code: ff e8 34 b7 6a ff e9 9f 02 00 00 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 83 c4 88 f6 84 24 88 00 00 00 03 75 20 <e8> 4a 01 00 00 48 89 e7 48 8b 74 24 78 48 c7 44 24 78 ff ff ff 
[    2.573192] Kernel panic - not syncing: Machine halted.
[    2.573694] CPU: 1 PID: 273 Comm: systemd-udevd Not tainted 4.14.0-01256-g03dea81fe9f2-dirty #30
[    2.574528] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[    2.575330] Call Trace:
[    2.575570]  <#DF>
[    2.575760]  dump_stack+0x46/0x59
[    2.576120]  panic+0xde/0x223
[    2.576405]  df_debug+0x29/0x30
[    2.576687]  do_double_fault+0x9a/0x120
[    2.577057]  double_fault+0x22/0x30
[    2.577376] RIP: 0010:page_fault+0x11/0x60
[    2.577775] RSP: 0000:ffffffffff083fc8 EFLAGS: 00010046
[    2.578314] RAX: 00000000819d0ac7 RBX: 0000000000000001 RCX: ffffffff819d0ac7
[    2.578979] RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffffffffff084078
[    2.579666] RBP: 000000000000000b R08: 00000000ffffffff R09: 0000000000000040
[    2.580334] R10: 0000000000000018 R11: 0000000000000246 R12: 0000000000000003
[    2.581008] R13: 000055719fd7d410 R14: 0000000000000000 R15: 0000000003938700
[    2.581684]  ? native_iret+0x7/0x7
[    2.582007] WARNING: can't dereference iret registers at ffffffffff084048 for ip page_fault+0x11/0x60
[    2.582008]  </#DF>
[    2.583134]  <SYSENTER>
[    2.583367]  ? __do_page_fault+0x4b0/0x4b0
[    2.583751]  ? page_fault+0x2c/0x60
[    2.584127]  ? native_iret+0x7/0x7
[    2.584466]  ? __do_page_fault+0x4b0/0x4b0
[    2.584860]  ? page_fault+0x2c/0x60
[    2.585195]  ? call_function_interrupt+0xc0/0xc0
[    2.585621]  </SYSENTER>
[    2.586966] Dumping ftrace buffer:
[    2.587254]    (ftrace buffer empty)
[    2.587534] Kernel Offset: disabled
[    2.587814] ---[ end Kernel panic - not syncing: Machine halted.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/orc: Don't bail on stack overflow
  2017-11-25 23:13   ` Thomas Gleixner
@ 2017-11-26  0:16     ` Andy Lutomirski
  2017-11-26  2:40       ` Josh Poimboeuf
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2017-11-26  0:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds, Josh Poimboeuf

Can you send me whatever config and exact commit hash generated this?
I can try to figure out why it failed.

On Sat, Nov 25, 2017 at 3:13 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sat, 25 Nov 2017, Andy Lutomirski wrote:
>
>> On Sat, Nov 25, 2017 at 9:28 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> > If we overflow the stack into a guard page and then try to unwind
>> > it with ORC, it should work perfectly: by construction, there can't
>> > be any meaningful data in the guard page because no writes to the
>> > guard page will have succeeded.
>> >
>> > ORC seems entirely capable of unwinding in this situation, except
>> > that it doesn't even try.  Adjust its initial stack check so that
>> > it's willing to try unwinding.
>> >
>> > I tested this by intentionally overflowing the task stack.  The
>> > result is an accurate call trace instead of a trace consisting
>> > purely of '?' entries.
>> >
>> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> > ---
>> >
>> > Hi all-
>> >
>> > Ingo, this would have fixed half the debugging problem you had, I think.
>> > To really nail it, we'd want some kind of magic to annotate the trace
>> > so that page_fault (and async_page_fault) entries show CR2 and error_code.
>> >
>> > Josh, any ideas of how to do that cleanly?  We could easily hard-code it
>> > in the OOPS unwinder, I guess.
>>
>> Actually, this does pretty well.  We don't get CR2, but, when I added
>> an intentional bug kind of along the lines of the one you debugged,
>> the intermediate page fault successfully dumps all the regs in the
>> stack trace, so we get the faulting instruction *and* the registers.
>> We also get ORIG_RAX, which tells us the error code.  We could be
>> fancy and decode that.
>
> It works in general, but for that case it's not much better than before
> vs. the '?' entries.
>
> Thanks,
>
>         tglx
>
> [    2.556065] PANIC: double fault, error_code: 0x0
> [    2.557116] CPU: 1 PID: 273 Comm: systemd-udevd Not tainted 4.14.0-01256-g03dea81fe9f2-dirty #30
> [    2.558930] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [    2.560133] task: ffff880428dd8000 task.stack: ffffc900025fc000
> [    2.560729] RIP: 0010:page_fault+0x11/0x60
> [    2.561122] RSP: 0000:ffffffffff083fc8 EFLAGS: 00010046
> [    2.561607] RAX: 00000000819d0ac7 RBX: 0000000000000001 RCX: ffffffff819d0ac7
> [    2.562357] RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffffffffff084078
> [    2.563027] RBP: 000000000000000b R08: 00000000ffffffff R09: 0000000000000040
> [    2.563726] R10: 0000000000000018 R11: 0000000000000246 R12: 0000000000000003
> [    2.564429] R13: 000055719fd7d410 R14: 0000000000000000 R15: 0000000003938700
> [    2.565104] FS:  00007f9edc0b78c0(0000) GS:ffff88042e440000(0000) knlGS:0000000000000000
> [    2.565844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    2.566396] CR2: ffffffffff083fb8 CR3: 0000000428ec4005 CR4: 00000000001606e0
> [    2.567097] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    2.567761] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    2.568451] Call Trace:
> [    2.568704]  <SYSENTER>
> [    2.568950]  ? __do_page_fault+0x4b0/0x4b0
> [    2.569348]  ? page_fault+0x2c/0x60
> [    2.569680]  ? native_iret+0x7/0x7
> [    2.570019]  ? __do_page_fault+0x4b0/0x4b0
> [    2.570396]  ? page_fault+0x2c/0x60
> [    2.570743]  ? call_function_interrupt+0xc0/0xc0
> [    2.571199]  </SYSENTER>
> [    2.571422] Code: ff e8 34 b7 6a ff e9 9f 02 00 00 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 83 c4 88 f6 84 24 88 00 00 00 03 75 20 <e8> 4a 01 00 00 48 89 e7 48 8b 74 24 78 48 c7 44 24 78 ff ff ff
> [    2.573192] Kernel panic - not syncing: Machine halted.
> [    2.573694] CPU: 1 PID: 273 Comm: systemd-udevd Not tainted 4.14.0-01256-g03dea81fe9f2-dirty #30
> [    2.574528] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [    2.575330] Call Trace:
> [    2.575570]  <#DF>
> [    2.575760]  dump_stack+0x46/0x59
> [    2.576120]  panic+0xde/0x223
> [    2.576405]  df_debug+0x29/0x30
> [    2.576687]  do_double_fault+0x9a/0x120
> [    2.577057]  double_fault+0x22/0x30
> [    2.577376] RIP: 0010:page_fault+0x11/0x60
> [    2.577775] RSP: 0000:ffffffffff083fc8 EFLAGS: 00010046
> [    2.578314] RAX: 00000000819d0ac7 RBX: 0000000000000001 RCX: ffffffff819d0ac7
> [    2.578979] RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffffffffff084078
> [    2.579666] RBP: 000000000000000b R08: 00000000ffffffff R09: 0000000000000040
> [    2.580334] R10: 0000000000000018 R11: 0000000000000246 R12: 0000000000000003
> [    2.581008] R13: 000055719fd7d410 R14: 0000000000000000 R15: 0000000003938700
> [    2.581684]  ? native_iret+0x7/0x7
> [    2.582007] WARNING: can't dereference iret registers at ffffffffff084048 for ip page_fault+0x11/0x60
> [    2.582008]  </#DF>
> [    2.583134]  <SYSENTER>
> [    2.583367]  ? __do_page_fault+0x4b0/0x4b0
> [    2.583751]  ? page_fault+0x2c/0x60
> [    2.584127]  ? native_iret+0x7/0x7
> [    2.584466]  ? __do_page_fault+0x4b0/0x4b0
> [    2.584860]  ? page_fault+0x2c/0x60
> [    2.585195]  ? call_function_interrupt+0xc0/0xc0
> [    2.585621]  </SYSENTER>
> [    2.586966] Dumping ftrace buffer:
> [    2.587254]    (ftrace buffer empty)
> [    2.587534] Kernel Offset: disabled
> [    2.587814] ---[ end Kernel panic - not syncing: Machine halted.
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/orc: Don't bail on stack overflow
  2017-11-26  0:16     ` Andy Lutomirski
@ 2017-11-26  2:40       ` Josh Poimboeuf
  2017-11-26  4:25         ` Andy Lutomirski
  2017-11-27  9:38         ` Ingo Molnar
  0 siblings, 2 replies; 13+ messages in thread
From: Josh Poimboeuf @ 2017-11-26  2:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds

On Sat, Nov 25, 2017 at 04:16:23PM -0800, Andy Lutomirski wrote:
> Can you send me whatever config and exact commit hash generated this?
> I can try to figure out why it failed.

Sorry, I've been traveling.  I just got some time to take a look at
this.  I think there are at least two unwinder issues here:

- It doesn't deal gracefully with the case where the stack overflows and
  the stack pointer itself isn't on a valid stack but the
  to-be-dereferenced data *is*.

- The oops dump code doesn't know how to print partial pt_regs, for the
  case where if we get an interrupt/exception in *early* entry code
  before the full pt_regs have been saved.

(Andy, I'm not quite sure about your patch, and whether it's still
needed after these patches.  I'll need to look at it later when I have
more time.)

I attempted to fix both of the issues with the below patch.  Thomas or
Ingo, can you test to see if this gets rid of the question marks?

I can split it up into proper patches next week.  I'm assuming this
isn't holding up the KAISER merge?


diff --git a/arch/x86/include/asm/kdebug.h b/arch/x86/include/asm/kdebug.h
index f86a8caa561e..395c9631e000 100644
--- a/arch/x86/include/asm/kdebug.h
+++ b/arch/x86/include/asm/kdebug.h
@@ -26,6 +26,7 @@ extern void die(const char *, struct pt_regs *,long);
 extern int __must_check __die(const char *, struct pt_regs *, long);
 extern void show_stack_regs(struct pt_regs *regs);
 extern void __show_regs(struct pt_regs *regs, int all);
+extern void show_iret_regs(struct pt_regs *regs);
 extern unsigned long oops_begin(void);
 extern void oops_end(unsigned long, struct pt_regs *, int signr);
 
diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index e9cc6fe1fc6f..5be2fb23825a 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -7,6 +7,9 @@
 #include <asm/ptrace.h>
 #include <asm/stacktrace.h>
 
+#define IRET_FRAME_OFFSET (offsetof(struct pt_regs, ip))
+#define IRET_FRAME_SIZE   (sizeof(struct pt_regs) - IRET_FRAME_OFFSET)
+
 struct unwind_state {
 	struct stack_info stack_info;
 	unsigned long stack_mask;
@@ -52,6 +55,10 @@ void unwind_start(struct unwind_state *state, struct task_struct *task,
 }
 
 #if defined(CONFIG_UNWINDER_ORC) || defined(CONFIG_UNWINDER_FRAME_POINTER)
+/*
+ * WARNING: The entire pt_regs may not be safe to dereference.  In some cases,
+ * only the iret registers are accessible.  Use with caution!
+ */
 static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
 	if (unwind_done(state))
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index d58dd121c0af..f7f3b5b3bc05 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -68,6 +68,21 @@ static void printk_stack_address(unsigned long address, int reliable,
 	printk("%s %s%pB\n", log_lvl, reliable ? "" : "? ", (void *)address);
 }
 
+static void show_regs_safe(struct stack_info *info, struct pt_regs *regs)
+{
+	if (on_stack(info, regs, sizeof(*regs)))
+		__show_regs(regs, 0);
+	else if (on_stack(info, (void *)regs + IRET_FRAME_OFFSET,
+			  IRET_FRAME_SIZE)) {
+		/*
+		 * When an interrupt or exception occurs in entry code, the
+		 * full pt_regs might not have been saved yet.  In that case
+		 * just print the iret return frame.
+		 */
+		show_iret_regs(regs);
+	}
+}
+
 void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 			unsigned long *stack, char *log_lvl)
 {
@@ -116,8 +131,8 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		if (stack_name)
 			printk("%s <%s>\n", log_lvl, stack_name);
 
-		if (regs && on_stack(&stack_info, regs, sizeof(*regs)))
-			__show_regs(regs, 0);
+		if (regs)
+			show_regs_safe(&stack_info, regs);
 
 		/*
 		 * Scan the stack, printing any text addresses we find.  At the
@@ -141,7 +156,7 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 			/*
 			 * Don't print regs->ip again if it was already printed
-			 * by __show_regs() below.
+			 * by show_regs_safe() below.
 			 */
 			if (regs && stack == &regs->ip)
 				goto next;
@@ -177,8 +192,8 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 			/* if the frame has entry regs, print them */
 			regs = unwind_get_entry_regs(&state);
-			if (regs && on_stack(&stack_info, regs, sizeof(*regs)))
-				__show_regs(regs, 0);
+			if (regs)
+				show_regs_safe(&stack_info, regs);
 		}
 
 		if (stack_name)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 9a0220aa2bf9..2e0e2979e34d 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -61,6 +61,13 @@
 
 __visible DEFINE_PER_CPU_USER_MAPPED(unsigned long, rsp_scratch);
 
+void show_iret_regs(struct pt_regs *regs)
+{
+	printk(KERN_DEFAULT "RIP: %04lx:%pS\n", regs->cs, (void *)regs->ip);
+	printk(KERN_DEFAULT "RSP: %04lx:%016lx EFLAGS: %08lx", regs->ss,
+		regs->sp, regs->flags);
+}
+
 /* Prints also some state that isn't saved in the pt_regs */
 void __show_regs(struct pt_regs *regs, int all)
 {
@@ -69,9 +76,8 @@ void __show_regs(struct pt_regs *regs, int all)
 	unsigned int fsindex, gsindex;
 	unsigned int ds, cs, es;
 
-	printk(KERN_DEFAULT "RIP: %04lx:%pS\n", regs->cs, (void *)regs->ip);
-	printk(KERN_DEFAULT "RSP: %04lx:%016lx EFLAGS: %08lx", regs->ss,
-		regs->sp, regs->flags);
+	show_iret_regs(regs);
+
 	if (regs->orig_ax != -1)
 		pr_cont(" ORIG_RAX: %016lx\n", regs->orig_ax);
 	else
@@ -88,6 +94,9 @@ void __show_regs(struct pt_regs *regs, int all)
 	printk(KERN_DEFAULT "R13: %016lx R14: %016lx R15: %016lx\n",
 	       regs->r13, regs->r14, regs->r15);
 
+	if (!all)
+		return;
+
 	asm("movl %%ds,%0" : "=r" (ds));
 	asm("movl %%cs,%0" : "=r" (cs));
 	asm("movl %%es,%0" : "=r" (es));
@@ -98,9 +107,6 @@ void __show_regs(struct pt_regs *regs, int all)
 	rdmsrl(MSR_GS_BASE, gs);
 	rdmsrl(MSR_KERNEL_GS_BASE, shadowgs);
 
-	if (!all)
-		return;
-
 	cr0 = read_cr0();
 	cr2 = read_cr2();
 	cr3 = __read_cr3();
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 77835bc021c7..7dd0d2a0d142 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -102,7 +102,7 @@ __save_stack_trace_reliable(struct stack_trace *trace,
 	for (unwind_start(&state, task, NULL, NULL); !unwind_done(&state);
 	     unwind_next_frame(&state)) {
 
-		regs = unwind_get_entry_regs(&state);
+		regs = unwind_get_entry_regs(&state, NULL);
 		if (regs) {
 			/*
 			 * Kernel mode registers on the stack indicate an
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index a3f973b2c97a..06a0bdb9de46 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -253,22 +253,15 @@ unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
 	return NULL;
 }
 
-static bool stack_access_ok(struct unwind_state *state, unsigned long addr,
+static bool stack_access_ok(struct unwind_state *state, unsigned long _addr,
 			    size_t len)
 {
 	struct stack_info *info = &state->stack_info;
+	void *addr = (void *)_addr;
 
-	/*
-	 * If the address isn't on the current stack, switch to the next one.
-	 *
-	 * We may have to traverse multiple stacks to deal with the possibility
-	 * that info->next_sp could point to an empty stack and the address
-	 * could be on a subsequent stack.
-	 */
-	while (!on_stack(info, (void *)addr, len))
-		if (get_stack_info(info->next_sp, state->task, info,
-				   &state->stack_mask))
-			return false;
+	if (!on_stack(info, addr, len) &&
+	    (get_stack_info(addr, state->task, info, &state->stack_mask)))
+		return false;
 
 	return true;
 }
@@ -283,42 +276,32 @@ static bool deref_stack_reg(struct unwind_state *state, unsigned long addr,
 	return true;
 }
 
-#define REGS_SIZE (sizeof(struct pt_regs))
-#define SP_OFFSET (offsetof(struct pt_regs, sp))
-#define IRET_REGS_SIZE (REGS_SIZE - offsetof(struct pt_regs, ip))
-#define IRET_SP_OFFSET (SP_OFFSET - offsetof(struct pt_regs, ip))
-
 static bool deref_stack_regs(struct unwind_state *state, unsigned long addr,
-			     unsigned long *ip, unsigned long *sp, bool full)
+			     unsigned long *ip, unsigned long *sp)
 {
-	size_t regs_size = full ? REGS_SIZE : IRET_REGS_SIZE;
-	size_t sp_offset = full ? SP_OFFSET : IRET_SP_OFFSET;
-	struct pt_regs *regs = (struct pt_regs *)(addr + regs_size - REGS_SIZE);
-
-	if (IS_ENABLED(CONFIG_X86_64)) {
-		if (!stack_access_ok(state, addr, regs_size))
-			return false;
+	struct pt_regs *regs = (struct pt_regs *)addr;
 
-		*ip = regs->ip;
-		*sp = regs->sp;
-
-		return true;
-	}
+	/* x86-32 support will be more complicated due to the &regs->sp hack */
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_X86_32));
 
-	if (!stack_access_ok(state, addr, sp_offset))
+	if (!stack_access_ok(state, addr, sizeof(struct pt_regs)))
 		return false;
 
 	*ip = regs->ip;
+	*sp = regs->sp;
+	return true;
+}
 
-	if (user_mode(regs)) {
-		if (!stack_access_ok(state, addr + sp_offset,
-				     REGS_SIZE - SP_OFFSET))
-			return false;
+static bool deref_stack_iret_regs(struct unwind_state *state, unsigned long addr,
+				  unsigned long *ip, unsigned long *sp)
+{
+	struct pt_regs *regs = (void *)addr - IRET_FRAME_OFFSET;
 
-		*sp = regs->sp;
-	} else
-		*sp = (unsigned long)&regs->sp;
+	if (!stack_access_ok(state, addr, IRET_FRAME_SIZE))
+		return false;
 
+	*ip = regs->ip;
+	*sp = regs->sp;
 	return true;
 }
 
@@ -327,7 +310,6 @@ bool unwind_next_frame(struct unwind_state *state)
 	unsigned long ip_p, sp, orig_ip, prev_sp = state->sp;
 	enum stack_type prev_type = state->stack_info.type;
 	struct orc_entry *orc;
-	struct pt_regs *ptregs;
 	bool indirect = false;
 
 	if (unwind_done(state))
@@ -435,7 +417,7 @@ bool unwind_next_frame(struct unwind_state *state)
 		break;
 
 	case ORC_TYPE_REGS:
-		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, true)) {
+		if (!deref_stack_regs(state, sp, &state->ip, &state->sp)) {
 			orc_warn("can't dereference registers at %p for ip %pB\n",
 				 (void *)sp, (void *)orig_ip);
 			goto done;
@@ -447,20 +429,14 @@ bool unwind_next_frame(struct unwind_state *state)
 		break;
 
 	case ORC_TYPE_REGS_IRET:
-		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, false)) {
+		if (!deref_stack_iret_regs(state, sp, &state->ip, &state->sp)) {
 			orc_warn("can't dereference iret registers at %p for ip %pB\n",
 				 (void *)sp, (void *)orig_ip);
 			goto done;
 		}
 
-		ptregs = container_of((void *)sp, struct pt_regs, ip);
-		if ((unsigned long)ptregs >= prev_sp &&
-		    on_stack(&state->stack_info, ptregs, REGS_SIZE)) {
-			state->regs = ptregs;
-			state->full_regs = false;
-		} else
-			state->regs = NULL;
-
+		state->regs = (void *)sp - IRET_FRAME_OFFSET;
+		state->full_regs = false;
 		state->signal = true;
 		break;
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/orc: Don't bail on stack overflow
  2017-11-26  2:40       ` Josh Poimboeuf
@ 2017-11-26  4:25         ` Andy Lutomirski
  2017-11-26  4:41           ` Josh Poimboeuf
  2017-11-27  9:38         ` Ingo Molnar
  1 sibling, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2017-11-26  4:25 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Thomas Gleixner, X86 ML, Borislav Petkov,
	linux-kernel, Brian Gerst, Dave Hansen, Linus Torvalds

On Sat, Nov 25, 2017 at 6:40 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Sat, Nov 25, 2017 at 04:16:23PM -0800, Andy Lutomirski wrote:
>> Can you send me whatever config and exact commit hash generated this?
>> I can try to figure out why it failed.
>
> Sorry, I've been traveling.  I just got some time to take a look at
> this.  I think there are at least two unwinder issues here:
>
> - It doesn't deal gracefully with the case where the stack overflows and
>   the stack pointer itself isn't on a valid stack but the
>   to-be-dereferenced data *is*.
>
> - The oops dump code doesn't know how to print partial pt_regs, for the
>   case where if we get an interrupt/exception in *early* entry code
>   before the full pt_regs have been saved.
>
> (Andy, I'm not quite sure about your patch, and whether it's still
> needed after these patches.  I'll need to look at it later when I have
> more time.)

I haven't tested yet, but I think my patch is probably still needed.
The issue I fixed is that unwind_start() would bail out early if sp
was below the stack.  Also:

> -static bool stack_access_ok(struct unwind_state *state, unsigned long addr,
> +static bool stack_access_ok(struct unwind_state *state, unsigned long _addr,
>                             size_t len)
>  {
>         struct stack_info *info = &state->stack_info;
> +       void *addr = (void *)_addr;
>
> -       /*
> -        * If the address isn't on the current stack, switch to the next one.
> -        *
> -        * We may have to traverse multiple stacks to deal with the possibility
> -        * that info->next_sp could point to an empty stack and the address
> -        * could be on a subsequent stack.
> -        */
> -       while (!on_stack(info, (void *)addr, len))
> -               if (get_stack_info(info->next_sp, state->task, info,
> -                                  &state->stack_mask))
> -                       return false;
> +       if (!on_stack(info, addr, len) &&
> +           (get_stack_info(addr, state->task, info, &state->stack_mask)))
> +               return false;
>
>         return true;
>  }

This looks odd to me before and after.  Shouldn't this be side-effect
free?  That is, shouldn't it create a copy of info and stack_mask and
point that to get_stack_info() rather than allowing get_stack_info()
to modify the unwind state?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/orc: Don't bail on stack overflow
  2017-11-26  4:25         ` Andy Lutomirski
@ 2017-11-26  4:41           ` Josh Poimboeuf
  2017-11-26  4:48             ` Josh Poimboeuf
  0 siblings, 1 reply; 13+ messages in thread
From: Josh Poimboeuf @ 2017-11-26  4:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds

On Sat, Nov 25, 2017 at 08:25:12PM -0800, Andy Lutomirski wrote:
> On Sat, Nov 25, 2017 at 6:40 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Sat, Nov 25, 2017 at 04:16:23PM -0800, Andy Lutomirski wrote:
> >> Can you send me whatever config and exact commit hash generated this?
> >> I can try to figure out why it failed.
> >
> > Sorry, I've been traveling.  I just got some time to take a look at
> > this.  I think there are at least two unwinder issues here:
> >
> > - It doesn't deal gracefully with the case where the stack overflows and
> >   the stack pointer itself isn't on a valid stack but the
> >   to-be-dereferenced data *is*.
> >
> > - The oops dump code doesn't know how to print partial pt_regs, for the
> >   case where if we get an interrupt/exception in *early* entry code
> >   before the full pt_regs have been saved.
> >
> > (Andy, I'm not quite sure about your patch, and whether it's still
> > needed after these patches.  I'll need to look at it later when I have
> > more time.)
> 
> I haven't tested yet, but I think my patch is probably still needed.
> The issue I fixed is that unwind_start() would bail out early if sp
> was below the stack.  Also:

Makes sense, maybe both are needed.  Your patch deals with a bad SP at
the beginning and mine deals with a bad SP in the middle.

> > -static bool stack_access_ok(struct unwind_state *state, unsigned long addr,
> > +static bool stack_access_ok(struct unwind_state *state, unsigned long _addr,
> >                             size_t len)
> >  {
> >         struct stack_info *info = &state->stack_info;
> > +       void *addr = (void *)_addr;
> >
> > -       /*
> > -        * If the address isn't on the current stack, switch to the next one.
> > -        *
> > -        * We may have to traverse multiple stacks to deal with the possibility
> > -        * that info->next_sp could point to an empty stack and the address
> > -        * could be on a subsequent stack.
> > -        */
> > -       while (!on_stack(info, (void *)addr, len))
> > -               if (get_stack_info(info->next_sp, state->task, info,
> > -                                  &state->stack_mask))
> > -                       return false;
> > +       if (!on_stack(info, addr, len) &&
> > +           (get_stack_info(addr, state->task, info, &state->stack_mask)))
> > +               return false;
> >
> >         return true;
> >  }
> 
> This looks odd to me before and after.  Shouldn't this be side-effect
> free?  That is, shouldn't it create a copy of info and stack_mask and
> point that to get_stack_info() rather than allowing get_stack_info()
> to modify the unwind state?

I think the side effects are ok, but maybe stack_access_ok() should be
renamed to make it clearer that it has side effects.

-- 
Josh

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/orc: Don't bail on stack overflow
  2017-11-26  4:41           ` Josh Poimboeuf
@ 2017-11-26  4:48             ` Josh Poimboeuf
  2017-11-26  9:27               ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Josh Poimboeuf @ 2017-11-26  4:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds

On Sat, Nov 25, 2017 at 10:41:15PM -0600, Josh Poimboeuf wrote:
> On Sat, Nov 25, 2017 at 08:25:12PM -0800, Andy Lutomirski wrote:
> > On Sat, Nov 25, 2017 at 6:40 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > > On Sat, Nov 25, 2017 at 04:16:23PM -0800, Andy Lutomirski wrote:
> > >> Can you send me whatever config and exact commit hash generated this?
> > >> I can try to figure out why it failed.
> > >
> > > Sorry, I've been traveling.  I just got some time to take a look at
> > > this.  I think there are at least two unwinder issues here:
> > >
> > > - It doesn't deal gracefully with the case where the stack overflows and
> > >   the stack pointer itself isn't on a valid stack but the
> > >   to-be-dereferenced data *is*.
> > >
> > > - The oops dump code doesn't know how to print partial pt_regs, for the
> > >   case where if we get an interrupt/exception in *early* entry code
> > >   before the full pt_regs have been saved.
> > >
> > > (Andy, I'm not quite sure about your patch, and whether it's still
> > > needed after these patches.  I'll need to look at it later when I have
> > > more time.)
> > 
> > I haven't tested yet, but I think my patch is probably still needed.
> > The issue I fixed is that unwind_start() would bail out early if sp
> > was below the stack.  Also:
> 
> Makes sense, maybe both are needed.  Your patch deals with a bad SP at
> the beginning and mine deals with a bad SP in the middle.

I was able to recreate with the config Ingo posted earlier, along with
the following patch:

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index e12168936d3f..693a20d309e3 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -500,7 +500,7 @@
                VMLINUX_SYMBOL(__entry_text_end) = .;

 #define IRQENTRY_TEXT                                                  \
-               ALIGN_FUNCTION();                                       \
+               . = ALIGN(4096);                                        \
                VMLINUX_SYMBOL(__irqentry_text_start) = .;              \
                *(.irqentry.text)                                       \
                VMLINUX_SYMBOL(__irqentry_text_end) = .;


It looks a *lot* better with mine and your patches applied.  It probably
would have helped Ingo and Thomas figure the problem out a lot sooner:

[    1.159016] PANIC: double fault, error_code: 0x0
[    1.159583] CPU: 1 PID: 68 Comm: modprobe Not tainted 4.14.0-01257-g761d390195b6-dirty #19
[    1.159583] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[    1.159583] task: ffff880136f6bb00 task.stack: ffffc90000984000
[    1.159583] RIP: 0010:page_fault+0x11/0x60
[    1.159583] RSP: 0000:ffffffffff083fc8 EFLAGS: 00010046
[    1.159583] RAX: 00000000819d0a87 RBX: 0000000000000001 RCX: ffffffff819d0a87
[    1.159583] RDX: 0000000000001000 RSI: 0000000000000010 RDI: ffffffffff084078
[    1.159583] RBP: 0000000000000d68 R08: 00007f6d6bb24278 R09: 0000000000000023
[    1.159583] R10: 0000558e0feca600 R11: 0000000000000246 R12: 00007f6d6bb203c0
[    1.159583] R13: 00007f6d6bb1f880 R14: 00007ffff793bebc R15: 0000000000000100
[    1.159583] FS:  00007f6d6c39c5c0(0000) GS:ffff88013fc40000(0000) knlGS:0000000000000000
[    1.159583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.159583] CR2: ffffffffff083fb8 CR3: 0000000136f78002 CR4: 00000000001606e0
[    1.159583] Call Trace:
[    1.159583]  <SYSENTER>
[    1.159583]  __do_page_fault+0x4b0/0x4b0
[    1.159583]  page_fault+0x2c/0x60
[    1.159583] RIP: 0010:do_page_fault+0x0/0x100
[    1.159583] RSP: 0000:ffffffffff084120 EFLAGS: 00010012
[    1.159583] RAX: 00000000819d0a87 RBX: 0000000000000001 RCX: ffffffff819d0a87
[    1.159583] RDX: 0000000000001000 RSI: 0000000000000010 RDI: ffffffffff084128
[    1.159583] RBP: 0000000000000d68 R08: 00007f6d6bb24278 R09: 0000000000000023
[    1.159583] R10: 0000558e0feca600 R11: 0000000000000246 R12: 00007f6d6bb203c0
[    1.159583] R13: 00007f6d6bb1f880 R14: 00007ffff793bebc R15: 0000000000000100
[    1.159583]  ? native_iret+0x7/0x7
[    1.159583]  page_fault+0x2c/0x60
[    1.159583] RIP: 0010:apic_timer_interrupt+0x0/0xb0
[    1.159583] RSP: 0000:ffffffffff0841d8 EFLAGS: 00010046
[    1.159583] RAX: 0000000000000374 RBX: 0000558e0feca2c0 RCX: 00007f6d6b85aaf0
[    1.159583] RDX: 0000000000001000 RSI: 0000558e0feca600 RDI: 0000000000000000
[    1.159583] RBP: 0000000000000d68 R08: 00007f6d6bb24278 R09: 0000000000000023
[    1.159583] R10: 0000558e0feca600 R11: 0000000000000246 R12: 00007f6d6bb203c0
[    1.159583] R13: 00007f6d6bb1f880 R14: 00007ffff793bebc R15: 0000000000000100
[    1.159583] RIP: 0033:0x7f6d6b85aaf0
[    1.159583] RSP: 002b:00007ffff793bd68 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[    1.159583] RAX: ffffffff819d2000 RBX: 00007f6d6b85aaf0 RCX: 0000000000000010
[    1.159583] RDX: 0000000000010046 RSI: ffffffffff0841d8 RDI: 0000000000000000
[    1.159583] RBP: 0000000000000374 R08: ffffffffffffffff R09: 0000000000000000
[    1.159583] R10: 0000558e0feca600 R11: 0000000000001000 R12: 00007f6d6bb24278
[    1.159583] R13: 0000000000000023 R14: 0000558e0feca600 R15: 0000000000000246
[    1.159583]  </SYSENTER>
[    1.159583] Code: ff e8 94 b7 6a ff e9 9f 02 00 00 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 83 c4 88 f6 84 24 88 00 00 00 03 75 20 <e8> 4a 01 00 00 48 89 e7 48 8b 74 24 78 48 c7 44 24 78 ff ff ff
[    1.159583] Kernel panic - not syncing: Machine halted.
[    1.159583] CPU: 1 PID: 68 Comm: modprobe Not tainted 4.14.0-01257-g761d390195b6-dirty #19
[    1.159583] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[    1.159583] Call Trace:
[    1.159583]  <#DF>
[    1.159583]  dump_stack+0x46/0x59
[    1.159583]  panic+0xde/0x223
[    1.159583]  df_debug+0x29/0x30
[    1.159583]  do_double_fault+0x9a/0x120
[    1.159583]  double_fault+0x22/0x30
[    1.159583] RIP: 0010:page_fault+0x11/0x60
[    1.159583] RSP: 0000:ffffffffff083fc8 EFLAGS: 00010046
[    1.159583] RAX: 00000000819d0a87 RBX: 0000000000000001 RCX: ffffffff819d0a87
[    1.159583] RDX: 0000000000001000 RSI: 0000000000000010 RDI: ffffffffff084078
[    1.159583] RBP: 0000000000000d68 R08: 00007f6d6bb24278 R09: 0000000000000023
[    1.159583] R10: 0000558e0feca600 R11: 0000000000000246 R12: 00007f6d6bb203c0
[    1.159583] R13: 00007f6d6bb1f880 R14: 00007ffff793bebc R15: 0000000000000100
[    1.159583]  ? native_iret+0x7/0x7
[    1.159583]  </#DF>
[    1.159583]  <SYSENTER>
[    1.159583] RIP: 0010:do_page_fault+0x0/0x100
[    1.159583] RSP: 0000:ffffffffff084070 EFLAGS: 00010097
[    1.159583]  page_fault+0x2c/0x60
[    1.159583] RIP: 0010:do_page_fault+0x0/0x100
[    1.159583] RSP: 0000:ffffffffff084120 EFLAGS: 00010012
[    1.159583] RAX: 00000000819d0a87 RBX: 0000000000000001 RCX: ffffffff819d0a87
[    1.159583] RDX: 0000000000001000 RSI: 0000000000000010 RDI: ffffffffff084128
[    1.159583] RBP: 0000000000000d68 R08: 00007f6d6bb24278 R09: 0000000000000023
[    1.159583] R10: 0000558e0feca600 R11: 0000000000000246 R12: 00007f6d6bb203c0
[    1.159583] R13: 00007f6d6bb1f880 R14: 00007ffff793bebc R15: 0000000000000100
[    1.159583]  ? native_iret+0x7/0x7
[    1.159583]  page_fault+0x2c/0x60
[    1.159583] RIP: 0010:apic_timer_interrupt+0x0/0xb0
[    1.159583] RSP: 0000:ffffffffff0841d8 EFLAGS: 00010046
[    1.159583] RAX: 0000000000000374 RBX: 0000558e0feca2c0 RCX: 00007f6d6b85aaf0
[    1.159583] RDX: 0000000000001000 RSI: 0000558e0feca600 RDI: 0000000000000000
[    1.159583] RBP: 0000000000000d68 R08: 00007f6d6bb24278 R09: 0000000000000023
[    1.159583] R10: 0000558e0feca600 R11: 0000000000000246 R12: 00007f6d6bb203c0
[    1.159583] R13: 00007f6d6bb1f880 R14: 00007ffff793bebc R15: 0000000000000100
[    1.159583] RIP: 0033:0x7f6d6b85aaf0
[    1.159583] RSP: 002b:00007ffff793bd68 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[    1.159583] RAX: ffffffff819d2000 RBX: 00007f6d6b85aaf0 RCX: 0000000000000010
[    1.159583] RDX: 0000000000010046 RSI: ffffffffff0841d8 RDI: 0000000000000000
[    1.159583] RBP: 0000000000000374 R08: ffffffffffffffff R09: 0000000000000000
[    1.159583] R10: 0000558e0feca600 R11: 0000000000001000 R12: 00007f6d6bb24278
[    1.159583] R13: 0000000000000023 R14: 0000558e0feca600 R15: 0000000000000246
[    1.159583]  </SYSENTER>
[    1.159583] Kernel Offset: disabled
[    1.159583] ---[ end Kernel panic - not syncing: Machine halted.

 
-- 
Josh

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/orc: Don't bail on stack overflow
  2017-11-26  4:48             ` Josh Poimboeuf
@ 2017-11-26  9:27               ` Thomas Gleixner
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2017-11-26  9:27 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds

On Sat, 25 Nov 2017, Josh Poimboeuf wrote:

> It looks a *lot* better with mine and your patches applied.  It probably
> would have helped Ingo and Thomas figure the problem out a lot sooner:

> [    1.159583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.159583] CR2: ffffffffff083fb8 CR3: 0000000136f78002 CR4: 00000000001606e0
> [    1.159583] Call Trace:
> [    1.159583]  <SYSENTER>
> [    1.159583]  __do_page_fault+0x4b0/0x4b0
> [    1.159583]  page_fault+0x2c/0x60
> [    1.159583] RIP: 0010:do_page_fault+0x0/0x100
> [    1.159583] RSP: 0000:ffffffffff084120 EFLAGS: 00010012
> [    1.159583] RAX: 00000000819d0a87 RBX: 0000000000000001 RCX: ffffffff819d0a87
> [    1.159583] RDX: 0000000000001000 RSI: 0000000000000010 RDI: ffffffffff084128
> [    1.159583] RBP: 0000000000000d68 R08: 00007f6d6bb24278 R09: 0000000000000023
> [    1.159583] R10: 0000558e0feca600 R11: 0000000000000246 R12: 00007f6d6bb203c0
> [    1.159583] R13: 00007f6d6bb1f880 R14: 00007ffff793bebc R15: 0000000000000100
> [    1.159583]  ? native_iret+0x7/0x7
> [    1.159583]  page_fault+0x2c/0x60
> [    1.159583] RIP: 0010:apic_timer_interrupt+0x0/0xb0

Yes. That would have pointed immediately to the right place. It'd been
obvious that apic_timer_interrupt is not mapped.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/orc: Don't bail on stack overflow
  2017-11-26  2:40       ` Josh Poimboeuf
  2017-11-26  4:25         ` Andy Lutomirski
@ 2017-11-27  9:38         ` Ingo Molnar
  2017-11-27  9:49           ` Ingo Molnar
  2017-11-27 12:45           ` Josh Poimboeuf
  1 sibling, 2 replies; 13+ messages in thread
From: Ingo Molnar @ 2017-11-27  9:38 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Thomas Gleixner, X86 ML, Borislav Petkov,
	linux-kernel, Brian Gerst, Dave Hansen, Linus Torvalds


* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Sat, Nov 25, 2017 at 04:16:23PM -0800, Andy Lutomirski wrote:
> > Can you send me whatever config and exact commit hash generated this?
> > I can try to figure out why it failed.
> 
> Sorry, I've been traveling.  I just got some time to take a look at
> this.  I think there are at least two unwinder issues here:
> 
> - It doesn't deal gracefully with the case where the stack overflows and
>   the stack pointer itself isn't on a valid stack but the
>   to-be-dereferenced data *is*.
> 
> - The oops dump code doesn't know how to print partial pt_regs, for the
>   case where if we get an interrupt/exception in *early* entry code
>   before the full pt_regs have been saved.
> 
> (Andy, I'm not quite sure about your patch, and whether it's still
> needed after these patches.  I'll need to look at it later when I have
> more time.)
> 
> I attempted to fix both of the issues with the below patch.  Thomas or
> Ingo, can you test to see if this gets rid of the question marks?
> 
> I can split it up into proper patches next week.  I'm assuming this
> isn't holding up the KAISER merge?

It's not holding up the Kaiser merge, but good debuggability of weird crashes is a 
really good thing, so I constructed a changelog and picked up this patch as a 
single commit, and added your Signed-off-by, if that's OK with you.

Will only push it out if it passes testing.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/orc: Don't bail on stack overflow
  2017-11-27  9:38         ` Ingo Molnar
@ 2017-11-27  9:49           ` Ingo Molnar
  2017-11-27 12:45           ` Josh Poimboeuf
  1 sibling, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2017-11-27  9:49 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Thomas Gleixner, X86 ML, Borislav Petkov,
	linux-kernel, Brian Gerst, Dave Hansen, Linus Torvalds


* Ingo Molnar <mingo@kernel.org> wrote:

> Will only push it out if it passes testing.

So far it required the small fix below for 32-bit.

Thanks,

	Ingo

---
 arch/x86/kernel/dumpstack.c  | 7 +++++++
 arch/x86/kernel/process_64.c | 7 -------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index f7f3b5b3bc05..f214f7a6ccad 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -68,6 +68,13 @@ static void printk_stack_address(unsigned long address, int reliable,
 	printk("%s %s%pB\n", log_lvl, reliable ? "" : "? ", (void *)address);
 }
 
+void show_iret_regs(struct pt_regs *regs)
+{
+	printk(KERN_DEFAULT "RIP: %04x:%pS\n", (int)regs->cs, (void *)regs->ip);
+	printk(KERN_DEFAULT "RSP: %04x:%016lx EFLAGS: %08lx", (int)regs->ss,
+		regs->sp, regs->flags);
+}
+
 static void show_regs_safe(struct stack_info *info, struct pt_regs *regs)
 {
 	if (on_stack(info, regs, sizeof(*regs)))
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 2e0e2979e34d..631e229ab428 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -61,13 +61,6 @@
 
 __visible DEFINE_PER_CPU_USER_MAPPED(unsigned long, rsp_scratch);
 
-void show_iret_regs(struct pt_regs *regs)
-{
-	printk(KERN_DEFAULT "RIP: %04lx:%pS\n", regs->cs, (void *)regs->ip);
-	printk(KERN_DEFAULT "RSP: %04lx:%016lx EFLAGS: %08lx", regs->ss,
-		regs->sp, regs->flags);
-}
-
 /* Prints also some state that isn't saved in the pt_regs */
 void __show_regs(struct pt_regs *regs, int all)
 {

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/orc: Don't bail on stack overflow
  2017-11-27  9:38         ` Ingo Molnar
  2017-11-27  9:49           ` Ingo Molnar
@ 2017-11-27 12:45           ` Josh Poimboeuf
  2017-11-27 13:13             ` Ingo Molnar
  1 sibling, 1 reply; 13+ messages in thread
From: Josh Poimboeuf @ 2017-11-27 12:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Thomas Gleixner, X86 ML, Borislav Petkov,
	linux-kernel, Brian Gerst, Dave Hansen, Linus Torvalds

On Mon, Nov 27, 2017 at 10:38:42AM +0100, Ingo Molnar wrote:
> 
> * Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> > On Sat, Nov 25, 2017 at 04:16:23PM -0800, Andy Lutomirski wrote:
> > > Can you send me whatever config and exact commit hash generated this?
> > > I can try to figure out why it failed.
> > 
> > Sorry, I've been traveling.  I just got some time to take a look at
> > this.  I think there are at least two unwinder issues here:
> > 
> > - It doesn't deal gracefully with the case where the stack overflows and
> >   the stack pointer itself isn't on a valid stack but the
> >   to-be-dereferenced data *is*.
> > 
> > - The oops dump code doesn't know how to print partial pt_regs, for the
> >   case where if we get an interrupt/exception in *early* entry code
> >   before the full pt_regs have been saved.
> > 
> > (Andy, I'm not quite sure about your patch, and whether it's still
> > needed after these patches.  I'll need to look at it later when I have
> > more time.)
> > 
> > I attempted to fix both of the issues with the below patch.  Thomas or
> > Ingo, can you test to see if this gets rid of the question marks?
> > 
> > I can split it up into proper patches next week.  I'm assuming this
> > isn't holding up the KAISER merge?
> 
> It's not holding up the Kaiser merge, but good debuggability of weird crashes is a 
> really good thing, so I constructed a changelog and picked up this patch as a 
> single commit, and added your Signed-off-by, if that's OK with you.
> 
> Will only push it out if it passes testing.

The commit log looks good, though there's a CONFIG_FRAME_POINTER build
failure.  Can you fold in this fix?

diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 7dd0d2a0d142..77835bc021c7 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -102,7 +102,7 @@ __save_stack_trace_reliable(struct stack_trace *trace,
 	for (unwind_start(&state, task, NULL, NULL); !unwind_done(&state);
 	     unwind_next_frame(&state)) {
 
-		regs = unwind_get_entry_regs(&state, NULL);
+		regs = unwind_get_entry_regs(&state);
 		if (regs) {
 			/*
 			 * Kernel mode registers on the stack indicate an

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/orc: Don't bail on stack overflow
  2017-11-27 12:45           ` Josh Poimboeuf
@ 2017-11-27 13:13             ` Ingo Molnar
  0 siblings, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2017-11-27 13:13 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Thomas Gleixner, X86 ML, Borislav Petkov,
	linux-kernel, Brian Gerst, Dave Hansen, Linus Torvalds


* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Mon, Nov 27, 2017 at 10:38:42AM +0100, Ingo Molnar wrote:
> > 
> > * Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> > > On Sat, Nov 25, 2017 at 04:16:23PM -0800, Andy Lutomirski wrote:
> > > > Can you send me whatever config and exact commit hash generated this?
> > > > I can try to figure out why it failed.
> > > 
> > > Sorry, I've been traveling.  I just got some time to take a look at
> > > this.  I think there are at least two unwinder issues here:
> > > 
> > > - It doesn't deal gracefully with the case where the stack overflows and
> > >   the stack pointer itself isn't on a valid stack but the
> > >   to-be-dereferenced data *is*.
> > > 
> > > - The oops dump code doesn't know how to print partial pt_regs, for the
> > >   case where if we get an interrupt/exception in *early* entry code
> > >   before the full pt_regs have been saved.
> > > 
> > > (Andy, I'm not quite sure about your patch, and whether it's still
> > > needed after these patches.  I'll need to look at it later when I have
> > > more time.)
> > > 
> > > I attempted to fix both of the issues with the below patch.  Thomas or
> > > Ingo, can you test to see if this gets rid of the question marks?
> > > 
> > > I can split it up into proper patches next week.  I'm assuming this
> > > isn't holding up the KAISER merge?
> > 
> > It's not holding up the Kaiser merge, but good debuggability of weird crashes is a 
> > really good thing, so I constructed a changelog and picked up this patch as a 
> > single commit, and added your Signed-off-by, if that's OK with you.
> > 
> > Will only push it out if it passes testing.
> 
> The commit log looks good, though there's a CONFIG_FRAME_POINTER build
> failure.  Can you fold in this fix?
> 
> diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
> index 7dd0d2a0d142..77835bc021c7 100644
> --- a/arch/x86/kernel/stacktrace.c
> +++ b/arch/x86/kernel/stacktrace.c
> @@ -102,7 +102,7 @@ __save_stack_trace_reliable(struct stack_trace *trace,
>  	for (unwind_start(&state, task, NULL, NULL); !unwind_done(&state);
>  	     unwind_next_frame(&state)) {
>  
> -		regs = unwind_get_entry_regs(&state, NULL);
> +		regs = unwind_get_entry_regs(&state);
>  		if (regs) {
>  			/*
>  			 * Kernel mode registers on the stack indicate an

Done, thanks!

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-11-27 13:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-25 17:28 [PATCH] x86/orc: Don't bail on stack overflow Andy Lutomirski
2017-11-25 18:26 ` Andy Lutomirski
2017-11-25 23:13   ` Thomas Gleixner
2017-11-26  0:16     ` Andy Lutomirski
2017-11-26  2:40       ` Josh Poimboeuf
2017-11-26  4:25         ` Andy Lutomirski
2017-11-26  4:41           ` Josh Poimboeuf
2017-11-26  4:48             ` Josh Poimboeuf
2017-11-26  9:27               ` Thomas Gleixner
2017-11-27  9:38         ` Ingo Molnar
2017-11-27  9:49           ` Ingo Molnar
2017-11-27 12:45           ` Josh Poimboeuf
2017-11-27 13:13             ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.