All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
@ 2016-07-21 21:21 Josh Poimboeuf
  2016-07-21 21:21 ` [PATCH 01/19] x86/dumpstack: remove show_trace() Josh Poimboeuf
                   ` (19 more replies)
  0 siblings, 20 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

The x86 stack dump code is a bit of a mess.  dump_trace() uses
callbacks, and each user of it seems to have slightly different
requirements, so there are several slightly different callbacks floating
around.

Also there are some upcoming features which will require more changes to
the stack dump code: reliable stack detection for live patching,
hardened user copy, and the DWARF unwinder.  Each of those features
would at least need more callbacks and/or callback interfaces, resulting
in a much bigger mess than what we have today.

Before doing all that, we should try to clean things up and replace
dump_trace() with something cleaner and more flexible.

The new unwinder is a simple state machine which was heavily inspired by
a suggestion from Andy Lutomirski:

  https://lkml.kernel.org/r/CALCETrUbNTqaM2LRyXGRx=kVLRPeY5A3Pc6k4TtQxF320rUT=w@mail.gmail.com

It's also similar to the libunwind API:

  http://www.nongnu.org/libunwind/man/libunwind(3).html

Some if its advantages:

- simplicity: no more callback sprawl and less code duplication.

- flexibility: allows the caller to stop and inspect the stack state at
  each step in the unwinding process.

- modularity: the unwinder code, console stack dump code, and stack
  metadata analysis code are all better separated so that changing one
  of them shouldn't have much of an impact on any of the others.


Josh Poimboeuf (19):
  x86/dumpstack: remove show_trace()
  x86/dumpstack: add get_stack_pointer() and get_frame_pointer()
  x86/dumpstack: remove unnecessary stack pointer arguments
  x86/dumpstack: make printk_stack_address() more generally useful
  x86/dumpstack: fix function graph tracing stack dump reliability
    issues
  x86/dumpstack: remove extra brackets around "EOE"
  x86/dumpstack: add IRQ_USABLE_STACK_SIZE define
  x86/dumpstack: don't disable preemption in show_stack_log_lvl() and
    dump_trace()
  x86/dumpstack: simplify in_exception_stack()
  x86/dumpstack: add get_stack_info() interface
  x86/dumptrace: add new unwind interface and implementations
  perf/x86: convert perf_callchain_kernel() to the new unwinder
  x86/stacktrace: convert save_stack_trace_*() to the new unwinder
  oprofile/x86: convert x86_backtrace() to the new unwinder
  x86/dumpstack: convert show_trace_log_lvl() to the new unwinder
  x86/dumpstack: remove dump_trace()
  x86/entry/dumpstack: encode pt_regs pointer in frame pointer
  x86/dumpstack: print stack identifier on its own line
  x86/dumpstack: print any pt_regs found on the stack

 arch/x86/entry/calling.h             |  21 +++
 arch/x86/entry/entry_64.S            |   7 +-
 arch/x86/events/core.c               |  32 +---
 arch/x86/include/asm/kdebug.h        |   2 -
 arch/x86/include/asm/page_64_types.h |  19 ++-
 arch/x86/include/asm/stacktrace.h    | 127 +++++++-------
 arch/x86/include/asm/unwind.h        |  91 ++++++++++
 arch/x86/kernel/Makefile             |   6 +
 arch/x86/kernel/cpu/common.c         |   2 +-
 arch/x86/kernel/dumpstack.c          | 269 +++++++++++++++---------------
 arch/x86/kernel/dumpstack_32.c       | 120 +++++++-------
 arch/x86/kernel/dumpstack_64.c       | 310 ++++++++++-------------------------
 arch/x86/kernel/setup_percpu.c       |   2 +-
 arch/x86/kernel/stacktrace.c         |  74 ++++-----
 arch/x86/kernel/unwind_frame.c       | 133 +++++++++++++++
 arch/x86/kernel/unwind_guess.c       |  40 +++++
 arch/x86/oprofile/backtrace.c        |  44 +++--
 17 files changed, 713 insertions(+), 586 deletions(-)
 create mode 100644 arch/x86/include/asm/unwind.h
 create mode 100644 arch/x86/kernel/unwind_frame.c
 create mode 100644 arch/x86/kernel/unwind_guess.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 01/19] x86/dumpstack: remove show_trace()
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 21:49   ` Andy Lutomirski
  2016-07-21 21:21 ` [PATCH 02/19] x86/dumpstack: add get_stack_pointer() and get_frame_pointer() Josh Poimboeuf
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

There are a bewildering array of options for dumping the stack.
Simplify things a little by removing show_trace(), which is unused.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/include/asm/kdebug.h | 2 --
 arch/x86/kernel/dumpstack.c   | 6 ------
 2 files changed, 8 deletions(-)

diff --git a/arch/x86/include/asm/kdebug.h b/arch/x86/include/asm/kdebug.h
index 1ef9d58..d318811 100644
--- a/arch/x86/include/asm/kdebug.h
+++ b/arch/x86/include/asm/kdebug.h
@@ -24,8 +24,6 @@ enum die_val {
 extern void printk_address(unsigned long address);
 extern void die(const char *, struct pt_regs *,long);
 extern int __must_check __die(const char *, struct pt_regs *, long);
-extern void show_trace(struct task_struct *t, struct pt_regs *regs,
-		       unsigned long *sp, unsigned long bp);
 extern void show_stack_regs(struct pt_regs *regs);
 extern void __show_regs(struct pt_regs *regs, int all);
 extern unsigned long oops_begin(void);
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 92e8f0a..5f49c04 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -182,12 +182,6 @@ show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	dump_trace(task, regs, stack, bp, &print_trace_ops, log_lvl);
 }
 
-void show_trace(struct task_struct *task, struct pt_regs *regs,
-		unsigned long *stack, unsigned long bp)
-{
-	show_trace_log_lvl(task, regs, stack, bp, "");
-}
-
 void show_stack(struct task_struct *task, unsigned long *sp)
 {
 	unsigned long bp = 0;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 02/19] x86/dumpstack: add get_stack_pointer() and get_frame_pointer()
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
  2016-07-21 21:21 ` [PATCH 01/19] x86/dumpstack: remove show_trace() Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 21:53   ` Andy Lutomirski
  2016-07-21 21:21 ` [PATCH 03/19] x86/dumpstack: remove unnecessary stack pointer arguments Josh Poimboeuf
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

The various functions involved in dumping the stack all do similar
things with regard to getting the stack pointer and the frame pointer
based on the regs and task arguments.  Create helper functions to
do that instead.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/include/asm/stacktrace.h | 39 ++++++++++++++++++++++-----------------
 arch/x86/kernel/dumpstack.c       |  5 ++---
 arch/x86/kernel/dumpstack_32.c    | 25 ++++---------------------
 arch/x86/kernel/dumpstack_64.c    | 30 ++++--------------------------
 4 files changed, 32 insertions(+), 67 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 0944218..6f65995 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -49,37 +49,42 @@ void dump_trace(struct task_struct *tsk, struct pt_regs *regs,
 
 #ifdef CONFIG_X86_32
 #define STACKSLOTS_PER_LINE 8
-#define get_bp(bp) asm("movl %%ebp, %0" : "=r" (bp) :)
 #else
 #define STACKSLOTS_PER_LINE 4
-#define get_bp(bp) asm("movq %%rbp, %0" : "=r" (bp) :)
 #endif
 
 #ifdef CONFIG_FRAME_POINTER
-static inline unsigned long
-stack_frame(struct task_struct *task, struct pt_regs *regs)
+static inline unsigned long *
+get_frame_pointer(struct task_struct *task, struct pt_regs *regs)
 {
-	unsigned long bp;
-
 	if (regs)
-		return regs->bp;
+		return (unsigned long *)regs->bp;
 
-	if (task == current) {
-		/* Grab bp right from our regs */
-		get_bp(bp);
-		return bp;
-	}
+	if (!task || task == current)
+		return __builtin_frame_address(0);
 
 	/* bp is the last reg pushed by switch_to */
-	return *(unsigned long *)task->thread.sp;
+	return (unsigned long *)*(unsigned long *)task->thread.sp;
 }
 #else
-static inline unsigned long
-stack_frame(struct task_struct *task, struct pt_regs *regs)
+static inline unsigned long *
+get_frame_pointer(struct task_struct *task, struct pt_regs *regs)
 {
 	return 0;
 }
-#endif
+#endif /* CONFIG_FRAME_POINTER */
+
+static inline unsigned long *
+get_stack_pointer(struct task_struct *task, struct pt_regs *regs)
+{
+	if (regs)
+		return (unsigned long *)kernel_stack_pointer(regs);
+
+	if (!task || task == current)
+		return __builtin_frame_address(0);
+
+	return (unsigned long *)task->thread.sp;
+}
 
 extern void
 show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
@@ -106,7 +111,7 @@ static inline unsigned long caller_frame_pointer(void)
 {
 	struct stack_frame *frame;
 
-	get_bp(frame);
+	frame = __builtin_frame_address(0);
 
 #ifdef CONFIG_FRAME_POINTER
 	frame = frame->next_frame;
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 5f49c04..145f18b 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -185,15 +185,14 @@ show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 void show_stack(struct task_struct *task, unsigned long *sp)
 {
 	unsigned long bp = 0;
-	unsigned long stack;
 
 	/*
 	 * Stack frames below this one aren't interesting.  Don't show them
 	 * if we're printing for %current.
 	 */
 	if (!sp && (!task || task == current)) {
-		sp = &stack;
-		bp = stack_frame(current, NULL);
+		sp = get_stack_pointer(current, NULL);
+		bp = (unsigned long)get_frame_pointer(current, NULL);
 	}
 
 	show_stack_log_lvl(task, NULL, sp, bp, "");
diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index 0967571..358fe1c 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -46,19 +46,9 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	int graph = 0;
 	u32 *prev_esp;
 
-	if (!task)
-		task = current;
-
-	if (!stack) {
-		unsigned long dummy;
-
-		stack = &dummy;
-		if (task != current)
-			stack = (unsigned long *)task->thread.sp;
-	}
-
-	if (!bp)
-		bp = stack_frame(task, regs);
+	task = task ? : current;
+	stack = stack ? : get_stack_pointer(task, regs);
+	bp = bp ? : (unsigned long)get_frame_pointer(task, regs);
 
 	for (;;) {
 		void *end_stack;
@@ -95,14 +85,7 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	unsigned long *stack;
 	int i;
 
-	if (sp == NULL) {
-		if (regs)
-			sp = (unsigned long *)regs->sp;
-		else if (task)
-			sp = (unsigned long *)task->thread.sp;
-		else
-			sp = (unsigned long *)&sp;
-	}
+	sp = sp ? : get_stack_pointer(task, regs);
 
 	stack = sp;
 	for (i = 0; i < kstack_depth_to_print; i++) {
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 9ee4520..bc08e8b 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -154,25 +154,14 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 {
 	const unsigned cpu = get_cpu();
 	unsigned long *irq_stack = (unsigned long *)per_cpu(irq_stack_ptr, cpu);
-	unsigned long dummy;
 	unsigned used = 0;
 	int graph = 0;
 	int done = 0;
 
-	if (!task)
-		task = current;
+	task = task ? : current;
+	stack = stack ? : get_stack_pointer(task, regs);
+	bp = bp ? : (unsigned long)get_frame_pointer(task, regs);
 
-	if (!stack) {
-		if (regs)
-			stack = (unsigned long *)regs->sp;
-		else if (task != current)
-			stack = (unsigned long *)task->thread.sp;
-		else
-			stack = &dummy;
-	}
-
-	if (!bp)
-		bp = stack_frame(task, regs);
 	/*
 	 * Print function call entries in all stacks, starting at the
 	 * current stack address. If the stacks consist of nested
@@ -259,18 +248,7 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	irq_stack_end	= (unsigned long *)(per_cpu(irq_stack_ptr, cpu));
 	irq_stack	= (unsigned long *)(per_cpu(irq_stack_ptr, cpu) - IRQ_STACK_SIZE);
 
-	/*
-	 * Debugging aid: "show_stack(NULL, NULL);" prints the
-	 * back trace for this cpu:
-	 */
-	if (sp == NULL) {
-		if (regs)
-			sp = (unsigned long *)regs->sp;
-		else if (task)
-			sp = (unsigned long *)task->thread.sp;
-		else
-			sp = (unsigned long *)&sp;
-	}
+	sp = sp ? : get_stack_pointer(task, regs);
 
 	stack = sp;
 	for (i = 0; i < kstack_depth_to_print; i++) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 03/19] x86/dumpstack: remove unnecessary stack pointer arguments
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
  2016-07-21 21:21 ` [PATCH 01/19] x86/dumpstack: remove show_trace() Josh Poimboeuf
  2016-07-21 21:21 ` [PATCH 02/19] x86/dumpstack: add get_stack_pointer() and get_frame_pointer() Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 21:56   ` Andy Lutomirski
  2016-07-21 21:21 ` [PATCH 04/19] x86/dumpstack: make printk_stack_address() more generally useful Josh Poimboeuf
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

When calling show_stack_log_lvl() or dump_trace() with a regs argument,
providing a stack pointer or frame pointer is redundant.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>d
---
 arch/x86/kernel/dumpstack.c    | 2 +-
 arch/x86/kernel/dumpstack_32.c | 2 +-
 arch/x86/kernel/dumpstack_64.c | 5 +----
 arch/x86/oprofile/backtrace.c  | 4 +---
 4 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 145f18b..75d21ac 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -200,7 +200,7 @@ void show_stack(struct task_struct *task, unsigned long *sp)
 
 void show_stack_regs(struct pt_regs *regs)
 {
-	show_stack_log_lvl(current, regs, (unsigned long *)regs->sp, regs->bp, "");
+	show_stack_log_lvl(current, regs, NULL, 0, "");
 }
 
 static arch_spinlock_t die_lock = __ARCH_SPIN_LOCK_UNLOCKED;
diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index 358fe1c..c533b8b 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -122,7 +122,7 @@ void show_regs(struct pt_regs *regs)
 		u8 *ip;
 
 		pr_emerg("Stack:\n");
-		show_stack_log_lvl(NULL, regs, &regs->sp, 0, KERN_EMERG);
+		show_stack_log_lvl(NULL, regs, NULL, 0, KERN_EMERG);
 
 		pr_emerg("Code:");
 
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index bc08e8b..360f2e8 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -286,9 +286,7 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 void show_regs(struct pt_regs *regs)
 {
 	int i;
-	unsigned long sp;
 
-	sp = regs->sp;
 	show_regs_print_info(KERN_DEFAULT);
 	__show_regs(regs, 1);
 
@@ -303,8 +301,7 @@ void show_regs(struct pt_regs *regs)
 		u8 *ip;
 
 		printk(KERN_DEFAULT "Stack:\n");
-		show_stack_log_lvl(NULL, regs, (unsigned long *)sp,
-				   0, KERN_DEFAULT);
+		show_stack_log_lvl(NULL, regs, NULL, 0, KERN_DEFAULT);
 
 		printk(KERN_DEFAULT "Code: ");
 
diff --git a/arch/x86/oprofile/backtrace.c b/arch/x86/oprofile/backtrace.c
index cb31a44..c594768 100644
--- a/arch/x86/oprofile/backtrace.c
+++ b/arch/x86/oprofile/backtrace.c
@@ -113,10 +113,8 @@ x86_backtrace(struct pt_regs * const regs, unsigned int depth)
 	struct stack_frame *head = (struct stack_frame *)frame_pointer(regs);
 
 	if (!user_mode(regs)) {
-		unsigned long stack = kernel_stack_pointer(regs);
 		if (depth)
-			dump_trace(NULL, regs, (unsigned long *)stack, 0,
-				   &backtrace_ops, &depth);
+			dump_trace(NULL, regs, NULL, 0, &backtrace_ops, &depth);
 		return;
 	}
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 04/19] x86/dumpstack: make printk_stack_address() more generally useful
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (2 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 03/19] x86/dumpstack: remove unnecessary stack pointer arguments Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 21:21 ` [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues Josh Poimboeuf
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

Change printk_stack_address() to be useful when called by an unwinder
outside the context of dump_trace().

Specifically:

- printk_stack_address()'s 'data' argument is always used as the log
  level string.  Make that explicit.

- Call touch_nmi_watchdog().

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/kernel/dumpstack.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 75d21ac..692eecae 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -26,10 +26,11 @@ int kstack_depth_to_print = 3 * STACKSLOTS_PER_LINE;
 static int die_counter;
 
 static void printk_stack_address(unsigned long address, int reliable,
-		void *data)
+				 char *log_lvl)
 {
+	touch_nmi_watchdog();
 	printk("%s [<%p>] %s%pB\n",
-		(char *)data, (void *)address, reliable ? "" : "? ",
+		log_lvl, (void *)address, reliable ? "" : "? ",
 		(void *)address);
 }
 
@@ -163,7 +164,6 @@ static int print_trace_stack(void *data, char *name)
  */
 static int print_trace_address(void *data, unsigned long addr, int reliable)
 {
-	touch_nmi_watchdog();
 	printk_stack_address(addr, reliable, data);
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (3 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 04/19] x86/dumpstack: make printk_stack_address() more generally useful Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-29 22:55   ` Steven Rostedt
  2016-07-21 21:21 ` [PATCH 06/19] x86/dumpstack: remove extra brackets around "EOE" Josh Poimboeuf
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

When function graph tracing is enabled for a function, its return
address on the stack is replaced with the address of an ftrace handler
(return_to_handler).  When dumping the stack of a task with graph
tracing enabled, there are some subtle bugs:

- The fake return_to_handler() address can be reported as reliable.
  Instead, because it's not the real caller, it should be considered
  unreliable.

- In print_context_stack(), the real caller's return address is always
  reported as reliable, even if the return_to_handler() address wasn't
  referred to by a frame pointer.

In addition to fixing these bugs, convert print_ftrace_graph_addr() to a
more generic function which can be used outside of dump_trace()
callbacks.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/include/asm/stacktrace.h | 13 ++++++++++
 arch/x86/kernel/dumpstack.c       | 50 +++++++++++++++++----------------------
 2 files changed, 35 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 6f65995..5d3d258 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -14,6 +14,19 @@ extern int kstack_depth_to_print;
 struct thread_info;
 struct stacktrace_ops;
 
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+
+unsigned long
+ftrace_graph_ret_addr(struct task_struct *task, int *idx, unsigned long addr);
+
+#else
+static inline unsigned long
+ftrace_graph_ret_addr(struct task_struct *task, int *idx, unsigned long addr)
+{
+	return addr;
+}
+#endif /* CONFIG_FUNCTION_GRAPH_TRACER */
+
 typedef unsigned long (*walk_stack_t)(struct task_struct *task,
 				      unsigned long *stack,
 				      unsigned long bp,
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 692eecae..0a8694b 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -40,36 +40,25 @@ void printk_address(unsigned long address)
 }
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-static void
-print_ftrace_graph_addr(unsigned long addr, void *data,
-			const struct stacktrace_ops *ops,
-			struct task_struct *task, int *graph)
+unsigned long
+ftrace_graph_ret_addr(struct task_struct *task, int *idx, unsigned long addr)
 {
-	unsigned long ret_addr;
-	int index;
+	int task_idx;
 
 	if (addr != (unsigned long)return_to_handler)
-		return;
+		return addr;
 
-	index = task->curr_ret_stack;
+	task_idx = task->curr_ret_stack;
 
-	if (!task->ret_stack || index < *graph)
-		return;
+	if (!task->ret_stack || task_idx < *idx)
+		return addr;
 
-	index -= *graph;
-	ret_addr = task->ret_stack[index].ret;
+	task_idx -= *idx;
+	(*idx)++;
 
-	ops->address(data, ret_addr, 1);
-
-	(*graph)++;
+	return task->ret_stack[task_idx].ret;
 }
-#else
-static inline void
-print_ftrace_graph_addr(unsigned long addr, void *data,
-			const struct stacktrace_ops *ops,
-			struct task_struct *task, int *graph)
-{ }
-#endif
+#endif /* CONFIG_FUNCTION_GRAPH_TRACER */
 
 /*
  * x86-64 can have up to three kernel stacks:
@@ -108,18 +97,23 @@ print_context_stack(struct task_struct *task,
 		stack = (unsigned long *)task_stack_page(task);
 
 	while (valid_stack_ptr(task, stack, sizeof(*stack), end)) {
-		unsigned long addr;
+		unsigned long addr = *stack;
 
 		addr = *stack;
 		if (__kernel_text_address(addr)) {
+			int reliable = 0;
+			unsigned long real_addr;
+
 			if ((unsigned long) stack == bp + sizeof(long)) {
-				ops->address(data, addr, 1);
+				reliable = 1;
 				frame = frame->next_frame;
 				bp = (unsigned long) frame;
-			} else {
-				ops->address(data, addr, 0);
 			}
-			print_ftrace_graph_addr(addr, data, ops, task, graph);
+
+			real_addr = ftrace_graph_ret_addr(task, graph, addr);
+			if (addr != real_addr)
+				ops->address(data, addr, 0);
+			ops->address(data, real_addr, reliable);
 		}
 		stack++;
 	}
@@ -142,11 +136,11 @@ print_context_stack_bp(struct task_struct *task,
 		if (!__kernel_text_address(addr))
 			break;
 
+		addr = ftrace_graph_ret_addr(task, graph, addr);
 		if (ops->address(data, addr, 1))
 			break;
 		frame = frame->next_frame;
 		ret_addr = &frame->return_address;
-		print_ftrace_graph_addr(addr, data, ops, task, graph);
 	}
 
 	return (unsigned long)frame;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 06/19] x86/dumpstack: remove extra brackets around "EOE"
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (4 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 21:21 ` [PATCH 07/19] x86/dumpstack: add IRQ_USABLE_STACK_SIZE define Josh Poimboeuf
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

When starting the dump of an exception stack, it shows "<<EOE>>" instead
of "<EOE>".  print_trace_stack() already adds brackets, no need to add
them again.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/kernel/dumpstack_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 360f2e8..55cc88f 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -191,7 +191,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 
 			bp = ops->walk_stack(task, stack, bp, ops,
 					     data, stack_end, &graph);
-			ops->stack(data, "<EOE>");
+			ops->stack(data, "EOE");
 			/*
 			 * We link to the next stack via the
 			 * second-to-last pointer (index -2 to end) in the
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 07/19] x86/dumpstack: add IRQ_USABLE_STACK_SIZE define
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (5 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 06/19] x86/dumpstack: remove extra brackets around "EOE" Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 22:01   ` Andy Lutomirski
  2016-07-21 21:21 ` [PATCH 08/19] x86/dumpstack: don't disable preemption in show_stack_log_lvl() and dump_trace() Josh Poimboeuf
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

For reasons unknown, the x86_64 irq stack starts at an offset 64 bytes
from the end of the page.  At least make that explicit.

FIXME: Can we just remove the 64 byte gap?  If not, at least document
why.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/include/asm/page_64_types.h | 19 +++++++++++--------
 arch/x86/kernel/cpu/common.c         |  2 +-
 arch/x86/kernel/dumpstack_64.c       |  8 +++-----
 arch/x86/kernel/setup_percpu.c       |  2 +-
 4 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 9215e05..6256baf 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -12,17 +12,20 @@
 #endif
 
 #define THREAD_SIZE_ORDER	(2 + KASAN_STACK_ORDER)
-#define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
-#define CURRENT_MASK (~(THREAD_SIZE - 1))
+#define THREAD_SIZE		(PAGE_SIZE << THREAD_SIZE_ORDER)
+#define CURRENT_MASK		(~(THREAD_SIZE - 1))
 
-#define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER)
-#define EXCEPTION_STKSZ (PAGE_SIZE << EXCEPTION_STACK_ORDER)
+#define EXCEPTION_STACK_ORDER	(0 + KASAN_STACK_ORDER)
+#define EXCEPTION_STKSZ		(PAGE_SIZE << EXCEPTION_STACK_ORDER)
 
-#define DEBUG_STACK_ORDER (EXCEPTION_STACK_ORDER + 1)
-#define DEBUG_STKSZ (PAGE_SIZE << DEBUG_STACK_ORDER)
+#define DEBUG_STACK_ORDER	(EXCEPTION_STACK_ORDER + 1)
+#define DEBUG_STKSZ		(PAGE_SIZE << DEBUG_STACK_ORDER)
 
-#define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER)
-#define IRQ_STACK_SIZE (PAGE_SIZE << IRQ_STACK_ORDER)
+#define IRQ_STACK_ORDER		(2 + KASAN_STACK_ORDER)
+#define IRQ_STACK_SIZE		(PAGE_SIZE << IRQ_STACK_ORDER)
+
+/* FIXME: why? */
+#define IRQ_USABLE_STACK_SIZE	(IRQ_STACK_SIZE - 64)
 
 #define DOUBLEFAULT_STACK 1
 #define NMI_STACK 2
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 809eda0..8f3f7a4 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1281,7 +1281,7 @@ DEFINE_PER_CPU(struct task_struct *, current_task) ____cacheline_aligned =
 EXPORT_PER_CPU_SYMBOL(current_task);
 
 DEFINE_PER_CPU(char *, irq_stack_ptr) =
-	init_per_cpu_var(irq_stack_union.irq_stack) + IRQ_STACK_SIZE - 64;
+	init_per_cpu_var(irq_stack_union.irq_stack) + IRQ_USABLE_STACK_SIZE;
 
 DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
 
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 55cc88f..6a2d14e 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -103,9 +103,6 @@ in_irq_stack(unsigned long *stack, unsigned long *irq_stack,
 	return (stack >= irq_stack && stack < irq_stack_end);
 }
 
-static const unsigned long irq_stack_size =
-	(IRQ_STACK_SIZE - 64) / sizeof(unsigned long);
-
 enum stack_type {
 	STACK_IS_UNKNOWN,
 	STACK_IS_NORMAL,
@@ -133,7 +130,7 @@ analyze_stack(int cpu, struct task_struct *task, unsigned long *stack,
 		return STACK_IS_NORMAL;
 
 	*stack_end = irq_stack;
-	irq_stack = irq_stack - irq_stack_size;
+	irq_stack -= (IRQ_USABLE_STACK_SIZE / sizeof(long));
 
 	if (in_irq_stack(stack, irq_stack, *stack_end))
 		return STACK_IS_IRQ;
@@ -246,7 +243,8 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	cpu = smp_processor_id();
 
 	irq_stack_end	= (unsigned long *)(per_cpu(irq_stack_ptr, cpu));
-	irq_stack	= (unsigned long *)(per_cpu(irq_stack_ptr, cpu) - IRQ_STACK_SIZE);
+	irq_stack	= (unsigned long *)(per_cpu(irq_stack_ptr, cpu) -
+			  IRQ_USABLE_STACK_SIZE);
 
 	sp = sp ? : get_stack_pointer(task, regs);
 
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index e4fcb87..043454f 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -244,7 +244,7 @@ void __init setup_per_cpu_areas(void)
 #ifdef CONFIG_X86_64
 		per_cpu(irq_stack_ptr, cpu) =
 			per_cpu(irq_stack_union.irq_stack, cpu) +
-			IRQ_STACK_SIZE - 64;
+			IRQ_USABLE_STACK_SIZE;
 #endif
 #ifdef CONFIG_NUMA
 		per_cpu(x86_cpu_to_node_map, cpu) =
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 08/19] x86/dumpstack: don't disable preemption in show_stack_log_lvl() and dump_trace()
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (6 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 07/19] x86/dumpstack: add IRQ_USABLE_STACK_SIZE define Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 21:21 ` [PATCH 09/19] x86/dumpstack: simplify in_exception_stack() Josh Poimboeuf
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

show_stack_log_lvl() and dump_trace() are already preemption safe:

- If they're running in irq or exception context, preemption is already
  disabled, and the percpu irq stack pointers can be trusted.

- If they're running with preemption enabled, they must be running on
  the task stack anyway, so it doesn't matter if they're comparing the
  stack pointer against the percpu irq stack pointer from this CPU or
  another one: either way it won't match.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/kernel/dumpstack_32.c | 14 ++++++--------
 arch/x86/kernel/dumpstack_64.c | 29 ++++++++++-------------------
 2 files changed, 16 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index c533b8b..b07d5c9 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -24,16 +24,16 @@ static void *is_irq_stack(void *p, void *irq)
 }
 
 
-static void *is_hardirq_stack(unsigned long *stack, int cpu)
+static void *is_hardirq_stack(unsigned long *stack)
 {
-	void *irq = per_cpu(hardirq_stack, cpu);
+	void *irq = this_cpu_read(hardirq_stack);
 
 	return is_irq_stack(stack, irq);
 }
 
-static void *is_softirq_stack(unsigned long *stack, int cpu)
+static void *is_softirq_stack(unsigned long *stack);
 {
-	void *irq = per_cpu(softirq_stack, cpu);
+	void *irq = this_cpu_read(softirq_stack);
 
 	return is_irq_stack(stack, irq);
 }
@@ -42,7 +42,6 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 		unsigned long *stack, unsigned long bp,
 		const struct stacktrace_ops *ops, void *data)
 {
-	const unsigned cpu = get_cpu();
 	int graph = 0;
 	u32 *prev_esp;
 
@@ -53,9 +52,9 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	for (;;) {
 		void *end_stack;
 
-		end_stack = is_hardirq_stack(stack, cpu);
+		end_stack = is_hardirq_stack(stack);
 		if (!end_stack)
-			end_stack = is_softirq_stack(stack, cpu);
+			end_stack = is_softirq_stack(stack);
 
 		bp = ops->walk_stack(task, stack, bp, ops, data,
 				     end_stack, &graph);
@@ -74,7 +73,6 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 			break;
 		touch_nmi_watchdog();
 	}
-	put_cpu();
 }
 EXPORT_SYMBOL(dump_trace);
 
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 6a2d14e..634ed22 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -31,8 +31,8 @@ static char x86_stack_ids[][8] = {
 #endif
 };
 
-static unsigned long *in_exception_stack(unsigned cpu, unsigned long stack,
-					 unsigned *usedp, char **idp)
+static unsigned long *in_exception_stack(unsigned long stack, unsigned *usedp,
+					 char **idp)
 {
 	unsigned k;
 
@@ -41,7 +41,7 @@ static unsigned long *in_exception_stack(unsigned cpu, unsigned long stack,
 	 * 'stack' is in one of them:
 	 */
 	for (k = 0; k < N_EXCEPTION_STACKS; k++) {
-		unsigned long end = per_cpu(orig_ist, cpu).ist[k];
+		unsigned long end = this_cpu_ptr(&orig_ist)->ist[k];
 		/*
 		 * Is 'stack' above this exception frame's end?
 		 * If yes then skip to the next frame.
@@ -111,7 +111,7 @@ enum stack_type {
 };
 
 static enum stack_type
-analyze_stack(int cpu, struct task_struct *task, unsigned long *stack,
+analyze_stack(struct task_struct *task, unsigned long *stack,
 	      unsigned long **stack_end, unsigned long *irq_stack,
 	      unsigned *used, char **id)
 {
@@ -121,8 +121,7 @@ analyze_stack(int cpu, struct task_struct *task, unsigned long *stack,
 	if ((unsigned long)task_stack_page(task) == addr)
 		return STACK_IS_NORMAL;
 
-	*stack_end = in_exception_stack(cpu, (unsigned long)stack,
-					used, id);
+	*stack_end = in_exception_stack((unsigned long)stack, used, id);
 	if (*stack_end)
 		return STACK_IS_EXCEPTION;
 
@@ -149,8 +148,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 		unsigned long *stack, unsigned long bp,
 		const struct stacktrace_ops *ops, void *data)
 {
-	const unsigned cpu = get_cpu();
-	unsigned long *irq_stack = (unsigned long *)per_cpu(irq_stack_ptr, cpu);
+	unsigned long *irq_stack = (unsigned long *)this_cpu_read(irq_stack_ptr);
 	unsigned used = 0;
 	int graph = 0;
 	int done = 0;
@@ -169,8 +167,8 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 		enum stack_type stype;
 		char *id;
 
-		stype = analyze_stack(cpu, task, stack, &stack_end,
-				      irq_stack, &used, &id);
+		stype = analyze_stack(task, stack, &stack_end, irq_stack, &used,
+				      &id);
 
 		/* Default finish unless specified to continue */
 		done = 1;
@@ -225,7 +223,6 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	 * This handles the process stack:
 	 */
 	bp = ops->walk_stack(task, stack, bp, ops, data, NULL, &graph);
-	put_cpu();
 }
 EXPORT_SYMBOL(dump_trace);
 
@@ -236,15 +233,10 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	unsigned long *irq_stack_end;
 	unsigned long *irq_stack;
 	unsigned long *stack;
-	int cpu;
 	int i;
 
-	preempt_disable();
-	cpu = smp_processor_id();
-
-	irq_stack_end	= (unsigned long *)(per_cpu(irq_stack_ptr, cpu));
-	irq_stack	= (unsigned long *)(per_cpu(irq_stack_ptr, cpu) -
-			  IRQ_USABLE_STACK_SIZE);
+	irq_stack_end	= (unsigned long *)this_cpu_read(irq_stack_ptr);
+	irq_stack	= irq_stack_end - IRQ_USABLE_STACK_SIZE;
 
 	sp = sp ? : get_stack_pointer(task, regs);
 
@@ -275,7 +267,6 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		stack++;
 		touch_nmi_watchdog();
 	}
-	preempt_enable();
 
 	pr_cont("\n");
 	show_trace_log_lvl(task, regs, sp, bp, log_lvl);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 09/19] x86/dumpstack: simplify in_exception_stack()
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (7 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 08/19] x86/dumpstack: don't disable preemption in show_stack_log_lvl() and dump_trace() Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 22:05   ` Andy Lutomirski
  2016-07-21 21:21 ` [PATCH 10/19] x86/dumpstack: add get_stack_info() interface Josh Poimboeuf
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

in_exception_stack() does some bad, bad things just so the unwinder can
print different values for different areas of the debug exception stack.

There's no need to clarify where exactly on the stack it is.  Just print
"#DB" and be done with it.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/kernel/dumpstack_64.c | 106 +++++++++++++----------------------------
 1 file changed, 32 insertions(+), 74 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 634ed22..0641d75 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -16,83 +16,41 @@
 
 #include <asm/stacktrace.h>
 
+static char *exception_stack_names[N_EXCEPTION_STACKS] = {
+		[ DOUBLEFAULT_STACK-1	]	= "#DF",
+		[ NMI_STACK-1		]	= "NMI",
+		[ DEBUG_STACK-1		]	= "#DB",
+		[ MCE_STACK-1		]	= "#MC",
+};
 
-#define N_EXCEPTION_STACKS_END \
-		(N_EXCEPTION_STACKS + DEBUG_STKSZ/EXCEPTION_STKSZ - 2)
-
-static char x86_stack_ids[][8] = {
-		[ DEBUG_STACK-1			]	= "#DB",
-		[ NMI_STACK-1			]	= "NMI",
-		[ DOUBLEFAULT_STACK-1		]	= "#DF",
-		[ MCE_STACK-1			]	= "#MC",
-#if DEBUG_STKSZ > EXCEPTION_STKSZ
-		[ N_EXCEPTION_STACKS ...
-		  N_EXCEPTION_STACKS_END	]	= "#DB[?]"
-#endif
+static unsigned long exception_stack_sizes[N_EXCEPTION_STACKS] = {
+	[0 ... N_EXCEPTION_STACKS - 1]		= EXCEPTION_STKSZ,
+	[DEBUG_STACK - 1]			= DEBUG_STKSZ
 };
 
-static unsigned long *in_exception_stack(unsigned long stack, unsigned *usedp,
-					 char **idp)
+static unsigned long *in_exception_stack(unsigned long *s, char **name,
+					 unsigned long *visit_mask)
 {
+	unsigned long stack = (unsigned long)s;
+	unsigned long begin, end;
 	unsigned k;
 
-	/*
-	 * Iterate over all exception stacks, and figure out whether
-	 * 'stack' is in one of them:
-	 */
+	BUILD_BUG_ON(N_EXCEPTION_STACKS != 4);
+
 	for (k = 0; k < N_EXCEPTION_STACKS; k++) {
-		unsigned long end = this_cpu_ptr(&orig_ist)->ist[k];
-		/*
-		 * Is 'stack' above this exception frame's end?
-		 * If yes then skip to the next frame.
-		 */
-		if (stack >= end)
+		end   = this_cpu_ptr(&orig_ist)->ist[k];
+		begin = end - exception_stack_sizes[k];
+
+		if (stack < begin || stack >= end)
 			continue;
-		/*
-		 * Is 'stack' above this exception frame's start address?
-		 * If yes then we found the right frame.
-		 */
-		if (stack >= end - EXCEPTION_STKSZ) {
-			/*
-			 * Make sure we only iterate through an exception
-			 * stack once. If it comes up for the second time
-			 * then there's something wrong going on - just
-			 * break out and return NULL:
-			 */
-			if (*usedp & (1U << k))
-				break;
-			*usedp |= 1U << k;
-			*idp = x86_stack_ids[k];
-			return (unsigned long *)end;
-		}
-		/*
-		 * If this is a debug stack, and if it has a larger size than
-		 * the usual exception stacks, then 'stack' might still
-		 * be within the lower portion of the debug stack:
-		 */
-#if DEBUG_STKSZ > EXCEPTION_STKSZ
-		if (k == DEBUG_STACK - 1 && stack >= end - DEBUG_STKSZ) {
-			unsigned j = N_EXCEPTION_STACKS - 1;
 
-			/*
-			 * Black magic. A large debug stack is composed of
-			 * multiple exception stack entries, which we
-			 * iterate through now. Dont look:
-			 */
-			do {
-				++j;
-				end -= EXCEPTION_STKSZ;
-				x86_stack_ids[j][4] = '1' +
-						(j - N_EXCEPTION_STACKS);
-			} while (stack < end - EXCEPTION_STKSZ);
-			if (*usedp & (1U << j))
-				break;
-			*usedp |= 1U << j;
-			*idp = x86_stack_ids[j];
-			return (unsigned long *)end;
-		}
-#endif
+		if (test_and_set_bit(k, visit_mask))
+			return false;
+
+		*name = exception_stack_names[k];
+		return (unsigned long *)end;
 	}
+
 	return NULL;
 }
 
@@ -113,7 +71,7 @@ enum stack_type {
 static enum stack_type
 analyze_stack(struct task_struct *task, unsigned long *stack,
 	      unsigned long **stack_end, unsigned long *irq_stack,
-	      unsigned *used, char **id)
+	      unsigned long *visit_mask, char **name)
 {
 	unsigned long addr;
 
@@ -121,7 +79,7 @@ analyze_stack(struct task_struct *task, unsigned long *stack,
 	if ((unsigned long)task_stack_page(task) == addr)
 		return STACK_IS_NORMAL;
 
-	*stack_end = in_exception_stack((unsigned long)stack, used, id);
+	*stack_end = in_exception_stack(stack, name, visit_mask);
 	if (*stack_end)
 		return STACK_IS_EXCEPTION;
 
@@ -149,7 +107,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 		const struct stacktrace_ops *ops, void *data)
 {
 	unsigned long *irq_stack = (unsigned long *)this_cpu_read(irq_stack_ptr);
-	unsigned used = 0;
+	unsigned long visit_mask = 0;
 	int graph = 0;
 	int done = 0;
 
@@ -165,10 +123,10 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	while (!done) {
 		unsigned long *stack_end;
 		enum stack_type stype;
-		char *id;
+		char *name;
 
-		stype = analyze_stack(task, stack, &stack_end, irq_stack, &used,
-				      &id);
+		stype = analyze_stack(task, stack, &stack_end, irq_stack,
+				      &visit_mask, &name);
 
 		/* Default finish unless specified to continue */
 		done = 1;
@@ -181,7 +139,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 
 		case STACK_IS_EXCEPTION:
 
-			if (ops->stack(data, id) < 0)
+			if (ops->stack(data, name) < 0)
 				break;
 
 			bp = ops->walk_stack(task, stack, bp, ops,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (8 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 09/19] x86/dumpstack: simplify in_exception_stack() Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-22 23:26   ` Andy Lutomirski
  2016-07-21 21:21 ` [PATCH 11/19] x86/dumptrace: add new unwind interface and implementations Josh Poimboeuf
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

valid_stack_ptr() is buggy: it assumes that all stacks are of size
THREAD_SIZE, which is not true for exception stacks.  So the
walk_stack() callbacks will need to know the location of the beginning
of the stack as well as the end.

Another issue is that in general the various features of a stack (type,
size, next stack pointer, description string) are scattered around in
various places throughout the stack dump code.

Encapsulate all that information in a single place with a new stack_info
struct and a get_stack_info() interface.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/events/core.c            |   2 +-
 arch/x86/include/asm/stacktrace.h |  41 +++++++++-
 arch/x86/kernel/dumpstack.c       |  42 ++++++-----
 arch/x86/kernel/dumpstack_32.c    | 100 ++++++++++++++++++------
 arch/x86/kernel/dumpstack_64.c    | 155 ++++++++++++++++++++------------------
 arch/x86/kernel/stacktrace.c      |   2 +-
 6 files changed, 218 insertions(+), 124 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index fad9788..f388f57 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2248,7 +2248,7 @@ void arch_perf_update_userpage(struct perf_event *event,
  * callchain support
  */
 
-static int backtrace_stack(void *data, char *name)
+static int backtrace_stack(void *data, const char *name)
 {
 	return 0;
 }
diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 5d3d258..647ce3f 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -9,6 +9,39 @@
 #include <linux/uaccess.h>
 #include <linux/ptrace.h>
 
+enum stack_type {
+	STACK_TYPE_UNKNOWN,
+	STACK_TYPE_TASK,
+	STACK_TYPE_IRQ,
+	STACK_TYPE_SOFTIRQ,
+	STACK_TYPE_EXCEPTION,
+	STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,
+};
+
+struct stack_info {
+	enum stack_type type;
+	unsigned long *begin, *end, *next;
+};
+
+bool in_task_stack(unsigned long *stack, struct task_struct *task,
+		   struct stack_info *info, unsigned long *visit_mask);
+
+int get_stack_info(unsigned long *stack, struct task_struct *task,
+		   struct stack_info *info, unsigned long *visit_mask);
+
+void stack_type_str(enum stack_type type, const char **begin,
+		    const char **end);
+
+static inline bool on_stack(struct stack_info *info, void *addr, size_t len)
+{
+	void *begin = info->begin;
+	void *end   = info->end;
+
+	return (info->type != STACK_TYPE_UNKNOWN &&
+		addr >= begin && addr < end &&
+		addr + len > begin && addr + len <= end);
+}
+
 extern int kstack_depth_to_print;
 
 struct thread_info;
@@ -32,27 +65,27 @@ typedef unsigned long (*walk_stack_t)(struct task_struct *task,
 				      unsigned long bp,
 				      const struct stacktrace_ops *ops,
 				      void *data,
-				      unsigned long *end,
+				      struct stack_info *info,
 				      int *graph);
 
 extern unsigned long
 print_context_stack(struct task_struct *task,
 		    unsigned long *stack, unsigned long bp,
 		    const struct stacktrace_ops *ops, void *data,
-		    unsigned long *end, int *graph);
+		    struct stack_info *info, int *graph);
 
 extern unsigned long
 print_context_stack_bp(struct task_struct *task,
 		       unsigned long *stack, unsigned long bp,
 		       const struct stacktrace_ops *ops, void *data,
-		       unsigned long *end, int *graph);
+		       struct stack_info *info, int *graph);
 
 /* Generic stack tracer with callbacks */
 
 struct stacktrace_ops {
 	int (*address)(void *data, unsigned long address, int reliable);
 	/* On negative return stop dumping */
-	int (*stack)(void *data, char *name);
+	int (*stack)(void *data, const char *name);
 	walk_stack_t	walk_stack;
 };
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 0a8694b..6ef8ab5 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -25,6 +25,25 @@ unsigned int code_bytes = 64;
 int kstack_depth_to_print = 3 * STACKSLOTS_PER_LINE;
 static int die_counter;
 
+bool in_task_stack(unsigned long *stack, struct task_struct *task,
+		   struct stack_info *info, unsigned long *visit_mask)
+{
+	unsigned long addr = (unsigned long)stack & ~(THREAD_SIZE - 1);
+
+	if ((unsigned long)task_stack_page(task) != addr)
+		return false;
+
+	if (visit_mask && test_and_set_bit(STACK_TYPE_TASK, visit_mask))
+		return false;
+
+	info->type	= STACK_TYPE_TASK;
+	info->begin	= task_stack_page(task);
+	info->end	= task_stack_page(task) + THREAD_SIZE;
+	info->next	= NULL;
+
+	return true;
+}
+
 static void printk_stack_address(unsigned long address, int reliable,
 				 char *log_lvl)
 {
@@ -67,24 +86,11 @@ ftrace_graph_ret_addr(struct task_struct *task, int *idx, unsigned long addr)
  * severe exception (double fault, nmi, stack fault, debug, mce) hardware stack
  */
 
-static inline int valid_stack_ptr(struct task_struct *task,
-			void *p, unsigned int size, void *end)
-{
-	void *t = task_stack_page(task);
-	if (end) {
-		if (p < end && p >= (end-THREAD_SIZE))
-			return 1;
-		else
-			return 0;
-	}
-	return p >= t && p < t + THREAD_SIZE - size;
-}
-
 unsigned long
 print_context_stack(struct task_struct *task,
 		unsigned long *stack, unsigned long bp,
 		const struct stacktrace_ops *ops, void *data,
-		unsigned long *end, int *graph)
+		struct stack_info *info, int *graph)
 {
 	struct stack_frame *frame = (struct stack_frame *)bp;
 
@@ -96,7 +102,7 @@ print_context_stack(struct task_struct *task,
 	    PAGE_SIZE)
 		stack = (unsigned long *)task_stack_page(task);
 
-	while (valid_stack_ptr(task, stack, sizeof(*stack), end)) {
+	while (on_stack(info, stack, sizeof(*stack))) {
 		unsigned long addr = *stack;
 
 		addr = *stack;
@@ -125,12 +131,12 @@ unsigned long
 print_context_stack_bp(struct task_struct *task,
 		       unsigned long *stack, unsigned long bp,
 		       const struct stacktrace_ops *ops, void *data,
-		       unsigned long *end, int *graph)
+		       struct stack_info *info, int *graph)
 {
 	struct stack_frame *frame = (struct stack_frame *)bp;
 	unsigned long *ret_addr = &frame->return_address;
 
-	while (valid_stack_ptr(task, ret_addr, sizeof(*ret_addr), end)) {
+	while (on_stack(info, stack, sizeof(*stack) * 2)) {
 		unsigned long addr = *ret_addr;
 
 		if (!__kernel_text_address(addr))
@@ -147,7 +153,7 @@ print_context_stack_bp(struct task_struct *task,
 }
 EXPORT_SYMBOL_GPL(print_context_stack_bp);
 
-static int print_trace_stack(void *data, char *name)
+static int print_trace_stack(void *data, const char *name)
 {
 	printk("%s <%s> ", (char *)data, name);
 	return 0;
diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index b07d5c9..8f55ddb 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -16,61 +16,111 @@
 
 #include <asm/stacktrace.h>
 
-static void *is_irq_stack(void *p, void *irq)
+void stack_type_str(enum stack_type type, const char **begin, const char **end)
 {
-	if (p < irq || p >= (irq + THREAD_SIZE))
-		return NULL;
-	return irq + THREAD_SIZE;
+	switch (type) {
+	case STACK_TYPE_IRQ:
+	case STACK_TYPE_SOFTIRQ:
+		*begin = "IRQ";
+		*end   = "EOI";
+		break;
+	default:
+		*begin = NULL;
+		*end   = NULL;
+	}
 }
 
+static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
+			     unsigned long *visit_mask)
+{
+	unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
+	unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
+
+	if (stack < begin || stack >= end)
+		return false;
+
+	if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
+		return false;
+
+	info->type	= STACK_TYPE_IRQ;
+	info->begin	= begin;
+	info->end	= end;
+	info->next	= (unsigned long *)*begin;
 
-static void *is_hardirq_stack(unsigned long *stack)
+	return true;
+}
+
+static bool in_softirq_stack(unsigned long *stack, struct stack_info *info,
+			     unsigned long *visit_mask)
 {
-	void *irq = this_cpu_read(hardirq_stack);
+	unsigned long *begin = (unsigned long *)this_cpu_read(softirq_stack);
+	unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
+
+	if (stack < begin || stack >= end)
+		return false;
+
+	if (visit_mask && test_and_set_bit(STACK_TYPE_SOFTIRQ, visit_mask))
+		return false;
+
+	info->type	= STACK_TYPE_SOFTIRQ;
+	info->begin	= begin;
+	info->end	= end;
+	info->next	= (unsigned long *)*begin;
 
-	return is_irq_stack(stack, irq);
+	return true;
 }
 
-static void *is_softirq_stack(unsigned long *stack);
+int get_stack_info(unsigned long *stack, struct task_struct *task,
+		   struct stack_info *info, unsigned long *visit_mask)
 {
-	void *irq = this_cpu_read(softirq_stack);
+	if (!task)
+		task = current;
 
-	return is_irq_stack(stack, irq);
+	if (task == current) {
+		if (in_hardirq_stack(stack, info, visit_mask))
+			return 0;
+
+		if (in_softirq_stack(stack, info, visit_mask))
+			return 0;
+	}
+
+	if (in_task_stack(stack, task, info, visit_mask))
+		return 0;
+
+	info->type = STACK_TYPE_UNKNOWN;
+	return -EINVAL;
 }
 
 void dump_trace(struct task_struct *task, struct pt_regs *regs,
 		unsigned long *stack, unsigned long bp,
 		const struct stacktrace_ops *ops, void *data)
 {
+	unsigned long visit_mask = 0;
 	int graph = 0;
-	u32 *prev_esp;
 
 	task = task ? : current;
 	stack = stack ? : get_stack_pointer(task, regs);
 	bp = bp ? : (unsigned long)get_frame_pointer(task, regs);
 
 	for (;;) {
-		void *end_stack;
+		const char *begin_str, *end_str;
+		struct stack_info info;
 
-		end_stack = is_hardirq_stack(stack);
-		if (!end_stack)
-			end_stack = is_softirq_stack(stack);
+		if (get_stack_info(stack, task, &info, &visit_mask))
+			break;
 
-		bp = ops->walk_stack(task, stack, bp, ops, data,
-				     end_stack, &graph);
+		stack_type_str(info.type, &begin_str, &end_str);
 
-		/* Stop if not on irq stack */
-		if (!end_stack)
+		if (begin_str && ops->stack(data, begin_str) < 0)
 			break;
 
-		/* The previous esp is saved on the bottom of the stack */
-		prev_esp = (u32 *)(end_stack - THREAD_SIZE);
-		stack = (unsigned long *)*prev_esp;
-		if (!stack)
-			break;
+		bp = ops->walk_stack(task, stack, bp, ops, data, &info, &graph);
 
-		if (ops->stack(data, "IRQ") < 0)
+		if (end_str && ops->stack(data, end_str) < 0)
 			break;
+
+		stack = info.next;
+
 		touch_nmi_watchdog();
 	}
 }
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 0641d75..e1a5b6f 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -28,8 +28,27 @@ static unsigned long exception_stack_sizes[N_EXCEPTION_STACKS] = {
 	[DEBUG_STACK - 1]			= DEBUG_STKSZ
 };
 
-static unsigned long *in_exception_stack(unsigned long *s, char **name,
-					 unsigned long *visit_mask)
+void stack_type_str(enum stack_type type, const char **begin, const char **end)
+{
+	BUILD_BUG_ON(N_EXCEPTION_STACKS != 4);
+
+	switch (type) {
+	case STACK_TYPE_IRQ:
+		*begin = "IRQ";
+		*end   = "EOI";
+		break;
+	case STACK_TYPE_EXCEPTION ... STACK_TYPE_EXCEPTION_LAST:
+		*begin = exception_stack_names[type - STACK_TYPE_EXCEPTION];
+		*end   = "EOE";
+		break;
+	default:
+		*begin = NULL;
+		*end   = NULL;
+	}
+}
+
+static bool in_exception_stack(unsigned long *s, struct stack_info *info,
+			       unsigned long *visit_mask)
 {
 	unsigned long stack = (unsigned long)s;
 	unsigned long begin, end;
@@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long *s, char **name,
 		if (stack < begin || stack >= end)
 			continue;
 
-		if (test_and_set_bit(k, visit_mask))
+		if (visit_mask &&
+		    test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
 			return false;
 
-		*name = exception_stack_names[k];
-		return (unsigned long *)end;
+		info->type	= STACK_TYPE_EXCEPTION + k;
+		info->begin	= (unsigned long *)begin;
+		info->end	= (unsigned long *)end;
+		info->next	= (unsigned long *)info->end[-2];
+
+		return true;
 	}
 
-	return NULL;
+	return false;
 }
 
-static inline int
-in_irq_stack(unsigned long *stack, unsigned long *irq_stack,
-	     unsigned long *irq_stack_end)
+static bool in_irq_stack(unsigned long *stack, struct stack_info *info,
+			 unsigned long *visit_mask)
 {
-	return (stack >= irq_stack && stack < irq_stack_end);
-}
+	unsigned long *end   = (unsigned long *)this_cpu_read(irq_stack_ptr);
+	unsigned long *begin = end - (IRQ_USABLE_STACK_SIZE / sizeof(long));
 
-enum stack_type {
-	STACK_IS_UNKNOWN,
-	STACK_IS_NORMAL,
-	STACK_IS_EXCEPTION,
-	STACK_IS_IRQ,
-};
+	if (stack < begin || stack >= end)
+		return false;
 
-static enum stack_type
-analyze_stack(struct task_struct *task, unsigned long *stack,
-	      unsigned long **stack_end, unsigned long *irq_stack,
-	      unsigned long *visit_mask, char **name)
-{
-	unsigned long addr;
+	if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
+		return false;
 
-	addr = ((unsigned long)stack & (~(THREAD_SIZE - 1)));
-	if ((unsigned long)task_stack_page(task) == addr)
-		return STACK_IS_NORMAL;
+	info->type	= STACK_TYPE_IRQ;
+	info->begin	= begin;
+	info->end	= end;
+	info->next	= (unsigned long *)end[-1];
 
-	*stack_end = in_exception_stack(stack, name, visit_mask);
-	if (*stack_end)
-		return STACK_IS_EXCEPTION;
+	return true;
+}
 
-	if (!irq_stack)
-		return STACK_IS_NORMAL;
+int get_stack_info(unsigned long *stack, struct task_struct *task,
+		   struct stack_info *info, unsigned long *visit_mask)
+{
+	if (!task)
+		task = current;
+
+	if (in_task_stack(stack, task, info, visit_mask))
+		return 0;
 
-	*stack_end = irq_stack;
-	irq_stack -= (IRQ_USABLE_STACK_SIZE / sizeof(long));
+	if (task != current)
+		goto unknown;
+
+	if (in_exception_stack(stack, info, visit_mask))
+		return 0;
 
-	if (in_irq_stack(stack, irq_stack, *stack_end))
-		return STACK_IS_IRQ;
+	if (in_irq_stack(stack, info, visit_mask))
+		return 0;
 
-	return STACK_IS_UNKNOWN;
+unknown:
+	info->type = STACK_TYPE_UNKNOWN;
+	return -EINVAL;
 }
 
 /*
@@ -106,8 +132,8 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 		unsigned long *stack, unsigned long bp,
 		const struct stacktrace_ops *ops, void *data)
 {
-	unsigned long *irq_stack = (unsigned long *)this_cpu_read(irq_stack_ptr);
 	unsigned long visit_mask = 0;
+	struct stack_info info;
 	int graph = 0;
 	int done = 0;
 
@@ -121,57 +147,37 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	 * exceptions
 	 */
 	while (!done) {
-		unsigned long *stack_end;
-		enum stack_type stype;
-		char *name;
+		const char *begin_str, *end_str;
 
-		stype = analyze_stack(task, stack, &stack_end, irq_stack,
-				      &visit_mask, &name);
+		get_stack_info(stack, task, &info, &visit_mask);
 
 		/* Default finish unless specified to continue */
 		done = 1;
 
-		switch (stype) {
+		switch (info.type) {
 
 		/* Break out early if we are on the thread stack */
-		case STACK_IS_NORMAL:
+		case STACK_TYPE_TASK:
 			break;
 
-		case STACK_IS_EXCEPTION:
+		case STACK_TYPE_IRQ:
+		case STACK_TYPE_EXCEPTION ... STACK_TYPE_EXCEPTION_LAST:
+
+			stack_type_str(info.type, &begin_str, &end_str);
 
-			if (ops->stack(data, name) < 0)
+			if (ops->stack(data, begin_str) < 0)
 				break;
 
 			bp = ops->walk_stack(task, stack, bp, ops,
-					     data, stack_end, &graph);
-			ops->stack(data, "EOE");
-			/*
-			 * We link to the next stack via the
-			 * second-to-last pointer (index -2 to end) in the
-			 * exception stack:
-			 */
-			stack = (unsigned long *) stack_end[-2];
-			done = 0;
-			break;
+					     data, &info, &graph);
 
-		case STACK_IS_IRQ:
+			ops->stack(data, end_str);
 
-			if (ops->stack(data, "IRQ") < 0)
-				break;
-			bp = ops->walk_stack(task, stack, bp,
-				     ops, data, stack_end, &graph);
-			/*
-			 * We link to the next stack (which would be
-			 * the process stack normally) the last
-			 * pointer (index -1 to end) in the IRQ stack:
-			 */
-			stack = (unsigned long *) (stack_end[-1]);
-			irq_stack = NULL;
-			ops->stack(data, "EOI");
+			stack = info.next;
 			done = 0;
 			break;
 
-		case STACK_IS_UNKNOWN:
+		default:
 			ops->stack(data, "UNK");
 			break;
 		}
@@ -180,7 +186,7 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 	/*
 	 * This handles the process stack:
 	 */
-	bp = ops->walk_stack(task, stack, bp, ops, data, NULL, &graph);
+	bp = ops->walk_stack(task, stack, bp, ops, data, &info, &graph);
 }
 EXPORT_SYMBOL(dump_trace);
 
@@ -188,13 +194,12 @@ void
 show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		   unsigned long *sp, unsigned long bp, char *log_lvl)
 {
-	unsigned long *irq_stack_end;
-	unsigned long *irq_stack;
+	unsigned long *irq_stack, *irq_stack_end;
 	unsigned long *stack;
 	int i;
 
-	irq_stack_end	= (unsigned long *)this_cpu_read(irq_stack_ptr);
-	irq_stack	= irq_stack_end - IRQ_USABLE_STACK_SIZE;
+	irq_stack_end = (unsigned long *)this_cpu_read(irq_stack_ptr);
+	irq_stack     = irq_stack_end - IRQ_USABLE_STACK_SIZE;
 
 	sp = sp ? : get_stack_pointer(task, regs);
 
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 4738f5e..785aef1 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -9,7 +9,7 @@
 #include <linux/uaccess.h>
 #include <asm/stacktrace.h>
 
-static int save_stack_stack(void *data, char *name)
+static int save_stack_stack(void *data, const char *name)
 {
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 11/19] x86/dumptrace: add new unwind interface and implementations
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (9 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 10/19] x86/dumpstack: add get_stack_info() interface Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 21:21 ` [PATCH 12/19] perf/x86: convert perf_callchain_kernel() to the new unwinder Josh Poimboeuf
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

The x86 stack dump code is a bit of a mess.  dump_trace() uses
callbacks, and each user of it seems to have slightly different
requirements, so there are several slightly different callbacks floating
around.

Also there are some upcoming features which will require more changes to
the stack dump code: reliable stack detection for live patching,
hardened user copy, and the DWARF unwinder.  Each of those features
would at least need more callbacks and/or callback interfaces, resulting
in a much bigger mess than what we have today.

Before doing all that, we should try to clean things up and replace
dump_trace() with something cleaner and more flexible.

The new unwinder is a simple state machine which was heavily inspired by
a suggestion from Andy Lutomirski:

  https://lkml.kernel.org/r/CALCETrUbNTqaM2LRyXGRx=kVLRPeY5A3Pc6k4TtQxF320rUT=w@mail.gmail.com

It's also very similar to the libunwind API:

  http://www.nongnu.org/libunwind/man/libunwind(3).html

Some if its advantages:

- Simplicity: no more callback sprawl and less code duplication.

- Flexibility: it allows the caller to stop and inspect the stack state
  at each step in the unwinding process.

- Modularity: the unwinder code, console stack dump code, and stack
  metadata analysis code are all better separated so that changing one
  of them shouldn't have much of an impact on any of the others.

Two implementations are added which conform to the new unwind interface:

- The frame pointer unwinder which is used for CONFIG_FRAME_POINTER=y.

- The "guess" unwinder which is used for CONFIG_FRAME_POINTER=n.  This
  isn't an "unwinder" per se.  All it does is scan the stack for kernel
  text addresses.  But with no frame pointers, guesses are better than
  nothing in most cases.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/include/asm/unwind.h  | 80 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/Makefile       |  6 +++
 arch/x86/kernel/unwind_frame.c | 89 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/unwind_guess.c | 40 +++++++++++++++++++
 4 files changed, 215 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind.h
 create mode 100644 arch/x86/kernel/unwind_frame.c
 create mode 100644 arch/x86/kernel/unwind_guess.c

diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
new file mode 100644
index 0000000..61c6e95
--- /dev/null
+++ b/arch/x86/include/asm/unwind.h
@@ -0,0 +1,80 @@
+#ifndef _ASM_X86_UNWIND_H
+#define _ASM_X86_UNWIND_H
+
+#include <linux/sched.h>
+#include <linux/ftrace.h>
+#include <asm/ptrace.h>
+#include <asm/stacktrace.h>
+
+struct unwind_state {
+	struct stack_info stack_info;
+	unsigned long stack_mask;
+	struct task_struct *task;
+	unsigned long *sp;
+	int graph_idx;
+#ifdef CONFIG_FRAME_POINTER
+	unsigned long *bp;
+#endif
+};
+
+void __unwind_start(struct unwind_state *state, struct task_struct *task,
+		    struct pt_regs *regs, unsigned long *sp);
+
+bool unwind_next_frame(struct unwind_state *state);
+
+
+#ifdef CONFIG_FRAME_POINTER
+
+static inline unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	if (state->stack_info.type == STACK_TYPE_UNKNOWN)
+		return NULL;
+
+	return state->bp + 1;
+}
+
+unsigned long unwind_get_return_address(struct unwind_state *state);
+
+#else /* !CONFIG_FRAME_POINTER */
+
+static inline unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	return NULL;
+}
+
+static inline unsigned long unwind_get_return_address(struct unwind_state *state)
+{
+	if (state->stack_info.type == STACK_TYPE_UNKNOWN)
+		return 0;
+
+	return *state->sp;
+}
+
+#endif /* CONFIG_FRAME_POINTER */
+
+static inline unsigned long *unwind_get_stack_ptr(struct unwind_state *state)
+{
+	if (state->stack_info.type == STACK_TYPE_UNKNOWN)
+		return NULL;
+
+	return state->sp;
+}
+
+static inline bool unwind_done(struct unwind_state *state)
+{
+	return (state->stack_info.type == STACK_TYPE_UNKNOWN);
+}
+
+static inline
+void unwind_start(struct unwind_state *state, struct task_struct *task,
+		  struct pt_regs *regs, unsigned long *sp)
+{
+	if (!task)
+		task = current;
+
+	sp = sp ? : get_stack_pointer(task, regs);
+
+	__unwind_start(state, task, regs, sp);
+}
+
+#endif /* _ASM_X86_UNWIND_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 0503f5b..45257cf 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -125,6 +125,12 @@ obj-$(CONFIG_EFI)			+= sysfb_efi.o
 obj-$(CONFIG_PERF_EVENTS)		+= perf_regs.o
 obj-$(CONFIG_TRACING)			+= tracepoint.o
 
+ifdef CONFIG_FRAME_POINTER
+obj-y					+= unwind_frame.o
+else
+obj-y					+= unwind_guess.o
+endif
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
new file mode 100644
index 0000000..1234480
--- /dev/null
+++ b/arch/x86/kernel/unwind_frame.c
@@ -0,0 +1,89 @@
+#include <linux/sched.h>
+#include <asm/ptrace.h>
+#include <asm/bitops.h>
+#include <asm/stacktrace.h>
+#include <asm/unwind.h>
+
+unsigned long unwind_get_return_address(struct unwind_state *state)
+{
+	unsigned long addr, graph_addr;
+
+	if (state->stack_info.type == STACK_TYPE_UNKNOWN)
+		return 0;
+
+	addr = *unwind_get_return_address_ptr(state);
+	graph_addr = ftrace_graph_ret_addr(state->task, &state->graph_idx,
+					   addr);
+	return graph_addr ? : addr;
+}
+EXPORT_SYMBOL_GPL(unwind_get_return_address);
+
+static unsigned long *update_stack_state(struct unwind_state *state, void *addr,
+					 size_t len)
+{
+	struct stack_info *info = &state->stack_info;
+	unsigned long *sp;
+
+	if (on_stack(info, addr, len))
+		return addr;
+
+	sp = info->next;
+	if (!sp)
+		goto unknown;
+
+	if (get_stack_info(sp, state->task, info, &state->stack_mask))
+		goto unknown;
+
+	if (!on_stack(info, addr, len))
+		goto unknown;
+
+	return sp;
+
+unknown:
+	info->type = STACK_TYPE_UNKNOWN;
+	return NULL;
+}
+
+static bool unwind_next_frame_bp(struct unwind_state *state, unsigned long *bp)
+{
+	unsigned long *sp;
+
+	sp = update_stack_state(state, bp, sizeof(*bp) * 2);
+	if (state->stack_info.type == STACK_TYPE_UNKNOWN)
+		return false;
+
+	state->bp = bp;
+	state->sp = sp;
+
+	return true;
+}
+
+bool unwind_next_frame(struct unwind_state *state)
+{
+	unsigned long *bp;
+
+	if (unwind_done(state))
+		return false;
+
+	bp = (unsigned long *)*state->bp;
+
+	return unwind_next_frame_bp(state, bp);
+}
+EXPORT_SYMBOL_GPL(unwind_next_frame);
+
+void __unwind_start(struct unwind_state *state, struct task_struct *task,
+		    struct pt_regs *regs, unsigned long *sp)
+{
+	memset(state, 0, sizeof(*state));
+
+	state->task = task;
+	state->sp = sp;
+	state->bp = get_frame_pointer(task, regs);
+
+	get_stack_info(sp, state->task, &state->stack_info, &state->stack_mask);
+
+	/* unwind to the first frame after the user-specified stack pointer */
+	while (state->bp < sp && !unwind_done(state))
+		unwind_next_frame(state);
+}
+EXPORT_SYMBOL_GPL(__unwind_start);
diff --git a/arch/x86/kernel/unwind_guess.c b/arch/x86/kernel/unwind_guess.c
new file mode 100644
index 0000000..223d020
--- /dev/null
+++ b/arch/x86/kernel/unwind_guess.c
@@ -0,0 +1,40 @@
+#include <linux/sched.h>
+#include <linux/ftrace.h>
+#include <asm/ptrace.h>
+#include <asm/bitops.h>
+#include <asm/stacktrace.h>
+#include <asm/unwind.h>
+
+bool unwind_next_frame(struct unwind_state *state)
+{
+	struct stack_info *info = &state->stack_info;
+
+	if (info->type == STACK_TYPE_UNKNOWN)
+		return false;
+
+	do {
+		for (state->sp++; state->sp < info->end; state->sp++)
+			if (__kernel_text_address(*state->sp))
+				return true;
+
+		state->sp = info->next;
+
+	} while (!get_stack_info(state->sp, state->task, info,
+				 &state->stack_mask));
+
+	return false;
+}
+
+void __unwind_start(struct unwind_state *state, struct task_struct *task,
+		    struct pt_regs *regs, unsigned long *sp)
+{
+	memset(state, 0, sizeof(*state));
+
+	state->task = task;
+	state->sp   = sp;
+
+	get_stack_info(sp, state->task, &state->stack_info, &state->stack_mask);
+
+	if (!__kernel_text_address(*sp))
+		unwind_next_frame(state);
+}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 12/19] perf/x86: convert perf_callchain_kernel() to the new unwinder
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (10 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 11/19] x86/dumptrace: add new unwind interface and implementations Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 21:21 ` [PATCH 13/19] x86/stacktrace: convert save_stack_trace_*() " Josh Poimboeuf
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

Convert perf_callchain_kernel() to the new unwinder.  dump_trace() has
been deprecated.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/events/core.c | 32 +++++++++-----------------------
 1 file changed, 9 insertions(+), 23 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f388f57..e91b9c3 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -37,6 +37,7 @@
 #include <asm/timer.h>
 #include <asm/desc.h>
 #include <asm/ldt.h>
+#include <asm/unwind.h>
 
 #include "perf_event.h"
 
@@ -2244,31 +2245,12 @@ void arch_perf_update_userpage(struct perf_event *event,
 	cyc2ns_read_end(data);
 }
 
-/*
- * callchain support
- */
-
-static int backtrace_stack(void *data, const char *name)
-{
-	return 0;
-}
-
-static int backtrace_address(void *data, unsigned long addr, int reliable)
-{
-	struct perf_callchain_entry_ctx *entry = data;
-
-	return perf_callchain_store(entry, addr);
-}
-
-static const struct stacktrace_ops backtrace_ops = {
-	.stack			= backtrace_stack,
-	.address		= backtrace_address,
-	.walk_stack		= print_context_stack_bp,
-};
-
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
 {
+	struct unwind_state state;
+	unsigned long addr;
+
 	if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
 		/* TODO: We don't support guest os callchain now */
 		return;
@@ -2276,7 +2258,11 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
 
 	perf_callchain_store(entry, regs->ip);
 
-	dump_trace(NULL, regs, NULL, 0, &backtrace_ops, entry);
+	for (unwind_start(&state, NULL, regs, NULL); !unwind_done(&state);
+	     unwind_next_frame(&state)) {
+		addr = unwind_get_return_address(&state);
+		perf_callchain_store(entry, addr);
+	}
 }
 
 static inline int
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 13/19] x86/stacktrace: convert save_stack_trace_*() to the new unwinder
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (11 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 12/19] perf/x86: convert perf_callchain_kernel() to the new unwinder Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 21:21 ` [PATCH 14/19] oprofile/x86: convert x86_backtrace() " Josh Poimboeuf
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

Convert save_stack_trace_*() to the new unwinder.  dump_trace() has been
deprecated.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/kernel/stacktrace.c | 74 +++++++++++++++++---------------------------
 1 file changed, 29 insertions(+), 45 deletions(-)

diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 785aef1..63342f2 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -8,80 +8,64 @@
 #include <linux/export.h>
 #include <linux/uaccess.h>
 #include <asm/stacktrace.h>
+#include <asm/unwind.h>
 
-static int save_stack_stack(void *data, const char *name)
+static int save_stack_address(struct stack_trace *trace, unsigned long addr,
+			      bool nosched)
 {
-	return 0;
-}
-
-static int
-__save_stack_address(void *data, unsigned long addr, bool reliable, bool nosched)
-{
-	struct stack_trace *trace = data;
-#ifdef CONFIG_FRAME_POINTER
-	if (!reliable)
-		return 0;
-#endif
 	if (nosched && in_sched_functions(addr))
 		return 0;
+
 	if (trace->skip > 0) {
 		trace->skip--;
 		return 0;
 	}
-	if (trace->nr_entries < trace->max_entries) {
-		trace->entries[trace->nr_entries++] = addr;
-		return 0;
-	} else {
-		return -1; /* no more room, stop walking the stack */
-	}
-}
 
-static int save_stack_address(void *data, unsigned long addr, int reliable)
-{
-	return __save_stack_address(data, addr, reliable, false);
+	if (trace->nr_entries >= trace->max_entries)
+		return -1;
+
+	trace->entries[trace->nr_entries++] = addr;
+	return 0;
 }
 
-static int
-save_stack_address_nosched(void *data, unsigned long addr, int reliable)
+static void __save_stack_trace(struct stack_trace *trace,
+			       struct task_struct *task, struct pt_regs *regs,
+			       bool nosched)
 {
-	return __save_stack_address(data, addr, reliable, true);
-}
+	struct unwind_state state;
+	unsigned long addr;
 
-static const struct stacktrace_ops save_stack_ops = {
-	.stack		= save_stack_stack,
-	.address	= save_stack_address,
-	.walk_stack	= print_context_stack,
-};
+	if (regs)
+		save_stack_address(trace, regs->ip, nosched);
 
-static const struct stacktrace_ops save_stack_ops_nosched = {
-	.stack		= save_stack_stack,
-	.address	= save_stack_address_nosched,
-	.walk_stack	= print_context_stack,
-};
+	for (unwind_start(&state, task, regs, NULL); !unwind_done(&state);
+	     unwind_next_frame(&state)) {
+		addr = unwind_get_return_address(&state);
+		if (save_stack_address(trace, addr, nosched))
+			break;
+	}
+
+	if (trace->nr_entries < trace->max_entries)
+		trace->entries[trace->nr_entries++] = ULONG_MAX;
+}
 
 /*
  * Save stack-backtrace addresses into a stack_trace buffer.
  */
 void save_stack_trace(struct stack_trace *trace)
 {
-	dump_trace(current, NULL, NULL, 0, &save_stack_ops, trace);
-	if (trace->nr_entries < trace->max_entries)
-		trace->entries[trace->nr_entries++] = ULONG_MAX;
+	__save_stack_trace(trace, NULL, NULL, false);
 }
 EXPORT_SYMBOL_GPL(save_stack_trace);
 
 void save_stack_trace_regs(struct pt_regs *regs, struct stack_trace *trace)
 {
-	dump_trace(current, regs, NULL, 0, &save_stack_ops, trace);
-	if (trace->nr_entries < trace->max_entries)
-		trace->entries[trace->nr_entries++] = ULONG_MAX;
+	__save_stack_trace(trace, NULL, regs, false);
 }
 
 void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
 {
-	dump_trace(tsk, NULL, NULL, 0, &save_stack_ops_nosched, trace);
-	if (trace->nr_entries < trace->max_entries)
-		trace->entries[trace->nr_entries++] = ULONG_MAX;
+	__save_stack_trace(trace, tsk, NULL, true);
 }
 EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 14/19] oprofile/x86: convert x86_backtrace() to the new unwinder
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (12 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 13/19] x86/stacktrace: convert save_stack_trace_*() " Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 21:21 ` [PATCH 15/19] x86/dumpstack: convert show_trace_log_lvl() " Josh Poimboeuf
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park, Robert Richter,
	oprofile-list

Convert oprofile's x86_backtrace() to the new unwinder.  dump_trace()
has been deprecated.

Cc: Robert Richter <rric@kernel.org>
Cc: oprofile-list@lists.sf.net
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/oprofile/backtrace.c | 42 +++++++++++++++++++-----------------------
 1 file changed, 19 insertions(+), 23 deletions(-)

diff --git a/arch/x86/oprofile/backtrace.c b/arch/x86/oprofile/backtrace.c
index c594768..6cda1f4 100644
--- a/arch/x86/oprofile/backtrace.c
+++ b/arch/x86/oprofile/backtrace.c
@@ -16,27 +16,7 @@
 
 #include <asm/ptrace.h>
 #include <asm/stacktrace.h>
-
-static int backtrace_stack(void *data, char *name)
-{
-	/* Yes, we want all stacks */
-	return 0;
-}
-
-static int backtrace_address(void *data, unsigned long addr, int reliable)
-{
-	unsigned int *depth = data;
-
-	if ((*depth)--)
-		oprofile_add_trace(addr);
-	return 0;
-}
-
-static struct stacktrace_ops backtrace_ops = {
-	.stack		= backtrace_stack,
-	.address	= backtrace_address,
-	.walk_stack	= print_context_stack,
-};
+#include <asm/unwind.h>
 
 #ifdef CONFIG_COMPAT
 static struct stack_frame_ia32 *
@@ -113,8 +93,24 @@ x86_backtrace(struct pt_regs * const regs, unsigned int depth)
 	struct stack_frame *head = (struct stack_frame *)frame_pointer(regs);
 
 	if (!user_mode(regs)) {
-		if (depth)
-			dump_trace(NULL, regs, NULL, 0, &backtrace_ops, &depth);
+		struct unwind_state state;
+		unsigned long addr;
+
+		if (!depth)
+			return;
+
+		oprofile_add_trace(regs->ip);
+
+		if (!--depth)
+			return;
+
+		for (unwind_start(&state, NULL, regs, NULL);
+		     !unwind_done(&state); unwind_next_frame(&state)) {
+			addr = unwind_get_return_address(&state);
+			oprofile_add_trace(addr);
+			if (!--depth)
+				break;
+		}
 		return;
 	}
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 15/19] x86/dumpstack: convert show_trace_log_lvl() to the new unwinder
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (13 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 14/19] oprofile/x86: convert x86_backtrace() " Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 21:49   ` Byungchul Park
  2016-07-21 21:21 ` [PATCH 16/19] x86/dumpstack: remove dump_trace() Josh Poimboeuf
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

Convert show_trace_log_lvl() to the new unwinder.  dump_trace() has been
deprecated.

show_trace_log_lvl() is special compared to other users of the unwinder.
It's the only place where both reliable *and* unreliable addresses are
needed.  With frame pointers enabled, most stack walking code doesn't
want to know about unreliable addresses.  But in this case, when we're
dumping the stack to the console because something presumably went
wrong, the unreliable addresses are useful:

- They show stale data on the stack which can provide useful clues.

- If something goes wrong with the unwinder, or if frame pointers are
  corrupt or missing, all the stack addresses still get shown.

So in order to show all addresses on the stack, and at the same time
figure out which addresses are reliable, we have to do the scanning and
the unwinding in parallel.

The scanning is done with the help of get_stack_info() to traverse the
stacks.  The unwinding is done separately by the new unwinder.

In theory we could simplify show_trace_log_lvl() by instead pushing some
of this logic into the unwind code.  But then we would need some kind of
"fake" frame logic in the unwinder which would add a lot of complexity
and wouldn't be worth it in order to support only one user.

Another benefit of this approach is that once we have a DWARF unwinder,
we should be able to just plug it in with minimal impact to this code.

Another change here is that callers of show_trace_log_lvl() don't need
to provide the 'bp' argument.  The unwinder already finds the relevant
frame pointer by unwinding until it reaches the first frame after the
provided stack pointer.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/include/asm/stacktrace.h |  10 +--
 arch/x86/kernel/dumpstack.c       | 180 +++++++++++++++++++-------------------
 arch/x86/kernel/dumpstack_32.c    |   6 +-
 arch/x86/kernel/dumpstack_64.c    |  10 +--
 4 files changed, 101 insertions(+), 105 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 647ce3f..c66dece 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -132,13 +132,11 @@ get_stack_pointer(struct task_struct *task, struct pt_regs *regs)
 	return (unsigned long *)task->thread.sp;
 }
 
-extern void
-show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
-		   unsigned long *stack, unsigned long bp, char *log_lvl);
+void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
+			unsigned long *stack, char *log_lvl);
 
-extern void
-show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
-		   unsigned long *sp, unsigned long bp, char *log_lvl);
+void show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
+			unsigned long *sp, char *log_lvl);
 
 extern unsigned int code_bytes;
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 6ef8ab5..198dc9e 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -17,7 +17,7 @@
 #include <linux/sysfs.h>
 
 #include <asm/stacktrace.h>
-
+#include <asm/unwind.h>
 
 int panic_on_unrecovered_nmi;
 int panic_on_io_nmi;
@@ -79,107 +79,105 @@ ftrace_graph_ret_addr(struct task_struct *task, int *idx, unsigned long addr)
 }
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
 
-/*
- * x86-64 can have up to three kernel stacks:
- * process stack
- * interrupt stack
- * severe exception (double fault, nmi, stack fault, debug, mce) hardware stack
- */
-
-unsigned long
-print_context_stack(struct task_struct *task,
-		unsigned long *stack, unsigned long bp,
-		const struct stacktrace_ops *ops, void *data,
-		struct stack_info *info, int *graph)
+void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
+			unsigned long *stack, char *log_lvl)
 {
-	struct stack_frame *frame = (struct stack_frame *)bp;
+	struct unwind_state state;
+	struct stack_info stack_info = {0};
+	unsigned long stack_mask = 0;
+	int graph_idx = 0;
 
-	/*
-	 * If we overflowed the stack into a guard page, jump back to the
-	 * bottom of the usable stack.
-	 */
-	if ((unsigned long)task_stack_page(task) - (unsigned long)stack <
-	    PAGE_SIZE)
-		stack = (unsigned long *)task_stack_page(task);
-
-	while (on_stack(info, stack, sizeof(*stack))) {
-		unsigned long addr = *stack;
-
-		addr = *stack;
-		if (__kernel_text_address(addr)) {
-			int reliable = 0;
-			unsigned long real_addr;
+	printk("%sCall Trace:\n", log_lvl);
 
-			if ((unsigned long) stack == bp + sizeof(long)) {
-				reliable = 1;
-				frame = frame->next_frame;
-				bp = (unsigned long) frame;
-			}
+	stack = stack ? : get_stack_pointer(task, regs);
+	if (!task)
+		task = current;
 
-			real_addr = ftrace_graph_ret_addr(task, graph, addr);
-			if (addr != real_addr)
-				ops->address(data, addr, 0);
-			ops->address(data, real_addr, reliable);
-		}
-		stack++;
-	}
-	return bp;
-}
-EXPORT_SYMBOL_GPL(print_context_stack);
+	unwind_start(&state, task, regs, stack);
 
-unsigned long
-print_context_stack_bp(struct task_struct *task,
-		       unsigned long *stack, unsigned long bp,
-		       const struct stacktrace_ops *ops, void *data,
-		       struct stack_info *info, int *graph)
-{
-	struct stack_frame *frame = (struct stack_frame *)bp;
-	unsigned long *ret_addr = &frame->return_address;
-
-	while (on_stack(info, stack, sizeof(*stack) * 2)) {
-		unsigned long addr = *ret_addr;
+	/*
+	 * Iterate through the stacks, starting with the current stack pointer.
+	 * Each stack has a pointer to the next one.
+	 *
+	 * x86-64 can have several stacks:
+	 * - task stack
+	 * - interrupt stack
+	 * - HW exception stacks (double fault, nmi, debug, mce)
+	 *
+	 * x86-32 can have up to three stacks:
+	 * - task stack
+	 * - softirq stack
+	 * - hardirq stack
+	 */
+	for (; stack; stack = stack_info.next) {
+		const char *str_begin, *str_end;
 
-		if (!__kernel_text_address(addr))
-			break;
+		/*
+		 * If we overflowed the task stack into a guard page, jump back
+		 * to the bottom of the usable stack.
+		 */
+		if (task_stack_page(task) - (void *)stack < PAGE_SIZE)
+			stack = task_stack_page(task);
 
-		addr = ftrace_graph_ret_addr(task, graph, addr);
-		if (ops->address(data, addr, 1))
+		if (get_stack_info(stack, task, &stack_info, &stack_mask))
 			break;
-		frame = frame->next_frame;
-		ret_addr = &frame->return_address;
-	}
 
-	return (unsigned long)frame;
-}
-EXPORT_SYMBOL_GPL(print_context_stack_bp);
+		stack_type_str(stack_info.type, &str_begin, &str_end);
+		if (str_begin)
+			printk("%s <%s> ", log_lvl, str_begin);
+
+		/*
+		 * Scan the stack, printing any text addresses we find.  At the
+		 * same time, follow proper stack frames with the unwinder.
+		 *
+		 * Addresses found during the scan which are not reported by
+		 * the unwinder are considered to be additional clues which are
+		 * sometimes useful for debugging and are prefixed with '?'.
+		 * This also serves as a failsafe option in case the unwinder
+		 * goes off the rails.
+		 */
+		for (; stack < stack_info.end; stack++) {
+			unsigned long addr = *stack;
+			unsigned long real_addr;
+			unsigned long *ret_addr_p = \
+				unwind_get_return_address_ptr(&state);
 
-static int print_trace_stack(void *data, const char *name)
-{
-	printk("%s <%s> ", (char *)data, name);
-	return 0;
-}
+			if (!__kernel_text_address(addr))
+				continue;
 
-/*
- * Print one address/symbol entries per line.
- */
-static int print_trace_address(void *data, unsigned long addr, int reliable)
-{
-	printk_stack_address(addr, reliable, data);
-	return 0;
-}
+			if (stack != ret_addr_p) {
+				/* found an "unreliable" address */
+				printk_stack_address(addr, 0, log_lvl);
+				continue;
+			}
 
-static const struct stacktrace_ops print_trace_ops = {
-	.stack			= print_trace_stack,
-	.address		= print_trace_address,
-	.walk_stack		= print_context_stack,
-};
+			/*
+			 * When function graph tracing is enabled, the original
+			 * return address on the stack of a traced function is
+			 * replaced with the address of an ftrace handler.
+			 * In that case we print the ftrace handler address as
+			 * an unreliable clue and then print the real function
+			 * as a reliable address.
+			 */
+			real_addr = ftrace_graph_ret_addr(task, &graph_idx,
+							  addr);
+			if (real_addr != addr)
+				printk_stack_address(addr, 0, log_lvl);
+
+			printk_stack_address(real_addr, 1, log_lvl);
+
+			/*
+			 * Get the next frame from the unwinder.  No need to
+			 * check for an error: if anything goes wrong with the
+			 * unwinder, the rest of the addresses will just be
+			 * printed as unreliable.
+			 */
+			unwind_next_frame(&state);
+		}
 
-void
-show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
-		unsigned long *stack, unsigned long bp, char *log_lvl)
-{
-	printk("%sCall Trace:\n", log_lvl);
-	dump_trace(task, regs, stack, bp, &print_trace_ops, log_lvl);
+		if (str_end)
+			printk("%s <%s> ", log_lvl, str_end);
+	}
 }
 
 void show_stack(struct task_struct *task, unsigned long *sp)
@@ -195,12 +193,12 @@ void show_stack(struct task_struct *task, unsigned long *sp)
 		bp = (unsigned long)get_frame_pointer(current, NULL);
 	}
 
-	show_stack_log_lvl(task, NULL, sp, bp, "");
+	show_stack_log_lvl(task, NULL, sp, "");
 }
 
 void show_stack_regs(struct pt_regs *regs)
 {
-	show_stack_log_lvl(current, regs, NULL, 0, "");
+	show_stack_log_lvl(NULL, regs, NULL, "");
 }
 
 static arch_spinlock_t die_lock = __ARCH_SPIN_LOCK_UNLOCKED;
diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index 8f55ddb..6a881cc 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -128,7 +128,7 @@ EXPORT_SYMBOL(dump_trace);
 
 void
 show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
-		   unsigned long *sp, unsigned long bp, char *log_lvl)
+		   unsigned long *sp, char *log_lvl)
 {
 	unsigned long *stack;
 	int i;
@@ -148,7 +148,7 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		touch_nmi_watchdog();
 	}
 	pr_cont("\n");
-	show_trace_log_lvl(task, regs, sp, bp, log_lvl);
+	show_trace_log_lvl(task, regs, sp, log_lvl);
 }
 
 
@@ -170,7 +170,7 @@ void show_regs(struct pt_regs *regs)
 		u8 *ip;
 
 		pr_emerg("Stack:\n");
-		show_stack_log_lvl(NULL, regs, NULL, 0, KERN_EMERG);
+		show_stack_log_lvl(NULL, regs, NULL, KERN_EMERG);
 
 		pr_emerg("Code:");
 
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index e1a5b6f..6e5ccec 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -15,6 +15,7 @@
 #include <linux/nmi.h>
 
 #include <asm/stacktrace.h>
+#include <asm/unwind.h>
 
 static char *exception_stack_names[N_EXCEPTION_STACKS] = {
 		[ DOUBLEFAULT_STACK-1	]	= "#DF",
@@ -190,9 +191,8 @@ void dump_trace(struct task_struct *task, struct pt_regs *regs,
 }
 EXPORT_SYMBOL(dump_trace);
 
-void
-show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
-		   unsigned long *sp, unsigned long bp, char *log_lvl)
+void show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
+			unsigned long *sp, char *log_lvl)
 {
 	unsigned long *irq_stack, *irq_stack_end;
 	unsigned long *stack;
@@ -232,7 +232,7 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	}
 
 	pr_cont("\n");
-	show_trace_log_lvl(task, regs, sp, bp, log_lvl);
+	show_trace_log_lvl(task, regs, sp, log_lvl);
 }
 
 void show_regs(struct pt_regs *regs)
@@ -253,7 +253,7 @@ void show_regs(struct pt_regs *regs)
 		u8 *ip;
 
 		printk(KERN_DEFAULT "Stack:\n");
-		show_stack_log_lvl(NULL, regs, NULL, 0, KERN_DEFAULT);
+		show_stack_log_lvl(NULL, regs, NULL, KERN_DEFAULT);
 
 		printk(KERN_DEFAULT "Code: ");
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 16/19] x86/dumpstack: remove dump_trace()
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (14 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 15/19] x86/dumpstack: convert show_trace_log_lvl() " Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 21:21 ` [PATCH 17/19] x86/entry/dumpstack: encode pt_regs pointer in frame pointer Josh Poimboeuf
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

All previous users of dump_trace() have been converted to use the new
unwind interfaces, so we can remove it.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/include/asm/stacktrace.h | 38 +--------------------
 arch/x86/kernel/dumpstack_32.c    | 35 --------------------
 arch/x86/kernel/dumpstack_64.c    | 69 ---------------------------------------
 3 files changed, 1 insertion(+), 141 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index c66dece..bb2b74f 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -42,11 +42,6 @@ static inline bool on_stack(struct stack_info *info, void *addr, size_t len)
 		addr + len > begin && addr + len <= end);
 }
 
-extern int kstack_depth_to_print;
-
-struct thread_info;
-struct stacktrace_ops;
-
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 
 unsigned long
@@ -60,38 +55,7 @@ ftrace_graph_ret_addr(struct task_struct *task, int *idx, unsigned long addr)
 }
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
 
-typedef unsigned long (*walk_stack_t)(struct task_struct *task,
-				      unsigned long *stack,
-				      unsigned long bp,
-				      const struct stacktrace_ops *ops,
-				      void *data,
-				      struct stack_info *info,
-				      int *graph);
-
-extern unsigned long
-print_context_stack(struct task_struct *task,
-		    unsigned long *stack, unsigned long bp,
-		    const struct stacktrace_ops *ops, void *data,
-		    struct stack_info *info, int *graph);
-
-extern unsigned long
-print_context_stack_bp(struct task_struct *task,
-		       unsigned long *stack, unsigned long bp,
-		       const struct stacktrace_ops *ops, void *data,
-		       struct stack_info *info, int *graph);
-
-/* Generic stack tracer with callbacks */
-
-struct stacktrace_ops {
-	int (*address)(void *data, unsigned long address, int reliable);
-	/* On negative return stop dumping */
-	int (*stack)(void *data, const char *name);
-	walk_stack_t	walk_stack;
-};
-
-void dump_trace(struct task_struct *tsk, struct pt_regs *regs,
-		unsigned long *stack, unsigned long bp,
-		const struct stacktrace_ops *ops, void *data);
+extern int kstack_depth_to_print;
 
 #ifdef CONFIG_X86_32
 #define STACKSLOTS_PER_LINE 8
diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index 6a881cc..69b4ddd 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -91,41 +91,6 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
 	return -EINVAL;
 }
 
-void dump_trace(struct task_struct *task, struct pt_regs *regs,
-		unsigned long *stack, unsigned long bp,
-		const struct stacktrace_ops *ops, void *data)
-{
-	unsigned long visit_mask = 0;
-	int graph = 0;
-
-	task = task ? : current;
-	stack = stack ? : get_stack_pointer(task, regs);
-	bp = bp ? : (unsigned long)get_frame_pointer(task, regs);
-
-	for (;;) {
-		const char *begin_str, *end_str;
-		struct stack_info info;
-
-		if (get_stack_info(stack, task, &info, &visit_mask))
-			break;
-
-		stack_type_str(info.type, &begin_str, &end_str);
-
-		if (begin_str && ops->stack(data, begin_str) < 0)
-			break;
-
-		bp = ops->walk_stack(task, stack, bp, ops, data, &info, &graph);
-
-		if (end_str && ops->stack(data, end_str) < 0)
-			break;
-
-		stack = info.next;
-
-		touch_nmi_watchdog();
-	}
-}
-EXPORT_SYMBOL(dump_trace);
-
 void
 show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		   unsigned long *sp, char *log_lvl)
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 6e5ccec..b4e3bd3 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -122,75 +122,6 @@ unknown:
 	return -EINVAL;
 }
 
-/*
- * x86-64 can have up to three kernel stacks:
- * process stack
- * interrupt stack
- * severe exception (double fault, nmi, stack fault, debug, mce) hardware stack
- */
-
-void dump_trace(struct task_struct *task, struct pt_regs *regs,
-		unsigned long *stack, unsigned long bp,
-		const struct stacktrace_ops *ops, void *data)
-{
-	unsigned long visit_mask = 0;
-	struct stack_info info;
-	int graph = 0;
-	int done = 0;
-
-	task = task ? : current;
-	stack = stack ? : get_stack_pointer(task, regs);
-	bp = bp ? : (unsigned long)get_frame_pointer(task, regs);
-
-	/*
-	 * Print function call entries in all stacks, starting at the
-	 * current stack address. If the stacks consist of nested
-	 * exceptions
-	 */
-	while (!done) {
-		const char *begin_str, *end_str;
-
-		get_stack_info(stack, task, &info, &visit_mask);
-
-		/* Default finish unless specified to continue */
-		done = 1;
-
-		switch (info.type) {
-
-		/* Break out early if we are on the thread stack */
-		case STACK_TYPE_TASK:
-			break;
-
-		case STACK_TYPE_IRQ:
-		case STACK_TYPE_EXCEPTION ... STACK_TYPE_EXCEPTION_LAST:
-
-			stack_type_str(info.type, &begin_str, &end_str);
-
-			if (ops->stack(data, begin_str) < 0)
-				break;
-
-			bp = ops->walk_stack(task, stack, bp, ops,
-					     data, &info, &graph);
-
-			ops->stack(data, end_str);
-
-			stack = info.next;
-			done = 0;
-			break;
-
-		default:
-			ops->stack(data, "UNK");
-			break;
-		}
-	}
-
-	/*
-	 * This handles the process stack:
-	 */
-	bp = ops->walk_stack(task, stack, bp, ops, data, &info, &graph);
-}
-EXPORT_SYMBOL(dump_trace);
-
 void show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 			unsigned long *sp, char *log_lvl)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 17/19] x86/entry/dumpstack: encode pt_regs pointer in frame pointer
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (15 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 16/19] x86/dumpstack: remove dump_trace() Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 22:27   ` Andy Lutomirski
  2016-07-21 21:21 ` [PATCH 18/19] x86/dumpstack: print stack identifier on its own line Josh Poimboeuf
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

With frame pointers, when a task is interrupted, its stack is no longer
completely reliable because the function could have been interrupted
before it had a chance to save the previous frame pointer on the stack.
So the caller of the interrupted function could get skipped by a stack
trace.

This is problematic for live patching, which needs to know whether a
stack trace of a sleeping task can be relied upon.  There's currently no
way to detect if a sleeping task was interrupted by a page fault
exception or preemption before it went to sleep.

Another issue is that when dumping the stack of an interrupted task, the
unwinder has no way of knowing where the saved pt_regs registers are, so
it can't print them.

This solves those issues by encoding the pt_regs pointer in the frame
pointer on entry from an interrupt or an exception.  The frame pointer
unwinder is also updated to decode it.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/entry/calling.h       | 21 ++++++++++++++++++++
 arch/x86/entry/entry_64.S      |  7 ++++++-
 arch/x86/include/asm/unwind.h  | 11 +++++++++++
 arch/x86/kernel/unwind_frame.c | 44 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 9a9e588..ff5a5a3 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -201,6 +201,27 @@ For 32-bit we have the following conventions - kernel is built with
 	.byte 0xf1
 	.endm
 
+	/*
+	 * This is a sneaky trick to help the unwinder find pt_regs on the
+	 * stack.  The frame pointer is replaced with an encoded pointer to
+	 * pt_regs.  The encoding is just a clearing of the highest-order bit,
+	 * which makes it an invalid address and is also a signal to the
+	 * unwinder that it's a pt_regs pointer in disguise.
+	 *
+	 * NOTE: This must be called *after* SAVE_EXTRA_REGS because it
+	 * corrupts rbp.
+	 */
+.macro ENCODE_FRAME_POINTER ptregs_offset=0
+#ifdef CONFIG_FRAME_POINTER
+	.if \ptregs_offset
+		leaq \ptregs_offset(%rsp), %rbp
+	.else
+		mov %rsp, %rbp
+	.endif
+	btr $63, %rbp
+#endif
+.endm
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index b846875..7f492e2 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -431,6 +431,7 @@ END(irq_entries_start)
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
+	ENCODE_FRAME_POINTER
 
 	testb	$3, CS(%rsp)
 	jz	1f
@@ -893,6 +894,7 @@ ENTRY(xen_failsafe_callback)
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
+	ENCODE_FRAME_POINTER
 	jmp	error_exit
 END(xen_failsafe_callback)
 
@@ -936,6 +938,7 @@ ENTRY(paranoid_entry)
 	cld
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
+	ENCODE_FRAME_POINTER 8
 	movl	$1, %ebx
 	movl	$MSR_GS_BASE, %ecx
 	rdmsr
@@ -983,6 +986,7 @@ ENTRY(error_entry)
 	cld
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
+	ENCODE_FRAME_POINTER 8
 	xorl	%ebx, %ebx
 	testb	$3, CS+8(%rsp)
 	jz	.Lerror_kernelspace
@@ -1165,6 +1169,7 @@ ENTRY(nmi)
 	pushq	%r13		/* pt_regs->r13 */
 	pushq	%r14		/* pt_regs->r14 */
 	pushq	%r15		/* pt_regs->r15 */
+	ENCODE_FRAME_POINTER
 
 	/*
 	 * At this point we no longer need to worry about stack damage
@@ -1182,7 +1187,7 @@ ENTRY(nmi)
 	 * do_nmi doesn't modify pt_regs.
 	 */
 	SWAPGS
-	jmp	restore_c_regs_and_iret
+	jmp	restore_regs_and_iret
 
 .Lnmi_from_kernel:
 	/*
diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index 61c6e95..6d461ee 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -14,6 +14,7 @@ struct unwind_state {
 	int graph_idx;
 #ifdef CONFIG_FRAME_POINTER
 	unsigned long *bp;
+	struct pt_regs *regs;
 #endif
 };
 
@@ -35,6 +36,11 @@ static inline unsigned long *unwind_get_return_address_ptr(struct unwind_state *
 
 unsigned long unwind_get_return_address(struct unwind_state *state);
 
+static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
+{
+	return state->regs;
+}
+
 #else /* !CONFIG_FRAME_POINTER */
 
 static inline unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
@@ -50,6 +56,11 @@ static inline unsigned long unwind_get_return_address(struct unwind_state *state
 	return *state->sp;
 }
 
+static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_FRAME_POINTER */
 
 static inline unsigned long *unwind_get_stack_ptr(struct unwind_state *state)
diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index 1234480..2536353 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -18,6 +18,31 @@ unsigned long unwind_get_return_address(struct unwind_state *state)
 }
 EXPORT_SYMBOL_GPL(unwind_get_return_address);
 
+/*
+ * This determines if the frame pointer actually contains an encoded pointer to
+ * pt_regs on the stack.  See ENCODE_FRAME_POINTER.
+ */
+static struct pt_regs *decode_frame_pointer(struct unwind_state *state,
+					    unsigned long *bp)
+{
+	struct pt_regs *regs = (struct pt_regs *)bp;
+	unsigned long *task_begin = task_stack_page(state->task);
+	unsigned long *task_end   = task_stack_page(state->task) + THREAD_SIZE;
+
+	if (test_and_set_bit(BITS_PER_LONG - 1, (unsigned long *)&regs))
+		return NULL;
+
+	if (on_stack(&state->stack_info, regs, sizeof(*regs)))
+		return regs;
+
+	if ((unsigned long *)regs >= task_begin &&
+	    (unsigned long *)regs < task_end &&
+	    (unsigned long *)(regs + 1) <= task_end)
+		return regs;
+
+	return NULL;
+}
+
 static unsigned long *update_stack_state(struct unwind_state *state, void *addr,
 					 size_t len)
 {
@@ -58,14 +83,32 @@ static bool unwind_next_frame_bp(struct unwind_state *state, unsigned long *bp)
 	return true;
 }
 
+static bool unwind_next_frame_regs(struct unwind_state *state,
+				   struct pt_regs *regs)
+{
+	update_stack_state(state, regs, sizeof(*regs));
+	if (state->stack_info.type == STACK_TYPE_UNKNOWN)
+		return false;
+
+	state->regs = regs;
+
+	return unwind_next_frame_bp(state, (unsigned long *)regs->bp);
+}
+
 bool unwind_next_frame(struct unwind_state *state)
 {
+	struct pt_regs *regs;
 	unsigned long *bp;
 
+	state->regs = NULL;
+
 	if (unwind_done(state))
 		return false;
 
 	bp = (unsigned long *)*state->bp;
+	regs = decode_frame_pointer(state, bp);
+	if (regs)
+		return unwind_next_frame_regs(state, regs);
 
 	return unwind_next_frame_bp(state, bp);
 }
@@ -79,6 +122,7 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
 	state->task = task;
 	state->sp = sp;
 	state->bp = get_frame_pointer(task, regs);
+	state->regs = NULL;
 
 	get_stack_info(sp, state->task, &state->stack_info, &state->stack_mask);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 18/19] x86/dumpstack: print stack identifier on its own line
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (16 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 17/19] x86/entry/dumpstack: encode pt_regs pointer in frame pointer Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 21:21 ` [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack Josh Poimboeuf
  2016-07-23  0:22 ` [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Linus Torvalds
  19 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

show_trace_log_lvl() prints the stack id (e.g. "<IRQ>") without a
newline so that any stack address printed after it will appear on the
same line.  That causes the first stack address to be vertically
misaligned with the rest, making it visually cluttered and slightly
confusing:

  Call Trace:
   <IRQ> [<ffffffff814431c3>] dump_stack+0x86/0xc3
   [<ffffffff8100828b>] perf_callchain_kernel+0x14b/0x160
   [<ffffffff811e915f>] get_perf_callchain+0x15f/0x2b0
   ...
   <EOI> [<ffffffff8189c6c3>] ? _raw_spin_unlock_irq+0x33/0x60
   [<ffffffff810e1c84>] finish_task_switch+0xb4/0x250
   [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0

It will look worse once we start printing pt_regs registers found in the
middle of the stack:

  <IRQ> RIP: 0010:[<ffffffff8189c6c3>]  [<ffffffff8189c6c3>] _raw_spin_unlock_irq+0x33/0x60
  RSP: 0018:ffff88007876f720  EFLAGS: 00000206
  RAX: ffff8800786caa40 RBX: ffff88007d5da140 RCX: 0000000000000007
  ...

Improve readability by adding a newline to the stack name:

  Call Trace:
   <IRQ>
   [<ffffffff814431c3>] dump_stack+0x86/0xc3
   [<ffffffff8100828b>] perf_callchain_kernel+0x14b/0x160
   [<ffffffff811e915f>] get_perf_callchain+0x15f/0x2b0
   ...
   <EOI>
   [<ffffffff8189c6c3>] ? _raw_spin_unlock_irq+0x33/0x60
   [<ffffffff810e1c84>] finish_task_switch+0xb4/0x250
   [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0

Now that "continued" lines are no longer needed, we can also remove the
hack of using the empty string (aka KERN_CONT) and replace it with
KERN_DEFAULT.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/kernel/dumpstack.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 198dc9e..0eedb01 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -124,7 +124,7 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 		stack_type_str(stack_info.type, &str_begin, &str_end);
 		if (str_begin)
-			printk("%s <%s> ", log_lvl, str_begin);
+			printk("%s <%s>\n", log_lvl, str_begin);
 
 		/*
 		 * Scan the stack, printing any text addresses we find.  At the
@@ -176,7 +176,7 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		}
 
 		if (str_end)
-			printk("%s <%s> ", log_lvl, str_end);
+			printk("%s <%s>\n", log_lvl, str_end);
 	}
 }
 
@@ -193,12 +193,12 @@ void show_stack(struct task_struct *task, unsigned long *sp)
 		bp = (unsigned long)get_frame_pointer(current, NULL);
 	}
 
-	show_stack_log_lvl(task, NULL, sp, "");
+	show_stack_log_lvl(task, NULL, sp, KERN_DEFAULT);
 }
 
 void show_stack_regs(struct pt_regs *regs)
 {
-	show_stack_log_lvl(NULL, regs, NULL, "");
+	show_stack_log_lvl(NULL, regs, NULL, KERN_DEFAULT);
 }
 
 static arch_spinlock_t die_lock = __ARCH_SPIN_LOCK_UNLOCKED;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (17 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 18/19] x86/dumpstack: print stack identifier on its own line Josh Poimboeuf
@ 2016-07-21 21:21 ` Josh Poimboeuf
  2016-07-21 22:32   ` Andy Lutomirski
  2016-07-23  0:22 ` [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Linus Torvalds
  19 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-21 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

Now that we can find pt_regs registers in the middle of the stack due to
an interrupt or exception, we can print them.  Here's what it looks
like:

   ...
   [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
   [<ffffffff8189f558>] async_page_fault+0x28/0x30
  RIP: 0010:[<ffffffff814529e2>]  [<ffffffff814529e2>] __clear_user+0x42/0x70
  RSP: 0018:ffff88007876fd38  EFLAGS: 00010202
  RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
  RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
  RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
  R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
  R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
   [<ffffffff814529e2>] ? __clear_user+0x42/0x70
   [<ffffffff814529c3>] ? __clear_user+0x23/0x70
   [<ffffffff81452a7b>] clear_user+0x2b/0x40
   ...

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/kernel/dumpstack.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 0eedb01..4509866 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -173,6 +173,14 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 			 * printed as unreliable.
 			 */
 			unwind_next_frame(&state);
+
+			/*
+			 * If the previous frame had pt_regs associated with it
+			 * due to an interrupt or exception, print them.
+			 */
+			regs = unwind_get_entry_regs(&state);
+			if (regs)
+				__show_regs(regs, 0);
 		}
 
 		if (str_end)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH 15/19] x86/dumpstack: convert show_trace_log_lvl() to the new unwinder
  2016-07-21 21:21 ` [PATCH 15/19] x86/dumpstack: convert show_trace_log_lvl() " Josh Poimboeuf
@ 2016-07-21 21:49   ` Byungchul Park
  2016-07-22  1:38     ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Byungchul Park @ 2016-07-21 21:49 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker

On Thu, Jul 21, 2016 at 04:21:52PM -0500, Josh Poimboeuf wrote:
> Convert show_trace_log_lvl() to the new unwinder.  dump_trace() has been
> deprecated.
> 
> show_trace_log_lvl() is special compared to other users of the unwinder.
> It's the only place where both reliable *and* unreliable addresses are
> needed.  With frame pointers enabled, most stack walking code doesn't
> want to know about unreliable addresses.  But in this case, when we're
> dumping the stack to the console because something presumably went
> wrong, the unreliable addresses are useful:
> 
> - They show stale data on the stack which can provide useful clues.
> 
> - If something goes wrong with the unwinder, or if frame pointers are
>   corrupt or missing, all the stack addresses still get shown.
> 
> So in order to show all addresses on the stack, and at the same time
> figure out which addresses are reliable, we have to do the scanning and
> the unwinding in parallel.
> 
> The scanning is done with the help of get_stack_info() to traverse the
> stacks.  The unwinding is done separately by the new unwinder.
> 
> In theory we could simplify show_trace_log_lvl() by instead pushing some
> of this logic into the unwind code.  But then we would need some kind of
> "fake" frame logic in the unwinder which would add a lot of complexity
> and wouldn't be worth it in order to support only one user.
> 
> Another benefit of this approach is that once we have a DWARF unwinder,
> we should be able to just plug it in with minimal impact to this code.
> 
> Another change here is that callers of show_trace_log_lvl() don't need
> to provide the 'bp' argument.  The unwinder already finds the relevant
> frame pointer by unwinding until it reaches the first frame after the
> provided stack pointer.

Hello,

You seem to have changed a lot of code with which I dealt in another patch.
I might be supposed to wait until yours will be done. I need to check yours
at first anyway.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 01/19] x86/dumpstack: remove show_trace()
  2016-07-21 21:21 ` [PATCH 01/19] x86/dumpstack: remove show_trace() Josh Poimboeuf
@ 2016-07-21 21:49   ` Andy Lutomirski
  0 siblings, 0 replies; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-21 21:49 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> There are a bewildering array of options for dumping the stack.
> Simplify things a little by removing show_trace(), which is unused.

Reviewed-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 02/19] x86/dumpstack: add get_stack_pointer() and get_frame_pointer()
  2016-07-21 21:21 ` [PATCH 02/19] x86/dumpstack: add get_stack_pointer() and get_frame_pointer() Josh Poimboeuf
@ 2016-07-21 21:53   ` Andy Lutomirski
  0 siblings, 0 replies; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-21 21:53 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> The various functions involved in dumping the stack all do similar
> things with regard to getting the stack pointer and the frame pointer
> based on the regs and task arguments.  Create helper functions to
> do that instead.
>

Reviewed-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/19] x86/dumpstack: remove unnecessary stack pointer arguments
  2016-07-21 21:21 ` [PATCH 03/19] x86/dumpstack: remove unnecessary stack pointer arguments Josh Poimboeuf
@ 2016-07-21 21:56   ` Andy Lutomirski
  2016-07-22  1:41     ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-21 21:56 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> When calling show_stack_log_lvl() or dump_trace() with a regs argument,
> providing a stack pointer or frame pointer is redundant.
>

> diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
> index 358fe1c..c533b8b 100644
> --- a/arch/x86/kernel/dumpstack_32.c
> +++ b/arch/x86/kernel/dumpstack_32.c
> @@ -122,7 +122,7 @@ void show_regs(struct pt_regs *regs)
>                 u8 *ip;
>
>                 pr_emerg("Stack:\n");
> -               show_stack_log_lvl(NULL, regs, &regs->sp, 0, KERN_EMERG);
> +               show_stack_log_lvl(NULL, regs, NULL, 0, KERN_EMERG);

This is weird -- note the &.  You're at some risk of exposing a bug in
x86_32's kernel_stack_pointer() function, which is a mess.  (I don't
see why it's written the way it is -- the actual return stack pointer
given a pt_regs is quite well defined -- if regs->cs & 3 != 0, then
it's regs->sp, else it's &regs->sp.)

That being said, this isn't a big deal, so:

Reviewed-by: Andy Lutomirski <luto@kernel.org>

If you want to make this all a bit more reliably on x86_32, you could
fix kernel_stack_pointer().

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/19] x86/dumpstack: add IRQ_USABLE_STACK_SIZE define
  2016-07-21 21:21 ` [PATCH 07/19] x86/dumpstack: add IRQ_USABLE_STACK_SIZE define Josh Poimboeuf
@ 2016-07-21 22:01   ` Andy Lutomirski
  2016-07-22  1:48     ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-21 22:01 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> For reasons unknown, the x86_64 irq stack starts at an offset 64 bytes
> from the end of the page.  At least make that explicit.

This is a change in behavior -- see below.  Please mention this in the
changelog.

>
> FIXME: Can we just remove the 64 byte gap?  If not, at least document
> why.

I have no clue.

>
>         irq_stack_end   = (unsigned long *)(per_cpu(irq_stack_ptr, cpu));
> -       irq_stack       = (unsigned long *)(per_cpu(irq_stack_ptr, cpu) - IRQ_STACK_SIZE);
> +       irq_stack       = (unsigned long *)(per_cpu(irq_stack_ptr, cpu) -
> +                         IRQ_USABLE_STACK_SIZE);

This is different.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 09/19] x86/dumpstack: simplify in_exception_stack()
  2016-07-21 21:21 ` [PATCH 09/19] x86/dumpstack: simplify in_exception_stack() Josh Poimboeuf
@ 2016-07-21 22:05   ` Andy Lutomirski
  0 siblings, 0 replies; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-21 22:05 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> in_exception_stack() does some bad, bad things just so the unwinder can
> print different values for different areas of the debug exception stack.
>
> There's no need to clarify where exactly on the stack it is.  Just print
> "#DB" and be done with it.

This is a huge improvement.

However: could you add a comment clarifying what purpose visit_mask serves?

FWIW, I have patches that remove the extra debug stacks, and they'll
be nicer on top of this.

--Andy

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 17/19] x86/entry/dumpstack: encode pt_regs pointer in frame pointer
  2016-07-21 21:21 ` [PATCH 17/19] x86/entry/dumpstack: encode pt_regs pointer in frame pointer Josh Poimboeuf
@ 2016-07-21 22:27   ` Andy Lutomirski
  0 siblings, 0 replies; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-21 22:27 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> With frame pointers, when a task is interrupted, its stack is no longer
> completely reliable because the function could have been interrupted
> before it had a chance to save the previous frame pointer on the stack.
> So the caller of the interrupted function could get skipped by a stack
> trace.
>
> This is problematic for live patching, which needs to know whether a
> stack trace of a sleeping task can be relied upon.  There's currently no
> way to detect if a sleeping task was interrupted by a page fault
> exception or preemption before it went to sleep.
>
> Another issue is that when dumping the stack of an interrupted task, the
> unwinder has no way of knowing where the saved pt_regs registers are, so
> it can't print them.
>
> This solves those issues by encoding the pt_regs pointer in the frame
> pointer on entry from an interrupt or an exception.  The frame pointer
> unwinder is also updated to decode it.
>
> Suggested-by: Andy Lutomirski <luto@amacapital.net>
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>

>
> +/*
> + * This determines if the frame pointer actually contains an encoded pointer to
> + * pt_regs on the stack.  See ENCODE_FRAME_POINTER.
> + */
> +static struct pt_regs *decode_frame_pointer(struct unwind_state *state,
> +                                           unsigned long *bp)
> +{
> +       struct pt_regs *regs = (struct pt_regs *)bp;
> +       unsigned long *task_begin = task_stack_page(state->task);
> +       unsigned long *task_end   = task_stack_page(state->task) + THREAD_SIZE;
> +
> +       if (test_and_set_bit(BITS_PER_LONG - 1, (unsigned long *)&regs))
> +               return NULL;

test_and_set_bit is a fairly heavyweight atomic operation.  It's
probably better to use plain C bitwise ops.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
  2016-07-21 21:21 ` [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack Josh Poimboeuf
@ 2016-07-21 22:32   ` Andy Lutomirski
  2016-07-22  3:30     ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-21 22:32 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> Now that we can find pt_regs registers in the middle of the stack due to
> an interrupt or exception, we can print them.  Here's what it looks
> like:
>
>    ...
>    [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
>    [<ffffffff8189f558>] async_page_fault+0x28/0x30
>   RIP: 0010:[<ffffffff814529e2>]  [<ffffffff814529e2>] __clear_user+0x42/0x70
>   RSP: 0018:ffff88007876fd38  EFLAGS: 00010202
>   RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
>   RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
>   RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
>   R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
>   R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
>    [<ffffffff814529e2>] ? __clear_user+0x42/0x70
>    [<ffffffff814529c3>] ? __clear_user+0x23/0x70
>    [<ffffffff81452a7b>] clear_user+0x2b/0x40
>    ...

This looks wrong.  Here are some theories:

(a) __clear_user is a reliable address that is indicated by RIP: ....
Then it's found again as an unreliable address as "?
__clear_user+0x42/0x70" by scanning the stack.  "?
__clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
In this case, shouldn't "? __clear_user+0x42/0x70" have been
suppressed because it matched a reliable address?

(b) You actually intended for all the addresses to be printed, in
which case "? __clear_user+0x42/0x70" should have been
"__clear_user+0x42/0x70" and you have a bug.  In this case, it's
plausible that your state machine got a bit lost leading to "?
__clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
it's a real frame and you didn't find it).

(c) Something else and I'm confused.

--Andy

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 15/19] x86/dumpstack: convert show_trace_log_lvl() to the new unwinder
  2016-07-21 21:49   ` Byungchul Park
@ 2016-07-22  1:38     ` Josh Poimboeuf
  0 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-22  1:38 UTC (permalink / raw)
  To: Byungchul Park
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker

On Fri, Jul 22, 2016 at 06:49:01AM +0900, Byungchul Park wrote:
> On Thu, Jul 21, 2016 at 04:21:52PM -0500, Josh Poimboeuf wrote:
> > Convert show_trace_log_lvl() to the new unwinder.  dump_trace() has been
> > deprecated.
> > 
> > show_trace_log_lvl() is special compared to other users of the unwinder.
> > It's the only place where both reliable *and* unreliable addresses are
> > needed.  With frame pointers enabled, most stack walking code doesn't
> > want to know about unreliable addresses.  But in this case, when we're
> > dumping the stack to the console because something presumably went
> > wrong, the unreliable addresses are useful:
> > 
> > - They show stale data on the stack which can provide useful clues.
> > 
> > - If something goes wrong with the unwinder, or if frame pointers are
> >   corrupt or missing, all the stack addresses still get shown.
> > 
> > So in order to show all addresses on the stack, and at the same time
> > figure out which addresses are reliable, we have to do the scanning and
> > the unwinding in parallel.
> > 
> > The scanning is done with the help of get_stack_info() to traverse the
> > stacks.  The unwinding is done separately by the new unwinder.
> > 
> > In theory we could simplify show_trace_log_lvl() by instead pushing some
> > of this logic into the unwind code.  But then we would need some kind of
> > "fake" frame logic in the unwinder which would add a lot of complexity
> > and wouldn't be worth it in order to support only one user.
> > 
> > Another benefit of this approach is that once we have a DWARF unwinder,
> > we should be able to just plug it in with minimal impact to this code.
> > 
> > Another change here is that callers of show_trace_log_lvl() don't need
> > to provide the 'bp' argument.  The unwinder already finds the relevant
> > frame pointer by unwinding until it reaches the first frame after the
> > provided stack pointer.
> 
> Hello,
> 
> You seem to have changed a lot of code with which I dealt in another patch.
> I might be supposed to wait until yours will be done. I need to check yours
> at first anyway.

Yeah, that's why I added you to cc :-)  I think this obsoletes your
patches.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/19] x86/dumpstack: remove unnecessary stack pointer arguments
  2016-07-21 21:56   ` Andy Lutomirski
@ 2016-07-22  1:41     ` Josh Poimboeuf
  2016-07-22  2:29       ` Andy Lutomirski
  2016-07-22  3:08       ` Brian Gerst
  0 siblings, 2 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-22  1:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 02:56:52PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > When calling show_stack_log_lvl() or dump_trace() with a regs argument,
> > providing a stack pointer or frame pointer is redundant.
> >
> 
> > diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
> > index 358fe1c..c533b8b 100644
> > --- a/arch/x86/kernel/dumpstack_32.c
> > +++ b/arch/x86/kernel/dumpstack_32.c
> > @@ -122,7 +122,7 @@ void show_regs(struct pt_regs *regs)
> >                 u8 *ip;
> >
> >                 pr_emerg("Stack:\n");
> > -               show_stack_log_lvl(NULL, regs, &regs->sp, 0, KERN_EMERG);
> > +               show_stack_log_lvl(NULL, regs, NULL, 0, KERN_EMERG);
> 
> This is weird -- note the &.  You're at some risk of exposing a bug in
> x86_32's kernel_stack_pointer() function, which is a mess.  (I don't
> see why it's written the way it is -- the actual return stack pointer
> given a pt_regs is quite well defined -- if regs->cs & 3 != 0, then
> it's regs->sp, else it's &regs->sp.)
> 
> That being said, this isn't a big deal, so:
> 
> Reviewed-by: Andy Lutomirski <luto@kernel.org>
> 
> If you want to make this all a bit more reliably on x86_32, you could
> fix kernel_stack_pointer().

Ok.  The whole '&regs->sp' thing threw me for a loop.  I have no idea
what kernel_stack_pointer() is trying to do.  I just assumed it was
correct.  I'll take a look at it and try to fix it in another patch.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/19] x86/dumpstack: add IRQ_USABLE_STACK_SIZE define
  2016-07-21 22:01   ` Andy Lutomirski
@ 2016-07-22  1:48     ` Josh Poimboeuf
  2016-07-22  8:24       ` Ingo Molnar
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-22  1:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 03:01:10PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > For reasons unknown, the x86_64 irq stack starts at an offset 64 bytes
> > from the end of the page.  At least make that explicit.
> 
> This is a change in behavior -- see below.  Please mention this in the
> changelog.

Ah, right.

> 
> >
> > FIXME: Can we just remove the 64 byte gap?  If not, at least document
> > why.
> 
> I have no clue.
> 
> >
> >         irq_stack_end   = (unsigned long *)(per_cpu(irq_stack_ptr, cpu));
> > -       irq_stack       = (unsigned long *)(per_cpu(irq_stack_ptr, cpu) - IRQ_STACK_SIZE);
> > +       irq_stack       = (unsigned long *)(per_cpu(irq_stack_ptr, cpu) -
> > +                         IRQ_USABLE_STACK_SIZE);
> 
> This is different.

If nobody knows the reason for it, I may just remove it.  It doesn't
seem to blow anything up on my system.  I tried digging through the git
history but it's been there since the beginning of git time.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/19] x86/dumpstack: remove unnecessary stack pointer arguments
  2016-07-22  1:41     ` Josh Poimboeuf
@ 2016-07-22  2:29       ` Andy Lutomirski
  2016-07-22  3:08       ` Brian Gerst
  1 sibling, 0 replies; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-22  2:29 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 6:41 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, Jul 21, 2016 at 02:56:52PM -0700, Andy Lutomirski wrote:
>> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > When calling show_stack_log_lvl() or dump_trace() with a regs argument,
>> > providing a stack pointer or frame pointer is redundant.
>> >
>>
>> > diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
>> > index 358fe1c..c533b8b 100644
>> > --- a/arch/x86/kernel/dumpstack_32.c
>> > +++ b/arch/x86/kernel/dumpstack_32.c
>> > @@ -122,7 +122,7 @@ void show_regs(struct pt_regs *regs)
>> >                 u8 *ip;
>> >
>> >                 pr_emerg("Stack:\n");
>> > -               show_stack_log_lvl(NULL, regs, &regs->sp, 0, KERN_EMERG);
>> > +               show_stack_log_lvl(NULL, regs, NULL, 0, KERN_EMERG);
>>
>> This is weird -- note the &.  You're at some risk of exposing a bug in
>> x86_32's kernel_stack_pointer() function, which is a mess.  (I don't
>> see why it's written the way it is -- the actual return stack pointer
>> given a pt_regs is quite well defined -- if regs->cs & 3 != 0, then
>> it's regs->sp, else it's &regs->sp.)
>>
>> That being said, this isn't a big deal, so:
>>
>> Reviewed-by: Andy Lutomirski <luto@kernel.org>
>>
>> If you want to make this all a bit more reliably on x86_32, you could
>> fix kernel_stack_pointer().
>
> Ok.  The whole '&regs->sp' thing threw me for a loop.  I have no idea
> what kernel_stack_pointer() is trying to do.  I just assumed it was
> correct.  I'll take a look at it and try to fix it in another patch.
>

On further inspection, it's probably correct except in cases of stack
overflow, so I wouldn't worry about it.  It's certainly
overcomplicated.

--Andy

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/19] x86/dumpstack: remove unnecessary stack pointer arguments
  2016-07-22  1:41     ` Josh Poimboeuf
  2016-07-22  2:29       ` Andy Lutomirski
@ 2016-07-22  3:08       ` Brian Gerst
  1 sibling, 0 replies; 91+ messages in thread
From: Brian Gerst @ 2016-07-22  3:08 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, linux-kernel, Linus Torvalds, Steven Rostedt, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 9:41 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, Jul 21, 2016 at 02:56:52PM -0700, Andy Lutomirski wrote:
>> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > When calling show_stack_log_lvl() or dump_trace() with a regs argument,
>> > providing a stack pointer or frame pointer is redundant.
>> >
>>
>> > diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
>> > index 358fe1c..c533b8b 100644
>> > --- a/arch/x86/kernel/dumpstack_32.c
>> > +++ b/arch/x86/kernel/dumpstack_32.c
>> > @@ -122,7 +122,7 @@ void show_regs(struct pt_regs *regs)
>> >                 u8 *ip;
>> >
>> >                 pr_emerg("Stack:\n");
>> > -               show_stack_log_lvl(NULL, regs, &regs->sp, 0, KERN_EMERG);
>> > +               show_stack_log_lvl(NULL, regs, NULL, 0, KERN_EMERG);
>>
>> This is weird -- note the &.  You're at some risk of exposing a bug in
>> x86_32's kernel_stack_pointer() function, which is a mess.  (I don't
>> see why it's written the way it is -- the actual return stack pointer
>> given a pt_regs is quite well defined -- if regs->cs & 3 != 0, then
>> it's regs->sp, else it's &regs->sp.)
>>
>> That being said, this isn't a big deal, so:
>>
>> Reviewed-by: Andy Lutomirski <luto@kernel.org>
>>
>> If you want to make this all a bit more reliably on x86_32, you could
>> fix kernel_stack_pointer().
>
> Ok.  The whole '&regs->sp' thing threw me for a loop.  I have no idea
> what kernel_stack_pointer() is trying to do.  I just assumed it was
> correct.  I'll take a look at it and try to fix it in another patch.

On 32-bit, when an interrupt doesn't change CPL, SS:ESP is not pushed.
So, effectively, the old stack pointer is &regs->sp.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
  2016-07-21 22:32   ` Andy Lutomirski
@ 2016-07-22  3:30     ` Josh Poimboeuf
  2016-07-22  5:13       ` Andy Lutomirski
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-22  3:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > Now that we can find pt_regs registers in the middle of the stack due to
> > an interrupt or exception, we can print them.  Here's what it looks
> > like:
> >
> >    ...
> >    [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
> >    [<ffffffff8189f558>] async_page_fault+0x28/0x30
> >   RIP: 0010:[<ffffffff814529e2>]  [<ffffffff814529e2>] __clear_user+0x42/0x70
> >   RSP: 0018:ffff88007876fd38  EFLAGS: 00010202
> >   RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
> >   RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
> >   RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
> >   R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
> >   R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
> >    [<ffffffff814529e2>] ? __clear_user+0x42/0x70
> >    [<ffffffff814529c3>] ? __clear_user+0x23/0x70
> >    [<ffffffff81452a7b>] clear_user+0x2b/0x40
> >    ...
> 
> This looks wrong.  Here are some theories:
> 
> (a) __clear_user is a reliable address that is indicated by RIP: ....
> Then it's found again as an unreliable address as "?
> __clear_user+0x42/0x70" by scanning the stack.  "?
> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
> In this case, shouldn't "? __clear_user+0x42/0x70" have been
> suppressed because it matched a reliable address?
> 
> (b) You actually intended for all the addresses to be printed, in
> which case "? __clear_user+0x42/0x70" should have been
> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
> plausible that your state machine got a bit lost leading to "?
> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
> it's a real frame and you didn't find it).
> 
> (c) Something else and I'm confused.

So there's a subtle difference between addresses reported by regs->ip
and normal return addresses.  For example:

   ...
   [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50
   [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0
  RIP: 0010:[<ffffffff8129b350>]  [<ffffffff8129b350>] path_init+0x0/0x750
  RSP: 0018:ffff880036a3fd80  EFLAGS: 00000296
  RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08
  RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0
  RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010
  R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0
  R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000
   <EOI>
   [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0
   [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120
   [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180
   ...

In this case the irq hit right after path_lookupat() called into
path_init().  So the "path_init+0x0" printed by __show_regs() is right.

Note the backtrace reports the same address, but it instead describes it
as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
because normally, such an address after a call instruction at the end of
a function would indicate a tail call (e.g., to a noreturn function).
If that were the case, printing "path_init+0x0" would be completely
wrong, because path_init() just happens to be the function located
immediately after lookup_fast().

Maybe I could add some special logic to say: "if this return address was
from a call, use printk_stack_address(); else if it was from a pt_regs,
use printk_address()."  (The former prints the preceding function, the
latter prints the current function.)  Then we could remove the question
mark.

There's also the question of whether or not the address should be
printed again, after it's already been printed by __show_regs().  I
don't have a strong opinion either way.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
  2016-07-22  3:30     ` Josh Poimboeuf
@ 2016-07-22  5:13       ` Andy Lutomirski
  2016-07-22 15:57         ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-22  5:13 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
>> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > Now that we can find pt_regs registers in the middle of the stack due to
>> > an interrupt or exception, we can print them.  Here's what it looks
>> > like:
>> >
>> >    ...
>> >    [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
>> >    [<ffffffff8189f558>] async_page_fault+0x28/0x30
>> >   RIP: 0010:[<ffffffff814529e2>]  [<ffffffff814529e2>] __clear_user+0x42/0x70
>> >   RSP: 0018:ffff88007876fd38  EFLAGS: 00010202
>> >   RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
>> >   RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
>> >   RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
>> >   R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
>> >   R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
>> >    [<ffffffff814529e2>] ? __clear_user+0x42/0x70
>> >    [<ffffffff814529c3>] ? __clear_user+0x23/0x70
>> >    [<ffffffff81452a7b>] clear_user+0x2b/0x40
>> >    ...
>>
>> This looks wrong.  Here are some theories:
>>
>> (a) __clear_user is a reliable address that is indicated by RIP: ....
>> Then it's found again as an unreliable address as "?
>> __clear_user+0x42/0x70" by scanning the stack.  "?
>> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
>> In this case, shouldn't "? __clear_user+0x42/0x70" have been
>> suppressed because it matched a reliable address?
>>
>> (b) You actually intended for all the addresses to be printed, in
>> which case "? __clear_user+0x42/0x70" should have been
>> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
>> plausible that your state machine got a bit lost leading to "?
>> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
>> it's a real frame and you didn't find it).
>>
>> (c) Something else and I'm confused.
>
> So there's a subtle difference between addresses reported by regs->ip
> and normal return addresses.  For example:
>
>    ...
>    [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50
>    [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0
>   RIP: 0010:[<ffffffff8129b350>]  [<ffffffff8129b350>] path_init+0x0/0x750
>   RSP: 0018:ffff880036a3fd80  EFLAGS: 00000296
>   RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08
>   RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0
>   RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010
>   R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0
>   R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000
>    <EOI>
>    [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0
>    [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120
>    [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180
>    ...
>
> In this case the irq hit right after path_lookupat() called into
> path_init().  So the "path_init+0x0" printed by __show_regs() is right.
>
> Note the backtrace reports the same address, but it instead describes it
> as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
> because normally, such an address after a call instruction at the end of
> a function would indicate a tail call (e.g., to a noreturn function).
> If that were the case, printing "path_init+0x0" would be completely
> wrong, because path_init() just happens to be the function located
> immediately after lookup_fast().
>
> Maybe I could add some special logic to say: "if this return address was
> from a call, use printk_stack_address(); else if it was from a pt_regs,
> use printk_address()."  (The former prints the preceding function, the
> latter prints the current function.)  Then we could remove the question
> mark.
>
> There's also the question of whether or not the address should be
> printed again, after it's already been printed by __show_regs().  I
> don't have a strong opinion either way.
>

IIRC we don't show the actual faulting function in the call trace, so
we probably shouldn't duplicate the entry after the show_regs.

That being said, I'm still confused by the question marks.  What
exactly is going on?  Is the code really doing the right thing wrt
resuming the unwind?  Is there a git tree with these patches applied
somewhere so I can look at it easily in context?

--Andy

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/19] x86/dumpstack: add IRQ_USABLE_STACK_SIZE define
  2016-07-22  1:48     ` Josh Poimboeuf
@ 2016-07-22  8:24       ` Ingo Molnar
  0 siblings, 0 replies; 91+ messages in thread
From: Ingo Molnar @ 2016-07-22  8:24 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Thomas Gleixner, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park


* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> > >         irq_stack_end   = (unsigned long *)(per_cpu(irq_stack_ptr, cpu));
> > > -       irq_stack       = (unsigned long *)(per_cpu(irq_stack_ptr, cpu) - IRQ_STACK_SIZE);
> > > +       irq_stack       = (unsigned long *)(per_cpu(irq_stack_ptr, cpu) -
> > > +                         IRQ_USABLE_STACK_SIZE);
> > 
> > This is different.
> 
> If nobody knows the reason for it, I may just remove it.  It doesn't
> seem to blow anything up on my system.  I tried digging through the git
> history but it's been there since the beginning of git time.

Please do any behavioral changes in separate patches - ordered after all the 'does 
not change behavior' low-risk patches.

I.e. try to order the patches by risk: (near-)zero-risk ones first, followed by 
lower risk ones, closed by higher risk ones. This makes both review, application 
of the patches and any bisection/fixing work later on easier.

If you ever see a good chance to split a patch that changes behavior into a 
zero-risk and a low-risk component, do so - we'd rather err on the side of being 
too finegrained in a series than having to scratch heads when bisecting to overly 
large patches.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
  2016-07-22  5:13       ` Andy Lutomirski
@ 2016-07-22 15:57         ` Josh Poimboeuf
  2016-07-22 21:46           ` Andy Lutomirski
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-22 15:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > Now that we can find pt_regs registers in the middle of the stack due to
> >> > an interrupt or exception, we can print them.  Here's what it looks
> >> > like:
> >> >
> >> >    ...
> >> >    [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
> >> >    [<ffffffff8189f558>] async_page_fault+0x28/0x30
> >> >   RIP: 0010:[<ffffffff814529e2>]  [<ffffffff814529e2>] __clear_user+0x42/0x70
> >> >   RSP: 0018:ffff88007876fd38  EFLAGS: 00010202
> >> >   RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
> >> >   RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
> >> >   RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
> >> >   R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
> >> >   R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
> >> >    [<ffffffff814529e2>] ? __clear_user+0x42/0x70
> >> >    [<ffffffff814529c3>] ? __clear_user+0x23/0x70
> >> >    [<ffffffff81452a7b>] clear_user+0x2b/0x40
> >> >    ...
> >>
> >> This looks wrong.  Here are some theories:
> >>
> >> (a) __clear_user is a reliable address that is indicated by RIP: ....
> >> Then it's found again as an unreliable address as "?
> >> __clear_user+0x42/0x70" by scanning the stack.  "?
> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
> >> suppressed because it matched a reliable address?
> >>
> >> (b) You actually intended for all the addresses to be printed, in
> >> which case "? __clear_user+0x42/0x70" should have been
> >> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
> >> plausible that your state machine got a bit lost leading to "?
> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
> >> it's a real frame and you didn't find it).
> >>
> >> (c) Something else and I'm confused.
> >
> > So there's a subtle difference between addresses reported by regs->ip
> > and normal return addresses.  For example:
> >
> >    ...
> >    [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50
> >    [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0
> >   RIP: 0010:[<ffffffff8129b350>]  [<ffffffff8129b350>] path_init+0x0/0x750
> >   RSP: 0018:ffff880036a3fd80  EFLAGS: 00000296
> >   RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08
> >   RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0
> >   RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010
> >   R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0
> >   R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000
> >    <EOI>
> >    [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0
> >    [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120
> >    [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180
> >    ...
> >
> > In this case the irq hit right after path_lookupat() called into
> > path_init().  So the "path_init+0x0" printed by __show_regs() is right.
> >
> > Note the backtrace reports the same address, but it instead describes it
> > as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
> > because normally, such an address after a call instruction at the end of
> > a function would indicate a tail call (e.g., to a noreturn function).
> > If that were the case, printing "path_init+0x0" would be completely
> > wrong, because path_init() just happens to be the function located
> > immediately after lookup_fast().
> >
> > Maybe I could add some special logic to say: "if this return address was
> > from a call, use printk_stack_address(); else if it was from a pt_regs,
> > use printk_address()."  (The former prints the preceding function, the
> > latter prints the current function.)  Then we could remove the question
> > mark.
> >
> > There's also the question of whether or not the address should be
> > printed again, after it's already been printed by __show_regs().  I
> > don't have a strong opinion either way.
> >
> 
> IIRC we don't show the actual faulting function in the call trace, so
> we probably shouldn't duplicate the entry after the show_regs.

Just to clarify, that's true today for cases where the stack dump starts
from a handler which has regs.  It starts dumping based on regs->ip and
regs->bp, so the regs themselves aren't dumped.

But for cases where regs are in the middle of the stack, they aren't
detected today, and you'll still see the value of regs->ip dumped with a
question mark.

That said, with this patch, now that regs in the middle of the stack
*are* being printed, I can't think of a good reason to print the return
address twice: both in regs and the stack trace.  So removing it from
the stack trace is fine with me.

> That being said, I'm still confused by the question marks.  What
> exactly is going on?  Is the code really doing the right thing wrt
> resuming the unwind?  Is there a git tree with these patches applied
> somewhere so I can look at it easily in context?

show_trace_log_lvl() is doing two things in parallel: scanning all
kernel text addresses on the stack while simultaneously using the
unwinder to walk the frame pointers.  Only those scanned addresses which
are also found by the unwinder are printed without question marks.

The pt_regs aren't in a frame of their own; they're just data inside of
a bigger frame.  (You may recall that you objected to my proposal to put
them in their own frame :-))  So that's why the address stored in
regs->ip was printed with a question mark: it's not in the header of a
real frame; it's just data.

I pushed the code out at:

  https://github.com/jpoimboe/linux unwind-v1

See show_trace_log_lvl() in arch/x86/kernel/dumpstack.c.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
  2016-07-22 15:57         ` Josh Poimboeuf
@ 2016-07-22 21:46           ` Andy Lutomirski
  2016-07-22 22:20             ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-22 21:46 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
>> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
>> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > Now that we can find pt_regs registers in the middle of the stack due to
>> >> > an interrupt or exception, we can print them.  Here's what it looks
>> >> > like:
>> >> >
>> >> >    ...
>> >> >    [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
>> >> >    [<ffffffff8189f558>] async_page_fault+0x28/0x30
>> >> >   RIP: 0010:[<ffffffff814529e2>]  [<ffffffff814529e2>] __clear_user+0x42/0x70
>> >> >   RSP: 0018:ffff88007876fd38  EFLAGS: 00010202
>> >> >   RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
>> >> >   RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
>> >> >   RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
>> >> >   R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
>> >> >   R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
>> >> >    [<ffffffff814529e2>] ? __clear_user+0x42/0x70
>> >> >    [<ffffffff814529c3>] ? __clear_user+0x23/0x70
>> >> >    [<ffffffff81452a7b>] clear_user+0x2b/0x40
>> >> >    ...
>> >>
>> >> This looks wrong.  Here are some theories:
>> >>
>> >> (a) __clear_user is a reliable address that is indicated by RIP: ....
>> >> Then it's found again as an unreliable address as "?
>> >> __clear_user+0x42/0x70" by scanning the stack.  "?
>> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
>> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
>> >> suppressed because it matched a reliable address?
>> >>
>> >> (b) You actually intended for all the addresses to be printed, in
>> >> which case "? __clear_user+0x42/0x70" should have been
>> >> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
>> >> plausible that your state machine got a bit lost leading to "?
>> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
>> >> it's a real frame and you didn't find it).
>> >>
>> >> (c) Something else and I'm confused.
>> >
>> > So there's a subtle difference between addresses reported by regs->ip
>> > and normal return addresses.  For example:
>> >
>> >    ...
>> >    [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50
>> >    [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0
>> >   RIP: 0010:[<ffffffff8129b350>]  [<ffffffff8129b350>] path_init+0x0/0x750
>> >   RSP: 0018:ffff880036a3fd80  EFLAGS: 00000296
>> >   RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08
>> >   RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0
>> >   RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010
>> >   R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0
>> >   R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000
>> >    <EOI>
>> >    [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0
>> >    [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120
>> >    [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180
>> >    ...
>> >
>> > In this case the irq hit right after path_lookupat() called into
>> > path_init().  So the "path_init+0x0" printed by __show_regs() is right.
>> >
>> > Note the backtrace reports the same address, but it instead describes it
>> > as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
>> > because normally, such an address after a call instruction at the end of
>> > a function would indicate a tail call (e.g., to a noreturn function).
>> > If that were the case, printing "path_init+0x0" would be completely
>> > wrong, because path_init() just happens to be the function located
>> > immediately after lookup_fast().
>> >
>> > Maybe I could add some special logic to say: "if this return address was
>> > from a call, use printk_stack_address(); else if it was from a pt_regs,
>> > use printk_address()."  (The former prints the preceding function, the
>> > latter prints the current function.)  Then we could remove the question
>> > mark.
>> >
>> > There's also the question of whether or not the address should be
>> > printed again, after it's already been printed by __show_regs().  I
>> > don't have a strong opinion either way.
>> >
>>
>> IIRC we don't show the actual faulting function in the call trace, so
>> we probably shouldn't duplicate the entry after the show_regs.
>
> Just to clarify, that's true today for cases where the stack dump starts
> from a handler which has regs.  It starts dumping based on regs->ip and
> regs->bp, so the regs themselves aren't dumped.
>
> But for cases where regs are in the middle of the stack, they aren't
> detected today, and you'll still see the value of regs->ip dumped with a
> question mark.
>
> That said, with this patch, now that regs in the middle of the stack
> *are* being printed, I can't think of a good reason to print the return
> address twice: both in regs and the stack trace.  So removing it from
> the stack trace is fine with me.
>
>> That being said, I'm still confused by the question marks.  What
>> exactly is going on?  Is the code really doing the right thing wrt
>> resuming the unwind?  Is there a git tree with these patches applied
>> somewhere so I can look at it easily in context?
>
> show_trace_log_lvl() is doing two things in parallel: scanning all
> kernel text addresses on the stack while simultaneously using the
> unwinder to walk the frame pointers.  Only those scanned addresses which
> are also found by the unwinder are printed without question marks.
>
> The pt_regs aren't in a frame of their own; they're just data inside of
> a bigger frame.  (You may recall that you objected to my proposal to put
> them in their own frame :-))  So that's why the address stored in
> regs->ip was printed with a question mark: it's not in the header of a
> real frame; it's just data.

It wasn't the separate frame part I was objecting to -- it was their
encoding on the stack.  Maybe they should unwind as though they're a
separate frame.  For example, the unwind API could give the frame that
returns to apic_timer_interrupt+0x9e/0xb0 and then the next frame
could literally list regs->ip as its retaddr (and maybe that frame or
even the following one should be the one with non-NULL
unwind_get_entry_regs).

In some sense, the regs belong to the frame that got interrupted, not
the frame that did the interrupting.  But maybe that's backwards -- if
we have DWARF, then the regs correspond to the regs at the time of a
call, and those regs are reasonably likely to contain the arguments to
the called function.

But regardless of which way this goes, it seems quite awkward to me
that regs->ip never shows up as the return addr of any frame as
exposed by the unwind API.

--Andy

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
  2016-07-22 21:46           ` Andy Lutomirski
@ 2016-07-22 22:20             ` Josh Poimboeuf
  2016-07-22 23:18               ` Andy Lutomirski
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-22 22:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> > Now that we can find pt_regs registers in the middle of the stack due to
> >> >> > an interrupt or exception, we can print them.  Here's what it looks
> >> >> > like:
> >> >> >
> >> >> >    ...
> >> >> >    [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
> >> >> >    [<ffffffff8189f558>] async_page_fault+0x28/0x30
> >> >> >   RIP: 0010:[<ffffffff814529e2>]  [<ffffffff814529e2>] __clear_user+0x42/0x70
> >> >> >   RSP: 0018:ffff88007876fd38  EFLAGS: 00010202
> >> >> >   RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
> >> >> >   RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
> >> >> >   RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
> >> >> >   R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
> >> >> >   R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
> >> >> >    [<ffffffff814529e2>] ? __clear_user+0x42/0x70
> >> >> >    [<ffffffff814529c3>] ? __clear_user+0x23/0x70
> >> >> >    [<ffffffff81452a7b>] clear_user+0x2b/0x40
> >> >> >    ...
> >> >>
> >> >> This looks wrong.  Here are some theories:
> >> >>
> >> >> (a) __clear_user is a reliable address that is indicated by RIP: ....
> >> >> Then it's found again as an unreliable address as "?
> >> >> __clear_user+0x42/0x70" by scanning the stack.  "?
> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
> >> >> suppressed because it matched a reliable address?
> >> >>
> >> >> (b) You actually intended for all the addresses to be printed, in
> >> >> which case "? __clear_user+0x42/0x70" should have been
> >> >> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
> >> >> plausible that your state machine got a bit lost leading to "?
> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
> >> >> it's a real frame and you didn't find it).
> >> >>
> >> >> (c) Something else and I'm confused.
> >> >
> >> > So there's a subtle difference between addresses reported by regs->ip
> >> > and normal return addresses.  For example:
> >> >
> >> >    ...
> >> >    [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50
> >> >    [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0
> >> >   RIP: 0010:[<ffffffff8129b350>]  [<ffffffff8129b350>] path_init+0x0/0x750
> >> >   RSP: 0018:ffff880036a3fd80  EFLAGS: 00000296
> >> >   RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08
> >> >   RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0
> >> >   RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010
> >> >   R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0
> >> >   R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000
> >> >    <EOI>
> >> >    [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0
> >> >    [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120
> >> >    [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180
> >> >    ...
> >> >
> >> > In this case the irq hit right after path_lookupat() called into
> >> > path_init().  So the "path_init+0x0" printed by __show_regs() is right.
> >> >
> >> > Note the backtrace reports the same address, but it instead describes it
> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
> >> > because normally, such an address after a call instruction at the end of
> >> > a function would indicate a tail call (e.g., to a noreturn function).
> >> > If that were the case, printing "path_init+0x0" would be completely
> >> > wrong, because path_init() just happens to be the function located
> >> > immediately after lookup_fast().
> >> >
> >> > Maybe I could add some special logic to say: "if this return address was
> >> > from a call, use printk_stack_address(); else if it was from a pt_regs,
> >> > use printk_address()."  (The former prints the preceding function, the
> >> > latter prints the current function.)  Then we could remove the question
> >> > mark.
> >> >
> >> > There's also the question of whether or not the address should be
> >> > printed again, after it's already been printed by __show_regs().  I
> >> > don't have a strong opinion either way.
> >> >
> >>
> >> IIRC we don't show the actual faulting function in the call trace, so
> >> we probably shouldn't duplicate the entry after the show_regs.
> >
> > Just to clarify, that's true today for cases where the stack dump starts
> > from a handler which has regs.  It starts dumping based on regs->ip and
> > regs->bp, so the regs themselves aren't dumped.
> >
> > But for cases where regs are in the middle of the stack, they aren't
> > detected today, and you'll still see the value of regs->ip dumped with a
> > question mark.
> >
> > That said, with this patch, now that regs in the middle of the stack
> > *are* being printed, I can't think of a good reason to print the return
> > address twice: both in regs and the stack trace.  So removing it from
> > the stack trace is fine with me.
> >
> >> That being said, I'm still confused by the question marks.  What
> >> exactly is going on?  Is the code really doing the right thing wrt
> >> resuming the unwind?  Is there a git tree with these patches applied
> >> somewhere so I can look at it easily in context?
> >
> > show_trace_log_lvl() is doing two things in parallel: scanning all
> > kernel text addresses on the stack while simultaneously using the
> > unwinder to walk the frame pointers.  Only those scanned addresses which
> > are also found by the unwinder are printed without question marks.
> >
> > The pt_regs aren't in a frame of their own; they're just data inside of
> > a bigger frame.  (You may recall that you objected to my proposal to put
> > them in their own frame :-))  So that's why the address stored in
> > regs->ip was printed with a question mark: it's not in the header of a
> > real frame; it's just data.
> 
> It wasn't the separate frame part I was objecting to -- it was their
> encoding on the stack.  Maybe they should unwind as though they're a
> separate frame.  For example, the unwind API could give the frame that
> returns to apic_timer_interrupt+0x9e/0xb0 and then the next frame
> could literally list regs->ip as its retaddr (and maybe that frame or
> even the following one should be the one with non-NULL
> unwind_get_entry_regs).

Having the unwinder treat the pt_regs as a "fake" frame is problematic:

- As I described above, you can't treat regs->ip as a normal return
  value anyway.

- Also, for exceptions and nested interrupts, the regs are stored on the
  interrupting stack.  But for non-nested interrupts, they're stored on
  the thread stack.  So the regs aren't always on the same stack as the
  corresponding encoded pt_regs pointer.  Another issue is that there's
  not always a frame after the regs.  For those reasons, creating a
  "fake" frame abstraction in the state machine is quite a bit trickier
  than just dealing with those details in the only place that cares
  about them: show_trace_log_lvl().

> In some sense, the regs belong to the frame that got interrupted, not
> the frame that did the interrupting.  But maybe that's backwards -- if
> we have DWARF, then the regs correspond to the regs at the time of a
> call, and those regs are reasonably likely to contain the arguments to
> the called function.
> 
> But regardless of which way this goes, it seems quite awkward to me
> that regs->ip never shows up as the return addr of any frame as
> exposed by the unwind API.

Again, regs->ip is special.  It's not a call return address and we
shouldn't force it to be.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
  2016-07-22 22:20             ` Josh Poimboeuf
@ 2016-07-22 23:18               ` Andy Lutomirski
  2016-07-22 23:30                 ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-22 23:18 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote:
>> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
>> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
>> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> >> > Now that we can find pt_regs registers in the middle of the stack due to
>> >> >> > an interrupt or exception, we can print them.  Here's what it looks
>> >> >> > like:
>> >> >> >
>> >> >> >    ...
>> >> >> >    [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
>> >> >> >    [<ffffffff8189f558>] async_page_fault+0x28/0x30
>> >> >> >   RIP: 0010:[<ffffffff814529e2>]  [<ffffffff814529e2>] __clear_user+0x42/0x70
>> >> >> >   RSP: 0018:ffff88007876fd38  EFLAGS: 00010202
>> >> >> >   RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
>> >> >> >   RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
>> >> >> >   RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
>> >> >> >   R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
>> >> >> >   R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
>> >> >> >    [<ffffffff814529e2>] ? __clear_user+0x42/0x70
>> >> >> >    [<ffffffff814529c3>] ? __clear_user+0x23/0x70
>> >> >> >    [<ffffffff81452a7b>] clear_user+0x2b/0x40
>> >> >> >    ...
>> >> >>
>> >> >> This looks wrong.  Here are some theories:
>> >> >>
>> >> >> (a) __clear_user is a reliable address that is indicated by RIP: ....
>> >> >> Then it's found again as an unreliable address as "?
>> >> >> __clear_user+0x42/0x70" by scanning the stack.  "?
>> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
>> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
>> >> >> suppressed because it matched a reliable address?
>> >> >>
>> >> >> (b) You actually intended for all the addresses to be printed, in
>> >> >> which case "? __clear_user+0x42/0x70" should have been
>> >> >> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
>> >> >> plausible that your state machine got a bit lost leading to "?
>> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
>> >> >> it's a real frame and you didn't find it).
>> >> >>
>> >> >> (c) Something else and I'm confused.
>> >> >
>> >> > So there's a subtle difference between addresses reported by regs->ip
>> >> > and normal return addresses.  For example:
>> >> >
>> >> >    ...
>> >> >    [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50
>> >> >    [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0
>> >> >   RIP: 0010:[<ffffffff8129b350>]  [<ffffffff8129b350>] path_init+0x0/0x750
>> >> >   RSP: 0018:ffff880036a3fd80  EFLAGS: 00000296
>> >> >   RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08
>> >> >   RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0
>> >> >   RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010
>> >> >   R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0
>> >> >   R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000
>> >> >    <EOI>
>> >> >    [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0
>> >> >    [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120
>> >> >    [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180
>> >> >    ...
>> >> >
>> >> > In this case the irq hit right after path_lookupat() called into
>> >> > path_init().  So the "path_init+0x0" printed by __show_regs() is right.
>> >> >
>> >> > Note the backtrace reports the same address, but it instead describes it
>> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
>> >> > because normally, such an address after a call instruction at the end of
>> >> > a function would indicate a tail call (e.g., to a noreturn function).
>> >> > If that were the case, printing "path_init+0x0" would be completely
>> >> > wrong, because path_init() just happens to be the function located
>> >> > immediately after lookup_fast().
>> >> >
>> >> > Maybe I could add some special logic to say: "if this return address was
>> >> > from a call, use printk_stack_address(); else if it was from a pt_regs,
>> >> > use printk_address()."  (The former prints the preceding function, the
>> >> > latter prints the current function.)  Then we could remove the question
>> >> > mark.
>> >> >
>> >> > There's also the question of whether or not the address should be
>> >> > printed again, after it's already been printed by __show_regs().  I
>> >> > don't have a strong opinion either way.
>> >> >
>> >>
>> >> IIRC we don't show the actual faulting function in the call trace, so
>> >> we probably shouldn't duplicate the entry after the show_regs.
>> >
>> > Just to clarify, that's true today for cases where the stack dump starts
>> > from a handler which has regs.  It starts dumping based on regs->ip and
>> > regs->bp, so the regs themselves aren't dumped.
>> >
>> > But for cases where regs are in the middle of the stack, they aren't
>> > detected today, and you'll still see the value of regs->ip dumped with a
>> > question mark.
>> >
>> > That said, with this patch, now that regs in the middle of the stack
>> > *are* being printed, I can't think of a good reason to print the return
>> > address twice: both in regs and the stack trace.  So removing it from
>> > the stack trace is fine with me.
>> >
>> >> That being said, I'm still confused by the question marks.  What
>> >> exactly is going on?  Is the code really doing the right thing wrt
>> >> resuming the unwind?  Is there a git tree with these patches applied
>> >> somewhere so I can look at it easily in context?
>> >
>> > show_trace_log_lvl() is doing two things in parallel: scanning all
>> > kernel text addresses on the stack while simultaneously using the
>> > unwinder to walk the frame pointers.  Only those scanned addresses which
>> > are also found by the unwinder are printed without question marks.
>> >
>> > The pt_regs aren't in a frame of their own; they're just data inside of
>> > a bigger frame.  (You may recall that you objected to my proposal to put
>> > them in their own frame :-))  So that's why the address stored in
>> > regs->ip was printed with a question mark: it's not in the header of a
>> > real frame; it's just data.
>>
>> It wasn't the separate frame part I was objecting to -- it was their
>> encoding on the stack.  Maybe they should unwind as though they're a
>> separate frame.  For example, the unwind API could give the frame that
>> returns to apic_timer_interrupt+0x9e/0xb0 and then the next frame
>> could literally list regs->ip as its retaddr (and maybe that frame or
>> even the following one should be the one with non-NULL
>> unwind_get_entry_regs).
>
> Having the unwinder treat the pt_regs as a "fake" frame is problematic:
>
> - As I described above, you can't treat regs->ip as a normal return
>   value anyway.
>
> - Also, for exceptions and nested interrupts, the regs are stored on the
>   interrupting stack.  But for non-nested interrupts, they're stored on
>   the thread stack.  So the regs aren't always on the same stack as the
>   corresponding encoded pt_regs pointer.  Another issue is that there's
>   not always a frame after the regs.  For those reasons, creating a
>   "fake" frame abstraction in the state machine is quite a bit trickier
>   than just dealing with those details in the only place that cares
>   about them: show_trace_log_lvl().
>
>> In some sense, the regs belong to the frame that got interrupted, not
>> the frame that did the interrupting.  But maybe that's backwards -- if
>> we have DWARF, then the regs correspond to the regs at the time of a
>> call, and those regs are reasonably likely to contain the arguments to
>> the called function.
>>
>> But regardless of which way this goes, it seems quite awkward to me
>> that regs->ip never shows up as the return addr of any frame as
>> exposed by the unwind API.
>
> Again, regs->ip is special.  It's not a call return address and we
> shouldn't force it to be.

This is only mostly true.  If the exception was a trap, then it is
(e.g. if a function ends in int3, then regs->ip will be off the end).
But that's just me being pedantic.

More relevantly, regs->ip is a reliable address indicating a function
that will be returned to if we ever return, and both
show_trace_log_lvl() and the livepatch stuff should interpret it as
such.  Whether this means the unwinder should change or
show_trace_log_lvl() should change isn't a big deal, but I think one
of them should change so we get this right.

>
> --
> Josh



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-21 21:21 ` [PATCH 10/19] x86/dumpstack: add get_stack_info() interface Josh Poimboeuf
@ 2016-07-22 23:26   ` Andy Lutomirski
  2016-07-22 23:52     ` Andy Lutomirski
  2016-07-22 23:54     ` Josh Poimboeuf
  0 siblings, 2 replies; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-22 23:26 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> valid_stack_ptr() is buggy: it assumes that all stacks are of size
> THREAD_SIZE, which is not true for exception stacks.  So the
> walk_stack() callbacks will need to know the location of the beginning
> of the stack as well as the end.
>
> Another issue is that in general the various features of a stack (type,
> size, next stack pointer, description string) are scattered around in
> various places throughout the stack dump code.

I finally figured out what visit_info is.  But would it make more
sense to track it in the unwind state so that the unwinder can
directly make sure it doesn't start looping?

And please remove test_and_set_bit() -- it's pointlessly slow.

> +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
> +                            unsigned long *visit_mask)
> +{
> +       unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
> +       unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
> +
> +       if (stack < begin || stack >= end)
> +               return false;
> +
> +       if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
> +               return false;
> +
> +       info->type      = STACK_TYPE_IRQ;
> +       info->begin     = begin;
> +       info->end       = end;
> +       info->next      = (unsigned long *)*begin;

This works, but it's a bit magic.  I don't suppose we could get rid of
this ->next thing entirely and teach show_stack_log_lvl(), etc. to
move from stack to stack by querying the stack type of whatever the
frame base address is if the frame base address ends up being out of
bounds for the current stack?  Or maybe the unwinder could even do
this by itself.

> +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
> +                              unsigned long *visit_mask)
>  {
>         unsigned long stack = (unsigned long)s;
>         unsigned long begin, end;
> @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long *s, char **name,
>                 if (stack < begin || stack >= end)
>                         continue;
>
> -               if (test_and_set_bit(k, visit_mask))
> +               if (visit_mask &&
> +                   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
>                         return false;
>
> -               *name = exception_stack_names[k];
> -               return (unsigned long *)end;
> +               info->type      = STACK_TYPE_EXCEPTION + k;
> +               info->begin     = (unsigned long *)begin;
> +               info->end       = (unsigned long *)end;
> +               info->next      = (unsigned long *)info->end[-2];

This is so magical that I don't immediately see why it's correct.
Presumably it's because the thing two slots down from the top of the
stack is regs->sp?  If so, that needs a comment.

But again, couldn't we use the fact that we now know how to decode
pt_regs to avoid needing this?  I can imagine it being useful as a
fallback in the event that the unwinder fails, but this is just a
fallback.  Also, NMI is weird and I'm wondering whether this works at
all when trying to unwind from a looped NMI.

Fixing this up could be a followup after this series is in, I think --
you're preserving existing behavior AFAICS.  I just don't particularly
like the existing behavior.

FWIW, I think this code needs to be explicitly tested for the 32-bit
double fault case.  It's highly magical.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
  2016-07-22 23:18               ` Andy Lutomirski
@ 2016-07-22 23:30                 ` Josh Poimboeuf
  2016-07-22 23:39                   ` Andy Lutomirski
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-22 23:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> >> > Now that we can find pt_regs registers in the middle of the stack due to
> >> >> >> > an interrupt or exception, we can print them.  Here's what it looks
> >> >> >> > like:
> >> >> >> >
> >> >> >> >    ...
> >> >> >> >    [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
> >> >> >> >    [<ffffffff8189f558>] async_page_fault+0x28/0x30
> >> >> >> >   RIP: 0010:[<ffffffff814529e2>]  [<ffffffff814529e2>] __clear_user+0x42/0x70
> >> >> >> >   RSP: 0018:ffff88007876fd38  EFLAGS: 00010202
> >> >> >> >   RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
> >> >> >> >   RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
> >> >> >> >   RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
> >> >> >> >   R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
> >> >> >> >   R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
> >> >> >> >    [<ffffffff814529e2>] ? __clear_user+0x42/0x70
> >> >> >> >    [<ffffffff814529c3>] ? __clear_user+0x23/0x70
> >> >> >> >    [<ffffffff81452a7b>] clear_user+0x2b/0x40
> >> >> >> >    ...
> >> >> >>
> >> >> >> This looks wrong.  Here are some theories:
> >> >> >>
> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: ....
> >> >> >> Then it's found again as an unreliable address as "?
> >> >> >> __clear_user+0x42/0x70" by scanning the stack.  "?
> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
> >> >> >> suppressed because it matched a reliable address?
> >> >> >>
> >> >> >> (b) You actually intended for all the addresses to be printed, in
> >> >> >> which case "? __clear_user+0x42/0x70" should have been
> >> >> >> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
> >> >> >> plausible that your state machine got a bit lost leading to "?
> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
> >> >> >> it's a real frame and you didn't find it).
> >> >> >>
> >> >> >> (c) Something else and I'm confused.
> >> >> >
> >> >> > So there's a subtle difference between addresses reported by regs->ip
> >> >> > and normal return addresses.  For example:
> >> >> >
> >> >> >    ...
> >> >> >    [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50
> >> >> >    [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0
> >> >> >   RIP: 0010:[<ffffffff8129b350>]  [<ffffffff8129b350>] path_init+0x0/0x750
> >> >> >   RSP: 0018:ffff880036a3fd80  EFLAGS: 00000296
> >> >> >   RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08
> >> >> >   RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0
> >> >> >   RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010
> >> >> >   R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0
> >> >> >   R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000
> >> >> >    <EOI>
> >> >> >    [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0
> >> >> >    [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120
> >> >> >    [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180
> >> >> >    ...
> >> >> >
> >> >> > In this case the irq hit right after path_lookupat() called into
> >> >> > path_init().  So the "path_init+0x0" printed by __show_regs() is right.
> >> >> >
> >> >> > Note the backtrace reports the same address, but it instead describes it
> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
> >> >> > because normally, such an address after a call instruction at the end of
> >> >> > a function would indicate a tail call (e.g., to a noreturn function).
> >> >> > If that were the case, printing "path_init+0x0" would be completely
> >> >> > wrong, because path_init() just happens to be the function located
> >> >> > immediately after lookup_fast().
> >> >> >
> >> >> > Maybe I could add some special logic to say: "if this return address was
> >> >> > from a call, use printk_stack_address(); else if it was from a pt_regs,
> >> >> > use printk_address()."  (The former prints the preceding function, the
> >> >> > latter prints the current function.)  Then we could remove the question
> >> >> > mark.
> >> >> >
> >> >> > There's also the question of whether or not the address should be
> >> >> > printed again, after it's already been printed by __show_regs().  I
> >> >> > don't have a strong opinion either way.
> >> >> >
> >> >>
> >> >> IIRC we don't show the actual faulting function in the call trace, so
> >> >> we probably shouldn't duplicate the entry after the show_regs.
> >> >
> >> > Just to clarify, that's true today for cases where the stack dump starts
> >> > from a handler which has regs.  It starts dumping based on regs->ip and
> >> > regs->bp, so the regs themselves aren't dumped.
> >> >
> >> > But for cases where regs are in the middle of the stack, they aren't
> >> > detected today, and you'll still see the value of regs->ip dumped with a
> >> > question mark.
> >> >
> >> > That said, with this patch, now that regs in the middle of the stack
> >> > *are* being printed, I can't think of a good reason to print the return
> >> > address twice: both in regs and the stack trace.  So removing it from
> >> > the stack trace is fine with me.
> >> >
> >> >> That being said, I'm still confused by the question marks.  What
> >> >> exactly is going on?  Is the code really doing the right thing wrt
> >> >> resuming the unwind?  Is there a git tree with these patches applied
> >> >> somewhere so I can look at it easily in context?
> >> >
> >> > show_trace_log_lvl() is doing two things in parallel: scanning all
> >> > kernel text addresses on the stack while simultaneously using the
> >> > unwinder to walk the frame pointers.  Only those scanned addresses which
> >> > are also found by the unwinder are printed without question marks.
> >> >
> >> > The pt_regs aren't in a frame of their own; they're just data inside of
> >> > a bigger frame.  (You may recall that you objected to my proposal to put
> >> > them in their own frame :-))  So that's why the address stored in
> >> > regs->ip was printed with a question mark: it's not in the header of a
> >> > real frame; it's just data.
> >>
> >> It wasn't the separate frame part I was objecting to -- it was their
> >> encoding on the stack.  Maybe they should unwind as though they're a
> >> separate frame.  For example, the unwind API could give the frame that
> >> returns to apic_timer_interrupt+0x9e/0xb0 and then the next frame
> >> could literally list regs->ip as its retaddr (and maybe that frame or
> >> even the following one should be the one with non-NULL
> >> unwind_get_entry_regs).
> >
> > Having the unwinder treat the pt_regs as a "fake" frame is problematic:
> >
> > - As I described above, you can't treat regs->ip as a normal return
> >   value anyway.
> >
> > - Also, for exceptions and nested interrupts, the regs are stored on the
> >   interrupting stack.  But for non-nested interrupts, they're stored on
> >   the thread stack.  So the regs aren't always on the same stack as the
> >   corresponding encoded pt_regs pointer.  Another issue is that there's
> >   not always a frame after the regs.  For those reasons, creating a
> >   "fake" frame abstraction in the state machine is quite a bit trickier
> >   than just dealing with those details in the only place that cares
> >   about them: show_trace_log_lvl().
> >
> >> In some sense, the regs belong to the frame that got interrupted, not
> >> the frame that did the interrupting.  But maybe that's backwards -- if
> >> we have DWARF, then the regs correspond to the regs at the time of a
> >> call, and those regs are reasonably likely to contain the arguments to
> >> the called function.
> >>
> >> But regardless of which way this goes, it seems quite awkward to me
> >> that regs->ip never shows up as the return addr of any frame as
> >> exposed by the unwind API.
> >
> > Again, regs->ip is special.  It's not a call return address and we
> > shouldn't force it to be.
> 
> This is only mostly true.  If the exception was a trap, then it is
> (e.g. if a function ends in int3, then regs->ip will be off the end).
> But that's just me being pedantic.
> 
> More relevantly, regs->ip is a reliable address indicating a function
> that will be returned to if we ever return, and both
> show_trace_log_lvl() and the livepatch stuff should interpret it as
> such.

Actually livepatch doesn't care; once it sees that there are regs, it
will bail because the stack is unreliable.

> Whether this means the unwinder should change or
> show_trace_log_lvl() should change isn't a big deal, but I think one
> of them should change so we get this right.

I have no problem doing so, but can you clarify what you mean?  Earlier
you said:

  "IIRC we don't show the actual faulting function in the call trace, so
  we probably shouldn't duplicate the entry after the show_regs."

Maybe I'm misunderstanding, but that seems to contradict what you're
saying now.  So which is it?  Do you want the RIP address printed twice
(both in the regs printout and in the stack trace)?  Or not?

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
  2016-07-22 23:30                 ` Josh Poimboeuf
@ 2016-07-22 23:39                   ` Andy Lutomirski
  2016-07-23  0:00                     ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-22 23:39 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 4:30 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote:
>> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote:
>> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
>> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
>> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> >> >> > Now that we can find pt_regs registers in the middle of the stack due to
>> >> >> >> > an interrupt or exception, we can print them.  Here's what it looks
>> >> >> >> > like:
>> >> >> >> >
>> >> >> >> >    ...
>> >> >> >> >    [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
>> >> >> >> >    [<ffffffff8189f558>] async_page_fault+0x28/0x30
>> >> >> >> >   RIP: 0010:[<ffffffff814529e2>]  [<ffffffff814529e2>] __clear_user+0x42/0x70
>> >> >> >> >   RSP: 0018:ffff88007876fd38  EFLAGS: 00010202
>> >> >> >> >   RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
>> >> >> >> >   RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
>> >> >> >> >   RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
>> >> >> >> >   R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
>> >> >> >> >   R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
>> >> >> >> >    [<ffffffff814529e2>] ? __clear_user+0x42/0x70
>> >> >> >> >    [<ffffffff814529c3>] ? __clear_user+0x23/0x70
>> >> >> >> >    [<ffffffff81452a7b>] clear_user+0x2b/0x40
>> >> >> >> >    ...
>> >> >> >>
>> >> >> >> This looks wrong.  Here are some theories:
>> >> >> >>
>> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: ....
>> >> >> >> Then it's found again as an unreliable address as "?
>> >> >> >> __clear_user+0x42/0x70" by scanning the stack.  "?
>> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
>> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
>> >> >> >> suppressed because it matched a reliable address?
>> >> >> >>
>> >> >> >> (b) You actually intended for all the addresses to be printed, in
>> >> >> >> which case "? __clear_user+0x42/0x70" should have been
>> >> >> >> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
>> >> >> >> plausible that your state machine got a bit lost leading to "?
>> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
>> >> >> >> it's a real frame and you didn't find it).
>> >> >> >>
>> >> >> >> (c) Something else and I'm confused.
>> >> >> >
>> >> >> > So there's a subtle difference between addresses reported by regs->ip
>> >> >> > and normal return addresses.  For example:
>> >> >> >
>> >> >> >    ...
>> >> >> >    [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50
>> >> >> >    [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0
>> >> >> >   RIP: 0010:[<ffffffff8129b350>]  [<ffffffff8129b350>] path_init+0x0/0x750
>> >> >> >   RSP: 0018:ffff880036a3fd80  EFLAGS: 00000296
>> >> >> >   RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08
>> >> >> >   RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0
>> >> >> >   RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010
>> >> >> >   R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0
>> >> >> >   R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000
>> >> >> >    <EOI>
>> >> >> >    [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0
>> >> >> >    [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120
>> >> >> >    [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180
>> >> >> >    ...
>> >> >> >
>> >> >> > In this case the irq hit right after path_lookupat() called into
>> >> >> > path_init().  So the "path_init+0x0" printed by __show_regs() is right.
>> >> >> >
>> >> >> > Note the backtrace reports the same address, but it instead describes it
>> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
>> >> >> > because normally, such an address after a call instruction at the end of
>> >> >> > a function would indicate a tail call (e.g., to a noreturn function).
>> >> >> > If that were the case, printing "path_init+0x0" would be completely
>> >> >> > wrong, because path_init() just happens to be the function located
>> >> >> > immediately after lookup_fast().
>> >> >> >
>> >> >> > Maybe I could add some special logic to say: "if this return address was
>> >> >> > from a call, use printk_stack_address(); else if it was from a pt_regs,
>> >> >> > use printk_address()."  (The former prints the preceding function, the
>> >> >> > latter prints the current function.)  Then we could remove the question
>> >> >> > mark.
>> >> >> >
>> >> >> > There's also the question of whether or not the address should be
>> >> >> > printed again, after it's already been printed by __show_regs().  I
>> >> >> > don't have a strong opinion either way.
>> >> >> >
>> >> >>
>> >> >> IIRC we don't show the actual faulting function in the call trace, so
>> >> >> we probably shouldn't duplicate the entry after the show_regs.
>> >> >
>> >> > Just to clarify, that's true today for cases where the stack dump starts
>> >> > from a handler which has regs.  It starts dumping based on regs->ip and
>> >> > regs->bp, so the regs themselves aren't dumped.
>> >> >
>> >> > But for cases where regs are in the middle of the stack, they aren't
>> >> > detected today, and you'll still see the value of regs->ip dumped with a
>> >> > question mark.
>> >> >
>> >> > That said, with this patch, now that regs in the middle of the stack
>> >> > *are* being printed, I can't think of a good reason to print the return
>> >> > address twice: both in regs and the stack trace.  So removing it from
>> >> > the stack trace is fine with me.
>> >> >
>> >> >> That being said, I'm still confused by the question marks.  What
>> >> >> exactly is going on?  Is the code really doing the right thing wrt
>> >> >> resuming the unwind?  Is there a git tree with these patches applied
>> >> >> somewhere so I can look at it easily in context?
>> >> >
>> >> > show_trace_log_lvl() is doing two things in parallel: scanning all
>> >> > kernel text addresses on the stack while simultaneously using the
>> >> > unwinder to walk the frame pointers.  Only those scanned addresses which
>> >> > are also found by the unwinder are printed without question marks.
>> >> >
>> >> > The pt_regs aren't in a frame of their own; they're just data inside of
>> >> > a bigger frame.  (You may recall that you objected to my proposal to put
>> >> > them in their own frame :-))  So that's why the address stored in
>> >> > regs->ip was printed with a question mark: it's not in the header of a
>> >> > real frame; it's just data.
>> >>
>> >> It wasn't the separate frame part I was objecting to -- it was their
>> >> encoding on the stack.  Maybe they should unwind as though they're a
>> >> separate frame.  For example, the unwind API could give the frame that
>> >> returns to apic_timer_interrupt+0x9e/0xb0 and then the next frame
>> >> could literally list regs->ip as its retaddr (and maybe that frame or
>> >> even the following one should be the one with non-NULL
>> >> unwind_get_entry_regs).
>> >
>> > Having the unwinder treat the pt_regs as a "fake" frame is problematic:
>> >
>> > - As I described above, you can't treat regs->ip as a normal return
>> >   value anyway.
>> >
>> > - Also, for exceptions and nested interrupts, the regs are stored on the
>> >   interrupting stack.  But for non-nested interrupts, they're stored on
>> >   the thread stack.  So the regs aren't always on the same stack as the
>> >   corresponding encoded pt_regs pointer.  Another issue is that there's
>> >   not always a frame after the regs.  For those reasons, creating a
>> >   "fake" frame abstraction in the state machine is quite a bit trickier
>> >   than just dealing with those details in the only place that cares
>> >   about them: show_trace_log_lvl().
>> >
>> >> In some sense, the regs belong to the frame that got interrupted, not
>> >> the frame that did the interrupting.  But maybe that's backwards -- if
>> >> we have DWARF, then the regs correspond to the regs at the time of a
>> >> call, and those regs are reasonably likely to contain the arguments to
>> >> the called function.
>> >>
>> >> But regardless of which way this goes, it seems quite awkward to me
>> >> that regs->ip never shows up as the return addr of any frame as
>> >> exposed by the unwind API.
>> >
>> > Again, regs->ip is special.  It's not a call return address and we
>> > shouldn't force it to be.
>>
>> This is only mostly true.  If the exception was a trap, then it is
>> (e.g. if a function ends in int3, then regs->ip will be off the end).
>> But that's just me being pedantic.
>>
>> More relevantly, regs->ip is a reliable address indicating a function
>> that will be returned to if we ever return, and both
>> show_trace_log_lvl() and the livepatch stuff should interpret it as
>> such.
>
> Actually livepatch doesn't care; once it sees that there are regs, it
> will bail because the stack is unreliable.

Would it be better for livepatch not to bail some day?

>
>> Whether this means the unwinder should change or
>> show_trace_log_lvl() should change isn't a big deal, but I think one
>> of them should change so we get this right.
>
> I have no problem doing so, but can you clarify what you mean?  Earlier
> you said:
>
>   "IIRC we don't show the actual faulting function in the call trace, so
>   we probably shouldn't duplicate the entry after the show_regs."
>
> Maybe I'm misunderstanding, but that seems to contradict what you're
> saying now.  So which is it?  Do you want the RIP address printed twice
> (both in the regs printout and in the stack trace)?  Or not?

I don't have a stong preference as to how many times it's printed.
But I think we need to get rid of the question mark.  I think that
means there are two options:

a) Teach show_stack_log_lvl() that regs->ip is a "reliable" entry and
print it again.  That will get confused if it's the first instruction
in a function, so maybe it's not so great.

b) Teach show_stack_log_lvl() that regs->ip is a thing that we just
printed (via show_regs) and skip the ? entry.

Option b probably makes more sense.  I think I'm starting to
understand all this, but maybe I'm still missing something.

--Andy

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-22 23:26   ` Andy Lutomirski
@ 2016-07-22 23:52     ` Andy Lutomirski
  2016-07-23 13:09       ` Josh Poimboeuf
  2016-07-22 23:54     ` Josh Poimboeuf
  1 sibling, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-22 23:52 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 4:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> valid_stack_ptr() is buggy: it assumes that all stacks are of size
>> THREAD_SIZE, which is not true for exception stacks.  So the
>> walk_stack() callbacks will need to know the location of the beginning
>> of the stack as well as the end.
>>
>> Another issue is that in general the various features of a stack (type,
>> size, next stack pointer, description string) are scattered around in
>> various places throughout the stack dump code.
>
> I finally figured out what visit_info is.  But would it make more
> sense to track it in the unwind state so that the unwinder can
> directly make sure it doesn't start looping?
>

I just realized that it *is* in the unwind state.  But maybe this code
in update_stack_state:

    sp = info->next;
    if (!sp).
        goto unknown;

    if (get_stack_info(sp, state->task, info, &state->stack_mask))
        goto unknown;

    if (!on_stack(info, addr, len))
        goto unknown;

should do something like:

if (get_stack_info(addr, ...))
  goto unknown.

sp = info->end;

instead.  Alternatively, maybe it would make sense to keep sp as is
(have update_stack_state return bool instead of returning a pointer)
so that a frame that switches stacks still shows the actual sp at the
time that the frame called whatever the it called.

I'm really quite confused by what state->sp means in your current
code.  In the non-stack-switching case (everything is on the thread
stack), it appears to always match state->bp.  Am I missing something?
 If I'm understanding this correctly, when you're pointing at a call
frame, state->bp is that frame's base address (the top of the stack
frame), unwind_get_return_address() returns the address to which that
frame would return, and, in the future, unwind_get_gpr(UNWIND_DI) or
whatever it ends up looking like will return RDI at the time that the
frame called whatever function it called, if known.  By that logic,
shouldn't state->sp be sp on entry to the call instruction?  (Or could
sp just be removed?  Does it do anything?)

I guess the reason I'm still not 100% comfortable with the idea that
pt_regs frames don't exist a real frames is that I keep waffling as to
how I should think about the regs associated with a frame in the
future DWARF world.  I think I imagine them being the regs at the time
that the frame did it's call to the next frame, which, by an
admittedly weak analogy, means that the pt_regs state would be the
regs at the time that the exception or interrupt happened.  That
offers a third silly option for dealing with the annoying '?': emit
two frames for a pt_regs, but have the frame in the entry code show
NULL for its return address because it's not a normal return.

> And please remove test_and_set_bit() -- it's pointlessly slow.
>
>> +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
>> +                            unsigned long *visit_mask)
>> +{
>> +       unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
>> +       unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> +
>> +       if (stack < begin || stack >= end)
>> +               return false;
>> +
>> +       if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> +               return false;
>> +
>> +       info->type      = STACK_TYPE_IRQ;
>> +       info->begin     = begin;
>> +       info->end       = end;
>> +       info->next      = (unsigned long *)*begin;
>
> This works, but it's a bit magic.  I don't suppose we could get rid of
> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> move from stack to stack by querying the stack type of whatever the
> frame base address is if the frame base address ends up being out of
> bounds for the current stack?  Or maybe the unwinder could even do
> this by itself.
>
>> +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
>> +                              unsigned long *visit_mask)
>>  {
>>         unsigned long stack = (unsigned long)s;
>>         unsigned long begin, end;
>> @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long *s, char **name,
>>                 if (stack < begin || stack >= end)
>>                         continue;
>>
>> -               if (test_and_set_bit(k, visit_mask))
>> +               if (visit_mask &&
>> +                   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
>>                         return false;
>>
>> -               *name = exception_stack_names[k];
>> -               return (unsigned long *)end;
>> +               info->type      = STACK_TYPE_EXCEPTION + k;
>> +               info->begin     = (unsigned long *)begin;
>> +               info->end       = (unsigned long *)end;
>> +               info->next      = (unsigned long *)info->end[-2];
>
> This is so magical that I don't immediately see why it's correct.
> Presumably it's because the thing two slots down from the top of the
> stack is regs->sp?  If so, that needs a comment.
>
> But again, couldn't we use the fact that we now know how to decode
> pt_regs to avoid needing this?  I can imagine it being useful as a
> fallback in the event that the unwinder fails, but this is just a
> fallback.  Also, NMI is weird and I'm wondering whether this works at
> all when trying to unwind from a looped NMI.
>
> Fixing this up could be a followup after this series is in, I think --
> you're preserving existing behavior AFAICS.  I just don't particularly
> like the existing behavior.
>
> FWIW, I think this code needs to be explicitly tested for the 32-bit
> double fault case.  It's highly magical.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-22 23:26   ` Andy Lutomirski
  2016-07-22 23:52     ` Andy Lutomirski
@ 2016-07-22 23:54     ` Josh Poimboeuf
  2016-07-23  0:15       ` Andy Lutomirski
  1 sibling, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-22 23:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 04:26:46PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > valid_stack_ptr() is buggy: it assumes that all stacks are of size
> > THREAD_SIZE, which is not true for exception stacks.  So the
> > walk_stack() callbacks will need to know the location of the beginning
> > of the stack as well as the end.
> >
> > Another issue is that in general the various features of a stack (type,
> > size, next stack pointer, description string) are scattered around in
> > various places throughout the stack dump code.
> 
> I finally figured out what visit_info is.  But would it make more
> sense to track it in the unwind state so that the unwinder can
> directly make sure it doesn't start looping?

Well, the unwinders aren't the only users of get_stack_info() and the
visit_mask.  show_trace_log_lvl() also uses it.

But it would probably be cleaner to at least do the visit_mask bit
testing/setting in get_stack_info() rather than in the in_*_stack()
functions.

> And please remove test_and_set_bit() -- it's pointlessly slow.

Ok.

> 
> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
> > +                            unsigned long *visit_mask)
> > +{
> > +       unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
> > +       unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
> > +
> > +       if (stack < begin || stack >= end)
> > +               return false;
> > +
> > +       if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
> > +               return false;
> > +
> > +       info->type      = STACK_TYPE_IRQ;
> > +       info->begin     = begin;
> > +       info->end       = end;
> > +       info->next      = (unsigned long *)*begin;
> 
> This works, but it's a bit magic.  I don't suppose we could get rid of
> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> move from stack to stack by querying the stack type of whatever the
> frame base address is if the frame base address ends up being out of
> bounds for the current stack?  Or maybe the unwinder could even do
> this by itself.

I'm not quite sure what you mean here.  The ->next stack pointer is
quite useful and it abstracts that ugliness away from the callers of
get_stack_info().  I'm open to any specific suggestions.

> 
> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
> > +                              unsigned long *visit_mask)
> >  {
> >         unsigned long stack = (unsigned long)s;
> >         unsigned long begin, end;
> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long *s, char **name,
> >                 if (stack < begin || stack >= end)
> >                         continue;
> >
> > -               if (test_and_set_bit(k, visit_mask))
> > +               if (visit_mask &&
> > +                   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
> >                         return false;
> >
> > -               *name = exception_stack_names[k];
> > -               return (unsigned long *)end;
> > +               info->type      = STACK_TYPE_EXCEPTION + k;
> > +               info->begin     = (unsigned long *)begin;
> > +               info->end       = (unsigned long *)end;
> > +               info->next      = (unsigned long *)info->end[-2];
> 
> This is so magical that I don't immediately see why it's correct.
> Presumably it's because the thing two slots down from the top of the
> stack is regs->sp?  If so, that needs a comment.

Heck if I know, I just stole it from dump_trace() ;-)

I'll figure it out and add a comment.

> But again, couldn't we use the fact that we now know how to decode
> pt_regs to avoid needing this?  I can imagine it being useful as a
> fallback in the event that the unwinder fails, but this is just a
> fallback.

Yeah, this is needed as a fallback.  But I wouldn't call it "just" a
fallback: the stack dump code *needs* to be able to still traverse the
stacks if frame pointers fail.

> Also, NMI is weird and I'm wondering whether this works at
> all when trying to unwind from a looped NMI.

Unless I'm missing something, I think it should be fine for nested NMIs,
since they're all on the same stack.  I can try to test it.  What in
particular are you worried about?

> Fixing this up could be a followup after this series is in, I think --
> you're preserving existing behavior AFAICS.  I just don't particularly
> like the existing behavior.
> 
> FWIW, I think this code needs to be explicitly tested for the 32-bit
> double fault case.  It's highly magical.

Ok, I'll test it.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
  2016-07-22 23:39                   ` Andy Lutomirski
@ 2016-07-23  0:00                     ` Josh Poimboeuf
  0 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-23  0:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 04:39:00PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 4:30 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote:
> >> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
> >> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
> >> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> >> >> > Now that we can find pt_regs registers in the middle of the stack due to
> >> >> >> >> > an interrupt or exception, we can print them.  Here's what it looks
> >> >> >> >> > like:
> >> >> >> >> >
> >> >> >> >> >    ...
> >> >> >> >> >    [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
> >> >> >> >> >    [<ffffffff8189f558>] async_page_fault+0x28/0x30
> >> >> >> >> >   RIP: 0010:[<ffffffff814529e2>]  [<ffffffff814529e2>] __clear_user+0x42/0x70
> >> >> >> >> >   RSP: 0018:ffff88007876fd38  EFLAGS: 00010202
> >> >> >> >> >   RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
> >> >> >> >> >   RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
> >> >> >> >> >   RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
> >> >> >> >> >   R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
> >> >> >> >> >   R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
> >> >> >> >> >    [<ffffffff814529e2>] ? __clear_user+0x42/0x70
> >> >> >> >> >    [<ffffffff814529c3>] ? __clear_user+0x23/0x70
> >> >> >> >> >    [<ffffffff81452a7b>] clear_user+0x2b/0x40
> >> >> >> >> >    ...
> >> >> >> >>
> >> >> >> >> This looks wrong.  Here are some theories:
> >> >> >> >>
> >> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: ....
> >> >> >> >> Then it's found again as an unreliable address as "?
> >> >> >> >> __clear_user+0x42/0x70" by scanning the stack.  "?
> >> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
> >> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
> >> >> >> >> suppressed because it matched a reliable address?
> >> >> >> >>
> >> >> >> >> (b) You actually intended for all the addresses to be printed, in
> >> >> >> >> which case "? __clear_user+0x42/0x70" should have been
> >> >> >> >> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
> >> >> >> >> plausible that your state machine got a bit lost leading to "?
> >> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
> >> >> >> >> it's a real frame and you didn't find it).
> >> >> >> >>
> >> >> >> >> (c) Something else and I'm confused.
> >> >> >> >
> >> >> >> > So there's a subtle difference between addresses reported by regs->ip
> >> >> >> > and normal return addresses.  For example:
> >> >> >> >
> >> >> >> >    ...
> >> >> >> >    [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50
> >> >> >> >    [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0
> >> >> >> >   RIP: 0010:[<ffffffff8129b350>]  [<ffffffff8129b350>] path_init+0x0/0x750
> >> >> >> >   RSP: 0018:ffff880036a3fd80  EFLAGS: 00000296
> >> >> >> >   RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08
> >> >> >> >   RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0
> >> >> >> >   RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010
> >> >> >> >   R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0
> >> >> >> >   R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000
> >> >> >> >    <EOI>
> >> >> >> >    [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0
> >> >> >> >    [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120
> >> >> >> >    [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180
> >> >> >> >    ...
> >> >> >> >
> >> >> >> > In this case the irq hit right after path_lookupat() called into
> >> >> >> > path_init().  So the "path_init+0x0" printed by __show_regs() is right.
> >> >> >> >
> >> >> >> > Note the backtrace reports the same address, but it instead describes it
> >> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
> >> >> >> > because normally, such an address after a call instruction at the end of
> >> >> >> > a function would indicate a tail call (e.g., to a noreturn function).
> >> >> >> > If that were the case, printing "path_init+0x0" would be completely
> >> >> >> > wrong, because path_init() just happens to be the function located
> >> >> >> > immediately after lookup_fast().
> >> >> >> >
> >> >> >> > Maybe I could add some special logic to say: "if this return address was
> >> >> >> > from a call, use printk_stack_address(); else if it was from a pt_regs,
> >> >> >> > use printk_address()."  (The former prints the preceding function, the
> >> >> >> > latter prints the current function.)  Then we could remove the question
> >> >> >> > mark.
> >> >> >> >
> >> >> >> > There's also the question of whether or not the address should be
> >> >> >> > printed again, after it's already been printed by __show_regs().  I
> >> >> >> > don't have a strong opinion either way.
> >> >> >> >
> >> >> >>
> >> >> >> IIRC we don't show the actual faulting function in the call trace, so
> >> >> >> we probably shouldn't duplicate the entry after the show_regs.
> >> >> >
> >> >> > Just to clarify, that's true today for cases where the stack dump starts
> >> >> > from a handler which has regs.  It starts dumping based on regs->ip and
> >> >> > regs->bp, so the regs themselves aren't dumped.
> >> >> >
> >> >> > But for cases where regs are in the middle of the stack, they aren't
> >> >> > detected today, and you'll still see the value of regs->ip dumped with a
> >> >> > question mark.
> >> >> >
> >> >> > That said, with this patch, now that regs in the middle of the stack
> >> >> > *are* being printed, I can't think of a good reason to print the return
> >> >> > address twice: both in regs and the stack trace.  So removing it from
> >> >> > the stack trace is fine with me.
> >> >> >
> >> >> >> That being said, I'm still confused by the question marks.  What
> >> >> >> exactly is going on?  Is the code really doing the right thing wrt
> >> >> >> resuming the unwind?  Is there a git tree with these patches applied
> >> >> >> somewhere so I can look at it easily in context?
> >> >> >
> >> >> > show_trace_log_lvl() is doing two things in parallel: scanning all
> >> >> > kernel text addresses on the stack while simultaneously using the
> >> >> > unwinder to walk the frame pointers.  Only those scanned addresses which
> >> >> > are also found by the unwinder are printed without question marks.
> >> >> >
> >> >> > The pt_regs aren't in a frame of their own; they're just data inside of
> >> >> > a bigger frame.  (You may recall that you objected to my proposal to put
> >> >> > them in their own frame :-))  So that's why the address stored in
> >> >> > regs->ip was printed with a question mark: it's not in the header of a
> >> >> > real frame; it's just data.
> >> >>
> >> >> It wasn't the separate frame part I was objecting to -- it was their
> >> >> encoding on the stack.  Maybe they should unwind as though they're a
> >> >> separate frame.  For example, the unwind API could give the frame that
> >> >> returns to apic_timer_interrupt+0x9e/0xb0 and then the next frame
> >> >> could literally list regs->ip as its retaddr (and maybe that frame or
> >> >> even the following one should be the one with non-NULL
> >> >> unwind_get_entry_regs).
> >> >
> >> > Having the unwinder treat the pt_regs as a "fake" frame is problematic:
> >> >
> >> > - As I described above, you can't treat regs->ip as a normal return
> >> >   value anyway.
> >> >
> >> > - Also, for exceptions and nested interrupts, the regs are stored on the
> >> >   interrupting stack.  But for non-nested interrupts, they're stored on
> >> >   the thread stack.  So the regs aren't always on the same stack as the
> >> >   corresponding encoded pt_regs pointer.  Another issue is that there's
> >> >   not always a frame after the regs.  For those reasons, creating a
> >> >   "fake" frame abstraction in the state machine is quite a bit trickier
> >> >   than just dealing with those details in the only place that cares
> >> >   about them: show_trace_log_lvl().
> >> >
> >> >> In some sense, the regs belong to the frame that got interrupted, not
> >> >> the frame that did the interrupting.  But maybe that's backwards -- if
> >> >> we have DWARF, then the regs correspond to the regs at the time of a
> >> >> call, and those regs are reasonably likely to contain the arguments to
> >> >> the called function.
> >> >>
> >> >> But regardless of which way this goes, it seems quite awkward to me
> >> >> that regs->ip never shows up as the return addr of any frame as
> >> >> exposed by the unwind API.
> >> >
> >> > Again, regs->ip is special.  It's not a call return address and we
> >> > shouldn't force it to be.
> >>
> >> This is only mostly true.  If the exception was a trap, then it is
> >> (e.g. if a function ends in int3, then regs->ip will be off the end).
> >> But that's just me being pedantic.
> >>
> >> More relevantly, regs->ip is a reliable address indicating a function
> >> that will be returned to if we ever return, and both
> >> show_trace_log_lvl() and the livepatch stuff should interpret it as
> >> such.
> >
> > Actually livepatch doesn't care; once it sees that there are regs, it
> > will bail because the stack is unreliable.
> 
> Would it be better for livepatch not to bail some day?

Not until we have a DWARF unwinder.

> >> Whether this means the unwinder should change or
> >> show_trace_log_lvl() should change isn't a big deal, but I think one
> >> of them should change so we get this right.
> >
> > I have no problem doing so, but can you clarify what you mean?  Earlier
> > you said:
> >
> >   "IIRC we don't show the actual faulting function in the call trace, so
> >   we probably shouldn't duplicate the entry after the show_regs."
> >
> > Maybe I'm misunderstanding, but that seems to contradict what you're
> > saying now.  So which is it?  Do you want the RIP address printed twice
> > (both in the regs printout and in the stack trace)?  Or not?
> 
> I don't have a stong preference as to how many times it's printed.
> But I think we need to get rid of the question mark.  I think that
> means there are two options:
> 
> a) Teach show_stack_log_lvl() that regs->ip is a "reliable" entry and
> print it again.  That will get confused if it's the first instruction
> in a function, so maybe it's not so great.

I proposed a fix for this above, so that it would print regs->ip one
way and a normal return address another way, to avoid the confusion:

> >> >> >> > Maybe I could add some special logic to say: "if this return address was
> >> >> >> > from a call, use printk_stack_address(); else if it was from a pt_regs,
> >> >> >> > use printk_address()."  (The former prints the preceding function, the
> >> >> >> > latter prints the current function.)  Then we could remove the question
> >> >> >> > mark.

> b) Teach show_stack_log_lvl() that regs->ip is a thing that we just
> printed (via show_regs) and skip the ? entry.
> 
> Option b probably makes more sense.  I think I'm starting to
> understand all this, but maybe I'm still missing something.

I also think b is a good option.  I'll do it for v2 unless others
disagree.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-22 23:54     ` Josh Poimboeuf
@ 2016-07-23  0:15       ` Andy Lutomirski
  2016-07-23 14:04         ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-23  0:15 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
>> > +                            unsigned long *visit_mask)
>> > +{
>> > +       unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
>> > +       unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> > +
>> > +       if (stack < begin || stack >= end)
>> > +               return false;
>> > +
>> > +       if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> > +               return false;
>> > +
>> > +       info->type      = STACK_TYPE_IRQ;
>> > +       info->begin     = begin;
>> > +       info->end       = end;
>> > +       info->next      = (unsigned long *)*begin;
>>
>> This works, but it's a bit magic.  I don't suppose we could get rid of
>> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
>> move from stack to stack by querying the stack type of whatever the
>> frame base address is if the frame base address ends up being out of
>> bounds for the current stack?  Or maybe the unwinder could even do
>> this by itself.
>
> I'm not quite sure what you mean here.  The ->next stack pointer is
> quite useful and it abstracts that ugliness away from the callers of
> get_stack_info().  I'm open to any specific suggestions.

So far I've found two users of this thing.  One is
show_stack_log_lvl(), and it makes sense there, but maybe
info->heuristic_next_stack would be a better name.  The other is the
unwinder itself, and I think that walking from stack to stack using
this heuristic is the wrong approach there, at least in the long term.
I'd rather we just followed the bp chain wherever it leads us, as long
as it leads us to a valid stack that we haven't visited before.

As a concrete example of what I think is wrong with the current
approach, ISTM it would be totally valid to implement stack switching
like this:

some_func:
 push %rbp
 mov %rsp, %rbp
 ...
 mov [next stack], %rsp
 call some_other_func
 mov %rbp, %rsp
 pop %rbp
 ret

With the current approach, you can't unwind out of that function,
because there's no way to populate info->next.  I'm not actually
suggesting that the kernel should ever do such a thing on x86, and my
proposed rewrite of the IRQ stack code [1] will be fully compatible
with your approach, but it seems odd to me that the unwinder should
depend on idea that the stacks in use are chained together in a way
that can be decoded without .  (But maybe some of the Go compilers do
work this way -- I've never looked at their precise stack layout.)

Also, if you ever intend to port this thing to other architectures, I
think there are architectures that have separate exception stacks and
that track the next available slot on those stacks dynamically.  I
think that x86_32 is an example of this if task gates are used in a
back-and-forth manner, although Linux doesn't do that.  (x86_64 should
have done this for IST, but it didn't.)  On those architectures, you
can have two separate switches onto the same stack live at the same
time, and your current approach won't work.  (Even if you make the
change I'm suggesting, visit_mask will break, too, but fixing that
would be a much less invasive change.)

Am I making any sense?  This is a suggestion for making it better, not
something I see as a requirement for getting the x86 code upstream.

>>
>> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
>> > +                              unsigned long *visit_mask)
>> >  {
>> >         unsigned long stack = (unsigned long)s;
>> >         unsigned long begin, end;
>> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long *s, char **name,
>> >                 if (stack < begin || stack >= end)
>> >                         continue;
>> >
>> > -               if (test_and_set_bit(k, visit_mask))
>> > +               if (visit_mask &&
>> > +                   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
>> >                         return false;
>> >
>> > -               *name = exception_stack_names[k];
>> > -               return (unsigned long *)end;
>> > +               info->type      = STACK_TYPE_EXCEPTION + k;
>> > +               info->begin     = (unsigned long *)begin;
>> > +               info->end       = (unsigned long *)end;
>> > +               info->next      = (unsigned long *)info->end[-2];
>>
>> This is so magical that I don't immediately see why it's correct.
>> Presumably it's because the thing two slots down from the top of the
>> stack is regs->sp?  If so, that needs a comment.
>
> Heck if I know, I just stole it from dump_trace() ;-)
>
> I'll figure it out and add a comment.

If you can write it as:

struct pt_regs *regs = (struct pt_regs *)end - 1;
info->next = regs->sp;

and it still works, then no comment required :)

>
>> But again, couldn't we use the fact that we now know how to decode
>> pt_regs to avoid needing this?  I can imagine it being useful as a
>> fallback in the event that the unwinder fails, but this is just a
>> fallback.
>
> Yeah, this is needed as a fallback.  But I wouldn't call it "just" a
> fallback: the stack dump code *needs* to be able to still traverse the
> stacks if frame pointers fail.
>
>> Also, NMI is weird and I'm wondering whether this works at
>> all when trying to unwind from a looped NMI.
>
> Unless I'm missing something, I think it should be fine for nested NMIs,
> since they're all on the same stack.  I can try to test it.  What in
> particular are you worried about?
>

The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
CS, IP) records.  Off the top of my head, the record that matters is
the third one, so it should be regs[-15].  If an MCE hits while this
mess is being set up, good luck unwinding *that*.  If you really want
to know, take a deep breath, read the long comment in entry_64.S after
.Lnmi_from_kernel, then give up on x86 and start hacking on some other
architecture.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
  2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
                   ` (18 preceding siblings ...)
  2016-07-21 21:21 ` [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack Josh Poimboeuf
@ 2016-07-23  0:22 ` Linus Torvalds
  2016-07-23  0:31   ` Andy Lutomirski
  19 siblings, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2016-07-23  0:22 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Andy Lutomirski, Steven Rostedt, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 6:21 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> Some if its advantages:
>
> - simplicity: no more callback sprawl and less code duplication.
>
> - flexibility: allows the caller to stop and inspect the stack state at
>   each step in the unwinding process.
>
> - modularity: the unwinder code, console stack dump code, and stack
>   metadata analysis code are all better separated so that changing one
>   of them shouldn't have much of an impact on any of the others.

I've been without internet for the last week, so I have a ton pending,
and not good enough internet even now to take a good look.

However, I want to make one thing really really clear: the absolute
NUMBER ONE requirement for the stack tracing code is none of the
above.

The #1 requirement is that it works, and not have a chance in hell of
ever breaking. We had that happen once before when people wanted to
make it fancy and add Dwarf info, and it was such a f*cking disaster
that I am not sure I ever want to do that again. Seriously.

It does not matter if the stack tracing gives the wrong answers.

It does not matter if the stack tracing is complicated and odd old code.

It does not matter one whit if some new user is inconvenienced, and in
fact it is possible that new users should write their *own* stack
tracer code.

The ONLY thing that matters (to a very high degree) is that the code
is stable, and if an Oops happens, the stack tracer never *ever*
cause even more problems than we already have.

If the stack tracer *ever* takes a recursive fault and kills the
machine, the stack tracer is worse than bad - we'd be better off
*without* a stack tracer at all.

And yes, we had exactly that situation, where bugs in the stack tracer
meant that other bugs ended up being much harder to debug, because
instead of a nice logged oops message that would have been trivial to
figure out, we very occasionally ended up with a dead machine instead.

So without having yet looked at the code, I want people to understand
that to a very real degree, the stack tracer that the *oopsing* code
(ie what all the usual kernel fault handlers use) is very very special
code and needs to be handled very carefully, and needs to be extra
robust, even in the presence of stack corruption, and even in the
presence of the dwarf info being totally corrupted. Because we've very
much had both things happen.

It is very possible that we should have two different stack tracers -
the stupid "for oopses only" code that doesn't necessarily give the
perfect trace, but is very anal and happily gives old stale addresses
(which can be very useful for seeing what happened just before the
"real" stack trace), and then a separate stack trace engine that is
clever and gets things right, and if that one faults it can depend on
the normal kernel fault handling picking up the pieces.

Yes, the current stack tracer is  crufty. No, it's not perfect. But it
is very well tested, and has held up. That should not be dismissed.

                 Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
  2016-07-23  0:22 ` [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Linus Torvalds
@ 2016-07-23  0:31   ` Andy Lutomirski
  2016-07-23  5:35     ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-23  0:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josh Poimboeuf, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 5:22 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So without having yet looked at the code, I want people to understand
> that to a very real degree, the stack tracer that the *oopsing* code
> (ie what all the usual kernel fault handlers use) is very very special
> code and needs to be handled very carefully, and needs to be extra
> robust, even in the presence of stack corruption, and even in the
> presence of the dwarf info being totally corrupted. Because we've very
> much had both things happen.
>
> It is very possible that we should have two different stack tracers -
> the stupid "for oopses only" code that doesn't necessarily give the
> perfect trace, but is very anal and happily gives old stale addresses
> (which can be very useful for seeing what happened just before the
> "real" stack trace), and then a separate stack trace engine that is
> clever and gets things right, and if that one faults it can depend on
> the normal kernel fault handling picking up the pieces.

I think that Josh's code has the potential to be extremely robust
*and* give more correct results when possible.  One thing I intend to
review when v2 shows up is that it's as conservative as it needs to be
to avoid ever dereferencing an out-of-bounds pointer.  And Josh's oops
printer carefully walks and prints out all addresses on the stack
(complete with question marks) even if the unwinder doesn't find them.

>
> Yes, the current stack tracer is  crufty. No, it's not perfect. But it
> is very well tested, and has held up. That should not be dismissed.
>

I think you may be giving the current tracer slightly more credit than
it's due.  In my stack guard page patchset, I fixed two separate
issues, one of which caused recursive faults and one of which caused
it to output nothing at all.  So maybe *now* it's very robust :)  But
it's still an umaintainable mess IMO, and Josh's patchset helps a
*lot*.

--Andy

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
  2016-07-23  0:31   ` Andy Lutomirski
@ 2016-07-23  5:35     ` Josh Poimboeuf
  2016-07-23  5:39       ` Linus Torvalds
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-23  5:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 05:31:47PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 5:22 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So without having yet looked at the code, I want people to understand
> > that to a very real degree, the stack tracer that the *oopsing* code
> > (ie what all the usual kernel fault handlers use) is very very special
> > code and needs to be handled very carefully, and needs to be extra
> > robust, even in the presence of stack corruption, and even in the
> > presence of the dwarf info being totally corrupted. Because we've very
> > much had both things happen.
> >
> > It is very possible that we should have two different stack tracers -
> > the stupid "for oopses only" code that doesn't necessarily give the
> > perfect trace, but is very anal and happily gives old stale addresses
> > (which can be very useful for seeing what happened just before the
> > "real" stack trace), and then a separate stack trace engine that is
> > clever and gets things right, and if that one faults it can depend on
> > the normal kernel fault handling picking up the pieces.
> 
> I think that Josh's code has the potential to be extremely robust
> *and* give more correct results when possible.  One thing I intend to
> review when v2 shows up is that it's as conservative as it needs to be
> to avoid ever dereferencing an out-of-bounds pointer.  And Josh's oops
> printer carefully walks and prints out all addresses on the stack
> (complete with question marks) even if the unwinder doesn't find them.

I should add that while the show_trace_log_lvl() code (which is used for
oopses) looks different on the surface, the algorithm is fundamentally
the same as before: traverse the stacks, scanning and printing any
kernel text addresses.

While doing the scanning and printing, it does call the frame pointer
unwinder in parallel, but like before, that's *only* used to determine
whether a found address should be printed without a question mark.  If
the unwinder goes off the rails, the scanning and printing of text
addresses goes on, undisturbed.

The frame pointer unwinder code itself is quite careful not to
dereference anything it shouldn't (though of course I welcome any review
comments that find otherwise).

> > Yes, the current stack tracer is  crufty. No, it's not perfect. But it
> > is very well tested, and has held up. That should not be dismissed.
> >
> 
> I think you may be giving the current tracer slightly more credit than
> it's due.  In my stack guard page patchset, I fixed two separate
> issues, one of which caused recursive faults and one of which caused
> it to output nothing at all.  So maybe *now* it's very robust :)  But
> it's still an umaintainable mess IMO, and Josh's patchset helps a
> *lot*.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
  2016-07-23  5:35     ` Josh Poimboeuf
@ 2016-07-23  5:39       ` Linus Torvalds
  2016-07-23 12:53         ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2016-07-23  5:39 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

On Sat, Jul 23, 2016 at 2:35 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> While doing the scanning and printing, it does call the frame pointer
> unwinder in parallel, but like before, that's *only* used to determine
> whether a found address should be printed without a question mark.  If
> the unwinder goes off the rails, the scanning and printing of text
> addresses goes on, undisturbed.
>
> The frame pointer unwinder code itself is quite careful not to
> dereference anything it shouldn't (though of course I welcome any review
> comments that find otherwise).

So this was the bug the last time around we did unwinders - the code
would dereference the unwind tables, and the tables would be
corrupted. End result: recursive oops.

And they were corrupted not even because of memory corruption, but
simply because they contained incorrect data, due to compiler bugs and
other issues.

I have really bad memories from that time. Several years after the
fact. It took months to finally revert the crap, because the author
continued to insist that "this was the last bug" for several passes
through that thing.

As they say, "Once burned, twice shy". But in this case, it's more
like "Four times burned, sixteen times as shy".

            Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
  2016-07-23  5:39       ` Linus Torvalds
@ 2016-07-23 12:53         ` Josh Poimboeuf
  0 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-23 12:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Brian Gerst, Kees Cook, Peter Zijlstra,
	Frederic Weisbecker, Byungchul Park

On Sat, Jul 23, 2016 at 02:39:52PM +0900, Linus Torvalds wrote:
> On Sat, Jul 23, 2016 at 2:35 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >
> > While doing the scanning and printing, it does call the frame pointer
> > unwinder in parallel, but like before, that's *only* used to determine
> > whether a found address should be printed without a question mark.  If
> > the unwinder goes off the rails, the scanning and printing of text
> > addresses goes on, undisturbed.
> >
> > The frame pointer unwinder code itself is quite careful not to
> > dereference anything it shouldn't (though of course I welcome any review
> > comments that find otherwise).
> 
> So this was the bug the last time around we did unwinders - the code
> would dereference the unwind tables, and the tables would be
> corrupted. End result: recursive oops.
> 
> And they were corrupted not even because of memory corruption, but
> simply because they contained incorrect data, due to compiler bugs and
> other issues.
> 
> I have really bad memories from that time. Several years after the
> fact. It took months to finally revert the crap, because the author
> continued to insist that "this was the last bug" for several passes
> through that thing.
> 
> As they say, "Once burned, twice shy". But in this case, it's more
> like "Four times burned, sixteen times as shy".

But that was DWARF, right?  This is still just simple frame pointers.

Don't think of it as a new unwinder.  Think of it instead as a "gentle
reshuffling of the existing code to vastly improve readability and
maintenance."

Yes, I would like to eventually propose a DWARF unwinder, which
hopefully learns from the mistakes of previous attempts.  But either
way, I think this patch set stands on its own as a big improvement.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-22 23:52     ` Andy Lutomirski
@ 2016-07-23 13:09       ` Josh Poimboeuf
  0 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-23 13:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 04:52:10PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 4:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> valid_stack_ptr() is buggy: it assumes that all stacks are of size
> >> THREAD_SIZE, which is not true for exception stacks.  So the
> >> walk_stack() callbacks will need to know the location of the beginning
> >> of the stack as well as the end.
> >>
> >> Another issue is that in general the various features of a stack (type,
> >> size, next stack pointer, description string) are scattered around in
> >> various places throughout the stack dump code.
> >
> > I finally figured out what visit_info is.  But would it make more
> > sense to track it in the unwind state so that the unwinder can
> > directly make sure it doesn't start looping?
> >
> 
> I just realized that it *is* in the unwind state.  But maybe this code
> in update_stack_state:
> 
>     sp = info->next;
>     if (!sp).
>         goto unknown;
> 
>     if (get_stack_info(sp, state->task, info, &state->stack_mask))
>         goto unknown;
> 
>     if (!on_stack(info, addr, len))
>         goto unknown;
> 
> should do something like:
> 
> if (get_stack_info(addr, ...))
>   goto unknown.
> 
> sp = info->end;
> 
> instead.  Alternatively, maybe it would make sense to keep sp as is
> (have update_stack_state return bool instead of returning a pointer)
> so that a frame that switches stacks still shows the actual sp at the
> time that the frame called whatever the it called.
> 
> I'm really quite confused by what state->sp means in your current
> code.  In the non-stack-switching case (everything is on the thread
> stack), it appears to always match state->bp.  Am I missing something?
>  If I'm understanding this correctly, when you're pointing at a call
> frame, state->bp is that frame's base address (the top of the stack
> frame), unwind_get_return_address() returns the address to which that
> frame would return, and, in the future, unwind_get_gpr(UNWIND_DI) or
> whatever it ends up looking like will return RDI at the time that the
> frame called whatever function it called, if known.  By that logic,
> shouldn't state->sp be sp on entry to the call instruction?  (Or could
> sp just be removed?  Does it do anything?)

Yeah, I think sp has no purpose and can actually just be removed.

(It was leftover from a previous iteration of the code where it did have
a purpose and I forgot to remove it.)

> I guess the reason I'm still not 100% comfortable with the idea that
> pt_regs frames don't exist a real frames is that I keep waffling as to
> how I should think about the regs associated with a frame in the
> future DWARF world.  I think I imagine them being the regs at the time
> that the frame did it's call to the next frame, which, by an
> admittedly weak analogy, means that the pt_regs state would be the
> regs at the time that the exception or interrupt happened.  That
> offers a third silly option for dealing with the annoying '?': emit
> two frames for a pt_regs, but have the frame in the entry code show
> NULL for its return address because it's not a normal return.

Well, I'd say let's not get ahead of ourselves.  I think the current
regs-aren't-a-frame design works fine for now, and the code is fairly
simple.  If/when we get a DWARF unwinder, we can revisit that decision.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-23  0:15       ` Andy Lutomirski
@ 2016-07-23 14:04         ` Josh Poimboeuf
  2016-07-26  0:09           ` Andy Lutomirski
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-23 14:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 22, 2016 at 05:15:03PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
> >> > +                            unsigned long *visit_mask)
> >> > +{
> >> > +       unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
> >> > +       unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
> >> > +
> >> > +       if (stack < begin || stack >= end)
> >> > +               return false;
> >> > +
> >> > +       if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
> >> > +               return false;
> >> > +
> >> > +       info->type      = STACK_TYPE_IRQ;
> >> > +       info->begin     = begin;
> >> > +       info->end       = end;
> >> > +       info->next      = (unsigned long *)*begin;
> >>
> >> This works, but it's a bit magic.  I don't suppose we could get rid of
> >> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> >> move from stack to stack by querying the stack type of whatever the
> >> frame base address is if the frame base address ends up being out of
> >> bounds for the current stack?  Or maybe the unwinder could even do
> >> this by itself.
> >
> > I'm not quite sure what you mean here.  The ->next stack pointer is
> > quite useful and it abstracts that ugliness away from the callers of
> > get_stack_info().  I'm open to any specific suggestions.
> 
> So far I've found two users of this thing.  One is
> show_stack_log_lvl(), and it makes sense there, but maybe
> info->heuristic_next_stack would be a better name.  The other is the
> unwinder itself, and I think that walking from stack to stack using
> this heuristic is the wrong approach there, at least in the long term.
> I'd rather we just followed the bp chain wherever it leads us, as long
> as it leads us to a valid stack that we haven't visited before.
>
> As a concrete example of what I think is wrong with the current
> approach, ISTM it would be totally valid to implement stack switching
> like this:
> 
> some_func:
>  push %rbp
>  mov %rsp, %rbp
>  ...
>  mov [next stack], %rsp
>  call some_other_func
>  mov %rbp, %rsp
>  pop %rbp
>  ret
> 
> With the current approach, you can't unwind out of that function,
> because there's no way to populate info->next.  I'm not actually
> suggesting that the kernel should ever do such a thing on x86, and my
> proposed rewrite of the IRQ stack code [1] will be fully compatible
> with your approach, but it seems odd to me that the unwinder should
> depend on idea that the stacks in use are chained together in a way
> that can be decoded without .  (But maybe some of the Go compilers do
> work this way -- I've never looked at their precise stack layout.)

I don't think relying on frame pointers to switch between stacks is
necessarily a good idea:

- It requires CONFIG_FRAME_POINTER, which makes it unwinder-specific.
  The current approach is unwinder-agnostic.

- Instead of relying on a single correct "next stack" pointer, it
  requires relying on potentially dozens of correct frame pointers,
  across multiple stacks.  So a lot of things have to go right, instead
  of just one.  And then show_trace_log_lvl() becomes more dependent on
  the unwinder not screwing things up.

> Also, if you ever intend to port this thing to other architectures, I
> think there are architectures that have separate exception stacks and
> that track the next available slot on those stacks dynamically.  I
> think that x86_32 is an example of this if task gates are used in a
> back-and-forth manner, although Linux doesn't do that.  (x86_64 should
> have done this for IST, but it didn't.)  On those architectures, you
> can have two separate switches onto the same stack live at the same
> time, and your current approach won't work.  (Even if you make the
> change I'm suggesting, visit_mask will break, too, but fixing that
> would be a much less invasive change.)
>
> Am I making any sense?  This is a suggestion for making it better, not
> something I see as a requirement for getting the x86 code upstream.

I think porting these interfaces to other architectures could eventually
be a good idea, and you're right that the current approach might need to
be tweaked in order to work everywhere.  (But I agree this needs more
thought and this discussion can wait until later.)

> >> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
> >> > +                              unsigned long *visit_mask)
> >> >  {
> >> >         unsigned long stack = (unsigned long)s;
> >> >         unsigned long begin, end;
> >> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long *s, char **name,
> >> >                 if (stack < begin || stack >= end)
> >> >                         continue;
> >> >
> >> > -               if (test_and_set_bit(k, visit_mask))
> >> > +               if (visit_mask &&
> >> > +                   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
> >> >                         return false;
> >> >
> >> > -               *name = exception_stack_names[k];
> >> > -               return (unsigned long *)end;
> >> > +               info->type      = STACK_TYPE_EXCEPTION + k;
> >> > +               info->begin     = (unsigned long *)begin;
> >> > +               info->end       = (unsigned long *)end;
> >> > +               info->next      = (unsigned long *)info->end[-2];
> >>
> >> This is so magical that I don't immediately see why it's correct.
> >> Presumably it's because the thing two slots down from the top of the
> >> stack is regs->sp?  If so, that needs a comment.
> >
> > Heck if I know, I just stole it from dump_trace() ;-)
> >
> > I'll figure it out and add a comment.
> 
> If you can write it as:
> 
> struct pt_regs *regs = (struct pt_regs *)end - 1;
> info->next = regs->sp;
> 
> and it still works, then no comment required :)

Yeah.  in_irq_stack() does something similar, though it uses end[-1].
And its regs are actually stored on the thread stack.  So something
doesn't quite add up for irqs.  I still need to do some homework there.

> >> But again, couldn't we use the fact that we now know how to decode
> >> pt_regs to avoid needing this?  I can imagine it being useful as a
> >> fallback in the event that the unwinder fails, but this is just a
> >> fallback.
> >
> > Yeah, this is needed as a fallback.  But I wouldn't call it "just" a
> > fallback: the stack dump code *needs* to be able to still traverse the
> > stacks if frame pointers fail.
> >
> >> Also, NMI is weird and I'm wondering whether this works at
> >> all when trying to unwind from a looped NMI.
> >
> > Unless I'm missing something, I think it should be fine for nested NMIs,
> > since they're all on the same stack.  I can try to test it.  What in
> > particular are you worried about?
> >
> 
> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
> CS, IP) records.  Off the top of my head, the record that matters is
> the third one, so it should be regs[-15].  If an MCE hits while this
> mess is being set up, good luck unwinding *that*.  If you really want
> to know, take a deep breath, read the long comment in entry_64.S after
> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
> architecture.

I did read that comment.  Fortunately there's a big difference between
reading and understanding so I can go on being an ignorant x86 hacker!

For nested NMIs, it does look like the stack of the exception which
interrupted the first NMI would get skipped by the stack dump.  (But
that's a general problem, not specific to my patch set.)

Am I correct in understanding that there can only be one level of NMI
nesting at any given time?  If so, could we make it easier on the
unwinder by putting the nested NMI on a separate software stack, so the
"next stack" pointers are always in the same place?  Or am I just being
naive?

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-23 14:04         ` Josh Poimboeuf
@ 2016-07-26  0:09           ` Andy Lutomirski
  2016-07-26 16:26             ` Josh Poimboeuf
  2016-07-26 16:47             ` Josh Poimboeuf
  0 siblings, 2 replies; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-26  0:09 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Jul 22, 2016 at 05:15:03PM -0700, Andy Lutomirski wrote:
>> On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
>> >> > +                            unsigned long *visit_mask)
>> >> > +{
>> >> > +       unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
>> >> > +       unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> >> > +
>> >> > +       if (stack < begin || stack >= end)
>> >> > +               return false;
>> >> > +
>> >> > +       if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> >> > +               return false;
>> >> > +
>> >> > +       info->type      = STACK_TYPE_IRQ;
>> >> > +       info->begin     = begin;
>> >> > +       info->end       = end;
>> >> > +       info->next      = (unsigned long *)*begin;
>> >>
>> >> This works, but it's a bit magic.  I don't suppose we could get rid of
>> >> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
>> >> move from stack to stack by querying the stack type of whatever the
>> >> frame base address is if the frame base address ends up being out of
>> >> bounds for the current stack?  Or maybe the unwinder could even do
>> >> this by itself.
>> >
>> > I'm not quite sure what you mean here.  The ->next stack pointer is
>> > quite useful and it abstracts that ugliness away from the callers of
>> > get_stack_info().  I'm open to any specific suggestions.
>>
>> So far I've found two users of this thing.  One is
>> show_stack_log_lvl(), and it makes sense there, but maybe
>> info->heuristic_next_stack would be a better name.  The other is the
>> unwinder itself, and I think that walking from stack to stack using
>> this heuristic is the wrong approach there, at least in the long term.
>> I'd rather we just followed the bp chain wherever it leads us, as long
>> as it leads us to a valid stack that we haven't visited before.
>>
>> As a concrete example of what I think is wrong with the current
>> approach, ISTM it would be totally valid to implement stack switching
>> like this:
>>
>> some_func:
>>  push %rbp
>>  mov %rsp, %rbp
>>  ...
>>  mov [next stack], %rsp
>>  call some_other_func
>>  mov %rbp, %rsp
>>  pop %rbp
>>  ret
>>
>> With the current approach, you can't unwind out of that function,
>> because there's no way to populate info->next.  I'm not actually
>> suggesting that the kernel should ever do such a thing on x86, and my
>> proposed rewrite of the IRQ stack code [1] will be fully compatible
>> with your approach, but it seems odd to me that the unwinder should
>> depend on idea that the stacks in use are chained together in a way
>> that can be decoded without .  (But maybe some of the Go compilers do
>> work this way -- I've never looked at their precise stack layout.)
>
> I don't think relying on frame pointers to switch between stacks is
> necessarily a good idea:
>
> - It requires CONFIG_FRAME_POINTER, which makes it unwinder-specific.
>   The current approach is unwinder-agnostic.
>
> - Instead of relying on a single correct "next stack" pointer, it
>   requires relying on potentially dozens of correct frame pointers,
>   across multiple stacks.  So a lot of things have to go right, instead
>   of just one.  And then show_trace_log_lvl() becomes more dependent on
>   the unwinder not screwing things up.

That's a fair point, at least for show_trace_log_lvl().  So let's
leave it alone for now.  We can always revisit it later.

>>
>> If you can write it as:
>>
>> struct pt_regs *regs = (struct pt_regs *)end - 1;
>> info->next = regs->sp;
>>
>> and it still works, then no comment required :)
>
> Yeah.  in_irq_stack() does something similar, though it uses end[-1].
> And its regs are actually stored on the thread stack.  So something
> doesn't quite add up for irqs.  I still need to do some homework there.

I can do your homework for you: the irq stacks doesn't contain pt_regs.

The current code is quite hard to understand, but this patch of mine
(which I'll try to dust off and send in soon) cleans it up and should
be much easier to understand:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_ist&id=2231ec7e0bcc1a2bc94a17081511ab54cc6badd1

So a comment like /* When the IRQ stack is in use, the top word stores
the previous stack pointer. */ should do the trick.

>> >
>> > Unless I'm missing something, I think it should be fine for nested NMIs,
>> > since they're all on the same stack.  I can try to test it.  What in
>> > particular are you worried about?
>> >
>>
>> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
>> CS, IP) records.  Off the top of my head, the record that matters is
>> the third one, so it should be regs[-15].  If an MCE hits while this
>> mess is being set up, good luck unwinding *that*.  If you really want
>> to know, take a deep breath, read the long comment in entry_64.S after
>> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
>> architecture.
>
> I did read that comment.  Fortunately there's a big difference between
> reading and understanding so I can go on being an ignorant x86 hacker!
>
> For nested NMIs, it does look like the stack of the exception which
> interrupted the first NMI would get skipped by the stack dump.  (But
> that's a general problem, not specific to my patch set.)

If we end up with task -> IST -> NMI -> same IST, we're doomed and
we're going to crash, so it doesn't matter much whether the unwinder
works.  Is that what you mean?

>
> Am I correct in understanding that there can only be one level of NMI
> nesting at any given time?  If so, could we make it easier on the
> unwinder by putting the nested NMI on a separate software stack, so the
> "next stack" pointers are always in the same place?  Or am I just being
> naive?

I think you're being naive :)

But we don't really need the unwinder to be 100% faithful here.  If we have:

task stack
NMI
nested NMI

then the nested NMI code won't call into C and thus it should be
impossible to ever invoke your unwinder on that state.  Instead the
nested NMI code will fiddle with the saved regs and return in such a
way that the outer NMI will be forced to loop through again.  So it
*should* (assuming I'm remembering all this crap correctly) be
sufficient to find the "outermost" pt_regs, which is sitting at
(struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
value.  This ought to be the same thing that the frame-based unwinder
would naturally try to do.  But if you make this change, ISTM you
should make it separately because it does change behavior and Linus is
understandably leery about that.

Hmm.  I wonder if it would make sense to decode this thing both ways
and display it.  So show_trace_log_lvl() could print something like:

<#DB (0xffffwhatever000-0xffffwhateverfff), next frame is at 0xffffsomething>

and, in the case where the frame unwinder disagrees, it'll at least be
visible in that 0xffffsomething won't be between 0xffffwhatever000 and
0xffffwhateverfff.

Then Linus is happy because the unwinder works just like it did before
but people like me are also happy because it's clearer what's going
on.  FWIW, I've personally debugged crashes in the NMI code where the
current unwinder falls apart completely and it's not fun -- having a
display indicating that the unwinder got confused would be nice.

--Andy

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-26  0:09           ` Andy Lutomirski
@ 2016-07-26 16:26             ` Josh Poimboeuf
  2016-07-26 17:51               ` Steven Rostedt
  2016-07-26 20:59               ` Andy Lutomirski
  2016-07-26 16:47             ` Josh Poimboeuf
  1 sibling, 2 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-26 16:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > Unless I'm missing something, I think it should be fine for nested NMIs,
> >> > since they're all on the same stack.  I can try to test it.  What in
> >> > particular are you worried about?
> >> >
> >>
> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
> >> CS, IP) records.  Off the top of my head, the record that matters is
> >> the third one, so it should be regs[-15].  If an MCE hits while this
> >> mess is being set up, good luck unwinding *that*.  If you really want
> >> to know, take a deep breath, read the long comment in entry_64.S after
> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
> >> architecture.
> >
> > I did read that comment.  Fortunately there's a big difference between
> > reading and understanding so I can go on being an ignorant x86 hacker!
> >
> > For nested NMIs, it does look like the stack of the exception which
> > interrupted the first NMI would get skipped by the stack dump.  (But
> > that's a general problem, not specific to my patch set.)
> 
> If we end up with task -> IST -> NMI -> same IST, we're doomed and
> we're going to crash, so it doesn't matter much whether the unwinder
> works.  Is that what you mean?

I read the NMI entry code again, and now I realize my comment was
completely misinformed, so never mind.

Is "IST -> NMI -> same IST" even possible, since the other IST's are
higher priority than NMI?

> > Am I correct in understanding that there can only be one level of NMI
> > nesting at any given time?  If so, could we make it easier on the
> > unwinder by putting the nested NMI on a separate software stack, so the
> > "next stack" pointers are always in the same place?  Or am I just being
> > naive?
> 
> I think you're being naive :)
> 
> But we don't really need the unwinder to be 100% faithful here.  If we have:
> 
> task stack
> NMI
> nested NMI
> 
> then the nested NMI code won't call into C and thus it should be
> impossible to ever invoke your unwinder on that state.  Instead the
> nested NMI code will fiddle with the saved regs and return in such a
> way that the outer NMI will be forced to loop through again.  So it
> *should* (assuming I'm remembering all this crap correctly) be
> sufficient to find the "outermost" pt_regs, which is sitting at
> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
> value.  This ought to be the same thing that the frame-based unwinder
> would naturally try to do.  But if you make this change, ISTM you
> should make it separately because it does change behavior and Linus is
> understandably leery about that.

Ok, I think that makes sense to me now.  As I understand it, the
"outermost" RIP is the authoritative one, because it was written by the
original NMI.  Any nested NMIs will update the original and/or iret
RIPs, which will only ever point to NMI entry code, and so they should
be ignored.

But I think there's a case where this wouldn't work:

task stack
NMI
IST
stack dump

If the IST interrupt hits before the NMI has a chance to update the
outermost regs, the authoritative RIP would be the original one written
by HW, right?

> Hmm.  I wonder if it would make sense to decode this thing both ways
> and display it.  So show_trace_log_lvl() could print something like:
> 
> <#DB (0xffffwhatever000-0xffffwhateverfff), next frame is at 0xffffsomething>
> 
> and, in the case where the frame unwinder disagrees, it'll at least be
> visible in that 0xffffsomething won't be between 0xffffwhatever000 and
> 0xffffwhateverfff.
> 
> Then Linus is happy because the unwinder works just like it did before
> but people like me are also happy because it's clearer what's going
> on.  FWIW, I've personally debugged crashes in the NMI code where the
> current unwinder falls apart completely and it's not fun -- having a
> display indicating that the unwinder got confused would be nice.

Hm, maybe.  Another idea would be to have the unwinder print some kind
of warning if it goes off the rails.  It should be able to detect that,
because every stack trace should end at a user pt_regs.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-26  0:09           ` Andy Lutomirski
  2016-07-26 16:26             ` Josh Poimboeuf
@ 2016-07-26 16:47             ` Josh Poimboeuf
  2016-07-26 17:49               ` Brian Gerst
  1 sibling, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-26 16:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > Am I correct in understanding that there can only be one level of NMI
> > nesting at any given time?  If so, could we make it easier on the
> > unwinder by putting the nested NMI on a separate software stack, so the
> > "next stack" pointers are always in the same place?  Or am I just being
> > naive?
> 
> I think you're being naive :)

Another dumb question: since NMIs are reentrant, have you considered
removing the NMI IST entry, and instead just have NMIs keep using the
current stack?

The first NMI could then be switched to an NMI software stack, like IRQs
(assuming there's a way to do that atomically!).  And then determining
the context of subsequent NMIs would be straightforward, and we'd no
longer need to jump through all those horrible hoops in the entry code
to deal with NMI nesting.

Now you can tell me what else I'm missing...

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-26 16:47             ` Josh Poimboeuf
@ 2016-07-26 17:49               ` Brian Gerst
  2016-07-26 18:59                 ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Brian Gerst @ 2016-07-26 17:49 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, linux-kernel, Linus Torvalds, Steven Rostedt, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, Jul 26, 2016 at 12:47 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
>> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > Am I correct in understanding that there can only be one level of NMI
>> > nesting at any given time?  If so, could we make it easier on the
>> > unwinder by putting the nested NMI on a separate software stack, so the
>> > "next stack" pointers are always in the same place?  Or am I just being
>> > naive?
>>
>> I think you're being naive :)
>
> Another dumb question: since NMIs are reentrant, have you considered
> removing the NMI IST entry, and instead just have NMIs keep using the
> current stack?
>
> The first NMI could then be switched to an NMI software stack, like IRQs
> (assuming there's a way to do that atomically!).  And then determining
> the context of subsequent NMIs would be straightforward, and we'd no
> longer need to jump through all those horrible hoops in the entry code
> to deal with NMI nesting.
>
> Now you can tell me what else I'm missing...

There are several places (most notably SYSCALL entry) where the kernel
stack pointer is unsafe/user controlled for a brief time.  Since an
NMI can interrupt anywhere in the kernel, you have to use an IST to
protect against that case.

Blame Intel's legacy behavior for this mess, because IRET
unconditionally re-enables NMIs even if you are returning from another
exception like a page fault.  This wasn't a problem on the 8086 which
didn't have an MMU, but makes makes no sense on modern systems.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-26 16:26             ` Josh Poimboeuf
@ 2016-07-26 17:51               ` Steven Rostedt
  2016-07-26 18:56                 ` Josh Poimboeuf
  2016-07-26 20:59               ` Andy Lutomirski
  1 sibling, 1 reply; 91+ messages in thread
From: Steven Rostedt @ 2016-07-26 17:51 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, linux-kernel, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, 26 Jul 2016 11:26:42 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:


> Ok, I think that makes sense to me now.  As I understand it, the
> "outermost" RIP is the authoritative one, because it was written by the
> original NMI.  Any nested NMIs will update the original and/or iret
> RIPs, which will only ever point to NMI entry code, and so they should
> be ignored.

Just to confirm:

  -- top-of-stack --
  [ hardware written stack ] <- what the NMI hardware mechanism wrote
  [ internal variables ] <- you don't need to know what this is
  [ where to go next ] <- the stack to use to return on current NMI
  [ original copy of hardware stack ] <- the stack of the first NMI

IIRC, the original version had the "where to go next" stack last, but
to keep pt_regs in line with the stack, it made sense to have the
original NMI stack at the bottom, just above pt_regs, like a real
interrupt would.

> 
> But I think there's a case where this wouldn't work:
> 
> task stack
> NMI
> IST
> stack dump
> 
> If the IST interrupt hits before the NMI has a chance to update the
> outermost regs, the authoritative RIP would be the original one written
> by HW, right?

The only IST interrupt that would hit there is MCE and it would
probably be a critical error. Do we really need to worry about such an
unlikely scenario? The system is probably doomed anyway.

-- Steve

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-26 17:51               ` Steven Rostedt
@ 2016-07-26 18:56                 ` Josh Poimboeuf
  0 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-26 18:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, linux-kernel, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, Jul 26, 2016 at 01:51:27PM -0400, Steven Rostedt wrote:
> On Tue, 26 Jul 2016 11:26:42 -0500
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> 
> > Ok, I think that makes sense to me now.  As I understand it, the
> > "outermost" RIP is the authoritative one, because it was written by the
> > original NMI.  Any nested NMIs will update the original and/or iret
> > RIPs, which will only ever point to NMI entry code, and so they should
> > be ignored.
> 
> Just to confirm:
> 
>   -- top-of-stack --
>   [ hardware written stack ] <- what the NMI hardware mechanism wrote
>   [ internal variables ] <- you don't need to know what this is
>   [ where to go next ] <- the stack to use to return on current NMI
>   [ original copy of hardware stack ] <- the stack of the first NMI
> 
> IIRC, the original version had the "where to go next" stack last, but
> to keep pt_regs in line with the stack, it made sense to have the
> original NMI stack at the bottom, just above pt_regs, like a real
> interrupt would.
> 
> > 
> > But I think there's a case where this wouldn't work:
> > 
> > task stack
> > NMI
> > IST
> > stack dump
> > 
> > If the IST interrupt hits before the NMI has a chance to update the
> > outermost regs, the authoritative RIP would be the original one written
> > by HW, right?
> 
> The only IST interrupt that would hit there is MCE and it would
> probably be a critical error. Do we really need to worry about such an
> unlikely scenario? The system is probably doomed anyway.

According to entry_64.S:

	/*
	 * We allow breakpoints in NMIs. If a breakpoint occurs, then
	 * the iretq it performs will take us out of NMI context.
	 * This means that we can have nested NMIs where the next
	 * NMI is using the top of the stack of the previous NMI.

So I think this means that when a debug exception returns to an NMI with
iret, further NMIs are no longer masked.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-26 17:49               ` Brian Gerst
@ 2016-07-26 18:59                 ` Josh Poimboeuf
  0 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-26 18:59 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, linux-kernel, Linus Torvalds, Steven Rostedt, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, Jul 26, 2016 at 01:49:06PM -0400, Brian Gerst wrote:
> On Tue, Jul 26, 2016 at 12:47 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
> >> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > Am I correct in understanding that there can only be one level of NMI
> >> > nesting at any given time?  If so, could we make it easier on the
> >> > unwinder by putting the nested NMI on a separate software stack, so the
> >> > "next stack" pointers are always in the same place?  Or am I just being
> >> > naive?
> >>
> >> I think you're being naive :)
> >
> > Another dumb question: since NMIs are reentrant, have you considered
> > removing the NMI IST entry, and instead just have NMIs keep using the
> > current stack?
> >
> > The first NMI could then be switched to an NMI software stack, like IRQs
> > (assuming there's a way to do that atomically!).  And then determining
> > the context of subsequent NMIs would be straightforward, and we'd no
> > longer need to jump through all those horrible hoops in the entry code
> > to deal with NMI nesting.
> >
> > Now you can tell me what else I'm missing...
> 
> There are several places (most notably SYSCALL entry) where the kernel
> stack pointer is unsafe/user controlled for a brief time.  Since an
> NMI can interrupt anywhere in the kernel, you have to use an IST to
> protect against that case.

Ah, that makes sense.  Thanks.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-26 16:26             ` Josh Poimboeuf
  2016-07-26 17:51               ` Steven Rostedt
@ 2016-07-26 20:59               ` Andy Lutomirski
  2016-07-26 22:24                 ` Josh Poimboeuf
  1 sibling, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-26 20:59 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, Jul 26, 2016 at 9:26 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
>> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > Unless I'm missing something, I think it should be fine for nested NMIs,
>> >> > since they're all on the same stack.  I can try to test it.  What in
>> >> > particular are you worried about?
>> >> >
>> >>
>> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
>> >> CS, IP) records.  Off the top of my head, the record that matters is
>> >> the third one, so it should be regs[-15].  If an MCE hits while this
>> >> mess is being set up, good luck unwinding *that*.  If you really want
>> >> to know, take a deep breath, read the long comment in entry_64.S after
>> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
>> >> architecture.
>> >
>> > I did read that comment.  Fortunately there's a big difference between
>> > reading and understanding so I can go on being an ignorant x86 hacker!
>> >
>> > For nested NMIs, it does look like the stack of the exception which
>> > interrupted the first NMI would get skipped by the stack dump.  (But
>> > that's a general problem, not specific to my patch set.)
>>
>> If we end up with task -> IST -> NMI -> same IST, we're doomed and
>> we're going to crash, so it doesn't matter much whether the unwinder
>> works.  Is that what you mean?
>
> I read the NMI entry code again, and now I realize my comment was
> completely misinformed, so never mind.
>
> Is "IST -> NMI -> same IST" even possible, since the other IST's are
> higher priority than NMI?

Priority only matters for events that happen concurrently.
Synchronous things like #DB will always fire if the conditions that
trigger them are hit,

>
>> > Am I correct in understanding that there can only be one level of NMI
>> > nesting at any given time?  If so, could we make it easier on the
>> > unwinder by putting the nested NMI on a separate software stack, so the
>> > "next stack" pointers are always in the same place?  Or am I just being
>> > naive?
>>
>> I think you're being naive :)
>>
>> But we don't really need the unwinder to be 100% faithful here.  If we have:
>>
>> task stack
>> NMI
>> nested NMI
>>
>> then the nested NMI code won't call into C and thus it should be
>> impossible to ever invoke your unwinder on that state.  Instead the
>> nested NMI code will fiddle with the saved regs and return in such a
>> way that the outer NMI will be forced to loop through again.  So it
>> *should* (assuming I'm remembering all this crap correctly) be
>> sufficient to find the "outermost" pt_regs, which is sitting at
>> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
>> value.  This ought to be the same thing that the frame-based unwinder
>> would naturally try to do.  But if you make this change, ISTM you
>> should make it separately because it does change behavior and Linus is
>> understandably leery about that.
>
> Ok, I think that makes sense to me now.  As I understand it, the
> "outermost" RIP is the authoritative one, because it was written by the
> original NMI.  Any nested NMIs will update the original and/or iret
> RIPs, which will only ever point to NMI entry code, and so they should
> be ignored.
>
> But I think there's a case where this wouldn't work:
>
> task stack
> NMI
> IST
> stack dump
>
> If the IST interrupt hits before the NMI has a chance to update the
> outermost regs, the authoritative RIP would be the original one written
> by HW, right?

This should be impossible unless that last entry is MCE.  If we
actually fire an event that isn't MCE early in NMI entry, something
already went very wrong.

For NMI -> MCE -> stack dump, the frame-based unwinder will do better
than get_stack_info() unless get_stack_info() learns to use the
top-of-stack hardware copy if the actual RSP it finds is too high
(above the "outermost" frame).

>
>> Hmm.  I wonder if it would make sense to decode this thing both ways
>> and display it.  So show_trace_log_lvl() could print something like:
>>
>> <#DB (0xffffwhatever000-0xffffwhateverfff), next frame is at 0xffffsomething>
>>
>> and, in the case where the frame unwinder disagrees, it'll at least be
>> visible in that 0xffffsomething won't be between 0xffffwhatever000 and
>> 0xffffwhateverfff.
>>
>> Then Linus is happy because the unwinder works just like it did before
>> but people like me are also happy because it's clearer what's going
>> on.  FWIW, I've personally debugged crashes in the NMI code where the
>> current unwinder falls apart completely and it's not fun -- having a
>> display indicating that the unwinder got confused would be nice.
>
> Hm, maybe.  Another idea would be to have the unwinder print some kind
> of warning if it goes off the rails.  It should be able to detect that,
> because every stack trace should end at a user pt_regs.

I like that.

Be careful, though: kernel threads might not have a "user" pt_regs in
the "user_mode" returns true sense.  Checking that it's either
user_mode() or at task_pt_regs() might be a good condition to check.

--Andy

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-26 20:59               ` Andy Lutomirski
@ 2016-07-26 22:24                 ` Josh Poimboeuf
  2016-07-26 22:31                   ` Steven Rostedt
  2016-07-26 22:37                   ` Andy Lutomirski
  0 siblings, 2 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-26 22:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, Jul 26, 2016 at 01:59:20PM -0700, Andy Lutomirski wrote:
> On Tue, Jul 26, 2016 at 9:26 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
> >> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> > Unless I'm missing something, I think it should be fine for nested NMIs,
> >> >> > since they're all on the same stack.  I can try to test it.  What in
> >> >> > particular are you worried about?
> >> >> >
> >> >>
> >> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
> >> >> CS, IP) records.  Off the top of my head, the record that matters is
> >> >> the third one, so it should be regs[-15].  If an MCE hits while this
> >> >> mess is being set up, good luck unwinding *that*.  If you really want
> >> >> to know, take a deep breath, read the long comment in entry_64.S after
> >> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
> >> >> architecture.
> >> >
> >> > I did read that comment.  Fortunately there's a big difference between
> >> > reading and understanding so I can go on being an ignorant x86 hacker!
> >> >
> >> > For nested NMIs, it does look like the stack of the exception which
> >> > interrupted the first NMI would get skipped by the stack dump.  (But
> >> > that's a general problem, not specific to my patch set.)
> >>
> >> If we end up with task -> IST -> NMI -> same IST, we're doomed and
> >> we're going to crash, so it doesn't matter much whether the unwinder
> >> works.  Is that what you mean?
> >
> > I read the NMI entry code again, and now I realize my comment was
> > completely misinformed, so never mind.
> >
> > Is "IST -> NMI -> same IST" even possible, since the other IST's are
> > higher priority than NMI?
> 
> Priority only matters for events that happen concurrently.
> Synchronous things like #DB will always fire if the conditions that
> trigger them are hit,

So just to clarify, are you saying a lower priority exception like NMI
can interrupt a higher priority exception handler like #DB?  I'm getting
a different conclusion from reading section 6.9 of the Intel System
Programming Guide.

> >> > Am I correct in understanding that there can only be one level of NMI
> >> > nesting at any given time?  If so, could we make it easier on the
> >> > unwinder by putting the nested NMI on a separate software stack, so the
> >> > "next stack" pointers are always in the same place?  Or am I just being
> >> > naive?
> >>
> >> I think you're being naive :)
> >>
> >> But we don't really need the unwinder to be 100% faithful here.  If we have:
> >>
> >> task stack
> >> NMI
> >> nested NMI
> >>
> >> then the nested NMI code won't call into C and thus it should be
> >> impossible to ever invoke your unwinder on that state.  Instead the
> >> nested NMI code will fiddle with the saved regs and return in such a
> >> way that the outer NMI will be forced to loop through again.  So it
> >> *should* (assuming I'm remembering all this crap correctly) be
> >> sufficient to find the "outermost" pt_regs, which is sitting at
> >> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
> >> value.  This ought to be the same thing that the frame-based unwinder
> >> would naturally try to do.  But if you make this change, ISTM you
> >> should make it separately because it does change behavior and Linus is
> >> understandably leery about that.
> >
> > Ok, I think that makes sense to me now.  As I understand it, the
> > "outermost" RIP is the authoritative one, because it was written by the
> > original NMI.  Any nested NMIs will update the original and/or iret
> > RIPs, which will only ever point to NMI entry code, and so they should
> > be ignored.
> >
> > But I think there's a case where this wouldn't work:
> >
> > task stack
> > NMI
> > IST
> > stack dump
> >
> > If the IST interrupt hits before the NMI has a chance to update the
> > outermost regs, the authoritative RIP would be the original one written
> > by HW, right?
> 
> This should be impossible unless that last entry is MCE.  If we
> actually fire an event that isn't MCE early in NMI entry, something
> already went very wrong.

So we don't need to support breakpoints in the early NMI entry code?

> For NMI -> MCE -> stack dump, the frame-based unwinder will do better
> than get_stack_info() unless get_stack_info() learns to use the
> top-of-stack hardware copy if the actual RSP it finds is too high
> (above the "outermost" frame).

Ok.

> >> Hmm.  I wonder if it would make sense to decode this thing both ways
> >> and display it.  So show_trace_log_lvl() could print something like:
> >>
> >> <#DB (0xffffwhatever000-0xffffwhateverfff), next frame is at 0xffffsomething>
> >>
> >> and, in the case where the frame unwinder disagrees, it'll at least be
> >> visible in that 0xffffsomething won't be between 0xffffwhatever000 and
> >> 0xffffwhateverfff.
> >>
> >> Then Linus is happy because the unwinder works just like it did before
> >> but people like me are also happy because it's clearer what's going
> >> on.  FWIW, I've personally debugged crashes in the NMI code where the
> >> current unwinder falls apart completely and it's not fun -- having a
> >> display indicating that the unwinder got confused would be nice.
> >
> > Hm, maybe.  Another idea would be to have the unwinder print some kind
> > of warning if it goes off the rails.  It should be able to detect that,
> > because every stack trace should end at a user pt_regs.
> 
> I like that.
> 
> Be careful, though: kernel threads might not have a "user" pt_regs in
> the "user_mode" returns true sense.  Checking that it's either
> user_mode() or at task_pt_regs() might be a good condition to check.

Yeah.  I guess there are two distinct cases of "going off the rails":

1) The unwinder doesn't get to the end of the stack (user regs for user
   tasks, or whatever the end is for kthreads).

2) The unwinder strays away from the current stack's "previous stack"
   pointer.

We could warn on either case (though there's probably overlap between
the two).

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-26 22:24                 ` Josh Poimboeuf
@ 2016-07-26 22:31                   ` Steven Rostedt
  2016-07-26 22:37                   ` Andy Lutomirski
  1 sibling, 0 replies; 91+ messages in thread
From: Steven Rostedt @ 2016-07-26 22:31 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, linux-kernel, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, 26 Jul 2016 17:24:54 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> > This should be impossible unless that last entry is MCE.  If we
> > actually fire an event that isn't MCE early in NMI entry, something
> > already went very wrong.  
> 
> So we don't need to support breakpoints in the early NMI entry code?

Yes, if that happens, then bad things can really happen.

The only way a breakpoint could be added there, is perhaps with KGDB,
and that's just asking for trouble anyway.

-- Steve

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
  2016-07-26 22:24                 ` Josh Poimboeuf
  2016-07-26 22:31                   ` Steven Rostedt
@ 2016-07-26 22:37                   ` Andy Lutomirski
  1 sibling, 0 replies; 91+ messages in thread
From: Andy Lutomirski @ 2016-07-26 22:37 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML,
	linux-kernel, Linus Torvalds, Steven Rostedt, Brian Gerst,
	Kees Cook, Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, Jul 26, 2016 at 3:24 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Tue, Jul 26, 2016 at 01:59:20PM -0700, Andy Lutomirski wrote:
>> On Tue, Jul 26, 2016 at 9:26 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
>> >> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> >> > Unless I'm missing something, I think it should be fine for nested NMIs,
>> >> >> > since they're all on the same stack.  I can try to test it.  What in
>> >> >> > particular are you worried about?
>> >> >> >
>> >> >>
>> >> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
>> >> >> CS, IP) records.  Off the top of my head, the record that matters is
>> >> >> the third one, so it should be regs[-15].  If an MCE hits while this
>> >> >> mess is being set up, good luck unwinding *that*.  If you really want
>> >> >> to know, take a deep breath, read the long comment in entry_64.S after
>> >> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
>> >> >> architecture.
>> >> >
>> >> > I did read that comment.  Fortunately there's a big difference between
>> >> > reading and understanding so I can go on being an ignorant x86 hacker!
>> >> >
>> >> > For nested NMIs, it does look like the stack of the exception which
>> >> > interrupted the first NMI would get skipped by the stack dump.  (But
>> >> > that's a general problem, not specific to my patch set.)
>> >>
>> >> If we end up with task -> IST -> NMI -> same IST, we're doomed and
>> >> we're going to crash, so it doesn't matter much whether the unwinder
>> >> works.  Is that what you mean?
>> >
>> > I read the NMI entry code again, and now I realize my comment was
>> > completely misinformed, so never mind.
>> >
>> > Is "IST -> NMI -> same IST" even possible, since the other IST's are
>> > higher priority than NMI?
>>
>> Priority only matters for events that happen concurrently.
>> Synchronous things like #DB will always fire if the conditions that
>> trigger them are hit,
>
> So just to clarify, are you saying a lower priority exception like NMI
> can interrupt a higher priority exception handler like #DB?  I'm getting
> a different conclusion from reading section 6.9 of the Intel System
> Programming Guide.

Yes, effectively.  From the CPU's perspective, it's done with the #DB
as soon as it finishes pushing the stack frame and starts running
instructions again.  So the chain of events looks like:


<-- CPU is delivering #DB.  NMI can't be delivered.
debug:
<-- Oh boy, done with delivering #DB.  NMIs can be delivered again!
  pushq $whatever
  ...
  iretq  <-- CPU has no idea that this is related to the #DB

>
>> >> > Am I correct in understanding that there can only be one level of NMI
>> >> > nesting at any given time?  If so, could we make it easier on the
>> >> > unwinder by putting the nested NMI on a separate software stack, so the
>> >> > "next stack" pointers are always in the same place?  Or am I just being
>> >> > naive?
>> >>
>> >> I think you're being naive :)
>> >>
>> >> But we don't really need the unwinder to be 100% faithful here.  If we have:
>> >>
>> >> task stack
>> >> NMI
>> >> nested NMI
>> >>
>> >> then the nested NMI code won't call into C and thus it should be
>> >> impossible to ever invoke your unwinder on that state.  Instead the
>> >> nested NMI code will fiddle with the saved regs and return in such a
>> >> way that the outer NMI will be forced to loop through again.  So it
>> >> *should* (assuming I'm remembering all this crap correctly) be
>> >> sufficient to find the "outermost" pt_regs, which is sitting at
>> >> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
>> >> value.  This ought to be the same thing that the frame-based unwinder
>> >> would naturally try to do.  But if you make this change, ISTM you
>> >> should make it separately because it does change behavior and Linus is
>> >> understandably leery about that.
>> >
>> > Ok, I think that makes sense to me now.  As I understand it, the
>> > "outermost" RIP is the authoritative one, because it was written by the
>> > original NMI.  Any nested NMIs will update the original and/or iret
>> > RIPs, which will only ever point to NMI entry code, and so they should
>> > be ignored.
>> >
>> > But I think there's a case where this wouldn't work:
>> >
>> > task stack
>> > NMI
>> > IST
>> > stack dump
>> >
>> > If the IST interrupt hits before the NMI has a chance to update the
>> > outermost regs, the authoritative RIP would be the original one written
>> > by HW, right?
>>
>> This should be impossible unless that last entry is MCE.  If we
>> actually fire an event that isn't MCE early in NMI entry, something
>> already went very wrong.
>
> So we don't need to support breakpoints in the early NMI entry code?

No.  Instead we try not to let it happen.  See, for example:

commit e5779e8e12299f77c2421a707855d8d124171d85
Author: Andy Lutomirski <luto@kernel.org>
Date:   Thu Jul 30 20:32:40 2015 -0700

    perf/x86/hw_breakpoints: Disallow kernel breakpoints unless kprobe-safe


>>
>> Be careful, though: kernel threads might not have a "user" pt_regs in
>> the "user_mode" returns true sense.  Checking that it's either
>> user_mode() or at task_pt_regs() might be a good condition to check.
>
> Yeah.  I guess there are two distinct cases of "going off the rails":
>
> 1) The unwinder doesn't get to the end of the stack (user regs for user
>    tasks, or whatever the end is for kthreads).
>
> 2) The unwinder strays away from the current stack's "previous stack"
>    pointer.
>
> We could warn on either case (though there's probably overlap between
> the two).

I'm in favor of both.  But I think it's best to do them at the end the
series so that they're easy to revert in the event that Linus
complains and neither of us can convince him that's it's okay.

>
> --
> Josh



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-07-21 21:21 ` [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues Josh Poimboeuf
@ 2016-07-29 22:55   ` Steven Rostedt
  2016-07-30  0:50     ` Josh Poimboeuf
                       ` (2 more replies)
  0 siblings, 3 replies; 91+ messages in thread
From: Steven Rostedt @ 2016-07-29 22:55 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Thu, 21 Jul 2016 16:21:42 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> When function graph tracing is enabled for a function, its return
> address on the stack is replaced with the address of an ftrace handler
> (return_to_handler).  When dumping the stack of a task with graph
> tracing enabled, there are some subtle bugs:
> 
> - The fake return_to_handler() address can be reported as reliable.
>   Instead, because it's not the real caller, it should be considered
>   unreliable.

I have some mixed emotions about this. First, it's not "fake", the
function *is* going to return to it, but you are right, that's not the
function that was called.

I do like to see these in the trace, because sometimes these functions
are an issue. But I guess I can live with them being marked as
"unreliable".


> 
> - In print_context_stack(), the real caller's return address is always
>   reported as reliable, even if the return_to_handler() address wasn't
>   referred to by a frame pointer.

Hmm, if CONFIG_FRAME_POINTER is enabled, perhaps we should only call
the look up of ftrace_graph_ret_addr(). Hmm, playing with this, yeah,
we definitely should. It can report the wrong reliability.

Without doing the reliability check we can get out of sync with the
ret_stack. I have a patch to go on top of this patch below (hmm, it may
not apply fully, because I was using a different base tree than you).

> 
> In addition to fixing these bugs, convert print_ftrace_graph_addr() to a
> more generic function which can be used outside of dump_trace()
> callbacks.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> ---
>  arch/x86/include/asm/stacktrace.h | 13 ++++++++++
>  arch/x86/kernel/dumpstack.c       | 50 +++++++++++++++++----------------------
>  2 files changed, 35 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
> index 6f65995..5d3d258 100644
> --- a/arch/x86/include/asm/stacktrace.h
> +++ b/arch/x86/include/asm/stacktrace.h
> @@ -14,6 +14,19 @@ extern int kstack_depth_to_print;
>  struct thread_info;
>  struct stacktrace_ops;
>  
> +#ifdef CONFIG_FUNCTION_GRAPH_TRACER
> +
> +unsigned long
> +ftrace_graph_ret_addr(struct task_struct *task, int *idx, unsigned long addr);
> +
> +#else
> +static inline unsigned long
> +ftrace_graph_ret_addr(struct task_struct *task, int *idx, unsigned long addr)
> +{
> +	return addr;
> +}
> +#endif /* CONFIG_FUNCTION_GRAPH_TRACER */
> +
>  typedef unsigned long (*walk_stack_t)(struct task_struct *task,
>  				      unsigned long *stack,
>  				      unsigned long bp,
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index 692eecae..0a8694b 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -40,36 +40,25 @@ void printk_address(unsigned long address)
>  }
>  
>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> -static void
> -print_ftrace_graph_addr(unsigned long addr, void *data,
> -			const struct stacktrace_ops *ops,
> -			struct task_struct *task, int *graph)
> +unsigned long
> +ftrace_graph_ret_addr(struct task_struct *task, int *idx, unsigned long addr)
>  {
> -	unsigned long ret_addr;
> -	int index;
> +	int task_idx;
>  
>  	if (addr != (unsigned long)return_to_handler)
> -		return;
> +		return addr;
>  
> -	index = task->curr_ret_stack;
> +	task_idx = task->curr_ret_stack;
>  
> -	if (!task->ret_stack || index < *graph)
> -		return;
> +	if (!task->ret_stack || task_idx < *idx)
> +		return addr;
>  
> -	index -= *graph;
> -	ret_addr = task->ret_stack[index].ret;
> +	task_idx -= *idx;
> +	(*idx)++;
>  
> -	ops->address(data, ret_addr, 1);
> -
> -	(*graph)++;
> +	return task->ret_stack[task_idx].ret;
>  }
> -#else
> -static inline void
> -print_ftrace_graph_addr(unsigned long addr, void *data,
> -			const struct stacktrace_ops *ops,
> -			struct task_struct *task, int *graph)
> -{ }
> -#endif
> +#endif /* CONFIG_FUNCTION_GRAPH_TRACER */
>  
>  /*
>   * x86-64 can have up to three kernel stacks:
> @@ -108,18 +97,23 @@ print_context_stack(struct task_struct *task,
>  		stack = (unsigned long *)task_stack_page(task);
>  
>  	while (valid_stack_ptr(task, stack, sizeof(*stack), end)) {
> -		unsigned long addr;
> +		unsigned long addr = *stack;
>  
>  		addr = *stack;
>  		if (__kernel_text_address(addr)) {
> +			int reliable = 0;
> +			unsigned long real_addr;
> +
>  			if ((unsigned long) stack == bp + sizeof(long)) {
> -				ops->address(data, addr, 1);
> +				reliable = 1;
>  				frame = frame->next_frame;
>  				bp = (unsigned long) frame;
> -			} else {
> -				ops->address(data, addr, 0);
>  			}
> -			print_ftrace_graph_addr(addr, data, ops, task, graph);
> +
> +			real_addr = ftrace_graph_ret_addr(task, graph, addr);
> +			if (addr != real_addr)
> +				ops->address(data, addr, 0);

Note this changes behavior, as the original code had the ret_to_handler
first. This makes it second. (I fixed this below).

And that we should add a reliability check if CONFIG_FRAME_POINTER is
enabled.

> +			ops->address(data, real_addr, reliable);
>  		}
>  		stack++;
>  	}
> @@ -142,11 +136,11 @@ print_context_stack_bp(struct task_struct *task,
>  		if (!__kernel_text_address(addr))
>  			break;
>  
> +		addr = ftrace_graph_ret_addr(task, graph, addr);
>  		if (ops->address(data, addr, 1))
>  			break;
>  		frame = frame->next_frame;
>  		ret_addr = &frame->return_address;
> -		print_ftrace_graph_addr(addr, data, ops, task, graph);

This also changes the current code to print the return address as well.

>  	}
>  
>  	return (unsigned long)frame;

Here's my patch that should be applied on top.

Maybe add a Signed-off-by: Steven Rostedt <rostedt@goodmis.org> along
with your SOB. But you should remain Author.

-- Steve

---
 arch/x86/kernel/dumpstack.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

Index: linux-trace.git/arch/x86/kernel/dumpstack.c
===================================================================
--- linux-trace.git.orig/arch/x86/kernel/dumpstack.c	2016-07-29 17:17:10.995002677 -0400
+++ linux-trace.git/arch/x86/kernel/dumpstack.c	2016-07-29 18:50:53.497633797 -0400
@@ -90,10 +90,9 @@ print_context_stack(struct task_struct *
 	while (valid_stack_ptr(task, stack, sizeof(*stack), end)) {
 		unsigned long addr = *stack;
 
-		addr = *stack;
 		if (__kernel_text_address(addr)) {
+			unsigned long real_addr = addr;
 			int reliable = 0;
-			unsigned long real_addr;
 
 			if ((unsigned long) stack == bp + sizeof(long)) {
 				reliable = 1;
@@ -101,10 +100,12 @@ print_context_stack(struct task_struct *
 				bp = (unsigned long) frame;
 			}
 
-			real_addr = ftrace_graph_ret_addr(task, graph, addr);
+			if (!IS_ENABLED(CONFIG_FRAME_POINTER) || reliable)
+				real_addr = ftrace_graph_ret_addr(task, graph, addr);
+
+			ops->address(data, real_addr, reliable);
 			if (addr != real_addr)
 				ops->address(data, addr, 0);
-			ops->address(data, real_addr, reliable);
 		}
 		stack++;
 	}
@@ -123,13 +124,16 @@ print_context_stack_bp(struct task_struc
 
 	while (valid_stack_ptr(task, ret_addr, sizeof(*ret_addr), end)) {
 		unsigned long addr = *ret_addr;
+		unsigned long real_addr;
 
 		if (!__kernel_text_address(addr))
 			break;
 
-		addr = ftrace_graph_ret_addr(task, graph, addr);
-		if (ops->address(data, addr, 1))
+		real_addr = ftrace_graph_ret_addr(task, graph, addr);
+		if (ops->address(data, real_addr, 1))
 			break;
+		if (real_addr != addr)
+			ops->address(data, addr, 0);
 		frame = frame->next_frame;
 		ret_addr = &frame->return_address;
 	}

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-07-29 22:55   ` Steven Rostedt
@ 2016-07-30  0:50     ` Josh Poimboeuf
  2016-07-30  2:20       ` Steven Rostedt
  2016-08-01 15:59     ` Josh Poimboeuf
  2016-08-01 16:24     ` Josh Poimboeuf
  2 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-30  0:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 29, 2016 at 06:55:21PM -0400, Steven Rostedt wrote:
> On Thu, 21 Jul 2016 16:21:42 -0500
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> > When function graph tracing is enabled for a function, its return
> > address on the stack is replaced with the address of an ftrace handler
> > (return_to_handler).  When dumping the stack of a task with graph
> > tracing enabled, there are some subtle bugs:
> > 
> > - The fake return_to_handler() address can be reported as reliable.
> >   Instead, because it's not the real caller, it should be considered
> >   unreliable.
> 
> I have some mixed emotions about this. First, it's not "fake", the
> function *is* going to return to it, but you are right, that's not the
> function that was called.
> 
> I do like to see these in the trace, because sometimes these functions
> are an issue. But I guess I can live with them being marked as
> "unreliable".

Yeah, this is a little iffy.  Calling return_to_handler() "fake" isn't
100% accurate.  It wasn't involved in the *call* path, but it will be
involved in the *return* path.

My thinking was that when either saving or dumping the stack, you
normally only care about what led up to that point (the call path),
rather than what will happen in the future (the return path).

That's especially true in the non-oops stack trace case, which isn't
used for debugging.  For example, reporting return_to_handler() in the
reliable trace of a perf profiling operation would just be confusing.

And in the oops case, where debugging is important, I think "unreliable"
is more appropriate because it serves as a hint that graph tracing was
involved, instead of trying to assert that it was the real caller, which
could create some confusion.

> > - In print_context_stack(), the real caller's return address is always
> >   reported as reliable, even if the return_to_handler() address wasn't
> >   referred to by a frame pointer.
> 
> Hmm, if CONFIG_FRAME_POINTER is enabled, perhaps we should only call
> the look up of ftrace_graph_ret_addr(). Hmm, playing with this, yeah,
> we definitely should. It can report the wrong reliability.
> 
> Without doing the reliability check we can get out of sync with the
> ret_stack. I have a patch to go on top of this patch below (hmm, it may
> not apply fully, because I was using a different base tree than you).

Yeah, your patch makes it better.  Thanks!

BTW, it would be really nice if ftrace_graph_ret_addr() were idempotent
so we could get the "real" return address without having to pass in a
state variable.

For example we could add an "unsigned long *retp" pointer to
ftrace_ret_stack, which points to the return address on the stack.  Then
we could get rid of the index state variable in ftrace_graph_ret_addr,
and also then there would never be a chance of the stack dump getting
out of sync with the ret_stack.

What do you think?

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-07-30  0:50     ` Josh Poimboeuf
@ 2016-07-30  2:20       ` Steven Rostedt
  2016-07-30 13:51         ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Steven Rostedt @ 2016-07-30  2:20 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, 29 Jul 2016 19:50:59 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> BTW, it would be really nice if ftrace_graph_ret_addr() were idempotent
> so we could get the "real" return address without having to pass in a
> state variable.
> 
> For example we could add an "unsigned long *retp" pointer to
> ftrace_ret_stack, which points to the return address on the stack.  Then
> we could get rid of the index state variable in ftrace_graph_ret_addr,
> and also then there would never be a chance of the stack dump getting
> out of sync with the ret_stack.
> 
> What do you think?
> 

I don't want to extend ret_stack as that is allocated 50 of these
structures for every task. That said, we have the "fp" field that's
used to check for frame pointer corruption when mcount is used. With
CC_USING_FENTRY, that field is ignored. Perhaps we could overload that
field for this.

-- Steve

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-07-30  2:20       ` Steven Rostedt
@ 2016-07-30 13:51         ` Josh Poimboeuf
  2016-08-01 14:28           ` Steven Rostedt
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-07-30 13:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 29, 2016 at 10:20:36PM -0400, Steven Rostedt wrote:
> On Fri, 29 Jul 2016 19:50:59 -0500
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> > BTW, it would be really nice if ftrace_graph_ret_addr() were idempotent
> > so we could get the "real" return address without having to pass in a
> > state variable.
> > 
> > For example we could add an "unsigned long *retp" pointer to
> > ftrace_ret_stack, which points to the return address on the stack.  Then
> > we could get rid of the index state variable in ftrace_graph_ret_addr,
> > and also then there would never be a chance of the stack dump getting
> > out of sync with the ret_stack.
> > 
> > What do you think?
> > 
> 
> I don't want to extend ret_stack as that is allocated 50 of these
> structures for every task. That said, we have the "fp" field that's
> used to check for frame pointer corruption when mcount is used. With
> CC_USING_FENTRY, that field is ignored. Perhaps we could overload that
> field for this.

In that case, I guess we would need two versions of
ftrace_graph_ret_addr(), with the current implementation still needed
for mcount+HAVE_FUNCTION_GRAPH_FP_TEST.

Or would you want to get rid of HAVE_FUNCTION_GRAPH_FP_TEST for x86?

BTW, on a different note, should I put ftrace_graph_ret_addr() to
kernel/trace/trace_functions_graph.c so other arches can use it?

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-07-30 13:51         ` Josh Poimboeuf
@ 2016-08-01 14:28           ` Steven Rostedt
  2016-08-01 15:36             ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Steven Rostedt @ 2016-08-01 14:28 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Sat, 30 Jul 2016 08:51:25 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Fri, Jul 29, 2016 at 10:20:36PM -0400, Steven Rostedt wrote:
> > On Fri, 29 Jul 2016 19:50:59 -0500
> > Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >   
> > > BTW, it would be really nice if ftrace_graph_ret_addr() were idempotent
> > > so we could get the "real" return address without having to pass in a
> > > state variable.
> > > 
> > > For example we could add an "unsigned long *retp" pointer to
> > > ftrace_ret_stack, which points to the return address on the stack.  Then
> > > we could get rid of the index state variable in ftrace_graph_ret_addr,
> > > and also then there would never be a chance of the stack dump getting
> > > out of sync with the ret_stack.
> > > 
> > > What do you think?
> > >   
> > 
> > I don't want to extend ret_stack as that is allocated 50 of these
> > structures for every task. That said, we have the "fp" field that's
> > used to check for frame pointer corruption when mcount is used. With
> > CC_USING_FENTRY, that field is ignored. Perhaps we could overload that
> > field for this.  
> 
> In that case, I guess we would need two versions of
> ftrace_graph_ret_addr(), with the current implementation still needed
> for mcount+HAVE_FUNCTION_GRAPH_FP_TEST.

How hard would it be in that case?

> 
> Or would you want to get rid of HAVE_FUNCTION_GRAPH_FP_TEST for x86?

No, because there's gcc versions that we still support that mess up
mcount, and could still cause issues with function graph.


> 
> BTW, on a different note, should I put ftrace_graph_ret_addr() to
> kernel/trace/trace_functions_graph.c so other arches can use it?
> 

I guess you could. There doesn't seem to be any x86 specific code in
that right?

-- Steve

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-01 14:28           ` Steven Rostedt
@ 2016-08-01 15:36             ` Josh Poimboeuf
  2016-08-02 21:00               ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-08-01 15:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Mon, Aug 01, 2016 at 10:28:21AM -0400, Steven Rostedt wrote:
> On Sat, 30 Jul 2016 08:51:25 -0500
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> > On Fri, Jul 29, 2016 at 10:20:36PM -0400, Steven Rostedt wrote:
> > > On Fri, 29 Jul 2016 19:50:59 -0500
> > > Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > >   
> > > > BTW, it would be really nice if ftrace_graph_ret_addr() were idempotent
> > > > so we could get the "real" return address without having to pass in a
> > > > state variable.
> > > > 
> > > > For example we could add an "unsigned long *retp" pointer to
> > > > ftrace_ret_stack, which points to the return address on the stack.  Then
> > > > we could get rid of the index state variable in ftrace_graph_ret_addr,
> > > > and also then there would never be a chance of the stack dump getting
> > > > out of sync with the ret_stack.
> > > > 
> > > > What do you think?
> > > >   
> > > 
> > > I don't want to extend ret_stack as that is allocated 50 of these
> > > structures for every task. That said, we have the "fp" field that's
> > > used to check for frame pointer corruption when mcount is used. With
> > > CC_USING_FENTRY, that field is ignored. Perhaps we could overload that
> > > field for this.  
> > 
> > In that case, I guess we would need two versions of
> > ftrace_graph_ret_addr(), with the current implementation still needed
> > for mcount+HAVE_FUNCTION_GRAPH_FP_TEST.
> 
> How hard would it be in that case?

Well, it would be easy enough, but then the caller would still need to
pass in the state variable.  So maybe it's not worth the trouble.

> > Or would you want to get rid of HAVE_FUNCTION_GRAPH_FP_TEST for x86?
> 
> No, because there's gcc versions that we still support that mess up
> mcount, and could still cause issues with function graph.
> 
> 
> > 
> > BTW, on a different note, should I put ftrace_graph_ret_addr() to
> > kernel/trace/trace_functions_graph.c so other arches can use it?
> > 
> 
> I guess you could. There doesn't seem to be any x86 specific code in
> that right?

Right.  And I noticed that several arches implement this same
functionality in a slightly different way to fit with their stack dump
code.  So it would be nice to make it common.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-07-29 22:55   ` Steven Rostedt
  2016-07-30  0:50     ` Josh Poimboeuf
@ 2016-08-01 15:59     ` Josh Poimboeuf
  2016-08-01 16:05       ` Steven Rostedt
  2016-08-01 16:24     ` Josh Poimboeuf
  2 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-08-01 15:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 29, 2016 at 06:55:21PM -0400, Steven Rostedt wrote:
> Here's my patch that should be applied on top.
> 
> Maybe add a Signed-off-by: Steven Rostedt <rostedt@goodmis.org> along
> with your SOB. But you should remain Author.

[...]

> @@ -123,13 +124,16 @@ print_context_stack_bp(struct task_struc
>  
>  	while (valid_stack_ptr(task, ret_addr, sizeof(*ret_addr), end)) {
>  		unsigned long addr = *ret_addr;
> +		unsigned long real_addr;
>  
>  		if (!__kernel_text_address(addr))
>  			break;
>  
> -		addr = ftrace_graph_ret_addr(task, graph, addr);
> -		if (ops->address(data, addr, 1))
> +		real_addr = ftrace_graph_ret_addr(task, graph, addr);
> +		if (ops->address(data, real_addr, 1))
>  			break;
> +		if (real_addr != addr)
> +			ops->address(data, addr, 0);
>  		frame = frame->next_frame;
>  		ret_addr = &frame->return_address;
>  	}

Actually this hunk isn't needed because all users of
print_context_stack_bp() only care about "reliable" addresses.  With
frame pointers enabled, the only place "unreliable" addresses are used
is in show_trace_log_lvl() -- and it uses the print_context_stack()
callback.

I rely on that fact in the new frame pointer unwind code: it only
reports reliable addresses.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-01 15:59     ` Josh Poimboeuf
@ 2016-08-01 16:05       ` Steven Rostedt
  2016-08-01 16:19         ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Steven Rostedt @ 2016-08-01 16:05 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Mon, 1 Aug 2016 10:59:03 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Fri, Jul 29, 2016 at 06:55:21PM -0400, Steven Rostedt wrote:
> > Here's my patch that should be applied on top.
> > 
> > Maybe add a Signed-off-by: Steven Rostedt <rostedt@goodmis.org> along
> > with your SOB. But you should remain Author.  
> 
> [...]
> 
> > @@ -123,13 +124,16 @@ print_context_stack_bp(struct task_struc
> >  
> >  	while (valid_stack_ptr(task, ret_addr, sizeof(*ret_addr), end)) {
> >  		unsigned long addr = *ret_addr;
> > +		unsigned long real_addr;
> >  
> >  		if (!__kernel_text_address(addr))
> >  			break;
> >  
> > -		addr = ftrace_graph_ret_addr(task, graph, addr);
> > -		if (ops->address(data, addr, 1))
> > +		real_addr = ftrace_graph_ret_addr(task, graph, addr);
> > +		if (ops->address(data, real_addr, 1))
> >  			break;
> > +		if (real_addr != addr)
> > +			ops->address(data, addr, 0);
> >  		frame = frame->next_frame;
> >  		ret_addr = &frame->return_address;
> >  	}  
> 
> Actually this hunk isn't needed because all users of
> print_context_stack_bp() only care about "reliable" addresses.  With
> frame pointers enabled, the only place "unreliable" addresses are used
> is in show_trace_log_lvl() -- and it uses the print_context_stack()
> callback.
> 
> I rely on that fact in the new frame pointer unwind code: it only
> reports reliable addresses.
> 

Can you make this a separate patch then. Before this one, and explain
why it isn't needed in the change log. I rather have the current patch
not make such a change in logic.

-- Steve

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-01 16:05       ` Steven Rostedt
@ 2016-08-01 16:19         ` Josh Poimboeuf
  0 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-08-01 16:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Mon, Aug 01, 2016 at 12:05:41PM -0400, Steven Rostedt wrote:
> On Mon, 1 Aug 2016 10:59:03 -0500
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> > On Fri, Jul 29, 2016 at 06:55:21PM -0400, Steven Rostedt wrote:
> > > Here's my patch that should be applied on top.
> > > 
> > > Maybe add a Signed-off-by: Steven Rostedt <rostedt@goodmis.org> along
> > > with your SOB. But you should remain Author.  
> > 
> > [...]
> > 
> > > @@ -123,13 +124,16 @@ print_context_stack_bp(struct task_struc
> > >  
> > >  	while (valid_stack_ptr(task, ret_addr, sizeof(*ret_addr), end)) {
> > >  		unsigned long addr = *ret_addr;
> > > +		unsigned long real_addr;
> > >  
> > >  		if (!__kernel_text_address(addr))
> > >  			break;
> > >  
> > > -		addr = ftrace_graph_ret_addr(task, graph, addr);
> > > -		if (ops->address(data, addr, 1))
> > > +		real_addr = ftrace_graph_ret_addr(task, graph, addr);
> > > +		if (ops->address(data, real_addr, 1))
> > >  			break;
> > > +		if (real_addr != addr)
> > > +			ops->address(data, addr, 0);
> > >  		frame = frame->next_frame;
> > >  		ret_addr = &frame->return_address;
> > >  	}  
> > 
> > Actually this hunk isn't needed because all users of
> > print_context_stack_bp() only care about "reliable" addresses.  With
> > frame pointers enabled, the only place "unreliable" addresses are used
> > is in show_trace_log_lvl() -- and it uses the print_context_stack()
> > callback.
> > 
> > I rely on that fact in the new frame pointer unwind code: it only
> > reports reliable addresses.
> > 
> 
> Can you make this a separate patch then. Before this one, and explain
> why it isn't needed in the change log. I rather have the current patch
> not make such a change in logic.

Sure, I'll do that.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-07-29 22:55   ` Steven Rostedt
  2016-07-30  0:50     ` Josh Poimboeuf
  2016-08-01 15:59     ` Josh Poimboeuf
@ 2016-08-01 16:24     ` Josh Poimboeuf
  2016-08-01 16:56       ` Steven Rostedt
  2 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-08-01 16:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Fri, Jul 29, 2016 at 06:55:21PM -0400, Steven Rostedt wrote:
> > @@ -108,18 +97,23 @@ print_context_stack(struct task_struct *task,
> >  		stack = (unsigned long *)task_stack_page(task);
> >  
> >  	while (valid_stack_ptr(task, stack, sizeof(*stack), end)) {
> > -		unsigned long addr;
> > +		unsigned long addr = *stack;
> >  
> >  		addr = *stack;
> >  		if (__kernel_text_address(addr)) {
> > +			int reliable = 0;
> > +			unsigned long real_addr;
> > +
> >  			if ((unsigned long) stack == bp + sizeof(long)) {
> > -				ops->address(data, addr, 1);
> > +				reliable = 1;
> >  				frame = frame->next_frame;
> >  				bp = (unsigned long) frame;
> > -			} else {
> > -				ops->address(data, addr, 0);
> >  			}
> > -			print_ftrace_graph_addr(addr, data, ops, task, graph);
> > +
> > +			real_addr = ftrace_graph_ret_addr(task, graph, addr);
> > +			if (addr != real_addr)
> > +				ops->address(data, addr, 0);
> > +			ops->address(data, real_addr, reliable);
> 
> Note this changes behavior, as the original code had the ret_to_handler
> first. This makes it second. (I fixed this below).

Hm, as far as I can tell this actually keeps the original behavior.  The
"unreliable" ret_to_handler is still printed first, no?

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-01 16:24     ` Josh Poimboeuf
@ 2016-08-01 16:56       ` Steven Rostedt
  0 siblings, 0 replies; 91+ messages in thread
From: Steven Rostedt @ 2016-08-01 16:56 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Mon, 1 Aug 2016 11:24:59 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:


> > > -			print_ftrace_graph_addr(addr, data, ops, task, graph);
> > > +
> > > +			real_addr = ftrace_graph_ret_addr(task, graph, addr);
> > > +			if (addr != real_addr)
> > > +				ops->address(data, addr, 0);
> > > +			ops->address(data, real_addr, reliable);  
> > 
> > Note this changes behavior, as the original code had the ret_to_handler
> > first. This makes it second. (I fixed this below).  
> 
> Hm, as far as I can tell this actually keeps the original behavior.  The
> "unreliable" ret_to_handler is still printed first, no?
> 


Yep, I guess it does. I mixed up the meaning of "real_addr" and "addr",
and was thinking of the reverse.

-- Steve

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-01 15:36             ` Josh Poimboeuf
@ 2016-08-02 21:00               ` Josh Poimboeuf
  2016-08-02 21:16                 ` Steven Rostedt
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-08-02 21:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Mon, Aug 01, 2016 at 10:36:33AM -0500, Josh Poimboeuf wrote:
> On Mon, Aug 01, 2016 at 10:28:21AM -0400, Steven Rostedt wrote:
> > On Sat, 30 Jul 2016 08:51:25 -0500
> > Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> > > On Fri, Jul 29, 2016 at 10:20:36PM -0400, Steven Rostedt wrote:
> > > > On Fri, 29 Jul 2016 19:50:59 -0500
> > > > Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > > >   
> > > > > BTW, it would be really nice if ftrace_graph_ret_addr() were idempotent
> > > > > so we could get the "real" return address without having to pass in a
> > > > > state variable.
> > > > > 
> > > > > For example we could add an "unsigned long *retp" pointer to
> > > > > ftrace_ret_stack, which points to the return address on the stack.  Then
> > > > > we could get rid of the index state variable in ftrace_graph_ret_addr,
> > > > > and also then there would never be a chance of the stack dump getting
> > > > > out of sync with the ret_stack.
> > > > > 
> > > > > What do you think?
> > > > >   
> > > > 
> > > > I don't want to extend ret_stack as that is allocated 50 of these
> > > > structures for every task. That said, we have the "fp" field that's
> > > > used to check for frame pointer corruption when mcount is used. With
> > > > CC_USING_FENTRY, that field is ignored. Perhaps we could overload that
> > > > field for this.  
> > > 
> > > In that case, I guess we would need two versions of
> > > ftrace_graph_ret_addr(), with the current implementation still needed
> > > for mcount+HAVE_FUNCTION_GRAPH_FP_TEST.
> > 
> > How hard would it be in that case?
> 
> Well, it would be easy enough, but then the caller would still need to
> pass in the state variable.  So maybe it's not worth the trouble.

I did some stack trace testing on mainline with function graph tracing.
As it turns out, print_ftrace_graph_addr() is already buggy today if the
caller of dump_trace() specifies a stack pointer or a pt_regs (which is
usually done in order to skip some irrelevant stack frames in the
trace).

For example, here's a stack trace based on NMI regs:

  $ echo 1 > /proc/sys/kernel/sysrq
  $ echo l > /proc/sysrq-trigger
  ...
  Call Trace:
   [<ffffffff81066141>] ?  __x2apic_send_IPI_dest.constprop.4+0x31/0x40
   [<ffffffff810661e5>] __x2apic_send_IPI_mask+0x95/0xe0
   [<ffffffff81061d70>] ? irq_force_complete_move+0xf0/0xf0
   [<ffffffff810662a3>] x2apic_send_IPI_mask+0x13/0x20
   [<ffffffff81061d8b>] nmi_raise_cpu_backtrace+0x1b/0x20
   [<ffffffff8144ff76>] nmi_trigger_all_cpu_backtrace+0xc6/0xf0
   [<ffffffff81061de9>] arch_trigger_all_cpu_backtrace+0x19/0x20
   [<ffffffff8155c463>] sysrq_handle_showallcpus+0x13/0x20
   [<ffffffff8155cc18>] __handle_sysrq+0x138/0x220
   [<ffffffff8155cae5>] ? __handle_sysrq+0x5/0x220
   [<ffffffff8155d111>] write_sysrq_trigger+0x51/0x60
   [<ffffffff813104e2>] proc_reg_write+0x42/0x70
   [<ffffffff81291877>] __vfs_write+0x37/0x140
   [<ffffffff8110d161>] ? update_fast_ctr+0x51/0x80
   [<ffffffff8110d217>] ? percpu_down_read+0x57/0xa0
   [<ffffffff81296074>] ? __sb_start_write+0xb4/0xf0
   [<ffffffff81296074>] ? __sb_start_write+0xb4/0xf0
   [<ffffffff81292b38>] vfs_write+0xb8/0x1a0
   [<ffffffff81293fe8>] SyS_write+0x58/0xc0
   [<ffffffff818af97c>] entry_SYSCALL_64_fastpath+0x1f/0xbd

And here's the same trace with function graph tracing:

  $ echo function_graph > /sys/kernel/debug/tracing/current_tracer 
  $ echo l > /proc/sysrq-trigger
  ...
  Call Trace:
   [<ffffffff81066141>] ?  __x2apic_send_IPI_dest.constprop.4+0x31/0x40
   [<ffffffff818b2428>] ftrace_graph_caller+0xa8/0xa8
   [<ffffffff810394cc>] print_context_stack+0xfc/0x100
   [<ffffffff81061d70>] ? irq_force_complete_move+0xf0/0xf0
   [<ffffffff818b2428>] ftrace_graph_caller+0xa8/0xa8
   [<ffffffff8103891b>] dump_trace+0x12b/0x350
   [<ffffffff818b2428>] ftrace_graph_caller+0xa8/0xa8
   [<ffffffff810396eb>] show_trace_log_lvl+0x4b/0x60
   [<ffffffff818b2428>] ftrace_graph_caller+0xa8/0xa8
   [<ffffffff81038c76>] show_stack_log_lvl+0x136/0x1d0
   [<ffffffff81061de9>] arch_trigger_all_cpu_backtrace+0x19/0x20
   [<ffffffff818b2428>] ftrace_graph_caller+0xa8/0xa8
   [<ffffffff81038db8>] show_regs+0xa8/0x1b0
   [<ffffffff818b2428>] ftrace_graph_caller+0xa8/0xa8
   [<ffffffff8144fe96>] nmi_cpu_backtrace+0x46/0x60
   [<ffffffff8155cae5>] ? __handle_sysrq+0x5/0x220
   [<ffffffff818b2428>] ftrace_graph_caller+0xa8/0xa8
   [<ffffffff81039b5f>] nmi_handle+0xbf/0x2f0
   [<ffffffff818b2428>] ftrace_graph_caller+0xa8/0xa8
   [<ffffffff8103a2b3>] default_do_nmi+0x73/0x180
   [<ffffffff818b2428>] ftrace_graph_caller+0xa8/0xa8
   [<ffffffff8103a4d9>] do_nmi+0x119/0x170
   [<ffffffff811bb3cd>] ?  ftrace_return_to_handler+0x9d/0x110
   [<ffffffff81291845>] ? __vfs_write+0x5/0x140
   [<ffffffff81291845>] ? __vfs_write+0x5/0x140
   [<ffffffff818b2428>] ftrace_graph_caller+0xa8/0xa8
   [<ffffffff810661e5>] __x2apic_send_IPI_mask+0x95/0xe0
   [<ffffffff818b2428>] ftrace_graph_caller+0xa8/0xa8
   [<ffffffff810662a3>] x2apic_send_IPI_mask+0x13/0x20
   [<ffffffff818b2428>] ftrace_graph_caller+0xa8/0xa8
   [<ffffffff81061d8b>] nmi_raise_cpu_backtrace+0x1b/0x20

The ret_stack is out of sync with the stack dump because the stack dump
was started with the regs from the NMI, instead of being started from
the current frame.

So I guess there are a couple of ways to fix it:

  a) keep track of the return address pointer like we discussed above;

     or

  b) have the unwinder count the # of skipped frames which refer to
     'return_to_handler', and pass that as the initial index value to
     ftrace_graph_ret_addr().

Option a) would be much cleaner.  But to fix it for both mcount and
fentry, we couldn't override 'fp' so I guess we'd need to add a new
field to ftrace_ret_stack.

Option b) is uglier, but I could probably make it work with the new
unwinder.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-02 21:00               ` Josh Poimboeuf
@ 2016-08-02 21:16                 ` Steven Rostedt
  2016-08-02 22:13                   ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Steven Rostedt @ 2016-08-02 21:16 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, 2 Aug 2016 16:00:11 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

   [<ffffffff81061d8b>] nmi_raise_cpu_backtrace+0x1b/0x20
> 
> The ret_stack is out of sync with the stack dump because the stack dump
> was started with the regs from the NMI, instead of being started from
> the current frame.
> 
> So I guess there are a couple of ways to fix it:
> 
>   a) keep track of the return address pointer like we discussed above;
> 
>      or
> 
>   b) have the unwinder count the # of skipped frames which refer to
>      'return_to_handler', and pass that as the initial index value to
>      ftrace_graph_ret_addr().
> 
> Option a) would be much cleaner.  But to fix it for both mcount and
> fentry, we couldn't override 'fp' so I guess we'd need to add a new
> field to ftrace_ret_stack.

Actually, what about calling ftrace_graph_ret_addr() to figure out the
next stack conversion only if reliable or CONFIG_FRAME_POINTER is not
enabled?

	unsigned long real_addr = addr;

	[...]

	if (!IS_ENABLED(CONFIG_FRAME_POINTER) || reliable)
		real_addr = ftrace_graph_ret_addr(task, graph, addr);
	if (addr != real_addr)
		ops->address(data, addr, 0);
	ops->address(data, real_addr, reliable);

Then we only need the fp use case when FRAME_POINTER is not set. As
mcount forces FRAME_POINTER, we only need to worry about the fentry
case.

-- Steve

> 
> Option b) is uglier, but I could probably make it work with the new
> unwinder.
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-02 21:16                 ` Steven Rostedt
@ 2016-08-02 22:13                   ` Josh Poimboeuf
  2016-08-02 23:16                     ` Steven Rostedt
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-08-02 22:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, Aug 02, 2016 at 05:16:10PM -0400, Steven Rostedt wrote:
> On Tue, 2 Aug 2016 16:00:11 -0500
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
>    [<ffffffff81061d8b>] nmi_raise_cpu_backtrace+0x1b/0x20
> > 
> > The ret_stack is out of sync with the stack dump because the stack dump
> > was started with the regs from the NMI, instead of being started from
> > the current frame.
> > 
> > So I guess there are a couple of ways to fix it:
> > 
> >   a) keep track of the return address pointer like we discussed above;
> > 
> >      or
> > 
> >   b) have the unwinder count the # of skipped frames which refer to
> >      'return_to_handler', and pass that as the initial index value to
> >      ftrace_graph_ret_addr().
> > 
> > Option a) would be much cleaner.  But to fix it for both mcount and
> > fentry, we couldn't override 'fp' so I guess we'd need to add a new
> > field to ftrace_ret_stack.
> 
> Actually, what about calling ftrace_graph_ret_addr() to figure out the
> next stack conversion only if reliable or CONFIG_FRAME_POINTER is not
> enabled?
> 
> 	unsigned long real_addr = addr;
> 
> 	[...]
> 
> 	if (!IS_ENABLED(CONFIG_FRAME_POINTER) || reliable)
> 		real_addr = ftrace_graph_ret_addr(task, graph, addr);
> 	if (addr != real_addr)
> 		ops->address(data, addr, 0);
> 	ops->address(data, real_addr, reliable);
> 
> Then we only need the fp use case when FRAME_POINTER is not set. As
> mcount forces FRAME_POINTER, we only need to worry about the fentry
> case.

Hm, I'm confused.  First, I don't see where mcount forces FRAME_POINTER.

Second, I don't see why that even matters.  If mcount and frame pointers
are enabled, then the 'fp' field of ftrace_ret_stack is needed for the
gcc sanity check, right?  So we couldn't override 'fp', and the old
"stateful index" version of ftrace_graph_ret_addr() would have to be
used in the code above for reliable addresses, and we'd still have the
same out-of-sync bug.

Or am I missing something?

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-02 22:13                   ` Josh Poimboeuf
@ 2016-08-02 23:16                     ` Steven Rostedt
  2016-08-03  1:56                       ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Steven Rostedt @ 2016-08-02 23:16 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, 2 Aug 2016 17:13:59 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:


> > Then we only need the fp use case when FRAME_POINTER is not set. As
> > mcount forces FRAME_POINTER, we only need to worry about the fentry
> > case.  
> 
> Hm, I'm confused.  First, I don't see where mcount forces FRAME_POINTER.

Hmm, we should probably force it generally, as gcc itself requires
mcount to be used with framepointers. -mcount can't be used without
them.

> 
> Second, I don't see why that even matters.  If mcount and frame pointers
> are enabled, then the 'fp' field of ftrace_ret_stack is needed for the
> gcc sanity check, right?  So we couldn't override 'fp', and the old
> "stateful index" version of ftrace_graph_ret_addr() would have to be
> used in the code above for reliable addresses, and we'd still have the
> same out-of-sync bug.
> 
> Or am I missing something?
> 

Or I missed something. How did we get out of sync? If we have frame
pointers, shouldn't the "return_to_handler" be seen as reliable by the
code (not that we save it as such)? That is, if the frame pointer shows
that the next function is return_to_handler, then we increment the
index into ret_stack, otherwise we simply record the return_to_handler
as a normal "unreliable" function, without any processing of it.

I guess I don't actually understand how the NMI screwed it up, as
function graph doesn't trace "do_nmi()" itself nor anything before that.
I'm guessing it really got out of sync because there's a
"return_to_handler" in the stack that wasn't really called (not a frame
pointer). The ftrace_graph_ret_addr() will shift the index currently
regardless if the return_to_handler found is part of a stack frame, or
just left over in the stack. THAT is why I think it got out of sync.

-- Steve

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-02 23:16                     ` Steven Rostedt
@ 2016-08-03  1:56                       ` Josh Poimboeuf
  2016-08-03  2:30                         ` Steven Rostedt
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-08-03  1:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, Aug 02, 2016 at 07:16:22PM -0400, Steven Rostedt wrote:
> On Tue, 2 Aug 2016 17:13:59 -0500
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> 
> > > Then we only need the fp use case when FRAME_POINTER is not set. As
> > > mcount forces FRAME_POINTER, we only need to worry about the fentry
> > > case.  
> > 
> > Hm, I'm confused.  First, I don't see where mcount forces FRAME_POINTER.
> 
> Hmm, we should probably force it generally, as gcc itself requires
> mcount to be used with framepointers. -mcount can't be used without
> them.
> 
> > 
> > Second, I don't see why that even matters.  If mcount and frame pointers
> > are enabled, then the 'fp' field of ftrace_ret_stack is needed for the
> > gcc sanity check, right?  So we couldn't override 'fp', and the old
> > "stateful index" version of ftrace_graph_ret_addr() would have to be
> > used in the code above for reliable addresses, and we'd still have the
> > same out-of-sync bug.
> > 
> > Or am I missing something?
> > 
> 
> Or I missed something. How did we get out of sync? If we have frame
> pointers, shouldn't the "return_to_handler" be seen as reliable by the
> code (not that we save it as such)? That is, if the frame pointer shows
> that the next function is return_to_handler, then we increment the
> index into ret_stack, otherwise we simply record the return_to_handler
> as a normal "unreliable" function, without any processing of it.
> 
> I guess I don't actually understand how the NMI screwed it up, as
> function graph doesn't trace "do_nmi()" itself nor anything before that.
> I'm guessing it really got out of sync because there's a
> "return_to_handler" in the stack that wasn't really called (not a frame
> pointer). The ftrace_graph_ret_addr() will shift the index currently
> regardless if the return_to_handler found is part of a stack frame, or
> just left over in the stack. THAT is why I think it got out of sync.

It's not specific to NMIs.  The problem is that dump_trace() is starting
from the frame pointed to by a pt_regs, rather than the current frame.
Instead of starting with the current frame, the first 10 functions on
the stack are skipped by the unwinder, but they're *not* skipped on the
ret_stack.  So it starts out out-of-sync.

If it had first initialized the graph index variable to 10 instead of 0
before passing it to ftrace_graph_ret_addr(), it would have worked.

The problem isn't specific to NMIs.  It happens anywhere the first few
stack frames are skipped, which is very common.  For example:

  $ cat /proc/self/stack
  [<ffffffff810489a2>] save_stack_trace_tsk+0x22/0x40
  [<ffffffff81311a89>] proc_pid_stack+0xb9/0x110
  [<ffffffff813127c4>] proc_single_show+0x54/0x80
  [<ffffffff812be088>] seq_read+0x108/0x3e0
  [<ffffffff812923d7>] __vfs_read+0x37/0x140
  [<ffffffff812929d9>] vfs_read+0x99/0x140
  [<ffffffff81293f28>] SyS_read+0x58/0xc0
  [<ffffffff818af97c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
  [<ffffffffffffffff>] 0xffffffffffffffff

  $ echo function_graph > /sys/kernel/debug/tracing/current_tracer 
  $ cat /proc/self/stack
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff810394cc>] print_context_stack+0xfc/0x100
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff8103891b>] dump_trace+0x12b/0x350
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff810489a2>] save_stack_trace_tsk+0x22/0x40
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff81311a89>] proc_pid_stack+0xb9/0x110
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff813127c4>] proc_single_show+0x54/0x80
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff812be088>] seq_read+0x108/0x3e0
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff812923d7>] __vfs_read+0x37/0x140
  [<ffffffff818b2428>] return_to_handler+0x0/0x27
  [<ffffffff812929d9>] vfs_read+0x99/0x140
  [<ffffffffffffffff>] 0xffffffffffffffff

In this case, it's offset by two frames.  With function graph tracing
enabled, it starts with print_context_stack() instead of
save_stack_trace_tsk(), and it doesn't show the last two frames.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-03  1:56                       ` Josh Poimboeuf
@ 2016-08-03  2:30                         ` Steven Rostedt
  2016-08-03  2:50                           ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Steven Rostedt @ 2016-08-03  2:30 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, 2 Aug 2016 20:56:56 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> It's not specific to NMIs.  The problem is that dump_trace() is starting
> from the frame pointed to by a pt_regs, rather than the current frame.
> Instead of starting with the current frame, the first 10 functions on
> the stack are skipped by the unwinder, but they're *not* skipped on the
> ret_stack.  So it starts out out-of-sync.

OK, I see what you mean. If we do a dumpstack from interrupt passing in
the pt_regs of the kernel thread that was interrupted, even though
functions up to the interrupt was called and traced, which will show up
in the dump stack that shouldn't.

OK, you convinced me. Add the extra pointer, then we will have 4 longs
and 2 long longs in ftrace_ret_stack. That's not that bad.

-- Steve

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-03  2:30                         ` Steven Rostedt
@ 2016-08-03  2:50                           ` Josh Poimboeuf
  2016-08-03  2:59                             ` Steven Rostedt
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-08-03  2:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, Aug 02, 2016 at 10:30:11PM -0400, Steven Rostedt wrote:
> On Tue, 2 Aug 2016 20:56:56 -0500
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> > It's not specific to NMIs.  The problem is that dump_trace() is starting
> > from the frame pointed to by a pt_regs, rather than the current frame.
> > Instead of starting with the current frame, the first 10 functions on
> > the stack are skipped by the unwinder, but they're *not* skipped on the
> > ret_stack.  So it starts out out-of-sync.
> 
> OK, I see what you mean. If we do a dumpstack from interrupt passing in
> the pt_regs of the kernel thread that was interrupted, even though
> functions up to the interrupt was called and traced, which will show up
> in the dump stack that shouldn't.
> 
> OK, you convinced me. Add the extra pointer, then we will have 4 longs
> and 2 long longs in ftrace_ret_stack. That's not that bad.

Hm, since 'fp' is only used for mcount, I guess we could avoid
allocating it for fentry?  That would save a long when a modern compiler
is used.  Like:

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 1e814ae..fc508a7 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -795,7 +795,9 @@ struct ftrace_ret_stack {
 	unsigned long func;
 	unsigned long long calltime;
 	unsigned long long subtime;
+#if defined(CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST) && !defined(CC_USING_FENTRY)
 	unsigned long fp;
+#endif
 };
 
 /*
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 9caa9b2..86b2719 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -171,7 +171,9 @@ ftrace_push_return_trace(unsigned long ret, unsigned long func, int *depth,
 	current->ret_stack[index].func = func;
 	current->ret_stack[index].calltime = calltime;
 	current->ret_stack[index].subtime = 0;
+#if defined(CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST) && !defined(CC_USING_FENTRY)
 	current->ret_stack[index].fp = frame_pointer;
+#endif
 	*depth = current->curr_ret_stack;
 
 	return 0;

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-03  2:50                           ` Josh Poimboeuf
@ 2016-08-03  2:59                             ` Steven Rostedt
  2016-08-03  3:12                               ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Steven Rostedt @ 2016-08-03  2:59 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, 2 Aug 2016 21:50:12 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:


> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index 1e814ae..fc508a7 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -795,7 +795,9 @@ struct ftrace_ret_stack {
>  	unsigned long func;
>  	unsigned long long calltime;
>  	unsigned long long subtime;
> +#if defined(CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST) && !defined(CC_USING_FENTRY)

We need to make a new defined in ftrace.h:

#if defined(CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST) && !defined(CC_USING_FENTRY)
# define HAVE_FUNCTION_GRAPH_FP_TEST
#endif

And use that instead of this && complexity.

Or better yet, get rid of the CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST define
and only have HAVE_FUNCTION_GRAPH_FP_TEST defined in the asm/ftrace.h
in each arch. Then, x86 could just do;

#ifndef CC_USING_FENTRY
# define HAVE_FUNCTION_GRAPH_FP_TEST
#endif


-- Steve

>  	unsigned long fp;
> +#endif
>  };
>  
>  /*
> diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
> index 9caa9b2..86b2719 100644
> --- a/kernel/trace/trace_functions_graph.c
> +++ b/kernel/trace/trace_functions_graph.c
> @@ -171,7 +171,9 @@ ftrace_push_return_trace(unsigned long ret, unsigned long func, int *depth,
>  	current->ret_stack[index].func = func;
>  	current->ret_stack[index].calltime = calltime;
>  	current->ret_stack[index].subtime = 0;
> +#if defined(CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST) && !defined(CC_USING_FENTRY)
>  	current->ret_stack[index].fp = frame_pointer;
> +#endif
>  	*depth = current->curr_ret_stack;
>  
>  	return 0;

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-03  2:59                             ` Steven Rostedt
@ 2016-08-03  3:12                               ` Josh Poimboeuf
  2016-08-03  3:18                                 ` Steven Rostedt
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-08-03  3:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, Aug 02, 2016 at 10:59:36PM -0400, Steven Rostedt wrote:
> On Tue, 2 Aug 2016 21:50:12 -0500
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> 
> > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> > index 1e814ae..fc508a7 100644
> > --- a/include/linux/ftrace.h
> > +++ b/include/linux/ftrace.h
> > @@ -795,7 +795,9 @@ struct ftrace_ret_stack {
> >  	unsigned long func;
> >  	unsigned long long calltime;
> >  	unsigned long long subtime;
> > +#if defined(CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST) && !defined(CC_USING_FENTRY)
> 
> We need to make a new defined in ftrace.h:
> 
> #if defined(CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST) && !defined(CC_USING_FENTRY)
> # define HAVE_FUNCTION_GRAPH_FP_TEST
> #endif
> 
> And use that instead of this && complexity.
> 
> Or better yet, get rid of the CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST define
> and only have HAVE_FUNCTION_GRAPH_FP_TEST defined in the asm/ftrace.h
> in each arch. Then, x86 could just do;
> 
> #ifndef CC_USING_FENTRY
> # define HAVE_FUNCTION_GRAPH_FP_TEST
> #endif

Sounds good.  I was thinking I could also add a similar define to
indicate whether an arch passes the return address stack pointer to
ftrace_push_return_trace().  HAVE_FUNCTION_GRAPH_RET_ADDR_PTR?

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-03  3:12                               ` Josh Poimboeuf
@ 2016-08-03  3:18                                 ` Steven Rostedt
  2016-08-03  3:21                                   ` Steven Rostedt
  2016-08-03  3:30                                   ` Josh Poimboeuf
  0 siblings, 2 replies; 91+ messages in thread
From: Steven Rostedt @ 2016-08-03  3:18 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, 2 Aug 2016 22:12:33 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:


> Sounds good.  I was thinking I could also add a similar define to
> indicate whether an arch passes the return address stack pointer to
> ftrace_push_return_trace().  HAVE_FUNCTION_GRAPH_RET_ADDR_PTR?
> 

If you are making this function global, might as well make all pass
that pointer when you do the conversion. I don't think we need a define
to differentiate it.

-- Steve

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-03  3:18                                 ` Steven Rostedt
@ 2016-08-03  3:21                                   ` Steven Rostedt
  2016-08-03  3:31                                     ` Josh Poimboeuf
  2016-08-03  3:30                                   ` Josh Poimboeuf
  1 sibling, 1 reply; 91+ messages in thread
From: Steven Rostedt @ 2016-08-03  3:21 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, 2 Aug 2016 23:18:57 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 2 Aug 2016 22:12:33 -0500
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> 
> > Sounds good.  I was thinking I could also add a similar define to
> > indicate whether an arch passes the return address stack pointer to
> > ftrace_push_return_trace().  HAVE_FUNCTION_GRAPH_RET_ADDR_PTR?
> >   
> 
> If you are making this function global, might as well make all pass
> that pointer when you do the conversion. I don't think we need a define
> to differentiate it.
> 

Bah, I was thinking of your ftrace_graph_ret_addr() function. /me needs
to go to bed.

Anyway, if we have to add a parameter, we probably need to update all
the callers anyway. We do need to add a parameter for this, right?

-- Steve

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-03  3:18                                 ` Steven Rostedt
  2016-08-03  3:21                                   ` Steven Rostedt
@ 2016-08-03  3:30                                   ` Josh Poimboeuf
  1 sibling, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-08-03  3:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, Aug 02, 2016 at 11:18:57PM -0400, Steven Rostedt wrote:
> On Tue, 2 Aug 2016 22:12:33 -0500
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> 
> > Sounds good.  I was thinking I could also add a similar define to
> > indicate whether an arch passes the return address stack pointer to
> > ftrace_push_return_trace().  HAVE_FUNCTION_GRAPH_RET_ADDR_PTR?
> > 
> 
> If you are making this function global, might as well make all pass
> that pointer when you do the conversion. I don't think we need a define
> to differentiate it.

In theory, I like the idea.  But from what I can tell, it looks like a
few arches would require some assembly changes: s390, powerpc, and
sparc.  I can probably handle s390 and power, but sparc is a whole
different story...

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-03  3:21                                   ` Steven Rostedt
@ 2016-08-03  3:31                                     ` Josh Poimboeuf
  2016-08-03  3:45                                       ` Steven Rostedt
  0 siblings, 1 reply; 91+ messages in thread
From: Josh Poimboeuf @ 2016-08-03  3:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, Aug 02, 2016 at 11:21:04PM -0400, Steven Rostedt wrote:
> On Tue, 2 Aug 2016 23:18:57 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Tue, 2 Aug 2016 22:12:33 -0500
> > Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> > 
> > > Sounds good.  I was thinking I could also add a similar define to
> > > indicate whether an arch passes the return address stack pointer to
> > > ftrace_push_return_trace().  HAVE_FUNCTION_GRAPH_RET_ADDR_PTR?
> > >   
> > 
> > If you are making this function global, might as well make all pass
> > that pointer when you do the conversion. I don't think we need a define
> > to differentiate it.
> > 
> 
> Bah, I was thinking of your ftrace_graph_ret_addr() function. /me needs
> to go to bed.
> 
> Anyway, if we have to add a parameter, we probably need to update all
> the callers anyway. We do need to add a parameter for this, right?

Yeah, we do need to add a parameter to ftrace_push_return_trace().  But
callers which don't implement it could just pass zero like they do with
'fp'.

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-03  3:31                                     ` Josh Poimboeuf
@ 2016-08-03  3:45                                       ` Steven Rostedt
  2016-08-03 14:13                                         ` Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Steven Rostedt @ 2016-08-03  3:45 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, 2 Aug 2016 22:31:25 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:


> Yeah, we do need to add a parameter to ftrace_push_return_trace().  But
> callers which don't implement it could just pass zero like they do with
> 'fp'.
> 

Right, if zero is passed in, then just ignore it.

Bed time!

-- Steve

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues
  2016-08-03  3:45                                       ` Steven Rostedt
@ 2016-08-03 14:13                                         ` Josh Poimboeuf
  0 siblings, 0 replies; 91+ messages in thread
From: Josh Poimboeuf @ 2016-08-03 14:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	Andy Lutomirski, Linus Torvalds, Brian Gerst, Kees Cook,
	Peter Zijlstra, Frederic Weisbecker, Byungchul Park

On Tue, Aug 02, 2016 at 11:45:30PM -0400, Steven Rostedt wrote:
> On Tue, 2 Aug 2016 22:31:25 -0500
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > Yeah, we do need to add a parameter to ftrace_push_return_trace().  But
> > callers which don't implement it could just pass zero like they do with
> > 'fp'.
> > 
> 
> Right, if zero is passed in, then just ignore it.

I still think we need the define though, because there will be two
versions of ftrace_graph_ret_addr().

-- 
Josh

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2016-08-03 14:22 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-21 21:21 [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Josh Poimboeuf
2016-07-21 21:21 ` [PATCH 01/19] x86/dumpstack: remove show_trace() Josh Poimboeuf
2016-07-21 21:49   ` Andy Lutomirski
2016-07-21 21:21 ` [PATCH 02/19] x86/dumpstack: add get_stack_pointer() and get_frame_pointer() Josh Poimboeuf
2016-07-21 21:53   ` Andy Lutomirski
2016-07-21 21:21 ` [PATCH 03/19] x86/dumpstack: remove unnecessary stack pointer arguments Josh Poimboeuf
2016-07-21 21:56   ` Andy Lutomirski
2016-07-22  1:41     ` Josh Poimboeuf
2016-07-22  2:29       ` Andy Lutomirski
2016-07-22  3:08       ` Brian Gerst
2016-07-21 21:21 ` [PATCH 04/19] x86/dumpstack: make printk_stack_address() more generally useful Josh Poimboeuf
2016-07-21 21:21 ` [PATCH 05/19] x86/dumpstack: fix function graph tracing stack dump reliability issues Josh Poimboeuf
2016-07-29 22:55   ` Steven Rostedt
2016-07-30  0:50     ` Josh Poimboeuf
2016-07-30  2:20       ` Steven Rostedt
2016-07-30 13:51         ` Josh Poimboeuf
2016-08-01 14:28           ` Steven Rostedt
2016-08-01 15:36             ` Josh Poimboeuf
2016-08-02 21:00               ` Josh Poimboeuf
2016-08-02 21:16                 ` Steven Rostedt
2016-08-02 22:13                   ` Josh Poimboeuf
2016-08-02 23:16                     ` Steven Rostedt
2016-08-03  1:56                       ` Josh Poimboeuf
2016-08-03  2:30                         ` Steven Rostedt
2016-08-03  2:50                           ` Josh Poimboeuf
2016-08-03  2:59                             ` Steven Rostedt
2016-08-03  3:12                               ` Josh Poimboeuf
2016-08-03  3:18                                 ` Steven Rostedt
2016-08-03  3:21                                   ` Steven Rostedt
2016-08-03  3:31                                     ` Josh Poimboeuf
2016-08-03  3:45                                       ` Steven Rostedt
2016-08-03 14:13                                         ` Josh Poimboeuf
2016-08-03  3:30                                   ` Josh Poimboeuf
2016-08-01 15:59     ` Josh Poimboeuf
2016-08-01 16:05       ` Steven Rostedt
2016-08-01 16:19         ` Josh Poimboeuf
2016-08-01 16:24     ` Josh Poimboeuf
2016-08-01 16:56       ` Steven Rostedt
2016-07-21 21:21 ` [PATCH 06/19] x86/dumpstack: remove extra brackets around "EOE" Josh Poimboeuf
2016-07-21 21:21 ` [PATCH 07/19] x86/dumpstack: add IRQ_USABLE_STACK_SIZE define Josh Poimboeuf
2016-07-21 22:01   ` Andy Lutomirski
2016-07-22  1:48     ` Josh Poimboeuf
2016-07-22  8:24       ` Ingo Molnar
2016-07-21 21:21 ` [PATCH 08/19] x86/dumpstack: don't disable preemption in show_stack_log_lvl() and dump_trace() Josh Poimboeuf
2016-07-21 21:21 ` [PATCH 09/19] x86/dumpstack: simplify in_exception_stack() Josh Poimboeuf
2016-07-21 22:05   ` Andy Lutomirski
2016-07-21 21:21 ` [PATCH 10/19] x86/dumpstack: add get_stack_info() interface Josh Poimboeuf
2016-07-22 23:26   ` Andy Lutomirski
2016-07-22 23:52     ` Andy Lutomirski
2016-07-23 13:09       ` Josh Poimboeuf
2016-07-22 23:54     ` Josh Poimboeuf
2016-07-23  0:15       ` Andy Lutomirski
2016-07-23 14:04         ` Josh Poimboeuf
2016-07-26  0:09           ` Andy Lutomirski
2016-07-26 16:26             ` Josh Poimboeuf
2016-07-26 17:51               ` Steven Rostedt
2016-07-26 18:56                 ` Josh Poimboeuf
2016-07-26 20:59               ` Andy Lutomirski
2016-07-26 22:24                 ` Josh Poimboeuf
2016-07-26 22:31                   ` Steven Rostedt
2016-07-26 22:37                   ` Andy Lutomirski
2016-07-26 16:47             ` Josh Poimboeuf
2016-07-26 17:49               ` Brian Gerst
2016-07-26 18:59                 ` Josh Poimboeuf
2016-07-21 21:21 ` [PATCH 11/19] x86/dumptrace: add new unwind interface and implementations Josh Poimboeuf
2016-07-21 21:21 ` [PATCH 12/19] perf/x86: convert perf_callchain_kernel() to the new unwinder Josh Poimboeuf
2016-07-21 21:21 ` [PATCH 13/19] x86/stacktrace: convert save_stack_trace_*() " Josh Poimboeuf
2016-07-21 21:21 ` [PATCH 14/19] oprofile/x86: convert x86_backtrace() " Josh Poimboeuf
2016-07-21 21:21 ` [PATCH 15/19] x86/dumpstack: convert show_trace_log_lvl() " Josh Poimboeuf
2016-07-21 21:49   ` Byungchul Park
2016-07-22  1:38     ` Josh Poimboeuf
2016-07-21 21:21 ` [PATCH 16/19] x86/dumpstack: remove dump_trace() Josh Poimboeuf
2016-07-21 21:21 ` [PATCH 17/19] x86/entry/dumpstack: encode pt_regs pointer in frame pointer Josh Poimboeuf
2016-07-21 22:27   ` Andy Lutomirski
2016-07-21 21:21 ` [PATCH 18/19] x86/dumpstack: print stack identifier on its own line Josh Poimboeuf
2016-07-21 21:21 ` [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack Josh Poimboeuf
2016-07-21 22:32   ` Andy Lutomirski
2016-07-22  3:30     ` Josh Poimboeuf
2016-07-22  5:13       ` Andy Lutomirski
2016-07-22 15:57         ` Josh Poimboeuf
2016-07-22 21:46           ` Andy Lutomirski
2016-07-22 22:20             ` Josh Poimboeuf
2016-07-22 23:18               ` Andy Lutomirski
2016-07-22 23:30                 ` Josh Poimboeuf
2016-07-22 23:39                   ` Andy Lutomirski
2016-07-23  0:00                     ` Josh Poimboeuf
2016-07-23  0:22 ` [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code Linus Torvalds
2016-07-23  0:31   ` Andy Lutomirski
2016-07-23  5:35     ` Josh Poimboeuf
2016-07-23  5:39       ` Linus Torvalds
2016-07-23 12:53         ` Josh Poimboeuf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.