All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
@ 2017-07-11 15:33 Josh Poimboeuf
  2017-07-11 15:33 ` [PATCH v3 01/10] x86/entry/64: Refactor IRQ stacks and make them NMI-safe Josh Poimboeuf
                   ` (12 more replies)
  0 siblings, 13 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-11 15:33 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

The biggest change is that undwarf was renamed to ORC.  Here's the
relevant explanation from the docs:

  Etymology
  ---------
  
  Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
  enemies.  Similarly, the ORC unwinder was created in opposition to the
  complexity and slowness of DWARF.
  
  "Although Orcs rarely consider multiple solutions to a problem, they do
  excel at getting things done because they are creatures of action, not
  thought." [3]  Similarly, unlike the esoteric DWARF unwinder, the
  veracious ORC unwinder wastes no time or siloconic effort decoding
  variable-length zero-extended unsigned-integer byte-coded
  state-machine-based debug information entries.
  
  Similar to how Orcs frequently unravel the well-intentioned plans of
  their adversaries, the ORC unwinder frequently unravels stacks with
  brutal, unyielding efficiency.
  
  ORC stands for Oops Rewind Capability.


Other v3 changes:

- Rebase on tip (first 3 objtool patches were merged).
- Add Andy's patches (1-2) to fix unwinding from an empty irq stack.
- Add new patches (3-4) to fix other related issues.
- Add new patch (10) to make it easier to move from FRAME_POINTER to
  ORC_UNWINDER.
- Use packed struct for orc_entry (600k savings, 2% perf hit). (Ingo)
- Change the fast lookup array block size to a power of two to avoid the
  'div' instruction (10% speedup).
- Allocate the fast lookup array in the vmlinux linker script, since we
  don't know the array size at compile time, and it's better than
  allocating such a big block at runtime.
- Add cache locality improvement info to docs. (Ingo)
- "orc-types.h" -> "orc_types.h" (Ingo)
- "cfa" -> "sp" (Ingo)
- struct vertical whitespace alignment (Ingo)
- short -> s16, etc (Ingo)
- asm/undwarf.h -> asm/unwind_hints.h

-----

Create a new "ORC" unwinder, enabled by CONFIG_ORC_UNWINDER, and plug it
into the x86 unwinder framework.  Objtool is used to generate the ORC
debuginfo.  The ORC debuginfo format is basically a simplified version
of DWARF CFI.  More details below.

The unwinder works well in my testing.  It unwinds through interrupts,
exceptions, and preemption, with and without frame pointers, across
aligned stacks and dynamically allocated stacks.  If something goes
wrong during an oops, it successfully falls back to printing the '?'
entries just like the frame pointer unwinder.

Some potential future improvements:
- properly annotate or fix whitelisted functions and files
- add reliability checks for livepatch
- runtime NMI stack reliability checker
- generated code integration

This code can also be found at:

  https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git orc-v3

Here's the contents of the orc-unwinder.txt file which explains the
'why' in more detail:


ORC unwinder
============

Overview
--------

The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is
similar in concept to a DWARF unwinder.  The difference is that the
format of the ORC data is much simpler than DWARF, which in turn allows
the ORC unwinder to be much simpler and faster.

The ORC data consists of unwind tables which are generated by objtool.
They contain out-of-band data which is used by the in-kernel ORC
unwinder.  Objtool generates the ORC data by first doing compile-time
stack metadata validation (CONFIG_STACK_VALIDATION).  After analyzing
all the code paths of a .o file, it determines information about the
stack state at each instruction address in the file and outputs that
information to the .orc_unwind and .orc_unwind_ip sections.

The per-object ORC sections are combined at link time and are sorted and
post-processed at boot time.  The unwinder uses the resulting data to
correlate instruction addresses with their stack states at run time.


ORC vs frame pointers
---------------------

With frame pointers enabled, GCC adds instrumentation code to every
function in the kernel.  The kernel's .text size increases by about
3.2%, resulting in a broad kernel-wide slowdown.  Measurements by Mel
Gorman [1] have shown a slowdown of 5-10% for some workloads.

In contrast, the ORC unwinder has no effect on text size or runtime
performance, because the debuginfo is out of band.  So if you disable
frame pointers and enable the ORC unwinder, you get a nice performance
improvement across the board, and still have reliable stack traces.

Ingo Molnar says:

  "Note that it's not just a performance improvement, but also an
  instruction cache locality improvement: 3.2% .text savings almost
  directly transform into a similarly sized reduction in cache
  footprint. That can transform to even higher speedups for workloads
  whose cache locality is borderline."

Another benefit of ORC compared to frame pointers is that it can
reliably unwind across interrupts and exceptions.  Frame pointer based
unwinds can sometimes skip the caller of the interrupted function, if it
was a leaf function or if the interrupt hit before the frame pointer was
saved.

The main disadvantage of the ORC unwinder compared to frame pointers is
that it needs more memory to store the ORC unwind tables: roughly 2-4MB
depending on the kernel config.


ORC vs DWARF
------------

ORC debuginfo's advantage over DWARF itself is that it's much simpler.
It gets rid of the complex DWARF CFI state machine and also gets rid of
the tracking of unnecessary registers.  This allows the unwinder to be
much simpler, meaning fewer bugs, which is especially important for
mission critical oops code.

The simpler debuginfo format also enables the unwinder to be much faster
than DWARF, which is important for perf and lockdep.  In a basic
performance test by Jiri Slaby [2], the ORC unwinder was about 20x
faster than an out-of-tree DWARF unwinder.  (Note: That measurement was
taken before some performance tweaks were added, which doubled
performance, so the speedup over DWARF may be closer to 40x.)

The ORC data format does have a few downsides compared to DWARF.  The
ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.

Another potential downside is that, as GCC evolves, it's conceivable
that the ORC data may end up being *too* simple to describe the state of
the stack for certain optimizations.  But IMO this is unlikely because
GCC saves the frame pointer for any unusual stack adjustments it does,
so I suspect we'll really only ever need to keep track of the stack
pointer and the frame pointer between call frames.  But even if we do
end up having to track all the registers DWARF tracks, at least we will
still be able to control the format, e.g. no complex state machines.


ORC unwind table generation
---------------------------

The ORC data is generated by objtool.  With the existing compile-time
stack metadata validation feature, objtool already follows all code
paths, and so it already has all the information it needs to be able to
generate ORC data from scratch.  So it's an easy step to go from stack
validation to ORC data generation.

It should be possible to instead generate the ORC data with a simple
tool which converts DWARF to ORC data.  However, such a solution would
be incomplete due to the kernel's extensive use of asm, inline asm, and
special sections like exception tables.

That could be rectified by manually annotating those special code paths
using GNU assembler .cfi annotations in .S files, and homegrown
annotations for inline asm in .c files.  But asm annotations were tried
in the past and were found to be unmaintainable.  They were often
incorrect/incomplete and made the code harder to read and keep updated.
And based on looking at glibc code, annotating inline asm in .c files
might be even worse.

Objtool still needs a few annotations, but only in code which does
unusual things to the stack like entry code.  And even then, far fewer
annotations are needed than what DWARF would need, so they're much more
maintainable than DWARF CFI annotations.

So the advantages of using objtool to generate ORC data are that it
gives more accurate debuginfo, with very few annotations.  It also
insulates the kernel from toolchain bugs which can be very painful to
deal with in the kernel since we often have to workaround issues in
older versions of the toolchain for years.

The downside is that the unwinder now becomes dependent on objtool's
ability to reverse engineer GCC code flow.  If GCC optimizations become
too complicated for objtool to follow, the ORC data generation might
stop working or become incomplete.  (It's worth noting that livepatch
already has such a dependency on objtool's ability to follow GCC code
flow.)

If newer versions of GCC come up with some optimizations which break
objtool, we may need to revisit the current implementation.  Some
possible solutions would be asking GCC to make the optimizations more
palatable, or having objtool use DWARF as an additional input, or
creating a GCC plugin to assist objtool with its analysis.  But for now,
objtool follows GCC code quite well.


Unwinder implementation details
-------------------------------

Objtool generates the ORC data by integrating with the compile-time
stack metadata validation feature, which is described in detail in
tools/objtool/Documentation/stack-validation.txt.  After analyzing all
the code paths of a .o file, it creates an array of orc_entry structs,
and a parallel array of instruction addresses associated with those
structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
respectively.

The ORC data is split into the two arrays for performance reasons, to
make the searchable part of the data (.orc_unwind_ip) more compact.  The
arrays are sorted in parallel at boot time.

Performance is further improved by the use of a fast lookup table which
is created at runtime.  The fast lookup table associates a given address
with a range of indices for the .orc_unwind table, so that only a small
subset of the table needs to be searched.


Etymology
---------

Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
enemies.  Similarly, the ORC unwinder was created in opposition to the
complexity and slowness of DWARF.

"Although Orcs rarely consider multiple solutions to a problem, they do
excel at getting things done because they are creatures of action, not
thought." [3]  Similarly, unlike the esoteric DWARF unwinder, the
veracious ORC unwinder wastes no time or siloconic effort decoding
variable-length zero-extended unsigned-integer byte-coded
state-machine-based debug information entries.

Similar to how Orcs frequently unravel the well-intentioned plans of
their adversaries, the ORC unwinder frequently unravels stacks with
brutal, unyielding efficiency.

ORC stands for Oops Rewind Capability.


[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
[3] http://dustin.wikidot.com/half-orcs-and-orcs


Andy Lutomirski (2):
  x86/entry/64: Refactor IRQ stacks and make them NMI-safe
  x86/entry/64: Initialize the top of the IRQ stack before switching
    stacks

Josh Poimboeuf (8):
  x86/dumpstack: fix occasionally missing registers
  x86/dumpstack: fix interrupt and exception stack boundary checks
  objtool: add ORC unwind table generation
  objtool, x86: add facility for asm code to provide unwind hints
  x86/entry/64: add unwind hint annotations
  x86/asm: add unwind hint annotations to sync_core()
  x86/unwind: add ORC unwinder
  x86/kconfig: make it easier to switch to the new ORC unwinder

 Documentation/x86/orc-unwinder.txt               | 178 +++++++
 arch/um/include/asm/unwind.h                     |   8 +
 arch/x86/Kconfig                                 |   1 +
 arch/x86/Kconfig.debug                           |  26 +-
 arch/x86/entry/Makefile                          |   1 -
 arch/x86/entry/calling.h                         |   5 +
 arch/x86/entry/entry_64.S                        | 170 +++++--
 arch/x86/include/asm/module.h                    |   9 +
 arch/x86/include/asm/orc_lookup.h                |  46 ++
 arch/x86/include/asm/orc_types.h                 | 107 +++++
 arch/x86/include/asm/processor.h                 |   3 +
 arch/x86/include/asm/unwind.h                    |  76 +--
 arch/x86/include/asm/unwind_hints.h              | 103 ++++
 arch/x86/kernel/Makefile                         |   8 +-
 arch/x86/kernel/dumpstack.c                      |  12 +-
 arch/x86/kernel/dumpstack_32.c                   |   4 +-
 arch/x86/kernel/dumpstack_64.c                   |   4 +-
 arch/x86/kernel/module.c                         |  11 +-
 arch/x86/kernel/process_64.c                     |   3 +
 arch/x86/kernel/setup.c                          |   3 +
 arch/x86/kernel/unwind_frame.c                   |  39 +-
 arch/x86/kernel/unwind_guess.c                   |   5 +
 arch/x86/kernel/unwind_orc.c                     | 576 +++++++++++++++++++++++
 arch/x86/kernel/vmlinux.lds.S                    |   3 +
 include/asm-generic/vmlinux.lds.h                |  27 +-
 lib/Kconfig.debug                                |   9 +-
 scripts/Makefile.build                           |  14 +-
 tools/objtool/Build                              |   3 +
 tools/objtool/Documentation/stack-validation.txt |  56 +--
 tools/objtool/Makefile                           |   3 +
 tools/objtool/builtin-check.c                    |   2 +-
 tools/objtool/builtin-orc.c                      |  70 +++
 tools/objtool/builtin.h                          |   1 +
 tools/objtool/check.c                            | 249 +++++++++-
 tools/objtool/check.h                            |  19 +-
 tools/objtool/elf.c                              | 212 ++++++++-
 tools/objtool/elf.h                              |  15 +-
 tools/objtool/objtool.c                          |   3 +-
 tools/objtool/{builtin.h => orc.h}               |  18 +-
 tools/objtool/orc_dump.c                         | 212 +++++++++
 tools/objtool/orc_gen.c                          | 214 +++++++++
 tools/objtool/orc_types.h                        | 107 +++++
 42 files changed, 2449 insertions(+), 186 deletions(-)
 create mode 100644 Documentation/x86/orc-unwinder.txt
 create mode 100644 arch/um/include/asm/unwind.h
 create mode 100644 arch/x86/include/asm/orc_lookup.h
 create mode 100644 arch/x86/include/asm/orc_types.h
 create mode 100644 arch/x86/include/asm/unwind_hints.h
 create mode 100644 arch/x86/kernel/unwind_orc.c
 create mode 100644 tools/objtool/builtin-orc.c
 copy tools/objtool/{builtin.h => orc.h} (69%)
 create mode 100644 tools/objtool/orc_dump.c
 create mode 100644 tools/objtool/orc_gen.c
 create mode 100644 tools/objtool/orc_types.h

-- 
2.7.5

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v3 01/10] x86/entry/64: Refactor IRQ stacks and make them NMI-safe
  2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
@ 2017-07-11 15:33 ` Josh Poimboeuf
  2017-07-18 10:40   ` [tip:x86/asm] " tip-bot for Andy Lutomirski
  2017-07-11 15:33 ` [PATCH v3 02/10] x86/entry/64: Initialize the top of the IRQ stack before switching stacks Josh Poimboeuf
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-11 15:33 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

From: Andy Lutomirski <luto@kernel.org>

This will allow IRQ stacks to nest inside NMIs or similar entries
that can happen during IRQ stack setup or teardown.

The new macros won't work correctly if they're invoked with IRQs on.
Add a check under CONFIG_DEBUG_ENTRY to detect that.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
[ Use %r10 instead of %r11 in xen_do_hypervisor_callback to make objtool
  and ORC unwinder's lives a little easier. ]
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/Kconfig.debug       |  2 --
 arch/x86/entry/entry_64.S    | 85 +++++++++++++++++++++++++++++++-------------
 arch/x86/kernel/process_64.c |  3 ++
 3 files changed, 64 insertions(+), 26 deletions(-)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index fcb7604..353ed09 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -305,8 +305,6 @@ config DEBUG_ENTRY
 	  Some of these sanity checks may slow down kernel entries and
 	  exits or otherwise impact performance.
 
-	  This is currently used to help test NMI code.
-
 	  If unsure, say N.
 
 config DEBUG_NMI_SELFTEST
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a9a8027..0d4483a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -447,6 +447,59 @@ ENTRY(irq_entries_start)
     .endr
 END(irq_entries_start)
 
+.macro DEBUG_ENTRY_ASSERT_IRQS_OFF
+#ifdef CONFIG_DEBUG_ENTRY
+	pushfq
+	testl $X86_EFLAGS_IF, (%rsp)
+	jz .Lokay_\@
+	ud2
+.Lokay_\@:
+	addq $8, %rsp
+#endif
+.endm
+
+/*
+ * Enters the IRQ stack if we're not already using it.  NMI-safe.  Clobbers
+ * flags and puts old RSP into old_rsp, and leaves all other GPRs alone.
+ * Requires kernel GSBASE.
+ *
+ * The invariant is that, if irq_count != -1, then the IRQ stack is in use.
+ */
+.macro ENTER_IRQ_STACK old_rsp
+	DEBUG_ENTRY_ASSERT_IRQS_OFF
+	movq	%rsp, \old_rsp
+	incl	PER_CPU_VAR(irq_count)
+
+	/*
+	 * Right now, if we just incremented irq_count to zero, we've
+	 * claimed the IRQ stack but we haven't switched to it yet.
+	 *
+	 * If anything is added that can interrupt us here without using IST,
+	 * it must be *extremely* careful to limit its stack usage.  This
+	 * could include kprobes and a hypothetical future IST-less #DB
+	 * handler.
+	 */
+
+	cmovzq	PER_CPU_VAR(irq_stack_ptr), %rsp
+	pushq	\old_rsp
+.endm
+
+/*
+ * Undoes ENTER_IRQ_STACK.
+ */
+.macro LEAVE_IRQ_STACK
+	DEBUG_ENTRY_ASSERT_IRQS_OFF
+	/* We need to be off the IRQ stack before decrementing irq_count. */
+	popq	%rsp
+
+	/*
+	 * As in ENTER_IRQ_STACK, irq_count == 0, we are still claiming
+	 * the irq stack but we're not on it.
+	 */
+
+	decl	PER_CPU_VAR(irq_count)
+.endm
+
 /*
  * Interrupt entry/exit.
  *
@@ -485,17 +538,7 @@ END(irq_entries_start)
 	CALL_enter_from_user_mode
 
 1:
-	/*
-	 * Save previous stack pointer, optionally switch to interrupt stack.
-	 * irq_count is used to check if a CPU is already on an interrupt stack
-	 * or not. While this is essentially redundant with preempt_count it is
-	 * a little cheaper to use a separate counter in the PDA (short of
-	 * moving irq_enter into assembly, which would be too much work)
-	 */
-	movq	%rsp, %rdi
-	incl	PER_CPU_VAR(irq_count)
-	cmovzq	PER_CPU_VAR(irq_stack_ptr), %rsp
-	pushq	%rdi
+	ENTER_IRQ_STACK old_rsp=%rdi
 	/* We entered an interrupt context - irqs are off: */
 	TRACE_IRQS_OFF
 
@@ -515,10 +558,8 @@ common_interrupt:
 ret_from_intr:
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
-	decl	PER_CPU_VAR(irq_count)
 
-	/* Restore saved previous stack */
-	popq	%rsp
+	LEAVE_IRQ_STACK
 
 	testb	$3, CS(%rsp)
 	jz	retint_kernel
@@ -891,12 +932,10 @@ bad_gs:
 ENTRY(do_softirq_own_stack)
 	pushq	%rbp
 	mov	%rsp, %rbp
-	incl	PER_CPU_VAR(irq_count)
-	cmove	PER_CPU_VAR(irq_stack_ptr), %rsp
-	push	%rbp				/* frame pointer backlink */
+	ENTER_IRQ_STACK old_rsp=%r11
 	call	__do_softirq
+	LEAVE_IRQ_STACK
 	leaveq
-	decl	PER_CPU_VAR(irq_count)
 	ret
 END(do_softirq_own_stack)
 
@@ -923,13 +962,11 @@ ENTRY(xen_do_hypervisor_callback)		/* do_hypervisor_callback(struct *pt_regs) */
  * see the correct pointer to the pt_regs
  */
 	movq	%rdi, %rsp			/* we don't return, adjust the stack frame */
-11:	incl	PER_CPU_VAR(irq_count)
-	movq	%rsp, %rbp
-	cmovzq	PER_CPU_VAR(irq_stack_ptr), %rsp
-	pushq	%rbp				/* frame pointer backlink */
+
+	ENTER_IRQ_STACK old_rsp=%r10
 	call	xen_evtchn_do_upcall
-	popq	%rsp
-	decl	PER_CPU_VAR(irq_count)
+	LEAVE_IRQ_STACK
+
 #ifndef CONFIG_PREEMPT
 	call	xen_maybe_preempt_hcall
 #endif
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index c3169be..2987e39 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -279,6 +279,9 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	struct tss_struct *tss = &per_cpu(cpu_tss, cpu);
 	unsigned prev_fsindex, prev_gsindex;
 
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
+		     this_cpu_read(irq_count) != -1);
+
 	switch_fpu_prepare(prev_fpu, cpu);
 
 	/* We must save %fs and %gs before load_TLS() because
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 02/10] x86/entry/64: Initialize the top of the IRQ stack before switching stacks
  2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
  2017-07-11 15:33 ` [PATCH v3 01/10] x86/entry/64: Refactor IRQ stacks and make them NMI-safe Josh Poimboeuf
@ 2017-07-11 15:33 ` Josh Poimboeuf
  2017-07-18 10:41   ` [tip:x86/asm] " tip-bot for Andy Lutomirski
  2017-07-11 15:33 ` [PATCH v3 03/10] x86/dumpstack: fix occasionally missing registers Josh Poimboeuf
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-11 15:33 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

From: Andy Lutomirski <luto@kernel.org>

The OOPS unwinder wants the word at the top of the IRQ stack to
point back to the previous stack at all times when the IRQ stack
is in use.  There's currently a one-instruction window in ENTER_IRQ_STACK
during which this isn't the case.  Fix it by writing the old RSP to the
top of the IRQ stack before jumping.

This currently writes the pointer to the stack twice, which is a bit
ugly.  We could get rid of this by replacing irq_stack_ptr with
irq_stack_ptr_minus_eight (better name welcome).  OTOH, there may be
all kinds of odd microarchitectural considerations in play that
affect performance by a few cycles here.

Reported-by: Mike Galbraith <efault@gmx.de>
Reported-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/entry/entry_64.S | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 0d4483a..b56f7f2 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -469,6 +469,7 @@ END(irq_entries_start)
 	DEBUG_ENTRY_ASSERT_IRQS_OFF
 	movq	%rsp, \old_rsp
 	incl	PER_CPU_VAR(irq_count)
+	jnz	.Lirq_stack_push_old_rsp_\@
 
 	/*
 	 * Right now, if we just incremented irq_count to zero, we've
@@ -478,9 +479,30 @@ END(irq_entries_start)
 	 * it must be *extremely* careful to limit its stack usage.  This
 	 * could include kprobes and a hypothetical future IST-less #DB
 	 * handler.
+	 *
+	 * The OOPS unwinder relies on the word at the top of the IRQ
+	 * stack linking back to the previous RSP for the entire time we're
+	 * on the IRQ stack.  For this to work reliably, we need to write
+	 * it before we actually move ourselves to the IRQ stack.
+	 */
+
+	movq	\old_rsp, PER_CPU_VAR(irq_stack_union + IRQ_STACK_SIZE - 8)
+	movq	PER_CPU_VAR(irq_stack_ptr), %rsp
+
+#ifdef CONFIG_DEBUG_ENTRY
+	/*
+	 * If the first movq above becomes wrong due to IRQ stack layout
+	 * changes, the only way we'll notice is if we try to unwind right
+	 * here.  Assert that we set up the stack right to catch this type
+	 * of bug quickly.
 	 */
+	cmpq	-8(%rsp), \old_rsp
+	je	.Lirq_stack_okay\@
+	ud2
+	.Lirq_stack_okay\@:
+#endif
 
-	cmovzq	PER_CPU_VAR(irq_stack_ptr), %rsp
+.Lirq_stack_push_old_rsp_\@:
 	pushq	\old_rsp
 .endm
 
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 03/10] x86/dumpstack: fix occasionally missing registers
  2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
  2017-07-11 15:33 ` [PATCH v3 01/10] x86/entry/64: Refactor IRQ stacks and make them NMI-safe Josh Poimboeuf
  2017-07-11 15:33 ` [PATCH v3 02/10] x86/entry/64: Initialize the top of the IRQ stack before switching stacks Josh Poimboeuf
@ 2017-07-11 15:33 ` Josh Poimboeuf
  2017-07-18 10:41   ` [tip:x86/asm] x86/dumpstack: Fix " tip-bot for Josh Poimboeuf
  2017-07-11 15:33 ` [PATCH v3 04/10] x86/dumpstack: fix interrupt and exception stack boundary checks Josh Poimboeuf
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-11 15:33 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

If two consecutive stack frames have pt_regs, the oops dump code fails
to print the second frame's registers.  Fix that.

Fixes: 3b3fa11bc700 ("x86/dumpstack: Print any pt_regs found on the stack")
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/kernel/dumpstack.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index dbce3cc..bd265a4 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -94,6 +94,9 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		if (stack_name)
 			printk("%s <%s>\n", log_lvl, stack_name);
 
+		if (regs && on_stack(&stack_info, regs, sizeof(*regs)))
+			__show_regs(regs, 0);
+
 		/*
 		 * Scan the stack, printing any text addresses we find.  At the
 		 * same time, follow proper stack frames with the unwinder.
@@ -118,10 +121,8 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 			 * Don't print regs->ip again if it was already printed
 			 * by __show_regs() below.
 			 */
-			if (regs && stack == &regs->ip) {
-				unwind_next_frame(&state);
-				continue;
-			}
+			if (regs && stack == &regs->ip)
+				goto next;
 
 			if (stack == ret_addr_p)
 				reliable = 1;
@@ -144,6 +145,7 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 			if (!reliable)
 				continue;
 
+next:
 			/*
 			 * Get the next frame from the unwinder.  No need to
 			 * check for an error: if anything goes wrong, the rest
@@ -153,7 +155,7 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 			/* if the frame has entry regs, print them */
 			regs = unwind_get_entry_regs(&state);
-			if (regs)
+			if (regs && on_stack(&stack_info, regs, sizeof(*regs)))
 				__show_regs(regs, 0);
 		}
 
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 04/10] x86/dumpstack: fix interrupt and exception stack boundary checks
  2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
                   ` (2 preceding siblings ...)
  2017-07-11 15:33 ` [PATCH v3 03/10] x86/dumpstack: fix occasionally missing registers Josh Poimboeuf
@ 2017-07-11 15:33 ` Josh Poimboeuf
  2017-07-18 10:42   ` [tip:x86/asm] x86/dumpstack: Fix " tip-bot for Josh Poimboeuf
  2017-07-11 15:33 ` [PATCH v3 05/10] objtool: add ORC unwind table generation Josh Poimboeuf
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-11 15:33 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

On x86_64, the double fault exception stack is located immediately after
the interrupt stack in memory.  This causes confusion in the unwinder
when it tries to unwind through an empty interrupt stack, where the
stack pointer points to the address bordering the two stacks.  The
unwinder incorrectly thinks it's running on the double fault stack.

Fix this kind of stack border confusion by never considering the
beginning address of an exception or interrupt stack to be part of the
stack.

Fixes: 5fe599e02e41 ("x86/dumpstack: Add support for unwinding empty IRQ stacks")
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/kernel/dumpstack_32.c | 4 ++--
 arch/x86/kernel/dumpstack_64.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index e5f0b40..4f04814 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -37,7 +37,7 @@ static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info)
 	 * This is a software stack, so 'end' can be a valid stack pointer.
 	 * It just means the stack is empty.
 	 */
-	if (stack < begin || stack > end)
+	if (stack <= begin || stack > end)
 		return false;
 
 	info->type	= STACK_TYPE_IRQ;
@@ -62,7 +62,7 @@ static bool in_softirq_stack(unsigned long *stack, struct stack_info *info)
 	 * This is a software stack, so 'end' can be a valid stack pointer.
 	 * It just means the stack is empty.
 	 */
-	if (stack < begin || stack > end)
+	if (stack <= begin || stack > end)
 		return false;
 
 	info->type	= STACK_TYPE_SOFTIRQ;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 3e1471d..225af41 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -55,7 +55,7 @@ static bool in_exception_stack(unsigned long *stack, struct stack_info *info)
 		begin = end - (exception_stack_sizes[k] / sizeof(long));
 		regs  = (struct pt_regs *)end - 1;
 
-		if (stack < begin || stack >= end)
+		if (stack <= begin || stack >= end)
 			continue;
 
 		info->type	= STACK_TYPE_EXCEPTION + k;
@@ -78,7 +78,7 @@ static bool in_irq_stack(unsigned long *stack, struct stack_info *info)
 	 * This is a software stack, so 'end' can be a valid stack pointer.
 	 * It just means the stack is empty.
 	 */
-	if (stack < begin || stack > end)
+	if (stack <= begin || stack > end)
 		return false;
 
 	info->type	= STACK_TYPE_IRQ;
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 05/10] objtool: add ORC unwind table generation
  2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
                   ` (3 preceding siblings ...)
  2017-07-11 15:33 ` [PATCH v3 04/10] x86/dumpstack: fix interrupt and exception stack boundary checks Josh Poimboeuf
@ 2017-07-11 15:33 ` Josh Poimboeuf
  2017-07-18 10:42   ` [tip:x86/asm] objtool: Add " tip-bot for Josh Poimboeuf
  2017-07-11 15:33 ` [PATCH v3 06/10] objtool, x86: add facility for asm code to provide unwind hints Josh Poimboeuf
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-11 15:33 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

Now that objtool knows the states of all registers on the stack for each
instruction, it's straightforward to generate debuginfo for an unwinder
to use.

Instead of generating DWARF, generate a new format called ORC, which is
more suitable for an in-kernel unwinder.  See
Documentation/x86/orc-unwinder.txt for a more detailed description of
this new debuginfo format and why it's preferable to DWARF.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 tools/objtool/Build                              |   3 +
 tools/objtool/Documentation/stack-validation.txt |  56 ++----
 tools/objtool/builtin-check.c                    |   2 +-
 tools/objtool/builtin-orc.c                      |  70 ++++++++
 tools/objtool/builtin.h                          |   1 +
 tools/objtool/check.c                            |  58 +++++-
 tools/objtool/check.h                            |  15 +-
 tools/objtool/elf.c                              | 212 ++++++++++++++++++++--
 tools/objtool/elf.h                              |  15 +-
 tools/objtool/objtool.c                          |   3 +-
 tools/objtool/{builtin.h => orc.h}               |  18 +-
 tools/objtool/orc_dump.c                         | 212 ++++++++++++++++++++++
 tools/objtool/orc_gen.c                          | 214 +++++++++++++++++++++++
 tools/objtool/orc_types.h                        |  85 +++++++++
 14 files changed, 899 insertions(+), 65 deletions(-)
 create mode 100644 tools/objtool/builtin-orc.c
 copy tools/objtool/{builtin.h => orc.h} (69%)
 create mode 100644 tools/objtool/orc_dump.c
 create mode 100644 tools/objtool/orc_gen.c
 create mode 100644 tools/objtool/orc_types.h

diff --git a/tools/objtool/Build b/tools/objtool/Build
index 6f2e198..749becd 100644
--- a/tools/objtool/Build
+++ b/tools/objtool/Build
@@ -1,6 +1,9 @@
 objtool-y += arch/$(SRCARCH)/
 objtool-y += builtin-check.o
+objtool-y += builtin-orc.o
 objtool-y += check.o
+objtool-y += orc_gen.o
+objtool-y += orc_dump.o
 objtool-y += elf.o
 objtool-y += special.o
 objtool-y += objtool.o
diff --git a/tools/objtool/Documentation/stack-validation.txt b/tools/objtool/Documentation/stack-validation.txt
index 17c1195..6a1af43 100644
--- a/tools/objtool/Documentation/stack-validation.txt
+++ b/tools/objtool/Documentation/stack-validation.txt
@@ -11,9 +11,6 @@ analyzes every .o file and ensures the validity of its stack metadata.
 It enforces a set of rules on asm code and C inline assembly code so
 that stack traces can be reliable.
 
-Currently it only checks frame pointer usage, but there are plans to add
-CFI validation for C files and CFI generation for asm files.
-
 For each function, it recursively follows all possible code paths and
 validates the correct frame pointer state at each instruction.
 
@@ -23,6 +20,10 @@ alternative execution paths to a given instruction (or set of
 instructions).  Similarly, it knows how to follow switch statements, for
 which gcc sometimes uses jump tables.
 
+(Objtool also has an 'orc generate' subcommand which generates debuginfo
+for the ORC unwinder.  See Documentation/x86/orc-unwinder.txt in the
+kernel tree for more details.)
+
 
 Why do we need stack metadata validation?
 -----------------------------------------
@@ -93,37 +94,14 @@ a) More reliable stack traces for frame pointer enabled kernels
        or at the very end of the function after the stack frame has been
        destroyed.  This is an inherent limitation of frame pointers.
 
-b) 100% reliable stack traces for DWARF enabled kernels
-
-   (NOTE: This is not yet implemented)
-
-   As an alternative to frame pointers, DWARF Call Frame Information
-   (CFI) metadata can be used to walk the stack.  Unlike frame pointers,
-   CFI metadata is out of band.  So it doesn't affect runtime
-   performance and it can be reliable even when interrupts or exceptions
-   are involved.
-
-   For C code, gcc automatically generates DWARF CFI metadata.  But for
-   asm code, generating CFI is a tedious manual approach which requires
-   manually placed .cfi assembler macros to be scattered throughout the
-   code.  It's clumsy and very easy to get wrong, and it makes the real
-   code harder to read.
-
-   Stacktool will improve this situation in several ways.  For code
-   which already has CFI annotations, it will validate them.  For code
-   which doesn't have CFI annotations, it will generate them.  So an
-   architecture can opt to strip out all the manual .cfi annotations
-   from their asm code and have objtool generate them instead.
+b) ORC (Oops Rewind Capability) unwind table generation
 
-   We might also add a runtime stack validation debug option where we
-   periodically walk the stack from schedule() and/or an NMI to ensure
-   that the stack metadata is sane and that we reach the bottom of the
-   stack.
+   An alternative to frame pointers and DWARF, ORC unwind data can be
+   used to walk the stack.  Unlike frame pointers, ORC data is out of
+   band.  So it doesn't affect runtime performance and it can be
+   reliable even when interrupts or exceptions are involved.
 
-   So the benefit of objtool here will be that external tooling should
-   always show perfect stack traces.  And the same will be true for
-   kernel warning/oops traces if the architecture has a runtime DWARF
-   unwinder.
+   For more details, see Documentation/x86/orc-unwinder.txt.
 
 c) Higher live patching compatibility rate
 
@@ -211,7 +189,7 @@ they mean, and suggestions for how to fix them.
    function, add proper frame pointer logic using the FRAME_BEGIN and
    FRAME_END macros.  Otherwise, if it's not a callable function, remove
    its ELF function annotation by changing ENDPROC to END, and instead
-   use the manual CFI hint macros in asm/undwarf.h.
+   use the manual unwind hint macros in asm/unwind_hints.h.
 
    If it's a GCC-compiled .c file, the error may be because the function
    uses an inline asm() statement which has a "call" instruction.  An
@@ -231,8 +209,8 @@ they mean, and suggestions for how to fix them.
    If the error is for an asm file, and the instruction is inside (or
    reachable from) a callable function, the function should be annotated
    with the ENTRY/ENDPROC macros (ENDPROC is the important one).
-   Otherwise, the code should probably be annotated with the CFI hint
-   macros in asm/undwarf.h so objtool and the unwinder can know the
+   Otherwise, the code should probably be annotated with the unwind hint
+   macros in asm/unwind_hints.h so objtool and the unwinder can know the
    stack state associated with the code.
 
    If you're 100% sure the code won't affect stack traces, or if you're
@@ -258,7 +236,7 @@ they mean, and suggestions for how to fix them.
    instructions aren't allowed in a callable function, and are most
    likely part of the kernel entry code.  They should usually not have
    the callable function annotation (ENDPROC) and should always be
-   annotated with the CFI hint macros in asm/undwarf.h.
+   annotated with the unwind hint macros in asm/unwind_hints.h.
 
 
 6. file.o: warning: objtool: func()+0x26: sibling call from callable instruction with modified stack frame
@@ -272,7 +250,7 @@ they mean, and suggestions for how to fix them.
 
    If the instruction is not actually in a callable function (e.g.
    kernel entry code), change ENDPROC to END and annotate manually with
-   the CFI hint macros in asm/undwarf.h.
+   the unwind hint macros in asm/unwind_hints.h.
 
 
 7. file: warning: objtool: func()+0x5c: stack state mismatch
@@ -288,8 +266,8 @@ they mean, and suggestions for how to fix them.
 
    Another possibility is that the code has some asm or inline asm which
    does some unusual things to the stack or the frame pointer.  In such
-   cases it's probably appropriate to use the CFI hint macros in
-   asm/undwarf.h.
+   cases it's probably appropriate to use the unwind hint macros in
+   asm/unwind_hints.h.
 
 
 8. file.o: warning: objtool: funcA() falls through to next function funcB()
diff --git a/tools/objtool/builtin-check.c b/tools/objtool/builtin-check.c
index 365c34e..eedf089 100644
--- a/tools/objtool/builtin-check.c
+++ b/tools/objtool/builtin-check.c
@@ -52,5 +52,5 @@ int cmd_check(int argc, const char **argv)
 
 	objname = argv[0];
 
-	return check(objname, nofp);
+	return check(objname, nofp, false);
 }
diff --git a/tools/objtool/builtin-orc.c b/tools/objtool/builtin-orc.c
new file mode 100644
index 0000000..5ca41ab
--- /dev/null
+++ b/tools/objtool/builtin-orc.c
@@ -0,0 +1,70 @@
+/*
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * objtool orc:
+ *
+ * This command analyzes a .o file and adds .orc_unwind and .orc_unwind_ip
+ * sections to it, which is used by the in-kernel ORC unwinder.
+ *
+ * This command is a superset of "objtool check".
+ */
+
+#include <string.h>
+#include <subcmd/parse-options.h>
+#include "builtin.h"
+#include "check.h"
+
+
+static const char *orc_usage[] = {
+	"objtool orc generate [<options>] file.o",
+	"objtool orc dump file.o",
+	NULL,
+};
+
+extern const struct option check_options[];
+extern bool nofp;
+
+int cmd_orc(int argc, const char **argv)
+{
+	const char *objname;
+
+	argc--; argv++;
+	if (!strncmp(argv[0], "gen", 3)) {
+		argc = parse_options(argc, argv, check_options, orc_usage, 0);
+		if (argc != 1)
+			usage_with_options(orc_usage, check_options);
+
+		objname = argv[0];
+
+		return check(objname, nofp, true);
+
+	}
+
+	if (!strcmp(argv[0], "dump")) {
+		if (argc != 2)
+			usage_with_options(orc_usage, check_options);
+
+		objname = argv[1];
+
+		return orc_dump(objname);
+	}
+
+	usage_with_options(orc_usage, check_options);
+
+	return 0;
+}
diff --git a/tools/objtool/builtin.h b/tools/objtool/builtin.h
index 34d2ba7..dd52606 100644
--- a/tools/objtool/builtin.h
+++ b/tools/objtool/builtin.h
@@ -18,5 +18,6 @@
 #define _BUILTIN_H
 
 extern int cmd_check(int argc, const char **argv);
+extern int cmd_orc(int argc, const char **argv);
 
 #endif /* _BUILTIN_H */
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 2c6d748..cb57c52 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -36,8 +36,8 @@ const char *objname;
 static bool nofp;
 struct cfi_state initial_func_cfi;
 
-static struct instruction *find_insn(struct objtool_file *file,
-				     struct section *sec, unsigned long offset)
+struct instruction *find_insn(struct objtool_file *file,
+			      struct section *sec, unsigned long offset)
 {
 	struct instruction *insn;
 
@@ -259,6 +259,11 @@ static int decode_instructions(struct objtool_file *file)
 		if (!(sec->sh.sh_flags & SHF_EXECINSTR))
 			continue;
 
+		if (strcmp(sec->name, ".altinstr_replacement") &&
+		    strcmp(sec->name, ".altinstr_aux") &&
+		    strncmp(sec->name, ".discard.", 9))
+			sec->text = true;
+
 		for (offset = 0; offset < sec->len; offset += insn->len) {
 			insn = malloc(sizeof(*insn));
 			if (!insn) {
@@ -947,6 +952,30 @@ static bool has_valid_stack_frame(struct insn_state *state)
 	return false;
 }
 
+static int update_insn_state_regs(struct instruction *insn, struct insn_state *state)
+{
+	struct cfi_reg *cfa = &state->cfa;
+	struct stack_op *op = &insn->stack_op;
+
+	if (cfa->base != CFI_SP)
+		return 0;
+
+	/* push */
+	if (op->dest.type == OP_DEST_PUSH)
+		cfa->offset += 8;
+
+	/* pop */
+	if (op->src.type == OP_SRC_POP)
+		cfa->offset -= 8;
+
+	/* add immediate to sp */
+	if (op->dest.type == OP_DEST_REG && op->src.type == OP_SRC_ADD &&
+	    op->dest.reg == CFI_SP && op->src.reg == CFI_SP)
+		cfa->offset -= op->src.offset;
+
+	return 0;
+}
+
 static void save_reg(struct insn_state *state, unsigned char reg, int base,
 		     int offset)
 {
@@ -1032,6 +1061,9 @@ static int update_insn_state(struct instruction *insn, struct insn_state *state)
 		return 0;
 	}
 
+	if (state->type == ORC_TYPE_REGS || state->type == ORC_TYPE_REGS_IRET)
+		return update_insn_state_regs(insn, state);
+
 	switch (op->dest.type) {
 
 	case OP_DEST_REG:
@@ -1323,6 +1355,10 @@ static bool insn_state_match(struct instruction *insn, struct insn_state *state)
 			break;
 		}
 
+	} else if (state1->type != state2->type) {
+		WARN_FUNC("stack state mismatch: type1=%d type2=%d",
+			  insn->sec, insn->offset, state1->type, state2->type);
+
 	} else if (state1->drap != state2->drap ||
 		 (state1->drap && state1->drap_reg != state2->drap_reg)) {
 		WARN_FUNC("stack state mismatch: drap1=%d(%d) drap2=%d(%d)",
@@ -1613,7 +1649,7 @@ static void cleanup(struct objtool_file *file)
 	elf_close(file->elf);
 }
 
-int check(const char *_objname, bool _nofp)
+int check(const char *_objname, bool _nofp, bool orc)
 {
 	struct objtool_file file;
 	int ret, warnings = 0;
@@ -1621,7 +1657,7 @@ int check(const char *_objname, bool _nofp)
 	objname = _objname;
 	nofp = _nofp;
 
-	file.elf = elf_open(objname);
+	file.elf = elf_open(objname, orc ? O_RDWR : O_RDONLY);
 	if (!file.elf)
 		return 1;
 
@@ -1654,6 +1690,20 @@ int check(const char *_objname, bool _nofp)
 		warnings += ret;
 	}
 
+	if (orc) {
+		ret = create_orc(&file);
+		if (ret < 0)
+			goto out;
+
+		ret = create_orc_sections(&file);
+		if (ret < 0)
+			goto out;
+
+		ret = elf_write(file.elf);
+		if (ret < 0)
+			goto out;
+	}
+
 out:
 	cleanup(&file);
 
diff --git a/tools/objtool/check.h b/tools/objtool/check.h
index da85f5b..046874b 100644
--- a/tools/objtool/check.h
+++ b/tools/objtool/check.h
@@ -22,12 +22,14 @@
 #include "elf.h"
 #include "cfi.h"
 #include "arch.h"
+#include "orc.h"
 #include <linux/hashtable.h>
 
 struct insn_state {
 	struct cfi_reg cfa;
 	struct cfi_reg regs[CFI_NUM_REGS];
 	int stack_size;
+	unsigned char type;
 	bool bp_scratch;
 	bool drap;
 	int drap_reg;
@@ -48,6 +50,7 @@ struct instruction {
 	struct symbol *func;
 	struct stack_op stack_op;
 	struct insn_state state;
+	struct orc_entry orc;
 };
 
 struct objtool_file {
@@ -58,9 +61,19 @@ struct objtool_file {
 	bool ignore_unreachables, c_file;
 };
 
-int check(const char *objname, bool nofp);
+int check(const char *objname, bool nofp, bool orc);
+
+struct instruction *find_insn(struct objtool_file *file,
+			      struct section *sec, unsigned long offset);
 
 #define for_each_insn(file, insn)					\
 	list_for_each_entry(insn, &file->insn_list, list)
 
+#define sec_for_each_insn(file, sec, insn)				\
+	for (insn = find_insn(file, sec, 0);				\
+	     insn && &insn->list != &file->insn_list &&			\
+			insn->sec == sec;				\
+	     insn = list_next_entry(insn, list))
+
+
 #endif /* _CHECK_H */
diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c
index 1a7e8aa..6e9f980 100644
--- a/tools/objtool/elf.c
+++ b/tools/objtool/elf.c
@@ -30,16 +30,6 @@
 #include "elf.h"
 #include "warn.h"
 
-/*
- * Fallback for systems without this "read, mmaping if possible" cmd.
- */
-#ifndef ELF_C_READ_MMAP
-#define ELF_C_READ_MMAP ELF_C_READ
-#endif
-
-#define WARN_ELF(format, ...)					\
-	WARN(format ": %s", ##__VA_ARGS__, elf_errmsg(-1))
-
 struct section *find_section_by_name(struct elf *elf, const char *name)
 {
 	struct section *sec;
@@ -349,9 +339,10 @@ static int read_relas(struct elf *elf)
 	return 0;
 }
 
-struct elf *elf_open(const char *name)
+struct elf *elf_open(const char *name, int flags)
 {
 	struct elf *elf;
+	Elf_Cmd cmd;
 
 	elf_version(EV_CURRENT);
 
@@ -364,13 +355,20 @@ struct elf *elf_open(const char *name)
 
 	INIT_LIST_HEAD(&elf->sections);
 
-	elf->fd = open(name, O_RDONLY);
+	elf->fd = open(name, flags);
 	if (elf->fd == -1) {
 		perror("open");
 		goto err;
 	}
 
-	elf->elf = elf_begin(elf->fd, ELF_C_READ_MMAP, NULL);
+	if ((flags & O_ACCMODE) == O_RDONLY)
+		cmd = ELF_C_READ_MMAP;
+	else if ((flags & O_ACCMODE) == O_RDWR)
+		cmd = ELF_C_RDWR;
+	else /* O_WRONLY */
+		cmd = ELF_C_WRITE;
+
+	elf->elf = elf_begin(elf->fd, cmd, NULL);
 	if (!elf->elf) {
 		WARN_ELF("elf_begin");
 		goto err;
@@ -397,6 +395,194 @@ struct elf *elf_open(const char *name)
 	return NULL;
 }
 
+struct section *elf_create_section(struct elf *elf, const char *name,
+				   size_t entsize, int nr)
+{
+	struct section *sec, *shstrtab;
+	size_t size = entsize * nr;
+	struct Elf_Scn *s;
+	Elf_Data *data;
+
+	sec = malloc(sizeof(*sec));
+	if (!sec) {
+		perror("malloc");
+		return NULL;
+	}
+	memset(sec, 0, sizeof(*sec));
+
+	INIT_LIST_HEAD(&sec->symbol_list);
+	INIT_LIST_HEAD(&sec->rela_list);
+	hash_init(sec->rela_hash);
+	hash_init(sec->symbol_hash);
+
+	list_add_tail(&sec->list, &elf->sections);
+
+	s = elf_newscn(elf->elf);
+	if (!s) {
+		WARN_ELF("elf_newscn");
+		return NULL;
+	}
+
+	sec->name = strdup(name);
+	if (!sec->name) {
+		perror("strdup");
+		return NULL;
+	}
+
+	sec->idx = elf_ndxscn(s);
+	sec->len = size;
+	sec->changed = true;
+
+	sec->data = elf_newdata(s);
+	if (!sec->data) {
+		WARN_ELF("elf_newdata");
+		return NULL;
+	}
+
+	sec->data->d_size = size;
+	sec->data->d_align = 1;
+
+	if (size) {
+		sec->data->d_buf = malloc(size);
+		if (!sec->data->d_buf) {
+			perror("malloc");
+			return NULL;
+		}
+		memset(sec->data->d_buf, 0, size);
+	}
+
+	if (!gelf_getshdr(s, &sec->sh)) {
+		WARN_ELF("gelf_getshdr");
+		return NULL;
+	}
+
+	sec->sh.sh_size = size;
+	sec->sh.sh_entsize = entsize;
+	sec->sh.sh_type = SHT_PROGBITS;
+	sec->sh.sh_addralign = 1;
+	sec->sh.sh_flags = SHF_ALLOC;
+
+
+	/* Add section name to .shstrtab */
+	shstrtab = find_section_by_name(elf, ".shstrtab");
+	if (!shstrtab) {
+		WARN("can't find .shstrtab section");
+		return NULL;
+	}
+
+	s = elf_getscn(elf->elf, shstrtab->idx);
+	if (!s) {
+		WARN_ELF("elf_getscn");
+		return NULL;
+	}
+
+	data = elf_newdata(s);
+	if (!data) {
+		WARN_ELF("elf_newdata");
+		return NULL;
+	}
+
+	data->d_buf = sec->name;
+	data->d_size = strlen(name) + 1;
+	data->d_align = 1;
+
+	sec->sh.sh_name = shstrtab->len;
+
+	shstrtab->len += strlen(name) + 1;
+	shstrtab->changed = true;
+
+	return sec;
+}
+
+struct section *elf_create_rela_section(struct elf *elf, struct section *base)
+{
+	char *relaname;
+	struct section *sec;
+
+	relaname = malloc(strlen(base->name) + strlen(".rela") + 1);
+	if (!relaname) {
+		perror("malloc");
+		return NULL;
+	}
+	strcpy(relaname, ".rela");
+	strcat(relaname, base->name);
+
+	sec = elf_create_section(elf, relaname, sizeof(GElf_Rela), 0);
+	if (!sec)
+		return NULL;
+
+	base->rela = sec;
+	sec->base = base;
+
+	sec->sh.sh_type = SHT_RELA;
+	sec->sh.sh_addralign = 8;
+	sec->sh.sh_link = find_section_by_name(elf, ".symtab")->idx;
+	sec->sh.sh_info = base->idx;
+	sec->sh.sh_flags = SHF_INFO_LINK;
+
+	return sec;
+}
+
+int elf_rebuild_rela_section(struct section *sec)
+{
+	struct rela *rela;
+	int nr, idx = 0, size;
+	GElf_Rela *relas;
+
+	nr = 0;
+	list_for_each_entry(rela, &sec->rela_list, list)
+		nr++;
+
+	size = nr * sizeof(*relas);
+	relas = malloc(size);
+	if (!relas) {
+		perror("malloc");
+		return -1;
+	}
+
+	sec->data->d_buf = relas;
+	sec->data->d_size = size;
+
+	sec->sh.sh_size = size;
+
+	idx = 0;
+	list_for_each_entry(rela, &sec->rela_list, list) {
+		relas[idx].r_offset = rela->offset;
+		relas[idx].r_addend = rela->addend;
+		relas[idx].r_info = GELF_R_INFO(rela->sym->idx, rela->type);
+		idx++;
+	}
+
+	return 0;
+}
+
+int elf_write(struct elf *elf)
+{
+	struct section *sec;
+	Elf_Scn *s;
+
+	list_for_each_entry(sec, &elf->sections, list) {
+		if (sec->changed) {
+			s = elf_getscn(elf->elf, sec->idx);
+			if (!s) {
+				WARN_ELF("elf_getscn");
+				return -1;
+			}
+			if (!gelf_update_shdr (s, &sec->sh)) {
+				WARN_ELF("gelf_update_shdr");
+				return -1;
+			}
+		}
+	}
+
+	if (elf_update(elf->elf, ELF_C_WRITE) < 0) {
+		WARN_ELF("elf_update");
+		return -1;
+	}
+
+	return 0;
+}
+
 void elf_close(struct elf *elf)
 {
 	struct section *sec, *tmpsec;
diff --git a/tools/objtool/elf.h b/tools/objtool/elf.h
index 343968b..d86e2ff1 100644
--- a/tools/objtool/elf.h
+++ b/tools/objtool/elf.h
@@ -28,6 +28,13 @@
 # define elf_getshdrstrndx elf_getshstrndx
 #endif
 
+/*
+ * Fallback for systems without this "read, mmaping if possible" cmd.
+ */
+#ifndef ELF_C_READ_MMAP
+#define ELF_C_READ_MMAP ELF_C_READ
+#endif
+
 struct section {
 	struct list_head list;
 	GElf_Shdr sh;
@@ -41,6 +48,7 @@ struct section {
 	char *name;
 	int idx;
 	unsigned int len;
+	bool changed, text;
 };
 
 struct symbol {
@@ -75,7 +83,7 @@ struct elf {
 };
 
 
-struct elf *elf_open(const char *name);
+struct elf *elf_open(const char *name, int flags);
 struct section *find_section_by_name(struct elf *elf, const char *name);
 struct symbol *find_symbol_by_offset(struct section *sec, unsigned long offset);
 struct symbol *find_symbol_containing(struct section *sec, unsigned long offset);
@@ -83,6 +91,11 @@ struct rela *find_rela_by_dest(struct section *sec, unsigned long offset);
 struct rela *find_rela_by_dest_range(struct section *sec, unsigned long offset,
 				     unsigned int len);
 struct symbol *find_containing_func(struct section *sec, unsigned long offset);
+struct section *elf_create_section(struct elf *elf, const char *name, size_t
+				   entsize, int nr);
+struct section *elf_create_rela_section(struct elf *elf, struct section *base);
+int elf_rebuild_rela_section(struct section *sec);
+int elf_write(struct elf *elf);
 void elf_close(struct elf *elf);
 
 #define for_each_sec(file, sec)						\
diff --git a/tools/objtool/objtool.c b/tools/objtool/objtool.c
index ecc5b1b..31e0f91 100644
--- a/tools/objtool/objtool.c
+++ b/tools/objtool/objtool.c
@@ -42,10 +42,11 @@ struct cmd_struct {
 };
 
 static const char objtool_usage_string[] =
-	"objtool [OPTIONS] COMMAND [ARGS]";
+	"objtool COMMAND [ARGS]";
 
 static struct cmd_struct objtool_cmds[] = {
 	{"check",	cmd_check,	"Perform stack metadata validation on an object file" },
+	{"orc",		cmd_orc,	"Generate in-place ORC unwind tables for an object file" },
 };
 
 bool help;
diff --git a/tools/objtool/builtin.h b/tools/objtool/orc.h
similarity index 69%
copy from tools/objtool/builtin.h
copy to tools/objtool/orc.h
index 34d2ba7..a4139e3 100644
--- a/tools/objtool/builtin.h
+++ b/tools/objtool/orc.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2015 Josh Poimboeuf <jpoimboe@redhat.com>
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License
@@ -14,9 +14,17 @@
  * You should have received a copy of the GNU General Public License
  * along with this program; if not, see <http://www.gnu.org/licenses/>.
  */
-#ifndef _BUILTIN_H
-#define _BUILTIN_H
 
-extern int cmd_check(int argc, const char **argv);
+#ifndef _ORC_H
+#define _ORC_H
 
-#endif /* _BUILTIN_H */
+#include "orc_types.h"
+
+struct objtool_file;
+
+int create_orc(struct objtool_file *file);
+int create_orc_sections(struct objtool_file *file);
+
+int orc_dump(const char *objname);
+
+#endif /* _ORC_H */
diff --git a/tools/objtool/orc_dump.c b/tools/objtool/orc_dump.c
new file mode 100644
index 0000000..36c5bf6
--- /dev/null
+++ b/tools/objtool/orc_dump.c
@@ -0,0 +1,212 @@
+/*
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <unistd.h>
+#include "orc.h"
+#include "warn.h"
+
+static const char *reg_name(unsigned int reg)
+{
+	switch (reg) {
+	case ORC_REG_PREV_SP:
+		return "prevsp";
+	case ORC_REG_DX:
+		return "dx";
+	case ORC_REG_DI:
+		return "di";
+	case ORC_REG_BP:
+		return "bp";
+	case ORC_REG_SP:
+		return "sp";
+	case ORC_REG_R10:
+		return "r10";
+	case ORC_REG_R13:
+		return "r13";
+	case ORC_REG_BP_INDIRECT:
+		return "bp(ind)";
+	case ORC_REG_SP_INDIRECT:
+		return "sp(ind)";
+	default:
+		return "?";
+	}
+}
+
+static const char *orc_type_name(unsigned int type)
+{
+	switch (type) {
+	case ORC_TYPE_CALL:
+		return "call";
+	case ORC_TYPE_REGS:
+		return "regs";
+	case ORC_TYPE_REGS_IRET:
+		return "iret";
+	default:
+		return "?";
+	}
+}
+
+static void print_reg(unsigned int reg, int offset)
+{
+	if (reg == ORC_REG_BP_INDIRECT)
+		printf("(bp%+d)", offset);
+	else if (reg == ORC_REG_SP_INDIRECT)
+		printf("(sp%+d)", offset);
+	else if (reg == ORC_REG_UNDEFINED)
+		printf("(und)");
+	else
+		printf("%s%+d", reg_name(reg), offset);
+}
+
+int orc_dump(const char *_objname)
+{
+	int fd, nr_entries, i, *orc_ip = NULL, orc_size = 0;
+	struct orc_entry *orc = NULL;
+	char *name;
+	unsigned long nr_sections, orc_ip_addr = 0;
+	size_t shstrtab_idx;
+	Elf *elf;
+	Elf_Scn *scn;
+	GElf_Shdr sh;
+	GElf_Rela rela;
+	GElf_Sym sym;
+	Elf_Data *data, *symtab = NULL, *rela_orc_ip = NULL;
+
+
+	objname = _objname;
+
+	elf_version(EV_CURRENT);
+
+	fd = open(objname, O_RDONLY);
+	if (fd == -1) {
+		perror("open");
+		return -1;
+	}
+
+	elf = elf_begin(fd, ELF_C_READ_MMAP, NULL);
+	if (!elf) {
+		WARN_ELF("elf_begin");
+		return -1;
+	}
+
+	if (elf_getshdrnum(elf, &nr_sections)) {
+		WARN_ELF("elf_getshdrnum");
+		return -1;
+	}
+
+	if (elf_getshdrstrndx(elf, &shstrtab_idx)) {
+		WARN_ELF("elf_getshdrstrndx");
+		return -1;
+	}
+
+	for (i = 0; i < nr_sections; i++) {
+		scn = elf_getscn(elf, i);
+		if (!scn) {
+			WARN_ELF("elf_getscn");
+			return -1;
+		}
+
+		if (!gelf_getshdr(scn, &sh)) {
+			WARN_ELF("gelf_getshdr");
+			return -1;
+		}
+
+		name = elf_strptr(elf, shstrtab_idx, sh.sh_name);
+		if (!name) {
+			WARN_ELF("elf_strptr");
+			return -1;
+		}
+
+		data = elf_getdata(scn, NULL);
+		if (!data) {
+			WARN_ELF("elf_getdata");
+			return -1;
+		}
+
+		if (!strcmp(name, ".symtab")) {
+			symtab = data;
+		} else if (!strcmp(name, ".orc_unwind")) {
+			orc = data->d_buf;
+			orc_size = sh.sh_size;
+		} else if (!strcmp(name, ".orc_unwind_ip")) {
+			orc_ip = data->d_buf;
+			orc_ip_addr = sh.sh_addr;
+		} else if (!strcmp(name, ".rela.orc_unwind_ip")) {
+			rela_orc_ip = data;
+		}
+	}
+
+	if (!symtab || !orc || !orc_ip)
+		return 0;
+
+	if (orc_size % sizeof(*orc) != 0) {
+		WARN("bad .orc_unwind section size");
+		return -1;
+	}
+
+	nr_entries = orc_size / sizeof(*orc);
+	for (i = 0; i < nr_entries; i++) {
+		if (rela_orc_ip) {
+			if (!gelf_getrela(rela_orc_ip, i, &rela)) {
+				WARN_ELF("gelf_getrela");
+				return -1;
+			}
+
+			if (!gelf_getsym(symtab, GELF_R_SYM(rela.r_info), &sym)) {
+				WARN_ELF("gelf_getsym");
+				return -1;
+			}
+
+			scn = elf_getscn(elf, sym.st_shndx);
+			if (!scn) {
+				WARN_ELF("elf_getscn");
+				return -1;
+			}
+
+			if (!gelf_getshdr(scn, &sh)) {
+				WARN_ELF("gelf_getshdr");
+				return -1;
+			}
+
+			name = elf_strptr(elf, shstrtab_idx, sh.sh_name);
+			if (!name || !*name) {
+				WARN_ELF("elf_strptr");
+				return -1;
+			}
+
+			printf("%s+%lx:", name, rela.r_addend);
+
+		} else {
+			printf("%lx:", orc_ip_addr + (i * sizeof(int)) + orc_ip[i]);
+		}
+
+
+		printf(" sp:");
+
+		print_reg(orc[i].sp_reg, orc[i].sp_offset);
+
+		printf(" bp:");
+
+		print_reg(orc[i].bp_reg, orc[i].bp_offset);
+
+		printf(" type:%s\n", orc_type_name(orc[i].type));
+	}
+
+	elf_end(elf);
+	close(fd);
+
+	return 0;
+}
diff --git a/tools/objtool/orc_gen.c b/tools/objtool/orc_gen.c
new file mode 100644
index 0000000..e5ca314
--- /dev/null
+++ b/tools/objtool/orc_gen.c
@@ -0,0 +1,214 @@
+/*
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <stdlib.h>
+#include <string.h>
+
+#include "orc.h"
+#include "check.h"
+#include "warn.h"
+
+int create_orc(struct objtool_file *file)
+{
+	struct instruction *insn;
+
+	for_each_insn(file, insn) {
+		struct orc_entry *orc = &insn->orc;
+		struct cfi_reg *cfa = &insn->state.cfa;
+		struct cfi_reg *bp = &insn->state.regs[CFI_BP];
+
+		if (cfa->base == CFI_UNDEFINED) {
+			orc->sp_reg = ORC_REG_UNDEFINED;
+			continue;
+		}
+
+		switch (cfa->base) {
+		case CFI_SP:
+			orc->sp_reg = ORC_REG_SP;
+			break;
+		case CFI_SP_INDIRECT:
+			orc->sp_reg = ORC_REG_SP_INDIRECT;
+			break;
+		case CFI_BP:
+			orc->sp_reg = ORC_REG_BP;
+			break;
+		case CFI_BP_INDIRECT:
+			orc->sp_reg = ORC_REG_BP_INDIRECT;
+			break;
+		case CFI_R10:
+			orc->sp_reg = ORC_REG_R10;
+			break;
+		case CFI_R13:
+			orc->sp_reg = ORC_REG_R13;
+			break;
+		case CFI_DI:
+			orc->sp_reg = ORC_REG_DI;
+			break;
+		case CFI_DX:
+			orc->sp_reg = ORC_REG_DX;
+			break;
+		default:
+			WARN_FUNC("unknown CFA base reg %d",
+				  insn->sec, insn->offset, cfa->base);
+			return -1;
+		}
+
+		switch(bp->base) {
+		case CFI_UNDEFINED:
+			orc->bp_reg = ORC_REG_UNDEFINED;
+			break;
+		case CFI_CFA:
+			orc->bp_reg = ORC_REG_PREV_SP;
+			break;
+		case CFI_BP:
+			orc->bp_reg = ORC_REG_BP;
+			break;
+		default:
+			WARN_FUNC("unknown BP base reg %d",
+				  insn->sec, insn->offset, bp->base);
+			return -1;
+		}
+
+		orc->sp_offset = cfa->offset;
+		orc->bp_offset = bp->offset;
+		orc->type = insn->state.type;
+	}
+
+	return 0;
+}
+
+static int create_orc_entry(struct section *u_sec, struct section *ip_relasec,
+				unsigned int idx, struct section *insn_sec,
+				unsigned long insn_off, struct orc_entry *o)
+{
+	struct orc_entry *orc;
+	struct rela *rela;
+
+	/* populate ORC data */
+	orc = (struct orc_entry *)u_sec->data->d_buf + idx;
+	memcpy(orc, o, sizeof(*orc));
+
+	/* populate rela for ip */
+	rela = malloc(sizeof(*rela));
+	if (!rela) {
+		perror("malloc");
+		return -1;
+	}
+	memset(rela, 0, sizeof(*rela));
+
+	rela->sym = insn_sec->sym;
+	rela->addend = insn_off;
+	rela->type = R_X86_64_PC32;
+	rela->offset = idx * sizeof(int);
+
+	list_add_tail(&rela->list, &ip_relasec->rela_list);
+	hash_add(ip_relasec->rela_hash, &rela->hash, rela->offset);
+
+	return 0;
+}
+
+int create_orc_sections(struct objtool_file *file)
+{
+	struct instruction *insn, *prev_insn;
+	struct section *sec, *u_sec, *ip_relasec;
+	unsigned int idx;
+
+	struct orc_entry empty = {
+		.sp_reg = ORC_REG_UNDEFINED,
+		.bp_reg  = ORC_REG_UNDEFINED,
+		.type    = ORC_TYPE_CALL,
+	};
+
+	sec = find_section_by_name(file->elf, ".orc_unwind");
+	if (sec) {
+		WARN("file already has .orc_unwind section, skipping");
+		return -1;
+	}
+
+	/* count the number of needed orcs */
+	idx = 0;
+	for_each_sec(file, sec) {
+		if (!sec->text)
+			continue;
+
+		prev_insn = NULL;
+		sec_for_each_insn(file, sec, insn) {
+			if (!prev_insn ||
+			    memcmp(&insn->orc, &prev_insn->orc,
+				   sizeof(struct orc_entry))) {
+				idx++;
+			}
+			prev_insn = insn;
+		}
+
+		/* section terminator */
+		if (prev_insn)
+			idx++;
+	}
+	if (!idx)
+		return -1;
+
+
+	/* create .orc_unwind_ip and .rela.orc_unwind_ip sections */
+	sec = elf_create_section(file->elf, ".orc_unwind_ip", sizeof(int), idx);
+
+	ip_relasec = elf_create_rela_section(file->elf, sec);
+	if (!ip_relasec)
+		return -1;
+
+	/* create .orc_unwind section */
+	u_sec = elf_create_section(file->elf, ".orc_unwind",
+				   sizeof(struct orc_entry), idx);
+
+	/* populate sections */
+	idx = 0;
+	for_each_sec(file, sec) {
+		if (!sec->text)
+			continue;
+
+		prev_insn = NULL;
+		sec_for_each_insn(file, sec, insn) {
+			if (!prev_insn || memcmp(&insn->orc, &prev_insn->orc,
+						 sizeof(struct orc_entry))) {
+
+				if (create_orc_entry(u_sec, ip_relasec, idx,
+						     insn->sec, insn->offset,
+						     &insn->orc))
+					return -1;
+
+				idx++;
+			}
+			prev_insn = insn;
+		}
+
+		/* section terminator */
+		if (prev_insn) {
+			if (create_orc_entry(u_sec, ip_relasec, idx,
+					     prev_insn->sec,
+					     prev_insn->offset + prev_insn->len,
+					     &empty))
+				return -1;
+
+			idx++;
+		}
+	}
+
+	if (elf_rebuild_rela_section(ip_relasec))
+		return -1;
+
+	return 0;
+}
diff --git a/tools/objtool/orc_types.h b/tools/objtool/orc_types.h
new file mode 100644
index 0000000..fc5cf6c
--- /dev/null
+++ b/tools/objtool/orc_types.h
@@ -0,0 +1,85 @@
+/*
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _ORC_TYPES_H
+#define _ORC_TYPES_H
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+
+/*
+ * The ORC_REG_* registers are base registers which are used to find other
+ * registers on the stack.
+ *
+ * ORC_REG_PREV_SP, also known as DWARF Call Frame Address (CFA), is the
+ * address of the previous frame: the caller's SP before it called the current
+ * function.
+ *
+ * ORC_REG_UNDEFINED means the corresponding register's value didn't change in
+ * the current frame.
+ *
+ * The most commonly used base registers are SP and BP -- which the previous SP
+ * is usually based on -- and PREV_SP and UNDEFINED -- which the previous BP is
+ * usually based on.
+ *
+ * The rest of the base registers are needed for special cases like entry code
+ * and GCC realigned stacks.
+ */
+#define ORC_REG_UNDEFINED		0
+#define ORC_REG_PREV_SP			1
+#define ORC_REG_DX			2
+#define ORC_REG_DI			3
+#define ORC_REG_BP			4
+#define ORC_REG_SP			5
+#define ORC_REG_R10			6
+#define ORC_REG_R13			7
+#define ORC_REG_BP_INDIRECT		8
+#define ORC_REG_SP_INDIRECT		9
+#define ORC_REG_MAX			15
+
+/*
+ * ORC_TYPE_CALL: Indicates that sp_reg+sp_offset resolves to PREV_SP (the
+ * caller's SP right before it made the call).  Used for all callable
+ * functions, i.e. all C code and all callable asm functions.
+ *
+ * ORC_TYPE_REGS: Used in entry code to indicate that sp_reg+sp_offset points
+ * to a fully populated pt_regs from a syscall, interrupt, or exception.
+ *
+ * ORC_TYPE_REGS_IRET: Used in entry code to indicate that sp_reg+sp_offset
+ * points to the iret return frame.
+ */
+#define ORC_TYPE_CALL			0
+#define ORC_TYPE_REGS			1
+#define ORC_TYPE_REGS_IRET		2
+
+/*
+ * This struct is more or less a vastly simplified version of the DWARF Call
+ * Frame Information standard.  It contains only the necessary parts of DWARF
+ * CFI, simplified for ease of access by the in-kernel unwinder.  It tells the
+ * unwinder how to find the previous SP and BP (and sometimes entry regs) on
+ * the stack for a given code address.  Each instance of the struct corresponds
+ * to one or more code locations.
+ */
+struct orc_entry {
+	s16		sp_offset;
+	s16		bp_offset;
+	unsigned	sp_reg:4;
+	unsigned	bp_reg:4;
+	unsigned	type:2;
+} __packed;
+
+#endif /* _ORC_TYPES_H */
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 06/10] objtool, x86: add facility for asm code to provide unwind hints
  2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
                   ` (4 preceding siblings ...)
  2017-07-11 15:33 ` [PATCH v3 05/10] objtool: add ORC unwind table generation Josh Poimboeuf
@ 2017-07-11 15:33 ` Josh Poimboeuf
  2017-07-18 10:43   ` [tip:x86/asm] objtool, x86: Add " tip-bot for Josh Poimboeuf
  2017-07-11 15:33 ` [PATCH v3 07/10] x86/entry/64: add unwind hint annotations Josh Poimboeuf
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-11 15:33 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

Some asm (and inline asm) code does special things to the stack which
objtool can't understand.  (Nor can GCC or GNU assembler, for that
matter.)  In such cases we need a facility for the code to provide
annotations, so the unwinder can unwind through it.

This provides such a facility, in the form of unwind hints.  They're
similar to the GNU assembler .cfi* directives, but they give more
information, and are needed in far fewer places, because objtool can
fill in the blanks by following branches and adjusting the stack pointer
for pushes and pops.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 .../objtool => arch/x86/include/asm}/orc_types.h   |  24 ++-
 arch/x86/include/asm/unwind_hints.h                | 103 +++++++++++
 tools/objtool/Makefile                             |   3 +
 tools/objtool/check.c                              | 191 +++++++++++++++++++--
 tools/objtool/check.h                              |   4 +-
 tools/objtool/orc_types.h                          |  22 +++
 6 files changed, 333 insertions(+), 14 deletions(-)
 copy {tools/objtool => arch/x86/include/asm}/orc_types.h (82%)
 create mode 100644 arch/x86/include/asm/unwind_hints.h

diff --git a/tools/objtool/orc_types.h b/arch/x86/include/asm/orc_types.h
similarity index 82%
copy from tools/objtool/orc_types.h
copy to arch/x86/include/asm/orc_types.h
index fc5cf6c..7dc777a 100644
--- a/tools/objtool/orc_types.h
+++ b/arch/x86/include/asm/orc_types.h
@@ -61,11 +61,19 @@
  *
  * ORC_TYPE_REGS_IRET: Used in entry code to indicate that sp_reg+sp_offset
  * points to the iret return frame.
+ *
+ * The UNWIND_HINT macros are used only for the unwind_hint struct.  They
+ * aren't used in struct orc_entry due to size and complexity constraints.
+ * Objtool converts them to real types when it converts the hints to orc
+ * entries.
  */
 #define ORC_TYPE_CALL			0
 #define ORC_TYPE_REGS			1
 #define ORC_TYPE_REGS_IRET		2
+#define UNWIND_HINT_TYPE_SAVE		3
+#define UNWIND_HINT_TYPE_RESTORE	4
 
+#ifndef __ASSEMBLY__
 /*
  * This struct is more or less a vastly simplified version of the DWARF Call
  * Frame Information standard.  It contains only the necessary parts of DWARF
@@ -80,6 +88,20 @@ struct orc_entry {
 	unsigned	sp_reg:4;
 	unsigned	bp_reg:4;
 	unsigned	type:2;
-} __packed;
+};
+
+/*
+ * This struct is used by asm and inline asm code to manually annotate the
+ * location of registers on the stack for the ORC unwinder.
+ *
+ * Type can be either ORC_TYPE_* or UNWIND_HINT_TYPE_*.
+ */
+struct unwind_hint {
+	u32		ip;
+	s16		sp_offset;
+	u8		sp_reg;
+	u8		type;
+};
+#endif /* __ASSEMBLY__ */
 
 #endif /* _ORC_TYPES_H */
diff --git a/arch/x86/include/asm/unwind_hints.h b/arch/x86/include/asm/unwind_hints.h
new file mode 100644
index 0000000..5e02b11
--- /dev/null
+++ b/arch/x86/include/asm/unwind_hints.h
@@ -0,0 +1,103 @@
+#ifndef _ASM_X86_UNWIND_HINTS_H
+#define _ASM_X86_UNWIND_HINTS_H
+
+#include "orc_types.h"
+
+#ifdef __ASSEMBLY__
+
+/*
+ * In asm, there are two kinds of code: normal C-type callable functions and
+ * the rest.  The normal callable functions can be called by other code, and
+ * don't do anything unusual with the stack.  Such normal callable functions
+ * are annotated with the ENTRY/ENDPROC macros.  Most asm code falls in this
+ * category.  In this case, no special debugging annotations are needed because
+ * objtool can automatically generate the ORC data for the ORC unwinder to read
+ * at runtime.
+ *
+ * Anything which doesn't fall into the above category, such as syscall and
+ * interrupt handlers, tends to not be called directly by other functions, and
+ * often does unusual non-C-function-type things with the stack pointer.  Such
+ * code needs to be annotated such that objtool can understand it.  The
+ * following CFI hint macros are for this type of code.
+ *
+ * These macros provide hints to objtool about the state of the stack at each
+ * instruction.  Objtool starts from the hints and follows the code flow,
+ * making automatic CFI adjustments when it sees pushes and pops, filling out
+ * the debuginfo as necessary.  It will also warn if it sees any
+ * inconsistencies.
+ */
+.macro UNWIND_HINT sp_reg=ORC_REG_SP sp_offset=0 type=ORC_TYPE_CALL
+#ifdef CONFIG_STACK_VALIDATION
+.Lunwind_hint_ip_\@:
+	.pushsection .discard.unwind_hints
+		/* struct unwind_hint */
+		.long .Lunwind_hint_ip_\@ - .
+		.short \sp_offset
+		.byte \sp_reg
+		.byte \type
+	.popsection
+#endif
+.endm
+
+.macro UNWIND_HINT_EMPTY
+	UNWIND_HINT sp_reg=ORC_REG_UNDEFINED
+.endm
+
+.macro UNWIND_HINT_REGS base=%rsp offset=0 indirect=0 extra=1 iret=0
+	.if \base == %rsp && \indirect
+		.set sp_reg, ORC_REG_SP_INDIRECT
+	.elseif \base == %rsp
+		.set sp_reg, ORC_REG_SP
+	.elseif \base == %rbp
+		.set sp_reg, ORC_REG_BP
+	.elseif \base == %rdi
+		.set sp_reg, ORC_REG_DI
+	.elseif \base == %rdx
+		.set sp_reg, ORC_REG_DX
+	.elseif \base == %r10
+		.set sp_reg, ORC_REG_R10
+	.else
+		.error "UNWIND_HINT_REGS: bad base register"
+	.endif
+
+	.set sp_offset, \offset
+
+	.if \iret
+		.set type, ORC_TYPE_REGS_IRET
+	.elseif \extra == 0
+		.set type, ORC_TYPE_REGS_IRET
+		.set sp_offset, \offset + (16*8)
+	.else
+		.set type, ORC_TYPE_REGS
+	.endif
+
+	UNWIND_HINT sp_reg=sp_reg sp_offset=sp_offset type=type
+.endm
+
+.macro UNWIND_HINT_IRET_REGS base=%rsp offset=0
+	UNWIND_HINT_REGS base=\base offset=\offset iret=1
+.endm
+
+.macro UNWIND_HINT_FUNC sp_offset=8
+	UNWIND_HINT sp_offset=\sp_offset
+.endm
+
+#else /* !__ASSEMBLY__ */
+
+#define UNWIND_HINT(sp_reg, sp_offset, type)			\
+	"987: \n\t"						\
+	".pushsection .discard.unwind_hints\n\t"		\
+	/* struct unwind_hint */				\
+	".long 987b - .\n\t"					\
+	".short " __stringify(sp_offset) "\n\t"		\
+	".byte " __stringify(sp_reg) "\n\t"			\
+	".byte " __stringify(type) "\n\t"			\
+	".popsection\n\t"
+
+#define UNWIND_HINT_SAVE UNWIND_HINT(0, 0, UNWIND_HINT_TYPE_SAVE)
+
+#define UNWIND_HINT_RESTORE UNWIND_HINT(0, 0, UNWIND_HINT_TYPE_RESTORE)
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_UNWIND_HINTS_H */
diff --git a/tools/objtool/Makefile b/tools/objtool/Makefile
index 0e2765e..3a6425f 100644
--- a/tools/objtool/Makefile
+++ b/tools/objtool/Makefile
@@ -52,6 +52,9 @@ $(OBJTOOL): $(LIBSUBCMD) $(OBJTOOL_IN)
 	diff -I'^#include' arch/x86/insn/inat.h ../../arch/x86/include/asm/inat.h >/dev/null && \
 	diff -I'^#include' arch/x86/insn/inat_types.h ../../arch/x86/include/asm/inat_types.h >/dev/null) \
 	|| echo "warning: objtool: x86 instruction decoder differs from kernel" >&2 )) || true
+	@(test -d ../../kernel -a -d ../../tools -a -d ../objtool && (( \
+	diff ../../arch/x86/include/asm/orc_types.h orc_types.h >/dev/null) \
+	|| echo "warning: objtool: orc_types.h differs from kernel" >&2 )) || true
 	$(QUIET_LINK)$(CC) $(OBJTOOL_IN) $(LDFLAGS) -o $@
 
 
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index cb57c52..368275d 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -100,7 +100,6 @@ static bool gcov_enabled(struct objtool_file *file)
 static bool ignore_func(struct objtool_file *file, struct symbol *func)
 {
 	struct rela *rela;
-	struct instruction *insn;
 
 	/* check for STACK_FRAME_NON_STANDARD */
 	if (file->whitelist && file->whitelist->rela)
@@ -113,11 +112,6 @@ static bool ignore_func(struct objtool_file *file, struct symbol *func)
 				return true;
 		}
 
-	/* check if it has a context switching instruction */
-	func_for_each_insn(file, func, insn)
-		if (insn->type == INSN_CONTEXT_SWITCH)
-			return true;
-
 	return false;
 }
 
@@ -879,6 +873,99 @@ static int add_switch_table_alts(struct objtool_file *file)
 	return 0;
 }
 
+static int read_unwind_hints(struct objtool_file *file)
+{
+	struct section *sec, *relasec;
+	struct rela *rela;
+	struct unwind_hint *hint;
+	struct instruction *insn;
+	struct cfi_reg *cfa;
+	int i;
+
+	sec = find_section_by_name(file->elf, ".discard.unwind_hints");
+	if (!sec)
+		return 0;
+
+	relasec = sec->rela;
+	if (!relasec) {
+		WARN("missing .rela.discard.unwind_hints section");
+		return -1;
+	}
+
+	if (sec->len % sizeof(struct unwind_hint)) {
+		WARN("struct unwind_hint size mismatch");
+		return -1;
+	}
+
+	file->hints = true;
+
+	for (i = 0; i < sec->len / sizeof(struct unwind_hint); i++) {
+		hint = (struct unwind_hint *)sec->data->d_buf + i;
+
+		rela = find_rela_by_dest(sec, i * sizeof(*hint));
+		if (!rela) {
+			WARN("can't find rela for unwind_hints[%d]", i);
+			return -1;
+		}
+
+		insn = find_insn(file, rela->sym->sec, rela->addend);
+		if (!insn) {
+			WARN("can't find insn for unwind_hints[%d]", i);
+			return -1;
+		}
+
+		cfa = &insn->state.cfa;
+
+		if (hint->type == UNWIND_HINT_TYPE_SAVE) {
+			insn->save = true;
+			continue;
+
+		} else if (hint->type == UNWIND_HINT_TYPE_RESTORE) {
+			insn->restore = true;
+			insn->hint = true;
+			continue;
+		}
+
+		insn->hint = true;
+
+		switch (hint->sp_reg) {
+		case ORC_REG_UNDEFINED:
+			cfa->base = CFI_UNDEFINED;
+			break;
+		case ORC_REG_SP:
+			cfa->base = CFI_SP;
+			break;
+		case ORC_REG_BP:
+			cfa->base = CFI_BP;
+			break;
+		case ORC_REG_SP_INDIRECT:
+			cfa->base = CFI_SP_INDIRECT;
+			break;
+		case ORC_REG_R10:
+			cfa->base = CFI_R10;
+			break;
+		case ORC_REG_R13:
+			cfa->base = CFI_R13;
+			break;
+		case ORC_REG_DI:
+			cfa->base = CFI_DI;
+			break;
+		case ORC_REG_DX:
+			cfa->base = CFI_DX;
+			break;
+		default:
+			WARN_FUNC("unsupported unwind_hint sp base reg %d",
+				  insn->sec, insn->offset, hint->sp_reg);
+			return -1;
+		}
+
+		cfa->offset = hint->sp_offset;
+		insn->state.type = hint->type;
+	}
+
+	return 0;
+}
+
 static int decode_sections(struct objtool_file *file)
 {
 	int ret;
@@ -909,6 +996,10 @@ static int decode_sections(struct objtool_file *file)
 	if (ret)
 		return ret;
 
+	ret = read_unwind_hints(file);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 
@@ -1382,7 +1473,7 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 			   struct insn_state state)
 {
 	struct alternative *alt;
-	struct instruction *insn;
+	struct instruction *insn, *next_insn;
 	struct section *sec;
 	struct symbol *func = NULL;
 	int ret;
@@ -1397,6 +1488,8 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 	}
 
 	while (1) {
+		next_insn = next_insn_same_sec(file, insn);
+
 		if (file->c_file && insn->func) {
 			if (func && func != insn->func) {
 				WARN("%s() falls through to next function %s()",
@@ -1414,13 +1507,54 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 		}
 
 		if (insn->visited) {
-			if (!!insn_state_match(insn, &state))
+			if (!insn->hint && !insn_state_match(insn, &state))
 				return 1;
 
 			return 0;
 		}
 
-		insn->state = state;
+		if (insn->hint) {
+			if (insn->restore) {
+				struct instruction *save_insn, *i;
+
+				i = insn;
+				save_insn = NULL;
+				func_for_each_insn_continue_reverse(file, func, i) {
+					if (i->save) {
+						save_insn = i;
+						break;
+					}
+				}
+
+				if (!save_insn) {
+					WARN_FUNC("no corresponding CFI save for CFI restore",
+						  sec, insn->offset);
+					return 1;
+				}
+
+				if (!save_insn->visited) {
+					/*
+					 * Oops, no state to copy yet.
+					 * Hopefully we can reach this
+					 * instruction from another branch
+					 * after the save insn has been
+					 * visited.
+					 */
+					if (insn == first)
+						return 0;
+
+					WARN_FUNC("objtool isn't smart enough to handle this CFI save/restore combo",
+						  sec, insn->offset);
+					return 1;
+				}
+
+				insn->state = save_insn->state;
+			}
+
+			state = insn->state;
+
+		} else
+			insn->state = state;
 
 		insn->visited = true;
 
@@ -1497,6 +1631,14 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 
 			return 0;
 
+		case INSN_CONTEXT_SWITCH:
+			if (func && (!next_insn || !next_insn->hint)) {
+				WARN_FUNC("unsupported instruction in callable function",
+					  sec, insn->offset);
+				return 1;
+			}
+			return 0;
+
 		case INSN_STACK:
 			if (update_insn_state(insn, &state))
 				return -1;
@@ -1510,7 +1652,7 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 		if (insn->dead_end)
 			return 0;
 
-		insn = next_insn_same_sec(file, insn);
+		insn = next_insn;
 		if (!insn) {
 			WARN("%s: unexpected end of section", sec->name);
 			return 1;
@@ -1520,6 +1662,27 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 	return 0;
 }
 
+static int validate_unwind_hints(struct objtool_file *file)
+{
+	struct instruction *insn;
+	int ret, warnings = 0;
+	struct insn_state state;
+
+	if (!file->hints)
+		return 0;
+
+	clear_insn_state(&state);
+
+	for_each_insn(file, insn) {
+		if (insn->hint && !insn->visited) {
+			ret = validate_branch(file, insn, state);
+			warnings += ret;
+		}
+	}
+
+	return warnings;
+}
+
 static bool is_kasan_insn(struct instruction *insn)
 {
 	return (insn->type == INSN_CALL &&
@@ -1665,8 +1828,9 @@ int check(const char *_objname, bool _nofp, bool orc)
 	hash_init(file.insn_hash);
 	file.whitelist = find_section_by_name(file.elf, ".discard.func_stack_frame_non_standard");
 	file.rodata = find_section_by_name(file.elf, ".rodata");
-	file.ignore_unreachables = false;
 	file.c_file = find_section_by_name(file.elf, ".comment");
+	file.ignore_unreachables = false;
+	file.hints = false;
 
 	arch_initial_func_cfi_state(&initial_func_cfi);
 
@@ -1683,6 +1847,11 @@ int check(const char *_objname, bool _nofp, bool orc)
 		goto out;
 	warnings += ret;
 
+	ret = validate_unwind_hints(&file);
+	if (ret < 0)
+		goto out;
+	warnings += ret;
+
 	if (!warnings) {
 		ret = validate_reachable_instructions(&file);
 		if (ret < 0)
diff --git a/tools/objtool/check.h b/tools/objtool/check.h
index 046874b..ac3d4b1 100644
--- a/tools/objtool/check.h
+++ b/tools/objtool/check.h
@@ -43,7 +43,7 @@ struct instruction {
 	unsigned int len;
 	unsigned char type;
 	unsigned long immediate;
-	bool alt_group, visited, dead_end, ignore;
+	bool alt_group, visited, dead_end, ignore, hint, save, restore;
 	struct symbol *call_dest;
 	struct instruction *jump_dest;
 	struct list_head alts;
@@ -58,7 +58,7 @@ struct objtool_file {
 	struct list_head insn_list;
 	DECLARE_HASHTABLE(insn_hash, 16);
 	struct section *rodata, *whitelist;
-	bool ignore_unreachables, c_file;
+	bool ignore_unreachables, c_file, hints;
 };
 
 int check(const char *objname, bool nofp, bool orc);
diff --git a/tools/objtool/orc_types.h b/tools/objtool/orc_types.h
index fc5cf6c..9c9dc57 100644
--- a/tools/objtool/orc_types.h
+++ b/tools/objtool/orc_types.h
@@ -61,11 +61,19 @@
  *
  * ORC_TYPE_REGS_IRET: Used in entry code to indicate that sp_reg+sp_offset
  * points to the iret return frame.
+ *
+ * The UNWIND_HINT macros are used only for the unwind_hint struct.  They
+ * aren't used in struct orc_entry due to size and complexity constraints.
+ * Objtool converts them to real types when it converts the hints to orc
+ * entries.
  */
 #define ORC_TYPE_CALL			0
 #define ORC_TYPE_REGS			1
 #define ORC_TYPE_REGS_IRET		2
+#define UNWIND_HINT_TYPE_SAVE		3
+#define UNWIND_HINT_TYPE_RESTORE	4
 
+#ifndef __ASSEMBLY__
 /*
  * This struct is more or less a vastly simplified version of the DWARF Call
  * Frame Information standard.  It contains only the necessary parts of DWARF
@@ -82,4 +90,18 @@ struct orc_entry {
 	unsigned	type:2;
 } __packed;
 
+/*
+ * This struct is used by asm and inline asm code to manually annotate the
+ * location of registers on the stack for the ORC unwinder.
+ *
+ * Type can be either ORC_TYPE_* or UNWIND_HINT_TYPE_*.
+ */
+struct unwind_hint {
+	u32		ip;
+	s16		sp_offset;
+	u8		sp_reg;
+	u8		type;
+};
+#endif /* __ASSEMBLY__ */
+
 #endif /* _ORC_TYPES_H */
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 07/10] x86/entry/64: add unwind hint annotations
  2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
                   ` (5 preceding siblings ...)
  2017-07-11 15:33 ` [PATCH v3 06/10] objtool, x86: add facility for asm code to provide unwind hints Josh Poimboeuf
@ 2017-07-11 15:33 ` Josh Poimboeuf
  2017-07-18 10:43   ` [tip:x86/asm] x86/entry/64: Add " tip-bot for Josh Poimboeuf
  2017-07-11 15:33 ` [PATCH v3 08/10] x86/asm: add unwind hint annotations to sync_core() Josh Poimboeuf
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-11 15:33 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

Add unwind hint annotations to entry_64.S.  This will enable the ORC
unwinder to unwind through any location in the entry code including
syscalls, interrupts, and exceptions.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/entry/Makefile   |  1 -
 arch/x86/entry/calling.h  |  5 ++++
 arch/x86/entry/entry_64.S | 71 ++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 66 insertions(+), 11 deletions(-)

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 9976fce..af28a8a 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -2,7 +2,6 @@
 # Makefile for the x86 low level entry code
 #
 
-OBJECT_FILES_NON_STANDARD_entry_$(BITS).o   := y
 OBJECT_FILES_NON_STANDARD_entry_64_compat.o := y
 
 CFLAGS_syscall_64.o		+= $(call cc-option,-Wno-override-init,)
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 05ed3d3..640aafe 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -1,4 +1,5 @@
 #include <linux/jump_label.h>
+#include <asm/unwind_hints.h>
 
 /*
 
@@ -112,6 +113,7 @@ For 32-bit we have the following conventions - kernel is built with
 	movq %rdx, 12*8+\offset(%rsp)
 	movq %rsi, 13*8+\offset(%rsp)
 	movq %rdi, 14*8+\offset(%rsp)
+	UNWIND_HINT_REGS offset=\offset extra=0
 	.endm
 	.macro SAVE_C_REGS offset=0
 	SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1
@@ -136,6 +138,7 @@ For 32-bit we have the following conventions - kernel is built with
 	movq %r12, 3*8+\offset(%rsp)
 	movq %rbp, 4*8+\offset(%rsp)
 	movq %rbx, 5*8+\offset(%rsp)
+	UNWIND_HINT_REGS offset=\offset
 	.endm
 
 	.macro RESTORE_EXTRA_REGS offset=0
@@ -145,6 +148,7 @@ For 32-bit we have the following conventions - kernel is built with
 	movq 3*8+\offset(%rsp), %r12
 	movq 4*8+\offset(%rsp), %rbp
 	movq 5*8+\offset(%rsp), %rbx
+	UNWIND_HINT_REGS offset=\offset extra=0
 	.endm
 
 	.macro RESTORE_C_REGS_HELPER rstor_rax=1, rstor_rcx=1, rstor_r11=1, rstor_r8910=1, rstor_rdx=1
@@ -167,6 +171,7 @@ For 32-bit we have the following conventions - kernel is built with
 	.endif
 	movq 13*8(%rsp), %rsi
 	movq 14*8(%rsp), %rdi
+	UNWIND_HINT_IRET_REGS offset=16*8
 	.endm
 	.macro RESTORE_C_REGS
 	RESTORE_C_REGS_HELPER 1,1,1,1,1
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index b56f7f2..aa58155 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -36,6 +36,7 @@
 #include <asm/smap.h>
 #include <asm/pgtable_types.h>
 #include <asm/export.h>
+#include <asm/frame.h>
 #include <linux/err.h>
 
 .code64
@@ -43,9 +44,10 @@
 
 #ifdef CONFIG_PARAVIRT
 ENTRY(native_usergs_sysret64)
+	UNWIND_HINT_EMPTY
 	swapgs
 	sysretq
-ENDPROC(native_usergs_sysret64)
+END(native_usergs_sysret64)
 #endif /* CONFIG_PARAVIRT */
 
 .macro TRACE_IRQS_IRETQ
@@ -134,6 +136,7 @@ ENDPROC(native_usergs_sysret64)
  */
 
 ENTRY(entry_SYSCALL_64)
+	UNWIND_HINT_EMPTY
 	/*
 	 * Interrupts are off on entry.
 	 * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
@@ -169,6 +172,7 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
 	pushq	%r10				/* pt_regs->r10 */
 	pushq	%r11				/* pt_regs->r11 */
 	sub	$(6*8), %rsp			/* pt_regs->bp, bx, r12-15 not saved */
+	UNWIND_HINT_REGS extra=0
 
 	/*
 	 * If we need to do entry work or if we guess we'll need to do
@@ -223,6 +227,7 @@ entry_SYSCALL_64_fastpath:
 	movq	EFLAGS(%rsp), %r11
 	RESTORE_C_REGS_EXCEPT_RCX_R11
 	movq	RSP(%rsp), %rsp
+	UNWIND_HINT_EMPTY
 	USERGS_SYSRET64
 
 1:
@@ -316,6 +321,7 @@ syscall_return_via_sysret:
 	/* rcx and r11 are already restored (see code above) */
 	RESTORE_C_REGS_EXCEPT_RCX_R11
 	movq	RSP(%rsp), %rsp
+	UNWIND_HINT_EMPTY
 	USERGS_SYSRET64
 
 opportunistic_sysret_failed:
@@ -343,6 +349,7 @@ ENTRY(stub_ptregs_64)
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
 	popq	%rax
+	UNWIND_HINT_REGS extra=0
 	jmp	entry_SYSCALL64_slow_path
 
 1:
@@ -351,6 +358,7 @@ END(stub_ptregs_64)
 
 .macro ptregs_stub func
 ENTRY(ptregs_\func)
+	UNWIND_HINT_FUNC
 	leaq	\func(%rip), %rax
 	jmp	stub_ptregs_64
 END(ptregs_\func)
@@ -367,6 +375,7 @@ END(ptregs_\func)
  * %rsi: next task
  */
 ENTRY(__switch_to_asm)
+	UNWIND_HINT_FUNC
 	/*
 	 * Save callee-saved registers
 	 * This must match the order in inactive_task_frame
@@ -406,6 +415,7 @@ END(__switch_to_asm)
  * r12: kernel thread arg
  */
 ENTRY(ret_from_fork)
+	UNWIND_HINT_EMPTY
 	movq	%rax, %rdi
 	call	schedule_tail			/* rdi: 'prev' task parameter */
 
@@ -413,6 +423,7 @@ ENTRY(ret_from_fork)
 	jnz	1f				/* kernel threads are uncommon */
 
 2:
+	UNWIND_HINT_REGS
 	movq	%rsp, %rdi
 	call	syscall_return_slowpath	/* returns with IRQs disabled */
 	TRACE_IRQS_ON			/* user mode is traced as IRQS on */
@@ -440,10 +451,11 @@ END(ret_from_fork)
 ENTRY(irq_entries_start)
     vector=FIRST_EXTERNAL_VECTOR
     .rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
+	UNWIND_HINT_IRET_REGS
 	pushq	$(~vector+0x80)			/* Note: always in signed byte range */
-    vector=vector+1
 	jmp	common_interrupt
 	.align	8
+	vector=vector+1
     .endr
 END(irq_entries_start)
 
@@ -465,9 +477,14 @@ END(irq_entries_start)
  *
  * The invariant is that, if irq_count != -1, then the IRQ stack is in use.
  */
-.macro ENTER_IRQ_STACK old_rsp
+.macro ENTER_IRQ_STACK regs=1 old_rsp
 	DEBUG_ENTRY_ASSERT_IRQS_OFF
 	movq	%rsp, \old_rsp
+
+	.if \regs
+	UNWIND_HINT_REGS base=\old_rsp
+	.endif
+
 	incl	PER_CPU_VAR(irq_count)
 	jnz	.Lirq_stack_push_old_rsp_\@
 
@@ -504,16 +521,24 @@ END(irq_entries_start)
 
 .Lirq_stack_push_old_rsp_\@:
 	pushq	\old_rsp
+
+	.if \regs
+	UNWIND_HINT_REGS indirect=1
+	.endif
 .endm
 
 /*
  * Undoes ENTER_IRQ_STACK.
  */
-.macro LEAVE_IRQ_STACK
+.macro LEAVE_IRQ_STACK regs=1
 	DEBUG_ENTRY_ASSERT_IRQS_OFF
 	/* We need to be off the IRQ stack before decrementing irq_count. */
 	popq	%rsp
 
+	.if \regs
+	UNWIND_HINT_REGS
+	.endif
+
 	/*
 	 * As in ENTER_IRQ_STACK, irq_count == 0, we are still claiming
 	 * the irq stack but we're not on it.
@@ -624,6 +649,7 @@ restore_c_regs_and_iret:
 	INTERRUPT_RETURN
 
 ENTRY(native_iret)
+	UNWIND_HINT_IRET_REGS
 	/*
 	 * Are we returning to a stack segment from the LDT?  Note: in
 	 * 64-bit mode SS:RSP on the exception stack is always valid.
@@ -696,6 +722,7 @@ native_irq_return_ldt:
 	orq	PER_CPU_VAR(espfix_stack), %rax
 	SWAPGS
 	movq	%rax, %rsp
+	UNWIND_HINT_IRET_REGS offset=8
 
 	/*
 	 * At this point, we cannot write to the stack any more, but we can
@@ -717,6 +744,7 @@ END(common_interrupt)
  */
 .macro apicinterrupt3 num sym do_sym
 ENTRY(\sym)
+	UNWIND_HINT_IRET_REGS
 	ASM_CLAC
 	pushq	$~(\num)
 .Lcommon_\sym:
@@ -802,6 +830,8 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
 
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
+	UNWIND_HINT_IRET_REGS offset=8
+
 	/* Sanity check */
 	.if \shift_ist != -1 && \paranoid == 0
 	.error "using shift_ist requires paranoid=1"
@@ -825,6 +855,7 @@ ENTRY(\sym)
 	.else
 	call	error_entry
 	.endif
+	UNWIND_HINT_REGS
 	/* returned flag: ebx=0: need swapgs on exit, ebx=1: don't need it */
 
 	.if \paranoid
@@ -922,6 +953,7 @@ idtentry simd_coprocessor_error		do_simd_coprocessor_error	has_error_code=0
 	 * edi:  new selector
 	 */
 ENTRY(native_load_gs_index)
+	FRAME_BEGIN
 	pushfq
 	DISABLE_INTERRUPTS(CLBR_ANY & ~CLBR_RDI)
 	SWAPGS
@@ -930,8 +962,9 @@ ENTRY(native_load_gs_index)
 2:	ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
 	SWAPGS
 	popfq
+	FRAME_END
 	ret
-END(native_load_gs_index)
+ENDPROC(native_load_gs_index)
 EXPORT_SYMBOL(native_load_gs_index)
 
 	_ASM_EXTABLE(.Lgs_change, bad_gs)
@@ -954,12 +987,12 @@ bad_gs:
 ENTRY(do_softirq_own_stack)
 	pushq	%rbp
 	mov	%rsp, %rbp
-	ENTER_IRQ_STACK old_rsp=%r11
+	ENTER_IRQ_STACK regs=0 old_rsp=%r11
 	call	__do_softirq
-	LEAVE_IRQ_STACK
+	LEAVE_IRQ_STACK regs=0
 	leaveq
 	ret
-END(do_softirq_own_stack)
+ENDPROC(do_softirq_own_stack)
 
 #ifdef CONFIG_XEN
 idtentry xen_hypervisor_callback xen_do_hypervisor_callback has_error_code=0
@@ -983,7 +1016,9 @@ ENTRY(xen_do_hypervisor_callback)		/* do_hypervisor_callback(struct *pt_regs) */
  * Since we don't modify %rdi, evtchn_do_upall(struct *pt_regs) will
  * see the correct pointer to the pt_regs
  */
+	UNWIND_HINT_FUNC
 	movq	%rdi, %rsp			/* we don't return, adjust the stack frame */
+	UNWIND_HINT_REGS
 
 	ENTER_IRQ_STACK old_rsp=%r10
 	call	xen_evtchn_do_upcall
@@ -1009,6 +1044,7 @@ END(xen_do_hypervisor_callback)
  * with its current contents: any discrepancy means we in category 1.
  */
 ENTRY(xen_failsafe_callback)
+	UNWIND_HINT_EMPTY
 	movl	%ds, %ecx
 	cmpw	%cx, 0x10(%rsp)
 	jne	1f
@@ -1028,11 +1064,13 @@ ENTRY(xen_failsafe_callback)
 	pushq	$0				/* RIP */
 	pushq	%r11
 	pushq	%rcx
+	UNWIND_HINT_IRET_REGS offset=8
 	jmp	general_protection
 1:	/* Segment mismatch => Category 1 (Bad segment). Retry the IRET. */
 	movq	(%rsp), %rcx
 	movq	8(%rsp), %r11
 	addq	$0x30, %rsp
+	UNWIND_HINT_IRET_REGS
 	pushq	$-1 /* orig_ax = -1 => not a system call */
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS
@@ -1078,6 +1116,7 @@ idtentry machine_check					has_error_code=0	paranoid=1 do_sym=*machine_check_vec
  * Return: ebx=0: need swapgs on exit, ebx=1: otherwise
  */
 ENTRY(paranoid_entry)
+	UNWIND_HINT_FUNC
 	cld
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
@@ -1105,6 +1144,7 @@ END(paranoid_entry)
  * On entry, ebx is "no swapgs" flag (1: don't need swapgs, 0: need it)
  */
 ENTRY(paranoid_exit)
+	UNWIND_HINT_REGS
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF_DEBUG
 	testl	%ebx, %ebx			/* swapgs needed? */
@@ -1126,6 +1166,7 @@ END(paranoid_exit)
  * Return: EBX=0: came from user mode; EBX=1: otherwise
  */
 ENTRY(error_entry)
+	UNWIND_HINT_FUNC
 	cld
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
@@ -1210,6 +1251,7 @@ END(error_entry)
  *   0: user gsbase is loaded, we need SWAPGS and standard preparation for return to usermode
  */
 ENTRY(error_exit)
+	UNWIND_HINT_REGS
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
 	testl	%ebx, %ebx
@@ -1219,6 +1261,7 @@ END(error_exit)
 
 /* Runs on exception stack */
 ENTRY(nmi)
+	UNWIND_HINT_IRET_REGS
 	/*
 	 * Fix up the exception frame if we're on Xen.
 	 * PARAVIRT_ADJUST_EXCEPTION_FRAME is guaranteed to push at most
@@ -1290,11 +1333,13 @@ ENTRY(nmi)
 	cld
 	movq	%rsp, %rdx
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	UNWIND_HINT_IRET_REGS base=%rdx offset=8
 	pushq	5*8(%rdx)	/* pt_regs->ss */
 	pushq	4*8(%rdx)	/* pt_regs->rsp */
 	pushq	3*8(%rdx)	/* pt_regs->flags */
 	pushq	2*8(%rdx)	/* pt_regs->cs */
 	pushq	1*8(%rdx)	/* pt_regs->rip */
+	UNWIND_HINT_IRET_REGS
 	pushq   $-1		/* pt_regs->orig_ax */
 	pushq   %rdi		/* pt_regs->di */
 	pushq   %rsi		/* pt_regs->si */
@@ -1311,6 +1356,7 @@ ENTRY(nmi)
 	pushq	%r13		/* pt_regs->r13 */
 	pushq	%r14		/* pt_regs->r14 */
 	pushq	%r15		/* pt_regs->r15 */
+	UNWIND_HINT_REGS
 	ENCODE_FRAME_POINTER
 
 	/*
@@ -1465,6 +1511,7 @@ first_nmi:
 	.rept 5
 	pushq	11*8(%rsp)
 	.endr
+	UNWIND_HINT_IRET_REGS
 
 	/* Everything up to here is safe from nested NMIs */
 
@@ -1480,6 +1527,7 @@ first_nmi:
 	pushq	$__KERNEL_CS	/* CS */
 	pushq	$1f		/* RIP */
 	INTERRUPT_RETURN	/* continues at repeat_nmi below */
+	UNWIND_HINT_IRET_REGS
 1:
 #endif
 
@@ -1529,6 +1577,7 @@ end_repeat_nmi:
 	 * exceptions might do.
 	 */
 	call	paranoid_entry
+	UNWIND_HINT_REGS
 
 	/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
 	movq	%rsp, %rdi
@@ -1566,17 +1615,19 @@ nmi_restore:
 END(nmi)
 
 ENTRY(ignore_sysret)
+	UNWIND_HINT_EMPTY
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
 
 ENTRY(rewind_stack_do_exit)
+	UNWIND_HINT_FUNC
 	/* Prevent any naive code from trying to unwind to our caller. */
 	xorl	%ebp, %ebp
 
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rax
-	leaq	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%rax), %rsp
+	leaq	-PTREGS_SIZE(%rax), %rsp
+	UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
 
 	call	do_exit
-1:	jmp 1b
 END(rewind_stack_do_exit)
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 08/10] x86/asm: add unwind hint annotations to sync_core()
  2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
                   ` (6 preceding siblings ...)
  2017-07-11 15:33 ` [PATCH v3 07/10] x86/entry/64: add unwind hint annotations Josh Poimboeuf
@ 2017-07-11 15:33 ` Josh Poimboeuf
  2017-07-18 10:43   ` [tip:x86/asm] x86/asm: Add " tip-bot for Josh Poimboeuf
  2017-07-11 15:33 ` [PATCH v3 09/10] x86/unwind: add ORC unwinder Josh Poimboeuf
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-11 15:33 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

This enables objtool to grok the iret in the middle of a C function.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/include/asm/processor.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 6a79547..b27dc9b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -22,6 +22,7 @@ struct vm86;
 #include <asm/nops.h>
 #include <asm/special_insns.h>
 #include <asm/fpu/types.h>
+#include <asm/unwind_hints.h>
 
 #include <linux/personality.h>
 #include <linux/cache.h>
@@ -684,6 +685,7 @@ static inline void sync_core(void)
 	unsigned int tmp;
 
 	asm volatile (
+		UNWIND_HINT_SAVE
 		"mov %%ss, %0\n\t"
 		"pushq %q0\n\t"
 		"pushq %%rsp\n\t"
@@ -693,6 +695,7 @@ static inline void sync_core(void)
 		"pushq %q0\n\t"
 		"pushq $1f\n\t"
 		"iretq\n\t"
+		UNWIND_HINT_RESTORE
 		"1:"
 		: "=&r" (tmp), "+r" (__sp) : : "cc", "memory");
 #endif
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 09/10] x86/unwind: add ORC unwinder
  2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
                   ` (7 preceding siblings ...)
  2017-07-11 15:33 ` [PATCH v3 08/10] x86/asm: add unwind hint annotations to sync_core() Josh Poimboeuf
@ 2017-07-11 15:33 ` Josh Poimboeuf
  2017-07-14 17:22   ` [PATCH v3.1 " Josh Poimboeuf
  2017-07-11 15:33 ` [PATCH v3 10/10] x86/kconfig: make it easier to switch to the new " Josh Poimboeuf
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-11 15:33 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

Add a new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER.  It
plugs into the existing x86 unwinder framework.

It relies on objtool to generate the needed .orc_unwind and
.orc_unwind_ip sections.

For more details on why ORC is used instead of DWARF, see
Documentation/x86/orc-unwinder.txt.

Thanks to Andy Lutomirski for the performance improvement ideas:
splitting the ORC unwind table into two parallel arrays and creating a
fast lookup table to search a subset of the unwind table.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 Documentation/x86/orc-unwinder.txt | 178 ++++++++++++
 arch/um/include/asm/unwind.h       |   8 +
 arch/x86/Kconfig                   |   1 +
 arch/x86/Kconfig.debug             |  25 ++
 arch/x86/include/asm/module.h      |   9 +
 arch/x86/include/asm/orc_lookup.h  |  46 +++
 arch/x86/include/asm/orc_types.h   |   2 +-
 arch/x86/include/asm/unwind.h      |  76 +++--
 arch/x86/kernel/Makefile           |   8 +-
 arch/x86/kernel/module.c           |  11 +-
 arch/x86/kernel/setup.c            |   3 +
 arch/x86/kernel/unwind_frame.c     |  39 ++-
 arch/x86/kernel/unwind_guess.c     |   5 +
 arch/x86/kernel/unwind_orc.c       | 576 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/vmlinux.lds.S      |   3 +
 include/asm-generic/vmlinux.lds.h  |  27 +-
 lib/Kconfig.debug                  |   3 +
 scripts/Makefile.build             |  14 +-
 18 files changed, 970 insertions(+), 64 deletions(-)
 create mode 100644 Documentation/x86/orc-unwinder.txt
 create mode 100644 arch/um/include/asm/unwind.h
 create mode 100644 arch/x86/include/asm/orc_lookup.h
 create mode 100644 arch/x86/kernel/unwind_orc.c

diff --git a/Documentation/x86/orc-unwinder.txt b/Documentation/x86/orc-unwinder.txt
new file mode 100644
index 0000000..d9fadba
--- /dev/null
+++ b/Documentation/x86/orc-unwinder.txt
@@ -0,0 +1,178 @@
+ORC unwinder
+============
+
+Overview
+--------
+
+The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is
+similar in concept to a DWARF unwinder.  The difference is that the
+format of the ORC data is much simpler than DWARF, which in turn allows
+the ORC unwinder to be much simpler and faster.
+
+The ORC data consists of unwind tables which are generated by objtool.
+They contain out-of-band data which is used by the in-kernel ORC
+unwinder.  Objtool generates the ORC data by first doing compile-time
+stack metadata validation (CONFIG_STACK_VALIDATION).  After analyzing
+all the code paths of a .o file, it determines information about the
+stack state at each instruction address in the file and outputs that
+information to the .orc_unwind and .orc_unwind_ip sections.
+
+The per-object ORC sections are combined at link time and are sorted and
+post-processed at boot time.  The unwinder uses the resulting data to
+correlate instruction addresses with their stack states at run time.
+
+
+ORC vs frame pointers
+---------------------
+
+With frame pointers enabled, GCC adds instrumentation code to every
+function in the kernel.  The kernel's .text size increases by about
+3.2%, resulting in a broad kernel-wide slowdown.  Measurements by Mel
+Gorman [1] have shown a slowdown of 5-10% for some workloads.
+
+In contrast, the ORC unwinder has no effect on text size or runtime
+performance, because the debuginfo is out of band.  So if you disable
+frame pointers and enable the ORC unwinder, you get a nice performance
+improvement across the board, and still have reliable stack traces.
+
+Ingo Molnar says:
+
+  "Note that it's not just a performance improvement, but also an
+  instruction cache locality improvement: 3.2% .text savings almost
+  directly transform into a similarly sized reduction in cache
+  footprint. That can transform to even higher speedups for workloads
+  whose cache locality is borderline."
+
+Another benefit of ORC compared to frame pointers is that it can
+reliably unwind across interrupts and exceptions.  Frame pointer based
+unwinds can sometimes skip the caller of the interrupted function, if it
+was a leaf function or if the interrupt hit before the frame pointer was
+saved.
+
+The main disadvantage of the ORC unwinder compared to frame pointers is
+that it needs more memory to store the ORC unwind tables: roughly 2-4MB
+depending on the kernel config.
+
+
+ORC vs DWARF
+------------
+
+ORC debuginfo's advantage over DWARF itself is that it's much simpler.
+It gets rid of the complex DWARF CFI state machine and also gets rid of
+the tracking of unnecessary registers.  This allows the unwinder to be
+much simpler, meaning fewer bugs, which is especially important for
+mission critical oops code.
+
+The simpler debuginfo format also enables the unwinder to be much faster
+than DWARF, which is important for perf and lockdep.  In a basic
+performance test by Jiri Slaby [2], the ORC unwinder was about 20x
+faster than an out-of-tree DWARF unwinder.  (Note: That measurement was
+taken before some performance tweaks were added, which doubled
+performance, so the speedup over DWARF may be closer to 40x.)
+
+The ORC data format does have a few downsides compared to DWARF.  The
+ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
+
+Another potential downside is that, as GCC evolves, it's conceivable
+that the ORC data may end up being *too* simple to describe the state of
+the stack for certain optimizations.  But IMO this is unlikely because
+GCC saves the frame pointer for any unusual stack adjustments it does,
+so I suspect we'll really only ever need to keep track of the stack
+pointer and the frame pointer between call frames.  But even if we do
+end up having to track all the registers DWARF tracks, at least we will
+still be able to control the format, e.g. no complex state machines.
+
+
+ORC unwind table generation
+---------------------------
+
+The ORC data is generated by objtool.  With the existing compile-time
+stack metadata validation feature, objtool already follows all code
+paths, and so it already has all the information it needs to be able to
+generate ORC data from scratch.  So it's an easy step to go from stack
+validation to ORC data generation.
+
+It should be possible to instead generate the ORC data with a simple
+tool which converts DWARF to ORC data.  However, such a solution would
+be incomplete due to the kernel's extensive use of asm, inline asm, and
+special sections like exception tables.
+
+That could be rectified by manually annotating those special code paths
+using GNU assembler .cfi annotations in .S files, and homegrown
+annotations for inline asm in .c files.  But asm annotations were tried
+in the past and were found to be unmaintainable.  They were often
+incorrect/incomplete and made the code harder to read and keep updated.
+And based on looking at glibc code, annotating inline asm in .c files
+might be even worse.
+
+Objtool still needs a few annotations, but only in code which does
+unusual things to the stack like entry code.  And even then, far fewer
+annotations are needed than what DWARF would need, so they're much more
+maintainable than DWARF CFI annotations.
+
+So the advantages of using objtool to generate ORC data are that it
+gives more accurate debuginfo, with very few annotations.  It also
+insulates the kernel from toolchain bugs which can be very painful to
+deal with in the kernel since we often have to workaround issues in
+older versions of the toolchain for years.
+
+The downside is that the unwinder now becomes dependent on objtool's
+ability to reverse engineer GCC code flow.  If GCC optimizations become
+too complicated for objtool to follow, the ORC data generation might
+stop working or become incomplete.  (It's worth noting that livepatch
+already has such a dependency on objtool's ability to follow GCC code
+flow.)
+
+If newer versions of GCC come up with some optimizations which break
+objtool, we may need to revisit the current implementation.  Some
+possible solutions would be asking GCC to make the optimizations more
+palatable, or having objtool use DWARF as an additional input, or
+creating a GCC plugin to assist objtool with its analysis.  But for now,
+objtool follows GCC code quite well.
+
+
+Unwinder implementation details
+-------------------------------
+
+Objtool generates the ORC data by integrating with the compile-time
+stack metadata validation feature, which is described in detail in
+tools/objtool/Documentation/stack-validation.txt.  After analyzing all
+the code paths of a .o file, it creates an array of orc_entry structs,
+and a parallel array of instruction addresses associated with those
+structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
+respectively.
+
+The ORC data is split into the two arrays for performance reasons, to
+make the searchable part of the data (.orc_unwind_ip) more compact.  The
+arrays are sorted in parallel at boot time.
+
+Performance is further improved by the use of a fast lookup table which
+is created at runtime.  The fast lookup table associates a given address
+with a range of indices for the .orc_unwind table, so that only a small
+subset of the table needs to be searched.
+
+
+Etymology
+---------
+
+Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
+enemies.  Similarly, the ORC unwinder was created in opposition to the
+complexity and slowness of DWARF.
+
+"Although Orcs rarely consider multiple solutions to a problem, they do
+excel at getting things done because they are creatures of action, not
+thought." [3]  Similarly, unlike the esoteric DWARF unwinder, the
+veracious ORC unwinder wastes no time or siloconic effort decoding
+variable-length zero-extended unsigned-integer byte-coded
+state-machine-based debug information entries.
+
+Similar to how Orcs frequently unravel the well-intentioned plans of
+their adversaries, the ORC unwinder frequently unravels stacks with
+brutal, unyielding efficiency.
+
+ORC stands for Oops Rewind Capability.
+
+
+[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
+[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
+[3] http://dustin.wikidot.com/half-orcs-and-orcs
diff --git a/arch/um/include/asm/unwind.h b/arch/um/include/asm/unwind.h
new file mode 100644
index 0000000..7ffa543
--- /dev/null
+++ b/arch/um/include/asm/unwind.h
@@ -0,0 +1,8 @@
+#ifndef _ASM_UML_UNWIND_H
+#define _ASM_UML_UNWIND_H
+
+static inline void
+unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
+		   void *orc, size_t orc_size) {}
+
+#endif /* _ASM_UML_UNWIND_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e767ed2..0dac5a0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -153,6 +153,7 @@ config X86
 	select HAVE_MEMBLOCK
 	select HAVE_MEMBLOCK_NODE_MAP
 	select HAVE_MIXED_BREAKPOINTS_REGS
+	select HAVE_MOD_ARCH_SPECIFIC
 	select HAVE_NMI
 	select HAVE_OPROFILE
 	select HAVE_OPTPROBES
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 353ed09..dc10ec6 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -355,4 +355,29 @@ config PUNIT_ATOM_DEBUG
 	  The current power state can be read from
 	  /sys/kernel/debug/punit_atom/dev_power_state
 
+config ORC_UNWINDER
+	bool "ORC unwinder"
+	depends on X86_64
+	select STACK_VALIDATION
+	---help---
+	  This option enables the ORC (Oops Rewind Capability) unwinder for
+	  unwinding kernel stack traces.  It uses a custom data format which is
+	  a simplified version of the DWARF Call Frame Information standard.
+
+	  This unwinder is more accurate across interrupt entry frames than the
+	  frame pointer unwinder.  It can also enable a 5-10% performance
+	  improvement across the entire kernel if CONFIG_FRAME_POINTER is
+	  disabled.
+
+	  Enabling this option will increase the kernel's runtime memory usage
+	  by roughly 2-4MB, depending on your kernel config.
+
+config FRAME_POINTER_UNWINDER
+	def_bool y
+	depends on !ORC_UNWINDER && FRAME_POINTER
+
+config GUESS_UNWINDER
+	def_bool y
+	depends on !ORC_UNWINDER && !FRAME_POINTER
+
 endmenu
diff --git a/arch/x86/include/asm/module.h b/arch/x86/include/asm/module.h
index e3b7819..9eb7c71 100644
--- a/arch/x86/include/asm/module.h
+++ b/arch/x86/include/asm/module.h
@@ -2,6 +2,15 @@
 #define _ASM_X86_MODULE_H
 
 #include <asm-generic/module.h>
+#include <asm/orc_types.h>
+
+struct mod_arch_specific {
+#ifdef CONFIG_ORC_UNWINDER
+	unsigned int num_orcs;
+	int *orc_unwind_ip;
+	struct orc_entry *orc_unwind;
+#endif
+};
 
 #ifdef CONFIG_X86_64
 /* X86_64 does not define MODULE_PROC_FAMILY */
diff --git a/arch/x86/include/asm/orc_lookup.h b/arch/x86/include/asm/orc_lookup.h
new file mode 100644
index 0000000..91c8d86
--- /dev/null
+++ b/arch/x86/include/asm/orc_lookup.h
@@ -0,0 +1,46 @@
+/*
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef _ORC_LOOKUP_H
+#define _ORC_LOOKUP_H
+
+/*
+ * This is a lookup table for speeding up access to the .orc_unwind table.
+ * Given an input address offset, the corresponding lookup table entry
+ * specifies a subset of the .orc_unwind table to search.
+ *
+ * Each block represents the end of the previous range and the start of the
+ * next range.  An extra block is added to give the last range an end.
+ *
+ * The block size should be a power of 2 to avoid a costly 'div' instruction.
+ *
+ * A block size of 256 was chosen because it roughly doubles unwinder
+ * performance while only adding ~5% to the ORC data footprint.
+ */
+#define LOOKUP_BLOCK_ORDER	8
+#define LOOKUP_BLOCK_SIZE	(1 << LOOKUP_BLOCK_ORDER)
+
+#ifndef LINKER_SCRIPT
+
+extern unsigned int orc_lookup[];
+extern unsigned int orc_lookup_end[];
+
+#define LOOKUP_START_IP		(unsigned long)_stext
+#define LOOKUP_STOP_IP		(unsigned long)_etext
+
+#endif /* LINKER_SCRIPT */
+
+#endif /* _ORC_LOOKUP_H */
diff --git a/arch/x86/include/asm/orc_types.h b/arch/x86/include/asm/orc_types.h
index 7dc777a..9c9dc57 100644
--- a/arch/x86/include/asm/orc_types.h
+++ b/arch/x86/include/asm/orc_types.h
@@ -88,7 +88,7 @@ struct orc_entry {
 	unsigned	sp_reg:4;
 	unsigned	bp_reg:4;
 	unsigned	type:2;
-};
+} __packed;
 
 /*
  * This struct is used by asm and inline asm code to manually annotate the
diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index e667649..25b8d31a 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -12,11 +12,14 @@ struct unwind_state {
 	struct task_struct *task;
 	int graph_idx;
 	bool error;
-#ifdef CONFIG_FRAME_POINTER
+#if defined(CONFIG_ORC_UNWINDER)
+	bool signal, full_regs;
+	unsigned long sp, bp, ip;
+	struct pt_regs *regs;
+#elif defined(CONFIG_FRAME_POINTER)
 	bool got_irq;
-	unsigned long *bp, *orig_sp;
+	unsigned long *bp, *orig_sp, ip;
 	struct pt_regs *regs;
-	unsigned long ip;
 #else
 	unsigned long *sp;
 #endif
@@ -24,41 +27,30 @@ struct unwind_state {
 
 void __unwind_start(struct unwind_state *state, struct task_struct *task,
 		    struct pt_regs *regs, unsigned long *first_frame);
-
 bool unwind_next_frame(struct unwind_state *state);
-
 unsigned long unwind_get_return_address(struct unwind_state *state);
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state);
 
 static inline bool unwind_done(struct unwind_state *state)
 {
 	return state->stack_info.type == STACK_TYPE_UNKNOWN;
 }
 
-static inline
-void unwind_start(struct unwind_state *state, struct task_struct *task,
-		  struct pt_regs *regs, unsigned long *first_frame)
-{
-	first_frame = first_frame ? : get_stack_pointer(task, regs);
-
-	__unwind_start(state, task, regs, first_frame);
-}
-
 static inline bool unwind_error(struct unwind_state *state)
 {
 	return state->error;
 }
 
-#ifdef CONFIG_FRAME_POINTER
-
 static inline
-unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+void unwind_start(struct unwind_state *state, struct task_struct *task,
+		  struct pt_regs *regs, unsigned long *first_frame)
 {
-	if (unwind_done(state))
-		return NULL;
+	first_frame = first_frame ? : get_stack_pointer(task, regs);
 
-	return state->regs ? &state->regs->ip : state->bp + 1;
+	__unwind_start(state, task, regs, first_frame);
 }
 
+#if defined(CONFIG_ORC_UNWINDER) || defined(CONFIG_FRAME_POINTER)
 static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
 	if (unwind_done(state))
@@ -66,20 +58,46 @@ static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 
 	return state->regs;
 }
-
-#else /* !CONFIG_FRAME_POINTER */
-
-static inline
-unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+#else
+static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
 	return NULL;
 }
+#endif
 
-static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
+#ifdef CONFIG_ORC_UNWINDER
+void unwind_init(void);
+void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
+			void *orc, size_t orc_size);
+#else
+static inline void unwind_init(void) {}
+static inline
+void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
+			void *orc, size_t orc_size) {}
+#endif
+
+/*
+ * This disables KASAN checking when reading a value from another task's stack,
+ * since the other task could be running on another CPU and could have poisoned
+ * the stack in the meantime.
+ */
+#define READ_ONCE_TASK_STACK(task, x)			\
+({							\
+	unsigned long val;				\
+	if (task == current)				\
+		val = READ_ONCE(x);			\
+	else						\
+		val = READ_ONCE_NOCHECK(x);		\
+	val;						\
+})
+
+static inline bool task_on_another_cpu(struct task_struct *task)
 {
-	return NULL;
+#ifdef CONFIG_SMP
+	return task != current && task->on_cpu;
+#else
+	return false;
+#endif
 }
 
-#endif /* CONFIG_FRAME_POINTER */
-
 #endif /* _ASM_X86_UNWIND_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index a01892b..287eac7 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -126,11 +126,9 @@ obj-$(CONFIG_PERF_EVENTS)		+= perf_regs.o
 obj-$(CONFIG_TRACING)			+= tracepoint.o
 obj-$(CONFIG_SCHED_MC_PRIO)		+= itmt.o
 
-ifdef CONFIG_FRAME_POINTER
-obj-y					+= unwind_frame.o
-else
-obj-y					+= unwind_guess.o
-endif
+obj-$(CONFIG_ORC_UNWINDER)		+= unwind_orc.o
+obj-$(CONFIG_FRAME_POINTER_UNWINDER)	+= unwind_frame.o
+obj-$(CONFIG_GUESS_UNWINDER)		+= unwind_guess.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index f67bd32..62e7d70 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -35,6 +35,7 @@
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/setup.h>
+#include <asm/unwind.h>
 
 #if 0
 #define DEBUGP(fmt, ...)				\
@@ -213,7 +214,7 @@ int module_finalize(const Elf_Ehdr *hdr,
 		    struct module *me)
 {
 	const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL,
-		*para = NULL;
+		*para = NULL, *orc = NULL, *orc_ip = NULL;
 	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
 
 	for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) {
@@ -225,6 +226,10 @@ int module_finalize(const Elf_Ehdr *hdr,
 			locks = s;
 		if (!strcmp(".parainstructions", secstrings + s->sh_name))
 			para = s;
+		if (!strcmp(".orc_unwind", secstrings + s->sh_name))
+			orc = s;
+		if (!strcmp(".orc_unwind_ip", secstrings + s->sh_name))
+			orc_ip = s;
 	}
 
 	if (alt) {
@@ -248,6 +253,10 @@ int module_finalize(const Elf_Ehdr *hdr,
 	/* make jump label nops */
 	jump_label_apply_nops(me);
 
+	if (orc && orc_ip)
+		unwind_module_init(me, (void *)orc_ip->sh_addr, orc_ip->sh_size,
+				   (void *)orc->sh_addr, orc->sh_size);
+
 	return 0;
 }
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3486d04..ecab322 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -115,6 +115,7 @@
 #include <asm/microcode.h>
 #include <asm/mmu_context.h>
 #include <asm/kaslr.h>
+#include <asm/unwind.h>
 
 /*
  * max_low_pfn_mapped: highest direct mapped pfn under 4GB
@@ -1310,6 +1311,8 @@ void __init setup_arch(char **cmdline_p)
 	if (efi_enabled(EFI_BOOT))
 		efi_apply_memmap_quirks();
 #endif
+
+	unwind_init();
 }
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index b9389d7..7574ef5 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -10,20 +10,22 @@
 
 #define FRAME_HEADER_SIZE (sizeof(long) * 2)
 
-/*
- * This disables KASAN checking when reading a value from another task's stack,
- * since the other task could be running on another CPU and could have poisoned
- * the stack in the meantime.
- */
-#define READ_ONCE_TASK_STACK(task, x)			\
-({							\
-	unsigned long val;				\
-	if (task == current)				\
-		val = READ_ONCE(x);			\
-	else						\
-		val = READ_ONCE_NOCHECK(x);		\
-	val;						\
-})
+unsigned long unwind_get_return_address(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return 0;
+
+	return __kernel_text_address(state->ip) ? state->ip : 0;
+}
+EXPORT_SYMBOL_GPL(unwind_get_return_address);
+
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return NULL;
+
+	return state->regs ? &state->regs->ip : state->bp + 1;
+}
 
 static void unwind_dump(struct unwind_state *state)
 {
@@ -66,15 +68,6 @@ static void unwind_dump(struct unwind_state *state)
 	}
 }
 
-unsigned long unwind_get_return_address(struct unwind_state *state)
-{
-	if (unwind_done(state))
-		return 0;
-
-	return __kernel_text_address(state->ip) ? state->ip : 0;
-}
-EXPORT_SYMBOL_GPL(unwind_get_return_address);
-
 static size_t regs_size(struct pt_regs *regs)
 {
 	/* x86_32 regs from kernel mode are two words shorter: */
diff --git a/arch/x86/kernel/unwind_guess.c b/arch/x86/kernel/unwind_guess.c
index 039f367..4f0e17b 100644
--- a/arch/x86/kernel/unwind_guess.c
+++ b/arch/x86/kernel/unwind_guess.c
@@ -19,6 +19,11 @@ unsigned long unwind_get_return_address(struct unwind_state *state)
 }
 EXPORT_SYMBOL_GPL(unwind_get_return_address);
 
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	return NULL;
+}
+
 bool unwind_next_frame(struct unwind_state *state)
 {
 	struct stack_info *info = &state->stack_info;
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
new file mode 100644
index 0000000..9a8ad84
--- /dev/null
+++ b/arch/x86/kernel/unwind_orc.c
@@ -0,0 +1,576 @@
+#include <linux/module.h>
+#include <linux/sort.h>
+#include <asm/ptrace.h>
+#include <asm/stacktrace.h>
+#include <asm/unwind.h>
+#include <asm/orc_types.h>
+#include <asm/orc_lookup.h>
+#include <asm/sections.h>
+
+#define orc_warn(fmt, ...) \
+	printk_deferred_once(KERN_WARNING pr_fmt("WARNING: " fmt), ##__VA_ARGS__)
+
+extern int __start_orc_unwind_ip[];
+extern int __stop_orc_unwind_ip[];
+extern struct orc_entry __start_orc_unwind[];
+extern struct orc_entry __stop_orc_unwind[];
+
+static DEFINE_MUTEX(sort_mutex);
+int *cur_orc_ip_table = __start_orc_unwind_ip;
+struct orc_entry *cur_orc_table = __start_orc_unwind;
+
+unsigned int lookup_num_blocks;
+bool orc_init;
+
+static inline unsigned long orc_ip(const int *ip)
+{
+	return (unsigned long)ip + *ip;
+}
+
+static struct orc_entry *__orc_find(int *ip_table, struct orc_entry *u_table,
+				    unsigned int num_entries, unsigned long ip)
+{
+	int *first = ip_table;
+	int *last = ip_table + num_entries - 1;
+	int *mid = first, *found = first;
+
+	if (!num_entries)
+		return NULL;
+
+	/*
+	 * Do a binary range search to find the rightmost duplicate of a given
+	 * starting address.  Some entries are section terminators which are
+	 * "weak" entries for ensuring there are no gaps.  They should be
+	 * ignored when they conflict with a real entry.
+	 */
+	while (first <= last) {
+		mid = first + ((last - first) / 2);
+
+		if (orc_ip(mid) <= ip) {
+			found = mid;
+			first = mid + 1;
+		} else
+			last = mid - 1;
+	}
+
+	return u_table + (found - ip_table);
+}
+
+#ifdef CONFIG_MODULES
+static struct orc_entry *orc_module_find(unsigned long ip)
+{
+	struct module *mod;
+
+	mod = __module_address(ip);
+	if (!mod || !mod->arch.orc_unwind || !mod->arch.orc_unwind_ip)
+		return NULL;
+	return __orc_find(mod->arch.orc_unwind_ip, mod->arch.orc_unwind,
+			  mod->arch.num_orcs, ip);
+}
+#else
+static struct orc_entry *orc_module_find(unsigned long ip)
+{
+	return NULL;
+}
+#endif
+
+static struct orc_entry *orc_find(unsigned long ip)
+{
+	if (!orc_init)
+		return NULL;
+
+	/* For non-init vmlinux addresses, use the fast lookup table: */
+	if (ip >= LOOKUP_START_IP && ip < LOOKUP_STOP_IP) {
+		unsigned int idx, start, stop;
+
+		idx = (ip - LOOKUP_START_IP) / LOOKUP_BLOCK_SIZE;
+
+		if (WARN_ON_ONCE(idx >= lookup_num_blocks-1))
+			return NULL;
+
+		start = orc_lookup[idx];
+		stop = orc_lookup[idx + 1] + 1;
+
+		if (WARN_ON_ONCE(__start_orc_unwind + start >= __stop_orc_unwind) ||
+				 __start_orc_unwind + stop > __stop_orc_unwind)
+			return NULL;
+
+		return __orc_find(__start_orc_unwind_ip + start,
+				  __start_orc_unwind + start, stop - start, ip);
+	}
+
+	/* vmlinux .init slow lookup: */
+	if (ip >= (unsigned long)_sinittext && ip < (unsigned long)_einittext)
+		return __orc_find(__start_orc_unwind_ip, __start_orc_unwind,
+				  __stop_orc_unwind_ip - __start_orc_unwind_ip, ip);
+
+	/* Module lookup: */
+	return orc_module_find(ip);
+}
+
+static void orc_sort_swap(void *_a, void *_b, int size)
+{
+	struct orc_entry *orc_a, *orc_b;
+	struct orc_entry orc_tmp;
+	int *a = _a, *b = _b, tmp;
+	int delta = _b - _a;
+
+	/* Swap the .orc_unwind_ip entries: */
+	tmp = *a;
+	*a = *b + delta;
+	*b = tmp - delta;
+
+	/* Swap the corresponding .orc_unwind entries: */
+	orc_a = cur_orc_table + (a - cur_orc_ip_table);
+	orc_b = cur_orc_table + (b - cur_orc_ip_table);
+	orc_tmp = *orc_a;
+	*orc_a = *orc_b;
+	*orc_b = orc_tmp;
+}
+
+static int orc_sort_cmp(const void *_a, const void *_b)
+{
+	struct orc_entry *orc_a;
+	const int *a = _a, *b = _b;
+	unsigned long a_val = orc_ip(a);
+	unsigned long b_val = orc_ip(b);
+
+	if (a_val > b_val)
+		return 1;
+	if (a_val < b_val)
+		return -1;
+
+	/*
+	 * The "weak" section terminator entries need to always be on the left
+	 * to ensure the lookup code skips them in favor of real entries.
+	 * These terminator entries exist to handle any gaps created by
+	 * whitelisted .o files which didn't get objtool generation.
+	 */
+	orc_a = cur_orc_table + (a - cur_orc_ip_table);
+	return orc_a->sp_reg == ORC_REG_UNDEFINED ? -1 : 1;
+}
+
+#ifdef CONFIG_MODULES
+void unwind_module_init(struct module *mod, void *_orc_ip, size_t orc_ip_size,
+			void *_orc, size_t orc_size)
+{
+	int *orc_ip = _orc_ip;
+	struct orc_entry *orc = _orc;
+	unsigned int num_entries = orc_ip_size / sizeof(int);
+
+	WARN_ON_ONCE(orc_ip_size % sizeof(int) != 0 ||
+		     orc_size % sizeof(*orc) != 0 ||
+		     num_entries != orc_size / sizeof(*orc));
+
+	/*
+	 * The 'cur_orc_*' globals allow the orc_sort_swap() callback to
+	 * associate an .orc_unwind_ip table entry with its corresponding
+	 * .orc_unwind entry so they can both be swapped.
+	 */
+	mutex_lock(&sort_mutex);
+	cur_orc_ip_table = orc_ip;
+	cur_orc_table = orc;
+	sort(orc_ip, num_entries, sizeof(int), orc_sort_cmp, orc_sort_swap);
+	mutex_unlock(&sort_mutex);
+
+	mod->arch.orc_unwind_ip = orc_ip;
+	mod->arch.orc_unwind = orc;
+	mod->arch.num_orcs = num_entries;
+}
+#endif
+
+void __init unwind_init(void)
+{
+	size_t orc_ip_size = (void *)__stop_orc_unwind_ip - (void *)__start_orc_unwind_ip;
+	size_t orc_size = (void *)__stop_orc_unwind - (void *)__start_orc_unwind;
+	size_t num_entries = orc_ip_size / sizeof(int);
+	struct orc_entry *orc;
+	int i;
+
+	if (!num_entries || orc_ip_size % sizeof(int) != 0 ||
+	    orc_size % sizeof(struct orc_entry) != 0 ||
+	    num_entries != orc_size / sizeof(struct orc_entry)) {
+		orc_warn("WARNING: Bad or missing .orc_unwind table.  Disabling unwinder.\n");
+		return;
+	}
+
+	/* Sort the .orc_unwind and .orc_unwind_ip tables: */
+	sort(__start_orc_unwind_ip, num_entries, sizeof(int), orc_sort_cmp,
+	     orc_sort_swap);
+
+	/* Initialize the fast lookup table: */
+	lookup_num_blocks = orc_lookup_end - orc_lookup;
+	for (i = 0; i < lookup_num_blocks-1; i++) {
+		orc = __orc_find(__start_orc_unwind_ip, __start_orc_unwind,
+				 num_entries,
+				 LOOKUP_START_IP + (LOOKUP_BLOCK_SIZE * i));
+		if (!orc) {
+			orc_warn("WARNING: Corrupt .orc_unwind table.  Disabling unwinder.\n");
+			return;
+		}
+
+		orc_lookup[i] = orc - __start_orc_unwind;
+	}
+
+	/* Initialize the ending block: */
+	orc = __orc_find(__start_orc_unwind_ip, __start_orc_unwind, num_entries,
+			 LOOKUP_STOP_IP);
+	if (!orc) {
+		orc_warn("WARNING: Corrupt .orc_unwind table.  Disabling unwinder.\n");
+		return;
+	}
+	orc_lookup[lookup_num_blocks-1] = orc - __start_orc_unwind;
+
+	orc_init = true;
+}
+
+unsigned long unwind_get_return_address(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return 0;
+
+	return __kernel_text_address(state->ip) ? state->ip : 0;
+}
+EXPORT_SYMBOL_GPL(unwind_get_return_address);
+
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return NULL;
+
+	if (state->regs)
+		return &state->regs->ip;
+
+	if (state->sp)
+		return (unsigned long *)state->sp - 1;
+
+	return NULL;
+}
+
+static bool stack_access_ok(struct unwind_state *state, unsigned long addr,
+			    size_t len)
+{
+	struct stack_info *info = &state->stack_info;
+
+	/*
+	 * If the address isn't on the current stack, switch to the next one.
+	 *
+	 * We may have to traverse multiple stacks to deal with the possibility
+	 * that info->next_sp could point to an empty stack and the address
+	 * could be on a subsequent stack.
+	 */
+	while (!on_stack(info, (void *)addr, len))
+		if (get_stack_info(info->next_sp, state->task, info,
+				   &state->stack_mask))
+			return false;
+
+	return true;
+}
+
+static bool deref_stack_reg(struct unwind_state *state, unsigned long addr,
+			    unsigned long *val)
+{
+	if (!stack_access_ok(state, addr, sizeof(long)))
+		return false;
+
+	*val = READ_ONCE_TASK_STACK(state->task, *(unsigned long *)addr);
+	return true;
+}
+
+#define REGS_SIZE (sizeof(struct pt_regs))
+#define SP_OFFSET (offsetof(struct pt_regs, sp))
+#define IRET_REGS_SIZE (REGS_SIZE - offsetof(struct pt_regs, ip))
+#define IRET_SP_OFFSET (SP_OFFSET - offsetof(struct pt_regs, ip))
+
+static bool deref_stack_regs(struct unwind_state *state, unsigned long addr,
+			     unsigned long *ip, unsigned long *sp, bool full)
+{
+	size_t regs_size = full ? REGS_SIZE : IRET_REGS_SIZE;
+	size_t sp_offset = full ? SP_OFFSET : IRET_SP_OFFSET;
+	struct pt_regs *regs = (struct pt_regs *)(addr + regs_size - REGS_SIZE);
+
+	if (IS_ENABLED(CONFIG_X86_64)) {
+		if (!stack_access_ok(state, addr, regs_size))
+			return false;
+
+		*ip = regs->ip;
+		*sp = regs->sp;
+
+		return true;
+	}
+
+	if (!stack_access_ok(state, addr, sp_offset))
+		return false;
+
+	*ip = regs->ip;
+
+	if (user_mode(regs)) {
+		if (!stack_access_ok(state, addr + sp_offset,
+				     REGS_SIZE - SP_OFFSET))
+			return false;
+
+		*sp = regs->sp;
+	} else
+		*sp = (unsigned long)&regs->sp;
+
+	return true;
+}
+
+bool unwind_next_frame(struct unwind_state *state)
+{
+	unsigned long ip_p, sp, orig_ip, prev_sp = state->sp;
+	enum stack_type prev_type = state->stack_info.type;
+	struct orc_entry *orc;
+	struct pt_regs *ptregs;
+	bool indirect = false;
+
+	if (unwind_done(state))
+		return false;
+
+	/* Don't let modules unload while we're reading their ORC data. */
+	preempt_disable();
+
+	/* Have we reached the end? */
+	if (state->regs && user_mode(state->regs))
+		goto done;
+
+	/*
+	 * Find the orc_entry associated with the text address.
+	 *
+	 * Decrement call return addresses by one so they work for sibling
+	 * calls and calls to noreturn functions.
+	 */
+	orc = orc_find(state->signal ? state->ip : state->ip - 1);
+	if (!orc || orc->sp_reg == ORC_REG_UNDEFINED)
+		goto done;
+	orig_ip = state->ip;
+
+	/* Find the previous frame's stack: */
+	switch (orc->sp_reg) {
+	case ORC_REG_SP:
+		sp = state->sp + orc->sp_offset;
+		break;
+
+	case ORC_REG_BP:
+		sp = state->bp + orc->sp_offset;
+		break;
+
+	case ORC_REG_SP_INDIRECT:
+		sp = state->sp + orc->sp_offset;
+		indirect = true;
+		break;
+
+	case ORC_REG_BP_INDIRECT:
+		sp = state->bp + orc->sp_offset;
+		indirect = true;
+		break;
+
+	case ORC_REG_R10:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg R10 at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->r10;
+		break;
+
+	case ORC_REG_R13:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg R13 at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->r13;
+		break;
+
+	case ORC_REG_DI:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg DI at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->di;
+		break;
+
+	case ORC_REG_DX:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg DX at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->dx;
+		break;
+
+	default:
+		orc_warn("unknown SP base reg %d for ip %p\n",
+			 orc->sp_reg, (void *)state->ip);
+		goto done;
+	}
+
+	if (indirect) {
+		if (!deref_stack_reg(state, sp, &sp))
+			goto done;
+	}
+
+	/* Find IP, SP and possibly regs: */
+	switch (orc->type) {
+	case ORC_TYPE_CALL:
+		ip_p = sp - sizeof(long);
+
+		if (!deref_stack_reg(state, ip_p, &state->ip))
+			goto done;
+
+		state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
+						  state->ip, (void *)ip_p);
+
+		state->sp = sp;
+		state->regs = NULL;
+		state->signal = false;
+		break;
+
+	case ORC_TYPE_REGS:
+		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, true)) {
+			orc_warn("can't dereference registers at %p for ip %p\n",
+				 (void *)sp, (void *)orig_ip);
+			goto done;
+		}
+
+		state->regs = (struct pt_regs *)sp;
+		state->full_regs = true;
+		state->signal = true;
+		break;
+
+	case ORC_TYPE_REGS_IRET:
+		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, false)) {
+			orc_warn("can't dereference iret registers at %p for ip %p\n",
+				 (void *)sp, (void *)orig_ip);
+			goto done;
+		}
+
+		ptregs = container_of((void *)sp, struct pt_regs, ip);
+		if ((unsigned long)ptregs >= prev_sp &&
+		    on_stack(&state->stack_info, ptregs, REGS_SIZE)) {
+			state->regs = ptregs;
+			state->full_regs = false;
+		} else
+			state->regs = NULL;
+
+		state->signal = true;
+		break;
+
+	default:
+		orc_warn("unknown .orc_unwind entry type %d\n", orc->type);
+		break;
+	}
+
+	/* Find BP: */
+	switch (orc->bp_reg) {
+	case ORC_REG_UNDEFINED:
+		if (state->regs && state->full_regs)
+			state->bp = state->regs->bp;
+		break;
+
+	case ORC_REG_PREV_SP:
+		if (!deref_stack_reg(state, sp + orc->bp_offset, &state->bp))
+			goto done;
+		break;
+
+	case ORC_REG_BP:
+		if (!deref_stack_reg(state, state->bp + orc->bp_offset, &state->bp))
+			goto done;
+		break;
+
+	default:
+		orc_warn("unknown BP base reg %d for ip %p\n",
+			 orc->bp_reg, (void *)orig_ip);
+		goto done;
+	}
+
+	/* Prevent a recursive loop due to bad ORC data: */
+	if (state->stack_info.type == prev_type &&
+	    on_stack(&state->stack_info, (void *)state->sp, sizeof(long)) &&
+	    state->sp <= prev_sp) {
+		orc_warn("stack going in the wrong direction? ip=%p\n",
+			 (void *)orig_ip);
+		goto done;
+	}
+
+	preempt_enable();
+	return true;
+
+done:
+	preempt_enable();
+	state->stack_info.type = STACK_TYPE_UNKNOWN;
+	return false;
+}
+EXPORT_SYMBOL_GPL(unwind_next_frame);
+
+void __unwind_start(struct unwind_state *state, struct task_struct *task,
+		    struct pt_regs *regs, unsigned long *first_frame)
+{
+	memset(state, 0, sizeof(*state));
+	state->task = task;
+
+	/*
+	 * Refuse to unwind the stack of a task while it's executing on another
+	 * CPU.  This check is racy, but that's ok: the unwinder has other
+	 * checks to prevent it from going off the rails.
+	 */
+	if (task_on_another_cpu(task))
+		goto done;
+
+	if (regs) {
+		if (user_mode(regs))
+			goto done;
+
+		state->ip = regs->ip;
+		state->sp = kernel_stack_pointer(regs);
+		state->bp = regs->bp;
+		state->regs = regs;
+		state->full_regs = true;
+		state->signal = true;
+
+	} else if (task == current) {
+		asm volatile("lea (%%rip), %0\n\t"
+			     "mov %%rsp, %1\n\t"
+			     "mov %%rbp, %2\n\t"
+			     : "=r" (state->ip), "=r" (state->sp),
+			       "=r" (state->bp));
+
+	} else {
+		struct inactive_task_frame *frame = (void *)task->thread.sp;
+
+		state->ip = frame->ret_addr;
+		state->sp = task->thread.sp;
+		state->bp = frame->bp;
+	}
+
+	if (get_stack_info((unsigned long *)state->sp, state->task,
+			   &state->stack_info, &state->stack_mask))
+		return;
+
+	/*
+	 * The caller can provide the address of the first frame directly
+	 * (first_frame) or indirectly (regs->sp) to indicate which stack frame
+	 * to start unwinding at.  Skip ahead until we reach it.
+	 */
+
+	/* When starting from regs, skip the regs frame: */
+	if (regs) {
+		unwind_next_frame(state);
+		return;
+	}
+
+	/* Otherwise, skip ahead to the user-specified starting frame: */
+	while (!unwind_done(state) &&
+	       (!on_stack(&state->stack_info, first_frame, sizeof(long)) ||
+			state->sp <= (unsigned long)first_frame))
+		unwind_next_frame(state);
+
+	return;
+
+done:
+	state->stack_info.type = STACK_TYPE_UNKNOWN;
+	return;
+}
+EXPORT_SYMBOL_GPL(__unwind_start);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index c8a3b61..f05f00a 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -24,6 +24,7 @@
 #include <asm/asm-offsets.h>
 #include <asm/thread_info.h>
 #include <asm/page_types.h>
+#include <asm/orc_lookup.h>
 #include <asm/cache.h>
 #include <asm/boot.h>
 
@@ -148,6 +149,8 @@ SECTIONS
 
 	BUG_TABLE
 
+	ORC_UNWIND_TABLE
+
 	. = ALIGN(PAGE_SIZE);
 	__vvar_page = .;
 
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 0d64658..e98b052 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -668,6 +668,31 @@
 #define BUG_TABLE
 #endif
 
+#ifdef CONFIG_ORC_UNWINDER
+#define ORC_UNWIND_TABLE						\
+	. = ALIGN(4);							\
+	.orc_unwind_ip : AT(ADDR(.orc_unwind_ip) - LOAD_OFFSET) {	\
+		VMLINUX_SYMBOL(__start_orc_unwind_ip) = .;		\
+		KEEP(*(.orc_unwind_ip))					\
+		VMLINUX_SYMBOL(__stop_orc_unwind_ip) = .;		\
+	}								\
+	. = ALIGN(6);							\
+	.orc_unwind : AT(ADDR(.orc_unwind) - LOAD_OFFSET) {		\
+		VMLINUX_SYMBOL(__start_orc_unwind) = .;			\
+		KEEP(*(.orc_unwind))					\
+		VMLINUX_SYMBOL(__stop_orc_unwind) = .;			\
+	}								\
+	. = ALIGN(4);							\
+	.orc_lookup : AT(ADDR(.orc_lookup) - LOAD_OFFSET) {		\
+		VMLINUX_SYMBOL(orc_lookup) = .;				\
+		. += (((SIZEOF(.text) + LOOKUP_BLOCK_SIZE - 1) /	\
+			LOOKUP_BLOCK_SIZE) + 1) * 4;			\
+		VMLINUX_SYMBOL(orc_lookup_end) = .;			\
+	}
+#else
+#define ORC_UNWIND_TABLE
+#endif
+
 #ifdef CONFIG_PM_TRACE
 #define TRACEDATA							\
 	. = ALIGN(4);							\
@@ -854,7 +879,7 @@
 		DATA_DATA						\
 		CONSTRUCTORS						\
 	}								\
-	BUG_TABLE
+	BUG_TABLE							\
 
 #define INIT_TEXT_SECTION(inittext_align)				\
 	. = ALIGN(inittext_align);					\
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 9c5d40a..a7abffa 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -374,6 +374,9 @@ config STACK_VALIDATION
 	  pointers (if CONFIG_FRAME_POINTER is enabled).  This helps ensure
 	  that runtime stack traces are more reliable.
 
+	  This is also a prerequisite for generation of ORC unwind data, which
+	  is needed for CONFIG_ORC_UNWINDER.
+
 	  For more information, see
 	  tools/objtool/Documentation/stack-validation.txt.
 
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 733e044..11b5c28 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -258,7 +258,8 @@ ifneq ($(SKIP_STACK_VALIDATION),1)
 
 __objtool_obj := $(objtree)/tools/objtool/objtool
 
-objtool_args = check
+objtool_args = $(if $(CONFIG_ORC_UNWINDER),orc generate,check)
+
 ifndef CONFIG_FRAME_POINTER
 objtool_args += --no-fp
 endif
@@ -276,6 +277,11 @@ objtool_obj = $(if $(patsubst y%,, \
 endif # SKIP_STACK_VALIDATION
 endif # CONFIG_STACK_VALIDATION
 
+# Rebuild all objects when objtool changes, or is enabled/disabled.
+objtool_dep = $(objtool_obj)					\
+	      $(wildcard include/config/orc/unwinder.h		\
+			 include/config/stack/validation.h)
+
 define rule_cc_o_c
 	$(call echo-cmd,checksrc) $(cmd_checksrc)			  \
 	$(call cmd_and_fixdep,cc_o_c)					  \
@@ -298,13 +304,13 @@ cmd_undef_syms = echo
 endif
 
 # Built-in and composite module parts
-$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE
+$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
 	$(call cmd,force_checksrc)
 	$(call if_changed_rule,cc_o_c)
 
 # Single-part modules are special since we need to mark them in $(MODVERDIR)
 
-$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE
+$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
 	$(call cmd,force_checksrc)
 	$(call if_changed_rule,cc_o_c)
 	@{ echo $(@:.o=.ko); echo $@; \
@@ -399,7 +405,7 @@ cmd_modversions_S =								\
 endif
 endif
 
-$(obj)/%.o: $(src)/%.S $(objtool_obj) FORCE
+$(obj)/%.o: $(src)/%.S $(objtool_dep) FORCE
 	$(call if_changed_rule,as_o_S)
 
 targets += $(real-objs-y) $(real-objs-m) $(lib-y)
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 10/10] x86/kconfig: make it easier to switch to the new ORC unwinder
  2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
                   ` (8 preceding siblings ...)
  2017-07-11 15:33 ` [PATCH v3 09/10] x86/unwind: add ORC unwinder Josh Poimboeuf
@ 2017-07-11 15:33 ` Josh Poimboeuf
  2017-07-12  8:27 ` [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Ingo Molnar
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-11 15:33 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

A couple of Kconfig changes which make it much easier to switch to the
new CONFIG_ORC_UNWINDER:

1) Remove x86 dependencies on CONFIG_FRAME_POINTER for lockdep,
   latencytop, and fault injection.  x86 has a 'guess' unwinder which
   just scans the stack for kernel text addresses.  It's not 100%
   accurate but in many cases it's good enough.  This allows those users
   who don't want the text overhead of the frame pointer or ORC
   unwinders to still use these features.  More importantly, this also
   makes it much more straightforward to disable frame pointers.

2) Make CONFIG_ORC_UNWINDER depend on !CONFIG_FRAME_POINTER.  While it
   would be possible to have both enabled, it doesn't really make sense
   to do so.  So enforce a sane configuration to prevent the user from
   making a dumb mistake.

With these changes, when you disable CONFIG_FRAME_POINTER, "make
oldconfig" will ask if you want to enable CONFIG_ORC_UNWINDER.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/Kconfig.debug | 7 +++----
 lib/Kconfig.debug      | 6 +++---
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index dc10ec6..268a318 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -357,7 +357,7 @@ config PUNIT_ATOM_DEBUG
 
 config ORC_UNWINDER
 	bool "ORC unwinder"
-	depends on X86_64
+	depends on X86_64 && !FRAME_POINTER
 	select STACK_VALIDATION
 	---help---
 	  This option enables the ORC (Oops Rewind Capability) unwinder for
@@ -365,9 +365,8 @@ config ORC_UNWINDER
 	  a simplified version of the DWARF Call Frame Information standard.
 
 	  This unwinder is more accurate across interrupt entry frames than the
-	  frame pointer unwinder.  It can also enable a 5-10% performance
-	  improvement across the entire kernel if CONFIG_FRAME_POINTER is
-	  disabled.
+	  frame pointer unwinder.  It also enables a 5-10% performance
+	  improvement across the entire kernel compared to frame pointers.
 
 	  Enabling this option will increase the kernel's runtime memory usage
 	  by roughly 2-4MB, depending on your kernel config.
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a7abffa..325c1c5 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1113,7 +1113,7 @@ config LOCKDEP
 	bool
 	depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT
 	select STACKTRACE
-	select FRAME_POINTER if !MIPS && !PPC && !ARM_UNWIND && !S390 && !MICROBLAZE && !ARC && !SCORE
+	select FRAME_POINTER if !MIPS && !PPC && !ARM_UNWIND && !S390 && !MICROBLAZE && !ARC && !SCORE && !X86
 	select KALLSYMS
 	select KALLSYMS_ALL
 
@@ -1504,7 +1504,7 @@ config FAULT_INJECTION_STACKTRACE_FILTER
 	depends on FAULT_INJECTION_DEBUG_FS && STACKTRACE_SUPPORT
 	depends on !X86_64
 	select STACKTRACE
-	select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && !ARM_UNWIND && !ARC && !SCORE
+	select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && !ARM_UNWIND && !ARC && !SCORE && !X86
 	help
 	  Provide stacktrace filter for fault-injection capabilities
 
@@ -1513,7 +1513,7 @@ config LATENCYTOP
 	depends on DEBUG_KERNEL
 	depends on STACKTRACE_SUPPORT
 	depends on PROC_FS
-	select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && !ARM_UNWIND && !ARC
+	select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && !ARM_UNWIND && !ARC && !X86
 	select KALLSYMS
 	select KALLSYMS_ALL
 	select STACKTRACE
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
                   ` (9 preceding siblings ...)
  2017-07-11 15:33 ` [PATCH v3 10/10] x86/kconfig: make it easier to switch to the new " Josh Poimboeuf
@ 2017-07-12  8:27 ` Ingo Molnar
  2017-07-12 14:42   ` Josh Poimboeuf
  2017-07-12 21:49 ` Andres Freund
  2017-07-12 22:30 ` [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Andi Kleen
  12 siblings, 1 reply; 60+ messages in thread
From: Ingo Molnar @ 2017-07-12  8:27 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith


* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> The biggest change is that undwarf was renamed to ORC.  Here's the
> relevant explanation from the docs:
> 
>   Etymology
>   ---------
>   
>   Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
>   enemies.  Similarly, the ORC unwinder was created in opposition to the
>   complexity and slowness of DWARF.
>   
>   "Although Orcs rarely consider multiple solutions to a problem, they do
>   excel at getting things done because they are creatures of action, not
>   thought." [3]  Similarly, unlike the esoteric DWARF unwinder, the
>   veracious ORC unwinder wastes no time or siloconic effort decoding
>   variable-length zero-extended unsigned-integer byte-coded
>   state-machine-based debug information entries.
>   
>   Similar to how Orcs frequently unravel the well-intentioned plans of
>   their adversaries, the ORC unwinder frequently unravels stacks with
>   brutal, unyielding efficiency.
>   
>   ORC stands for Oops Rewind Capability.

Perfect naming!

(ORC might also stand for "Optimized Rewind Capability".)

> Create a new "ORC" unwinder, enabled by CONFIG_ORC_UNWINDER, and plug it
> into the x86 unwinder framework.  Objtool is used to generate the ORC
> debuginfo.  The ORC debuginfo format is basically a simplified version
> of DWARF CFI.  More details below.

BTW., we should perhaps consolidate our unwinder related Kconfig space, 
hierarchically:

	CONFIG_UNWINDER
	CONFIG_UNWINDER_ORC
	CONFIG_UNWINDER_FRAME_POINTERS

Note that as a side effect it would be a valid small systems build option to have 
no unwinder at all, if CONFIG_EXPERT=y is set and such: !CONFIG_UNWINDER=n would 
be a sibling to !CONFIG_BUG.

CONFIG_FRAME_POINTERS et al would be left for architectures where it has a meaning 
beyond backtrace generation. (Not sure whether there's any such architectures.)

> The unwinder works well in my testing.  It unwinds through interrupts,
> exceptions, and preemption, with and without frame pointers, across
> aligned stacks and dynamically allocated stacks.  If something goes
> wrong during an oops, it successfully falls back to printing the '?'
> entries just like the frame pointer unwinder.

Ok, I'll start applying your patches after -rc1, unless anyone objects.

> The ORC data format does have a few downsides compared to DWARF.  The
> ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.

Could we also write this in percentage, not absolute RAM size - i.e. ORC unwind 
tables take 30% more RAM (+0.7 MB on an x86 defconfig kernel) than DWARF eh_frame 
tables.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-12  8:27 ` [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Ingo Molnar
@ 2017-07-12 14:42   ` Josh Poimboeuf
  2017-07-12 19:27     ` Ingo Molnar
  0 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-12 14:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

On Wed, Jul 12, 2017 at 10:27:10AM +0200, Ingo Molnar wrote:
> > Create a new "ORC" unwinder, enabled by CONFIG_ORC_UNWINDER, and plug it
> > into the x86 unwinder framework.  Objtool is used to generate the ORC
> > debuginfo.  The ORC debuginfo format is basically a simplified version
> > of DWARF CFI.  More details below.
> 
> BTW., we should perhaps consolidate our unwinder related Kconfig space, 
> hierarchically:
> 
> 	CONFIG_UNWINDER
> 	CONFIG_UNWINDER_ORC
> 	CONFIG_UNWINDER_FRAME_POINTERS
> 
> Note that as a side effect it would be a valid small systems build option to have 
> no unwinder at all, if CONFIG_EXPERT=y is set and such: !CONFIG_UNWINDER=n would 
> be a sibling to !CONFIG_BUG.

So is the idea that CONFIG_UNWINDER=n means "use the 'guess' unwinder"?
Or should it mean that the unwind API isn't available?

Without frame pointers and orc, it defaults to the 'guess' unwinder, for
which the only overhead is a tiny amount of code.  It's still
technically considered an unwinder because it plugs into the unwind
interfaces (unwind_start(), unwind_next_frame(), etc) and is used for
things like /proc/<pid>/stack.

So I'm not really sure CONFIG_UNWINDER=n would make sense.  Maybe there
should just be a multiple-choice where you have to choose one of
CONFIG_UNWINDER_{ORC,FRAME_POINTER,GUESS}.

> CONFIG_FRAME_POINTERS et al would be left for architectures where it has a meaning 
> beyond backtrace generation. (Not sure whether there's any such architectures.)

Well, on x86, hardened usercopy relies on frame pointers, but not the
unwinder.  It does the frame pointer walk manually to avoid the full
unwinder overhead.  See arch_within_stack_frames().

> > The unwinder works well in my testing.  It unwinds through interrupts,
> > exceptions, and preemption, with and without frame pointers, across
> > aligned stacks and dynamically allocated stacks.  If something goes
> > wrong during an oops, it successfully falls back to printing the '?'
> > entries just like the frame pointer unwinder.
> 
> Ok, I'll start applying your patches after -rc1, unless anyone objects.

Thank you Ingo!

> > The ORC data format does have a few downsides compared to DWARF.  The
> > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> 
> Could we also write this in percentage, not absolute RAM size - i.e. ORC unwind 
> tables take 30% more RAM (+0.7 MB on an x86 defconfig kernel) than DWARF eh_frame 
> tables.

Ok, how about:

  "Orc unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig
  kernel) than DWARF eh_frame tables."

(My previous 1MB number was from my distro-based config, and it also
forgot to take into account the fast lookup table (".orc_lookup")).

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-12 14:42   ` Josh Poimboeuf
@ 2017-07-12 19:27     ` Ingo Molnar
  2017-07-14 17:17       ` Josh Poimboeuf
  0 siblings, 1 reply; 60+ messages in thread
From: Ingo Molnar @ 2017-07-12 19:27 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith


* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Wed, Jul 12, 2017 at 10:27:10AM +0200, Ingo Molnar wrote:
> > > Create a new "ORC" unwinder, enabled by CONFIG_ORC_UNWINDER, and plug it
> > > into the x86 unwinder framework.  Objtool is used to generate the ORC
> > > debuginfo.  The ORC debuginfo format is basically a simplified version
> > > of DWARF CFI.  More details below.
> > 
> > BTW., we should perhaps consolidate our unwinder related Kconfig space, 
> > hierarchically:
> > 
> > 	CONFIG_UNWINDER
> > 	CONFIG_UNWINDER_ORC
> > 	CONFIG_UNWINDER_FRAME_POINTERS
> > 
> > Note that as a side effect it would be a valid small systems build option to have 
> > no unwinder at all, if CONFIG_EXPERT=y is set and such: !CONFIG_UNWINDER=n would 
> > be a sibling to !CONFIG_BUG.
> 
> So is the idea that CONFIG_UNWINDER=n means "use the 'guess' unwinder"?
> Or should it mean that the unwind API isn't available?
> 
> Without frame pointers and orc, it defaults to the 'guess' unwinder, for
> which the only overhead is a tiny amount of code.  It's still
> technically considered an unwinder because it plugs into the unwind
> interfaces (unwind_start(), unwind_next_frame(), etc) and is used for
> things like /proc/<pid>/stack.
> 
> So I'm not really sure CONFIG_UNWINDER=n would make sense.  Maybe there
> should just be a multiple-choice where you have to choose one of
> CONFIG_UNWINDER_{ORC,FRAME_POINTER,GUESS}.

Ok, you are right.

Maybe we could offer a menu of unwinders - i.e. make the whole Kconfig interface a 
bit nicer:

  CONFIG_UNWINDER_FRAME_POINTER
  CONFIG_UNWINDER_ORC
  CONFIG_UNWINDER_GUESS

... or so?

Default would be the historic FRAME_POINTER, at least initially, I think.

I wouldn't mind making CONFIG_UNWINDER_ORC the new default either, due to the 
non-trivial speedup it offers - but maybe folks would object?

> > CONFIG_FRAME_POINTERS et al would be left for architectures where it has a meaning 
> > beyond backtrace generation. (Not sure whether there's any such architectures.)
> 
> Well, on x86, hardened usercopy relies on frame pointers, but not the
> unwinder.  It does the frame pointer walk manually to avoid the full
> unwinder overhead.  See arch_within_stack_frames().

Oh well...

> Ok, how about:
> 
>   "Orc unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig
>   kernel) than DWARF eh_frame tables."
> 
> (My previous 1MB number was from my distro-based config, and it also
> forgot to take into account the fast lookup table (".orc_lookup")).

Sounds good to me!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
                   ` (10 preceding siblings ...)
  2017-07-12  8:27 ` [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Ingo Molnar
@ 2017-07-12 21:49 ` Andres Freund
  2017-07-12 22:32   ` Josh Poimboeuf
  2017-07-12 22:30 ` [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Andi Kleen
  12 siblings, 1 reply; 60+ messages in thread
From: Andres Freund @ 2017-07-12 21:49 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

Hi,

On 2017-07-11 10:33:37 -0500, Josh Poimboeuf wrote:
> The simpler debuginfo format also enables the unwinder to be much faster
> than DWARF, which is important for perf and lockdep.

Is this going to be usable for userland call-graphs as well? If one
converts dwarf to that, I mean? Because right now with perf dwarf is
often the only thing that works properly through libc, as libc isn't
compiled with fps and has hardcoded asm not preserving fp. lbr isn't
available for many events, and often not at all available in VMs etc.

Regards,

Andres

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
                   ` (11 preceding siblings ...)
  2017-07-12 21:49 ` Andres Freund
@ 2017-07-12 22:30 ` Andi Kleen
  2017-07-12 22:47   ` Josh Poimboeuf
                     ` (2 more replies)
  12 siblings, 3 replies; 60+ messages in thread
From: Andi Kleen @ 2017-07-12 22:30 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

Josh Poimboeuf <jpoimboe@redhat.com> writes:
>
> The ORC data format does have a few downsides compared to DWARF.  The
> ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
>
Can we have an option to just use dwarf instead? For people
who don't want to waste a MB+ to solve a problem that doesn't
exist (as proven by many years of opensuse kernel experience)

As far as I can tell this whole thing has only downsides compared
to the dwarf unwinder that was earlier proposed. I don't see
a single advantage.

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-12 21:49 ` Andres Freund
@ 2017-07-12 22:32   ` Josh Poimboeuf
  2017-07-12 22:36     ` Andres Freund
  2017-07-13  7:12     ` Peter Zijlstra
  0 siblings, 2 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-12 22:32 UTC (permalink / raw)
  To: Andres Freund
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

On Wed, Jul 12, 2017 at 02:49:20PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2017-07-11 10:33:37 -0500, Josh Poimboeuf wrote:
> > The simpler debuginfo format also enables the unwinder to be much faster
> > than DWARF, which is important for perf and lockdep.
> 
> Is this going to be usable for userland call-graphs as well? If one
> converts dwarf to that, I mean? Because right now with perf dwarf is
> often the only thing that works properly through libc, as libc isn't
> compiled with fps and has hardcoded asm not preserving fp. lbr isn't
> available for many events, and often not at all available in VMs etc.

Just to clarify, these patches are completely separate from DWARF and
shouldn't break any existing DWARF-based functionality for user space
tooling.  So perf can still use DWARF for user space binaries just fine.

(Also, tools which rely on CONFIG_DEBUG_INFO for kernel debugging, like
gdb and crash, will continue to work.)

If you want perf to be able to use ORC instead of DWARF for user space
binaries, that's not currently possible, though I don't see any
technical blockers for doing so.  Perf would need to be taught to read
ORC data.

And I think it should be possible to convert DWARF to ORC, assuming the
DWARF data is trusted.  We could probably add an objtool subcommand for
that.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-12 22:32   ` Josh Poimboeuf
@ 2017-07-12 22:36     ` Andres Freund
  2017-07-12 22:40       ` Josh Poimboeuf
  2017-07-13  7:12     ` Peter Zijlstra
  1 sibling, 1 reply; 60+ messages in thread
From: Andres Freund @ 2017-07-12 22:36 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

Hi,

On 2017-07-12 17:32:25 -0500, Josh Poimboeuf wrote:
> If you want perf to be able to use ORC instead of DWARF for user space
> binaries, that's not currently possible, though I don't see any
> technical blockers for doing so.  Perf would need to be taught to read
> ORC data.

Right, that's what I was hoping for.


> And I think it should be possible to convert DWARF to ORC, assuming the
> DWARF data is trusted.  We could probably add an objtool subcommand for
> that.

That'd be pretty helpful.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-12 22:36     ` Andres Freund
@ 2017-07-12 22:40       ` Josh Poimboeuf
  2017-07-12 22:54         ` Andres Freund
  0 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-12 22:40 UTC (permalink / raw)
  To: Andres Freund
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

On Wed, Jul 12, 2017 at 03:36:05PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2017-07-12 17:32:25 -0500, Josh Poimboeuf wrote:
> > If you want perf to be able to use ORC instead of DWARF for user space
> > binaries, that's not currently possible, though I don't see any
> > technical blockers for doing so.  Perf would need to be taught to read
> > ORC data.
> 
> Right, that's what I was hoping for.
> 
> 
> > And I think it should be possible to convert DWARF to ORC, assuming the
> > DWARF data is trusted.  We could probably add an objtool subcommand for
> > that.
> 
> That'd be pretty helpful.

Can I ask why?  Is DWARF too slow, or is it something else?

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-12 22:30 ` [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Andi Kleen
@ 2017-07-12 22:47   ` Josh Poimboeuf
  2017-07-13  4:29     ` Andi Kleen
  2017-07-13  9:29     ` Ingo Molnar
  2017-07-12 23:22   ` Andy Lutomirski
  2017-07-13  3:03   ` Mike Galbraith
  2 siblings, 2 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-12 22:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

On Wed, Jul 12, 2017 at 03:30:31PM -0700, Andi Kleen wrote:
> Josh Poimboeuf <jpoimboe@redhat.com> writes:
> >
> > The ORC data format does have a few downsides compared to DWARF.  The
> > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> >
> Can we have an option to just use dwarf instead? For people
> who don't want to waste a MB+ to solve a problem that doesn't
> exist (as proven by many years of opensuse kernel experience)
> 
> As far as I can tell this whole thing has only downsides compared
> to the dwarf unwinder that was earlier proposed. I don't see
> a single advantage.

Improved speed, reliability, maintainability.  Are those not advantages?

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-12 22:40       ` Josh Poimboeuf
@ 2017-07-12 22:54         ` Andres Freund
  0 siblings, 0 replies; 60+ messages in thread
From: Andres Freund @ 2017-07-12 22:54 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

On 2017-07-12 17:40:45 -0500, Josh Poimboeuf wrote:
> On Wed, Jul 12, 2017 at 03:36:05PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2017-07-12 17:32:25 -0500, Josh Poimboeuf wrote:
> > > If you want perf to be able to use ORC instead of DWARF for user space
> > > binaries, that's not currently possible, though I don't see any
> > > technical blockers for doing so.  Perf would need to be taught to read
> > > ORC data.
> > 
> > Right, that's what I was hoping for.
> > 
> > 
> > > And I think it should be possible to convert DWARF to ORC, assuming the
> > > DWARF data is trusted.  We could probably add an objtool subcommand for
> > > that.
> > 
> > That'd be pretty helpful.
> 
> Can I ask why?  Is DWARF too slow, or is it something else?

Both. Dwarf is really slow and uses a lot of space - on a bigger machine
it's often nearly unusable. Secondly dwarf isn't available for BPF based
stuff, IIUC because the kernel has to create a full backtrace there
(rather than saving enough data that userland can do so). Which wasn't
"allowed" to be done in-kernel w/ dwarf, just fp so far.

- Andres

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-12 22:30 ` [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Andi Kleen
  2017-07-12 22:47   ` Josh Poimboeuf
@ 2017-07-12 23:22   ` Andy Lutomirski
  2017-07-13  3:03   ` Mike Galbraith
  2 siblings, 0 replies; 60+ messages in thread
From: Andy Lutomirski @ 2017-07-12 23:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Josh Poimboeuf, X86 ML, linux-kernel, live-patching,
	Linus Torvalds, Andy Lutomirski, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Peter Zijlstra, Mike Galbraith

On Wed, Jul 12, 2017 at 3:30 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Josh Poimboeuf <jpoimboe@redhat.com> writes:
>>
>> The ORC data format does have a few downsides compared to DWARF.  The
>> ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
>>
> Can we have an option to just use dwarf instead? For people
> who don't want to waste a MB+ to solve a problem that doesn't
> exist (as proven by many years of opensuse kernel experience)
>
> As far as I can tell this whole thing has only downsides compared
> to the dwarf unwinder that was earlier proposed. I don't see
> a single advantage.
>

If someone wanted to write an in-kernel DWARF parser that hooked into
the same machinery that Josh is using and comes with a complete formal
verification package, I might not object, with the caveat that it's
likely to be *much* slower than ORC.  By complete formal verification,
I mean that a user tool running the exact same code, compiled with
strong sanitization, should decode the tables for every single kernel
IP and confirm that (a) the output is sane, (b) the output matches
what objtool says it should do and (c) doesn't crash.

I'm not sure I see the point, though.  I also think that Linus would
object, since I asked him quite recently and he said he'd object.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-12 22:30 ` [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Andi Kleen
  2017-07-12 22:47   ` Josh Poimboeuf
  2017-07-12 23:22   ` Andy Lutomirski
@ 2017-07-13  3:03   ` Mike Galbraith
  2017-07-13  4:15     ` Andi Kleen
  2 siblings, 1 reply; 60+ messages in thread
From: Mike Galbraith @ 2017-07-13  3:03 UTC (permalink / raw)
  To: Andi Kleen, Josh Poimboeuf
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra

On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> Josh Poimboeuf <jpoimboe@redhat.com> writes:
> >
> > The ORC data format does have a few downsides compared to DWARF.  The
> > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> >
> Can we have an option to just use dwarf instead? For people
> who don't want to waste a MB+ to solve a problem that doesn't
> exist (as proven by many years of opensuse kernel experience)

Sure the dwarf unwinder works well for crashes, but at the price of
demolishing ftrace/perf utility.

	-Mike

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13  3:03   ` Mike Galbraith
@ 2017-07-13  4:15     ` Andi Kleen
  2017-07-13  4:28       ` Mike Galbraith
  0 siblings, 1 reply; 60+ messages in thread
From: Andi Kleen @ 2017-07-13  4:15 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andi Kleen, Josh Poimboeuf, x86, linux-kernel, live-patching,
	Linus Torvalds, Andy Lutomirski, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Peter Zijlstra

On Thu, Jul 13, 2017 at 05:03:00AM +0200, Mike Galbraith wrote:
> On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> > Josh Poimboeuf <jpoimboe@redhat.com> writes:
> > >
> > > The ORC data format does have a few downsides compared to DWARF.  The
> > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > >
> > Can we have an option to just use dwarf instead? For people
> > who don't want to waste a MB+ to solve a problem that doesn't
> > exist (as proven by many years of opensuse kernel experience)
> 
> Sure the dwarf unwinder works well for crashes, but at the price of
> demolishing ftrace/perf utility.

You mean the unwind performance?

That's a valid concern, but neither ORC nor dwarf are likely
to address it. However most usages of ftrace/perf shouldn't be that
depending on unwind performance -- just lower the frequency of your
events. 

The only possible win is if the win from not using FP code is
significant enough. On the x86 side the only modern CPUs that should really
care about this are Atoms.

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13  4:15     ` Andi Kleen
@ 2017-07-13  4:28       ` Mike Galbraith
  2017-07-13  4:40         ` Andi Kleen
  0 siblings, 1 reply; 60+ messages in thread
From: Mike Galbraith @ 2017-07-13  4:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Josh Poimboeuf, x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra

On Wed, 2017-07-12 at 21:15 -0700, Andi Kleen wrote:
> On Thu, Jul 13, 2017 at 05:03:00AM +0200, Mike Galbraith wrote:
> > On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> > > Josh Poimboeuf <jpoimboe@redhat.com> writes:
> > > >
> > > > The ORC data format does have a few downsides compared to DWARF.  The
> > > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > > >
> > > Can we have an option to just use dwarf instead? For people
> > > who don't want to waste a MB+ to solve a problem that doesn't
> > > exist (as proven by many years of opensuse kernel experience)
> > 
> > Sure the dwarf unwinder works well for crashes, but at the price of
> > demolishing ftrace/perf utility.
> 
> You mean the unwind performance?

Yeah, it hurts.. massively, has even been known to kill big boxen.

> That's a valid concern, but neither ORC nor dwarf are likely
> to address it. However most usages of ftrace/perf shouldn't be that
> depending on unwind performance -- just lower the frequency of your
> events. 
> 
> The only possible win is if the win from not using FP code is
> significant enough. On the x86 side the only modern CPUs that should really
> care about this are Atoms.

Nope, they all care.  Measure performance delta of fast/light stuff.

Maybe I'm expecting too much good stuff to follow, but don't spoil it
for me, I think I'm looking at a real winner :)

	-Mike

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-12 22:47   ` Josh Poimboeuf
@ 2017-07-13  4:29     ` Andi Kleen
  2017-07-13 13:15       ` Josh Poimboeuf
  2017-07-13  9:29     ` Ingo Molnar
  1 sibling, 1 reply; 60+ messages in thread
From: Andi Kleen @ 2017-07-13  4:29 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andi Kleen, x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

On Wed, Jul 12, 2017 at 05:47:59PM -0500, Josh Poimboeuf wrote:
> On Wed, Jul 12, 2017 at 03:30:31PM -0700, Andi Kleen wrote:
> > Josh Poimboeuf <jpoimboe@redhat.com> writes:
> > >
> > > The ORC data format does have a few downsides compared to DWARF.  The
> > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > >
> > Can we have an option to just use dwarf instead? For people
> > who don't want to waste a MB+ to solve a problem that doesn't
> > exist (as proven by many years of opensuse kernel experience)
> > 
> > As far as I can tell this whole thing has only downsides compared
> > to the dwarf unwinder that was earlier proposed. I don't see
> > a single advantage.
> 
> Improved speed, reliability, maintainability.  Are those not advantages?

Ok. We'll see how it works out.

The memory overhead is quite bad though. You're basically undoing many
years of efforts to shrink kernel text. I hope this can be still
done better.

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13  4:28       ` Mike Galbraith
@ 2017-07-13  4:40         ` Andi Kleen
  2017-07-13  5:22           ` Mike Galbraith
  2017-07-13 12:02           ` Jiri Kosina
  0 siblings, 2 replies; 60+ messages in thread
From: Andi Kleen @ 2017-07-13  4:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andi Kleen, Josh Poimboeuf, x86, linux-kernel, live-patching,
	Linus Torvalds, Andy Lutomirski, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Peter Zijlstra

On Thu, Jul 13, 2017 at 06:28:43AM +0200, Mike Galbraith wrote:
> On Wed, 2017-07-12 at 21:15 -0700, Andi Kleen wrote:
> > On Thu, Jul 13, 2017 at 05:03:00AM +0200, Mike Galbraith wrote:
> > > On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> > > > Josh Poimboeuf <jpoimboe@redhat.com> writes:
> > > > >
> > > > > The ORC data format does have a few downsides compared to DWARF.  The
> > > > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > > > >
> > > > Can we have an option to just use dwarf instead? For people
> > > > who don't want to waste a MB+ to solve a problem that doesn't
> > > > exist (as proven by many years of opensuse kernel experience)
> > > 
> > > Sure the dwarf unwinder works well for crashes, but at the price of
> > > demolishing ftrace/perf utility.
> > 
> > You mean the unwind performance?
> 
> Yeah, it hurts.. massively, has even been known to kill big boxen.

Why was that? 

> 
> > That's a valid concern, but neither ORC nor dwarf are likely
> > to address it. However most usages of ftrace/perf shouldn't be that
> > depending on unwind performance -- just lower the frequency of your
> > events. 
> > 
> > The only possible win is if the win from not using FP code is
> > significant enough. On the x86 side the only modern CPUs that should really
> > care about this are Atoms.
> 
> Nope, they all care.  Measure performance delta of fast/light stuff.

Well if your test cares that much about function overhead you may want to try
LTO. It can get rid of a lot of functions by doing cross file
inlining.

https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=lto-411-2

> Maybe I'm expecting too much good stuff to follow, but don't spoil it
> for me, I think I'm looking at a real winner :)

It's somewhat surprising. It would be good to under stand why that
happens. Is it icache misses, data cache misses for the stack, or
simply more instructions executed, or worse tail calls?

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13  4:40         ` Andi Kleen
@ 2017-07-13  5:22           ` Mike Galbraith
  2017-07-13 12:02           ` Jiri Kosina
  1 sibling, 0 replies; 60+ messages in thread
From: Mike Galbraith @ 2017-07-13  5:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Josh Poimboeuf, x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra

On Wed, 2017-07-12 at 21:40 -0700, Andi Kleen wrote:
> On Thu, Jul 13, 2017 at 06:28:43AM +0200, Mike Galbraith wrote:
> > On Wed, 2017-07-12 at 21:15 -0700, Andi Kleen wrote:
> > > On Thu, Jul 13, 2017 at 05:03:00AM +0200, Mike Galbraith wrote:
> > > > On Wed, 2017-07-12 at 15:30 -0700, Andi Kleen wrote:
> > > > > Josh Poimboeuf <jpoimboe@redhat.com> writes:
> > > > > >
> > > > > > The ORC data format does have a few downsides compared to DWARF.  The
> > > > > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > > > > >
> > > > > Can we have an option to just use dwarf instead? For people
> > > > > who don't want to waste a MB+ to solve a problem that doesn't
> > > > > exist (as proven by many years of opensuse kernel experience)
> > > > 
> > > > Sure the dwarf unwinder works well for crashes, but at the price of
> > > > demolishing ftrace/perf utility.
> > > 
> > > You mean the unwind performance?
> > 
> > Yeah, it hurts.. massively, has even been known to kill big boxen.
> 
> Why was that?

Presuming you mean the big box bit, danged if I know, I haven't
personally met that, only the massive overhead.

> > > That's a valid concern, but neither ORC nor dwarf are likely
> > > to address it. However most usages of ftrace/perf shouldn't be that
> > > depending on unwind performance -- just lower the frequency of your
> > > events. 
> > > 
> > > The only possible win is if the win from not using FP code is
> > > significant enough. On the x86 side the only modern CPUs that should really
> > > care about this are Atoms.
> > 
> > Nope, they all care.  Measure performance delta of fast/light stuff.
> 
> Well if your test cares that much about function overhead you may want to try
> LTO. It can get rid of a lot of functions by doing cross file
> inlining.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=lto-411-2
> 
> > Maybe I'm expecting too much good stuff to follow, but don't spoil it
> > for me, I think I'm looking at a real winner :)
> 
> It's somewhat surprising. It would be good to under stand why that
> happens. Is it icache misses, data cache misses for the stack, or
> simply more instructions executed, or worse tail calls?

No idea.  It was speculated that it was register loss, but I played
with that, saw nearly zero delta until I stole too many.

	-Mike

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-12 22:32   ` Josh Poimboeuf
  2017-07-12 22:36     ` Andres Freund
@ 2017-07-13  7:12     ` Peter Zijlstra
  2017-07-13  8:50       ` Peter Zijlstra
  1 sibling, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2017-07-13  7:12 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andres Freund, x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Mike Galbraith

On Wed, Jul 12, 2017 at 05:32:25PM -0500, Josh Poimboeuf wrote:
> If you want perf to be able to use ORC instead of DWARF for user space
> binaries, that's not currently possible, though I don't see any
> technical blockers for doing so.  Perf would need to be taught to read
> ORC data.

So the problem with userspace stuff is that the unwind data isn't
readily available from NMI context.

So the kernel unwinder will trigger a fault and abort.

The very best we can hope for is using the EH [*] stuff that all
binaries actually have _and_ map. The only problem is that most programs
don't actually use the EH stuff much so while its mapped, its not
actually paged in, so we're still stuck.

[*] C++ ABI requires EH bits for stack unwinding for exception handling
and the like, and because C++ can unwind through C code, C ABI also
mandates EH bits be present.


ORC doesn't much change this. What is currently an option is for perf to
simply copy out the top n-Kb of the stack for each sample (talk about
expensive) and then have userspace unwind it. And for userspace
unwinding in userspace, libunwind and the like are fine, I see absolutely
no reason to use ORC bits here.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13  7:12     ` Peter Zijlstra
@ 2017-07-13  8:50       ` Peter Zijlstra
  2017-07-13  8:51         ` Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2017-07-13  8:50 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andres Freund, x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Mike Galbraith

On Thu, Jul 13, 2017 at 09:12:53AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 12, 2017 at 05:32:25PM -0500, Josh Poimboeuf wrote:
> > If you want perf to be able to use ORC instead of DWARF for user space
> > binaries, that's not currently possible, though I don't see any
> > technical blockers for doing so.  Perf would need to be taught to read
> > ORC data.
> 
> So the problem with userspace stuff is that the unwind data isn't
> readily available from NMI context.
> 
> So the kernel unwinder will trigger a fault and abort.
> 
> The very best we can hope for is using the EH [*] stuff that all
> binaries actually have _and_ map. The only problem is that most programs
> don't actually use the EH stuff much so while its mapped, its not
> actually paged in, so we're still stuck.

One gloriously ugly hack would be to delay the userspace unwind to
return-to-userspace, at which point we have a schedulable context and
can take faults.

Of course, then you have to somehow identify this later unwind sample
with all relevant prior samples and stitch the whole thing back
together, but that should be doable.

In fact, it would be at all hard to do, just queue a task_work from the
NMI and have that do the EH based unwind.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13  8:50       ` Peter Zijlstra
@ 2017-07-13  8:51         ` Peter Zijlstra
  2017-07-13  9:19           ` Ingo Molnar
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2017-07-13  8:51 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andres Freund, x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Mike Galbraith

On Thu, Jul 13, 2017 at 10:50:15AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 13, 2017 at 09:12:53AM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 12, 2017 at 05:32:25PM -0500, Josh Poimboeuf wrote:
> > > If you want perf to be able to use ORC instead of DWARF for user space
> > > binaries, that's not currently possible, though I don't see any
> > > technical blockers for doing so.  Perf would need to be taught to read
> > > ORC data.
> > 
> > So the problem with userspace stuff is that the unwind data isn't
> > readily available from NMI context.
> > 
> > So the kernel unwinder will trigger a fault and abort.
> > 
> > The very best we can hope for is using the EH [*] stuff that all
> > binaries actually have _and_ map. The only problem is that most programs
> > don't actually use the EH stuff much so while its mapped, its not
> > actually paged in, so we're still stuck.
> 
> One gloriously ugly hack would be to delay the userspace unwind to
> return-to-userspace, at which point we have a schedulable context and
> can take faults.
> 
> Of course, then you have to somehow identify this later unwind sample
> with all relevant prior samples and stitch the whole thing back
> together, but that should be doable.
> 
> In fact, it would be at all hard to do, just queue a task_work from the

+not

> NMI and have that do the EH based unwind.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13  8:51         ` Peter Zijlstra
@ 2017-07-13  9:19           ` Ingo Molnar
  2017-07-13 12:17             ` Josh Poimboeuf
  2017-07-25 11:55             ` [RFC] perf: Delayed userspace unwind (Was: [PATCH v3 00/10] x86: ORC unwinder) Peter Zijlstra
  0 siblings, 2 replies; 60+ messages in thread
From: Ingo Molnar @ 2017-07-13  9:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Poimboeuf, Andres Freund, x86, linux-kernel, live-patching,
	Linus Torvalds, Andy Lutomirski, Jiri Slaby, H. Peter Anvin,
	Mike Galbraith, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin


* Peter Zijlstra <peterz@infradead.org> wrote:

> > One gloriously ugly hack would be to delay the userspace unwind to 
> > return-to-userspace, at which point we have a schedulable context and can take 
> > faults.

I don't think it's ugly, and it has various advantages:

> > Of course, then you have to somehow identify this later unwind sample with all 
> > relevant prior samples and stitch the whole thing back together, but that 
> > should be doable.
> > 
> > In fact, it would not be at all hard to do, just queue a task_work from the 
> > NMI and have that do the EH based unwind.

This would have a couple of advantages:

 - as you mention, being able to fault in debug info and generally do 
   IO/scheduling,

 - profiling overhead would be accounted to the task context that generates it,
   not the NMI context,

 - there would be a natural batching/coalescing optimization if multiple events
   hit the same system call: the user-space backtrace would only have to be looked 
   up once for all samples that got collected.

This could be done by separating the user-space backtrace into a separate event, 
and perf tooling would then apply the same user-space backtrace to all prior 
kernel samples.

I.e. the ring-buffer would have trace entries like:

 [ kernel sample #1, with kernel backtrace #1 ]
 [ kernel sample #2, with kernel backtrace #2 ]
 [ kernel sample #3, with kernel backtrace #3 ]
 [ user-space backtrace #1 at syscall return ]
 ...

Note how the three kernel samples didn't have to do any user-space unwinding at 
all, so the user-space unwinding overhead got reduced by a factor of 3.

Tooling would know that 'user-space backtrace #1' applies to the previous three 
kernel samples.

Or so?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-12 22:47   ` Josh Poimboeuf
  2017-07-13  4:29     ` Andi Kleen
@ 2017-07-13  9:29     ` Ingo Molnar
  1 sibling, 0 replies; 60+ messages in thread
From: Ingo Molnar @ 2017-07-13  9:29 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andi Kleen, x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith, Thomas Gleixner, Peter Zijlstra


* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Wed, Jul 12, 2017 at 03:30:31PM -0700, Andi Kleen wrote:
> > Josh Poimboeuf <jpoimboe@redhat.com> writes:
> > >
> > > The ORC data format does have a few downsides compared to DWARF.  The
> > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > >
> > Can we have an option to just use dwarf instead? For people
> > who don't want to waste a MB+ to solve a problem that doesn't
> > exist (as proven by many years of opensuse kernel experience)
> > 
> > As far as I can tell this whole thing has only downsides compared
> > to the dwarf unwinder that was earlier proposed. I don't see
> > a single advantage.
> 
> Improved speed, reliability, maintainability.  Are those not advantages?

Exactly, and all these advantages of the ORC debuginfo over DWARF debuginfo are 
enabled by an unwinding optimized data format that the kernel project generates, 
controls and is able to trust inherently.

DWARF generated by external tooling can just never reach that level of trust, 
without insane amounts of formal verification.

Even if ORC was _slower_ its reliability would be reason enough to merge. The fact 
that it's 20-40 times faster than the DWARF unwinder is really just icing on the 
cake.

BTW., as a side note, (and I hope my optimism isn't premature), I believe the ORC 
unwinder is a prime example of where Linus's stubborness resisting poor concepts 
paid off in the long run: had we merged the DWARF unwinder years ago we'd never 
have gained the ORC unwinder. We quite literally had to wait over a decade, but 
good things happened in the end.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13  4:40         ` Andi Kleen
  2017-07-13  5:22           ` Mike Galbraith
@ 2017-07-13 12:02           ` Jiri Kosina
  1 sibling, 0 replies; 60+ messages in thread
From: Jiri Kosina @ 2017-07-13 12:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mike Galbraith, Josh Poimboeuf, x86, linux-kernel, live-patching,
	Linus Torvalds, Andy Lutomirski, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Peter Zijlstra

On Wed, 12 Jul 2017, Andi Kleen wrote:

> It's somewhat surprising. It would be good to under stand why that 
> happens. Is it icache misses, data cache misses for the stack, or simply 
> more instructions executed, or worse tail calls?

http://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13  9:19           ` Ingo Molnar
@ 2017-07-13 12:17             ` Josh Poimboeuf
  2017-07-13 12:21               ` Josh Poimboeuf
  2017-07-14  8:29               ` Ingo Molnar
  2017-07-25 11:55             ` [RFC] perf: Delayed userspace unwind (Was: [PATCH v3 00/10] x86: ORC unwinder) Peter Zijlstra
  1 sibling, 2 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-13 12:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andres Freund, x86, linux-kernel, live-patching,
	Linus Torvalds, Andy Lutomirski, Jiri Slaby, H. Peter Anvin,
	Mike Galbraith, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin

On Thu, Jul 13, 2017 at 11:19:11AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > One gloriously ugly hack would be to delay the userspace unwind to 
> > > return-to-userspace, at which point we have a schedulable context and can take 
> > > faults.
> 
> I don't think it's ugly, and it has various advantages:
> 
> > > Of course, then you have to somehow identify this later unwind sample with all 
> > > relevant prior samples and stitch the whole thing back together, but that 
> > > should be doable.
> > > 
> > > In fact, it would not be at all hard to do, just queue a task_work from the 
> > > NMI and have that do the EH based unwind.
> 
> This would have a couple of advantages:
> 
>  - as you mention, being able to fault in debug info and generally do 
>    IO/scheduling,
> 
>  - profiling overhead would be accounted to the task context that generates it,
>    not the NMI context,
> 
>  - there would be a natural batching/coalescing optimization if multiple events
>    hit the same system call: the user-space backtrace would only have to be looked 
>    up once for all samples that got collected.
> 
> This could be done by separating the user-space backtrace into a separate event, 
> and perf tooling would then apply the same user-space backtrace to all prior 
> kernel samples.
> 
> I.e. the ring-buffer would have trace entries like:
> 
>  [ kernel sample #1, with kernel backtrace #1 ]
>  [ kernel sample #2, with kernel backtrace #2 ]
>  [ kernel sample #3, with kernel backtrace #3 ]
>  [ user-space backtrace #1 at syscall return ]
>  ...
> 
> Note how the three kernel samples didn't have to do any user-space unwinding at 
> all, so the user-space unwinding overhead got reduced by a factor of 3.
> 
> Tooling would know that 'user-space backtrace #1' applies to the previous three 
> kernel samples.
> 
> Or so?

BTW, while we're throwing out ideas for this, here's another idea,
though it's almost certainly not a good one :-)

For user space stack unwinding, the kernel could emulate what the kernel
'guess' unwinder does by scanning the user space stack and returning all
the text addresses it finds.

The results wouldn't be 100% accurate, but they could end up being
useful over time.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13 12:17             ` Josh Poimboeuf
@ 2017-07-13 12:21               ` Josh Poimboeuf
  2017-07-13 12:35                 ` Josh Poimboeuf
  2017-07-14  8:29               ` Ingo Molnar
  1 sibling, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-13 12:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andres Freund, x86, linux-kernel, live-patching,
	Linus Torvalds, Andy Lutomirski, Jiri Slaby, H. Peter Anvin,
	Mike Galbraith, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin

On Thu, Jul 13, 2017 at 07:17:55AM -0500, Josh Poimboeuf wrote:
> BTW, while we're throwing out ideas for this, here's another idea,
> though it's almost certainly not a good one :-)
> 
> For user space stack unwinding, the kernel could emulate what the kernel
> 'guess' unwinder does by scanning the user space stack and returning all
> the text addresses it finds.
> 
> The results wouldn't be 100% accurate, but they could end up being
> useful over time.

And to expound further on the bad idea, maybe the "bad" addresses could
be filtered out somehow in post-processing (insert lots of hand waving).

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13 12:21               ` Josh Poimboeuf
@ 2017-07-13 12:35                 ` Josh Poimboeuf
  2017-07-14  8:33                   ` Ingo Molnar
  0 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-13 12:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andres Freund, x86, linux-kernel, live-patching,
	Linus Torvalds, Andy Lutomirski, Jiri Slaby, H. Peter Anvin,
	Mike Galbraith, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin

On Thu, Jul 13, 2017 at 07:21:15AM -0500, Josh Poimboeuf wrote:
> On Thu, Jul 13, 2017 at 07:17:55AM -0500, Josh Poimboeuf wrote:
> > BTW, while we're throwing out ideas for this, here's another idea,
> > though it's almost certainly not a good one :-)
> > 
> > For user space stack unwinding, the kernel could emulate what the kernel
> > 'guess' unwinder does by scanning the user space stack and returning all
> > the text addresses it finds.

To clarify, text address would mean any address in a VMA with the
executable bit set.

> > The results wouldn't be 100% accurate, but they could end up being
> > useful over time.
> 
> And to expound further on the bad idea, maybe the "bad" addresses could
> be filtered out somehow in post-processing (insert lots of hand waving).

And some details on the post-processing: in most cases it should be
possible to determine which of the found stack addresses are valid by
looking at the call instructions immediately preceding the stack text
addresses, and making sure the call target points to the same function
as the previously found address.  But of course that wouldn't work for
indirect calls.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13  4:29     ` Andi Kleen
@ 2017-07-13 13:15       ` Josh Poimboeuf
  0 siblings, 0 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-13 13:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

On Wed, Jul 12, 2017 at 09:29:17PM -0700, Andi Kleen wrote:
> On Wed, Jul 12, 2017 at 05:47:59PM -0500, Josh Poimboeuf wrote:
> > On Wed, Jul 12, 2017 at 03:30:31PM -0700, Andi Kleen wrote:
> > > Josh Poimboeuf <jpoimboe@redhat.com> writes:
> > > >
> > > > The ORC data format does have a few downsides compared to DWARF.  The
> > > > ORC unwind tables take up ~1MB more memory than DWARF eh_frame tables.
> > > >
> > > Can we have an option to just use dwarf instead? For people
> > > who don't want to waste a MB+ to solve a problem that doesn't
> > > exist (as proven by many years of opensuse kernel experience)
> > > 
> > > As far as I can tell this whole thing has only downsides compared
> > > to the dwarf unwinder that was earlier proposed. I don't see
> > > a single advantage.
> > 
> > Improved speed, reliability, maintainability.  Are those not advantages?
> 
> Ok. We'll see how it works out.
> 
> The memory overhead is quite bad though. You're basically undoing many
> years of efforts to shrink kernel text. I hope this can be still
> done better.

If we're talking *text*, this further shrinks text size by 3% because
frame pointers can be disabled.

As far as the data size goes, is anyone *truly* impacted by that extra
1MB or so?  If you're enabling a DWARF/ORC unwinder, you're already
signing up for a few extra megs anyway.

I do have a vague idea about how to reduce the data size, if/when the
size becomes a problem.  Basically there's a *lot* of duplication in the
ORC data:

  $ tools/objtool/objtool orc dump vmlinux | wc -l
  311095

  $ tools/objtool/objtool orc dump vmlinux |cut -d' ' -f2- |sort |uniq |wc -l
  345

So that's over 300,000 6-byte entries, only 345 of which are unique.
There should be a way to compress that.  However, it will probably
require sacrificing some combination of speed and simplicity.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13 12:17             ` Josh Poimboeuf
  2017-07-13 12:21               ` Josh Poimboeuf
@ 2017-07-14  8:29               ` Ingo Molnar
  1 sibling, 0 replies; 60+ messages in thread
From: Ingo Molnar @ 2017-07-14  8:29 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Andres Freund, x86, linux-kernel, live-patching,
	Linus Torvalds, Andy Lutomirski, Jiri Slaby, H. Peter Anvin,
	Mike Galbraith, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin


* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> For user space stack unwinding, the kernel could emulate what the kernel
> 'guess' unwinder does by scanning the user space stack and returning all
> the text addresses it finds.

User-space stacks tend to be much larger than kernel stacks, the cost of doing 
such a full scan on every PMI would kill a lot of profiling workloads.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-13 12:35                 ` Josh Poimboeuf
@ 2017-07-14  8:33                   ` Ingo Molnar
  0 siblings, 0 replies; 60+ messages in thread
From: Ingo Molnar @ 2017-07-14  8:33 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Andres Freund, x86, linux-kernel, live-patching,
	Linus Torvalds, Andy Lutomirski, Jiri Slaby, H. Peter Anvin,
	Mike Galbraith, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin


* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> > > The results wouldn't be 100% accurate, but they could end up being useful 
> > > over time.
> > 
> > And to expound further on the bad idea, maybe the "bad" addresses could be 
> > filtered out somehow in post-processing (insert lots of hand waving).
> 
> And some details on the post-processing: in most cases it should be possible to 
> determine which of the found stack addresses are valid by looking at the call 
> instructions immediately preceding the stack text addresses, and making sure the 
> call target points to the same function as the previously found address.  But of 
> course that wouldn't work for indirect calls.

I believe this is similar to how OProfile did graph/dwarf profiling, by saving a 
copy of the stack and post-processing it.

By my best recollection (but I haven't used OProfile that much) it was both a 
performance nightmare, was limited (because it only saved a part of the stack), 
and was rather fragile as well, because it depended on the task VM being 
post-processable.

I think the highest quality implementation is to generate the call trace either in 
hardware (LBR), or as close to the event as possible: generate the kernel call 
chain in the PMI context, and the user-space call chain before user-space executes 
again (at the latest). Call chain generation should be roughly O(chain_depth), 
which both FP and ORC ensures.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-12 19:27     ` Ingo Molnar
@ 2017-07-14 17:17       ` Josh Poimboeuf
  2017-07-25  9:09         ` Ingo Molnar
  0 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-14 17:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

On Wed, Jul 12, 2017 at 09:27:50PM +0200, Ingo Molnar wrote:
> Maybe we could offer a menu of unwinders - i.e. make the whole Kconfig interface a 
> bit nicer:
> 
>   CONFIG_UNWINDER_FRAME_POINTER
>   CONFIG_UNWINDER_ORC
>   CONFIG_UNWINDER_GUESS
> 
> ... or so?

So far I haven't been able to figure out how to make the above three
options into a multiple choice selection, such that allnoconfig selects
CONFIG_UNWINDER_GUESS and alldefconfig selects
CONFIG_UNWINDER_FRAME_POINTER.

I *think* I should be able to do it by setting the choice default to
FRAME_POINTER, and setting the 'allnoconfig_y' option for
UNWINDER_GUESS.  But kconfig apparently doesn't support 'allnoconfig_y'
for choice selections yet.  I may need to modify kconfig for that.

But IMO, this change can come later, and the current patches should be
fine to merge as-is.  And it might make sense to delay such a patch
anyway, see below.

> Default would be the historic FRAME_POINTER, at least initially, I think.
> 
> I wouldn't mind making CONFIG_UNWINDER_ORC the new default either, due to the 
> non-trivial speedup it offers - but maybe folks would object?

Personally I wouldn't have an objection to making ORC the default,
though we should probably wait to give it some burn-in time first.

If we *do* decide to eventually make it the default, we could flip the
switch at the same time we introduced the multiple-choice config and
rename above.  That way, users of "make oldconfig" would see the change
and would be encouraged to switch ORC.

> > > CONFIG_FRAME_POINTERS et al would be left for architectures where it has a meaning 
> > > beyond backtrace generation. (Not sure whether there's any such architectures.)
> > 
> > Well, on x86, hardened usercopy relies on frame pointers, but not the
> > unwinder.  It does the frame pointer walk manually to avoid the full
> > unwinder overhead.  See arch_within_stack_frames().
> 
> Oh well...
> 
> > Ok, how about:
> > 
> >   "Orc unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig
> >   kernel) than DWARF eh_frame tables."
> > 
> > (My previous 1MB number was from my distro-based config, and it also
> > forgot to take into account the fast lookup table (".orc_lookup")).
> 
> Sounds good to me!

Ok, I'll post a v3.1 of patch 9 with the changed wording.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v3.1 09/10] x86/unwind: add ORC unwinder
  2017-07-11 15:33 ` [PATCH v3 09/10] x86/unwind: add ORC unwinder Josh Poimboeuf
@ 2017-07-14 17:22   ` Josh Poimboeuf
  2017-07-20  7:12     ` Jiri Slaby
  0 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-14 17:22 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

Add a new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER.  It
plugs into the existing x86 unwinder framework.

It relies on objtool to generate the needed .orc_unwind and
.orc_unwind_ip sections.

For more details on why ORC is used instead of DWARF, see
Documentation/x86/orc-unwinder.txt.

Thanks to Andy Lutomirski for the performance improvement ideas:
splitting the ORC unwind table into two parallel arrays and creating a
fast lookup table to search a subset of the unwind table.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
Slightly tweaked documentation wording, as recommended by Ingo:

  "ORC unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
   than DWARF-based eh_frame tables."

 Documentation/x86/orc-unwinder.txt | 179 ++++++++++++
 arch/um/include/asm/unwind.h       |   8 +
 arch/x86/Kconfig                   |   1 +
 arch/x86/Kconfig.debug             |  25 ++
 arch/x86/include/asm/module.h      |   9 +
 arch/x86/include/asm/orc_lookup.h  |  46 +++
 arch/x86/include/asm/orc_types.h   |   2 +-
 arch/x86/include/asm/unwind.h      |  76 +++--
 arch/x86/kernel/Makefile           |   8 +-
 arch/x86/kernel/module.c           |  11 +-
 arch/x86/kernel/setup.c            |   3 +
 arch/x86/kernel/unwind_frame.c     |  39 ++-
 arch/x86/kernel/unwind_guess.c     |   5 +
 arch/x86/kernel/unwind_orc.c       | 576 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/vmlinux.lds.S      |   3 +
 include/asm-generic/vmlinux.lds.h  |  27 +-
 lib/Kconfig.debug                  |   3 +
 scripts/Makefile.build             |  14 +-
 18 files changed, 971 insertions(+), 64 deletions(-)
 create mode 100644 Documentation/x86/orc-unwinder.txt
 create mode 100644 arch/um/include/asm/unwind.h
 create mode 100644 arch/x86/include/asm/orc_lookup.h
 create mode 100644 arch/x86/kernel/unwind_orc.c

diff --git a/Documentation/x86/orc-unwinder.txt b/Documentation/x86/orc-unwinder.txt
new file mode 100644
index 0000000..af0c9a4
--- /dev/null
+++ b/Documentation/x86/orc-unwinder.txt
@@ -0,0 +1,179 @@
+ORC unwinder
+============
+
+Overview
+--------
+
+The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is
+similar in concept to a DWARF unwinder.  The difference is that the
+format of the ORC data is much simpler than DWARF, which in turn allows
+the ORC unwinder to be much simpler and faster.
+
+The ORC data consists of unwind tables which are generated by objtool.
+They contain out-of-band data which is used by the in-kernel ORC
+unwinder.  Objtool generates the ORC data by first doing compile-time
+stack metadata validation (CONFIG_STACK_VALIDATION).  After analyzing
+all the code paths of a .o file, it determines information about the
+stack state at each instruction address in the file and outputs that
+information to the .orc_unwind and .orc_unwind_ip sections.
+
+The per-object ORC sections are combined at link time and are sorted and
+post-processed at boot time.  The unwinder uses the resulting data to
+correlate instruction addresses with their stack states at run time.
+
+
+ORC vs frame pointers
+---------------------
+
+With frame pointers enabled, GCC adds instrumentation code to every
+function in the kernel.  The kernel's .text size increases by about
+3.2%, resulting in a broad kernel-wide slowdown.  Measurements by Mel
+Gorman [1] have shown a slowdown of 5-10% for some workloads.
+
+In contrast, the ORC unwinder has no effect on text size or runtime
+performance, because the debuginfo is out of band.  So if you disable
+frame pointers and enable the ORC unwinder, you get a nice performance
+improvement across the board, and still have reliable stack traces.
+
+Ingo Molnar says:
+
+  "Note that it's not just a performance improvement, but also an
+  instruction cache locality improvement: 3.2% .text savings almost
+  directly transform into a similarly sized reduction in cache
+  footprint. That can transform to even higher speedups for workloads
+  whose cache locality is borderline."
+
+Another benefit of ORC compared to frame pointers is that it can
+reliably unwind across interrupts and exceptions.  Frame pointer based
+unwinds can sometimes skip the caller of the interrupted function, if it
+was a leaf function or if the interrupt hit before the frame pointer was
+saved.
+
+The main disadvantage of the ORC unwinder compared to frame pointers is
+that it needs more memory to store the ORC unwind tables: roughly 2-4MB
+depending on the kernel config.
+
+
+ORC vs DWARF
+------------
+
+ORC debuginfo's advantage over DWARF itself is that it's much simpler.
+It gets rid of the complex DWARF CFI state machine and also gets rid of
+the tracking of unnecessary registers.  This allows the unwinder to be
+much simpler, meaning fewer bugs, which is especially important for
+mission critical oops code.
+
+The simpler debuginfo format also enables the unwinder to be much faster
+than DWARF, which is important for perf and lockdep.  In a basic
+performance test by Jiri Slaby [2], the ORC unwinder was about 20x
+faster than an out-of-tree DWARF unwinder.  (Note: That measurement was
+taken before some performance tweaks were added, which doubled
+performance, so the speedup over DWARF may be closer to 40x.)
+
+The ORC data format does have a few downsides compared to DWARF.  ORC
+unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
+than DWARF-based eh_frame tables.
+
+Another potential downside is that, as GCC evolves, it's conceivable
+that the ORC data may end up being *too* simple to describe the state of
+the stack for certain optimizations.  But IMO this is unlikely because
+GCC saves the frame pointer for any unusual stack adjustments it does,
+so I suspect we'll really only ever need to keep track of the stack
+pointer and the frame pointer between call frames.  But even if we do
+end up having to track all the registers DWARF tracks, at least we will
+still be able to control the format, e.g. no complex state machines.
+
+
+ORC unwind table generation
+---------------------------
+
+The ORC data is generated by objtool.  With the existing compile-time
+stack metadata validation feature, objtool already follows all code
+paths, and so it already has all the information it needs to be able to
+generate ORC data from scratch.  So it's an easy step to go from stack
+validation to ORC data generation.
+
+It should be possible to instead generate the ORC data with a simple
+tool which converts DWARF to ORC data.  However, such a solution would
+be incomplete due to the kernel's extensive use of asm, inline asm, and
+special sections like exception tables.
+
+That could be rectified by manually annotating those special code paths
+using GNU assembler .cfi annotations in .S files, and homegrown
+annotations for inline asm in .c files.  But asm annotations were tried
+in the past and were found to be unmaintainable.  They were often
+incorrect/incomplete and made the code harder to read and keep updated.
+And based on looking at glibc code, annotating inline asm in .c files
+might be even worse.
+
+Objtool still needs a few annotations, but only in code which does
+unusual things to the stack like entry code.  And even then, far fewer
+annotations are needed than what DWARF would need, so they're much more
+maintainable than DWARF CFI annotations.
+
+So the advantages of using objtool to generate ORC data are that it
+gives more accurate debuginfo, with very few annotations.  It also
+insulates the kernel from toolchain bugs which can be very painful to
+deal with in the kernel since we often have to workaround issues in
+older versions of the toolchain for years.
+
+The downside is that the unwinder now becomes dependent on objtool's
+ability to reverse engineer GCC code flow.  If GCC optimizations become
+too complicated for objtool to follow, the ORC data generation might
+stop working or become incomplete.  (It's worth noting that livepatch
+already has such a dependency on objtool's ability to follow GCC code
+flow.)
+
+If newer versions of GCC come up with some optimizations which break
+objtool, we may need to revisit the current implementation.  Some
+possible solutions would be asking GCC to make the optimizations more
+palatable, or having objtool use DWARF as an additional input, or
+creating a GCC plugin to assist objtool with its analysis.  But for now,
+objtool follows GCC code quite well.
+
+
+Unwinder implementation details
+-------------------------------
+
+Objtool generates the ORC data by integrating with the compile-time
+stack metadata validation feature, which is described in detail in
+tools/objtool/Documentation/stack-validation.txt.  After analyzing all
+the code paths of a .o file, it creates an array of orc_entry structs,
+and a parallel array of instruction addresses associated with those
+structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
+respectively.
+
+The ORC data is split into the two arrays for performance reasons, to
+make the searchable part of the data (.orc_unwind_ip) more compact.  The
+arrays are sorted in parallel at boot time.
+
+Performance is further improved by the use of a fast lookup table which
+is created at runtime.  The fast lookup table associates a given address
+with a range of indices for the .orc_unwind table, so that only a small
+subset of the table needs to be searched.
+
+
+Etymology
+---------
+
+Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
+enemies.  Similarly, the ORC unwinder was created in opposition to the
+complexity and slowness of DWARF.
+
+"Although Orcs rarely consider multiple solutions to a problem, they do
+excel at getting things done because they are creatures of action, not
+thought." [3]  Similarly, unlike the esoteric DWARF unwinder, the
+veracious ORC unwinder wastes no time or siloconic effort decoding
+variable-length zero-extended unsigned-integer byte-coded
+state-machine-based debug information entries.
+
+Similar to how Orcs frequently unravel the well-intentioned plans of
+their adversaries, the ORC unwinder frequently unravels stacks with
+brutal, unyielding efficiency.
+
+ORC stands for Oops Rewind Capability.
+
+
+[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
+[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
+[3] http://dustin.wikidot.com/half-orcs-and-orcs
diff --git a/arch/um/include/asm/unwind.h b/arch/um/include/asm/unwind.h
new file mode 100644
index 0000000..7ffa543
--- /dev/null
+++ b/arch/um/include/asm/unwind.h
@@ -0,0 +1,8 @@
+#ifndef _ASM_UML_UNWIND_H
+#define _ASM_UML_UNWIND_H
+
+static inline void
+unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
+		   void *orc, size_t orc_size) {}
+
+#endif /* _ASM_UML_UNWIND_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e767ed2..0dac5a0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -153,6 +153,7 @@ config X86
 	select HAVE_MEMBLOCK
 	select HAVE_MEMBLOCK_NODE_MAP
 	select HAVE_MIXED_BREAKPOINTS_REGS
+	select HAVE_MOD_ARCH_SPECIFIC
 	select HAVE_NMI
 	select HAVE_OPROFILE
 	select HAVE_OPTPROBES
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 353ed09..dc10ec6 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -355,4 +355,29 @@ config PUNIT_ATOM_DEBUG
 	  The current power state can be read from
 	  /sys/kernel/debug/punit_atom/dev_power_state
 
+config ORC_UNWINDER
+	bool "ORC unwinder"
+	depends on X86_64
+	select STACK_VALIDATION
+	---help---
+	  This option enables the ORC (Oops Rewind Capability) unwinder for
+	  unwinding kernel stack traces.  It uses a custom data format which is
+	  a simplified version of the DWARF Call Frame Information standard.
+
+	  This unwinder is more accurate across interrupt entry frames than the
+	  frame pointer unwinder.  It can also enable a 5-10% performance
+	  improvement across the entire kernel if CONFIG_FRAME_POINTER is
+	  disabled.
+
+	  Enabling this option will increase the kernel's runtime memory usage
+	  by roughly 2-4MB, depending on your kernel config.
+
+config FRAME_POINTER_UNWINDER
+	def_bool y
+	depends on !ORC_UNWINDER && FRAME_POINTER
+
+config GUESS_UNWINDER
+	def_bool y
+	depends on !ORC_UNWINDER && !FRAME_POINTER
+
 endmenu
diff --git a/arch/x86/include/asm/module.h b/arch/x86/include/asm/module.h
index e3b7819..9eb7c71 100644
--- a/arch/x86/include/asm/module.h
+++ b/arch/x86/include/asm/module.h
@@ -2,6 +2,15 @@
 #define _ASM_X86_MODULE_H
 
 #include <asm-generic/module.h>
+#include <asm/orc_types.h>
+
+struct mod_arch_specific {
+#ifdef CONFIG_ORC_UNWINDER
+	unsigned int num_orcs;
+	int *orc_unwind_ip;
+	struct orc_entry *orc_unwind;
+#endif
+};
 
 #ifdef CONFIG_X86_64
 /* X86_64 does not define MODULE_PROC_FAMILY */
diff --git a/arch/x86/include/asm/orc_lookup.h b/arch/x86/include/asm/orc_lookup.h
new file mode 100644
index 0000000..91c8d86
--- /dev/null
+++ b/arch/x86/include/asm/orc_lookup.h
@@ -0,0 +1,46 @@
+/*
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef _ORC_LOOKUP_H
+#define _ORC_LOOKUP_H
+
+/*
+ * This is a lookup table for speeding up access to the .orc_unwind table.
+ * Given an input address offset, the corresponding lookup table entry
+ * specifies a subset of the .orc_unwind table to search.
+ *
+ * Each block represents the end of the previous range and the start of the
+ * next range.  An extra block is added to give the last range an end.
+ *
+ * The block size should be a power of 2 to avoid a costly 'div' instruction.
+ *
+ * A block size of 256 was chosen because it roughly doubles unwinder
+ * performance while only adding ~5% to the ORC data footprint.
+ */
+#define LOOKUP_BLOCK_ORDER	8
+#define LOOKUP_BLOCK_SIZE	(1 << LOOKUP_BLOCK_ORDER)
+
+#ifndef LINKER_SCRIPT
+
+extern unsigned int orc_lookup[];
+extern unsigned int orc_lookup_end[];
+
+#define LOOKUP_START_IP		(unsigned long)_stext
+#define LOOKUP_STOP_IP		(unsigned long)_etext
+
+#endif /* LINKER_SCRIPT */
+
+#endif /* _ORC_LOOKUP_H */
diff --git a/arch/x86/include/asm/orc_types.h b/arch/x86/include/asm/orc_types.h
index 7dc777a..9c9dc57 100644
--- a/arch/x86/include/asm/orc_types.h
+++ b/arch/x86/include/asm/orc_types.h
@@ -88,7 +88,7 @@ struct orc_entry {
 	unsigned	sp_reg:4;
 	unsigned	bp_reg:4;
 	unsigned	type:2;
-};
+} __packed;
 
 /*
  * This struct is used by asm and inline asm code to manually annotate the
diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index e667649..25b8d31a 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -12,11 +12,14 @@ struct unwind_state {
 	struct task_struct *task;
 	int graph_idx;
 	bool error;
-#ifdef CONFIG_FRAME_POINTER
+#if defined(CONFIG_ORC_UNWINDER)
+	bool signal, full_regs;
+	unsigned long sp, bp, ip;
+	struct pt_regs *regs;
+#elif defined(CONFIG_FRAME_POINTER)
 	bool got_irq;
-	unsigned long *bp, *orig_sp;
+	unsigned long *bp, *orig_sp, ip;
 	struct pt_regs *regs;
-	unsigned long ip;
 #else
 	unsigned long *sp;
 #endif
@@ -24,41 +27,30 @@ struct unwind_state {
 
 void __unwind_start(struct unwind_state *state, struct task_struct *task,
 		    struct pt_regs *regs, unsigned long *first_frame);
-
 bool unwind_next_frame(struct unwind_state *state);
-
 unsigned long unwind_get_return_address(struct unwind_state *state);
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state);
 
 static inline bool unwind_done(struct unwind_state *state)
 {
 	return state->stack_info.type == STACK_TYPE_UNKNOWN;
 }
 
-static inline
-void unwind_start(struct unwind_state *state, struct task_struct *task,
-		  struct pt_regs *regs, unsigned long *first_frame)
-{
-	first_frame = first_frame ? : get_stack_pointer(task, regs);
-
-	__unwind_start(state, task, regs, first_frame);
-}
-
 static inline bool unwind_error(struct unwind_state *state)
 {
 	return state->error;
 }
 
-#ifdef CONFIG_FRAME_POINTER
-
 static inline
-unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+void unwind_start(struct unwind_state *state, struct task_struct *task,
+		  struct pt_regs *regs, unsigned long *first_frame)
 {
-	if (unwind_done(state))
-		return NULL;
+	first_frame = first_frame ? : get_stack_pointer(task, regs);
 
-	return state->regs ? &state->regs->ip : state->bp + 1;
+	__unwind_start(state, task, regs, first_frame);
 }
 
+#if defined(CONFIG_ORC_UNWINDER) || defined(CONFIG_FRAME_POINTER)
 static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
 	if (unwind_done(state))
@@ -66,20 +58,46 @@ static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 
 	return state->regs;
 }
-
-#else /* !CONFIG_FRAME_POINTER */
-
-static inline
-unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+#else
+static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
 	return NULL;
 }
+#endif
 
-static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
+#ifdef CONFIG_ORC_UNWINDER
+void unwind_init(void);
+void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
+			void *orc, size_t orc_size);
+#else
+static inline void unwind_init(void) {}
+static inline
+void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
+			void *orc, size_t orc_size) {}
+#endif
+
+/*
+ * This disables KASAN checking when reading a value from another task's stack,
+ * since the other task could be running on another CPU and could have poisoned
+ * the stack in the meantime.
+ */
+#define READ_ONCE_TASK_STACK(task, x)			\
+({							\
+	unsigned long val;				\
+	if (task == current)				\
+		val = READ_ONCE(x);			\
+	else						\
+		val = READ_ONCE_NOCHECK(x);		\
+	val;						\
+})
+
+static inline bool task_on_another_cpu(struct task_struct *task)
 {
-	return NULL;
+#ifdef CONFIG_SMP
+	return task != current && task->on_cpu;
+#else
+	return false;
+#endif
 }
 
-#endif /* CONFIG_FRAME_POINTER */
-
 #endif /* _ASM_X86_UNWIND_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index a01892b..287eac7 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -126,11 +126,9 @@ obj-$(CONFIG_PERF_EVENTS)		+= perf_regs.o
 obj-$(CONFIG_TRACING)			+= tracepoint.o
 obj-$(CONFIG_SCHED_MC_PRIO)		+= itmt.o
 
-ifdef CONFIG_FRAME_POINTER
-obj-y					+= unwind_frame.o
-else
-obj-y					+= unwind_guess.o
-endif
+obj-$(CONFIG_ORC_UNWINDER)		+= unwind_orc.o
+obj-$(CONFIG_FRAME_POINTER_UNWINDER)	+= unwind_frame.o
+obj-$(CONFIG_GUESS_UNWINDER)		+= unwind_guess.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index f67bd32..62e7d70 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -35,6 +35,7 @@
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/setup.h>
+#include <asm/unwind.h>
 
 #if 0
 #define DEBUGP(fmt, ...)				\
@@ -213,7 +214,7 @@ int module_finalize(const Elf_Ehdr *hdr,
 		    struct module *me)
 {
 	const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL,
-		*para = NULL;
+		*para = NULL, *orc = NULL, *orc_ip = NULL;
 	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
 
 	for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) {
@@ -225,6 +226,10 @@ int module_finalize(const Elf_Ehdr *hdr,
 			locks = s;
 		if (!strcmp(".parainstructions", secstrings + s->sh_name))
 			para = s;
+		if (!strcmp(".orc_unwind", secstrings + s->sh_name))
+			orc = s;
+		if (!strcmp(".orc_unwind_ip", secstrings + s->sh_name))
+			orc_ip = s;
 	}
 
 	if (alt) {
@@ -248,6 +253,10 @@ int module_finalize(const Elf_Ehdr *hdr,
 	/* make jump label nops */
 	jump_label_apply_nops(me);
 
+	if (orc && orc_ip)
+		unwind_module_init(me, (void *)orc_ip->sh_addr, orc_ip->sh_size,
+				   (void *)orc->sh_addr, orc->sh_size);
+
 	return 0;
 }
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3486d04..ecab322 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -115,6 +115,7 @@
 #include <asm/microcode.h>
 #include <asm/mmu_context.h>
 #include <asm/kaslr.h>
+#include <asm/unwind.h>
 
 /*
  * max_low_pfn_mapped: highest direct mapped pfn under 4GB
@@ -1310,6 +1311,8 @@ void __init setup_arch(char **cmdline_p)
 	if (efi_enabled(EFI_BOOT))
 		efi_apply_memmap_quirks();
 #endif
+
+	unwind_init();
 }
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index b9389d7..7574ef5 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -10,20 +10,22 @@
 
 #define FRAME_HEADER_SIZE (sizeof(long) * 2)
 
-/*
- * This disables KASAN checking when reading a value from another task's stack,
- * since the other task could be running on another CPU and could have poisoned
- * the stack in the meantime.
- */
-#define READ_ONCE_TASK_STACK(task, x)			\
-({							\
-	unsigned long val;				\
-	if (task == current)				\
-		val = READ_ONCE(x);			\
-	else						\
-		val = READ_ONCE_NOCHECK(x);		\
-	val;						\
-})
+unsigned long unwind_get_return_address(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return 0;
+
+	return __kernel_text_address(state->ip) ? state->ip : 0;
+}
+EXPORT_SYMBOL_GPL(unwind_get_return_address);
+
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return NULL;
+
+	return state->regs ? &state->regs->ip : state->bp + 1;
+}
 
 static void unwind_dump(struct unwind_state *state)
 {
@@ -66,15 +68,6 @@ static void unwind_dump(struct unwind_state *state)
 	}
 }
 
-unsigned long unwind_get_return_address(struct unwind_state *state)
-{
-	if (unwind_done(state))
-		return 0;
-
-	return __kernel_text_address(state->ip) ? state->ip : 0;
-}
-EXPORT_SYMBOL_GPL(unwind_get_return_address);
-
 static size_t regs_size(struct pt_regs *regs)
 {
 	/* x86_32 regs from kernel mode are two words shorter: */
diff --git a/arch/x86/kernel/unwind_guess.c b/arch/x86/kernel/unwind_guess.c
index 039f367..4f0e17b 100644
--- a/arch/x86/kernel/unwind_guess.c
+++ b/arch/x86/kernel/unwind_guess.c
@@ -19,6 +19,11 @@ unsigned long unwind_get_return_address(struct unwind_state *state)
 }
 EXPORT_SYMBOL_GPL(unwind_get_return_address);
 
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	return NULL;
+}
+
 bool unwind_next_frame(struct unwind_state *state)
 {
 	struct stack_info *info = &state->stack_info;
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
new file mode 100644
index 0000000..9a8ad84
--- /dev/null
+++ b/arch/x86/kernel/unwind_orc.c
@@ -0,0 +1,576 @@
+#include <linux/module.h>
+#include <linux/sort.h>
+#include <asm/ptrace.h>
+#include <asm/stacktrace.h>
+#include <asm/unwind.h>
+#include <asm/orc_types.h>
+#include <asm/orc_lookup.h>
+#include <asm/sections.h>
+
+#define orc_warn(fmt, ...) \
+	printk_deferred_once(KERN_WARNING pr_fmt("WARNING: " fmt), ##__VA_ARGS__)
+
+extern int __start_orc_unwind_ip[];
+extern int __stop_orc_unwind_ip[];
+extern struct orc_entry __start_orc_unwind[];
+extern struct orc_entry __stop_orc_unwind[];
+
+static DEFINE_MUTEX(sort_mutex);
+int *cur_orc_ip_table = __start_orc_unwind_ip;
+struct orc_entry *cur_orc_table = __start_orc_unwind;
+
+unsigned int lookup_num_blocks;
+bool orc_init;
+
+static inline unsigned long orc_ip(const int *ip)
+{
+	return (unsigned long)ip + *ip;
+}
+
+static struct orc_entry *__orc_find(int *ip_table, struct orc_entry *u_table,
+				    unsigned int num_entries, unsigned long ip)
+{
+	int *first = ip_table;
+	int *last = ip_table + num_entries - 1;
+	int *mid = first, *found = first;
+
+	if (!num_entries)
+		return NULL;
+
+	/*
+	 * Do a binary range search to find the rightmost duplicate of a given
+	 * starting address.  Some entries are section terminators which are
+	 * "weak" entries for ensuring there are no gaps.  They should be
+	 * ignored when they conflict with a real entry.
+	 */
+	while (first <= last) {
+		mid = first + ((last - first) / 2);
+
+		if (orc_ip(mid) <= ip) {
+			found = mid;
+			first = mid + 1;
+		} else
+			last = mid - 1;
+	}
+
+	return u_table + (found - ip_table);
+}
+
+#ifdef CONFIG_MODULES
+static struct orc_entry *orc_module_find(unsigned long ip)
+{
+	struct module *mod;
+
+	mod = __module_address(ip);
+	if (!mod || !mod->arch.orc_unwind || !mod->arch.orc_unwind_ip)
+		return NULL;
+	return __orc_find(mod->arch.orc_unwind_ip, mod->arch.orc_unwind,
+			  mod->arch.num_orcs, ip);
+}
+#else
+static struct orc_entry *orc_module_find(unsigned long ip)
+{
+	return NULL;
+}
+#endif
+
+static struct orc_entry *orc_find(unsigned long ip)
+{
+	if (!orc_init)
+		return NULL;
+
+	/* For non-init vmlinux addresses, use the fast lookup table: */
+	if (ip >= LOOKUP_START_IP && ip < LOOKUP_STOP_IP) {
+		unsigned int idx, start, stop;
+
+		idx = (ip - LOOKUP_START_IP) / LOOKUP_BLOCK_SIZE;
+
+		if (WARN_ON_ONCE(idx >= lookup_num_blocks-1))
+			return NULL;
+
+		start = orc_lookup[idx];
+		stop = orc_lookup[idx + 1] + 1;
+
+		if (WARN_ON_ONCE(__start_orc_unwind + start >= __stop_orc_unwind) ||
+				 __start_orc_unwind + stop > __stop_orc_unwind)
+			return NULL;
+
+		return __orc_find(__start_orc_unwind_ip + start,
+				  __start_orc_unwind + start, stop - start, ip);
+	}
+
+	/* vmlinux .init slow lookup: */
+	if (ip >= (unsigned long)_sinittext && ip < (unsigned long)_einittext)
+		return __orc_find(__start_orc_unwind_ip, __start_orc_unwind,
+				  __stop_orc_unwind_ip - __start_orc_unwind_ip, ip);
+
+	/* Module lookup: */
+	return orc_module_find(ip);
+}
+
+static void orc_sort_swap(void *_a, void *_b, int size)
+{
+	struct orc_entry *orc_a, *orc_b;
+	struct orc_entry orc_tmp;
+	int *a = _a, *b = _b, tmp;
+	int delta = _b - _a;
+
+	/* Swap the .orc_unwind_ip entries: */
+	tmp = *a;
+	*a = *b + delta;
+	*b = tmp - delta;
+
+	/* Swap the corresponding .orc_unwind entries: */
+	orc_a = cur_orc_table + (a - cur_orc_ip_table);
+	orc_b = cur_orc_table + (b - cur_orc_ip_table);
+	orc_tmp = *orc_a;
+	*orc_a = *orc_b;
+	*orc_b = orc_tmp;
+}
+
+static int orc_sort_cmp(const void *_a, const void *_b)
+{
+	struct orc_entry *orc_a;
+	const int *a = _a, *b = _b;
+	unsigned long a_val = orc_ip(a);
+	unsigned long b_val = orc_ip(b);
+
+	if (a_val > b_val)
+		return 1;
+	if (a_val < b_val)
+		return -1;
+
+	/*
+	 * The "weak" section terminator entries need to always be on the left
+	 * to ensure the lookup code skips them in favor of real entries.
+	 * These terminator entries exist to handle any gaps created by
+	 * whitelisted .o files which didn't get objtool generation.
+	 */
+	orc_a = cur_orc_table + (a - cur_orc_ip_table);
+	return orc_a->sp_reg == ORC_REG_UNDEFINED ? -1 : 1;
+}
+
+#ifdef CONFIG_MODULES
+void unwind_module_init(struct module *mod, void *_orc_ip, size_t orc_ip_size,
+			void *_orc, size_t orc_size)
+{
+	int *orc_ip = _orc_ip;
+	struct orc_entry *orc = _orc;
+	unsigned int num_entries = orc_ip_size / sizeof(int);
+
+	WARN_ON_ONCE(orc_ip_size % sizeof(int) != 0 ||
+		     orc_size % sizeof(*orc) != 0 ||
+		     num_entries != orc_size / sizeof(*orc));
+
+	/*
+	 * The 'cur_orc_*' globals allow the orc_sort_swap() callback to
+	 * associate an .orc_unwind_ip table entry with its corresponding
+	 * .orc_unwind entry so they can both be swapped.
+	 */
+	mutex_lock(&sort_mutex);
+	cur_orc_ip_table = orc_ip;
+	cur_orc_table = orc;
+	sort(orc_ip, num_entries, sizeof(int), orc_sort_cmp, orc_sort_swap);
+	mutex_unlock(&sort_mutex);
+
+	mod->arch.orc_unwind_ip = orc_ip;
+	mod->arch.orc_unwind = orc;
+	mod->arch.num_orcs = num_entries;
+}
+#endif
+
+void __init unwind_init(void)
+{
+	size_t orc_ip_size = (void *)__stop_orc_unwind_ip - (void *)__start_orc_unwind_ip;
+	size_t orc_size = (void *)__stop_orc_unwind - (void *)__start_orc_unwind;
+	size_t num_entries = orc_ip_size / sizeof(int);
+	struct orc_entry *orc;
+	int i;
+
+	if (!num_entries || orc_ip_size % sizeof(int) != 0 ||
+	    orc_size % sizeof(struct orc_entry) != 0 ||
+	    num_entries != orc_size / sizeof(struct orc_entry)) {
+		orc_warn("WARNING: Bad or missing .orc_unwind table.  Disabling unwinder.\n");
+		return;
+	}
+
+	/* Sort the .orc_unwind and .orc_unwind_ip tables: */
+	sort(__start_orc_unwind_ip, num_entries, sizeof(int), orc_sort_cmp,
+	     orc_sort_swap);
+
+	/* Initialize the fast lookup table: */
+	lookup_num_blocks = orc_lookup_end - orc_lookup;
+	for (i = 0; i < lookup_num_blocks-1; i++) {
+		orc = __orc_find(__start_orc_unwind_ip, __start_orc_unwind,
+				 num_entries,
+				 LOOKUP_START_IP + (LOOKUP_BLOCK_SIZE * i));
+		if (!orc) {
+			orc_warn("WARNING: Corrupt .orc_unwind table.  Disabling unwinder.\n");
+			return;
+		}
+
+		orc_lookup[i] = orc - __start_orc_unwind;
+	}
+
+	/* Initialize the ending block: */
+	orc = __orc_find(__start_orc_unwind_ip, __start_orc_unwind, num_entries,
+			 LOOKUP_STOP_IP);
+	if (!orc) {
+		orc_warn("WARNING: Corrupt .orc_unwind table.  Disabling unwinder.\n");
+		return;
+	}
+	orc_lookup[lookup_num_blocks-1] = orc - __start_orc_unwind;
+
+	orc_init = true;
+}
+
+unsigned long unwind_get_return_address(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return 0;
+
+	return __kernel_text_address(state->ip) ? state->ip : 0;
+}
+EXPORT_SYMBOL_GPL(unwind_get_return_address);
+
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return NULL;
+
+	if (state->regs)
+		return &state->regs->ip;
+
+	if (state->sp)
+		return (unsigned long *)state->sp - 1;
+
+	return NULL;
+}
+
+static bool stack_access_ok(struct unwind_state *state, unsigned long addr,
+			    size_t len)
+{
+	struct stack_info *info = &state->stack_info;
+
+	/*
+	 * If the address isn't on the current stack, switch to the next one.
+	 *
+	 * We may have to traverse multiple stacks to deal with the possibility
+	 * that info->next_sp could point to an empty stack and the address
+	 * could be on a subsequent stack.
+	 */
+	while (!on_stack(info, (void *)addr, len))
+		if (get_stack_info(info->next_sp, state->task, info,
+				   &state->stack_mask))
+			return false;
+
+	return true;
+}
+
+static bool deref_stack_reg(struct unwind_state *state, unsigned long addr,
+			    unsigned long *val)
+{
+	if (!stack_access_ok(state, addr, sizeof(long)))
+		return false;
+
+	*val = READ_ONCE_TASK_STACK(state->task, *(unsigned long *)addr);
+	return true;
+}
+
+#define REGS_SIZE (sizeof(struct pt_regs))
+#define SP_OFFSET (offsetof(struct pt_regs, sp))
+#define IRET_REGS_SIZE (REGS_SIZE - offsetof(struct pt_regs, ip))
+#define IRET_SP_OFFSET (SP_OFFSET - offsetof(struct pt_regs, ip))
+
+static bool deref_stack_regs(struct unwind_state *state, unsigned long addr,
+			     unsigned long *ip, unsigned long *sp, bool full)
+{
+	size_t regs_size = full ? REGS_SIZE : IRET_REGS_SIZE;
+	size_t sp_offset = full ? SP_OFFSET : IRET_SP_OFFSET;
+	struct pt_regs *regs = (struct pt_regs *)(addr + regs_size - REGS_SIZE);
+
+	if (IS_ENABLED(CONFIG_X86_64)) {
+		if (!stack_access_ok(state, addr, regs_size))
+			return false;
+
+		*ip = regs->ip;
+		*sp = regs->sp;
+
+		return true;
+	}
+
+	if (!stack_access_ok(state, addr, sp_offset))
+		return false;
+
+	*ip = regs->ip;
+
+	if (user_mode(regs)) {
+		if (!stack_access_ok(state, addr + sp_offset,
+				     REGS_SIZE - SP_OFFSET))
+			return false;
+
+		*sp = regs->sp;
+	} else
+		*sp = (unsigned long)&regs->sp;
+
+	return true;
+}
+
+bool unwind_next_frame(struct unwind_state *state)
+{
+	unsigned long ip_p, sp, orig_ip, prev_sp = state->sp;
+	enum stack_type prev_type = state->stack_info.type;
+	struct orc_entry *orc;
+	struct pt_regs *ptregs;
+	bool indirect = false;
+
+	if (unwind_done(state))
+		return false;
+
+	/* Don't let modules unload while we're reading their ORC data. */
+	preempt_disable();
+
+	/* Have we reached the end? */
+	if (state->regs && user_mode(state->regs))
+		goto done;
+
+	/*
+	 * Find the orc_entry associated with the text address.
+	 *
+	 * Decrement call return addresses by one so they work for sibling
+	 * calls and calls to noreturn functions.
+	 */
+	orc = orc_find(state->signal ? state->ip : state->ip - 1);
+	if (!orc || orc->sp_reg == ORC_REG_UNDEFINED)
+		goto done;
+	orig_ip = state->ip;
+
+	/* Find the previous frame's stack: */
+	switch (orc->sp_reg) {
+	case ORC_REG_SP:
+		sp = state->sp + orc->sp_offset;
+		break;
+
+	case ORC_REG_BP:
+		sp = state->bp + orc->sp_offset;
+		break;
+
+	case ORC_REG_SP_INDIRECT:
+		sp = state->sp + orc->sp_offset;
+		indirect = true;
+		break;
+
+	case ORC_REG_BP_INDIRECT:
+		sp = state->bp + orc->sp_offset;
+		indirect = true;
+		break;
+
+	case ORC_REG_R10:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg R10 at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->r10;
+		break;
+
+	case ORC_REG_R13:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg R13 at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->r13;
+		break;
+
+	case ORC_REG_DI:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg DI at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->di;
+		break;
+
+	case ORC_REG_DX:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg DX at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->dx;
+		break;
+
+	default:
+		orc_warn("unknown SP base reg %d for ip %p\n",
+			 orc->sp_reg, (void *)state->ip);
+		goto done;
+	}
+
+	if (indirect) {
+		if (!deref_stack_reg(state, sp, &sp))
+			goto done;
+	}
+
+	/* Find IP, SP and possibly regs: */
+	switch (orc->type) {
+	case ORC_TYPE_CALL:
+		ip_p = sp - sizeof(long);
+
+		if (!deref_stack_reg(state, ip_p, &state->ip))
+			goto done;
+
+		state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
+						  state->ip, (void *)ip_p);
+
+		state->sp = sp;
+		state->regs = NULL;
+		state->signal = false;
+		break;
+
+	case ORC_TYPE_REGS:
+		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, true)) {
+			orc_warn("can't dereference registers at %p for ip %p\n",
+				 (void *)sp, (void *)orig_ip);
+			goto done;
+		}
+
+		state->regs = (struct pt_regs *)sp;
+		state->full_regs = true;
+		state->signal = true;
+		break;
+
+	case ORC_TYPE_REGS_IRET:
+		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, false)) {
+			orc_warn("can't dereference iret registers at %p for ip %p\n",
+				 (void *)sp, (void *)orig_ip);
+			goto done;
+		}
+
+		ptregs = container_of((void *)sp, struct pt_regs, ip);
+		if ((unsigned long)ptregs >= prev_sp &&
+		    on_stack(&state->stack_info, ptregs, REGS_SIZE)) {
+			state->regs = ptregs;
+			state->full_regs = false;
+		} else
+			state->regs = NULL;
+
+		state->signal = true;
+		break;
+
+	default:
+		orc_warn("unknown .orc_unwind entry type %d\n", orc->type);
+		break;
+	}
+
+	/* Find BP: */
+	switch (orc->bp_reg) {
+	case ORC_REG_UNDEFINED:
+		if (state->regs && state->full_regs)
+			state->bp = state->regs->bp;
+		break;
+
+	case ORC_REG_PREV_SP:
+		if (!deref_stack_reg(state, sp + orc->bp_offset, &state->bp))
+			goto done;
+		break;
+
+	case ORC_REG_BP:
+		if (!deref_stack_reg(state, state->bp + orc->bp_offset, &state->bp))
+			goto done;
+		break;
+
+	default:
+		orc_warn("unknown BP base reg %d for ip %p\n",
+			 orc->bp_reg, (void *)orig_ip);
+		goto done;
+	}
+
+	/* Prevent a recursive loop due to bad ORC data: */
+	if (state->stack_info.type == prev_type &&
+	    on_stack(&state->stack_info, (void *)state->sp, sizeof(long)) &&
+	    state->sp <= prev_sp) {
+		orc_warn("stack going in the wrong direction? ip=%p\n",
+			 (void *)orig_ip);
+		goto done;
+	}
+
+	preempt_enable();
+	return true;
+
+done:
+	preempt_enable();
+	state->stack_info.type = STACK_TYPE_UNKNOWN;
+	return false;
+}
+EXPORT_SYMBOL_GPL(unwind_next_frame);
+
+void __unwind_start(struct unwind_state *state, struct task_struct *task,
+		    struct pt_regs *regs, unsigned long *first_frame)
+{
+	memset(state, 0, sizeof(*state));
+	state->task = task;
+
+	/*
+	 * Refuse to unwind the stack of a task while it's executing on another
+	 * CPU.  This check is racy, but that's ok: the unwinder has other
+	 * checks to prevent it from going off the rails.
+	 */
+	if (task_on_another_cpu(task))
+		goto done;
+
+	if (regs) {
+		if (user_mode(regs))
+			goto done;
+
+		state->ip = regs->ip;
+		state->sp = kernel_stack_pointer(regs);
+		state->bp = regs->bp;
+		state->regs = regs;
+		state->full_regs = true;
+		state->signal = true;
+
+	} else if (task == current) {
+		asm volatile("lea (%%rip), %0\n\t"
+			     "mov %%rsp, %1\n\t"
+			     "mov %%rbp, %2\n\t"
+			     : "=r" (state->ip), "=r" (state->sp),
+			       "=r" (state->bp));
+
+	} else {
+		struct inactive_task_frame *frame = (void *)task->thread.sp;
+
+		state->ip = frame->ret_addr;
+		state->sp = task->thread.sp;
+		state->bp = frame->bp;
+	}
+
+	if (get_stack_info((unsigned long *)state->sp, state->task,
+			   &state->stack_info, &state->stack_mask))
+		return;
+
+	/*
+	 * The caller can provide the address of the first frame directly
+	 * (first_frame) or indirectly (regs->sp) to indicate which stack frame
+	 * to start unwinding at.  Skip ahead until we reach it.
+	 */
+
+	/* When starting from regs, skip the regs frame: */
+	if (regs) {
+		unwind_next_frame(state);
+		return;
+	}
+
+	/* Otherwise, skip ahead to the user-specified starting frame: */
+	while (!unwind_done(state) &&
+	       (!on_stack(&state->stack_info, first_frame, sizeof(long)) ||
+			state->sp <= (unsigned long)first_frame))
+		unwind_next_frame(state);
+
+	return;
+
+done:
+	state->stack_info.type = STACK_TYPE_UNKNOWN;
+	return;
+}
+EXPORT_SYMBOL_GPL(__unwind_start);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index c8a3b61..f05f00a 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -24,6 +24,7 @@
 #include <asm/asm-offsets.h>
 #include <asm/thread_info.h>
 #include <asm/page_types.h>
+#include <asm/orc_lookup.h>
 #include <asm/cache.h>
 #include <asm/boot.h>
 
@@ -148,6 +149,8 @@ SECTIONS
 
 	BUG_TABLE
 
+	ORC_UNWIND_TABLE
+
 	. = ALIGN(PAGE_SIZE);
 	__vvar_page = .;
 
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 0d64658..e98b052 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -668,6 +668,31 @@
 #define BUG_TABLE
 #endif
 
+#ifdef CONFIG_ORC_UNWINDER
+#define ORC_UNWIND_TABLE						\
+	. = ALIGN(4);							\
+	.orc_unwind_ip : AT(ADDR(.orc_unwind_ip) - LOAD_OFFSET) {	\
+		VMLINUX_SYMBOL(__start_orc_unwind_ip) = .;		\
+		KEEP(*(.orc_unwind_ip))					\
+		VMLINUX_SYMBOL(__stop_orc_unwind_ip) = .;		\
+	}								\
+	. = ALIGN(6);							\
+	.orc_unwind : AT(ADDR(.orc_unwind) - LOAD_OFFSET) {		\
+		VMLINUX_SYMBOL(__start_orc_unwind) = .;			\
+		KEEP(*(.orc_unwind))					\
+		VMLINUX_SYMBOL(__stop_orc_unwind) = .;			\
+	}								\
+	. = ALIGN(4);							\
+	.orc_lookup : AT(ADDR(.orc_lookup) - LOAD_OFFSET) {		\
+		VMLINUX_SYMBOL(orc_lookup) = .;				\
+		. += (((SIZEOF(.text) + LOOKUP_BLOCK_SIZE - 1) /	\
+			LOOKUP_BLOCK_SIZE) + 1) * 4;			\
+		VMLINUX_SYMBOL(orc_lookup_end) = .;			\
+	}
+#else
+#define ORC_UNWIND_TABLE
+#endif
+
 #ifdef CONFIG_PM_TRACE
 #define TRACEDATA							\
 	. = ALIGN(4);							\
@@ -854,7 +879,7 @@
 		DATA_DATA						\
 		CONSTRUCTORS						\
 	}								\
-	BUG_TABLE
+	BUG_TABLE							\
 
 #define INIT_TEXT_SECTION(inittext_align)				\
 	. = ALIGN(inittext_align);					\
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 9c5d40a..a7abffa 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -374,6 +374,9 @@ config STACK_VALIDATION
 	  pointers (if CONFIG_FRAME_POINTER is enabled).  This helps ensure
 	  that runtime stack traces are more reliable.
 
+	  This is also a prerequisite for generation of ORC unwind data, which
+	  is needed for CONFIG_ORC_UNWINDER.
+
 	  For more information, see
 	  tools/objtool/Documentation/stack-validation.txt.
 
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 733e044..11b5c28 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -258,7 +258,8 @@ ifneq ($(SKIP_STACK_VALIDATION),1)
 
 __objtool_obj := $(objtree)/tools/objtool/objtool
 
-objtool_args = check
+objtool_args = $(if $(CONFIG_ORC_UNWINDER),orc generate,check)
+
 ifndef CONFIG_FRAME_POINTER
 objtool_args += --no-fp
 endif
@@ -276,6 +277,11 @@ objtool_obj = $(if $(patsubst y%,, \
 endif # SKIP_STACK_VALIDATION
 endif # CONFIG_STACK_VALIDATION
 
+# Rebuild all objects when objtool changes, or is enabled/disabled.
+objtool_dep = $(objtool_obj)					\
+	      $(wildcard include/config/orc/unwinder.h		\
+			 include/config/stack/validation.h)
+
 define rule_cc_o_c
 	$(call echo-cmd,checksrc) $(cmd_checksrc)			  \
 	$(call cmd_and_fixdep,cc_o_c)					  \
@@ -298,13 +304,13 @@ cmd_undef_syms = echo
 endif
 
 # Built-in and composite module parts
-$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE
+$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
 	$(call cmd,force_checksrc)
 	$(call if_changed_rule,cc_o_c)
 
 # Single-part modules are special since we need to mark them in $(MODVERDIR)
 
-$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE
+$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
 	$(call cmd,force_checksrc)
 	$(call if_changed_rule,cc_o_c)
 	@{ echo $(@:.o=.ko); echo $@; \
@@ -399,7 +405,7 @@ cmd_modversions_S =								\
 endif
 endif
 
-$(obj)/%.o: $(src)/%.S $(objtool_obj) FORCE
+$(obj)/%.o: $(src)/%.S $(objtool_dep) FORCE
 	$(call if_changed_rule,as_o_S)
 
 targets += $(real-objs-y) $(real-objs-m) $(lib-y)
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:x86/asm] x86/entry/64: Refactor IRQ stacks and make them NMI-safe
  2017-07-11 15:33 ` [PATCH v3 01/10] x86/entry/64: Refactor IRQ stacks and make them NMI-safe Josh Poimboeuf
@ 2017-07-18 10:40   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Andy Lutomirski @ 2017-07-18 10:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, efault, peterz, mingo, brgerst, bp, tglx, linux-kernel,
	luto, jpoimboe, torvalds, jslaby, dvlasenk

Commit-ID:  1d3e53e8624a3ec85f4041ca6d973da7c1575938
Gitweb:     http://git.kernel.org/tip/1d3e53e8624a3ec85f4041ca6d973da7c1575938
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Tue, 11 Jul 2017 10:33:38 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 18 Jul 2017 10:56:22 +0200

x86/entry/64: Refactor IRQ stacks and make them NMI-safe

This will allow IRQ stacks to nest inside NMIs or similar entries
that can happen during IRQ stack setup or teardown.

The new macros won't work correctly if they're invoked with IRQs on.
Add a check under CONFIG_DEBUG_ENTRY to detect that.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
[ Use %r10 instead of %r11 in xen_do_hypervisor_callback to make objtool
  and ORC unwinder's lives a little easier. ]
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/b0b2ff5fb97d2da2e1d7e1f380190c92545c8bb5.1499786555.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig.debug       |  2 --
 arch/x86/entry/entry_64.S    | 85 +++++++++++++++++++++++++++++++-------------
 arch/x86/kernel/process_64.c |  3 ++
 3 files changed, 64 insertions(+), 26 deletions(-)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index fcb7604..353ed09 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -305,8 +305,6 @@ config DEBUG_ENTRY
 	  Some of these sanity checks may slow down kernel entries and
 	  exits or otherwise impact performance.
 
-	  This is currently used to help test NMI code.
-
 	  If unsure, say N.
 
 config DEBUG_NMI_SELFTEST
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a9a8027..0d4483a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -447,6 +447,59 @@ ENTRY(irq_entries_start)
     .endr
 END(irq_entries_start)
 
+.macro DEBUG_ENTRY_ASSERT_IRQS_OFF
+#ifdef CONFIG_DEBUG_ENTRY
+	pushfq
+	testl $X86_EFLAGS_IF, (%rsp)
+	jz .Lokay_\@
+	ud2
+.Lokay_\@:
+	addq $8, %rsp
+#endif
+.endm
+
+/*
+ * Enters the IRQ stack if we're not already using it.  NMI-safe.  Clobbers
+ * flags and puts old RSP into old_rsp, and leaves all other GPRs alone.
+ * Requires kernel GSBASE.
+ *
+ * The invariant is that, if irq_count != -1, then the IRQ stack is in use.
+ */
+.macro ENTER_IRQ_STACK old_rsp
+	DEBUG_ENTRY_ASSERT_IRQS_OFF
+	movq	%rsp, \old_rsp
+	incl	PER_CPU_VAR(irq_count)
+
+	/*
+	 * Right now, if we just incremented irq_count to zero, we've
+	 * claimed the IRQ stack but we haven't switched to it yet.
+	 *
+	 * If anything is added that can interrupt us here without using IST,
+	 * it must be *extremely* careful to limit its stack usage.  This
+	 * could include kprobes and a hypothetical future IST-less #DB
+	 * handler.
+	 */
+
+	cmovzq	PER_CPU_VAR(irq_stack_ptr), %rsp
+	pushq	\old_rsp
+.endm
+
+/*
+ * Undoes ENTER_IRQ_STACK.
+ */
+.macro LEAVE_IRQ_STACK
+	DEBUG_ENTRY_ASSERT_IRQS_OFF
+	/* We need to be off the IRQ stack before decrementing irq_count. */
+	popq	%rsp
+
+	/*
+	 * As in ENTER_IRQ_STACK, irq_count == 0, we are still claiming
+	 * the irq stack but we're not on it.
+	 */
+
+	decl	PER_CPU_VAR(irq_count)
+.endm
+
 /*
  * Interrupt entry/exit.
  *
@@ -485,17 +538,7 @@ END(irq_entries_start)
 	CALL_enter_from_user_mode
 
 1:
-	/*
-	 * Save previous stack pointer, optionally switch to interrupt stack.
-	 * irq_count is used to check if a CPU is already on an interrupt stack
-	 * or not. While this is essentially redundant with preempt_count it is
-	 * a little cheaper to use a separate counter in the PDA (short of
-	 * moving irq_enter into assembly, which would be too much work)
-	 */
-	movq	%rsp, %rdi
-	incl	PER_CPU_VAR(irq_count)
-	cmovzq	PER_CPU_VAR(irq_stack_ptr), %rsp
-	pushq	%rdi
+	ENTER_IRQ_STACK old_rsp=%rdi
 	/* We entered an interrupt context - irqs are off: */
 	TRACE_IRQS_OFF
 
@@ -515,10 +558,8 @@ common_interrupt:
 ret_from_intr:
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
-	decl	PER_CPU_VAR(irq_count)
 
-	/* Restore saved previous stack */
-	popq	%rsp
+	LEAVE_IRQ_STACK
 
 	testb	$3, CS(%rsp)
 	jz	retint_kernel
@@ -891,12 +932,10 @@ bad_gs:
 ENTRY(do_softirq_own_stack)
 	pushq	%rbp
 	mov	%rsp, %rbp
-	incl	PER_CPU_VAR(irq_count)
-	cmove	PER_CPU_VAR(irq_stack_ptr), %rsp
-	push	%rbp				/* frame pointer backlink */
+	ENTER_IRQ_STACK old_rsp=%r11
 	call	__do_softirq
+	LEAVE_IRQ_STACK
 	leaveq
-	decl	PER_CPU_VAR(irq_count)
 	ret
 END(do_softirq_own_stack)
 
@@ -923,13 +962,11 @@ ENTRY(xen_do_hypervisor_callback)		/* do_hypervisor_callback(struct *pt_regs) */
  * see the correct pointer to the pt_regs
  */
 	movq	%rdi, %rsp			/* we don't return, adjust the stack frame */
-11:	incl	PER_CPU_VAR(irq_count)
-	movq	%rsp, %rbp
-	cmovzq	PER_CPU_VAR(irq_stack_ptr), %rsp
-	pushq	%rbp				/* frame pointer backlink */
+
+	ENTER_IRQ_STACK old_rsp=%r10
 	call	xen_evtchn_do_upcall
-	popq	%rsp
-	decl	PER_CPU_VAR(irq_count)
+	LEAVE_IRQ_STACK
+
 #ifndef CONFIG_PREEMPT
 	call	xen_maybe_preempt_hcall
 #endif
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index c3169be..2987e39 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -279,6 +279,9 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	struct tss_struct *tss = &per_cpu(cpu_tss, cpu);
 	unsigned prev_fsindex, prev_gsindex;
 
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
+		     this_cpu_read(irq_count) != -1);
+
 	switch_fpu_prepare(prev_fpu, cpu);
 
 	/* We must save %fs and %gs before load_TLS() because

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:x86/asm] x86/entry/64: Initialize the top of the IRQ stack before switching stacks
  2017-07-11 15:33 ` [PATCH v3 02/10] x86/entry/64: Initialize the top of the IRQ stack before switching stacks Josh Poimboeuf
@ 2017-07-18 10:41   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Andy Lutomirski @ 2017-07-18 10:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jpoimboe, jslaby, torvalds, bp, linux-kernel, tglx, mingo, hpa,
	luto, peterz, brgerst, efault, dvlasenk

Commit-ID:  2995590964da93e1fd9a91550f9c9d9fab28f160
Gitweb:     http://git.kernel.org/tip/2995590964da93e1fd9a91550f9c9d9fab28f160
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Tue, 11 Jul 2017 10:33:39 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 18 Jul 2017 10:56:23 +0200

x86/entry/64: Initialize the top of the IRQ stack before switching stacks

The OOPS unwinder wants the word at the top of the IRQ stack to
point back to the previous stack at all times when the IRQ stack
is in use.  There's currently a one-instruction window in ENTER_IRQ_STACK
during which this isn't the case.  Fix it by writing the old RSP to the
top of the IRQ stack before jumping.

This currently writes the pointer to the stack twice, which is a bit
ugly.  We could get rid of this by replacing irq_stack_ptr with
irq_stack_ptr_minus_eight (better name welcome).  OTOH, there may be
all kinds of odd microarchitectural considerations in play that
affect performance by a few cycles here.

Reported-by: Mike Galbraith <efault@gmx.de>
Reported-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/aae7e79e49914808440ad5310ace138ced2179ca.1499786555.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64.S | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 0d4483a..b56f7f2 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -469,6 +469,7 @@ END(irq_entries_start)
 	DEBUG_ENTRY_ASSERT_IRQS_OFF
 	movq	%rsp, \old_rsp
 	incl	PER_CPU_VAR(irq_count)
+	jnz	.Lirq_stack_push_old_rsp_\@
 
 	/*
 	 * Right now, if we just incremented irq_count to zero, we've
@@ -478,9 +479,30 @@ END(irq_entries_start)
 	 * it must be *extremely* careful to limit its stack usage.  This
 	 * could include kprobes and a hypothetical future IST-less #DB
 	 * handler.
+	 *
+	 * The OOPS unwinder relies on the word at the top of the IRQ
+	 * stack linking back to the previous RSP for the entire time we're
+	 * on the IRQ stack.  For this to work reliably, we need to write
+	 * it before we actually move ourselves to the IRQ stack.
+	 */
+
+	movq	\old_rsp, PER_CPU_VAR(irq_stack_union + IRQ_STACK_SIZE - 8)
+	movq	PER_CPU_VAR(irq_stack_ptr), %rsp
+
+#ifdef CONFIG_DEBUG_ENTRY
+	/*
+	 * If the first movq above becomes wrong due to IRQ stack layout
+	 * changes, the only way we'll notice is if we try to unwind right
+	 * here.  Assert that we set up the stack right to catch this type
+	 * of bug quickly.
 	 */
+	cmpq	-8(%rsp), \old_rsp
+	je	.Lirq_stack_okay\@
+	ud2
+	.Lirq_stack_okay\@:
+#endif
 
-	cmovzq	PER_CPU_VAR(irq_stack_ptr), %rsp
+.Lirq_stack_push_old_rsp_\@:
 	pushq	\old_rsp
 .endm
 

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:x86/asm] x86/dumpstack: Fix occasionally missing registers
  2017-07-11 15:33 ` [PATCH v3 03/10] x86/dumpstack: fix occasionally missing registers Josh Poimboeuf
@ 2017-07-18 10:41   ` tip-bot for Josh Poimboeuf
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Josh Poimboeuf @ 2017-07-18 10:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: efault, torvalds, tglx, jpoimboe, luto, linux-kernel, mingo,
	dvlasenk, peterz, jslaby, hpa, brgerst, bp

Commit-ID:  b0529becebde629ff6abf2afdca6def6824f4fa9
Gitweb:     http://git.kernel.org/tip/b0529becebde629ff6abf2afdca6def6824f4fa9
Author:     Josh Poimboeuf <jpoimboe@redhat.com>
AuthorDate: Tue, 11 Jul 2017 10:33:40 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 18 Jul 2017 10:56:23 +0200

x86/dumpstack: Fix occasionally missing registers

If two consecutive stack frames have pt_regs, the oops dump code fails
to print the second frame's registers.  Fix that.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Fixes: 3b3fa11bc700 ("x86/dumpstack: Print any pt_regs found on the stack")
Link: http://lkml.kernel.org/r/269c5c00c7d45c699f3dcea42a3a594c6cf7a9a3.1499786555.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index dbce3cc..bd265a4 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -94,6 +94,9 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		if (stack_name)
 			printk("%s <%s>\n", log_lvl, stack_name);
 
+		if (regs && on_stack(&stack_info, regs, sizeof(*regs)))
+			__show_regs(regs, 0);
+
 		/*
 		 * Scan the stack, printing any text addresses we find.  At the
 		 * same time, follow proper stack frames with the unwinder.
@@ -118,10 +121,8 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 			 * Don't print regs->ip again if it was already printed
 			 * by __show_regs() below.
 			 */
-			if (regs && stack == &regs->ip) {
-				unwind_next_frame(&state);
-				continue;
-			}
+			if (regs && stack == &regs->ip)
+				goto next;
 
 			if (stack == ret_addr_p)
 				reliable = 1;
@@ -144,6 +145,7 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 			if (!reliable)
 				continue;
 
+next:
 			/*
 			 * Get the next frame from the unwinder.  No need to
 			 * check for an error: if anything goes wrong, the rest
@@ -153,7 +155,7 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 			/* if the frame has entry regs, print them */
 			regs = unwind_get_entry_regs(&state);
-			if (regs)
+			if (regs && on_stack(&stack_info, regs, sizeof(*regs)))
 				__show_regs(regs, 0);
 		}
 

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:x86/asm] x86/dumpstack: Fix interrupt and exception stack boundary checks
  2017-07-11 15:33 ` [PATCH v3 04/10] x86/dumpstack: fix interrupt and exception stack boundary checks Josh Poimboeuf
@ 2017-07-18 10:42   ` tip-bot for Josh Poimboeuf
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Josh Poimboeuf @ 2017-07-18 10:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, mingo, torvalds, luto, peterz, brgerst, efault, tglx, hpa,
	dvlasenk, jpoimboe, jslaby, linux-kernel

Commit-ID:  5a3cf86978a1ac433407704ec280919751aa2699
Gitweb:     http://git.kernel.org/tip/5a3cf86978a1ac433407704ec280919751aa2699
Author:     Josh Poimboeuf <jpoimboe@redhat.com>
AuthorDate: Tue, 11 Jul 2017 10:33:41 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 18 Jul 2017 10:56:23 +0200

x86/dumpstack: Fix interrupt and exception stack boundary checks

On x86_64, the double fault exception stack is located immediately after
the interrupt stack in memory.  This causes confusion in the unwinder
when it tries to unwind through an empty interrupt stack, where the
stack pointer points to the address bordering the two stacks.  The
unwinder incorrectly thinks it's running on the double fault stack.

Fix this kind of stack border confusion by never considering the
beginning address of an exception or interrupt stack to be part of the
stack.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Fixes: 5fe599e02e41 ("x86/dumpstack: Add support for unwinding empty IRQ stacks")
Link: http://lkml.kernel.org/r/bcc142160a5104de5c354c21c394c93a0173943f.1499786555.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/dumpstack_32.c | 4 ++--
 arch/x86/kernel/dumpstack_64.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index e5f0b40..4f04814 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -37,7 +37,7 @@ static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info)
 	 * This is a software stack, so 'end' can be a valid stack pointer.
 	 * It just means the stack is empty.
 	 */
-	if (stack < begin || stack > end)
+	if (stack <= begin || stack > end)
 		return false;
 
 	info->type	= STACK_TYPE_IRQ;
@@ -62,7 +62,7 @@ static bool in_softirq_stack(unsigned long *stack, struct stack_info *info)
 	 * This is a software stack, so 'end' can be a valid stack pointer.
 	 * It just means the stack is empty.
 	 */
-	if (stack < begin || stack > end)
+	if (stack <= begin || stack > end)
 		return false;
 
 	info->type	= STACK_TYPE_SOFTIRQ;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 3e1471d..225af41 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -55,7 +55,7 @@ static bool in_exception_stack(unsigned long *stack, struct stack_info *info)
 		begin = end - (exception_stack_sizes[k] / sizeof(long));
 		regs  = (struct pt_regs *)end - 1;
 
-		if (stack < begin || stack >= end)
+		if (stack <= begin || stack >= end)
 			continue;
 
 		info->type	= STACK_TYPE_EXCEPTION + k;
@@ -78,7 +78,7 @@ static bool in_irq_stack(unsigned long *stack, struct stack_info *info)
 	 * This is a software stack, so 'end' can be a valid stack pointer.
 	 * It just means the stack is empty.
 	 */
-	if (stack < begin || stack > end)
+	if (stack <= begin || stack > end)
 		return false;
 
 	info->type	= STACK_TYPE_IRQ;

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:x86/asm] objtool: Add ORC unwind table generation
  2017-07-11 15:33 ` [PATCH v3 05/10] objtool: add ORC unwind table generation Josh Poimboeuf
@ 2017-07-18 10:42   ` tip-bot for Josh Poimboeuf
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Josh Poimboeuf @ 2017-07-18 10:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, jpoimboe, bp, linux-kernel, efault, jslaby, brgerst, hpa,
	luto, peterz, torvalds, dvlasenk, mingo

Commit-ID:  627fce14809ba5610b0cb476cd0186d3fcedecfc
Gitweb:     http://git.kernel.org/tip/627fce14809ba5610b0cb476cd0186d3fcedecfc
Author:     Josh Poimboeuf <jpoimboe@redhat.com>
AuthorDate: Tue, 11 Jul 2017 10:33:42 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 18 Jul 2017 10:57:43 +0200

objtool: Add ORC unwind table generation

Now that objtool knows the states of all registers on the stack for each
instruction, it's straightforward to generate debuginfo for an unwinder
to use.

Instead of generating DWARF, generate a new format called ORC, which is
more suitable for an in-kernel unwinder.  See
Documentation/x86/orc-unwinder.txt for a more detailed description of
this new debuginfo format and why it's preferable to DWARF.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/c9b9f01ba6c5ed2bdc9bb0957b78167fdbf9632e.1499786555.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 tools/objtool/Build                              |   3 +
 tools/objtool/Documentation/stack-validation.txt |  56 ++----
 tools/objtool/builtin-check.c                    |   2 +-
 tools/objtool/builtin-orc.c                      |  70 ++++++++
 tools/objtool/builtin.h                          |   1 +
 tools/objtool/check.c                            |  58 +++++-
 tools/objtool/check.h                            |  15 +-
 tools/objtool/elf.c                              | 212 ++++++++++++++++++++--
 tools/objtool/elf.h                              |  15 +-
 tools/objtool/objtool.c                          |   3 +-
 tools/objtool/{builtin.h => orc.h}               |  18 +-
 tools/objtool/orc_dump.c                         | 212 ++++++++++++++++++++++
 tools/objtool/orc_gen.c                          | 214 +++++++++++++++++++++++
 tools/objtool/orc_types.h                        |  85 +++++++++
 14 files changed, 899 insertions(+), 65 deletions(-)

diff --git a/tools/objtool/Build b/tools/objtool/Build
index 6f2e198..749becd 100644
--- a/tools/objtool/Build
+++ b/tools/objtool/Build
@@ -1,6 +1,9 @@
 objtool-y += arch/$(SRCARCH)/
 objtool-y += builtin-check.o
+objtool-y += builtin-orc.o
 objtool-y += check.o
+objtool-y += orc_gen.o
+objtool-y += orc_dump.o
 objtool-y += elf.o
 objtool-y += special.o
 objtool-y += objtool.o
diff --git a/tools/objtool/Documentation/stack-validation.txt b/tools/objtool/Documentation/stack-validation.txt
index 17c1195..6a1af43 100644
--- a/tools/objtool/Documentation/stack-validation.txt
+++ b/tools/objtool/Documentation/stack-validation.txt
@@ -11,9 +11,6 @@ analyzes every .o file and ensures the validity of its stack metadata.
 It enforces a set of rules on asm code and C inline assembly code so
 that stack traces can be reliable.
 
-Currently it only checks frame pointer usage, but there are plans to add
-CFI validation for C files and CFI generation for asm files.
-
 For each function, it recursively follows all possible code paths and
 validates the correct frame pointer state at each instruction.
 
@@ -23,6 +20,10 @@ alternative execution paths to a given instruction (or set of
 instructions).  Similarly, it knows how to follow switch statements, for
 which gcc sometimes uses jump tables.
 
+(Objtool also has an 'orc generate' subcommand which generates debuginfo
+for the ORC unwinder.  See Documentation/x86/orc-unwinder.txt in the
+kernel tree for more details.)
+
 
 Why do we need stack metadata validation?
 -----------------------------------------
@@ -93,37 +94,14 @@ a) More reliable stack traces for frame pointer enabled kernels
        or at the very end of the function after the stack frame has been
        destroyed.  This is an inherent limitation of frame pointers.
 
-b) 100% reliable stack traces for DWARF enabled kernels
-
-   (NOTE: This is not yet implemented)
-
-   As an alternative to frame pointers, DWARF Call Frame Information
-   (CFI) metadata can be used to walk the stack.  Unlike frame pointers,
-   CFI metadata is out of band.  So it doesn't affect runtime
-   performance and it can be reliable even when interrupts or exceptions
-   are involved.
-
-   For C code, gcc automatically generates DWARF CFI metadata.  But for
-   asm code, generating CFI is a tedious manual approach which requires
-   manually placed .cfi assembler macros to be scattered throughout the
-   code.  It's clumsy and very easy to get wrong, and it makes the real
-   code harder to read.
-
-   Stacktool will improve this situation in several ways.  For code
-   which already has CFI annotations, it will validate them.  For code
-   which doesn't have CFI annotations, it will generate them.  So an
-   architecture can opt to strip out all the manual .cfi annotations
-   from their asm code and have objtool generate them instead.
+b) ORC (Oops Rewind Capability) unwind table generation
 
-   We might also add a runtime stack validation debug option where we
-   periodically walk the stack from schedule() and/or an NMI to ensure
-   that the stack metadata is sane and that we reach the bottom of the
-   stack.
+   An alternative to frame pointers and DWARF, ORC unwind data can be
+   used to walk the stack.  Unlike frame pointers, ORC data is out of
+   band.  So it doesn't affect runtime performance and it can be
+   reliable even when interrupts or exceptions are involved.
 
-   So the benefit of objtool here will be that external tooling should
-   always show perfect stack traces.  And the same will be true for
-   kernel warning/oops traces if the architecture has a runtime DWARF
-   unwinder.
+   For more details, see Documentation/x86/orc-unwinder.txt.
 
 c) Higher live patching compatibility rate
 
@@ -211,7 +189,7 @@ they mean, and suggestions for how to fix them.
    function, add proper frame pointer logic using the FRAME_BEGIN and
    FRAME_END macros.  Otherwise, if it's not a callable function, remove
    its ELF function annotation by changing ENDPROC to END, and instead
-   use the manual CFI hint macros in asm/undwarf.h.
+   use the manual unwind hint macros in asm/unwind_hints.h.
 
    If it's a GCC-compiled .c file, the error may be because the function
    uses an inline asm() statement which has a "call" instruction.  An
@@ -231,8 +209,8 @@ they mean, and suggestions for how to fix them.
    If the error is for an asm file, and the instruction is inside (or
    reachable from) a callable function, the function should be annotated
    with the ENTRY/ENDPROC macros (ENDPROC is the important one).
-   Otherwise, the code should probably be annotated with the CFI hint
-   macros in asm/undwarf.h so objtool and the unwinder can know the
+   Otherwise, the code should probably be annotated with the unwind hint
+   macros in asm/unwind_hints.h so objtool and the unwinder can know the
    stack state associated with the code.
 
    If you're 100% sure the code won't affect stack traces, or if you're
@@ -258,7 +236,7 @@ they mean, and suggestions for how to fix them.
    instructions aren't allowed in a callable function, and are most
    likely part of the kernel entry code.  They should usually not have
    the callable function annotation (ENDPROC) and should always be
-   annotated with the CFI hint macros in asm/undwarf.h.
+   annotated with the unwind hint macros in asm/unwind_hints.h.
 
 
 6. file.o: warning: objtool: func()+0x26: sibling call from callable instruction with modified stack frame
@@ -272,7 +250,7 @@ they mean, and suggestions for how to fix them.
 
    If the instruction is not actually in a callable function (e.g.
    kernel entry code), change ENDPROC to END and annotate manually with
-   the CFI hint macros in asm/undwarf.h.
+   the unwind hint macros in asm/unwind_hints.h.
 
 
 7. file: warning: objtool: func()+0x5c: stack state mismatch
@@ -288,8 +266,8 @@ they mean, and suggestions for how to fix them.
 
    Another possibility is that the code has some asm or inline asm which
    does some unusual things to the stack or the frame pointer.  In such
-   cases it's probably appropriate to use the CFI hint macros in
-   asm/undwarf.h.
+   cases it's probably appropriate to use the unwind hint macros in
+   asm/unwind_hints.h.
 
 
 8. file.o: warning: objtool: funcA() falls through to next function funcB()
diff --git a/tools/objtool/builtin-check.c b/tools/objtool/builtin-check.c
index 365c34e..eedf089 100644
--- a/tools/objtool/builtin-check.c
+++ b/tools/objtool/builtin-check.c
@@ -52,5 +52,5 @@ int cmd_check(int argc, const char **argv)
 
 	objname = argv[0];
 
-	return check(objname, nofp);
+	return check(objname, nofp, false);
 }
diff --git a/tools/objtool/builtin-orc.c b/tools/objtool/builtin-orc.c
new file mode 100644
index 0000000..5ca41ab
--- /dev/null
+++ b/tools/objtool/builtin-orc.c
@@ -0,0 +1,70 @@
+/*
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * objtool orc:
+ *
+ * This command analyzes a .o file and adds .orc_unwind and .orc_unwind_ip
+ * sections to it, which is used by the in-kernel ORC unwinder.
+ *
+ * This command is a superset of "objtool check".
+ */
+
+#include <string.h>
+#include <subcmd/parse-options.h>
+#include "builtin.h"
+#include "check.h"
+
+
+static const char *orc_usage[] = {
+	"objtool orc generate [<options>] file.o",
+	"objtool orc dump file.o",
+	NULL,
+};
+
+extern const struct option check_options[];
+extern bool nofp;
+
+int cmd_orc(int argc, const char **argv)
+{
+	const char *objname;
+
+	argc--; argv++;
+	if (!strncmp(argv[0], "gen", 3)) {
+		argc = parse_options(argc, argv, check_options, orc_usage, 0);
+		if (argc != 1)
+			usage_with_options(orc_usage, check_options);
+
+		objname = argv[0];
+
+		return check(objname, nofp, true);
+
+	}
+
+	if (!strcmp(argv[0], "dump")) {
+		if (argc != 2)
+			usage_with_options(orc_usage, check_options);
+
+		objname = argv[1];
+
+		return orc_dump(objname);
+	}
+
+	usage_with_options(orc_usage, check_options);
+
+	return 0;
+}
diff --git a/tools/objtool/builtin.h b/tools/objtool/builtin.h
index 34d2ba7..dd52606 100644
--- a/tools/objtool/builtin.h
+++ b/tools/objtool/builtin.h
@@ -18,5 +18,6 @@
 #define _BUILTIN_H
 
 extern int cmd_check(int argc, const char **argv);
+extern int cmd_orc(int argc, const char **argv);
 
 #endif /* _BUILTIN_H */
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 2c6d748..cb57c52 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -36,8 +36,8 @@ const char *objname;
 static bool nofp;
 struct cfi_state initial_func_cfi;
 
-static struct instruction *find_insn(struct objtool_file *file,
-				     struct section *sec, unsigned long offset)
+struct instruction *find_insn(struct objtool_file *file,
+			      struct section *sec, unsigned long offset)
 {
 	struct instruction *insn;
 
@@ -259,6 +259,11 @@ static int decode_instructions(struct objtool_file *file)
 		if (!(sec->sh.sh_flags & SHF_EXECINSTR))
 			continue;
 
+		if (strcmp(sec->name, ".altinstr_replacement") &&
+		    strcmp(sec->name, ".altinstr_aux") &&
+		    strncmp(sec->name, ".discard.", 9))
+			sec->text = true;
+
 		for (offset = 0; offset < sec->len; offset += insn->len) {
 			insn = malloc(sizeof(*insn));
 			if (!insn) {
@@ -947,6 +952,30 @@ static bool has_valid_stack_frame(struct insn_state *state)
 	return false;
 }
 
+static int update_insn_state_regs(struct instruction *insn, struct insn_state *state)
+{
+	struct cfi_reg *cfa = &state->cfa;
+	struct stack_op *op = &insn->stack_op;
+
+	if (cfa->base != CFI_SP)
+		return 0;
+
+	/* push */
+	if (op->dest.type == OP_DEST_PUSH)
+		cfa->offset += 8;
+
+	/* pop */
+	if (op->src.type == OP_SRC_POP)
+		cfa->offset -= 8;
+
+	/* add immediate to sp */
+	if (op->dest.type == OP_DEST_REG && op->src.type == OP_SRC_ADD &&
+	    op->dest.reg == CFI_SP && op->src.reg == CFI_SP)
+		cfa->offset -= op->src.offset;
+
+	return 0;
+}
+
 static void save_reg(struct insn_state *state, unsigned char reg, int base,
 		     int offset)
 {
@@ -1032,6 +1061,9 @@ static int update_insn_state(struct instruction *insn, struct insn_state *state)
 		return 0;
 	}
 
+	if (state->type == ORC_TYPE_REGS || state->type == ORC_TYPE_REGS_IRET)
+		return update_insn_state_regs(insn, state);
+
 	switch (op->dest.type) {
 
 	case OP_DEST_REG:
@@ -1323,6 +1355,10 @@ static bool insn_state_match(struct instruction *insn, struct insn_state *state)
 			break;
 		}
 
+	} else if (state1->type != state2->type) {
+		WARN_FUNC("stack state mismatch: type1=%d type2=%d",
+			  insn->sec, insn->offset, state1->type, state2->type);
+
 	} else if (state1->drap != state2->drap ||
 		 (state1->drap && state1->drap_reg != state2->drap_reg)) {
 		WARN_FUNC("stack state mismatch: drap1=%d(%d) drap2=%d(%d)",
@@ -1613,7 +1649,7 @@ static void cleanup(struct objtool_file *file)
 	elf_close(file->elf);
 }
 
-int check(const char *_objname, bool _nofp)
+int check(const char *_objname, bool _nofp, bool orc)
 {
 	struct objtool_file file;
 	int ret, warnings = 0;
@@ -1621,7 +1657,7 @@ int check(const char *_objname, bool _nofp)
 	objname = _objname;
 	nofp = _nofp;
 
-	file.elf = elf_open(objname);
+	file.elf = elf_open(objname, orc ? O_RDWR : O_RDONLY);
 	if (!file.elf)
 		return 1;
 
@@ -1654,6 +1690,20 @@ int check(const char *_objname, bool _nofp)
 		warnings += ret;
 	}
 
+	if (orc) {
+		ret = create_orc(&file);
+		if (ret < 0)
+			goto out;
+
+		ret = create_orc_sections(&file);
+		if (ret < 0)
+			goto out;
+
+		ret = elf_write(file.elf);
+		if (ret < 0)
+			goto out;
+	}
+
 out:
 	cleanup(&file);
 
diff --git a/tools/objtool/check.h b/tools/objtool/check.h
index da85f5b..046874b 100644
--- a/tools/objtool/check.h
+++ b/tools/objtool/check.h
@@ -22,12 +22,14 @@
 #include "elf.h"
 #include "cfi.h"
 #include "arch.h"
+#include "orc.h"
 #include <linux/hashtable.h>
 
 struct insn_state {
 	struct cfi_reg cfa;
 	struct cfi_reg regs[CFI_NUM_REGS];
 	int stack_size;
+	unsigned char type;
 	bool bp_scratch;
 	bool drap;
 	int drap_reg;
@@ -48,6 +50,7 @@ struct instruction {
 	struct symbol *func;
 	struct stack_op stack_op;
 	struct insn_state state;
+	struct orc_entry orc;
 };
 
 struct objtool_file {
@@ -58,9 +61,19 @@ struct objtool_file {
 	bool ignore_unreachables, c_file;
 };
 
-int check(const char *objname, bool nofp);
+int check(const char *objname, bool nofp, bool orc);
+
+struct instruction *find_insn(struct objtool_file *file,
+			      struct section *sec, unsigned long offset);
 
 #define for_each_insn(file, insn)					\
 	list_for_each_entry(insn, &file->insn_list, list)
 
+#define sec_for_each_insn(file, sec, insn)				\
+	for (insn = find_insn(file, sec, 0);				\
+	     insn && &insn->list != &file->insn_list &&			\
+			insn->sec == sec;				\
+	     insn = list_next_entry(insn, list))
+
+
 #endif /* _CHECK_H */
diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c
index 1a7e8aa..6e9f980 100644
--- a/tools/objtool/elf.c
+++ b/tools/objtool/elf.c
@@ -30,16 +30,6 @@
 #include "elf.h"
 #include "warn.h"
 
-/*
- * Fallback for systems without this "read, mmaping if possible" cmd.
- */
-#ifndef ELF_C_READ_MMAP
-#define ELF_C_READ_MMAP ELF_C_READ
-#endif
-
-#define WARN_ELF(format, ...)					\
-	WARN(format ": %s", ##__VA_ARGS__, elf_errmsg(-1))
-
 struct section *find_section_by_name(struct elf *elf, const char *name)
 {
 	struct section *sec;
@@ -349,9 +339,10 @@ static int read_relas(struct elf *elf)
 	return 0;
 }
 
-struct elf *elf_open(const char *name)
+struct elf *elf_open(const char *name, int flags)
 {
 	struct elf *elf;
+	Elf_Cmd cmd;
 
 	elf_version(EV_CURRENT);
 
@@ -364,13 +355,20 @@ struct elf *elf_open(const char *name)
 
 	INIT_LIST_HEAD(&elf->sections);
 
-	elf->fd = open(name, O_RDONLY);
+	elf->fd = open(name, flags);
 	if (elf->fd == -1) {
 		perror("open");
 		goto err;
 	}
 
-	elf->elf = elf_begin(elf->fd, ELF_C_READ_MMAP, NULL);
+	if ((flags & O_ACCMODE) == O_RDONLY)
+		cmd = ELF_C_READ_MMAP;
+	else if ((flags & O_ACCMODE) == O_RDWR)
+		cmd = ELF_C_RDWR;
+	else /* O_WRONLY */
+		cmd = ELF_C_WRITE;
+
+	elf->elf = elf_begin(elf->fd, cmd, NULL);
 	if (!elf->elf) {
 		WARN_ELF("elf_begin");
 		goto err;
@@ -397,6 +395,194 @@ err:
 	return NULL;
 }
 
+struct section *elf_create_section(struct elf *elf, const char *name,
+				   size_t entsize, int nr)
+{
+	struct section *sec, *shstrtab;
+	size_t size = entsize * nr;
+	struct Elf_Scn *s;
+	Elf_Data *data;
+
+	sec = malloc(sizeof(*sec));
+	if (!sec) {
+		perror("malloc");
+		return NULL;
+	}
+	memset(sec, 0, sizeof(*sec));
+
+	INIT_LIST_HEAD(&sec->symbol_list);
+	INIT_LIST_HEAD(&sec->rela_list);
+	hash_init(sec->rela_hash);
+	hash_init(sec->symbol_hash);
+
+	list_add_tail(&sec->list, &elf->sections);
+
+	s = elf_newscn(elf->elf);
+	if (!s) {
+		WARN_ELF("elf_newscn");
+		return NULL;
+	}
+
+	sec->name = strdup(name);
+	if (!sec->name) {
+		perror("strdup");
+		return NULL;
+	}
+
+	sec->idx = elf_ndxscn(s);
+	sec->len = size;
+	sec->changed = true;
+
+	sec->data = elf_newdata(s);
+	if (!sec->data) {
+		WARN_ELF("elf_newdata");
+		return NULL;
+	}
+
+	sec->data->d_size = size;
+	sec->data->d_align = 1;
+
+	if (size) {
+		sec->data->d_buf = malloc(size);
+		if (!sec->data->d_buf) {
+			perror("malloc");
+			return NULL;
+		}
+		memset(sec->data->d_buf, 0, size);
+	}
+
+	if (!gelf_getshdr(s, &sec->sh)) {
+		WARN_ELF("gelf_getshdr");
+		return NULL;
+	}
+
+	sec->sh.sh_size = size;
+	sec->sh.sh_entsize = entsize;
+	sec->sh.sh_type = SHT_PROGBITS;
+	sec->sh.sh_addralign = 1;
+	sec->sh.sh_flags = SHF_ALLOC;
+
+
+	/* Add section name to .shstrtab */
+	shstrtab = find_section_by_name(elf, ".shstrtab");
+	if (!shstrtab) {
+		WARN("can't find .shstrtab section");
+		return NULL;
+	}
+
+	s = elf_getscn(elf->elf, shstrtab->idx);
+	if (!s) {
+		WARN_ELF("elf_getscn");
+		return NULL;
+	}
+
+	data = elf_newdata(s);
+	if (!data) {
+		WARN_ELF("elf_newdata");
+		return NULL;
+	}
+
+	data->d_buf = sec->name;
+	data->d_size = strlen(name) + 1;
+	data->d_align = 1;
+
+	sec->sh.sh_name = shstrtab->len;
+
+	shstrtab->len += strlen(name) + 1;
+	shstrtab->changed = true;
+
+	return sec;
+}
+
+struct section *elf_create_rela_section(struct elf *elf, struct section *base)
+{
+	char *relaname;
+	struct section *sec;
+
+	relaname = malloc(strlen(base->name) + strlen(".rela") + 1);
+	if (!relaname) {
+		perror("malloc");
+		return NULL;
+	}
+	strcpy(relaname, ".rela");
+	strcat(relaname, base->name);
+
+	sec = elf_create_section(elf, relaname, sizeof(GElf_Rela), 0);
+	if (!sec)
+		return NULL;
+
+	base->rela = sec;
+	sec->base = base;
+
+	sec->sh.sh_type = SHT_RELA;
+	sec->sh.sh_addralign = 8;
+	sec->sh.sh_link = find_section_by_name(elf, ".symtab")->idx;
+	sec->sh.sh_info = base->idx;
+	sec->sh.sh_flags = SHF_INFO_LINK;
+
+	return sec;
+}
+
+int elf_rebuild_rela_section(struct section *sec)
+{
+	struct rela *rela;
+	int nr, idx = 0, size;
+	GElf_Rela *relas;
+
+	nr = 0;
+	list_for_each_entry(rela, &sec->rela_list, list)
+		nr++;
+
+	size = nr * sizeof(*relas);
+	relas = malloc(size);
+	if (!relas) {
+		perror("malloc");
+		return -1;
+	}
+
+	sec->data->d_buf = relas;
+	sec->data->d_size = size;
+
+	sec->sh.sh_size = size;
+
+	idx = 0;
+	list_for_each_entry(rela, &sec->rela_list, list) {
+		relas[idx].r_offset = rela->offset;
+		relas[idx].r_addend = rela->addend;
+		relas[idx].r_info = GELF_R_INFO(rela->sym->idx, rela->type);
+		idx++;
+	}
+
+	return 0;
+}
+
+int elf_write(struct elf *elf)
+{
+	struct section *sec;
+	Elf_Scn *s;
+
+	list_for_each_entry(sec, &elf->sections, list) {
+		if (sec->changed) {
+			s = elf_getscn(elf->elf, sec->idx);
+			if (!s) {
+				WARN_ELF("elf_getscn");
+				return -1;
+			}
+			if (!gelf_update_shdr (s, &sec->sh)) {
+				WARN_ELF("gelf_update_shdr");
+				return -1;
+			}
+		}
+	}
+
+	if (elf_update(elf->elf, ELF_C_WRITE) < 0) {
+		WARN_ELF("elf_update");
+		return -1;
+	}
+
+	return 0;
+}
+
 void elf_close(struct elf *elf)
 {
 	struct section *sec, *tmpsec;
diff --git a/tools/objtool/elf.h b/tools/objtool/elf.h
index 343968b..d86e2ff1 100644
--- a/tools/objtool/elf.h
+++ b/tools/objtool/elf.h
@@ -28,6 +28,13 @@
 # define elf_getshdrstrndx elf_getshstrndx
 #endif
 
+/*
+ * Fallback for systems without this "read, mmaping if possible" cmd.
+ */
+#ifndef ELF_C_READ_MMAP
+#define ELF_C_READ_MMAP ELF_C_READ
+#endif
+
 struct section {
 	struct list_head list;
 	GElf_Shdr sh;
@@ -41,6 +48,7 @@ struct section {
 	char *name;
 	int idx;
 	unsigned int len;
+	bool changed, text;
 };
 
 struct symbol {
@@ -75,7 +83,7 @@ struct elf {
 };
 
 
-struct elf *elf_open(const char *name);
+struct elf *elf_open(const char *name, int flags);
 struct section *find_section_by_name(struct elf *elf, const char *name);
 struct symbol *find_symbol_by_offset(struct section *sec, unsigned long offset);
 struct symbol *find_symbol_containing(struct section *sec, unsigned long offset);
@@ -83,6 +91,11 @@ struct rela *find_rela_by_dest(struct section *sec, unsigned long offset);
 struct rela *find_rela_by_dest_range(struct section *sec, unsigned long offset,
 				     unsigned int len);
 struct symbol *find_containing_func(struct section *sec, unsigned long offset);
+struct section *elf_create_section(struct elf *elf, const char *name, size_t
+				   entsize, int nr);
+struct section *elf_create_rela_section(struct elf *elf, struct section *base);
+int elf_rebuild_rela_section(struct section *sec);
+int elf_write(struct elf *elf);
 void elf_close(struct elf *elf);
 
 #define for_each_sec(file, sec)						\
diff --git a/tools/objtool/objtool.c b/tools/objtool/objtool.c
index ecc5b1b..31e0f91 100644
--- a/tools/objtool/objtool.c
+++ b/tools/objtool/objtool.c
@@ -42,10 +42,11 @@ struct cmd_struct {
 };
 
 static const char objtool_usage_string[] =
-	"objtool [OPTIONS] COMMAND [ARGS]";
+	"objtool COMMAND [ARGS]";
 
 static struct cmd_struct objtool_cmds[] = {
 	{"check",	cmd_check,	"Perform stack metadata validation on an object file" },
+	{"orc",		cmd_orc,	"Generate in-place ORC unwind tables for an object file" },
 };
 
 bool help;
diff --git a/tools/objtool/builtin.h b/tools/objtool/orc.h
similarity index 69%
copy from tools/objtool/builtin.h
copy to tools/objtool/orc.h
index 34d2ba7..a4139e3 100644
--- a/tools/objtool/builtin.h
+++ b/tools/objtool/orc.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2015 Josh Poimboeuf <jpoimboe@redhat.com>
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License
@@ -14,9 +14,17 @@
  * You should have received a copy of the GNU General Public License
  * along with this program; if not, see <http://www.gnu.org/licenses/>.
  */
-#ifndef _BUILTIN_H
-#define _BUILTIN_H
 
-extern int cmd_check(int argc, const char **argv);
+#ifndef _ORC_H
+#define _ORC_H
 
-#endif /* _BUILTIN_H */
+#include "orc_types.h"
+
+struct objtool_file;
+
+int create_orc(struct objtool_file *file);
+int create_orc_sections(struct objtool_file *file);
+
+int orc_dump(const char *objname);
+
+#endif /* _ORC_H */
diff --git a/tools/objtool/orc_dump.c b/tools/objtool/orc_dump.c
new file mode 100644
index 0000000..36c5bf6
--- /dev/null
+++ b/tools/objtool/orc_dump.c
@@ -0,0 +1,212 @@
+/*
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <unistd.h>
+#include "orc.h"
+#include "warn.h"
+
+static const char *reg_name(unsigned int reg)
+{
+	switch (reg) {
+	case ORC_REG_PREV_SP:
+		return "prevsp";
+	case ORC_REG_DX:
+		return "dx";
+	case ORC_REG_DI:
+		return "di";
+	case ORC_REG_BP:
+		return "bp";
+	case ORC_REG_SP:
+		return "sp";
+	case ORC_REG_R10:
+		return "r10";
+	case ORC_REG_R13:
+		return "r13";
+	case ORC_REG_BP_INDIRECT:
+		return "bp(ind)";
+	case ORC_REG_SP_INDIRECT:
+		return "sp(ind)";
+	default:
+		return "?";
+	}
+}
+
+static const char *orc_type_name(unsigned int type)
+{
+	switch (type) {
+	case ORC_TYPE_CALL:
+		return "call";
+	case ORC_TYPE_REGS:
+		return "regs";
+	case ORC_TYPE_REGS_IRET:
+		return "iret";
+	default:
+		return "?";
+	}
+}
+
+static void print_reg(unsigned int reg, int offset)
+{
+	if (reg == ORC_REG_BP_INDIRECT)
+		printf("(bp%+d)", offset);
+	else if (reg == ORC_REG_SP_INDIRECT)
+		printf("(sp%+d)", offset);
+	else if (reg == ORC_REG_UNDEFINED)
+		printf("(und)");
+	else
+		printf("%s%+d", reg_name(reg), offset);
+}
+
+int orc_dump(const char *_objname)
+{
+	int fd, nr_entries, i, *orc_ip = NULL, orc_size = 0;
+	struct orc_entry *orc = NULL;
+	char *name;
+	unsigned long nr_sections, orc_ip_addr = 0;
+	size_t shstrtab_idx;
+	Elf *elf;
+	Elf_Scn *scn;
+	GElf_Shdr sh;
+	GElf_Rela rela;
+	GElf_Sym sym;
+	Elf_Data *data, *symtab = NULL, *rela_orc_ip = NULL;
+
+
+	objname = _objname;
+
+	elf_version(EV_CURRENT);
+
+	fd = open(objname, O_RDONLY);
+	if (fd == -1) {
+		perror("open");
+		return -1;
+	}
+
+	elf = elf_begin(fd, ELF_C_READ_MMAP, NULL);
+	if (!elf) {
+		WARN_ELF("elf_begin");
+		return -1;
+	}
+
+	if (elf_getshdrnum(elf, &nr_sections)) {
+		WARN_ELF("elf_getshdrnum");
+		return -1;
+	}
+
+	if (elf_getshdrstrndx(elf, &shstrtab_idx)) {
+		WARN_ELF("elf_getshdrstrndx");
+		return -1;
+	}
+
+	for (i = 0; i < nr_sections; i++) {
+		scn = elf_getscn(elf, i);
+		if (!scn) {
+			WARN_ELF("elf_getscn");
+			return -1;
+		}
+
+		if (!gelf_getshdr(scn, &sh)) {
+			WARN_ELF("gelf_getshdr");
+			return -1;
+		}
+
+		name = elf_strptr(elf, shstrtab_idx, sh.sh_name);
+		if (!name) {
+			WARN_ELF("elf_strptr");
+			return -1;
+		}
+
+		data = elf_getdata(scn, NULL);
+		if (!data) {
+			WARN_ELF("elf_getdata");
+			return -1;
+		}
+
+		if (!strcmp(name, ".symtab")) {
+			symtab = data;
+		} else if (!strcmp(name, ".orc_unwind")) {
+			orc = data->d_buf;
+			orc_size = sh.sh_size;
+		} else if (!strcmp(name, ".orc_unwind_ip")) {
+			orc_ip = data->d_buf;
+			orc_ip_addr = sh.sh_addr;
+		} else if (!strcmp(name, ".rela.orc_unwind_ip")) {
+			rela_orc_ip = data;
+		}
+	}
+
+	if (!symtab || !orc || !orc_ip)
+		return 0;
+
+	if (orc_size % sizeof(*orc) != 0) {
+		WARN("bad .orc_unwind section size");
+		return -1;
+	}
+
+	nr_entries = orc_size / sizeof(*orc);
+	for (i = 0; i < nr_entries; i++) {
+		if (rela_orc_ip) {
+			if (!gelf_getrela(rela_orc_ip, i, &rela)) {
+				WARN_ELF("gelf_getrela");
+				return -1;
+			}
+
+			if (!gelf_getsym(symtab, GELF_R_SYM(rela.r_info), &sym)) {
+				WARN_ELF("gelf_getsym");
+				return -1;
+			}
+
+			scn = elf_getscn(elf, sym.st_shndx);
+			if (!scn) {
+				WARN_ELF("elf_getscn");
+				return -1;
+			}
+
+			if (!gelf_getshdr(scn, &sh)) {
+				WARN_ELF("gelf_getshdr");
+				return -1;
+			}
+
+			name = elf_strptr(elf, shstrtab_idx, sh.sh_name);
+			if (!name || !*name) {
+				WARN_ELF("elf_strptr");
+				return -1;
+			}
+
+			printf("%s+%lx:", name, rela.r_addend);
+
+		} else {
+			printf("%lx:", orc_ip_addr + (i * sizeof(int)) + orc_ip[i]);
+		}
+
+
+		printf(" sp:");
+
+		print_reg(orc[i].sp_reg, orc[i].sp_offset);
+
+		printf(" bp:");
+
+		print_reg(orc[i].bp_reg, orc[i].bp_offset);
+
+		printf(" type:%s\n", orc_type_name(orc[i].type));
+	}
+
+	elf_end(elf);
+	close(fd);
+
+	return 0;
+}
diff --git a/tools/objtool/orc_gen.c b/tools/objtool/orc_gen.c
new file mode 100644
index 0000000..e5ca314
--- /dev/null
+++ b/tools/objtool/orc_gen.c
@@ -0,0 +1,214 @@
+/*
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <stdlib.h>
+#include <string.h>
+
+#include "orc.h"
+#include "check.h"
+#include "warn.h"
+
+int create_orc(struct objtool_file *file)
+{
+	struct instruction *insn;
+
+	for_each_insn(file, insn) {
+		struct orc_entry *orc = &insn->orc;
+		struct cfi_reg *cfa = &insn->state.cfa;
+		struct cfi_reg *bp = &insn->state.regs[CFI_BP];
+
+		if (cfa->base == CFI_UNDEFINED) {
+			orc->sp_reg = ORC_REG_UNDEFINED;
+			continue;
+		}
+
+		switch (cfa->base) {
+		case CFI_SP:
+			orc->sp_reg = ORC_REG_SP;
+			break;
+		case CFI_SP_INDIRECT:
+			orc->sp_reg = ORC_REG_SP_INDIRECT;
+			break;
+		case CFI_BP:
+			orc->sp_reg = ORC_REG_BP;
+			break;
+		case CFI_BP_INDIRECT:
+			orc->sp_reg = ORC_REG_BP_INDIRECT;
+			break;
+		case CFI_R10:
+			orc->sp_reg = ORC_REG_R10;
+			break;
+		case CFI_R13:
+			orc->sp_reg = ORC_REG_R13;
+			break;
+		case CFI_DI:
+			orc->sp_reg = ORC_REG_DI;
+			break;
+		case CFI_DX:
+			orc->sp_reg = ORC_REG_DX;
+			break;
+		default:
+			WARN_FUNC("unknown CFA base reg %d",
+				  insn->sec, insn->offset, cfa->base);
+			return -1;
+		}
+
+		switch(bp->base) {
+		case CFI_UNDEFINED:
+			orc->bp_reg = ORC_REG_UNDEFINED;
+			break;
+		case CFI_CFA:
+			orc->bp_reg = ORC_REG_PREV_SP;
+			break;
+		case CFI_BP:
+			orc->bp_reg = ORC_REG_BP;
+			break;
+		default:
+			WARN_FUNC("unknown BP base reg %d",
+				  insn->sec, insn->offset, bp->base);
+			return -1;
+		}
+
+		orc->sp_offset = cfa->offset;
+		orc->bp_offset = bp->offset;
+		orc->type = insn->state.type;
+	}
+
+	return 0;
+}
+
+static int create_orc_entry(struct section *u_sec, struct section *ip_relasec,
+				unsigned int idx, struct section *insn_sec,
+				unsigned long insn_off, struct orc_entry *o)
+{
+	struct orc_entry *orc;
+	struct rela *rela;
+
+	/* populate ORC data */
+	orc = (struct orc_entry *)u_sec->data->d_buf + idx;
+	memcpy(orc, o, sizeof(*orc));
+
+	/* populate rela for ip */
+	rela = malloc(sizeof(*rela));
+	if (!rela) {
+		perror("malloc");
+		return -1;
+	}
+	memset(rela, 0, sizeof(*rela));
+
+	rela->sym = insn_sec->sym;
+	rela->addend = insn_off;
+	rela->type = R_X86_64_PC32;
+	rela->offset = idx * sizeof(int);
+
+	list_add_tail(&rela->list, &ip_relasec->rela_list);
+	hash_add(ip_relasec->rela_hash, &rela->hash, rela->offset);
+
+	return 0;
+}
+
+int create_orc_sections(struct objtool_file *file)
+{
+	struct instruction *insn, *prev_insn;
+	struct section *sec, *u_sec, *ip_relasec;
+	unsigned int idx;
+
+	struct orc_entry empty = {
+		.sp_reg = ORC_REG_UNDEFINED,
+		.bp_reg  = ORC_REG_UNDEFINED,
+		.type    = ORC_TYPE_CALL,
+	};
+
+	sec = find_section_by_name(file->elf, ".orc_unwind");
+	if (sec) {
+		WARN("file already has .orc_unwind section, skipping");
+		return -1;
+	}
+
+	/* count the number of needed orcs */
+	idx = 0;
+	for_each_sec(file, sec) {
+		if (!sec->text)
+			continue;
+
+		prev_insn = NULL;
+		sec_for_each_insn(file, sec, insn) {
+			if (!prev_insn ||
+			    memcmp(&insn->orc, &prev_insn->orc,
+				   sizeof(struct orc_entry))) {
+				idx++;
+			}
+			prev_insn = insn;
+		}
+
+		/* section terminator */
+		if (prev_insn)
+			idx++;
+	}
+	if (!idx)
+		return -1;
+
+
+	/* create .orc_unwind_ip and .rela.orc_unwind_ip sections */
+	sec = elf_create_section(file->elf, ".orc_unwind_ip", sizeof(int), idx);
+
+	ip_relasec = elf_create_rela_section(file->elf, sec);
+	if (!ip_relasec)
+		return -1;
+
+	/* create .orc_unwind section */
+	u_sec = elf_create_section(file->elf, ".orc_unwind",
+				   sizeof(struct orc_entry), idx);
+
+	/* populate sections */
+	idx = 0;
+	for_each_sec(file, sec) {
+		if (!sec->text)
+			continue;
+
+		prev_insn = NULL;
+		sec_for_each_insn(file, sec, insn) {
+			if (!prev_insn || memcmp(&insn->orc, &prev_insn->orc,
+						 sizeof(struct orc_entry))) {
+
+				if (create_orc_entry(u_sec, ip_relasec, idx,
+						     insn->sec, insn->offset,
+						     &insn->orc))
+					return -1;
+
+				idx++;
+			}
+			prev_insn = insn;
+		}
+
+		/* section terminator */
+		if (prev_insn) {
+			if (create_orc_entry(u_sec, ip_relasec, idx,
+					     prev_insn->sec,
+					     prev_insn->offset + prev_insn->len,
+					     &empty))
+				return -1;
+
+			idx++;
+		}
+	}
+
+	if (elf_rebuild_rela_section(ip_relasec))
+		return -1;
+
+	return 0;
+}
diff --git a/tools/objtool/orc_types.h b/tools/objtool/orc_types.h
new file mode 100644
index 0000000..fc5cf6c
--- /dev/null
+++ b/tools/objtool/orc_types.h
@@ -0,0 +1,85 @@
+/*
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _ORC_TYPES_H
+#define _ORC_TYPES_H
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+
+/*
+ * The ORC_REG_* registers are base registers which are used to find other
+ * registers on the stack.
+ *
+ * ORC_REG_PREV_SP, also known as DWARF Call Frame Address (CFA), is the
+ * address of the previous frame: the caller's SP before it called the current
+ * function.
+ *
+ * ORC_REG_UNDEFINED means the corresponding register's value didn't change in
+ * the current frame.
+ *
+ * The most commonly used base registers are SP and BP -- which the previous SP
+ * is usually based on -- and PREV_SP and UNDEFINED -- which the previous BP is
+ * usually based on.
+ *
+ * The rest of the base registers are needed for special cases like entry code
+ * and GCC realigned stacks.
+ */
+#define ORC_REG_UNDEFINED		0
+#define ORC_REG_PREV_SP			1
+#define ORC_REG_DX			2
+#define ORC_REG_DI			3
+#define ORC_REG_BP			4
+#define ORC_REG_SP			5
+#define ORC_REG_R10			6
+#define ORC_REG_R13			7
+#define ORC_REG_BP_INDIRECT		8
+#define ORC_REG_SP_INDIRECT		9
+#define ORC_REG_MAX			15
+
+/*
+ * ORC_TYPE_CALL: Indicates that sp_reg+sp_offset resolves to PREV_SP (the
+ * caller's SP right before it made the call).  Used for all callable
+ * functions, i.e. all C code and all callable asm functions.
+ *
+ * ORC_TYPE_REGS: Used in entry code to indicate that sp_reg+sp_offset points
+ * to a fully populated pt_regs from a syscall, interrupt, or exception.
+ *
+ * ORC_TYPE_REGS_IRET: Used in entry code to indicate that sp_reg+sp_offset
+ * points to the iret return frame.
+ */
+#define ORC_TYPE_CALL			0
+#define ORC_TYPE_REGS			1
+#define ORC_TYPE_REGS_IRET		2
+
+/*
+ * This struct is more or less a vastly simplified version of the DWARF Call
+ * Frame Information standard.  It contains only the necessary parts of DWARF
+ * CFI, simplified for ease of access by the in-kernel unwinder.  It tells the
+ * unwinder how to find the previous SP and BP (and sometimes entry regs) on
+ * the stack for a given code address.  Each instance of the struct corresponds
+ * to one or more code locations.
+ */
+struct orc_entry {
+	s16		sp_offset;
+	s16		bp_offset;
+	unsigned	sp_reg:4;
+	unsigned	bp_reg:4;
+	unsigned	type:2;
+} __packed;
+
+#endif /* _ORC_TYPES_H */

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:x86/asm] objtool, x86: Add facility for asm code to provide unwind hints
  2017-07-11 15:33 ` [PATCH v3 06/10] objtool, x86: add facility for asm code to provide unwind hints Josh Poimboeuf
@ 2017-07-18 10:43   ` tip-bot for Josh Poimboeuf
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Josh Poimboeuf @ 2017-07-18 10:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, efault, jslaby, luto, brgerst, bp, dvlasenk,
	linux-kernel, hpa, mingo, peterz, tglx, jpoimboe

Commit-ID:  39358a033b2e4432052265c1fa0f36f572d8cfb5
Gitweb:     http://git.kernel.org/tip/39358a033b2e4432052265c1fa0f36f572d8cfb5
Author:     Josh Poimboeuf <jpoimboe@redhat.com>
AuthorDate: Tue, 11 Jul 2017 10:33:43 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 18 Jul 2017 10:57:44 +0200

objtool, x86: Add facility for asm code to provide unwind hints

Some asm (and inline asm) code does special things to the stack which
objtool can't understand.  (Nor can GCC or GNU assembler, for that
matter.)  In such cases we need a facility for the code to provide
annotations, so the unwinder can unwind through it.

This provides such a facility, in the form of unwind hints.  They're
similar to the GNU assembler .cfi* directives, but they give more
information, and are needed in far fewer places, because objtool can
fill in the blanks by following branches and adjusting the stack pointer
for pushes and pops.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/0f5f3c9104fca559ff4088bece1d14ae3bca52d5.1499786555.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 .../objtool => arch/x86/include/asm}/orc_types.h   |  24 ++-
 arch/x86/include/asm/unwind_hints.h                | 103 +++++++++++
 tools/objtool/Makefile                             |   3 +
 tools/objtool/check.c                              | 191 +++++++++++++++++++--
 tools/objtool/check.h                              |   4 +-
 tools/objtool/orc_types.h                          |  22 +++
 6 files changed, 333 insertions(+), 14 deletions(-)

diff --git a/tools/objtool/orc_types.h b/arch/x86/include/asm/orc_types.h
similarity index 82%
copy from tools/objtool/orc_types.h
copy to arch/x86/include/asm/orc_types.h
index fc5cf6c..7dc777a 100644
--- a/tools/objtool/orc_types.h
+++ b/arch/x86/include/asm/orc_types.h
@@ -61,11 +61,19 @@
  *
  * ORC_TYPE_REGS_IRET: Used in entry code to indicate that sp_reg+sp_offset
  * points to the iret return frame.
+ *
+ * The UNWIND_HINT macros are used only for the unwind_hint struct.  They
+ * aren't used in struct orc_entry due to size and complexity constraints.
+ * Objtool converts them to real types when it converts the hints to orc
+ * entries.
  */
 #define ORC_TYPE_CALL			0
 #define ORC_TYPE_REGS			1
 #define ORC_TYPE_REGS_IRET		2
+#define UNWIND_HINT_TYPE_SAVE		3
+#define UNWIND_HINT_TYPE_RESTORE	4
 
+#ifndef __ASSEMBLY__
 /*
  * This struct is more or less a vastly simplified version of the DWARF Call
  * Frame Information standard.  It contains only the necessary parts of DWARF
@@ -80,6 +88,20 @@ struct orc_entry {
 	unsigned	sp_reg:4;
 	unsigned	bp_reg:4;
 	unsigned	type:2;
-} __packed;
+};
+
+/*
+ * This struct is used by asm and inline asm code to manually annotate the
+ * location of registers on the stack for the ORC unwinder.
+ *
+ * Type can be either ORC_TYPE_* or UNWIND_HINT_TYPE_*.
+ */
+struct unwind_hint {
+	u32		ip;
+	s16		sp_offset;
+	u8		sp_reg;
+	u8		type;
+};
+#endif /* __ASSEMBLY__ */
 
 #endif /* _ORC_TYPES_H */
diff --git a/arch/x86/include/asm/unwind_hints.h b/arch/x86/include/asm/unwind_hints.h
new file mode 100644
index 0000000..5e02b11
--- /dev/null
+++ b/arch/x86/include/asm/unwind_hints.h
@@ -0,0 +1,103 @@
+#ifndef _ASM_X86_UNWIND_HINTS_H
+#define _ASM_X86_UNWIND_HINTS_H
+
+#include "orc_types.h"
+
+#ifdef __ASSEMBLY__
+
+/*
+ * In asm, there are two kinds of code: normal C-type callable functions and
+ * the rest.  The normal callable functions can be called by other code, and
+ * don't do anything unusual with the stack.  Such normal callable functions
+ * are annotated with the ENTRY/ENDPROC macros.  Most asm code falls in this
+ * category.  In this case, no special debugging annotations are needed because
+ * objtool can automatically generate the ORC data for the ORC unwinder to read
+ * at runtime.
+ *
+ * Anything which doesn't fall into the above category, such as syscall and
+ * interrupt handlers, tends to not be called directly by other functions, and
+ * often does unusual non-C-function-type things with the stack pointer.  Such
+ * code needs to be annotated such that objtool can understand it.  The
+ * following CFI hint macros are for this type of code.
+ *
+ * These macros provide hints to objtool about the state of the stack at each
+ * instruction.  Objtool starts from the hints and follows the code flow,
+ * making automatic CFI adjustments when it sees pushes and pops, filling out
+ * the debuginfo as necessary.  It will also warn if it sees any
+ * inconsistencies.
+ */
+.macro UNWIND_HINT sp_reg=ORC_REG_SP sp_offset=0 type=ORC_TYPE_CALL
+#ifdef CONFIG_STACK_VALIDATION
+.Lunwind_hint_ip_\@:
+	.pushsection .discard.unwind_hints
+		/* struct unwind_hint */
+		.long .Lunwind_hint_ip_\@ - .
+		.short \sp_offset
+		.byte \sp_reg
+		.byte \type
+	.popsection
+#endif
+.endm
+
+.macro UNWIND_HINT_EMPTY
+	UNWIND_HINT sp_reg=ORC_REG_UNDEFINED
+.endm
+
+.macro UNWIND_HINT_REGS base=%rsp offset=0 indirect=0 extra=1 iret=0
+	.if \base == %rsp && \indirect
+		.set sp_reg, ORC_REG_SP_INDIRECT
+	.elseif \base == %rsp
+		.set sp_reg, ORC_REG_SP
+	.elseif \base == %rbp
+		.set sp_reg, ORC_REG_BP
+	.elseif \base == %rdi
+		.set sp_reg, ORC_REG_DI
+	.elseif \base == %rdx
+		.set sp_reg, ORC_REG_DX
+	.elseif \base == %r10
+		.set sp_reg, ORC_REG_R10
+	.else
+		.error "UNWIND_HINT_REGS: bad base register"
+	.endif
+
+	.set sp_offset, \offset
+
+	.if \iret
+		.set type, ORC_TYPE_REGS_IRET
+	.elseif \extra == 0
+		.set type, ORC_TYPE_REGS_IRET
+		.set sp_offset, \offset + (16*8)
+	.else
+		.set type, ORC_TYPE_REGS
+	.endif
+
+	UNWIND_HINT sp_reg=sp_reg sp_offset=sp_offset type=type
+.endm
+
+.macro UNWIND_HINT_IRET_REGS base=%rsp offset=0
+	UNWIND_HINT_REGS base=\base offset=\offset iret=1
+.endm
+
+.macro UNWIND_HINT_FUNC sp_offset=8
+	UNWIND_HINT sp_offset=\sp_offset
+.endm
+
+#else /* !__ASSEMBLY__ */
+
+#define UNWIND_HINT(sp_reg, sp_offset, type)			\
+	"987: \n\t"						\
+	".pushsection .discard.unwind_hints\n\t"		\
+	/* struct unwind_hint */				\
+	".long 987b - .\n\t"					\
+	".short " __stringify(sp_offset) "\n\t"		\
+	".byte " __stringify(sp_reg) "\n\t"			\
+	".byte " __stringify(type) "\n\t"			\
+	".popsection\n\t"
+
+#define UNWIND_HINT_SAVE UNWIND_HINT(0, 0, UNWIND_HINT_TYPE_SAVE)
+
+#define UNWIND_HINT_RESTORE UNWIND_HINT(0, 0, UNWIND_HINT_TYPE_RESTORE)
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_UNWIND_HINTS_H */
diff --git a/tools/objtool/Makefile b/tools/objtool/Makefile
index 0e2765e..3a6425f 100644
--- a/tools/objtool/Makefile
+++ b/tools/objtool/Makefile
@@ -52,6 +52,9 @@ $(OBJTOOL): $(LIBSUBCMD) $(OBJTOOL_IN)
 	diff -I'^#include' arch/x86/insn/inat.h ../../arch/x86/include/asm/inat.h >/dev/null && \
 	diff -I'^#include' arch/x86/insn/inat_types.h ../../arch/x86/include/asm/inat_types.h >/dev/null) \
 	|| echo "warning: objtool: x86 instruction decoder differs from kernel" >&2 )) || true
+	@(test -d ../../kernel -a -d ../../tools -a -d ../objtool && (( \
+	diff ../../arch/x86/include/asm/orc_types.h orc_types.h >/dev/null) \
+	|| echo "warning: objtool: orc_types.h differs from kernel" >&2 )) || true
 	$(QUIET_LINK)$(CC) $(OBJTOOL_IN) $(LDFLAGS) -o $@
 
 
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index cb57c52..368275d 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -100,7 +100,6 @@ static bool gcov_enabled(struct objtool_file *file)
 static bool ignore_func(struct objtool_file *file, struct symbol *func)
 {
 	struct rela *rela;
-	struct instruction *insn;
 
 	/* check for STACK_FRAME_NON_STANDARD */
 	if (file->whitelist && file->whitelist->rela)
@@ -113,11 +112,6 @@ static bool ignore_func(struct objtool_file *file, struct symbol *func)
 				return true;
 		}
 
-	/* check if it has a context switching instruction */
-	func_for_each_insn(file, func, insn)
-		if (insn->type == INSN_CONTEXT_SWITCH)
-			return true;
-
 	return false;
 }
 
@@ -879,6 +873,99 @@ static int add_switch_table_alts(struct objtool_file *file)
 	return 0;
 }
 
+static int read_unwind_hints(struct objtool_file *file)
+{
+	struct section *sec, *relasec;
+	struct rela *rela;
+	struct unwind_hint *hint;
+	struct instruction *insn;
+	struct cfi_reg *cfa;
+	int i;
+
+	sec = find_section_by_name(file->elf, ".discard.unwind_hints");
+	if (!sec)
+		return 0;
+
+	relasec = sec->rela;
+	if (!relasec) {
+		WARN("missing .rela.discard.unwind_hints section");
+		return -1;
+	}
+
+	if (sec->len % sizeof(struct unwind_hint)) {
+		WARN("struct unwind_hint size mismatch");
+		return -1;
+	}
+
+	file->hints = true;
+
+	for (i = 0; i < sec->len / sizeof(struct unwind_hint); i++) {
+		hint = (struct unwind_hint *)sec->data->d_buf + i;
+
+		rela = find_rela_by_dest(sec, i * sizeof(*hint));
+		if (!rela) {
+			WARN("can't find rela for unwind_hints[%d]", i);
+			return -1;
+		}
+
+		insn = find_insn(file, rela->sym->sec, rela->addend);
+		if (!insn) {
+			WARN("can't find insn for unwind_hints[%d]", i);
+			return -1;
+		}
+
+		cfa = &insn->state.cfa;
+
+		if (hint->type == UNWIND_HINT_TYPE_SAVE) {
+			insn->save = true;
+			continue;
+
+		} else if (hint->type == UNWIND_HINT_TYPE_RESTORE) {
+			insn->restore = true;
+			insn->hint = true;
+			continue;
+		}
+
+		insn->hint = true;
+
+		switch (hint->sp_reg) {
+		case ORC_REG_UNDEFINED:
+			cfa->base = CFI_UNDEFINED;
+			break;
+		case ORC_REG_SP:
+			cfa->base = CFI_SP;
+			break;
+		case ORC_REG_BP:
+			cfa->base = CFI_BP;
+			break;
+		case ORC_REG_SP_INDIRECT:
+			cfa->base = CFI_SP_INDIRECT;
+			break;
+		case ORC_REG_R10:
+			cfa->base = CFI_R10;
+			break;
+		case ORC_REG_R13:
+			cfa->base = CFI_R13;
+			break;
+		case ORC_REG_DI:
+			cfa->base = CFI_DI;
+			break;
+		case ORC_REG_DX:
+			cfa->base = CFI_DX;
+			break;
+		default:
+			WARN_FUNC("unsupported unwind_hint sp base reg %d",
+				  insn->sec, insn->offset, hint->sp_reg);
+			return -1;
+		}
+
+		cfa->offset = hint->sp_offset;
+		insn->state.type = hint->type;
+	}
+
+	return 0;
+}
+
 static int decode_sections(struct objtool_file *file)
 {
 	int ret;
@@ -909,6 +996,10 @@ static int decode_sections(struct objtool_file *file)
 	if (ret)
 		return ret;
 
+	ret = read_unwind_hints(file);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 
@@ -1382,7 +1473,7 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 			   struct insn_state state)
 {
 	struct alternative *alt;
-	struct instruction *insn;
+	struct instruction *insn, *next_insn;
 	struct section *sec;
 	struct symbol *func = NULL;
 	int ret;
@@ -1397,6 +1488,8 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 	}
 
 	while (1) {
+		next_insn = next_insn_same_sec(file, insn);
+
 		if (file->c_file && insn->func) {
 			if (func && func != insn->func) {
 				WARN("%s() falls through to next function %s()",
@@ -1414,13 +1507,54 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 		}
 
 		if (insn->visited) {
-			if (!!insn_state_match(insn, &state))
+			if (!insn->hint && !insn_state_match(insn, &state))
 				return 1;
 
 			return 0;
 		}
 
-		insn->state = state;
+		if (insn->hint) {
+			if (insn->restore) {
+				struct instruction *save_insn, *i;
+
+				i = insn;
+				save_insn = NULL;
+				func_for_each_insn_continue_reverse(file, func, i) {
+					if (i->save) {
+						save_insn = i;
+						break;
+					}
+				}
+
+				if (!save_insn) {
+					WARN_FUNC("no corresponding CFI save for CFI restore",
+						  sec, insn->offset);
+					return 1;
+				}
+
+				if (!save_insn->visited) {
+					/*
+					 * Oops, no state to copy yet.
+					 * Hopefully we can reach this
+					 * instruction from another branch
+					 * after the save insn has been
+					 * visited.
+					 */
+					if (insn == first)
+						return 0;
+
+					WARN_FUNC("objtool isn't smart enough to handle this CFI save/restore combo",
+						  sec, insn->offset);
+					return 1;
+				}
+
+				insn->state = save_insn->state;
+			}
+
+			state = insn->state;
+
+		} else
+			insn->state = state;
 
 		insn->visited = true;
 
@@ -1497,6 +1631,14 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 
 			return 0;
 
+		case INSN_CONTEXT_SWITCH:
+			if (func && (!next_insn || !next_insn->hint)) {
+				WARN_FUNC("unsupported instruction in callable function",
+					  sec, insn->offset);
+				return 1;
+			}
+			return 0;
+
 		case INSN_STACK:
 			if (update_insn_state(insn, &state))
 				return -1;
@@ -1510,7 +1652,7 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 		if (insn->dead_end)
 			return 0;
 
-		insn = next_insn_same_sec(file, insn);
+		insn = next_insn;
 		if (!insn) {
 			WARN("%s: unexpected end of section", sec->name);
 			return 1;
@@ -1520,6 +1662,27 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 	return 0;
 }
 
+static int validate_unwind_hints(struct objtool_file *file)
+{
+	struct instruction *insn;
+	int ret, warnings = 0;
+	struct insn_state state;
+
+	if (!file->hints)
+		return 0;
+
+	clear_insn_state(&state);
+
+	for_each_insn(file, insn) {
+		if (insn->hint && !insn->visited) {
+			ret = validate_branch(file, insn, state);
+			warnings += ret;
+		}
+	}
+
+	return warnings;
+}
+
 static bool is_kasan_insn(struct instruction *insn)
 {
 	return (insn->type == INSN_CALL &&
@@ -1665,8 +1828,9 @@ int check(const char *_objname, bool _nofp, bool orc)
 	hash_init(file.insn_hash);
 	file.whitelist = find_section_by_name(file.elf, ".discard.func_stack_frame_non_standard");
 	file.rodata = find_section_by_name(file.elf, ".rodata");
-	file.ignore_unreachables = false;
 	file.c_file = find_section_by_name(file.elf, ".comment");
+	file.ignore_unreachables = false;
+	file.hints = false;
 
 	arch_initial_func_cfi_state(&initial_func_cfi);
 
@@ -1683,6 +1847,11 @@ int check(const char *_objname, bool _nofp, bool orc)
 		goto out;
 	warnings += ret;
 
+	ret = validate_unwind_hints(&file);
+	if (ret < 0)
+		goto out;
+	warnings += ret;
+
 	if (!warnings) {
 		ret = validate_reachable_instructions(&file);
 		if (ret < 0)
diff --git a/tools/objtool/check.h b/tools/objtool/check.h
index 046874b..ac3d4b1 100644
--- a/tools/objtool/check.h
+++ b/tools/objtool/check.h
@@ -43,7 +43,7 @@ struct instruction {
 	unsigned int len;
 	unsigned char type;
 	unsigned long immediate;
-	bool alt_group, visited, dead_end, ignore;
+	bool alt_group, visited, dead_end, ignore, hint, save, restore;
 	struct symbol *call_dest;
 	struct instruction *jump_dest;
 	struct list_head alts;
@@ -58,7 +58,7 @@ struct objtool_file {
 	struct list_head insn_list;
 	DECLARE_HASHTABLE(insn_hash, 16);
 	struct section *rodata, *whitelist;
-	bool ignore_unreachables, c_file;
+	bool ignore_unreachables, c_file, hints;
 };
 
 int check(const char *objname, bool nofp, bool orc);
diff --git a/tools/objtool/orc_types.h b/tools/objtool/orc_types.h
index fc5cf6c..9c9dc57 100644
--- a/tools/objtool/orc_types.h
+++ b/tools/objtool/orc_types.h
@@ -61,11 +61,19 @@
  *
  * ORC_TYPE_REGS_IRET: Used in entry code to indicate that sp_reg+sp_offset
  * points to the iret return frame.
+ *
+ * The UNWIND_HINT macros are used only for the unwind_hint struct.  They
+ * aren't used in struct orc_entry due to size and complexity constraints.
+ * Objtool converts them to real types when it converts the hints to orc
+ * entries.
  */
 #define ORC_TYPE_CALL			0
 #define ORC_TYPE_REGS			1
 #define ORC_TYPE_REGS_IRET		2
+#define UNWIND_HINT_TYPE_SAVE		3
+#define UNWIND_HINT_TYPE_RESTORE	4
 
+#ifndef __ASSEMBLY__
 /*
  * This struct is more or less a vastly simplified version of the DWARF Call
  * Frame Information standard.  It contains only the necessary parts of DWARF
@@ -82,4 +90,18 @@ struct orc_entry {
 	unsigned	type:2;
 } __packed;
 
+/*
+ * This struct is used by asm and inline asm code to manually annotate the
+ * location of registers on the stack for the ORC unwinder.
+ *
+ * Type can be either ORC_TYPE_* or UNWIND_HINT_TYPE_*.
+ */
+struct unwind_hint {
+	u32		ip;
+	s16		sp_offset;
+	u8		sp_reg;
+	u8		type;
+};
+#endif /* __ASSEMBLY__ */
+
 #endif /* _ORC_TYPES_H */

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:x86/asm] x86/entry/64: Add unwind hint annotations
  2017-07-11 15:33 ` [PATCH v3 07/10] x86/entry/64: add unwind hint annotations Josh Poimboeuf
@ 2017-07-18 10:43   ` tip-bot for Josh Poimboeuf
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Josh Poimboeuf @ 2017-07-18 10:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, efault, torvalds, luto, dvlasenk, jpoimboe, bp, hpa,
	mingo, jslaby, linux-kernel, tglx, brgerst

Commit-ID:  8c1f75587a18ca032da8f6376d1ed882d7095289
Gitweb:     http://git.kernel.org/tip/8c1f75587a18ca032da8f6376d1ed882d7095289
Author:     Josh Poimboeuf <jpoimboe@redhat.com>
AuthorDate: Tue, 11 Jul 2017 10:33:44 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 18 Jul 2017 10:57:44 +0200

x86/entry/64: Add unwind hint annotations

Add unwind hint annotations to entry_64.S.  This will enable the ORC
unwinder to unwind through any location in the entry code including
syscalls, interrupts, and exceptions.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/b9f6d478aadf68ba57c739dcfac34ec0dc021c4c.1499786555.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/Makefile   |  1 -
 arch/x86/entry/calling.h  |  5 ++++
 arch/x86/entry/entry_64.S | 71 ++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 66 insertions(+), 11 deletions(-)

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 9976fce..af28a8a 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -2,7 +2,6 @@
 # Makefile for the x86 low level entry code
 #
 
-OBJECT_FILES_NON_STANDARD_entry_$(BITS).o   := y
 OBJECT_FILES_NON_STANDARD_entry_64_compat.o := y
 
 CFLAGS_syscall_64.o		+= $(call cc-option,-Wno-override-init,)
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 05ed3d3..640aafe 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -1,4 +1,5 @@
 #include <linux/jump_label.h>
+#include <asm/unwind_hints.h>
 
 /*
 
@@ -112,6 +113,7 @@ For 32-bit we have the following conventions - kernel is built with
 	movq %rdx, 12*8+\offset(%rsp)
 	movq %rsi, 13*8+\offset(%rsp)
 	movq %rdi, 14*8+\offset(%rsp)
+	UNWIND_HINT_REGS offset=\offset extra=0
 	.endm
 	.macro SAVE_C_REGS offset=0
 	SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1
@@ -136,6 +138,7 @@ For 32-bit we have the following conventions - kernel is built with
 	movq %r12, 3*8+\offset(%rsp)
 	movq %rbp, 4*8+\offset(%rsp)
 	movq %rbx, 5*8+\offset(%rsp)
+	UNWIND_HINT_REGS offset=\offset
 	.endm
 
 	.macro RESTORE_EXTRA_REGS offset=0
@@ -145,6 +148,7 @@ For 32-bit we have the following conventions - kernel is built with
 	movq 3*8+\offset(%rsp), %r12
 	movq 4*8+\offset(%rsp), %rbp
 	movq 5*8+\offset(%rsp), %rbx
+	UNWIND_HINT_REGS offset=\offset extra=0
 	.endm
 
 	.macro RESTORE_C_REGS_HELPER rstor_rax=1, rstor_rcx=1, rstor_r11=1, rstor_r8910=1, rstor_rdx=1
@@ -167,6 +171,7 @@ For 32-bit we have the following conventions - kernel is built with
 	.endif
 	movq 13*8(%rsp), %rsi
 	movq 14*8(%rsp), %rdi
+	UNWIND_HINT_IRET_REGS offset=16*8
 	.endm
 	.macro RESTORE_C_REGS
 	RESTORE_C_REGS_HELPER 1,1,1,1,1
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index b56f7f2..aa58155 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -36,6 +36,7 @@
 #include <asm/smap.h>
 #include <asm/pgtable_types.h>
 #include <asm/export.h>
+#include <asm/frame.h>
 #include <linux/err.h>
 
 .code64
@@ -43,9 +44,10 @@
 
 #ifdef CONFIG_PARAVIRT
 ENTRY(native_usergs_sysret64)
+	UNWIND_HINT_EMPTY
 	swapgs
 	sysretq
-ENDPROC(native_usergs_sysret64)
+END(native_usergs_sysret64)
 #endif /* CONFIG_PARAVIRT */
 
 .macro TRACE_IRQS_IRETQ
@@ -134,6 +136,7 @@ ENDPROC(native_usergs_sysret64)
  */
 
 ENTRY(entry_SYSCALL_64)
+	UNWIND_HINT_EMPTY
 	/*
 	 * Interrupts are off on entry.
 	 * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
@@ -169,6 +172,7 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
 	pushq	%r10				/* pt_regs->r10 */
 	pushq	%r11				/* pt_regs->r11 */
 	sub	$(6*8), %rsp			/* pt_regs->bp, bx, r12-15 not saved */
+	UNWIND_HINT_REGS extra=0
 
 	/*
 	 * If we need to do entry work or if we guess we'll need to do
@@ -223,6 +227,7 @@ entry_SYSCALL_64_fastpath:
 	movq	EFLAGS(%rsp), %r11
 	RESTORE_C_REGS_EXCEPT_RCX_R11
 	movq	RSP(%rsp), %rsp
+	UNWIND_HINT_EMPTY
 	USERGS_SYSRET64
 
 1:
@@ -316,6 +321,7 @@ syscall_return_via_sysret:
 	/* rcx and r11 are already restored (see code above) */
 	RESTORE_C_REGS_EXCEPT_RCX_R11
 	movq	RSP(%rsp), %rsp
+	UNWIND_HINT_EMPTY
 	USERGS_SYSRET64
 
 opportunistic_sysret_failed:
@@ -343,6 +349,7 @@ ENTRY(stub_ptregs_64)
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
 	popq	%rax
+	UNWIND_HINT_REGS extra=0
 	jmp	entry_SYSCALL64_slow_path
 
 1:
@@ -351,6 +358,7 @@ END(stub_ptregs_64)
 
 .macro ptregs_stub func
 ENTRY(ptregs_\func)
+	UNWIND_HINT_FUNC
 	leaq	\func(%rip), %rax
 	jmp	stub_ptregs_64
 END(ptregs_\func)
@@ -367,6 +375,7 @@ END(ptregs_\func)
  * %rsi: next task
  */
 ENTRY(__switch_to_asm)
+	UNWIND_HINT_FUNC
 	/*
 	 * Save callee-saved registers
 	 * This must match the order in inactive_task_frame
@@ -406,6 +415,7 @@ END(__switch_to_asm)
  * r12: kernel thread arg
  */
 ENTRY(ret_from_fork)
+	UNWIND_HINT_EMPTY
 	movq	%rax, %rdi
 	call	schedule_tail			/* rdi: 'prev' task parameter */
 
@@ -413,6 +423,7 @@ ENTRY(ret_from_fork)
 	jnz	1f				/* kernel threads are uncommon */
 
 2:
+	UNWIND_HINT_REGS
 	movq	%rsp, %rdi
 	call	syscall_return_slowpath	/* returns with IRQs disabled */
 	TRACE_IRQS_ON			/* user mode is traced as IRQS on */
@@ -440,10 +451,11 @@ END(ret_from_fork)
 ENTRY(irq_entries_start)
     vector=FIRST_EXTERNAL_VECTOR
     .rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
+	UNWIND_HINT_IRET_REGS
 	pushq	$(~vector+0x80)			/* Note: always in signed byte range */
-    vector=vector+1
 	jmp	common_interrupt
 	.align	8
+	vector=vector+1
     .endr
 END(irq_entries_start)
 
@@ -465,9 +477,14 @@ END(irq_entries_start)
  *
  * The invariant is that, if irq_count != -1, then the IRQ stack is in use.
  */
-.macro ENTER_IRQ_STACK old_rsp
+.macro ENTER_IRQ_STACK regs=1 old_rsp
 	DEBUG_ENTRY_ASSERT_IRQS_OFF
 	movq	%rsp, \old_rsp
+
+	.if \regs
+	UNWIND_HINT_REGS base=\old_rsp
+	.endif
+
 	incl	PER_CPU_VAR(irq_count)
 	jnz	.Lirq_stack_push_old_rsp_\@
 
@@ -504,16 +521,24 @@ END(irq_entries_start)
 
 .Lirq_stack_push_old_rsp_\@:
 	pushq	\old_rsp
+
+	.if \regs
+	UNWIND_HINT_REGS indirect=1
+	.endif
 .endm
 
 /*
  * Undoes ENTER_IRQ_STACK.
  */
-.macro LEAVE_IRQ_STACK
+.macro LEAVE_IRQ_STACK regs=1
 	DEBUG_ENTRY_ASSERT_IRQS_OFF
 	/* We need to be off the IRQ stack before decrementing irq_count. */
 	popq	%rsp
 
+	.if \regs
+	UNWIND_HINT_REGS
+	.endif
+
 	/*
 	 * As in ENTER_IRQ_STACK, irq_count == 0, we are still claiming
 	 * the irq stack but we're not on it.
@@ -624,6 +649,7 @@ restore_c_regs_and_iret:
 	INTERRUPT_RETURN
 
 ENTRY(native_iret)
+	UNWIND_HINT_IRET_REGS
 	/*
 	 * Are we returning to a stack segment from the LDT?  Note: in
 	 * 64-bit mode SS:RSP on the exception stack is always valid.
@@ -696,6 +722,7 @@ native_irq_return_ldt:
 	orq	PER_CPU_VAR(espfix_stack), %rax
 	SWAPGS
 	movq	%rax, %rsp
+	UNWIND_HINT_IRET_REGS offset=8
 
 	/*
 	 * At this point, we cannot write to the stack any more, but we can
@@ -717,6 +744,7 @@ END(common_interrupt)
  */
 .macro apicinterrupt3 num sym do_sym
 ENTRY(\sym)
+	UNWIND_HINT_IRET_REGS
 	ASM_CLAC
 	pushq	$~(\num)
 .Lcommon_\sym:
@@ -802,6 +830,8 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
 
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
+	UNWIND_HINT_IRET_REGS offset=8
+
 	/* Sanity check */
 	.if \shift_ist != -1 && \paranoid == 0
 	.error "using shift_ist requires paranoid=1"
@@ -825,6 +855,7 @@ ENTRY(\sym)
 	.else
 	call	error_entry
 	.endif
+	UNWIND_HINT_REGS
 	/* returned flag: ebx=0: need swapgs on exit, ebx=1: don't need it */
 
 	.if \paranoid
@@ -922,6 +953,7 @@ idtentry simd_coprocessor_error		do_simd_coprocessor_error	has_error_code=0
 	 * edi:  new selector
 	 */
 ENTRY(native_load_gs_index)
+	FRAME_BEGIN
 	pushfq
 	DISABLE_INTERRUPTS(CLBR_ANY & ~CLBR_RDI)
 	SWAPGS
@@ -930,8 +962,9 @@ ENTRY(native_load_gs_index)
 2:	ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
 	SWAPGS
 	popfq
+	FRAME_END
 	ret
-END(native_load_gs_index)
+ENDPROC(native_load_gs_index)
 EXPORT_SYMBOL(native_load_gs_index)
 
 	_ASM_EXTABLE(.Lgs_change, bad_gs)
@@ -954,12 +987,12 @@ bad_gs:
 ENTRY(do_softirq_own_stack)
 	pushq	%rbp
 	mov	%rsp, %rbp
-	ENTER_IRQ_STACK old_rsp=%r11
+	ENTER_IRQ_STACK regs=0 old_rsp=%r11
 	call	__do_softirq
-	LEAVE_IRQ_STACK
+	LEAVE_IRQ_STACK regs=0
 	leaveq
 	ret
-END(do_softirq_own_stack)
+ENDPROC(do_softirq_own_stack)
 
 #ifdef CONFIG_XEN
 idtentry xen_hypervisor_callback xen_do_hypervisor_callback has_error_code=0
@@ -983,7 +1016,9 @@ ENTRY(xen_do_hypervisor_callback)		/* do_hypervisor_callback(struct *pt_regs) */
  * Since we don't modify %rdi, evtchn_do_upall(struct *pt_regs) will
  * see the correct pointer to the pt_regs
  */
+	UNWIND_HINT_FUNC
 	movq	%rdi, %rsp			/* we don't return, adjust the stack frame */
+	UNWIND_HINT_REGS
 
 	ENTER_IRQ_STACK old_rsp=%r10
 	call	xen_evtchn_do_upcall
@@ -1009,6 +1044,7 @@ END(xen_do_hypervisor_callback)
  * with its current contents: any discrepancy means we in category 1.
  */
 ENTRY(xen_failsafe_callback)
+	UNWIND_HINT_EMPTY
 	movl	%ds, %ecx
 	cmpw	%cx, 0x10(%rsp)
 	jne	1f
@@ -1028,11 +1064,13 @@ ENTRY(xen_failsafe_callback)
 	pushq	$0				/* RIP */
 	pushq	%r11
 	pushq	%rcx
+	UNWIND_HINT_IRET_REGS offset=8
 	jmp	general_protection
 1:	/* Segment mismatch => Category 1 (Bad segment). Retry the IRET. */
 	movq	(%rsp), %rcx
 	movq	8(%rsp), %r11
 	addq	$0x30, %rsp
+	UNWIND_HINT_IRET_REGS
 	pushq	$-1 /* orig_ax = -1 => not a system call */
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS
@@ -1078,6 +1116,7 @@ idtentry machine_check					has_error_code=0	paranoid=1 do_sym=*machine_check_vec
  * Return: ebx=0: need swapgs on exit, ebx=1: otherwise
  */
 ENTRY(paranoid_entry)
+	UNWIND_HINT_FUNC
 	cld
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
@@ -1105,6 +1144,7 @@ END(paranoid_entry)
  * On entry, ebx is "no swapgs" flag (1: don't need swapgs, 0: need it)
  */
 ENTRY(paranoid_exit)
+	UNWIND_HINT_REGS
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF_DEBUG
 	testl	%ebx, %ebx			/* swapgs needed? */
@@ -1126,6 +1166,7 @@ END(paranoid_exit)
  * Return: EBX=0: came from user mode; EBX=1: otherwise
  */
 ENTRY(error_entry)
+	UNWIND_HINT_FUNC
 	cld
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
@@ -1210,6 +1251,7 @@ END(error_entry)
  *   0: user gsbase is loaded, we need SWAPGS and standard preparation for return to usermode
  */
 ENTRY(error_exit)
+	UNWIND_HINT_REGS
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
 	testl	%ebx, %ebx
@@ -1219,6 +1261,7 @@ END(error_exit)
 
 /* Runs on exception stack */
 ENTRY(nmi)
+	UNWIND_HINT_IRET_REGS
 	/*
 	 * Fix up the exception frame if we're on Xen.
 	 * PARAVIRT_ADJUST_EXCEPTION_FRAME is guaranteed to push at most
@@ -1290,11 +1333,13 @@ ENTRY(nmi)
 	cld
 	movq	%rsp, %rdx
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	UNWIND_HINT_IRET_REGS base=%rdx offset=8
 	pushq	5*8(%rdx)	/* pt_regs->ss */
 	pushq	4*8(%rdx)	/* pt_regs->rsp */
 	pushq	3*8(%rdx)	/* pt_regs->flags */
 	pushq	2*8(%rdx)	/* pt_regs->cs */
 	pushq	1*8(%rdx)	/* pt_regs->rip */
+	UNWIND_HINT_IRET_REGS
 	pushq   $-1		/* pt_regs->orig_ax */
 	pushq   %rdi		/* pt_regs->di */
 	pushq   %rsi		/* pt_regs->si */
@@ -1311,6 +1356,7 @@ ENTRY(nmi)
 	pushq	%r13		/* pt_regs->r13 */
 	pushq	%r14		/* pt_regs->r14 */
 	pushq	%r15		/* pt_regs->r15 */
+	UNWIND_HINT_REGS
 	ENCODE_FRAME_POINTER
 
 	/*
@@ -1465,6 +1511,7 @@ first_nmi:
 	.rept 5
 	pushq	11*8(%rsp)
 	.endr
+	UNWIND_HINT_IRET_REGS
 
 	/* Everything up to here is safe from nested NMIs */
 
@@ -1480,6 +1527,7 @@ first_nmi:
 	pushq	$__KERNEL_CS	/* CS */
 	pushq	$1f		/* RIP */
 	INTERRUPT_RETURN	/* continues at repeat_nmi below */
+	UNWIND_HINT_IRET_REGS
 1:
 #endif
 
@@ -1529,6 +1577,7 @@ end_repeat_nmi:
 	 * exceptions might do.
 	 */
 	call	paranoid_entry
+	UNWIND_HINT_REGS
 
 	/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
 	movq	%rsp, %rdi
@@ -1566,17 +1615,19 @@ nmi_restore:
 END(nmi)
 
 ENTRY(ignore_sysret)
+	UNWIND_HINT_EMPTY
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
 
 ENTRY(rewind_stack_do_exit)
+	UNWIND_HINT_FUNC
 	/* Prevent any naive code from trying to unwind to our caller. */
 	xorl	%ebp, %ebp
 
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rax
-	leaq	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%rax), %rsp
+	leaq	-PTREGS_SIZE(%rax), %rsp
+	UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
 
 	call	do_exit
-1:	jmp 1b
 END(rewind_stack_do_exit)

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [tip:x86/asm] x86/asm: Add unwind hint annotations to sync_core()
  2017-07-11 15:33 ` [PATCH v3 08/10] x86/asm: add unwind hint annotations to sync_core() Josh Poimboeuf
@ 2017-07-18 10:43   ` tip-bot for Josh Poimboeuf
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Josh Poimboeuf @ 2017-07-18 10:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: brgerst, torvalds, jpoimboe, efault, tglx, dvlasenk, bp, mingo,
	linux-kernel, luto, hpa, jslaby, peterz

Commit-ID:  76846bf3cb09e98881cb4908385a0e899716b01f
Gitweb:     http://git.kernel.org/tip/76846bf3cb09e98881cb4908385a0e899716b01f
Author:     Josh Poimboeuf <jpoimboe@redhat.com>
AuthorDate: Tue, 11 Jul 2017 10:33:45 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 18 Jul 2017 10:57:44 +0200

x86/asm: Add unwind hint annotations to sync_core()

This enables objtool to grok the iret in the middle of a C function.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/b057be26193c11d2ed3337b2107bc7adcba42c99.1499786555.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/processor.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 6a79547..b27dc9b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -22,6 +22,7 @@ struct vm86;
 #include <asm/nops.h>
 #include <asm/special_insns.h>
 #include <asm/fpu/types.h>
+#include <asm/unwind_hints.h>
 
 #include <linux/personality.h>
 #include <linux/cache.h>
@@ -684,6 +685,7 @@ static inline void sync_core(void)
 	unsigned int tmp;
 
 	asm volatile (
+		UNWIND_HINT_SAVE
 		"mov %%ss, %0\n\t"
 		"pushq %q0\n\t"
 		"pushq %%rsp\n\t"
@@ -693,6 +695,7 @@ static inline void sync_core(void)
 		"pushq %q0\n\t"
 		"pushq $1f\n\t"
 		"iretq\n\t"
+		UNWIND_HINT_RESTORE
 		"1:"
 		: "=&r" (tmp), "+r" (__sp) : : "cc", "memory");
 #endif

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v3.1 09/10] x86/unwind: add ORC unwinder
  2017-07-14 17:22   ` [PATCH v3.1 " Josh Poimboeuf
@ 2017-07-20  7:12     ` Jiri Slaby
  2017-07-20 21:16       ` Josh Poimboeuf
  0 siblings, 1 reply; 60+ messages in thread
From: Jiri Slaby @ 2017-07-20  7:12 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, Mike Galbraith

On 07/14/2017, 07:22 PM, Josh Poimboeuf wrote:
> +void __unwind_start(struct unwind_state *state, struct task_struct *task,
> +		    struct pt_regs *regs, unsigned long *first_frame)
> +{
> +	memset(state, 0, sizeof(*state));
> +	state->task = task;
> +
> +	/*
> +	 * Refuse to unwind the stack of a task while it's executing on another
> +	 * CPU.  This check is racy, but that's ok: the unwinder has other
> +	 * checks to prevent it from going off the rails.
> +	 */
> +	if (task_on_another_cpu(task))
> +		goto done;
> +
> +	if (regs) {
> +		if (user_mode(regs))
> +			goto done;
> +
> +		state->ip = regs->ip;
> +		state->sp = kernel_stack_pointer(regs);
> +		state->bp = regs->bp;
> +		state->regs = regs;
> +		state->full_regs = true;
> +		state->signal = true;
> +
> +	} else if (task == current) {
> +		asm volatile("lea (%%rip), %0\n\t"
> +			     "mov %%rsp, %1\n\t"
> +			     "mov %%rbp, %2\n\t"
> +			     : "=r" (state->ip), "=r" (state->sp),
> +			       "=r" (state->bp));
> +
> +	} else {
> +		struct inactive_task_frame *frame = (void *)task->thread.sp;
> +
> +		state->ip = frame->ret_addr;
> +		state->sp = task->thread.sp;
> +		state->bp = frame->bp;

I wonder, if the reads from 'frame' should have READ_ONCE_NOCHECK for
the same reason as in:
commit 84936118bdf37bda513d4a361c38181a216427e0
Author: Josh Poimboeuf <jpoimboe@redhat.com>
Date:   Mon Jan 9 12:00:23 2017 -0600

    x86/unwind: Disable KASAN checks for non-current tasks
?


thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3.1 09/10] x86/unwind: add ORC unwinder
  2017-07-20  7:12     ` Jiri Slaby
@ 2017-07-20 21:16       ` Josh Poimboeuf
  0 siblings, 0 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-20 21:16 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

On Thu, Jul 20, 2017 at 09:12:16AM +0200, Jiri Slaby wrote:
> On 07/14/2017, 07:22 PM, Josh Poimboeuf wrote:
> > +void __unwind_start(struct unwind_state *state, struct task_struct *task,
> > +		    struct pt_regs *regs, unsigned long *first_frame)
> > +{
> > +	memset(state, 0, sizeof(*state));
> > +	state->task = task;
> > +
> > +	/*
> > +	 * Refuse to unwind the stack of a task while it's executing on another
> > +	 * CPU.  This check is racy, but that's ok: the unwinder has other
> > +	 * checks to prevent it from going off the rails.
> > +	 */
> > +	if (task_on_another_cpu(task))
> > +		goto done;
> > +
> > +	if (regs) {
> > +		if (user_mode(regs))
> > +			goto done;
> > +
> > +		state->ip = regs->ip;
> > +		state->sp = kernel_stack_pointer(regs);
> > +		state->bp = regs->bp;
> > +		state->regs = regs;
> > +		state->full_regs = true;
> > +		state->signal = true;
> > +
> > +	} else if (task == current) {
> > +		asm volatile("lea (%%rip), %0\n\t"
> > +			     "mov %%rsp, %1\n\t"
> > +			     "mov %%rbp, %2\n\t"
> > +			     : "=r" (state->ip), "=r" (state->sp),
> > +			       "=r" (state->bp));
> > +
> > +	} else {
> > +		struct inactive_task_frame *frame = (void *)task->thread.sp;
> > +
> > +		state->ip = frame->ret_addr;
> > +		state->sp = task->thread.sp;
> > +		state->bp = frame->bp;
> 
> I wonder, if the reads from 'frame' should have READ_ONCE_NOCHECK for
> the same reason as in:
> commit 84936118bdf37bda513d4a361c38181a216427e0
> Author: Josh Poimboeuf <jpoimboe@redhat.com>
> Date:   Mon Jan 9 12:00:23 2017 -0600
> 
>     x86/unwind: Disable KASAN checks for non-current tasks
> ?

Yeah, maybe so.  Since the task_on_another_cpu() check above is racy,
here it's remotely possible that the task has since starting executing
and has poisoned the stack memory we're about to read.

I don't know how realistic that scenario is, but it wouldn't hurt to add
a couple of READ_ONCE_NOCHECKs here for the 'frame' dereferences.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-14 17:17       ` Josh Poimboeuf
@ 2017-07-25  9:09         ` Ingo Molnar
  2017-07-25 17:58           ` Josh Poimboeuf
  0 siblings, 1 reply; 60+ messages in thread
From: Ingo Molnar @ 2017-07-25  9:09 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith


* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Wed, Jul 12, 2017 at 09:27:50PM +0200, Ingo Molnar wrote:
> > Maybe we could offer a menu of unwinders - i.e. make the whole Kconfig interface a 
> > bit nicer:
> > 
> >   CONFIG_UNWINDER_FRAME_POINTER
> >   CONFIG_UNWINDER_ORC
> >   CONFIG_UNWINDER_GUESS
> > 
> > ... or so?
> 
> So far I haven't been able to figure out how to make the above three
> options into a multiple choice selection, such that allnoconfig selects
> CONFIG_UNWINDER_GUESS and alldefconfig selects
> CONFIG_UNWINDER_FRAME_POINTER.

I don't think that's a problem: the scheduler preemption model Kconfig setup has 
similar behavior - allyesconfig does not enable CONFIG_PREEMPT=y.

The new x86 default will eventually be the Orc unwinder, but not initially.

> > I wouldn't mind making CONFIG_UNWINDER_ORC the new default either, due to the 
> > non-trivial speedup it offers - but maybe folks would object?
> 
> Personally I wouldn't have an objection to making ORC the default, though we 
> should probably wait to give it some burn-in time first.

Sure, that's what testing is for.

> If we *do* decide to eventually make it the default, we could flip the switch at 
> the same time we introduced the multiple-choice config and rename above.  That 
> way, users of "make oldconfig" would see the change and would be encouraged to 
> switch ORC.

I disagree, as the current Kconfig layout actively hinders the 'more testing' 
part: you can only enable Orc if you knew how to do it, and 99% of our testers 
won't bother. In practice that's a testing coverage that is close to not testing 
it at all ...

> > > > CONFIG_FRAME_POINTERS et al would be left for architectures where it has a meaning 
> > > > beyond backtrace generation. (Not sure whether there's any such architectures.)
> > > 
> > > Well, on x86, hardened usercopy relies on frame pointers, but not the
> > > unwinder.  It does the frame pointer walk manually to avoid the full
> > > unwinder overhead.  See arch_within_stack_frames().

BTW., I think this aspect of the hardened user-copy is crazy stuff - there can be 
many stack frames, and this adds a serious amount of overhead even with frame 
pointers...

I think the current behavior is fine: if frame pointers are disabled then 
arch_within_stack_frames() returns NOT_STACK. Maybe it could do a few sanity 
checks: we do know the kernel stack range and we could check alignment as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC] perf: Delayed userspace unwind (Was: [PATCH v3 00/10] x86: ORC unwinder)
  2017-07-13  9:19           ` Ingo Molnar
  2017-07-13 12:17             ` Josh Poimboeuf
@ 2017-07-25 11:55             ` Peter Zijlstra
  2017-07-28 14:13               ` Jiri Olsa
  2017-07-29  3:35               ` Andy Lutomirski
  1 sibling, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-07-25 11:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Josh Poimboeuf, Andres Freund, x86, linux-kernel, live-patching,
	Linus Torvalds, Andy Lutomirski, Jiri Slaby, H. Peter Anvin,
	Mike Galbraith, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin

On Thu, Jul 13, 2017 at 11:19:11AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > One gloriously ugly hack would be to delay the userspace unwind to 
> > > return-to-userspace, at which point we have a schedulable context and can take 
> > > faults.
> 
> I don't think it's ugly, and it has various advantages:
> 
> > > Of course, then you have to somehow identify this later unwind sample with all 
> > > relevant prior samples and stitch the whole thing back together, but that 
> > > should be doable.
> > > 
> > > In fact, it would not be at all hard to do, just queue a task_work from the 
> > > NMI and have that do the EH based unwind.
> 
> This would have a couple of advantages:
> 
>  - as you mention, being able to fault in debug info and generally do 
>    IO/scheduling,
> 
>  - profiling overhead would be accounted to the task context that generates it,
>    not the NMI context,
> 
>  - there would be a natural batching/coalescing optimization if multiple events
>    hit the same system call: the user-space backtrace would only have to be looked 
>    up once for all samples that got collected.
> 
> This could be done by separating the user-space backtrace into a separate event, 
> and perf tooling would then apply the same user-space backtrace to all prior 
> kernel samples.
> 
> I.e. the ring-buffer would have trace entries like:
> 
>  [ kernel sample #1, with kernel backtrace #1 ]
>  [ kernel sample #2, with kernel backtrace #2 ]
>  [ kernel sample #3, with kernel backtrace #3 ]
>  [ user-space backtrace #1 at syscall return ]
>  ...
> 
> Note how the three kernel samples didn't have to do any user-space unwinding at 
> all, so the user-space unwinding overhead got reduced by a factor of 3.
> 
> Tooling would know that 'user-space backtrace #1' applies to the previous three 
> kernel samples.
> 
> Or so?

Find compile tested patch below, someone needs to teach this userspace
thing about it though.. Not sure I can still make sense of that code.

---
 include/linux/perf_event.h      |  1 +
 include/uapi/linux/perf_event.h | 14 ++++++-
 kernel/events/callchain.c       | 86 ++++++++++++++++++++++++++++++++++++++---
 kernel/events/core.c            | 18 +++------
 4 files changed, 100 insertions(+), 19 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a3b873fc59e4..241251533e39 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -682,6 +682,7 @@ struct perf_event {
 	int				pending_disable;
 	struct irq_work			pending;
 
+	struct callback_head		pending_callchain;
 	atomic_t			event_limit;
 
 	/* address range filters */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 642db5fa3286..342def57ef34 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -368,7 +368,8 @@ struct perf_event_attr {
 				context_switch :  1, /* context switch data */
 				write_backward :  1, /* Write ring buffer from end to beginning */
 				namespaces     :  1, /* include namespaces data */
-				__reserved_1   : 35;
+				delayed_user_callchain   : 1, /* ... */
+				__reserved_1   : 34;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -915,6 +916,17 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_NAMESPACES			= 16,
 
+	/*
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	{ u64			nr,
+	 *	  u64			ips[nr];  } && PERF_SAMPLE_CALLCHAIN
+	 *	struct sample_id		sample_id;
+	 * };
+	 *
+	 */
+	PERF_RECORD_CALLCHAIN			= 17,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 1b2be63c8528..c98a12f3592c 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -12,6 +12,7 @@
 #include <linux/perf_event.h>
 #include <linux/slab.h>
 #include <linux/sched/task_stack.h>
+#include <linux/task_work.h>
 
 #include "internal.h"
 
@@ -178,19 +179,94 @@ put_callchain_entry(int rctx)
 	put_recursion_context(this_cpu_ptr(callchain_recursion), rctx);
 }
 
+static struct perf_callchain_entry __empty = { .nr = 0, };
+
+static void perf_callchain_work(struct callback_head *work)
+{
+	struct perf_event *event = container_of(work, struct perf_event, pending_callchain);
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	size_t size;
+	int ret;
+
+	struct {
+		struct perf_event_header	header;
+	} callchain_event = {
+		.header = {
+			.type = PERF_RECORD_CALLCHAIN,
+			.misc = 0,
+			.size = sizeof(callchain_event),
+		},
+	};
+
+	perf_event_header__init_id(&callchain_event.header, &sample, event);
+
+	sample.callchain = get_perf_callchain(task_pt_regs(current),
+					      /* init_nr   */ 0,
+					      /* kernel    */ false,
+					      /* user      */ true,
+					      event->attr.sample_max_stack,
+					      /* crosstask */ false,
+					      /* add_mark  */ true);
+
+	if (!sample.callchain)
+		sample.callchain = &__empty;
+
+	size = sizeof(u64) * (1 + sample.callchain->nr);
+	callchain_event.header.size += size;
+
+	ret = perf_output_begin(&handle, event, callchain_event.header.size);
+	if (ret)
+		return;
+
+	perf_output_put(&handle, callchain_event);
+	__output_copy(&handle, sample.callchain, size);
+	perf_event__output_id_sample(event, &handle, &sample);
+	perf_output_end(&handle);
+
+	barrier();
+	work->func = NULL; /* done */
+}
+
 struct perf_callchain_entry *
 perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
-	bool kernel = !event->attr.exclude_callchain_kernel;
-	bool user   = !event->attr.exclude_callchain_user;
+	bool kernel  = !event->attr.exclude_callchain_kernel;
+	bool user    = !event->attr.exclude_callchain_user;
+	bool delayed = event->attr.delayed_user_callchain;
+
 	/* Disallow cross-task user callchains. */
 	bool crosstask = event->ctx->task && event->ctx->task != current;
 	const u32 max_stack = event->attr.sample_max_stack;
 
-	if (!kernel && !user)
-		return NULL;
+	struct perf_callchain_entry *callchain = NULL;
+
+	if (user && delayed && !crosstask) {
+		struct callback_head *work = &event->pending_callchain;
+
+		if (!work->func) {
+			work->func = perf_callchain_work;
+			/*
+			 * We cannot do set_notify_resume() from NMI context,
+			 * also, knowing we are already in an interrupted
+			 * context and will pass return to userspace, we can
+			 * simply set TIF_NOTIFY_RESUME.
+			 */
+			task_work_add(current, work, false);
+			set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+		}
+		user = false;
+	}
+
+	if (kernel || user) {
+		callchain = get_perf_callchain(regs, 0, kernel, user,
+				               max_stack, crosstask, true);
+	}
+
+	if (!callchain)
+		callchain = &__empty;
 
-	return get_perf_callchain(regs, 0, kernel, user, max_stack, crosstask, true);
+	return callchain;
 }
 
 struct perf_callchain_entry *
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 426c2ffba16d..26aed7bfbb6a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5857,19 +5857,12 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_read(handle, event);
 
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
-		if (data->callchain) {
-			int size = 1;
-
-			if (data->callchain)
-				size += data->callchain->nr;
+		int size = 1;
 
-			size *= sizeof(u64);
+		size += data->callchain->nr;
+		size *= sizeof(u64);
 
-			__output_copy(handle, data->callchain, size);
-		} else {
-			u64 nr = 0;
-			perf_output_put(handle, nr);
-		}
+		__output_copy(handle, data->callchain, size);
 	}
 
 	if (sample_type & PERF_SAMPLE_RAW) {
@@ -6010,8 +6003,7 @@ void perf_prepare_sample(struct perf_event_header *header,
 
 		data->callchain = perf_callchain(event, regs);
 
-		if (data->callchain)
-			size += data->callchain->nr;
+		size += data->callchain->nr;
 
 		header->size += size * sizeof(u64);
 	}

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-25  9:09         ` Ingo Molnar
@ 2017-07-25 17:58           ` Josh Poimboeuf
  2017-07-25 18:46             ` Kees Cook
  0 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2017-07-25 17:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith, Kees Cook

[ Adding Kees to CC for the hardened usercopy discussion. ]

Kees, FYI: frame pointers may be disabled by default on x86 relatively
soon (presumably weeks or months) in favor of the ORC unwinder.  So the
hardened usercopy stack walk will no longer work as advertised.

Using the ORC unwinder for hardened usercopy would probably be pretty
bad performance-wise.  I'm not sure what else could be done.  Ingo did
have a few ideas for sanity checks:

On Tue, Jul 25, 2017 at 11:09:44AM +0200, Ingo Molnar wrote:
> > > > Well, on x86, hardened usercopy relies on frame pointers, but not the
> > > > unwinder.  It does the frame pointer walk manually to avoid the full
> > > > unwinder overhead.  See arch_within_stack_frames().
> 
> BTW., I think this aspect of the hardened user-copy is crazy stuff - there can be 
> many stack frames, and this adds a serious amount of overhead even with frame 
> pointers...
> 
> I think the current behavior is fine: if frame pointers are disabled then 
> arch_within_stack_frames() returns NOT_STACK. Maybe it could do a few sanity 
> checks: we do know the kernel stack range and we could check alignment as well.

I believe it checks the kernel stack range already in
check_stack_object() before deciding whether to call
arch_within_stack_frames().  It also has an overlapping stack check.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
  2017-07-25 17:58           ` Josh Poimboeuf
@ 2017-07-25 18:46             ` Kees Cook
  0 siblings, 0 replies; 60+ messages in thread
From: Kees Cook @ 2017-07-25 18:46 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Ingo Molnar, x86, LKML, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

On Tue, Jul 25, 2017 at 10:58 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> [ Adding Kees to CC for the hardened usercopy discussion. ]
>
> Kees, FYI: frame pointers may be disabled by default on x86 relatively
> soon (presumably weeks or months) in favor of the ORC unwinder.  So the
> hardened usercopy stack walk will no longer work as advertised.
>
> Using the ORC unwinder for hardened usercopy would probably be pretty
> bad performance-wise.  I'm not sure what else could be done.  Ingo did
> have a few ideas for sanity checks:
>
> On Tue, Jul 25, 2017 at 11:09:44AM +0200, Ingo Molnar wrote:
>> > > > Well, on x86, hardened usercopy relies on frame pointers, but not the
>> > > > unwinder.  It does the frame pointer walk manually to avoid the full
>> > > > unwinder overhead.  See arch_within_stack_frames().
>>
>> BTW., I think this aspect of the hardened user-copy is crazy stuff - there can be
>> many stack frames, and this adds a serious amount of overhead even with frame
>> pointers...
>>
>> I think the current behavior is fine: if frame pointers are disabled then
>> arch_within_stack_frames() returns NOT_STACK. Maybe it could do a few sanity
>> checks: we do know the kernel stack range and we could check alignment as well.
>
> I believe it checks the kernel stack range already in
> check_stack_object() before deciding whether to call
> arch_within_stack_frames().  It also has an overlapping stack check.

Right, pointers starting in the stack are already checked to not go
beyond the stack.

As far as dropping inter-frame overflow checking, while I'd prefer to
keep it, but its benefit in my mind is already pretty minimal since it
already doesn't protect/exclude the stack canary. And since this is a
check for a linear overflow (i.e. a contiguous access) we're mostly
protected by the existing stack canary for writes. For reads, we do
risk allowing return addresses to get exposed, though without the
frame pointer, we've got even less to expose in the first place.

So, mainly, I'm fine with this. I'm slightly sad, but it's not a huge
loss. The main benefit of usercopy hardening is the slab cache object
size protections...

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC] perf: Delayed userspace unwind (Was: [PATCH v3 00/10] x86: ORC unwinder)
  2017-07-25 11:55             ` [RFC] perf: Delayed userspace unwind (Was: [PATCH v3 00/10] x86: ORC unwinder) Peter Zijlstra
@ 2017-07-28 14:13               ` Jiri Olsa
  2017-07-28 14:21                 ` Peter Zijlstra
  2017-07-29  3:35               ` Andy Lutomirski
  1 sibling, 1 reply; 60+ messages in thread
From: Jiri Olsa @ 2017-07-28 14:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Josh Poimboeuf, Andres Freund, x86, linux-kernel,
	live-patching, Linus Torvalds, Andy Lutomirski, Jiri Slaby,
	H. Peter Anvin, Mike Galbraith, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin

On Tue, Jul 25, 2017 at 01:55:12PM +0200, Peter Zijlstra wrote:

SNIP

>  
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index 1b2be63c8528..c98a12f3592c 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c
> @@ -12,6 +12,7 @@
>  #include <linux/perf_event.h>
>  #include <linux/slab.h>
>  #include <linux/sched/task_stack.h>
> +#include <linux/task_work.h>
>  
>  #include "internal.h"
>  
> @@ -178,19 +179,94 @@ put_callchain_entry(int rctx)
>  	put_recursion_context(this_cpu_ptr(callchain_recursion), rctx);
>  }
>  
> +static struct perf_callchain_entry __empty = { .nr = 0, };
> +
> +static void perf_callchain_work(struct callback_head *work)
> +{
> +	struct perf_event *event = container_of(work, struct perf_event, pending_callchain);
> +	struct perf_output_handle handle;
> +	struct perf_sample_data sample;
> +	size_t size;
> +	int ret;
> +
> +	struct {
> +		struct perf_event_header	header;
> +	} callchain_event = {
> +		.header = {
> +			.type = PERF_RECORD_CALLCHAIN,
> +			.misc = 0,
> +			.size = sizeof(callchain_event),
> +		},
> +	};

how about we make this generic for all user space sample_type?

I think we could certainly use it for PERF_SAMPLE_STACK_USER,
maybe PERF_SAMPLE_REGS_USER would also help.. just a little ;-)

I'll check on that

jirka

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC] perf: Delayed userspace unwind (Was: [PATCH v3 00/10] x86: ORC unwinder)
  2017-07-28 14:13               ` Jiri Olsa
@ 2017-07-28 14:21                 ` Peter Zijlstra
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-07-28 14:21 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ingo Molnar, Josh Poimboeuf, Andres Freund, x86, linux-kernel,
	live-patching, Linus Torvalds, Andy Lutomirski, Jiri Slaby,
	H. Peter Anvin, Mike Galbraith, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin

On Fri, Jul 28, 2017 at 04:13:25PM +0200, Jiri Olsa wrote:
> On Tue, Jul 25, 2017 at 01:55:12PM +0200, Peter Zijlstra wrote:
> 
> SNIP
> 
> >  
> > diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> > index 1b2be63c8528..c98a12f3592c 100644
> > --- a/kernel/events/callchain.c
> > +++ b/kernel/events/callchain.c
> > @@ -12,6 +12,7 @@
> >  #include <linux/perf_event.h>
> >  #include <linux/slab.h>
> >  #include <linux/sched/task_stack.h>
> > +#include <linux/task_work.h>
> >  
> >  #include "internal.h"
> >  
> > @@ -178,19 +179,94 @@ put_callchain_entry(int rctx)
> >  	put_recursion_context(this_cpu_ptr(callchain_recursion), rctx);
> >  }
> >  
> > +static struct perf_callchain_entry __empty = { .nr = 0, };
> > +
> > +static void perf_callchain_work(struct callback_head *work)
> > +{
> > +	struct perf_event *event = container_of(work, struct perf_event, pending_callchain);
> > +	struct perf_output_handle handle;
> > +	struct perf_sample_data sample;
> > +	size_t size;
> > +	int ret;
> > +
> > +	struct {
> > +		struct perf_event_header	header;
> > +	} callchain_event = {
> > +		.header = {
> > +			.type = PERF_RECORD_CALLCHAIN,
> > +			.misc = 0,
> > +			.size = sizeof(callchain_event),
> > +		},
> > +	};
> 
> how about we make this generic for all user space sample_type?
> 
> I think we could certainly use it for PERF_SAMPLE_STACK_USER,
> maybe PERF_SAMPLE_REGS_USER would also help.. just a little ;-)
> 
> I'll check on that

Right, so I put all the magic in the callchain code itself, which would
make that a wee bit harder. But yes, putting all the USER stuff in there
makes sense.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC] perf: Delayed userspace unwind (Was: [PATCH v3 00/10] x86: ORC unwinder)
  2017-07-25 11:55             ` [RFC] perf: Delayed userspace unwind (Was: [PATCH v3 00/10] x86: ORC unwinder) Peter Zijlstra
  2017-07-28 14:13               ` Jiri Olsa
@ 2017-07-29  3:35               ` Andy Lutomirski
  2017-07-29  9:28                 ` Peter Zijlstra
  1 sibling, 1 reply; 60+ messages in thread
From: Andy Lutomirski @ 2017-07-29  3:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Josh Poimboeuf, Andres Freund, X86 ML, linux-kernel,
	live-patching, Linus Torvalds, Andy Lutomirski, Jiri Slaby,
	H. Peter Anvin, Mike Galbraith, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin

> On Jul 25, 2017, at 7:55 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Thu, Jul 13, 2017 at 11:19:11AM +0200, Ingo Molnar wrote:
>>
>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>>> One gloriously ugly hack would be to delay the userspace unwind to
>>>> return-to-userspace, at which point we have a schedulable context and can take
>>>> faults.
>>
>> I don't think it's ugly, and it has various advantages:
>>
>>>> Of course, then you have to somehow identify this later unwind sample with all
>>>> relevant prior samples and stitch the whole thing back together, but that
>>>> should be doable.
>>>>
>>>> In fact, it would not be at all hard to do, just queue a task_work from the
>>>> NMI and have that do the EH based unwind.
>>

I haven't checked task_work specifically, but a bunch of the exit work
is permitted to sleep, which is potentially useful.

If this becomes successful enough that we could eventually deprecate
the old code, I wonder if copy_from_user_nmi() could go away? :)

>
> ---
> include/linux/perf_event.h      |  1 +
> include/uapi/linux/perf_event.h | 14 ++++++-
> kernel/events/callchain.c       | 86 ++++++++++++++++++++++++++++++++++++++---
> kernel/events/core.c            | 18 +++------
> 4 files changed, 100 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index a3b873fc59e4..241251533e39 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -682,6 +682,7 @@ struct perf_event {
>    int                pending_disable;
>    struct irq_work            pending;
>
> +    struct callback_head        pending_callchain;
>    atomic_t            event_limit;
>
>    /* address range filters */
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 642db5fa3286..342def57ef34 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -368,7 +368,8 @@ struct perf_event_attr {
>                context_switch :  1, /* context switch data */
>                write_backward :  1, /* Write ring buffer from end to beginning */
>                namespaces     :  1, /* include namespaces data */
> -                __reserved_1   : 35;
> +                delayed_user_callchain   : 1, /* ... */
> +                __reserved_1   : 34;
>
>    union {
>        __u32        wakeup_events;      /* wakeup every n events */
> @@ -915,6 +916,17 @@ enum perf_event_type {
>     */
>    PERF_RECORD_NAMESPACES            = 16,
>
> +    /*
> +     * struct {
> +     *    struct perf_event_header    header;
> +     *    { u64            nr,
> +     *      u64            ips[nr];  } && PERF_SAMPLE_CALLCHAIN
> +     *    struct sample_id        sample_id;
> +     * };
> +     *
> +     */
> +    PERF_RECORD_CALLCHAIN            = 17,
> +
>    PERF_RECORD_MAX,            /* non-ABI */
> };
>
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index 1b2be63c8528..c98a12f3592c 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c
> @@ -12,6 +12,7 @@
> #include <linux/perf_event.h>
> #include <linux/slab.h>
> #include <linux/sched/task_stack.h>
> +#include <linux/task_work.h>
>
> #include "internal.h"
>
> @@ -178,19 +179,94 @@ put_callchain_entry(int rctx)
>    put_recursion_context(this_cpu_ptr(callchain_recursion), rctx);
> }
>
> +static struct perf_callchain_entry __empty = { .nr = 0, };
> +
> +static void perf_callchain_work(struct callback_head *work)
> +{
> +    struct perf_event *event = container_of(work, struct perf_event, pending_callchain);
> +    struct perf_output_handle handle;
> +    struct perf_sample_data sample;
> +    size_t size;
> +    int ret;
> +
> +    struct {
> +        struct perf_event_header    header;
> +    } callchain_event = {
> +        .header = {
> +            .type = PERF_RECORD_CALLCHAIN,
> +            .misc = 0,
> +            .size = sizeof(callchain_event),
> +        },
> +    };
> +
> +    perf_event_header__init_id(&callchain_event.header, &sample, event);
> +
> +    sample.callchain = get_perf_callchain(task_pt_regs(current),
> +                          /* init_nr   */ 0,
> +                          /* kernel    */ false,
> +                          /* user      */ true,
> +                          event->attr.sample_max_stack,
> +                          /* crosstask */ false,
> +                          /* add_mark  */ true);
> +
> +    if (!sample.callchain)
> +        sample.callchain = &__empty;
> +
> +    size = sizeof(u64) * (1 + sample.callchain->nr);
> +    callchain_event.header.size += size;
> +
> +    ret = perf_output_begin(&handle, event, callchain_event.header.size);
> +    if (ret)
> +        return;
> +
> +    perf_output_put(&handle, callchain_event);
> +    __output_copy(&handle, sample.callchain, size);
> +    perf_event__output_id_sample(event, &handle, &sample);
> +    perf_output_end(&handle);
> +
> +    barrier();
> +    work->func = NULL; /* done */
> +}
> +
> struct perf_callchain_entry *
> perf_callchain(struct perf_event *event, struct pt_regs *regs)
> {
> -    bool kernel = !event->attr.exclude_callchain_kernel;
> -    bool user   = !event->attr.exclude_callchain_user;
> +    bool kernel  = !event->attr.exclude_callchain_kernel;
> +    bool user    = !event->attr.exclude_callchain_user;
> +    bool delayed = event->attr.delayed_user_callchain;
> +
>    /* Disallow cross-task user callchains. */
>    bool crosstask = event->ctx->task && event->ctx->task != current;
>    const u32 max_stack = event->attr.sample_max_stack;
>
> -    if (!kernel && !user)
> -        return NULL;
> +    struct perf_callchain_entry *callchain = NULL;
> +
> +    if (user && delayed && !crosstask) {
> +        struct callback_head *work = &event->pending_callchain;
> +
> +        if (!work->func) {
> +            work->func = perf_callchain_work;
> +            /*
> +             * We cannot do set_notify_resume() from NMI context,
> +             * also, knowing we are already in an interrupted
> +             * context and will pass return to userspace, we can
> +             * simply set TIF_NOTIFY_RESUME.
> +             */
> +            task_work_add(current, work, false);
> +            set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);

There's a more or leas unavoidable window in which this won't be
noticed, which could plausibly confuse userspace.  It might be
possible to figure out a way for an NMI to tell if it lands in this
window, but it would be a bit tricky.  Also, is the task_work code
prepared to handle task_work_add during exit?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC] perf: Delayed userspace unwind (Was: [PATCH v3 00/10] x86: ORC unwinder)
  2017-07-29  3:35               ` Andy Lutomirski
@ 2017-07-29  9:28                 ` Peter Zijlstra
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-07-29  9:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Josh Poimboeuf, Andres Freund, X86 ML, linux-kernel,
	live-patching, Linus Torvalds, Andy Lutomirski, Jiri Slaby,
	H. Peter Anvin, Mike Galbraith, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin

On Fri, Jul 28, 2017 at 08:35:16PM -0700, Andy Lutomirski wrote:

> I haven't checked task_work specifically, but a bunch of the exit work
> is permitted to sleep, which is potentially useful.

Yes.

> If this becomes successful enough that we could eventually deprecate
> the old code, I wonder if copy_from_user_nmi() could go away? :)

So we still use that for things like the PEBS IP fixup for older CPUs.
That needs to read the userspace code.

Also, since all this is optional on userspace asking for the new format,
we will probably (forever) need to support userspace not asking for it.

> > +        if (!work->func) {
> > +            work->func = perf_callchain_work;
> > +            /*
> > +             * We cannot do set_notify_resume() from NMI context,
> > +             * also, knowing we are already in an interrupted
> > +             * context and will pass return to userspace, we can
> > +             * simply set TIF_NOTIFY_RESUME.
> > +             */
> > +            task_work_add(current, work, false);
> > +            set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> 
> There's a more or leas unavoidable window in which this won't be
> noticed, which could plausibly confuse userspace.  It might be
> possible to figure out a way for an NMI to tell if it lands in this
> window, but it would be a bit tricky.

Correct, I have been thinking on how to do that but haven't found
anything particularly nice yet.

> Also, is the task_work code prepared to handle task_work_add during
> exit?

That is one I hadn't thought of, but basically task_work_add() will fail
if the task is too far gone. At that point we should fallback to the
'old' behaviour and simply include the information in the kernel SAMPLE
record.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2017-07-29  9:28 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-11 15:33 [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Josh Poimboeuf
2017-07-11 15:33 ` [PATCH v3 01/10] x86/entry/64: Refactor IRQ stacks and make them NMI-safe Josh Poimboeuf
2017-07-18 10:40   ` [tip:x86/asm] " tip-bot for Andy Lutomirski
2017-07-11 15:33 ` [PATCH v3 02/10] x86/entry/64: Initialize the top of the IRQ stack before switching stacks Josh Poimboeuf
2017-07-18 10:41   ` [tip:x86/asm] " tip-bot for Andy Lutomirski
2017-07-11 15:33 ` [PATCH v3 03/10] x86/dumpstack: fix occasionally missing registers Josh Poimboeuf
2017-07-18 10:41   ` [tip:x86/asm] x86/dumpstack: Fix " tip-bot for Josh Poimboeuf
2017-07-11 15:33 ` [PATCH v3 04/10] x86/dumpstack: fix interrupt and exception stack boundary checks Josh Poimboeuf
2017-07-18 10:42   ` [tip:x86/asm] x86/dumpstack: Fix " tip-bot for Josh Poimboeuf
2017-07-11 15:33 ` [PATCH v3 05/10] objtool: add ORC unwind table generation Josh Poimboeuf
2017-07-18 10:42   ` [tip:x86/asm] objtool: Add " tip-bot for Josh Poimboeuf
2017-07-11 15:33 ` [PATCH v3 06/10] objtool, x86: add facility for asm code to provide unwind hints Josh Poimboeuf
2017-07-18 10:43   ` [tip:x86/asm] objtool, x86: Add " tip-bot for Josh Poimboeuf
2017-07-11 15:33 ` [PATCH v3 07/10] x86/entry/64: add unwind hint annotations Josh Poimboeuf
2017-07-18 10:43   ` [tip:x86/asm] x86/entry/64: Add " tip-bot for Josh Poimboeuf
2017-07-11 15:33 ` [PATCH v3 08/10] x86/asm: add unwind hint annotations to sync_core() Josh Poimboeuf
2017-07-18 10:43   ` [tip:x86/asm] x86/asm: Add " tip-bot for Josh Poimboeuf
2017-07-11 15:33 ` [PATCH v3 09/10] x86/unwind: add ORC unwinder Josh Poimboeuf
2017-07-14 17:22   ` [PATCH v3.1 " Josh Poimboeuf
2017-07-20  7:12     ` Jiri Slaby
2017-07-20 21:16       ` Josh Poimboeuf
2017-07-11 15:33 ` [PATCH v3 10/10] x86/kconfig: make it easier to switch to the new " Josh Poimboeuf
2017-07-12  8:27 ` [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Ingo Molnar
2017-07-12 14:42   ` Josh Poimboeuf
2017-07-12 19:27     ` Ingo Molnar
2017-07-14 17:17       ` Josh Poimboeuf
2017-07-25  9:09         ` Ingo Molnar
2017-07-25 17:58           ` Josh Poimboeuf
2017-07-25 18:46             ` Kees Cook
2017-07-12 21:49 ` Andres Freund
2017-07-12 22:32   ` Josh Poimboeuf
2017-07-12 22:36     ` Andres Freund
2017-07-12 22:40       ` Josh Poimboeuf
2017-07-12 22:54         ` Andres Freund
2017-07-13  7:12     ` Peter Zijlstra
2017-07-13  8:50       ` Peter Zijlstra
2017-07-13  8:51         ` Peter Zijlstra
2017-07-13  9:19           ` Ingo Molnar
2017-07-13 12:17             ` Josh Poimboeuf
2017-07-13 12:21               ` Josh Poimboeuf
2017-07-13 12:35                 ` Josh Poimboeuf
2017-07-14  8:33                   ` Ingo Molnar
2017-07-14  8:29               ` Ingo Molnar
2017-07-25 11:55             ` [RFC] perf: Delayed userspace unwind (Was: [PATCH v3 00/10] x86: ORC unwinder) Peter Zijlstra
2017-07-28 14:13               ` Jiri Olsa
2017-07-28 14:21                 ` Peter Zijlstra
2017-07-29  3:35               ` Andy Lutomirski
2017-07-29  9:28                 ` Peter Zijlstra
2017-07-12 22:30 ` [PATCH v3 00/10] x86: ORC unwinder (previously undwarf) Andi Kleen
2017-07-12 22:47   ` Josh Poimboeuf
2017-07-13  4:29     ` Andi Kleen
2017-07-13 13:15       ` Josh Poimboeuf
2017-07-13  9:29     ` Ingo Molnar
2017-07-12 23:22   ` Andy Lutomirski
2017-07-13  3:03   ` Mike Galbraith
2017-07-13  4:15     ` Andi Kleen
2017-07-13  4:28       ` Mike Galbraith
2017-07-13  4:40         ` Andi Kleen
2017-07-13  5:22           ` Mike Galbraith
2017-07-13 12:02           ` Jiri Kosina

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.