linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/2] ORC unwinder
@ 2017-07-24 23:36 Josh Poimboeuf
  2017-07-24 23:36 ` [PATCH v4 1/2] x86/unwind: add " Josh Poimboeuf
  2017-07-24 23:36 ` [PATCH v4 2/2] x86/kconfig: make it easier to switch to the new " Josh Poimboeuf
  0 siblings, 2 replies; 38+ messages in thread
From: Josh Poimboeuf @ 2017-07-24 23:36 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

All the prerequisite patches have been merged into -tip since I posted
v3.  These last two patches are:

  1) the ORC unwinder itself; and

  2) a kconfig tweak to make it easier to switch from
     CONFIG_FRAME_POINTER to CONFIG_ORC_UNWINDER.

v4 changes:
- slightly tweak documentation wording re: RAM usage (Ingo M)
- use READ_ONCE_NOCHECK in __unwind_start() (Jiri S)
- in orc_find(), use printk_deferred_once() instead of WARN_ON_ONCE()
  because printk doesn't work in all error scenarios

-----

Create a new "ORC" unwinder, enabled by CONFIG_ORC_UNWINDER, and plug it
into the x86 unwinder framework.  Objtool is used to generate the ORC
debuginfo.  The ORC debuginfo format is basically a simplified version
of DWARF CFI.  More details below.

The unwinder works well in my testing.  It unwinds through interrupts,
exceptions, and preemption, with and without frame pointers, across
aligned stacks and dynamically allocated stacks.  If something goes
wrong during an oops, it successfully falls back to printing the '?'
entries just like the frame pointer unwinder.

Some potential future improvements:
- properly annotate or fix whitelisted functions and files
- add reliability checks for livepatch
- runtime NMI stack reliability checker
- generated code integration

This code can also be found at:

  https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git orc-v4

Here's the contents of the orc-unwinder.txt file which explains the
'why' in more detail:


ORC unwinder
============

Overview
--------

The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is
similar in concept to a DWARF unwinder.  The difference is that the
format of the ORC data is much simpler than DWARF, which in turn allows
the ORC unwinder to be much simpler and faster.

The ORC data consists of unwind tables which are generated by objtool.
They contain out-of-band data which is used by the in-kernel ORC
unwinder.  Objtool generates the ORC data by first doing compile-time
stack metadata validation (CONFIG_STACK_VALIDATION).  After analyzing
all the code paths of a .o file, it determines information about the
stack state at each instruction address in the file and outputs that
information to the .orc_unwind and .orc_unwind_ip sections.

The per-object ORC sections are combined at link time and are sorted and
post-processed at boot time.  The unwinder uses the resulting data to
correlate instruction addresses with their stack states at run time.


ORC vs frame pointers
---------------------

With frame pointers enabled, GCC adds instrumentation code to every
function in the kernel.  The kernel's .text size increases by about
3.2%, resulting in a broad kernel-wide slowdown.  Measurements by Mel
Gorman [1] have shown a slowdown of 5-10% for some workloads.

In contrast, the ORC unwinder has no effect on text size or runtime
performance, because the debuginfo is out of band.  So if you disable
frame pointers and enable the ORC unwinder, you get a nice performance
improvement across the board, and still have reliable stack traces.

Ingo Molnar says:

  "Note that it's not just a performance improvement, but also an
  instruction cache locality improvement: 3.2% .text savings almost
  directly transform into a similarly sized reduction in cache
  footprint. That can transform to even higher speedups for workloads
  whose cache locality is borderline."

Another benefit of ORC compared to frame pointers is that it can
reliably unwind across interrupts and exceptions.  Frame pointer based
unwinds can sometimes skip the caller of the interrupted function, if it
was a leaf function or if the interrupt hit before the frame pointer was
saved.

The main disadvantage of the ORC unwinder compared to frame pointers is
that it needs more memory to store the ORC unwind tables: roughly 2-4MB
depending on the kernel config.


ORC vs DWARF
------------

ORC debuginfo's advantage over DWARF itself is that it's much simpler.
It gets rid of the complex DWARF CFI state machine and also gets rid of
the tracking of unnecessary registers.  This allows the unwinder to be
much simpler, meaning fewer bugs, which is especially important for
mission critical oops code.

The simpler debuginfo format also enables the unwinder to be much faster
than DWARF, which is important for perf and lockdep.  In a basic
performance test by Jiri Slaby [2], the ORC unwinder was about 20x
faster than an out-of-tree DWARF unwinder.  (Note: That measurement was
taken before some performance tweaks were added, which doubled
performance, so the speedup over DWARF may be closer to 40x.)

The ORC data format does have a few downsides compared to DWARF.  ORC
unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
than DWARF-based eh_frame tables.

Another potential downside is that, as GCC evolves, it's conceivable
that the ORC data may end up being *too* simple to describe the state of
the stack for certain optimizations.  But IMO this is unlikely because
GCC saves the frame pointer for any unusual stack adjustments it does,
so I suspect we'll really only ever need to keep track of the stack
pointer and the frame pointer between call frames.  But even if we do
end up having to track all the registers DWARF tracks, at least we will
still be able to control the format, e.g. no complex state machines.


ORC unwind table generation
---------------------------

The ORC data is generated by objtool.  With the existing compile-time
stack metadata validation feature, objtool already follows all code
paths, and so it already has all the information it needs to be able to
generate ORC data from scratch.  So it's an easy step to go from stack
validation to ORC data generation.

It should be possible to instead generate the ORC data with a simple
tool which converts DWARF to ORC data.  However, such a solution would
be incomplete due to the kernel's extensive use of asm, inline asm, and
special sections like exception tables.

That could be rectified by manually annotating those special code paths
using GNU assembler .cfi annotations in .S files, and homegrown
annotations for inline asm in .c files.  But asm annotations were tried
in the past and were found to be unmaintainable.  They were often
incorrect/incomplete and made the code harder to read and keep updated.
And based on looking at glibc code, annotating inline asm in .c files
might be even worse.

Objtool still needs a few annotations, but only in code which does
unusual things to the stack like entry code.  And even then, far fewer
annotations are needed than what DWARF would need, so they're much more
maintainable than DWARF CFI annotations.

So the advantages of using objtool to generate ORC data are that it
gives more accurate debuginfo, with very few annotations.  It also
insulates the kernel from toolchain bugs which can be very painful to
deal with in the kernel since we often have to workaround issues in
older versions of the toolchain for years.

The downside is that the unwinder now becomes dependent on objtool's
ability to reverse engineer GCC code flow.  If GCC optimizations become
too complicated for objtool to follow, the ORC data generation might
stop working or become incomplete.  (It's worth noting that livepatch
already has such a dependency on objtool's ability to follow GCC code
flow.)

If newer versions of GCC come up with some optimizations which break
objtool, we may need to revisit the current implementation.  Some
possible solutions would be asking GCC to make the optimizations more
palatable, or having objtool use DWARF as an additional input, or
creating a GCC plugin to assist objtool with its analysis.  But for now,
objtool follows GCC code quite well.


Unwinder implementation details
-------------------------------

Objtool generates the ORC data by integrating with the compile-time
stack metadata validation feature, which is described in detail in
tools/objtool/Documentation/stack-validation.txt.  After analyzing all
the code paths of a .o file, it creates an array of orc_entry structs,
and a parallel array of instruction addresses associated with those
structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
respectively.

The ORC data is split into the two arrays for performance reasons, to
make the searchable part of the data (.orc_unwind_ip) more compact.  The
arrays are sorted in parallel at boot time.

Performance is further improved by the use of a fast lookup table which
is created at runtime.  The fast lookup table associates a given address
with a range of indices for the .orc_unwind table, so that only a small
subset of the table needs to be searched.


Etymology
---------

Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
enemies.  Similarly, the ORC unwinder was created in opposition to the
complexity and slowness of DWARF.

"Although Orcs rarely consider multiple solutions to a problem, they do
excel at getting things done because they are creatures of action, not
thought." [3]  Similarly, unlike the esoteric DWARF unwinder, the
veracious ORC unwinder wastes no time or siloconic effort decoding
variable-length zero-extended unsigned-integer byte-coded
state-machine-based debug information entries.

Similar to how Orcs frequently unravel the well-intentioned plans of
their adversaries, the ORC unwinder frequently unravels stacks with
brutal, unyielding efficiency.

ORC stands for Oops Rewind Capability.


[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
[3] http://dustin.wikidot.com/half-orcs-and-orcs




Josh Poimboeuf (2):
  x86/unwind: add ORC unwinder
  x86/kconfig: make it easier to switch to the new ORC unwinder

 Documentation/x86/orc-unwinder.txt | 179 ++++++++++++
 arch/um/include/asm/unwind.h       |   8 +
 arch/x86/Kconfig                   |   1 +
 arch/x86/Kconfig.debug             |  24 ++
 arch/x86/include/asm/module.h      |   9 +
 arch/x86/include/asm/orc_lookup.h  |  46 +++
 arch/x86/include/asm/orc_types.h   |   2 +-
 arch/x86/include/asm/unwind.h      |  76 +++--
 arch/x86/kernel/Makefile           |   8 +-
 arch/x86/kernel/module.c           |  11 +-
 arch/x86/kernel/setup.c            |   3 +
 arch/x86/kernel/unwind_frame.c     |  39 +--
 arch/x86/kernel/unwind_guess.c     |   5 +
 arch/x86/kernel/unwind_orc.c       | 582 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/vmlinux.lds.S      |   3 +
 include/asm-generic/vmlinux.lds.h  |  27 +-
 lib/Kconfig.debug                  |   9 +-
 scripts/Makefile.build             |  14 +-
 18 files changed, 979 insertions(+), 67 deletions(-)
 create mode 100644 Documentation/x86/orc-unwinder.txt
 create mode 100644 arch/um/include/asm/unwind.h
 create mode 100644 arch/x86/include/asm/orc_lookup.h
 create mode 100644 arch/x86/kernel/unwind_orc.c

-- 
2.13.3

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-07-24 23:36 [PATCH v4 0/2] ORC unwinder Josh Poimboeuf
@ 2017-07-24 23:36 ` Josh Poimboeuf
  2017-07-26 12:10   ` [tip:x86/asm] x86/unwind: Add the " tip-bot for Josh Poimboeuf
  2017-07-28 16:48   ` [PATCH v4 1/2] x86/unwind: add " Levin, Alexander (Sasha Levin)
  2017-07-24 23:36 ` [PATCH v4 2/2] x86/kconfig: make it easier to switch to the new " Josh Poimboeuf
  1 sibling, 2 replies; 38+ messages in thread
From: Josh Poimboeuf @ 2017-07-24 23:36 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

Add a new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER.  It
plugs into the existing x86 unwinder framework.

It relies on objtool to generate the needed .orc_unwind and
.orc_unwind_ip sections.

For more details on why ORC is used instead of DWARF, see
Documentation/x86/orc-unwinder.txt.

Thanks to Andy Lutomirski for the performance improvement ideas:
splitting the ORC unwind table into two parallel arrays and creating a
fast lookup table to search a subset of the unwind table.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 Documentation/x86/orc-unwinder.txt | 179 ++++++++++++
 arch/um/include/asm/unwind.h       |   8 +
 arch/x86/Kconfig                   |   1 +
 arch/x86/Kconfig.debug             |  25 ++
 arch/x86/include/asm/module.h      |   9 +
 arch/x86/include/asm/orc_lookup.h  |  46 +++
 arch/x86/include/asm/orc_types.h   |   2 +-
 arch/x86/include/asm/unwind.h      |  76 +++--
 arch/x86/kernel/Makefile           |   8 +-
 arch/x86/kernel/module.c           |  11 +-
 arch/x86/kernel/setup.c            |   3 +
 arch/x86/kernel/unwind_frame.c     |  39 +--
 arch/x86/kernel/unwind_guess.c     |   5 +
 arch/x86/kernel/unwind_orc.c       | 582 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/vmlinux.lds.S      |   3 +
 include/asm-generic/vmlinux.lds.h  |  27 +-
 lib/Kconfig.debug                  |   3 +
 scripts/Makefile.build             |  14 +-
 18 files changed, 977 insertions(+), 64 deletions(-)
 create mode 100644 Documentation/x86/orc-unwinder.txt
 create mode 100644 arch/um/include/asm/unwind.h
 create mode 100644 arch/x86/include/asm/orc_lookup.h
 create mode 100644 arch/x86/kernel/unwind_orc.c

diff --git a/Documentation/x86/orc-unwinder.txt b/Documentation/x86/orc-unwinder.txt
new file mode 100644
index 000000000000..af0c9a4c65a6
--- /dev/null
+++ b/Documentation/x86/orc-unwinder.txt
@@ -0,0 +1,179 @@
+ORC unwinder
+============
+
+Overview
+--------
+
+The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is
+similar in concept to a DWARF unwinder.  The difference is that the
+format of the ORC data is much simpler than DWARF, which in turn allows
+the ORC unwinder to be much simpler and faster.
+
+The ORC data consists of unwind tables which are generated by objtool.
+They contain out-of-band data which is used by the in-kernel ORC
+unwinder.  Objtool generates the ORC data by first doing compile-time
+stack metadata validation (CONFIG_STACK_VALIDATION).  After analyzing
+all the code paths of a .o file, it determines information about the
+stack state at each instruction address in the file and outputs that
+information to the .orc_unwind and .orc_unwind_ip sections.
+
+The per-object ORC sections are combined at link time and are sorted and
+post-processed at boot time.  The unwinder uses the resulting data to
+correlate instruction addresses with their stack states at run time.
+
+
+ORC vs frame pointers
+---------------------
+
+With frame pointers enabled, GCC adds instrumentation code to every
+function in the kernel.  The kernel's .text size increases by about
+3.2%, resulting in a broad kernel-wide slowdown.  Measurements by Mel
+Gorman [1] have shown a slowdown of 5-10% for some workloads.
+
+In contrast, the ORC unwinder has no effect on text size or runtime
+performance, because the debuginfo is out of band.  So if you disable
+frame pointers and enable the ORC unwinder, you get a nice performance
+improvement across the board, and still have reliable stack traces.
+
+Ingo Molnar says:
+
+  "Note that it's not just a performance improvement, but also an
+  instruction cache locality improvement: 3.2% .text savings almost
+  directly transform into a similarly sized reduction in cache
+  footprint. That can transform to even higher speedups for workloads
+  whose cache locality is borderline."
+
+Another benefit of ORC compared to frame pointers is that it can
+reliably unwind across interrupts and exceptions.  Frame pointer based
+unwinds can sometimes skip the caller of the interrupted function, if it
+was a leaf function or if the interrupt hit before the frame pointer was
+saved.
+
+The main disadvantage of the ORC unwinder compared to frame pointers is
+that it needs more memory to store the ORC unwind tables: roughly 2-4MB
+depending on the kernel config.
+
+
+ORC vs DWARF
+------------
+
+ORC debuginfo's advantage over DWARF itself is that it's much simpler.
+It gets rid of the complex DWARF CFI state machine and also gets rid of
+the tracking of unnecessary registers.  This allows the unwinder to be
+much simpler, meaning fewer bugs, which is especially important for
+mission critical oops code.
+
+The simpler debuginfo format also enables the unwinder to be much faster
+than DWARF, which is important for perf and lockdep.  In a basic
+performance test by Jiri Slaby [2], the ORC unwinder was about 20x
+faster than an out-of-tree DWARF unwinder.  (Note: That measurement was
+taken before some performance tweaks were added, which doubled
+performance, so the speedup over DWARF may be closer to 40x.)
+
+The ORC data format does have a few downsides compared to DWARF.  ORC
+unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
+than DWARF-based eh_frame tables.
+
+Another potential downside is that, as GCC evolves, it's conceivable
+that the ORC data may end up being *too* simple to describe the state of
+the stack for certain optimizations.  But IMO this is unlikely because
+GCC saves the frame pointer for any unusual stack adjustments it does,
+so I suspect we'll really only ever need to keep track of the stack
+pointer and the frame pointer between call frames.  But even if we do
+end up having to track all the registers DWARF tracks, at least we will
+still be able to control the format, e.g. no complex state machines.
+
+
+ORC unwind table generation
+---------------------------
+
+The ORC data is generated by objtool.  With the existing compile-time
+stack metadata validation feature, objtool already follows all code
+paths, and so it already has all the information it needs to be able to
+generate ORC data from scratch.  So it's an easy step to go from stack
+validation to ORC data generation.
+
+It should be possible to instead generate the ORC data with a simple
+tool which converts DWARF to ORC data.  However, such a solution would
+be incomplete due to the kernel's extensive use of asm, inline asm, and
+special sections like exception tables.
+
+That could be rectified by manually annotating those special code paths
+using GNU assembler .cfi annotations in .S files, and homegrown
+annotations for inline asm in .c files.  But asm annotations were tried
+in the past and were found to be unmaintainable.  They were often
+incorrect/incomplete and made the code harder to read and keep updated.
+And based on looking at glibc code, annotating inline asm in .c files
+might be even worse.
+
+Objtool still needs a few annotations, but only in code which does
+unusual things to the stack like entry code.  And even then, far fewer
+annotations are needed than what DWARF would need, so they're much more
+maintainable than DWARF CFI annotations.
+
+So the advantages of using objtool to generate ORC data are that it
+gives more accurate debuginfo, with very few annotations.  It also
+insulates the kernel from toolchain bugs which can be very painful to
+deal with in the kernel since we often have to workaround issues in
+older versions of the toolchain for years.
+
+The downside is that the unwinder now becomes dependent on objtool's
+ability to reverse engineer GCC code flow.  If GCC optimizations become
+too complicated for objtool to follow, the ORC data generation might
+stop working or become incomplete.  (It's worth noting that livepatch
+already has such a dependency on objtool's ability to follow GCC code
+flow.)
+
+If newer versions of GCC come up with some optimizations which break
+objtool, we may need to revisit the current implementation.  Some
+possible solutions would be asking GCC to make the optimizations more
+palatable, or having objtool use DWARF as an additional input, or
+creating a GCC plugin to assist objtool with its analysis.  But for now,
+objtool follows GCC code quite well.
+
+
+Unwinder implementation details
+-------------------------------
+
+Objtool generates the ORC data by integrating with the compile-time
+stack metadata validation feature, which is described in detail in
+tools/objtool/Documentation/stack-validation.txt.  After analyzing all
+the code paths of a .o file, it creates an array of orc_entry structs,
+and a parallel array of instruction addresses associated with those
+structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
+respectively.
+
+The ORC data is split into the two arrays for performance reasons, to
+make the searchable part of the data (.orc_unwind_ip) more compact.  The
+arrays are sorted in parallel at boot time.
+
+Performance is further improved by the use of a fast lookup table which
+is created at runtime.  The fast lookup table associates a given address
+with a range of indices for the .orc_unwind table, so that only a small
+subset of the table needs to be searched.
+
+
+Etymology
+---------
+
+Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
+enemies.  Similarly, the ORC unwinder was created in opposition to the
+complexity and slowness of DWARF.
+
+"Although Orcs rarely consider multiple solutions to a problem, they do
+excel at getting things done because they are creatures of action, not
+thought." [3]  Similarly, unlike the esoteric DWARF unwinder, the
+veracious ORC unwinder wastes no time or siloconic effort decoding
+variable-length zero-extended unsigned-integer byte-coded
+state-machine-based debug information entries.
+
+Similar to how Orcs frequently unravel the well-intentioned plans of
+their adversaries, the ORC unwinder frequently unravels stacks with
+brutal, unyielding efficiency.
+
+ORC stands for Oops Rewind Capability.
+
+
+[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
+[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
+[3] http://dustin.wikidot.com/half-orcs-and-orcs
diff --git a/arch/um/include/asm/unwind.h b/arch/um/include/asm/unwind.h
new file mode 100644
index 000000000000..7ffa5437b761
--- /dev/null
+++ b/arch/um/include/asm/unwind.h
@@ -0,0 +1,8 @@
+#ifndef _ASM_UML_UNWIND_H
+#define _ASM_UML_UNWIND_H
+
+static inline void
+unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
+		   void *orc, size_t orc_size) {}
+
+#endif /* _ASM_UML_UNWIND_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ff637dedfafa..21c75a652bb9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -157,6 +157,7 @@ config X86
 	select HAVE_MEMBLOCK
 	select HAVE_MEMBLOCK_NODE_MAP
 	select HAVE_MIXED_BREAKPOINTS_REGS
+	select HAVE_MOD_ARCH_SPECIFIC
 	select HAVE_NMI
 	select HAVE_OPROFILE
 	select HAVE_OPTPROBES
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 1fc519f3c49e..d5bca2ec8a74 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -356,4 +356,29 @@ config PUNIT_ATOM_DEBUG
 	  The current power state can be read from
 	  /sys/kernel/debug/punit_atom/dev_power_state
 
+config ORC_UNWINDER
+	bool "ORC unwinder"
+	depends on X86_64
+	select STACK_VALIDATION
+	---help---
+	  This option enables the ORC (Oops Rewind Capability) unwinder for
+	  unwinding kernel stack traces.  It uses a custom data format which is
+	  a simplified version of the DWARF Call Frame Information standard.
+
+	  This unwinder is more accurate across interrupt entry frames than the
+	  frame pointer unwinder.  It can also enable a 5-10% performance
+	  improvement across the entire kernel if CONFIG_FRAME_POINTER is
+	  disabled.
+
+	  Enabling this option will increase the kernel's runtime memory usage
+	  by roughly 2-4MB, depending on your kernel config.
+
+config FRAME_POINTER_UNWINDER
+	def_bool y
+	depends on !ORC_UNWINDER && FRAME_POINTER
+
+config GUESS_UNWINDER
+	def_bool y
+	depends on !ORC_UNWINDER && !FRAME_POINTER
+
 endmenu
diff --git a/arch/x86/include/asm/module.h b/arch/x86/include/asm/module.h
index e3b7819caeef..9eb7c718aaf8 100644
--- a/arch/x86/include/asm/module.h
+++ b/arch/x86/include/asm/module.h
@@ -2,6 +2,15 @@
 #define _ASM_X86_MODULE_H
 
 #include <asm-generic/module.h>
+#include <asm/orc_types.h>
+
+struct mod_arch_specific {
+#ifdef CONFIG_ORC_UNWINDER
+	unsigned int num_orcs;
+	int *orc_unwind_ip;
+	struct orc_entry *orc_unwind;
+#endif
+};
 
 #ifdef CONFIG_X86_64
 /* X86_64 does not define MODULE_PROC_FAMILY */
diff --git a/arch/x86/include/asm/orc_lookup.h b/arch/x86/include/asm/orc_lookup.h
new file mode 100644
index 000000000000..91c8d868424d
--- /dev/null
+++ b/arch/x86/include/asm/orc_lookup.h
@@ -0,0 +1,46 @@
+/*
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef _ORC_LOOKUP_H
+#define _ORC_LOOKUP_H
+
+/*
+ * This is a lookup table for speeding up access to the .orc_unwind table.
+ * Given an input address offset, the corresponding lookup table entry
+ * specifies a subset of the .orc_unwind table to search.
+ *
+ * Each block represents the end of the previous range and the start of the
+ * next range.  An extra block is added to give the last range an end.
+ *
+ * The block size should be a power of 2 to avoid a costly 'div' instruction.
+ *
+ * A block size of 256 was chosen because it roughly doubles unwinder
+ * performance while only adding ~5% to the ORC data footprint.
+ */
+#define LOOKUP_BLOCK_ORDER	8
+#define LOOKUP_BLOCK_SIZE	(1 << LOOKUP_BLOCK_ORDER)
+
+#ifndef LINKER_SCRIPT
+
+extern unsigned int orc_lookup[];
+extern unsigned int orc_lookup_end[];
+
+#define LOOKUP_START_IP		(unsigned long)_stext
+#define LOOKUP_STOP_IP		(unsigned long)_etext
+
+#endif /* LINKER_SCRIPT */
+
+#endif /* _ORC_LOOKUP_H */
diff --git a/arch/x86/include/asm/orc_types.h b/arch/x86/include/asm/orc_types.h
index 7dc777a6cb40..9c9dc579bd7d 100644
--- a/arch/x86/include/asm/orc_types.h
+++ b/arch/x86/include/asm/orc_types.h
@@ -88,7 +88,7 @@ struct orc_entry {
 	unsigned	sp_reg:4;
 	unsigned	bp_reg:4;
 	unsigned	type:2;
-};
+} __packed;
 
 /*
  * This struct is used by asm and inline asm code to manually annotate the
diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index e6676495b125..25b8d31a007d 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -12,11 +12,14 @@ struct unwind_state {
 	struct task_struct *task;
 	int graph_idx;
 	bool error;
-#ifdef CONFIG_FRAME_POINTER
+#if defined(CONFIG_ORC_UNWINDER)
+	bool signal, full_regs;
+	unsigned long sp, bp, ip;
+	struct pt_regs *regs;
+#elif defined(CONFIG_FRAME_POINTER)
 	bool got_irq;
-	unsigned long *bp, *orig_sp;
+	unsigned long *bp, *orig_sp, ip;
 	struct pt_regs *regs;
-	unsigned long ip;
 #else
 	unsigned long *sp;
 #endif
@@ -24,41 +27,30 @@ struct unwind_state {
 
 void __unwind_start(struct unwind_state *state, struct task_struct *task,
 		    struct pt_regs *regs, unsigned long *first_frame);
-
 bool unwind_next_frame(struct unwind_state *state);
-
 unsigned long unwind_get_return_address(struct unwind_state *state);
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state);
 
 static inline bool unwind_done(struct unwind_state *state)
 {
 	return state->stack_info.type == STACK_TYPE_UNKNOWN;
 }
 
-static inline
-void unwind_start(struct unwind_state *state, struct task_struct *task,
-		  struct pt_regs *regs, unsigned long *first_frame)
-{
-	first_frame = first_frame ? : get_stack_pointer(task, regs);
-
-	__unwind_start(state, task, regs, first_frame);
-}
-
 static inline bool unwind_error(struct unwind_state *state)
 {
 	return state->error;
 }
 
-#ifdef CONFIG_FRAME_POINTER
-
 static inline
-unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+void unwind_start(struct unwind_state *state, struct task_struct *task,
+		  struct pt_regs *regs, unsigned long *first_frame)
 {
-	if (unwind_done(state))
-		return NULL;
+	first_frame = first_frame ? : get_stack_pointer(task, regs);
 
-	return state->regs ? &state->regs->ip : state->bp + 1;
+	__unwind_start(state, task, regs, first_frame);
 }
 
+#if defined(CONFIG_ORC_UNWINDER) || defined(CONFIG_FRAME_POINTER)
 static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
 	if (unwind_done(state))
@@ -66,20 +58,46 @@ static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 
 	return state->regs;
 }
-
-#else /* !CONFIG_FRAME_POINTER */
-
-static inline
-unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+#else
+static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
 	return NULL;
 }
+#endif
 
-static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
+#ifdef CONFIG_ORC_UNWINDER
+void unwind_init(void);
+void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
+			void *orc, size_t orc_size);
+#else
+static inline void unwind_init(void) {}
+static inline
+void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
+			void *orc, size_t orc_size) {}
+#endif
+
+/*
+ * This disables KASAN checking when reading a value from another task's stack,
+ * since the other task could be running on another CPU and could have poisoned
+ * the stack in the meantime.
+ */
+#define READ_ONCE_TASK_STACK(task, x)			\
+({							\
+	unsigned long val;				\
+	if (task == current)				\
+		val = READ_ONCE(x);			\
+	else						\
+		val = READ_ONCE_NOCHECK(x);		\
+	val;						\
+})
+
+static inline bool task_on_another_cpu(struct task_struct *task)
 {
-	return NULL;
+#ifdef CONFIG_SMP
+	return task != current && task->on_cpu;
+#else
+	return false;
+#endif
 }
 
-#endif /* CONFIG_FRAME_POINTER */
-
 #endif /* _ASM_X86_UNWIND_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index a01892bdd61a..287eac7d207f 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -126,11 +126,9 @@ obj-$(CONFIG_PERF_EVENTS)		+= perf_regs.o
 obj-$(CONFIG_TRACING)			+= tracepoint.o
 obj-$(CONFIG_SCHED_MC_PRIO)		+= itmt.o
 
-ifdef CONFIG_FRAME_POINTER
-obj-y					+= unwind_frame.o
-else
-obj-y					+= unwind_guess.o
-endif
+obj-$(CONFIG_ORC_UNWINDER)		+= unwind_orc.o
+obj-$(CONFIG_FRAME_POINTER_UNWINDER)	+= unwind_frame.o
+obj-$(CONFIG_GUESS_UNWINDER)		+= unwind_guess.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index f67bd3205df7..62e7d70aadd5 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -35,6 +35,7 @@
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/setup.h>
+#include <asm/unwind.h>
 
 #if 0
 #define DEBUGP(fmt, ...)				\
@@ -213,7 +214,7 @@ int module_finalize(const Elf_Ehdr *hdr,
 		    struct module *me)
 {
 	const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL,
-		*para = NULL;
+		*para = NULL, *orc = NULL, *orc_ip = NULL;
 	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
 
 	for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) {
@@ -225,6 +226,10 @@ int module_finalize(const Elf_Ehdr *hdr,
 			locks = s;
 		if (!strcmp(".parainstructions", secstrings + s->sh_name))
 			para = s;
+		if (!strcmp(".orc_unwind", secstrings + s->sh_name))
+			orc = s;
+		if (!strcmp(".orc_unwind_ip", secstrings + s->sh_name))
+			orc_ip = s;
 	}
 
 	if (alt) {
@@ -248,6 +253,10 @@ int module_finalize(const Elf_Ehdr *hdr,
 	/* make jump label nops */
 	jump_label_apply_nops(me);
 
+	if (orc && orc_ip)
+		unwind_module_init(me, (void *)orc_ip->sh_addr, orc_ip->sh_size,
+				   (void *)orc->sh_addr, orc->sh_size);
+
 	return 0;
 }
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 0bfe0c1628f6..022ebddb3734 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -116,6 +116,7 @@
 #include <asm/microcode.h>
 #include <asm/mmu_context.h>
 #include <asm/kaslr.h>
+#include <asm/unwind.h>
 
 /*
  * max_low_pfn_mapped: highest direct mapped pfn under 4GB
@@ -1319,6 +1320,8 @@ void __init setup_arch(char **cmdline_p)
 	if (efi_enabled(EFI_BOOT))
 		efi_apply_memmap_quirks();
 #endif
+
+	unwind_init();
 }
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index b9389d72b2f7..7574ef5f16ec 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -10,20 +10,22 @@
 
 #define FRAME_HEADER_SIZE (sizeof(long) * 2)
 
-/*
- * This disables KASAN checking when reading a value from another task's stack,
- * since the other task could be running on another CPU and could have poisoned
- * the stack in the meantime.
- */
-#define READ_ONCE_TASK_STACK(task, x)			\
-({							\
-	unsigned long val;				\
-	if (task == current)				\
-		val = READ_ONCE(x);			\
-	else						\
-		val = READ_ONCE_NOCHECK(x);		\
-	val;						\
-})
+unsigned long unwind_get_return_address(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return 0;
+
+	return __kernel_text_address(state->ip) ? state->ip : 0;
+}
+EXPORT_SYMBOL_GPL(unwind_get_return_address);
+
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return NULL;
+
+	return state->regs ? &state->regs->ip : state->bp + 1;
+}
 
 static void unwind_dump(struct unwind_state *state)
 {
@@ -66,15 +68,6 @@ static void unwind_dump(struct unwind_state *state)
 	}
 }
 
-unsigned long unwind_get_return_address(struct unwind_state *state)
-{
-	if (unwind_done(state))
-		return 0;
-
-	return __kernel_text_address(state->ip) ? state->ip : 0;
-}
-EXPORT_SYMBOL_GPL(unwind_get_return_address);
-
 static size_t regs_size(struct pt_regs *regs)
 {
 	/* x86_32 regs from kernel mode are two words shorter: */
diff --git a/arch/x86/kernel/unwind_guess.c b/arch/x86/kernel/unwind_guess.c
index 039f36738e49..4f0e17b90463 100644
--- a/arch/x86/kernel/unwind_guess.c
+++ b/arch/x86/kernel/unwind_guess.c
@@ -19,6 +19,11 @@ unsigned long unwind_get_return_address(struct unwind_state *state)
 }
 EXPORT_SYMBOL_GPL(unwind_get_return_address);
 
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	return NULL;
+}
+
 bool unwind_next_frame(struct unwind_state *state)
 {
 	struct stack_info *info = &state->stack_info;
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
new file mode 100644
index 000000000000..570b70d3f604
--- /dev/null
+++ b/arch/x86/kernel/unwind_orc.c
@@ -0,0 +1,582 @@
+#include <linux/module.h>
+#include <linux/sort.h>
+#include <asm/ptrace.h>
+#include <asm/stacktrace.h>
+#include <asm/unwind.h>
+#include <asm/orc_types.h>
+#include <asm/orc_lookup.h>
+#include <asm/sections.h>
+
+#define orc_warn(fmt, ...) \
+	printk_deferred_once(KERN_WARNING pr_fmt("WARNING: " fmt), ##__VA_ARGS__)
+
+extern int __start_orc_unwind_ip[];
+extern int __stop_orc_unwind_ip[];
+extern struct orc_entry __start_orc_unwind[];
+extern struct orc_entry __stop_orc_unwind[];
+
+static DEFINE_MUTEX(sort_mutex);
+int *cur_orc_ip_table = __start_orc_unwind_ip;
+struct orc_entry *cur_orc_table = __start_orc_unwind;
+
+unsigned int lookup_num_blocks;
+bool orc_init;
+
+static inline unsigned long orc_ip(const int *ip)
+{
+	return (unsigned long)ip + *ip;
+}
+
+static struct orc_entry *__orc_find(int *ip_table, struct orc_entry *u_table,
+				    unsigned int num_entries, unsigned long ip)
+{
+	int *first = ip_table;
+	int *last = ip_table + num_entries - 1;
+	int *mid = first, *found = first;
+
+	if (!num_entries)
+		return NULL;
+
+	/*
+	 * Do a binary range search to find the rightmost duplicate of a given
+	 * starting address.  Some entries are section terminators which are
+	 * "weak" entries for ensuring there are no gaps.  They should be
+	 * ignored when they conflict with a real entry.
+	 */
+	while (first <= last) {
+		mid = first + ((last - first) / 2);
+
+		if (orc_ip(mid) <= ip) {
+			found = mid;
+			first = mid + 1;
+		} else
+			last = mid - 1;
+	}
+
+	return u_table + (found - ip_table);
+}
+
+#ifdef CONFIG_MODULES
+static struct orc_entry *orc_module_find(unsigned long ip)
+{
+	struct module *mod;
+
+	mod = __module_address(ip);
+	if (!mod || !mod->arch.orc_unwind || !mod->arch.orc_unwind_ip)
+		return NULL;
+	return __orc_find(mod->arch.orc_unwind_ip, mod->arch.orc_unwind,
+			  mod->arch.num_orcs, ip);
+}
+#else
+static struct orc_entry *orc_module_find(unsigned long ip)
+{
+	return NULL;
+}
+#endif
+
+static struct orc_entry *orc_find(unsigned long ip)
+{
+	if (!orc_init)
+		return NULL;
+
+	/* For non-init vmlinux addresses, use the fast lookup table: */
+	if (ip >= LOOKUP_START_IP && ip < LOOKUP_STOP_IP) {
+		unsigned int idx, start, stop;
+
+		idx = (ip - LOOKUP_START_IP) / LOOKUP_BLOCK_SIZE;
+
+		if (unlikely((idx >= lookup_num_blocks-1))) {
+			orc_warn("WARNING: bad lookup idx: idx=%u num=%u ip=%lx\n",
+				 idx, lookup_num_blocks, ip);
+			return NULL;
+		}
+
+		start = orc_lookup[idx];
+		stop = orc_lookup[idx + 1] + 1;
+
+		if (unlikely((__start_orc_unwind + start >= __stop_orc_unwind) ||
+			     (__start_orc_unwind + stop > __stop_orc_unwind))) {
+			orc_warn("WARNING: bad lookup value: idx=%u num=%u start=%u stop=%u ip=%lx\n",
+				 idx, lookup_num_blocks, start, stop, ip);
+			return NULL;
+		}
+
+		return __orc_find(__start_orc_unwind_ip + start,
+				  __start_orc_unwind + start, stop - start, ip);
+	}
+
+	/* vmlinux .init slow lookup: */
+	if (ip >= (unsigned long)_sinittext && ip < (unsigned long)_einittext)
+		return __orc_find(__start_orc_unwind_ip, __start_orc_unwind,
+				  __stop_orc_unwind_ip - __start_orc_unwind_ip, ip);
+
+	/* Module lookup: */
+	return orc_module_find(ip);
+}
+
+static void orc_sort_swap(void *_a, void *_b, int size)
+{
+	struct orc_entry *orc_a, *orc_b;
+	struct orc_entry orc_tmp;
+	int *a = _a, *b = _b, tmp;
+	int delta = _b - _a;
+
+	/* Swap the .orc_unwind_ip entries: */
+	tmp = *a;
+	*a = *b + delta;
+	*b = tmp - delta;
+
+	/* Swap the corresponding .orc_unwind entries: */
+	orc_a = cur_orc_table + (a - cur_orc_ip_table);
+	orc_b = cur_orc_table + (b - cur_orc_ip_table);
+	orc_tmp = *orc_a;
+	*orc_a = *orc_b;
+	*orc_b = orc_tmp;
+}
+
+static int orc_sort_cmp(const void *_a, const void *_b)
+{
+	struct orc_entry *orc_a;
+	const int *a = _a, *b = _b;
+	unsigned long a_val = orc_ip(a);
+	unsigned long b_val = orc_ip(b);
+
+	if (a_val > b_val)
+		return 1;
+	if (a_val < b_val)
+		return -1;
+
+	/*
+	 * The "weak" section terminator entries need to always be on the left
+	 * to ensure the lookup code skips them in favor of real entries.
+	 * These terminator entries exist to handle any gaps created by
+	 * whitelisted .o files which didn't get objtool generation.
+	 */
+	orc_a = cur_orc_table + (a - cur_orc_ip_table);
+	return orc_a->sp_reg == ORC_REG_UNDEFINED ? -1 : 1;
+}
+
+#ifdef CONFIG_MODULES
+void unwind_module_init(struct module *mod, void *_orc_ip, size_t orc_ip_size,
+			void *_orc, size_t orc_size)
+{
+	int *orc_ip = _orc_ip;
+	struct orc_entry *orc = _orc;
+	unsigned int num_entries = orc_ip_size / sizeof(int);
+
+	WARN_ON_ONCE(orc_ip_size % sizeof(int) != 0 ||
+		     orc_size % sizeof(*orc) != 0 ||
+		     num_entries != orc_size / sizeof(*orc));
+
+	/*
+	 * The 'cur_orc_*' globals allow the orc_sort_swap() callback to
+	 * associate an .orc_unwind_ip table entry with its corresponding
+	 * .orc_unwind entry so they can both be swapped.
+	 */
+	mutex_lock(&sort_mutex);
+	cur_orc_ip_table = orc_ip;
+	cur_orc_table = orc;
+	sort(orc_ip, num_entries, sizeof(int), orc_sort_cmp, orc_sort_swap);
+	mutex_unlock(&sort_mutex);
+
+	mod->arch.orc_unwind_ip = orc_ip;
+	mod->arch.orc_unwind = orc;
+	mod->arch.num_orcs = num_entries;
+}
+#endif
+
+void __init unwind_init(void)
+{
+	size_t orc_ip_size = (void *)__stop_orc_unwind_ip - (void *)__start_orc_unwind_ip;
+	size_t orc_size = (void *)__stop_orc_unwind - (void *)__start_orc_unwind;
+	size_t num_entries = orc_ip_size / sizeof(int);
+	struct orc_entry *orc;
+	int i;
+
+	if (!num_entries || orc_ip_size % sizeof(int) != 0 ||
+	    orc_size % sizeof(struct orc_entry) != 0 ||
+	    num_entries != orc_size / sizeof(struct orc_entry)) {
+		orc_warn("WARNING: Bad or missing .orc_unwind table.  Disabling unwinder.\n");
+		return;
+	}
+
+	/* Sort the .orc_unwind and .orc_unwind_ip tables: */
+	sort(__start_orc_unwind_ip, num_entries, sizeof(int), orc_sort_cmp,
+	     orc_sort_swap);
+
+	/* Initialize the fast lookup table: */
+	lookup_num_blocks = orc_lookup_end - orc_lookup;
+	for (i = 0; i < lookup_num_blocks-1; i++) {
+		orc = __orc_find(__start_orc_unwind_ip, __start_orc_unwind,
+				 num_entries,
+				 LOOKUP_START_IP + (LOOKUP_BLOCK_SIZE * i));
+		if (!orc) {
+			orc_warn("WARNING: Corrupt .orc_unwind table.  Disabling unwinder.\n");
+			return;
+		}
+
+		orc_lookup[i] = orc - __start_orc_unwind;
+	}
+
+	/* Initialize the ending block: */
+	orc = __orc_find(__start_orc_unwind_ip, __start_orc_unwind, num_entries,
+			 LOOKUP_STOP_IP);
+	if (!orc) {
+		orc_warn("WARNING: Corrupt .orc_unwind table.  Disabling unwinder.\n");
+		return;
+	}
+	orc_lookup[lookup_num_blocks-1] = orc - __start_orc_unwind;
+
+	orc_init = true;
+}
+
+unsigned long unwind_get_return_address(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return 0;
+
+	return __kernel_text_address(state->ip) ? state->ip : 0;
+}
+EXPORT_SYMBOL_GPL(unwind_get_return_address);
+
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return NULL;
+
+	if (state->regs)
+		return &state->regs->ip;
+
+	if (state->sp)
+		return (unsigned long *)state->sp - 1;
+
+	return NULL;
+}
+
+static bool stack_access_ok(struct unwind_state *state, unsigned long addr,
+			    size_t len)
+{
+	struct stack_info *info = &state->stack_info;
+
+	/*
+	 * If the address isn't on the current stack, switch to the next one.
+	 *
+	 * We may have to traverse multiple stacks to deal with the possibility
+	 * that info->next_sp could point to an empty stack and the address
+	 * could be on a subsequent stack.
+	 */
+	while (!on_stack(info, (void *)addr, len))
+		if (get_stack_info(info->next_sp, state->task, info,
+				   &state->stack_mask))
+			return false;
+
+	return true;
+}
+
+static bool deref_stack_reg(struct unwind_state *state, unsigned long addr,
+			    unsigned long *val)
+{
+	if (!stack_access_ok(state, addr, sizeof(long)))
+		return false;
+
+	*val = READ_ONCE_TASK_STACK(state->task, *(unsigned long *)addr);
+	return true;
+}
+
+#define REGS_SIZE (sizeof(struct pt_regs))
+#define SP_OFFSET (offsetof(struct pt_regs, sp))
+#define IRET_REGS_SIZE (REGS_SIZE - offsetof(struct pt_regs, ip))
+#define IRET_SP_OFFSET (SP_OFFSET - offsetof(struct pt_regs, ip))
+
+static bool deref_stack_regs(struct unwind_state *state, unsigned long addr,
+			     unsigned long *ip, unsigned long *sp, bool full)
+{
+	size_t regs_size = full ? REGS_SIZE : IRET_REGS_SIZE;
+	size_t sp_offset = full ? SP_OFFSET : IRET_SP_OFFSET;
+	struct pt_regs *regs = (struct pt_regs *)(addr + regs_size - REGS_SIZE);
+
+	if (IS_ENABLED(CONFIG_X86_64)) {
+		if (!stack_access_ok(state, addr, regs_size))
+			return false;
+
+		*ip = regs->ip;
+		*sp = regs->sp;
+
+		return true;
+	}
+
+	if (!stack_access_ok(state, addr, sp_offset))
+		return false;
+
+	*ip = regs->ip;
+
+	if (user_mode(regs)) {
+		if (!stack_access_ok(state, addr + sp_offset,
+				     REGS_SIZE - SP_OFFSET))
+			return false;
+
+		*sp = regs->sp;
+	} else
+		*sp = (unsigned long)&regs->sp;
+
+	return true;
+}
+
+bool unwind_next_frame(struct unwind_state *state)
+{
+	unsigned long ip_p, sp, orig_ip, prev_sp = state->sp;
+	enum stack_type prev_type = state->stack_info.type;
+	struct orc_entry *orc;
+	struct pt_regs *ptregs;
+	bool indirect = false;
+
+	if (unwind_done(state))
+		return false;
+
+	/* Don't let modules unload while we're reading their ORC data. */
+	preempt_disable();
+
+	/* Have we reached the end? */
+	if (state->regs && user_mode(state->regs))
+		goto done;
+
+	/*
+	 * Find the orc_entry associated with the text address.
+	 *
+	 * Decrement call return addresses by one so they work for sibling
+	 * calls and calls to noreturn functions.
+	 */
+	orc = orc_find(state->signal ? state->ip : state->ip - 1);
+	if (!orc || orc->sp_reg == ORC_REG_UNDEFINED)
+		goto done;
+	orig_ip = state->ip;
+
+	/* Find the previous frame's stack: */
+	switch (orc->sp_reg) {
+	case ORC_REG_SP:
+		sp = state->sp + orc->sp_offset;
+		break;
+
+	case ORC_REG_BP:
+		sp = state->bp + orc->sp_offset;
+		break;
+
+	case ORC_REG_SP_INDIRECT:
+		sp = state->sp + orc->sp_offset;
+		indirect = true;
+		break;
+
+	case ORC_REG_BP_INDIRECT:
+		sp = state->bp + orc->sp_offset;
+		indirect = true;
+		break;
+
+	case ORC_REG_R10:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg R10 at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->r10;
+		break;
+
+	case ORC_REG_R13:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg R13 at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->r13;
+		break;
+
+	case ORC_REG_DI:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg DI at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->di;
+		break;
+
+	case ORC_REG_DX:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg DX at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->dx;
+		break;
+
+	default:
+		orc_warn("unknown SP base reg %d for ip %p\n",
+			 orc->sp_reg, (void *)state->ip);
+		goto done;
+	}
+
+	if (indirect) {
+		if (!deref_stack_reg(state, sp, &sp))
+			goto done;
+	}
+
+	/* Find IP, SP and possibly regs: */
+	switch (orc->type) {
+	case ORC_TYPE_CALL:
+		ip_p = sp - sizeof(long);
+
+		if (!deref_stack_reg(state, ip_p, &state->ip))
+			goto done;
+
+		state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
+						  state->ip, (void *)ip_p);
+
+		state->sp = sp;
+		state->regs = NULL;
+		state->signal = false;
+		break;
+
+	case ORC_TYPE_REGS:
+		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, true)) {
+			orc_warn("can't dereference registers at %p for ip %p\n",
+				 (void *)sp, (void *)orig_ip);
+			goto done;
+		}
+
+		state->regs = (struct pt_regs *)sp;
+		state->full_regs = true;
+		state->signal = true;
+		break;
+
+	case ORC_TYPE_REGS_IRET:
+		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, false)) {
+			orc_warn("can't dereference iret registers at %p for ip %p\n",
+				 (void *)sp, (void *)orig_ip);
+			goto done;
+		}
+
+		ptregs = container_of((void *)sp, struct pt_regs, ip);
+		if ((unsigned long)ptregs >= prev_sp &&
+		    on_stack(&state->stack_info, ptregs, REGS_SIZE)) {
+			state->regs = ptregs;
+			state->full_regs = false;
+		} else
+			state->regs = NULL;
+
+		state->signal = true;
+		break;
+
+	default:
+		orc_warn("unknown .orc_unwind entry type %d\n", orc->type);
+		break;
+	}
+
+	/* Find BP: */
+	switch (orc->bp_reg) {
+	case ORC_REG_UNDEFINED:
+		if (state->regs && state->full_regs)
+			state->bp = state->regs->bp;
+		break;
+
+	case ORC_REG_PREV_SP:
+		if (!deref_stack_reg(state, sp + orc->bp_offset, &state->bp))
+			goto done;
+		break;
+
+	case ORC_REG_BP:
+		if (!deref_stack_reg(state, state->bp + orc->bp_offset, &state->bp))
+			goto done;
+		break;
+
+	default:
+		orc_warn("unknown BP base reg %d for ip %p\n",
+			 orc->bp_reg, (void *)orig_ip);
+		goto done;
+	}
+
+	/* Prevent a recursive loop due to bad ORC data: */
+	if (state->stack_info.type == prev_type &&
+	    on_stack(&state->stack_info, (void *)state->sp, sizeof(long)) &&
+	    state->sp <= prev_sp) {
+		orc_warn("stack going in the wrong direction? ip=%p\n",
+			 (void *)orig_ip);
+		goto done;
+	}
+
+	preempt_enable();
+	return true;
+
+done:
+	preempt_enable();
+	state->stack_info.type = STACK_TYPE_UNKNOWN;
+	return false;
+}
+EXPORT_SYMBOL_GPL(unwind_next_frame);
+
+void __unwind_start(struct unwind_state *state, struct task_struct *task,
+		    struct pt_regs *regs, unsigned long *first_frame)
+{
+	memset(state, 0, sizeof(*state));
+	state->task = task;
+
+	/*
+	 * Refuse to unwind the stack of a task while it's executing on another
+	 * CPU.  This check is racy, but that's ok: the unwinder has other
+	 * checks to prevent it from going off the rails.
+	 */
+	if (task_on_another_cpu(task))
+		goto done;
+
+	if (regs) {
+		if (user_mode(regs))
+			goto done;
+
+		state->ip = regs->ip;
+		state->sp = kernel_stack_pointer(regs);
+		state->bp = regs->bp;
+		state->regs = regs;
+		state->full_regs = true;
+		state->signal = true;
+
+	} else if (task == current) {
+		asm volatile("lea (%%rip), %0\n\t"
+			     "mov %%rsp, %1\n\t"
+			     "mov %%rbp, %2\n\t"
+			     : "=r" (state->ip), "=r" (state->sp),
+			       "=r" (state->bp));
+
+	} else {
+		struct inactive_task_frame *frame = (void *)task->thread.sp;
+
+		state->sp = task->thread.sp;
+		state->bp = READ_ONCE_NOCHECK(frame->bp);
+		state->ip = READ_ONCE_NOCHECK(frame->ret_addr);
+	}
+
+	if (get_stack_info((unsigned long *)state->sp, state->task,
+			   &state->stack_info, &state->stack_mask))
+		return;
+
+	/*
+	 * The caller can provide the address of the first frame directly
+	 * (first_frame) or indirectly (regs->sp) to indicate which stack frame
+	 * to start unwinding at.  Skip ahead until we reach it.
+	 */
+
+	/* When starting from regs, skip the regs frame: */
+	if (regs) {
+		unwind_next_frame(state);
+		return;
+	}
+
+	/* Otherwise, skip ahead to the user-specified starting frame: */
+	while (!unwind_done(state) &&
+	       (!on_stack(&state->stack_info, first_frame, sizeof(long)) ||
+			state->sp <= (unsigned long)first_frame))
+		unwind_next_frame(state);
+
+	return;
+
+done:
+	state->stack_info.type = STACK_TYPE_UNKNOWN;
+	return;
+}
+EXPORT_SYMBOL_GPL(__unwind_start);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index c8a3b61be0aa..f05f00acac89 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -24,6 +24,7 @@
 #include <asm/asm-offsets.h>
 #include <asm/thread_info.h>
 #include <asm/page_types.h>
+#include <asm/orc_lookup.h>
 #include <asm/cache.h>
 #include <asm/boot.h>
 
@@ -148,6 +149,8 @@ SECTIONS
 
 	BUG_TABLE
 
+	ORC_UNWIND_TABLE
+
 	. = ALIGN(PAGE_SIZE);
 	__vvar_page = .;
 
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index da0be9a8d1de..fffc9bdae025 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -680,6 +680,31 @@
 #define BUG_TABLE
 #endif
 
+#ifdef CONFIG_ORC_UNWINDER
+#define ORC_UNWIND_TABLE						\
+	. = ALIGN(4);							\
+	.orc_unwind_ip : AT(ADDR(.orc_unwind_ip) - LOAD_OFFSET) {	\
+		VMLINUX_SYMBOL(__start_orc_unwind_ip) = .;		\
+		KEEP(*(.orc_unwind_ip))					\
+		VMLINUX_SYMBOL(__stop_orc_unwind_ip) = .;		\
+	}								\
+	. = ALIGN(6);							\
+	.orc_unwind : AT(ADDR(.orc_unwind) - LOAD_OFFSET) {		\
+		VMLINUX_SYMBOL(__start_orc_unwind) = .;			\
+		KEEP(*(.orc_unwind))					\
+		VMLINUX_SYMBOL(__stop_orc_unwind) = .;			\
+	}								\
+	. = ALIGN(4);							\
+	.orc_lookup : AT(ADDR(.orc_lookup) - LOAD_OFFSET) {		\
+		VMLINUX_SYMBOL(orc_lookup) = .;				\
+		. += (((SIZEOF(.text) + LOOKUP_BLOCK_SIZE - 1) /	\
+			LOOKUP_BLOCK_SIZE) + 1) * 4;			\
+		VMLINUX_SYMBOL(orc_lookup_end) = .;			\
+	}
+#else
+#define ORC_UNWIND_TABLE
+#endif
+
 #ifdef CONFIG_PM_TRACE
 #define TRACEDATA							\
 	. = ALIGN(4);							\
@@ -866,7 +891,7 @@
 		DATA_DATA						\
 		CONSTRUCTORS						\
 	}								\
-	BUG_TABLE
+	BUG_TABLE							\
 
 #define INIT_TEXT_SECTION(inittext_align)				\
 	. = ALIGN(inittext_align);					\
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 98fe715522e8..0f0d019ffb99 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -374,6 +374,9 @@ config STACK_VALIDATION
 	  pointers (if CONFIG_FRAME_POINTER is enabled).  This helps ensure
 	  that runtime stack traces are more reliable.
 
+	  This is also a prerequisite for generation of ORC unwind data, which
+	  is needed for CONFIG_ORC_UNWINDER.
+
 	  For more information, see
 	  tools/objtool/Documentation/stack-validation.txt.
 
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 4a9a2cec0a1b..cc4a6e3b8cc8 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -258,7 +258,8 @@ ifneq ($(SKIP_STACK_VALIDATION),1)
 
 __objtool_obj := $(objtree)/tools/objtool/objtool
 
-objtool_args = check
+objtool_args = $(if $(CONFIG_ORC_UNWINDER),orc generate,check)
+
 ifndef CONFIG_FRAME_POINTER
 objtool_args += --no-fp
 endif
@@ -276,6 +277,11 @@ objtool_obj = $(if $(patsubst y%,, \
 endif # SKIP_STACK_VALIDATION
 endif # CONFIG_STACK_VALIDATION
 
+# Rebuild all objects when objtool changes, or is enabled/disabled.
+objtool_dep = $(objtool_obj)					\
+	      $(wildcard include/config/orc/unwinder.h		\
+			 include/config/stack/validation.h)
+
 define rule_cc_o_c
 	$(call echo-cmd,checksrc) $(cmd_checksrc)			  \
 	$(call cmd_and_fixdep,cc_o_c)					  \
@@ -298,13 +304,13 @@ cmd_undef_syms = echo
 endif
 
 # Built-in and composite module parts
-$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE
+$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
 	$(call cmd,force_checksrc)
 	$(call if_changed_rule,cc_o_c)
 
 # Single-part modules are special since we need to mark them in $(MODVERDIR)
 
-$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE
+$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
 	$(call cmd,force_checksrc)
 	$(call if_changed_rule,cc_o_c)
 	@{ echo $(@:.o=.ko); echo $@; \
@@ -399,7 +405,7 @@ cmd_modversions_S =								\
 endif
 endif
 
-$(obj)/%.o: $(src)/%.S $(objtool_obj) FORCE
+$(obj)/%.o: $(src)/%.S $(objtool_dep) FORCE
 	$(call if_changed_rule,as_o_S)
 
 targets += $(real-objs-y) $(real-objs-m) $(lib-y)
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 2/2] x86/kconfig: make it easier to switch to the new ORC unwinder
  2017-07-24 23:36 [PATCH v4 0/2] ORC unwinder Josh Poimboeuf
  2017-07-24 23:36 ` [PATCH v4 1/2] x86/unwind: add " Josh Poimboeuf
@ 2017-07-24 23:36 ` Josh Poimboeuf
  2017-07-25  9:10   ` Ingo Molnar
  2017-07-26 12:10   ` [tip:x86/asm] x86/kconfig: Make it easier to switch to the new ORC unwinder tip-bot for Josh Poimboeuf
  1 sibling, 2 replies; 38+ messages in thread
From: Josh Poimboeuf @ 2017-07-24 23:36 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, live-patching, Linus Torvalds, Andy Lutomirski,
	Jiri Slaby, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

A couple of Kconfig changes which make it much easier to switch to the
new CONFIG_ORC_UNWINDER:

1) Remove x86 dependencies on CONFIG_FRAME_POINTER for lockdep,
   latencytop, and fault injection.  x86 has a 'guess' unwinder which
   just scans the stack for kernel text addresses.  It's not 100%
   accurate but in many cases it's good enough.  This allows those users
   who don't want the text overhead of the frame pointer or ORC
   unwinders to still use these features.  More importantly, this also
   makes it much more straightforward to disable frame pointers.

2) Make CONFIG_ORC_UNWINDER depend on !CONFIG_FRAME_POINTER.  While it
   would be possible to have both enabled, it doesn't really make sense
   to do so.  So enforce a sane configuration to prevent the user from
   making a dumb mistake.

With these changes, when you disable CONFIG_FRAME_POINTER, "make
oldconfig" will ask if you want to enable CONFIG_ORC_UNWINDER.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/Kconfig.debug | 7 +++----
 lib/Kconfig.debug      | 6 +++---
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index d5bca2ec8a74..e6176007b838 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -358,7 +358,7 @@ config PUNIT_ATOM_DEBUG
 
 config ORC_UNWINDER
 	bool "ORC unwinder"
-	depends on X86_64
+	depends on X86_64 && !FRAME_POINTER
 	select STACK_VALIDATION
 	---help---
 	  This option enables the ORC (Oops Rewind Capability) unwinder for
@@ -366,9 +366,8 @@ config ORC_UNWINDER
 	  a simplified version of the DWARF Call Frame Information standard.
 
 	  This unwinder is more accurate across interrupt entry frames than the
-	  frame pointer unwinder.  It can also enable a 5-10% performance
-	  improvement across the entire kernel if CONFIG_FRAME_POINTER is
-	  disabled.
+	  frame pointer unwinder.  It also enables a 5-10% performance
+	  improvement across the entire kernel compared to frame pointers.
 
 	  Enabling this option will increase the kernel's runtime memory usage
 	  by roughly 2-4MB, depending on your kernel config.
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 0f0d019ffb99..32a48e739e26 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1124,7 +1124,7 @@ config LOCKDEP
 	bool
 	depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT
 	select STACKTRACE
-	select FRAME_POINTER if !MIPS && !PPC && !ARM_UNWIND && !S390 && !MICROBLAZE && !ARC && !SCORE
+	select FRAME_POINTER if !MIPS && !PPC && !ARM_UNWIND && !S390 && !MICROBLAZE && !ARC && !SCORE && !X86
 	select KALLSYMS
 	select KALLSYMS_ALL
 
@@ -1543,7 +1543,7 @@ config FAULT_INJECTION_STACKTRACE_FILTER
 	depends on FAULT_INJECTION_DEBUG_FS && STACKTRACE_SUPPORT
 	depends on !X86_64
 	select STACKTRACE
-	select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && !ARM_UNWIND && !ARC && !SCORE
+	select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && !ARM_UNWIND && !ARC && !SCORE && !X86
 	help
 	  Provide stacktrace filter for fault-injection capabilities
 
@@ -1552,7 +1552,7 @@ config LATENCYTOP
 	depends on DEBUG_KERNEL
 	depends on STACKTRACE_SUPPORT
 	depends on PROC_FS
-	select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && !ARM_UNWIND && !ARC
+	select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && !ARM_UNWIND && !ARC && !X86
 	select KALLSYMS
 	select KALLSYMS_ALL
 	select STACKTRACE
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 2/2] x86/kconfig: make it easier to switch to the new ORC unwinder
  2017-07-24 23:36 ` [PATCH v4 2/2] x86/kconfig: make it easier to switch to the new " Josh Poimboeuf
@ 2017-07-25  9:10   ` Ingo Molnar
  2017-07-25 13:54     ` Josh Poimboeuf
  2017-07-26 12:10   ` [tip:x86/asm] x86/kconfig: Make it easier to switch to the new ORC unwinder tip-bot for Josh Poimboeuf
  1 sibling, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2017-07-25  9:10 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith


* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> A couple of Kconfig changes which make it much easier to switch to the
> new CONFIG_ORC_UNWINDER:
> 
> 1) Remove x86 dependencies on CONFIG_FRAME_POINTER for lockdep,
>    latencytop, and fault injection.  x86 has a 'guess' unwinder which
>    just scans the stack for kernel text addresses.  It's not 100%
>    accurate but in many cases it's good enough.  This allows those users
>    who don't want the text overhead of the frame pointer or ORC
>    unwinders to still use these features.  More importantly, this also
>    makes it much more straightforward to disable frame pointers.
> 
> 2) Make CONFIG_ORC_UNWINDER depend on !CONFIG_FRAME_POINTER.  While it
>    would be possible to have both enabled, it doesn't really make sense
>    to do so.  So enforce a sane configuration to prevent the user from
>    making a dumb mistake.
> 
> With these changes, when you disable CONFIG_FRAME_POINTER, "make
> oldconfig" will ask if you want to enable CONFIG_ORC_UNWINDER.

Yeah, so I think this is still suboptimal: the frame pointer and the Orc unwinders 
are configured in different places, and the user won't know about the various 
unwinder options unless stumbling across them by accidentally disabling frame 
pointers ...

Also, the Kconfig help text for frame pointers is now actively misleading:

  CONFIG_FRAME_POINTER:

  If you say Y here the resulting kernel image will be slightly
  larger and slower, but it gives very useful debugging information
  in case of kernel bugs. (precise oopses/stacktraces/warnings)

Please, as I suggested it before, make it a multiple choice option: frame, Orc, or 
the guess unwinder.

I'd only offer the 'guess' unwinder if EXPERT is selected, because it's a really 
sub-optimal selection all things considered.

Once things are tested and it's all rosy we can change the default x86 unwinder to 
Orc and organize the naming of the config variables to:

	CONFIG_UNWINDER_ORC
	CONFIG_UNWINDER_FRAME_POINTER
	CONFIG_UNWINDER_GUESS

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 2/2] x86/kconfig: make it easier to switch to the new ORC unwinder
  2017-07-25  9:10   ` Ingo Molnar
@ 2017-07-25 13:54     ` Josh Poimboeuf
  2017-07-26 12:11       ` [tip:x86/asm] x86/kconfig: Consolidate unwinders into multiple choice selection tip-bot for Josh Poimboeuf
  0 siblings, 1 reply; 38+ messages in thread
From: Josh Poimboeuf @ 2017-07-25 13:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, H. Peter Anvin, Peter Zijlstra,
	Mike Galbraith

On Tue, Jul 25, 2017 at 11:10:59AM +0200, Ingo Molnar wrote:
> 
> * Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> > A couple of Kconfig changes which make it much easier to switch to the
> > new CONFIG_ORC_UNWINDER:
> > 
> > 1) Remove x86 dependencies on CONFIG_FRAME_POINTER for lockdep,
> >    latencytop, and fault injection.  x86 has a 'guess' unwinder which
> >    just scans the stack for kernel text addresses.  It's not 100%
> >    accurate but in many cases it's good enough.  This allows those users
> >    who don't want the text overhead of the frame pointer or ORC
> >    unwinders to still use these features.  More importantly, this also
> >    makes it much more straightforward to disable frame pointers.
> > 
> > 2) Make CONFIG_ORC_UNWINDER depend on !CONFIG_FRAME_POINTER.  While it
> >    would be possible to have both enabled, it doesn't really make sense
> >    to do so.  So enforce a sane configuration to prevent the user from
> >    making a dumb mistake.
> > 
> > With these changes, when you disable CONFIG_FRAME_POINTER, "make
> > oldconfig" will ask if you want to enable CONFIG_ORC_UNWINDER.
> 
> Yeah, so I think this is still suboptimal: the frame pointer and the Orc unwinders 
> are configured in different places, and the user won't know about the various 
> unwinder options unless stumbling across them by accidentally disabling frame 
> pointers ...
> 
> Also, the Kconfig help text for frame pointers is now actively misleading:
> 
>   CONFIG_FRAME_POINTER:
> 
>   If you say Y here the resulting kernel image will be slightly
>   larger and slower, but it gives very useful debugging information
>   in case of kernel bugs. (precise oopses/stacktraces/warnings)
> 
> Please, as I suggested it before, make it a multiple choice option: frame, Orc, or 
> the guess unwinder.
> 
> I'd only offer the 'guess' unwinder if EXPERT is selected, because it's a really 
> sub-optimal selection all things considered.
> 
> Once things are tested and it's all rosy we can change the default x86 unwinder to 
> Orc and organize the naming of the config variables to:
> 
> 	CONFIG_UNWINDER_ORC
> 	CONFIG_UNWINDER_FRAME_POINTER
> 	CONFIG_UNWINDER_GUESS

How about the below patch?  It goes on top of the other two.

----

From: Josh Poimboeuf <jpoimboe@redhat.com>
Subject: [PATCH] x86/kconfig: consolidate unwinders into multiple choice selection

There are three mutually exclusive unwinders.  Make that more obvious by
combining them into a multiple-choice selection:

  CONFIG_FRAME_POINTER_UNWINDER
  CONFIG_ORC_UNWINDER
  CONFIG_GUESS_UNWINDER (if CONFIG_EXPERT=y)

Frame pointers are still the default (for now).

The old CONFIG_FRAME_POINTER option is still used in some
arch-independent places, so keep it around, but make it invisible to the
user on x86.  It's now selected by CONFIG_FRAME_POINTER_UNWINDER.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/Kconfig              |  3 +--
 arch/x86/Kconfig.debug        | 47 ++++++++++++++++++++++++++++++++++++-------
 arch/x86/configs/tiny.config  |  2 ++
 arch/x86/include/asm/unwind.h |  4 ++--
 4 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 21c75a652bb9..1e789ecefc02 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -73,7 +73,6 @@ config X86
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-	select ARCH_WANT_FRAME_POINTERS
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select ARCH_WANTS_THP_SWAP		if X86_64
 	select BUILDTIME_EXTABLE_SORT
@@ -168,7 +167,7 @@ config X86
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_REGS_AND_STACK_ACCESS_API
-	select HAVE_RELIABLE_STACKTRACE		if X86_64 && FRAME_POINTER && STACK_VALIDATION
+	select HAVE_RELIABLE_STACKTRACE		if X86_64 && FRAME_POINTER_UNWINDER && STACK_VALIDATION
 	select HAVE_STACK_VALIDATION		if X86_64
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UNSTABLE_SCHED_CLOCK
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index e6176007b838..71a48a30fc84 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -356,9 +356,32 @@ config PUNIT_ATOM_DEBUG
 	  The current power state can be read from
 	  /sys/kernel/debug/punit_atom/dev_power_state
 
+choice
+	prompt "Choose kernel unwinder"
+	default FRAME_POINTER_UNWINDER
+	---help---
+	  This determines which method will be used for unwinding kernel stack
+	  traces for panics, oopses, bugs, warnings, perf, /proc/<pid>/stack,
+	  livepatch, lockdep, and more.
+
+config FRAME_POINTER_UNWINDER
+	bool "Frame pointer unwinder"
+	select FRAME_POINTER
+	---help---
+	  This option enables the frame pointer unwinder for unwinding kernel
+	  stack traces.
+
+	  The unwinder itself is fast and it uses less RAM than the ORC
+	  unwinder, but the kernel text size will grow by ~3% and the kernel's
+	  overall performance will degrade by roughly 5-10%.
+
+	  This option is recommended if you want to use the livepatch
+	  consistency model, as this is currently the only way to get a
+	  reliable stack trace (CONFIG_HAVE_RELIABLE_STACKTRACE).
+
 config ORC_UNWINDER
 	bool "ORC unwinder"
-	depends on X86_64 && !FRAME_POINTER
+	depends on X86_64
 	select STACK_VALIDATION
 	---help---
 	  This option enables the ORC (Oops Rewind Capability) unwinder for
@@ -372,12 +395,22 @@ config ORC_UNWINDER
 	  Enabling this option will increase the kernel's runtime memory usage
 	  by roughly 2-4MB, depending on your kernel config.
 
-config FRAME_POINTER_UNWINDER
-	def_bool y
-	depends on !ORC_UNWINDER && FRAME_POINTER
-
 config GUESS_UNWINDER
-	def_bool y
-	depends on !ORC_UNWINDER && !FRAME_POINTER
+	bool "Guess unwinder"
+	depends on EXPERT
+	---help---
+	  This option enables the "guess" unwinder for unwinding kernel stack
+	  traces.  It scans the stack and reports every kernel text address it
+	  finds.  Some of the addresses it reports may be incorrect.
+
+	  While this option often produces false positives, it can still be
+	  useful in many cases.  Unlike the other unwinders, it has no runtime
+	  overhead.
+
+endchoice
+
+config FRAME_POINTER
+	depends on !ORC_UNWINDER && !GUESS_UNWINDER
+	bool
 
 endmenu
diff --git a/arch/x86/configs/tiny.config b/arch/x86/configs/tiny.config
index 4b429df40d7a..550cd5012b73 100644
--- a/arch/x86/configs/tiny.config
+++ b/arch/x86/configs/tiny.config
@@ -1,3 +1,5 @@
 CONFIG_NOHIGHMEM=y
 # CONFIG_HIGHMEM4G is not set
 # CONFIG_HIGHMEM64G is not set
+CONFIG_GUESS_UNWINDER=y
+# CONFIG_FRAME_POINTER_UNWINDER is not set
diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index 25b8d31a007d..e9f793e2df7a 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -16,7 +16,7 @@ struct unwind_state {
 	bool signal, full_regs;
 	unsigned long sp, bp, ip;
 	struct pt_regs *regs;
-#elif defined(CONFIG_FRAME_POINTER)
+#elif defined(CONFIG_FRAME_POINTER_UNWINDER)
 	bool got_irq;
 	unsigned long *bp, *orig_sp, ip;
 	struct pt_regs *regs;
@@ -50,7 +50,7 @@ void unwind_start(struct unwind_state *state, struct task_struct *task,
 	__unwind_start(state, task, regs, first_frame);
 }
 
-#if defined(CONFIG_ORC_UNWINDER) || defined(CONFIG_FRAME_POINTER)
+#if defined(CONFIG_ORC_UNWINDER) || defined(CONFIG_FRAME_POINTER_UNWINDER)
 static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
 	if (unwind_done(state))
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/asm] x86/unwind: Add the ORC unwinder
  2017-07-24 23:36 ` [PATCH v4 1/2] x86/unwind: add " Josh Poimboeuf
@ 2017-07-26 12:10   ` tip-bot for Josh Poimboeuf
  2017-07-28 16:48   ` [PATCH v4 1/2] x86/unwind: add " Levin, Alexander (Sasha Levin)
  1 sibling, 0 replies; 38+ messages in thread
From: tip-bot for Josh Poimboeuf @ 2017-07-26 12:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: efault, jslaby, tglx, linux-kernel, hpa, jpoimboe, bp, mingo,
	peterz, luto, torvalds, dvlasenk, brgerst

Commit-ID:  ee9f8fce99640811b2b8e79d0d1dbe8bab69ba67
Gitweb:     http://git.kernel.org/tip/ee9f8fce99640811b2b8e79d0d1dbe8bab69ba67
Author:     Josh Poimboeuf <jpoimboe@redhat.com>
AuthorDate: Mon, 24 Jul 2017 18:36:57 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 26 Jul 2017 13:18:20 +0200

x86/unwind: Add the ORC unwinder

Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y.
It plugs into the existing x86 unwinder framework.

It relies on objtool to generate the needed .orc_unwind and
.orc_unwind_ip sections.

For more details on why ORC is used instead of DWARF, see
Documentation/x86/orc-unwinder.txt - but the short version is
that it's a simplified, fundamentally more robust debugninfo
data structure, which also allows up to two orders of magnitude
faster lookups than the DWARF unwinder - which matters to
profiling workloads like perf.

Thanks to Andy Lutomirski for the performance improvement ideas:
splitting the ORC unwind table into two parallel arrays and creating a
fast lookup table to search a subset of the unwind table.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/0a6cbfb40f8da99b7a45a1a8302dc6aef16ec812.1500938583.git.jpoimboe@redhat.com
[ Extended the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/x86/orc-unwinder.txt | 179 ++++++++++++
 arch/um/include/asm/unwind.h       |   8 +
 arch/x86/Kconfig                   |   1 +
 arch/x86/Kconfig.debug             |  25 ++
 arch/x86/include/asm/module.h      |   9 +
 arch/x86/include/asm/orc_lookup.h  |  46 +++
 arch/x86/include/asm/orc_types.h   |   2 +-
 arch/x86/include/asm/unwind.h      |  76 +++--
 arch/x86/kernel/Makefile           |   8 +-
 arch/x86/kernel/module.c           |  11 +-
 arch/x86/kernel/setup.c            |   3 +
 arch/x86/kernel/unwind_frame.c     |  39 +--
 arch/x86/kernel/unwind_guess.c     |   5 +
 arch/x86/kernel/unwind_orc.c       | 582 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/vmlinux.lds.S      |   3 +
 include/asm-generic/vmlinux.lds.h  |  27 +-
 lib/Kconfig.debug                  |   3 +
 scripts/Makefile.build             |  14 +-
 18 files changed, 977 insertions(+), 64 deletions(-)

diff --git a/Documentation/x86/orc-unwinder.txt b/Documentation/x86/orc-unwinder.txt
new file mode 100644
index 0000000..af0c9a4
--- /dev/null
+++ b/Documentation/x86/orc-unwinder.txt
@@ -0,0 +1,179 @@
+ORC unwinder
+============
+
+Overview
+--------
+
+The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is
+similar in concept to a DWARF unwinder.  The difference is that the
+format of the ORC data is much simpler than DWARF, which in turn allows
+the ORC unwinder to be much simpler and faster.
+
+The ORC data consists of unwind tables which are generated by objtool.
+They contain out-of-band data which is used by the in-kernel ORC
+unwinder.  Objtool generates the ORC data by first doing compile-time
+stack metadata validation (CONFIG_STACK_VALIDATION).  After analyzing
+all the code paths of a .o file, it determines information about the
+stack state at each instruction address in the file and outputs that
+information to the .orc_unwind and .orc_unwind_ip sections.
+
+The per-object ORC sections are combined at link time and are sorted and
+post-processed at boot time.  The unwinder uses the resulting data to
+correlate instruction addresses with their stack states at run time.
+
+
+ORC vs frame pointers
+---------------------
+
+With frame pointers enabled, GCC adds instrumentation code to every
+function in the kernel.  The kernel's .text size increases by about
+3.2%, resulting in a broad kernel-wide slowdown.  Measurements by Mel
+Gorman [1] have shown a slowdown of 5-10% for some workloads.
+
+In contrast, the ORC unwinder has no effect on text size or runtime
+performance, because the debuginfo is out of band.  So if you disable
+frame pointers and enable the ORC unwinder, you get a nice performance
+improvement across the board, and still have reliable stack traces.
+
+Ingo Molnar says:
+
+  "Note that it's not just a performance improvement, but also an
+  instruction cache locality improvement: 3.2% .text savings almost
+  directly transform into a similarly sized reduction in cache
+  footprint. That can transform to even higher speedups for workloads
+  whose cache locality is borderline."
+
+Another benefit of ORC compared to frame pointers is that it can
+reliably unwind across interrupts and exceptions.  Frame pointer based
+unwinds can sometimes skip the caller of the interrupted function, if it
+was a leaf function or if the interrupt hit before the frame pointer was
+saved.
+
+The main disadvantage of the ORC unwinder compared to frame pointers is
+that it needs more memory to store the ORC unwind tables: roughly 2-4MB
+depending on the kernel config.
+
+
+ORC vs DWARF
+------------
+
+ORC debuginfo's advantage over DWARF itself is that it's much simpler.
+It gets rid of the complex DWARF CFI state machine and also gets rid of
+the tracking of unnecessary registers.  This allows the unwinder to be
+much simpler, meaning fewer bugs, which is especially important for
+mission critical oops code.
+
+The simpler debuginfo format also enables the unwinder to be much faster
+than DWARF, which is important for perf and lockdep.  In a basic
+performance test by Jiri Slaby [2], the ORC unwinder was about 20x
+faster than an out-of-tree DWARF unwinder.  (Note: That measurement was
+taken before some performance tweaks were added, which doubled
+performance, so the speedup over DWARF may be closer to 40x.)
+
+The ORC data format does have a few downsides compared to DWARF.  ORC
+unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
+than DWARF-based eh_frame tables.
+
+Another potential downside is that, as GCC evolves, it's conceivable
+that the ORC data may end up being *too* simple to describe the state of
+the stack for certain optimizations.  But IMO this is unlikely because
+GCC saves the frame pointer for any unusual stack adjustments it does,
+so I suspect we'll really only ever need to keep track of the stack
+pointer and the frame pointer between call frames.  But even if we do
+end up having to track all the registers DWARF tracks, at least we will
+still be able to control the format, e.g. no complex state machines.
+
+
+ORC unwind table generation
+---------------------------
+
+The ORC data is generated by objtool.  With the existing compile-time
+stack metadata validation feature, objtool already follows all code
+paths, and so it already has all the information it needs to be able to
+generate ORC data from scratch.  So it's an easy step to go from stack
+validation to ORC data generation.
+
+It should be possible to instead generate the ORC data with a simple
+tool which converts DWARF to ORC data.  However, such a solution would
+be incomplete due to the kernel's extensive use of asm, inline asm, and
+special sections like exception tables.
+
+That could be rectified by manually annotating those special code paths
+using GNU assembler .cfi annotations in .S files, and homegrown
+annotations for inline asm in .c files.  But asm annotations were tried
+in the past and were found to be unmaintainable.  They were often
+incorrect/incomplete and made the code harder to read and keep updated.
+And based on looking at glibc code, annotating inline asm in .c files
+might be even worse.
+
+Objtool still needs a few annotations, but only in code which does
+unusual things to the stack like entry code.  And even then, far fewer
+annotations are needed than what DWARF would need, so they're much more
+maintainable than DWARF CFI annotations.
+
+So the advantages of using objtool to generate ORC data are that it
+gives more accurate debuginfo, with very few annotations.  It also
+insulates the kernel from toolchain bugs which can be very painful to
+deal with in the kernel since we often have to workaround issues in
+older versions of the toolchain for years.
+
+The downside is that the unwinder now becomes dependent on objtool's
+ability to reverse engineer GCC code flow.  If GCC optimizations become
+too complicated for objtool to follow, the ORC data generation might
+stop working or become incomplete.  (It's worth noting that livepatch
+already has such a dependency on objtool's ability to follow GCC code
+flow.)
+
+If newer versions of GCC come up with some optimizations which break
+objtool, we may need to revisit the current implementation.  Some
+possible solutions would be asking GCC to make the optimizations more
+palatable, or having objtool use DWARF as an additional input, or
+creating a GCC plugin to assist objtool with its analysis.  But for now,
+objtool follows GCC code quite well.
+
+
+Unwinder implementation details
+-------------------------------
+
+Objtool generates the ORC data by integrating with the compile-time
+stack metadata validation feature, which is described in detail in
+tools/objtool/Documentation/stack-validation.txt.  After analyzing all
+the code paths of a .o file, it creates an array of orc_entry structs,
+and a parallel array of instruction addresses associated with those
+structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
+respectively.
+
+The ORC data is split into the two arrays for performance reasons, to
+make the searchable part of the data (.orc_unwind_ip) more compact.  The
+arrays are sorted in parallel at boot time.
+
+Performance is further improved by the use of a fast lookup table which
+is created at runtime.  The fast lookup table associates a given address
+with a range of indices for the .orc_unwind table, so that only a small
+subset of the table needs to be searched.
+
+
+Etymology
+---------
+
+Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
+enemies.  Similarly, the ORC unwinder was created in opposition to the
+complexity and slowness of DWARF.
+
+"Although Orcs rarely consider multiple solutions to a problem, they do
+excel at getting things done because they are creatures of action, not
+thought." [3]  Similarly, unlike the esoteric DWARF unwinder, the
+veracious ORC unwinder wastes no time or siloconic effort decoding
+variable-length zero-extended unsigned-integer byte-coded
+state-machine-based debug information entries.
+
+Similar to how Orcs frequently unravel the well-intentioned plans of
+their adversaries, the ORC unwinder frequently unravels stacks with
+brutal, unyielding efficiency.
+
+ORC stands for Oops Rewind Capability.
+
+
+[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
+[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
+[3] http://dustin.wikidot.com/half-orcs-and-orcs
diff --git a/arch/um/include/asm/unwind.h b/arch/um/include/asm/unwind.h
new file mode 100644
index 0000000..7ffa543
--- /dev/null
+++ b/arch/um/include/asm/unwind.h
@@ -0,0 +1,8 @@
+#ifndef _ASM_UML_UNWIND_H
+#define _ASM_UML_UNWIND_H
+
+static inline void
+unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
+		   void *orc, size_t orc_size) {}
+
+#endif /* _ASM_UML_UNWIND_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 781521b..7ccf26a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -157,6 +157,7 @@ config X86
 	select HAVE_MEMBLOCK
 	select HAVE_MEMBLOCK_NODE_MAP
 	select HAVE_MIXED_BREAKPOINTS_REGS
+	select HAVE_MOD_ARCH_SPECIFIC
 	select HAVE_NMI
 	select HAVE_OPROFILE
 	select HAVE_OPTPROBES
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 353ed09..dc10ec6 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -355,4 +355,29 @@ config PUNIT_ATOM_DEBUG
 	  The current power state can be read from
 	  /sys/kernel/debug/punit_atom/dev_power_state
 
+config ORC_UNWINDER
+	bool "ORC unwinder"
+	depends on X86_64
+	select STACK_VALIDATION
+	---help---
+	  This option enables the ORC (Oops Rewind Capability) unwinder for
+	  unwinding kernel stack traces.  It uses a custom data format which is
+	  a simplified version of the DWARF Call Frame Information standard.
+
+	  This unwinder is more accurate across interrupt entry frames than the
+	  frame pointer unwinder.  It can also enable a 5-10% performance
+	  improvement across the entire kernel if CONFIG_FRAME_POINTER is
+	  disabled.
+
+	  Enabling this option will increase the kernel's runtime memory usage
+	  by roughly 2-4MB, depending on your kernel config.
+
+config FRAME_POINTER_UNWINDER
+	def_bool y
+	depends on !ORC_UNWINDER && FRAME_POINTER
+
+config GUESS_UNWINDER
+	def_bool y
+	depends on !ORC_UNWINDER && !FRAME_POINTER
+
 endmenu
diff --git a/arch/x86/include/asm/module.h b/arch/x86/include/asm/module.h
index e3b7819..9eb7c71 100644
--- a/arch/x86/include/asm/module.h
+++ b/arch/x86/include/asm/module.h
@@ -2,6 +2,15 @@
 #define _ASM_X86_MODULE_H
 
 #include <asm-generic/module.h>
+#include <asm/orc_types.h>
+
+struct mod_arch_specific {
+#ifdef CONFIG_ORC_UNWINDER
+	unsigned int num_orcs;
+	int *orc_unwind_ip;
+	struct orc_entry *orc_unwind;
+#endif
+};
 
 #ifdef CONFIG_X86_64
 /* X86_64 does not define MODULE_PROC_FAMILY */
diff --git a/arch/x86/include/asm/orc_lookup.h b/arch/x86/include/asm/orc_lookup.h
new file mode 100644
index 0000000..91c8d86
--- /dev/null
+++ b/arch/x86/include/asm/orc_lookup.h
@@ -0,0 +1,46 @@
+/*
+ * Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef _ORC_LOOKUP_H
+#define _ORC_LOOKUP_H
+
+/*
+ * This is a lookup table for speeding up access to the .orc_unwind table.
+ * Given an input address offset, the corresponding lookup table entry
+ * specifies a subset of the .orc_unwind table to search.
+ *
+ * Each block represents the end of the previous range and the start of the
+ * next range.  An extra block is added to give the last range an end.
+ *
+ * The block size should be a power of 2 to avoid a costly 'div' instruction.
+ *
+ * A block size of 256 was chosen because it roughly doubles unwinder
+ * performance while only adding ~5% to the ORC data footprint.
+ */
+#define LOOKUP_BLOCK_ORDER	8
+#define LOOKUP_BLOCK_SIZE	(1 << LOOKUP_BLOCK_ORDER)
+
+#ifndef LINKER_SCRIPT
+
+extern unsigned int orc_lookup[];
+extern unsigned int orc_lookup_end[];
+
+#define LOOKUP_START_IP		(unsigned long)_stext
+#define LOOKUP_STOP_IP		(unsigned long)_etext
+
+#endif /* LINKER_SCRIPT */
+
+#endif /* _ORC_LOOKUP_H */
diff --git a/arch/x86/include/asm/orc_types.h b/arch/x86/include/asm/orc_types.h
index 7dc777a..9c9dc57 100644
--- a/arch/x86/include/asm/orc_types.h
+++ b/arch/x86/include/asm/orc_types.h
@@ -88,7 +88,7 @@ struct orc_entry {
 	unsigned	sp_reg:4;
 	unsigned	bp_reg:4;
 	unsigned	type:2;
-};
+} __packed;
 
 /*
  * This struct is used by asm and inline asm code to manually annotate the
diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index e667649..25b8d31a 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -12,11 +12,14 @@ struct unwind_state {
 	struct task_struct *task;
 	int graph_idx;
 	bool error;
-#ifdef CONFIG_FRAME_POINTER
+#if defined(CONFIG_ORC_UNWINDER)
+	bool signal, full_regs;
+	unsigned long sp, bp, ip;
+	struct pt_regs *regs;
+#elif defined(CONFIG_FRAME_POINTER)
 	bool got_irq;
-	unsigned long *bp, *orig_sp;
+	unsigned long *bp, *orig_sp, ip;
 	struct pt_regs *regs;
-	unsigned long ip;
 #else
 	unsigned long *sp;
 #endif
@@ -24,41 +27,30 @@ struct unwind_state {
 
 void __unwind_start(struct unwind_state *state, struct task_struct *task,
 		    struct pt_regs *regs, unsigned long *first_frame);
-
 bool unwind_next_frame(struct unwind_state *state);
-
 unsigned long unwind_get_return_address(struct unwind_state *state);
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state);
 
 static inline bool unwind_done(struct unwind_state *state)
 {
 	return state->stack_info.type == STACK_TYPE_UNKNOWN;
 }
 
-static inline
-void unwind_start(struct unwind_state *state, struct task_struct *task,
-		  struct pt_regs *regs, unsigned long *first_frame)
-{
-	first_frame = first_frame ? : get_stack_pointer(task, regs);
-
-	__unwind_start(state, task, regs, first_frame);
-}
-
 static inline bool unwind_error(struct unwind_state *state)
 {
 	return state->error;
 }
 
-#ifdef CONFIG_FRAME_POINTER
-
 static inline
-unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+void unwind_start(struct unwind_state *state, struct task_struct *task,
+		  struct pt_regs *regs, unsigned long *first_frame)
 {
-	if (unwind_done(state))
-		return NULL;
+	first_frame = first_frame ? : get_stack_pointer(task, regs);
 
-	return state->regs ? &state->regs->ip : state->bp + 1;
+	__unwind_start(state, task, regs, first_frame);
 }
 
+#if defined(CONFIG_ORC_UNWINDER) || defined(CONFIG_FRAME_POINTER)
 static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
 	if (unwind_done(state))
@@ -66,20 +58,46 @@ static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 
 	return state->regs;
 }
-
-#else /* !CONFIG_FRAME_POINTER */
-
-static inline
-unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+#else
+static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
 	return NULL;
 }
+#endif
 
-static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
+#ifdef CONFIG_ORC_UNWINDER
+void unwind_init(void);
+void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
+			void *orc, size_t orc_size);
+#else
+static inline void unwind_init(void) {}
+static inline
+void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
+			void *orc, size_t orc_size) {}
+#endif
+
+/*
+ * This disables KASAN checking when reading a value from another task's stack,
+ * since the other task could be running on another CPU and could have poisoned
+ * the stack in the meantime.
+ */
+#define READ_ONCE_TASK_STACK(task, x)			\
+({							\
+	unsigned long val;				\
+	if (task == current)				\
+		val = READ_ONCE(x);			\
+	else						\
+		val = READ_ONCE_NOCHECK(x);		\
+	val;						\
+})
+
+static inline bool task_on_another_cpu(struct task_struct *task)
 {
-	return NULL;
+#ifdef CONFIG_SMP
+	return task != current && task->on_cpu;
+#else
+	return false;
+#endif
 }
 
-#endif /* CONFIG_FRAME_POINTER */
-
 #endif /* _ASM_X86_UNWIND_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index a01892b..287eac7 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -126,11 +126,9 @@ obj-$(CONFIG_PERF_EVENTS)		+= perf_regs.o
 obj-$(CONFIG_TRACING)			+= tracepoint.o
 obj-$(CONFIG_SCHED_MC_PRIO)		+= itmt.o
 
-ifdef CONFIG_FRAME_POINTER
-obj-y					+= unwind_frame.o
-else
-obj-y					+= unwind_guess.o
-endif
+obj-$(CONFIG_ORC_UNWINDER)		+= unwind_orc.o
+obj-$(CONFIG_FRAME_POINTER_UNWINDER)	+= unwind_frame.o
+obj-$(CONFIG_GUESS_UNWINDER)		+= unwind_guess.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index f67bd32..62e7d70 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -35,6 +35,7 @@
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/setup.h>
+#include <asm/unwind.h>
 
 #if 0
 #define DEBUGP(fmt, ...)				\
@@ -213,7 +214,7 @@ int module_finalize(const Elf_Ehdr *hdr,
 		    struct module *me)
 {
 	const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL,
-		*para = NULL;
+		*para = NULL, *orc = NULL, *orc_ip = NULL;
 	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
 
 	for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) {
@@ -225,6 +226,10 @@ int module_finalize(const Elf_Ehdr *hdr,
 			locks = s;
 		if (!strcmp(".parainstructions", secstrings + s->sh_name))
 			para = s;
+		if (!strcmp(".orc_unwind", secstrings + s->sh_name))
+			orc = s;
+		if (!strcmp(".orc_unwind_ip", secstrings + s->sh_name))
+			orc_ip = s;
 	}
 
 	if (alt) {
@@ -248,6 +253,10 @@ int module_finalize(const Elf_Ehdr *hdr,
 	/* make jump label nops */
 	jump_label_apply_nops(me);
 
+	if (orc && orc_ip)
+		unwind_module_init(me, (void *)orc_ip->sh_addr, orc_ip->sh_size,
+				   (void *)orc->sh_addr, orc->sh_size);
+
 	return 0;
 }
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3486d04..ecab322 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -115,6 +115,7 @@
 #include <asm/microcode.h>
 #include <asm/mmu_context.h>
 #include <asm/kaslr.h>
+#include <asm/unwind.h>
 
 /*
  * max_low_pfn_mapped: highest direct mapped pfn under 4GB
@@ -1310,6 +1311,8 @@ void __init setup_arch(char **cmdline_p)
 	if (efi_enabled(EFI_BOOT))
 		efi_apply_memmap_quirks();
 #endif
+
+	unwind_init();
 }
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index b9389d7..7574ef5 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -10,20 +10,22 @@
 
 #define FRAME_HEADER_SIZE (sizeof(long) * 2)
 
-/*
- * This disables KASAN checking when reading a value from another task's stack,
- * since the other task could be running on another CPU and could have poisoned
- * the stack in the meantime.
- */
-#define READ_ONCE_TASK_STACK(task, x)			\
-({							\
-	unsigned long val;				\
-	if (task == current)				\
-		val = READ_ONCE(x);			\
-	else						\
-		val = READ_ONCE_NOCHECK(x);		\
-	val;						\
-})
+unsigned long unwind_get_return_address(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return 0;
+
+	return __kernel_text_address(state->ip) ? state->ip : 0;
+}
+EXPORT_SYMBOL_GPL(unwind_get_return_address);
+
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return NULL;
+
+	return state->regs ? &state->regs->ip : state->bp + 1;
+}
 
 static void unwind_dump(struct unwind_state *state)
 {
@@ -66,15 +68,6 @@ static void unwind_dump(struct unwind_state *state)
 	}
 }
 
-unsigned long unwind_get_return_address(struct unwind_state *state)
-{
-	if (unwind_done(state))
-		return 0;
-
-	return __kernel_text_address(state->ip) ? state->ip : 0;
-}
-EXPORT_SYMBOL_GPL(unwind_get_return_address);
-
 static size_t regs_size(struct pt_regs *regs)
 {
 	/* x86_32 regs from kernel mode are two words shorter: */
diff --git a/arch/x86/kernel/unwind_guess.c b/arch/x86/kernel/unwind_guess.c
index 039f367..4f0e17b 100644
--- a/arch/x86/kernel/unwind_guess.c
+++ b/arch/x86/kernel/unwind_guess.c
@@ -19,6 +19,11 @@ unsigned long unwind_get_return_address(struct unwind_state *state)
 }
 EXPORT_SYMBOL_GPL(unwind_get_return_address);
 
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	return NULL;
+}
+
 bool unwind_next_frame(struct unwind_state *state)
 {
 	struct stack_info *info = &state->stack_info;
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
new file mode 100644
index 0000000..570b70d
--- /dev/null
+++ b/arch/x86/kernel/unwind_orc.c
@@ -0,0 +1,582 @@
+#include <linux/module.h>
+#include <linux/sort.h>
+#include <asm/ptrace.h>
+#include <asm/stacktrace.h>
+#include <asm/unwind.h>
+#include <asm/orc_types.h>
+#include <asm/orc_lookup.h>
+#include <asm/sections.h>
+
+#define orc_warn(fmt, ...) \
+	printk_deferred_once(KERN_WARNING pr_fmt("WARNING: " fmt), ##__VA_ARGS__)
+
+extern int __start_orc_unwind_ip[];
+extern int __stop_orc_unwind_ip[];
+extern struct orc_entry __start_orc_unwind[];
+extern struct orc_entry __stop_orc_unwind[];
+
+static DEFINE_MUTEX(sort_mutex);
+int *cur_orc_ip_table = __start_orc_unwind_ip;
+struct orc_entry *cur_orc_table = __start_orc_unwind;
+
+unsigned int lookup_num_blocks;
+bool orc_init;
+
+static inline unsigned long orc_ip(const int *ip)
+{
+	return (unsigned long)ip + *ip;
+}
+
+static struct orc_entry *__orc_find(int *ip_table, struct orc_entry *u_table,
+				    unsigned int num_entries, unsigned long ip)
+{
+	int *first = ip_table;
+	int *last = ip_table + num_entries - 1;
+	int *mid = first, *found = first;
+
+	if (!num_entries)
+		return NULL;
+
+	/*
+	 * Do a binary range search to find the rightmost duplicate of a given
+	 * starting address.  Some entries are section terminators which are
+	 * "weak" entries for ensuring there are no gaps.  They should be
+	 * ignored when they conflict with a real entry.
+	 */
+	while (first <= last) {
+		mid = first + ((last - first) / 2);
+
+		if (orc_ip(mid) <= ip) {
+			found = mid;
+			first = mid + 1;
+		} else
+			last = mid - 1;
+	}
+
+	return u_table + (found - ip_table);
+}
+
+#ifdef CONFIG_MODULES
+static struct orc_entry *orc_module_find(unsigned long ip)
+{
+	struct module *mod;
+
+	mod = __module_address(ip);
+	if (!mod || !mod->arch.orc_unwind || !mod->arch.orc_unwind_ip)
+		return NULL;
+	return __orc_find(mod->arch.orc_unwind_ip, mod->arch.orc_unwind,
+			  mod->arch.num_orcs, ip);
+}
+#else
+static struct orc_entry *orc_module_find(unsigned long ip)
+{
+	return NULL;
+}
+#endif
+
+static struct orc_entry *orc_find(unsigned long ip)
+{
+	if (!orc_init)
+		return NULL;
+
+	/* For non-init vmlinux addresses, use the fast lookup table: */
+	if (ip >= LOOKUP_START_IP && ip < LOOKUP_STOP_IP) {
+		unsigned int idx, start, stop;
+
+		idx = (ip - LOOKUP_START_IP) / LOOKUP_BLOCK_SIZE;
+
+		if (unlikely((idx >= lookup_num_blocks-1))) {
+			orc_warn("WARNING: bad lookup idx: idx=%u num=%u ip=%lx\n",
+				 idx, lookup_num_blocks, ip);
+			return NULL;
+		}
+
+		start = orc_lookup[idx];
+		stop = orc_lookup[idx + 1] + 1;
+
+		if (unlikely((__start_orc_unwind + start >= __stop_orc_unwind) ||
+			     (__start_orc_unwind + stop > __stop_orc_unwind))) {
+			orc_warn("WARNING: bad lookup value: idx=%u num=%u start=%u stop=%u ip=%lx\n",
+				 idx, lookup_num_blocks, start, stop, ip);
+			return NULL;
+		}
+
+		return __orc_find(__start_orc_unwind_ip + start,
+				  __start_orc_unwind + start, stop - start, ip);
+	}
+
+	/* vmlinux .init slow lookup: */
+	if (ip >= (unsigned long)_sinittext && ip < (unsigned long)_einittext)
+		return __orc_find(__start_orc_unwind_ip, __start_orc_unwind,
+				  __stop_orc_unwind_ip - __start_orc_unwind_ip, ip);
+
+	/* Module lookup: */
+	return orc_module_find(ip);
+}
+
+static void orc_sort_swap(void *_a, void *_b, int size)
+{
+	struct orc_entry *orc_a, *orc_b;
+	struct orc_entry orc_tmp;
+	int *a = _a, *b = _b, tmp;
+	int delta = _b - _a;
+
+	/* Swap the .orc_unwind_ip entries: */
+	tmp = *a;
+	*a = *b + delta;
+	*b = tmp - delta;
+
+	/* Swap the corresponding .orc_unwind entries: */
+	orc_a = cur_orc_table + (a - cur_orc_ip_table);
+	orc_b = cur_orc_table + (b - cur_orc_ip_table);
+	orc_tmp = *orc_a;
+	*orc_a = *orc_b;
+	*orc_b = orc_tmp;
+}
+
+static int orc_sort_cmp(const void *_a, const void *_b)
+{
+	struct orc_entry *orc_a;
+	const int *a = _a, *b = _b;
+	unsigned long a_val = orc_ip(a);
+	unsigned long b_val = orc_ip(b);
+
+	if (a_val > b_val)
+		return 1;
+	if (a_val < b_val)
+		return -1;
+
+	/*
+	 * The "weak" section terminator entries need to always be on the left
+	 * to ensure the lookup code skips them in favor of real entries.
+	 * These terminator entries exist to handle any gaps created by
+	 * whitelisted .o files which didn't get objtool generation.
+	 */
+	orc_a = cur_orc_table + (a - cur_orc_ip_table);
+	return orc_a->sp_reg == ORC_REG_UNDEFINED ? -1 : 1;
+}
+
+#ifdef CONFIG_MODULES
+void unwind_module_init(struct module *mod, void *_orc_ip, size_t orc_ip_size,
+			void *_orc, size_t orc_size)
+{
+	int *orc_ip = _orc_ip;
+	struct orc_entry *orc = _orc;
+	unsigned int num_entries = orc_ip_size / sizeof(int);
+
+	WARN_ON_ONCE(orc_ip_size % sizeof(int) != 0 ||
+		     orc_size % sizeof(*orc) != 0 ||
+		     num_entries != orc_size / sizeof(*orc));
+
+	/*
+	 * The 'cur_orc_*' globals allow the orc_sort_swap() callback to
+	 * associate an .orc_unwind_ip table entry with its corresponding
+	 * .orc_unwind entry so they can both be swapped.
+	 */
+	mutex_lock(&sort_mutex);
+	cur_orc_ip_table = orc_ip;
+	cur_orc_table = orc;
+	sort(orc_ip, num_entries, sizeof(int), orc_sort_cmp, orc_sort_swap);
+	mutex_unlock(&sort_mutex);
+
+	mod->arch.orc_unwind_ip = orc_ip;
+	mod->arch.orc_unwind = orc;
+	mod->arch.num_orcs = num_entries;
+}
+#endif
+
+void __init unwind_init(void)
+{
+	size_t orc_ip_size = (void *)__stop_orc_unwind_ip - (void *)__start_orc_unwind_ip;
+	size_t orc_size = (void *)__stop_orc_unwind - (void *)__start_orc_unwind;
+	size_t num_entries = orc_ip_size / sizeof(int);
+	struct orc_entry *orc;
+	int i;
+
+	if (!num_entries || orc_ip_size % sizeof(int) != 0 ||
+	    orc_size % sizeof(struct orc_entry) != 0 ||
+	    num_entries != orc_size / sizeof(struct orc_entry)) {
+		orc_warn("WARNING: Bad or missing .orc_unwind table.  Disabling unwinder.\n");
+		return;
+	}
+
+	/* Sort the .orc_unwind and .orc_unwind_ip tables: */
+	sort(__start_orc_unwind_ip, num_entries, sizeof(int), orc_sort_cmp,
+	     orc_sort_swap);
+
+	/* Initialize the fast lookup table: */
+	lookup_num_blocks = orc_lookup_end - orc_lookup;
+	for (i = 0; i < lookup_num_blocks-1; i++) {
+		orc = __orc_find(__start_orc_unwind_ip, __start_orc_unwind,
+				 num_entries,
+				 LOOKUP_START_IP + (LOOKUP_BLOCK_SIZE * i));
+		if (!orc) {
+			orc_warn("WARNING: Corrupt .orc_unwind table.  Disabling unwinder.\n");
+			return;
+		}
+
+		orc_lookup[i] = orc - __start_orc_unwind;
+	}
+
+	/* Initialize the ending block: */
+	orc = __orc_find(__start_orc_unwind_ip, __start_orc_unwind, num_entries,
+			 LOOKUP_STOP_IP);
+	if (!orc) {
+		orc_warn("WARNING: Corrupt .orc_unwind table.  Disabling unwinder.\n");
+		return;
+	}
+	orc_lookup[lookup_num_blocks-1] = orc - __start_orc_unwind;
+
+	orc_init = true;
+}
+
+unsigned long unwind_get_return_address(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return 0;
+
+	return __kernel_text_address(state->ip) ? state->ip : 0;
+}
+EXPORT_SYMBOL_GPL(unwind_get_return_address);
+
+unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
+{
+	if (unwind_done(state))
+		return NULL;
+
+	if (state->regs)
+		return &state->regs->ip;
+
+	if (state->sp)
+		return (unsigned long *)state->sp - 1;
+
+	return NULL;
+}
+
+static bool stack_access_ok(struct unwind_state *state, unsigned long addr,
+			    size_t len)
+{
+	struct stack_info *info = &state->stack_info;
+
+	/*
+	 * If the address isn't on the current stack, switch to the next one.
+	 *
+	 * We may have to traverse multiple stacks to deal with the possibility
+	 * that info->next_sp could point to an empty stack and the address
+	 * could be on a subsequent stack.
+	 */
+	while (!on_stack(info, (void *)addr, len))
+		if (get_stack_info(info->next_sp, state->task, info,
+				   &state->stack_mask))
+			return false;
+
+	return true;
+}
+
+static bool deref_stack_reg(struct unwind_state *state, unsigned long addr,
+			    unsigned long *val)
+{
+	if (!stack_access_ok(state, addr, sizeof(long)))
+		return false;
+
+	*val = READ_ONCE_TASK_STACK(state->task, *(unsigned long *)addr);
+	return true;
+}
+
+#define REGS_SIZE (sizeof(struct pt_regs))
+#define SP_OFFSET (offsetof(struct pt_regs, sp))
+#define IRET_REGS_SIZE (REGS_SIZE - offsetof(struct pt_regs, ip))
+#define IRET_SP_OFFSET (SP_OFFSET - offsetof(struct pt_regs, ip))
+
+static bool deref_stack_regs(struct unwind_state *state, unsigned long addr,
+			     unsigned long *ip, unsigned long *sp, bool full)
+{
+	size_t regs_size = full ? REGS_SIZE : IRET_REGS_SIZE;
+	size_t sp_offset = full ? SP_OFFSET : IRET_SP_OFFSET;
+	struct pt_regs *regs = (struct pt_regs *)(addr + regs_size - REGS_SIZE);
+
+	if (IS_ENABLED(CONFIG_X86_64)) {
+		if (!stack_access_ok(state, addr, regs_size))
+			return false;
+
+		*ip = regs->ip;
+		*sp = regs->sp;
+
+		return true;
+	}
+
+	if (!stack_access_ok(state, addr, sp_offset))
+		return false;
+
+	*ip = regs->ip;
+
+	if (user_mode(regs)) {
+		if (!stack_access_ok(state, addr + sp_offset,
+				     REGS_SIZE - SP_OFFSET))
+			return false;
+
+		*sp = regs->sp;
+	} else
+		*sp = (unsigned long)&regs->sp;
+
+	return true;
+}
+
+bool unwind_next_frame(struct unwind_state *state)
+{
+	unsigned long ip_p, sp, orig_ip, prev_sp = state->sp;
+	enum stack_type prev_type = state->stack_info.type;
+	struct orc_entry *orc;
+	struct pt_regs *ptregs;
+	bool indirect = false;
+
+	if (unwind_done(state))
+		return false;
+
+	/* Don't let modules unload while we're reading their ORC data. */
+	preempt_disable();
+
+	/* Have we reached the end? */
+	if (state->regs && user_mode(state->regs))
+		goto done;
+
+	/*
+	 * Find the orc_entry associated with the text address.
+	 *
+	 * Decrement call return addresses by one so they work for sibling
+	 * calls and calls to noreturn functions.
+	 */
+	orc = orc_find(state->signal ? state->ip : state->ip - 1);
+	if (!orc || orc->sp_reg == ORC_REG_UNDEFINED)
+		goto done;
+	orig_ip = state->ip;
+
+	/* Find the previous frame's stack: */
+	switch (orc->sp_reg) {
+	case ORC_REG_SP:
+		sp = state->sp + orc->sp_offset;
+		break;
+
+	case ORC_REG_BP:
+		sp = state->bp + orc->sp_offset;
+		break;
+
+	case ORC_REG_SP_INDIRECT:
+		sp = state->sp + orc->sp_offset;
+		indirect = true;
+		break;
+
+	case ORC_REG_BP_INDIRECT:
+		sp = state->bp + orc->sp_offset;
+		indirect = true;
+		break;
+
+	case ORC_REG_R10:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg R10 at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->r10;
+		break;
+
+	case ORC_REG_R13:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg R13 at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->r13;
+		break;
+
+	case ORC_REG_DI:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg DI at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->di;
+		break;
+
+	case ORC_REG_DX:
+		if (!state->regs || !state->full_regs) {
+			orc_warn("missing regs for base reg DX at ip %p\n",
+				 (void *)state->ip);
+			goto done;
+		}
+		sp = state->regs->dx;
+		break;
+
+	default:
+		orc_warn("unknown SP base reg %d for ip %p\n",
+			 orc->sp_reg, (void *)state->ip);
+		goto done;
+	}
+
+	if (indirect) {
+		if (!deref_stack_reg(state, sp, &sp))
+			goto done;
+	}
+
+	/* Find IP, SP and possibly regs: */
+	switch (orc->type) {
+	case ORC_TYPE_CALL:
+		ip_p = sp - sizeof(long);
+
+		if (!deref_stack_reg(state, ip_p, &state->ip))
+			goto done;
+
+		state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
+						  state->ip, (void *)ip_p);
+
+		state->sp = sp;
+		state->regs = NULL;
+		state->signal = false;
+		break;
+
+	case ORC_TYPE_REGS:
+		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, true)) {
+			orc_warn("can't dereference registers at %p for ip %p\n",
+				 (void *)sp, (void *)orig_ip);
+			goto done;
+		}
+
+		state->regs = (struct pt_regs *)sp;
+		state->full_regs = true;
+		state->signal = true;
+		break;
+
+	case ORC_TYPE_REGS_IRET:
+		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, false)) {
+			orc_warn("can't dereference iret registers at %p for ip %p\n",
+				 (void *)sp, (void *)orig_ip);
+			goto done;
+		}
+
+		ptregs = container_of((void *)sp, struct pt_regs, ip);
+		if ((unsigned long)ptregs >= prev_sp &&
+		    on_stack(&state->stack_info, ptregs, REGS_SIZE)) {
+			state->regs = ptregs;
+			state->full_regs = false;
+		} else
+			state->regs = NULL;
+
+		state->signal = true;
+		break;
+
+	default:
+		orc_warn("unknown .orc_unwind entry type %d\n", orc->type);
+		break;
+	}
+
+	/* Find BP: */
+	switch (orc->bp_reg) {
+	case ORC_REG_UNDEFINED:
+		if (state->regs && state->full_regs)
+			state->bp = state->regs->bp;
+		break;
+
+	case ORC_REG_PREV_SP:
+		if (!deref_stack_reg(state, sp + orc->bp_offset, &state->bp))
+			goto done;
+		break;
+
+	case ORC_REG_BP:
+		if (!deref_stack_reg(state, state->bp + orc->bp_offset, &state->bp))
+			goto done;
+		break;
+
+	default:
+		orc_warn("unknown BP base reg %d for ip %p\n",
+			 orc->bp_reg, (void *)orig_ip);
+		goto done;
+	}
+
+	/* Prevent a recursive loop due to bad ORC data: */
+	if (state->stack_info.type == prev_type &&
+	    on_stack(&state->stack_info, (void *)state->sp, sizeof(long)) &&
+	    state->sp <= prev_sp) {
+		orc_warn("stack going in the wrong direction? ip=%p\n",
+			 (void *)orig_ip);
+		goto done;
+	}
+
+	preempt_enable();
+	return true;
+
+done:
+	preempt_enable();
+	state->stack_info.type = STACK_TYPE_UNKNOWN;
+	return false;
+}
+EXPORT_SYMBOL_GPL(unwind_next_frame);
+
+void __unwind_start(struct unwind_state *state, struct task_struct *task,
+		    struct pt_regs *regs, unsigned long *first_frame)
+{
+	memset(state, 0, sizeof(*state));
+	state->task = task;
+
+	/*
+	 * Refuse to unwind the stack of a task while it's executing on another
+	 * CPU.  This check is racy, but that's ok: the unwinder has other
+	 * checks to prevent it from going off the rails.
+	 */
+	if (task_on_another_cpu(task))
+		goto done;
+
+	if (regs) {
+		if (user_mode(regs))
+			goto done;
+
+		state->ip = regs->ip;
+		state->sp = kernel_stack_pointer(regs);
+		state->bp = regs->bp;
+		state->regs = regs;
+		state->full_regs = true;
+		state->signal = true;
+
+	} else if (task == current) {
+		asm volatile("lea (%%rip), %0\n\t"
+			     "mov %%rsp, %1\n\t"
+			     "mov %%rbp, %2\n\t"
+			     : "=r" (state->ip), "=r" (state->sp),
+			       "=r" (state->bp));
+
+	} else {
+		struct inactive_task_frame *frame = (void *)task->thread.sp;
+
+		state->sp = task->thread.sp;
+		state->bp = READ_ONCE_NOCHECK(frame->bp);
+		state->ip = READ_ONCE_NOCHECK(frame->ret_addr);
+	}
+
+	if (get_stack_info((unsigned long *)state->sp, state->task,
+			   &state->stack_info, &state->stack_mask))
+		return;
+
+	/*
+	 * The caller can provide the address of the first frame directly
+	 * (first_frame) or indirectly (regs->sp) to indicate which stack frame
+	 * to start unwinding at.  Skip ahead until we reach it.
+	 */
+
+	/* When starting from regs, skip the regs frame: */
+	if (regs) {
+		unwind_next_frame(state);
+		return;
+	}
+
+	/* Otherwise, skip ahead to the user-specified starting frame: */
+	while (!unwind_done(state) &&
+	       (!on_stack(&state->stack_info, first_frame, sizeof(long)) ||
+			state->sp <= (unsigned long)first_frame))
+		unwind_next_frame(state);
+
+	return;
+
+done:
+	state->stack_info.type = STACK_TYPE_UNKNOWN;
+	return;
+}
+EXPORT_SYMBOL_GPL(__unwind_start);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index c8a3b61..f05f00a 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -24,6 +24,7 @@
 #include <asm/asm-offsets.h>
 #include <asm/thread_info.h>
 #include <asm/page_types.h>
+#include <asm/orc_lookup.h>
 #include <asm/cache.h>
 #include <asm/boot.h>
 
@@ -148,6 +149,8 @@ SECTIONS
 
 	BUG_TABLE
 
+	ORC_UNWIND_TABLE
+
 	. = ALIGN(PAGE_SIZE);
 	__vvar_page = .;
 
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index da0be9a..fffc9bd 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -680,6 +680,31 @@
 #define BUG_TABLE
 #endif
 
+#ifdef CONFIG_ORC_UNWINDER
+#define ORC_UNWIND_TABLE						\
+	. = ALIGN(4);							\
+	.orc_unwind_ip : AT(ADDR(.orc_unwind_ip) - LOAD_OFFSET) {	\
+		VMLINUX_SYMBOL(__start_orc_unwind_ip) = .;		\
+		KEEP(*(.orc_unwind_ip))					\
+		VMLINUX_SYMBOL(__stop_orc_unwind_ip) = .;		\
+	}								\
+	. = ALIGN(6);							\
+	.orc_unwind : AT(ADDR(.orc_unwind) - LOAD_OFFSET) {		\
+		VMLINUX_SYMBOL(__start_orc_unwind) = .;			\
+		KEEP(*(.orc_unwind))					\
+		VMLINUX_SYMBOL(__stop_orc_unwind) = .;			\
+	}								\
+	. = ALIGN(4);							\
+	.orc_lookup : AT(ADDR(.orc_lookup) - LOAD_OFFSET) {		\
+		VMLINUX_SYMBOL(orc_lookup) = .;				\
+		. += (((SIZEOF(.text) + LOOKUP_BLOCK_SIZE - 1) /	\
+			LOOKUP_BLOCK_SIZE) + 1) * 4;			\
+		VMLINUX_SYMBOL(orc_lookup_end) = .;			\
+	}
+#else
+#define ORC_UNWIND_TABLE
+#endif
+
 #ifdef CONFIG_PM_TRACE
 #define TRACEDATA							\
 	. = ALIGN(4);							\
@@ -866,7 +891,7 @@
 		DATA_DATA						\
 		CONSTRUCTORS						\
 	}								\
-	BUG_TABLE
+	BUG_TABLE							\
 
 #define INIT_TEXT_SECTION(inittext_align)				\
 	. = ALIGN(inittext_align);					\
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 98fe715..0f0d019 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -374,6 +374,9 @@ config STACK_VALIDATION
 	  pointers (if CONFIG_FRAME_POINTER is enabled).  This helps ensure
 	  that runtime stack traces are more reliable.
 
+	  This is also a prerequisite for generation of ORC unwind data, which
+	  is needed for CONFIG_ORC_UNWINDER.
+
 	  For more information, see
 	  tools/objtool/Documentation/stack-validation.txt.
 
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 854608d4..80ed148 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -258,7 +258,8 @@ ifneq ($(SKIP_STACK_VALIDATION),1)
 
 __objtool_obj := $(objtree)/tools/objtool/objtool
 
-objtool_args = check
+objtool_args = $(if $(CONFIG_ORC_UNWINDER),orc generate,check)
+
 ifndef CONFIG_FRAME_POINTER
 objtool_args += --no-fp
 endif
@@ -279,6 +280,11 @@ objtool_obj = $(if $(patsubst y%,, \
 endif # SKIP_STACK_VALIDATION
 endif # CONFIG_STACK_VALIDATION
 
+# Rebuild all objects when objtool changes, or is enabled/disabled.
+objtool_dep = $(objtool_obj)					\
+	      $(wildcard include/config/orc/unwinder.h		\
+			 include/config/stack/validation.h)
+
 define rule_cc_o_c
 	$(call echo-cmd,checksrc) $(cmd_checksrc)			  \
 	$(call cmd_and_fixdep,cc_o_c)					  \
@@ -301,13 +307,13 @@ cmd_undef_syms = echo
 endif
 
 # Built-in and composite module parts
-$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE
+$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
 	$(call cmd,force_checksrc)
 	$(call if_changed_rule,cc_o_c)
 
 # Single-part modules are special since we need to mark them in $(MODVERDIR)
 
-$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE
+$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
 	$(call cmd,force_checksrc)
 	$(call if_changed_rule,cc_o_c)
 	@{ echo $(@:.o=.ko); echo $@; \
@@ -402,7 +408,7 @@ cmd_modversions_S =								\
 endif
 endif
 
-$(obj)/%.o: $(src)/%.S $(objtool_obj) FORCE
+$(obj)/%.o: $(src)/%.S $(objtool_dep) FORCE
 	$(call if_changed_rule,as_o_S)
 
 targets += $(real-objs-y) $(real-objs-m) $(lib-y)

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/asm] x86/kconfig: Make it easier to switch to the new ORC unwinder
  2017-07-24 23:36 ` [PATCH v4 2/2] x86/kconfig: make it easier to switch to the new " Josh Poimboeuf
  2017-07-25  9:10   ` Ingo Molnar
@ 2017-07-26 12:10   ` tip-bot for Josh Poimboeuf
  1 sibling, 0 replies; 38+ messages in thread
From: tip-bot for Josh Poimboeuf @ 2017-07-26 12:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, dvlasenk, torvalds, hpa, efault, tglx, brgerst,
	luto, peterz, jpoimboe, mingo, bp, jslaby

Commit-ID:  a34a766ff96d9e88572e35a45066279e40a85d84
Gitweb:     http://git.kernel.org/tip/a34a766ff96d9e88572e35a45066279e40a85d84
Author:     Josh Poimboeuf <jpoimboe@redhat.com>
AuthorDate: Mon, 24 Jul 2017 18:36:58 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 26 Jul 2017 13:18:20 +0200

x86/kconfig: Make it easier to switch to the new ORC unwinder

A couple of Kconfig changes which make it much easier to switch to the
new CONFIG_ORC_UNWINDER:

1) Remove x86 dependencies on CONFIG_FRAME_POINTER for lockdep,
   latencytop, and fault injection.  x86 has a 'guess' unwinder which
   just scans the stack for kernel text addresses.  It's not 100%
   accurate but in many cases it's good enough.  This allows those users
   who don't want the text overhead of the frame pointer or ORC
   unwinders to still use these features.  More importantly, this also
   makes it much more straightforward to disable frame pointers.

2) Make CONFIG_ORC_UNWINDER depend on !CONFIG_FRAME_POINTER.  While it
   would be possible to have both enabled, it doesn't really make sense
   to do so.  So enforce a sane configuration to prevent the user from
   making a dumb mistake.

With these changes, when you disable CONFIG_FRAME_POINTER, "make
oldconfig" will ask if you want to enable CONFIG_ORC_UNWINDER.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/9985fb91ce5005fe33ea5cc2a20f14bd33c61d03.1500938583.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig.debug | 7 +++----
 lib/Kconfig.debug      | 6 +++---
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index dc10ec6..268a318 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -357,7 +357,7 @@ config PUNIT_ATOM_DEBUG
 
 config ORC_UNWINDER
 	bool "ORC unwinder"
-	depends on X86_64
+	depends on X86_64 && !FRAME_POINTER
 	select STACK_VALIDATION
 	---help---
 	  This option enables the ORC (Oops Rewind Capability) unwinder for
@@ -365,9 +365,8 @@ config ORC_UNWINDER
 	  a simplified version of the DWARF Call Frame Information standard.
 
 	  This unwinder is more accurate across interrupt entry frames than the
-	  frame pointer unwinder.  It can also enable a 5-10% performance
-	  improvement across the entire kernel if CONFIG_FRAME_POINTER is
-	  disabled.
+	  frame pointer unwinder.  It also enables a 5-10% performance
+	  improvement across the entire kernel compared to frame pointers.
 
 	  Enabling this option will increase the kernel's runtime memory usage
 	  by roughly 2-4MB, depending on your kernel config.
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 0f0d019..32a48e7 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1124,7 +1124,7 @@ config LOCKDEP
 	bool
 	depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT
 	select STACKTRACE
-	select FRAME_POINTER if !MIPS && !PPC && !ARM_UNWIND && !S390 && !MICROBLAZE && !ARC && !SCORE
+	select FRAME_POINTER if !MIPS && !PPC && !ARM_UNWIND && !S390 && !MICROBLAZE && !ARC && !SCORE && !X86
 	select KALLSYMS
 	select KALLSYMS_ALL
 
@@ -1543,7 +1543,7 @@ config FAULT_INJECTION_STACKTRACE_FILTER
 	depends on FAULT_INJECTION_DEBUG_FS && STACKTRACE_SUPPORT
 	depends on !X86_64
 	select STACKTRACE
-	select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && !ARM_UNWIND && !ARC && !SCORE
+	select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && !ARM_UNWIND && !ARC && !SCORE && !X86
 	help
 	  Provide stacktrace filter for fault-injection capabilities
 
@@ -1552,7 +1552,7 @@ config LATENCYTOP
 	depends on DEBUG_KERNEL
 	depends on STACKTRACE_SUPPORT
 	depends on PROC_FS
-	select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && !ARM_UNWIND && !ARC
+	select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && !ARM_UNWIND && !ARC && !X86
 	select KALLSYMS
 	select KALLSYMS_ALL
 	select STACKTRACE

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/asm] x86/kconfig: Consolidate unwinders into multiple choice selection
  2017-07-25 13:54     ` Josh Poimboeuf
@ 2017-07-26 12:11       ` tip-bot for Josh Poimboeuf
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Josh Poimboeuf @ 2017-07-26 12:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, tglx, peterz, bp, hpa, jslaby, jpoimboe, mingo,
	efault, dvlasenk, brgerst, luto, torvalds

Commit-ID:  81d387190039c14edac8de2b3ec789beb899afd9
Gitweb:     http://git.kernel.org/tip/81d387190039c14edac8de2b3ec789beb899afd9
Author:     Josh Poimboeuf <jpoimboe@redhat.com>
AuthorDate: Tue, 25 Jul 2017 08:54:24 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 26 Jul 2017 14:05:36 +0200

x86/kconfig: Consolidate unwinders into multiple choice selection

There are three mutually exclusive unwinders.  Make that more obvious by
combining them into a multiple-choice selection:

  CONFIG_FRAME_POINTER_UNWINDER
  CONFIG_ORC_UNWINDER
  CONFIG_GUESS_UNWINDER (if CONFIG_EXPERT=y)

Frame pointers are still the default (for now).

The old CONFIG_FRAME_POINTER option is still used in some
arch-independent places, so keep it around, but make it
invisible to the user on x86 - it's now selected by
CONFIG_FRAME_POINTER_UNWINDER=y.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/20170725135424.zukjmgpz3plf5pmt@treble
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig              |  3 +--
 arch/x86/Kconfig.debug        | 47 ++++++++++++++++++++++++++++++++++++-------
 arch/x86/configs/tiny.config  |  2 ++
 arch/x86/include/asm/unwind.h |  4 ++--
 4 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7ccf26a..9b30212 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -73,7 +73,6 @@ config X86
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-	select ARCH_WANT_FRAME_POINTERS
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select ARCH_WANTS_THP_SWAP		if X86_64
 	select BUILDTIME_EXTABLE_SORT
@@ -168,7 +167,7 @@ config X86
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_REGS_AND_STACK_ACCESS_API
-	select HAVE_RELIABLE_STACKTRACE		if X86_64 && FRAME_POINTER && STACK_VALIDATION
+	select HAVE_RELIABLE_STACKTRACE		if X86_64 && FRAME_POINTER_UNWINDER && STACK_VALIDATION
 	select HAVE_STACK_VALIDATION		if X86_64
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UNSTABLE_SCHED_CLOCK
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 268a318..93bbb31 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -355,9 +355,32 @@ config PUNIT_ATOM_DEBUG
 	  The current power state can be read from
 	  /sys/kernel/debug/punit_atom/dev_power_state
 
+choice
+	prompt "Choose kernel unwinder"
+	default FRAME_POINTER_UNWINDER
+	---help---
+	  This determines which method will be used for unwinding kernel stack
+	  traces for panics, oopses, bugs, warnings, perf, /proc/<pid>/stack,
+	  livepatch, lockdep, and more.
+
+config FRAME_POINTER_UNWINDER
+	bool "Frame pointer unwinder"
+	select FRAME_POINTER
+	---help---
+	  This option enables the frame pointer unwinder for unwinding kernel
+	  stack traces.
+
+	  The unwinder itself is fast and it uses less RAM than the ORC
+	  unwinder, but the kernel text size will grow by ~3% and the kernel's
+	  overall performance will degrade by roughly 5-10%.
+
+	  This option is recommended if you want to use the livepatch
+	  consistency model, as this is currently the only way to get a
+	  reliable stack trace (CONFIG_HAVE_RELIABLE_STACKTRACE).
+
 config ORC_UNWINDER
 	bool "ORC unwinder"
-	depends on X86_64 && !FRAME_POINTER
+	depends on X86_64
 	select STACK_VALIDATION
 	---help---
 	  This option enables the ORC (Oops Rewind Capability) unwinder for
@@ -371,12 +394,22 @@ config ORC_UNWINDER
 	  Enabling this option will increase the kernel's runtime memory usage
 	  by roughly 2-4MB, depending on your kernel config.
 
-config FRAME_POINTER_UNWINDER
-	def_bool y
-	depends on !ORC_UNWINDER && FRAME_POINTER
-
 config GUESS_UNWINDER
-	def_bool y
-	depends on !ORC_UNWINDER && !FRAME_POINTER
+	bool "Guess unwinder"
+	depends on EXPERT
+	---help---
+	  This option enables the "guess" unwinder for unwinding kernel stack
+	  traces.  It scans the stack and reports every kernel text address it
+	  finds.  Some of the addresses it reports may be incorrect.
+
+	  While this option often produces false positives, it can still be
+	  useful in many cases.  Unlike the other unwinders, it has no runtime
+	  overhead.
+
+endchoice
+
+config FRAME_POINTER
+	depends on !ORC_UNWINDER && !GUESS_UNWINDER
+	bool
 
 endmenu
diff --git a/arch/x86/configs/tiny.config b/arch/x86/configs/tiny.config
index 4b429df..550cd50 100644
--- a/arch/x86/configs/tiny.config
+++ b/arch/x86/configs/tiny.config
@@ -1,3 +1,5 @@
 CONFIG_NOHIGHMEM=y
 # CONFIG_HIGHMEM4G is not set
 # CONFIG_HIGHMEM64G is not set
+CONFIG_GUESS_UNWINDER=y
+# CONFIG_FRAME_POINTER_UNWINDER is not set
diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index 25b8d31a..e9f793e 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -16,7 +16,7 @@ struct unwind_state {
 	bool signal, full_regs;
 	unsigned long sp, bp, ip;
 	struct pt_regs *regs;
-#elif defined(CONFIG_FRAME_POINTER)
+#elif defined(CONFIG_FRAME_POINTER_UNWINDER)
 	bool got_irq;
 	unsigned long *bp, *orig_sp, ip;
 	struct pt_regs *regs;
@@ -50,7 +50,7 @@ void unwind_start(struct unwind_state *state, struct task_struct *task,
 	__unwind_start(state, task, regs, first_frame);
 }
 
-#if defined(CONFIG_ORC_UNWINDER) || defined(CONFIG_FRAME_POINTER)
+#if defined(CONFIG_ORC_UNWINDER) || defined(CONFIG_FRAME_POINTER_UNWINDER)
 static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
 	if (unwind_done(state))

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-07-24 23:36 ` [PATCH v4 1/2] x86/unwind: add " Josh Poimboeuf
  2017-07-26 12:10   ` [tip:x86/asm] x86/unwind: Add the " tip-bot for Josh Poimboeuf
@ 2017-07-28 16:48   ` Levin, Alexander (Sasha Levin)
  2017-07-28 17:52     ` Josh Poimboeuf
  1 sibling, 1 reply; 38+ messages in thread
From: Levin, Alexander (Sasha Levin) @ 2017-07-28 16:48 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

On Mon, Jul 24, 2017 at 06:36:57PM -0500, Josh Poimboeuf wrote:
>Add a new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER.  It
>plugs into the existing x86 unwinder framework.
>
>It relies on objtool to generate the needed .orc_unwind and
>.orc_unwind_ip sections.
>
>For more details on why ORC is used instead of DWARF, see
>Documentation/x86/orc-unwinder.txt.
>
>Thanks to Andy Lutomirski for the performance improvement ideas:
>splitting the ORC unwind table into two parallel arrays and creating a
>fast lookup table to search a subset of the unwind table.
>
>Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>

Hey Josh,

Syzkaller seems to trigger the following:

==================================================================
BUG: KASAN: stack-out-of-bounds in __read_once_size include/linux/compiler.h:253 [inline]
BUG: KASAN: stack-out-of-bounds in deref_stack_reg+0x123/0x140 arch/x86/kernel/unwind_orc.c:282
Read of size 8 at addr ffff8800374a7b28 by task syz-executor4/6474

CPU: 2 PID: 6474 Comm: syz-executor4 Not tainted 4.13.0-rc2-next-20170727 #232
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0xab/0x105 lib/dump_stack.c:52
 print_address_description+0xc2/0x250 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x24f/0x360 mm/kasan/report.c:408
 __read_once_size include/linux/compiler.h:253 [inline]
 deref_stack_reg+0x123/0x140 arch/x86/kernel/unwind_orc.c:282
 unwind_next_frame+0xd9b/0x1b80 arch/x86/kernel/unwind_orc.c:426
 __save_stack_trace+0x7d/0xf0 arch/x86/kernel/stacktrace.c:44
 save_stack+0x33/0xa0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:524
 slab_free_hook mm/slub.c:1357 [inline]
 slab_free_freelist_hook mm/slub.c:1379 [inline]
 slab_free mm/slub.c:2955 [inline]
 kmem_cache_free+0xae/0x310 mm/slub.c:2977
 put_pid+0xe2/0x120 kernel/pid.c:246
 __rcu_reclaim kernel/rcu/rcu.h:195 [inline]
 rcu_do_batch kernel/rcu/tree.c:2666 [inline]
 invoke_rcu_callbacks kernel/rcu/tree.c:2920 [inline]
 __rcu_process_callbacks kernel/rcu/tree.c:2887 [inline]
 rcu_process_callbacks+0x599/0x12b0 kernel/rcu/tree.c:2904
 __do_softirq+0x234/0x934 kernel/softirq.c:284
 invoke_softirq kernel/softirq.c:364 [inline]
 irq_exit+0x164/0x190 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:638 [inline]
 smp_apic_timer_interrupt+0x71/0x90 arch/x86/kernel/apic/apic.c:1044
 apic_timer_interrupt+0xb9/0xc0 arch/x86/entry/entry_64.S:793
 </IRQ>
RIP: 0010:arch_local_irq_enable arch/x86/include/asm/paravirt.h:824 [inline]
RIP: 0010:preempt_schedule_irq+0x71/0xd0 kernel/sched/core.c:3579
RSP: 0018:ffff8800374a7958 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
RAX: 0000000000000007 RBX: ffffed0006d95808 RCX: 1ffffffff534d022
RDX: 0000000000000000 RSI: ffffffffa7065fe0 RDI: ffff880036cac9a4
RBP: 0000000000000000 R08: ffff88007ffd709c R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880036cac040
R13: 00000000000002eb R14: ffff880036cac040 R15: ffff88005fdc2728
 retint_kernel+0x1b/0x2d
RIP: 0010:arch_local_save_flags arch/x86/include/asm/paravirt.h:809 [inline]
RIP: 0010:___might_sleep+0x159/0x480 kernel/sched/core.c:5968
RSP: 0018:ffff8800374a7a28 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff02
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 1ffff10006d95882 RSI: 00000000ffffffff RDI: ffff880036cac410
RBP: 0000000000000000 R08: ffffffffa3b1546e R09: dffffc0000000000
R10: ffff8800374a7c08 R11: 0000000000000001 R12: ffffffffa7065320
R13: 00000000000002eb R14: ffff880036cac040 R15: ffff88005fdc2728
 ext4_orphan_add+0x34e/0xd70 fs/ext4/namei.c:2801

The buggy address belongs to the page:
page:ffffea0000dd29c0 count:0 mapcount:0 mapping:          (null) index:0x0
flags: 0xfffe0000000000()
raw: 00fffe0000000000 0000000000000000 0000000000000000 00000000ffffffff
raw: 0000000000000000 dead000000000200 0000000000000000 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8800374a7a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff8800374a7a80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 f4 f4
>ffff8800374a7b00: f2 f2 f2 f2 00 f4 f4 f4 f2 f2 f2 f2 00 00 00 00
                                  ^
 ffff8800374a7b80: 00 f4 f4 f4 00 00 00 00 00 00 00 00 00 00 00 00
 ffff8800374a7c00: 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 00
==================================================================

-- 

Thanks,
Sasha

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-07-28 16:48   ` [PATCH v4 1/2] x86/unwind: add " Levin, Alexander (Sasha Levin)
@ 2017-07-28 17:52     ` Josh Poimboeuf
  2017-07-28 18:29       ` Levin, Alexander (Sasha Levin)
  0 siblings, 1 reply; 38+ messages in thread
From: Josh Poimboeuf @ 2017-07-28 17:52 UTC (permalink / raw)
  To: Levin, Alexander (Sasha Levin)
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

On Fri, Jul 28, 2017 at 04:48:47PM +0000, Levin, Alexander (Sasha Levin) wrote:
> Hey Josh,
> 
> Syzkaller seems to trigger the following:
> 
> ==================================================================
> BUG: KASAN: stack-out-of-bounds in __read_once_size include/linux/compiler.h:253 [inline]
> BUG: KASAN: stack-out-of-bounds in deref_stack_reg+0x123/0x140 arch/x86/kernel/unwind_orc.c:282
> Read of size 8 at addr ffff8800374a7b28 by task syz-executor4/6474
> 
> CPU: 2 PID: 6474 Comm: syz-executor4 Not tainted 4.13.0-rc2-next-20170727 #232
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
> Call Trace:
>  <IRQ>
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0xab/0x105 lib/dump_stack.c:52
>  print_address_description+0xc2/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x24f/0x360 mm/kasan/report.c:408
>  __read_once_size include/linux/compiler.h:253 [inline]
>  deref_stack_reg+0x123/0x140 arch/x86/kernel/unwind_orc.c:282
>  unwind_next_frame+0xd9b/0x1b80 arch/x86/kernel/unwind_orc.c:426
>  __save_stack_trace+0x7d/0xf0 arch/x86/kernel/stacktrace.c:44
>  save_stack+0x33/0xa0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:524
>  slab_free_hook mm/slub.c:1357 [inline]
>  slab_free_freelist_hook mm/slub.c:1379 [inline]
>  slab_free mm/slub.c:2955 [inline]
>  kmem_cache_free+0xae/0x310 mm/slub.c:2977
>  put_pid+0xe2/0x120 kernel/pid.c:246
>  __rcu_reclaim kernel/rcu/rcu.h:195 [inline]
>  rcu_do_batch kernel/rcu/tree.c:2666 [inline]
>  invoke_rcu_callbacks kernel/rcu/tree.c:2920 [inline]
>  __rcu_process_callbacks kernel/rcu/tree.c:2887 [inline]
>  rcu_process_callbacks+0x599/0x12b0 kernel/rcu/tree.c:2904
>  __do_softirq+0x234/0x934 kernel/softirq.c:284
>  invoke_softirq kernel/softirq.c:364 [inline]
>  irq_exit+0x164/0x190 kernel/softirq.c:405
>  exiting_irq arch/x86/include/asm/apic.h:638 [inline]
>  smp_apic_timer_interrupt+0x71/0x90 arch/x86/kernel/apic/apic.c:1044
>  apic_timer_interrupt+0xb9/0xc0 arch/x86/entry/entry_64.S:793
>  </IRQ>
> RIP: 0010:arch_local_irq_enable arch/x86/include/asm/paravirt.h:824 [inline]
> RIP: 0010:preempt_schedule_irq+0x71/0xd0 kernel/sched/core.c:3579
> RSP: 0018:ffff8800374a7958 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
> RAX: 0000000000000007 RBX: ffffed0006d95808 RCX: 1ffffffff534d022
> RDX: 0000000000000000 RSI: ffffffffa7065fe0 RDI: ffff880036cac9a4
> RBP: 0000000000000000 R08: ffff88007ffd709c R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff880036cac040
> R13: 00000000000002eb R14: ffff880036cac040 R15: ffff88005fdc2728
>  retint_kernel+0x1b/0x2d
> RIP: 0010:arch_local_save_flags arch/x86/include/asm/paravirt.h:809 [inline]
> RIP: 0010:___might_sleep+0x159/0x480 kernel/sched/core.c:5968
> RSP: 0018:ffff8800374a7a28 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff02
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
> RDX: 1ffff10006d95882 RSI: 00000000ffffffff RDI: ffff880036cac410
> RBP: 0000000000000000 R08: ffffffffa3b1546e R09: dffffc0000000000
> R10: ffff8800374a7c08 R11: 0000000000000001 R12: ffffffffa7065320
> R13: 00000000000002eb R14: ffff880036cac040 R15: ffff88005fdc2728
>  ext4_orphan_add+0x34e/0xd70 fs/ext4/namei.c:2801

Thanks for reporting this.  I'm confused by the stack trace.  It seems
to end at ext4_orphan_add, which would normally make sense because the
unwinder would have stopped when it read the bad address on the stack.

But there aren't any of the '?' entries, which should still be there.
Any chance your post-processing script removes those?  Can you share the
raw dmesg before it was post-processed?

-- 
Josh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-07-28 17:52     ` Josh Poimboeuf
@ 2017-07-28 18:29       ` Levin, Alexander (Sasha Levin)
  2017-07-28 18:57         ` Josh Poimboeuf
  0 siblings, 1 reply; 38+ messages in thread
From: Levin, Alexander (Sasha Levin) @ 2017-07-28 18:29 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

On Fri, Jul 28, 2017 at 12:52:34PM -0500, Josh Poimboeuf wrote:
>On Fri, Jul 28, 2017 at 04:48:47PM +0000, Levin, Alexander (Sasha Levin) wrote:
>> Hey Josh,
>>
>> Syzkaller seems to trigger the following:
>>
>> ==================================================================
>> BUG: KASAN: stack-out-of-bounds in __read_once_size include/linux/compiler.h:253 [inline]
>> BUG: KASAN: stack-out-of-bounds in deref_stack_reg+0x123/0x140 arch/x86/kernel/unwind_orc.c:282
>> Read of size 8 at addr ffff8800374a7b28 by task syz-executor4/6474
>>
>> CPU: 2 PID: 6474 Comm: syz-executor4 Not tainted 4.13.0-rc2-next-20170727 #232
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
>> Call Trace:
>>  <IRQ>
>>  __dump_stack lib/dump_stack.c:16 [inline]
>>  dump_stack+0xab/0x105 lib/dump_stack.c:52
>>  print_address_description+0xc2/0x250 mm/kasan/report.c:252
>>  kasan_report_error mm/kasan/report.c:351 [inline]
>>  kasan_report+0x24f/0x360 mm/kasan/report.c:408
>>  __read_once_size include/linux/compiler.h:253 [inline]
>>  deref_stack_reg+0x123/0x140 arch/x86/kernel/unwind_orc.c:282
>>  unwind_next_frame+0xd9b/0x1b80 arch/x86/kernel/unwind_orc.c:426
>>  __save_stack_trace+0x7d/0xf0 arch/x86/kernel/stacktrace.c:44
>>  save_stack+0x33/0xa0 mm/kasan/kasan.c:447
>>  set_track mm/kasan/kasan.c:459 [inline]
>>  kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:524
>>  slab_free_hook mm/slub.c:1357 [inline]
>>  slab_free_freelist_hook mm/slub.c:1379 [inline]
>>  slab_free mm/slub.c:2955 [inline]
>>  kmem_cache_free+0xae/0x310 mm/slub.c:2977
>>  put_pid+0xe2/0x120 kernel/pid.c:246
>>  __rcu_reclaim kernel/rcu/rcu.h:195 [inline]
>>  rcu_do_batch kernel/rcu/tree.c:2666 [inline]
>>  invoke_rcu_callbacks kernel/rcu/tree.c:2920 [inline]
>>  __rcu_process_callbacks kernel/rcu/tree.c:2887 [inline]
>>  rcu_process_callbacks+0x599/0x12b0 kernel/rcu/tree.c:2904
>>  __do_softirq+0x234/0x934 kernel/softirq.c:284
>>  invoke_softirq kernel/softirq.c:364 [inline]
>>  irq_exit+0x164/0x190 kernel/softirq.c:405
>>  exiting_irq arch/x86/include/asm/apic.h:638 [inline]
>>  smp_apic_timer_interrupt+0x71/0x90 arch/x86/kernel/apic/apic.c:1044
>>  apic_timer_interrupt+0xb9/0xc0 arch/x86/entry/entry_64.S:793
>>  </IRQ>
>> RIP: 0010:arch_local_irq_enable arch/x86/include/asm/paravirt.h:824 [inline]
>> RIP: 0010:preempt_schedule_irq+0x71/0xd0 kernel/sched/core.c:3579
>> RSP: 0018:ffff8800374a7958 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
>> RAX: 0000000000000007 RBX: ffffed0006d95808 RCX: 1ffffffff534d022
>> RDX: 0000000000000000 RSI: ffffffffa7065fe0 RDI: ffff880036cac9a4
>> RBP: 0000000000000000 R08: ffff88007ffd709c R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000000 R12: ffff880036cac040
>> R13: 00000000000002eb R14: ffff880036cac040 R15: ffff88005fdc2728
>>  retint_kernel+0x1b/0x2d
>> RIP: 0010:arch_local_save_flags arch/x86/include/asm/paravirt.h:809 [inline]
>> RIP: 0010:___might_sleep+0x159/0x480 kernel/sched/core.c:5968
>> RSP: 0018:ffff8800374a7a28 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff02
>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
>> RDX: 1ffff10006d95882 RSI: 00000000ffffffff RDI: ffff880036cac410
>> RBP: 0000000000000000 R08: ffffffffa3b1546e R09: dffffc0000000000
>> R10: ffff8800374a7c08 R11: 0000000000000001 R12: ffffffffa7065320
>> R13: 00000000000002eb R14: ffff880036cac040 R15: ffff88005fdc2728
>>  ext4_orphan_add+0x34e/0xd70 fs/ext4/namei.c:2801
>
>Thanks for reporting this.  I'm confused by the stack trace.  It seems
>to end at ext4_orphan_add, which would normally make sense because the
>unwinder would have stopped when it read the bad address on the stack.
>
>But there aren't any of the '?' entries, which should still be there.
>Any chance your post-processing script removes those?  Can you share the
>raw dmesg before it was post-processed?

Hey Josh,

Sure, here it is:

[  391.851860] ==================================================================
[  391.856663] BUG: KASAN: stack-out-of-bounds in deref_stack_reg+0x123/0x140
[  391.861469] Read of size 8 at addr ffff8800374a7b28 by task syz-executor4/6474
[  391.864387] 
[  391.864651] CPU: 2 PID: 6474 Comm: syz-executor4 Not tainted 4.13.0-rc2-next-20170727 #232
[  391.865919] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[  391.867424] Call Trace:
[  391.867875]  <IRQ>
[  391.868269]  dump_stack+0xab/0x105
[  391.868923]  print_address_description+0xc2/0x250
[  391.869843]  kasan_report+0x24f/0x360
[  391.870534]  ? deref_stack_reg+0x123/0x140
[  391.871490]  deref_stack_reg+0x123/0x140
[  391.872239]  ? __read_once_size_nocheck.constprop.7+0x10/0x10
[  391.873289]  ? get_stack_info+0x32/0x150
[  391.874016]  ? stack_access_ok+0xd6/0x150
[  391.874701]  ? sched_clock+0x5/0x10
[  391.875516]  ? sched_clock_cpu+0x18/0x1d0
[  391.877449]  unwind_next_frame+0xd9b/0x1b80
[  391.878149]  ? ext4_orphan_add+0x34e/0xd70
[  391.878798]  ? deref_stack_reg+0x140/0x140
[  391.882077]  ? check_preemption_disabled+0x34/0x1f0
[  391.884084]  __save_stack_trace+0x7d/0xf0
[  391.885471]  ? ext4_orphan_add+0x34e/0xd70
[  391.886123]  ? put_pid+0xe2/0x120
[  391.889839]  save_stack+0x33/0xa0
[  391.890367]  ? save_stack+0x33/0xa0
[  391.890902]  ? kasan_slab_free+0x72/0xc0
[  391.891508]  ? kmem_cache_free+0xae/0x310
[  391.892134]  ? put_pid+0xe2/0x120
[  391.892659]  ? rcu_process_callbacks+0x599/0x12b0
[  391.893791]  ? __do_softirq+0x234/0x934
[  391.894418]  ? irq_exit+0x164/0x190
[  391.894973]  ? smp_apic_timer_interrupt+0x71/0x90
[  391.895705]  ? apic_timer_interrupt+0xb9/0xc0
[  391.896379]  ? preempt_schedule_irq+0x71/0xd0
[  391.897062]  ? retint_kernel+0x1b/0x2d
[  391.901512]  ? ___might_sleep+0x159/0x480
[  391.902143]  ? ext4_orphan_add+0x34e/0xd70
[  391.902794]  ? mark_held_locks+0xc7/0x110
[  391.903422]  ? check_preemption_disabled+0x34/0x1f0
[  391.911223]  ? trace_hardirqs_on_caller+0x284/0x590
[  391.911987]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  391.912771]  ? free_object+0xce/0x160
[  391.913376]  ? __debug_check_no_obj_freed+0x400/0x900
[  391.919901]  ? check_preemption_disabled+0x34/0x1f0
[  391.920664]  ? debug_object_free+0x10/0x10
[  391.921324]  ? mark_held_locks+0xc7/0x110
[  391.921952]  ? check_preemption_disabled+0x34/0x1f0
[  391.922718]  kasan_slab_free+0x72/0xc0
[  391.923314]  kmem_cache_free+0xae/0x310
[  391.923927]  put_pid+0xe2/0x120
[  391.924633]  rcu_process_callbacks+0x599/0x12b0
[  391.928396]  ? rcu_exp_wait_wake+0x1100/0x1100
[  391.929078]  ? check_preemption_disabled+0x34/0x1f0
[  391.929967]  __do_softirq+0x234/0x934
[  391.930867]  irq_exit+0x164/0x190
[  391.931482]  smp_apic_timer_interrupt+0x71/0x90
[  391.934785]  apic_timer_interrupt+0xb9/0xc0
[  391.935393]  </IRQ>
[  391.935720] RIP: 0010:preempt_schedule_irq+0x71/0xd0
[  391.936799] RSP: 0018:ffff8800374a7958 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  391.940748] RAX: 0000000000000007 RBX: ffffed0006d95808 RCX: 1ffffffff534d022
[  391.942554] RDX: 0000000000000000 RSI: ffffffffa7065fe0 RDI: ffff880036cac9a4
[  391.943536] RBP: 0000000000000000 R08: ffff88007ffd709c R09: 0000000000000000
[  391.950270] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880036cac040
[  391.952836] R13: 00000000000002eb R14: ffff880036cac040 R15: ffff88005fdc2728
[  391.957222]  ? preempt_schedule_irq+0x6d/0xd0
[  391.957915]  retint_kernel+0x1b/0x2d
[  391.958483] RIP: 0010:___might_sleep+0x159/0x480
[  391.959904] RSP: 0018:ffff8800374a7a28 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff02
[  391.969771] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[  391.971302] RDX: 1ffff10006d95882 RSI: 00000000ffffffff RDI: ffff880036cac410
[  391.972390] RBP: 0000000000000000 R08: ffffffffa3b1546e R09: dffffc0000000000
[  391.973523] R10: ffff8800374a7c08 R11: 0000000000000001 R12: ffffffffa7065320
[  391.974947] R13: 00000000000002eb R14: ffff880036cac040 R15: ffff88005fdc2728
[  391.976216]  ? ext4_orphan_add+0x34e/0xd70
[  391.977434]  ext4_orphan_add+0x34e/0xd70
[  391.979303]  ? __mutex_lock+0xc7/0x14b0
[  391.980692]  ? ext4_orphan_add+0x34e/0xd70
[  391.981547]  ? mutex_lock_io_nested+0x1350/0x1350
[  391.982276]  ? lock_acquire+0x3a0/0x3a0
[  391.982884]  ? check_preemption_disabled+0x34/0x1f0
[  391.983655]  ? jbd2_write_access_granted.part.7+0x1b9/0x2b0
[  391.984534]  ? jbd2_journal_get_write_access+0x77/0x90
[  391.990212]  ? __ext4_journal_get_write_access+0x117/0x1d0
[  391.994173]  ? ext4_orphan_add+0x34e/0xd70
[  391.994820]  ? ext4_orphan_add+0x34e/0xd70
[  391.995467]  ? ext4_delete_entry+0x25c/0x420
[  391.996137]  ? jbd2__journal_start+0xef/0x8c0
[  391.996799]  ? ext4_empty_dir+0x750/0x750
[  391.998128]  ? __ext4_journal_start_sb+0x100/0x440
[  391.998873]  ? ext4_rmdir+0x432/0xc30
[  391.999484]  ? ext4_rmdir+0x703/0xc30
[  392.000071]  ? ext4_rename2+0x130/0x130
[  392.000692]  ? vfs_rmdir+0x1cd/0x3a0
[  392.001273]  ? do_rmdir+0x39f/0x400
[  392.001833]  ? user_path_create+0x40/0x40
[  392.002458]  ? syscall_trace_enter+0x324/0xe30
[  392.008656]  ? lock_acquire+0x3a0/0x3a0
[  392.009278]  ? exit_to_usermode_loop+0x160/0x160
[  392.009997]  ? check_preemption_disabled+0x34/0x1f0
[  392.010765]  ? check_preemption_disabled+0x34/0x1f0
[  392.011527]  ? SyS_mkdir+0x230/0x230
[  392.012094]  ? do_syscall_64+0x1b0/0x600
[  392.012718]  ? entry_SYSCALL64_slow_path+0x25/0x25
[  392.013484] 
[  392.013734] The buggy address belongs to the page:
[  392.014750] page:ffffea0000dd29c0 count:0 mapcount:0 mapping:          (null) index:0x0
[  392.016276] flags: 0xfffe0000000000()
[  392.016939] raw: 00fffe0000000000 0000000000000000 0000000000000000 00000000ffffffff
[  392.018679] raw: 0000000000000000 dead000000000200 0000000000000000 0000000000000000
[  392.019881] page dumped because: kasan: bad access detected
[  392.020729] 
[  392.020979] Memory state around the buggy address:
[  392.021738]  ffff8800374a7a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  392.022836]  ffff8800374a7a80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 f4 f4
[  392.023950] >ffff8800374a7b00: f2 f2 f2 f2 00 f4 f4 f4 f2 f2 f2 f2 00 00 00 00
[  392.025193]                                   ^
[  392.025890]  ffff8800374a7b80: 00 f4 f4 f4 00 00 00 00 00 00 00 00 00 00 00 00
[  392.026961]  ffff8800374a7c00: 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 00
[  392.028049] ==================================================================
[  392.029713] Disabling lock debugging due to kernel taint
[  392.030551] Kernel panic - not syncing: panic_on_warn set ...
[  392.030551] 
[  392.031651] CPU: 2 PID: 6474 Comm: syz-executor4 Tainted: G    B           4.13.0-rc2-next-20170727 #232
[  392.033060] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[  392.034399] Call Trace:
[  392.034785]  <IRQ>
[  392.035115]  dump_stack+0xab/0x105
[  392.035649]  panic+0x1bc/0x3ba
[  392.036306]  ? __warn+0x1d6/0x1d6
[  392.036836]  kasan_end_report+0x4a/0x50
[  392.037438]  kasan_report+0x168/0x360
[  392.038004]  ? deref_stack_reg+0x123/0x140
[  392.038581]  deref_stack_reg+0x123/0x140
[  392.039161]  ? __read_once_size_nocheck.constprop.7+0x10/0x10
[  392.040072]  ? get_stack_info+0x32/0x150
[  392.040895]  ? stack_access_ok+0xd6/0x150
[  392.041764]  ? sched_clock+0x5/0x10
[  392.042346]  ? sched_clock_cpu+0x18/0x1d0
[  392.043076]  unwind_next_frame+0xd9b/0x1b80
[  392.043776]  ? ext4_orphan_add+0x34e/0xd70
[  392.044425]  ? deref_stack_reg+0x140/0x140
[  392.045073]  ? check_preemption_disabled+0x34/0x1f0
[  392.045833]  __save_stack_trace+0x7d/0xf0
[  392.047297]  ? ext4_orphan_add+0x34e/0xd70
[  392.047932]  ? put_pid+0xe2/0x120
[  392.048447]  save_stack+0x33/0xa0
[  392.048986]  ? save_stack+0x33/0xa0
[  392.049564]  ? kasan_slab_free+0x72/0xc0
[  392.050171]  ? kmem_cache_free+0xae/0x310
[  392.050792]  ? put_pid+0xe2/0x120
[  392.051650]  ? rcu_process_callbacks+0x599/0x12b0
[  392.052357]  ? __do_softirq+0x234/0x934
[  392.052948]  ? irq_exit+0x164/0x190
[  392.053523]  ? smp_apic_timer_interrupt+0x71/0x90
[  392.054273]  ? apic_timer_interrupt+0xb9/0xc0
[  392.054951]  ? preempt_schedule_irq+0x71/0xd0
[  392.055594]  ? retint_kernel+0x1b/0x2d
[  392.056168]  ? ___might_sleep+0x159/0x480
[  392.056784]  ? ext4_orphan_add+0x34e/0xd70
[  392.057442]  ? mark_held_locks+0xc7/0x110
[  392.058235]  ? check_preemption_disabled+0x34/0x1f0
[  392.058979]  ? trace_hardirqs_on_caller+0x284/0x590
[  392.059730]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  392.060486]  ? free_object+0xce/0x160
[  392.060997]  ? __debug_check_no_obj_freed+0x400/0x900
[  392.061905]  ? check_preemption_disabled+0x34/0x1f0
[  392.062763]  ? debug_object_free+0x10/0x10
[  392.063334]  ? mark_held_locks+0xc7/0x110
[  392.063914]  ? check_preemption_disabled+0x34/0x1f0
[  392.064764]  kasan_slab_free+0x72/0xc0
[  392.065451]  kmem_cache_free+0xae/0x310
[  392.066139]  put_pid+0xe2/0x120
[  392.066723]  rcu_process_callbacks+0x599/0x12b0
[  392.067532]  ? rcu_exp_wait_wake+0x1100/0x1100
[  392.068293]  ? check_preemption_disabled+0x34/0x1f0
[  392.069231]  __do_softirq+0x234/0x934
[  392.069760]  irq_exit+0x164/0x190
[  392.070270]  smp_apic_timer_interrupt+0x71/0x90
[  392.070978]  apic_timer_interrupt+0xb9/0xc0
[  392.071614]  </IRQ>
[  392.072152] RIP: 0010:preempt_schedule_irq+0x71/0xd0
[  392.072903] RSP: 0018:ffff8800374a7958 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  392.074336] RAX: 0000000000000007 RBX: ffffed0006d95808 RCX: 1ffffffff534d022
[  392.075285] RDX: 0000000000000000 RSI: ffffffffa7065fe0 RDI: ffff880036cac9a4
[  392.076235] RBP: 0000000000000000 R08: ffff88007ffd709c R09: 0000000000000000
[  392.077204] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880036cac040
[  392.078271] R13: 00000000000002eb R14: ffff880036cac040 R15: ffff88005fdc2728
[  392.080105]  ? preempt_schedule_irq+0x6d/0xd0
[  392.080882]  retint_kernel+0x1b/0x2d
[  392.081563] RIP: 0010:___might_sleep+0x159/0x480
[  392.082364] RSP: 0018:ffff8800374a7a28 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff02
[  392.083604] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[  392.084879] RDX: 1ffff10006d95882 RSI: 00000000ffffffff RDI: ffff880036cac410
[  392.085967] RBP: 0000000000000000 R08: ffffffffa3b1546e R09: dffffc0000000000
[  392.087029] R10: ffff8800374a7c08 R11: 0000000000000001 R12: ffffffffa7065320
[  392.088090] R13: 00000000000002eb R14: ffff880036cac040 R15: ffff88005fdc2728
[  392.089199]  ? ext4_orphan_add+0x34e/0xd70
[  392.089833]  ext4_orphan_add+0x34e/0xd70
[  392.090441]  ? __mutex_lock+0xc7/0x14b0
[  392.091197]  ? ext4_orphan_add+0x34e/0xd70
[  392.091833]  ? mutex_lock_io_nested+0x1350/0x1350
[  392.092550]  ? lock_acquire+0x3a0/0x3a0
[  392.093153]  ? check_preemption_disabled+0x34/0x1f0
[  392.093917]  ? jbd2_write_access_granted.part.7+0x1b9/0x2b0
[  392.094774]  ? jbd2_journal_get_write_access+0x77/0x90
[  392.095946]  ? __ext4_journal_get_write_access+0x117/0x1d0
[  392.096785]  ? ext4_orphan_add+0x34e/0xd70
[  392.097431]  ? ext4_orphan_add+0x34e/0xd70
[  392.098063]  ? ext4_delete_entry+0x25c/0x420
[  392.098934]  ? jbd2__journal_start+0xef/0x8c0
[  392.099602]  ? ext4_empty_dir+0x750/0x750
[  392.100233]  ? __ext4_journal_start_sb+0x100/0x440
[  392.100964]  ? ext4_rmdir+0x432/0xc30
[  392.101561]  ? ext4_rmdir+0x703/0xc30
[  392.102303]  ? ext4_rename2+0x130/0x130
[  392.102906]  ? vfs_rmdir+0x1cd/0x3a0
[  392.103460]  ? do_rmdir+0x39f/0x400
[  392.104097]  ? user_path_create+0x40/0x40
[  392.104806]  ? syscall_trace_enter+0x324/0xe30
[  392.105812]  ? lock_acquire+0x3a0/0x3a0
[  392.106684]  ? exit_to_usermode_loop+0x160/0x160
[  392.107501]  ? check_preemption_disabled+0x34/0x1f0
[  392.108299]  ? check_preemption_disabled+0x34/0x1f0
[  392.110963]  ? SyS_mkdir+0x230/0x230
[  392.111521]  ? do_syscall_64+0x1b0/0x600
[  392.112154]  ? entry_SYSCALL64_slow_path+0x25/0x25
[  392.113310] Dumping ftrace buffer:
[  392.113849]    (ftrace buffer empty)
[  392.114400] Kernel Offset: 0x22200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  392.116033] Rebooting in 86400 seconds..

-- 

Thanks,
Sasha

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-07-28 18:29       ` Levin, Alexander (Sasha Levin)
@ 2017-07-28 18:57         ` Josh Poimboeuf
  2017-07-28 19:59           ` Levin, Alexander (Sasha Levin)
  0 siblings, 1 reply; 38+ messages in thread
From: Josh Poimboeuf @ 2017-07-28 18:57 UTC (permalink / raw)
  To: Levin, Alexander (Sasha Levin)
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

On Fri, Jul 28, 2017 at 06:29:57PM +0000, Levin, Alexander (Sasha Levin) wrote:
> On Fri, Jul 28, 2017 at 12:52:34PM -0500, Josh Poimboeuf wrote:
> >On Fri, Jul 28, 2017 at 04:48:47PM +0000, Levin, Alexander (Sasha Levin) wrote:
> >> Hey Josh,
> >>
> >> Syzkaller seems to trigger the following:
> >>
> >> ==================================================================
> >> BUG: KASAN: stack-out-of-bounds in __read_once_size include/linux/compiler.h:253 [inline]
> >> BUG: KASAN: stack-out-of-bounds in deref_stack_reg+0x123/0x140 arch/x86/kernel/unwind_orc.c:282
> >> Read of size 8 at addr ffff8800374a7b28 by task syz-executor4/6474
> >>
> >> CPU: 2 PID: 6474 Comm: syz-executor4 Not tainted 4.13.0-rc2-next-20170727 #232
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
> >> Call Trace:
> >>  <IRQ>
> >>  __dump_stack lib/dump_stack.c:16 [inline]
> >>  dump_stack+0xab/0x105 lib/dump_stack.c:52
> >>  print_address_description+0xc2/0x250 mm/kasan/report.c:252
> >>  kasan_report_error mm/kasan/report.c:351 [inline]
> >>  kasan_report+0x24f/0x360 mm/kasan/report.c:408
> >>  __read_once_size include/linux/compiler.h:253 [inline]
> >>  deref_stack_reg+0x123/0x140 arch/x86/kernel/unwind_orc.c:282
> >>  unwind_next_frame+0xd9b/0x1b80 arch/x86/kernel/unwind_orc.c:426
> >>  __save_stack_trace+0x7d/0xf0 arch/x86/kernel/stacktrace.c:44
> >>  save_stack+0x33/0xa0 mm/kasan/kasan.c:447
> >>  set_track mm/kasan/kasan.c:459 [inline]
> >>  kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:524
> >>  slab_free_hook mm/slub.c:1357 [inline]
> >>  slab_free_freelist_hook mm/slub.c:1379 [inline]
> >>  slab_free mm/slub.c:2955 [inline]
> >>  kmem_cache_free+0xae/0x310 mm/slub.c:2977
> >>  put_pid+0xe2/0x120 kernel/pid.c:246
> >>  __rcu_reclaim kernel/rcu/rcu.h:195 [inline]
> >>  rcu_do_batch kernel/rcu/tree.c:2666 [inline]
> >>  invoke_rcu_callbacks kernel/rcu/tree.c:2920 [inline]
> >>  __rcu_process_callbacks kernel/rcu/tree.c:2887 [inline]
> >>  rcu_process_callbacks+0x599/0x12b0 kernel/rcu/tree.c:2904
> >>  __do_softirq+0x234/0x934 kernel/softirq.c:284
> >>  invoke_softirq kernel/softirq.c:364 [inline]
> >>  irq_exit+0x164/0x190 kernel/softirq.c:405
> >>  exiting_irq arch/x86/include/asm/apic.h:638 [inline]
> >>  smp_apic_timer_interrupt+0x71/0x90 arch/x86/kernel/apic/apic.c:1044
> >>  apic_timer_interrupt+0xb9/0xc0 arch/x86/entry/entry_64.S:793
> >>  </IRQ>
> >> RIP: 0010:arch_local_irq_enable arch/x86/include/asm/paravirt.h:824 [inline]
> >> RIP: 0010:preempt_schedule_irq+0x71/0xd0 kernel/sched/core.c:3579
> >> RSP: 0018:ffff8800374a7958 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
> >> RAX: 0000000000000007 RBX: ffffed0006d95808 RCX: 1ffffffff534d022
> >> RDX: 0000000000000000 RSI: ffffffffa7065fe0 RDI: ffff880036cac9a4
> >> RBP: 0000000000000000 R08: ffff88007ffd709c R09: 0000000000000000
> >> R10: 0000000000000000 R11: 0000000000000000 R12: ffff880036cac040
> >> R13: 00000000000002eb R14: ffff880036cac040 R15: ffff88005fdc2728
> >>  retint_kernel+0x1b/0x2d
> >> RIP: 0010:arch_local_save_flags arch/x86/include/asm/paravirt.h:809 [inline]
> >> RIP: 0010:___might_sleep+0x159/0x480 kernel/sched/core.c:5968
> >> RSP: 0018:ffff8800374a7a28 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff02
> >> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
> >> RDX: 1ffff10006d95882 RSI: 00000000ffffffff RDI: ffff880036cac410
> >> RBP: 0000000000000000 R08: ffffffffa3b1546e R09: dffffc0000000000
> >> R10: ffff8800374a7c08 R11: 0000000000000001 R12: ffffffffa7065320
> >> R13: 00000000000002eb R14: ffff880036cac040 R15: ffff88005fdc2728
> >>  ext4_orphan_add+0x34e/0xd70 fs/ext4/namei.c:2801
> >
> >Thanks for reporting this.  I'm confused by the stack trace.  It seems
> >to end at ext4_orphan_add, which would normally make sense because the
> >unwinder would have stopped when it read the bad address on the stack.
> >
> >But there aren't any of the '?' entries, which should still be there.
> >Any chance your post-processing script removes those?  Can you share the
> >raw dmesg before it was post-processed?
> 
> Hey Josh,
> 
> Sure, here it is:

Thanks, that's much better.  I'm relieved the unwinder didn't screw that
up, at least.

This looks like a tricky one.  Is it easily recreatable?

Any chance you'd be able to share the vmlinux file somehow?  If not, at
least the .config file, GCC version, and code level would be useful so I
can try to build a similar image.

-- 
Josh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-07-28 18:57         ` Josh Poimboeuf
@ 2017-07-28 19:59           ` Levin, Alexander (Sasha Levin)
  2017-07-29  3:54             ` Josh Poimboeuf
  0 siblings, 1 reply; 38+ messages in thread
From: Levin, Alexander (Sasha Levin) @ 2017-07-28 19:59 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 11561 bytes --]

On Fri, Jul 28, 2017 at 01:57:20PM -0500, Josh Poimboeuf wrote:
>Thanks, that's much better.  I'm relieved the unwinder didn't screw that
>up, at least.
>
>This looks like a tricky one.  Is it easily recreatable?

Yeah, I just hit it again with slightly different initial calls:

[   49.261152] ==================================================================
[   49.262388] BUG: KASAN: stack-out-of-bounds in deref_stack_reg+0x123/0x140
[   49.263455] Read of size 8 at addr ffff880037bdf9a8 by task udevd/2790
[   49.264460] 
[   49.264859] CPU: 3 PID: 2790 Comm: udevd Not tainted 4.13.0-rc2-next-20170727 #232
[   49.266469] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[   49.268372] Call Trace:
[   49.268993]  <IRQ>
[   49.269482]  dump_stack+0xab/0x105
[   49.270255]  print_address_description+0xc2/0x250
[   49.271295]  kasan_report+0x24f/0x360
[   49.272257]  ? deref_stack_reg+0x123/0x140
[   49.273179]  deref_stack_reg+0x123/0x140
[   49.274050]  ? __read_once_size_nocheck.constprop.7+0x10/0x10
[   49.275289]  ? get_stack_info+0x32/0x150
[   49.276161]  ? stack_access_ok+0xd6/0x150
[   49.277583]  ? sched_clock_cpu+0x18/0x1d0
[   49.278493]  unwind_next_frame+0xd9b/0x1b80
[   49.279422]  ? __alloc_skb+0xac/0x550
[   49.280240]  ? __lock_acquire+0xc44/0x4870
[   49.281181]  ? deref_stack_reg+0x140/0x140
[   49.282319]  __save_stack_trace+0x7d/0xf0
[   49.283377]  ? __alloc_skb+0xac/0x550
[   49.284204]  ? rcu_process_callbacks+0x599/0x12b0
[   49.285445]  save_stack+0x33/0xa0
[   49.286331]  ? save_stack+0x33/0xa0
[   49.287115]  ? kasan_slab_free+0x72/0xc0
[   49.287983]  ? kmem_cache_free+0xae/0x310
[   49.289100]  ? rcu_process_callbacks+0x599/0x12b0
[   49.290138]  ? __do_softirq+0x234/0x934
[   49.290998]  ? irq_exit+0x164/0x190
[   49.291776]  ? smp_apic_timer_interrupt+0x71/0x90
[   49.292816]  ? apic_timer_interrupt+0xb9/0xc0
[   49.293931]  ? __slab_alloc.isra.64+0x17/0xa0
[   49.294895]  ? __alloc_skb+0xac/0x550
[   49.295718]  ? lock_acquire+0x3a0/0x3a0
[   49.296586]  ? check_preemption_disabled+0x34/0x1f0
[   49.297668]  ? trace_hardirqs_on_caller+0x284/0x590
[   49.298740]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[   49.299823]  ? free_object+0xce/0x160
[   49.300673]  ? __debug_check_no_obj_freed+0x400/0x900
[   49.301790]  ? check_preemption_disabled+0x34/0x1f0
[   49.302856]  ? debug_object_free+0x10/0x10
[   49.303939]  ? check_preemption_disabled+0x34/0x1f0
[   49.305173]  ? check_preemption_disabled+0x34/0x1f0
[   49.306249]  kasan_slab_free+0x72/0xc0
[   49.307116]  kmem_cache_free+0xae/0x310
[   49.307968]  rcu_process_callbacks+0x599/0x12b0
[   49.308989]  ? rcu_exp_wait_wake+0x1100/0x1100
[   49.309974]  ? check_preemption_disabled+0x34/0x1f0
[   49.311055]  __do_softirq+0x234/0x934
[   49.311926]  ? __alloc_skb+0xac/0x550
[   49.312739]  irq_exit+0x164/0x190
[   49.313512]  smp_apic_timer_interrupt+0x71/0x90
[   49.314515]  apic_timer_interrupt+0xb9/0xc0
[   49.315442]  </IRQ>
[   49.316085] RIP: 0010:__slab_alloc.isra.64+0x17/0xa0
[   49.317298] RSP: 0018:ffff880037bdf8d8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[   49.318922] RAX: 0000000000000000 RBX: 00000000014000c0 RCX: ffffffffbacd0ecc
[   49.320445] RDX: 00000000ffffffff RSI: 00000000014000c0 RDI: ffff88003dd36640
[   49.321973] RBP: ffff88003dd36640 R08: 0000000000000000 R09: 0000000000000003
[   49.323505] R10: ffff880037bdfa7f R11: 0000000000000001 R12: ffff88003dd36640
[   49.325444] R13: 00000000014000c0 R14: 00000000ffffffff R15: ffffffffbacd0ecc
[   49.327316]  ? __alloc_skb+0xac/0x550
[   49.328140]  ? __alloc_skb+0xac/0x550
[   49.328981]  __alloc_skb+0xac/0x550
[   49.329773]  ? kmem_cache_alloc_node+0xc3/0x380
[   49.330785]  ? __alloc_skb+0xac/0x550
[   49.331604]  ? skb_to_sgvec_nomark+0x40/0x40
[   49.332557]  ? security_capable+0x88/0xb0
[   49.333471]  ? netlink_sendmsg+0x8da/0xbb0
[   49.334391]  ? netlink_unicast+0x7a0/0x7a0
[   49.335311]  ? netlink_unicast+0x7a0/0x7a0
[   49.336425]  ? sock_sendmsg+0xc0/0x100
[   49.337676]  ? ___sys_sendmsg+0x7b8/0x920
[   49.338702]  ? copy_msghdr_from_user+0x380/0x380
[   49.339755]  ? kvm_clock_read+0x1f/0x30
[   49.340591]  ? kvm_sched_clock_read+0x5/0x10
[   49.341541]  ? sched_clock+0x5/0x10
[   49.342325]  ? sched_clock_cpu+0x18/0x1d0
[   49.343213]  ? sched_clock+0x5/0x10
[   49.343997]  ? check_preemption_disabled+0x34/0x1f0
[   49.345085]  ? __lock_acquire+0xc44/0x4870
[   49.346045]  ? __fget_light+0x9a/0x1e0
[   49.347347]  ? __sys_sendmsg+0xc7/0x170
[   49.348658]  ? __sys_sendmsg+0xc7/0x170
[   49.349652]  ? SyS_shutdown+0x1a0/0x1a0
[   49.350525]  ? check_preemption_disabled+0x34/0x1f0
[   49.351594]  ? syscall_trace_enter+0x324/0xe30
[   49.352571]  ? lock_acquire+0x3a0/0x3a0
[   49.353448]  ? check_preemption_disabled+0x34/0x1f0
[   49.354790]  ? SyS_sendmsg+0x27/0x40
[   49.355602]  ? __sys_sendmsg+0x170/0x170
[   49.356477]  ? do_syscall_64+0x1b0/0x600
[   49.357874]  ? entry_SYSCALL64_slow_path+0x25/0x25
[   49.358940] 
[   49.359891] The buggy address belongs to the page:
[   49.360940] page:ffffea0000def7c0 count:0 mapcount:0 mapping:          (null) index:0x0
[   49.362788] flags: 0xfffe0000000000()
[   49.363608] raw: 00fffe0000000000 0000000000000000 0000000000000000 00000000ffffffff
[   49.365286] raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000
[   49.366955] page dumped because: kasan: bad access detected
[   49.368162] 
[   49.368529] Memory state around the buggy address:
[   49.370008]  ffff880037bdf880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   49.371916]  ffff880037bdf900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1
[   49.373692] >ffff880037bdf980: f1 f1 f1 01 f4 f4 f4 00 00 00 00 00 00 00 00 00
[   49.375240]                                   ^
[   49.376235]  ffff880037bdfa00: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 00 00
[   49.377794]  ffff880037bdfa80: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
[   49.379539] ==================================================================
[   49.381628] Disabling lock debugging due to kernel taint
[   49.382908] Kernel panic - not syncing: panic_on_warn set ...
[   49.382908] 
[   49.384486] CPU: 3 PID: 2790 Comm: udevd Tainted: G    B           4.13.0-rc2-next-20170727 #232
[   49.386541] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[   49.388443] Call Trace:
[   49.389006]  <IRQ>
[   49.389478]  dump_stack+0xab/0x105
[   49.390242]  panic+0x1bc/0x3ba
[   49.390929]  ? __warn+0x1d6/0x1d6
[   49.391687]  kasan_end_report+0x4a/0x50
[   49.392572]  kasan_report+0x168/0x360
[   49.393737]  ? deref_stack_reg+0x123/0x140
[   49.394644]  deref_stack_reg+0x123/0x140
[   49.395511]  ? __read_once_size_nocheck.constprop.7+0x10/0x10
[   49.396757]  ? get_stack_info+0x32/0x150
[   49.397622]  ? stack_access_ok+0xd6/0x150
[   49.398504]  ? sched_clock_cpu+0x18/0x1d0
[   49.399395]  unwind_next_frame+0xd9b/0x1b80
[   49.400313]  ? __alloc_skb+0xac/0x550
[   49.401133]  ? __lock_acquire+0xc44/0x4870
[   49.402029]  ? deref_stack_reg+0x140/0x140
[   49.402928]  __save_stack_trace+0x7d/0xf0
[   49.403954]  ? __alloc_skb+0xac/0x550
[   49.404984]  ? rcu_process_callbacks+0x599/0x12b0
[   49.406006]  save_stack+0x33/0xa0
[   49.406947]  ? save_stack+0x33/0xa0
[   49.407729]  ? kasan_slab_free+0x72/0xc0
[   49.408628]  ? kmem_cache_free+0xae/0x310
[   49.409497]  ? rcu_process_callbacks+0x599/0x12b0
[   49.410519]  ? __do_softirq+0x234/0x934
[   49.411353]  ? irq_exit+0x164/0x190
[   49.412110]  ? smp_apic_timer_interrupt+0x71/0x90
[   49.413121]  ? apic_timer_interrupt+0xb9/0xc0
[   49.414048]  ? __slab_alloc.isra.64+0x17/0xa0
[   49.415133]  ? __alloc_skb+0xac/0x550
[   49.415925]  ? lock_acquire+0x3a0/0x3a0
[   49.416768]  ? check_preemption_disabled+0x34/0x1f0
[   49.417809]  ? trace_hardirqs_on_caller+0x284/0x590
[   49.418847]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[   49.419907]  ? free_object+0xce/0x160
[   49.420705]  ? __debug_check_no_obj_freed+0x400/0x900
[   49.421797]  ? check_preemption_disabled+0x34/0x1f0
[   49.422831]  ? debug_object_free+0x10/0x10
[   49.423708]  ? check_preemption_disabled+0x34/0x1f0
[   49.424751]  ? check_preemption_disabled+0x34/0x1f0
[   49.425932]  kasan_slab_free+0x72/0xc0
[   49.426740]  kmem_cache_free+0xae/0x310
[   49.427566]  rcu_process_callbacks+0x599/0x12b0
[   49.428538]  ? rcu_exp_wait_wake+0x1100/0x1100
[   49.429699]  ? check_preemption_disabled+0x34/0x1f0
[   49.430747]  __do_softirq+0x234/0x934
[   49.431551]  ? __alloc_skb+0xac/0x550
[   49.432352]  irq_exit+0x164/0x190
[   49.433078]  smp_apic_timer_interrupt+0x71/0x90
[   49.434048]  apic_timer_interrupt+0xb9/0xc0
[   49.434942]  </IRQ>
[   49.435417] RIP: 0010:__slab_alloc.isra.64+0x17/0xa0
[   49.436473] RSP: 0018:ffff880037bdf8d8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[   49.438230] RAX: 0000000000000000 RBX: 00000000014000c0 RCX: ffffffffbacd0ecc
[   49.439739] RDX: 00000000ffffffff RSI: 00000000014000c0 RDI: ffff88003dd36640
[   49.441467] RBP: ffff88003dd36640 R08: 0000000000000000 R09: 0000000000000003
[   49.442972] R10: ffff880037bdfa7f R11: 0000000000000001 R12: ffff88003dd36640
[   49.444478] R13: 00000000014000c0 R14: 00000000ffffffff R15: ffffffffbacd0ecc
[   49.445989]  ? __alloc_skb+0xac/0x550
[   49.446777]  ? __alloc_skb+0xac/0x550
[   49.447570]  __alloc_skb+0xac/0x550
[   49.448491]  ? kmem_cache_alloc_node+0xc3/0x380
[   49.449477]  ? __alloc_skb+0xac/0x550
[   49.450269]  ? skb_to_sgvec_nomark+0x40/0x40
[   49.451186]  ? security_capable+0x88/0xb0
[   49.452057]  ? netlink_sendmsg+0x8da/0xbb0
[   49.452948]  ? netlink_unicast+0x7a0/0x7a0
[   49.453831]  ? netlink_unicast+0x7a0/0x7a0
[   49.454719]  ? sock_sendmsg+0xc0/0x100
[   49.455525]  ? ___sys_sendmsg+0x7b8/0x920
[   49.456389]  ? copy_msghdr_from_user+0x380/0x380
[   49.457390]  ? kvm_clock_read+0x1f/0x30
[   49.458216]  ? kvm_sched_clock_read+0x5/0x10
[   49.459281]  ? sched_clock+0x5/0x10
[   49.460041]  ? sched_clock_cpu+0x18/0x1d0
[   49.460912]  ? sched_clock+0x5/0x10
[   49.461670]  ? check_preemption_disabled+0x34/0x1f0
[   49.462710]  ? __lock_acquire+0xc44/0x4870
[   49.463598]  ? __fget_light+0x9a/0x1e0
[   49.464406]  ? __sys_sendmsg+0xc7/0x170
[   49.465244]  ? __sys_sendmsg+0xc7/0x170
[   49.466060]  ? SyS_shutdown+0x1a0/0x1a0
[   49.466886]  ? check_preemption_disabled+0x34/0x1f0
[   49.467920]  ? syscall_trace_enter+0x324/0xe30
[   49.468877]  ? lock_acquire+0x3a0/0x3a0
[   49.469700]  ? check_preemption_disabled+0x34/0x1f0
[   49.470891]  ? SyS_sendmsg+0x27/0x40
[   49.471664]  ? __sys_sendmsg+0x170/0x170
[   49.472509]  ? do_syscall_64+0x1b0/0x600
[   49.473361]  ? entry_SYSCALL64_slow_path+0x25/0x25
[   49.474538] Dumping ftrace buffer:
[   49.475276]    (ftrace buffer empty)
[   49.476055] Kernel Offset: 0x37400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   49.478324] Rebooting in 86400 seconds..

>Any chance you'd be able to share the vmlinux file somehow?  If not, at
>least the .config file, GCC version, and code level would be useful so I
>can try to build a similar image.

vmlinux is a bit big, ~180MB in my case. If you have somewhere to upload it to I'd be happy to push it.

Attached my config.

GCC is 7.1.0.

-- 

Thanks,
Sasha

[-- Attachment #2: config-sasha.gz --]
[-- Type: application/gzip, Size: 32906 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-07-28 19:59           ` Levin, Alexander (Sasha Levin)
@ 2017-07-29  3:54             ` Josh Poimboeuf
  2017-08-08 18:58               ` Josh Poimboeuf
  0 siblings, 1 reply; 38+ messages in thread
From: Josh Poimboeuf @ 2017-07-29  3:54 UTC (permalink / raw)
  To: Levin, Alexander (Sasha Levin)
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

On Fri, Jul 28, 2017 at 07:59:12PM +0000, Levin, Alexander (Sasha Levin) wrote:
> On Fri, Jul 28, 2017 at 01:57:20PM -0500, Josh Poimboeuf wrote:
> >Thanks, that's much better.  I'm relieved the unwinder didn't screw that
> >up, at least.
> >
> >This looks like a tricky one.  Is it easily recreatable?
> 
> Yeah, I just hit it again with slightly different initial calls:

Sasha sent me some data privately.  As I suspected, the cause is some
bad ORC data.  Objtool incorrectly assumes that once the frame pointer
is set up, it no longer gets touched.

For example:

  ffffffff81820680 <pipe_wait>:
  ffffffff81820680:       41 56                   push   %r14
  ffffffff81820682:       41 55                   push   %r13
  ffffffff81820684:       41 54                   push   %r12
  ffffffff81820686:       49 89 fc                mov    %rdi,%r12
  ffffffff81820689:       55                      push   %rbp
  ffffffff8182068a:       53                      push   %rbx
  ffffffff8182068b:       48 bb 00 00 00 00 00    movabs $0xdffffc0000000000,%rbx
  ffffffff81820692:       fc ff df
  ffffffff81820695:       48 83 c4 80             add    $0xffffffffffffff80,%rsp
  ffffffff81820699:       48 89 e5                mov    %rsp,%rbp
  ffffffff8182069c:       48 c7 04 24 b3 8a b5    movq   $0x41b58ab3,(%rsp)
  ffffffff818206a3:       41
  ffffffff818206a4:       48 c7 44 24 08 07 a5    movq   $0xffffffff8621a507,0x8(%rsp)
  ffffffff818206ab:       21 86
                          ffffffff818206a9: R_X86_64_32S  .rodata+0xa1a507
  ffffffff818206ad:       48 c1 ed 03             shr    $0x3,%rbp

In this case, rbp was pushed ("push %rbp") and then replaced with rsp
("mov %rsp, %rbp"), which is the normal frame pointer setup.  But then
rbp was modified ("shr 0x3, %rbp"), which objtool didn't expect.

Objtool will need to be made smarter here somehow.  I'll be on vacation
next week so it might be a week or so before I can come up with the fix.

Despite the scary KASAN warning, this is only a minor bug.  The ORC data
isn't perfect yet, so these types of issues will happen until we get the
kinks worked out.  The good news is the unwinder recovered from the bad
ORC data gracefully, and the oops dump still showed the rest of the
addresses (with question marks).

-- 
Josh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-07-29  3:54             ` Josh Poimboeuf
@ 2017-08-08 18:58               ` Josh Poimboeuf
  2017-08-08 19:03                 ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Josh Poimboeuf @ 2017-08-08 18:58 UTC (permalink / raw)
  To: Levin, Alexander (Sasha Levin)
  Cc: x86, linux-kernel, live-patching, Linus Torvalds,
	Andy Lutomirski, Jiri Slaby, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Mike Galbraith

On Fri, Jul 28, 2017 at 10:54:37PM -0500, Josh Poimboeuf wrote:
> On Fri, Jul 28, 2017 at 07:59:12PM +0000, Levin, Alexander (Sasha Levin) wrote:
> > On Fri, Jul 28, 2017 at 01:57:20PM -0500, Josh Poimboeuf wrote:
> > >Thanks, that's much better.  I'm relieved the unwinder didn't screw that
> > >up, at least.
> > >
> > >This looks like a tricky one.  Is it easily recreatable?
> > 
> > Yeah, I just hit it again with slightly different initial calls:
> 
> Sasha sent me some data privately.  As I suspected, the cause is some
> bad ORC data.  Objtool incorrectly assumes that once the frame pointer
> is set up, it no longer gets touched.

So as it turns out, my pre-vacation self was completely wrong.  The
problem is actually some runtime instruction patching which affects the
accuracy of the ORC data.

Take for example the lock_is_held_type() function.  In vmlinux, it has
the following instruction:

  callq *0xffffffff85a94880 (pv_irq_ops.save_fl)

At runtime, that instruction is patched and replaced with a fast inline
version of arch_local_save_flags() which eliminates the call:

  pushfq
  pop %rax

The problem is when an interrupt hits after the push:

  pushfq
  --- irq ---
  pop %rax

The push offsets the stack pointer by 8 bytes, confusing the ORC
unwinder when it tries to unwind from the IRQ.

The race should be somewhat rare, though Sasha has no problems hitting
it with syzkaller.

I'm not sure what the solution should be.  It will probably need to be
one of the following:

  a) either don't allow runtime "alternative" patches to mess with the
     stack pointer (objtool could enforce this); or

  b) come up with some way to register such patches with the ORC
     unwinder at runtime.

I haven't looked much at either option to see how feasible they would
be.

Any thoughts/opinions?

-- 
Josh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-08 18:58               ` Josh Poimboeuf
@ 2017-08-08 19:03                 ` Linus Torvalds
  2017-08-08 19:13                   ` Josh Poimboeuf
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2017-08-08 19:03 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Levin, Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Andy Lutomirski, Jiri Slaby,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, Mike Galbraith

On Tue, Aug 8, 2017 at 11:58 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> Take for example the lock_is_held_type() function.  In vmlinux, it has
> the following instruction:
>
>   callq *0xffffffff85a94880 (pv_irq_ops.save_fl)
>
> At runtime, that instruction is patched and replaced with a fast inline
> version of arch_local_save_flags() which eliminates the call:
>
>   pushfq
>   pop %rax
>
> The problem is when an interrupt hits after the push:
>
>   pushfq
>   --- irq ---
>   pop %rax

That should actually be something easily fixable, for an odd reason:
the instruction boundaries are different.

> I'm not sure what the solution should be.  It will probably need to be
> one of the following:
>
>   a) either don't allow runtime "alternative" patches to mess with the
>      stack pointer (objtool could enforce this); or
>
>   b) come up with some way to register such patches with the ORC
>      unwinder at runtime.

c) just add ORC data for the alternative statically and _unconditionally_.

No runtime registration. Just an unconditional entry for the
particular IP that comes after the "pushfq". It cannot match the
"callq" instruction, since it would be in the middle of that
instruction.

Basically, just do a "union" of the ORC data for all the alternatives.

Now, objtool should still verify that the instruction pointers for
alternatives are unique - or that they share the same ORC unwinder
information if they are not.

But in cases like this, when the instruction boundaires are different,
things should "just work", with no need for any special cases.

Hmm?

                    Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-08 19:03                 ` Linus Torvalds
@ 2017-08-08 19:13                   ` Josh Poimboeuf
  2017-08-08 20:09                     ` Andy Lutomirski
  0 siblings, 1 reply; 38+ messages in thread
From: Josh Poimboeuf @ 2017-08-08 19:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Levin, Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Andy Lutomirski, Jiri Slaby,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, Mike Galbraith

On Tue, Aug 08, 2017 at 12:03:51PM -0700, Linus Torvalds wrote:
> On Tue, Aug 8, 2017 at 11:58 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >
> > Take for example the lock_is_held_type() function.  In vmlinux, it has
> > the following instruction:
> >
> >   callq *0xffffffff85a94880 (pv_irq_ops.save_fl)
> >
> > At runtime, that instruction is patched and replaced with a fast inline
> > version of arch_local_save_flags() which eliminates the call:
> >
> >   pushfq
> >   pop %rax
> >
> > The problem is when an interrupt hits after the push:
> >
> >   pushfq
> >   --- irq ---
> >   pop %rax
> 
> That should actually be something easily fixable, for an odd reason:
> the instruction boundaries are different.
> 
> > I'm not sure what the solution should be.  It will probably need to be
> > one of the following:
> >
> >   a) either don't allow runtime "alternative" patches to mess with the
> >      stack pointer (objtool could enforce this); or
> >
> >   b) come up with some way to register such patches with the ORC
> >      unwinder at runtime.
> 
> c) just add ORC data for the alternative statically and _unconditionally_.
> 
> No runtime registration. Just an unconditional entry for the
> particular IP that comes after the "pushfq". It cannot match the
> "callq" instruction, since it would be in the middle of that
> instruction.
> 
> Basically, just do a "union" of the ORC data for all the alternatives.
> 
> Now, objtool should still verify that the instruction pointers for
> alternatives are unique - or that they share the same ORC unwinder
> information if they are not.
> 
> But in cases like this, when the instruction boundaires are different,
> things should "just work", with no need for any special cases.
> 
> Hmm?

Yeah, that might work.  Objtool already knows about alternatives, so it
might not be too hard.  I'll try it.

And it can spit out a warning if we get two different ORC states for the
same address after doing the "union".  Then I guess we'd have to
rearrange things or sprinkle some nops to work around it.

-- 
Josh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-08 19:13                   ` Josh Poimboeuf
@ 2017-08-08 20:09                     ` Andy Lutomirski
  2017-08-08 22:00                       ` Josh Poimboeuf
                                         ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Andy Lutomirski @ 2017-08-08 20:09 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Linus Torvalds, Levin, Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Andy Lutomirski, Jiri Slaby,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, Mike Galbraith

On Tue, Aug 8, 2017 at 12:13 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Tue, Aug 08, 2017 at 12:03:51PM -0700, Linus Torvalds wrote:
>> On Tue, Aug 8, 2017 at 11:58 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >
>> > Take for example the lock_is_held_type() function.  In vmlinux, it has
>> > the following instruction:
>> >
>> >   callq *0xffffffff85a94880 (pv_irq_ops.save_fl)
>> >
>> > At runtime, that instruction is patched and replaced with a fast inline
>> > version of arch_local_save_flags() which eliminates the call:
>> >
>> >   pushfq
>> >   pop %rax
>> >
>> > The problem is when an interrupt hits after the push:
>> >
>> >   pushfq
>> >   --- irq ---
>> >   pop %rax
>>
>> That should actually be something easily fixable, for an odd reason:
>> the instruction boundaries are different.
>>
>> > I'm not sure what the solution should be.  It will probably need to be
>> > one of the following:
>> >
>> >   a) either don't allow runtime "alternative" patches to mess with the
>> >      stack pointer (objtool could enforce this); or
>> >
>> >   b) come up with some way to register such patches with the ORC
>> >      unwinder at runtime.
>>
>> c) just add ORC data for the alternative statically and _unconditionally_.
>>
>> No runtime registration. Just an unconditional entry for the
>> particular IP that comes after the "pushfq". It cannot match the
>> "callq" instruction, since it would be in the middle of that
>> instruction.
>>
>> Basically, just do a "union" of the ORC data for all the alternatives.
>>
>> Now, objtool should still verify that the instruction pointers for
>> alternatives are unique - or that they share the same ORC unwinder
>> information if they are not.
>>
>> But in cases like this, when the instruction boundaires are different,
>> things should "just work", with no need for any special cases.
>>
>> Hmm?
>
> Yeah, that might work.  Objtool already knows about alternatives, so it
> might not be too hard.  I'll try it.

But this one's not an actual alternative, right?  It's a pv op.

I would advocate that we make it an alternative after all.  I frickin'
hate the PV irq ops.  It would like roughly like this:

ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
X86_FEATURE_GODDAMN_PV_IRQ_OPS

(The obvious syntax error and the naming should probably be fixed.
Also, this needs to live in an #ifdef because it needs to build on
kernels with pv support.  It should also properly register itself as a
pv patch site.)

Semi-serious question: can we maybe delete lguest and 32-bit Xen PV
support some time soon?  As far as I know, 32-bit Xen PV *hosts* are
all EOL and have no security support, 32-bit Xen PV guest dom0 may not
work (I've never tried, but it would certainly be nutty on a 64-bit
hypervisor), and lguest is, um, not seriously maintained any more. [1]

[1] A while back I complained that I couldn't get lguest to boot.
Someone replied with multiple workarounds for known bugs that make it
not boot.  I have a hard time believing that anyone uses it for
anything other than trying to test that it still works.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-08 20:09                     ` Andy Lutomirski
@ 2017-08-08 22:00                       ` Josh Poimboeuf
  2017-08-09  8:49                       ` Juergen Gross
  2017-09-27 21:08                       ` Josh Poimboeuf
  2 siblings, 0 replies; 38+ messages in thread
From: Josh Poimboeuf @ 2017-08-08 22:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Levin, Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Peter Zijlstra, Mike Galbraith, Borislav Petkov

On Tue, Aug 08, 2017 at 01:09:08PM -0700, Andy Lutomirski wrote:
> >> c) just add ORC data for the alternative statically and _unconditionally_.
> >>
> >> No runtime registration. Just an unconditional entry for the
> >> particular IP that comes after the "pushfq". It cannot match the
> >> "callq" instruction, since it would be in the middle of that
> >> instruction.
> >>
> >> Basically, just do a "union" of the ORC data for all the alternatives.
> >>
> >> Now, objtool should still verify that the instruction pointers for
> >> alternatives are unique - or that they share the same ORC unwinder
> >> information if they are not.
> >>
> >> But in cases like this, when the instruction boundaires are different,
> >> things should "just work", with no need for any special cases.
> >>
> >> Hmm?
> >
> > Yeah, that might work.  Objtool already knows about alternatives, so it
> > might not be too hard.  I'll try it.
> 
> But this one's not an actual alternative, right?  It's a pv op.

Ah, right.  Objtool doesn't know about paravirt patching, unfortunately.

> I would advocate that we make it an alternative after all.  I frickin'
> hate the PV irq ops.  It would like roughly like this:
> 
> ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
> X86_FEATURE_GODDAMN_PV_IRQ_OPS
> 
> (The obvious syntax error and the naming should probably be fixed.
> Also, this needs to live in an #ifdef because it needs to build on
> kernels with pv support.  It should also properly register itself as a
> pv patch site.)

Yeah, that would be really nice, assuming it's possible.  Otherwise I'll
need to teach objtool about the paravirt patches.

-- 
Josh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-08 20:09                     ` Andy Lutomirski
  2017-08-08 22:00                       ` Josh Poimboeuf
@ 2017-08-09  8:49                       ` Juergen Gross
  2017-08-09  9:16                         ` Peter Zijlstra
  2017-08-09 16:11                         ` Andy Lutomirski
  2017-09-27 21:08                       ` Josh Poimboeuf
  2 siblings, 2 replies; 38+ messages in thread
From: Juergen Gross @ 2017-08-09  8:49 UTC (permalink / raw)
  To: Andy Lutomirski, Josh Poimboeuf
  Cc: Linus Torvalds, Levin, Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Peter Zijlstra, Mike Galbraith

On 08/08/17 22:09, Andy Lutomirski wrote:
> On Tue, Aug 8, 2017 at 12:13 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> On Tue, Aug 08, 2017 at 12:03:51PM -0700, Linus Torvalds wrote:
>>> On Tue, Aug 8, 2017 at 11:58 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>>
>>>> Take for example the lock_is_held_type() function.  In vmlinux, it has
>>>> the following instruction:
>>>>
>>>>   callq *0xffffffff85a94880 (pv_irq_ops.save_fl)
>>>>
>>>> At runtime, that instruction is patched and replaced with a fast inline
>>>> version of arch_local_save_flags() which eliminates the call:
>>>>
>>>>   pushfq
>>>>   pop %rax
>>>>
>>>> The problem is when an interrupt hits after the push:
>>>>
>>>>   pushfq
>>>>   --- irq ---
>>>>   pop %rax
>>>
>>> That should actually be something easily fixable, for an odd reason:
>>> the instruction boundaries are different.
>>>
>>>> I'm not sure what the solution should be.  It will probably need to be
>>>> one of the following:
>>>>
>>>>   a) either don't allow runtime "alternative" patches to mess with the
>>>>      stack pointer (objtool could enforce this); or
>>>>
>>>>   b) come up with some way to register such patches with the ORC
>>>>      unwinder at runtime.
>>>
>>> c) just add ORC data for the alternative statically and _unconditionally_.
>>>
>>> No runtime registration. Just an unconditional entry for the
>>> particular IP that comes after the "pushfq". It cannot match the
>>> "callq" instruction, since it would be in the middle of that
>>> instruction.
>>>
>>> Basically, just do a "union" of the ORC data for all the alternatives.
>>>
>>> Now, objtool should still verify that the instruction pointers for
>>> alternatives are unique - or that they share the same ORC unwinder
>>> information if they are not.
>>>
>>> But in cases like this, when the instruction boundaires are different,
>>> things should "just work", with no need for any special cases.
>>>
>>> Hmm?
>>
>> Yeah, that might work.  Objtool already knows about alternatives, so it
>> might not be too hard.  I'll try it.
> 
> But this one's not an actual alternative, right?  It's a pv op.
> 
> I would advocate that we make it an alternative after all.  I frickin'
> hate the PV irq ops.  It would like roughly like this:
> 
> ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
> X86_FEATURE_GODDAMN_PV_IRQ_OPS

You are aware that at least some of the Xen irq pvops functionality is
patched inline? Your modification would slow down pv guests quite a
bit, I guess.

> (The obvious syntax error and the naming should probably be fixed.
> Also, this needs to live in an #ifdef because it needs to build on
> kernels with pv support.  It should also properly register itself as a
> pv patch site.)
> 
> Semi-serious question: can we maybe delete lguest and 32-bit Xen PV
> support some time soon?  As far as I know, 32-bit Xen PV *hosts* are
> all EOL and have no security support, 32-bit Xen PV guest dom0 may not
> work (I've never tried, but it would certainly be nutty on a 64-bit
> hypervisor), and lguest is, um, not seriously maintained any more. [1]

Hmm, I suggested drop of lguest support about 3 months ago and got no
response. OTOH there was no objection either. :-)

Regarding 32 bit Xen PV guests: even 32 bit dom0 is supposed to work.
In case you want to drop support in Linux you might ask that question
on xen-devel@lists.xenproject.org (with PVH support for guests nearly
complete your chances might be >0 to succeed).


Juergen

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-09  8:49                       ` Juergen Gross
@ 2017-08-09  9:16                         ` Peter Zijlstra
  2017-08-09  9:24                           ` Juergen Gross
  2017-08-09 16:11                         ` Andy Lutomirski
  1 sibling, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2017-08-09  9:16 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Levin,
	Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Mike Galbraith

On Wed, Aug 09, 2017 at 10:49:43AM +0200, Juergen Gross wrote:
> > ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
> > X86_FEATURE_GODDAMN_PV_IRQ_OPS
> 
> You are aware that at least some of the Xen irq pvops functionality is
> patched inline? Your modification would slow down pv guests quite a
> bit, I guess.

Where does that live? I know of the inline patching for native, but
didn't know the guests did any of that too.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-09  9:16                         ` Peter Zijlstra
@ 2017-08-09  9:24                           ` Juergen Gross
  2017-08-09  9:35                             ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Juergen Gross @ 2017-08-09  9:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Levin,
	Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Mike Galbraith

On 09/08/17 11:16, Peter Zijlstra wrote:
> On Wed, Aug 09, 2017 at 10:49:43AM +0200, Juergen Gross wrote:
>>> ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
>>> X86_FEATURE_GODDAMN_PV_IRQ_OPS
>>
>> You are aware that at least some of the Xen irq pvops functionality is
>> patched inline? Your modification would slow down pv guests quite a
>> bit, I guess.
> 
> Where does that live? I know of the inline patching for native, but
> didn't know the guests did any of that too.

See arch/x86/xen/enlighten_pv.c xen_patch().


Juergen

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-09  9:24                           ` Juergen Gross
@ 2017-08-09  9:35                             ` Peter Zijlstra
  2017-08-09  9:55                               ` Juergen Gross
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2017-08-09  9:35 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Levin,
	Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Mike Galbraith

On Wed, Aug 09, 2017 at 11:24:07AM +0200, Juergen Gross wrote:
> On 09/08/17 11:16, Peter Zijlstra wrote:
> > On Wed, Aug 09, 2017 at 10:49:43AM +0200, Juergen Gross wrote:
> >>> ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
> >>> X86_FEATURE_GODDAMN_PV_IRQ_OPS
> >>
> >> You are aware that at least some of the Xen irq pvops functionality is
> >> patched inline? Your modification would slow down pv guests quite a
> >> bit, I guess.
> > 
> > Where does that live? I know of the inline patching for native, but
> > didn't know the guests did any of that too.
> 
> See arch/x86/xen/enlighten_pv.c xen_patch().

'obvious' name that :-) I see that the actual code that's patched in
lives in xen-asm.S which unlike the native case doesn't appear to have
its own section. So that might make things even more difficult.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-09  9:35                             ` Peter Zijlstra
@ 2017-08-09  9:55                               ` Juergen Gross
  2017-08-09 20:15                                 ` Josh Poimboeuf
  0 siblings, 1 reply; 38+ messages in thread
From: Juergen Gross @ 2017-08-09  9:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Levin,
	Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Mike Galbraith

On 09/08/17 11:35, Peter Zijlstra wrote:
> On Wed, Aug 09, 2017 at 11:24:07AM +0200, Juergen Gross wrote:
>> On 09/08/17 11:16, Peter Zijlstra wrote:
>>> On Wed, Aug 09, 2017 at 10:49:43AM +0200, Juergen Gross wrote:
>>>>> ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
>>>>> X86_FEATURE_GODDAMN_PV_IRQ_OPS
>>>>
>>>> You are aware that at least some of the Xen irq pvops functionality is
>>>> patched inline? Your modification would slow down pv guests quite a
>>>> bit, I guess.
>>>
>>> Where does that live? I know of the inline patching for native, but
>>> didn't know the guests did any of that too.
>>
>> See arch/x86/xen/enlighten_pv.c xen_patch().
> 
> 'obvious' name that :-) I see that the actual code that's patched in
> lives in xen-asm.S which unlike the native case doesn't appear to have
> its own section. So that might make things even more difficult.

I don't see why this couldn't be changed.


Juergen

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-09  8:49                       ` Juergen Gross
  2017-08-09  9:16                         ` Peter Zijlstra
@ 2017-08-09 16:11                         ` Andy Lutomirski
  2017-08-09 17:52                           ` Juergen Gross
  1 sibling, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2017-08-09 16:11 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Andy Lutomirski, Josh Poimboeuf, Linus Torvalds, Levin,
	Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Peter Zijlstra, Mike Galbraith

On Wed, Aug 9, 2017 at 1:49 AM, Juergen Gross <jgross@suse.com> wrote:
> On 08/08/17 22:09, Andy Lutomirski wrote:
>> On Tue, Aug 8, 2017 at 12:13 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>> On Tue, Aug 08, 2017 at 12:03:51PM -0700, Linus Torvalds wrote:
>>>> On Tue, Aug 8, 2017 at 11:58 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>>>
>>>>> Take for example the lock_is_held_type() function.  In vmlinux, it has
>>>>> the following instruction:
>>>>>
>>>>>   callq *0xffffffff85a94880 (pv_irq_ops.save_fl)
>>>>>
>>>>> At runtime, that instruction is patched and replaced with a fast inline
>>>>> version of arch_local_save_flags() which eliminates the call:
>>>>>
>>>>>   pushfq
>>>>>   pop %rax
>>>>>
>>>>> The problem is when an interrupt hits after the push:
>>>>>
>>>>>   pushfq
>>>>>   --- irq ---
>>>>>   pop %rax
>>>>
>>>> That should actually be something easily fixable, for an odd reason:
>>>> the instruction boundaries are different.
>>>>
>>>>> I'm not sure what the solution should be.  It will probably need to be
>>>>> one of the following:
>>>>>
>>>>>   a) either don't allow runtime "alternative" patches to mess with the
>>>>>      stack pointer (objtool could enforce this); or
>>>>>
>>>>>   b) come up with some way to register such patches with the ORC
>>>>>      unwinder at runtime.
>>>>
>>>> c) just add ORC data for the alternative statically and _unconditionally_.
>>>>
>>>> No runtime registration. Just an unconditional entry for the
>>>> particular IP that comes after the "pushfq". It cannot match the
>>>> "callq" instruction, since it would be in the middle of that
>>>> instruction.
>>>>
>>>> Basically, just do a "union" of the ORC data for all the alternatives.
>>>>
>>>> Now, objtool should still verify that the instruction pointers for
>>>> alternatives are unique - or that they share the same ORC unwinder
>>>> information if they are not.
>>>>
>>>> But in cases like this, when the instruction boundaires are different,
>>>> things should "just work", with no need for any special cases.
>>>>
>>>> Hmm?
>>>
>>> Yeah, that might work.  Objtool already knows about alternatives, so it
>>> might not be too hard.  I'll try it.
>>
>> But this one's not an actual alternative, right?  It's a pv op.
>>
>> I would advocate that we make it an alternative after all.  I frickin'
>> hate the PV irq ops.  It would like roughly like this:
>>
>> ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
>> X86_FEATURE_GODDAMN_PV_IRQ_OPS
>
> You are aware that at least some of the Xen irq pvops functionality is
> patched inline? Your modification would slow down pv guests quite a
> bit, I guess.

Yes, but what I had in mind was having both the alternative *and* the
paravirt patch entry.  We'd obviously have to make sure to run
alternatives before paravirt patching, but that's possibly already the
case.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-09 16:11                         ` Andy Lutomirski
@ 2017-08-09 17:52                           ` Juergen Gross
  0 siblings, 0 replies; 38+ messages in thread
From: Juergen Gross @ 2017-08-09 17:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, Linus Torvalds, Levin, Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Peter Zijlstra, Mike Galbraith

On 09/08/17 18:11, Andy Lutomirski wrote:
> On Wed, Aug 9, 2017 at 1:49 AM, Juergen Gross <jgross@suse.com> wrote:
>> On 08/08/17 22:09, Andy Lutomirski wrote:
>>> On Tue, Aug 8, 2017 at 12:13 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>> On Tue, Aug 08, 2017 at 12:03:51PM -0700, Linus Torvalds wrote:
>>>>> On Tue, Aug 8, 2017 at 11:58 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>>>>
>>>>>> Take for example the lock_is_held_type() function.  In vmlinux, it has
>>>>>> the following instruction:
>>>>>>
>>>>>>   callq *0xffffffff85a94880 (pv_irq_ops.save_fl)
>>>>>>
>>>>>> At runtime, that instruction is patched and replaced with a fast inline
>>>>>> version of arch_local_save_flags() which eliminates the call:
>>>>>>
>>>>>>   pushfq
>>>>>>   pop %rax
>>>>>>
>>>>>> The problem is when an interrupt hits after the push:
>>>>>>
>>>>>>   pushfq
>>>>>>   --- irq ---
>>>>>>   pop %rax
>>>>>
>>>>> That should actually be something easily fixable, for an odd reason:
>>>>> the instruction boundaries are different.
>>>>>
>>>>>> I'm not sure what the solution should be.  It will probably need to be
>>>>>> one of the following:
>>>>>>
>>>>>>   a) either don't allow runtime "alternative" patches to mess with the
>>>>>>      stack pointer (objtool could enforce this); or
>>>>>>
>>>>>>   b) come up with some way to register such patches with the ORC
>>>>>>      unwinder at runtime.
>>>>>
>>>>> c) just add ORC data for the alternative statically and _unconditionally_.
>>>>>
>>>>> No runtime registration. Just an unconditional entry for the
>>>>> particular IP that comes after the "pushfq". It cannot match the
>>>>> "callq" instruction, since it would be in the middle of that
>>>>> instruction.
>>>>>
>>>>> Basically, just do a "union" of the ORC data for all the alternatives.
>>>>>
>>>>> Now, objtool should still verify that the instruction pointers for
>>>>> alternatives are unique - or that they share the same ORC unwinder
>>>>> information if they are not.
>>>>>
>>>>> But in cases like this, when the instruction boundaires are different,
>>>>> things should "just work", with no need for any special cases.
>>>>>
>>>>> Hmm?
>>>>
>>>> Yeah, that might work.  Objtool already knows about alternatives, so it
>>>> might not be too hard.  I'll try it.
>>>
>>> But this one's not an actual alternative, right?  It's a pv op.
>>>
>>> I would advocate that we make it an alternative after all.  I frickin'
>>> hate the PV irq ops.  It would like roughly like this:
>>>
>>> ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
>>> X86_FEATURE_GODDAMN_PV_IRQ_OPS
>>
>> You are aware that at least some of the Xen irq pvops functionality is
>> patched inline? Your modification would slow down pv guests quite a
>> bit, I guess.
> 
> Yes, but what I had in mind was having both the alternative *and* the
> paravirt patch entry.  We'd obviously have to make sure to run
> alternatives before paravirt patching, but that's possibly already the
> case.

So instead of having the "callq *pv_irq_ops.save_fl" as initial code you
would end up with the "pushfq; popq %rax" until the alternatives are
applied.

I don't think this will work. In the end you are not allowed to use any
irq ops in a Xen guest until that happens. And I think it happens rather
late compared to the usage of any irq ops.

And in case you are swapping oldinstr and newinstr in above ALTERNATIVE
usage you will end up with exactly the same as with today's pvops, just
the patching mechanism for the bare metal case would be different. And
you would need more table entries (pvops and alternatives) for the same
functionality.

Or do I miss something here?


Juergen

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-09  9:55                               ` Juergen Gross
@ 2017-08-09 20:15                                 ` Josh Poimboeuf
  2017-08-10  7:05                                   ` Juergen Gross
  0 siblings, 1 reply; 38+ messages in thread
From: Josh Poimboeuf @ 2017-08-09 20:15 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Peter Zijlstra, Andy Lutomirski, Linus Torvalds, Levin,
	Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Mike Galbraith

On Wed, Aug 09, 2017 at 11:55:35AM +0200, Juergen Gross wrote:
> On 09/08/17 11:35, Peter Zijlstra wrote:
> > On Wed, Aug 09, 2017 at 11:24:07AM +0200, Juergen Gross wrote:
> >> On 09/08/17 11:16, Peter Zijlstra wrote:
> >>> On Wed, Aug 09, 2017 at 10:49:43AM +0200, Juergen Gross wrote:
> >>>>> ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
> >>>>> X86_FEATURE_GODDAMN_PV_IRQ_OPS
> >>>>
> >>>> You are aware that at least some of the Xen irq pvops functionality is
> >>>> patched inline? Your modification would slow down pv guests quite a
> >>>> bit, I guess.
> >>>
> >>> Where does that live? I know of the inline patching for native, but
> >>> didn't know the guests did any of that too.
> >>
> >> See arch/x86/xen/enlighten_pv.c xen_patch().
> > 
> > 'obvious' name that :-) I see that the actual code that's patched in
> > lives in xen-asm.S which unlike the native case doesn't appear to have
> > its own section. So that might make things even more difficult.
> 
> I don't see why this couldn't be changed.

I'm wondering why xen_patch() even exists.  The main difference between
xen_patch() and native_patch() seems to be that xen_patch() does some
relocs when doing an inline patch after calling paravirt_patch_insns().

But I can't see how that code path would ever run, because the
replacement functions are all larger than the size of the call
instruction to be replaced (7 bytes).  So they would never fit, and
instead the paravirt_patch_default() case would always run.  Or am I
missing something?

If we could get rid of the hypervisor-specific patching functions
(pv_init_ops) -- including possibly removing the lguest and vsmp code,
if nobody cares about them anymore -- that might make it easier to
consolidate all the patching things into a single place.

For example, maybe the paravirt patching could just use alternatives
under the hood somehow (insert hand waving).

-- 
Josh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-09 20:15                                 ` Josh Poimboeuf
@ 2017-08-10  7:05                                   ` Juergen Gross
  2017-08-10 14:09                                     ` Josh Poimboeuf
  0 siblings, 1 reply; 38+ messages in thread
From: Juergen Gross @ 2017-08-10  7:05 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Andy Lutomirski, Linus Torvalds, Levin,
	Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Mike Galbraith

On 09/08/17 22:15, Josh Poimboeuf wrote:
> On Wed, Aug 09, 2017 at 11:55:35AM +0200, Juergen Gross wrote:
>> On 09/08/17 11:35, Peter Zijlstra wrote:
>>> On Wed, Aug 09, 2017 at 11:24:07AM +0200, Juergen Gross wrote:
>>>> On 09/08/17 11:16, Peter Zijlstra wrote:
>>>>> On Wed, Aug 09, 2017 at 10:49:43AM +0200, Juergen Gross wrote:
>>>>>>> ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
>>>>>>> X86_FEATURE_GODDAMN_PV_IRQ_OPS
>>>>>>
>>>>>> You are aware that at least some of the Xen irq pvops functionality is
>>>>>> patched inline? Your modification would slow down pv guests quite a
>>>>>> bit, I guess.
>>>>>
>>>>> Where does that live? I know of the inline patching for native, but
>>>>> didn't know the guests did any of that too.
>>>>
>>>> See arch/x86/xen/enlighten_pv.c xen_patch().
>>>
>>> 'obvious' name that :-) I see that the actual code that's patched in
>>> lives in xen-asm.S which unlike the native case doesn't appear to have
>>> its own section. So that might make things even more difficult.
>>
>> I don't see why this couldn't be changed.
> 
> I'm wondering why xen_patch() even exists.  The main difference between
> xen_patch() and native_patch() seems to be that xen_patch() does some
> relocs when doing an inline patch after calling paravirt_patch_insns().
> 
> But I can't see how that code path would ever run, because the
> replacement functions are all larger than the size of the call
> instruction to be replaced (7 bytes).  So they would never fit, and
> instead the paravirt_patch_default() case would always run.  Or am I
> missing something?

Hmm, interesting. Just checked it and it seems you are right.

> If we could get rid of the hypervisor-specific patching functions
> (pv_init_ops) -- including possibly removing the lguest and vsmp code,
> if nobody cares about them anymore -- that might make it easier to
> consolidate all the patching things into a single place.

I'll send some patches to:

- remove xen_patch()
- remove lguest
- remove vsmp

In case nobody objects to apply those patches we can possibly simplify
some more code.

I'd love that. :-)


Juergen

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-10  7:05                                   ` Juergen Gross
@ 2017-08-10 14:09                                     ` Josh Poimboeuf
  2017-08-10 14:24                                       ` Juergen Gross
  2017-08-10 14:42                                       ` Josh Poimboeuf
  0 siblings, 2 replies; 38+ messages in thread
From: Josh Poimboeuf @ 2017-08-10 14:09 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Peter Zijlstra, Andy Lutomirski, Linus Torvalds, Levin,
	Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Mike Galbraith

On Thu, Aug 10, 2017 at 09:05:19AM +0200, Juergen Gross wrote:
> > I'm wondering why xen_patch() even exists.  The main difference between
> > xen_patch() and native_patch() seems to be that xen_patch() does some
> > relocs when doing an inline patch after calling paravirt_patch_insns().
> > 
> > But I can't see how that code path would ever run, because the
> > replacement functions are all larger than the size of the call
> > instruction to be replaced (7 bytes).  So they would never fit, and
> > instead the paravirt_patch_default() case would always run.  Or am I
> > missing something?
> 
> Hmm, interesting. Just checked it and it seems you are right.
> 
> > If we could get rid of the hypervisor-specific patching functions
> > (pv_init_ops) -- including possibly removing the lguest and vsmp code,
> > if nobody cares about them anymore -- that might make it easier to
> > consolidate all the patching things into a single place.
> 
> I'll send some patches to:
> 
> - remove xen_patch()
> - remove lguest
> - remove vsmp
> 
> In case nobody objects to apply those patches we can possibly simplify
> some more code.
> 
> I'd love that. :-)

Well, I might have spoken too soon about getting rid of vsmp.  The
scalemp.com domain still exists.  The code hasn't changed much in three
years, but maybe it's simple enough that it hasn't needed to change.

Also, looking at the lguest mailing list, there seem to have been at
least a few people trying lguest out in the past year or so.

Even if we couldn't get rid of vsmp or lguest, I wonder if the PVOP_CALL
stuff could be reworked to something like the following:

static inline notrace unsigned long arch_local_save_flags(void)
{
	return PVOP_CALLEE0(unsigned long, pv_irq_ops.save_fl,
			    "pushfq; popq %rax", CPU_FEATURE_NATIVE,
			    "call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
			    "call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
			    "call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST);
} 

Which would eventually translate to something like:

asm volatile(ALTERNATIVE_4("call *pv_irq_ops.save_fl",
			   "pushfq; popq %rax", CPU_FEATURE_NATIVE,
			   "call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
			   "call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
			   "call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST
			   : ... pvop clobber stuff ... );

where ALTERNATIVE_4 is a logical extension of ALTERNATIVE_2 and
CPU_FEATURE_NATIVE would always be set.

It might need some more macro magic, but if it worked I think it would
be a lot clearer than the current voodoo.

Thoughts?

-- 
Josh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-10 14:09                                     ` Josh Poimboeuf
@ 2017-08-10 14:24                                       ` Juergen Gross
  2017-08-10 14:39                                         ` Josh Poimboeuf
  2017-08-10 14:42                                       ` Josh Poimboeuf
  1 sibling, 1 reply; 38+ messages in thread
From: Juergen Gross @ 2017-08-10 14:24 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Andy Lutomirski, Linus Torvalds, Levin,
	Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Mike Galbraith

On 10/08/17 16:09, Josh Poimboeuf wrote:
> On Thu, Aug 10, 2017 at 09:05:19AM +0200, Juergen Gross wrote:
>>> I'm wondering why xen_patch() even exists.  The main difference between
>>> xen_patch() and native_patch() seems to be that xen_patch() does some
>>> relocs when doing an inline patch after calling paravirt_patch_insns().
>>>
>>> But I can't see how that code path would ever run, because the
>>> replacement functions are all larger than the size of the call
>>> instruction to be replaced (7 bytes).  So they would never fit, and
>>> instead the paravirt_patch_default() case would always run.  Or am I
>>> missing something?
>>
>> Hmm, interesting. Just checked it and it seems you are right.
>>
>>> If we could get rid of the hypervisor-specific patching functions
>>> (pv_init_ops) -- including possibly removing the lguest and vsmp code,
>>> if nobody cares about them anymore -- that might make it easier to
>>> consolidate all the patching things into a single place.
>>
>> I'll send some patches to:
>>
>> - remove xen_patch()
>> - remove lguest
>> - remove vsmp
>>
>> In case nobody objects to apply those patches we can possibly simplify
>> some more code.
>>
>> I'd love that. :-)
> 
> Well, I might have spoken too soon about getting rid of vsmp.  The
> scalemp.com domain still exists.  The code hasn't changed much in three
> years, but maybe it's simple enough that it hasn't needed to change.

Lets see. I have made the experience that asking whether some code can
be removed almost never get answers. Sending a patch which actually
removes the stuff results much more often in objections. :-)

> Also, looking at the lguest mailing list, there seem to have been at
> least a few people trying lguest out in the past year or so.

Well, yes. The question is here whether there is a _need_ for lguest
or was it just out of curiosity?

In the end it is 32 bit only and you can easily test boot code via
KVM, Xen or qemu.

> Even if we couldn't get rid of vsmp or lguest, I wonder if the PVOP_CALL
> stuff could be reworked to something like the following:
> 
> static inline notrace unsigned long arch_local_save_flags(void)
> {
> 	return PVOP_CALLEE0(unsigned long, pv_irq_ops.save_fl,
> 			    "pushfq; popq %rax", CPU_FEATURE_NATIVE,
> 			    "call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
> 			    "call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
> 			    "call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST);
> } 
> 
> Which would eventually translate to something like:
> 
> asm volatile(ALTERNATIVE_4("call *pv_irq_ops.save_fl",
> 			   "pushfq; popq %rax", CPU_FEATURE_NATIVE,
> 			   "call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
> 			   "call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
> 			   "call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST
> 			   : ... pvop clobber stuff ... );
> 
> where ALTERNATIVE_4 is a logical extension of ALTERNATIVE_2 and
> CPU_FEATURE_NATIVE would always be set.
> 
> It might need some more macro magic, but if it worked I think it would
> be a lot clearer than the current voodoo.
> 
> Thoughts?

Hmm, this would modify the current approach of pvops completely: instead
of letting each user of pvops (xen, lguest, vsmp, ...) set the functions
it is needing, you'd have to modify the core definition of each pvops
function for each user. Or would you want to let Xen, lguest etc. opt in
for pvops and generate above code at build time from some templates?


Juergen

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-10 14:24                                       ` Juergen Gross
@ 2017-08-10 14:39                                         ` Josh Poimboeuf
  2017-08-10 14:59                                           ` Juergen Gross
  0 siblings, 1 reply; 38+ messages in thread
From: Josh Poimboeuf @ 2017-08-10 14:39 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Peter Zijlstra, Andy Lutomirski, Linus Torvalds, Levin,
	Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Mike Galbraith

On Thu, Aug 10, 2017 at 04:24:58PM +0200, Juergen Gross wrote:
> >> I'll send some patches to:
> >>
> >> - remove xen_patch()
> >> - remove lguest
> >> - remove vsmp
> >>
> >> In case nobody objects to apply those patches we can possibly simplify
> >> some more code.
> >>
> >> I'd love that. :-)
> > 
> > Well, I might have spoken too soon about getting rid of vsmp.  The
> > scalemp.com domain still exists.  The code hasn't changed much in three
> > years, but maybe it's simple enough that it hasn't needed to change.
> 
> Lets see. I have made the experience that asking whether some code can
> be removed almost never get answers. Sending a patch which actually
> removes the stuff results much more often in objections. :-)
> 
> > Also, looking at the lguest mailing list, there seem to have been at
> > least a few people trying lguest out in the past year or so.
> 
> Well, yes. The question is here whether there is a _need_ for lguest
> or was it just out of curiosity?
> 
> In the end it is 32 bit only and you can easily test boot code via
> KVM, Xen or qemu.

Good points.  I'm all for removing code, so you have no objections from
me :-)

> > Even if we couldn't get rid of vsmp or lguest, I wonder if the PVOP_CALL
> > stuff could be reworked to something like the following:
> > 
> > static inline notrace unsigned long arch_local_save_flags(void)
> > {
> > 	return PVOP_CALLEE0(unsigned long, pv_irq_ops.save_fl,
> > 			    "pushfq; popq %rax", CPU_FEATURE_NATIVE,
> > 			    "call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
> > 			    "call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
> > 			    "call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST);
> > } 
> > 
> > Which would eventually translate to something like:
> > 
> > asm volatile(ALTERNATIVE_4("call *pv_irq_ops.save_fl",
> > 			   "pushfq; popq %rax", CPU_FEATURE_NATIVE,
> > 			   "call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
> > 			   "call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
> > 			   "call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST
> > 			   : ... pvop clobber stuff ... );
> > 
> > where ALTERNATIVE_4 is a logical extension of ALTERNATIVE_2 and
> > CPU_FEATURE_NATIVE would always be set.
> > 
> > It might need some more macro magic, but if it worked I think it would
> > be a lot clearer than the current voodoo.
> > 
> > Thoughts?
> 
> Hmm, this would modify the current approach of pvops completely: instead
> of letting each user of pvops (xen, lguest, vsmp, ...) set the functions
> it is needing, you'd have to modify the core definition of each pvops
> function for each user.

Right.  The callers (arch_local_save_flags, etc) would have to know
about the different hypervisors' functions.  But this knowledge could be
hidden in inline functions and/or macros, so I don't see it being too
much of a problem.

The upsides are that the behavior is much clearer (IMO), and we could
get rid of the .parainstructions stuff altogether.

> Or would you want to let Xen, lguest etc. opt in
> for pvops and generate above code at build time from some templates?

I'm not sure what you mean, can you clarify?

-- 
Josh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-10 14:09                                     ` Josh Poimboeuf
  2017-08-10 14:24                                       ` Juergen Gross
@ 2017-08-10 14:42                                       ` Josh Poimboeuf
  1 sibling, 0 replies; 38+ messages in thread
From: Josh Poimboeuf @ 2017-08-10 14:42 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Peter Zijlstra, Andy Lutomirski, Linus Torvalds, Levin,
	Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Mike Galbraith

On Thu, Aug 10, 2017 at 09:09:03AM -0500, Josh Poimboeuf wrote:
> static inline notrace unsigned long arch_local_save_flags(void)
> {
> 	return PVOP_CALLEE0(unsigned long, pv_irq_ops.save_fl,
> 			    "pushfq; popq %rax", CPU_FEATURE_NATIVE,
> 			    "call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
> 			    "call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
> 			    "call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST);
> } 

Just a few clarifications on this idea:

It would probably be better to have the PVOP macros do the function name
translation, so maybe it would be something like this instead:

 	return PVOP_CALLEE0(unsigned long, pv_irq_ops.save_fl,
 			    "pushfq; popq %rax", CPU_FEATURE_NATIVE,
 			    xen_save_fl, CPU_FEATURE_XEN,
 			    vsmp_save_fl, CPU_FEATURE_VSMP,
 			    lguest_save_fl, CPU_FEATURE_LGUEST);

One issue is that it would fail to link if CONFIG_XEN or
CONFIG_LGUEST_GUEST isn't set.  However I've seen some crazy macro magic
which lets you detect the number of variable arguments.  So if we got
that to work, it could be:

 	return PVOP_CALLEE0(unsigned long, pv_irq_ops.save_fl,
 			    "pushfq; popq %rax", CPU_FEATURE_NATIVE,
#ifdef CONFIG_XEN
 			    xen_save_fl, CPU_FEATURE_XEN,
#endif
 			    vsmp_save_fl, CPU_FEATURE_VSMP,
#ifdef CONFIG_LGUEST_GUEST
 			    lguest_save_fl, CPU_FEATURE_LGUEST,
#endif
			    );

And then maybe the PVOP_CALL macros could detect the number of arguments
and call the corresponding version of ALTERNATIVE_X.  Macro fun :-)

> Which would eventually translate to something like:
> 
> asm volatile(ALTERNATIVE_4("call *pv_irq_ops.save_fl",
> 			   "pushfq; popq %rax", CPU_FEATURE_NATIVE,
> 			   "call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
> 			   "call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
> 			   "call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST
> 			   : ... pvop clobber stuff ... );
> 
> where ALTERNATIVE_4 is a logical extension of ALTERNATIVE_2 and
> CPU_FEATURE_NATIVE would always be set.
> 
> It might need some more macro magic, but if it worked I think it would
> be a lot clearer than the current voodoo.
> 
> Thoughts?

-- 
Josh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-10 14:39                                         ` Josh Poimboeuf
@ 2017-08-10 14:59                                           ` Juergen Gross
  2017-08-10 15:49                                             ` Josh Poimboeuf
  0 siblings, 1 reply; 38+ messages in thread
From: Juergen Gross @ 2017-08-10 14:59 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Andy Lutomirski, Linus Torvalds, Levin,
	Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Mike Galbraith

On 10/08/17 16:39, Josh Poimboeuf wrote:
> On Thu, Aug 10, 2017 at 04:24:58PM +0200, Juergen Gross wrote:
>>>> I'll send some patches to:
>>>>
>>>> - remove xen_patch()
>>>> - remove lguest
>>>> - remove vsmp
>>>>
>>>> In case nobody objects to apply those patches we can possibly simplify
>>>> some more code.
>>>>
>>>> I'd love that. :-)
>>>
>>> Well, I might have spoken too soon about getting rid of vsmp.  The
>>> scalemp.com domain still exists.  The code hasn't changed much in three
>>> years, but maybe it's simple enough that it hasn't needed to change.
>>
>> Lets see. I have made the experience that asking whether some code can
>> be removed almost never get answers. Sending a patch which actually
>> removes the stuff results much more often in objections. :-)
>>
>>> Also, looking at the lguest mailing list, there seem to have been at
>>> least a few people trying lguest out in the past year or so.
>>
>> Well, yes. The question is here whether there is a _need_ for lguest
>> or was it just out of curiosity?
>>
>> In the end it is 32 bit only and you can easily test boot code via
>> KVM, Xen or qemu.
> 
> Good points.  I'm all for removing code, so you have no objections from
> me :-)
> 
>>> Even if we couldn't get rid of vsmp or lguest, I wonder if the PVOP_CALL
>>> stuff could be reworked to something like the following:
>>>
>>> static inline notrace unsigned long arch_local_save_flags(void)
>>> {
>>> 	return PVOP_CALLEE0(unsigned long, pv_irq_ops.save_fl,
>>> 			    "pushfq; popq %rax", CPU_FEATURE_NATIVE,
>>> 			    "call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
>>> 			    "call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
>>> 			    "call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST);
>>> } 
>>>
>>> Which would eventually translate to something like:
>>>
>>> asm volatile(ALTERNATIVE_4("call *pv_irq_ops.save_fl",
>>> 			   "pushfq; popq %rax", CPU_FEATURE_NATIVE,
>>> 			   "call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
>>> 			   "call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
>>> 			   "call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST
>>> 			   : ... pvop clobber stuff ... );
>>>
>>> where ALTERNATIVE_4 is a logical extension of ALTERNATIVE_2 and
>>> CPU_FEATURE_NATIVE would always be set.
>>>
>>> It might need some more macro magic, but if it worked I think it would
>>> be a lot clearer than the current voodoo.
>>>
>>> Thoughts?
>>
>> Hmm, this would modify the current approach of pvops completely: instead
>> of letting each user of pvops (xen, lguest, vsmp, ...) set the functions
>> it is needing, you'd have to modify the core definition of each pvops
>> function for each user.
> 
> Right.  The callers (arch_local_save_flags, etc) would have to know
> about the different hypervisors' functions.  But this knowledge could be
> hidden in inline functions and/or macros, so I don't see it being too
> much of a problem.
> 
> The upsides are that the behavior is much clearer (IMO), and we could
> get rid of the .parainstructions stuff altogether.
> 
>> Or would you want to let Xen, lguest etc. opt in
>> for pvops and generate above code at build time from some templates?
> 
> I'm not sure what you mean, can you clarify?

It shouldn't be too much work to let each pvops user have a file in a
common paravirt directory containing the needed information to create:

static inline notrace unsigned long arch_local_save_flags(void)
{
    return PVOP_CALLEE0(unsigned long, pv_irq_ops.save_fl,
	"pushfq; popq %rax", CPU_FEATURE_NATIVE,
	"call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
	"call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
	"call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST);
}

and all other needed functions at build time. It could look e.g. like
(for xen: xen.pv):

@@feature CPU_FEATURE_XEN
PV_IRQ_OPS_SAVE_FL "call __raw_callee_save_xen_save_fl"

and the pre-processor could be used to assemble all configured users
(pvops.pv):

#ifdef CONFIG_XEN_PV
#include "xen.pv"
#endif
#ifdef CONFIG_LGUEST
#include "lguest.pv"
#endif

The resulting file would the be mangled by e.g. a python or awk script
to a header containing macro definitions like:

#define PV_IRQ_OPS_SAVE_FL \
   "pushfq; popq %rax", CPU_FEATURE_NATIVE, \
	"call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN, \
	"call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP, \
	"call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST

which can then be used in paravirt.h:

static inline notrace unsigned long arch_local_save_flags(void)
{
    return PVOP_CALLEE0(unsigned long, pv_irq_ops.save_fl,
                        PV_IRQ_OPS_SAVE_FL);
}


Juergen

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-10 14:59                                           ` Juergen Gross
@ 2017-08-10 15:49                                             ` Josh Poimboeuf
  0 siblings, 0 replies; 38+ messages in thread
From: Josh Poimboeuf @ 2017-08-10 15:49 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Peter Zijlstra, Andy Lutomirski, Linus Torvalds, Levin,
	Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Mike Galbraith

On Thu, Aug 10, 2017 at 04:59:36PM +0200, Juergen Gross wrote:
> On 10/08/17 16:39, Josh Poimboeuf wrote:
> > On Thu, Aug 10, 2017 at 04:24:58PM +0200, Juergen Gross wrote:
> >>>> I'll send some patches to:
> >>>>
> >>>> - remove xen_patch()
> >>>> - remove lguest
> >>>> - remove vsmp
> >>>>
> >>>> In case nobody objects to apply those patches we can possibly simplify
> >>>> some more code.
> >>>>
> >>>> I'd love that. :-)
> >>>
> >>> Well, I might have spoken too soon about getting rid of vsmp.  The
> >>> scalemp.com domain still exists.  The code hasn't changed much in three
> >>> years, but maybe it's simple enough that it hasn't needed to change.
> >>
> >> Lets see. I have made the experience that asking whether some code can
> >> be removed almost never get answers. Sending a patch which actually
> >> removes the stuff results much more often in objections. :-)
> >>
> >>> Also, looking at the lguest mailing list, there seem to have been at
> >>> least a few people trying lguest out in the past year or so.
> >>
> >> Well, yes. The question is here whether there is a _need_ for lguest
> >> or was it just out of curiosity?
> >>
> >> In the end it is 32 bit only and you can easily test boot code via
> >> KVM, Xen or qemu.
> > 
> > Good points.  I'm all for removing code, so you have no objections from
> > me :-)
> > 
> >>> Even if we couldn't get rid of vsmp or lguest, I wonder if the PVOP_CALL
> >>> stuff could be reworked to something like the following:
> >>>
> >>> static inline notrace unsigned long arch_local_save_flags(void)
> >>> {
> >>> 	return PVOP_CALLEE0(unsigned long, pv_irq_ops.save_fl,
> >>> 			    "pushfq; popq %rax", CPU_FEATURE_NATIVE,
> >>> 			    "call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
> >>> 			    "call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
> >>> 			    "call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST);
> >>> } 
> >>>
> >>> Which would eventually translate to something like:
> >>>
> >>> asm volatile(ALTERNATIVE_4("call *pv_irq_ops.save_fl",
> >>> 			   "pushfq; popq %rax", CPU_FEATURE_NATIVE,
> >>> 			   "call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
> >>> 			   "call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
> >>> 			   "call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST
> >>> 			   : ... pvop clobber stuff ... );
> >>>
> >>> where ALTERNATIVE_4 is a logical extension of ALTERNATIVE_2 and
> >>> CPU_FEATURE_NATIVE would always be set.
> >>>
> >>> It might need some more macro magic, but if it worked I think it would
> >>> be a lot clearer than the current voodoo.
> >>>
> >>> Thoughts?
> >>
> >> Hmm, this would modify the current approach of pvops completely: instead
> >> of letting each user of pvops (xen, lguest, vsmp, ...) set the functions
> >> it is needing, you'd have to modify the core definition of each pvops
> >> function for each user.
> > 
> > Right.  The callers (arch_local_save_flags, etc) would have to know
> > about the different hypervisors' functions.  But this knowledge could be
> > hidden in inline functions and/or macros, so I don't see it being too
> > much of a problem.
> > 
> > The upsides are that the behavior is much clearer (IMO), and we could
> > get rid of the .parainstructions stuff altogether.
> > 
> >> Or would you want to let Xen, lguest etc. opt in
> >> for pvops and generate above code at build time from some templates?
> > 
> > I'm not sure what you mean, can you clarify?
> 
> It shouldn't be too much work to let each pvops user have a file in a
> common paravirt directory containing the needed information to create:
> 
> static inline notrace unsigned long arch_local_save_flags(void)
> {
>     return PVOP_CALLEE0(unsigned long, pv_irq_ops.save_fl,
> 	"pushfq; popq %rax", CPU_FEATURE_NATIVE,
> 	"call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN,
> 	"call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP,
> 	"call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST);
> }
> 
> and all other needed functions at build time. It could look e.g. like
> (for xen: xen.pv):
> 
> @@feature CPU_FEATURE_XEN
> PV_IRQ_OPS_SAVE_FL "call __raw_callee_save_xen_save_fl"
> 
> and the pre-processor could be used to assemble all configured users
> (pvops.pv):
> 
> #ifdef CONFIG_XEN_PV
> #include "xen.pv"
> #endif
> #ifdef CONFIG_LGUEST
> #include "lguest.pv"
> #endif
> 
> The resulting file would the be mangled by e.g. a python or awk script
> to a header containing macro definitions like:
> 
> #define PV_IRQ_OPS_SAVE_FL \
>    "pushfq; popq %rax", CPU_FEATURE_NATIVE, \
> 	"call __raw_callee_save_xen_save_fl", CPU_FEATURE_XEN, \
> 	"call __raw_callee_save_vsmp_save_fl", CPU_FEATURE_VSMP, \
> 	"call __raw_callee_save_lguest_save_fl", CPU_FEATURE_LGUEST
> 
> which can then be used in paravirt.h:
> 
> static inline notrace unsigned long arch_local_save_flags(void)
> {
>     return PVOP_CALLEE0(unsigned long, pv_irq_ops.save_fl,
>                         PV_IRQ_OPS_SAVE_FL);
> }

That could work, though I'd prefer the code-based approach because I get
the feeling it would be less obtuse.  I can play around with it, though
it may be a few weeks.  Feel free to delete code in the meantime :-)

-- 
Josh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-08-08 20:09                     ` Andy Lutomirski
  2017-08-08 22:00                       ` Josh Poimboeuf
  2017-08-09  8:49                       ` Juergen Gross
@ 2017-09-27 21:08                       ` Josh Poimboeuf
  2017-09-28  6:03                         ` Juergen Gross
  2 siblings, 1 reply; 38+ messages in thread
From: Josh Poimboeuf @ 2017-09-27 21:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Levin, Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Peter Zijlstra, Mike Galbraith, Juergen Gross

On Tue, Aug 08, 2017 at 01:09:08PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 8, 2017 at 12:13 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Tue, Aug 08, 2017 at 12:03:51PM -0700, Linus Torvalds wrote:
> >> On Tue, Aug 8, 2017 at 11:58 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >
> >> > Take for example the lock_is_held_type() function.  In vmlinux, it has
> >> > the following instruction:
> >> >
> >> >   callq *0xffffffff85a94880 (pv_irq_ops.save_fl)
> >> >
> >> > At runtime, that instruction is patched and replaced with a fast inline
> >> > version of arch_local_save_flags() which eliminates the call:
> >> >
> >> >   pushfq
> >> >   pop %rax
> >> >
> >> > The problem is when an interrupt hits after the push:
> >> >
> >> >   pushfq
> >> >   --- irq ---
> >> >   pop %rax
> >>
> >> That should actually be something easily fixable, for an odd reason:
> >> the instruction boundaries are different.
> >>
> >> > I'm not sure what the solution should be.  It will probably need to be
> >> > one of the following:
> >> >
> >> >   a) either don't allow runtime "alternative" patches to mess with the
> >> >      stack pointer (objtool could enforce this); or
> >> >
> >> >   b) come up with some way to register such patches with the ORC
> >> >      unwinder at runtime.
> >>
> >> c) just add ORC data for the alternative statically and _unconditionally_.
> >>
> >> No runtime registration. Just an unconditional entry for the
> >> particular IP that comes after the "pushfq". It cannot match the
> >> "callq" instruction, since it would be in the middle of that
> >> instruction.
> >>
> >> Basically, just do a "union" of the ORC data for all the alternatives.
> >>
> >> Now, objtool should still verify that the instruction pointers for
> >> alternatives are unique - or that they share the same ORC unwinder
> >> information if they are not.
> >>
> >> But in cases like this, when the instruction boundaires are different,
> >> things should "just work", with no need for any special cases.
> >>
> >> Hmm?
> >
> > Yeah, that might work.  Objtool already knows about alternatives, so it
> > might not be too hard.  I'll try it.
> 
> But this one's not an actual alternative, right?  It's a pv op.
> 
> I would advocate that we make it an alternative after all.  I frickin'
> hate the PV irq ops.  It would like roughly like this:
> 
> ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
> X86_FEATURE_GODDAMN_PV_IRQ_OPS
> 
> (The obvious syntax error and the naming should probably be fixed.
> Also, this needs to live in an #ifdef because it needs to build on
> kernels with pv support.  It should also properly register itself as a
> pv patch site.)

I've got a prototype of the above working, where vmlinux shows:

  pushfq
  pop    %rax
  nop
  nop
  nop
  nop
  nop

instead of:

  callq  *0xffffffff81e3a400 (pv_irq_ops.save_fl)

Which is nice because the vmlinux disassembly now matches the most
common runtime cases (everything except Xen and vsmp).  And it also
fixes the upthread objtool issue.

The only slight issue with the patches is that hypervisors need access
to the pv ops much earlier than when alternatives are applied.  So I had
to add a new .pv_alternatives section for these pv ops alternatives, so
they can be patched very early, if running in a hypervisor.

Will clean up the code and post something relatively soon.

-- 
Josh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-09-27 21:08                       ` Josh Poimboeuf
@ 2017-09-28  6:03                         ` Juergen Gross
  2017-09-28 13:55                           ` Josh Poimboeuf
  0 siblings, 1 reply; 38+ messages in thread
From: Juergen Gross @ 2017-09-28  6:03 UTC (permalink / raw)
  To: Josh Poimboeuf, Andy Lutomirski
  Cc: Linus Torvalds, Levin, Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Peter Zijlstra, Mike Galbraith

On 27/09/17 23:08, Josh Poimboeuf wrote:
> On Tue, Aug 08, 2017 at 01:09:08PM -0700, Andy Lutomirski wrote:
>> On Tue, Aug 8, 2017 at 12:13 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>> On Tue, Aug 08, 2017 at 12:03:51PM -0700, Linus Torvalds wrote:
>>>> On Tue, Aug 8, 2017 at 11:58 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>>>
>>>>> Take for example the lock_is_held_type() function.  In vmlinux, it has
>>>>> the following instruction:
>>>>>
>>>>>   callq *0xffffffff85a94880 (pv_irq_ops.save_fl)
>>>>>
>>>>> At runtime, that instruction is patched and replaced with a fast inline
>>>>> version of arch_local_save_flags() which eliminates the call:
>>>>>
>>>>>   pushfq
>>>>>   pop %rax
>>>>>
>>>>> The problem is when an interrupt hits after the push:
>>>>>
>>>>>   pushfq
>>>>>   --- irq ---
>>>>>   pop %rax
>>>>
>>>> That should actually be something easily fixable, for an odd reason:
>>>> the instruction boundaries are different.
>>>>
>>>>> I'm not sure what the solution should be.  It will probably need to be
>>>>> one of the following:
>>>>>
>>>>>   a) either don't allow runtime "alternative" patches to mess with the
>>>>>      stack pointer (objtool could enforce this); or
>>>>>
>>>>>   b) come up with some way to register such patches with the ORC
>>>>>      unwinder at runtime.
>>>>
>>>> c) just add ORC data for the alternative statically and _unconditionally_.
>>>>
>>>> No runtime registration. Just an unconditional entry for the
>>>> particular IP that comes after the "pushfq". It cannot match the
>>>> "callq" instruction, since it would be in the middle of that
>>>> instruction.
>>>>
>>>> Basically, just do a "union" of the ORC data for all the alternatives.
>>>>
>>>> Now, objtool should still verify that the instruction pointers for
>>>> alternatives are unique - or that they share the same ORC unwinder
>>>> information if they are not.
>>>>
>>>> But in cases like this, when the instruction boundaires are different,
>>>> things should "just work", with no need for any special cases.
>>>>
>>>> Hmm?
>>>
>>> Yeah, that might work.  Objtool already knows about alternatives, so it
>>> might not be too hard.  I'll try it.
>>
>> But this one's not an actual alternative, right?  It's a pv op.
>>
>> I would advocate that we make it an alternative after all.  I frickin'
>> hate the PV irq ops.  It would like roughly like this:
>>
>> ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
>> X86_FEATURE_GODDAMN_PV_IRQ_OPS
>>
>> (The obvious syntax error and the naming should probably be fixed.
>> Also, this needs to live in an #ifdef because it needs to build on
>> kernels with pv support.  It should also properly register itself as a
>> pv patch site.)
> 
> I've got a prototype of the above working, where vmlinux shows:
> 
>   pushfq
>   pop    %rax
>   nop
>   nop
>   nop
>   nop
>   nop
> 
> instead of:
> 
>   callq  *0xffffffff81e3a400 (pv_irq_ops.save_fl)
> 
> Which is nice because the vmlinux disassembly now matches the most
> common runtime cases (everything except Xen and vsmp).  And it also
> fixes the upthread objtool issue.
> 
> The only slight issue with the patches is that hypervisors need access
> to the pv ops much earlier than when alternatives are applied.  So I had
> to add a new .pv_alternatives section for these pv ops alternatives, so
> they can be patched very early, if running in a hypervisor.
> 
> Will clean up the code and post something relatively soon.
> 

Are you combining alternatives and pvops then? I'm asking because in an
up and running system under Xen the callq *... will be replaced with a
much faster "call xen_save_fl". This should still be the case after
your patch.


Juergen

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-09-28  6:03                         ` Juergen Gross
@ 2017-09-28 13:55                           ` Josh Poimboeuf
  2017-09-28 13:58                             ` Juergen Gross
  0 siblings, 1 reply; 38+ messages in thread
From: Josh Poimboeuf @ 2017-09-28 13:55 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Andy Lutomirski, Linus Torvalds, Levin, Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Peter Zijlstra, Mike Galbraith

On Thu, Sep 28, 2017 at 08:03:26AM +0200, Juergen Gross wrote:
> On 27/09/17 23:08, Josh Poimboeuf wrote:
> > On Tue, Aug 08, 2017 at 01:09:08PM -0700, Andy Lutomirski wrote:
> >> On Tue, Aug 8, 2017 at 12:13 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >>> On Tue, Aug 08, 2017 at 12:03:51PM -0700, Linus Torvalds wrote:
> >>>> On Tue, Aug 8, 2017 at 11:58 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >>>>>
> >>>>> Take for example the lock_is_held_type() function.  In vmlinux, it has
> >>>>> the following instruction:
> >>>>>
> >>>>>   callq *0xffffffff85a94880 (pv_irq_ops.save_fl)
> >>>>>
> >>>>> At runtime, that instruction is patched and replaced with a fast inline
> >>>>> version of arch_local_save_flags() which eliminates the call:
> >>>>>
> >>>>>   pushfq
> >>>>>   pop %rax
> >>>>>
> >>>>> The problem is when an interrupt hits after the push:
> >>>>>
> >>>>>   pushfq
> >>>>>   --- irq ---
> >>>>>   pop %rax
> >>>>
> >>>> That should actually be something easily fixable, for an odd reason:
> >>>> the instruction boundaries are different.
> >>>>
> >>>>> I'm not sure what the solution should be.  It will probably need to be
> >>>>> one of the following:
> >>>>>
> >>>>>   a) either don't allow runtime "alternative" patches to mess with the
> >>>>>      stack pointer (objtool could enforce this); or
> >>>>>
> >>>>>   b) come up with some way to register such patches with the ORC
> >>>>>      unwinder at runtime.
> >>>>
> >>>> c) just add ORC data for the alternative statically and _unconditionally_.
> >>>>
> >>>> No runtime registration. Just an unconditional entry for the
> >>>> particular IP that comes after the "pushfq". It cannot match the
> >>>> "callq" instruction, since it would be in the middle of that
> >>>> instruction.
> >>>>
> >>>> Basically, just do a "union" of the ORC data for all the alternatives.
> >>>>
> >>>> Now, objtool should still verify that the instruction pointers for
> >>>> alternatives are unique - or that they share the same ORC unwinder
> >>>> information if they are not.
> >>>>
> >>>> But in cases like this, when the instruction boundaires are different,
> >>>> things should "just work", with no need for any special cases.
> >>>>
> >>>> Hmm?
> >>>
> >>> Yeah, that might work.  Objtool already knows about alternatives, so it
> >>> might not be too hard.  I'll try it.
> >>
> >> But this one's not an actual alternative, right?  It's a pv op.
> >>
> >> I would advocate that we make it an alternative after all.  I frickin'
> >> hate the PV irq ops.  It would like roughly like this:
> >>
> >> ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
> >> X86_FEATURE_GODDAMN_PV_IRQ_OPS
> >>
> >> (The obvious syntax error and the naming should probably be fixed.
> >> Also, this needs to live in an #ifdef because it needs to build on
> >> kernels with pv support.  It should also properly register itself as a
> >> pv patch site.)
> > 
> > I've got a prototype of the above working, where vmlinux shows:
> > 
> >   pushfq
> >   pop    %rax
> >   nop
> >   nop
> >   nop
> >   nop
> >   nop
> > 
> > instead of:
> > 
> >   callq  *0xffffffff81e3a400 (pv_irq_ops.save_fl)
> > 
> > Which is nice because the vmlinux disassembly now matches the most
> > common runtime cases (everything except Xen and vsmp).  And it also
> > fixes the upthread objtool issue.
> > 
> > The only slight issue with the patches is that hypervisors need access
> > to the pv ops much earlier than when alternatives are applied.  So I had
> > to add a new .pv_alternatives section for these pv ops alternatives, so
> > they can be patched very early, if running in a hypervisor.
> > 
> > Will clean up the code and post something relatively soon.
> > 
> 
> Are you combining alternatives and pvops then? I'm asking because in an
> up and running system under Xen the callq *... will be replaced with a
> much faster "call xen_save_fl". This should still be the case after
> your patch.

Right, it's not combining alternatives and pv ops, it's just adding
another step.  So first the

  pushfq; pop %rax

is replaced with

  callq *pv_irq_ops.save_fl

and then later, after the xen ops structs are finalized, it's replaced
with

  callq xen_save_fl

-- 
Josh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/2] x86/unwind: add ORC unwinder
  2017-09-28 13:55                           ` Josh Poimboeuf
@ 2017-09-28 13:58                             ` Juergen Gross
  0 siblings, 0 replies; 38+ messages in thread
From: Juergen Gross @ 2017-09-28 13:58 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Linus Torvalds, Levin, Alexander (Sasha Levin),
	x86, linux-kernel, live-patching, Jiri Slaby, Ingo Molnar,
	H. Peter Anvin, Peter Zijlstra, Mike Galbraith

On 28/09/17 15:55, Josh Poimboeuf wrote:
> On Thu, Sep 28, 2017 at 08:03:26AM +0200, Juergen Gross wrote:
>> On 27/09/17 23:08, Josh Poimboeuf wrote:
>>> On Tue, Aug 08, 2017 at 01:09:08PM -0700, Andy Lutomirski wrote:
>>>> On Tue, Aug 8, 2017 at 12:13 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>>> On Tue, Aug 08, 2017 at 12:03:51PM -0700, Linus Torvalds wrote:
>>>>>> On Tue, Aug 8, 2017 at 11:58 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>>>>>
>>>>>>> Take for example the lock_is_held_type() function.  In vmlinux, it has
>>>>>>> the following instruction:
>>>>>>>
>>>>>>>   callq *0xffffffff85a94880 (pv_irq_ops.save_fl)
>>>>>>>
>>>>>>> At runtime, that instruction is patched and replaced with a fast inline
>>>>>>> version of arch_local_save_flags() which eliminates the call:
>>>>>>>
>>>>>>>   pushfq
>>>>>>>   pop %rax
>>>>>>>
>>>>>>> The problem is when an interrupt hits after the push:
>>>>>>>
>>>>>>>   pushfq
>>>>>>>   --- irq ---
>>>>>>>   pop %rax
>>>>>>
>>>>>> That should actually be something easily fixable, for an odd reason:
>>>>>> the instruction boundaries are different.
>>>>>>
>>>>>>> I'm not sure what the solution should be.  It will probably need to be
>>>>>>> one of the following:
>>>>>>>
>>>>>>>   a) either don't allow runtime "alternative" patches to mess with the
>>>>>>>      stack pointer (objtool could enforce this); or
>>>>>>>
>>>>>>>   b) come up with some way to register such patches with the ORC
>>>>>>>      unwinder at runtime.
>>>>>>
>>>>>> c) just add ORC data for the alternative statically and _unconditionally_.
>>>>>>
>>>>>> No runtime registration. Just an unconditional entry for the
>>>>>> particular IP that comes after the "pushfq". It cannot match the
>>>>>> "callq" instruction, since it would be in the middle of that
>>>>>> instruction.
>>>>>>
>>>>>> Basically, just do a "union" of the ORC data for all the alternatives.
>>>>>>
>>>>>> Now, objtool should still verify that the instruction pointers for
>>>>>> alternatives are unique - or that they share the same ORC unwinder
>>>>>> information if they are not.
>>>>>>
>>>>>> But in cases like this, when the instruction boundaires are different,
>>>>>> things should "just work", with no need for any special cases.
>>>>>>
>>>>>> Hmm?
>>>>>
>>>>> Yeah, that might work.  Objtool already knows about alternatives, so it
>>>>> might not be too hard.  I'll try it.
>>>>
>>>> But this one's not an actual alternative, right?  It's a pv op.
>>>>
>>>> I would advocate that we make it an alternative after all.  I frickin'
>>>> hate the PV irq ops.  It would like roughly like this:
>>>>
>>>> ALTERNATIVE "pushfq; popq %rax", "callq *pv_irq_ops.save_fl",
>>>> X86_FEATURE_GODDAMN_PV_IRQ_OPS
>>>>
>>>> (The obvious syntax error and the naming should probably be fixed.
>>>> Also, this needs to live in an #ifdef because it needs to build on
>>>> kernels with pv support.  It should also properly register itself as a
>>>> pv patch site.)
>>>
>>> I've got a prototype of the above working, where vmlinux shows:
>>>
>>>   pushfq
>>>   pop    %rax
>>>   nop
>>>   nop
>>>   nop
>>>   nop
>>>   nop
>>>
>>> instead of:
>>>
>>>   callq  *0xffffffff81e3a400 (pv_irq_ops.save_fl)
>>>
>>> Which is nice because the vmlinux disassembly now matches the most
>>> common runtime cases (everything except Xen and vsmp).  And it also
>>> fixes the upthread objtool issue.
>>>
>>> The only slight issue with the patches is that hypervisors need access
>>> to the pv ops much earlier than when alternatives are applied.  So I had
>>> to add a new .pv_alternatives section for these pv ops alternatives, so
>>> they can be patched very early, if running in a hypervisor.
>>>
>>> Will clean up the code and post something relatively soon.
>>>
>>
>> Are you combining alternatives and pvops then? I'm asking because in an
>> up and running system under Xen the callq *... will be replaced with a
>> much faster "call xen_save_fl". This should still be the case after
>> your patch.
> 
> Right, it's not combining alternatives and pv ops, it's just adding
> another step.  So first the
> 
>   pushfq; pop %rax
> 
> is replaced with
> 
>   callq *pv_irq_ops.save_fl
> 
> and then later, after the xen ops structs are finalized, it's replaced
> with
> 
>   callq xen_save_fl
> 

Okay, thanks for confirming.


Juergen

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2017-09-28 16:26 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-24 23:36 [PATCH v4 0/2] ORC unwinder Josh Poimboeuf
2017-07-24 23:36 ` [PATCH v4 1/2] x86/unwind: add " Josh Poimboeuf
2017-07-26 12:10   ` [tip:x86/asm] x86/unwind: Add the " tip-bot for Josh Poimboeuf
2017-07-28 16:48   ` [PATCH v4 1/2] x86/unwind: add " Levin, Alexander (Sasha Levin)
2017-07-28 17:52     ` Josh Poimboeuf
2017-07-28 18:29       ` Levin, Alexander (Sasha Levin)
2017-07-28 18:57         ` Josh Poimboeuf
2017-07-28 19:59           ` Levin, Alexander (Sasha Levin)
2017-07-29  3:54             ` Josh Poimboeuf
2017-08-08 18:58               ` Josh Poimboeuf
2017-08-08 19:03                 ` Linus Torvalds
2017-08-08 19:13                   ` Josh Poimboeuf
2017-08-08 20:09                     ` Andy Lutomirski
2017-08-08 22:00                       ` Josh Poimboeuf
2017-08-09  8:49                       ` Juergen Gross
2017-08-09  9:16                         ` Peter Zijlstra
2017-08-09  9:24                           ` Juergen Gross
2017-08-09  9:35                             ` Peter Zijlstra
2017-08-09  9:55                               ` Juergen Gross
2017-08-09 20:15                                 ` Josh Poimboeuf
2017-08-10  7:05                                   ` Juergen Gross
2017-08-10 14:09                                     ` Josh Poimboeuf
2017-08-10 14:24                                       ` Juergen Gross
2017-08-10 14:39                                         ` Josh Poimboeuf
2017-08-10 14:59                                           ` Juergen Gross
2017-08-10 15:49                                             ` Josh Poimboeuf
2017-08-10 14:42                                       ` Josh Poimboeuf
2017-08-09 16:11                         ` Andy Lutomirski
2017-08-09 17:52                           ` Juergen Gross
2017-09-27 21:08                       ` Josh Poimboeuf
2017-09-28  6:03                         ` Juergen Gross
2017-09-28 13:55                           ` Josh Poimboeuf
2017-09-28 13:58                             ` Juergen Gross
2017-07-24 23:36 ` [PATCH v4 2/2] x86/kconfig: make it easier to switch to the new " Josh Poimboeuf
2017-07-25  9:10   ` Ingo Molnar
2017-07-25 13:54     ` Josh Poimboeuf
2017-07-26 12:11       ` [tip:x86/asm] x86/kconfig: Consolidate unwinders into multiple choice selection tip-bot for Josh Poimboeuf
2017-07-26 12:10   ` [tip:x86/asm] x86/kconfig: Make it easier to switch to the new ORC unwinder tip-bot for Josh Poimboeuf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).