All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")
@ 2018-01-04  9:10 Paul Turner
  2018-01-04  9:12 ` Paul Turner
                   ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Paul Turner @ 2018-01-04  9:10 UTC (permalink / raw)
  To: LKML, Linus Torvalds, Greg Kroah-Hartman, Woodhouse, David,
	Tim Chen, Dave Hansen, tglx, Kees Cook, Rik van Riel,
	Peter Zijlstra, Andy Lutomirski, Jiri Kosina, gnomes

Apologies for the discombobulation around today's disclosure.  Obviously the
original goal was to communicate this a little more coherently, but the
unscheduled advances in the disclosure disrupted the efforts to pull this
together more cleanly.

I wanted to open discussion the "retpoline" approach and and define its
requirements so that we can separate the core
details from questions regarding any particular implementation thereof.

As a starting point, a full write-up describing the approach is available at:
  https://support.google.com/faqs/answer/7625886

The 30 second version is:
Returns are a special type of indirect branch.  As function returns are intended
to pair with function calls, processors often implement dedicated return stack
predictors.  The choice of this branch prediction allows us to generate an
indirect branch in which speculative execution is intentionally redirected into
a controlled location by a return stack target that we control.  Preventing
branch target injections (also known as "Spectre") against these binaries.

On the targets (Intel Xeon) we have measured so far, cost is within cycles of a
"native" indirect branch for which branch prediction hardware has been disabled.
This is unfortunately measurable -- from 3 cycles on average to about 30.
However the cost is largely mitigated for many workloads since the kernel uses
comparatively few indirect branches (versus say, a C++ binary).  With some
effort we have the average overall overhead within the 0-1.5% range for our
internal workloads, including some particularly high packet processing engines.

There are several components, the majority of which are independent of kernel
modifications:

(1) A compiler supporting retpoline transformations.
(1a) Optionally: annotations for hand-coded indirect jmps, so that they may be
    made compatible with (1).
    [ Note: The only known indirect jmp which is not safe to convert, is the
      early virtual address check in head entry. ]
(2) Kernel modifications for preventing return-stack underflow (see document
    above).
   The key points where this occurs are:
   - Context switches (into protected targets)
   - interrupt return (we return into potentially unwinding execution)
   - sleep state exit (flushes cashes)
   - guest exit.
  (These can be run-time gated, a full refill costs 30-45 cycles.)
(3) Optional: Optimizations so that direct branches can be used for hot kernel
   indirects. While as discussed above, kernel execution generally depends on
   fewer indirect branches, there are a few places (in particular, the
   networking stack) where we have chained sequences of indirects on hot paths.
(4) More general support for guarding against RSB underflow in an affected
    target.  While this is harder to exploit and may not be required for many
    users, the approaches we have used here are not generally applicable.
    Further discussion is required.

With respect to the what these deltas mean for an unmodified kernel:
 (1a) At minimum annotation only.  More complicated, config and
run-time gated options are also possigble.
 (2) Trivially run-time & config gated.
 (3) The de-virtualizing of these branches improves performance in both the
     retpoline and non-retpoline cases.

For an out of the box kernel that is reasonably protected, (1)-(3) are required.

I apologize that this does not come with a clean set of patches, merging the
things that we and Intel have looked at here.  That was one of the original
goals for this week.  Strictly speaking, I think that Andi, David, and I have
a fair amount of merging and clean-up to do here.  This is an attempt
to keep discussion of the fundamentals at least independent of that.

I'm trying to keep the above reasonably compact/dense.  I'm happy to expand on
any details in sub-threads.  I'll also link back some of the other compiler work
which is landing for (1).

Thanks,

- Paul

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")
  2018-01-04  9:10 [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre") Paul Turner
@ 2018-01-04  9:12 ` Paul Turner
  2018-01-04  9:24 ` Paul Turner
  2018-01-04  9:30 ` Woodhouse, David
  2 siblings, 0 replies; 79+ messages in thread
From: Paul Turner @ 2018-01-04  9:12 UTC (permalink / raw)
  To: LKML, Linus Torvalds, Greg Kroah-Hartman, Woodhouse, David,
	Tim Chen, Dave Hansen, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes, Thomas Gleixner

On Thu, Jan 4, 2018 at 1:10 AM, Paul Turner <pjt@google.com> wrote:
> Apologies for the discombobulation around today's disclosure.  Obviously the
> original goal was to communicate this a little more coherently, but the
> unscheduled advances in the disclosure disrupted the efforts to pull this
> together more cleanly.
>
> I wanted to open discussion the "retpoline" approach and and define its
> requirements so that we can separate the core
> details from questions regarding any particular implementation thereof.
>
> As a starting point, a full write-up describing the approach is available at:
>   https://support.google.com/faqs/answer/7625886
>
> The 30 second version is:
> Returns are a special type of indirect branch.  As function returns are intended
> to pair with function calls, processors often implement dedicated return stack
> predictors.  The choice of this branch prediction allows us to generate an
> indirect branch in which speculative execution is intentionally redirected into
> a controlled location by a return stack target that we control.  Preventing
> branch target injections (also known as "Spectre") against these binaries.
>
> On the targets (Intel Xeon) we have measured so far, cost is within cycles of a
> "native" indirect branch for which branch prediction hardware has been disabled.
> This is unfortunately measurable -- from 3 cycles on average to about 30.
> However the cost is largely mitigated for many workloads since the kernel uses
> comparatively few indirect branches (versus say, a C++ binary).  With some
> effort we have the average overall overhead within the 0-1.5% range for our
> internal workloads, including some particularly high packet processing engines.
>
> There are several components, the majority of which are independent of kernel
> modifications:
>
> (1) A compiler supporting retpoline transformations.
> (1a) Optionally: annotations for hand-coded indirect jmps, so that they may be
>     made compatible with (1).
>     [ Note: The only known indirect jmp which is not safe to convert, is the
>       early virtual address check in head entry. ]
> (2) Kernel modifications for preventing return-stack underflow (see document
>     above).
>    The key points where this occurs are:
>    - Context switches (into protected targets)
>    - interrupt return (we return into potentially unwinding execution)
>    - sleep state exit (flushes cashes)
>    - guest exit.
>   (These can be run-time gated, a full refill costs 30-45 cycles.)
> (3) Optional: Optimizations so that direct branches can be used for hot kernel
>    indirects. While as discussed above, kernel execution generally depends on
>    fewer indirect branches, there are a few places (in particular, the
>    networking stack) where we have chained sequences of indirects on hot paths.
> (4) More general support for guarding against RSB underflow in an affected
>     target.  While this is harder to exploit and may not be required for many
>     users, the approaches we have used here are not generally applicable.
>     Further discussion is required.
>
> With respect to the what these deltas mean for an unmodified kernel:
>  (1a) At minimum annotation only.  More complicated, config and
> run-time gated options are also possigble.
>  (2) Trivially run-time & config gated.
>  (3) The de-virtualizing of these branches improves performance in both the
>      retpoline and non-retpoline cases.
>
> For an out of the box kernel that is reasonably protected, (1)-(3) are required.
>
> I apologize that this does not come with a clean set of patches, merging the
> things that we and Intel have looked at here.  That was one of the original
> goals for this week.  Strictly speaking, I think that Andi, David, and I have
> a fair amount of merging and clean-up to do here.  This is an attempt
> to keep discussion of the fundamentals at least independent of that.
>
> I'm trying to keep the above reasonably compact/dense.  I'm happy to expand on
> any details in sub-threads.  I'll also link back some of the other compiler work
> which is landing for (1).
>
> Thanks,
>
> - Paul

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")
  2018-01-04  9:10 [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre") Paul Turner
  2018-01-04  9:12 ` Paul Turner
@ 2018-01-04  9:24 ` Paul Turner
  2018-01-04  9:48   ` Greg Kroah-Hartman
  2018-01-04  9:30 ` Woodhouse, David
  2 siblings, 1 reply; 79+ messages in thread
From: Paul Turner @ 2018-01-04  9:24 UTC (permalink / raw)
  To: LKML, Linus Torvalds, Greg Kroah-Hartman, Woodhouse, David,
	Tim Chen, Dave Hansen, tglx, Kees Cook, Rik van Riel,
	Peter Zijlstra, Andy Lutomirski, Jiri Kosina, gnomes

On Thu, Jan 4, 2018 at 1:10 AM, Paul Turner <pjt@google.com> wrote:
> Apologies for the discombobulation around today's disclosure.  Obviously the
> original goal was to communicate this a little more coherently, but the
> unscheduled advances in the disclosure disrupted the efforts to pull this
> together more cleanly.
>
> I wanted to open discussion the "retpoline" approach and and define its
> requirements so that we can separate the core
> details from questions regarding any particular implementation thereof.
>
> As a starting point, a full write-up describing the approach is available at:
>   https://support.google.com/faqs/answer/7625886
>
> The 30 second version is:
> Returns are a special type of indirect branch.  As function returns are intended
> to pair with function calls, processors often implement dedicated return stack
> predictors.  The choice of this branch prediction allows us to generate an
> indirect branch in which speculative execution is intentionally redirected into
> a controlled location by a return stack target that we control.  Preventing
> branch target injections (also known as "Spectre") against these binaries.
>
> On the targets (Intel Xeon) we have measured so far, cost is within cycles of a
> "native" indirect branch for which branch prediction hardware has been disabled.
> This is unfortunately measurable -- from 3 cycles on average to about 30.
> However the cost is largely mitigated for many workloads since the kernel uses
> comparatively few indirect branches (versus say, a C++ binary).  With some
> effort we have the average overall overhead within the 0-1.5% range for our
> internal workloads, including some particularly high packet processing engines.
>
> There are several components, the majority of which are independent of kernel
> modifications:
>
> (1) A compiler supporting retpoline transformations.

An implementation for LLVM is available at:
  https://reviews.llvm.org/D41723

> (1a) Optionally: annotations for hand-coded indirect jmps, so that they may be
>     made compatible with (1).
>     [ Note: The only known indirect jmp which is not safe to convert, is the
>       early virtual address check in head entry. ]
> (2) Kernel modifications for preventing return-stack underflow (see document
>     above).
>    The key points where this occurs are:
>    - Context switches (into protected targets)
>    - interrupt return (we return into potentially unwinding execution)
>    - sleep state exit (flushes cashes)
>    - guest exit.
>   (These can be run-time gated, a full refill costs 30-45 cycles.)
> (3) Optional: Optimizations so that direct branches can be used for hot kernel
>    indirects. While as discussed above, kernel execution generally depends on
>    fewer indirect branches, there are a few places (in particular, the
>    networking stack) where we have chained sequences of indirects on hot paths.
> (4) More general support for guarding against RSB underflow in an affected
>     target.  While this is harder to exploit and may not be required for many
>     users, the approaches we have used here are not generally applicable.
>     Further discussion is required.
>
> With respect to the what these deltas mean for an unmodified kernel:

Sorry this should have been, a kernel that does not care about this protection.

It has been a long day :-).

>  (1a) At minimum annotation only.  More complicated, config and
> run-time gated options are also possigble.
>  (2) Trivially run-time & config gated.
>  (3) The de-virtualizing of these branches improves performance in both the
>      retpoline and non-retpoline cases.
>
> For an out of the box kernel that is reasonably protected, (1)-(3) are required.
>
> I apologize that this does not come with a clean set of patches, merging the
> things that we and Intel have looked at here.  That was one of the original
> goals for this week.  Strictly speaking, I think that Andi, David, and I have
> a fair amount of merging and clean-up to do here.  This is an attempt
> to keep discussion of the fundamentals at least independent of that.
>
> I'm trying to keep the above reasonably compact/dense.  I'm happy to expand on
> any details in sub-threads.  I'll also link back some of the other compiler work
> which is landing for (1).
>
> Thanks,
>
> - Paul

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")
  2018-01-04  9:10 [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre") Paul Turner
  2018-01-04  9:12 ` Paul Turner
  2018-01-04  9:24 ` Paul Turner
@ 2018-01-04  9:30 ` Woodhouse, David
  2018-01-04 14:36   ` [PATCH v3 01/13] x86/retpoline: Add initial retpoline support David Woodhouse
                     ` (13 more replies)
  2 siblings, 14 replies; 79+ messages in thread
From: Woodhouse, David @ 2018-01-04  9:30 UTC (permalink / raw)
  To: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	Dave Hansen, tglx, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes

[-- Attachment #1: Type: text/plain, Size: 1130 bytes --]

On Thu, 2018-01-04 at 01:10 -0800, Paul Turner wrote:
> Apologies for the discombobulation around today's disclosure.  Obviously the
> original goal was to communicate this a little more coherently, but the
> unscheduled advances in the disclosure disrupted the efforts to pull this
> together more cleanly.
> 
> I wanted to open discussion the "retpoline" approach and and define its
> requirements so that we can separate the core
> details from questions regarding any particular implementation thereof.
> 
> As a starting point, a full write-up describing the approach is available at:
>   https://support.google.com/faqs/answer/7625886

Note that (ab)using 'ret' in this way is incompatible with CET on
upcoming processors. HJ added a -mno-indirect-branch-register option to
the latest round of GCC patches, which puts the branch target in a
register instead of on the stack. My kernel patches (which I'm about to
reconcile with Andi's tweaks and post) do the same.

That means that in the cases where at runtime we want to ALTERNATIVE
out the retpoline, it just turns back into a bare 'jmp *\reg'.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")
  2018-01-04  9:24 ` Paul Turner
@ 2018-01-04  9:48   ` Greg Kroah-Hartman
  2018-01-04  9:56     ` Woodhouse, David
  0 siblings, 1 reply; 79+ messages in thread
From: Greg Kroah-Hartman @ 2018-01-04  9:48 UTC (permalink / raw)
  To: Paul Turner
  Cc: LKML, Linus Torvalds, Woodhouse, David, Tim Chen, Dave Hansen,
	tglx, Kees Cook, Rik van Riel, Peter Zijlstra, Andy Lutomirski,
	Jiri Kosina, gnomes

On Thu, Jan 04, 2018 at 01:24:41AM -0800, Paul Turner wrote:
> On Thu, Jan 4, 2018 at 1:10 AM, Paul Turner <pjt@google.com> wrote:
> > Apologies for the discombobulation around today's disclosure.  Obviously the
> > original goal was to communicate this a little more coherently, but the
> > unscheduled advances in the disclosure disrupted the efforts to pull this
> > together more cleanly.
> >
> > I wanted to open discussion the "retpoline" approach and and define its
> > requirements so that we can separate the core
> > details from questions regarding any particular implementation thereof.
> >
> > As a starting point, a full write-up describing the approach is available at:
> >   https://support.google.com/faqs/answer/7625886
> >
> > The 30 second version is:
> > Returns are a special type of indirect branch.  As function returns are intended
> > to pair with function calls, processors often implement dedicated return stack
> > predictors.  The choice of this branch prediction allows us to generate an
> > indirect branch in which speculative execution is intentionally redirected into
> > a controlled location by a return stack target that we control.  Preventing
> > branch target injections (also known as "Spectre") against these binaries.
> >
> > On the targets (Intel Xeon) we have measured so far, cost is within cycles of a
> > "native" indirect branch for which branch prediction hardware has been disabled.
> > This is unfortunately measurable -- from 3 cycles on average to about 30.
> > However the cost is largely mitigated for many workloads since the kernel uses
> > comparatively few indirect branches (versus say, a C++ binary).  With some
> > effort we have the average overall overhead within the 0-1.5% range for our
> > internal workloads, including some particularly high packet processing engines.
> >
> > There are several components, the majority of which are independent of kernel
> > modifications:
> >
> > (1) A compiler supporting retpoline transformations.
> 
> An implementation for LLVM is available at:
>   https://reviews.llvm.org/D41723

Nice, thanks for the link and the write up.  There is also a patch for
gcc floating around somewhere, does anyone have the link for that?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")
  2018-01-04  9:48   ` Greg Kroah-Hartman
@ 2018-01-04  9:56     ` Woodhouse, David
  0 siblings, 0 replies; 79+ messages in thread
From: Woodhouse, David @ 2018-01-04  9:56 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Paul Turner
  Cc: LKML, Linus Torvalds, Tim Chen, Dave Hansen, tglx, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

[-- Attachment #1: Type: text/plain, Size: 390 bytes --]

On Thu, 2018-01-04 at 10:48 +0100, Greg Kroah-Hartman wrote:
> 
> Nice, thanks for the link and the write up.  There is also a patch for
> gcc floating around somewhere, does anyone have the link for that?

http://git.infradead.org/users/dwmw2/gcc-retpoline.git/shortlog/refs/heads/gcc-7_2_0-retpoline-20171219

I put packages for Fedora 27 at ftp://ftp.infradead.org/pub/retpoline/

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-04  9:30 ` Woodhouse, David
@ 2018-01-04 14:36   ` David Woodhouse
  2018-01-04 18:03     ` Linus Torvalds
                       ` (2 more replies)
  2018-01-04 14:36   ` [PATCH v3 02/13] x86/retpoline/crypto: Convert crypto assembler indirect jumps David Woodhouse
                     ` (12 subsequent siblings)
  13 siblings, 3 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 14:36 UTC (permalink / raw)
  To: ak
  Cc: David Woodhouse, Paul Turner, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, tglx, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

Enable the use of -mindirect-branch=thunk-extern in newer GCC, and provide
the corresponding thunks. Provide assembler macros for invoking the thunks
in the same way that GCC does, from native and inline assembler.

This adds an X86_BUG_NO_RETPOLINE "feature" for runtime patching out
of the thunks. This is a placeholder for now; the patches which support
the new Intel/AMD microcode features will flesh out the precise conditions
under which we disable the retpoline and do other things instead.

[Andi Kleen: Rename the macros and add CONFIG_RETPOLINE option]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/Kconfig                   |  8 ++++++
 arch/x86/Makefile                  | 10 ++++++++
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/lib/Makefile              |  1 +
 arch/x86/lib/retpoline.S           | 50 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 70 insertions(+)
 create mode 100644 arch/x86/lib/retpoline.S

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d4fc98c50378..8b0facfa35be 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -429,6 +429,14 @@ config GOLDFISH
        def_bool y
        depends on X86_GOLDFISH
 
+config RETPOLINE
+	bool "Avoid speculative indirect branches in kernel"
+	default y
+	help
+	  Compile kernel with the retpoline compiler options to guard against
+	  kernel to user data leaks by avoiding speculative indirect
+	  branches. Requires a new enough compiler. The kernel may run slower.
+
 config INTEL_RDT
 	bool "Intel Resource Director Technology support"
 	default n
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 3e73bc255e4e..f772b3fef202 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -230,6 +230,16 @@ KBUILD_CFLAGS += -Wno-sign-compare
 #
 KBUILD_CFLAGS += -fno-asynchronous-unwind-tables
 
+# Avoid indirect branches in kernel to deal with Spectre
+ifdef CONFIG_RETPOLINE
+    RETPOLINE_CFLAGS += $(call cc-option,-mindirect-branch=thunk-extern -mindirect-branch-register)
+    ifneq ($(RETPOLINE_CFLAGS),)
+	KBUILD_CFLAGS += $(RETPOLINE_CFLAGS) -DRETPOLINE
+    else
+	$(warning Retpoline not supported in compiler. System may be insecure.)
+    endif
+endif
+
 archscripts: scripts_basic
 	$(Q)$(MAKE) $(build)=arch/x86/tools relocs
 
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 07cdd1715705..900fa7016d3f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -342,5 +342,6 @@
 #define X86_BUG_MONITOR			X86_BUG(12) /* IPI required to wake up remote CPU */
 #define X86_BUG_AMD_E400		X86_BUG(13) /* CPU is among the affected by Erratum 400 */
 #define X86_BUG_CPU_INSECURE		X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
+#define X86_BUG_NO_RETPOLINE		X86_BUG(15) /* Placeholder: disable retpoline branch thunks */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 7b181b61170e..f23934bbaf4e 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -26,6 +26,7 @@ lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o
 lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
+lib-$(CONFIG_RETPOLINE) += retpoline.o
 
 obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
 
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
new file mode 100644
index 000000000000..bbdda5cc136e
--- /dev/null
+++ b/arch/x86/lib/retpoline.S
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/stringify.h>
+#include <linux/linkage.h>
+#include <asm/dwarf2.h>
+#include <asm/cpufeatures.h>
+#include <asm/alternative-asm.h>
+
+.macro THUNK sp reg
+	.section .text.__x86.indirect_thunk.\reg
+
+ENTRY(__x86.indirect_thunk.\reg)
+	CFI_STARTPROC
+	ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE
+1:
+	lfence
+	jmp	1b
+2:
+	mov	%\reg, (%\sp)
+	ret
+	CFI_ENDPROC
+ENDPROC(__x86.indirect_thunk.\reg)
+.endm
+
+#ifdef CONFIG_64BIT
+.irp reg rax rbx rcx rdx rsi rdi rbp r8 r9 r10 r11 r12 r13 r14 r15
+	THUNK rsp \reg
+.endr
+#else
+.irp reg eax ebx ecx edx esi edi ebp
+	THUNK esp \reg
+.endr
+
+/*
+ * Also provide the original ret-equivalent retpoline for i386 because it's
+ * so register-starved, and we don't care about CET compatibility here.
+ */
+ENTRY(__x86.indirect_thunk)
+	CFI_STARTPROC
+	ALTERNATIVE "call 2f", "ret", X86_BUG_NO_RETPOLINE
+1:
+	lfence
+	jmp	1b
+2:
+	lea	4(%esp), %esp
+	ret
+	CFI_ENDPROC
+ENDPROC(__x86.indirect_thunk)
+
+#endif
-- 
2.14.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 02/13] x86/retpoline/crypto: Convert crypto assembler indirect jumps
  2018-01-04  9:30 ` Woodhouse, David
  2018-01-04 14:36   ` [PATCH v3 01/13] x86/retpoline: Add initial retpoline support David Woodhouse
@ 2018-01-04 14:36   ` David Woodhouse
  2018-01-04 14:37   ` [PATCH v3 03/13] x86/retpoline/entry: Convert entry " David Woodhouse
                     ` (11 subsequent siblings)
  13 siblings, 0 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 14:36 UTC (permalink / raw)
  To: ak
  Cc: David Woodhouse, Paul Turner, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, tglx, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

Convert all indirect jumps in crypto assembler code to use non-speculative
sequences when CONFIG_RETPOLINE is enabled.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/crypto/aesni-intel_asm.S            | 5 +++--
 arch/x86/crypto/camellia-aesni-avx-asm_64.S  | 3 ++-
 arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 3 ++-
 arch/x86/crypto/crc32c-pcl-intel-asm_64.S    | 3 ++-
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S
index 16627fec80b2..074c13767c9f 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -32,6 +32,7 @@
 #include <linux/linkage.h>
 #include <asm/inst.h>
 #include <asm/frame.h>
+#include <asm/nospec-branch.h>
 
 /*
  * The following macros are used to move an (un)aligned 16 byte value to/from
@@ -2884,7 +2885,7 @@ ENTRY(aesni_xts_crypt8)
 	pxor INC, STATE4
 	movdqu IV, 0x30(OUTP)
 
-	call *%r11
+	NOSPEC_CALL r11
 
 	movdqu 0x00(OUTP), INC
 	pxor INC, STATE1
@@ -2929,7 +2930,7 @@ ENTRY(aesni_xts_crypt8)
 	_aesni_gf128mul_x_ble()
 	movups IV, (IVP)
 
-	call *%r11
+	NOSPEC_CALL r11
 
 	movdqu 0x40(OUTP), INC
 	pxor INC, STATE1
diff --git a/arch/x86/crypto/camellia-aesni-avx-asm_64.S b/arch/x86/crypto/camellia-aesni-avx-asm_64.S
index f7c495e2863c..98a717ba5e1a 100644
--- a/arch/x86/crypto/camellia-aesni-avx-asm_64.S
+++ b/arch/x86/crypto/camellia-aesni-avx-asm_64.S
@@ -17,6 +17,7 @@
 
 #include <linux/linkage.h>
 #include <asm/frame.h>
+#include <asm/nospec-branch.h>
 
 #define CAMELLIA_TABLE_BYTE_LEN 272
 
@@ -1227,7 +1228,7 @@ camellia_xts_crypt_16way:
 	vpxor 14 * 16(%rax), %xmm15, %xmm14;
 	vpxor 15 * 16(%rax), %xmm15, %xmm15;
 
-	call *%r9;
+	NOSPEC_CALL r9;
 
 	addq $(16 * 16), %rsp;
 
diff --git a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
index eee5b3982cfd..99d09d3166a5 100644
--- a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
+++ b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
@@ -12,6 +12,7 @@
 
 #include <linux/linkage.h>
 #include <asm/frame.h>
+#include <asm/nospec-branch.h>
 
 #define CAMELLIA_TABLE_BYTE_LEN 272
 
@@ -1343,7 +1344,7 @@ camellia_xts_crypt_32way:
 	vpxor 14 * 32(%rax), %ymm15, %ymm14;
 	vpxor 15 * 32(%rax), %ymm15, %ymm15;
 
-	call *%r9;
+	NOSPEC_CALL r9;
 
 	addq $(16 * 32), %rsp;
 
diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
index 7a7de27c6f41..05178b44317d 100644
--- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
+++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
@@ -45,6 +45,7 @@
 
 #include <asm/inst.h>
 #include <linux/linkage.h>
+#include <asm/nospec-branch.h>
 
 ## ISCSI CRC 32 Implementation with crc32 and pclmulqdq Instruction
 
@@ -172,7 +173,7 @@ continue_block:
 	movzxw  (bufp, %rax, 2), len
 	lea	crc_array(%rip), bufp
 	lea     (bufp, len, 1), bufp
-	jmp     *bufp
+	NOSPEC_JMP bufp
 
 	################################################################
 	## 2a) PROCESS FULL BLOCKS:
-- 
2.14.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 03/13] x86/retpoline/entry: Convert entry assembler indirect jumps
  2018-01-04  9:30 ` Woodhouse, David
  2018-01-04 14:36   ` [PATCH v3 01/13] x86/retpoline: Add initial retpoline support David Woodhouse
  2018-01-04 14:36   ` [PATCH v3 02/13] x86/retpoline/crypto: Convert crypto assembler indirect jumps David Woodhouse
@ 2018-01-04 14:37   ` David Woodhouse
  2018-01-04 14:46     ` Dave Hansen
  2018-01-04 14:37   ` [PATCH v3 04/13] x86/retpoline/ftrace: Convert ftrace " David Woodhouse
                     ` (10 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 14:37 UTC (permalink / raw)
  To: ak
  Cc: David Woodhouse, Paul Turner, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, tglx, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

Convert indirect jumps in core 32/64bit entry assembler code to use
non-speculative sequences when CONFIG_RETPOLINE is enabled.

KPTI complicates this a little; the one in entry_SYSCALL_64_trampoline
can't just jump to the thunk because the thunk isn't mapped. So it gets
its own copy of the thunk, inline.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/entry/entry_32.S |  5 +++--
 arch/x86/entry/entry_64.S | 20 ++++++++++++++++----
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index ace8f321a5a1..abd1e5dd487d 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -44,6 +44,7 @@
 #include <asm/asm.h>
 #include <asm/smap.h>
 #include <asm/frame.h>
+#include <asm/nospec-branch.h>
 
 	.section .entry.text, "ax"
 
@@ -290,7 +291,7 @@ ENTRY(ret_from_fork)
 
 	/* kernel thread */
 1:	movl	%edi, %eax
-	call	*%ebx
+	NOSPEC_CALL ebx
 	/*
 	 * A kernel thread is allowed to return here after successfully
 	 * calling do_execve().  Exit to userspace to complete the execve()
@@ -919,7 +920,7 @@ common_exception:
 	movl	%ecx, %es
 	TRACE_IRQS_OFF
 	movl	%esp, %eax			# pt_regs pointer
-	call	*%edi
+	NOSPEC_CALL edi
 	jmp	ret_from_exception
 END(common_exception)
 
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f048e384ff54..9e449701115a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -37,6 +37,7 @@
 #include <asm/pgtable_types.h>
 #include <asm/export.h>
 #include <asm/frame.h>
+#include <asm/nospec-branch.h>
 #include <linux/err.h>
 
 #include "calling.h"
@@ -191,7 +192,17 @@ ENTRY(entry_SYSCALL_64_trampoline)
 	 */
 	pushq	%rdi
 	movq	$entry_SYSCALL_64_stage2, %rdi
-	jmp	*%rdi
+	/*
+	 * Open-code the retpoline from retpoline.S, because we can't
+	 * just jump to it directly.
+	 */
+	ALTERNATIVE "call 2f", "jmp *%rdi", X86_BUG_NO_RETPOLINE
+1:
+	lfence
+	jmp	1b
+2:
+	mov	%rdi, (%rsp)
+	ret
 END(entry_SYSCALL_64_trampoline)
 
 	.popsection
@@ -270,7 +281,8 @@ entry_SYSCALL_64_fastpath:
 	 * It might end up jumping to the slow path.  If it jumps, RAX
 	 * and all argument registers are clobbered.
 	 */
-	call	*sys_call_table(, %rax, 8)
+	movq	sys_call_table(, %rax, 8), %rax
+	NOSPEC_CALL rax
 .Lentry_SYSCALL_64_after_fastpath_call:
 
 	movq	%rax, RAX(%rsp)
@@ -442,7 +454,7 @@ ENTRY(stub_ptregs_64)
 	jmp	entry_SYSCALL64_slow_path
 
 1:
-	jmp	*%rax				/* Called from C */
+	NOSPEC_JMP rax				/* Called from C */
 END(stub_ptregs_64)
 
 .macro ptregs_stub func
@@ -521,7 +533,7 @@ ENTRY(ret_from_fork)
 1:
 	/* kernel thread */
 	movq	%r12, %rdi
-	call	*%rbx
+	NOSPEC_CALL rbx
 	/*
 	 * A kernel thread is allowed to return here after successfully
 	 * calling do_execve().  Exit to userspace to complete the execve()
-- 
2.14.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 04/13] x86/retpoline/ftrace: Convert ftrace assembler indirect jumps
  2018-01-04  9:30 ` Woodhouse, David
                     ` (2 preceding siblings ...)
  2018-01-04 14:37   ` [PATCH v3 03/13] x86/retpoline/entry: Convert entry " David Woodhouse
@ 2018-01-04 14:37   ` David Woodhouse
  2018-01-04 14:37   ` [PATCH v3 05/13] x86/retpoline/hyperv: Convert " David Woodhouse
                     ` (9 subsequent siblings)
  13 siblings, 0 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 14:37 UTC (permalink / raw)
  To: ak
  Cc: David Woodhouse, Paul Turner, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, tglx, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

Convert all indirect jumps in ftrace assembler code to use non-speculative
sequences when CONFIG_RETPOLINE is enabled.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kernel/ftrace_32.S | 6 ++++--
 arch/x86/kernel/ftrace_64.S | 8 ++++----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/ftrace_32.S b/arch/x86/kernel/ftrace_32.S
index b6c6468e10bc..eb9a56fbda02 100644
--- a/arch/x86/kernel/ftrace_32.S
+++ b/arch/x86/kernel/ftrace_32.S
@@ -8,6 +8,7 @@
 #include <asm/segment.h>
 #include <asm/export.h>
 #include <asm/ftrace.h>
+#include <asm/nospec-branch.h>
 
 #ifdef CC_USING_FENTRY
 # define function_hook	__fentry__
@@ -197,7 +198,8 @@ ftrace_stub:
 	movl	0x4(%ebp), %edx
 	subl	$MCOUNT_INSN_SIZE, %eax
 
-	call	*ftrace_trace_function
+	movl	ftrace_trace_function, %ecx
+	NOSPEC_CALL ecx
 
 	popl	%edx
 	popl	%ecx
@@ -241,5 +243,5 @@ return_to_handler:
 	movl	%eax, %ecx
 	popl	%edx
 	popl	%eax
-	jmp	*%ecx
+	NOSPEC_JMP ecx
 #endif
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index c832291d948a..d4611d8bdbbf 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -7,7 +7,7 @@
 #include <asm/ptrace.h>
 #include <asm/ftrace.h>
 #include <asm/export.h>
-
+#include <asm/nospec-branch.h>
 
 	.code64
 	.section .entry.text, "ax"
@@ -286,8 +286,8 @@ trace:
 	 * ip and parent ip are used and the list function is called when
 	 * function tracing is enabled.
 	 */
-	call   *ftrace_trace_function
-
+	movq ftrace_trace_function, %r8
+	NOSPEC_CALL r8
 	restore_mcount_regs
 
 	jmp fgraph_trace
@@ -329,5 +329,5 @@ GLOBAL(return_to_handler)
 	movq 8(%rsp), %rdx
 	movq (%rsp), %rax
 	addq $24, %rsp
-	jmp *%rdi
+	NOSPEC_JMP rdi
 #endif
-- 
2.14.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 05/13] x86/retpoline/hyperv: Convert assembler indirect jumps
  2018-01-04  9:30 ` Woodhouse, David
                     ` (3 preceding siblings ...)
  2018-01-04 14:37   ` [PATCH v3 04/13] x86/retpoline/ftrace: Convert ftrace " David Woodhouse
@ 2018-01-04 14:37   ` David Woodhouse
  2018-01-04 14:37   ` [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall " David Woodhouse
                     ` (8 subsequent siblings)
  13 siblings, 0 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 14:37 UTC (permalink / raw)
  To: ak
  Cc: David Woodhouse, Paul Turner, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, tglx, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

Convert all indirect jumps in hyperv inline asm code to use non-speculative
sequences when CONFIG_RETPOLINE is enabled.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/mshyperv.h | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 5400add2885b..532ab441f39a 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -7,6 +7,7 @@
 #include <linux/nmi.h>
 #include <asm/io.h>
 #include <asm/hyperv.h>
+#include <asm/nospec-branch.h>
 
 /*
  * The below CPUID leaves are present if VersionAndFeatures.HypervisorPresent
@@ -186,10 +187,11 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
 		return U64_MAX;
 
 	__asm__ __volatile__("mov %4, %%r8\n"
-			     "call *%5"
+			     NOSPEC_CALL
 			     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
 			       "+c" (control), "+d" (input_address)
-			     :  "r" (output_address), "m" (hv_hypercall_pg)
+			     :  "r" (output_address),
+				THUNK_TARGET(hv_hypercall_pg)
 			     : "cc", "memory", "r8", "r9", "r10", "r11");
 #else
 	u32 input_address_hi = upper_32_bits(input_address);
@@ -200,13 +202,13 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
 	if (!hv_hypercall_pg)
 		return U64_MAX;
 
-	__asm__ __volatile__("call *%7"
+	__asm__ __volatile__(NOSPEC_CALL
 			     : "=A" (hv_status),
 			       "+c" (input_address_lo), ASM_CALL_CONSTRAINT
 			     : "A" (control),
 			       "b" (input_address_hi),
 			       "D"(output_address_hi), "S"(output_address_lo),
-			       "m" (hv_hypercall_pg)
+			       THUNK_TARGET(hv_hypercall_pg)
 			     : "cc", "memory");
 #endif /* !x86_64 */
 	return hv_status;
@@ -227,10 +229,10 @@ static inline u64 hv_do_fast_hypercall8(u16 code, u64 input1)
 
 #ifdef CONFIG_X86_64
 	{
-		__asm__ __volatile__("call *%4"
+		__asm__ __volatile__(NOSPEC_CALL
 				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
 				       "+c" (control), "+d" (input1)
-				     : "m" (hv_hypercall_pg)
+				     : THUNK_TARGET(hv_hypercall_pg)
 				     : "cc", "r8", "r9", "r10", "r11");
 	}
 #else
@@ -238,13 +240,13 @@ static inline u64 hv_do_fast_hypercall8(u16 code, u64 input1)
 		u32 input1_hi = upper_32_bits(input1);
 		u32 input1_lo = lower_32_bits(input1);
 
-		__asm__ __volatile__ ("call *%5"
+		__asm__ __volatile__ (NOSPEC_CALL
 				      : "=A"(hv_status),
 					"+c"(input1_lo),
 					ASM_CALL_CONSTRAINT
 				      :	"A" (control),
 					"b" (input1_hi),
-					"m" (hv_hypercall_pg)
+					THUNK_TARGET(hv_hypercall_pg)
 				      : "cc", "edi", "esi");
 	}
 #endif
-- 
2.14.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall indirect jumps
  2018-01-04  9:30 ` Woodhouse, David
                     ` (4 preceding siblings ...)
  2018-01-04 14:37   ` [PATCH v3 05/13] x86/retpoline/hyperv: Convert " David Woodhouse
@ 2018-01-04 14:37   ` David Woodhouse
  2018-01-04 15:10     ` Juergen Gross
  2018-01-04 15:54     ` Juergen Gross
  2018-01-04 14:37   ` [PATCH v3 07/13] x86/retpoline/checksum32: Convert assembler " David Woodhouse
                     ` (7 subsequent siblings)
  13 siblings, 2 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 14:37 UTC (permalink / raw)
  To: ak
  Cc: David Woodhouse, Paul Turner, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, tglx, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

Convert indirect call in Xen hypercall to use non-speculative sequence,
when CONFIG_RETPOLINE is enabled.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/xen/hypercall.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
index 7cb282e9e587..393c0048c63e 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -44,6 +44,7 @@
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/smap.h>
+#include <asm/nospec-branch.h>
 
 #include <xen/interface/xen.h>
 #include <xen/interface/sched.h>
@@ -217,9 +218,9 @@ privcmd_call(unsigned call,
 	__HYPERCALL_5ARG(a1, a2, a3, a4, a5);
 
 	stac();
-	asm volatile("call *%[call]"
+	asm volatile(NOSPEC_CALL
 		     : __HYPERCALL_5PARAM
-		     : [call] "a" (&hypercall_page[call])
+		     : [thunk_target] "a" (&hypercall_page[call])
 		     : __HYPERCALL_CLOBBER5);
 	clac();
 
-- 
2.14.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 07/13] x86/retpoline/checksum32: Convert assembler indirect jumps
  2018-01-04  9:30 ` Woodhouse, David
                     ` (5 preceding siblings ...)
  2018-01-04 14:37   ` [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall " David Woodhouse
@ 2018-01-04 14:37   ` David Woodhouse
  2018-01-04 14:37   ` [PATCH v3 08/13] x86/alternatives: Add missing \n at end of ALTERNATIVE inline asm David Woodhouse
                     ` (6 subsequent siblings)
  13 siblings, 0 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 14:37 UTC (permalink / raw)
  To: ak
  Cc: David Woodhouse, Paul Turner, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, tglx, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

Convert all indirect jumps in 32bit checksum assembler code to use
non-speculative sequences when CONFIG_RETPOLINE is enabled.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/lib/checksum_32.S | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/lib/checksum_32.S b/arch/x86/lib/checksum_32.S
index 4d34bb548b41..d5ef7cc0efd8 100644
--- a/arch/x86/lib/checksum_32.S
+++ b/arch/x86/lib/checksum_32.S
@@ -29,7 +29,8 @@
 #include <asm/errno.h>
 #include <asm/asm.h>
 #include <asm/export.h>
-				
+#include <asm/nospec-branch.h>
+
 /*
  * computes a partial checksum, e.g. for TCP/UDP fragments
  */
@@ -156,7 +157,7 @@ ENTRY(csum_partial)
 	negl %ebx
 	lea 45f(%ebx,%ebx,2), %ebx
 	testl %esi, %esi
-	jmp *%ebx
+	NOSPEC_JMP ebx
 
 	# Handle 2-byte-aligned regions
 20:	addw (%esi), %ax
@@ -439,7 +440,7 @@ ENTRY(csum_partial_copy_generic)
 	andl $-32,%edx
 	lea 3f(%ebx,%ebx), %ebx
 	testl %esi, %esi 
-	jmp *%ebx
+	NOSPEC_JMP ebx
 1:	addl $64,%esi
 	addl $64,%edi 
 	SRC(movb -32(%edx),%bl)	; SRC(movb (%edx),%bl)
-- 
2.14.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 08/13] x86/alternatives: Add missing \n at end of ALTERNATIVE inline asm
  2018-01-04  9:30 ` Woodhouse, David
                     ` (6 preceding siblings ...)
  2018-01-04 14:37   ` [PATCH v3 07/13] x86/retpoline/checksum32: Convert assembler " David Woodhouse
@ 2018-01-04 14:37   ` David Woodhouse
  2018-01-05 13:04     ` [tip:x86/pti] x86/alternatives: Add missing '\n' " tip-bot for David Woodhouse
  2018-01-04 14:37   ` [PATCH v3 09/13] x86/retpoline/irq32: Convert assembler indirect jumps David Woodhouse
                     ` (5 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 14:37 UTC (permalink / raw)
  To: ak
  Cc: David Woodhouse, Paul Turner, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, tglx, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

Where an ALTERNATIVE is used in the middle of an inline asm block, this
would otherwise lead to the following instruction being appended directly
to the trailing ".popsection", and a failed compile.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/alternative.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index dbfd0854651f..cf5961ca8677 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -140,7 +140,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
 	".popsection\n"							\
 	".pushsection .altinstr_replacement, \"ax\"\n"			\
 	ALTINSTR_REPLACEMENT(newinstr, feature, 1)			\
-	".popsection"
+	".popsection\n"
 
 #define ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2)\
 	OLDINSTR_2(oldinstr, 1, 2)					\
@@ -151,7 +151,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
 	".pushsection .altinstr_replacement, \"ax\"\n"			\
 	ALTINSTR_REPLACEMENT(newinstr1, feature1, 1)			\
 	ALTINSTR_REPLACEMENT(newinstr2, feature2, 2)			\
-	".popsection"
+	".popsection\n"
 
 /*
  * Alternative instructions for different CPU types or capabilities.
-- 
2.14.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 09/13] x86/retpoline/irq32: Convert assembler indirect jumps
  2018-01-04  9:30 ` Woodhouse, David
                     ` (7 preceding siblings ...)
  2018-01-04 14:37   ` [PATCH v3 08/13] x86/alternatives: Add missing \n at end of ALTERNATIVE inline asm David Woodhouse
@ 2018-01-04 14:37   ` David Woodhouse
  2018-01-04 14:37   ` [PATCH v3 10/13] x86/retpoline/pvops: " David Woodhouse
                     ` (4 subsequent siblings)
  13 siblings, 0 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 14:37 UTC (permalink / raw)
  To: ak
  Cc: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	Dave Hansen, tglx, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes, David Woodhouse

From: Andi Kleen <ak@linux.intel.com>

Convert all indirect jumps in 32bit irq inline asm code to use
non speculative sequences.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kernel/irq_32.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index a83b3346a0e1..e1e58f738c3d 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -20,6 +20,7 @@
 #include <linux/mm.h>
 
 #include <asm/apic.h>
+#include <asm/nospec-branch.h>
 
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
 
@@ -55,11 +56,11 @@ DEFINE_PER_CPU(struct irq_stack *, softirq_stack);
 static void call_on_stack(void *func, void *stack)
 {
 	asm volatile("xchgl	%%ebx,%%esp	\n"
-		     "call	*%%edi		\n"
+		     NOSPEC_CALL
 		     "movl	%%ebx,%%esp	\n"
 		     : "=b" (stack)
 		     : "0" (stack),
-		       "D"(func)
+		       [thunk_target] "D"(func)
 		     : "memory", "cc", "edx", "ecx", "eax");
 }
 
@@ -95,11 +96,11 @@ static inline int execute_on_irq_stack(int overflow, struct irq_desc *desc)
 		call_on_stack(print_stack_overflow, isp);
 
 	asm volatile("xchgl	%%ebx,%%esp	\n"
-		     "call	*%%edi		\n"
+		     NOSPEC_CALL
 		     "movl	%%ebx,%%esp	\n"
 		     : "=a" (arg1), "=b" (isp)
 		     :  "0" (desc),   "1" (isp),
-			"D" (desc->handle_irq)
+			[thunk_target] "D" (desc->handle_irq)
 		     : "memory", "cc", "ecx");
 	return 1;
 }
-- 
2.14.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps
  2018-01-04  9:30 ` Woodhouse, David
                     ` (8 preceding siblings ...)
  2018-01-04 14:37   ` [PATCH v3 09/13] x86/retpoline/irq32: Convert assembler indirect jumps David Woodhouse
@ 2018-01-04 14:37   ` David Woodhouse
  2018-01-04 15:02     ` Juergen Gross
  2018-01-04 14:37   ` [PATCH v3 11/13] retpoline/taint: Taint kernel for missing retpoline in compiler David Woodhouse
                     ` (3 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 14:37 UTC (permalink / raw)
  To: ak
  Cc: David Woodhouse, Paul Turner, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, tglx, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

Convert pvops invocations to use non-speculative call sequences, when
CONFIG_RETPOLINE is enabled.

There is scope for future optimisation here — once the pvops methods are
actually set, we could just turn the damn things into *direct* jumps.
But this is perfectly sufficient for now, without that added complexity.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/paravirt_types.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 6ec54d01972d..54b735b8ae12 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -336,11 +336,17 @@ extern struct pv_lock_ops pv_lock_ops;
 #define PARAVIRT_PATCH(x)					\
 	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
 
+#define paravirt_clobber(clobber)		\
+	[paravirt_clobber] "i" (clobber)
+#ifdef CONFIG_RETPOLINE
+#define paravirt_type(op)				\
+	[paravirt_typenum] "i" (PARAVIRT_PATCH(op)),	\
+	[paravirt_opptr] "r" ((op))
+#else
 #define paravirt_type(op)				\
 	[paravirt_typenum] "i" (PARAVIRT_PATCH(op)),	\
 	[paravirt_opptr] "i" (&(op))
-#define paravirt_clobber(clobber)		\
-	[paravirt_clobber] "i" (clobber)
+#endif
 
 /*
  * Generate some code, and mark it as patchable by the
@@ -392,7 +398,11 @@ int paravirt_disable_iospace(void);
  * offset into the paravirt_patch_template structure, and can therefore be
  * freely converted back into a structure offset.
  */
+#ifdef CONFIG_RETPOLINE
+#define PARAVIRT_CALL	"call __x86.indirect_thunk.%V[paravirt_opptr];"
+#else
 #define PARAVIRT_CALL	"call *%c[paravirt_opptr];"
+#endif
 
 /*
  * These macros are intended to wrap calls through one of the paravirt
-- 
2.14.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 11/13] retpoline/taint: Taint kernel for missing retpoline in compiler
  2018-01-04  9:30 ` Woodhouse, David
                     ` (9 preceding siblings ...)
  2018-01-04 14:37   ` [PATCH v3 10/13] x86/retpoline/pvops: " David Woodhouse
@ 2018-01-04 14:37   ` David Woodhouse
  2018-01-04 22:06     ` Justin Forbes
  2018-01-04 14:37   ` [PATCH v3 12/13] retpoline/objtool: Disable some objtool warnings David Woodhouse
                     ` (2 subsequent siblings)
  13 siblings, 1 reply; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 14:37 UTC (permalink / raw)
  To: ak
  Cc: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	Dave Hansen, tglx, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes, David Woodhouse

From: Andi Kleen <ak@linux.intel.com>

When the kernel or a module hasn't been compiled with a retpoline
aware compiler, print a warning and set a taint flag.

For modules it is checked at compile time, however it cannot
check assembler or other non compiled objects used in the module link.

Due to lack of better letter it uses taint option 'Z'

v2: Change warning message
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 Documentation/admin-guide/tainted-kernels.rst |  3 +++
 arch/x86/kernel/setup.c                       |  6 ++++++
 include/linux/kernel.h                        |  4 +++-
 kernel/module.c                               | 11 ++++++++++-
 kernel/panic.c                                |  1 +
 scripts/mod/modpost.c                         |  9 +++++++++
 6 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index 1df03b5cb02f..800261b6bd6f 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -52,6 +52,9 @@ characters, each representing a particular tainted value.
 
  16) ``K`` if the kernel has been live patched.
 
+ 17) ``Z`` if the x86 kernel or a module hasn't been compiled with
+     a retpoline aware compiler and may be vulnerable to data leaks.
+
 The primary reason for the **'Tainted: '** string is to tell kernel
 debuggers if this is a clean kernel or if anything unusual has
 occurred.  Tainting is permanent: even if an offending module is
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 8af2e8d0c0a1..cc880b46b756 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1296,6 +1296,12 @@ void __init setup_arch(char **cmdline_p)
 #endif
 
 	unwind_init();
+
+#ifndef RETPOLINE
+	add_taint(TAINT_NO_RETPOLINE, LOCKDEP_STILL_OK);
+	pr_warn("No support for retpoline in kernel compiler\n");
+	pr_warn("System may be vulnerable to data leaks.\n");
+#endif
 }
 
 #ifdef CONFIG_X86_32
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index ce51455e2adf..fbb4d3baffcc 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -550,7 +550,9 @@ extern enum system_states {
 #define TAINT_SOFTLOCKUP		14
 #define TAINT_LIVEPATCH			15
 #define TAINT_AUX			16
-#define TAINT_FLAGS_COUNT		17
+#define TAINT_NO_RETPOLINE		17
+
+#define TAINT_FLAGS_COUNT		18
 
 struct taint_flag {
 	char c_true;	/* character printed when tainted */
diff --git a/kernel/module.c b/kernel/module.c
index dea01ac9cb74..92db3f59a29a 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3028,7 +3028,16 @@ static int check_modinfo(struct module *mod, struct load_info *info, int flags)
 				mod->name);
 		add_taint_module(mod, TAINT_OOT_MODULE, LOCKDEP_STILL_OK);
 	}
-
+#ifdef RETPOLINE
+	if (!get_modinfo(info, "retpoline")) {
+		if (!test_taint(TAINT_NO_RETPOLINE)) {
+			pr_warn("%s: loading module not compiled with retpoline compiler.\n",
+				mod->name);
+			pr_warn("Kernel may be vulnerable to data leaks.\n");
+		}
+		add_taint_module(mod, TAINT_NO_RETPOLINE, LOCKDEP_STILL_OK);
+	}
+#endif
 	if (get_modinfo(info, "staging")) {
 		add_taint_module(mod, TAINT_CRAP, LOCKDEP_STILL_OK);
 		pr_warn("%s: module is from the staging directory, the quality "
diff --git a/kernel/panic.c b/kernel/panic.c
index 2cfef408fec9..6686c67b6e4b 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -325,6 +325,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
 	{ 'L', ' ', false },	/* TAINT_SOFTLOCKUP */
 	{ 'K', ' ', true },	/* TAINT_LIVEPATCH */
 	{ 'X', ' ', true },	/* TAINT_AUX */
+	{ 'Z', ' ', true },	/* TAINT_NO_RETPOLINE */
 };
 
 /**
diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
index f51cf977c65b..6510536c06df 100644
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -2165,6 +2165,14 @@ static void add_intree_flag(struct buffer *b, int is_intree)
 		buf_printf(b, "\nMODULE_INFO(intree, \"Y\");\n");
 }
 
+/* Cannot check for assembler */
+static void add_retpoline(struct buffer *b)
+{
+	buf_printf(b, "\n#ifdef RETPOLINE\n");
+	buf_printf(b, "MODULE_INFO(retpoline, \"Y\");\n");
+	buf_printf(b, "#endif\n");
+}
+
 static void add_staging_flag(struct buffer *b, const char *name)
 {
 	static const char *staging_dir = "drivers/staging";
@@ -2506,6 +2514,7 @@ int main(int argc, char **argv)
 		err |= check_modname_len(mod);
 		add_header(&buf, mod);
 		add_intree_flag(&buf, !external_module);
+		add_retpoline(&buf);
 		add_staging_flag(&buf, mod->name);
 		err |= add_versions(&buf, mod);
 		add_depends(&buf, mod, modules);
-- 
2.14.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 12/13] retpoline/objtool: Disable some objtool warnings
  2018-01-04  9:30 ` Woodhouse, David
                     ` (10 preceding siblings ...)
  2018-01-04 14:37   ` [PATCH v3 11/13] retpoline/taint: Taint kernel for missing retpoline in compiler David Woodhouse
@ 2018-01-04 14:37   ` David Woodhouse
  2018-01-04 14:37   ` [PATCH v3 13/13] retpoline: Attempt to quiten objtool warning for unreachable code David Woodhouse
  2018-01-04 16:18   ` [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre") Andy Lutomirski
  13 siblings, 0 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 14:37 UTC (permalink / raw)
  To: ak
  Cc: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	Dave Hansen, tglx, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes, David Woodhouse

From: Andi Kleen <ak@linux.intel.com>

With the indirect call thunk enabled compiler two objtool
warnings are triggered very frequently and make the build
very noisy.

I don't see a good way to avoid them, so just disable them
for now.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 tools/objtool/check.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 9b341584eb1b..435c71f944dc 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -503,8 +503,13 @@ static int add_call_destinations(struct objtool_file *file)
 			insn->call_dest = find_symbol_by_offset(insn->sec,
 								dest_off);
 			if (!insn->call_dest) {
+#if 0
+				/* Compilers with -mindirect-branch=thunk-extern trigger
+				 * this everywhere on x86. Disable for now.
+				 */
 				WARN_FUNC("can't find call dest symbol at offset 0x%lx",
 					  insn->sec, insn->offset, dest_off);
+#endif
 				return -1;
 			}
 		} else if (rela->sym->type == STT_SECTION) {
@@ -1716,8 +1721,14 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 					return 1;
 
 			} else if (func && has_modified_stack_frame(&state)) {
+#if 0
+				/* Compilers with -mindirect-branch=thunk-extern trigger
+				 * this everywhere on x86. Disable for now.
+				 */
+
 				WARN_FUNC("sibling call from callable instruction with modified stack frame",
 					  sec, insn->offset);
+#endif
 				return 1;
 			}
 
-- 
2.14.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 13/13] retpoline: Attempt to quiten objtool warning for unreachable code
  2018-01-04  9:30 ` Woodhouse, David
                     ` (11 preceding siblings ...)
  2018-01-04 14:37   ` [PATCH v3 12/13] retpoline/objtool: Disable some objtool warnings David Woodhouse
@ 2018-01-04 14:37   ` David Woodhouse
  2018-01-04 16:18   ` [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre") Andy Lutomirski
  13 siblings, 0 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 14:37 UTC (permalink / raw)
  To: ak
  Cc: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	Dave Hansen, tglx, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes, jpoimboe

From: Andi Kleen <ak@linux.intel.com>

The speculative jump trampoline has to contain unreachable code.
objtool keeps complaining

arch/x86/lib/retpoline.o: warning: objtool: __x86.indirect_thunk()+0x8: unreachable instruction

I tried to fix it here by adding ASM_UNREACHABLE annotation (after
supporting them for pure assembler), but it still complains.
Seems like a objtool bug?

So it doesn't actually fix the warning oyet.
Of course it's just a warning so the kernel will still work fine.

Perhaps Josh can figure it out

Cc: jpoimboe@redhat.com
Not-Signed-off-by: Andi Kleen <ak@linux.intel.com>
Not-Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/lib/retpoline.S |  5 +++++
 include/linux/compiler.h | 10 +++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index bbdda5cc136e..8112feaea6ff 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -2,6 +2,7 @@
 
 #include <linux/stringify.h>
 #include <linux/linkage.h>
+#include <linux/compiler.h>
 #include <asm/dwarf2.h>
 #include <asm/cpufeatures.h>
 #include <asm/alternative-asm.h>
@@ -14,7 +15,9 @@ ENTRY(__x86.indirect_thunk.\reg)
 	ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE
 1:
 	lfence
+	ASM_UNREACHABLE
 	jmp	1b
+	ASM_UNREACHABLE
 2:
 	mov	%\reg, (%\sp)
 	ret
@@ -40,7 +43,9 @@ ENTRY(__x86.indirect_thunk)
 	ALTERNATIVE "call 2f", "ret", X86_BUG_NO_RETPOLINE
 1:
 	lfence
+	ASM_UNREACHABLE
 	jmp	1b
+	ASM_UNREACHABLE
 2:
 	lea	4(%esp), %esp
 	ret
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 52e611ab9a6c..cfba91acc79a 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -269,7 +269,15 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 
 #endif /* __KERNEL__ */
 
-#endif /* __ASSEMBLY__ */
+#else /* __ASSEMBLY__ */
+
+#define ASM_UNREACHABLE \
+	999:						\
+	.pushsection .discard.unreachable;		\
+	.long 999b - .;					\
+	.popsection
+
+#endif /* !__ASSEMBLY__ */
 
 /* Compile time object size, -1 for unknown */
 #ifndef __compiletime_object_size
-- 
2.14.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 03/13] x86/retpoline/entry: Convert entry assembler indirect jumps
  2018-01-04 14:37   ` [PATCH v3 03/13] x86/retpoline/entry: Convert entry " David Woodhouse
@ 2018-01-04 14:46     ` Dave Hansen
  2018-01-04 14:49       ` Woodhouse, David
  0 siblings, 1 reply; 79+ messages in thread
From: Dave Hansen @ 2018-01-04 14:46 UTC (permalink / raw)
  To: David Woodhouse, ak
  Cc: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	tglx, Kees Cook, Rik van Riel, Peter Zijlstra, Andy Lutomirski,
	Jiri Kosina, gnomes

On 01/04/2018 06:37 AM, David Woodhouse wrote:
> KPTI complicates this a little; the one in entry_SYSCALL_64_trampoline
> can't just jump to the thunk because the thunk isn't mapped. So it gets
> its own copy of the thunk, inline.

This one call site isn't too painful, of course.

But, is there anything keeping us from just sticking the thunk in the
entry text section where it would be available while still in the
trampoline?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 03/13] x86/retpoline/entry: Convert entry assembler indirect jumps
  2018-01-04 14:46     ` Dave Hansen
@ 2018-01-04 14:49       ` Woodhouse, David
  0 siblings, 0 replies; 79+ messages in thread
From: Woodhouse, David @ 2018-01-04 14:49 UTC (permalink / raw)
  To: Dave Hansen, ak
  Cc: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	tglx, Kees Cook, Rik van Riel, Peter Zijlstra, Andy Lutomirski,
	Jiri Kosina, gnomes

[-- Attachment #1: Type: text/plain, Size: 734 bytes --]

On Thu, 2018-01-04 at 06:46 -0800, Dave Hansen wrote:
> On 01/04/2018 06:37 AM, David Woodhouse wrote:
> > KPTI complicates this a little; the one in entry_SYSCALL_64_trampoline
> > can't just jump to the thunk because the thunk isn't mapped. So it gets
> > its own copy of the thunk, inline.
> 
> This one call site isn't too painful, of course.
> 
> But, is there anything keeping us from just sticking the thunk in the
> entry text section where it would be available while still in the
> trampoline?

Not really. Since we have a thunk per register and the trampoline is
limited in size, if you wanted to put just the *one* thunk that's
needed into the trampoline then it's slightly icky, but all perfectly
feasible.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps
  2018-01-04 14:37   ` [PATCH v3 10/13] x86/retpoline/pvops: " David Woodhouse
@ 2018-01-04 15:02     ` Juergen Gross
  2018-01-04 15:12       ` Woodhouse, David
                         ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Juergen Gross @ 2018-01-04 15:02 UTC (permalink / raw)
  To: David Woodhouse, ak
  Cc: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	Dave Hansen, tglx, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes

On 04/01/18 15:37, David Woodhouse wrote:
> Convert pvops invocations to use non-speculative call sequences, when
> CONFIG_RETPOLINE is enabled.
> 
> There is scope for future optimisation here — once the pvops methods are
> actually set, we could just turn the damn things into *direct* jumps.
> But this is perfectly sufficient for now, without that added complexity.

I don't see the need to modify the pvops calls.

All indirect calls are replaced by either direct calls or other code
long before any user code is active.

For modules the replacements are in place before the module is being
used.


Juergen

> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/include/asm/paravirt_types.h | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 6ec54d01972d..54b735b8ae12 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -336,11 +336,17 @@ extern struct pv_lock_ops pv_lock_ops;
>  #define PARAVIRT_PATCH(x)					\
>  	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
>  
> +#define paravirt_clobber(clobber)		\
> +	[paravirt_clobber] "i" (clobber)
> +#ifdef CONFIG_RETPOLINE
> +#define paravirt_type(op)				\
> +	[paravirt_typenum] "i" (PARAVIRT_PATCH(op)),	\
> +	[paravirt_opptr] "r" ((op))
> +#else
>  #define paravirt_type(op)				\
>  	[paravirt_typenum] "i" (PARAVIRT_PATCH(op)),	\
>  	[paravirt_opptr] "i" (&(op))
> -#define paravirt_clobber(clobber)		\
> -	[paravirt_clobber] "i" (clobber)
> +#endif
>  
>  /*
>   * Generate some code, and mark it as patchable by the
> @@ -392,7 +398,11 @@ int paravirt_disable_iospace(void);
>   * offset into the paravirt_patch_template structure, and can therefore be
>   * freely converted back into a structure offset.
>   */
> +#ifdef CONFIG_RETPOLINE
> +#define PARAVIRT_CALL	"call __x86.indirect_thunk.%V[paravirt_opptr];"
> +#else
>  #define PARAVIRT_CALL	"call *%c[paravirt_opptr];"
> +#endif
>  
>  /*
>   * These macros are intended to wrap calls through one of the paravirt
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall indirect jumps
  2018-01-04 14:37   ` [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall " David Woodhouse
@ 2018-01-04 15:10     ` Juergen Gross
  2018-01-04 15:18       ` David Woodhouse
  2018-01-04 15:54     ` Juergen Gross
  1 sibling, 1 reply; 79+ messages in thread
From: Juergen Gross @ 2018-01-04 15:10 UTC (permalink / raw)
  To: David Woodhouse, ak
  Cc: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	Dave Hansen, tglx, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes

On 04/01/18 15:37, David Woodhouse wrote:
> Convert indirect call in Xen hypercall to use non-speculative sequence,
> when CONFIG_RETPOLINE is enabled.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/include/asm/xen/hypercall.h | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
> index 7cb282e9e587..393c0048c63e 100644
> --- a/arch/x86/include/asm/xen/hypercall.h
> +++ b/arch/x86/include/asm/xen/hypercall.h
> @@ -44,6 +44,7 @@
>  #include <asm/page.h>
>  #include <asm/pgtable.h>
>  #include <asm/smap.h>
> +#include <asm/nospec-branch.h>

Where does this file come from? It isn't added in any of the patches.


Juergen

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps
  2018-01-04 15:02     ` Juergen Gross
@ 2018-01-04 15:12       ` Woodhouse, David
  2018-01-04 15:18       ` Andrew Cooper
  2018-01-04 16:37       ` Andi Kleen
  2 siblings, 0 replies; 79+ messages in thread
From: Woodhouse, David @ 2018-01-04 15:12 UTC (permalink / raw)
  To: Juergen Gross, ak
  Cc: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	Dave Hansen, tglx, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes

[-- Attachment #1: Type: text/plain, Size: 449 bytes --]

On Thu, 2018-01-04 at 16:02 +0100, Juergen Gross wrote:
> On 04/01/18 15:37, David Woodhouse wrote:
> > Convert pvops invocations to use non-speculative call sequences, when
> > CONFIG_RETPOLINE is enabled.
> > 
> > There is scope for future optimisation here — once the pvops methods are
> > actually set, we could just turn the damn things into *direct* jumps.
> > But this is perfectly sufficient for now, without that added complexity=

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps
  2018-01-04 15:02     ` Juergen Gross
  2018-01-04 15:12       ` Woodhouse, David
@ 2018-01-04 15:18       ` Andrew Cooper
  2018-01-04 16:04         ` Juergen Gross
  2018-01-04 16:37       ` Andi Kleen
  2 siblings, 1 reply; 79+ messages in thread
From: Andrew Cooper @ 2018-01-04 15:18 UTC (permalink / raw)
  To: Juergen Gross, David Woodhouse, ak
  Cc: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	Dave Hansen, tglx, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes

On 04/01/18 15:02, Juergen Gross wrote:
> On 04/01/18 15:37, David Woodhouse wrote:
>> Convert pvops invocations to use non-speculative call sequences, when
>> CONFIG_RETPOLINE is enabled.
>>
>> There is scope for future optimisation here — once the pvops methods are
>> actually set, we could just turn the damn things into *direct* jumps.
>> But this is perfectly sufficient for now, without that added complexity.
> I don't see the need to modify the pvops calls.
>
> All indirect calls are replaced by either direct calls or other code
> long before any user code is active.
>
> For modules the replacements are in place before the module is being
> used.

When booting virtualised, sibling hyperthreads can arrange VM-to-VM SP2
attacks.

One mitigation though is to consider if there is any interesting data to
leak that early during boot.

~Andrew

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall indirect jumps
  2018-01-04 15:10     ` Juergen Gross
@ 2018-01-04 15:18       ` David Woodhouse
  0 siblings, 0 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 15:18 UTC (permalink / raw)
  To: Juergen Gross, ak
  Cc: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	Dave Hansen, tglx, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes

[-- Attachment #1: Type: text/plain, Size: 984 bytes --]

On Thu, 2018-01-04 at 16:10 +0100, Juergen Gross wrote:
> On 04/01/18 15:37, David Woodhouse wrote:
> > 
> > Convert indirect call in Xen hypercall to use non-speculative sequence,
> > when CONFIG_RETPOLINE is enabled.
> > 
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> >  arch/x86/include/asm/xen/hypercall.h | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/xen/hypercall.h
> > b/arch/x86/include/asm/xen/hypercall.h
> > index 7cb282e9e587..393c0048c63e 100644
> > --- a/arch/x86/include/asm/xen/hypercall.h
> > +++ b/arch/x86/include/asm/xen/hypercall.h
> > @@ -44,6 +44,7 @@
> >  #include <asm/page.h>
> >  #include <asm/pgtable.h>
> >  #include <asm/smap.h>
> > +#include <asm/nospec-branch.h>

> Where does this file come from? It isn't added in any of the patches

Dammit, sorry. Fixed now:

http://git.infradead.org/users/dwmw2/linux-retpoline.git/commitdiff/be7e80781

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall indirect jumps
  2018-01-04 14:37   ` [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall " David Woodhouse
  2018-01-04 15:10     ` Juergen Gross
@ 2018-01-04 15:54     ` Juergen Gross
  1 sibling, 0 replies; 79+ messages in thread
From: Juergen Gross @ 2018-01-04 15:54 UTC (permalink / raw)
  To: David Woodhouse, ak
  Cc: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	Dave Hansen, tglx, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes

On 04/01/18 15:37, David Woodhouse wrote:
> Convert indirect call in Xen hypercall to use non-speculative sequence,
> when CONFIG_RETPOLINE is enabled.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps
  2018-01-04 15:18       ` Andrew Cooper
@ 2018-01-04 16:04         ` Juergen Gross
  0 siblings, 0 replies; 79+ messages in thread
From: Juergen Gross @ 2018-01-04 16:04 UTC (permalink / raw)
  To: Andrew Cooper, David Woodhouse, ak
  Cc: Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman, Tim Chen,
	Dave Hansen, tglx, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes

On 04/01/18 16:18, Andrew Cooper wrote:
> On 04/01/18 15:02, Juergen Gross wrote:
>> On 04/01/18 15:37, David Woodhouse wrote:
>>> Convert pvops invocations to use non-speculative call sequences, when
>>> CONFIG_RETPOLINE is enabled.
>>>
>>> There is scope for future optimisation here — once the pvops methods are
>>> actually set, we could just turn the damn things into *direct* jumps.
>>> But this is perfectly sufficient for now, without that added complexity.
>> I don't see the need to modify the pvops calls.
>>
>> All indirect calls are replaced by either direct calls or other code
>> long before any user code is active.
>>
>> For modules the replacements are in place before the module is being
>> used.
> 
> When booting virtualised, sibling hyperthreads can arrange VM-to-VM SP2
> attacks.
> 
> One mitigation though is to consider if there is any interesting data to
> leak that early during boot.

Right. And if you are able to detect a booting VM in the other
hyperthread, obtain enough information about its kernel layout and
extract the information via statistical methods in the very short time
frame of the boot before pvops patching takes place. Not to forget the
vast amount of data the booting VM will pull through the caches making
side channel attacks a rather flaky endeavor...

I'd opt for leaving pvops calls untouched. The Reviewed-by: I gave for
the patch was just for its correctness. :-)


Juergen

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")
  2018-01-04  9:30 ` Woodhouse, David
                     ` (12 preceding siblings ...)
  2018-01-04 14:37   ` [PATCH v3 13/13] retpoline: Attempt to quiten objtool warning for unreachable code David Woodhouse
@ 2018-01-04 16:18   ` Andy Lutomirski
  2018-01-04 16:24     ` David Woodhouse
  2018-01-05 10:49     ` Paul Turner
  13 siblings, 2 replies; 79+ messages in thread
From: Andy Lutomirski @ 2018-01-04 16:18 UTC (permalink / raw)
  To: Woodhouse, David, Van De Ven, Arjan
  Cc: linux-kernel, tim.c.chen, peterz, torvalds, tglx, riel, keescook,
	gnomes, pjt, dave.hansen, jikos, gregkh

On Thu, Jan 4, 2018 at 1:30 AM, Woodhouse, David <dwmw@amazon.co.uk> wrote:
> On Thu, 2018-01-04 at 01:10 -0800, Paul Turner wrote:
>> Apologies for the discombobulation around today's disclosure.  Obviously the
>> original goal was to communicate this a little more coherently, but the
>> unscheduled advances in the disclosure disrupted the efforts to pull this
>> together more cleanly.
>>
>> I wanted to open discussion the "retpoline" approach and and define its
>> requirements so that we can separate the core
>> details from questions regarding any particular implementation thereof.
>>
>> As a starting point, a full write-up describing the approach is available at:
>>   https://support.google.com/faqs/answer/7625886
>
> Note that (ab)using 'ret' in this way is incompatible with CET on
> upcoming processors. HJ added a -mno-indirect-branch-register option to
> the latest round of GCC patches, which puts the branch target in a
> register instead of on the stack. My kernel patches (which I'm about to
> reconcile with Andi's tweaks and post) do the same.
>
> That means that in the cases where at runtime we want to ALTERNATIVE
> out the retpoline, it just turns back into a bare 'jmp *\reg'.
>
>

I hate to say this, but I think Intel should postpone CET until the
dust settles.  Intel should also consider a hardware-protected stack
that is only accessible with PUSH, POP, CALL, RET, and a new MOVSTACK
instruction.  That, by itself, would give considerable protection.
But we still need JMP_NO_SPECULATE.  Or, better yet, get the CPU to
stop leaking data during speculative execution.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")
  2018-01-04 16:18   ` [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre") Andy Lutomirski
@ 2018-01-04 16:24     ` David Woodhouse
  2018-01-05 10:49     ` Paul Turner
  1 sibling, 0 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 16:24 UTC (permalink / raw)
  To: Andy Lutomirski, Van De Ven, Arjan
  Cc: linux-kernel, tim.c.chen, peterz, torvalds, tglx, riel, keescook,
	gnomes, pjt, dave.hansen, jikos, gregkh

[-- Attachment #1: Type: text/plain, Size: 514 bytes --]

On Thu, 2018-01-04 at 08:18 -0800, Andy Lutomirski wrote:
> I hate to say this, but I think Intel should postpone CET until the
> dust settles.

CET isn't a *problem* for retpoline. We've had a CET-compatible version
for a while now, and I posted it earlier. It's just that Andi was
working from an older version of my patches.

Of course, there's a school of thought that says that Intel should
postpone *everything* until this is all fixed sanely, but there's
nothing special about CET in that respect.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 10/13] x86/retpoline/pvops: Convert assembler indirect jumps
  2018-01-04 15:02     ` Juergen Gross
  2018-01-04 15:12       ` Woodhouse, David
  2018-01-04 15:18       ` Andrew Cooper
@ 2018-01-04 16:37       ` Andi Kleen
  2 siblings, 0 replies; 79+ messages in thread
From: Andi Kleen @ 2018-01-04 16:37 UTC (permalink / raw)
  To: Juergen Gross
  Cc: David Woodhouse, Paul Turner, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, tglx, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

On Thu, Jan 04, 2018 at 04:02:06PM +0100, Juergen Gross wrote:
> On 04/01/18 15:37, David Woodhouse wrote:
> > Convert pvops invocations to use non-speculative call sequences, when
> > CONFIG_RETPOLINE is enabled.
> > 
> > There is scope for future optimisation here — once the pvops methods are
> > actually set, we could just turn the damn things into *direct* jumps.
> > But this is perfectly sufficient for now, without that added complexity.
> 
> I don't see the need to modify the pvops calls.
> 
> All indirect calls are replaced by either direct calls or other code
> long before any user code is active.
> 
> For modules the replacements are in place before the module is being
> used.

Agreed. This shouldn't be needed.

-Andi

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-04 14:36   ` [PATCH v3 01/13] x86/retpoline: Add initial retpoline support David Woodhouse
@ 2018-01-04 18:03     ` Linus Torvalds
  2018-01-04 19:32       ` Woodhouse, David
  2018-01-04 18:17     ` Alexei Starovoitov
  2018-01-05 12:54     ` Thomas Gleixner
  2 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2018-01-04 18:03 UTC (permalink / raw)
  To: David Woodhouse
  Cc: ak, Paul Turner, LKML, Greg Kroah-Hartman, Tim Chen, Dave Hansen,
	Thomas Gleixner, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes

David,
 these are all marked as spam, because your emails have screwed up
DKIM. You used

    From: David Woodhouse <dwmw@amazon.co.uk>

but then you used infradead as a mailer, so it has the DKIM signature
from infradead, not from Amazon.co.uk.

The DKIM signature does pass for infradead, but amazon dmarc - quite
reasonably - wants the from to match.

End result:

       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=QUARANTINE)
header.from=amazon.co.uk

and everything was in spam.

Please don't do this. There's enough spam in the world that we don't
need people mis-configuring their emails and making real emails look
like spam too.

               Linus

On Thu, Jan 4, 2018 at 6:36 AM, David Woodhouse <dwmw@amazon.co.uk> wrote:
> Enable the use of -mindirect-branch=thunk-extern in newer GCC, and provide
> the corresponding thunks. Provide assembler macros for invoking the thunks
> in the same way that GCC does, from native and inline assembler.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-04 14:36   ` [PATCH v3 01/13] x86/retpoline: Add initial retpoline support David Woodhouse
  2018-01-04 18:03     ` Linus Torvalds
@ 2018-01-04 18:17     ` Alexei Starovoitov
  2018-01-04 18:25       ` Linus Torvalds
  2018-01-04 18:40       ` Andi Kleen
  2018-01-05 12:54     ` Thomas Gleixner
  2 siblings, 2 replies; 79+ messages in thread
From: Alexei Starovoitov @ 2018-01-04 18:17 UTC (permalink / raw)
  To: David Woodhouse
  Cc: ak, Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman,
	Tim Chen, Dave Hansen, tglx, Kees Cook, Rik van Riel,
	Peter Zijlstra, Andy Lutomirski, Jiri Kosina, gnomes

On Thu, Jan 04, 2018 at 02:36:58PM +0000, David Woodhouse wrote:
> Enable the use of -mindirect-branch=thunk-extern in newer GCC, and provide
> the corresponding thunks. Provide assembler macros for invoking the thunks
> in the same way that GCC does, from native and inline assembler.
> 
> This adds an X86_BUG_NO_RETPOLINE "feature" for runtime patching out
> of the thunks. This is a placeholder for now; the patches which support
> the new Intel/AMD microcode features will flesh out the precise conditions
> under which we disable the retpoline and do other things instead.
> 
> [Andi Kleen: Rename the macros and add CONFIG_RETPOLINE option]
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
...
> +.macro THUNK sp reg
> +	.section .text.__x86.indirect_thunk.\reg
> +
> +ENTRY(__x86.indirect_thunk.\reg)
> +	CFI_STARTPROC
> +	ALTERNATIVE "call 2f", __stringify(jmp *%\reg), X86_BUG_NO_RETPOLINE
> +1:
> +	lfence
> +	jmp	1b
> +2:
> +	mov	%\reg, (%\sp)
> +	ret
> +	CFI_ENDPROC
> +ENDPROC(__x86.indirect_thunk.\reg)

Clearly Paul's approach to retpoline without lfence is faster.
I'm guessing it wasn't shared with amazon/intel until now and
this set of patches going to adopt it, right?

Paul, could you share a link to a set of alternative gcc patches
that do retpoline similar to llvm diff ?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-04 18:17     ` Alexei Starovoitov
@ 2018-01-04 18:25       ` Linus Torvalds
  2018-01-04 18:36         ` Alexei Starovoitov
  2018-01-05 10:40         ` Paul Turner
  2018-01-04 18:40       ` Andi Kleen
  1 sibling, 2 replies; 79+ messages in thread
From: Linus Torvalds @ 2018-01-04 18:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Woodhouse, Andi Kleen, Paul Turner, LKML,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, Thomas Gleixner,
	Kees Cook, Rik van Riel, Peter Zijlstra, Andy Lutomirski,
	Jiri Kosina, One Thousand Gnomes

On Thu, Jan 4, 2018 at 10:17 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Clearly Paul's approach to retpoline without lfence is faster.
> I'm guessing it wasn't shared with amazon/intel until now and
> this set of patches going to adopt it, right?
>
> Paul, could you share a link to a set of alternative gcc patches
> that do retpoline similar to llvm diff ?

What is the alternative approach? Is it literally just doing a

      call 1f
1:    mov real_target,(%rsp)
       ret

on the assumption that the "ret" will always just predict to that "1"
due to the call stack?

                Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-04 18:25       ` Linus Torvalds
@ 2018-01-04 18:36         ` Alexei Starovoitov
  2018-01-04 19:27           ` David Woodhouse
  2018-01-05 10:40         ` Paul Turner
  1 sibling, 1 reply; 79+ messages in thread
From: Alexei Starovoitov @ 2018-01-04 18:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Andi Kleen, Paul Turner, LKML,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, Thomas Gleixner,
	Kees Cook, Rik van Riel, Peter Zijlstra, Andy Lutomirski,
	Jiri Kosina, One Thousand Gnomes

On Thu, Jan 04, 2018 at 10:25:35AM -0800, Linus Torvalds wrote:
> On Thu, Jan 4, 2018 at 10:17 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > Clearly Paul's approach to retpoline without lfence is faster.
> > I'm guessing it wasn't shared with amazon/intel until now and
> > this set of patches going to adopt it, right?
> >
> > Paul, could you share a link to a set of alternative gcc patches
> > that do retpoline similar to llvm diff ?
> 
> What is the alternative approach? Is it literally just doing a
> 
>       call 1f
> 1:    mov real_target,(%rsp)
>        ret
> 
> on the assumption that the "ret" will always just predict to that "1"
> due to the call stack?

Pretty much.
Paul's writeup: https://support.google.com/faqs/answer/7625886
tldr: jmp *%r11 gets converted to:
call set_up_target;
capture_spec:
  pause;
  jmp capture_spec;
set_up_target:
  mov %r11, (%rsp);
  ret;
where capture_spec part will be looping speculatively.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-04 18:17     ` Alexei Starovoitov
  2018-01-04 18:25       ` Linus Torvalds
@ 2018-01-04 18:40       ` Andi Kleen
  2018-01-05 10:32         ` Paul Turner
  1 sibling, 1 reply; 79+ messages in thread
From: Andi Kleen @ 2018-01-04 18:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Woodhouse, Paul Turner, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, tglx, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

> Clearly Paul's approach to retpoline without lfence is faster.
> I'm guessing it wasn't shared with amazon/intel until now and
> this set of patches going to adopt it, right?
> 
> Paul, could you share a link to a set of alternative gcc patches
> that do retpoline similar to llvm diff ?

I don't think it's a good idea to use any sequence not signed
off by CPU designers and extensively tested. 

While another one may work for most tests, it could always fail in
some corner case.

Right now we have the more heavy weight one and I would
suggest to stay with that one for now. Then worry
about more optimizations later.

Correctness first.

-Andi

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-04 18:36         ` Alexei Starovoitov
@ 2018-01-04 19:27           ` David Woodhouse
  2018-01-05 10:28             ` Paul Turner
  0 siblings, 1 reply; 79+ messages in thread
From: David Woodhouse @ 2018-01-04 19:27 UTC (permalink / raw)
  To: Alexei Starovoitov, Linus Torvalds
  Cc: Andi Kleen, Paul Turner, LKML, Greg Kroah-Hartman, Tim Chen,
	Dave Hansen, Thomas Gleixner, Kees Cook, Rik van Riel,
	Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	One Thousand Gnomes

[-- Attachment #1: Type: text/plain, Size: 900 bytes --]

On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
> 
> Pretty much.
> Paul's writeup: https://support.google.com/faqs/answer/7625886
> tldr: jmp *%r11 gets converted to:
> call set_up_target;
> capture_spec:
>   pause;
>   jmp capture_spec;
> set_up_target:
>   mov %r11, (%rsp);
>   ret;
> where capture_spec part will be looping speculatively.

That is almost identical to what's in my latest patch set, except that
the capture_spec loop has 'lfence' instead of 'pause'.

As Andi says, I'd want to see explicit approval from the CPU architects
for making that change.

We've already had false starts there — for a long time, Intel thought
that a much simpler option with an lfence after the register load was
sufficient, and then eventually worked out that in some rare cases it
wasn't. While AMD still seem to think it *is* sufficient for them,
apparently.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-04 18:03     ` Linus Torvalds
@ 2018-01-04 19:32       ` Woodhouse, David
  0 siblings, 0 replies; 79+ messages in thread
From: Woodhouse, David @ 2018-01-04 19:32 UTC (permalink / raw)
  To: torvalds
  Cc: linux-kernel, tim.c.chen, peterz, tglx, ak, riel, keescook,
	gnomes, pjt, dave.hansen, luto, jikos, gregkh


[-- Attachment #1.1: Type: text/plain, Size: 469 bytes --]

On Thu, 2018-01-04 at 10:03 -0800, Linus Torvalds wrote:
> 
> but then you used infradead as a mailer, so it has the DKIM signature
> from infradead, not from Amazon.co.uk.
> 
> The DKIM signature does pass for infradead, but amazon dmarc - quite
> reasonably - wants the from to match.

Apologies. In the struggle to get it sent without the corporate mail
setup adding an HTML disclaimer to it, I forgot about DKIM. I'll make
sure I get that right next time.

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 197 bytes --]




Amazon Web Services UK Limited. Registered in England and Wales with registration number 08650665 and which has its registered office at 60 Holborn Viaduct, London EC1A 2FD, United Kingdom.

[-- Attachment #2.2: Type: text/html, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 11/13] retpoline/taint: Taint kernel for missing retpoline in compiler
  2018-01-04 14:37   ` [PATCH v3 11/13] retpoline/taint: Taint kernel for missing retpoline in compiler David Woodhouse
@ 2018-01-04 22:06     ` Justin Forbes
  0 siblings, 0 replies; 79+ messages in thread
From: Justin Forbes @ 2018-01-04 22:06 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Andi Kleen, Paul Turner, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, Thomas Gleixner,
	Kees Cook, Rik van Riel, Peter Zijlstra, Andy Lutomirski,
	Jiri Kosina, Alan Cox

On Thu, Jan 4, 2018 at 8:37 AM, David Woodhouse <dwmw@amazon.co.uk> wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> When the kernel or a module hasn't been compiled with a retpoline
> aware compiler, print a warning and set a taint flag.
>
> For modules it is checked at compile time, however it cannot
> check assembler or other non compiled objects used in the module link.
>
> Due to lack of better letter it uses taint option 'Z'
>

Is taint really the right thing to do here? Why not just do pr_info?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-04 19:27           ` David Woodhouse
@ 2018-01-05 10:28             ` Paul Turner
  2018-01-05 10:55               ` David Woodhouse
  2018-01-05 11:26               ` Paolo Bonzini
  0 siblings, 2 replies; 79+ messages in thread
From: Paul Turner @ 2018-01-05 10:28 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Alexei Starovoitov, Linus Torvalds, Andi Kleen, LKML,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, Thomas Gleixner,
	Kees Cook, Rik van Riel, Peter Zijlstra, Andy Lutomirski,
	Jiri Kosina, One Thousand Gnomes

On Thu, Jan 04, 2018 at 07:27:58PM +0000, David Woodhouse wrote:
> On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
> > 
> > Pretty much.
> > Paul's writeup: https://support.google.com/faqs/answer/7625886
> > tldr: jmp *%r11 gets converted to:
> > call set_up_target;
> > capture_spec:
> >   pause;
> >   jmp capture_spec;
> > set_up_target:
> >   mov %r11, (%rsp);
> >   ret;
> > where capture_spec part will be looping speculatively.
> 
> That is almost identical to what's in my latest patch set, except that
> the capture_spec loop has 'lfence' instead of 'pause'.

When choosing this sequence I benchmarked several alternatives here, including
(nothing, nops, fences, and other serializing instructions such as cpuid).

The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
it was chosen.

  "pause; jmp" 33.231 cycles/call 9.517 ns/call
  "lfence; jmp" 33.354 cycles/call 9.552 ns/call

(Timings are for a complete retpolined indirect branch.)
> 
> As Andi says, I'd want to see explicit approval from the CPU architects
> for making that change.

Beyond guaranteeing that speculative execution is constrained, the choice of
sequence here is a performance detail and not one of correctness.

> 
> We've already had false starts there — for a long time, Intel thought
> that a much simpler option with an lfence after the register load was
> sufficient, and then eventually worked out that in some rare cases it
> wasn't. While AMD still seem to think it *is* sufficient for them,
> apparently.

As an interesting aside, that speculation proceeds beyond lfence can be
trivially proven using the timings above.  In fact, if we substitute only:
  "lfence" (with no jmp)

We see:
  29.573 cycles/call 8.469 ns/call

Now, the only way for this timing to be different, is if speculation beyond the
lfence was executed differently.

That said, this is a negative result, it does suggest that the jmp is
contributing a larger than realized cost to our speculative loop.  We can likely
shave off some additional time with some unrolling.  I did try this previously,
but did not see results above the noise floor; it seems worth trying this again;
will take a look tomorrow.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-04 18:40       ` Andi Kleen
@ 2018-01-05 10:32         ` Paul Turner
  0 siblings, 0 replies; 79+ messages in thread
From: Paul Turner @ 2018-01-05 10:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alexei Starovoitov, David Woodhouse, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, tglx, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

On Thu, Jan 04, 2018 at 10:40:23AM -0800, Andi Kleen wrote:
> > Clearly Paul's approach to retpoline without lfence is faster.
> > I'm guessing it wasn't shared with amazon/intel until now and
> > this set of patches going to adopt it, right?
> > 
> > Paul, could you share a link to a set of alternative gcc patches
> > that do retpoline similar to llvm diff ?
> 
> I don't think it's a good idea to use any sequence not signed
> off by CPU designers and extensively tested. 

I can confirm that "pause; jmp" has been previously reviewed by your side.

That said, again, per the other email, once we have guaranteed that speculative
execution will reach this point, beyond the guarantee that it does something
"safe" the choice of sequence here is a performance detail rather than
correctness.

> 
> While another one may work for most tests, it could always fail in
> some corner case.
> 
> Right now we have the more heavy weight one and I would
> suggest to stay with that one for now. Then worry
> about more optimizations later.
> 
> Correctness first.
> 
> -Andi

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-04 18:25       ` Linus Torvalds
  2018-01-04 18:36         ` Alexei Starovoitov
@ 2018-01-05 10:40         ` Paul Turner
  1 sibling, 0 replies; 79+ messages in thread
From: Paul Turner @ 2018-01-05 10:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexei Starovoitov, David Woodhouse, Andi Kleen, LKML,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, Thomas Gleixner,
	Kees Cook, Rik van Riel, Peter Zijlstra, Andy Lutomirski,
	Jiri Kosina, One Thousand Gnomes

On Thu, Jan 04, 2018 at 10:25:35AM -0800, Linus Torvalds wrote:
> On Thu, Jan 4, 2018 at 10:17 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > Clearly Paul's approach to retpoline without lfence is faster.

Using pause rather than lfence does not represent a fundamental difference here.

A protected indirect branch is always adding ~25-30 cycles of overhead.

That this can be avoided in practice is a function of two key factors:
(1) Kernel code uses fewer indirect branches.
(2) The overhead can be avoided for hot indirect branches via devirtualization.
  e.g. the semantic equivalent of,
    if (ptr == foo)
      foo();
    else
      (*ptr)();
  Allowing foo() to be called directly, even though it was provided as an
  indirect.

> > I'm guessing it wasn't shared with amazon/intel until now and
> > this set of patches going to adopt it, right?
> >
> > Paul, could you share a link to a set of alternative gcc patches
> > that do retpoline similar to llvm diff ?
> 
> What is the alternative approach? Is it literally just doing a
> 
>       call 1f
> 1:    mov real_target,(%rsp)
>        ret
> 
> on the assumption that the "ret" will always just predict to that "1"
> due to the call stack?
> 
>                 Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")
  2018-01-04 16:18   ` [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre") Andy Lutomirski
  2018-01-04 16:24     ` David Woodhouse
@ 2018-01-05 10:49     ` Paul Turner
  2018-01-05 11:43       ` Woodhouse, David
  1 sibling, 1 reply; 79+ messages in thread
From: Paul Turner @ 2018-01-05 10:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Woodhouse, David, Van De Ven, Arjan, linux-kernel, tim.c.chen,
	peterz, torvalds, tglx, riel, keescook, gnomes, dave.hansen,
	jikos, gregkh

On Thu, Jan 04, 2018 at 08:18:57AM -0800, Andy Lutomirski wrote:
> On Thu, Jan 4, 2018 at 1:30 AM, Woodhouse, David <dwmw@amazon.co.uk> wrote:
> > On Thu, 2018-01-04 at 01:10 -0800, Paul Turner wrote:
> >> Apologies for the discombobulation around today's disclosure.  Obviously the
> >> original goal was to communicate this a little more coherently, but the
> >> unscheduled advances in the disclosure disrupted the efforts to pull this
> >> together more cleanly.
> >>
> >> I wanted to open discussion the "retpoline" approach and and define its
> >> requirements so that we can separate the core
> >> details from questions regarding any particular implementation thereof.
> >>
> >> As a starting point, a full write-up describing the approach is available at:
> >>   https://support.google.com/faqs/answer/7625886
> >
> > Note that (ab)using 'ret' in this way is incompatible with CET on
> > upcoming processors. HJ added a -mno-indirect-branch-register option to
> > the latest round of GCC patches, which puts the branch target in a
> > register instead of on the stack. My kernel patches (which I'm about to
> > reconcile with Andi's tweaks and post) do the same.
> >
> > That means that in the cases where at runtime we want to ALTERNATIVE
> > out the retpoline, it just turns back into a bare 'jmp *\reg'.
> >
> >
> 
> I hate to say this, but I think Intel should postpone CET until the
> dust settles.  Intel should also consider a hardware-protected stack
> that is only accessible with PUSH, POP, CALL, RET, and a new MOVSTACK
> instruction.  That, by itself, would give considerable protection.
> But we still need JMP_NO_SPECULATE.  Or, better yet, get the CPU to
> stop leaking data during speculative execution.

Echoing Andy's thoughts, but from a slightly different angle:

1) BTI is worse than the current classes of return attack.  Given this,
   considered as a binary choice, it's equivalent to the current state of the
   world (e.g. no CET).
2) CET will not be "free".  I suspect in its initial revisions it will be more
   valuable for protecting end-users then enterprise workloads (cost is not
   observable for interactive workloads because there's tons of headroom in the
   first place).

While the potential incompatibility is unfortunate; I'm not sure it makes a
significant adoption to the adoption rate of CET.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 10:28             ` Paul Turner
@ 2018-01-05 10:55               ` David Woodhouse
  2018-01-05 11:19                 ` Paul Turner
  2018-01-05 11:25                 ` Paul Turner
  2018-01-05 11:26               ` Paolo Bonzini
  1 sibling, 2 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-05 10:55 UTC (permalink / raw)
  To: Paul Turner
  Cc: Alexei Starovoitov, Linus Torvalds, Andi Kleen, LKML,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, Thomas Gleixner,
	Kees Cook, Rik van Riel, Peter Zijlstra, Andy Lutomirski,
	Jiri Kosina, One Thousand Gnomes

[-- Attachment #1: Type: text/plain, Size: 1764 bytes --]

On Fri, 2018-01-05 at 02:28 -0800, Paul Turner wrote:
> On Thu, Jan 04, 2018 at 07:27:58PM +0000, David Woodhouse wrote:
> > On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
> > > 
> > > Pretty much.
> > > Paul's writeup: https://support.google.com/faqs/answer/7625886
> > > tldr: jmp *%r11 gets converted to:
> > > call set_up_target;
> > > capture_spec:
> > >   pause;
> > >   jmp capture_spec;
> > > set_up_target:
> > >   mov %r11, (%rsp);
> > >   ret;
> > > where capture_spec part will be looping speculatively.
> > 
> > That is almost identical to what's in my latest patch set, except that
> > the capture_spec loop has 'lfence' instead of 'pause'.
> 
> When choosing this sequence I benchmarked several alternatives here, including
> (nothing, nops, fences, and other serializing instructions such as cpuid).
> 
> The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
> it was chosen.
> 
>   "pause; jmp" 33.231 cycles/call 9.517 ns/call
>   "lfence; jmp" 33.354 cycles/call 9.552 ns/call
> 
> (Timings are for a complete retpolined indirect branch.)

Yeah, I studiously ignored you here and went with only what Intel had
*assured* me was correct and put into the GCC patches, rather than
chasing those 35 picoseconds ;)

The GCC patch set already had about four different variants over time,
with associated "oh shit, that one doesn't actually work; try this".
What we have in my patch set is precisely what GCC emits at the moment.

I'm all for optimising it further, but maybe not this week.

Other than that, is there any other development from your side that I
haven't captured in the latest (v4) series?
http://git.infradead.org/users/dwmw2/linux-retpoline.git/

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 10:55               ` David Woodhouse
@ 2018-01-05 11:19                 ` Paul Turner
  2018-01-05 11:25                 ` Paul Turner
  1 sibling, 0 replies; 79+ messages in thread
From: Paul Turner @ 2018-01-05 11:19 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Alexei Starovoitov, Linus Torvalds, Andi Kleen, LKML,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, Thomas Gleixner,
	Kees Cook, Rik van Riel, Peter Zijlstra, Andy Lutomirski,
	Jiri Kosina, One Thousand Gnomes

On Fri, Jan 05, 2018 at 10:55:38AM +0000, David Woodhouse wrote:
> On Fri, 2018-01-05 at 02:28 -0800, Paul Turner wrote:
> > On Thu, Jan 04, 2018 at 07:27:58PM +0000, David Woodhouse wrote:
> > > On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
> > > > 
> > > > Pretty much.
> > > > Paul's writeup: https://support.google.com/faqs/answer/7625886
> > > > tldr: jmp *%r11 gets converted to:
> > > > call set_up_target;
> > > > capture_spec:
> > > >   pause;
> > > >   jmp capture_spec;
> > > > set_up_target:
> > > >   mov %r11, (%rsp);
> > > >   ret;
> > > > where capture_spec part will be looping speculatively.
> > > 
> > > That is almost identical to what's in my latest patch set, except that
> > > the capture_spec loop has 'lfence' instead of 'pause'.
> > 
> > When choosing this sequence I benchmarked several alternatives here, including
> > (nothing, nops, fences, and other serializing instructions such as cpuid).
> > 
> > The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
> > it was chosen.
> > 
> >   "pause; jmp" 33.231 cycles/call 9.517 ns/call
> >   "lfence; jmp" 33.354 cycles/call 9.552 ns/call
> > 
> > (Timings are for a complete retpolined indirect branch.)
> 
> Yeah, I studiously ignored you here and went with only what Intel had
> *assured* me was correct and put into the GCC patches, rather than
> chasing those 35 picoseconds ;)
> 
> The GCC patch set already had about four different variants over time,
> with associated "oh shit, that one doesn't actually work; try this".
> What we have in my patch set is precisely what GCC emits at the moment.

While I can't speak to the various iterations of the gcc patches, I can confirm
that the details originally provided were reviewed by Intel.

> 
> I'm all for optimising it further, but maybe not this week.
> 
> Other than that, is there any other development from your side that I
> haven't captured in the latest (v4) series?
> http://git.infradead.org/users/dwmw2/linux-retpoline.git/

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 10:55               ` David Woodhouse
  2018-01-05 11:19                 ` Paul Turner
@ 2018-01-05 11:25                 ` Paul Turner
  1 sibling, 0 replies; 79+ messages in thread
From: Paul Turner @ 2018-01-05 11:25 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Alexei Starovoitov, Linus Torvalds, Andi Kleen, LKML,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, Thomas Gleixner,
	Kees Cook, Rik van Riel, Peter Zijlstra, Andy Lutomirski,
	Jiri Kosina, One Thousand Gnomes

On Fri, Jan 05, 2018 at 10:55:38AM +0000, David Woodhouse wrote:
> On Fri, 2018-01-05 at 02:28 -0800, Paul Turner wrote:
> > On Thu, Jan 04, 2018 at 07:27:58PM +0000, David Woodhouse wrote:
> > > On Thu, 2018-01-04 at 10:36 -0800, Alexei Starovoitov wrote:
> > > > 
> > > > Pretty much.
> > > > Paul's writeup: https://support.google.com/faqs/answer/7625886
> > > > tldr: jmp *%r11 gets converted to:
> > > > call set_up_target;
> > > > capture_spec:
> > > >   pause;
> > > >   jmp capture_spec;
> > > > set_up_target:
> > > >   mov %r11, (%rsp);
> > > >   ret;
> > > > where capture_spec part will be looping speculatively.
> > > 
> > > That is almost identical to what's in my latest patch set, except that
> > > the capture_spec loop has 'lfence' instead of 'pause'.
> > 
> > When choosing this sequence I benchmarked several alternatives here, including
> > (nothing, nops, fences, and other serializing instructions such as cpuid).
> > 
> > The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
> > it was chosen.
> > 
> >   "pause; jmp" 33.231 cycles/call 9.517 ns/call
> >   "lfence; jmp" 33.354 cycles/call 9.552 ns/call
> > 
> > (Timings are for a complete retpolined indirect branch.)
> 
> Yeah, I studiously ignored you here and went with only what Intel had
> *assured* me was correct and put into the GCC patches, rather than
> chasing those 35 picoseconds ;)

It's also notable here that while the difference is small in terms of absolute
values, it's likely due to reduced variation:

I would expect:
- pause to be extremely consistent in its timings
- pause and lfence to be close on their average timings, particularly in a
  micro-benchmark.

Which suggests that the difference may be larger in the occasional cases that
you are getting "unlucky" and seeing some other uarch interaction in the lfence
path.
> 
> The GCC patch set already had about four different variants over time,
> with associated "oh shit, that one doesn't actually work; try this".
> What we have in my patch set is precisely what GCC emits at the moment.
> 
> I'm all for optimising it further, but maybe not this week.
> 
> Other than that, is there any other development from your side that I
> haven't captured in the latest (v4) series?
> http://git.infradead.org/users/dwmw2/linux-retpoline.git/

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 10:28             ` Paul Turner
  2018-01-05 10:55               ` David Woodhouse
@ 2018-01-05 11:26               ` Paolo Bonzini
  2018-01-05 12:20                 ` Paul Turner
  1 sibling, 1 reply; 79+ messages in thread
From: Paolo Bonzini @ 2018-01-05 11:26 UTC (permalink / raw)
  To: Paul Turner, David Woodhouse
  Cc: Alexei Starovoitov, Linus Torvalds, Andi Kleen, LKML,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, Thomas Gleixner,
	Kees Cook, Rik van Riel, Peter Zijlstra, Andy Lutomirski,
	Jiri Kosina, One Thousand Gnomes

On 05/01/2018 11:28, Paul Turner wrote:
> 
> The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
> it was chosen.
> 
>   "pause; jmp" 33.231 cycles/call 9.517 ns/call
>   "lfence; jmp" 33.354 cycles/call 9.552 ns/call

Do you have timings for a non-retpolined indirect branch with the
predictor suppressed via IBRS=1?  So at least we can compute the break
even point.

Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")
  2018-01-05 10:49     ` Paul Turner
@ 2018-01-05 11:43       ` Woodhouse, David
  0 siblings, 0 replies; 79+ messages in thread
From: Woodhouse, David @ 2018-01-05 11:43 UTC (permalink / raw)
  To: pjt, luto
  Cc: linux-kernel, arjan.van.de.ven, peterz, tim.c.chen, torvalds,
	tglx, riel, keescook, gnomes, dave.hansen, jikos, gregkh


[-- Attachment #1.1: Type: text/plain, Size: 343 bytes --]

On Fri, 2018-01-05 at 02:49 -0800, Paul Turner wrote:
> 
> 
> While the potential incompatibility is unfortunate; I'm not sure it makes a
> significant adoption to the adoption rate of CET.

There is no incompatibility. Nothing will have CET that doesn't have
IBRS_ATT (and hence the retpoline alternative'd out into a bare jmp
again).

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 197 bytes --]




Amazon Web Services UK Limited. Registered in England and Wales with registration number 08650665 and which has its registered office at 60 Holborn Viaduct, London EC1A 2FD, United Kingdom.

[-- Attachment #2.2: Type: text/html, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 11:26               ` Paolo Bonzini
@ 2018-01-05 12:20                 ` Paul Turner
  0 siblings, 0 replies; 79+ messages in thread
From: Paul Turner @ 2018-01-05 12:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Woodhouse, Alexei Starovoitov, Linus Torvalds, Andi Kleen,
	LKML, Greg Kroah-Hartman, Tim Chen, Dave Hansen, Thomas Gleixner,
	Kees Cook, Rik van Riel, Peter Zijlstra, Andy Lutomirski,
	Jiri Kosina, One Thousand Gnomes

On Fri, Jan 5, 2018 at 3:26 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 05/01/2018 11:28, Paul Turner wrote:
>>
>> The "pause; jmp" sequence proved minutely faster than "lfence;jmp" which is why
>> it was chosen.
>>
>>   "pause; jmp" 33.231 cycles/call 9.517 ns/call
>>   "lfence; jmp" 33.354 cycles/call 9.552 ns/call
>
> Do you have timings for a non-retpolined indirect branch with the
> predictor suppressed via IBRS=1?  So at least we can compute the break
> even point.

The data I collected here previously had the run-time cost as a wash.
On Skylake, an IBRS=1 and a retpolined indirect branch had cost within
a few cycles.

The costs to consider when making a choice here are:

- The transition overheads.  This is how frequently will you be
switching in and out of protected code (as IBRS needs to be enabled
and disabled at these boundaries).
- The frequency at which you will be executing protected code on one
sibling, and unprotected code on another (enabling IBRS may affect
sibling execution, depending on SKU)
- The implementation cost (retpoline requires auditing/rebuilding your
target, while IBRS can be used out of the box).


>
> Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-04 14:36   ` [PATCH v3 01/13] x86/retpoline: Add initial retpoline support David Woodhouse
  2018-01-04 18:03     ` Linus Torvalds
  2018-01-04 18:17     ` Alexei Starovoitov
@ 2018-01-05 12:54     ` Thomas Gleixner
  2018-01-05 13:01       ` Juergen Gross
  2018-01-05 13:56       ` Woodhouse, David
  2 siblings, 2 replies; 79+ messages in thread
From: Thomas Gleixner @ 2018-01-05 12:54 UTC (permalink / raw)
  To: David Woodhouse
  Cc: ak, Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman,
	Tim Chen, Dave Hansen, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes

On Thu, 4 Jan 2018, David Woodhouse wrote:
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 07cdd1715705..900fa7016d3f 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -342,5 +342,6 @@
>  #define X86_BUG_MONITOR			X86_BUG(12) /* IPI required to wake up remote CPU */
>  #define X86_BUG_AMD_E400		X86_BUG(13) /* CPU is among the affected by Erratum 400 */
>  #define X86_BUG_CPU_INSECURE		X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
> +#define X86_BUG_NO_RETPOLINE		X86_BUG(15) /* Placeholder: disable retpoline branch thunks */

I think this is the wrong approach. We have X86_BUG_CPU_INSECURE, which now
should be renamed to X86_BUG_CPU_MELTDOWN_V3 or something like that. It
tells the kernel, that the CPU is affected by variant 3.

If the kernel detects that and has PTI support then it sets the 'pti'
feature bit which tells that the mitigation is in place.

So what we really want is

   X86_BUG_MELTDOWN_V1/2/3

which get set when the CPU is affected by a particular variant and then
have feature flags

   X86_FEATURE_RETPOLINE
   X86_FEATURE_IBRS
   X86_FEATURE_NOSPEC

or whatever it takes to signal that a mitigation is in place. Then we
depend all actions on those feature flags very much in the way we do for
FEATURE_PTI.

If CPUs come along which are not affected by a particular variant the BUG
flag does not get set.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 12:54     ` Thomas Gleixner
@ 2018-01-05 13:01       ` Juergen Gross
  2018-01-05 13:03         ` Thomas Gleixner
  2018-01-05 13:56       ` Woodhouse, David
  1 sibling, 1 reply; 79+ messages in thread
From: Juergen Gross @ 2018-01-05 13:01 UTC (permalink / raw)
  To: Thomas Gleixner, David Woodhouse
  Cc: ak, Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman,
	Tim Chen, Dave Hansen, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes

On 05/01/18 13:54, Thomas Gleixner wrote:
> On Thu, 4 Jan 2018, David Woodhouse wrote:
>> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
>> index 07cdd1715705..900fa7016d3f 100644
>> --- a/arch/x86/include/asm/cpufeatures.h
>> +++ b/arch/x86/include/asm/cpufeatures.h
>> @@ -342,5 +342,6 @@
>>  #define X86_BUG_MONITOR			X86_BUG(12) /* IPI required to wake up remote CPU */
>>  #define X86_BUG_AMD_E400		X86_BUG(13) /* CPU is among the affected by Erratum 400 */
>>  #define X86_BUG_CPU_INSECURE		X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
>> +#define X86_BUG_NO_RETPOLINE		X86_BUG(15) /* Placeholder: disable retpoline branch thunks */
> 
> I think this is the wrong approach. We have X86_BUG_CPU_INSECURE, which now
> should be renamed to X86_BUG_CPU_MELTDOWN_V3 or something like that. It
> tells the kernel, that the CPU is affected by variant 3.

MELTDOWN is variant 3.

> 
> If the kernel detects that and has PTI support then it sets the 'pti'
> feature bit which tells that the mitigation is in place.
> 
> So what we really want is
> 
>    X86_BUG_MELTDOWN_V1/2/3

X86_BUG_MELTDOWN, X86_BUG_SPECTRE_V1, X86_BUG_SPECTRE_V2


Juergen

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 13:01       ` Juergen Gross
@ 2018-01-05 13:03         ` Thomas Gleixner
  0 siblings, 0 replies; 79+ messages in thread
From: Thomas Gleixner @ 2018-01-05 13:03 UTC (permalink / raw)
  To: Juergen Gross
  Cc: David Woodhouse, ak, Paul Turner, LKML, Linus Torvalds,
	Greg Kroah-Hartman, Tim Chen, Dave Hansen, Kees Cook,
	Rik van Riel, Peter Zijlstra, Andy Lutomirski, Jiri Kosina,
	gnomes

On Fri, 5 Jan 2018, Juergen Gross wrote:
> On 05/01/18 13:54, Thomas Gleixner wrote:
> > On Thu, 4 Jan 2018, David Woodhouse wrote:
> >> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> >> index 07cdd1715705..900fa7016d3f 100644
> >> --- a/arch/x86/include/asm/cpufeatures.h
> >> +++ b/arch/x86/include/asm/cpufeatures.h
> >> @@ -342,5 +342,6 @@
> >>  #define X86_BUG_MONITOR			X86_BUG(12) /* IPI required to wake up remote CPU */
> >>  #define X86_BUG_AMD_E400		X86_BUG(13) /* CPU is among the affected by Erratum 400 */
> >>  #define X86_BUG_CPU_INSECURE		X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
> >> +#define X86_BUG_NO_RETPOLINE		X86_BUG(15) /* Placeholder: disable retpoline branch thunks */
> > 
> > I think this is the wrong approach. We have X86_BUG_CPU_INSECURE, which now
> > should be renamed to X86_BUG_CPU_MELTDOWN_V3 or something like that. It
> > tells the kernel, that the CPU is affected by variant 3.
> 
> MELTDOWN is variant 3.
> 
> > 
> > If the kernel detects that and has PTI support then it sets the 'pti'
> > feature bit which tells that the mitigation is in place.
> > 
> > So what we really want is
> > 
> >    X86_BUG_MELTDOWN_V1/2/3
> 
> X86_BUG_MELTDOWN, X86_BUG_SPECTRE_V1, X86_BUG_SPECTRE_V2

Right. I'm confused as always :)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [tip:x86/pti] x86/alternatives: Add missing '\n' at end of ALTERNATIVE inline asm
  2018-01-04 14:37   ` [PATCH v3 08/13] x86/alternatives: Add missing \n at end of ALTERNATIVE inline asm David Woodhouse
@ 2018-01-05 13:04     ` tip-bot for David Woodhouse
  0 siblings, 0 replies; 79+ messages in thread
From: tip-bot for David Woodhouse @ 2018-01-05 13:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dave.hansen, mingo, hpa, tim.c.chen, linux-kernel, torvalds,
	luto, gregkh, pjt, dwmw, jikos, peterz, tglx, keescook, riel

Commit-ID:  b9e705ef7cfaf22db0daab91ad3cd33b0fa32eb9
Gitweb:     https://git.kernel.org/tip/b9e705ef7cfaf22db0daab91ad3cd33b0fa32eb9
Author:     David Woodhouse <dwmw@amazon.co.uk>
AuthorDate: Thu, 4 Jan 2018 14:37:05 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 5 Jan 2018 14:01:15 +0100

x86/alternatives: Add missing '\n' at end of ALTERNATIVE inline asm

Where an ALTERNATIVE is used in the middle of an inline asm block, this
would otherwise lead to the following instruction being appended directly
to the trailing ".popsection", and a failed compile.

Fixes: 9cebed423c84 ("x86, alternative: Use .pushsection/.popsection")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: Rik van Riel <riel@redhat.com>
Cc: ak@linux.intel.com
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180104143710.8961-8-dwmw@amazon.co.uk
---
 arch/x86/include/asm/alternative.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index dbfd085..cf5961c 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -140,7 +140,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
 	".popsection\n"							\
 	".pushsection .altinstr_replacement, \"ax\"\n"			\
 	ALTINSTR_REPLACEMENT(newinstr, feature, 1)			\
-	".popsection"
+	".popsection\n"
 
 #define ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2)\
 	OLDINSTR_2(oldinstr, 1, 2)					\
@@ -151,7 +151,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
 	".pushsection .altinstr_replacement, \"ax\"\n"			\
 	ALTINSTR_REPLACEMENT(newinstr1, feature1, 1)			\
 	ALTINSTR_REPLACEMENT(newinstr2, feature2, 2)			\
-	".popsection"
+	".popsection\n"
 
 /*
  * Alternative instructions for different CPU types or capabilities.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 12:54     ` Thomas Gleixner
  2018-01-05 13:01       ` Juergen Gross
@ 2018-01-05 13:56       ` Woodhouse, David
  2018-01-05 16:41         ` Woodhouse, David
  1 sibling, 1 reply; 79+ messages in thread
From: Woodhouse, David @ 2018-01-05 13:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: ak, Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman,
	Tim Chen, Dave Hansen, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes

[-- Attachment #1: Type: text/plain, Size: 2014 bytes --]

On Fri, 2018-01-05 at 13:54 +0100, Thomas Gleixner wrote:
> On Thu, 4 Jan 2018, David Woodhouse wrote:
> > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > index 07cdd1715705..900fa7016d3f 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -342,5 +342,6 @@
> >  #define X86_BUG_MONITOR                      X86_BUG(12) /* IPI required to wake up remote CPU */
> >  #define X86_BUG_AMD_E400             X86_BUG(13) /* CPU is among the affected by Erratum 400 */
> >  #define X86_BUG_CPU_INSECURE         X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
> > +#define X86_BUG_NO_RETPOLINE         X86_BUG(15) /* Placeholder: disable retpoline branch thunks */
> 
> I think this is the wrong approach. We have X86_BUG_CPU_INSECURE, which now
> should be renamed to X86_BUG_CPU_MELTDOWN_V3 or something like that. It
> tells the kernel, that the CPU is affected by variant 3.

As it says, it's a placeholder. The actual conditions depend on whether
we decide to use IBRS or not, which also depends on whether we find
IBRS_ATT or whether we're on Skylake+.

The IBRS patch series should be updating it.

> So what we really want is
> 
>    X86_BUG_MELTDOWN_V1/2/3
> 
> which get set when the CPU is affected by a particular variant and then
> have feature flags
> 
>    X86_FEATURE_RETPOLINE
>    X86_FEATURE_IBRS
>    X86_FEATURE_NOSPEC

At some point during this whole painful mess, I had come to the
conclusion that having relocations in altinstr didn't work, and that's
why I had X86_xx_NO_RETPOLINE instead of X86_xx_RETPOLINE. I now think
that something else was wrong when I was testing that, and relocs in
altinstr do work. So sure, X86_FEATURE_RETPOLINE ought to work. I can
change that round, and it's simpler for the IBRS patch set to take it
into account and set it when appropriate.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 13:56       ` Woodhouse, David
@ 2018-01-05 16:41         ` Woodhouse, David
  2018-01-05 16:45           ` Borislav Petkov
  0 siblings, 1 reply; 79+ messages in thread
From: Woodhouse, David @ 2018-01-05 16:41 UTC (permalink / raw)
  To: Thomas Gleixner, Borislav Petkov
  Cc: ak, Paul Turner, LKML, Linus Torvalds, Greg Kroah-Hartman,
	Tim Chen, Dave Hansen, Kees Cook, Rik van Riel, Peter Zijlstra,
	Andy Lutomirski, Jiri Kosina, gnomes

[-- Attachment #1: Type: text/plain, Size: 967 bytes --]

On Fri, 2018-01-05 at 13:56 +0000, Woodhouse, David wrote:
> 
> At some point during this whole painful mess, I had come to the
> conclusion that having relocations in altinstr didn't work, and that's
> why I had X86_xx_NO_RETPOLINE instead of X86_xx_RETPOLINE. I now think
> that something else was wrong when I was testing that, and relocs in
> altinstr do work. So sure, X86_FEATURE_RETPOLINE ought to work. I can
> change that round, and it's simpler for the IBRS patch set to take it
> into account and set it when appropriate.

+bpetkov

Nope, alternatives are broken. Only a jmp as the *first* opcode of
altinstr gets handled by recompute_jump(), while any subsequent insn is
just copied untouched.

To fix that and handle every instruction, the alternative code would
need to know about instruction lengths. I think we need to stick with
the inverted X86_FEATURE_NO_RETPOLINE flag for the moment, and not tie
it to a complex bugfix there.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 16:41         ` Woodhouse, David
@ 2018-01-05 16:45           ` Borislav Petkov
  2018-01-05 17:08             ` Josh Poimboeuf
  2018-01-05 17:12             ` Woodhouse, David
  0 siblings, 2 replies; 79+ messages in thread
From: Borislav Petkov @ 2018-01-05 16:45 UTC (permalink / raw)
  To: Woodhouse, David
  Cc: tglx, linux-kernel, tim.c.chen, peterz, torvalds, ak, riel,
	keescook, gnomes, pjt, dave.hansen, luto, jikos, gregkh

On Fri, Jan 05, 2018 at 04:41:46PM +0000, Woodhouse, David wrote:
> Nope, alternatives are broken. Only a jmp as the *first* opcode of
> altinstr gets handled by recompute_jump(), while any subsequent insn is
> just copied untouched.

Not broken - simply no one needed it until now. I'm looking into it.
Looks like the insn decoder might come in handy finally.

:-)

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 16:45           ` Borislav Petkov
@ 2018-01-05 17:08             ` Josh Poimboeuf
  2018-01-06  0:30               ` Borislav Petkov
  2018-01-05 17:12             ` Woodhouse, David
  1 sibling, 1 reply; 79+ messages in thread
From: Josh Poimboeuf @ 2018-01-05 17:08 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Woodhouse, David, tglx, linux-kernel, tim.c.chen, peterz,
	torvalds, ak, riel, keescook, gnomes, pjt, dave.hansen, luto,
	jikos, gregkh

On Fri, Jan 05, 2018 at 05:45:06PM +0100, Borislav Petkov wrote:
> On Fri, Jan 05, 2018 at 04:41:46PM +0000, Woodhouse, David wrote:
> > Nope, alternatives are broken. Only a jmp as the *first* opcode of
> > altinstr gets handled by recompute_jump(), while any subsequent insn is
> > just copied untouched.
> 
> Not broken - simply no one needed it until now. I'm looking into it.
> Looks like the insn decoder might come in handy finally.
> 
> :-)

I seem to recall that we also discussed the need for this for converting
pvops to use alternatives, though the "why" is eluding me at the moment.

-- 
Josh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 16:45           ` Borislav Petkov
  2018-01-05 17:08             ` Josh Poimboeuf
@ 2018-01-05 17:12             ` Woodhouse, David
  2018-01-05 17:28               ` Linus Torvalds
  1 sibling, 1 reply; 79+ messages in thread
From: Woodhouse, David @ 2018-01-05 17:12 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tglx, linux-kernel, tim.c.chen, peterz, torvalds, ak, riel,
	keescook, gnomes, pjt, dave.hansen, luto, jikos, gregkh

[-- Attachment #1: Type: text/plain, Size: 793 bytes --]

On Fri, 2018-01-05 at 17:45 +0100, Borislav Petkov wrote:
> On Fri, Jan 05, 2018 at 04:41:46PM +0000, Woodhouse, David wrote:
> > Nope, alternatives are broken. Only a jmp as the *first* opcode of
> > altinstr gets handled by recompute_jump(), while any subsequent insn is
> > just copied untouched.
> 
> Not broken - simply no one needed it until now. I'm looking into it.
> Looks like the insn decoder might come in handy finally.#

I typed 'jmp __x86.indirect_thunk' and it actually jumped to an address
which I believe is (__x86.indirect_thunk + &altinstr - &oldinstr).
Which made me sad, and took a while to debug.

Let's call it "not working for this use case", and I'll stick with
X86_FEATURE_NO_RETPOLINE for now, so the jump can go in the oldinstr
case not in altinstr.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 17:12             ` Woodhouse, David
@ 2018-01-05 17:28               ` Linus Torvalds
  2018-01-05 17:48                 ` David Woodhouse
                                   ` (3 more replies)
  0 siblings, 4 replies; 79+ messages in thread
From: Linus Torvalds @ 2018-01-05 17:28 UTC (permalink / raw)
  To: Woodhouse, David
  Cc: bp, linux-kernel, peterz, tim.c.chen, tglx, ak, riel, keescook,
	gnomes, pjt, dave.hansen, luto, jikos, gregkh

On Fri, Jan 5, 2018 at 9:12 AM, Woodhouse, David <dwmw@amazon.co.uk> wrote:
>
> I typed 'jmp __x86.indirect_thunk' and it actually jumped to an address
> which I believe is (__x86.indirect_thunk + &altinstr - &oldinstr).
> Which made me sad, and took a while to debug.

Yes, I would suggest against expecting altinstructions to have
relocation information. They are generated in a different place, so..

That said, I honestly like the inline version (the one that is in the
google paper first) of the retpoline more than the out-of-line one.
And that one shouldn't have any relocagtion issues, because all the
offsets are relative.

We want to use that one for the entry stub anyway, can't we just
standardize on that one for all our assembly?

If the *compiler* uses the out-of-line version, that's a separate
thing. But for our asm cases, let's just make it all be the inline
case, ok?

It also should simplify the whole target generation. None of this
silly "__x86.indirect_thunk.\reg" crap with different targets for
different register choices.

Hmm?

                     Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 17:28               ` Linus Torvalds
@ 2018-01-05 17:48                 ` David Woodhouse
  2018-01-05 18:05                 ` Andi Kleen
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-05 17:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: bp, linux-kernel, peterz, tim.c.chen, tglx, ak, riel, keescook,
	gnomes, pjt, dave.hansen, luto, jikos, gregkh

[-- Attachment #1: Type: text/plain, Size: 1709 bytes --]

On Fri, 2018-01-05 at 09:28 -0800, Linus Torvalds wrote:
> 
> Yes, I would suggest against expecting altinstructions to have
> relocation information. They are generated in a different place, so..
> 
> That said, I honestly like the inline version (the one that is in the
> google paper first) of the retpoline more than the out-of-line one.
> And that one shouldn't have any relocagtion issues, because all the
> offsets are relative.
> 
> We want to use that one for the entry stub anyway, can't we just
> standardize on that one for all our assembly?

Sure, that's just a case of tweaking the macros a little to do it
inline instead of jumping to the thunk. I thunk judicious use of
__stringify() has resolved any issues I might have had with that
approach in the first place. However...

> If the *compiler* uses the out-of-line version, that's a separate
> thing. But for our asm cases, let's just make it all be the inline
> case, ok?
> 
> It also should simplify the whole target generation. None of this
> silly "__x86.indirect_thunk.\reg" crap with different targets for
> different register choices.

If we're going to let the compiler use the out-of-line version (which
we *want* to because otherwise we don't get to ALTERNATIVE it away as
appropriate), then we still need to emit the various
__x86.indirect_thunk.\reg thunks for the compiler to use anyway.

At that point, there isn't a *huge* benefit to doing the retpoline
inline in our own asm, when it might as well just be a jump to the
thunks which exist anyway. But neither do I have an argument *against*
doing so, so I'll happily tweak the NOSPEC_CALL/NOSPEC_JMP macros
accordingly if you really want...

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 17:28               ` Linus Torvalds
  2018-01-05 17:48                 ` David Woodhouse
@ 2018-01-05 18:05                 ` Andi Kleen
  2018-01-05 20:32                 ` Woodhouse, David
  2018-01-05 22:00                 ` Woodhouse, David
  3 siblings, 0 replies; 79+ messages in thread
From: Andi Kleen @ 2018-01-05 18:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Woodhouse, David, bp, linux-kernel, peterz, tim.c.chen, tglx,
	riel, keescook, gnomes, pjt, dave.hansen, luto, jikos, gregkh

> If the *compiler* uses the out-of-line version, that's a separate
> thing. But for our asm cases, let's just make it all be the inline
> case, ok?

Should be a simple change.

> 
> It also should simplify the whole target generation. None of this
> silly "__x86.indirect_thunk.\reg" crap with different targets for
> different register choices.
> 
> Hmm?

We need the different thunks anyways for the compiler generated code.
So it won't go away.

-Andi

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 17:28               ` Linus Torvalds
  2018-01-05 17:48                 ` David Woodhouse
  2018-01-05 18:05                 ` Andi Kleen
@ 2018-01-05 20:32                 ` Woodhouse, David
  2018-01-05 21:11                   ` Brian Gerst
  2018-01-05 22:00                 ` Woodhouse, David
  3 siblings, 1 reply; 79+ messages in thread
From: Woodhouse, David @ 2018-01-05 20:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: bp, linux-kernel, peterz, tim.c.chen, tglx, ak, riel, keescook,
	gnomes, pjt, dave.hansen, luto, jikos, gregkh

[-- Attachment #1: Type: text/plain, Size: 5203 bytes --]

On Fri, 2018-01-05 at 09:28 -0800, Linus Torvalds wrote:
> 
> Yes, I would suggest against expecting altinstructions to have
> relocation information. They are generated in a different place, so..
> 
> That said, I honestly like the inline version (the one that is in the
> google paper first) of the retpoline more than the out-of-line one.
> And that one shouldn't have any relocation issues, because all the
> offsets are relative.

Note that the *only* issue with the relocation is that it pushes me to
use X86_FEATURE_NO_RETPOLINE for my feature instead of
X86_FEATURE_RETPOLINE as might be natural. And actually there's a
motivation to do that anyway, because of the way three-way alternatives
interact.

With the existing negative flag I can do 

 ALTERNATIVE_2(retpoline, K8: lfence+jmp; NO_RETPOLINE: jmp)

But if I invert it, I think I need two feature flags to get the same functionality — X86_FEATURE_RETPOLINE and X86_FEATURE_RETPOLINE_AMD:

 ALTERNATIVE_2(jmp, RETPOLINE: retpoline, RETPOLINE_AMD: lfence+jmp)

So I was completely prepared to live with the slightly unnatural
inverse logic of the feature flag. But since you asked...

> We want to use that one for the entry stub anyway, can't we just
> standardize on that one for all our assembly?
> 
> If the *compiler* uses the out-of-line version, that's a separate
> thing. But for our asm cases, let's just make it all be the inline
> case, ok?

OK.... it starts off looking a bit like this. You're right; with the
caveats above it will let me invert the logic to X86_FEATURE_RETPOLINE
because the alternatives mechanism no longer needs to adjust any part
of the retpoline code path when it's in 'altinstr'.

And it does let me use a simple NOSPEC_JMP in the entry trampoline
instead of open-coding it again, which is nice.

But the first pass of it, below, is fugly as hell. I'll take another
look at *using* the ALTERNATIVE_2 macro instead of reimplementing it
for NOSPEC_CALL, but I strongly suspect that's just going to land me
with a fairly unreadable __stringify(jmp;call;lfence;jmp;call;mov;ret)
monstrosity all on a single line. Assembler macros are... brittle.

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 76f94bbacaec..8f7e1129f493 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -188,17 +188,7 @@ ENTRY(entry_SYSCALL_64_trampoline)
 	 */
 	pushq	%rdi
 	movq	$entry_SYSCALL_64_stage2, %rdi
-	/*
-	 * Open-code the retpoline from retpoline.S, because we can't
-	 * just jump to it directly.
-	 */
-	ALTERNATIVE "call 2f", "jmp *%rdi", X86_FEATURE_NO_RETPOLINE
-1:
-	lfence
-	jmp	1b
-2:
-	mov	%rdi, (%rsp)
-	ret
+	NOSPEC_JMP rdi
 END(entry_SYSCALL_64_trampoline)
 
 	.popsection
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index eced0dfaddc9..1c8312ff186a 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -14,15 +14,54 @@
  */
 .macro NOSPEC_JMP reg:req
 #ifdef CONFIG_RETPOLINE
-	ALTERNATIVE __stringify(jmp __x86.indirect_thunk.\reg), __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
+	ALTERNATIVE_2 "call 1112f", __stringify(lfence;jmp *%\reg), X86_FEATURE_K8, __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
+1111:
+	lfence
+	jmp	1111b
+1112:
+	mov	%\reg, (%_ASM_SP)
+	ret
 #else
-	jmp *%\reg
+	jmp	*%\reg
 #endif
 .endm
 
+/*
+ * Even __stringify() on the arguments doesn't really make it nice to use
+ * the existing ALTERNATIVE_2 macro here. So open-code our own version...
+ */
 .macro NOSPEC_CALL reg:req
 #ifdef CONFIG_RETPOLINE
-	ALTERNATIVE __stringify(call __x86.indirect_thunk.\reg), __stringify(call *%\reg), X86_FEATURE_NO_RETPOLINE
+140:
+	jmp	1113f
+1110:
+	call	1112f
+1111:
+	lfence
+	jmp	1111b
+1112:
+	mov	%\reg, (%_ASM_SP)
+	ret
+1113:
+	call	1110b
+141:
+	.skip -((alt_max_short(new_len1, new_len2) - (old_len)) > 0) * \
+		(alt_max_short(new_len1, new_len2) - (old_len)),0x90
+142:
+
+	.pushsection .altinstructions,"a"
+	altinstruction_entry 140b,143f,X86_FEATURE_K8,142b-140b,144f-143f,142b-141b
+	altinstruction_entry 140b,144f,X86_FEATURE_NO_RETPOLINE,142b-140b,145f-144f,142b-141b
+	.popsection
+
+	.pushsection .altinstr_replacement,"ax"
+143:
+	lfence
+	call	*%\reg
+144:
+	call	*%\reg
+145:
+	.popsection
 #else
 	call *%\reg
 #endif
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index 2a4b1f09eb84..5c15e4307da5 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -6,19 +6,14 @@
 #include <asm/cpufeatures.h>
 #include <asm/alternative-asm.h>
 #include <asm/export.h>
+#include <asm/nospec-branch.h>
 
 .macro THUNK sp reg
 	.section .text.__x86.indirect_thunk.\reg
 
 ENTRY(__x86.indirect_thunk.\reg)
 	CFI_STARTPROC
-	ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), X86_FEATURE_K8, __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
-1:
-	lfence
-	jmp	1b
-2:
-	mov	%\reg, (%\sp)
-	ret
+	NOSPEC_JMP \reg
 	CFI_ENDPROC
 ENDPROC(__x86.indirect_thunk.\reg)
 EXPORT_SYMBOL(__x86.indirect_thunk.\reg)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 20:32                 ` Woodhouse, David
@ 2018-01-05 21:11                   ` Brian Gerst
  2018-01-05 22:16                     ` Woodhouse, David
  0 siblings, 1 reply; 79+ messages in thread
From: Brian Gerst @ 2018-01-05 21:11 UTC (permalink / raw)
  To: Woodhouse, David
  Cc: Linus Torvalds, bp, linux-kernel, peterz, tim.c.chen, tglx, ak,
	riel, keescook, gnomes, pjt, dave.hansen, luto, jikos, gregkh

On Fri, Jan 5, 2018 at 3:32 PM, Woodhouse, David <dwmw@amazon.co.uk> wrote:
> On Fri, 2018-01-05 at 09:28 -0800, Linus Torvalds wrote:
>>
>> Yes, I would suggest against expecting altinstructions to have
>> relocation information. They are generated in a different place, so..
>>
>> That said, I honestly like the inline version (the one that is in the
>> google paper first) of the retpoline more than the out-of-line one.
>> And that one shouldn't have any relocation issues, because all the
>> offsets are relative.
>
> Note that the *only* issue with the relocation is that it pushes me to
> use X86_FEATURE_NO_RETPOLINE for my feature instead of
> X86_FEATURE_RETPOLINE as might be natural. And actually there's a
> motivation to do that anyway, because of the way three-way alternatives
> interact.
>
> With the existing negative flag I can do
>
>  ALTERNATIVE_2(retpoline, K8: lfence+jmp; NO_RETPOLINE: jmp)
>
> But if I invert it, I think I need two feature flags to get the same functionality — X86_FEATURE_RETPOLINE and X86_FEATURE_RETPOLINE_AMD:
>
>  ALTERNATIVE_2(jmp, RETPOLINE: retpoline, RETPOLINE_AMD: lfence+jmp)

Another way to do it is with two consecutive alternatives:

ALTERNATIVE(NOP, K8: lfence)
ALTERNATIVE(jmp indirect, RETPOLINE: jmp thunk)

This also avoids the issue with the relocation of the jmp target when
the replacement is more than one instruction.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 17:28               ` Linus Torvalds
                                   ` (2 preceding siblings ...)
  2018-01-05 20:32                 ` Woodhouse, David
@ 2018-01-05 22:00                 ` Woodhouse, David
  2018-01-05 22:06                   ` Borislav Petkov
  2018-01-05 23:50                   ` Linus Torvalds
  3 siblings, 2 replies; 79+ messages in thread
From: Woodhouse, David @ 2018-01-05 22:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: bp, linux-kernel, peterz, tim.c.chen, tglx, ak, riel, keescook,
	gnomes, pjt, dave.hansen, luto, jikos, gregkh

[-- Attachment #1: Type: text/plain, Size: 8768 bytes --]

On Fri, 2018-01-05 at 09:28 -0800, Linus Torvalds wrote:
> That said, I honestly like the inline version (the one that is in the
> google paper first) of the retpoline more than the out-of-line one.
> And that one shouldn't have any relocagtion issues, because all the
> offsets are relative.
> 
> We want to use that one for the entry stub anyway, can't we just
> standardize on that one for all our assembly?
> 
> If the *compiler* uses the out-of-line version, that's a separate
> thing. But for our asm cases, let's just make it all be the inline
> case, ok?

OK, this one looks saner, and I think I've tested all the 32/64 bit
retpoline/amd/none permutations. Pushed to
http://git.infradead.org/users/dwmw2/linux-retpoline.git/

If this matches what you were thinking, I'll refactor the series in the
morning to do it this way.

From cfddb3bcae1524da52e782398da2809ec8faa200 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Fri, 5 Jan 2018 21:50:41 +0000
Subject: [PATCH] x86/retpoline: Clean up inverted X86_FEATURE_NO_RETPOLINE
 logic

If we make the thunks inline so they don't need relocations to branch to
__x86.indirect_thunk.xxx then we don't fall foul of the alternatives
handling, and we can have the retpoline variant in 'altinstr'. This
means that we can use X86_FEATURE_RETPOLINE which is more natural.

Unfortunately, it does mean that the X86_FEATURE_K8 trick doesn't work
any more, so we need an additional X86_FEATURE_RETPOLINE_AMD for that.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/entry/entry_64.S            | 12 +--------
 arch/x86/include/asm/cpufeatures.h   |  3 ++-
 arch/x86/include/asm/nospec-branch.h | 50 +++++++++++++++++++++++++++---------
 arch/x86/kernel/cpu/common.c         |  5 ++++
 arch/x86/kernel/cpu/intel.c          |  3 ++-
 arch/x86/lib/retpoline.S             | 17 +++++-------
 6 files changed, 54 insertions(+), 36 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 76f94bbacaec..8f7e1129f493 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -188,17 +188,7 @@ ENTRY(entry_SYSCALL_64_trampoline)
 	 */
 	pushq	%rdi
 	movq	$entry_SYSCALL_64_stage2, %rdi
-	/*
-	 * Open-code the retpoline from retpoline.S, because we can't
-	 * just jump to it directly.
-	 */
-	ALTERNATIVE "call 2f", "jmp *%rdi", X86_FEATURE_NO_RETPOLINE
-1:
-	lfence
-	jmp	1b
-2:
-	mov	%rdi, (%rsp)
-	ret
+	NOSPEC_JMP rdi
 END(entry_SYSCALL_64_trampoline)
 
 	.popsection
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 2d916fd13bf9..6f10edabbf82 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -203,7 +203,8 @@
 #define X86_FEATURE_PROC_FEEDBACK	( 7*32+ 9) /* AMD ProcFeedbackInterface */
 #define X86_FEATURE_SME			( 7*32+10) /* AMD Secure Memory Encryption */
 #define X86_FEATURE_PTI			( 7*32+11) /* Kernel Page Table Isolation enabled */
-#define X86_FEATURE_NO_RETPOLINE	( 7*32+12) /* Retpoline mitigation for Spectre variant 2 */
+#define X86_FEATURE_RETPOLINE		( 7*32+12) /* Intel Retpoline mitigation for Spectre variant 2 */
+#define X86_FEATURE_RETPOLINE_AMD	( 7*32+13) /* AMD Retpoline mitigation for Spectre variant 2 */
 #define X86_FEATURE_INTEL_PPIN		( 7*32+14) /* Intel Processor Inventory Number */
 #define X86_FEATURE_INTEL_PT		( 7*32+15) /* Intel Processor Trace */
 #define X86_FEATURE_AVX512_4VNNIW	( 7*32+16) /* AVX-512 Neural Network Instructions */
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index eced0dfaddc9..6e92edf64e53 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -8,23 +8,44 @@
 #include <asm/cpufeatures.h>
 
 #ifdef __ASSEMBLY__
+
 /*
- * The asm code uses CONFIG_RETPOLINE; this part will happen even if
- * the toolchain isn't retpoline-capable.
+ * These are the bare retpoline primitives for indirect jmp and call.
+ * Do not use these directly.
  */
+.macro RETPOLINE_JMP reg:req
+	call	1112f
+1111:	lfence
+	jmp	1111b
+1112:	mov	%\reg, (%_ASM_SP)
+	ret
+.endm
+
+.macro RETPOLINE_CALL reg:req
+	jmp	1113f
+1110:	RETPOLINE_JMP \reg
+1113:	call	1110b
+.endm
+
+
+
 .macro NOSPEC_JMP reg:req
 #ifdef CONFIG_RETPOLINE
-	ALTERNATIVE __stringify(jmp __x86.indirect_thunk.\reg), __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
+	ALTERNATIVE_2 __stringify(jmp *%\reg),				\
+		__stringify(RETPOLINE_JMP \reg), X86_FEATURE_RETPOLINE,	\
+		__stringify(lfence; jmp *%\reg), X86_FEATURE_RETPOLINE_AMD
 #else
-	jmp *%\reg
+	jmp	*%\reg
 #endif
 .endm
 
 .macro NOSPEC_CALL reg:req
 #ifdef CONFIG_RETPOLINE
-	ALTERNATIVE __stringify(call __x86.indirect_thunk.\reg), __stringify(call *%\reg), X86_FEATURE_NO_RETPOLINE
+	ALTERNATIVE_2 __stringify(call *%\reg),				\
+		__stringify(RETPOLINE_CALL \reg), X86_FEATURE_RETPOLINE,\
+		__stringify(lfence; call *%\reg), X86_FEATURE_RETPOLINE_AMD
 #else
-	call *%\reg
+	call	*%\reg
 #endif
 .endm
 
@@ -36,16 +57,21 @@
  */
 #if defined(CONFIG_X86_64) && defined(RETPOLINE)
 #  define NOSPEC_CALL ALTERNATIVE(				\
+	"call *%[thunk_target]\n",				\
 	"call __x86.indirect_thunk.%V[thunk_target]\n",		\
-	"call *%[thunk_target]\n", X86_FEATURE_NO_RETPOLINE)
+	X86_FEATURE_RETPOLINE)
 #  define THUNK_TARGET(addr) [thunk_target] "r" (addr)
 #elif defined(CONFIG_X86_64) && defined(CONFIG_RETPOLINE)
 # define NOSPEC_CALL ALTERNATIVE(				\
-	"       jmp 1221f; "					\
-	"1222:  push %[thunk_target];"				\
-	"       jmp __x86.indirect_thunk;"			\
-	"1221:  call 1222b;\n",					\
-	"call *%[thunk_target]\n", X86_FEATURE_NO_RETPOLINE)
+	"call	*%[thunk_target]\n",				\
+	"       jmp    1113f; "					\
+	"1110:  call   1112f; "					\
+	"1111:	lfence; "					\
+	"       jmp    1111b; "					\
+	"1112:	movl   %[thunk_target], (%esp); "		\
+	"       ret; "						\
+	"1113:  call   1110b;\n",				\
+	X86_FEATURE_RETPOLINE)
 # define THUNK_TARGET(addr) [thunk_target] "rm" (addr)
 #else
 # define NOSPEC_CALL "call *%[thunk_target]\n"
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 372ba3fb400f..40e6e54d8501 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -904,6 +904,11 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
 
 	setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
 	setup_force_cpu_bug(X86_BUG_SPECTRE_V2);
+#ifdef CONFIG_RETPOLINE
+	setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
+	if (c->x86_vendor == X86_VENDOR_AMD)
+		setup_force_cpu_cap(X86_FEATURE_RETPOLINE_AMD);
+#endif
 
 	fpu__init_system(c);
 
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index e1812d07b53e..35e123e5f413 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -35,7 +35,8 @@
 static int __init noretpoline_setup(char *__unused)
 {
 	pr_info("Retpoline runtime disabled\n");
-	setup_force_cpu_cap(X86_FEATURE_NO_RETPOLINE);
+	setup_clear_cpu_cap(X86_FEATURE_RETPOLINE);
+	setup_clear_cpu_cap(X86_FEATURE_RETPOLINE_AMD);
 	return 1;
 }
 __setup("noretpoline", noretpoline_setup);
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index 2a4b1f09eb84..90d9a1589a54 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -6,19 +6,14 @@
 #include <asm/cpufeatures.h>
 #include <asm/alternative-asm.h>
 #include <asm/export.h>
+#include <asm/nospec-branch.h>
 
-.macro THUNK sp reg
+.macro THUNK reg
 	.section .text.__x86.indirect_thunk.\reg
 
 ENTRY(__x86.indirect_thunk.\reg)
 	CFI_STARTPROC
-	ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), X86_FEATURE_K8, __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
-1:
-	lfence
-	jmp	1b
-2:
-	mov	%\reg, (%\sp)
-	ret
+	NOSPEC_JMP \reg
 	CFI_ENDPROC
 ENDPROC(__x86.indirect_thunk.\reg)
 EXPORT_SYMBOL(__x86.indirect_thunk.\reg)
@@ -26,11 +21,11 @@ EXPORT_SYMBOL(__x86.indirect_thunk.\reg)
 
 #ifdef CONFIG_64BIT
 .irp reg rax rbx rcx rdx rsi rdi rbp r8 r9 r10 r11 r12 r13 r14 r15
-	THUNK rsp \reg
+	THUNK \reg
 .endr
 #else
 .irp reg eax ebx ecx edx esi edi ebp
-	THUNK esp \reg
+	THUNK \reg
 .endr
 
 /*
@@ -39,7 +34,7 @@ EXPORT_SYMBOL(__x86.indirect_thunk.\reg)
  */
 ENTRY(__x86.indirect_thunk)
 	CFI_STARTPROC
-	ALTERNATIVE "call 2f", "ret", X86_FEATURE_NO_RETPOLINE
+	ALTERNATIVE "ret", "call 2f", X86_FEATURE_RETPOLINE
 1:
 	lfence
 	jmp	1b
-- 
2.14.3


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 22:00                 ` Woodhouse, David
@ 2018-01-05 22:06                   ` Borislav Petkov
  2018-01-05 23:50                   ` Linus Torvalds
  1 sibling, 0 replies; 79+ messages in thread
From: Borislav Petkov @ 2018-01-05 22:06 UTC (permalink / raw)
  To: Woodhouse, David
  Cc: torvalds, linux-kernel, peterz, tim.c.chen, tglx, ak, riel,
	keescook, gnomes, pjt, dave.hansen, luto, jikos, gregkh

On Fri, Jan 05, 2018 at 10:00:19PM +0000, Woodhouse, David wrote:
> OK, this one looks saner, and I think I've tested all the 32/64 bit

Dunno, I think Brian's suggestion will make this even simpler:

ALTERNATIVE(NOP, K8: lfence)
ALTERNATIVE(jmp indirect, RETPOLINE: jmp thunk)

Hmm?

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 21:11                   ` Brian Gerst
@ 2018-01-05 22:16                     ` Woodhouse, David
  2018-01-05 22:43                       ` Borislav Petkov
  0 siblings, 1 reply; 79+ messages in thread
From: Woodhouse, David @ 2018-01-05 22:16 UTC (permalink / raw)
  To: brgerst
  Cc: linux-kernel, peterz, tim.c.chen, torvalds, tglx, bp, ak, riel,
	keescook, gnomes, pjt, dave.hansen, luto, jikos, gregkh


[-- Attachment #1.1: Type: text/plain, Size: 415 bytes --]

On Fri, 2018-01-05 at 16:11 -0500, Brian Gerst wrote:
> 
> Another way to do it is with two consecutive alternatives:
> 
> ALTERNATIVE(NOP, K8: lfence)
> ALTERNATIVE(jmp indirect, RETPOLINE: jmp thunk)
> 
> This also avoids the issue with the relocation of the jmp target when
> the replacement is more than one instruction.

You'd still want a RETPOLINE_AMD flag to enable that lfence; it's not
just K8.

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 197 bytes --]




Amazon Web Services UK Limited. Registered in England and Wales with registration number 08650665 and which has its registered office at 60 Holborn Viaduct, London EC1A 2FD, United Kingdom.

[-- Attachment #2.2: Type: text/html, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 22:16                     ` Woodhouse, David
@ 2018-01-05 22:43                       ` Borislav Petkov
  0 siblings, 0 replies; 79+ messages in thread
From: Borislav Petkov @ 2018-01-05 22:43 UTC (permalink / raw)
  To: Woodhouse, David
  Cc: brgerst, linux-kernel, peterz, tim.c.chen, torvalds, tglx, ak,
	riel, keescook, gnomes, pjt, dave.hansen, luto, jikos, gregkh

On Fri, Jan 05, 2018 at 10:16:54PM +0000, Woodhouse, David wrote:
> You'd still want a RETPOLINE_AMD flag to enable that lfence; it's not
> just K8.

I think you're forgetting that we set K8 on everything >= K8 on AMD. So
this:

+       if (c->x86_vendor == X86_VENDOR_AMD)
+               setup_force_cpu_cap(X86_FEATURE_RETPOLINE_AMD);

can just as well be dropped and X86_FEATURE_K8 used everywhere.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 22:00                 ` Woodhouse, David
  2018-01-05 22:06                   ` Borislav Petkov
@ 2018-01-05 23:50                   ` Linus Torvalds
  2018-01-06 10:53                     ` Woodhouse, David
  1 sibling, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2018-01-05 23:50 UTC (permalink / raw)
  To: Woodhouse, David
  Cc: linux-kernel, peterz, tim.c.chen, bp, tglx, ak, riel, keescook,
	gnomes, pjt, dave.hansen, luto, jikos, gregkh

On Fri, Jan 5, 2018 at 2:00 PM, Woodhouse, David <dwmw@amazon.co.uk> wrote:
> +.macro RETPOLINE_JMP reg:req
> +       call    1112f
> +1111:  lfence
> +       jmp     1111b
> +1112:  mov     %\reg, (%_ASM_SP)
> +       ret
> +.endm
> +
> +.macro RETPOLINE_CALL reg:req
> +       jmp     1113f
> +1110:  RETPOLINE_JMP \reg
> +1113:  call    1110b
> +.endm

Why do these still force a register name?

Is it because this is the only user?

> index 2a4b1f09eb84..90d9a1589a54 100644
> --- a/arch/x86/lib/retpoline.S
> +++ b/arch/x86/lib/retpoline.S
> @@ -6,19 +6,14 @@
>  #include <asm/cpufeatures.h>
>  #include <asm/alternative-asm.h>
>  #include <asm/export.h>
> +#include <asm/nospec-branch.h>
>
> -.macro THUNK sp reg
> +.macro THUNK reg
>         .section .text.__x86.indirect_thunk.\reg
>
>  ENTRY(__x86.indirect_thunk.\reg)
>         CFI_STARTPROC
> -       ALTERNATIVE_2 "call 2f", __stringify(lfence;jmp *%\reg), X86_FEATURE_K8, __stringify(jmp *%\reg), X86_FEATURE_NO_RETPOLINE
> -1:
> -       lfence
> -       jmp     1b
> -2:
> -       mov     %\reg, (%\sp)
> -       ret
> +       NOSPEC_JMP \reg
>         CFI_ENDPROC

Can we just move those macros into retpoline.S because using them
outside that shouldn't happen?

                 Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 17:08             ` Josh Poimboeuf
@ 2018-01-06  0:30               ` Borislav Petkov
  2018-01-06  8:23                 ` David Woodhouse
  2018-01-08  5:06                 ` Josh Poimboeuf
  0 siblings, 2 replies; 79+ messages in thread
From: Borislav Petkov @ 2018-01-06  0:30 UTC (permalink / raw)
  To: Josh Poimboeuf, Woodhouse, David
  Cc: tglx, linux-kernel, tim.c.chen, peterz, torvalds, ak, riel,
	keescook, gnomes, pjt, dave.hansen, luto, jikos, gregkh

On Fri, Jan 05, 2018 at 11:08:06AM -0600, Josh Poimboeuf wrote:
> I seem to recall that we also discussed the need for this for converting
> pvops to use alternatives, though the "why" is eluding me at the moment.

Ok, here's something which seems to work in my VM here. I'll continue
playing with it tomorrow. Josh, if you have some example sequences for
me to try, send them my way pls.

Anyway, here's an example:

alternative("", "xor %%rdi, %%rdi; jmp startup_64", X86_FEATURE_K8);

which did this:

[    0.921013] apply_alternatives: feat: 3*32+4, old: (ffffffff81027429, len: 8), repl: (ffffffff824759d2, len: 8), pad: 8
[    0.924002] ffffffff81027429: old_insn: 90 90 90 90 90 90 90 90
[    0.928003] ffffffff824759d2: rpl_insn: 48 31 ff e9 26 a6 b8 fe
[    0.930212] process_jumps: repl[0]: 0x48
[    0.932002] process_jumps: insn len: 3
[    0.932814] process_jumps: repl[0]: 0xe9
[    0.934003] recompute_jump: o_dspl: 0xfeb8a626
[    0.934914] recompute_jump: target RIP: ffffffff81000000, new_displ: 0xfffd8bd7
[    0.936001] recompute_jump: final displ: 0xfffd8bd2, JMP 0xffffffff81000000
[    0.937240] process_jumps: insn len: 5
[    0.938053] ffffffff81027429: final_insn: e9 d2 8b fd ff a6 b8 fe

Apparently our insn decoder is smart enough to parse the insn and get
its length, so I can use that. It jumps over the first 3-byte XOR and
than massaged the following 5-byte jump.

---
From: Borislav Petkov <bp@suse.de>
Date: Fri, 5 Jan 2018 20:32:58 +0100
Subject: [PATCH] WIP

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/alternative.c | 51 +++++++++++++++++++++++++++++++------------
 1 file changed, 37 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index dbaf14d69ebd..14a855789a50 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -21,6 +21,7 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 #include <asm/fixmap.h>
+#include <asm/insn.h>
 
 int __read_mostly alternatives_patched;
 
@@ -281,24 +282,24 @@ static inline bool is_jmp(const u8 opcode)
 }
 
 static void __init_or_module
-recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insnbuf)
+recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 repl_len, u8 *insnbuf)
 {
 	u8 *next_rip, *tgt_rip;
 	s32 n_dspl, o_dspl;
-	int repl_len;
 
-	if (a->replacementlen != 5)
+	if (repl_len != 5)
 		return;
 
-	o_dspl = *(s32 *)(insnbuf + 1);
+	o_dspl = *(s32 *)(repl_insn + 1);
 
 	/* next_rip of the replacement JMP */
-	next_rip = repl_insn + a->replacementlen;
+	next_rip = repl_insn + repl_len;
+
 	/* target rip of the replacement JMP */
 	tgt_rip  = next_rip + o_dspl;
 	n_dspl = tgt_rip - orig_insn;
 
-	DPRINTK("target RIP: %p, new_displ: 0x%x", tgt_rip, n_dspl);
+	DPRINTK("target RIP: %px, new_displ: 0x%x", tgt_rip, n_dspl);
 
 	if (tgt_rip - orig_insn >= 0) {
 		if (n_dspl - 2 <= 127)
@@ -337,6 +338,29 @@ recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insnbuf)
 		n_dspl, (unsigned long)orig_insn + n_dspl + repl_len);
 }
 
+static void __init_or_module process_jumps(struct alt_instr *a, u8 *insnbuf)
+{
+	u8 *repl  = (u8 *)&a->repl_offset  + a->repl_offset;
+	u8 *instr = (u8 *)&a->instr_offset + a->instr_offset;
+	struct insn insn;
+	int i = 0;
+
+	if (!a->replacementlen)
+		return;
+
+	while (i < a->replacementlen) {
+		kernel_insn_init(&insn, repl, a->replacementlen);
+
+		insn_get_length(&insn);
+
+		if (is_jmp(repl[0]))
+			recompute_jump(a, instr, repl, insn.length, insnbuf);
+
+		i    += insn.length;
+		repl += insn.length;
+	}
+}
+
 /*
  * "noinline" to cause control flow change and thus invalidate I$ and
  * cause refetch after modification.
@@ -352,7 +376,7 @@ static void __init_or_module noinline optimize_nops(struct alt_instr *a, u8 *ins
 	add_nops(instr + (a->instrlen - a->padlen), a->padlen);
 	local_irq_restore(flags);
 
-	DUMP_BYTES(instr, a->instrlen, "%p: [%d:%d) optimized NOPs: ",
+	DUMP_BYTES(instr, a->instrlen, "%px: [%d:%d) optimized NOPs: ",
 		   instr, a->instrlen - a->padlen, a->padlen);
 }
 
@@ -373,7 +397,7 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 	u8 *instr, *replacement;
 	u8 insnbuf[MAX_PATCH_LEN];
 
-	DPRINTK("alt table %p -> %p", start, end);
+	DPRINTK("alt table %px -> %px", start, end);
 	/*
 	 * The scan order should be from start to end. A later scanned
 	 * alternative code can overwrite previously scanned alternative code.
@@ -397,14 +421,14 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 			continue;
 		}
 
-		DPRINTK("feat: %d*32+%d, old: (%p, len: %d), repl: (%p, len: %d), pad: %d",
+		DPRINTK("feat: %d*32+%d, old: (%px, len: %d), repl: (%px, len: %d), pad: %d",
 			a->cpuid >> 5,
 			a->cpuid & 0x1f,
 			instr, a->instrlen,
 			replacement, a->replacementlen, a->padlen);
 
-		DUMP_BYTES(instr, a->instrlen, "%p: old_insn: ", instr);
-		DUMP_BYTES(replacement, a->replacementlen, "%p: rpl_insn: ", replacement);
+		DUMP_BYTES(instr, a->instrlen, "%px: old_insn: ", instr);
+		DUMP_BYTES(replacement, a->replacementlen, "%px: rpl_insn: ", replacement);
 
 		memcpy(insnbuf, replacement, a->replacementlen);
 		insnbuf_sz = a->replacementlen;
@@ -422,15 +446,14 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 				(unsigned long)instr + *(s32 *)(insnbuf + 1) + 5);
 		}
 
-		if (a->replacementlen && is_jmp(replacement[0]))
-			recompute_jump(a, instr, replacement, insnbuf);
+		process_jumps(a, insnbuf);
 
 		if (a->instrlen > a->replacementlen) {
 			add_nops(insnbuf + a->replacementlen,
 				 a->instrlen - a->replacementlen);
 			insnbuf_sz += a->instrlen - a->replacementlen;
 		}
-		DUMP_BYTES(insnbuf, insnbuf_sz, "%p: final_insn: ", instr);
+		DUMP_BYTES(insnbuf, insnbuf_sz, "%px: final_insn: ", instr);
 
 		text_poke_early(instr, insnbuf, insnbuf_sz);
 	}
-- 
2.13.0

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-06  0:30               ` Borislav Petkov
@ 2018-01-06  8:23                 ` David Woodhouse
  2018-01-06 17:02                   ` Borislav Petkov
  2018-01-08  5:06                 ` Josh Poimboeuf
  1 sibling, 1 reply; 79+ messages in thread
From: David Woodhouse @ 2018-01-06  8:23 UTC (permalink / raw)
  To: Borislav Petkov, Josh Poimboeuf
  Cc: tglx, linux-kernel, tim.c.chen, peterz, torvalds, ak, riel,
	keescook, gnomes, pjt, dave.hansen, luto, jikos, gregkh

[-- Attachment #1: Type: text/plain, Size: 1918 bytes --]

On Sat, 2018-01-06 at 01:30 +0100, Borislav Petkov wrote:
> On Fri, Jan 05, 2018 at 11:08:06AM -0600, Josh Poimboeuf wrote:
> > I seem to recall that we also discussed the need for this for converting
> > pvops to use alternatives, though the "why" is eluding me at the moment.
> 
> Ok, here's something which seems to work in my VM here. I'll continue
> playing with it tomorrow. Josh, if you have some example sequences for
> me to try, send them my way pls.
> 
> Anyway, here's an example:
> 
> alternative("", "xor %%rdi, %%rdi; jmp startup_64", X86_FEATURE_K8);
> 
> which did this:
> 
> [    0.921013] apply_alternatives: feat: 3*32+4, old: (ffffffff81027429, len: 8), repl: (ffffffff824759d2, len: 8), pad: 8
> [    0.924002] ffffffff81027429: old_insn: 90 90 90 90 90 90 90 90
> [    0.928003] ffffffff824759d2: rpl_insn: 48 31 ff e9 26 a6 b8 fe
> [    0.930212] process_jumps: repl[0]: 0x48
> [    0.932002] process_jumps: insn len: 3
> [    0.932814] process_jumps: repl[0]: 0xe9
> [    0.934003] recompute_jump: o_dspl: 0xfeb8a626
> [    0.934914] recompute_jump: target RIP: ffffffff81000000, new_displ: 0xfffd8bd7
> [    0.936001] recompute_jump: final displ: 0xfffd8bd2, JMP 0xffffffff81000000
> [    0.937240] process_jumps: insn len: 5
> [    0.938053] ffffffff81027429: final_insn: e9 d2 8b fd ff a6 b8 fe
> 
> Apparently our insn decoder is smart enough to parse the insn and get
> its length, so I can use that. It jumps over the first 3-byte XOR and
> than massaged the following 5-byte jump.

Thanks. From code inspection, I couldn't see that it was smart enough
*not* to process a relative jump in the 'altinstr' section which was
jumping to a target *within* that same altinstr section, and thus
didn't need to be touched at all. Does this work?

alternative("", "xor %%rdi, %%rdi; jmp 2f; 2: jmp startup_64", X86_FEATURE_K8);

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-05 23:50                   ` Linus Torvalds
@ 2018-01-06 10:53                     ` Woodhouse, David
  0 siblings, 0 replies; 79+ messages in thread
From: Woodhouse, David @ 2018-01-06 10:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, peterz, tim.c.chen, bp, tglx, ak, riel, keescook,
	gnomes, pjt, dave.hansen, luto, jikos, gregkh

[-- Attachment #1: Type: text/plain, Size: 929 bytes --]

On Fri, 2018-01-05 at 15:50 -0800, Linus Torvalds wrote:
> 
> > +
> > +.macro RETPOLINE_CALL reg:req
> > +       jmp     1113f
> > +1110:  RETPOLINE_JMP \reg
> > +1113:  call    1110b
> > +.endm


(Note that RETPOLINE_CALL is purely internal to nospec-branch.h, used
only from the NOSPEC_CALL macro to make the ALTERNATIVE_2 invocation
les
s ugly than it would have been. But the same applies to NOSPEC_CALL
which is the one that gets used in our .S files including retpoline.S)

> Why do these still force a register name?
> 
> Is it because this is the only user?

By "register name" are you asking why it's invoked as
"NOSPEC_CALL rbx" instead of "NOSPEC_CALL %rbx" with the percent sign?

I think I can probably clean that up now, and turn a lot of %\reg
occurrences into just \reg. The only remaining %\reg would be left in
retpoline.S inside the THUNK macro used from .irp.





[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-06  8:23                 ` David Woodhouse
@ 2018-01-06 17:02                   ` Borislav Petkov
  2018-01-07  9:40                     ` David Woodhouse
  0 siblings, 1 reply; 79+ messages in thread
From: Borislav Petkov @ 2018-01-06 17:02 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Josh Poimboeuf, tglx, linux-kernel, tim.c.chen, peterz, torvalds,
	ak, riel, keescook, gnomes, pjt, dave.hansen, luto, jikos,
	gregkh

On Sat, Jan 06, 2018 at 08:23:21AM +0000, David Woodhouse wrote:
> Thanks. From code inspection, I couldn't see that it was smart enough
> *not* to process a relative jump in the 'altinstr' section which was
> jumping to a target *within* that same altinstr section, and thus
> didn't need to be touched at all. Does this work?
> 
> alternative("", "xor %%rdi, %%rdi; jmp 2f; 2: jmp startup_64", X86_FEATURE_K8);

So this is fine because it gets turned into a two-byte jump:

[    0.816005] apply_alternatives: feat: 3*32+4, old: (ffffffff810273c9, len: 10), repl: (ffffffff824759d2, len: 10), pad: 10
[    0.820001] ffffffff810273c9: old_insn: 90 90 90 90 90 90 90 90 90 90
[    0.821247] ffffffff824759d2: rpl_insn: 48 31 ff eb 00 e9 24 a6 b8 fe
[    0.822455] process_jumps: insn start 0x48, at 0, len: 3
[    0.823496] process_jumps: insn start 0xeb, at 3, len: 2
[    0.824002] process_jumps: insn start 0xe9, at 5, len: 5
[    0.825120] recompute_jump: target RIP: ffffffff81000000, new_displ: 0xfffd8c37
[    0.826567] recompute_jump: final displ: 0xfffd8c32, JMP 0xffffffff81000000
[    0.828001] ffffffff810273c9: final_insn: e9 32 8c fd ff e9 24 a6 b8 fe

i.e., notice the "eb 00" thing.

Which, when copied into the kernel proper, will simply work as it is
a small offset which, when referring to other code which gets copied
*together* with it, should work. I.e., we're not changing the offsets
during the copy so all good.

It becomes more tricky when you force a 5-byte jump:

        alternative("", "xor %%rdi, %%rdi; .byte 0xe9; .long 2f - .altinstr_replacement; 2: jmp startup_64", X86_FEATURE_K8);

because then you need to know whether the offset is within the
.altinstr_replacement section itself or it is meant to be an absolute
offset like jmp startup_64 or within another section.

On that I need to sleep more to figure out what a reliable way to do it,
would be. I mean, not that we need it now. If all we care is two-byte
offsets, those should work now. The stress being on "should".

The current version:

---
From: Borislav Petkov <bp@suse.de>
Date: Fri, 5 Jan 2018 20:32:58 +0100
Subject: [PATCH] WIP

Signed-off-by: Borislav Petkov <bp@suse.de>
---
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index dbaf14d69ebd..0cb4f886e6d7 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -21,6 +21,7 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 #include <asm/fixmap.h>
+#include <asm/insn.h>
 
 int __read_mostly alternatives_patched;
 
@@ -280,25 +281,35 @@ static inline bool is_jmp(const u8 opcode)
 	return opcode == 0xeb || opcode == 0xe9;
 }
 
+/**
+ * @orig_insn:	pointer to the original insn
+ * @repl_insn:	pointer to the replacement insn
+ * @repl_len:	length of the replacement insn
+ * @insnbuf:	buffer we're working on massaging the insns
+ * @i_off:	offset within the buffer
+ */
 static void __init_or_module
-recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insnbuf)
+recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 repl_len,
+	       u8 *insnbuf, u8 i_off)
 {
 	u8 *next_rip, *tgt_rip;
 	s32 n_dspl, o_dspl;
-	int repl_len;
 
-	if (a->replacementlen != 5)
+	if (repl_len != 5)
 		return;
 
-	o_dspl = *(s32 *)(insnbuf + 1);
+	o_dspl = *(s32 *)(repl_insn + 1);
+
+	DPRINTK("o_dspl: 0x%x, orig_insn: %px", o_dspl, orig_insn);
 
 	/* next_rip of the replacement JMP */
-	next_rip = repl_insn + a->replacementlen;
+	next_rip = repl_insn + repl_len;
+
 	/* target rip of the replacement JMP */
 	tgt_rip  = next_rip + o_dspl;
 	n_dspl = tgt_rip - orig_insn;
 
-	DPRINTK("target RIP: %p, new_displ: 0x%x", tgt_rip, n_dspl);
+	DPRINTK("target RIP: %px, new_displ: 0x%x", tgt_rip, n_dspl);
 
 	if (tgt_rip - orig_insn >= 0) {
 		if (n_dspl - 2 <= 127)
@@ -316,8 +327,8 @@ recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insnbuf)
 two_byte_jmp:
 	n_dspl -= 2;
 
-	insnbuf[0] = 0xeb;
-	insnbuf[1] = (s8)n_dspl;
+	insnbuf[i_off] = 0xeb;
+	insnbuf[i_off + 1] = (s8)n_dspl;
 	add_nops(insnbuf + 2, 3);
 
 	repl_len = 2;
@@ -326,8 +337,8 @@ recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insnbuf)
 five_byte_jmp:
 	n_dspl -= 5;
 
-	insnbuf[0] = 0xe9;
-	*(s32 *)&insnbuf[1] = n_dspl;
+	insnbuf[i_off] = 0xe9;
+	*(s32 *)&insnbuf[i_off + 1] = n_dspl;
 
 	repl_len = 5;
 
@@ -337,6 +348,32 @@ recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insnbuf)
 		n_dspl, (unsigned long)orig_insn + n_dspl + repl_len);
 }
 
+static void __init_or_module process_jumps(struct alt_instr *a, u8 *insnbuf)
+{
+	u8 *repl  = (u8 *)&a->repl_offset  + a->repl_offset;
+	u8 *instr = (u8 *)&a->instr_offset + a->instr_offset;
+	struct insn insn;
+	int i = 0;
+
+	if (!a->replacementlen)
+		return;
+
+	while (i < a->replacementlen) {
+		kernel_insn_init(&insn, repl, a->replacementlen);
+
+		insn_get_length(&insn);
+
+		DPRINTK("insn start 0x%x, at %d, len: %d", repl[0], i, insn.length);
+
+		if (is_jmp(repl[0]))
+			recompute_jump(a, instr, repl, insn.length, insnbuf, i);
+
+		i     += insn.length;
+		repl  += insn.length;
+		instr += insn.length;
+	}
+}
+
 /*
  * "noinline" to cause control flow change and thus invalidate I$ and
  * cause refetch after modification.
@@ -352,7 +389,7 @@ static void __init_or_module noinline optimize_nops(struct alt_instr *a, u8 *ins
 	add_nops(instr + (a->instrlen - a->padlen), a->padlen);
 	local_irq_restore(flags);
 
-	DUMP_BYTES(instr, a->instrlen, "%p: [%d:%d) optimized NOPs: ",
+	DUMP_BYTES(instr, a->instrlen, "%px: [%d:%d) optimized NOPs: ",
 		   instr, a->instrlen - a->padlen, a->padlen);
 }
 
@@ -373,7 +410,7 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 	u8 *instr, *replacement;
 	u8 insnbuf[MAX_PATCH_LEN];
 
-	DPRINTK("alt table %p -> %p", start, end);
+	DPRINTK("alt table %px -> %px", start, end);
 	/*
 	 * The scan order should be from start to end. A later scanned
 	 * alternative code can overwrite previously scanned alternative code.
@@ -397,14 +434,14 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 			continue;
 		}
 
-		DPRINTK("feat: %d*32+%d, old: (%p, len: %d), repl: (%p, len: %d), pad: %d",
+		DPRINTK("feat: %d*32+%d, old: (%px, len: %d), repl: (%px, len: %d), pad: %d",
 			a->cpuid >> 5,
 			a->cpuid & 0x1f,
 			instr, a->instrlen,
 			replacement, a->replacementlen, a->padlen);
 
-		DUMP_BYTES(instr, a->instrlen, "%p: old_insn: ", instr);
-		DUMP_BYTES(replacement, a->replacementlen, "%p: rpl_insn: ", replacement);
+		DUMP_BYTES(instr, a->instrlen, "%px: old_insn: ", instr);
+		DUMP_BYTES(replacement, a->replacementlen, "%px: rpl_insn: ", replacement);
 
 		memcpy(insnbuf, replacement, a->replacementlen);
 		insnbuf_sz = a->replacementlen;
@@ -422,15 +459,14 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 				(unsigned long)instr + *(s32 *)(insnbuf + 1) + 5);
 		}
 
-		if (a->replacementlen && is_jmp(replacement[0]))
-			recompute_jump(a, instr, replacement, insnbuf);
+		process_jumps(a, insnbuf);
 
 		if (a->instrlen > a->replacementlen) {
 			add_nops(insnbuf + a->replacementlen,
 				 a->instrlen - a->replacementlen);
 			insnbuf_sz += a->instrlen - a->replacementlen;
 		}
-		DUMP_BYTES(insnbuf, insnbuf_sz, "%p: final_insn: ", instr);
+		DUMP_BYTES(insnbuf, insnbuf_sz, "%px: final_insn: ", instr);
 
 		text_poke_early(instr, insnbuf, insnbuf_sz);
 	}
-- 
2.13.0

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-06 17:02                   ` Borislav Petkov
@ 2018-01-07  9:40                     ` David Woodhouse
  2018-01-07 11:46                       ` Borislav Petkov
  0 siblings, 1 reply; 79+ messages in thread
From: David Woodhouse @ 2018-01-07  9:40 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Josh Poimboeuf, tglx, linux-kernel, tim.c.chen, peterz, torvalds,
	ak, riel, keescook, gnomes, pjt, dave.hansen, luto, jikos,
	gregkh

[-- Attachment #1: Type: text/plain, Size: 3834 bytes --]

On Sat, 2018-01-06 at 18:02 +0100, Borislav Petkov wrote:
> On Sat, Jan 06, 2018 at 08:23:21AM +0000, David Woodhouse wrote:
> > Thanks. From code inspection, I couldn't see that it was smart enough
> > *not* to process a relative jump in the 'altinstr' section which was
> > jumping to a target *within* that same altinstr section, and thus
> > didn't need to be touched at all. Does this work?
> > 
> > alternative("", "xor %%rdi, %%rdi; jmp 2f; 2: jmp startup_64", X86_FEATURE_K8);
> 
> So this is fine because it gets turned into a two-byte jump:
> 
> [    0.816005] apply_alternatives: feat: 3*32+4, old: (ffffffff810273c9, len: 10), repl: (ffffffff824759d2, len: 10), pad: 10
> [    0.820001] ffffffff810273c9: old_insn: 90 90 90 90 90 90 90 90 90 90
> [    0.821247] ffffffff824759d2: rpl_insn: 48 31 ff eb 00 e9 24 a6 b8 fe
> [    0.822455] process_jumps: insn start 0x48, at 0, len: 3
> [    0.823496] process_jumps: insn start 0xeb, at 3, len: 2
> [    0.824002] process_jumps: insn start 0xe9, at 5, len: 5
> [    0.825120] recompute_jump: target RIP: ffffffff81000000, new_displ: 0xfffd8c37
> [    0.826567] recompute_jump: final displ: 0xfffd8c32, JMP 0xffffffff81000000
> [    0.828001] ffffffff810273c9: final_insn: e9 32 8c fd ff e9 24 a6 b8 fe
> 
> i.e., notice the "eb 00" thing.
> 
> Which, when copied into the kernel proper, will simply work as it is
> a small offset which, when referring to other code which gets copied
> *together* with it, should work. I.e., we're not changing the offsets
> during the copy so all good.
> 
> It becomes more tricky when you force a 5-byte jump:
> 
>         alternative("", "xor %%rdi, %%rdi; .byte 0xe9; .long 2f - .altinstr_replacement; 2: jmp startup_64", X86_FEATURE_K8);
> 
> because then you need to know whether the offset is within the
> .altinstr_replacement section itself or it is meant to be an absolute
> offset like jmp startup_64 or within another section.

Right, so it all tends to work out OK purely by virtue of the fact that
oldinstr and altinstr end up far enough apart in the image that they're
5-byte jumps. Which isn't perfect but we've lived with worse.

I'm relatively pleased that we've managed to eliminate this as a
dependency for inverting the X86_FEATURE_RETPOLINE logic though, by
following Linus' suggestion to just emit the thunk inline instead of
calling the same one as GCC.

The other fun one for alternatives is in entry_64.S, where we really
need the return address of the call instruction to be *precisely* the 
.Lentry_SYSCALL_64_after_fastpath_call label, so we have to eschew the
normal NOSPEC_CALL there:

	/*
	 * This call instruction is handled specially in stub_ptregs_64.
	 * It might end up jumping to the slow path.  If it jumps, RAX
	 * and all argument registers are clobbered.
	 */
#ifdef CONFIG_RETPOLINE
	movq	sys_call_table(, %rax, 8), %rax
	call	__x86.indirect_thunk.rax
#else
	call	*sys_call_table(, %rax, 8)
#endif
.Lentry_SYSCALL_64_after_fastpath_call:

Now it's not like an unconditional branch to the out-of-line thunk is
really going to be much of a problem, even in the case where that out-
of-line thunk is alternative'd into a bare 'jmp *%rax'. But it would be
slightly nicer to avoid it.

At the moment though, it's really hard to ensure that the 'call'
instruction that leaves its address on the stack is right at the end.

Am I missing a trick there, other than manually inserting leading NOPs
(instead of the automatic trailing ones) to ensure that everything
lines up, and making assumptions about how the assembler will encode
instructions (not that it has *much* choice, but it has some).

On the whole I think it's fine as it is, but if you have a simple fix
then that would be nice.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-07  9:40                     ` David Woodhouse
@ 2018-01-07 11:46                       ` Borislav Petkov
  2018-01-07 12:21                         ` David Woodhouse
  0 siblings, 1 reply; 79+ messages in thread
From: Borislav Petkov @ 2018-01-07 11:46 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Josh Poimboeuf, tglx, linux-kernel, tim.c.chen, peterz, torvalds,
	ak, riel, keescook, gnomes, pjt, dave.hansen, luto, jikos,
	gregkh

On Sun, Jan 07, 2018 at 09:40:42AM +0000, David Woodhouse wrote:
> Right, so it all tends to work out OK purely by virtue of the fact that
> oldinstr and altinstr end up far enough apart in the image that they're
> 5-byte jumps. Which isn't perfect but we've lived with worse.

Well, the reference point is important. And I don't think we've done
more involved things than jumping back to something in .text proper.
However, I think I know how to fix this so that arbitrary jump offsets
would work but I need to talk to our gcc guys first.

If the jump is close enough for 2 bytes, then it should work as long as
the offset to the target doesn't change.

The main thing recompute_jumps() does is turn 5-byte jumps - which gas
creates because the jump target is in .text but the jump itself is in
.altinstr_replacement - into 2-byte ones. Because when you copy the jump
back into .text, the offset might fit in a signed byte all of a sudden.

There are still some nasties with forcing 5-byte jumps but I think I
know how to fix those. Stay tuned...

> I'm relatively pleased that we've managed to eliminate this as a
> dependency for inverting the X86_FEATURE_RETPOLINE logic though, by
> following Linus' suggestion to just emit the thunk inline instead of
> calling the same one as GCC.
> 
> The other fun one for alternatives is in entry_64.S, where we really
> need the return address of the call instruction to be *precisely* the 
> .Lentry_SYSCALL_64_after_fastpath_call label, so we have to eschew the
> normal NOSPEC_CALL there:

So CALL, as the doc says, pushes the offset of the *next* insn onto the
stack and branches to the target address.

So I'm thinking, as long as the next insn doesn't move and gcc doesn't
pad anything, you're fine.

However, I suspect that I'm missing something else here and I guess I'll
have more clue if I look at the whole thing. So can you point me to your
current branch so that I can take a look at the code?

Thx.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-07 11:46                       ` Borislav Petkov
@ 2018-01-07 12:21                         ` David Woodhouse
  2018-01-07 14:03                           ` Borislav Petkov
  0 siblings, 1 reply; 79+ messages in thread
From: David Woodhouse @ 2018-01-07 12:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Josh Poimboeuf, tglx, linux-kernel, tim.c.chen, peterz, torvalds,
	ak, riel, keescook, gnomes, pjt, dave.hansen, luto, jikos,
	gregkh

[-- Attachment #1: Type: text/plain, Size: 2200 bytes --]

On Sun, 2018-01-07 at 12:46 +0100, Borislav Petkov wrote:
> 
> > 
> > The other fun one for alternatives is in entry_64.S, where we really
> > need the return address of the call instruction to be *precisely* the 
> > .Lentry_SYSCALL_64_after_fastpath_call label, so we have to eschew the
> > normal NOSPEC_CALL there:
> 
> So CALL, as the doc says, pushes the offset of the *next* insn onto the
> stack and branches to the target address.
> 
> So I'm thinking, as long as the next insn doesn't move and gcc doesn't
> pad anything, you're fine.
> 
> However, I suspect that I'm missing something else here and I guess I'll
> have more clue if I look at the whole thing. So can you point me to your
> current branch so that I can take a look at the code?

http://git.infradead.org/users/dwmw2/linux-retpoline.git

In particular, this call site in entry_64.S:
http://git.infradead.org/users/dwmw2/linux-retpoline.git/blob/0f5c54a36e:/arch/x86/entry/entry_64.S#l270

It's still just unconditionally calling the out-of-line thunk and not
using ALTERNATIVE in the CONFIG_RETPOLINE case. I can't just use the
NOSPEC_CALL macro from
http://git.infradead.org/users/dwmw2/linux-retpoline.git/blob/0f5c54a36e:/arch/x86/include/asm/nospec-branch.h#l46
because of that requirement that the return address (on the stack) for
the CALL instruction must be precisely at the end, in all cases.

Each of the three alternatives *does* end with the CALL, it's just that
for the two which are shorter than the full retpoline one, they'll get
padded with NOPs at the end, so the return address on the stack *won't*
be what's expected.

Explicitly padding the alternatives with leading NOPs so that they are
all precisely the same length would work, and if the alternatives
mechanism were to pad the shorter ones with leading NOPs instead of
trailing NOPs that would *also* work (but be fairly difficult
especially to do that for oldinstr).

I'm not sure I *see* a simple answer, and it isn't really that bad to
just do what GCC is doing and unconditionally call the out-of-line
thunk. So feel free to just throw your hands up in horror and say "no,
we can't cope with that" :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-07 12:21                         ` David Woodhouse
@ 2018-01-07 14:03                           ` Borislav Petkov
  2018-01-08 21:50                             ` David Woodhouse
  0 siblings, 1 reply; 79+ messages in thread
From: Borislav Petkov @ 2018-01-07 14:03 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Josh Poimboeuf, tglx, linux-kernel, tim.c.chen, peterz, torvalds,
	ak, riel, keescook, gnomes, pjt, dave.hansen, luto, jikos,
	gregkh

On Sun, Jan 07, 2018 at 12:21:29PM +0000, David Woodhouse wrote:
> http://git.infradead.org/users/dwmw2/linux-retpoline.git
> 
> In particular, this call site in entry_64.S:
> http://git.infradead.org/users/dwmw2/linux-retpoline.git/blob/0f5c54a36e:/arch/x86/entry/entry_64.S#l270
> 
> It's still just unconditionally calling the out-of-line thunk and not
> using ALTERNATIVE in the CONFIG_RETPOLINE case. I can't just use the
> NOSPEC_CALL macro from
> http://git.infradead.org/users/dwmw2/linux-retpoline.git/blob/0f5c54a36e:/arch/x86/include/asm/nospec-branch.h#l46
> because of that requirement that the return address (on the stack) for
> the CALL instruction must be precisely at the end, in all cases.

Thanks, this makes it all clear.

> Each of the three alternatives *does* end with the CALL, it's just that
> for the two which are shorter than the full retpoline one, they'll get
> padded with NOPs at the end, so the return address on the stack *won't*
> be what's expected.

Right.

> Explicitly padding the alternatives with leading NOPs so that they are
> all precisely the same length would work, and if the alternatives
> mechanism were to pad the shorter ones with leading NOPs instead of
> trailing NOPs that would *also* work (but be fairly difficult
> especially to do that for oldinstr).

Right, so doing this reliably would need adding flags to struct
alt_instr - and this need has arisen before and we've managed not to do
it :). And I'm not really persuaded we need it now either because this
would be a one-off case where we need the padding in the front.

> I'm not sure I *see* a simple answer, and it isn't really that bad to
> just do what GCC is doing and unconditionally call the out-of-line
> thunk. So feel free to just throw your hands up in horror and say "no,
> we can't cope with that" :)

Right, or we can do something like this - diff ontop of yours below. We used to
do this before the automatic padding - explicit NOP padding by hand :-)

First split the ALTERNATIVE_2 into two alternative calls, as Brian suggested:

        ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE_AMD

and then the second:

        ALTERNATIVE __stringify(ASM_NOP8; ASM_NOP8; ASM_NOP3; call *\reg), \
                     __stringify(RETPOLINE_CALL \reg), X86_FEATURE_RETPOLINE

with frontal NOP padding.

When compiled, it looks like this:

ffffffff818002e2:       90                      nop
ffffffff818002e3:       90                      nop
ffffffff818002e4:       90                      nop

these first three bytes become LFENCE on AMD. On Intel, it gets optimized:

[    0.042954] ffffffff818002e2: [0:3) optimized NOPs: 0f 1f 00

Then:

ffffffff818002e5:       66 66 66 90             data16 data16 xchg %ax,%ax
ffffffff818002e9:       66 66 66 90             data16 data16 xchg %ax,%ax
ffffffff818002ed:       66 66 66 90             data16 data16 xchg %ax,%ax
ffffffff818002f1:       66 66 66 90             data16 data16 xchg %ax,%ax
ffffffff818002f5:       66 66 90                data16 xchg %ax,%ax
ffffffff818002f8:       ff d3                   callq  *%rbx

This thing remains like this on AMD and on Intel gets turned into:

[    0.044004] apply_alternatives: feat: 7*32+12, old: (ffffffff818002e5, len: 21), repl: (ffffffff82219edc, len: 21), pad: 0
[    0.045967] ffffffff818002e5: old_insn: 66 66 66 90 66 66 66 90 66 66 66 90 66 66 66 90 66 66 90 ff d3
[    0.048006] ffffffff82219edc: rpl_insn: eb 0e e8 04 00 00 00 f3 90 eb fc 48 89 1c 24 c3 e8 ed ff ff ff
[    0.050581] ffffffff818002e5: final_insn: eb 0e e8 04 00 00 00 f3 90 eb fc 48 89 1c 24 c3 e8 ed ff ff ff
[    0.052003] ffffffff8180123b: [0:3) optimized NOPs: 0f 1f 00


ffffffff818002e5:       eb 0e                   jmp ffffffff818002f5
ffffffff818002e7:       e8 04 00 00 00          callq ffffffff818002f0
ffffffff818002ec:       f3 90                   pause
ffffffff818002ee:       eb fc                   jmp ffffffff818002ec
ffffffff818002f0:       48 89 1c 24             mov %rbx,(%rsp)
ffffffff818002f4:       c3                      ret
ffffffff818002f5:       e8 ed ff ff ff          callq ffffffff818002e7

so that in both cases, the CALL remains last.

My fear is if some funky compiler changes the sizes of the insns in
RETPOLINE_CALL/JMP and then the padding becomes wrong. But looking at the
labels, they're all close so you have a 2-byte jmp already and the

call    1112f

should be ok. The MOV is reg,(reg) which should not change size of 4...

But I'm remaining cautious here.

And we'd need to do the respective thing for NOSPEC_JMP.

Btw, I've moved the setting of X86_FEATURE_RETPOLINE and
X86_FEATURE_RETPOLINE_AMD to the respective intel.c and amd.c files. I
think this is what we want.

And if we're doing two different RETPOLINE flavors, then we should call
X86_FEATURE_RETPOLINE
X86_FEATURE_RETPOLINE_INTEL.

Does that whole thing make some sense...?

---
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index c4d08fa29d6c..70825ce909ba 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -6,6 +6,7 @@
 #include <asm/alternative.h>
 #include <asm/alternative-asm.h>
 #include <asm/cpufeatures.h>
+#include <asm/nops.h>
 
 #ifdef __ASSEMBLY__
 
@@ -43,11 +44,15 @@
 #endif
 .endm
 
+/*
+ * RETPOLINE_CALL is 21 bytes so pad the front with 19 bytes + 2 bytes for the
+ * CALL insn so that all CALL instructions remain last.
+ */
 .macro NOSPEC_CALL reg:req
 #ifdef CONFIG_RETPOLINE
-	ALTERNATIVE_2 __stringify(call *\reg),				\
-		__stringify(RETPOLINE_CALL \reg), X86_FEATURE_RETPOLINE,\
-		__stringify(lfence; call *\reg), X86_FEATURE_RETPOLINE_AMD
+	ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE_AMD
+	ALTERNATIVE __stringify(ASM_NOP8; ASM_NOP8; ASM_NOP3; call *\reg), \
+		     __stringify(RETPOLINE_CALL \reg), X86_FEATURE_RETPOLINE
 #else
 	call	*\reg
 #endif
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index b221fe507640..938111e77dd0 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -554,6 +554,8 @@ static void bsp_init_amd(struct cpuinfo_x86 *c)
 		rdmsrl(MSR_FAM10H_NODE_ID, value);
 		nodes_per_socket = ((value >> 3) & 7) + 1;
 	}
+
+	setup_force_cpu_cap(X86_FEATURE_RETPOLINE_AMD);
 }
 
 static void early_init_amd(struct cpuinfo_x86 *c)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cfa5042232a2..372ba3fb400f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -904,9 +904,6 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
 
 	setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
 	setup_force_cpu_bug(X86_BUG_SPECTRE_V2);
-#ifdef CONFIG_RETPOLINE
-	setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
-#endif
 
 	fpu__init_system(c);
 
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 35e123e5f413..a9e00edaba46 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -524,6 +524,13 @@ static void init_intel_misc_features(struct cpuinfo_x86 *c)
 	wrmsrl(MSR_MISC_FEATURES_ENABLES, msr);
 }
 
+static void bsp_init_intel(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_RETPOLINE
+	setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
+#endif
+}
+
 static void init_intel(struct cpuinfo_x86 *c)
 {
 	unsigned int l2 = 0;
@@ -893,6 +900,7 @@ static const struct cpu_dev intel_cpu_dev = {
 	.c_detect_tlb	= intel_detect_tlb,
 	.c_early_init   = early_init_intel,
 	.c_init		= init_intel,
+	.c_bsp_init	= bsp_init_intel,
 	.c_bsp_resume	= intel_bsp_resume,
 	.c_x86_vendor	= X86_VENDOR_INTEL,
 };

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-06  0:30               ` Borislav Petkov
  2018-01-06  8:23                 ` David Woodhouse
@ 2018-01-08  5:06                 ` Josh Poimboeuf
  2018-01-08  7:55                   ` Woodhouse, David
  1 sibling, 1 reply; 79+ messages in thread
From: Josh Poimboeuf @ 2018-01-08  5:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Woodhouse, David, tglx, linux-kernel, tim.c.chen, peterz,
	torvalds, ak, riel, keescook, gnomes, pjt, dave.hansen, luto,
	jikos, gregkh

On Sat, Jan 06, 2018 at 01:30:59AM +0100, Borislav Petkov wrote:
> On Fri, Jan 05, 2018 at 11:08:06AM -0600, Josh Poimboeuf wrote:
> > I seem to recall that we also discussed the need for this for converting
> > pvops to use alternatives, though the "why" is eluding me at the moment.
> 
> Ok, here's something which seems to work in my VM here. I'll continue
> playing with it tomorrow. Josh, if you have some example sequences for
> me to try, send them my way pls.

Here's the use case I had in mind before.  With paravirt,

  ENABLE_INTERRUPTS(CLBR_NONE)

becomes

  push	%rax
  call	*pv_irq_ops.irq_enable
  pop	%rax

and I wanted to apply those instructions with an alternative.  It
doesn't work currently because the 'call' isn't first.

-- 
Josh

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-08  5:06                 ` Josh Poimboeuf
@ 2018-01-08  7:55                   ` Woodhouse, David
  0 siblings, 0 replies; 79+ messages in thread
From: Woodhouse, David @ 2018-01-08  7:55 UTC (permalink / raw)
  To: Josh Poimboeuf, Borislav Petkov
  Cc: tglx, linux-kernel, tim.c.chen, peterz, torvalds, ak, riel,
	keescook, gnomes, pjt, dave.hansen, luto, jikos, gregkh

[-- Attachment #1: Type: text/plain, Size: 978 bytes --]

On Sun, 2018-01-07 at 23:06 -0600, Josh Poimboeuf wrote:
> 
> Here's the use case I had in mind before.  With paravirt,
> 
>   ENABLE_INTERRUPTS(CLBR_NONE)
> 
> becomes
> 
>   push  %rax
>   call  *pv_irq_ops.irq_enable
>   pop   %rax
> 
> and I wanted to apply those instructions with an alternative.  It
> doesn't work currently because the 'call' isn't first.

I believe Borislav has made that work... however, if you're literally
doing the above then you'd be introducing new indirect branches which
is precisely what I've been trying to eliminate.

I believe I was told to stop prodding at pvops and just trust that they
all get turned into *direct* jumps at runtime. For example the above
call would not be literally 'call *pv_irq_ops.irq_enable', because by
the time the pvops are patched we'd *know* the final value of the
irq_enable method, and we'd turn it into a *direct* call instead.

Do I need to start looking at pvops again?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 01/13] x86/retpoline: Add initial retpoline support
  2018-01-07 14:03                           ` Borislav Petkov
@ 2018-01-08 21:50                             ` David Woodhouse
  0 siblings, 0 replies; 79+ messages in thread
From: David Woodhouse @ 2018-01-08 21:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Josh Poimboeuf, tglx, linux-kernel, tim.c.chen, peterz, torvalds,
	ak, riel, keescook, gnomes, pjt, dave.hansen, luto, jikos,
	gregkh

[-- Attachment #1: Type: text/plain, Size: 1167 bytes --]

On Sun, 2018-01-07 at 15:03 +0100, Borislav Petkov wrote:
> 
> My fear is if some funky compiler changes the sizes of the insns in
> RETPOLINE_CALL/JMP and then the padding becomes wrong. But looking at the
> labels, they're all close so you have a 2-byte jmp already and the
> 
> call    1112f
> 
> should be ok. The MOV is reg,(reg) which should not change size of 4...
> 
> But I'm remaining cautious here.

Right. I forget the specifics, but I've *watched* LLVM break carefully
hand-crafted asm code by emitting 4-byte variants when we expected 2-
byte, etc.

On the whole, I'm sufficiently unhappy with making such assumptions,
that I think the cure is worse than the disease. We can live with that
*one* out-of-line call to the thunk in the syscall case, and that was
the *only* one that really needed the call to be at the end.

Note that in the alternative case there, we don't even need to load it
into a register at all. We could do our own alternatives specially for
that case, and hand-tune the lengths only for them. But *with* a sanity
check to break the build on mismatch.

I don't think it's worth it at this point though.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2018-01-08 21:50 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-04  9:10 [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre") Paul Turner
2018-01-04  9:12 ` Paul Turner
2018-01-04  9:24 ` Paul Turner
2018-01-04  9:48   ` Greg Kroah-Hartman
2018-01-04  9:56     ` Woodhouse, David
2018-01-04  9:30 ` Woodhouse, David
2018-01-04 14:36   ` [PATCH v3 01/13] x86/retpoline: Add initial retpoline support David Woodhouse
2018-01-04 18:03     ` Linus Torvalds
2018-01-04 19:32       ` Woodhouse, David
2018-01-04 18:17     ` Alexei Starovoitov
2018-01-04 18:25       ` Linus Torvalds
2018-01-04 18:36         ` Alexei Starovoitov
2018-01-04 19:27           ` David Woodhouse
2018-01-05 10:28             ` Paul Turner
2018-01-05 10:55               ` David Woodhouse
2018-01-05 11:19                 ` Paul Turner
2018-01-05 11:25                 ` Paul Turner
2018-01-05 11:26               ` Paolo Bonzini
2018-01-05 12:20                 ` Paul Turner
2018-01-05 10:40         ` Paul Turner
2018-01-04 18:40       ` Andi Kleen
2018-01-05 10:32         ` Paul Turner
2018-01-05 12:54     ` Thomas Gleixner
2018-01-05 13:01       ` Juergen Gross
2018-01-05 13:03         ` Thomas Gleixner
2018-01-05 13:56       ` Woodhouse, David
2018-01-05 16:41         ` Woodhouse, David
2018-01-05 16:45           ` Borislav Petkov
2018-01-05 17:08             ` Josh Poimboeuf
2018-01-06  0:30               ` Borislav Petkov
2018-01-06  8:23                 ` David Woodhouse
2018-01-06 17:02                   ` Borislav Petkov
2018-01-07  9:40                     ` David Woodhouse
2018-01-07 11:46                       ` Borislav Petkov
2018-01-07 12:21                         ` David Woodhouse
2018-01-07 14:03                           ` Borislav Petkov
2018-01-08 21:50                             ` David Woodhouse
2018-01-08  5:06                 ` Josh Poimboeuf
2018-01-08  7:55                   ` Woodhouse, David
2018-01-05 17:12             ` Woodhouse, David
2018-01-05 17:28               ` Linus Torvalds
2018-01-05 17:48                 ` David Woodhouse
2018-01-05 18:05                 ` Andi Kleen
2018-01-05 20:32                 ` Woodhouse, David
2018-01-05 21:11                   ` Brian Gerst
2018-01-05 22:16                     ` Woodhouse, David
2018-01-05 22:43                       ` Borislav Petkov
2018-01-05 22:00                 ` Woodhouse, David
2018-01-05 22:06                   ` Borislav Petkov
2018-01-05 23:50                   ` Linus Torvalds
2018-01-06 10:53                     ` Woodhouse, David
2018-01-04 14:36   ` [PATCH v3 02/13] x86/retpoline/crypto: Convert crypto assembler indirect jumps David Woodhouse
2018-01-04 14:37   ` [PATCH v3 03/13] x86/retpoline/entry: Convert entry " David Woodhouse
2018-01-04 14:46     ` Dave Hansen
2018-01-04 14:49       ` Woodhouse, David
2018-01-04 14:37   ` [PATCH v3 04/13] x86/retpoline/ftrace: Convert ftrace " David Woodhouse
2018-01-04 14:37   ` [PATCH v3 05/13] x86/retpoline/hyperv: Convert " David Woodhouse
2018-01-04 14:37   ` [PATCH v3 06/13] x86/retpoline/xen: Convert Xen hypercall " David Woodhouse
2018-01-04 15:10     ` Juergen Gross
2018-01-04 15:18       ` David Woodhouse
2018-01-04 15:54     ` Juergen Gross
2018-01-04 14:37   ` [PATCH v3 07/13] x86/retpoline/checksum32: Convert assembler " David Woodhouse
2018-01-04 14:37   ` [PATCH v3 08/13] x86/alternatives: Add missing \n at end of ALTERNATIVE inline asm David Woodhouse
2018-01-05 13:04     ` [tip:x86/pti] x86/alternatives: Add missing '\n' " tip-bot for David Woodhouse
2018-01-04 14:37   ` [PATCH v3 09/13] x86/retpoline/irq32: Convert assembler indirect jumps David Woodhouse
2018-01-04 14:37   ` [PATCH v3 10/13] x86/retpoline/pvops: " David Woodhouse
2018-01-04 15:02     ` Juergen Gross
2018-01-04 15:12       ` Woodhouse, David
2018-01-04 15:18       ` Andrew Cooper
2018-01-04 16:04         ` Juergen Gross
2018-01-04 16:37       ` Andi Kleen
2018-01-04 14:37   ` [PATCH v3 11/13] retpoline/taint: Taint kernel for missing retpoline in compiler David Woodhouse
2018-01-04 22:06     ` Justin Forbes
2018-01-04 14:37   ` [PATCH v3 12/13] retpoline/objtool: Disable some objtool warnings David Woodhouse
2018-01-04 14:37   ` [PATCH v3 13/13] retpoline: Attempt to quiten objtool warning for unreachable code David Woodhouse
2018-01-04 16:18   ` [RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre") Andy Lutomirski
2018-01-04 16:24     ` David Woodhouse
2018-01-05 10:49     ` Paul Turner
2018-01-05 11:43       ` Woodhouse, David

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.