All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] x86/entry: clear registers to sanitize speculative usages
@ 2018-02-03 23:21 Dan Williams
  2018-02-03 23:21 ` [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels Dan Williams
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Dan Williams @ 2018-02-03 23:21 UTC (permalink / raw)
  To: tglx
  Cc: Andi Kleen, x86, linux-kernel, Ingo Molnar, luto, H. Peter Anvin,
	torvalds

At entry userspace may have populated callee saved registers with values
that could be useful in a speculative execution attack. Clear them to
minimize the kernel's attack surface.

Note, this is done to make it harder to find / manipulate exploitable
sequences in the kernel.

The clearing is limited to the 64-bit 'extra' registers since those are
the most likely to survive with user populated values deep into the call
chain. Normal register pressure likely clobbers values in the lower
registers and the 32-bit case.

As for cycle impact on my Sandy Bridge test system it can handle the xor
sequence at 3.5 instructions per cycle.

---

Andi Kleen (2):
      x86/entry: Clear registers for 64bit exceptions/interrupts
      x86/entry: Clear registers for compat syscalls

Dan Williams (1):
      x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels


 arch/x86/entry/calling.h         |   17 +++++++++++++++++
 arch/x86/entry/entry_64.S        |    6 ++++++
 arch/x86/entry/entry_64_compat.S |    3 +++
 3 files changed, 26 insertions(+)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-03 23:21 [PATCH 0/3] x86/entry: clear registers to sanitize speculative usages Dan Williams
@ 2018-02-03 23:21 ` Dan Williams
  2018-02-04  0:14   ` Andy Lutomirski
  2018-02-04 13:01   ` Brian Gerst
  2018-02-03 23:21 ` [PATCH 2/3] x86/entry: Clear registers for 64bit exceptions/interrupts Dan Williams
  2018-02-03 23:21 ` [PATCH 3/3] x86/entry: Clear registers for compat syscalls Dan Williams
  2 siblings, 2 replies; 18+ messages in thread
From: Dan Williams @ 2018-02-03 23:21 UTC (permalink / raw)
  To: tglx
  Cc: Andi Kleen, x86, linux-kernel, Ingo Molnar, luto, H. Peter Anvin,
	torvalds

At entry userspace may have populated the extra registers outside the
syscall calling convention with values that could be useful in a
speculative execution attack. Clear them to minimize the kernel's attack
surface. Note, this only clears the extra registers and not the unused
registers for syscalls less than 6 arguments since those registers are
likely to be clobbered well before their values could be put to use
under speculation.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Andy Lutomirski <luto@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/entry/calling.h  |   17 +++++++++++++++++
 arch/x86/entry/entry_64.S |    1 +
 2 files changed, 18 insertions(+)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 3f48f695d5e6..daee2d19e73d 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -147,6 +147,23 @@ For 32-bit we have the following conventions - kernel is built with
 	UNWIND_HINT_REGS offset=\offset
 	.endm
 
+	/*
+	 * Sanitize extra registers of values that a speculation attack
+	 * might want to exploit. In the CONFIG_FRAME_POINTER=y case,
+	 * the expectation is that %ebp will be clobbered before it
+	 * could be used.
+	 */
+	.macro CLEAR_EXTRA_REGS_NOSPEC
+	xorq %r15, %r15
+	xorq %r14, %r14
+	xorq %r13, %r13
+	xorq %r12, %r12
+	xorl %ebx, %ebx
+#ifndef CONFIG_FRAME_POINTER
+	xorl %ebp, %ebp
+#endif
+	.endm
+
 	.macro POP_EXTRA_REGS
 	popq %r15
 	popq %r14
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c752abe89d80..46260e951da6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -247,6 +247,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	TRACE_IRQS_OFF
 
 	/* IRQs are off. */
+	CLEAR_EXTRA_REGS_NOSPEC
 	movq	%rsp, %rdi
 	call	do_syscall_64		/* returns with IRQs disabled */
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] x86/entry: Clear registers for 64bit exceptions/interrupts
  2018-02-03 23:21 [PATCH 0/3] x86/entry: clear registers to sanitize speculative usages Dan Williams
  2018-02-03 23:21 ` [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels Dan Williams
@ 2018-02-03 23:21 ` Dan Williams
  2018-02-03 23:21 ` [PATCH 3/3] x86/entry: Clear registers for compat syscalls Dan Williams
  2 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2018-02-03 23:21 UTC (permalink / raw)
  To: tglx; +Cc: Andi Kleen, torvalds, linux-kernel, luto

From: Andi Kleen <ak@linux.intel.com>

Clear the 'extra' registers on entering the 64bit kernel for exceptions
and interrupts. The common registers are not cleared since they are
likely clobbered well before they can be exploited in a speculative
execution attack.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/entry/entry_64.S |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 46260e951da6..d73eedf1eb47 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -563,6 +563,7 @@ END(irq_entries_start)
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
+	CLEAR_EXTRA_REGS_NOSPEC
 	ENCODE_FRAME_POINTER
 
 	testb	$3, CS(%rsp)
@@ -1121,6 +1122,7 @@ ENTRY(xen_failsafe_callback)
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
+	CLEAR_EXTRA_REGS_NOSPEC
 	ENCODE_FRAME_POINTER
 	jmp	error_exit
 END(xen_failsafe_callback)
@@ -1166,6 +1168,7 @@ ENTRY(paranoid_entry)
 	cld
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
+	CLEAR_EXTRA_REGS_NOSPEC
 	ENCODE_FRAME_POINTER 8
 	movl	$1, %ebx
 	movl	$MSR_GS_BASE, %ecx
@@ -1218,6 +1221,7 @@ ENTRY(error_entry)
 	cld
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
+	CLEAR_EXTRA_REGS_NOSPEC
 	ENCODE_FRAME_POINTER 8
 	xorl	%ebx, %ebx
 	testb	$3, CS+8(%rsp)
@@ -1416,6 +1420,7 @@ ENTRY(nmi)
 	pushq	%r14		/* pt_regs->r14 */
 	pushq	%r15		/* pt_regs->r15 */
 	UNWIND_HINT_REGS
+	CLEAR_EXTRA_REGS_NOSPEC
 	ENCODE_FRAME_POINTER
 
 	/*

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] x86/entry: Clear registers for compat syscalls
  2018-02-03 23:21 [PATCH 0/3] x86/entry: clear registers to sanitize speculative usages Dan Williams
  2018-02-03 23:21 ` [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels Dan Williams
  2018-02-03 23:21 ` [PATCH 2/3] x86/entry: Clear registers for 64bit exceptions/interrupts Dan Williams
@ 2018-02-03 23:21 ` Dan Williams
  2 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2018-02-03 23:21 UTC (permalink / raw)
  To: tglx
  Cc: Andi Kleen, x86, linux-kernel, Ingo Molnar, luto, H. Peter Anvin,
	torvalds

From: Andi Kleen <ak@linux.intel.com>

At entry userspace may have populated registers with values that could
be useful in a speculative execution attack. Clear them to minimize the
kernel's attack surface.

[djbw: rename the macro, only clear the extra registers]
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/entry/entry_64_compat.S |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 98d5358e4041..f55b018a580b 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -95,6 +95,7 @@ ENTRY(entry_SYSENTER_compat)
 	pushq   $0			/* pt_regs->r14 = 0 */
 	pushq   $0			/* pt_regs->r15 = 0 */
 	cld
+	CLEAR_EXTRA_REGS_NOSPEC
 
 	/*
 	 * SYSENTER doesn't filter flags, so we need to clear NT and AC
@@ -223,6 +224,7 @@ GLOBAL(entry_SYSCALL_compat_after_hwframe)
 	pushq   $0			/* pt_regs->r13 = 0 */
 	pushq   $0			/* pt_regs->r14 = 0 */
 	pushq   $0			/* pt_regs->r15 = 0 */
+	CLEAR_EXTRA_REGS_NOSPEC
 
 	/*
 	 * User mode is traced as though IRQs are on, and SYSENTER
@@ -348,6 +350,7 @@ ENTRY(entry_INT80_compat)
 	pushq   %r14                    /* pt_regs->r14 */
 	pushq   %r15                    /* pt_regs->r15 */
 	cld
+	CLEAR_EXTRA_REGS_NOSPEC
 
 	/*
 	 * User mode is traced as though IRQs are on, and the interrupt

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-03 23:21 ` [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels Dan Williams
@ 2018-02-04  0:14   ` Andy Lutomirski
  2018-02-04  1:25     ` Dan Williams
  2018-02-04 13:01   ` Brian Gerst
  1 sibling, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2018-02-04  0:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Thomas Gleixner, Andi Kleen, X86 ML, LKML, Ingo Molnar,
	Andrew Lutomirski, H. Peter Anvin, Linus Torvalds

On Sat, Feb 3, 2018 at 11:21 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> At entry userspace may have populated the extra registers outside the
> syscall calling convention with values that could be useful in a
> speculative execution attack. Clear them to minimize the kernel's attack
> surface. Note, this only clears the extra registers and not the unused
> registers for syscalls less than 6 arguments since those registers are
> likely to be clobbered well before their values could be put to use
> under speculation.
>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: Andy Lutomirski <luto@kernel.org>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Reported-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/x86/entry/calling.h  |   17 +++++++++++++++++
>  arch/x86/entry/entry_64.S |    1 +
>  2 files changed, 18 insertions(+)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 3f48f695d5e6..daee2d19e73d 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -147,6 +147,23 @@ For 32-bit we have the following conventions - kernel is built with
>         UNWIND_HINT_REGS offset=\offset
>         .endm
>
> +       /*
> +        * Sanitize extra registers of values that a speculation attack
> +        * might want to exploit. In the CONFIG_FRAME_POINTER=y case,
> +        * the expectation is that %ebp will be clobbered before it
> +        * could be used.
> +        */
> +       .macro CLEAR_EXTRA_REGS_NOSPEC
> +       xorq %r15, %r15
> +       xorq %r14, %r14
> +       xorq %r13, %r13
> +       xorq %r12, %r12
> +       xorl %ebx, %ebx
> +#ifndef CONFIG_FRAME_POINTER
> +       xorl %ebp, %ebp
> +#endif
> +       .endm
> +

Can we make the clears only happen if we have CONFIG_RETPOLINE?  Or is
there maybe some reason why we want this even without retpolines on?

>         .macro POP_EXTRA_REGS
>         popq %r15
>         popq %r14
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index c752abe89d80..46260e951da6 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -247,6 +247,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
>         TRACE_IRQS_OFF
>
>         /* IRQs are off. */
> +       CLEAR_EXTRA_REGS_NOSPEC

Please put the clears before TRACE_IRQS_OFF to protect users that use tracing.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-04  0:14   ` Andy Lutomirski
@ 2018-02-04  1:25     ` Dan Williams
  2018-02-04  1:29       ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-02-04  1:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Andi Kleen, X86 ML, LKML, Ingo Molnar,
	H. Peter Anvin, Linus Torvalds

On Sat, Feb 3, 2018 at 4:14 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Sat, Feb 3, 2018 at 11:21 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> At entry userspace may have populated the extra registers outside the
>> syscall calling convention with values that could be useful in a
>> speculative execution attack. Clear them to minimize the kernel's attack
>> surface. Note, this only clears the extra registers and not the unused
>> registers for syscalls less than 6 arguments since those registers are
>> likely to be clobbered well before their values could be put to use
>> under speculation.
>>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Reported-by: Andi Kleen <ak@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  arch/x86/entry/calling.h  |   17 +++++++++++++++++
>>  arch/x86/entry/entry_64.S |    1 +
>>  2 files changed, 18 insertions(+)
>>
>> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
>> index 3f48f695d5e6..daee2d19e73d 100644
>> --- a/arch/x86/entry/calling.h
>> +++ b/arch/x86/entry/calling.h
>> @@ -147,6 +147,23 @@ For 32-bit we have the following conventions - kernel is built with
>>         UNWIND_HINT_REGS offset=\offset
>>         .endm
>>
>> +       /*
>> +        * Sanitize extra registers of values that a speculation attack
>> +        * might want to exploit. In the CONFIG_FRAME_POINTER=y case,
>> +        * the expectation is that %ebp will be clobbered before it
>> +        * could be used.
>> +        */
>> +       .macro CLEAR_EXTRA_REGS_NOSPEC
>> +       xorq %r15, %r15
>> +       xorq %r14, %r14
>> +       xorq %r13, %r13
>> +       xorq %r12, %r12
>> +       xorl %ebx, %ebx
>> +#ifndef CONFIG_FRAME_POINTER
>> +       xorl %ebp, %ebp
>> +#endif
>> +       .endm
>> +
>
> Can we make the clears only happen if we have CONFIG_RETPOLINE?  Or is
> there maybe some reason why we want this even without retpolines on?

We have the other Spectre variant1 mitigations on by default. I'm not
opposed to adding a config to turn them all off, but I think we should
be consistent either way, and I don't think CONFIG_RETPOLINE is the
right config to gate those.

>>         .macro POP_EXTRA_REGS
>>         popq %r15
>>         popq %r14
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index c752abe89d80..46260e951da6 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -247,6 +247,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
>>         TRACE_IRQS_OFF
>>
>>         /* IRQs are off. */
>> +       CLEAR_EXTRA_REGS_NOSPEC
>
> Please put the clears before TRACE_IRQS_OFF to protect users that use tracing.

Ok.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-04  1:25     ` Dan Williams
@ 2018-02-04  1:29       ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2018-02-04  1:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Lutomirski, Thomas Gleixner, Andi Kleen, X86 ML, LKML,
	Ingo Molnar, H. Peter Anvin, Linus Torvalds

On Sun, Feb 4, 2018 at 1:25 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sat, Feb 3, 2018 at 4:14 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Sat, Feb 3, 2018 at 11:21 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> At entry userspace may have populated the extra registers outside the
>>> syscall calling convention with values that could be useful in a
>>> speculative execution attack. Clear them to minimize the kernel's attack
>>> surface. Note, this only clears the extra registers and not the unused
>>> registers for syscalls less than 6 arguments since those registers are
>>> likely to be clobbered well before their values could be put to use
>>> under speculation.
>>>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>> Cc: x86@kernel.org
>>> Cc: Andy Lutomirski <luto@kernel.org>
>>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>>> Reported-by: Andi Kleen <ak@linux.intel.com>
>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>>> ---
>>>  arch/x86/entry/calling.h  |   17 +++++++++++++++++
>>>  arch/x86/entry/entry_64.S |    1 +
>>>  2 files changed, 18 insertions(+)
>>>
>>> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
>>> index 3f48f695d5e6..daee2d19e73d 100644
>>> --- a/arch/x86/entry/calling.h
>>> +++ b/arch/x86/entry/calling.h
>>> @@ -147,6 +147,23 @@ For 32-bit we have the following conventions - kernel is built with
>>>         UNWIND_HINT_REGS offset=\offset
>>>         .endm
>>>
>>> +       /*
>>> +        * Sanitize extra registers of values that a speculation attack
>>> +        * might want to exploit. In the CONFIG_FRAME_POINTER=y case,
>>> +        * the expectation is that %ebp will be clobbered before it
>>> +        * could be used.
>>> +        */
>>> +       .macro CLEAR_EXTRA_REGS_NOSPEC
>>> +       xorq %r15, %r15
>>> +       xorq %r14, %r14
>>> +       xorq %r13, %r13
>>> +       xorq %r12, %r12
>>> +       xorl %ebx, %ebx
>>> +#ifndef CONFIG_FRAME_POINTER
>>> +       xorl %ebp, %ebp
>>> +#endif
>>> +       .endm
>>> +
>>
>> Can we make the clears only happen if we have CONFIG_RETPOLINE?  Or is
>> there maybe some reason why we want this even without retpolines on?
>
> We have the other Spectre variant1 mitigations on by default. I'm not
> opposed to adding a config to turn them all off, but I think we should
> be consistent either way, and I don't think CONFIG_RETPOLINE is the
> right config to gate those.

Fair enough.

>
>>>         .macro POP_EXTRA_REGS
>>>         popq %r15
>>>         popq %r14
>>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>>> index c752abe89d80..46260e951da6 100644
>>> --- a/arch/x86/entry/entry_64.S
>>> +++ b/arch/x86/entry/entry_64.S
>>> @@ -247,6 +247,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
>>>         TRACE_IRQS_OFF
>>>
>>>         /* IRQs are off. */
>>> +       CLEAR_EXTRA_REGS_NOSPEC
>>
>> Please put the clears before TRACE_IRQS_OFF to protect users that use tracing.
>
> Ok.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-03 23:21 ` [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels Dan Williams
  2018-02-04  0:14   ` Andy Lutomirski
@ 2018-02-04 13:01   ` Brian Gerst
  2018-02-04 17:42     ` Dan Williams
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Gerst @ 2018-02-04 13:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Thomas Gleixner, Andi Kleen, the arch/x86 maintainers,
	Linux Kernel Mailing List, Ingo Molnar, Andy Lutomirski,
	H. Peter Anvin, Linus Torvalds

On Sat, Feb 3, 2018 at 6:21 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> At entry userspace may have populated the extra registers outside the
> syscall calling convention with values that could be useful in a
> speculative execution attack. Clear them to minimize the kernel's attack
> surface. Note, this only clears the extra registers and not the unused
> registers for syscalls less than 6 arguments since those registers are
> likely to be clobbered well before their values could be put to use
> under speculation.
>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: Andy Lutomirski <luto@kernel.org>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Reported-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/x86/entry/calling.h  |   17 +++++++++++++++++
>  arch/x86/entry/entry_64.S |    1 +
>  2 files changed, 18 insertions(+)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 3f48f695d5e6..daee2d19e73d 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -147,6 +147,23 @@ For 32-bit we have the following conventions - kernel is built with
>         UNWIND_HINT_REGS offset=\offset
>         .endm
>
> +       /*
> +        * Sanitize extra registers of values that a speculation attack
> +        * might want to exploit. In the CONFIG_FRAME_POINTER=y case,
> +        * the expectation is that %ebp will be clobbered before it
> +        * could be used.
> +        */
> +       .macro CLEAR_EXTRA_REGS_NOSPEC
> +       xorq %r15, %r15
> +       xorq %r14, %r14
> +       xorq %r13, %r13
> +       xorq %r12, %r12
> +       xorl %ebx, %ebx
> +#ifndef CONFIG_FRAME_POINTER
> +       xorl %ebp, %ebp
> +#endif
> +       .endm
> +
>         .macro POP_EXTRA_REGS
>         popq %r15
>         popq %r14
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index c752abe89d80..46260e951da6 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -247,6 +247,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
>         TRACE_IRQS_OFF
>
>         /* IRQs are off. */
> +       CLEAR_EXTRA_REGS_NOSPEC
>         movq    %rsp, %rdi
>         call    do_syscall_64           /* returns with IRQs disabled */
>
>

Now that the fast syscall path is gone, all regs (except RSP
obviously) are dead after being saved to pt_regs.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-04 13:01   ` Brian Gerst
@ 2018-02-04 17:42     ` Dan Williams
  2018-02-04 18:40       ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2018-02-04 17:42 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Thomas Gleixner, Andi Kleen, the arch/x86 maintainers,
	Linux Kernel Mailing List, Ingo Molnar, Andy Lutomirski,
	H. Peter Anvin, Linus Torvalds

On Sun, Feb 4, 2018 at 5:01 AM, Brian Gerst <brgerst@gmail.com> wrote:
> On Sat, Feb 3, 2018 at 6:21 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> At entry userspace may have populated the extra registers outside the
>> syscall calling convention with values that could be useful in a
>> speculative execution attack. Clear them to minimize the kernel's attack
>> surface. Note, this only clears the extra registers and not the unused
>> registers for syscalls less than 6 arguments since those registers are
>> likely to be clobbered well before their values could be put to use
>> under speculation.
>>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Reported-by: Andi Kleen <ak@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  arch/x86/entry/calling.h  |   17 +++++++++++++++++
>>  arch/x86/entry/entry_64.S |    1 +
>>  2 files changed, 18 insertions(+)
>>
>> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
>> index 3f48f695d5e6..daee2d19e73d 100644
>> --- a/arch/x86/entry/calling.h
>> +++ b/arch/x86/entry/calling.h
>> @@ -147,6 +147,23 @@ For 32-bit we have the following conventions - kernel is built with
>>         UNWIND_HINT_REGS offset=\offset
>>         .endm
>>
>> +       /*
>> +        * Sanitize extra registers of values that a speculation attack
>> +        * might want to exploit. In the CONFIG_FRAME_POINTER=y case,
>> +        * the expectation is that %ebp will be clobbered before it
>> +        * could be used.
>> +        */
>> +       .macro CLEAR_EXTRA_REGS_NOSPEC
>> +       xorq %r15, %r15
>> +       xorq %r14, %r14
>> +       xorq %r13, %r13
>> +       xorq %r12, %r12
>> +       xorl %ebx, %ebx
>> +#ifndef CONFIG_FRAME_POINTER
>> +       xorl %ebp, %ebp
>> +#endif
>> +       .endm
>> +
>>         .macro POP_EXTRA_REGS
>>         popq %r15
>>         popq %r14
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index c752abe89d80..46260e951da6 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -247,6 +247,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
>>         TRACE_IRQS_OFF
>>
>>         /* IRQs are off. */
>> +       CLEAR_EXTRA_REGS_NOSPEC
>>         movq    %rsp, %rdi
>>         call    do_syscall_64           /* returns with IRQs disabled */
>>
>>
>
> Now that the fast syscall path is gone, all regs (except RSP
> obviously) are dead after being saved to pt_regs.

They're saved, but not dead afaics. The concern is that when the CPU
starts speculating it could be in a code path that has not yet touched
the extra registers. Once that happens userspace could potentially
load data from an arbitrary address.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-04 17:42     ` Dan Williams
@ 2018-02-04 18:40       ` Linus Torvalds
  2018-02-05 16:26         ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2018-02-04 18:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Brian Gerst, Thomas Gleixner, Andi Kleen,
	the arch/x86 maintainers, Linux Kernel Mailing List, Ingo Molnar,
	Andy Lutomirski, H. Peter Anvin

On Sun, Feb 4, 2018 at 9:42 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sun, Feb 4, 2018 at 5:01 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>
>> Now that the fast syscall path is gone, all regs (except RSP
>> obviously) are dead after being saved to pt_regs.
>
> They're saved, but not dead afaics.

Actually, they _are_ dead with the slow-path - it will reload them off
the ptregs pointer instead.

However, exactly because the slow-path will just reload the
user-supplied values, there's no point in clearing them anyway.  Those
values will be live in the system call sequence anyway.

So even for the slow system call case, there's no point in clearing
them. If we ever end up only reloading the required values (ie we push
the 'struct ptregs' pointer lower down into the syscall stack), at
that point we might want to clear those registers, because many system
calls will reload only a small subset.

But as the commit message says, the system call argument registers are
also likely to be aggressively clobbered unless used, since the low
registers are preferred for code generation (smaller code, and many of
them are special anyway in various ways and have forced uses for
shifts, function arguments, or just are special in general like %rax).

So the actual argument registers tend to not be an issue anyway.

               Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-04 18:40       ` Linus Torvalds
@ 2018-02-05 16:26         ` Ingo Molnar
  2018-02-05 16:38           ` Andy Lutomirski
  2018-02-05 20:05           ` Ingo Molnar
  0 siblings, 2 replies; 18+ messages in thread
From: Ingo Molnar @ 2018-02-05 16:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Williams, Brian Gerst, Thomas Gleixner, Andi Kleen,
	the arch/x86 maintainers, Linux Kernel Mailing List, Ingo Molnar,
	Andy Lutomirski, H. Peter Anvin


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [...]
> 
> But as the commit message says, the system call argument registers are
> also likely to be aggressively clobbered unless used, since the low
> registers are preferred for code generation (smaller code, and many of
> them are special anyway in various ways and have forced uses for
> shifts, function arguments, or just are special in general like %rax).
> 
> So the actual argument registers tend to not be an issue anyway.

Btw., to underline these arguments, here's some statistical data about actual 
register usage the x86 kernel.

I picked the latest upstream kernel and did a statistical analysis of the 
disassembly of an 'allyesconfig' 64-bit build.

Here is the histogram of GP register usage, ordered by a calculated "average 
per-function usage ratio" (last column):

  # nr of =y .config options:  9553
  # nr of functions:           249340
  # nr of instructions:        20223765
  # nr of register uses:       33413619

  register | # of uses | avg uses per fn
  --------------------------------------
      %r11 |     21564 |       0.1
      %r10 |     65499 |       0.3
      %r9  |    162040 |       0.6
      %r8  |    292779 |       1.2
      %rcx |    860528 |       3.5
      %r15 |   1414816 |       5.7
      %r14 |   1597952 |       6.4
      %rsi |   1636660 |       6.6
      %rdx |   1798109 |       7.2
      %r13 |   1829557 |       7.3
      %r12 |   2301476 |       9.2
      %rbp |   3156682 |      12.7
      %rbx |   4451880 |      17.9
      %rdi |   4747951 |      19.0
      %rax |   5370191 |      21.5

Here is the same histogram for a distro kernel (Fedora) config based build, with 
all =m modules changed to =y and thus built into the vmlinux for easier analysis:

  # nr of =y .config options:  4871
  # nr of functions:           190477
  # nr of instructions:        10329411
  # nr of register uses:       16907185

  register | # of uses | avg uses per fn
  --------------------------------------
      %r11 |     64135 |       0.3
      %r10 |    113366 |       0.6
      %r9  |    196269 |       1.0
      %r8  |    314812 |       1.7
      %r15 |    404789 |       2.1
      %r14 |    461266 |       2.4
      %r13 |    569654 |       3.0
      %r12 |    763973 |       4.0
      %rcx |    920477 |       4.8
      %rbp |   1161700 |       6.1
      %rsi |   1257150 |       6.6
      %rdi |   1625617 |       8.5
      %rdx |   1667365 |       8.8
      %rbx |   1739660 |       9.1
      %rax |   3826187 |      20.1

Finally here's the histogram of a 'defconfig' build - which should be 
representative of 'device specific kernel builds':

  # nr of =y .config options:  1255
  # nr of functions:           45490
  # nr of instructions:        1963956
  # nr of register uses:       3183680

  register | # of uses | avg uses per fn
  --------------------------------------
      %r11 |     11608 |       0.3
      %r10 |     23398 |       0.5
      %r9  |     37431 |       0.8
      %r8  |     56140 |       1.2
      %r15 |     77468 |       1.7
      %r14 |     89285 |       2.0
      %r13 |    111665 |       2.5
      %r12 |    151977 |       3.3
      %rcx |    166425 |       3.7
      %rsi |    226536 |       5.0
      %rbp |    238286 |       5.2
      %rdi |    306709 |       6.7
      %rdx |    313569 |       6.9
      %rbx |    349496 |       7.7
      %rax |    728036 |      16.0

(Note the various caveats listed further below.)

These three builds I believe provide representative members of a wide spectrum of 
kernel options used in practice: from everything-enabled, through distro-enabled 
to device-specific minimal kernels.

There's a consistent pattern in these histograms: the least used registers are 
R11, R10, R9 and R8. Registers R12-R15 are used almost as frequently as some of 
the GP registers (!).

In practice R11-R10 is probably the most vulnerable ones to attack: their use is 
at least 1-2 orders of magnitude less common than that of the more common general 
purpose registers.

So I submit that we should probably extend the register clearing/sanitization to 
R10 and R11 as well, because while they are technically caller-saved and freely 
clobberable, in practice they don't get clobbered all that often and there might 
be various code paths into complex system calls where these R10/R11 values survive 
just fine and can be used in Spectre gadgets.

Thanks,

	Ingo

P.S.:

List of caveats/notes:

Note #1:
  I collapsed all 32-bit register users which zero-extend by default.
  I did not collapse 8 and 16 bit uses as they don't automatically clobber the 
  higher bits. )

Note #2:
  This histogram does not make a distinction between read and write uses of 
  registers.

Note #3:
  I did not include implicit register clobbering, only those registers that are 
  explicitly listed in the disassembly. In the overwhelming majority of cases the 
  affected registers are listed though, so the real numbers should be very close
  though.

Note #4:
  The 'avg uses per fn' number is over-estimates the real uses per function,
  because I counted total number of uses, not rounded down to a per function 
  register usage heat-map. I believe this does not change the _ordering_ of the 
  register usage histograms, so it's a valid simplification.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-05 16:26         ` Ingo Molnar
@ 2018-02-05 16:38           ` Andy Lutomirski
  2018-02-05 18:29             ` Ingo Molnar
  2018-02-05 20:05           ` Ingo Molnar
  1 sibling, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2018-02-05 16:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Dan Williams, Brian Gerst, Thomas Gleixner,
	Andi Kleen, the arch/x86 maintainers, Linux Kernel Mailing List,
	Ingo Molnar, Andy Lutomirski, H. Peter Anvin

On Mon, Feb 5, 2018 at 4:26 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> [...]
>>
>> But as the commit message says, the system call argument registers are
>> also likely to be aggressively clobbered unless used, since the low
>> registers are preferred for code generation (smaller code, and many of
>> them are special anyway in various ways and have forced uses for
>> shifts, function arguments, or just are special in general like %rax).
>>
>> So the actual argument registers tend to not be an issue anyway.
>

>
> So I submit that we should probably extend the register clearing/sanitization to
> R10 and R11 as well, because while they are technically caller-saved and freely
> clobberable, in practice they don't get clobbered all that often and there might
> be various code paths into complex system calls where these R10/R11 values survive
> just fine and can be used in Spectre gadgets.


Maybe R11, but we have to be careful, since R11 is used as scratch
space in a bunch of the asm.  Clearing R10 is mostly useless in the
syscall path because we'll just unconditionally reload it in
do_syscall_64().  If we manage to change the way syscall wrappers
work, then we can think about clearing R10 and maybe even more regs.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-05 16:38           ` Andy Lutomirski
@ 2018-02-05 18:29             ` Ingo Molnar
  2018-02-05 18:47               ` Brian Gerst
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2018-02-05 18:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Dan Williams, Brian Gerst, Thomas Gleixner,
	Andi Kleen, the arch/x86 maintainers, Linux Kernel Mailing List,
	Ingo Molnar, H. Peter Anvin


* Andy Lutomirski <luto@kernel.org> wrote:

> [...] Clearing R10 is mostly useless in the syscall path because we'll just 
> unconditionally reload it in do_syscall_64().

AFAICS do_syscall_64() doesn't touch R10 at all. So how does it reload R10?

In fact do_syscall_64() as a C function does not touch R10, R11, R12, R13, R14, 
R15 - it passes their values through.

What am I missing?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-05 18:29             ` Ingo Molnar
@ 2018-02-05 18:47               ` Brian Gerst
  2018-02-05 19:48                 ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Gerst @ 2018-02-05 18:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Linus Torvalds, Dan Williams, Thomas Gleixner,
	Andi Kleen, the arch/x86 maintainers, Linux Kernel Mailing List,
	Ingo Molnar, H. Peter Anvin

On Mon, Feb 5, 2018 at 1:29 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
>> [...] Clearing R10 is mostly useless in the syscall path because we'll just
>> unconditionally reload it in do_syscall_64().
>
> AFAICS do_syscall_64() doesn't touch R10 at all. So how does it reload R10?
>
> In fact do_syscall_64() as a C function does not touch R10, R11, R12, R13, R14,
> R15 - it passes their values through.
>
> What am I missing?

The syscall ABI uses R10 for the 4th argument instead of RCX, because
RCX gets clobbered by the SYSCALL instruction for RIP.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-05 18:47               ` Brian Gerst
@ 2018-02-05 19:48                 ` Ingo Molnar
  2018-02-05 20:25                   ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2018-02-05 19:48 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, Linus Torvalds, Dan Williams, Thomas Gleixner,
	Andi Kleen, the arch/x86 maintainers, Linux Kernel Mailing List,
	Ingo Molnar, H. Peter Anvin


* Brian Gerst <brgerst@gmail.com> wrote:

> On Mon, Feb 5, 2018 at 1:29 PM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Andy Lutomirski <luto@kernel.org> wrote:
> >
> >> [...] Clearing R10 is mostly useless in the syscall path because we'll just
> >> unconditionally reload it in do_syscall_64().
> >
> > AFAICS do_syscall_64() doesn't touch R10 at all. So how does it reload R10?
> >
> > In fact do_syscall_64() as a C function does not touch R10, R11, R12, R13, R14,
> > R15 - it passes their values through.
> >
> > What am I missing?
> 
> The syscall ABI uses R10 for the 4th argument instead of RCX, because
> RCX gets clobbered by the SYSCALL instruction for RIP.

But we only reload the syscall-entry value of R10 it into RCX (4th C function 
argument):

                regs->ax = sys_call_table[nr](
                        regs->di, regs->si, regs->dx,
                        regs->r10, regs->r8, regs->r9);

while RCX is a clobbered register, so in practice, while it will be briefly 
present in do_syscall_64() and the high level syscall functions, the value in RCX 
will be cleared from RCX in the overwhelming majority of cases.

But the real R10 will survive much longer, because it's only used in a very small 
minority of the C functions!

So my point: if we clear R10 (and R11) from the _real_ registers, we can stop 
propagating these user controlled values further into the kernel.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-05 16:26         ` Ingo Molnar
  2018-02-05 16:38           ` Andy Lutomirski
@ 2018-02-05 20:05           ` Ingo Molnar
  2018-02-06  8:48             ` Ingo Molnar
  1 sibling, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2018-02-05 20:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Williams, Brian Gerst, Thomas Gleixner, Andi Kleen,
	the arch/x86 maintainers, Linux Kernel Mailing List, Ingo Molnar,
	Andy Lutomirski, H. Peter Anvin


* Ingo Molnar <mingo@kernel.org> wrote:

> Btw., to underline these arguments, here's some statistical data about actual 
> register usage the x86 kernel.

So this caveat was bothering me:

> List of caveats/notes:
> 
> Note #4:
>   The 'avg uses per fn' number is over-estimates the real uses per function,
>   because I counted total number of uses, not rounded down to a per function 
>   register usage heat-map. I believe this does not change the _ordering_ of the 
>   register usage histograms, so it's a valid simplification.

as averaging all uses over the functions over-estimates the per function usage of 
registers significantly, so I implemented a real, per function register usage 
tracking.

For the x86 defconfig kernel the results are:

  r11: used in   1704 fns, not used in  43310 fns, usage ratio:    3.8%
  r10: used in   3809 fns, not used in  41205 fns, usage ratio:    8.5%
  r15: used in   6599 fns, not used in  38415 fns, usage ratio:   14.7%
   r9: used in   8120 fns, not used in  36894 fns, usage ratio:   18.0%
  r14: used in   9243 fns, not used in  35771 fns, usage ratio:   20.5%
   r8: used in  12614 fns, not used in  32400 fns, usage ratio:   28.0%
  r13: used in  12708 fns, not used in  32306 fns, usage ratio:   28.2%
  r12: used in  17144 fns, not used in  27870 fns, usage ratio:   38.1%
  rbp: used in  23289 fns, not used in  21725 fns, usage ratio:   51.7%
  rcx: used in  23897 fns, not used in  21117 fns, usage ratio:   53.1%
  rbx: used in  29226 fns, not used in  15788 fns, usage ratio:   64.9%
  rdx: used in  33205 fns, not used in  11809 fns, usage ratio:   73.8%
  rsi: used in  35415 fns, not used in   9599 fns, usage ratio:   78.7%
  rdi: used in  40628 fns, not used in   4386 fns, usage ratio:   90.3%
  rax: used in  43120 fns, not used in   1894 fns, usage ratio:   95.8%

As a comparison, here's the previous estimate:

  # nr of =y .config options:  1255
  # nr of functions:           45490
  # nr of instructions:        1963956
  # nr of register uses:       3183680

  register | # of uses | avg uses per fn
  --------------------------------------
      %r11 |     11608 |       0.3
      %r10 |     23398 |       0.5
      %r9  |     37431 |       0.8
      %r8  |     56140 |       1.2
      %r15 |     77468 |       1.7
      %r14 |     89285 |       2.0
      %r13 |    111665 |       2.5
      %r12 |    151977 |       3.3
      %rcx |    166425 |       3.7
      %rsi |    226536 |       5.0
      %rbp |    238286 |       5.2
      %rdi |    306709 |       6.7
      %rdx |    313569 |       6.9
      %rbx |    349496 |       7.7
      %rax |    728036 |      16.0

The ordering of the registers is similar, but it's not the same - and the actual 
percentages changed drastically.

In particular R10 and R11 are used only in a small minority of the functions:

  r11: used in   1704 fns, not used in  43310 fns, usage ratio:    3.8%
  r10: used in   3809 fns, not used in  41205 fns, usage ratio:    8.5%

and if deeper inside the kernel a Spectre-gadget uses R10 or R11 then chances are 
high that user-space R10/R11 values will be passed through to it without 
disturbance from other C functions.

So I think these numbers increase the urgency of clearing R10/R11 as well.

Note that this too is still only a statistical estimate, the _real_ impact could 
only be judged if we created a call graph of the whole kernel, with 100% coverage, 
and analyzed register use through this tree.

The other registers:

  r15: used in   6599 fns, not used in  38415 fns, usage ratio:   14.7%
   r9: used in   8120 fns, not used in  36894 fns, usage ratio:   18.0%
  r14: used in   9243 fns, not used in  35771 fns, usage ratio:   20.5%
   r8: used in  12614 fns, not used in  32400 fns, usage ratio:   28.0%
  r13: used in  12708 fns, not used in  32306 fns, usage ratio:   28.2%
  r12: used in  17144 fns, not used in  27870 fns, usage ratio:   38.1%

R12-R15 will be cleared with tis patchset, so that angle should be handled, which 
leaves us:

   r9: used in   8120 fns, not used in  36894 fns, usage ratio:   18.0%
   r8: used in  12614 fns, not used in  32400 fns, usage ratio:   28.0%

These are arguments #5 and #6 of the x86-64 C calling convention ABI. In practice 
they will be overwritten in many code paths - especially they will be overwritten 
in some of the richer ABIs of Linux: ioctls, socketcalls and the other high level 
demux system calls.

R8 is also spontaneously used by GCC once a function's size grows about ~130 
instructions. For R9 the average threshold is about ~160 instructions.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-05 19:48                 ` Ingo Molnar
@ 2018-02-05 20:25                   ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2018-02-05 20:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Brian Gerst, Andy Lutomirski, Linus Torvalds, Dan Williams,
	Thomas Gleixner, Andi Kleen, the arch/x86 maintainers,
	Linux Kernel Mailing List, Ingo Molnar, H. Peter Anvin

On Mon, Feb 5, 2018 at 7:48 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Brian Gerst <brgerst@gmail.com> wrote:
>
>> On Mon, Feb 5, 2018 at 1:29 PM, Ingo Molnar <mingo@kernel.org> wrote:
>> >
>> > * Andy Lutomirski <luto@kernel.org> wrote:
>> >
>> >> [...] Clearing R10 is mostly useless in the syscall path because we'll just
>> >> unconditionally reload it in do_syscall_64().
>> >
>> > AFAICS do_syscall_64() doesn't touch R10 at all. So how does it reload R10?
>> >
>> > In fact do_syscall_64() as a C function does not touch R10, R11, R12, R13, R14,
>> > R15 - it passes their values through.
>> >
>> > What am I missing?
>>
>> The syscall ABI uses R10 for the 4th argument instead of RCX, because
>> RCX gets clobbered by the SYSCALL instruction for RIP.
>
> But we only reload the syscall-entry value of R10 it into RCX (4th C function
> argument):
>
>                 regs->ax = sys_call_table[nr](
>                         regs->di, regs->si, regs->dx,
>                         regs->r10, regs->r8, regs->r9);
>
> while RCX is a clobbered register, so in practice, while it will be briefly
> present in do_syscall_64() and the high level syscall functions, the value in RCX
> will be cleared from RCX in the overwhelming majority of cases.
>
> But the real R10 will survive much longer, because it's only used in a very small
> minority of the C functions!

Yes, indeed, brain fart on my part.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels
  2018-02-05 20:05           ` Ingo Molnar
@ 2018-02-06  8:48             ` Ingo Molnar
  0 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2018-02-06  8:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Williams, Brian Gerst, Thomas Gleixner, Andi Kleen,
	the arch/x86 maintainers, Linux Kernel Mailing List, Ingo Molnar,
	Andy Lutomirski, H. Peter Anvin


* Ingo Molnar <mingo@kernel.org> wrote:

> [...] so I implemented a real, per function register usage tracking.
> 
> For the x86 defconfig kernel the results are:
> 
>   r11: used in   1704 fns, not used in  43310 fns, usage ratio:    3.8%
>   r10: used in   3809 fns, not used in  41205 fns, usage ratio:    8.5%
>   r15: used in   6599 fns, not used in  38415 fns, usage ratio:   14.7%
>    r9: used in   8120 fns, not used in  36894 fns, usage ratio:   18.0%
>   r14: used in   9243 fns, not used in  35771 fns, usage ratio:   20.5%
>    r8: used in  12614 fns, not used in  32400 fns, usage ratio:   28.0%
>   r13: used in  12708 fns, not used in  32306 fns, usage ratio:   28.2%
>   r12: used in  17144 fns, not used in  27870 fns, usage ratio:   38.1%
>   rbp: used in  23289 fns, not used in  21725 fns, usage ratio:   51.7%
>   rcx: used in  23897 fns, not used in  21117 fns, usage ratio:   53.1%
>   rbx: used in  29226 fns, not used in  15788 fns, usage ratio:   64.9%
>   rdx: used in  33205 fns, not used in  11809 fns, usage ratio:   73.8%
>   rsi: used in  35415 fns, not used in   9599 fns, usage ratio:   78.7%
>   rdi: used in  40628 fns, not used in   4386 fns, usage ratio:   90.3%
>   rax: used in  43120 fns, not used in   1894 fns, usage ratio:   95.8%

So here's the next (and probably final) chapter of x86-64 register allocation 
statistics: out of curiosity I let this analysis run overnight on all 4 kernel 
configs, to see the register usage patterns of the distro and allyesconfig kernels 
as well.

Here's all the per function register allocation probabilities in a single table:

  REG      allnoconfig       localconfig      distroconfig      allyesconfig
  --------------------------------------------------------------------------
  rax:           94.6%             95.8%             94.3%             96.2%
  rbx:           46.9%             64.9%             67.6%             90.4%
  rcx:           47.8%             53.1%             57.9%             52.7%
  rdx:           66.0%             73.8%             76.0%             74.3%
  rbp:           36.2%             51.7%             55.5%             81.5%
  rsi:           64.8%             78.7%             81.3%             85.0%
  rdi:           79.9%             90.3%             92.1%             94.3%
   r8:           21.9%             28.0%             31.9%             29.7%
   r9:           13.9%             18.0%             20.4%             18.3%
  r10:            9.3%              8.5%              8.4%              4.7%
  r11:            4.9%              3.8%              4.5%              1.6%
  r12:           25.6%             38.1%             42.4%             69.3%
  r13:           18.3%             28.2%             31.5%             57.1%
  r14:           13.3%             20.5%             22.8%             46.1%
  r15:            9.3%             14.7%             16.4%             36.6%

These numbers underline the overall conclusions that we have reached so far:

 - We should clear all of R10-R15 in syscalls and R8-R15 in parameter-less
   entries (IRQs, NMIs, exceptions, etc.) - like the latest series from Dan does.

 - We should probably strive to clear R8-R9 for system calls that don't use it -
   which is ~98% of them. In particular R9 with its comparatively low (~20%)
   allocation probability could survive deep into the kernel: 5-deep call chains
   still have a ~30% chance to have R9 intact - and call chains as deep as 10 
   could still realistically have a ~10% residual probability to have R9 intact.
   We don't do this yet.

 - Smaller kernels are statistically easier to attack via Spectre, as long as the
   gadget is present in the smaller kernel. In particular heavily stripped down
   64-bit kernels might be attackable via R8-R9 (21%,14%) and also RBP (36%) to a 
   certain degree. This means that the RBP clearing introduced by this series is 
   very much relevant: because RBP is not part of the C function call calling 
   arguments ABI its allocation frequency is much lower than that of other GP 
   registers. Unfortunately R8/R9 values will survive through system calls, 
   because we restore them in do_syscall_64().

There's a somewhat surprising pattern as well: the register allocation probability 
of R10 and R11 _decreases_ as the kernel gets more complex. For all other 
registers the allocation probability increases with increasing kernel complexity, 
which is intuitive: larger functions with higher register pressure will use more 
registers.

So this result is counter-intuitive - my best guess is that it's some sort of GCC 
register allocation artifact. Here's the comparison of the code generation of a 
distro versus an allyesconfig kernel:

  #                             distro-config        allyesconfig
  #
  # nr of =y .config options:            4871                9553
  # nr of functions:                   190477              249340
  # nr of instructions:              10329411            20223765
  # nr of register uses:             16907185            33413619
  #
  # instructions per function:             54                  81
  #
  #
  # r10 used in:                        15404 fns           11300 fns
  # r10 not used in:                   167714 fns          228114 fns
  # r10 usage ratio:                      8.4%                4.7%
  #
  # r11 used in:                         8224 fns            3876 fns
  # r11 not used in:                   174894 fns          235538 fns
  # r11 usage ratio:                     4.5%                 1.6%

I don't know which kernel option (out of thousands) causes R10/R11 to be used much 
less frequently in a significantly larger kernel.

Note that even the absolute count of functions with R10/R11 use decreases in the 
allyesconfig kernel, so I don't think it can be caused by the extra 
instrumentation bloat of features like CONFIG_GCOV_KERNEL=y.

The basic inlining and optimization settings are the same and neither has 
branch-instrumentation enabled:

  #                           distro-config       allyesconfig
  #
  CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y       CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
  CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y            CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
  # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set        # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
  CONFIG_OPTIMIZE_INLINING=y                      CONFIG_OPTIMIZE_INLINING=y

  CONFIG_BRANCH_PROFILE_NONE=y                    CONFIG_BRANCH_PROFILE_NONE=y

While no-one will build and boot an allyesconfig kernel (other than me), the 
numbers are still indicative: we should keep in mind the possibility that a Linux 
distro enabling seemingly benign non-default kernel options can lower the 
allocation probability of R10/R11 significantly.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-02-06  8:48 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-03 23:21 [PATCH 0/3] x86/entry: clear registers to sanitize speculative usages Dan Williams
2018-02-03 23:21 ` [PATCH 1/3] x86/entry: Clear extra registers beyond syscall arguments for 64bit kernels Dan Williams
2018-02-04  0:14   ` Andy Lutomirski
2018-02-04  1:25     ` Dan Williams
2018-02-04  1:29       ` Andy Lutomirski
2018-02-04 13:01   ` Brian Gerst
2018-02-04 17:42     ` Dan Williams
2018-02-04 18:40       ` Linus Torvalds
2018-02-05 16:26         ` Ingo Molnar
2018-02-05 16:38           ` Andy Lutomirski
2018-02-05 18:29             ` Ingo Molnar
2018-02-05 18:47               ` Brian Gerst
2018-02-05 19:48                 ` Ingo Molnar
2018-02-05 20:25                   ` Andy Lutomirski
2018-02-05 20:05           ` Ingo Molnar
2018-02-06  8:48             ` Ingo Molnar
2018-02-03 23:21 ` [PATCH 2/3] x86/entry: Clear registers for 64bit exceptions/interrupts Dan Williams
2018-02-03 23:21 ` [PATCH 3/3] x86/entry: Clear registers for compat syscalls Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.