All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
@ 2019-03-18  9:41 Elena Reshetova
  2019-03-18 20:15 ` Andy Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: Elena Reshetova @ 2019-03-18  9:41 UTC (permalink / raw)
  To: luto
  Cc: luto, jpoimboe, keescook, jannh, enrico.perla, mingo, bp, tglx,
	linux-kernel, peterz, gregkh, Elena Reshetova

If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
the kernel stack offset is randomized upon each
entry to a system call after fixed location of pt_regs
struct.

This feature is based on the original idea from
the PaX's RANDKSTACK feature:
https://pax.grsecurity.net/docs/randkstack.txt
All the credits for the original idea goes to the PaX team.
However, the design and implementation of
RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
feature (see below).

Reasoning for the feature:

This feature aims to make considerably harder various
stack-based attacks that rely on deterministic stack
structure.
We have had many of such attacks in past [1],[2],[3]
(just to name few), and as Linux kernel stack protections
have been constantly improving (vmap-based stack
allocation with guard pages, removal of thread_info,
STACKLEAK), attackers have to find new ways for their
exploits to work.

It is important to note that we currently cannot show
a concrete attack that would be stopped by this new
feature (given that other existing stack protections
are enabled), so this is an attempt to be on a proactive
side vs. catching up with existing successful exploits.

The main idea is that since the stack offset is
randomized upon each system call, it is very hard for
attacker to reliably land in any particular place on
the thread stack when attack is performed.
Also, since randomization is performed *after* pt_regs,
the ptrace-based approach to discover randomization
offset during a long-running syscall should not be
possible.

[1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
[2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
[3] googleprojectzero.blogspot.com/2016/06/exploiting-
recursion-in-linux-kernel_20.html

Design description:

During most of the kernel's execution, it runs on the "thread
stack", which is allocated at fork.c/dup_task_struct() and stored in
a per-task variable (tsk->stack). Since stack is growing downward,
the stack top can be always calculated using task_top_of_stack(tsk)
function, which essentially returns an address of tsk->stack + stack
size. When VMAP_STACK is enabled, the thread stack is allocated from
vmalloc space.

Thread stack is pretty deterministic on its structure - fixed in size,
and upon every entry from a userspace to kernel on a
syscall the thread stack is started to be constructed from an
address fetched from a per-cpu cpu_current_top_of_stack variable.
The first element to be pushed to the thread stack is the pt_regs struct
that stores all required CPU registers and sys call parameters.

The goal of RANDOMIZE_KSTACK_OFFSET feature is to add a random offset
after the pt_regs has been pushed to the stack and the rest of thread
stack (used during the syscall processing) every time a process issues
a syscall. The source of randomness can be taken either from rdtsc or
rdrand with performance implications listed below. The value of random
offset is stored in a callee-saved register (r15 currently) and the
maximum size of random offset is defined by __MAX_STACK_RANDOM_OFFSET
value, which currently equals to 0xFF0.

As a result this patch introduces 8 bits of randomness
(bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment)
after pt_regs location on the thread stack.
The amount of randomness can be adjusted based on how much of the
stack space we wish/can trade for security.

The main issue with this approach is that it slightly breaks the
processing of last frame in the unwinder, so I have made a simple
fix to the frame pointer unwinder (I guess others should be fixed
similarly) and stack dump functionality to "jump" over the random hole
at the end. My way of solving this is probably far from ideal,
so I would really appreciate feedback on how to improve it.

Performance:

1) lmbench: ./lat_syscall -N 1000000 null
    base:                     Simple syscall: 0.1774 microseconds
    random_offset (rdtsc):     Simple syscall: 0.1803 microseconds
    random_offset (rdrand): Simple syscall: 0.3702 microseconds

2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
    base:                     10000000 loops in 1.62224s = 162.22 nsec / loop
    random_offset (rdtsc):     10000000 loops in 1.64660s = 164.66 nsec / loop
    random_offset (rdrand): 10000000 loops in 3.51315s = 351.32 nsec / loop

Comparison to grsecurity RANDKSTACK feature:

RANDKSTACK feature randomizes the location of the stack start
(cpu_current_top_of_stack), i.e. location of pt_regs structure
itself on the stack. Initially this patch followed the same approach,
but during the recent discussions [4], it has been determined
to be of a little value since, if ptrace functionality is available
for an attacker, he can use PTRACE_PEEKUSR/PTRACE_POKEUSR api to read/write
different offsets in the pt_regs struct, observe the cache
behavior of the pt_regs accesses, and figure out the random stack offset.

Another big difference is that randomization is done upon
syscall entry and not the exit, as with RANDKSTACK.

Also, as a result of the above two differences, the implementation
of RANDKSTACK and RANDOMIZE_KSTACK_OFFSET has nothing in common.

[4] https://www.openwall.com/lists/kernel-hardening/2019/02/08/6

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
---
 arch/Kconfig                   | 15 +++++++++++++++
 arch/x86/Kconfig               |  1 +
 arch/x86/entry/calling.h       | 14 ++++++++++++++
 arch/x86/entry/entry_64.S      |  6 ++++++
 arch/x86/include/asm/frame.h   |  3 +++
 arch/x86/kernel/dumpstack.c    | 10 +++++++++-
 arch/x86/kernel/unwind_frame.c |  9 ++++++++-
 7 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4cfb6de48f79..9a2557b0cfce 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -808,6 +808,21 @@ config VMAP_STACK
 	  the stack to map directly to the KASAN shadow map using a formula
 	  that is incorrect if the stack is in vmalloc space.
 
+config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stack
+	  offset randomization.
+
+config RANDOMIZE_KSTACK_OFFSET
+	default n
+	bool "Randomize kernel stack offset on syscall entry"
+	depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
+	help
+	  Enable this if you want the randomize kernel stack offset upon
+	  each syscall entry. This causes kernel stack (after pt_regs) to
+	  have a randomized offset upon executing each system call.
+
 config ARCH_OPTIONAL_KERNEL_RWX
 	def_bool n
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ade12ec4224b..5edcae945b73 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -131,6 +131,7 @@ config X86
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
 	select HAVE_ARCH_VMAP_STACK		if X86_64
+	select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET  if X86_64
 	select HAVE_ARCH_WITHIN_STACK_FRAMES
 	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index efb0d1b1f15f..68502645d812 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -345,6 +345,20 @@ For 32-bit we have the following conventions - kernel is built with
 #endif
 .endm
 
+.macro RANDOMIZE_KSTACK
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+	/* prepare a random offset in rax */
+	pushq %rax
+	xorq  %rax, %rax
+	ALTERNATIVE "rdtsc", "rdrand %rax", X86_FEATURE_RDRAND
+	andq  $__MAX_STACK_RANDOM_OFFSET, %rax
+
+	/* store offset in r15 */
+	movq  %rax, %r15
+	popq  %rax
+#endif
+.endm
+
 /*
  * This does 'call enter_from_user_mode' unless we can avoid it based on
  * kernel config or using the static jump infrastructure.
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..0816ec680c21 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -167,13 +167,19 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 
 	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
 
+	RANDOMIZE_KSTACK		/* stores randomized offset in r15 */
+
 	TRACE_IRQS_OFF
 
 	/* IRQs are off. */
 	movq	%rax, %rdi
 	movq	%rsp, %rsi
+	sub 	%r15, %rsp          /* substitute random offset from rsp */
 	call	do_syscall_64		/* returns with IRQs disabled */
 
+	/* need to restore the gap */
+	add 	%r15, %rsp       /* add random offset back to rsp */
+
 	TRACE_IRQS_IRETQ		/* we're about to change IF */
 
 	/*
diff --git a/arch/x86/include/asm/frame.h b/arch/x86/include/asm/frame.h
index 5cbce6fbb534..e1bb91504f6e 100644
--- a/arch/x86/include/asm/frame.h
+++ b/arch/x86/include/asm/frame.h
@@ -4,6 +4,9 @@
 
 #include <asm/asm.h>
 
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+#define __MAX_STACK_RANDOM_OFFSET 0xFF0
+#endif
 /*
  * These are stack frame creation macros.  They should be used by every
  * callable non-leaf asm function to make kernel stack traces more reliable.
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 2b5886401e5f..4146a4c3e9c6 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -192,7 +192,6 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 */
 	for ( ; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
 		const char *stack_name;
-
 		if (get_stack_info(stack, task, &stack_info, &visit_mask)) {
 			/*
 			 * We weren't on a valid stack.  It's possible that
@@ -224,6 +223,9 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		 */
 		for (; stack < stack_info.end; stack++) {
 			unsigned long real_addr;
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+			unsigned long left_gap;
+#endif
 			int reliable = 0;
 			unsigned long addr = READ_ONCE_NOCHECK(*stack);
 			unsigned long *ret_addr_p =
@@ -272,6 +274,12 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 			regs = unwind_get_entry_regs(&state, &partial);
 			if (regs)
 				show_regs_if_on_stack(&stack_info, regs, partial);
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+			left_gap = (unsigned long)regs - (unsigned long)stack;
+			/* if we reached last frame, jump over the random gap*/
+			if (left_gap < __MAX_STACK_RANDOM_OFFSET)
+				stack = (unsigned long *)regs--;
+#endif
 		}
 
 		if (stack_name)
diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index 3dc26f95d46e..656f36b1f1b3 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -98,7 +98,14 @@ static inline unsigned long *last_frame(struct unwind_state *state)
 
 static bool is_last_frame(struct unwind_state *state)
 {
-	return state->bp == last_frame(state);
+	if (state->bp == last_frame(state))
+		return true;
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+	if ((last_frame(state) - state->bp) < __MAX_STACK_RANDOM_OFFSET)
+		return true;
+#endif
+	return false;
+
 }
 
 #ifdef CONFIG_X86_32
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-18  9:41 [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall Elena Reshetova
@ 2019-03-18 20:15 ` Andy Lutomirski
  2019-03-18 21:07   ` Kees Cook
                     ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Andy Lutomirski @ 2019-03-18 20:15 UTC (permalink / raw)
  To: Elena Reshetova
  Cc: Andrew Lutomirski, Josh Poimboeuf, Kees Cook, Jann Horn, Perla,
	Enrico, Ingo Molnar, Borislav Petkov, Thomas Gleixner, LKML,
	Peter Zijlstra, Greg KH

On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
<elena.reshetova@intel.com> wrote:
>
> If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
> the kernel stack offset is randomized upon each
> entry to a system call after fixed location of pt_regs
> struct.
>
> This feature is based on the original idea from
> the PaX's RANDKSTACK feature:
> https://pax.grsecurity.net/docs/randkstack.txt
> All the credits for the original idea goes to the PaX team.
> However, the design and implementation of
> RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
> feature (see below).
>
> Reasoning for the feature:
>
> This feature aims to make considerably harder various
> stack-based attacks that rely on deterministic stack
> structure.
> We have had many of such attacks in past [1],[2],[3]
> (just to name few), and as Linux kernel stack protections
> have been constantly improving (vmap-based stack
> allocation with guard pages, removal of thread_info,
> STACKLEAK), attackers have to find new ways for their
> exploits to work.
>
> It is important to note that we currently cannot show
> a concrete attack that would be stopped by this new
> feature (given that other existing stack protections
> are enabled), so this is an attempt to be on a proactive
> side vs. catching up with existing successful exploits.
>
> The main idea is that since the stack offset is
> randomized upon each system call, it is very hard for
> attacker to reliably land in any particular place on
> the thread stack when attack is performed.
> Also, since randomization is performed *after* pt_regs,
> the ptrace-based approach to discover randomization
> offset during a long-running syscall should not be
> possible.
>
> [1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
> [2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
> [3] googleprojectzero.blogspot.com/2016/06/exploiting-
> recursion-in-linux-kernel_20.html
>
> Design description:
>
> During most of the kernel's execution, it runs on the "thread
> stack", which is allocated at fork.c/dup_task_struct() and stored in
> a per-task variable (tsk->stack). Since stack is growing downward,
> the stack top can be always calculated using task_top_of_stack(tsk)
> function, which essentially returns an address of tsk->stack + stack
> size. When VMAP_STACK is enabled, the thread stack is allocated from
> vmalloc space.
>
> Thread stack is pretty deterministic on its structure - fixed in size,
> and upon every entry from a userspace to kernel on a
> syscall the thread stack is started to be constructed from an
> address fetched from a per-cpu cpu_current_top_of_stack variable.
> The first element to be pushed to the thread stack is the pt_regs struct
> that stores all required CPU registers and sys call parameters.
>
> The goal of RANDOMIZE_KSTACK_OFFSET feature is to add a random offset
> after the pt_regs has been pushed to the stack and the rest of thread
> stack (used during the syscall processing) every time a process issues
> a syscall. The source of randomness can be taken either from rdtsc or
> rdrand with performance implications listed below. The value of random
> offset is stored in a callee-saved register (r15 currently) and the
> maximum size of random offset is defined by __MAX_STACK_RANDOM_OFFSET
> value, which currently equals to 0xFF0.
>
> As a result this patch introduces 8 bits of randomness
> (bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment)
> after pt_regs location on the thread stack.
> The amount of randomness can be adjusted based on how much of the
> stack space we wish/can trade for security.

Why do you need four zero bits at the bottom?  x86_64 Linux only
maintains 8 byte stack alignment.

>
> The main issue with this approach is that it slightly breaks the
> processing of last frame in the unwinder, so I have made a simple
> fix to the frame pointer unwinder (I guess others should be fixed
> similarly) and stack dump functionality to "jump" over the random hole
> at the end. My way of solving this is probably far from ideal,
> so I would really appreciate feedback on how to improve it.

That's probably a question for Josh :)

Another way to do the dirty work would be to do:

    char *ptr = alloca(offset);
    asm volatile ("" :: "m" (*ptr));

in do_syscall_64() and adjust compiler flags as needed to avoid warnings.  Hmm.

>
> Performance:
>
> 1) lmbench: ./lat_syscall -N 1000000 null
>     base:                     Simple syscall: 0.1774 microseconds
>     random_offset (rdtsc):     Simple syscall: 0.1803 microseconds
>     random_offset (rdrand): Simple syscall: 0.3702 microseconds
>
> 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
>     base:                     10000000 loops in 1.62224s = 162.22 nsec / loop
>     random_offset (rdtsc):     10000000 loops in 1.64660s = 164.66 nsec / loop
>     random_offset (rdrand): 10000000 loops in 3.51315s = 351.32 nsec / loop
>

Egads!  RDTSC is nice and fast but probably fairly easy to defeat.
RDRAND is awful.  I had hoped for better.

So perhaps we need a little percpu buffer that collects 64 bits of
randomness at a time, shifts out the needed bits, and refills the
buffer when we run out.

>  /*
>   * This does 'call enter_from_user_mode' unless we can avoid it based on
>   * kernel config or using the static jump infrastructure.
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 1f0efdb7b629..0816ec680c21 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -167,13 +167,19 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
>
>         PUSH_AND_CLEAR_REGS rax=$-ENOSYS
>
> +       RANDOMIZE_KSTACK                /* stores randomized offset in r15 */
> +
>         TRACE_IRQS_OFF
>
>         /* IRQs are off. */
>         movq    %rax, %rdi
>         movq    %rsp, %rsi
> +       sub     %r15, %rsp          /* substitute random offset from rsp */
>         call    do_syscall_64           /* returns with IRQs disabled */
>
> +       /* need to restore the gap */
> +       add     %r15, %rsp       /* add random offset back to rsp */

Off the top of my head, the nicer way to approach this would be to
change this such that mov %rbp, %rsp; popq %rbp or something like that
will do the trick.  Then the unwinder could just see it as a regular
frame.  Maybe Josh will have a better idea.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-18 20:15 ` Andy Lutomirski
@ 2019-03-18 21:07   ` Kees Cook
  2019-03-26 10:35     ` Reshetova, Elena
  2019-03-18 23:31   ` Josh Poimboeuf
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Kees Cook @ 2019-03-18 21:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Elena Reshetova, Josh Poimboeuf, Jann Horn, Perla, Enrico,
	Ingo Molnar, Borislav Petkov, Thomas Gleixner, LKML,
	Peter Zijlstra, Greg KH

On Mon, Mar 18, 2019 at 1:16 PM Andy Lutomirski <luto@kernel.org> wrote:
> On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
> <elena.reshetova@intel.com> wrote:
> > Performance:
> >
> > 1) lmbench: ./lat_syscall -N 1000000 null
> >     base:                     Simple syscall: 0.1774 microseconds
> >     random_offset (rdtsc):     Simple syscall: 0.1803 microseconds
> >     random_offset (rdrand): Simple syscall: 0.3702 microseconds
> >
> > 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> >     base:                     10000000 loops in 1.62224s = 162.22 nsec / loop
> >     random_offset (rdtsc):     10000000 loops in 1.64660s = 164.66 nsec / loop
> >     random_offset (rdrand): 10000000 loops in 3.51315s = 351.32 nsec / loop
> >
>
> Egads!  RDTSC is nice and fast but probably fairly easy to defeat.
> RDRAND is awful.  I had hoped for better.

RDRAND can also fail.

> So perhaps we need a little percpu buffer that collects 64 bits of
> randomness at a time, shifts out the needed bits, and refills the
> buffer when we run out.

I'd like to avoid saving the _exact_ details of where the next offset
will be, but if nothing else works, this should be okay. We can use 8
bits at a time and call prandom_u32() every 4th call. Something like
prandom_bytes(), but where it doesn't throw away the unused bytes.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-18 20:15 ` Andy Lutomirski
  2019-03-18 21:07   ` Kees Cook
@ 2019-03-18 23:31   ` Josh Poimboeuf
  2019-03-20 12:10     ` Reshetova, Elena
  2019-03-20 11:12   ` David Laight
  2019-03-20 12:04   ` Reshetova, Elena
  3 siblings, 1 reply; 22+ messages in thread
From: Josh Poimboeuf @ 2019-03-18 23:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Elena Reshetova, Kees Cook, Jann Horn, Perla, Enrico,
	Ingo Molnar, Borislav Petkov, Thomas Gleixner, LKML,
	Peter Zijlstra, Greg KH

On Mon, Mar 18, 2019 at 01:15:44PM -0700, Andy Lutomirski wrote:
> On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
> <elena.reshetova@intel.com> wrote:
> >
> > If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
> > the kernel stack offset is randomized upon each
> > entry to a system call after fixed location of pt_regs
> > struct.
> >
> > This feature is based on the original idea from
> > the PaX's RANDKSTACK feature:
> > https://pax.grsecurity.net/docs/randkstack.txt
> > All the credits for the original idea goes to the PaX team.
> > However, the design and implementation of
> > RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
> > feature (see below).
> >
> > Reasoning for the feature:
> >
> > This feature aims to make considerably harder various
> > stack-based attacks that rely on deterministic stack
> > structure.
> > We have had many of such attacks in past [1],[2],[3]
> > (just to name few), and as Linux kernel stack protections
> > have been constantly improving (vmap-based stack
> > allocation with guard pages, removal of thread_info,
> > STACKLEAK), attackers have to find new ways for their
> > exploits to work.
> >
> > It is important to note that we currently cannot show
> > a concrete attack that would be stopped by this new
> > feature (given that other existing stack protections
> > are enabled), so this is an attempt to be on a proactive
> > side vs. catching up with existing successful exploits.
> >
> > The main idea is that since the stack offset is
> > randomized upon each system call, it is very hard for
> > attacker to reliably land in any particular place on
> > the thread stack when attack is performed.
> > Also, since randomization is performed *after* pt_regs,
> > the ptrace-based approach to discover randomization
> > offset during a long-running syscall should not be
> > possible.
> >
> > [1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
> > [2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
> > [3] googleprojectzero.blogspot.com/2016/06/exploiting-
> > recursion-in-linux-kernel_20.html

Now that thread_info is off the stack, and vmap stack guard pages exist,
it's not clear to me what the benefit is.

> > The main issue with this approach is that it slightly breaks the
> > processing of last frame in the unwinder, so I have made a simple
> > fix to the frame pointer unwinder (I guess others should be fixed
> > similarly) and stack dump functionality to "jump" over the random hole
> > at the end. My way of solving this is probably far from ideal,
> > so I would really appreciate feedback on how to improve it.
> 
> That's probably a question for Josh :)
> 
> Another way to do the dirty work would be to do:
> 
>     char *ptr = alloca(offset);
>     asm volatile ("" :: "m" (*ptr));
> 
> in do_syscall_64() and adjust compiler flags as needed to avoid warnings.  Hmm.

I like the alloca() idea a lot.  If you do the stack adjustment in C,
then everything should just work, with no custom hacks in entry code or
the unwinders.

> >  /*
> >   * This does 'call enter_from_user_mode' unless we can avoid it based on
> >   * kernel config or using the static jump infrastructure.
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 1f0efdb7b629..0816ec680c21 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -167,13 +167,19 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
> >
> >         PUSH_AND_CLEAR_REGS rax=$-ENOSYS
> >
> > +       RANDOMIZE_KSTACK                /* stores randomized offset in r15 */
> > +
> >         TRACE_IRQS_OFF
> >
> >         /* IRQs are off. */
> >         movq    %rax, %rdi
> >         movq    %rsp, %rsi
> > +       sub     %r15, %rsp          /* substitute random offset from rsp */
> >         call    do_syscall_64           /* returns with IRQs disabled */
> >
> > +       /* need to restore the gap */
> > +       add     %r15, %rsp       /* add random offset back to rsp */
> 
> Off the top of my head, the nicer way to approach this would be to
> change this such that mov %rbp, %rsp; popq %rbp or something like that
> will do the trick.  Then the unwinder could just see it as a regular
> frame.  Maybe Josh will have a better idea.

Yes, we could probably do something like that.  Though I think I'd much
rather do the alloca() thing.  

-- 
Josh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-18 20:15 ` Andy Lutomirski
  2019-03-18 21:07   ` Kees Cook
  2019-03-18 23:31   ` Josh Poimboeuf
@ 2019-03-20 11:12   ` David Laight
  2019-03-20 14:51     ` Andy Lutomirski
  2019-03-20 12:04   ` Reshetova, Elena
  3 siblings, 1 reply; 22+ messages in thread
From: David Laight @ 2019-03-20 11:12 UTC (permalink / raw)
  To: 'Andy Lutomirski', Elena Reshetova
  Cc: Josh Poimboeuf, Kees Cook, Jann Horn, Perla, Enrico, Ingo Molnar,
	Borislav Petkov, Thomas Gleixner, LKML, Peter Zijlstra, Greg KH

From: Andy Lutomirski
> Sent: 18 March 2019 20:16
...
> > As a result this patch introduces 8 bits of randomness
> > (bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment)
> > after pt_regs location on the thread stack.
> > The amount of randomness can be adjusted based on how much of the
> > stack space we wish/can trade for security.
> 
> Why do you need four zero bits at the bottom?  x86_64 Linux only
> maintains 8 byte stack alignment.

ISTR that the gcc developers arbitrarily changed the alignment
a few years ago.
If the stack is only 8 byte aligned and you allocate a variable that
requires 16 byte alignment you need gcc to generate the extra stack
frame to align the stack.
I don't remember seeing the relevant gcc options on the linux
gcc command lines.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-18 20:15 ` Andy Lutomirski
                     ` (2 preceding siblings ...)
  2019-03-20 11:12   ` David Laight
@ 2019-03-20 12:04   ` Reshetova, Elena
  3 siblings, 0 replies; 22+ messages in thread
From: Reshetova, Elena @ 2019-03-20 12:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, Kees Cook, Jann Horn, Perla, Enrico, Ingo Molnar,
	Borislav Petkov, Thomas Gleixner, LKML, Peter Zijlstra, Greg KH

Smth is really weird with my intel mail: it only now delivered
me all messages in one go and I was thinking that I don't get any feedback...

> > If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
> > the kernel stack offset is randomized upon each
> > entry to a system call after fixed location of pt_regs
> > struct.
> >
> > This feature is based on the original idea from
> > the PaX's RANDKSTACK feature:
> > https://pax.grsecurity.net/docs/randkstack.txt
> > All the credits for the original idea goes to the PaX team.
> > However, the design and implementation of
> > RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
> > feature (see below).
> >
> > Reasoning for the feature:
> >
> > This feature aims to make considerably harder various
> > stack-based attacks that rely on deterministic stack
> > structure.
> > We have had many of such attacks in past [1],[2],[3]
> > (just to name few), and as Linux kernel stack protections
> > have been constantly improving (vmap-based stack
> > allocation with guard pages, removal of thread_info,
> > STACKLEAK), attackers have to find new ways for their
> > exploits to work.
> >
> > It is important to note that we currently cannot show
> > a concrete attack that would be stopped by this new
> > feature (given that other existing stack protections
> > are enabled), so this is an attempt to be on a proactive
> > side vs. catching up with existing successful exploits.
> >
> > The main idea is that since the stack offset is
> > randomized upon each system call, it is very hard for
> > attacker to reliably land in any particular place on
> > the thread stack when attack is performed.
> > Also, since randomization is performed *after* pt_regs,
> > the ptrace-based approach to discover randomization
> > offset during a long-running syscall should not be
> > possible.
> >
> > [1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
> > [2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
> > [3] googleprojectzero.blogspot.com/2016/06/exploiting-
> > recursion-in-linux-kernel_20.html
> >
> > Design description:
> >
> > During most of the kernel's execution, it runs on the "thread
> > stack", which is allocated at fork.c/dup_task_struct() and stored in
> > a per-task variable (tsk->stack). Since stack is growing downward,
> > the stack top can be always calculated using task_top_of_stack(tsk)
> > function, which essentially returns an address of tsk->stack + stack
> > size. When VMAP_STACK is enabled, the thread stack is allocated from
> > vmalloc space.
> >
> > Thread stack is pretty deterministic on its structure - fixed in size,
> > and upon every entry from a userspace to kernel on a
> > syscall the thread stack is started to be constructed from an
> > address fetched from a per-cpu cpu_current_top_of_stack variable.
> > The first element to be pushed to the thread stack is the pt_regs struct
> > that stores all required CPU registers and sys call parameters.
> >
> > The goal of RANDOMIZE_KSTACK_OFFSET feature is to add a random offset
> > after the pt_regs has been pushed to the stack and the rest of thread
> > stack (used during the syscall processing) every time a process issues
> > a syscall. The source of randomness can be taken either from rdtsc or
> > rdrand with performance implications listed below. The value of random
> > offset is stored in a callee-saved register (r15 currently) and the
> > maximum size of random offset is defined by __MAX_STACK_RANDOM_OFFSET
> > value, which currently equals to 0xFF0.
> >
> > As a result this patch introduces 8 bits of randomness
> > (bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment)
> > after pt_regs location on the thread stack.
> > The amount of randomness can be adjusted based on how much of the
> > stack space we wish/can trade for security.
> 
> Why do you need four zero bits at the bottom?  x86_64 Linux only
> maintains 8 byte stack alignment.

I have to check this: it did look to me that this is needed to avoid
alignment issues, but maybe it is my mistake.  

> >
> > The main issue with this approach is that it slightly breaks the
> > processing of last frame in the unwinder, so I have made a simple
> > fix to the frame pointer unwinder (I guess others should be fixed
> > similarly) and stack dump functionality to "jump" over the random hole
> > at the end. My way of solving this is probably far from ideal,
> > so I would really appreciate feedback on how to improve it.
> 
> That's probably a question for Josh :)
> 
> Another way to do the dirty work would be to do:
> 
>     char *ptr = alloca(offset);
>     asm volatile ("" :: "m" (*ptr));
> 
> in do_syscall_64() and adjust compiler flags as needed to avoid warnings.  Hmm.

I was hoping to go away with assembly-only and minimal
changes, but if this approach seems better for you and Josh,
then I guess I can do it this way. 

> 
> >
> > Performance:
> >
> > 1) lmbench: ./lat_syscall -N 1000000 null
> >     base:                     Simple syscall: 0.1774 microseconds
> >     random_offset (rdtsc):     Simple syscall: 0.1803 microseconds
> >     random_offset (rdrand): Simple syscall: 0.3702 microseconds
> >
> > 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> >     base:                     10000000 loops in 1.62224s = 162.22 nsec / loop
> >     random_offset (rdtsc):     10000000 loops in 1.64660s = 164.66 nsec / loop
> >     random_offset (rdrand): 10000000 loops in 3.51315s = 351.32 nsec / loop
> >
> 
> Egads!  RDTSC is nice and fast but probably fairly easy to defeat.
> RDRAND is awful.  I had hoped for better.

Yes, it is very very slow, I actually didn't believe my measurements
first thinking that it cannot be so much slower just because of one
instruction difference, but it looks like it can...

> 
> So perhaps we need a little percpu buffer that collects 64 bits of
> randomness at a time, shifts out the needed bits, and refills the
> buffer when we run out.

Hm... We might have to refill pretty often on a syscall-hungry 
workloads. If we need 8 bits for each sys call, then we will refill
every 8 syscalls, which is of course better than each one, but is
it an acceptable penalty? And then there is also a storage issue
of our offset bits as Kees mentioned. 

> 
> >  /*
> >   * This does 'call enter_from_user_mode' unless we can avoid it based on
> >   * kernel config or using the static jump infrastructure.
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 1f0efdb7b629..0816ec680c21 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -167,13 +167,19 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
> >
> >         PUSH_AND_CLEAR_REGS rax=$-ENOSYS
> >
> > +       RANDOMIZE_KSTACK                /* stores randomized offset in r15 */
> > +
> >         TRACE_IRQS_OFF
> >
> >         /* IRQs are off. */
> >         movq    %rax, %rdi
> >         movq    %rsp, %rsi
> > +       sub     %r15, %rsp          /* substitute random offset from rsp */
> >         call    do_syscall_64           /* returns with IRQs disabled */
> >
> > +       /* need to restore the gap */
> > +       add     %r15, %rsp       /* add random offset back to rsp */
> 
> Off the top of my head, the nicer way to approach this would be to
> change this such that mov %rbp, %rsp; popq %rbp or something like that
> will do the trick.  Then the unwinder could just see it as a regular
> frame.  Maybe Josh will have a better idea.

I tried it with rbp, but I could not get it working as with other callee
saved registers. But since alloca method seems to be more preferable, 
maybe it is not worth investigating this further. 

Best Regards,
Elena.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-18 23:31   ` Josh Poimboeuf
@ 2019-03-20 12:10     ` Reshetova, Elena
  0 siblings, 0 replies; 22+ messages in thread
From: Reshetova, Elena @ 2019-03-20 12:10 UTC (permalink / raw)
  To: Josh Poimboeuf, Andy Lutomirski
  Cc: Kees Cook, Jann Horn, Perla, Enrico, Ingo Molnar,
	Borislav Petkov, Thomas Gleixner, LKML, Peter Zijlstra, Greg KH

> On Mon, Mar 18, 2019 at 01:15:44PM -0700, Andy Lutomirski wrote:
> > On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
> > <elena.reshetova@intel.com> wrote:
> > >
> > > If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
> > > the kernel stack offset is randomized upon each
> > > entry to a system call after fixed location of pt_regs
> > > struct.
> > >
> > > This feature is based on the original idea from
> > > the PaX's RANDKSTACK feature:
> > > https://pax.grsecurity.net/docs/randkstack.txt
> > > All the credits for the original idea goes to the PaX team.
> > > However, the design and implementation of
> > > RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
> > > feature (see below).
> > >
> > > Reasoning for the feature:
> > >
> > > This feature aims to make considerably harder various
> > > stack-based attacks that rely on deterministic stack
> > > structure.
> > > We have had many of such attacks in past [1],[2],[3]
> > > (just to name few), and as Linux kernel stack protections
> > > have been constantly improving (vmap-based stack
> > > allocation with guard pages, removal of thread_info,
> > > STACKLEAK), attackers have to find new ways for their
> > > exploits to work.
> > >
> > > It is important to note that we currently cannot show
> > > a concrete attack that would be stopped by this new
> > > feature (given that other existing stack protections
> > > are enabled), so this is an attempt to be on a proactive
> > > side vs. catching up with existing successful exploits.
> > >
> > > The main idea is that since the stack offset is
> > > randomized upon each system call, it is very hard for
> > > attacker to reliably land in any particular place on
> > > the thread stack when attack is performed.
> > > Also, since randomization is performed *after* pt_regs,
> > > the ptrace-based approach to discover randomization
> > > offset during a long-running syscall should not be
> > > possible.
> > >
> > > [1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
> > > [2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
> > > [3] googleprojectzero.blogspot.com/2016/06/exploiting-
> > > recursion-in-linux-kernel_20.html
> 
> Now that thread_info is off the stack, and vmap stack guard pages exist,
> it's not clear to me what the benefit is.

Yes, as it says above, this is an attempt to be proactive vs. reactive. 
We cannot show concrete attack now that would succeed with vmap
stack enabled, thread_info removed and other protections enabled. 
However, the fact that kernel thread stack is still very deterministic
remains, and this feature of it has been utilized many times in attacks. 
We don't know where creative attackers would go next and what they
can use to mount next kernel stack-based attack, but I think this is just
a question of time. I don't believe we can claim that currently Linux kernel
thread stack is immune from attacks.

So, if we can add a protection that is not invasive, both on code and performance,
and which might make the attacker's life considerably harder, why not making it? 

> 
> > > The main issue with this approach is that it slightly breaks the
> > > processing of last frame in the unwinder, so I have made a simple
> > > fix to the frame pointer unwinder (I guess others should be fixed
> > > similarly) and stack dump functionality to "jump" over the random hole
> > > at the end. My way of solving this is probably far from ideal,
> > > so I would really appreciate feedback on how to improve it.
> >
> > That's probably a question for Josh :)
> >
> > Another way to do the dirty work would be to do:
> >
> >     char *ptr = alloca(offset);
> >     asm volatile ("" :: "m" (*ptr));
> >
> > in do_syscall_64() and adjust compiler flags as needed to avoid warnings.  Hmm.
> 
> I like the alloca() idea a lot.  If you do the stack adjustment in C,
> then everything should just work, with no custom hacks in entry code or
> the unwinders.

Ok, so maybe this is what I am going to try next then. 

Best Regards,
Elena.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-20 11:12   ` David Laight
@ 2019-03-20 14:51     ` Andy Lutomirski
  0 siblings, 0 replies; 22+ messages in thread
From: Andy Lutomirski @ 2019-03-20 14:51 UTC (permalink / raw)
  To: David Laight
  Cc: Andy Lutomirski, Elena Reshetova, Josh Poimboeuf, Kees Cook,
	Jann Horn, Perla, Enrico, Ingo Molnar, Borislav Petkov,
	Thomas Gleixner, LKML, Peter Zijlstra, Greg KH


> On Mar 20, 2019, at 4:12 AM, David Laight <David.Laight@aculab.com> wrote:
> 
> From: Andy Lutomirski
>> Sent: 18 March 2019 20:16
> ...
>>> As a result this patch introduces 8 bits of randomness
>>> (bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment)
>>> after pt_regs location on the thread stack.
>>> The amount of randomness can be adjusted based on how much of the
>>> stack space we wish/can trade for security.
>> 
>> Why do you need four zero bits at the bottom?  x86_64 Linux only
>> maintains 8 byte stack alignment.
> 
> ISTR that the gcc developers arbitrarily changed the alignment
> a few years ago.
> If the stack is only 8 byte aligned and you allocate a variable that
> requires 16 byte alignment you need gcc to generate the extra stack
> frame to align the stack.
> I don't remember seeing the relevant gcc options on the linux
> gcc command lines.
> 


On older gcc, you *can’t* set the relevant command line options because gcc was daft.  So we just crossed out fingers and hope led for the best.  On newer gcc, we set the options.  Fortunately, 32-byte stack variable alignment works regardless.

AFAIK x86_64 Linux has never aligned the stack to 16 bytes.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-18 21:07   ` Kees Cook
@ 2019-03-26 10:35     ` Reshetova, Elena
  2019-03-27  4:31       ` Andy Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: Reshetova, Elena @ 2019-03-26 10:35 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski
  Cc: Josh Poimboeuf, Jann Horn, Perla, Enrico, Ingo Molnar,
	Borislav Petkov, Thomas Gleixner, LKML, Peter Zijlstra, Greg KH

> On Mon, Mar 18, 2019 at 1:16 PM Andy Lutomirski <luto@kernel.org> wrote:
> > On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
> > <elena.reshetova@intel.com> wrote:
> > > Performance:
> > >
> > > 1) lmbench: ./lat_syscall -N 1000000 null
> > >     base:                     Simple syscall: 0.1774 microseconds
> > >     random_offset (rdtsc):     Simple syscall: 0.1803 microseconds
> > >     random_offset (rdrand): Simple syscall: 0.3702 microseconds
> > >
> > > 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> > >     base:                     10000000 loops in 1.62224s = 162.22 nsec / loop
> > >     random_offset (rdtsc):     10000000 loops in 1.64660s = 164.66 nsec / loop
> > >     random_offset (rdrand): 10000000 loops in 3.51315s = 351.32 nsec / loop
> > >
> >
> > Egads!  RDTSC is nice and fast but probably fairly easy to defeat.
> > RDRAND is awful.  I had hoped for better.
> 
> RDRAND can also fail.
> 
> > So perhaps we need a little percpu buffer that collects 64 bits of
> > randomness at a time, shifts out the needed bits, and refills the
> > buffer when we run out.
> 
> I'd like to avoid saving the _exact_ details of where the next offset
> will be, but if nothing else works, this should be okay. We can use 8
> bits at a time and call prandom_u32() every 4th call. Something like
> prandom_bytes(), but where it doesn't throw away the unused bytes.

Actually I think this would make the end result even worse security-wise
than simply using rdtsc() on every syscall. Saving the randomness in percpu
buffer, which is probably easily accessible and can be probed if needed,
would supply attacker with much more knowledge about the next 3-4
random offsets that what he would get if we use "weak" rdtsc. Given 
that for a successful exploit, an attacker would need to have stack aligned
once only, having a knowledge of 3-4 next offsets sounds like a present to an
exploit writer...  Additionally it creates complexity around the code that I
have issues justifying with "security" argument because of above... 

I have the patch now with alloca() and rdtsc() working, I can post it 
(albeit it is very simple), but I am really hesitating on adding the percpu
buffer randomness storage to it...

Best Regards,
Elena.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-26 10:35     ` Reshetova, Elena
@ 2019-03-27  4:31       ` Andy Lutomirski
  2019-03-28 15:45         ` Kees Cook
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2019-03-27  4:31 UTC (permalink / raw)
  To: Reshetova, Elena
  Cc: Kees Cook, Andy Lutomirski, Josh Poimboeuf, Jann Horn, Perla,
	Enrico, Ingo Molnar, Borislav Petkov, Thomas Gleixner, LKML,
	Peter Zijlstra, Greg KH

On Tue, Mar 26, 2019 at 3:35 AM Reshetova, Elena
<elena.reshetova@intel.com> wrote:
>
> > On Mon, Mar 18, 2019 at 1:16 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
> > > <elena.reshetova@intel.com> wrote:
> > > > Performance:
> > > >
> > > > 1) lmbench: ./lat_syscall -N 1000000 null
> > > >     base:                     Simple syscall: 0.1774 microseconds
> > > >     random_offset (rdtsc):     Simple syscall: 0.1803 microseconds
> > > >     random_offset (rdrand): Simple syscall: 0.3702 microseconds
> > > >
> > > > 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> > > >     base:                     10000000 loops in 1.62224s = 162.22 nsec / loop
> > > >     random_offset (rdtsc):     10000000 loops in 1.64660s = 164.66 nsec / loop
> > > >     random_offset (rdrand): 10000000 loops in 3.51315s = 351.32 nsec / loop
> > > >
> > >
> > > Egads!  RDTSC is nice and fast but probably fairly easy to defeat.
> > > RDRAND is awful.  I had hoped for better.
> >
> > RDRAND can also fail.
> >
> > > So perhaps we need a little percpu buffer that collects 64 bits of
> > > randomness at a time, shifts out the needed bits, and refills the
> > > buffer when we run out.
> >
> > I'd like to avoid saving the _exact_ details of where the next offset
> > will be, but if nothing else works, this should be okay. We can use 8
> > bits at a time and call prandom_u32() every 4th call. Something like
> > prandom_bytes(), but where it doesn't throw away the unused bytes.
>
> Actually I think this would make the end result even worse security-wise
> than simply using rdtsc() on every syscall. Saving the randomness in percpu
> buffer, which is probably easily accessible and can be probed if needed,
> would supply attacker with much more knowledge about the next 3-4
> random offsets that what he would get if we use "weak" rdtsc. Given
> that for a successful exploit, an attacker would need to have stack aligned
> once only, having a knowledge of 3-4 next offsets sounds like a present to an
> exploit writer...  Additionally it creates complexity around the code that I
> have issues justifying with "security" argument because of above...
>
> I have the patch now with alloca() and rdtsc() working, I can post it
> (albeit it is very simple), but I am really hesitating on adding the percpu
> buffer randomness storage to it...
>

Hmm.  I guess it depends on what types of attack you care about.  I
bet that, if you do a bunch of iterations of mfence;rdtsc;syscall,
you'll discover that the offset between the user rdtsc and the
syscall's rdtsc has several values that occur with high probability.

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-27  4:31       ` Andy Lutomirski
@ 2019-03-28 15:45         ` Kees Cook
  2019-03-28 16:29           ` Andy Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: Kees Cook @ 2019-03-28 15:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Reshetova, Elena, Josh Poimboeuf, Jann Horn, Perla, Enrico,
	Ingo Molnar, Borislav Petkov, Thomas Gleixner, LKML,
	Peter Zijlstra, Greg KH

On Tue, Mar 26, 2019 at 9:31 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Tue, Mar 26, 2019 at 3:35 AM Reshetova, Elena
> <elena.reshetova@intel.com> wrote:
> >
> > > On Mon, Mar 18, 2019 at 1:16 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > > On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
> > > > <elena.reshetova@intel.com> wrote:
> > > > > Performance:
> > > > >
> > > > > 1) lmbench: ./lat_syscall -N 1000000 null
> > > > >     base:                     Simple syscall: 0.1774 microseconds
> > > > >     random_offset (rdtsc):     Simple syscall: 0.1803 microseconds
> > > > >     random_offset (rdrand): Simple syscall: 0.3702 microseconds
> > > > >
> > > > > 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> > > > >     base:                     10000000 loops in 1.62224s = 162.22 nsec / loop
> > > > >     random_offset (rdtsc):     10000000 loops in 1.64660s = 164.66 nsec / loop
> > > > >     random_offset (rdrand): 10000000 loops in 3.51315s = 351.32 nsec / loop
> > > > >
> > > >
> > > > Egads!  RDTSC is nice and fast but probably fairly easy to defeat.
> > > > RDRAND is awful.  I had hoped for better.
> > >
> > > RDRAND can also fail.
> > >
> > > > So perhaps we need a little percpu buffer that collects 64 bits of
> > > > randomness at a time, shifts out the needed bits, and refills the
> > > > buffer when we run out.
> > >
> > > I'd like to avoid saving the _exact_ details of where the next offset
> > > will be, but if nothing else works, this should be okay. We can use 8
> > > bits at a time and call prandom_u32() every 4th call. Something like
> > > prandom_bytes(), but where it doesn't throw away the unused bytes.
> >
> > Actually I think this would make the end result even worse security-wise
> > than simply using rdtsc() on every syscall. Saving the randomness in percpu
> > buffer, which is probably easily accessible and can be probed if needed,
> > would supply attacker with much more knowledge about the next 3-4
> > random offsets that what he would get if we use "weak" rdtsc. Given
> > that for a successful exploit, an attacker would need to have stack aligned
> > once only, having a knowledge of 3-4 next offsets sounds like a present to an
> > exploit writer...  Additionally it creates complexity around the code that I
> > have issues justifying with "security" argument because of above...

That certainly solidifies my concern against saving randomness. :)

> > I have the patch now with alloca() and rdtsc() working, I can post it
> > (albeit it is very simple), but I am really hesitating on adding the percpu
> > buffer randomness storage to it...
> >
>
> Hmm.  I guess it depends on what types of attack you care about.  I
> bet that, if you do a bunch of iterations of mfence;rdtsc;syscall,
> you'll discover that the offset between the user rdtsc and the
> syscall's rdtsc has several values that occur with high probability.

How about rdtsc xor with the middle word of the stack canary? (to
avoid the 0-byte) Something like:

    rdtsc
    xorl [%gs:...canary....], %rax
    andq  $__MAX_STACK_RANDOM_OFFSET, %rax

I need to look at the right way to reference the canary during that
code. Andy might know off the top of his head. :)

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-28 15:45         ` Kees Cook
@ 2019-03-28 16:29           ` Andy Lutomirski
  2019-03-28 16:47             ` Kees Cook
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2019-03-28 16:29 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Reshetova, Elena, Josh Poimboeuf, Jann Horn,
	Perla, Enrico, Ingo Molnar, Borislav Petkov, Thomas Gleixner,
	LKML, Peter Zijlstra, Greg KH



> On Mar 28, 2019, at 8:45 AM, Kees Cook <keescook@chromium.org> wrote:
> 
>> On Tue, Mar 26, 2019 at 9:31 PM Andy Lutomirski <luto@kernel.org> wrote:
>> 
>> On Tue, Mar 26, 2019 at 3:35 AM Reshetova, Elena
>> <elena.reshetova@intel.com> wrote:
>>> 
>>>>> On Mon, Mar 18, 2019 at 1:16 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>> On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
>>>>> <elena.reshetova@intel.com> wrote:
>>>>>> Performance:
>>>>>> 
>>>>>> 1) lmbench: ./lat_syscall -N 1000000 null
>>>>>>    base:                     Simple syscall: 0.1774 microseconds
>>>>>>    random_offset (rdtsc):     Simple syscall: 0.1803 microseconds
>>>>>>    random_offset (rdrand): Simple syscall: 0.3702 microseconds
>>>>>> 
>>>>>> 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
>>>>>>    base:                     10000000 loops in 1.62224s = 162.22 nsec / loop
>>>>>>    random_offset (rdtsc):     10000000 loops in 1.64660s = 164.66 nsec / loop
>>>>>>    random_offset (rdrand): 10000000 loops in 3.51315s = 351.32 nsec / loop
>>>>>> 
>>>>> 
>>>>> Egads!  RDTSC is nice and fast but probably fairly easy to defeat.
>>>>> RDRAND is awful.  I had hoped for better.
>>>> 
>>>> RDRAND can also fail.
>>>> 
>>>>> So perhaps we need a little percpu buffer that collects 64 bits of
>>>>> randomness at a time, shifts out the needed bits, and refills the
>>>>> buffer when we run out.
>>>> 
>>>> I'd like to avoid saving the _exact_ details of where the next offset
>>>> will be, but if nothing else works, this should be okay. We can use 8
>>>> bits at a time and call prandom_u32() every 4th call. Something like
>>>> prandom_bytes(), but where it doesn't throw away the unused bytes.
>>> 
>>> Actually I think this would make the end result even worse security-wise
>>> than simply using rdtsc() on every syscall. Saving the randomness in percpu
>>> buffer, which is probably easily accessible and can be probed if needed,
>>> would supply attacker with much more knowledge about the next 3-4
>>> random offsets that what he would get if we use "weak" rdtsc. Given
>>> that for a successful exploit, an attacker would need to have stack aligned
>>> once only, having a knowledge of 3-4 next offsets sounds like a present to an
>>> exploit writer...  Additionally it creates complexity around the code that I
>>> have issues justifying with "security" argument because of above...
> 
> That certainly solidifies my concern against saving randomness. :)
> 
>>> I have the patch now with alloca() and rdtsc() working, I can post it
>>> (albeit it is very simple), but I am really hesitating on adding the percpu
>>> buffer randomness storage to it...
>>> 
>> 
>> Hmm.  I guess it depends on what types of attack you care about.  I
>> bet that, if you do a bunch of iterations of mfence;rdtsc;syscall,
>> you'll discover that the offset between the user rdtsc and the
>> syscall's rdtsc has several values that occur with high probability.
> 
> How about rdtsc xor with the middle word of the stack canary? (to
> avoid the 0-byte) Something like:
> 
>    rdtsc
>    xorl [%gs:...canary....], %rax
>    andq  $__MAX_STACK_RANDOM_OFFSET, %rax
> 
> I need to look at the right way to reference the canary during that
> code. Andy might know off the top of his head. :)
> 

Doesn’t this just leak some of the canary to user code through side channels?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-28 16:29           ` Andy Lutomirski
@ 2019-03-28 16:47             ` Kees Cook
  2019-03-29  7:50               ` Reshetova, Elena
  0 siblings, 1 reply; 22+ messages in thread
From: Kees Cook @ 2019-03-28 16:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Reshetova, Elena, Josh Poimboeuf, Jann Horn,
	Perla, Enrico, Ingo Molnar, Borislav Petkov, Thomas Gleixner,
	LKML, Peter Zijlstra, Greg KH

On Thu, Mar 28, 2019 at 9:29 AM Andy Lutomirski <luto@amacapital.net> wrote:
> Doesn’t this just leak some of the canary to user code through side channels?

Erf, yes, good point. Let's just use prandom and be done with it.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-28 16:47             ` Kees Cook
@ 2019-03-29  7:50               ` Reshetova, Elena
  0 siblings, 0 replies; 22+ messages in thread
From: Reshetova, Elena @ 2019-03-29  7:50 UTC (permalink / raw)
  To: 'Kees Cook', Andy Lutomirski
  Cc: Andy Lutomirski, Josh Poimboeuf, Jann Horn, Perla, Enrico,
	Ingo Molnar, Borislav Petkov, Thomas Gleixner, LKML,
	Peter Zijlstra, Greg KH

> On Thu, Mar 28, 2019 at 9:29 AM Andy Lutomirski <luto@amacapital.net> wrote:
> > Doesn’t this just leak some of the canary to user code through side channels?
> 
> Erf, yes, good point. Let's just use prandom and be done with it.

And here I have some numbers on this. Actually prandom turned out to be pretty
fast, even when called every syscall. See the numbers below:

1) lmbench: ./lat_syscall -N 1000000 null
    base:                                              Simple syscall: 0.1774 microseconds
    random_offset (prandom_u32() every syscall):     Simple syscall: 0.1822 microseconds
    random_offset (prandom_u32() every 4th syscall): Simple syscall: 0.1844 microseconds

2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
    base:                                              10000000 loops in 1.62224s = 162.22 nsec / loop
    random_offset (prandom_u32() every syscall):     10000000 loops in 1.64660s = 166.26 nsec / loop
    random_offset (prandom_u32() every 4th syscall): 10000000 loops in 3.51315s = 169.30 nsec / loop

The second case is when prandom is called only once in 4 syscalls and unused random
bits are preserved in a per-cpu buffer. As you can see it is actually slower (modulo my maybe not
so optimized code in prandom, see below) vs. calling it every time, so I would vote for actually calling it every time and saving
on the hassle and also avoid additional code in prandom.

And below is what I was calling instead of prandom_u32() to preserve random bits
(net_rand_state_buffer is a new per-cpu buffer I added to save random bits):
And I didn't include the check for bytes >= sizeof(u32) since this was 
just poc to test the base speed, but for generic case it would be needed.

+void prandom_bytes_preserve(void *buf, size_t bytes)
+{
+    u32 *buffer = &get_cpu_var(net_rand_state_buffer);
+    u8 *ptr = buf;
+
+    if (!(*buffer)) {
+        struct rnd_state *state = &get_cpu_var(net_rand_state);
+        if (bytes > 0) {
+            *buffer = prandom_u32_state(state);
+            do {
+                *ptr++ = (u8) *buffer;
+                bytes--;
+                *buffer >>= BITS_PER_BYTE;
+            } while (bytes > 0);
+        }
+        put_cpu_var(net_rand_state);
+        put_cpu_var(net_rand_state_buffer);
+    } else {
+        if (bytes > 0) {
+            do {
+                *ptr++ = (u8) *buffer;
+                bytes--;
+                *buffer >>= BITS_PER_BYTE;
+            } while (bytes > 0);
+        }
+        put_cpu_var(net_rand_state_buffer);
+    }
+}

I will send the first version of patch (calling prandom_u32() every time)
shortly if anyone wants to double check performance implications. 

Best Regards,
Elena.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-04-05 10:14       ` Reshetova, Elena
@ 2019-04-05 13:14         ` Andy Lutomirski
  0 siblings, 0 replies; 22+ messages in thread
From: Andy Lutomirski @ 2019-04-05 13:14 UTC (permalink / raw)
  To: Reshetova, Elena
  Cc: Kees Cook, Andy Lutomirski, Kernel Hardening, Josh Poimboeuf,
	Jann Horn, Perla, Enrico, Ingo Molnar, Borislav Petkov,
	Thomas Gleixner, Peter Zijlstra, Greg KH



On Apr 5, 2019, at 4:14 AM, Reshetova, Elena <elena.reshetova@intel.com> wrote:

>> On Thu, Apr 4, 2019 at 4:41 AM Reshetova, Elena
>> <elena.reshetova@intel.com> wrote:
>>> What I still don't fully understand here (due to my little knowledge of
>>> compilers) and afraid of is that the asm code that alloca generates (see my version)
>>> and the alignment might differ on the different targets, etc.
>> 
>> I guess it's possible, but for x86_64, since appears to be consistent.
> 
> So, yes, I double checked this now with just printing all possible offsets I get for rsp
> from do_syscall_64, it is indeed 33 different offsets, so it is indeed more like 5 bits of entropy. 
> We can increase it, if we want and people are ok with losing a bit more stack space. 
> 
>> 
>>> If you tried it on yours, can you send me the asm code that it produced for you?
>>> Is it different from mine?
>> 
>> You can compare compiler outputs here. Here's gcc vs clang for this code:
>> https://godbolt.org/z/WJSbN8
>> You can adjust compiler versions, etc.
> 
> Oh, this is handy! Thank you for the link! 
> 
> 
> So, should I resend to lkml (with some cosmetic fixes) or how to proceed with this?
> I will also update the randomness bit info. 
> 
> 

Go ahead and send a new version, please.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-04-04 17:03     ` Kees Cook
@ 2019-04-05 10:14       ` Reshetova, Elena
  2019-04-05 13:14         ` Andy Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: Reshetova, Elena @ 2019-04-05 10:14 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Kernel Hardening, Andy Lutomirski,
	Josh Poimboeuf, Jann Horn, Perla, Enrico, Ingo Molnar,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Greg KH

> On Thu, Apr 4, 2019 at 4:41 AM Reshetova, Elena
> <elena.reshetova@intel.com> wrote:
> > What I still don't fully understand here (due to my little knowledge of
> > compilers) and afraid of is that the asm code that alloca generates (see my version)
> > and the alignment might differ on the different targets, etc.
> 
> I guess it's possible, but for x86_64, since appears to be consistent.

So, yes, I double checked this now with just printing all possible offsets I get for rsp
from do_syscall_64, it is indeed 33 different offsets, so it is indeed more like 5 bits of entropy. 
We can increase it, if we want and people are ok with losing a bit more stack space. 
 
> 
> > If you tried it on yours, can you send me the asm code that it produced for you?
> > Is it different from mine?
> 
> You can compare compiler outputs here. Here's gcc vs clang for this code:
> https://godbolt.org/z/WJSbN8
> You can adjust compiler versions, etc.

Oh, this is handy! Thank you for the link! 


So, should I resend to lkml (with some cosmetic fixes) or how to proceed with this?
I will also update the randomness bit info. 

Best Regards,
Elena.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-04-04 11:41   ` Reshetova, Elena
@ 2019-04-04 17:03     ` Kees Cook
  2019-04-05 10:14       ` Reshetova, Elena
  0 siblings, 1 reply; 22+ messages in thread
From: Kees Cook @ 2019-04-04 17:03 UTC (permalink / raw)
  To: Reshetova, Elena
  Cc: Andy Lutomirski, Kernel Hardening, Andy Lutomirski,
	Josh Poimboeuf, Jann Horn, Perla, Enrico, Ingo Molnar,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Greg KH

On Thu, Apr 4, 2019 at 4:41 AM Reshetova, Elena
<elena.reshetova@intel.com> wrote:
> What I still don't fully understand here (due to my little knowledge of
> compilers) and afraid of is that the asm code that alloca generates (see my version)
> and the alignment might differ on the different targets, etc.

I guess it's possible, but for x86_64, since appears to be consistent.

> If you tried it on yours, can you send me the asm code that it produced for you?
> Is it different from mine?

You can compare compiler outputs here. Here's gcc vs clang for this code:
https://godbolt.org/z/WJSbN8
You can adjust compiler versions, etc.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-04-03 21:17 ` Kees Cook
@ 2019-04-04 11:41   ` Reshetova, Elena
  2019-04-04 17:03     ` Kees Cook
  0 siblings, 1 reply; 22+ messages in thread
From: Reshetova, Elena @ 2019-04-04 11:41 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Kernel Hardening, Andy Lutomirski,
	Josh Poimboeuf, Jann Horn, Perla, Enrico, Ingo Molnar,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Greg KH

 On Fri, Mar 29, 2019 at 1:14 AM Elena Reshetova
> <elena.reshetova@intel.com> wrote:
> > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> > index 7bc105f47d21..28cb3687bf82 100644
> > --- a/arch/x86/entry/common.c
> > +++ b/arch/x86/entry/common.c
> > @@ -32,6 +32,10 @@
> >  #include <linux/uaccess.h>
> >  #include <asm/cpufeature.h>
> >
> > +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> > +#include <linux/random.h>
> > +#endif
> > +
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/syscalls.h>
> >
> > @@ -269,10 +273,22 @@ __visible inline void syscall_return_slowpath(struct
> pt_regs *regs)
> >  }
> >
> >  #ifdef CONFIG_X86_64
> > +
> > +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> > +void *alloca(size_t size);
> > +#endif
> > +
> >  __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
> >  {
> >         struct thread_info *ti;
> >
> > +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> > +       size_t offset = ((size_t)prandom_u32()) % 256;
> > +       char *ptr = alloca(offset);
> > +
> > +       asm volatile("":"=m"(*ptr));
> > +#endif
> > +
> >         enter_from_user_mode();
> >         local_irq_enable();
> >         ti = current_thread_info();
> 
> Well this is delightfully short! 

Yes :) Looks like when you are allowed to use forbidden APIs, life might be 
suddenly much easier :) 

The alloca() definition could even be
> moved up after the #include of random.h, just to reduce the number of
> #ifdef lines, too.

Sure, can do this. 

 I patched getpid() to report stack locations for a
> given pid, just to get a sense of the entropy. On 10,000 getpid()
> calls I see counts like:
> 
>     229  ffffa58240697dbc
>     294  ffffa58240697dc4
>     315  ffffa58240697dcc
>     298  ffffa58240697dd4
>     335  ffffa58240697ddc
>     311  ffffa58240697de4
>     295  ffffa58240697dec
>     303  ffffa58240697df4
>     334  ffffa58240697dfc
>     331  ffffa58240697e04
>     321  ffffa58240697e0c
>     298  ffffa58240697e14
>     290  ffffa58240697e1c
>     306  ffffa58240697e24
>     308  ffffa58240697e2c
>     325  ffffa58240697e34
>     301  ffffa58240697e3c
>     336  ffffa58240697e44
>     328  ffffa58240697e4c
>     326  ffffa58240697e54
>     314  ffffa58240697e5c
>     305  ffffa58240697e64
>     315  ffffa58240697e6c
>     325  ffffa58240697e74
>     287  ffffa58240697e7c
>     319  ffffa58240697e84
>     309  ffffa58240697e8c
>     329  ffffa58240697e94
>     311  ffffa58240697e9c
>     306  ffffa58240697ea4
>     313  ffffa58240697eac
>     289  ffffa58240697eb4
>      94  ffffa58240697ebc
> 
> So it looks more like 5 bits of entropy in practice (here are 33
> unique stack locations), but that still looks good to me.

What I still don't fully understand here (due to my little knowledge of
compilers) and afraid of is that the asm code that alloca generates (see my version)
 and the alignment might differ on the different targets, etc. 
If you tried it on yours, can you send me the asm code that it produced for you?
Is it different from mine? 

> 
> Can you send the next version with a CC to lkml too?

I was thinking on not spamming lkml before we get some agreement here, but
I can do it if people believe this is the right way. 

Getting Andy's feedback on this version first would be great! 

Best Regards,
Elena.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-29  8:13 Elena Reshetova
@ 2019-04-03 21:17 ` Kees Cook
  2019-04-04 11:41   ` Reshetova, Elena
  0 siblings, 1 reply; 22+ messages in thread
From: Kees Cook @ 2019-04-03 21:17 UTC (permalink / raw)
  To: Elena Reshetova
  Cc: Andy Lutomirski, Kernel Hardening, Andy Lutomirski,
	Josh Poimboeuf, Jann Horn, Perla, Enrico, Ingo Molnar,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Greg KH

On Fri, Mar 29, 2019 at 1:14 AM Elena Reshetova
<elena.reshetova@intel.com> wrote:
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 7bc105f47d21..28cb3687bf82 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -32,6 +32,10 @@
>  #include <linux/uaccess.h>
>  #include <asm/cpufeature.h>
>
> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> +#include <linux/random.h>
> +#endif
> +
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/syscalls.h>
>
> @@ -269,10 +273,22 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
>  }
>
>  #ifdef CONFIG_X86_64
> +
> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> +void *alloca(size_t size);
> +#endif
> +
>  __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>  {
>         struct thread_info *ti;
>
> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> +       size_t offset = ((size_t)prandom_u32()) % 256;
> +       char *ptr = alloca(offset);
> +
> +       asm volatile("":"=m"(*ptr));
> +#endif
> +
>         enter_from_user_mode();
>         local_irq_enable();
>         ti = current_thread_info();

Well this is delightfully short! The alloca() definition could even be
moved up after the #include of random.h, just to reduce the number of
#ifdef lines, too. I patched getpid() to report stack locations for a
given pid, just to get a sense of the entropy. On 10,000 getpid()
calls I see counts like:

    229  ffffa58240697dbc
    294  ffffa58240697dc4
    315  ffffa58240697dcc
    298  ffffa58240697dd4
    335  ffffa58240697ddc
    311  ffffa58240697de4
    295  ffffa58240697dec
    303  ffffa58240697df4
    334  ffffa58240697dfc
    331  ffffa58240697e04
    321  ffffa58240697e0c
    298  ffffa58240697e14
    290  ffffa58240697e1c
    306  ffffa58240697e24
    308  ffffa58240697e2c
    325  ffffa58240697e34
    301  ffffa58240697e3c
    336  ffffa58240697e44
    328  ffffa58240697e4c
    326  ffffa58240697e54
    314  ffffa58240697e5c
    305  ffffa58240697e64
    315  ffffa58240697e6c
    325  ffffa58240697e74
    287  ffffa58240697e7c
    319  ffffa58240697e84
    309  ffffa58240697e8c
    329  ffffa58240697e94
    311  ffffa58240697e9c
    306  ffffa58240697ea4
    313  ffffa58240697eac
    289  ffffa58240697eb4
     94  ffffa58240697ebc

So it looks more like 5 bits of entropy in practice (here are 33
unique stack locations), but that still looks good to me.

Can you send the next version with a CC to lkml too?

Andy, Thomas, how does this look to you?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
@ 2019-03-29  8:13 Elena Reshetova
  2019-04-03 21:17 ` Kees Cook
  0 siblings, 1 reply; 22+ messages in thread
From: Elena Reshetova @ 2019-03-29  8:13 UTC (permalink / raw)
  To: luto
  Cc: kernel-hardening, luto, jpoimboe, keescook, jannh, enrico.perla,
	mingo, bp, tglx, peterz, gregkh, Elena Reshetova

If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
the kernel stack offset is randomized upon each
entry to a system call after fixed location of pt_regs
struct.

This feature is based on the original idea from
the PaX's RANDKSTACK feature:
https://pax.grsecurity.net/docs/randkstack.txt
All the credits for the original idea goes to the PaX team.
However, the design and implementation of
RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
feature (see below).

Reasoning for the feature:

This feature aims to make considerably harder various
stack-based attacks that rely on deterministic stack
structure.
We have had many of such attacks in past [1],[2],[3]
(just to name few), and as Linux kernel stack protections
have been constantly improving (vmap-based stack
allocation with guard pages, removal of thread_info,
STACKLEAK), attackers have to find new ways for their
exploits to work.

It is important to note that we currently cannot show
a concrete attack that would be stopped by this new
feature (given that other existing stack protections
are enabled), so this is an attempt to be on a proactive
side vs. catching up with existing successful exploits.

The main idea is that since the stack offset is
randomized upon each system call, it is very hard for
attacker to reliably land in any particular place on
the thread stack when attack is performed.
Also, since randomization is performed *after* pt_regs,
the ptrace-based approach to discover randomization
offset during a long-running syscall should not be
possible.

[1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
[2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
[3] googleprojectzero.blogspot.com/2016/06/exploiting-
recursion-in-linux-kernel_20.html

Design description:

During most of the kernel's execution, it runs on the "thread
stack", which is allocated at fork.c/dup_task_struct() and stored in
a per-task variable (tsk->stack). Since stack is growing downward,
the stack top can be always calculated using task_top_of_stack(tsk)
function, which essentially returns an address of tsk->stack + stack
size. When VMAP_STACK is enabled, the thread stack is allocated from
vmalloc space.

Thread stack is pretty deterministic on its structure - fixed in size,
and upon every entry from a userspace to kernel on a
syscall the thread stack is started to be constructed from an
address fetched from a per-cpu cpu_current_top_of_stack variable.
The first element to be pushed to the thread stack is the pt_regs struct
that stores all required CPU registers and sys call parameters.

The goal of RANDOMIZE_KSTACK_OFFSET feature is to add a random offset
after the pt_regs has been pushed to the stack and the rest of thread
stack (used during the syscall processing) every time a process issues
a syscall. The source of randomness can be taken either from prandom_u32()
pseudo random generator (not cryptographically secure). The offset is
added using alloca() call since it helps avoiding changes in assembly
syscall entry code and unwinder. I am not that greatly happy about the
generated assembly code (but I don't know if how to force gcc to produce
anything better):

...
	size_t offset = ((size_t)prandom_u32()) % 256;
	char * ptr = alloca(offset);
0xffffffff8100426d add    $0x16,%rax
0xffffffff81004271 and    $0x1f8,%eax
0xffffffff81004276 sub    %rax,%rsp
0xffffffff81004279 lea    0xf(%rsp),%rax
0xffffffff8100427e and    $0xfffffffffffffff0,%rax
	asm volatile("":"=m"(*ptr));
...

As a result of the above gcc-produce code this patch introduces 6 bits
of randomness (bits 3 - 8 are randomized, bits 0-2 are zero due to stack alignment)
after pt_regs location on the thread stack.
The amount of randomness can be adjusted based on how much of the
stack space we wish/can trade for security.

Performance:

1) lmbench: ./lat_syscall -N 1000000 null
    base:                                        Simple syscall: 0.1774 microseconds
    random_offset (prandom_u32() every syscall): Simple syscall: 0.1822 microseconds

2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
    base:                                        10000000 loops in 1.62224s = 162.22 nsec / loop
    random_offset (prandom_u32() every syscall): 10000000 loops in 1.64660s = 166.26 nsec / loop

Comparison to grsecurity RANDKSTACK feature:

RANDKSTACK feature randomizes the location of the stack start
(cpu_current_top_of_stack), i.e. location of pt_regs structure
itself on the stack. Initially this patch followed the same approach,
but during the recent discussions [4], it has been determined
to be of a little value since, if ptrace functionality is available
for an attacker, he can use PTRACE_PEEKUSR/PTRACE_POKEUSR api to read/write
different offsets in the pt_regs struct, observe the cache
behavior of the pt_regs accesses, and figure out the random stack offset.

Another big difference is that randomization is done upon
syscall entry and not the exit, as with RANDKSTACK.

Also, as a result of the above two differences, the implementation
of RANDKSTACK and RANDOMIZE_KSTACK_OFFSET has nothing in common.

[4] https://www.openwall.com/lists/kernel-hardening/2019/02/08/6

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
---
 arch/Kconfig            | 15 +++++++++++++++
 arch/x86/Kconfig        |  1 +
 arch/x86/entry/common.c | 16 ++++++++++++++++
 3 files changed, 32 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4cfb6de48f79..9a2557b0cfce 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -808,6 +808,21 @@ config VMAP_STACK
 	  the stack to map directly to the KASAN shadow map using a formula
 	  that is incorrect if the stack is in vmalloc space.
 
+config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stack
+	  offset randomization.
+
+config RANDOMIZE_KSTACK_OFFSET
+	default n
+	bool "Randomize kernel stack offset on syscall entry"
+	depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
+	help
+	  Enable this if you want the randomize kernel stack offset upon
+	  each syscall entry. This causes kernel stack (after pt_regs) to
+	  have a randomized offset upon executing each system call.
+
 config ARCH_OPTIONAL_KERNEL_RWX
 	def_bool n
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ade12ec4224b..5edcae945b73 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -131,6 +131,7 @@ config X86
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
 	select HAVE_ARCH_VMAP_STACK		if X86_64
+	select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET  if X86_64
 	select HAVE_ARCH_WITHIN_STACK_FRAMES
 	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7bc105f47d21..28cb3687bf82 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -32,6 +32,10 @@
 #include <linux/uaccess.h>
 #include <asm/cpufeature.h>
 
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+#include <linux/random.h>
+#endif
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
@@ -269,10 +273,22 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 }
 
 #ifdef CONFIG_X86_64
+
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+void *alloca(size_t size);
+#endif
+
 __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
 	struct thread_info *ti;
 
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+	size_t offset = ((size_t)prandom_u32()) % 256;
+	char *ptr = alloca(offset);
+
+	asm volatile("":"=m"(*ptr));
+#endif
+
 	enter_from_user_mode();
 	local_irq_enable();
 	ti = current_thread_info();
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
  2019-03-20  7:27 Elena Reshetova
@ 2019-03-20  7:29 ` Reshetova, Elena
  0 siblings, 0 replies; 22+ messages in thread
From: Reshetova, Elena @ 2019-03-20  7:29 UTC (permalink / raw)
  To: luto
  Cc: kernel-hardening, luto, jpoimboe, keescook, jannh, Perla, Enrico,
	mingo, bp, tglx, peterz, gregkh

My apologies for the double posting: I just realized today that I used my other template to send this RFC, so it went to lkml and not kernel-hardening, where it should have gone at the first place. 

> -----Original Message-----
> From: Reshetova, Elena
> Sent: Wednesday, March 20, 2019 9:27 AM
> To: luto@kernel.org
> Cc: kernel-hardening@lists.openwall.com; luto@amacapital.net;
> jpoimboe@redhat.com; keescook@chromium.org; jannh@google.com; Perla,
> Enrico <enrico.perla@intel.com>; mingo@redhat.com; bp@alien8.de;
> tglx@linutronix.de; peterz@infradead.org; gregkh@linuxfoundation.org; Reshetova,
> Elena <elena.reshetova@intel.com>
> Subject: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
> 
> If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
> the kernel stack offset is randomized upon each
> entry to a system call after fixed location of pt_regs
> struct.
> 
> This feature is based on the original idea from
> the PaX's RANDKSTACK feature:
> https://pax.grsecurity.net/docs/randkstack.txt
> All the credits for the original idea goes to the PaX team.
> However, the design and implementation of
> RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
> feature (see below).
> 
> Reasoning for the feature:
> 
> This feature aims to make considerably harder various
> stack-based attacks that rely on deterministic stack
> structure.
> We have had many of such attacks in past [1],[2],[3]
> (just to name few), and as Linux kernel stack protections
> have been constantly improving (vmap-based stack
> allocation with guard pages, removal of thread_info,
> STACKLEAK), attackers have to find new ways for their
> exploits to work.
> 
> It is important to note that we currently cannot show
> a concrete attack that would be stopped by this new
> feature (given that other existing stack protections
> are enabled), so this is an attempt to be on a proactive
> side vs. catching up with existing successful exploits.
> 
> The main idea is that since the stack offset is
> randomized upon each system call, it is very hard for
> attacker to reliably land in any particular place on
> the thread stack when attack is performed.
> Also, since randomization is performed *after* pt_regs,
> the ptrace-based approach to discover randomization
> offset during a long-running syscall should not be
> possible.
> 
> [1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
> [2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
> [3] googleprojectzero.blogspot.com/2016/06/exploiting-
> recursion-in-linux-kernel_20.html
> 
> Design description:
> 
> During most of the kernel's execution, it runs on the "thread
> stack", which is allocated at fork.c/dup_task_struct() and stored in
> a per-task variable (tsk->stack). Since stack is growing downward,
> the stack top can be always calculated using task_top_of_stack(tsk)
> function, which essentially returns an address of tsk->stack + stack
> size. When VMAP_STACK is enabled, the thread stack is allocated from
> vmalloc space.
> 
> Thread stack is pretty deterministic on its structure - fixed in size,
> and upon every entry from a userspace to kernel on a
> syscall the thread stack is started to be constructed from an
> address fetched from a per-cpu cpu_current_top_of_stack variable.
> The first element to be pushed to the thread stack is the pt_regs struct
> that stores all required CPU registers and sys call parameters.
> 
> The goal of RANDOMIZE_KSTACK_OFFSET feature is to add a random offset
> after the pt_regs has been pushed to the stack and the rest of thread
> stack (used during the syscall processing) every time a process issues
> a syscall. The source of randomness can be taken either from rdtsc or
> rdrand with performance implications listed below. The value of random
> offset is stored in a callee-saved register (r15 currently) and the
> maximum size of random offset is defined by __MAX_STACK_RANDOM_OFFSET
> value, which currently equals to 0xFF0.
> 
> As a result this patch introduces 8 bits of randomness
> (bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment)
> after pt_regs location on the thread stack.
> The amount of randomness can be adjusted based on how much of the
> stack space we wish/can trade for security.
> 
> The main issue with this approach is that it slightly breaks the
> processing of last frame in the unwinder, so I have made a simple
> fix to the frame pointer unwinder (I guess others should be fixed
> similarly) and stack dump functionality to "jump" over the random hole
> at the end. My way of solving this is probably far from ideal,
> so I would really appreciate feedback on how to improve it.
> 
> Performance:
> 
> 1) lmbench: ./lat_syscall -N 1000000 null
>     base:                     Simple syscall: 0.1774 microseconds
>     random_offset (rdtsc):     Simple syscall: 0.1803 microseconds
>     random_offset (rdrand): Simple syscall: 0.3702 microseconds
> 
> 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
>     base:                     10000000 loops in 1.62224s = 162.22 nsec / loop
>     random_offset (rdtsc):     10000000 loops in 1.64660s = 164.66 nsec / loop
>     random_offset (rdrand): 10000000 loops in 3.51315s = 351.32 nsec / loop
> 
> Comparison to grsecurity RANDKSTACK feature:
> 
> RANDKSTACK feature randomizes the location of the stack start
> (cpu_current_top_of_stack), i.e. location of pt_regs structure
> itself on the stack. Initially this patch followed the same approach,
> but during the recent discussions [4], it has been determined
> to be of a little value since, if ptrace functionality is available
> for an attacker, he can use PTRACE_PEEKUSR/PTRACE_POKEUSR api to read/write
> different offsets in the pt_regs struct, observe the cache
> behavior of the pt_regs accesses, and figure out the random stack offset.
> 
> Another big difference is that randomization is done upon
> syscall entry and not the exit, as with RANDKSTACK.
> 
> Also, as a result of the above two differences, the implementation
> of RANDKSTACK and RANDOMIZE_KSTACK_OFFSET has nothing in common.
> 
> [4] https://www.openwall.com/lists/kernel-hardening/2019/02/08/6
> 
> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
> ---
>  arch/Kconfig                   | 15 +++++++++++++++
>  arch/x86/Kconfig               |  1 +
>  arch/x86/entry/calling.h       | 14 ++++++++++++++
>  arch/x86/entry/entry_64.S      |  6 ++++++
>  arch/x86/include/asm/frame.h   |  3 +++
>  arch/x86/kernel/dumpstack.c    | 10 +++++++++-
>  arch/x86/kernel/unwind_frame.c |  9 ++++++++-
>  7 files changed, 56 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 4cfb6de48f79..9a2557b0cfce 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -808,6 +808,21 @@ config VMAP_STACK
>  	  the stack to map directly to the KASAN shadow map using a formula
>  	  that is incorrect if the stack is in vmalloc space.
> 
> +config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
> +	def_bool n
> +	help
> +	  An arch should select this symbol if it can support kernel stack
> +	  offset randomization.
> +
> +config RANDOMIZE_KSTACK_OFFSET
> +	default n
> +	bool "Randomize kernel stack offset on syscall entry"
> +	depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
> +	help
> +	  Enable this if you want the randomize kernel stack offset upon
> +	  each syscall entry. This causes kernel stack (after pt_regs) to
> +	  have a randomized offset upon executing each system call.
> +
>  config ARCH_OPTIONAL_KERNEL_RWX
>  	def_bool n
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index ade12ec4224b..5edcae945b73 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -131,6 +131,7 @@ config X86
>  	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>  	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
>  	select HAVE_ARCH_VMAP_STACK		if X86_64
> +	select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET  if X86_64
>  	select HAVE_ARCH_WITHIN_STACK_FRAMES
>  	select HAVE_CMPXCHG_DOUBLE
>  	select HAVE_CMPXCHG_LOCAL
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index efb0d1b1f15f..68502645d812 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -345,6 +345,20 @@ For 32-bit we have the following conventions - kernel is
> built with
>  #endif
>  .endm
> 
> +.macro RANDOMIZE_KSTACK
> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> +	/* prepare a random offset in rax */
> +	pushq %rax
> +	xorq  %rax, %rax
> +	ALTERNATIVE "rdtsc", "rdrand %rax", X86_FEATURE_RDRAND
> +	andq  $__MAX_STACK_RANDOM_OFFSET, %rax
> +
> +	/* store offset in r15 */
> +	movq  %rax, %r15
> +	popq  %rax
> +#endif
> +.endm
> +
>  /*
>   * This does 'call enter_from_user_mode' unless we can avoid it based on
>   * kernel config or using the static jump infrastructure.
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 1f0efdb7b629..0816ec680c21 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -167,13 +167,19 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
> 
>  	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
> 
> +	RANDOMIZE_KSTACK		/* stores randomized
> offset in r15 */
> +
>  	TRACE_IRQS_OFF
> 
>  	/* IRQs are off. */
>  	movq	%rax, %rdi
>  	movq	%rsp, %rsi
> +	sub 	%r15, %rsp          /* substitute random offset from rsp
> */
>  	call	do_syscall_64		/* returns with IRQs
> disabled */
> 
> +	/* need to restore the gap */
> +	add 	%r15, %rsp       /* add random offset back to rsp */
> +
>  	TRACE_IRQS_IRETQ		/* we're about to
> change IF */
> 
>  	/*
> diff --git a/arch/x86/include/asm/frame.h b/arch/x86/include/asm/frame.h
> index 5cbce6fbb534..e1bb91504f6e 100644
> --- a/arch/x86/include/asm/frame.h
> +++ b/arch/x86/include/asm/frame.h
> @@ -4,6 +4,9 @@
> 
>  #include <asm/asm.h>
> 
> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> +#define __MAX_STACK_RANDOM_OFFSET 0xFF0
> +#endif
>  /*
>   * These are stack frame creation macros.  They should be used by every
>   * callable non-leaf asm function to make kernel stack traces more reliable.
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index 2b5886401e5f..4146a4c3e9c6 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -192,7 +192,6 @@ void show_trace_log_lvl(struct task_struct *task, struct
> pt_regs *regs,
>  	 */
>  	for ( ; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
>  		const char *stack_name;
> -
>  		if (get_stack_info(stack, task, &stack_info,
> &visit_mask)) {
>  			/*
>  			 * We weren't on a valid stack.  It's
> possible that
> @@ -224,6 +223,9 @@ void show_trace_log_lvl(struct task_struct *task, struct
> pt_regs *regs,
>  		 */
>  		for (; stack < stack_info.end; stack++) {
>  			unsigned long real_addr;
> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> +			unsigned long left_gap;
> +#endif
>  			int reliable = 0;
>  			unsigned long addr =
> READ_ONCE_NOCHECK(*stack);
>  			unsigned long *ret_addr_p =
> @@ -272,6 +274,12 @@ void show_trace_log_lvl(struct task_struct *task, struct
> pt_regs *regs,
>  			regs = unwind_get_entry_regs(&state,
> &partial);
>  			if (regs)
> 
> 	show_regs_if_on_stack(&stack_info, regs, partial);
> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> +			left_gap = (unsigned long)regs -
> (unsigned long)stack;
> +			/* if we reached last frame, jump over
> the random gap*/
> +			if (left_gap <
> __MAX_STACK_RANDOM_OFFSET)
> +				stack = (unsigned long
> *)regs--;
> +#endif
>  		}
> 
>  		if (stack_name)
> diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
> index 3dc26f95d46e..656f36b1f1b3 100644
> --- a/arch/x86/kernel/unwind_frame.c
> +++ b/arch/x86/kernel/unwind_frame.c
> @@ -98,7 +98,14 @@ static inline unsigned long *last_frame(struct unwind_state
> *state)
> 
>  static bool is_last_frame(struct unwind_state *state)
>  {
> -	return state->bp == last_frame(state);
> +	if (state->bp == last_frame(state))
> +		return true;
> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> +	if ((last_frame(state) - state->bp) < __MAX_STACK_RANDOM_OFFSET)
> +		return true;
> +#endif
> +	return false;
> +
>  }
> 
>  #ifdef CONFIG_X86_32
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
@ 2019-03-20  7:27 Elena Reshetova
  2019-03-20  7:29 ` Reshetova, Elena
  0 siblings, 1 reply; 22+ messages in thread
From: Elena Reshetova @ 2019-03-20  7:27 UTC (permalink / raw)
  To: luto
  Cc: kernel-hardening, luto, jpoimboe, keescook, jannh, enrico.perla,
	mingo, bp, tglx, peterz, gregkh, Elena Reshetova

If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
the kernel stack offset is randomized upon each
entry to a system call after fixed location of pt_regs
struct.

This feature is based on the original idea from
the PaX's RANDKSTACK feature:
https://pax.grsecurity.net/docs/randkstack.txt
All the credits for the original idea goes to the PaX team.
However, the design and implementation of
RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
feature (see below).

Reasoning for the feature:

This feature aims to make considerably harder various
stack-based attacks that rely on deterministic stack
structure.
We have had many of such attacks in past [1],[2],[3]
(just to name few), and as Linux kernel stack protections
have been constantly improving (vmap-based stack
allocation with guard pages, removal of thread_info,
STACKLEAK), attackers have to find new ways for their
exploits to work.

It is important to note that we currently cannot show
a concrete attack that would be stopped by this new
feature (given that other existing stack protections
are enabled), so this is an attempt to be on a proactive
side vs. catching up with existing successful exploits.

The main idea is that since the stack offset is
randomized upon each system call, it is very hard for
attacker to reliably land in any particular place on
the thread stack when attack is performed.
Also, since randomization is performed *after* pt_regs,
the ptrace-based approach to discover randomization
offset during a long-running syscall should not be
possible.

[1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
[2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
[3] googleprojectzero.blogspot.com/2016/06/exploiting-
recursion-in-linux-kernel_20.html

Design description:

During most of the kernel's execution, it runs on the "thread
stack", which is allocated at fork.c/dup_task_struct() and stored in
a per-task variable (tsk->stack). Since stack is growing downward,
the stack top can be always calculated using task_top_of_stack(tsk)
function, which essentially returns an address of tsk->stack + stack
size. When VMAP_STACK is enabled, the thread stack is allocated from
vmalloc space.

Thread stack is pretty deterministic on its structure - fixed in size,
and upon every entry from a userspace to kernel on a
syscall the thread stack is started to be constructed from an
address fetched from a per-cpu cpu_current_top_of_stack variable.
The first element to be pushed to the thread stack is the pt_regs struct
that stores all required CPU registers and sys call parameters.

The goal of RANDOMIZE_KSTACK_OFFSET feature is to add a random offset
after the pt_regs has been pushed to the stack and the rest of thread
stack (used during the syscall processing) every time a process issues
a syscall. The source of randomness can be taken either from rdtsc or
rdrand with performance implications listed below. The value of random
offset is stored in a callee-saved register (r15 currently) and the
maximum size of random offset is defined by __MAX_STACK_RANDOM_OFFSET
value, which currently equals to 0xFF0.

As a result this patch introduces 8 bits of randomness
(bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment)
after pt_regs location on the thread stack.
The amount of randomness can be adjusted based on how much of the
stack space we wish/can trade for security.

The main issue with this approach is that it slightly breaks the
processing of last frame in the unwinder, so I have made a simple
fix to the frame pointer unwinder (I guess others should be fixed
similarly) and stack dump functionality to "jump" over the random hole
at the end. My way of solving this is probably far from ideal,
so I would really appreciate feedback on how to improve it.

Performance:

1) lmbench: ./lat_syscall -N 1000000 null
    base:                     Simple syscall: 0.1774 microseconds
    random_offset (rdtsc):     Simple syscall: 0.1803 microseconds
    random_offset (rdrand): Simple syscall: 0.3702 microseconds

2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
    base:                     10000000 loops in 1.62224s = 162.22 nsec / loop
    random_offset (rdtsc):     10000000 loops in 1.64660s = 164.66 nsec / loop
    random_offset (rdrand): 10000000 loops in 3.51315s = 351.32 nsec / loop

Comparison to grsecurity RANDKSTACK feature:

RANDKSTACK feature randomizes the location of the stack start
(cpu_current_top_of_stack), i.e. location of pt_regs structure
itself on the stack. Initially this patch followed the same approach,
but during the recent discussions [4], it has been determined
to be of a little value since, if ptrace functionality is available
for an attacker, he can use PTRACE_PEEKUSR/PTRACE_POKEUSR api to read/write
different offsets in the pt_regs struct, observe the cache
behavior of the pt_regs accesses, and figure out the random stack offset.

Another big difference is that randomization is done upon
syscall entry and not the exit, as with RANDKSTACK.

Also, as a result of the above two differences, the implementation
of RANDKSTACK and RANDOMIZE_KSTACK_OFFSET has nothing in common.

[4] https://www.openwall.com/lists/kernel-hardening/2019/02/08/6

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
---
 arch/Kconfig                   | 15 +++++++++++++++
 arch/x86/Kconfig               |  1 +
 arch/x86/entry/calling.h       | 14 ++++++++++++++
 arch/x86/entry/entry_64.S      |  6 ++++++
 arch/x86/include/asm/frame.h   |  3 +++
 arch/x86/kernel/dumpstack.c    | 10 +++++++++-
 arch/x86/kernel/unwind_frame.c |  9 ++++++++-
 7 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4cfb6de48f79..9a2557b0cfce 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -808,6 +808,21 @@ config VMAP_STACK
 	  the stack to map directly to the KASAN shadow map using a formula
 	  that is incorrect if the stack is in vmalloc space.
 
+config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stack
+	  offset randomization.
+
+config RANDOMIZE_KSTACK_OFFSET
+	default n
+	bool "Randomize kernel stack offset on syscall entry"
+	depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
+	help
+	  Enable this if you want the randomize kernel stack offset upon
+	  each syscall entry. This causes kernel stack (after pt_regs) to
+	  have a randomized offset upon executing each system call.
+
 config ARCH_OPTIONAL_KERNEL_RWX
 	def_bool n
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ade12ec4224b..5edcae945b73 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -131,6 +131,7 @@ config X86
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
 	select HAVE_ARCH_VMAP_STACK		if X86_64
+	select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET  if X86_64
 	select HAVE_ARCH_WITHIN_STACK_FRAMES
 	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index efb0d1b1f15f..68502645d812 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -345,6 +345,20 @@ For 32-bit we have the following conventions - kernel is built with
 #endif
 .endm
 
+.macro RANDOMIZE_KSTACK
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+	/* prepare a random offset in rax */
+	pushq %rax
+	xorq  %rax, %rax
+	ALTERNATIVE "rdtsc", "rdrand %rax", X86_FEATURE_RDRAND
+	andq  $__MAX_STACK_RANDOM_OFFSET, %rax
+
+	/* store offset in r15 */
+	movq  %rax, %r15
+	popq  %rax
+#endif
+.endm
+
 /*
  * This does 'call enter_from_user_mode' unless we can avoid it based on
  * kernel config or using the static jump infrastructure.
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..0816ec680c21 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -167,13 +167,19 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 
 	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
 
+	RANDOMIZE_KSTACK		/* stores randomized offset in r15 */
+
 	TRACE_IRQS_OFF
 
 	/* IRQs are off. */
 	movq	%rax, %rdi
 	movq	%rsp, %rsi
+	sub 	%r15, %rsp          /* substitute random offset from rsp */
 	call	do_syscall_64		/* returns with IRQs disabled */
 
+	/* need to restore the gap */
+	add 	%r15, %rsp       /* add random offset back to rsp */
+
 	TRACE_IRQS_IRETQ		/* we're about to change IF */
 
 	/*
diff --git a/arch/x86/include/asm/frame.h b/arch/x86/include/asm/frame.h
index 5cbce6fbb534..e1bb91504f6e 100644
--- a/arch/x86/include/asm/frame.h
+++ b/arch/x86/include/asm/frame.h
@@ -4,6 +4,9 @@
 
 #include <asm/asm.h>
 
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+#define __MAX_STACK_RANDOM_OFFSET 0xFF0
+#endif
 /*
  * These are stack frame creation macros.  They should be used by every
  * callable non-leaf asm function to make kernel stack traces more reliable.
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 2b5886401e5f..4146a4c3e9c6 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -192,7 +192,6 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 */
 	for ( ; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
 		const char *stack_name;
-
 		if (get_stack_info(stack, task, &stack_info, &visit_mask)) {
 			/*
 			 * We weren't on a valid stack.  It's possible that
@@ -224,6 +223,9 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		 */
 		for (; stack < stack_info.end; stack++) {
 			unsigned long real_addr;
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+			unsigned long left_gap;
+#endif
 			int reliable = 0;
 			unsigned long addr = READ_ONCE_NOCHECK(*stack);
 			unsigned long *ret_addr_p =
@@ -272,6 +274,12 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 			regs = unwind_get_entry_regs(&state, &partial);
 			if (regs)
 				show_regs_if_on_stack(&stack_info, regs, partial);
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+			left_gap = (unsigned long)regs - (unsigned long)stack;
+			/* if we reached last frame, jump over the random gap*/
+			if (left_gap < __MAX_STACK_RANDOM_OFFSET)
+				stack = (unsigned long *)regs--;
+#endif
 		}
 
 		if (stack_name)
diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index 3dc26f95d46e..656f36b1f1b3 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -98,7 +98,14 @@ static inline unsigned long *last_frame(struct unwind_state *state)
 
 static bool is_last_frame(struct unwind_state *state)
 {
-	return state->bp == last_frame(state);
+	if (state->bp == last_frame(state))
+		return true;
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+	if ((last_frame(state) - state->bp) < __MAX_STACK_RANDOM_OFFSET)
+		return true;
+#endif
+	return false;
+
 }
 
 #ifdef CONFIG_X86_32
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-04-05 13:14 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-18  9:41 [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall Elena Reshetova
2019-03-18 20:15 ` Andy Lutomirski
2019-03-18 21:07   ` Kees Cook
2019-03-26 10:35     ` Reshetova, Elena
2019-03-27  4:31       ` Andy Lutomirski
2019-03-28 15:45         ` Kees Cook
2019-03-28 16:29           ` Andy Lutomirski
2019-03-28 16:47             ` Kees Cook
2019-03-29  7:50               ` Reshetova, Elena
2019-03-18 23:31   ` Josh Poimboeuf
2019-03-20 12:10     ` Reshetova, Elena
2019-03-20 11:12   ` David Laight
2019-03-20 14:51     ` Andy Lutomirski
2019-03-20 12:04   ` Reshetova, Elena
2019-03-20  7:27 Elena Reshetova
2019-03-20  7:29 ` Reshetova, Elena
2019-03-29  8:13 Elena Reshetova
2019-04-03 21:17 ` Kees Cook
2019-04-04 11:41   ` Reshetova, Elena
2019-04-04 17:03     ` Kees Cook
2019-04-05 10:14       ` Reshetova, Elena
2019-04-05 13:14         ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.