[GIT,PULL] Stack randomization fix
diff mbox series

Message ID 20210515073453.GA78379@gmail.com
State Accepted
Commit 2515dd6ce8e545b0b2eece84920048ef9ed846c4
Headers show
Series
  • [GIT,PULL] Stack randomization fix
Related show

Commit Message

Ingo Molnar May 15, 2021, 7:34 a.m. UTC
Linus,

Please pull the latest core/urgent git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-urgent-2021-05-15

   # HEAD: 2515dd6ce8e545b0b2eece84920048ef9ed846c4 stack: Replace "o" output with "r" input constraint

Fix an assembly constraint that affected LLVM up to version 12.

 Thanks,

	Ingo

------------------>
Nick Desaulniers (1):
      stack: Replace "o" output with "r" input constraint


 include/linux/randomize_kstack.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Linus Torvalds May 15, 2021, 5:13 p.m. UTC | #1
On Sat, May 15, 2021 at 12:35 AM Ingo Molnar <mingo@kernel.org> wrote:
>
> --- a/include/linux/randomize_kstack.h
> +++ b/include/linux/randomize_kstack.h
> @@ -38,7 +38,7 @@ void *__builtin_alloca(size_t size);
>                 /* Keep allocation even after "ptr" loses scope. */     \
> -               asm volatile("" : "=o"(*ptr) :: "memory");              \
> +               asm volatile("" :: "r"(ptr) : "memory");                \
>         }                                                               \

Side note: at some point, a compiler will (correctly) decide that
regardless of the inline asm here, the scope of that allocation is
only that one block.

To be actually reliable, I suspect that add_random_kstack_offset()
should return that pointer as a cookie, and then users should have a

     end_random_kstack_offset(cookie);

at the end of the function that did the add_random_kstack_offset().

And that end_random_kstack_offset() should have *another* "asm
volatile" that uses the pointer.

Note that you need both of the asm volatile barriers: without the
first one, the compiler might decide to move the allocation down, and
without the second one, the compiler might decide the allocation is
out of scope and can be undone before the actual system call or
whatever is done.

Honestly, I don't personally much care (I think this is all stupid
overhead, and think clearing the stack is much better), and I suspect
that current compilers always end up cleaning up alloca at function
exit, so this probably all _works_ as-is. But it's technically fragile
and wrong.

          Linus
pr-tracker-bot@kernel.org May 15, 2021, 5:55 p.m. UTC | #2
The pull request you sent on Sat, 15 May 2021 09:34:53 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-urgent-2021-05-15

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/91b7a0f0637c14ce0d252111cf9bca3830e16593

Thank you!
Ingo Molnar May 16, 2021, 7:29 a.m. UTC | #3
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, May 15, 2021 at 12:35 AM Ingo Molnar <mingo@kernel.org> wrote:
> >
> > --- a/include/linux/randomize_kstack.h
> > +++ b/include/linux/randomize_kstack.h
> > @@ -38,7 +38,7 @@ void *__builtin_alloca(size_t size);
> >                 /* Keep allocation even after "ptr" loses scope. */     \
> > -               asm volatile("" : "=o"(*ptr) :: "memory");              \
> > +               asm volatile("" :: "r"(ptr) : "memory");                \
> >         }                                                               \
> 
> Side note: at some point, a compiler will (correctly) decide that
> regardless of the inline asm here, the scope of that allocation is
> only that one block.
> 
> To be actually reliable, I suspect that add_random_kstack_offset()
> should return that pointer as a cookie, and then users should have a
> 
>      end_random_kstack_offset(cookie);
> 
> at the end of the function that did the add_random_kstack_offset().
> 
> And that end_random_kstack_offset() should have *another* "asm
> volatile" that uses the pointer.

Indeed.

> Note that you need both of the asm volatile barriers: without the first 
> one, the compiler might decide to move the allocation down, and without 
> the second one, the compiler might decide the allocation is out of scope 
> and can be undone before the actual system call or whatever is done.
>
> Honestly, I don't personally much care (I think this is all stupid 
> overhead, and think clearing the stack is much better), and I suspect 
> that current compilers always end up cleaning up alloca at function exit, 
> so this probably all _works_ as-is. But it's technically fragile and 
> wrong.

Agreed.

What probably saves this sequence in practice is that it's:

 - Rare & localized to syscall entry points.

 - Freeing of stack allocation can usually be merged with the freeing of 
   the local stack at function epilogue at zero cost, on most platforms:

    - on x86 with frame pointers, LEAVE will clean up automatically

    - on x86 without frame pointers, the offset of the allocation can be 
      added to the stack fixup addition.

 - On x86 we actually use -fno-omit-frame-pointer for the entry code 
   explicitly, but it also appears from a couple of quick tests that GCC 
   will use frame pointers and LEAVE cleanup when alloca is used in a 
   function - even with -fomit-frame-pointers:

I tried to coax GCC into breaking the stack with:

   void dummy_dummy(void)
   {
	u32 offset = raw_cpu_read(kstack_offset);
	u8 *ptr = __builtin_alloca(KSTACK_OFFSET_MAX(offset));

	asm volatile("" :: "r"(ptr) : "memory");
   }

But GCC switched to frame pointers:

ffffffff810850f0 <dummy_dummy>:
ffffffff810850f0:       55                      push   %rbp
ffffffff810850f1:       48 89 e5                mov    %rsp,%rbp
ffffffff810850f4:       48 83 ec 08             sub    $0x8,%rsp
ffffffff810850f8:       65 48 8b 04 25 28 00    mov    %gs:0x28,%rax
ffffffff810850ff:       00 00 
ffffffff81085101:       48 89 45 f8             mov    %rax,-0x8(%rbp)
ffffffff81085105:       31 c0                   xor    %eax,%eax
ffffffff81085107:       65 8b 05 b2 c4 f8 7e    mov    %gs:0x7ef8c4b2(%rip),%eax        # 115c0 <kstack_offset>
ffffffff8108510e:       25 ff 03 00 00          and    $0x3ff,%eax
ffffffff81085113:       48 83 c0 0f             add    $0xf,%rax
ffffffff81085117:       25 f8 07 00 00          and    $0x7f8,%eax
ffffffff8108511c:       48 29 c4                sub    %rax,%rsp
ffffffff8108511f:       48 8d 44 24 0f          lea    0xf(%rsp),%rax
ffffffff81085124:       48 83 e0 f0             and    $0xfffffffffffffff0,%rax
ffffffff81085128:       48 8b 45 f8             mov    -0x8(%rbp),%rax
ffffffff8108512c:       65 48 2b 04 25 28 00    sub    %gs:0x28,%rax
ffffffff81085133:       00 00 
ffffffff81085135:       75 02                   jne    ffffffff81085139 <dummy_dummy+0x49>
ffffffff81085137:       c9                      leave  
ffffffff81085138:       c3                      ret    
ffffffff81085139:       e8 a2 b2 b2 00          call   ffffffff81bb03e0 <__stack_chk_fail>
ffffffff8108513e:       66 90                   xchg   %ax,%ax

... side note: the overhead of this whole sequence is rather horrible, 
executed for every system call. :-/ This is made worse by 
CONFIG_STACKPROTECTOR_STRONG=y - which most big Linux distros have enabled.

Without stackprotector we get:

ffffffff81080330 <dummy_dummy>:
ffffffff81080330:       55                      push   %rbp
ffffffff81080331:       65 8b 05 88 12 f9 7e    mov    %gs:0x7ef91288(%rip),%eax        # 115c0 <kstack_offset>
ffffffff81080338:       25 ff 03 00 00          and    $0x3ff,%eax
ffffffff8108033d:       48 83 c0 0f             add    $0xf,%rax
ffffffff81080341:       48 89 e5                mov    %rsp,%rbp
ffffffff81080344:       25 f8 07 00 00          and    $0x7f8,%eax
ffffffff81080349:       48 29 c4                sub    %rax,%rsp
ffffffff8108034c:       48 8d 44 24 0f          lea    0xf(%rsp),%rax
ffffffff81080351:       48 83 e0 f0             and    $0xfffffffffffffff0,%rax
ffffffff81080355:       c9                      leave  
ffffffff81080356:       c3                      ret    

Which is still quite a bit longer than it probably should be, IMO. Since we 
are relying on assembly anyway, we don't we force frame pointers explicitly 
and do this in assembly? The key sequence should only be something like:

       65 8b 05 88 12 f9 7e    mov    %gs:0x7ef91288(%rip),%eax        # 115c0 <kstack_offset>
       48 29 c4                sub    %rax,%rsp

There's no fundamental reason for all the masking games IMO.

Anyway, in terms of fragility, the 'usually' and 'most' qualifiers & our 
reliance on compiler handling of alloca() are not guarantees & increase the 
fragility of this sequence...

Should be straightforward to switch this to a begin/end sequence, and 
shouldn't impact code generation, in principle.

Thanks,

	Ingo
David Laight May 17, 2021, 9:12 a.m. UTC | #4
From: Linus Torvalds
> Sent: 15 May 2021 18:13
> 
> On Sat, May 15, 2021 at 12:35 AM Ingo Molnar <mingo@kernel.org> wrote:
> >
> > --- a/include/linux/randomize_kstack.h
> > +++ b/include/linux/randomize_kstack.h
> > @@ -38,7 +38,7 @@ void *__builtin_alloca(size_t size);
> >                 /* Keep allocation even after "ptr" loses scope. */     \
> > -               asm volatile("" : "=o"(*ptr) :: "memory");              \
> > +               asm volatile("" :: "r"(ptr) : "memory");                \
> >         }                                                               \
> 
> Side note: at some point, a compiler will (correctly) decide that
> regardless of the inline asm here, the scope of that allocation is
> only that one block.
> 
> To be actually reliable, I suspect that add_random_kstack_offset()
> should return that pointer as a cookie, and then users should have a
> 
>      end_random_kstack_offset(cookie);
> 
> at the end of the function that did the add_random_kstack_offset().

Hmmm... would that even help?
The scope of the alloca() block is still the code block in which it
is created - so you are still using a pointer to stale stack.

The alloca() would have to have function scope to remain valid
throughout the entire system call.
Then the simple asm statement that just saves the pointer ought
to be good enough since the compiler must assume that can be read
much later in the syscall code.

I've thought of an alternative approach.
Instead of using alloca() save the offset in a per-cpu location.
In the next system call use the saved offset to adjust the
stack pointer in the asm wrapper.

It has to be easier to adjust the stack pointer very early
in the syscall entry path?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Kees Cook May 21, 2021, 6:12 p.m. UTC | #5
On Sun, May 16, 2021 at 09:29:39AM +0200, Ingo Molnar wrote:
> Without stackprotector we get:
> 
> ffffffff81080330 <dummy_dummy>:
> ffffffff81080330:       55                      push   %rbp
> ffffffff81080331:       65 8b 05 88 12 f9 7e    mov    %gs:0x7ef91288(%rip),%eax        # 115c0 <kstack_offset>
> ffffffff81080338:       25 ff 03 00 00          and    $0x3ff,%eax
> ffffffff8108033d:       48 83 c0 0f             add    $0xf,%rax
> ffffffff81080341:       48 89 e5                mov    %rsp,%rbp
> ffffffff81080344:       25 f8 07 00 00          and    $0x7f8,%eax
> ffffffff81080349:       48 29 c4                sub    %rax,%rsp
> ffffffff8108034c:       48 8d 44 24 0f          lea    0xf(%rsp),%rax
> ffffffff81080351:       48 83 e0 f0             and    $0xfffffffffffffff0,%rax
> ffffffff81080355:       c9                      leave  
> ffffffff81080356:       c3                      ret    
> 
> Which is still quite a bit longer than it probably should be, IMO. Since we 
> are relying on assembly anyway, we don't we force frame pointers explicitly 
> and do this in assembly? The key sequence should only be something like:
> 
>        65 8b 05 88 12 f9 7e    mov    %gs:0x7ef91288(%rip),%eax        # 115c0 <kstack_offset>
>        48 29 c4                sub    %rax,%rsp
> 
> There's no fundamental reason for all the masking games IMO.

Mainly the mask is for enforcing stack alignment (and the compiler
does it). The top-level mask is to limit the resulting entropy while
keeping the rest of the entropy for mixing the per-cpu variable.

However, the compile almost entirely fails to optimize the masking:

>        25 ff 03 00 00          and    $0x3ff,%eax
>        48 83 c0 0f             add    $0xf,%rax
>        25 f8 07 00 00          and    $0x7f8,%eax

This should just be and $0x3f0, I suspect (I need to double-check the
rounding up it wants to do with the "add"...)

Luckily, while long, it is very fast.

Patch
diff mbox series

diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h
index fd80fab663a9..bebc911161b6 100644
--- a/include/linux/randomize_kstack.h
+++ b/include/linux/randomize_kstack.h
@@ -38,7 +38,7 @@  void *__builtin_alloca(size_t size);
 		u32 offset = raw_cpu_read(kstack_offset);		\
 		u8 *ptr = __builtin_alloca(KSTACK_OFFSET_MAX(offset));	\
 		/* Keep allocation even after "ptr" loses scope. */	\
-		asm volatile("" : "=o"(*ptr) :: "memory");		\
+		asm volatile("" :: "r"(ptr) : "memory");		\
 	}								\
 } while (0)