Re: [GIT PULL] Stack randomization fix

From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Borislav Petkov <bp@alien8.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Nick Desaulniers <ndesaulniers@google.com>,
	Kees Cook <keescook@chromium.org>,
	Elena Reshetova <elena.reshetova@intel.com>
Subject: Re: [GIT PULL] Stack randomization fix
Date: Sun, 16 May 2021 09:29:39 +0200	[thread overview]
Message-ID: <YKDJ40DUdY1Oy+FJ@gmail.com> (raw)
In-Reply-To: <CAHk-=whpKm_bCDui-VcRwJWVPDPCFKY_oqRACpTff5zXNr8MjQ@mail.gmail.com>

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, May 15, 2021 at 12:35 AM Ingo Molnar <mingo@kernel.org> wrote:
> >
> > --- a/include/linux/randomize_kstack.h
> > +++ b/include/linux/randomize_kstack.h
> > @@ -38,7 +38,7 @@ void *__builtin_alloca(size_t size);
> >                 /* Keep allocation even after "ptr" loses scope. */     \
> > -               asm volatile("" : "=o"(*ptr) :: "memory");              \
> > +               asm volatile("" :: "r"(ptr) : "memory");                \
> >         }                                                               \
> 
> Side note: at some point, a compiler will (correctly) decide that
> regardless of the inline asm here, the scope of that allocation is
> only that one block.
> 
> To be actually reliable, I suspect that add_random_kstack_offset()
> should return that pointer as a cookie, and then users should have a
> 
>      end_random_kstack_offset(cookie);
> 
> at the end of the function that did the add_random_kstack_offset().
> 
> And that end_random_kstack_offset() should have *another* "asm
> volatile" that uses the pointer.

Indeed.

> Note that you need both of the asm volatile barriers: without the first 
> one, the compiler might decide to move the allocation down, and without 
> the second one, the compiler might decide the allocation is out of scope 
> and can be undone before the actual system call or whatever is done.
>
> Honestly, I don't personally much care (I think this is all stupid 
> overhead, and think clearing the stack is much better), and I suspect 
> that current compilers always end up cleaning up alloca at function exit, 
> so this probably all _works_ as-is. But it's technically fragile and 
> wrong.

Agreed.

What probably saves this sequence in practice is that it's:

 - Rare & localized to syscall entry points.

 - Freeing of stack allocation can usually be merged with the freeing of 
   the local stack at function epilogue at zero cost, on most platforms:

    - on x86 with frame pointers, LEAVE will clean up automatically

    - on x86 without frame pointers, the offset of the allocation can be 
      added to the stack fixup addition.

 - On x86 we actually use -fno-omit-frame-pointer for the entry code 
   explicitly, but it also appears from a couple of quick tests that GCC 
   will use frame pointers and LEAVE cleanup when alloca is used in a 
   function - even with -fomit-frame-pointers:

I tried to coax GCC into breaking the stack with:

   void dummy_dummy(void)
   {
	u32 offset = raw_cpu_read(kstack_offset);
	u8 *ptr = __builtin_alloca(KSTACK_OFFSET_MAX(offset));

	asm volatile("" :: "r"(ptr) : "memory");
   }

But GCC switched to frame pointers:

ffffffff810850f0 <dummy_dummy>:
ffffffff810850f0:       55                      push   %rbp
ffffffff810850f1:       48 89 e5                mov    %rsp,%rbp
ffffffff810850f4:       48 83 ec 08             sub    $0x8,%rsp
ffffffff810850f8:       65 48 8b 04 25 28 00    mov    %gs:0x28,%rax
ffffffff810850ff:       00 00 
ffffffff81085101:       48 89 45 f8             mov    %rax,-0x8(%rbp)
ffffffff81085105:       31 c0                   xor    %eax,%eax
ffffffff81085107:       65 8b 05 b2 c4 f8 7e    mov    %gs:0x7ef8c4b2(%rip),%eax        # 115c0 <kstack_offset>
ffffffff8108510e:       25 ff 03 00 00          and    $0x3ff,%eax
ffffffff81085113:       48 83 c0 0f             add    $0xf,%rax
ffffffff81085117:       25 f8 07 00 00          and    $0x7f8,%eax
ffffffff8108511c:       48 29 c4                sub    %rax,%rsp
ffffffff8108511f:       48 8d 44 24 0f          lea    0xf(%rsp),%rax
ffffffff81085124:       48 83 e0 f0             and    $0xfffffffffffffff0,%rax
ffffffff81085128:       48 8b 45 f8             mov    -0x8(%rbp),%rax
ffffffff8108512c:       65 48 2b 04 25 28 00    sub    %gs:0x28,%rax
ffffffff81085133:       00 00 
ffffffff81085135:       75 02                   jne    ffffffff81085139 <dummy_dummy+0x49>
ffffffff81085137:       c9                      leave  
ffffffff81085138:       c3                      ret    
ffffffff81085139:       e8 a2 b2 b2 00          call   ffffffff81bb03e0 <__stack_chk_fail>
ffffffff8108513e:       66 90                   xchg   %ax,%ax

... side note: the overhead of this whole sequence is rather horrible, 
executed for every system call. :-/ This is made worse by 
CONFIG_STACKPROTECTOR_STRONG=y - which most big Linux distros have enabled.

Without stackprotector we get:

ffffffff81080330 <dummy_dummy>:
ffffffff81080330:       55                      push   %rbp
ffffffff81080331:       65 8b 05 88 12 f9 7e    mov    %gs:0x7ef91288(%rip),%eax        # 115c0 <kstack_offset>
ffffffff81080338:       25 ff 03 00 00          and    $0x3ff,%eax
ffffffff8108033d:       48 83 c0 0f             add    $0xf,%rax
ffffffff81080341:       48 89 e5                mov    %rsp,%rbp
ffffffff81080344:       25 f8 07 00 00          and    $0x7f8,%eax
ffffffff81080349:       48 29 c4                sub    %rax,%rsp
ffffffff8108034c:       48 8d 44 24 0f          lea    0xf(%rsp),%rax
ffffffff81080351:       48 83 e0 f0             and    $0xfffffffffffffff0,%rax
ffffffff81080355:       c9                      leave  
ffffffff81080356:       c3                      ret    

Which is still quite a bit longer than it probably should be, IMO. Since we 
are relying on assembly anyway, we don't we force frame pointers explicitly 
and do this in assembly? The key sequence should only be something like:

       65 8b 05 88 12 f9 7e    mov    %gs:0x7ef91288(%rip),%eax        # 115c0 <kstack_offset>
       48 29 c4                sub    %rax,%rsp

There's no fundamental reason for all the masking games IMO.

Anyway, in terms of fragility, the 'usually' and 'most' qualifiers & our 
reliance on compiler handling of alloca() are not guarantees & increase the 
fragility of this sequence...

Should be straightforward to switch this to a begin/end sequence, and 
shouldn't impact code generation, in principle.

Thanks,

	Ingo