RE: [PATCH] x86/entry/64: randomize kernel stack offset upon syscall

From: "Reshetova, Elena" <elena.reshetova@intel.com>
To: Kees Cook <keescook@chromium.org>, Ingo Molnar <mingo@kernel.org>,
	"Andy Lutomirski" <luto@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>,
	David Laight <David.Laight@aculab.com>,
	Eric Biggers <ebiggers3@gmail.com>,
	"ebiggers@google.com" <ebiggers@google.com>,
	"herbert@gondor.apana.org.au" <herbert@gondor.apana.org.au>,
	Peter Zijlstra <peterz@infradead.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"jpoimboe@redhat.com" <jpoimboe@redhat.com>,
	"jannh@google.com" <jannh@google.com>,
	"Perla, Enrico" <enrico.perla@intel.com>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"bp@alien8.de" <bp@alien8.de>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"Edgecombe, Rick P" <rick.p.edgecombe@intel.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: RE: [PATCH] x86/entry/64: randomize kernel stack offset upon syscall
Date: Mon, 29 Jul 2019 11:41:11 +0000	[thread overview]
Message-ID: <2236FBA76BA1254E88B949DDB74E612BA4D4BFCA@IRSMSX102.ger.corp.intel.com> (raw)
In-Reply-To: <201905291136.FD61FF42@keescook>

Ingo, Andy,

I want to summarize here the data (including the performance numbers)
and reasoning for the in-stack randomization feature. I have organized
it in a simple set of Q&A below.

Q: Why do we need in-stack per-syscall randomization when we already have
all known attack vectors covered with existing protections?

A: First, we have all known (to us) attack vectors covered *only* given that all
existing related protections are enabled.  While we do expect the vmalloc-stack
allocation with guard pages, as well as VLA-free kernel to be the default setup
for most of the kernels we concerned about, this cannot be said about GCC-plugin
based features, such as STACKLEAK. The reason for this is a) STACKLEAK requires
GCC plugin enablement during compilation and this is not a default case now for
many distros. b) STACKLEAK can introduce very visible performance overhead.
Please see the microbenchmarks below for the cases you have asked me to measure
before. The performance hit can be 79% for simple syscall and I believe even
bigger for complex syscalls where a lot of stack needs to be cleaned. 

Q: What is the performance impact both micro benchmark and on real workloads?

A: Here is the table summarizing the perf numbers depending on subroutine used
for obtaining random bits and STACKLEAK numbers for comparison:

1. Microbenchmarking:

lmbench: ./lat_syscall -N 1000000 null:

CONFIG_PAGE_TABLE_ISOLATION=y:

base:                                0.1737
rdtsc:                               0.1803 (+3.89%)
get_random_bytes (4096 byte buffer): 0.1811 (+4.26%)
RDRAND (every 10th syscall):         0.1910 (+9.96%)
STACKLEAK:                           0.2150 (+23.78%)

CONFIG_PAGE_TABLE_ISOLATION=n:

base:                                0.0522
rdtsc:                               0.0556 (+6.51%)
get_random_bytes (4096 byte buffer): 0.0599 (+14.75%)
RDRAND (every 10th syscall):         0.0706 (+35.25%)
STACKLEAK:                           0.0935 (+79.12%)

So, at least one thing that follows from above numbers is that STACKLEAK
performance is much worse even for simple syscall case (more complex
syscalls likely cause even more perf hit since stack must be cleaned deeper).
However, this hasn't prevented STACKLEAK upstream merge and STACKLEAK is much
more complex feature both code side and code logic...

2. Kernel compilation time (example of real workload):

Base (CONFIG_PAGE_TABLE_ISOLATION=n):

real: 31m17,601s
user: 224m56,144s
sys: 18m26,946s

get_random_bytes (4096 byte buffer, CONFIG_PAGE_TABLE_ISOLATION=n):

real: 31m22,684s (+0.271% increase)
user: 225m4,766s
sys: 18m27,113s

STACKLEAK (CONFIG_PAGE_TABLE_ISOLATION=n):

real: 31m23,289s (+0.303% increase)
user: 224m24,001s
sys: 19m36,882s

Q: Is there a way to drastically improve the microbenchmark results
for get_random_bytes() without decreasing security?

A: As far as I can see no, but I am not a perf optimization expert.
My measurements and profiling showed that biggest load comes from
extract_crng function and underlying arch_get_random_long() 
(basically calling RDRAND for x86) and chacha_block function (core 
chacha functionality and permutation itself). So, unless the permutation
is rewritten or additional seeding is dropped (happens automatically
if arch does not support RDRAND), there is no way in my understanding to 
improve things drastically. See traces below.

perf (4096 isolation off, get_random_byte, arch_get_random_long not inlined):

   98.94%     0.00%  lat_syscall  [unknown]         [k] 0x48fffffe27e80f74
            |
            ---0x48fffffe27e80f74
               |          
               |--97.73%--__GI___getppid
               |          |          
               |          |--53.29%--entry_SYSCALL_64_after_hwframe
               |          |          |          
               |          |          |--29.73%--do_syscall_64
               |          |          |          |          
               |          |          |          |--11.05%--__x64_sys_getppid
               |          |          |          |          |          
               |          |          |          |           --10.64%--__task_pid_nr_ns
               |          |          |          |          
               |          |          |           --9.44%--random_get_byte
               |          |          |                     |          
               |          |          |                      --8.08%--get_random_bytes
               |          |          |                                |          
               |          |          |                                 --7.80%--_extract_crng.constprop.45
               |          |          |                                           |          
               |          |          |                                           |--4.95%--arch_get_random_long
               |          |          |                                           |          
               |          |          |                                            --2.39%--chacha_block

    17.22%  [kernel]          [k] syscall_return_via_sysret
    15.32%  libc-2.27.so      [.] __GI___getppid
    13.33%  [kernel]          [k] __x64_sys_getppid
    10.04%  [kernel]          [k] __task_pid_nr_ns
     8.60%  [kernel]          [k] do_syscall_64
     7.94%  [kernel]          [k] entry_SYSCALL_64
     7.08%  [kernel]          [k] entry_SYSCALL_64_after_hwframe
     4.45%  [kernel]          [k] _extract_crng.constprop.46
     2.28%  [kernel]          [k] chacha_permute
     1.92%  [kernel]          [k] __indirect_thunk_start
     1.69%  [kernel]          [k] random_get_byte
     0.75%  lat_syscall       [.] do_getppid
     0.50%  [kernel]          [k] fw_domains_get_with_fallback
     0.43%  lat_syscall       [.] getppid@plt

   PerfTop:    5877 irqs/sec  kernel:78.6%  exact: 100.0% [4000Hz cycles:ppp],  (all, 8 CPUs)
---------------------------------------------------------------------------------------------
Showing cycles:ppp for _extract_crng.constprop.46
  Events  Pcnt (>=5%)
 Percent | Source code & Disassembly of kcore for cycles:ppp (2104 samples, percent: local period)
--------------------------------------------------------------------------------------------------
    0.00 :   ffffffff9abd1a62:       mov    $0xa,%edx
   97.94 :   ffffffff9abd1a67:       rdrand %rax

Q: So, what is the resulting benefit of in-stack per-syscall randomization?

A: Randomizing kernel stack per each syscall removes one of the main property
that attracted attackers to kernel thread stack for couple of last decades - the
deterministic stack structure.
Reliability is one of the most important feature for kernel exploits (not so
much for userspace ones, since they always can survive the fact that a process
can be terminated). An attacker cannot afford to panic the kernel, because it
normally means that the attack target is dead and at best he can continue from
square one after the target reboots and intrusion detection systems are already
alert. So, removing the feature of reliably predicting what would be the thread
stack layout upon the subsequent syscall is a bug hurdle for an attacker in
constructing the exploit chain that involves kernel stack.
It basically means that a class of stack-based attacks that do some preparation
(or information discovery) work in a first syscall and then utilize this
knowledge on a subsequent syscall cannot be reliable anymore (unless target is
pt_regs since we don't change pt_regs position), which means that they are much
less attractive for attackers.

The in-stack randomization is really a very small change both code wise and
logic wise.
It does not affect real workloads and does not require enablement of other
features (such as GCC plugins).
So, I think we should really reconsider its inclusion.