Re: [kernel-hardening] [RFC PATCH 12/13] x86: x86 implementation for HARDENED_ATOMIC

From: Jann Horn <jann@thejh.net>
To: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@intel.com>,
	"kernel-hardening@lists.openwall.com"
	<kernel-hardening@lists.openwall.com>,
	Elena Reshetova <elena.reshetova@intel.com>,
	Hans Liljestrand <ishkamiel@gmail.com>,
	David Windsor <dwindsor@gmail.com>
Subject: Re: [kernel-hardening] [RFC PATCH 12/13] x86: x86 implementation for HARDENED_ATOMIC
Date: Tue, 4 Oct 2016 21:48:48 +0200	[thread overview]
Message-ID: <20161004194848.GI2040@pc.thejh.net> (raw)
In-Reply-To: <CAGXu5j+6dBJ17UK-=DH7Vz-8qJSPnQ=CCyq4mjAiUJHEd9HfnA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6948 bytes --]

On Tue, Oct 04, 2016 at 11:51:09AM -0700, Kees Cook wrote:
> On Tue, Oct 4, 2016 at 5:41 AM, Jann Horn <jann@thejh.net> wrote:
> > On Mon, Oct 03, 2016 at 12:27:01PM -0700, Dave Hansen wrote:
> >> On 10/02/2016 11:41 PM, Elena Reshetova wrote:
> >> >  static __always_inline void atomic_add(int i, atomic_t *v)
> >> >  {
> >> > -   asm volatile(LOCK_PREFIX "addl %1,%0"
> >> > +   asm volatile(LOCK_PREFIX "addl %1,%0\n"
> >> > +
> >> > +#ifdef CONFIG_HARDENED_ATOMIC
> >> > +                "jno 0f\n"
> >> > +                LOCK_PREFIX "subl %1,%0\n"
> >> > +                "int $4\n0:\n"
> >> > +                _ASM_EXTABLE(0b, 0b)
> >> > +#endif
> >> > +
> >> > +                : "+m" (v->counter)
> >> > +                : "ir" (i));
> >> > +}
> >>
> >> Rather than doing all this assembly and exception stuff, could we just do:
> >>
> >> static __always_inline void atomic_add(int i, atomic_t *v)
> >> {
> >>       if (atomic_add_unless(v, a, INT_MAX))
> >>               BUG_ON_OVERFLOW_FOO()...
> >> }
> >>
> >> That way, there's also no transient state where somebody can have
> >> observed the overflow before it is fixed up.  Granted, this
> >> cmpxchg-based operation _is_ more expensive than the fast-path locked addl.
> >
> > I think we need some numbers, so I copypasted a bunch of kernel code together
> > so that I can benchmark this stuff in userspace without having a full kernel
> > implementation of refcounting protection. My code is at the bottom of
> > the mail - please test this on other CPUs, these are just numbers from my
> > machine.
> >
> > The following numbers are from tests on a
> > "Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz" CPU.
> >
> > First, I'm testing the single-threaded (uncontended) case:
> >
> > $ gcc -o atomic_user_test atomic_user_test.c -std=gnu99 -Wall -pthread -ggdb
> > $ time ./atomic_user_test 1 1 1000000000 # single-threaded, no protection
> > real    0m9.281s
> > user    0m9.251s
> > sys     0m0.004s
> > $ time ./atomic_user_test 1 2 1000000000 # single-threaded, racy protection
> > real    0m9.385s
> > user    0m9.365s
> > sys     0m0.003s
> > $ time ./atomic_user_test 1 3 1000000000 # single-threaded, cmpxchg protection
> > real    0m12.399s
> > user    0m12.375s
> > sys     0m0.000s
> >
> > The cmpxchg protection is something like 30% slower than the racy one. The
> > cmpxchg protection needs something like 12.4ns per operation, or around 45
> > cycles per operation. (Well, probably actually a bit less, considering that
> > the loop also costs some time.) My guess is that this wouldn't be noticeable.
> >
> > Now, I'm testing the multi-threaded (contended) case, with two threads that
> > only try to increment the same counter over and over again - so this is a
> > pretty extreme worst-case microbenchmark:
> >
> > $ time ./atomic_user_test 2 1 1000000000 # multi-threaded, no protection
> > real    0m9.550s
> > user    0m18.988s
> > sys     0m0.000s
> > $ time ./atomic_user_test 2 2 1000000000 # multi-threaded, racy protection
> > real    0m9.249s
> > user    0m18.430s
> > sys     0m0.004s
> > $ time ./atomic_user_test 2 3 1000000000 # multi-threaded, cmpxchg protection
> > real    1m47.331s
> > user    3m34.390s
> > sys     0m0.024s
> >
> > Here, the cmpxchg-protected counter is around 11 times as slow as the
> > unprotected counter, with around 107ns per average increment. That's around
> > 380 cycles per increment.
> >
> > I guess we probably don't care all that much about the few extra cycles in
> > the uncontended case.
> > So I think the big question now is how important the performance of the
> > high-contention case is.
> 
> What I find quite exciting about this benchmark is that they're the
> absolute worst-case: the process is doing nothing but atomic
> operations (which won't be the case for general kernel workloads) and
> the fact that the racy protection is basically lost in the noise is
> great.

Yeah, I agree.

> Now, cmpxchg looks bad here, but is, again, in the worst-case
> environment. I'll be curious to see kernel workloads with it, though.

Me too.

Btw: I just noticed that __fget() uses atomic_long_inc_not_zero(), which
is already implemented with cmpxchg on x86. So every time a multithreaded
process has multiple threads that interact with the same fd a lot, this
is already going to create racing cmpxchg loops today, and nobody seems
to have complained sufficiently loud so far. (And "perf top" says that
indeed, doing pread() in a loop in two threads spends way less time in
fget() if the threads use different fds. I'm not going to give you exact
numbers from my system because I have all kinds of debug crap turned on,
but I'll put my test code at the bottom of this mail if someone wants to
play with it.)

> As for the "racy" part, I'm not too concerned about it. (I would like
> it not to race, but given a choice, I would rather this protection was
> enabled by default.) As-is, with two threads fighting for a race, it
> can be hard to win the race, and even if there's a success, one will
> still Oops, so it's still better than what we have today: no
> notification of failure and an exploitable condition. (And, frankly,
> the act of racing and _failing_ is both threads Oopsing, so performing
> a racy would be extremely noisy on systems where they're not already
> set to panic on the first Oops...)
> 
> Just to make sure I'm not imagining the wrong thing, the race looks like this:
> 
> CPU0  CPU1
> inc->0
>             inc->1
> dec->0
> Oops
>             carry on

Yup, exactly.

In my test with an artificial worst-realistic-case bug that I did in
https://bugs.chromium.org/p/project-zero/issues/detail?id=856, it was
possible to get around the racy protection within seconds. Of course,
the normal cost of overflowing a reference counter comes on top of
that, and if you look at the logs while the attack is running, it is
indeed going to be very obvious - but I think that realistically, on
most systems, nobody is actually watching dmesg and looking for
oopses. Either you have panic_on_oops=1 or the attack is successful
and the attacker wipes the logs.

====================
#define _GNU_SOURCE
#include <pthread.h>
#include <err.h>
#include <sys/eventfd.h>
#include <unistd.h>

static int evfd = -1;

void *childfn(void *arg) {
  // uncomment the following three lines to let the threads use separate FDs
#if 0
  int evfd = eventfd(0, 0);
  if (evfd == -1)
    err(1, "eventfd");
#endif
  char c = 'X';
  while (1) {
    pread(evfd, &c, 1, 0); // will fail every time
  }
  return NULL;
}

int main(void) {
  evfd = eventfd(0, 0);
  if (evfd == -1)
    err(1, "eventfd");
  pthread_t thread;
  if (pthread_create(&thread, NULL, childfn, NULL))
    err(1, "pthread_create");
  childfn(NULL);
}
====================

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]