Re: eBPF / seccomp globals?

From: Kees Cook <keescook@chromium.org>
To: Michael Tirado <mtirado418@gmail.com>,
	Andy Lutomirski <luto@amacapital.net>
Cc: LKML <linux-kernel@vger.kernel.org>, Will Drewry <wad@chromium.org>
Subject: Re: eBPF / seccomp globals?
Date: Tue, 29 Sep 2015 16:44:42 -0700	[thread overview]
Message-ID: <CAGXu5jKJuf80L7sAPOb-hbc_wC5e_0nEpQpfhg401XhZH4A2Pw@mail.gmail.com> (raw)
In-Reply-To: <CAMkWEXM2pyuOx8VhGLQUQU=7jLuB4ifNydu4egK2k-HnWf6h1w@mail.gmail.com>

On Thu, Sep 10, 2015 at 2:55 PM, Michael Tirado <mtirado418@gmail.com> wrote:
> On Fri, Sep 4, 2015 at 8:37 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> Do you still need file capabilities with the availability of the new
>> ambient capabilities?
>>
>> https://s3hh.wordpress.com/2015/07/25/ambient-capabilities/
>> http://thread.gmane.org/gmane.linux.kernel.lsm/24034
>
> Ah.. thanks for the info on this, my launcher program could use ambient
> capabilities if whoever invoked it already has that capability. I am trying
> to have the new environment explicitly defined as a white list, and avoid
> any type of privilege escalation not already granted by root user either
> by filesystem mechanisms (setuid / file caps) or inheritable caps.
>
> I would still like to be able to launch programs with file capabilities since we
> can lock those down with capability bounding set, and maybe even setuid
> binaries too (with a hefty warning message). This rules out LD_PRELOAD
> for me, and also some linkers may not support it at all.
>
>
>
>> On the TODO list is
>> doing deep argument inspection, but it is not an easy thing to get
>> right. :)
>
> Yes, please do not rush such a thing!!  It might even be a can of worms
> not worth opening.
>
>
>
> In case anyone is wondering what I am doing for-now(tm) while waiting for
> eBPF map support, or some other way to deal with this problem: I have crafted
> a very hacky patch to work around the issue that will allow 2 system calls to
> pass through before the filter program is run. I'm lazily using google
> webmail so,
> sorry if the tabs are missing :(

I

>
>
>
> From: Michael R. Tirado <mtirado418@gmail.com>
> Date: Thu, 10 Sep 2015 08:28:41 +0000
> Subject: [PATCH] Add new seccomp filter mode + flag to allow two syscalls to
>  pass before the filter is run. allows a launcher program to setuid(drop caps)
>  and exec if those two privileges are not granted in seccomp filter whitelist.
>
>  DISCLAIMER:
>  I am doing this as a quick temporary workaround to this complex problem.
>  Also, there may be a more efficient way to implement it instead of
>  branching in the filter loop.
> ---
>  include/linux/seccomp.h      |  2 +-
>  include/uapi/linux/seccomp.h |  2 ++
>  kernel/seccomp.c             | 23 ++++++++++++++++++++---
>  3 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index a19ddac..5547448c 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -3,7 +3,7 @@
>
>  #include <uapi/linux/seccomp.h>
>
> -#define SECCOMP_FILTER_FLAG_MASK    (SECCOMP_FILTER_FLAG_TSYNC)
> +#define SECCOMP_FILTER_FLAG_MASK    (SECCOMP_FILTER_FLAG_TSYNC |
> SECCOMP_FILTER_FLAG_DEFERRED)
>
>  #ifdef CONFIG_SECCOMP
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 0f238a4..43a8fb8 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -9,6 +9,7 @@
>  #define SECCOMP_MODE_DISABLED    0 /* seccomp is not in use. */
>  #define SECCOMP_MODE_STRICT    1 /* uses hard-coded filter. */
>  #define SECCOMP_MODE_FILTER    2 /* uses user-supplied filter. */
> +#define SECCOMP_MODE_FILTER_DEFERRED 3 /* sets filter mode + deferred flag */
>
>  /* Valid operations for seccomp syscall. */
>  #define SECCOMP_SET_MODE_STRICT    0
> @@ -16,6 +17,7 @@
>
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
>  #define SECCOMP_FILTER_FLAG_TSYNC    1
> +#define SECCOMP_FILTER_FLAG_DEFERRED    2 /* grant two unfiltered syscalls */
>
>  /*
>   * All BPF programs must return a 32-bit value.
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 245df6b..dc2a5af 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -58,6 +58,7 @@ struct seccomp_filter {
>      atomic_t usage;
>      struct seccomp_filter *prev;
>      struct bpf_prog *prog;
> +    unsigned int deferred;
>  };
>
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -196,7 +197,12 @@ static u32 seccomp_run_filters(struct seccomp_data *sd)
>       * value always takes priority (ignoring the DATA).
>       */
>      for (; f; f = f->prev) {
> -        u32 cur_ret = BPF_PROG_RUN(f->prog, (void *)sd);
> +        u32 cur_ret;
> +        if (unlikely(f->deferred)) {
> +            --f->deferred;
> +            continue;
> +        }
> +        cur_ret = BPF_PROG_RUN(f->prog, (void *)sd);

I do like the idea of a deferred filters, but I wouldn't want them
checked this way (it adds a fixed cost to all filters). I would rather
that a separate list of filters exist, waiting for something like
"exec" to trigger them getting added to the actual filter list. In
fact, probably you'd want something like "prepare" and "cancel" too,
then you could "prepare", fork, exec. Then the child would get the
prepared filters added, and the parent could call "cancel" to drop the
prepared filters from itself.

Andy, Will, do you see issues with a class of "deferred" filters that
would be attached after exec?

>
>          if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
>              ret = cur_ret;
> @@ -444,6 +450,14 @@ static long seccomp_attach_filter(unsigned int flags,
>      }
>
>      /*
> +     * in certain cases we may wish to defer filtering, and allow some
> +     * syscalls. eg, a launcher program will setuid(drop caps) then exec.
> +     */
> +    if (flags & SECCOMP_FILTER_FLAG_DEFERRED) {
> +        filter->deferred = 2;
> +    }
> +
> +    /*
>       * If there is an existing filter, make it the prev and don't drop its
>       * task reference.
>       */
> @@ -838,6 +852,7 @@ long prctl_set_seccomp(unsigned long seccomp_mode,
> char __user *filter)
>  {
>      unsigned int op;
>      char __user *uargs;
> +    unsigned int flags = 0;
>
>      switch (seccomp_mode) {
>      case SECCOMP_MODE_STRICT:
> @@ -849,6 +864,9 @@ long prctl_set_seccomp(unsigned long seccomp_mode,
> char __user *filter)
>           */
>          uargs = NULL;
>          break;
> +        /* set flag, older kernels lack seccomp syscall */
> +    case SECCOMP_MODE_FILTER_DEFERRED:
> +        flags = SECCOMP_FILTER_FLAG_DEFERRED;
>      case SECCOMP_MODE_FILTER:
>          op = SECCOMP_SET_MODE_FILTER;
>          uargs = filter;
> @@ -857,6 +875,5 @@ long prctl_set_seccomp(unsigned long seccomp_mode,
> char __user *filter)
>          return -EINVAL;
>      }
>
> -    /* prctl interface doesn't have flags, so they are always zero. */
> -    return do_seccomp(op, 0, uargs);
> +    return do_seccomp(op, flags, uargs);

Sorry, no, this isn't allowed, as the comment above it says. New flags
must happen only via the seccomp syscall.

>  }
> --
> 1.8.4

-Kees

-- 
Kees Cook
Chrome OS Security