linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jann Horn <jannh@google.com>
To: Kees Cook <keescook@chromium.org>
Cc: kernel list <linux-kernel@vger.kernel.org>,
	Christian Brauner <christian@brauner.io>,
	Sargun Dhillon <sargun@sargun.me>,
	Tycho Andersen <tycho@tycho.ws>,
	"zhujianwei (C)" <zhujianwei7@huawei.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Andy Lutomirski <luto@kernel.org>, Will Drewry <wad@chromium.org>,
	Shuah Khan <shuah@kernel.org>, Matt Denton <mpdenton@google.com>,
	Chris Palmer <palmer@google.com>,
	Jeffrey Vander Stoep <jeffv@google.com>,
	Aleksa Sarai <cyphar@cyphar.com>,
	Hehuazhen <hehuazhen@huawei.com>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	Linux Containers <containers@lists.linux-foundation.org>,
	linux-security-module <linux-security-module@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>
Subject: Re: [PATCH 4/8] seccomp: Implement constant action bitmaps
Date: Tue, 16 Jun 2020 14:14:47 +0200	[thread overview]
Message-ID: <CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com> (raw)
In-Reply-To: <20200616074934.1600036-5-keescook@chromium.org>

On Tue, Jun 16, 2020 at 9:49 AM Kees Cook <keescook@chromium.org> wrote:
> One of the most common pain points with seccomp filters has been dealing
> with the overhead of processing the filters, especially for "always allow"
> or "always reject" cases. While BPF is extremely fast[1], it will always
> have overhead associated with it. Additionally, due to seccomp's design,
> filters are layered, which means processing time goes up as the number
> of filters attached goes up.
>
> In the past, efforts have been focused on making filter execution complete
> in a shorter amount of time. For example, filters were rewritten from
> using linear if/then/else syscall search to using balanced binary trees,
> or moving tests for syscalls common to the process's workload to the
> front of the filter. However, there are limits to this, especially when
> some processes are dealing with tens of filters[2], or when some
> architectures have a less efficient BPF engine[3].
>
> The most common use of seccomp, constructing syscall block/allow-lists,
> where syscalls that are always allowed or always rejected (without regard
> to any arguments), also tends to produce the most pathological runtime
> problems, in that a large number of syscall checks in the filter need
> to be performed to come to a determination.
>
> In order to optimize these cases from O(n) to O(1), seccomp can
> use bitmaps to immediately determine the desired action. A critical
> observation in the prior paragraph bears repeating: the common case for
> syscall tests do not check arguments. For any given filter, there is a
> constant mapping from the combination of architecture and syscall to the
> seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
> there is a single architecture.). As such, it is possible to construct
> a mapping of arch/syscall to action, which can be updated as new filters
> are attached to a process.
>
> In order to build this mapping at filter attach time, each filter is
> executed for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
> there is a constant mapping for that syscall, and bitmaps can be updated
> accordingly. If any accesses happen outside of those struct members,
> seccomp must not bypass filter execution for that syscall, since program
> state will be used to determine filter action result.
>
> During syscall action probing, in order to determine whether other members
> of struct seccomp_data are being accessed during a filter execution,
> the struct is placed across a page boundary with the "arch" and "nr"
> members in the first page, and everything else in the second page. The
> "page accessed" flag is cleared in the second page's PTE, and the filter
> is run. If the "page accessed" flag appears as set after running the
> filter, we can determine that the filter looked beyond the "arch" and
> "nr" members, and exclude that syscall from the constant action bitmaps.
>
> For architectures to support this optimization, they must declare
> their architectures for seccomp to see (via SECCOMP_ARCH and
> SECCOMP_ARCH_COMPAT macros), and provide a way to perform efficient
> CPU-local kernel TLB flushes (via local_flush_tlb_kernel_range()),
> and then set HAVE_ARCH_SECCOMP_BITMAP in their Kconfig.

Wouldn't it be simpler to use a function that can run a subset of
seccomp cBPF and bails out on anything that indicates that a syscall's
handling is complex or on instructions it doesn't understand? For
syscalls that have a fixed policy, a typical seccomp filter doesn't
even use any of the BPF_ALU ops, the scratch space, or the X register;
it just uses something like the following set of operations, which is
easy to emulate without much code:

BPF_LD | BPF_W | BPF_ABS
BPF_JMP | BPF_JEQ | BPF_K
BPF_JMP | BPF_JGE | BPF_K
BPF_JMP | BPF_JGT | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K

Something like (completely untested):

/*
 * Try to statically determine whether @filter will always return a fixed result
 * when run for syscall @nr under architecture @arch.
 * Returns true if the result could be determined; if so, the result will be
 * stored in @action.
 */
static bool seccomp_check_syscall(struct sock_filter *filter, unsigned int arch,
                                  unsigned int nr, unsigned int *action)
{
  int pc;
  unsigned int reg_value = 0;

  for (pc = 0; 1; pc++) {
    struct sock_filter *insn = &filter[pc];
    u16 code = insn->code;
    u32 k = insn->k;

    switch (code) {
    case BPF_LD | BPF_W | BPF_ABS:
      if (k == offsetof(struct seccomp_data, nr)) {
        reg_value = nr;
      } else if (k == offsetof(struct seccomp_data, arch)) {
        reg_value = arch;
      } else {
        return false; /* can't optimize (non-constant value load) */
      }
      break;
    case BPF_RET | BPF_K:
      *action = insn->k;
      return true; /* success: reached return with constant values only */
    case BPF_JMP | BPF_JA:
      pc += insn->k;
      break;
    case BPF_JMP | BPF_JEQ | BPF_K:
    case BPF_JMP | BPF_JGE | BPF_K:
    case BPF_JMP | BPF_JGT | BPF_K:
    default:
      if (BPF_CLASS(code) == BPF_JMP && BPF_SRC(code) == BPF_K) {
        u16 op = BPF_OP(code);
        bool op_res;

        switch (op) {
        case BPF_JEQ:
          op_res = reg_value == k;
          break;
        case BPF_JGE:
          op_res = reg_value >= k;
          break;
        case BPF_JGT:
          op_res = reg_value > k;
          break;
        default:
          return false; /* can't optimize (unknown insn) */
        }

        pc += op_res ? insn->jt : insn->jf;
        break;
      }
      return false; /* can't optimize (unknown insn) */
    }
  }
}

That way, you won't need any of this complicated architecture-specific stuff.

  reply	other threads:[~2020-06-16 12:15 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-16  7:49 [RFC][PATCH 0/8] seccomp: Implement constant action bitmaps Kees Cook
2020-06-16  7:49 ` [PATCH 1/8] selftests/seccomp: Improve calibration loop Kees Cook
2020-06-16  7:49 ` [PATCH 2/8] seccomp: Use pr_fmt Kees Cook
2020-06-16  7:49 ` [PATCH 3/8] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE Kees Cook
2020-06-16 16:56   ` Andy Lutomirski
2020-06-17 15:25   ` Jann Horn
2020-06-17 15:29     ` Andy Lutomirski
2020-06-17 15:31       ` Jann Horn
2020-06-16  7:49 ` [PATCH 4/8] seccomp: Implement constant action bitmaps Kees Cook
2020-06-16 12:14   ` Jann Horn [this message]
2020-06-16 15:48     ` Kees Cook
2020-06-16 18:36       ` Jann Horn
2020-06-16 18:49         ` Kees Cook
2020-06-16 21:13         ` Andy Lutomirski
2020-06-16 14:40   ` Dave Hansen
2020-06-16 16:01     ` Kees Cook
2020-06-16  7:49 ` [PATCH 5/8] selftests/seccomp: Compare bitmap vs filter overhead Kees Cook
2020-06-16  7:49 ` [PATCH 6/8] x86: Provide API for local kernel TLB flushing Kees Cook
2020-06-16 16:59   ` Andy Lutomirski
2020-06-16 18:37     ` Kees Cook
2020-06-16  7:49 ` [PATCH 7/8] x86: Enable seccomp constant action bitmaps Kees Cook
2020-06-16  7:49 ` [PATCH 8/8] [DEBUG] seccomp: Report bitmap coverage ranges Kees Cook
2020-06-16 17:01 ` [RFC][PATCH 0/8] seccomp: Implement constant action bitmaps Andy Lutomirski
2020-06-16 18:35   ` Kees Cook

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com' \
    --to=jannh@google.com \
    --cc=christian@brauner.io \
    --cc=containers@lists.linux-foundation.org \
    --cc=cyphar@cyphar.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hehuazhen@huawei.com \
    --cc=jeffv@google.com \
    --cc=keescook@chromium.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mpdenton@google.com \
    --cc=palmer@google.com \
    --cc=sargun@sargun.me \
    --cc=shuah@kernel.org \
    --cc=tycho@tycho.ws \
    --cc=wad@chromium.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=zhujianwei7@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).