archive mirror
 help / color / mirror / Atom feed
From: Jann Horn <>
To: Kees Cook <>
Cc: kernel list <>,
	Christian Brauner <>,
	Sargun Dhillon <>,
	Tycho Andersen <>,
	"zhujianwei (C)" <>,
	Dave Hansen <>,
	Matthew Wilcox <>,
	Andy Lutomirski <>, Will Drewry <>,
	Shuah Khan <>, Matt Denton <>,
	Chris Palmer <>,
	Jeffrey Vander Stoep <>,
	Aleksa Sarai <>,
	Hehuazhen <>,
	"the arch/x86 maintainers" <>,
	Linux Containers <>,
	linux-security-module <>,
	Linux API <>
Subject: Re: [PATCH 4/8] seccomp: Implement constant action bitmaps
Date: Tue, 16 Jun 2020 14:14:47 +0200	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Tue, Jun 16, 2020 at 9:49 AM Kees Cook <> wrote:
> One of the most common pain points with seccomp filters has been dealing
> with the overhead of processing the filters, especially for "always allow"
> or "always reject" cases. While BPF is extremely fast[1], it will always
> have overhead associated with it. Additionally, due to seccomp's design,
> filters are layered, which means processing time goes up as the number
> of filters attached goes up.
> In the past, efforts have been focused on making filter execution complete
> in a shorter amount of time. For example, filters were rewritten from
> using linear if/then/else syscall search to using balanced binary trees,
> or moving tests for syscalls common to the process's workload to the
> front of the filter. However, there are limits to this, especially when
> some processes are dealing with tens of filters[2], or when some
> architectures have a less efficient BPF engine[3].
> The most common use of seccomp, constructing syscall block/allow-lists,
> where syscalls that are always allowed or always rejected (without regard
> to any arguments), also tends to produce the most pathological runtime
> problems, in that a large number of syscall checks in the filter need
> to be performed to come to a determination.
> In order to optimize these cases from O(n) to O(1), seccomp can
> use bitmaps to immediately determine the desired action. A critical
> observation in the prior paragraph bears repeating: the common case for
> syscall tests do not check arguments. For any given filter, there is a
> constant mapping from the combination of architecture and syscall to the
> seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
> there is a single architecture.). As such, it is possible to construct
> a mapping of arch/syscall to action, which can be updated as new filters
> are attached to a process.
> In order to build this mapping at filter attach time, each filter is
> executed for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
> there is a constant mapping for that syscall, and bitmaps can be updated
> accordingly. If any accesses happen outside of those struct members,
> seccomp must not bypass filter execution for that syscall, since program
> state will be used to determine filter action result.
> During syscall action probing, in order to determine whether other members
> of struct seccomp_data are being accessed during a filter execution,
> the struct is placed across a page boundary with the "arch" and "nr"
> members in the first page, and everything else in the second page. The
> "page accessed" flag is cleared in the second page's PTE, and the filter
> is run. If the "page accessed" flag appears as set after running the
> filter, we can determine that the filter looked beyond the "arch" and
> "nr" members, and exclude that syscall from the constant action bitmaps.
> For architectures to support this optimization, they must declare
> their architectures for seccomp to see (via SECCOMP_ARCH and
> SECCOMP_ARCH_COMPAT macros), and provide a way to perform efficient
> CPU-local kernel TLB flushes (via local_flush_tlb_kernel_range()),
> and then set HAVE_ARCH_SECCOMP_BITMAP in their Kconfig.

Wouldn't it be simpler to use a function that can run a subset of
seccomp cBPF and bails out on anything that indicates that a syscall's
handling is complex or on instructions it doesn't understand? For
syscalls that have a fixed policy, a typical seccomp filter doesn't
even use any of the BPF_ALU ops, the scratch space, or the X register;
it just uses something like the following set of operations, which is
easy to emulate without much code:


Something like (completely untested):

 * Try to statically determine whether @filter will always return a fixed result
 * when run for syscall @nr under architecture @arch.
 * Returns true if the result could be determined; if so, the result will be
 * stored in @action.
static bool seccomp_check_syscall(struct sock_filter *filter, unsigned int arch,
                                  unsigned int nr, unsigned int *action)
  int pc;
  unsigned int reg_value = 0;

  for (pc = 0; 1; pc++) {
    struct sock_filter *insn = &filter[pc];
    u16 code = insn->code;
    u32 k = insn->k;

    switch (code) {
    case BPF_LD | BPF_W | BPF_ABS:
      if (k == offsetof(struct seccomp_data, nr)) {
        reg_value = nr;
      } else if (k == offsetof(struct seccomp_data, arch)) {
        reg_value = arch;
      } else {
        return false; /* can't optimize (non-constant value load) */
    case BPF_RET | BPF_K:
      *action = insn->k;
      return true; /* success: reached return with constant values only */
    case BPF_JMP | BPF_JA:
      pc += insn->k;
    case BPF_JMP | BPF_JEQ | BPF_K:
    case BPF_JMP | BPF_JGE | BPF_K:
    case BPF_JMP | BPF_JGT | BPF_K:
      if (BPF_CLASS(code) == BPF_JMP && BPF_SRC(code) == BPF_K) {
        u16 op = BPF_OP(code);
        bool op_res;

        switch (op) {
        case BPF_JEQ:
          op_res = reg_value == k;
        case BPF_JGE:
          op_res = reg_value >= k;
        case BPF_JGT:
          op_res = reg_value > k;
          return false; /* can't optimize (unknown insn) */

        pc += op_res ? insn->jt : insn->jf;
      return false; /* can't optimize (unknown insn) */

That way, you won't need any of this complicated architecture-specific stuff.

  reply	other threads:[~2020-06-16 12:15 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-16  7:49 [RFC][PATCH 0/8] seccomp: Implement constant action bitmaps Kees Cook
2020-06-16  7:49 ` [PATCH 1/8] selftests/seccomp: Improve calibration loop Kees Cook
2020-06-16  7:49 ` [PATCH 2/8] seccomp: Use pr_fmt Kees Cook
2020-06-16  7:49 ` [PATCH 3/8] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE Kees Cook
2020-06-16 16:56   ` Andy Lutomirski
2020-06-17 15:25   ` Jann Horn
2020-06-17 15:29     ` Andy Lutomirski
2020-06-17 15:31       ` Jann Horn
2020-06-16  7:49 ` [PATCH 4/8] seccomp: Implement constant action bitmaps Kees Cook
2020-06-16 12:14   ` Jann Horn [this message]
2020-06-16 15:48     ` Kees Cook
2020-06-16 18:36       ` Jann Horn
2020-06-16 18:49         ` Kees Cook
2020-06-16 21:13         ` Andy Lutomirski
2020-06-16 14:40   ` Dave Hansen
2020-06-16 16:01     ` Kees Cook
2020-06-16  7:49 ` [PATCH 5/8] selftests/seccomp: Compare bitmap vs filter overhead Kees Cook
2020-06-16  7:49 ` [PATCH 6/8] x86: Provide API for local kernel TLB flushing Kees Cook
2020-06-16 16:59   ` Andy Lutomirski
2020-06-16 18:37     ` Kees Cook
2020-06-16  7:49 ` [PATCH 7/8] x86: Enable seccomp constant action bitmaps Kees Cook
2020-06-16  7:49 ` [PATCH 8/8] [DEBUG] seccomp: Report bitmap coverage ranges Kees Cook
2020-06-16 17:01 ` [RFC][PATCH 0/8] seccomp: Implement constant action bitmaps Andy Lutomirski
2020-06-16 18:35   ` Kees Cook

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='' \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).