From: Jann Horn <jannh@google.com>
To: Kees Cook <keescook@chromium.org>
Cc: kernel list <linux-kernel@vger.kernel.org>,
Christian Brauner <christian@brauner.io>,
Sargun Dhillon <sargun@sargun.me>,
Tycho Andersen <tycho@tycho.ws>,
"zhujianwei (C)" <zhujianwei7@huawei.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
Matthew Wilcox <willy@infradead.org>,
Andy Lutomirski <luto@kernel.org>, Will Drewry <wad@chromium.org>,
Shuah Khan <shuah@kernel.org>, Matt Denton <mpdenton@google.com>,
Chris Palmer <palmer@google.com>,
Jeffrey Vander Stoep <jeffv@google.com>,
Aleksa Sarai <cyphar@cyphar.com>,
Hehuazhen <hehuazhen@huawei.com>,
"the arch/x86 maintainers" <x86@kernel.org>,
Linux Containers <containers@lists.linux-foundation.org>,
linux-security-module <linux-security-module@vger.kernel.org>,
Linux API <linux-api@vger.kernel.org>
Subject: Re: [PATCH 4/8] seccomp: Implement constant action bitmaps
Date: Tue, 16 Jun 2020 14:14:47 +0200 [thread overview]
Message-ID: <CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com> (raw)
In-Reply-To: <20200616074934.1600036-5-keescook@chromium.org>
On Tue, Jun 16, 2020 at 9:49 AM Kees Cook <keescook@chromium.org> wrote:
> One of the most common pain points with seccomp filters has been dealing
> with the overhead of processing the filters, especially for "always allow"
> or "always reject" cases. While BPF is extremely fast[1], it will always
> have overhead associated with it. Additionally, due to seccomp's design,
> filters are layered, which means processing time goes up as the number
> of filters attached goes up.
>
> In the past, efforts have been focused on making filter execution complete
> in a shorter amount of time. For example, filters were rewritten from
> using linear if/then/else syscall search to using balanced binary trees,
> or moving tests for syscalls common to the process's workload to the
> front of the filter. However, there are limits to this, especially when
> some processes are dealing with tens of filters[2], or when some
> architectures have a less efficient BPF engine[3].
>
> The most common use of seccomp, constructing syscall block/allow-lists,
> where syscalls that are always allowed or always rejected (without regard
> to any arguments), also tends to produce the most pathological runtime
> problems, in that a large number of syscall checks in the filter need
> to be performed to come to a determination.
>
> In order to optimize these cases from O(n) to O(1), seccomp can
> use bitmaps to immediately determine the desired action. A critical
> observation in the prior paragraph bears repeating: the common case for
> syscall tests do not check arguments. For any given filter, there is a
> constant mapping from the combination of architecture and syscall to the
> seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
> there is a single architecture.). As such, it is possible to construct
> a mapping of arch/syscall to action, which can be updated as new filters
> are attached to a process.
>
> In order to build this mapping at filter attach time, each filter is
> executed for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
> there is a constant mapping for that syscall, and bitmaps can be updated
> accordingly. If any accesses happen outside of those struct members,
> seccomp must not bypass filter execution for that syscall, since program
> state will be used to determine filter action result.
>
> During syscall action probing, in order to determine whether other members
> of struct seccomp_data are being accessed during a filter execution,
> the struct is placed across a page boundary with the "arch" and "nr"
> members in the first page, and everything else in the second page. The
> "page accessed" flag is cleared in the second page's PTE, and the filter
> is run. If the "page accessed" flag appears as set after running the
> filter, we can determine that the filter looked beyond the "arch" and
> "nr" members, and exclude that syscall from the constant action bitmaps.
>
> For architectures to support this optimization, they must declare
> their architectures for seccomp to see (via SECCOMP_ARCH and
> SECCOMP_ARCH_COMPAT macros), and provide a way to perform efficient
> CPU-local kernel TLB flushes (via local_flush_tlb_kernel_range()),
> and then set HAVE_ARCH_SECCOMP_BITMAP in their Kconfig.
Wouldn't it be simpler to use a function that can run a subset of
seccomp cBPF and bails out on anything that indicates that a syscall's
handling is complex or on instructions it doesn't understand? For
syscalls that have a fixed policy, a typical seccomp filter doesn't
even use any of the BPF_ALU ops, the scratch space, or the X register;
it just uses something like the following set of operations, which is
easy to emulate without much code:
BPF_LD | BPF_W | BPF_ABS
BPF_JMP | BPF_JEQ | BPF_K
BPF_JMP | BPF_JGE | BPF_K
BPF_JMP | BPF_JGT | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K
Something like (completely untested):
/*
* Try to statically determine whether @filter will always return a fixed result
* when run for syscall @nr under architecture @arch.
* Returns true if the result could be determined; if so, the result will be
* stored in @action.
*/
static bool seccomp_check_syscall(struct sock_filter *filter, unsigned int arch,
unsigned int nr, unsigned int *action)
{
int pc;
unsigned int reg_value = 0;
for (pc = 0; 1; pc++) {
struct sock_filter *insn = &filter[pc];
u16 code = insn->code;
u32 k = insn->k;
switch (code) {
case BPF_LD | BPF_W | BPF_ABS:
if (k == offsetof(struct seccomp_data, nr)) {
reg_value = nr;
} else if (k == offsetof(struct seccomp_data, arch)) {
reg_value = arch;
} else {
return false; /* can't optimize (non-constant value load) */
}
break;
case BPF_RET | BPF_K:
*action = insn->k;
return true; /* success: reached return with constant values only */
case BPF_JMP | BPF_JA:
pc += insn->k;
break;
case BPF_JMP | BPF_JEQ | BPF_K:
case BPF_JMP | BPF_JGE | BPF_K:
case BPF_JMP | BPF_JGT | BPF_K:
default:
if (BPF_CLASS(code) == BPF_JMP && BPF_SRC(code) == BPF_K) {
u16 op = BPF_OP(code);
bool op_res;
switch (op) {
case BPF_JEQ:
op_res = reg_value == k;
break;
case BPF_JGE:
op_res = reg_value >= k;
break;
case BPF_JGT:
op_res = reg_value > k;
break;
default:
return false; /* can't optimize (unknown insn) */
}
pc += op_res ? insn->jt : insn->jf;
break;
}
return false; /* can't optimize (unknown insn) */
}
}
}
That way, you won't need any of this complicated architecture-specific stuff.
next prev parent reply other threads:[~2020-06-16 12:15 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-16 7:49 [RFC][PATCH 0/8] seccomp: Implement constant action bitmaps Kees Cook
2020-06-16 7:49 ` [PATCH 1/8] selftests/seccomp: Improve calibration loop Kees Cook
2020-06-16 7:49 ` [PATCH 2/8] seccomp: Use pr_fmt Kees Cook
2020-06-16 7:49 ` [PATCH 3/8] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE Kees Cook
2020-06-16 16:56 ` Andy Lutomirski
2020-06-17 15:25 ` Jann Horn
2020-06-17 15:29 ` Andy Lutomirski
2020-06-17 15:31 ` Jann Horn
2020-06-16 7:49 ` [PATCH 4/8] seccomp: Implement constant action bitmaps Kees Cook
2020-06-16 12:14 ` Jann Horn [this message]
2020-06-16 15:48 ` Kees Cook
2020-06-16 18:36 ` Jann Horn
2020-06-16 18:49 ` Kees Cook
2020-06-16 21:13 ` Andy Lutomirski
2020-06-16 14:40 ` Dave Hansen
2020-06-16 16:01 ` Kees Cook
2020-06-16 7:49 ` [PATCH 5/8] selftests/seccomp: Compare bitmap vs filter overhead Kees Cook
2020-06-16 7:49 ` [PATCH 6/8] x86: Provide API for local kernel TLB flushing Kees Cook
2020-06-16 16:59 ` Andy Lutomirski
2020-06-16 18:37 ` Kees Cook
2020-06-16 7:49 ` [PATCH 7/8] x86: Enable seccomp constant action bitmaps Kees Cook
2020-06-16 7:49 ` [PATCH 8/8] [DEBUG] seccomp: Report bitmap coverage ranges Kees Cook
2020-06-16 17:01 ` [RFC][PATCH 0/8] seccomp: Implement constant action bitmaps Andy Lutomirski
2020-06-16 18:35 ` Kees Cook
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com' \
--to=jannh@google.com \
--cc=christian@brauner.io \
--cc=containers@lists.linux-foundation.org \
--cc=cyphar@cyphar.com \
--cc=dave.hansen@linux.intel.com \
--cc=hehuazhen@huawei.com \
--cc=jeffv@google.com \
--cc=keescook@chromium.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-security-module@vger.kernel.org \
--cc=luto@kernel.org \
--cc=mpdenton@google.com \
--cc=palmer@google.com \
--cc=sargun@sargun.me \
--cc=shuah@kernel.org \
--cc=tycho@tycho.ws \
--cc=wad@chromium.org \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
--cc=zhujianwei7@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).