From: Jann Horn <jannh@google.com>
To: Kees Cook <keescook@chromium.org>
Cc: YiFei Zhu <yifeifz2@illinois.edu>,
Christian Brauner <christian.brauner@ubuntu.com>,
Tycho Andersen <tycho@tycho.pizza>,
Andy Lutomirski <luto@amacapital.net>,
Will Drewry <wad@chromium.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Giuseppe Scrivano <gscrivan@redhat.com>,
Tobin Feldman-Fitzthum <tobin@ibm.com>,
Dimitrios Skarlatos <dskarlat@cs.cmu.edu>,
Valentin Rothberg <vrothber@redhat.com>,
Hubertus Franke <frankeh@us.ibm.com>,
Jack Chen <jianyan2@illinois.edu>,
Josep Torrellas <torrella@illinois.edu>,
Tianyin Xu <tyxu@illinois.edu>, bpf <bpf@vger.kernel.org>,
Linux Containers <containers@lists.linux-foundation.org>,
Linux API <linux-api@vger.kernel.org>,
kernel list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
Date: Thu, 24 Sep 2020 02:25:03 +0200 [thread overview]
Message-ID: <CAG48ez0d80fOSTyn5QbH33WPz5UkzJJOo+V8of7YMR8pVQxumw@mail.gmail.com> (raw)
In-Reply-To: <20200923232923.3142503-4-keescook@chromium.org>
On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> One of the most common pain points with seccomp filters has been dealing
> with the overhead of processing the filters, especially for "always allow"
> or "always reject" cases.
The "always reject" cases don't need to be fast, in particular not the
kill_thread/kill_process ones. Nobody's going to have "process kills
itself by executing a forbidden syscall" on a critical hot codepath.
> While BPF is extremely fast[1], it will always
> have overhead associated with it. Additionally, due to seccomp's design,
> filters are layered, which means processing time goes up as the number
> of filters attached goes up.
[...]
> In order to build this mapping at filter attach time, each filter is
> executed for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
> there is a constant mapping for that syscall, and bitmaps can be updated
> accordingly. If any accesses happen outside of those struct members,
> seccomp must not bypass filter execution for that syscall, since program
> state will be used to determine filter action result. (This logic comes
> in the next patch.)
>
> [1] https://lore.kernel.org/bpf/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/
> [2] https://lore.kernel.org/bpf/20200601101137.GA121847@gardel-login/
> [3] https://lore.kernel.org/bpf/717a06e7f35740ccb4c70470ec70fb2f@huawei.com/
>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
> include/linux/seccomp.h | 18 ++++
> kernel/seccomp.c | 207 +++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 221 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 0be20bc81ea9..96df2f899e3d 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -25,6 +25,17 @@
> #define SECCOMP_ARCH_IS_MULTIPLEX 3
> #define SECCOMP_ARCH_IS_UNKNOWN 0xff
>
> +/* When no bits are set for a syscall, filters are run. */
> +struct seccomp_bitmaps {
> +#ifdef SECCOMP_ARCH
> + /* "allow" are initialized to set and only ever get cleared. */
> + DECLARE_BITMAP(allow, NR_syscalls);
This bitmap makes sense.
The "NR_syscalls" part assumes that the compat syscall tables will not
be bigger than the native syscall table, right? I guess that's usually
mostly true nowadays, thanks to the syscall table unification...
(might be worth a comment though)
> + /* These are initialized to clear and only ever get set. */
> + DECLARE_BITMAP(kill_thread, NR_syscalls);
> + DECLARE_BITMAP(kill_process, NR_syscalls);
I don't think these bitmaps make sense, this is not part of any fastpath.
(However, a "which syscalls have a fixed result" bitmap might make
sense if we want to export the list of permitted syscalls as a text
file in procfs, as I mentioned over at
<https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/>.)
> +#endif
> +};
> +
> struct seccomp_filter;
> /**
> * struct seccomp - the state of a seccomp'ed process
> @@ -45,6 +56,13 @@ struct seccomp {
> #endif
> atomic_t filter_count;
> struct seccomp_filter *filter;
> + struct seccomp_bitmaps native;
> +#ifdef CONFIG_COMPAT
> + struct seccomp_bitmaps compat;
> +#endif
> +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> + struct seccomp_bitmaps multiplex;
> +#endif
Why do we have one bitmap per thread (in struct seccomp) instead of
putting the bitmap for a given filter and all its ancestors into the
seccomp_filter?
> };
>
> #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 0a3ff8eb8aea..111a238bc532 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -318,7 +318,7 @@ static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr)
>
> #ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> if (syscall_arch == SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH) {
> - seccomp_arch |= (sd->nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
> + seccomp_arch |= (syscall_nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
> SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT;
This belongs over into patch 1.
> }
> #endif
> @@ -559,6 +559,21 @@ static inline void seccomp_sync_threads(unsigned long flags)
> atomic_set(&thread->seccomp.filter_count,
> atomic_read(&thread->seccomp.filter_count));
>
> + /* Copy syscall filter bitmaps. */
> + memcpy(&thread->seccomp.native,
> + &caller->seccomp.native,
> + sizeof(caller->seccomp.native));
> +#ifdef CONFIG_COMPAT
> + memcpy(&thread->seccomp.compat,
> + &caller->seccomp.compat,
> + sizeof(caller->seccomp.compat));
> +#endif
> +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> + memcpy(&thread->seccomp.multiplex,
> + &caller->seccomp.multiplex,
> + sizeof(caller->seccomp.multiplex));
> +#endif
This part wouldn't be necessary if the bitmasks were part of the
seccomp_filter...
> /*
> * Don't let an unprivileged task work around
> * the no_new_privs restriction by creating
> @@ -661,6 +676,114 @@ seccomp_prepare_user_filter(const char __user *user_filter)
> return filter;
> }
>
> +static inline bool sd_touched(pte_t *ptep)
> +{
> + return !!pte_young(*(READ_ONCE(ptep)));
> +}
I think this is left over from the previous version and should've been removed?
[...]
> +/*
> + * Walk everyone syscall combination for this arch/mask combo and update
nit: "Walk every possible", or something like that
> + * the bitmaps with any results.
> + */
> +static void seccomp_update_bitmap(struct seccomp_filter *filter,
> + void *pagepair, u32 arch, u32 mask,
> + struct seccomp_bitmaps *bitmaps)
[...]
> @@ -970,6 +1097,65 @@ static int seccomp_do_user_notification(int this_syscall,
> return -1;
> }
>
> +#ifdef SECCOMP_ARCH
> +static inline bool __bypass_filter(struct seccomp_bitmaps *bitmaps,
> + u32 nr, u32 *filter_ret)
> +{
> + if (nr < NR_syscalls) {
> + if (test_bit(nr, bitmaps->allow)) {
> + *filter_ret = SECCOMP_RET_ALLOW;
> + return true;
> + }
> + if (test_bit(nr, bitmaps->kill_process)) {
> + *filter_ret = SECCOMP_RET_KILL_PROCESS;
> + return true;
> + }
> + if (test_bit(nr, bitmaps->kill_thread)) {
> + *filter_ret = SECCOMP_RET_KILL_THREAD;
> + return true;
> + }
The checks against ->kill_process and ->kill_thread won't make
anything faster, but since they will run in the fastpath, they'll
probably actually contribute to making things *slower*.
> + }
> + return false;
> +}
> +
> +static inline u32 check_syscall(const struct seccomp_data *sd,
> + struct seccomp_filter **match)
> +{
> + u32 filter_ret = SECCOMP_RET_KILL_PROCESS;
> + u8 arch = seccomp_get_arch(sd->arch, sd->nr);
> +
> + switch (arch) {
> + case SECCOMP_ARCH_IS_NATIVE:
> + if (__bypass_filter(¤t->seccomp.native, sd->nr, &filter_ret))
> + return filter_ret;
> + break;
> +#ifdef CONFIG_COMPAT
> + case SECCOMP_ARCH_IS_COMPAT:
> + if (__bypass_filter(¤t->seccomp.compat, sd->nr, &filter_ret))
> + return filter_ret;
> + break;
> +#endif
> +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> + case SECCOMP_ARCH_IS_MULTIPLEX:
> + if (__bypass_filter(¤t->seccomp.multiplex, sd->nr, &filter_ret))
> + return filter_ret;
> + break;
> +#endif
> + default:
> + WARN_ON_ONCE(1);
> + return filter_ret;
> + };
> +
> + return seccomp_run_filters(sd, match);
> +}
You could write this in a less repetitive way, and especially if we
get rid of the kill_* masks, also more compact:
static inline u32 check_syscall(const struct seccomp_data *sd,
struct seccomp_filter **match)
{
struct seccomp_bitmaps *bitmaps;
u32 filter_ret;
switch (arch) {
case SECCOMP_ARCH_IS_NATIVE:
bitmaps = ¤t->seccomp.native;
break;
#ifdef CONFIG_COMPAT
case SECCOMP_ARCH_IS_COMPAT:
bitmaps = ¤t->seccomp.compat;
break;
#endif
#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
case SECCOMP_ARCH_IS_MULTIPLEX:
bitmaps = ¤t->seccomp.multiplex;
break;
#endif
default:
WARN_ON_ONCE(1);
return SECCOMP_RET_KILL_PROCESS;
}
if ((unsigned)sd->nr < __NR_syscalls && test_bit(sd->nr, bitmaps->allow))
return SECCOMP_RET_ALLOW;
return seccomp_run_filters(sd, match);
}
[...]
> @@ -1625,12 +1812,24 @@ static long seccomp_set_mode_filter(unsigned int flags,
> mutex_lock_killable(¤t->signal->cred_guard_mutex))
> goto out_put_fd;
>
> + /*
> + * This memory will be needed for bitmap testing, but we'll
> + * be holding a spinlock at that point. Do the allocation
> + * (and free) outside of the lock.
> + *
> + * Alternative: we could do the bitmap update before attach
> + * to avoid spending too much time under lock.
> + */
> + pagepair = vzalloc(PAGE_SIZE * 2);
> + if (!pagepair)
> + goto out_put_fd;
> +
[...]
> - ret = seccomp_attach_filter(flags, prepared);
> + ret = seccomp_attach_filter(flags, prepared, pagepair);
You probably intended to rip this stuff back out? AFAIU the vzalloc()
stuff is a remnant from the old version that relied on MMU trickery.
next prev parent reply other threads:[~2020-09-24 0:25 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-23 23:29 [PATCH v1 0/6] seccomp: Implement constant action bitmaps Kees Cook
2020-09-23 23:29 ` [PATCH 1/6] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE Kees Cook
2020-09-24 0:41 ` Jann Horn
2020-09-24 7:11 ` Kees Cook
2020-09-23 23:29 ` [PATCH 2/6] x86: Enable seccomp architecture tracking Kees Cook
2020-09-24 0:45 ` Jann Horn
2020-09-24 7:12 ` Kees Cook
2020-09-23 23:29 ` [PATCH 3/6] seccomp: Implement constant action bitmaps Kees Cook
2020-09-24 0:25 ` Jann Horn [this message]
2020-09-24 7:36 ` Kees Cook
2020-09-24 8:07 ` YiFei Zhu
2020-09-24 8:15 ` Kees Cook
2020-09-24 8:22 ` YiFei Zhu
2020-09-24 12:28 ` Jann Horn
2020-09-24 12:37 ` David Laight
2020-09-24 12:56 ` Jann Horn
[not found] ` <DM6PR11MB271492D0565E91475D949F5DEF390@DM6PR11MB2714.namprd11.prod.outlook.com>
2020-09-24 0:36 ` YiFei Zhu
2020-09-24 7:38 ` Kees Cook
2020-09-24 7:51 ` YiFei Zhu
2020-09-23 23:29 ` [PATCH 4/6] seccomp: Emulate basic filters for constant action results Kees Cook
2020-09-23 23:47 ` Jann Horn
2020-09-24 7:46 ` Kees Cook
2020-09-24 15:28 ` Paul Moore
2020-09-24 19:52 ` Kees Cook
2020-09-24 20:46 ` Paul Moore
2020-09-24 21:35 ` Kees Cook
2020-09-23 23:29 ` [PATCH 5/6] selftests/seccomp: Compare bitmap vs filter overhead Kees Cook
2020-09-23 23:29 ` [PATCH 6/6] [DEBUG] seccomp: Report bitmap coverage ranges Kees Cook
2020-09-24 13:40 ` [PATCH v1 0/6] seccomp: Implement constant action bitmaps Rasmus Villemoes
2020-09-24 13:58 ` YiFei Zhu
2020-09-25 5:56 ` Rasmus Villemoes
2020-09-25 7:07 ` YiFei Zhu
2020-09-26 18:11 ` YiFei Zhu
2020-09-28 20:04 ` Kees Cook
2020-09-28 20:16 ` YiFei Zhu
2020-09-24 14:05 ` Jann Horn
2020-09-24 18:57 ` Andrea Arcangeli
2020-09-24 19:18 ` Jann Horn
[not found] ` <9dbe8e3bbdad43a1872202ff38c34ca2@DM5PR11MB1692.namprd11.prod.outlook.com>
2020-09-24 19:48 ` Tianyin Xu
2020-09-24 20:00 ` Kees Cook
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAG48ez0d80fOSTyn5QbH33WPz5UkzJJOo+V8of7YMR8pVQxumw@mail.gmail.com \
--to=jannh@google.com \
--cc=aarcange@redhat.com \
--cc=bpf@vger.kernel.org \
--cc=christian.brauner@ubuntu.com \
--cc=containers@lists.linux-foundation.org \
--cc=dskarlat@cs.cmu.edu \
--cc=frankeh@us.ibm.com \
--cc=gscrivan@redhat.com \
--cc=jianyan2@illinois.edu \
--cc=keescook@chromium.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=tobin@ibm.com \
--cc=torrella@illinois.edu \
--cc=tycho@tycho.pizza \
--cc=tyxu@illinois.edu \
--cc=vrothber@redhat.com \
--cc=wad@chromium.org \
--cc=yifeifz2@illinois.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).