From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EEEEBC4363D for ; Thu, 24 Sep 2020 00:26:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AE0CC2137B for ; Thu, 24 Sep 2020 00:26:25 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="cbX2fdSu" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726734AbgIXAZc (ORCPT ); Wed, 23 Sep 2020 20:25:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40694 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726684AbgIXAZb (ORCPT ); Wed, 23 Sep 2020 20:25:31 -0400 Received: from mail-ej1-x641.google.com (mail-ej1-x641.google.com [IPv6:2a00:1450:4864:20::641]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 59790C0613D2 for ; Wed, 23 Sep 2020 17:25:31 -0700 (PDT) Received: by mail-ej1-x641.google.com with SMTP id lo4so2020938ejb.8 for ; Wed, 23 Sep 2020 17:25:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=z36kKBCsQHaNPCeH2E/H9tB3p5RdrZUDg9hhwZ3Nxx4=; b=cbX2fdSuIUKZiYq81JUboXArhe0hzzMQKHDvzeE7J6nPvEp8uAfCWtiVvkPW+XoRqf SNmPSYUj8Mpf3ntteDZJ0BanpOlimPgmPK5k08S1L+NTDs61EtOmrdTemzlWhloBfw6h hKLpi61bNpByUX8xegnen+FIfES+zFOI5Jgc8ssG7qTKjF/Wpu+CsolNmc1A1GzdxMsW NkjdaMnu37uiBq+cQSIdZi0TXBDDkqRhIpx11URvPwfi8G5hVCXLICWEaTH5P15N5kMO CFX1LqExw0W6wTTa9LYSuBv3hcYXouQwkUxbrf+gi3nYE4HwGGLnZmq/5/jBYASMI8q5 9ahA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=z36kKBCsQHaNPCeH2E/H9tB3p5RdrZUDg9hhwZ3Nxx4=; b=eFBuIecMuezg6JWjwserOzw1BgPtFYLIk7pRLZcZk1YwqmT62GzlmQvwZCg+RiXTdq VDhzz/whok4QuU9p4pntN1A4W9KBSQHR3/SJhOrQODHRDQpyqkrj9m28Ar1KQCWJVkpt zhiFsw/SnXx3JY5OQIyR/ERKWm+EiH4BRmJCdNijwniDmqsP5m3VW6iGQmpkR7MsQNAs S0DmDyGKoLX0VJfNA1eS9XndS+IHwQqPUuoQ9Y2R6Yio+V8p2yx03kAxqdiUwsrUvRaF Sj62r3syP+5aonbgQgiksY//Zgk5xJ8zEb+QAP69t4I+dZ+ahl5r716w1FBQZrbg09b6 +9SQ== X-Gm-Message-State: AOAM533ztO8gXRfPxJgq65POQ8JlJcyP5n1KDbnQ2V7Z4sl0gGNWFaxc SPz9jH4+zAbglK52YACejQpF0FO3YM/jthUl3AkKDQ== X-Google-Smtp-Source: ABdhPJx4IA2iHrYE9+q5i2RC4hqX4ARo0Up8yxJYRnnkUT5XwQl8IaZ/DeyUn7wb1rDN5L75+fZh2NkwrKTz3Yzkc0Q= X-Received: by 2002:a17:906:1f94:: with SMTP id t20mr2089748ejr.493.1600907129579; Wed, 23 Sep 2020 17:25:29 -0700 (PDT) MIME-Version: 1.0 References: <20200923232923.3142503-1-keescook@chromium.org> <20200923232923.3142503-4-keescook@chromium.org> In-Reply-To: <20200923232923.3142503-4-keescook@chromium.org> From: Jann Horn Date: Thu, 24 Sep 2020 02:25:03 +0200 Message-ID: Subject: Re: [PATCH 3/6] seccomp: Implement constant action bitmaps To: Kees Cook Cc: YiFei Zhu , Christian Brauner , Tycho Andersen , Andy Lutomirski , Will Drewry , Andrea Arcangeli , Giuseppe Scrivano , Tobin Feldman-Fitzthum , Dimitrios Skarlatos , Valentin Rothberg , Hubertus Franke , Jack Chen , Josep Torrellas , Tianyin Xu , bpf , Linux Containers , Linux API , kernel list Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On Thu, Sep 24, 2020 at 1:29 AM Kees Cook wrote: > One of the most common pain points with seccomp filters has been dealing > with the overhead of processing the filters, especially for "always allow" > or "always reject" cases. The "always reject" cases don't need to be fast, in particular not the kill_thread/kill_process ones. Nobody's going to have "process kills itself by executing a forbidden syscall" on a critical hot codepath. > While BPF is extremely fast[1], it will always > have overhead associated with it. Additionally, due to seccomp's design, > filters are layered, which means processing time goes up as the number > of filters attached goes up. [...] > In order to build this mapping at filter attach time, each filter is > executed for every syscall (under each possible architecture), and > checked for any accesses of struct seccomp_data that are not the "arch" > nor "nr" (syscall) members. If only "arch" and "nr" are examined, then > there is a constant mapping for that syscall, and bitmaps can be updated > accordingly. If any accesses happen outside of those struct members, > seccomp must not bypass filter execution for that syscall, since program > state will be used to determine filter action result. (This logic comes > in the next patch.) > > [1] https://lore.kernel.org/bpf/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/ > [2] https://lore.kernel.org/bpf/20200601101137.GA121847@gardel-login/ > [3] https://lore.kernel.org/bpf/717a06e7f35740ccb4c70470ec70fb2f@huawei.com/ > > Signed-off-by: Kees Cook > --- > include/linux/seccomp.h | 18 ++++ > kernel/seccomp.c | 207 +++++++++++++++++++++++++++++++++++++++- > 2 files changed, 221 insertions(+), 4 deletions(-) > > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h > index 0be20bc81ea9..96df2f899e3d 100644 > --- a/include/linux/seccomp.h > +++ b/include/linux/seccomp.h > @@ -25,6 +25,17 @@ > #define SECCOMP_ARCH_IS_MULTIPLEX 3 > #define SECCOMP_ARCH_IS_UNKNOWN 0xff > > +/* When no bits are set for a syscall, filters are run. */ > +struct seccomp_bitmaps { > +#ifdef SECCOMP_ARCH > + /* "allow" are initialized to set and only ever get cleared. */ > + DECLARE_BITMAP(allow, NR_syscalls); This bitmap makes sense. The "NR_syscalls" part assumes that the compat syscall tables will not be bigger than the native syscall table, right? I guess that's usually mostly true nowadays, thanks to the syscall table unification... (might be worth a comment though) > + /* These are initialized to clear and only ever get set. */ > + DECLARE_BITMAP(kill_thread, NR_syscalls); > + DECLARE_BITMAP(kill_process, NR_syscalls); I don't think these bitmaps make sense, this is not part of any fastpath. (However, a "which syscalls have a fixed result" bitmap might make sense if we want to export the list of permitted syscalls as a text file in procfs, as I mentioned over at .) > +#endif > +}; > + > struct seccomp_filter; > /** > * struct seccomp - the state of a seccomp'ed process > @@ -45,6 +56,13 @@ struct seccomp { > #endif > atomic_t filter_count; > struct seccomp_filter *filter; > + struct seccomp_bitmaps native; > +#ifdef CONFIG_COMPAT > + struct seccomp_bitmaps compat; > +#endif > +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH > + struct seccomp_bitmaps multiplex; > +#endif Why do we have one bitmap per thread (in struct seccomp) instead of putting the bitmap for a given filter and all its ancestors into the seccomp_filter? > }; > > #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 0a3ff8eb8aea..111a238bc532 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -318,7 +318,7 @@ static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr) > > #ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH > if (syscall_arch == SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH) { > - seccomp_arch |= (sd->nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >> > + seccomp_arch |= (syscall_nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >> > SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT; This belongs over into patch 1. > } > #endif > @@ -559,6 +559,21 @@ static inline void seccomp_sync_threads(unsigned long flags) > atomic_set(&thread->seccomp.filter_count, > atomic_read(&thread->seccomp.filter_count)); > > + /* Copy syscall filter bitmaps. */ > + memcpy(&thread->seccomp.native, > + &caller->seccomp.native, > + sizeof(caller->seccomp.native)); > +#ifdef CONFIG_COMPAT > + memcpy(&thread->seccomp.compat, > + &caller->seccomp.compat, > + sizeof(caller->seccomp.compat)); > +#endif > +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH > + memcpy(&thread->seccomp.multiplex, > + &caller->seccomp.multiplex, > + sizeof(caller->seccomp.multiplex)); > +#endif This part wouldn't be necessary if the bitmasks were part of the seccomp_filter... > /* > * Don't let an unprivileged task work around > * the no_new_privs restriction by creating > @@ -661,6 +676,114 @@ seccomp_prepare_user_filter(const char __user *user_filter) > return filter; > } > > +static inline bool sd_touched(pte_t *ptep) > +{ > + return !!pte_young(*(READ_ONCE(ptep))); > +} I think this is left over from the previous version and should've been removed? [...] > +/* > + * Walk everyone syscall combination for this arch/mask combo and update nit: "Walk every possible", or something like that > + * the bitmaps with any results. > + */ > +static void seccomp_update_bitmap(struct seccomp_filter *filter, > + void *pagepair, u32 arch, u32 mask, > + struct seccomp_bitmaps *bitmaps) [...] > @@ -970,6 +1097,65 @@ static int seccomp_do_user_notification(int this_syscall, > return -1; > } > > +#ifdef SECCOMP_ARCH > +static inline bool __bypass_filter(struct seccomp_bitmaps *bitmaps, > + u32 nr, u32 *filter_ret) > +{ > + if (nr < NR_syscalls) { > + if (test_bit(nr, bitmaps->allow)) { > + *filter_ret = SECCOMP_RET_ALLOW; > + return true; > + } > + if (test_bit(nr, bitmaps->kill_process)) { > + *filter_ret = SECCOMP_RET_KILL_PROCESS; > + return true; > + } > + if (test_bit(nr, bitmaps->kill_thread)) { > + *filter_ret = SECCOMP_RET_KILL_THREAD; > + return true; > + } The checks against ->kill_process and ->kill_thread won't make anything faster, but since they will run in the fastpath, they'll probably actually contribute to making things *slower*. > + } > + return false; > +} > + > +static inline u32 check_syscall(const struct seccomp_data *sd, > + struct seccomp_filter **match) > +{ > + u32 filter_ret = SECCOMP_RET_KILL_PROCESS; > + u8 arch = seccomp_get_arch(sd->arch, sd->nr); > + > + switch (arch) { > + case SECCOMP_ARCH_IS_NATIVE: > + if (__bypass_filter(¤t->seccomp.native, sd->nr, &filter_ret)) > + return filter_ret; > + break; > +#ifdef CONFIG_COMPAT > + case SECCOMP_ARCH_IS_COMPAT: > + if (__bypass_filter(¤t->seccomp.compat, sd->nr, &filter_ret)) > + return filter_ret; > + break; > +#endif > +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH > + case SECCOMP_ARCH_IS_MULTIPLEX: > + if (__bypass_filter(¤t->seccomp.multiplex, sd->nr, &filter_ret)) > + return filter_ret; > + break; > +#endif > + default: > + WARN_ON_ONCE(1); > + return filter_ret; > + }; > + > + return seccomp_run_filters(sd, match); > +} You could write this in a less repetitive way, and especially if we get rid of the kill_* masks, also more compact: static inline u32 check_syscall(const struct seccomp_data *sd, struct seccomp_filter **match) { struct seccomp_bitmaps *bitmaps; u32 filter_ret; switch (arch) { case SECCOMP_ARCH_IS_NATIVE: bitmaps = ¤t->seccomp.native; break; #ifdef CONFIG_COMPAT case SECCOMP_ARCH_IS_COMPAT: bitmaps = ¤t->seccomp.compat; break; #endif #ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH case SECCOMP_ARCH_IS_MULTIPLEX: bitmaps = ¤t->seccomp.multiplex; break; #endif default: WARN_ON_ONCE(1); return SECCOMP_RET_KILL_PROCESS; } if ((unsigned)sd->nr < __NR_syscalls && test_bit(sd->nr, bitmaps->allow)) return SECCOMP_RET_ALLOW; return seccomp_run_filters(sd, match); } [...] > @@ -1625,12 +1812,24 @@ static long seccomp_set_mode_filter(unsigned int flags, > mutex_lock_killable(¤t->signal->cred_guard_mutex)) > goto out_put_fd; > > + /* > + * This memory will be needed for bitmap testing, but we'll > + * be holding a spinlock at that point. Do the allocation > + * (and free) outside of the lock. > + * > + * Alternative: we could do the bitmap update before attach > + * to avoid spending too much time under lock. > + */ > + pagepair = vzalloc(PAGE_SIZE * 2); > + if (!pagepair) > + goto out_put_fd; > + [...] > - ret = seccomp_attach_filter(flags, prepared); > + ret = seccomp_attach_filter(flags, prepared, pagepair); You probably intended to rip this stuff back out? AFAIU the vzalloc() stuff is a remnant from the old version that relied on MMU trickery.