From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753096Ab2ALIxj (ORCPT ); Thu, 12 Jan 2012 03:53:39 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:57492 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752910Ab2ALIxf (ORCPT ); Thu, 12 Jan 2012 03:53:35 -0500 Date: Thu, 12 Jan 2012 09:53:16 +0100 From: Serge Hallyn To: Will Drewry Cc: linux-kernel@vger.kernel.org, keescook@chromium.org, john.johansen@canonical.com, coreyb@linux.vnet.ibm.com, pmoore@redhat.com, eparis@redhat.com, djm@mindrot.org, torvalds@linux-foundation.org, segoon@openwall.com, rostedt@goodmis.org, jmorris@namei.org, scarybeasts@gmail.com, avi@redhat.com, penberg@cs.helsinki.fi, viro@zeniv.linux.org.uk, luto@mit.edu, mingo@elte.hu, akpm@linux-foundation.org, khilman@ti.com, borislav.petkov@amd.com, amwang@redhat.com, oleg@redhat.com, ak@linux.intel.com, eric.dumazet@gmail.com, gregkh@suse.de, dhowells@redhat.com, daniel.lezcano@free.fr, linux-fsdevel@vger.kernel.org, linux-security-module@vger.kernel.org, olofj@chromium.org, mhalcrow@google.com, dlaor@redhat.com Subject: Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF Message-ID: <20120112085316.GB16518@peqn> References: <1326302710-9427-1-git-send-email-wad@chromium.org> <1326302710-9427-2-git-send-email-wad@chromium.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1326302710-9427-2-git-send-email-wad@chromium.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Quoting Will Drewry (wad@chromium.org): > This patch adds support for seccomp mode 2. This mode enables dynamic > enforcement of system call filtering policy in the kernel as specified > by a userland task. The policy is expressed in terms of a BPF program, > as is used for userland-exposed socket filtering. Instead of network > data, the BPF program is evaluated over struct user_regs_struct at the > time of the system call (as retrieved using regviews). > > A filter program may be installed by a userland task by calling > prctl(PR_ATTACH_SECCOMP_FILTER, &fprog); > where fprog is of type struct sock_fprog. > > If the first filter program allows subsequent prctl(2) calls, then > additional filter programs may be attached. All attached programs > must be evaluated before a system call will be allowed to proceed. > > To avoid CONFIG_COMPAT related landmines, once a filter program is > installed using specific is_compat_task() and current->personality, it > is not allowed to make system calls or attach additional filters which > use a different combination of is_compat_task() and > current->personality. > > Filter programs may _only_ cross the execve(2) barrier if last filter > program was attached by a task with CAP_SYS_ADMIN capabilities in its > user namespace. Once a task-local filter program is attached from a > process without privileges, execve will fail. This ensures that only > privileged parent task can affect its privileged children (e.g., setuid > binary). > > There are a number of benefits to this approach. A few of which are > as follows: > - BPF has been exposed to userland for a long time. > - Userland already knows its ABI: expected register layout and system > call numbers. > - Full register information is provided which may be relevant for > certain syscalls (fork, rt_sigreturn) or for other userland > filtering tactics (checking the PC). > - No time-of-check-time-of-use vulnerable data accesses are possible. > > This patch includes its own BPF evaluator, but relies on the > net/core/filter.c BPF checking code. It is possible to share > evaluators, but the performance sensitive nature of the network > filtering path makes it an iterative optimization which (I think :) can > be tackled separately via separate patchsets. (And at some point sharing > BPF JIT code!) > > Signed-off-by: Will Drewry Hey Will, A few comments below, but otherwise Acked-by: Serge Hallyn thanks, -serge > --- > fs/exec.c | 5 + > include/linux/prctl.h | 3 + > include/linux/seccomp.h | 70 +++++- > kernel/Makefile | 1 + > kernel/fork.c | 4 + > kernel/seccomp.c | 8 + > kernel/seccomp_filter.c | 639 +++++++++++++++++++++++++++++++++++++++++++++++ > kernel/sys.c | 4 + > security/Kconfig | 12 + > 9 files changed, 743 insertions(+), 3 deletions(-) > create mode 100644 kernel/seccomp_filter.c > > diff --git a/fs/exec.c b/fs/exec.c > index 3625464..e9cc89c 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -44,6 +44,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -1477,6 +1478,10 @@ static int do_execve_common(const char *filename, > if (retval) > goto out_ret; > > + retval = seccomp_check_exec(); > + if (retval) > + goto out_ret; > + > retval = -ENOMEM; > bprm = kzalloc(sizeof(*bprm), GFP_KERNEL); > if (!bprm) > diff --git a/include/linux/prctl.h b/include/linux/prctl.h > index a3baeb2..15e2460 100644 > --- a/include/linux/prctl.h > +++ b/include/linux/prctl.h > @@ -64,6 +64,9 @@ > #define PR_GET_SECCOMP 21 > #define PR_SET_SECCOMP 22 > > +/* Set process seccomp filters */ > +#define PR_ATTACH_SECCOMP_FILTER 36 > + > /* Get/set the capability bounding set (as per security/commoncap.c) */ > #define PR_CAPBSET_READ 23 > #define PR_CAPBSET_DROP 24 > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h > index cc7a4e9..99d163e 100644 > --- a/include/linux/seccomp.h > +++ b/include/linux/seccomp.h > @@ -5,9 +5,28 @@ > #ifdef CONFIG_SECCOMP > > #include > +#include > #include > > -typedef struct { int mode; } seccomp_t; > +struct seccomp_filter; > +/** > + * struct seccomp_struct - the state of a seccomp'ed process > + * > + * @mode: > + * if this is 0, seccomp is not in use. > + * is 1, the process is under standard seccomp rules. > + * is 2, the process is only allowed to make system calls where > + * associated filters evaluate successfully. > + * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER. > + * @filter must only be accessed from the context of current as there > + * is no guard. > + */ > +typedef struct seccomp_struct { > + int mode; > +#ifdef CONFIG_SECCOMP_FILTER > + struct seccomp_filter *filter; > +#endif > +} seccomp_t; > > extern void __secure_computing(int); > static inline void secure_computing(int this_syscall) > @@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s) > > #include > > -typedef struct { } seccomp_t; > - > +typedef struct seccomp_struct { } seccomp_t; > #define secure_computing(x) do { } while (0) > > static inline long prctl_get_seccomp(void) > @@ -49,4 +67,50 @@ static inline int seccomp_mode(seccomp_t *s) > > #endif /* CONFIG_SECCOMP */ > > +#ifdef CONFIG_SECCOMP_FILTER > + > +#define seccomp_filter_init_task(_tsk) do { \ > + (_tsk)->seccomp.filter = NULL; \ > +} while (0); > + > +/* No locking is needed here because the task_struct will > + * have no parallel consumers. > + */ > +#define seccomp_filter_free_task(_tsk) do { \ > + put_seccomp_filter((_tsk)->seccomp.filter); \ > +} while (0); > + > +extern int seccomp_check_exec(void); > + > +extern long prctl_attach_seccomp_filter(char __user *); > + > +extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *); > +extern void put_seccomp_filter(struct seccomp_filter *); > + > +extern int seccomp_test_filters(int); > +extern void seccomp_filter_log_failure(int); > +extern void seccomp_filter_fork(struct task_struct *child, > + struct task_struct *parent); > + > +#else /* CONFIG_SECCOMP_FILTER */ > + > +#include > + > +struct seccomp_filter { }; > +#define seccomp_filter_init_task(_tsk) do { } while (0); > +#define seccomp_filter_fork(_tsk, _orig) do { } while (0); > +#define seccomp_filter_free_task(_tsk) do { } while (0); > + > +static inline int seccomp_check_exec(void) > +{ > + return 0; > +} > + > + > +static inline long prctl_attach_seccomp_filter(char __user *a2) > +{ > + return -ENOSYS; > +} > + > +#endif /* CONFIG_SECCOMP_FILTER */ > #endif /* _LINUX_SECCOMP_H */ > diff --git a/kernel/Makefile b/kernel/Makefile > index e898c5b..0584090 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o > obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o > obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ > obj-$(CONFIG_SECCOMP) += seccomp.o > +obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o > obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o > obj-$(CONFIG_TREE_RCU) += rcutree.o > obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o > diff --git a/kernel/fork.c b/kernel/fork.c > index da4a6a1..cc1d628 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -34,6 +34,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk) > free_thread_info(tsk->stack); > rt_mutex_debug_task_free(tsk); > ftrace_graph_exit_task(tsk); > + seccomp_filter_free_task(tsk); > free_task_struct(tsk); > } > EXPORT_SYMBOL(free_task); > @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, > /* Perform scheduler related setup. Assign this task to a CPU. */ > sched_fork(p); > > + seccomp_filter_init_task(p); > retval = perf_event_init_task(p); > if (retval) > goto bad_fork_cleanup_policy; > @@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, > if (clone_flags & CLONE_THREAD) > threadgroup_fork_read_unlock(current); > perf_event_fork(p); > + seccomp_filter_fork(p, current); > return p; > > bad_fork_free_pid: > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 57d4b13..78719be 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -47,6 +47,14 @@ void __secure_computing(int this_syscall) > return; > } while (*++syscall); > break; > +#ifdef CONFIG_SECCOMP_FILTER > + case 2: > + if (seccomp_test_filters(this_syscall) == 0) > + return; > + > + seccomp_filter_log_failure(this_syscall); > + break; > +#endif > default: > BUG(); > } > diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c > new file mode 100644 > index 0000000..4770847 > --- /dev/null > +++ b/kernel/seccomp_filter.c > @@ -0,0 +1,639 @@ > +/* bpf program-based system call filtering > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program; if not, write to the Free Software > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. > + * > + * Copyright (C) 2011 The Chromium OS Authors > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > + > +/** > + * struct seccomp_filter - container for seccomp BPF programs > + * > + * @usage: reference count to manage the object lifetime. > + * get/put helpers should be used when accessing an instance > + * outside of a lifetime-guarded section. In general, this > + * is only needed for handling filters shared across tasks. > + * @creator: pointer to the pid that created this filter > + * @parent: pointer to the ancestor which this filter will be composed with. > + * @flags: provide information about filter from creation time. > + * @personality: personality of the process at filter creation time. > + * @insns: the BPF program instructions to evaluate > + * @count: the number of instructions in the program. > + * > + * seccomp_filter objects should never be modified after being attached > + * to a task_struct (other than @usage). > + */ > +struct seccomp_filter { > + struct kref usage; > + struct pid *creator; > + struct seccomp_filter *parent; > + struct { > + uint32_t admin:1, /* can allow execve */ > + compat:1, /* CONFIG_COMPAT */ > + __reserved:30; > + } flags; > + int personality; > + unsigned short count; /* Instruction count */ > + struct sock_filter insns[0]; > +}; > + > +static unsigned int seccomp_run_filter(const u8 *buf, > + const size_t buflen, > + const struct sock_filter *); > + > +/** > + * seccomp_filter_alloc - allocates a new filter object > + * @padding: size of the insns[0] array in bytes > + * > + * The @padding should be a multiple of > + * sizeof(struct sock_filter). > + * > + * Returns ERR_PTR on error or an allocated object. > + */ > +static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding) > +{ > + struct seccomp_filter *f; > + unsigned long bpf_blocks = padding / sizeof(struct sock_filter); > + > + /* Drop oversized requests. */ > + if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS) > + return ERR_PTR(-EINVAL); > + > + /* Padding should always be in sock_filter increments. */ > + BUG_ON(padding % sizeof(struct sock_filter)); I still think the BUG_ON here is harsh given that the progsize is passed in by userspace. Was there a reason not to return -EINVAL here? > + > + f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL); > + if (!f) > + return ERR_PTR(-ENOMEM); > + kref_init(&f->usage); > + f->creator = get_task_pid(current, PIDTYPE_PID); > + f->count = bpf_blocks; > + return f; > +} > + > +/** > + * seccomp_filter_free - frees the allocated filter. > + * @filter: NULL or live object to be completely destructed. > + */ > +static void seccomp_filter_free(struct seccomp_filter *filter) > +{ > + if (!filter) > + return; > + put_seccomp_filter(filter->parent); > + put_pid(filter->creator); > + kfree(filter); > +} > + > +static void __put_seccomp_filter(struct kref *kref) > +{ > + struct seccomp_filter *orig = > + container_of(kref, struct seccomp_filter, usage); > + seccomp_filter_free(orig); > +} > + > +void seccomp_filter_log_failure(int syscall) > +{ > + pr_info("%s[%d]: system call %d blocked at 0x%lx\n", > + current->comm, task_pid_nr(current), syscall, > + KSTK_EIP(current)); > +} > + > +/* put_seccomp_filter - decrements the ref count of @orig and may free. */ > +void put_seccomp_filter(struct seccomp_filter *orig) > +{ > + if (!orig) > + return; > + kref_put(&orig->usage, __put_seccomp_filter); > +} > + > +/* get_seccomp_filter - increments the reference count of @orig. */ > +struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig) > +{ > + if (!orig) > + return NULL; > + kref_get(&orig->usage); > + return orig; > +} > + > +static int seccomp_check_personality(struct seccomp_filter *filter) > +{ > + if (filter->personality != current->personality) > + return -EACCES; > +#ifdef CONFIG_COMPAT > + if (filter->flags.compat != (!!(is_compat_task()))) > + return -EACCES; > +#endif > + return 0; > +} > + > +static const struct user_regset * > +find_prstatus(const struct user_regset_view *view) > +{ > + const struct user_regset *regset; > + int n; > + > + /* Skip 0. */ > + for (n = 1; n < view->n; ++n) { > + regset = view->regsets + n; > + if (regset->core_note_type == NT_PRSTATUS) > + return regset; > + } > + > + return NULL; > +} > + > +/** > + * seccomp_get_regs - returns a pointer to struct user_regs_struct > + * @scratch: preallocated storage of size @available > + * @available: pointer to the size of scratch. > + * > + * Returns NULL if the registers cannot be acquired or copied. > + * Returns a populated pointer to @scratch by default. > + * Otherwise, returns a pointer to a a u8 array containing the struct > + * user_regs_struct appropriate for the task personality. The pointer > + * may be to the beginning of @scratch or to an externally managed data > + * structure. On success, @available should be updated with the > + * valid region size of the returned pointer. > + * > + * If the architecture overrides the linkage, then the pointer may pointer to > + * another location. > + */ > +__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available) > +{ > + /* regset is usually returned based on task personality, not current > + * system call convention. This behavior makes it unsafe to execute > + * BPF programs over regviews if is_compat_task or the personality > + * have changed since the program was installed. > + */ > + const struct user_regset_view *view = task_user_regset_view(current); > + const struct user_regset *regset = &view->regsets[0]; > + size_t scratch_size = *available; > + if (regset->core_note_type != NT_PRSTATUS) { > + /* The architecture should override this method for speed. */ > + regset = find_prstatus(view); > + if (!regset) > + return NULL; > + } > + *available = regset->n * regset->size; > + /* Make sure the scratch space isn't exceeded. */ > + if (*available > scratch_size) > + *available = scratch_size; > + if (regset->get(current, regset, 0, *available, scratch, NULL)) > + return NULL; > + return scratch; > +} > + > +/** > + * seccomp_test_filters - tests 'current' against the given syscall > + * @syscall: number of the system call to test > + * > + * Returns 0 on ok and non-zero on error/failure. > + */ > +int seccomp_test_filters(int syscall) > +{ > + struct seccomp_filter *filter; > + u8 regs_tmp[sizeof(struct user_regs_struct)], *regs; > + size_t regs_size = sizeof(struct user_regs_struct); > + int ret = -EACCES; > + > + filter = current->seccomp.filter; /* uses task ref */ > + if (!filter) > + goto out; > + > + /* All filters in the list are required to share the same system call > + * convention so only the first filter is ever checked. > + */ > + if (seccomp_check_personality(filter)) > + goto out; > + > + /* Grab the user_regs_struct. Normally, regs == ®s_tmp, but > + * that is not mandatory. E.g., it may return a point to > + * task_pt_regs(current). NULL checking is mandatory. > + */ > + regs = seccomp_get_regs(regs_tmp, ®s_size); > + if (!regs) > + goto out; > + > + /* Only allow a system call if it is allowed in all ancestors. */ > + ret = 0; > + for ( ; filter != NULL; filter = filter->parent) { > + /* Allowed if return value is the size of the data supplied. */ > + if (seccomp_run_filter(regs, regs_size, filter->insns) != > + regs_size) > + ret = -EACCES; > + } > +out: > + return ret; > +} > + > +/** > + * seccomp_attach_filter: Attaches a seccomp filter to current. > + * @fprog: BPF program to install > + * > + * Context: User context only. This function may sleep on allocation and > + * operates on current. current must be attempting a system call > + * when this is called (usually prctl). > + * > + * This function may be called repeatedly to install additional filters. > + * Every filter successfully installed will be evaluated (in reverse order) > + * for each system call the thread makes. > + * > + * Returns 0 on success or an errno on failure. > + */ > +long seccomp_attach_filter(struct sock_fprog *fprog) > +{ > + struct seccomp_filter *filter = NULL; > + /* Note, len is a short so overflow should be impossible. */ > + unsigned long fp_size = fprog->len * sizeof(struct sock_filter); > + long ret = -EPERM; > + > + /* Allocate a new seccomp_filter */ > + filter = seccomp_filter_alloc(fp_size); > + if (IS_ERR(filter)) { > + ret = PTR_ERR(filter); > + goto out; > + } > + > + /* Lock the process personality and calling convention. */ > +#ifdef CONFIG_COMPAT > + if (is_compat_task()) > + filter->flags.compat = 1; > +#endif > + filter->personality = current->personality; > + > + /* Auditing is not needed since the capability wasn't requested */ > + if (security_real_capable_noaudit(current, current_user_ns(), > + CAP_SYS_ADMIN) == 0) > + filter->flags.admin = 1; > + > + /* Copy the instructions from fprog. */ > + ret = -EFAULT; > + if (copy_from_user(filter->insns, fprog->filter, fp_size)) > + goto out; > + > + /* Check the fprog */ > + ret = sk_chk_filter(filter->insns, filter->count); > + if (ret) > + goto out; > + > + /* If there is an existing filter, make it the parent > + * and reuse the existing task-based ref. > + */ > + filter->parent = current->seccomp.filter; > + > + /* Force all filters to use one system call convention. */ > + ret = -EINVAL; > + if (filter->parent) { > + if (filter->parent->flags.compat != filter->flags.compat) > + goto out; > + if (filter->parent->personality != filter->personality) > + goto out; > + } > + > + /* Double claim the new filter so we can release it below simplifying > + * the error paths earlier. > + */ > + ret = 0; > + get_seccomp_filter(filter); > + current->seccomp.filter = filter; > + /* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */ > + if (!current->seccomp.mode) { > + current->seccomp.mode = 2; > + set_thread_flag(TIF_SECCOMP); > + } > + > +out: > + put_seccomp_filter(filter); /* for get or task, on err */ > + return ret; > +} > + > +long prctl_attach_seccomp_filter(char __user *user_filter) > +{ > + struct sock_fprog fprog; > + long ret = -EINVAL; > + > + ret = -EFAULT; > + if (!user_filter) > + goto out; > + > + if (copy_from_user(&fprog, user_filter, sizeof(fprog))) > + goto out; > + > + ret = seccomp_attach_filter(&fprog); > +out: > + return ret; > +} > + > +/** > + * seccomp_check_exec: determines if exec is allowed for current > + * Returns 0 if allowed. > + */ > +int seccomp_check_exec(void) > +{ > + if (current->seccomp.mode != 2) > + return 0; > + /* We can rely on the task refcount for the filter. */ > + if (!current->seccomp.filter) > + return -EPERM; > + /* The last attached filter set for the process is checked. It must > + * have been installed with CAP_SYS_ADMIN capabilities. This comment is confusing. By 'It must' you mean that if not, it's denied. But if I didn't know better I would read that as "we can't get to this code unless". Can you change it to something like "Exec is refused unless the filter was installed with CAP_SYS_ADMIN privilege"? > + */ > + if (current->seccomp.filter->flags.admin) > + return 0; > + return -EPERM; > +} > + > +/* seccomp_filter_fork: manages inheritance on fork > + * @child: forkee > + * @parent: forker > + * Ensures that @child inherit a seccomp_filter iff seccomp is enabled > + * and the set of filters is marked as 'enabled'. > + */ > +void seccomp_filter_fork(struct task_struct *child, > + struct task_struct *parent) > +{ > + if (!parent->seccomp.mode) > + return; > + child->seccomp.mode = parent->seccomp.mode; > + child->seccomp.filter = get_seccomp_filter(parent->seccomp.filter); > +} > + > +/* Returns a pointer to the BPF evaluator after checking the offset and size > + * boundaries. The signature almost matches the signature from > + * net/core/filter.c with the hopes of sharing code in the future. > + */ > +static const void *load_pointer(const u8 *buf, size_t buflen, > + int offset, size_t size, > + void *unused) > +{ > + if (offset >= buflen) > + goto fail; > + if (offset < 0) > + goto fail; > + if (size > buflen - offset) > + goto fail; > + return buf + offset; > +fail: > + return NULL; > +} > + > +/** > + * seccomp_run_filter - evaluate BPF (over user_regs_struct) > + * @buf: buffer to execute the filter over > + * @buflen: length of the buffer > + * @fentry: filter to apply > + * > + * Decode and apply filter instructions to the buffer. > + * Return length to keep, 0 for none. @buf is a regset we are > + * filtering, @filter is the array of filter instructions. > + * Because all jumps are guaranteed to be before last instruction, > + * and last instruction guaranteed to be a RET, we dont need to check > + * flen. > + * > + * See core/net/filter.c as this is nearly an exact copy. > + * At some point, it would be nice to merge them to take advantage of > + * optimizations (like JIT). > + * > + * A successful filter must return the full length of the data. Anything less > + * will currently result in a seccomp failure. In the future, it may be > + * possible to use that for hard filtering registers on the fly so it is > + * ideal for consumers to return 0 on intended failure. > + */ > +static unsigned int seccomp_run_filter(const u8 *buf, > + const size_t buflen, > + const struct sock_filter *fentry) > +{ > + const void *ptr; > + u32 A = 0; /* Accumulator */ > + u32 X = 0; /* Index Register */ > + u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */ > + u32 tmp; > + int k; > + > + /* > + * Process array of filter instructions. > + */ > + for (;; fentry++) { > +#if defined(CONFIG_X86_32) > +#define K (fentry->k) > +#else > + const u32 K = fentry->k; > +#endif > + > + switch (fentry->code) { > + case BPF_S_ALU_ADD_X: > + A += X; > + continue; > + case BPF_S_ALU_ADD_K: > + A += K; > + continue; > + case BPF_S_ALU_SUB_X: > + A -= X; > + continue; > + case BPF_S_ALU_SUB_K: > + A -= K; > + continue; > + case BPF_S_ALU_MUL_X: > + A *= X; > + continue; > + case BPF_S_ALU_MUL_K: > + A *= K; > + continue; > + case BPF_S_ALU_DIV_X: > + if (X == 0) > + return 0; > + A /= X; > + continue; > + case BPF_S_ALU_DIV_K: > + A = reciprocal_divide(A, K); > + continue; > + case BPF_S_ALU_AND_X: > + A &= X; > + continue; > + case BPF_S_ALU_AND_K: > + A &= K; > + continue; > + case BPF_S_ALU_OR_X: > + A |= X; > + continue; > + case BPF_S_ALU_OR_K: > + A |= K; > + continue; > + case BPF_S_ALU_LSH_X: > + A <<= X; > + continue; > + case BPF_S_ALU_LSH_K: > + A <<= K; > + continue; > + case BPF_S_ALU_RSH_X: > + A >>= X; > + continue; > + case BPF_S_ALU_RSH_K: > + A >>= K; > + continue; > + case BPF_S_ALU_NEG: > + A = -A; > + continue; > + case BPF_S_JMP_JA: > + fentry += K; > + continue; > + case BPF_S_JMP_JGT_K: > + fentry += (A > K) ? fentry->jt : fentry->jf; > + continue; > + case BPF_S_JMP_JGE_K: > + fentry += (A >= K) ? fentry->jt : fentry->jf; > + continue; > + case BPF_S_JMP_JEQ_K: > + fentry += (A == K) ? fentry->jt : fentry->jf; > + continue; > + case BPF_S_JMP_JSET_K: > + fentry += (A & K) ? fentry->jt : fentry->jf; > + continue; > + case BPF_S_JMP_JGT_X: > + fentry += (A > X) ? fentry->jt : fentry->jf; > + continue; > + case BPF_S_JMP_JGE_X: > + fentry += (A >= X) ? fentry->jt : fentry->jf; > + continue; > + case BPF_S_JMP_JEQ_X: > + fentry += (A == X) ? fentry->jt : fentry->jf; > + continue; > + case BPF_S_JMP_JSET_X: > + fentry += (A & X) ? fentry->jt : fentry->jf; > + continue; > + case BPF_S_LD_W_ABS: > + k = K; > +load_w: > + ptr = load_pointer(buf, buflen, k, 4, &tmp); > + if (ptr != NULL) { > + /* Note, unlike on network data, values are not > + * byte swapped. > + */ > + A = *(const u32 *)ptr; > + continue; > + } > + return 0; > + case BPF_S_LD_H_ABS: > + k = K; > +load_h: > + ptr = load_pointer(buf, buflen, k, 2, &tmp); > + if (ptr != NULL) { > + A = *(const u16 *)ptr; > + continue; > + } > + return 0; > + case BPF_S_LD_B_ABS: > + k = K; > +load_b: > + ptr = load_pointer(buf, buflen, k, 1, &tmp); > + if (ptr != NULL) { > + A = *(const u8 *)ptr; > + continue; > + } > + return 0; > + case BPF_S_LD_W_LEN: > + A = buflen; > + continue; > + case BPF_S_LDX_W_LEN: > + X = buflen; > + continue; > + case BPF_S_LD_W_IND: > + k = X + K; > + goto load_w; > + case BPF_S_LD_H_IND: > + k = X + K; > + goto load_h; > + case BPF_S_LD_B_IND: > + k = X + K; > + goto load_b; > + case BPF_S_LDX_B_MSH: > + ptr = load_pointer(buf, buflen, K, 1, &tmp); > + if (ptr != NULL) { > + X = (*(u8 *)ptr & 0xf) << 2; > + continue; > + } > + return 0; > + case BPF_S_LD_IMM: > + A = K; > + continue; > + case BPF_S_LDX_IMM: > + X = K; > + continue; > + case BPF_S_LD_MEM: > + A = mem[K]; > + continue; > + case BPF_S_LDX_MEM: > + X = mem[K]; > + continue; > + case BPF_S_MISC_TAX: > + X = A; > + continue; > + case BPF_S_MISC_TXA: > + A = X; > + continue; > + case BPF_S_RET_K: > + return K; > + case BPF_S_RET_A: > + return A; > + case BPF_S_ST: > + mem[K] = A; > + continue; > + case BPF_S_STX: > + mem[K] = X; > + continue; > + case BPF_S_ANC_PROTOCOL: > + case BPF_S_ANC_PKTTYPE: > + case BPF_S_ANC_IFINDEX: > + case BPF_S_ANC_MARK: > + case BPF_S_ANC_QUEUE: > + case BPF_S_ANC_HATYPE: > + case BPF_S_ANC_RXHASH: > + case BPF_S_ANC_CPU: > + case BPF_S_ANC_NLATTR: > + case BPF_S_ANC_NLATTR_NEST: > + /* ignored */ > + continue; > + default: > + WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n", > + fentry->code, fentry->jt, > + fentry->jf, fentry->k); > + return 0; > + } > + } > + > + return 0; > +} > diff --git a/kernel/sys.c b/kernel/sys.c > index 481611f..77f2eda 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > case PR_SET_SECCOMP: > error = prctl_set_seccomp(arg2); > break; > + case PR_ATTACH_SECCOMP_FILTER: > + error = prctl_attach_seccomp_filter((char __user *) > + arg2); > + break; > case PR_GET_TSC: > error = GET_TSC_CTL(arg2); > break; > diff --git a/security/Kconfig b/security/Kconfig > index 51bd5a0..77b1106 100644 > --- a/security/Kconfig > +++ b/security/Kconfig > @@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT > > If you are unsure how to answer this question, answer N. > > +config SECCOMP_FILTER > + bool "Enable seccomp-based system call filtering" > + select SECCOMP > + depends on EXPERIMENTAL > + help > + This kernel feature expands CONFIG_SECCOMP to allow computing > + in environments with reduced kernel access dictated by a system > + call filter, expressed in BPF, installed by the application itself > + through prctl(2). > + > + See Documentation/prctl/seccomp_filter.txt for more detail. > + > config SECURITY > bool "Enable different security models" > depends on SYSFS > -- > 1.7.5.4 >