* [PATCH v6 1/3] seccomp: kill the seccomp_t typedef
@ 2012-01-28 22:11 Will Drewry
2012-01-28 22:11 ` [PATCH v6 2/3] seccomp_filters: system call filtering using BPF Will Drewry
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Will Drewry @ 2012-01-28 22:11 UTC (permalink / raw)
To: linux-kernel
Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
corbet, alan, indan, mcgrathr
Replaces the seccomp_t typedef with seccomp_struct to match modern
kernel style.
Signed-off-by: Will Drewry <wad@chromium.org>
---
include/linux/sched.h | 2 +-
include/linux/seccomp.h | 10 ++++++----
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4032ec1..288b5cb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1418,7 +1418,7 @@ struct task_struct {
uid_t loginuid;
unsigned int sessionid;
#endif
- seccomp_t seccomp;
+ struct seccomp_struct seccomp;
/* Thread group tracking */
u32 parent_exec_id;
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..171ab66 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -7,7 +7,9 @@
#include <linux/thread_info.h>
#include <asm/seccomp.h>
-typedef struct { int mode; } seccomp_t;
+struct seccomp_struct {
+ int mode;
+};
extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -19,7 +21,7 @@ static inline void secure_computing(int this_syscall)
extern long prctl_get_seccomp(void);
extern long prctl_set_seccomp(unsigned long);
-static inline int seccomp_mode(seccomp_t *s)
+static inline int seccomp_mode(struct seccomp_struct *s)
{
return s->mode;
}
@@ -28,7 +30,7 @@ static inline int seccomp_mode(seccomp_t *s)
#include <linux/errno.h>
-typedef struct { } seccomp_t;
+struct seccomp_struct { };
#define secure_computing(x) do { } while (0)
@@ -42,7 +44,7 @@ static inline long prctl_set_seccomp(unsigned long arg2)
return -EINVAL;
}
-static inline int seccomp_mode(seccomp_t *s)
+static inline int seccomp_mode(struct seccomp_struct *s)
{
return 0;
}
--
1.7.5.4
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v6 2/3] seccomp_filters: system call filtering using BPF
2012-01-28 22:11 [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Will Drewry
@ 2012-01-28 22:11 ` Will Drewry
2012-01-31 14:13 ` Eduardo Otubo
2012-02-02 15:32 ` Serge E. Hallyn
2012-01-28 22:11 ` [PATCH v6 3/3] Documentation: prctl/seccomp_filter Will Drewry
2012-02-02 15:29 ` [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Serge E. Hallyn
2 siblings, 2 replies; 13+ messages in thread
From: Will Drewry @ 2012-01-28 22:11 UTC (permalink / raw)
To: linux-kernel
Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
corbet, alan, indan, mcgrathr
[This patch depends on luto@mit.edu's no_new_privs patch:
https://lkml.org/lkml/2012/1/12/446
]
This patch adds support for seccomp mode 2. This mode enables dynamic
enforcement of system call filtering policy in the kernel as specified
by a userland task. The policy is expressed in terms of a Berkeley
Packet Filter program, as is used for userland-exposed socket filtering.
Instead of network data, the BPF program is evaluated over struct
seccomp_filter_data at the time of the system call.
A filter program may be installed by a userland task by calling
prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
where fprog is of type struct sock_fprog.
If the first filter program allows subsequent prctl(2) calls, then
additional filter programs may be attached. All attached programs
must be evaluated before a system call will be allowed to proceed.
To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() value, it is not allowed to
make system calls using the alternate entry point.
Filter programs will be inherited across fork/clone and execve, however
the installation of filters must be preceded by setting 'no_new_privs'
to ensure that unprivileged tasks cannot attach filters that affect
privileged tasks (e.g., setuid binary). Tasks with CAP_SYS_ADMIN
in their namespace may install inheritable filters without setting
the no_new_privs bit.
There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time.
- Userland already knows its ABI: system call numbers and desired
arguments
- No time-of-check-time-of-use vulnerable data accesses are possible.
- system call arguments are loaded on demand only to minimize copying
required for system call number-only policy decisions.
This patch includes its own BPF evaluator, but relies on the
net/core/filter.c BPF checking code. It is possible to share
evaluators, but the performance sensitive nature of the network
filtering path makes it an iterative optimization which (I think :) can
be tackled separately via separate patchsets. (And at some point sharing
BPF JIT code!)
v6: - fix memory leak on attach compat check failure
- require no_new_privs || CAP_SYS_ADMIN prior to filter
installation. (luto@mit.edu)
- s/seccomp_struct_/seccomp_/ for macros/functions
(amwang@redhat.com)
- cleaned up Kconfig (amwang@redhat.com)
- on block, note if the call was compat (so the # means something)
v5: - uses syscall_get_arguments
(indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
- uses union-based arg storage with hi/lo struct to
handle endianness. Compromises between the two alternate
proposals to minimize extra arg shuffling and account for
endianness assuming userspace uses offsetof().
(mcgrathr@chromium.org, indan@nul.nu)
- update Kconfig description
- add include/seccomp_filter.h and add its installation
- (naive) on-demand syscall argument loading
- drop seccomp_t (eparis@redhat.com)
v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
- now uses current->no_new_privs
(luto@mit.edu,torvalds@linux-foundation.com)
- assign names to seccomp modes (rdunlap@xenotime.net)
- fix style issues (rdunlap@xenotime.net)
- reworded Kconfig entry (rdunlap@xenotime.net)
v3: - macros to inline (oleg@redhat.com)
- init_task behavior fixed (oleg@redhat.com)
- drop creator entry and extra NULL check (oleg@redhat.com)
- alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
- adds tentative use of "always_unprivileged" as per
torvalds@linux-foundation.org and luto@mit.edu
v2: - (patch 2 only)
Signed-off-by: Will Drewry <wad@chromium.org>
---
include/linux/Kbuild | 1 +
include/linux/prctl.h | 3 +
include/linux/seccomp.h | 63 ++++
include/linux/seccomp_filter.h | 79 +++++
kernel/Makefile | 1 +
kernel/fork.c | 4 +
kernel/seccomp.c | 10 +-
kernel/seccomp_filter.c | 627 ++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 4 +
security/Kconfig | 20 ++
10 files changed, 811 insertions(+), 1 deletions(-)
create mode 100644 include/linux/seccomp_filter.h
create mode 100644 kernel/seccomp_filter.c
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index c94e717..5659454 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -330,6 +330,7 @@ header-y += scc.h
header-y += sched.h
header-y += screen_info.h
header-y += sdla.h
+header-y += seccomp_filter.h
header-y += securebits.h
header-y += selinux_netlink.h
header-y += sem.h
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index 7ddc7f1..b8c4beb 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -114,4 +114,7 @@
# define PR_SET_MM_START_BRK 6
# define PR_SET_MM_BRK 7
+/* Set process seccomp filters */
+#define PR_ATTACH_SECCOMP_FILTER 37
+
#endif /* _LINUX_PRCTL_H */
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 171ab66..d3b896b 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,10 +5,29 @@
#ifdef CONFIG_SECCOMP
#include <linux/thread_info.h>
+#include <linux/types.h>
#include <asm/seccomp.h>
+/* Valid values of seccomp_struct.mode */
+#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */
+#define SECCOMP_MODE_STRICT 1 /* uses hard-coded seccomp.c rules. */
+#define SECCOMP_MODE_FILTER 2 /* system call access determined by filter. */
+
+struct seccomp_filter;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * @mode: indicates one of the valid values above for controlled
+ * system calls available to a process.
+ * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
+ * @filter must only be accessed from the context of current as there
+ * is no guard.
+ */
struct seccomp_struct {
int mode;
+#ifdef CONFIG_SECCOMP_FILTER
+ struct seccomp_filter *filter;
+#endif
};
extern void __secure_computing(int);
@@ -51,4 +70,48 @@ static inline int seccomp_mode(struct seccomp_struct *s)
#endif /* CONFIG_SECCOMP */
+#ifdef CONFIG_SECCOMP_FILTER
+
+
+extern long prctl_attach_seccomp_filter(char __user *);
+
+extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
+extern void put_seccomp_filter(struct seccomp_filter *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+extern void seccomp_fork(struct seccomp_struct *child,
+ const struct seccomp_struct *parent);
+
+static inline void seccomp_init_task(struct seccomp_struct *seccomp)
+{
+ seccomp->mode = SECCOMP_MODE_DISABLED;
+ seccomp->filter = NULL;
+}
+
+/* No locking is needed here because the task_struct will
+ * have no parallel consumers.
+ */
+static inline void seccomp_free_task(struct seccomp_struct *seccomp)
+{
+ put_seccomp_filter(seccomp->filter);
+ seccomp->filter = NULL;
+}
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+#include <linux/errno.h>
+
+struct seccomp_filter { };
+/* Macros consume the unused dereference by the caller. */
+#define seccomp_init_task(_seccomp) do { } while (0);
+#define seccomp_fork(_tsk, _orig) do { } while (0);
+#define seccomp_free_task(_seccomp) do { } while (0);
+
+static inline long prctl_attach_seccomp_filter(char __user *a2)
+{
+ return -ENOSYS;
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */
#endif /* _LINUX_SECCOMP_H */
diff --git a/include/linux/seccomp_filter.h b/include/linux/seccomp_filter.h
new file mode 100644
index 0000000..3ecd641
--- /dev/null
+++ b/include/linux/seccomp_filter.h
@@ -0,0 +1,79 @@
+/*
+ * Secomp-based system call filtering data structures and definitions.
+ *
+ * Copyright (C) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ *
+ * This copyrighted material is made available to anyone wishing to use,
+ * modify, copy, or redistribute it subject to the terms and conditions
+ * of the GNU General Public License v.2.
+ *
+ */
+
+#ifndef __LINUX_SECCOMP_FILTER_H__
+#define __LINUX_SECCOMP_FILTER_H__
+
+#include <asm/byteorder.h>
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+/*
+ * Keep the contents of this file similar to linux/filter.h:
+ * struct sock_filter and sock_fprog and versions.
+ * Custom naming exists solely if divergence is ever needed.
+ */
+
+/*
+ * Current version of the filter code architecture.
+ */
+#define SECCOMP_BPF_MAJOR_VERSION 1
+#define SECCOMP_BPF_MINOR_VERSION 1
+
+struct seccomp_filter_block { /* Filter block */
+ __u16 code; /* Actual filter code */
+ __u8 jt; /* Jump true */
+ __u8 jf; /* Jump false */
+ __u32 k; /* Generic multiuse field */
+};
+
+struct seccomp_fprog { /* Required for SO_ATTACH_FILTER. */
+ unsigned short len; /* Number of filter blocks */
+ struct seccomp_filter_block __user *filter;
+};
+
+/* Ensure the u32 ordering is consistent with platform byte order. */
+#if defined(__LITTLE_ENDIAN)
+#define SECCOMP_ENDIAN_SWAP(x, y) x, y
+#elif defined(__BIG_ENDIAN)
+#define SECCOMP_ENDIAN_SWAP(x, y) y, x
+#else
+#error edit for your odd arch byteorder.
+#endif
+
+/* System call argument layout for the filter data. */
+union seccomp_filter_arg {
+ struct {
+ __u32 SECCOMP_ENDIAN_SWAP(lo32, hi32);
+ };
+ __u64 u64;
+};
+
+/*
+ * Expected data the BPF program will execute over.
+ * Endianness will be arch specific, but the values will be
+ * swapped, as above, to allow for consistent BPF programs.
+ */
+struct seccomp_filter_data {
+ int syscall_nr;
+ __u32 __reserved;
+ union seccomp_filter_arg args[6];
+};
+
+#undef SECCOMP_ENDIAN_SWAP
+
+/*
+ * Defined valid return values for the BPF program.
+ */
+#define SECCOMP_BPF_ALLOW 0xFFFFFFFF
+#define SECCOMP_BPF_DENY 0
+
+#endif /* __LINUX_SECCOMP_FILTER_H__ */
diff --git a/kernel/Makefile b/kernel/Makefile
index 2d9de86..fd81bac 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -78,6 +78,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TREE_RCU) += rcutree.o
obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index 051f090..0007933 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ seccomp_free_task(&tsk->seccomp);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -1093,6 +1095,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto fork_out;
ftrace_graph_init_task(p);
+ seccomp_init_task(&p->seccomp);
rt_mutex_init_task(p);
@@ -1376,6 +1379,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
if (clone_flags & CLONE_THREAD)
threadgroup_change_end(current);
perf_event_fork(p);
+ seccomp_fork(&p->seccomp, ¤t->seccomp);
trace_task_newtask(p, clone_flags);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index e8d76c5..a045dd4 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -37,7 +37,7 @@ void __secure_computing(int this_syscall)
int * syscall;
switch (mode) {
- case 1:
+ case SECCOMP_MODE_STRICT:
syscall = mode1_syscalls;
#ifdef CONFIG_COMPAT
if (is_compat_task())
@@ -48,6 +48,14 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case SECCOMP_MODE_FILTER:
+ if (seccomp_test_filters(this_syscall) == 0)
+ return;
+
+ seccomp_filter_log_failure(this_syscall);
+ break;
+#endif
default:
BUG();
}
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..0e2e56c
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,627 @@
+/*
+ * linux/kernel/seccomp_filter.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ *
+ * Extends linux/kernel/seccomp.c to allow tasks to install system call
+ * filters using a Berkeley Packet Filter program which is executed over
+ * struct seccomp_filter_data.
+ */
+
+#include <asm/syscall.h>
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/rculist.h>
+#include <linux/filter.h>
+#include <linux/kallsyms.h>
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/pid.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/ratelimit.h>
+#include <linux/reciprocal_div.h>
+#include <linux/regset.h>
+#include <linux/seccomp.h>
+#include <linux/seccomp_filter.h>
+#include <linux/security.h>
+#include <linux/seccomp.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/user.h>
+
+
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object lifetime.
+ * get/put helpers should be used when accessing an instance
+ * outside of a lifetime-guarded section. In general, this
+ * is only needed for handling filters shared across tasks.
+ * @parent: pointer to the ancestor which this filter will be composed with.
+ * @insns: the BPF program instructions to evaluate
+ * @count: the number of instructions in the program.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+ struct kref usage;
+ struct seccomp_filter *parent;
+ struct {
+ uint32_t compat:1;
+ } flags;
+ unsigned short count; /* Instruction count */
+ struct sock_filter insns[0];
+};
+
+/*
+ * struct seccomp_filter_metadata - BPF data wrapper
+ * @data: data accessible to the BPF program.
+ * @has_args: indicates that the args have been lazily populated.
+ *
+ * Used by seccomp_load_pointer.
+ */
+struct seccomp_filter_metadata {
+ struct seccomp_filter_data data;
+ bool has_args;
+};
+
+static unsigned int seccomp_run_filter(void *, uint32_t,
+ const struct sock_filter *);
+
+/**
+ * seccomp_filter_alloc - allocates a new filter object
+ * @padding: size of the insns[0] array in bytes
+ *
+ * The @padding should be a multiple of
+ * sizeof(struct sock_filter).
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
+{
+ struct seccomp_filter *f;
+ unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
+
+ /* Drop oversized requests. */
+ if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
+ return ERR_PTR(-EINVAL);
+
+ /* Padding should always be in sock_filter increments. */
+ if (padding % sizeof(struct sock_filter))
+ return ERR_PTR(-EINVAL);
+
+ f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
+ if (!f)
+ return ERR_PTR(-ENOMEM);
+ kref_init(&f->usage);
+ f->count = bpf_blocks;
+ return f;
+}
+
+/**
+ * seccomp_filter_free - frees the allocated filter.
+ * @filter: NULL or live object to be completely destructed.
+ */
+static void seccomp_filter_free(struct seccomp_filter *filter)
+{
+ if (!filter)
+ return;
+ put_seccomp_filter(filter->parent);
+ kfree(filter);
+}
+
+static void __put_seccomp_filter(struct kref *kref)
+{
+ struct seccomp_filter *orig =
+ container_of(kref, struct seccomp_filter, usage);
+ seccomp_filter_free(orig);
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+ int compat = 0;
+#ifdef CONFIG_COMPAT
+ compat = is_compat_task();
+#endif
+ pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
+ current->comm, task_pid_nr(current),
+ (compat ? "compat " : ""),
+ syscall, KSTK_EIP(current));
+}
+
+/* put_seccomp_filter - decrements the ref count of @orig and may free. */
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return;
+ kref_put(&orig->usage, __put_seccomp_filter);
+}
+
+/* get_seccomp_filter - increments the reference count of @orig. */
+struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return NULL;
+ kref_get(&orig->usage);
+ return orig;
+}
+
+#if BITS_PER_LONG == 32
+static inline unsigned long *seccomp_filter_data_arg(
+ struct seccomp_filter_data *data, int index)
+{
+ /* Avoid inconsistent hi contents. */
+ data->args[index].hi32 = 0;
+ return (unsigned long *) &(data->args[index].lo32);
+}
+#elif BITS_PER_LONG == 64
+static inline unsigned long *seccomp_filter_data_arg(
+ struct seccomp_filter_data *data, int index)
+{
+ return (unsigned long *) &(data->args[index].u64);
+}
+#else
+#error Unknown BITS_PER_LONG.
+#endif
+
+/**
+ * seccomp_load_pointer: checks and returns a pointer to the requested offset
+ * @buf: u8 array to index into
+ * @buflen: length of the @buf array
+ * @offset: offset to return data from
+ * @size: size of the data to retrieve at offset
+ * @unused: placeholder which net/core/filter.c uses for for temporary
+ * storage. Ideally, the two code paths can be merged.
+ *
+ * Returns a pointer to the BPF evaluator after checking the offset and size
+ * boundaries.
+ */
+static inline void *seccomp_load_pointer(void *data, int offset, size_t size,
+ void *buffer)
+{
+ struct seccomp_filter_metadata *metadata = data;
+ int arg;
+ if (offset >= sizeof(metadata->data))
+ goto fail;
+ if (offset < 0)
+ goto fail;
+ if (size > sizeof(metadata->data) - offset)
+ goto fail;
+ if (metadata->has_args)
+ goto pass;
+ /* No argument data touched. */
+ if (offset + size - 1 < offsetof(struct seccomp_filter_data, args))
+ goto pass;
+ for (arg = 0; arg < ARRAY_SIZE(metadata->data.args); ++arg)
+ syscall_get_arguments(current, task_pt_regs(current), arg, 1,
+ seccomp_filter_data_arg(&metadata->data, arg));
+ metadata->has_args = true;
+pass:
+ return ((__u8 *)(&metadata->data)) + offset;
+fail:
+ return NULL;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+ int ret = -EACCES;
+ struct seccomp_filter *filter;
+ struct seccomp_filter_metadata metadata;
+
+ filter = current->seccomp.filter; /* uses task ref */
+ if (!filter)
+ goto out;
+
+ metadata.data.syscall_nr = syscall;
+ metadata.has_args = false;
+
+#ifdef CONFIG_COMPAT
+ if (filter->flags.compat != !!(is_compat_task()))
+ goto out;
+#endif
+
+ /* Only allow a system call if it is allowed in all ancestors. */
+ ret = 0;
+ for ( ; filter != NULL; filter = filter->parent) {
+ /* Allowed if return value is SECCOMP_BPF_ALLOW */
+ if (seccomp_run_filter(&metadata, sizeof(metadata.data),
+ filter->insns) != SECCOMP_BPF_ALLOW)
+ ret = -EACCES;
+ }
+out:
+ return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * @fprog: BPF program to install
+ *
+ * Context: User context only. This function may sleep on allocation and
+ * operates on current. current must be attempting a system call
+ * when this is called (usually prctl).
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the thread makes.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+ struct seccomp_filter *filter = NULL;
+ /* Note, len is a short so overflow should be impossible. */
+ unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+ long ret = -EPERM;
+
+ /* Allocate a new seccomp_filter */
+ filter = seccomp_filter_alloc(fp_size);
+ if (IS_ERR(filter)) {
+ ret = PTR_ERR(filter);
+ goto out;
+ }
+
+ /* Copy the instructions from fprog. */
+ ret = -EFAULT;
+ if (copy_from_user(filter->insns, fprog->filter, fp_size))
+ goto out;
+
+ /* Check the fprog */
+ ret = sk_chk_filter(filter->insns, filter->count);
+ if (ret)
+ goto out;
+
+ /*
+ * Installing a seccomp filter requires that the task
+ * have CAP_SYS_ADMIN in its namespace or be running with
+ * no_new_privs. This avoids scenarios where unprivileged
+ * tasks can affect the behavior of privileged children.
+ */
+ ret = -EACCES;
+ if (!current->no_new_privs &&
+ security_capable_noaudit(current_cred(), current_user_ns(),
+ CAP_SYS_ADMIN) != 0)
+ goto out;
+
+ /*
+ * If there is an existing filter, make it the parent
+ * and reuse the existing task-based ref.
+ */
+ filter->parent = current->seccomp.filter;
+
+#ifdef CONFIG_COMPAT
+ /* Disallow changing system calling conventions after the fact. */
+ filter->flags.compat = !!(is_compat_task());
+
+ if (filter->parent &&
+ filter->parent->flags.compat != filter->flags.compat)
+ goto out;
+#endif
+
+ /*
+ * Double claim the new filter so we can release it below simplifying
+ * the error paths earlier.
+ */
+ ret = 0;
+ get_seccomp_filter(filter);
+ current->seccomp.filter = filter;
+ /* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
+ if (current->seccomp.mode == SECCOMP_MODE_DISABLED) {
+ current->seccomp.mode = SECCOMP_MODE_FILTER;
+ set_thread_flag(TIF_SECCOMP);
+ }
+
+out:
+ put_seccomp_filter(filter); /* for get or task, on err */
+ return ret;
+}
+
+#ifdef CONFIG_COMPAT
+/* This should be kept in sync with net/compat.c which changes infrequently. */
+struct compat_sock_fprog {
+ u16 len;
+ compat_uptr_t filter; /* struct sock_filter */
+};
+
+static long compat_attach_seccomp_filter(char __user *optval)
+{
+ struct compat_sock_fprog __user *fprog32 =
+ (struct compat_sock_fprog __user *)optval;
+ struct sock_fprog __user *kfprog =
+ compat_alloc_user_space(sizeof(struct sock_fprog));
+ compat_uptr_t ptr;
+ u16 len;
+
+ if (!access_ok(VERIFY_READ, fprog32, sizeof(*fprog32)) ||
+ !access_ok(VERIFY_WRITE, kfprog, sizeof(struct sock_fprog)) ||
+ __get_user(len, &fprog32->len) ||
+ __get_user(ptr, &fprog32->filter) ||
+ __put_user(len, &kfprog->len) ||
+ __put_user(compat_ptr(ptr), &kfprog->filter))
+ return -EFAULT;
+
+ return seccomp_attach_filter(kfprog);
+}
+#endif
+
+long prctl_attach_seccomp_filter(char __user *user_filter)
+{
+ struct sock_fprog fprog;
+ long ret = -EINVAL;
+ ret = -EFAULT;
+ if (!user_filter)
+ goto out;
+
+#ifdef CONFIG_COMPAT
+ if (is_compat_task())
+ return compat_attach_seccomp_filter(user_filter);
+#endif
+
+ if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+ goto out;
+
+ ret = seccomp_attach_filter(&fprog);
+out:
+ return ret;
+}
+
+/**
+ * seccomp_fork: manages inheritance on fork
+ * @child: forkee's seccomp_struct
+ * @parent: forker's seccomp_struct
+ *
+ * Ensures that @child inherits seccomp mode and state iff
+ * seccomp filtering is in use.
+ */
+void seccomp_fork(struct seccomp_struct *child,
+ const struct seccomp_struct *parent)
+{
+ child->mode = parent->mode;
+ if (parent->mode != SECCOMP_MODE_FILTER)
+ return;
+ child->filter = get_seccomp_filter(parent->filter);
+}
+
+/**
+ * seccomp_run_filter - evaluate BPF
+ * @buf: opaque buffer to execute the filter over
+ * @buflen: length of the buffer
+ * @fentry: filter to apply
+ *
+ * Decode and apply filter instructions to the buffer. Return length to
+ * keep, 0 for none. @buf is a seccomp_filter_metadata we are filtering,
+ * @filter is the array of filter instructions. Because all jumps are
+ * guaranteed to be before last instruction, and last instruction
+ * guaranteed to be a RET, we dont need to check flen.
+ *
+ * See core/net/filter.c as this is nearly an exact copy.
+ * At some point, it would be nice to merge them to take advantage of
+ * optimizations (like JIT).
+ */
+static unsigned int seccomp_run_filter(void *data, uint32_t datalen,
+ const struct sock_filter *fentry)
+{
+ const void *ptr;
+ u32 A = 0; /* Accumulator */
+ u32 X = 0; /* Index Register */
+ u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */
+ u32 tmp;
+ int k;
+
+ /*
+ * Process array of filter instructions.
+ */
+ for (;; fentry++) {
+#if defined(CONFIG_X86_32)
+#define K (fentry->k)
+#else
+ const u32 K = fentry->k;
+#endif
+
+ switch (fentry->code) {
+ case BPF_S_ALU_ADD_X:
+ A += X;
+ continue;
+ case BPF_S_ALU_ADD_K:
+ A += K;
+ continue;
+ case BPF_S_ALU_SUB_X:
+ A -= X;
+ continue;
+ case BPF_S_ALU_SUB_K:
+ A -= K;
+ continue;
+ case BPF_S_ALU_MUL_X:
+ A *= X;
+ continue;
+ case BPF_S_ALU_MUL_K:
+ A *= K;
+ continue;
+ case BPF_S_ALU_DIV_X:
+ if (X == 0)
+ return 0;
+ A /= X;
+ continue;
+ case BPF_S_ALU_DIV_K:
+ A = reciprocal_divide(A, K);
+ continue;
+ case BPF_S_ALU_AND_X:
+ A &= X;
+ continue;
+ case BPF_S_ALU_AND_K:
+ A &= K;
+ continue;
+ case BPF_S_ALU_OR_X:
+ A |= X;
+ continue;
+ case BPF_S_ALU_OR_K:
+ A |= K;
+ continue;
+ case BPF_S_ALU_LSH_X:
+ A <<= X;
+ continue;
+ case BPF_S_ALU_LSH_K:
+ A <<= K;
+ continue;
+ case BPF_S_ALU_RSH_X:
+ A >>= X;
+ continue;
+ case BPF_S_ALU_RSH_K:
+ A >>= K;
+ continue;
+ case BPF_S_ALU_NEG:
+ A = -A;
+ continue;
+ case BPF_S_JMP_JA:
+ fentry += K;
+ continue;
+ case BPF_S_JMP_JGT_K:
+ fentry += (A > K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGE_K:
+ fentry += (A >= K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JEQ_K:
+ fentry += (A == K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JSET_K:
+ fentry += (A & K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGT_X:
+ fentry += (A > X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGE_X:
+ fentry += (A >= X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JEQ_X:
+ fentry += (A == X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JSET_X:
+ fentry += (A & X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_LD_W_ABS:
+ k = K;
+load_w:
+ ptr = seccomp_load_pointer(data, k, 4, &tmp);
+ if (ptr != NULL) {
+ /*
+ * Assume load_pointer did any byte swapping.
+ */
+ A = *(const u32 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_H_ABS:
+ k = K;
+load_h:
+ ptr = seccomp_load_pointer(data, k, 2, &tmp);
+ if (ptr != NULL) {
+ A = *(const u16 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_B_ABS:
+ k = K;
+load_b:
+ ptr = seccomp_load_pointer(data, k, 1, &tmp);
+ if (ptr != NULL) {
+ A = *(const u8 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_W_LEN:
+ A = datalen;
+ continue;
+ case BPF_S_LDX_W_LEN:
+ X = datalen;
+ continue;
+ case BPF_S_LD_W_IND:
+ k = X + K;
+ goto load_w;
+ case BPF_S_LD_H_IND:
+ k = X + K;
+ goto load_h;
+ case BPF_S_LD_B_IND:
+ k = X + K;
+ goto load_b;
+ case BPF_S_LDX_B_MSH:
+ ptr = seccomp_load_pointer(data, K, 1, &tmp);
+ if (ptr != NULL) {
+ X = (*(u8 *)ptr & 0xf) << 2;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_IMM:
+ A = K;
+ continue;
+ case BPF_S_LDX_IMM:
+ X = K;
+ continue;
+ case BPF_S_LD_MEM:
+ A = mem[K];
+ continue;
+ case BPF_S_LDX_MEM:
+ X = mem[K];
+ continue;
+ case BPF_S_MISC_TAX:
+ X = A;
+ continue;
+ case BPF_S_MISC_TXA:
+ A = X;
+ continue;
+ case BPF_S_RET_K:
+ return K;
+ case BPF_S_RET_A:
+ return A;
+ case BPF_S_ST:
+ mem[K] = A;
+ continue;
+ case BPF_S_STX:
+ mem[K] = X;
+ continue;
+ case BPF_S_ANC_PROTOCOL:
+ case BPF_S_ANC_PKTTYPE:
+ case BPF_S_ANC_IFINDEX:
+ case BPF_S_ANC_MARK:
+ case BPF_S_ANC_QUEUE:
+ case BPF_S_ANC_HATYPE:
+ case BPF_S_ANC_RXHASH:
+ case BPF_S_ANC_CPU:
+ case BPF_S_ANC_NLATTR:
+ case BPF_S_ANC_NLATTR_NEST:
+ continue;
+ default:
+ WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
+ fentry->code, fentry->jt,
+ fentry->jf, fentry->k);
+ return 0;
+ }
+ }
+
+ return 0;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 4070153..8e43f70 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1901,6 +1901,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SET_SECCOMP:
error = prctl_set_seccomp(arg2);
break;
+ case PR_ATTACH_SECCOMP_FILTER:
+ error = prctl_attach_seccomp_filter((char __user *)
+ arg2);
+ break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
break;
diff --git a/security/Kconfig b/security/Kconfig
index 51bd5a0..3c55d36 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -84,6 +84,26 @@ config SECURITY_DMESG_RESTRICT
If you are unsure how to answer this question, answer N.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ select SECCOMP
+ help
+ This option provides support for limiting the accessibility of
+ system calls at a task-level using a dynamically defined policy.
+
+ System call filtering policy is expressed as a Berkeley Packet
+ Filter program. The program is attached using prctl(2) and
+ cannot be detached. Once attached, the filter program will
+ evaluate each system call, and its arguments, the task
+ makes. Its output determines if the system call may proceed.
+ If the system call is disallowed, the task will be terminated
+ immediately.
+
+ Dynamically limiting system call access aids software in the
+ creation of secure computation environments.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config SECURITY
bool "Enable different security models"
depends on SYSFS
--
1.7.5.4
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v6 3/3] Documentation: prctl/seccomp_filter
2012-01-28 22:11 [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Will Drewry
2012-01-28 22:11 ` [PATCH v6 2/3] seccomp_filters: system call filtering using BPF Will Drewry
@ 2012-01-28 22:11 ` Will Drewry
2012-01-30 22:47 ` Corey Bryant
2012-02-02 15:29 ` [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Serge E. Hallyn
2 siblings, 1 reply; 13+ messages in thread
From: Will Drewry @ 2012-01-28 22:11 UTC (permalink / raw)
To: linux-kernel
Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
corbet, alan, indan, mcgrathr
Documents how system call filtering using Berkeley Packet
Filter programs works and how it may be used.
Includes an example for x86 (32-bit) and a semi-generic
example using an example code generator.
v6: - tweak the language to note the requirement of
PR_SET_NO_NEW_PRIVS being called prior to use. (luto@mit.edu)
v5: - update sample to use system call arguments
- adds a "fancy" example using a macro-based generator
- cleaned up bpf in the sample
- update docs to mention arguments
- fix prctl value (eparis@redhat.com)
- language cleanup (rdunlap@xenotime.net)
v4: - update for no_new_privs use
- minor tweaks
v3: - call out BPF <-> Berkeley Packet Filter (rdunlap@xenotime.net)
- document use of tentative always-unprivileged
- guard sample compilation for i386 and x86_64
v2: - move code to samples (corbet@lwn.net)
Signed-off-by: Will Drewry <wad@chromium.org>
---
Documentation/prctl/seccomp_filter.txt | 100 +++++++++++++++
samples/Makefile | 2 +-
samples/seccomp/Makefile | 27 ++++
samples/seccomp/bpf-direct.c | 77 +++++++++++
samples/seccomp/bpf-fancy.c | 95 ++++++++++++++
samples/seccomp/bpf-helper.c | 89 +++++++++++++
samples/seccomp/bpf-helper.h | 219 ++++++++++++++++++++++++++++++++
7 files changed, 608 insertions(+), 1 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt
create mode 100644 samples/seccomp/Makefile
create mode 100644 samples/seccomp/bpf-direct.c
create mode 100644 samples/seccomp/bpf-fancy.c
create mode 100644 samples/seccomp/bpf-helper.c
create mode 100644 samples/seccomp/bpf-helper.h
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..4ad7649
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,100 @@
+ Seccomp filtering
+ =================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduced set
+of available system calls. The resulting set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter for
+incoming system calls. The filter is expressed as a Berkeley Packet
+Filter (BPF) program, as with socket filters, except that the data
+operated on is related to the system call being made: system call
+number, and the system call arguments. This allows for expressive
+filtering of system calls using a filter program language with a long
+history of being exposed to userland and a straightforward data set.
+
+Additionally, BPF makes it impossible for users of seccomp to fall prey
+to time-of-check-time-of-use (TOCTOU) attacks that are common in system
+call interposition frameworks. BPF programs may not dereference
+pointers which constrains all filters to solely evaluating the system
+call arguments directly.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. Beyond that,
+policy for logical behavior and information flow should be managed with
+a combination of other system hardening techniques and, potentially, an
+LSM of your choosing. Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by
+the consuming process. The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+ Allows the specification of a new filter using a BPF program.
+ The BPF program will be executed over struct seccomp_filter_data
+ reflecting the system call number, arguments, and other
+ metadata, To allow a system call, SECCOMP_BPF_ALLOW must be
+ returned. At present, all other return values result in the
+ system call being blocked, but it is recommended to return
+ SECCOMP_BPF_DENY in those cases. This will allow for future
+ custom return values to be introduced, if ever desired.
+
+ Usage:
+ prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+ The 'prog' argument is a pointer to a struct sock_fprog which will
+ contain the filter program. If the program is invalid, the call
+ will return -1 and set errno to EINVAL.
+
+ Note, is_compat_task is also tracked for the @prog. This means
+ that once set the calling task will have all of its system calls
+ blocked if it switches its system call ABI.
+
+ If fork/clone and execve are allowed by @prog, any child processes will
+ be constrained to the same filters and system call ABI as the parent.
+
+ Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
+ run with CAP_SYS_ADMIN privileges in its namespace. If these are not
+ true, -EACCES will be returned. This requirement ensures that filter
+ programs cannot be applied to child processes with greater privileges
+ than the task that installed them.
+
+ Additionally, if prctl(2) is allowed by the attached filter,
+ additional filters may be layered on which will increase evaluation
+ time, but allow for further decreasing the attack surface during
+ execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Example
+-------
+
+The samples/seccomp/ directory contains both a 32-bit specific example
+and a more generic example of a higher level macro interface for BPF
+program generation.
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters as long
+as CONFIG_SECCOMP_FILTER is enabled and the architecture has implemented
+syscall_get_arguments.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
# Makefile for Linux samples code
obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ tracepoints/ trace_events/ \
- hw_breakpoint/ kfifo/ kdb/ hidraw/
+ hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..0298c6f
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,27 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+hostprogs-y := bpf-fancy
+bpf-fancy-objs := bpf-fancy.o bpf-helper.o
+
+HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
+HOSTCFLAGS_bpf-helper.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-helper.o += -idirafter $(objtree)/include
+
+# bpf-direct.c is x86-only.
+ifeq ($(filter-out x86_64 i386,$(KBUILD_BUILDHOST)),)
+# List of programs to build
+hostprogs-y += bpf-direct
+bpf-direct-objs := bpf-direct.o
+endif
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
+ifeq ($(KBUILD_BUILDHOST),x86_64)
+HOSTCFLAGS_bpf-direct.o += -m32
+HOSTLOADLIBES_bpf-direct += -m32
+endif
diff --git a/samples/seccomp/bpf-direct.c b/samples/seccomp/bpf-direct.c
new file mode 100644
index 0000000..d799244
--- /dev/null
+++ b/samples/seccomp/bpf-direct.c
@@ -0,0 +1,77 @@
+/*
+ * 32-bit seccomp filter example with BPF macros
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <linux/filter.h>
+#include <linux/ptrace.h>
+#include <linux/seccomp_filter.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#ifndef PR_ATTACH_SECCOMP_FILTER
+# define PR_ATTACH_SECCOMP_FILTER 37
+#endif
+
+#define syscall_arg(_n) (offsetof(struct seccomp_filter_data, args[_n].lo32))
+#define nr (offsetof(struct seccomp_filter_data, syscall_nr))
+
+static int install_filter(void)
+{
+ struct seccomp_filter_block filter[] = {
+ /* Grab the system call number */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, nr),
+ /* Jump table for the allowed syscalls */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+ /* Check that read is only using stdin. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+ /* Check that write is only using stdout/stderr */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_ALLOW),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_DENY),
+ };
+ struct seccomp_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) (_c), sizeof((_c))
+int main(int argc, char **argv)
+{
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO,
+ payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ return 0;
+}
diff --git a/samples/seccomp/bpf-fancy.c b/samples/seccomp/bpf-fancy.c
new file mode 100644
index 0000000..1318b1a
--- /dev/null
+++ b/samples/seccomp/bpf-fancy.c
@@ -0,0 +1,95 @@
+/*
+ * Seccomp BPF example using a macro-based generator.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <linux/seccomp_filter.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#include "bpf-helper.h"
+
+#ifndef PR_ATTACH_SECCOMP_FILTER
+# define PR_ATTACH_SECCOMP_FILTER 37
+#endif
+
+int main(int argc, char **argv)
+{
+ struct bpf_labels l;
+ static const char msg1[] = "Please type something: ";
+ static const char msg2[] = "You typed: ";
+ char buf[256];
+ struct seccomp_filter_block filter[] = {
+ LOAD_SYSCALL_NR,
+ SYSCALL(__NR_exit, ALLOW),
+ SYSCALL(__NR_exit_group, ALLOW),
+ SYSCALL(__NR_write, JUMP(&l, write_fd)),
+ SYSCALL(__NR_read, JUMP(&l, read)),
+ DENY, /* Don't passthrough into a label */
+
+ LABEL(&l, read),
+ ARG(0),
+ JNE(STDIN_FILENO, DENY),
+ ARG(1),
+ JNE((unsigned long)buf, DENY),
+ ARG(2),
+ JGE(sizeof(buf), DENY),
+ ALLOW,
+
+ LABEL(&l, write_fd),
+ ARG(0),
+ JEQ(STDOUT_FILENO, JUMP(&l, write_buf)),
+ JEQ(STDERR_FILENO, JUMP(&l, write_buf)),
+ DENY,
+
+ LABEL(&l, write_buf),
+ ARG(1),
+ JEQ((unsigned long)msg1, JUMP(&l, msg1_len)),
+ JEQ((unsigned long)msg2, JUMP(&l, msg2_len)),
+ JEQ((unsigned long)buf, JUMP(&l, buf_len)),
+ DENY,
+
+ LABEL(&l, msg1_len),
+ ARG(2),
+ JLT(sizeof(msg1), ALLOW),
+ DENY,
+
+ LABEL(&l, msg2_len),
+ ARG(2),
+ JLT(sizeof(msg2), ALLOW),
+ DENY,
+
+ LABEL(&l, buf_len),
+ ARG(2),
+ JLT(sizeof(buf), ALLOW),
+ DENY,
+ };
+ struct seccomp_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ ssize_t bytes;
+ bpf_resolve_jumps(&l, filter, sizeof(filter)/sizeof(*filter));
+
+ if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ syscall(__NR_write, STDOUT_FILENO, msg1, strlen(msg1));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf)-1);
+ bytes = (bytes > 0 ? bytes : 0);
+ syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2));
+ syscall(__NR_write, STDERR_FILENO, buf, bytes);
+ /* Now get killed */
+ syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2)+2);
+ return 0;
+}
diff --git a/samples/seccomp/bpf-helper.c b/samples/seccomp/bpf-helper.c
new file mode 100644
index 0000000..e1b6bc7
--- /dev/null
+++ b/samples/seccomp/bpf-helper.c
@@ -0,0 +1,89 @@
+/*
+ * Seccomp BPF helper functions
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include "bpf-helper.h"
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+ struct seccomp_filter_block *filter, size_t count)
+{
+ struct seccomp_filter_block *begin = filter;
+ __u8 insn = count - 1;
+
+ if (count < 1)
+ return -1;
+ /*
+ * Walk it once, backwards, to build the label table and do fixups.
+ * Since backward jumps are disallowed by BPF, this is easy.
+ */
+ filter += insn;
+ for (; filter >= begin; --insn, --filter) {
+ if (filter->code != (BPF_JMP+BPF_JA))
+ continue;
+ switch ((filter->jt<<8)|filter->jf) {
+ case (JUMP_JT<<8)|JUMP_JF:
+ if (labels->labels[filter->k].location == 0xffffffff) {
+ fprintf(stderr, "Unresolved label: '%s'\n",
+ labels->labels[filter->k].label);
+ return 1;
+ }
+ filter->k = labels->labels[filter->k].location -
+ (insn + 1);
+ filter->jt = 0;
+ filter->jf = 0;
+ continue;
+ case (LABEL_JT<<8)|LABEL_JF:
+ if (labels->labels[filter->k].location != 0xffffffff) {
+ fprintf(stderr, "Duplicate label use: '%s'\n",
+ labels->labels[filter->k].label);
+ return 1;
+ }
+ labels->labels[filter->k].location = insn;
+ filter->k = 0; /* fall through */
+ filter->jt = 0;
+ filter->jf = 0;
+ continue;
+ }
+ }
+ return 0;
+}
+
+/* Simple lookup table for labels. */
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label)
+{
+ struct __bpf_label *begin = labels->labels, *end;
+ int id;
+ if (labels->count == 0) {
+ begin->label = label;
+ begin->location = 0xffffffff;
+ labels->count++;
+ return 0;
+ }
+ end = begin + labels->count;
+ for (id = 0; begin < end; ++begin, ++id) {
+ if (!strcmp(label, begin->label))
+ return id;
+ }
+ begin->label = label;
+ begin->location = 0xffffffff;
+ labels->count++;
+ return id;
+}
+
+void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t count)
+{
+ struct seccomp_filter_block *end = filter + count;
+ for ( ; filter < end; ++filter)
+ printf("{ code=%u,jt=%u,jf=%u,k=%u },\n",
+ filter->code, filter->jt, filter->jf, filter->k);
+}
diff --git a/samples/seccomp/bpf-helper.h b/samples/seccomp/bpf-helper.h
new file mode 100644
index 0000000..92b94ec
--- /dev/null
+++ b/samples/seccomp/bpf-helper.h
@@ -0,0 +1,219 @@
+/*
+ * Example wrapper around BPF macros.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ *
+ * No guarantees are provided with respect to the correctness
+ * or functionality of this code.
+ */
+#ifndef __BPF_HELPER_H__
+#define __BPF_HELPER_H__
+
+#include <asm/bitsperlong.h> /* for __BITS_PER_LONG */
+#include <linux/filter.h>
+#include <linux/seccomp_filter.h> /* for seccomp_filter_data.arg */
+#include <linux/types.h>
+#include <linux/unistd.h>
+#include <stddef.h>
+
+#define BPF_LABELS_MAX 256
+struct bpf_labels {
+ int count;
+ struct __bpf_label {
+ const char *label;
+ __u32 location;
+ } labels[BPF_LABELS_MAX];
+};
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+ struct seccomp_filter_block *filter, size_t count);
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label);
+void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t count);
+
+#define JUMP_JT 0xff
+#define JUMP_JF 0xff
+#define LABEL_JT 0xfe
+#define LABEL_JF 0xfe
+
+#define ALLOW \
+ BPF_STMT(BPF_RET+BPF_K, 0xFFFFFFFF)
+#define DENY \
+ BPF_STMT(BPF_RET+BPF_K, 0)
+#define JUMP(labels, label) \
+ BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
+ JUMP_JT, JUMP_JF)
+#define LABEL(labels, label) \
+ BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
+ LABEL_JT, LABEL_JF)
+#define SYSCALL(nr, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (nr), 0, 1), \
+ jt
+
+/* Lame, but just an example */
+#define FIND_LABEL(labels, label) seccomp_bpf_label((labels), #label)
+
+#define EXPAND(...) __VA_ARGS__
+/* Map all width-sensitive operations */
+#if __BITS_PER_LONG == 32
+
+#define JEQ(x, jt) JEQ32(x, EXPAND(jt))
+#define JNE(x, jt) JNE32(x, EXPAND(jt))
+#define JGT(x, jt) JGT32(x, EXPAND(jt))
+#define JLT(x, jt) JLT32(x, EXPAND(jt))
+#define JGE(x, jt) JGE32(x, EXPAND(jt))
+#define JLE(x, jt) JLE32(x, EXPAND(jt))
+#define JA(x, jt) JA32(x, EXPAND(jt))
+#define ARG(i) ARG_32(i)
+
+#elif __BITS_PER_LONG == 64
+
+#define JEQ(x, jt) \
+ JEQ64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+ ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JGT(x, jt) \
+ JGT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+ ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JGE(x, jt) \
+ JGE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+ ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JNE(x, jt) \
+ JNE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+ ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JLT(x, jt) \
+ JLT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+ ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JLE(x, jt) \
+ JLE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+ ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+
+#define JA(x, jt) \
+ JA64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+ ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define ARG(i) ARG_64(i)
+
+#else
+#error __BITS_PER_LONG value unusable.
+#endif
+
+/* Loads the arg into A */
+#define ARG_32(idx) \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_filter_data, args[(idx)].lo32))
+
+/* Loads hi into A and lo in X */
+#define ARG_64(idx) \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_filter_data, args[(idx)].lo32)), \
+ BPF_STMT(BPF_ST, 0), /* lo -> M[0] */ \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_filter_data, args[(idx)].hi32)), \
+ BPF_STMT(BPF_ST, 1) /* hi -> M[1] */
+
+#define JEQ32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 0, 1), \
+ jt
+
+#define JNE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 1, 0), \
+ jt
+
+/* Checks the lo, then swaps to check the hi. A=lo,X=hi */
+#define JEQ64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JNE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 5, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JA32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (value), 0, 1), \
+ jt
+
+#define JA64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (hi), 3, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 0, 1), \
+ jt
+
+#define JLT32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 1, 0), \
+ jt
+
+/* Shortcut checking if hi > arg.hi. */
+#define JGE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLT64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (hi), 0, 4), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGT32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
+ jt
+
+#define JLE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
+ jt
+
+/* Check hi > args.hi first, then do the GE checking */
+#define JGT64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 6, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 3), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define LOAD_SYSCALL_NR \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_filter_data, syscall_nr))
+
+#endif /* __BPF_HELPER_H__ */
--
1.7.5.4
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v6 3/3] Documentation: prctl/seccomp_filter
2012-01-28 22:11 ` [PATCH v6 3/3] Documentation: prctl/seccomp_filter Will Drewry
@ 2012-01-30 22:47 ` Corey Bryant
2012-01-30 22:52 ` Will Drewry
0 siblings, 1 reply; 13+ messages in thread
From: Corey Bryant @ 2012-01-30 22:47 UTC (permalink / raw)
To: Will Drewry
Cc: linux-kernel, keescook, john.johansen, serge.hallyn, pmoore,
eparis, djm, torvalds, segoon, rostedt, jmorris, scarybeasts,
avi, penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
corbet, alan, indan, mcgrathr
On 01/28/2012 05:11 PM, Will Drewry wrote:
> Documents how system call filtering using Berkeley Packet
> Filter programs works and how it may be used.
> Includes an example for x86 (32-bit) and a semi-generic
> example using an example code generator.
>
> v6: - tweak the language to note the requirement of
> PR_SET_NO_NEW_PRIVS being called prior to use. (luto@mit.edu)
> v5: - update sample to use system call arguments
> - adds a "fancy" example using a macro-based generator
> - cleaned up bpf in the sample
> - update docs to mention arguments
> - fix prctl value (eparis@redhat.com)
> - language cleanup (rdunlap@xenotime.net)
> v4: - update for no_new_privs use
> - minor tweaks
> v3: - call out BPF<-> Berkeley Packet Filter (rdunlap@xenotime.net)
> - document use of tentative always-unprivileged
> - guard sample compilation for i386 and x86_64
> v2: - move code to samples (corbet@lwn.net)
>
> Signed-off-by: Will Drewry<wad@chromium.org>
> ---
> Documentation/prctl/seccomp_filter.txt | 100 +++++++++++++++
> samples/Makefile | 2 +-
> samples/seccomp/Makefile | 27 ++++
> samples/seccomp/bpf-direct.c | 77 +++++++++++
> samples/seccomp/bpf-fancy.c | 95 ++++++++++++++
> samples/seccomp/bpf-helper.c | 89 +++++++++++++
> samples/seccomp/bpf-helper.h | 219 ++++++++++++++++++++++++++++++++
> 7 files changed, 608 insertions(+), 1 deletions(-)
> create mode 100644 Documentation/prctl/seccomp_filter.txt
> create mode 100644 samples/seccomp/Makefile
> create mode 100644 samples/seccomp/bpf-direct.c
> create mode 100644 samples/seccomp/bpf-fancy.c
> create mode 100644 samples/seccomp/bpf-helper.c
> create mode 100644 samples/seccomp/bpf-helper.h
>
> diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
> new file mode 100644
> index 0000000..4ad7649
> --- /dev/null
> +++ b/Documentation/prctl/seccomp_filter.txt
> @@ -0,0 +1,100 @@
> + Seccomp filtering
> + =================
> +
> +Introduction
> +------------
> +
> +A large number of system calls are exposed to every userland process
> +with many of them going unused for the entire lifetime of the process.
> +As system calls change and mature, bugs are found and eradicated. A
> +certain subset of userland applications benefit by having a reduced set
> +of available system calls. The resulting set reduces the total kernel
> +surface exposed to the application. System call filtering is meant for
> +use with those applications.
> +
> +Seccomp filtering provides a means for a process to specify a filter for
> +incoming system calls. The filter is expressed as a Berkeley Packet
> +Filter (BPF) program, as with socket filters, except that the data
> +operated on is related to the system call being made: system call
> +number, and the system call arguments. This allows for expressive
> +filtering of system calls using a filter program language with a long
> +history of being exposed to userland and a straightforward data set.
> +
> +Additionally, BPF makes it impossible for users of seccomp to fall prey
> +to time-of-check-time-of-use (TOCTOU) attacks that are common in system
> +call interposition frameworks. BPF programs may not dereference
> +pointers which constrains all filters to solely evaluating the system
> +call arguments directly.
> +
> +What it isn't
> +-------------
> +
> +System call filtering isn't a sandbox. It provides a clearly defined
> +mechanism for minimizing the exposed kernel surface. Beyond that,
> +policy for logical behavior and information flow should be managed with
> +a combination of other system hardening techniques and, potentially, an
> +LSM of your choosing. Expressive, dynamic filters provide further options down
> +this path (avoiding pathological sizes or selecting which of the multiplexed
> +system calls in socketcall() is allowed, for instance) which could be
> +construed, incorrectly, as a more complete sandboxing solution.
> +
> +Usage
> +-----
> +
> +An additional seccomp mode is added, but they are not directly set by
> +the consuming process. The new mode, '2', is only available if
> +CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
> +PR_ATTACH_SECCOMP_FILTER argument.
> +
> +Interacting with seccomp filters is done using one prctl(2) call.
> +
> +PR_ATTACH_SECCOMP_FILTER:
> + Allows the specification of a new filter using a BPF program.
> + The BPF program will be executed over struct seccomp_filter_data
> + reflecting the system call number, arguments, and other
> + metadata, To allow a system call, SECCOMP_BPF_ALLOW must be
> + returned. At present, all other return values result in the
> + system call being blocked, but it is recommended to return
> + SECCOMP_BPF_DENY in those cases. This will allow for future
> + custom return values to be introduced, if ever desired.
> +
> + Usage:
> + prctl(PR_ATTACH_SECCOMP_FILTER, prog);
> +
> + The 'prog' argument is a pointer to a struct sock_fprog which will
> + contain the filter program. If the program is invalid, the call
> + will return -1 and set errno to EINVAL.
> +
> + Note, is_compat_task is also tracked for the @prog. This means
> + that once set the calling task will have all of its system calls
> + blocked if it switches its system call ABI.
> +
> + If fork/clone and execve are allowed by @prog, any child processes will
> + be constrained to the same filters and system call ABI as the parent.
> +
> + Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
> + run with CAP_SYS_ADMIN privileges in its namespace. If these are not
> + true, -EACCES will be returned. This requirement ensures that filter
> + programs cannot be applied to child processes with greater privileges
> + than the task that installed them.
> +
> + Additionally, if prctl(2) is allowed by the attached filter,
> + additional filters may be layered on which will increase evaluation
> + time, but allow for further decreasing the attack surface during
> + execution of a process.
> +
> +The above call returns 0 on success and non-zero on error.
> +
> +Example
> +-------
> +
> +The samples/seccomp/ directory contains both a 32-bit specific example
> +and a more generic example of a higher level macro interface for BPF
> +program generation.
> +
> +Adding architecture support
> +-----------------------
> +
> +Any platform with seccomp support will support seccomp filters as long
> +as CONFIG_SECCOMP_FILTER is enabled and the architecture has implemented
> +syscall_get_arguments.
> diff --git a/samples/Makefile b/samples/Makefile
> index 6280817..f29b19c 100644
> --- a/samples/Makefile
> +++ b/samples/Makefile
> @@ -1,4 +1,4 @@
> # Makefile for Linux samples code
>
> obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ tracepoints/ trace_events/ \
> - hw_breakpoint/ kfifo/ kdb/ hidraw/
> + hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
> diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
> new file mode 100644
> index 0000000..0298c6f
> --- /dev/null
> +++ b/samples/seccomp/Makefile
> @@ -0,0 +1,27 @@
> +# kbuild trick to avoid linker error. Can be omitted if a module is built.
> +obj- := dummy.o
> +
> +hostprogs-y := bpf-fancy
> +bpf-fancy-objs := bpf-fancy.o bpf-helper.o
> +
> +HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
> +HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
> +HOSTCFLAGS_bpf-helper.o += -I$(objtree)/usr/include
> +HOSTCFLAGS_bpf-helper.o += -idirafter $(objtree)/include
> +
> +# bpf-direct.c is x86-only.
> +ifeq ($(filter-out x86_64 i386,$(KBUILD_BUILDHOST)),)
> +# List of programs to build
> +hostprogs-y += bpf-direct
> +bpf-direct-objs := bpf-direct.o
> +endif
> +
> +# Tell kbuild to always build the programs
> +always := $(hostprogs-y)
> +
> +HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
> +HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
> +ifeq ($(KBUILD_BUILDHOST),x86_64)
> +HOSTCFLAGS_bpf-direct.o += -m32
> +HOSTLOADLIBES_bpf-direct += -m32
> +endif
> diff --git a/samples/seccomp/bpf-direct.c b/samples/seccomp/bpf-direct.c
> new file mode 100644
> index 0000000..d799244
> --- /dev/null
> +++ b/samples/seccomp/bpf-direct.c
> @@ -0,0 +1,77 @@
> +/*
> + * 32-bit seccomp filter example with BPF macros
> + *
> + * Copyright (c) 2012 The Chromium OS Authors<chromium-os-dev@chromium.org>
> + * Author: Will Drewry<wad@chromium.org>
> + *
> + * The code may be used by anyone for any purpose,
> + * and can serve as a starting point for developing
> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
> + */
> +
> +#include<linux/filter.h>
> +#include<linux/ptrace.h>
> +#include<linux/seccomp_filter.h>
> +#include<linux/unistd.h>
> +#include<stdio.h>
> +#include<stddef.h>
> +#include<sys/prctl.h>
> +#include<unistd.h>
> +
> +#ifndef PR_ATTACH_SECCOMP_FILTER
> +# define PR_ATTACH_SECCOMP_FILTER 37
> +#endif
> +
> +#define syscall_arg(_n) (offsetof(struct seccomp_filter_data, args[_n].lo32))
> +#define nr (offsetof(struct seccomp_filter_data, syscall_nr))
> +
> +static int install_filter(void)
> +{
> + struct seccomp_filter_block filter[] = {
> + /* Grab the system call number */
> + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, nr),
> + /* Jump table for the allowed syscalls */
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
> +
> + /* Check that read is only using stdin. */
> + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
> +
> + /* Check that write is only using stdout/stderr */
> + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
> +
> + BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_ALLOW),
> + BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_DENY),
> + };
> + struct seccomp_fprog prog = {
> + .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
> + .filter = filter,
> + };
> + if (prctl(PR_ATTACH_SECCOMP_FILTER,&prog)) {
> + perror("prctl");
> + return 1;
> + }
> + return 0;
> +}
> +
> +#define payload(_c) (_c), sizeof((_c))
> +int main(int argc, char **argv)
> +{
> + char buf[4096];
> + ssize_t bytes = 0;
> + if (install_filter())
> + return 1;
> + syscall(__NR_write, STDOUT_FILENO,
> + payload("OHAI! WHAT IS YOUR NAME? "));
> + bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
> + syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
> + syscall(__NR_write, STDOUT_FILENO, buf, bytes);
> + return 0;
> +}
> diff --git a/samples/seccomp/bpf-fancy.c b/samples/seccomp/bpf-fancy.c
> new file mode 100644
> index 0000000..1318b1a
> --- /dev/null
> +++ b/samples/seccomp/bpf-fancy.c
> @@ -0,0 +1,95 @@
> +/*
> + * Seccomp BPF example using a macro-based generator.
> + *
> + * Copyright (c) 2012 The Chromium OS Authors<chromium-os-dev@chromium.org>
> + * Author: Will Drewry<wad@chromium.org>
> + *
> + * The code may be used by anyone for any purpose,
> + * and can serve as a starting point for developing
> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
> + */
> +
> +#include<linux/seccomp_filter.h>
> +#include<linux/unistd.h>
> +#include<stdio.h>
> +#include<string.h>
> +#include<sys/prctl.h>
> +#include<unistd.h>
> +
> +#include "bpf-helper.h"
> +
> +#ifndef PR_ATTACH_SECCOMP_FILTER
> +# define PR_ATTACH_SECCOMP_FILTER 37
> +#endif
> +
> +int main(int argc, char **argv)
> +{
> + struct bpf_labels l;
> + static const char msg1[] = "Please type something: ";
> + static const char msg2[] = "You typed: ";
> + char buf[256];
> + struct seccomp_filter_block filter[] = {
> + LOAD_SYSCALL_NR,
> + SYSCALL(__NR_exit, ALLOW),
> + SYSCALL(__NR_exit_group, ALLOW),
> + SYSCALL(__NR_write, JUMP(&l, write_fd)),
> + SYSCALL(__NR_read, JUMP(&l, read)),
> + DENY, /* Don't passthrough into a label */
> +
> + LABEL(&l, read),
> + ARG(0),
> + JNE(STDIN_FILENO, DENY),
> + ARG(1),
> + JNE((unsigned long)buf, DENY),
> + ARG(2),
> + JGE(sizeof(buf), DENY),
> + ALLOW,
> +
> + LABEL(&l, write_fd),
> + ARG(0),
> + JEQ(STDOUT_FILENO, JUMP(&l, write_buf)),
> + JEQ(STDERR_FILENO, JUMP(&l, write_buf)),
> + DENY,
> +
> + LABEL(&l, write_buf),
> + ARG(1),
> + JEQ((unsigned long)msg1, JUMP(&l, msg1_len)),
> + JEQ((unsigned long)msg2, JUMP(&l, msg2_len)),
> + JEQ((unsigned long)buf, JUMP(&l, buf_len)),
> + DENY,
> +
> + LABEL(&l, msg1_len),
> + ARG(2),
> + JLT(sizeof(msg1), ALLOW),
> + DENY,
> +
> + LABEL(&l, msg2_len),
> + ARG(2),
> + JLT(sizeof(msg2), ALLOW),
> + DENY,
> +
> + LABEL(&l, buf_len),
> + ARG(2),
> + JLT(sizeof(buf), ALLOW),
> + DENY,
> + };
> + struct seccomp_fprog prog = {
> + .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
> + .filter = filter,
> + };
> + ssize_t bytes;
> + bpf_resolve_jumps(&l, filter, sizeof(filter)/sizeof(*filter));
> +
> + if (prctl(PR_ATTACH_SECCOMP_FILTER,&prog)) {
> + perror("prctl");
> + return 1;
> + }
> + syscall(__NR_write, STDOUT_FILENO, msg1, strlen(msg1));
> + bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf)-1);
> + bytes = (bytes> 0 ? bytes : 0);
> + syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2));
> + syscall(__NR_write, STDERR_FILENO, buf, bytes);
> + /* Now get killed */
> + syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2)+2);
> + return 0;
> +}
> diff --git a/samples/seccomp/bpf-helper.c b/samples/seccomp/bpf-helper.c
> new file mode 100644
> index 0000000..e1b6bc7
> --- /dev/null
> +++ b/samples/seccomp/bpf-helper.c
> @@ -0,0 +1,89 @@
> +/*
> + * Seccomp BPF helper functions
> + *
> + * Copyright (c) 2012 The Chromium OS Authors<chromium-os-dev@chromium.org>
> + * Author: Will Drewry<wad@chromium.org>
> + *
> + * The code may be used by anyone for any purpose,
> + * and can serve as a starting point for developing
> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
> + */
> +
> +#include<stdio.h>
> +#include<string.h>
> +
> +#include "bpf-helper.h"
> +
> +int bpf_resolve_jumps(struct bpf_labels *labels,
> + struct seccomp_filter_block *filter, size_t count)
> +{
> + struct seccomp_filter_block *begin = filter;
> + __u8 insn = count - 1;
> +
> + if (count< 1)
> + return -1;
> + /*
> + * Walk it once, backwards, to build the label table and do fixups.
> + * Since backward jumps are disallowed by BPF, this is easy.
> + */
> + filter += insn;
> + for (; filter>= begin; --insn, --filter) {
> + if (filter->code != (BPF_JMP+BPF_JA))
> + continue;
> + switch ((filter->jt<<8)|filter->jf) {
> + case (JUMP_JT<<8)|JUMP_JF:
> + if (labels->labels[filter->k].location == 0xffffffff) {
> + fprintf(stderr, "Unresolved label: '%s'\n",
> + labels->labels[filter->k].label);
> + return 1;
> + }
> + filter->k = labels->labels[filter->k].location -
> + (insn + 1);
> + filter->jt = 0;
> + filter->jf = 0;
> + continue;
> + case (LABEL_JT<<8)|LABEL_JF:
> + if (labels->labels[filter->k].location != 0xffffffff) {
> + fprintf(stderr, "Duplicate label use: '%s'\n",
> + labels->labels[filter->k].label);
> + return 1;
> + }
> + labels->labels[filter->k].location = insn;
> + filter->k = 0; /* fall through */
> + filter->jt = 0;
> + filter->jf = 0;
> + continue;
> + }
> + }
> + return 0;
> +}
> +
> +/* Simple lookup table for labels. */
> +__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label)
> +{
> + struct __bpf_label *begin = labels->labels, *end;
> + int id;
> + if (labels->count == 0) {
> + begin->label = label;
> + begin->location = 0xffffffff;
> + labels->count++;
> + return 0;
> + }
> + end = begin + labels->count;
> + for (id = 0; begin< end; ++begin, ++id) {
> + if (!strcmp(label, begin->label))
> + return id;
> + }
> + begin->label = label;
> + begin->location = 0xffffffff;
> + labels->count++;
> + return id;
> +}
> +
> +void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t count)
> +{
> + struct seccomp_filter_block *end = filter + count;
> + for ( ; filter< end; ++filter)
> + printf("{ code=%u,jt=%u,jf=%u,k=%u },\n",
> + filter->code, filter->jt, filter->jf, filter->k);
> +}
> diff --git a/samples/seccomp/bpf-helper.h b/samples/seccomp/bpf-helper.h
> new file mode 100644
> index 0000000..92b94ec
> --- /dev/null
> +++ b/samples/seccomp/bpf-helper.h
> @@ -0,0 +1,219 @@
> +/*
> + * Example wrapper around BPF macros.
> + *
> + * Copyright (c) 2012 The Chromium OS Authors<chromium-os-dev@chromium.org>
> + * Author: Will Drewry<wad@chromium.org>
> + *
> + * The code may be used by anyone for any purpose,
> + * and can serve as a starting point for developing
> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
> + *
> + * No guarantees are provided with respect to the correctness
> + * or functionality of this code.
> + */
> +#ifndef __BPF_HELPER_H__
> +#define __BPF_HELPER_H__
> +
> +#include<asm/bitsperlong.h> /* for __BITS_PER_LONG */
> +#include<linux/filter.h>
> +#include<linux/seccomp_filter.h> /* for seccomp_filter_data.arg */
> +#include<linux/types.h>
> +#include<linux/unistd.h>
> +#include<stddef.h>
> +
> +#define BPF_LABELS_MAX 256
> +struct bpf_labels {
> + int count;
> + struct __bpf_label {
> + const char *label;
> + __u32 location;
> + } labels[BPF_LABELS_MAX];
> +};
> +
> +int bpf_resolve_jumps(struct bpf_labels *labels,
> + struct seccomp_filter_block *filter, size_t count);
> +__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label);
> +void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t count);
> +
> +#define JUMP_JT 0xff
> +#define JUMP_JF 0xff
> +#define LABEL_JT 0xfe
> +#define LABEL_JF 0xfe
> +
> +#define ALLOW \
> + BPF_STMT(BPF_RET+BPF_K, 0xFFFFFFFF)
> +#define DENY \
> + BPF_STMT(BPF_RET+BPF_K, 0)
> +#define JUMP(labels, label) \
> + BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
> + JUMP_JT, JUMP_JF)
> +#define LABEL(labels, label) \
> + BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
> + LABEL_JT, LABEL_JF)
> +#define SYSCALL(nr, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (nr), 0, 1), \
> + jt
> +
> +/* Lame, but just an example */
> +#define FIND_LABEL(labels, label) seccomp_bpf_label((labels), #label)
> +
> +#define EXPAND(...) __VA_ARGS__
> +/* Map all width-sensitive operations */
> +#if __BITS_PER_LONG == 32
> +
> +#define JEQ(x, jt) JEQ32(x, EXPAND(jt))
> +#define JNE(x, jt) JNE32(x, EXPAND(jt))
> +#define JGT(x, jt) JGT32(x, EXPAND(jt))
> +#define JLT(x, jt) JLT32(x, EXPAND(jt))
> +#define JGE(x, jt) JGE32(x, EXPAND(jt))
> +#define JLE(x, jt) JLE32(x, EXPAND(jt))
> +#define JA(x, jt) JA32(x, EXPAND(jt))
> +#define ARG(i) ARG_32(i)
> +
> +#elif __BITS_PER_LONG == 64
> +
> +#define JEQ(x, jt) \
> + JEQ64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> + EXPAND(jt))
> +#define JGT(x, jt) \
> + JGT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> + EXPAND(jt))
> +#define JGE(x, jt) \
> + JGE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> + EXPAND(jt))
> +#define JNE(x, jt) \
> + JNE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> + EXPAND(jt))
> +#define JLT(x, jt) \
> + JLT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> + EXPAND(jt))
> +#define JLE(x, jt) \
> + JLE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> + EXPAND(jt))
> +
> +#define JA(x, jt) \
> + JA64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> + EXPAND(jt))
> +#define ARG(i) ARG_64(i)
> +
> +#else
> +#error __BITS_PER_LONG value unusable.
> +#endif
> +
> +/* Loads the arg into A */
> +#define ARG_32(idx) \
> + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
> + offsetof(struct seccomp_filter_data, args[(idx)].lo32))
> +
> +/* Loads hi into A and lo in X */
> +#define ARG_64(idx) \
> + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
> + offsetof(struct seccomp_filter_data, args[(idx)].lo32)), \
> + BPF_STMT(BPF_ST, 0), /* lo -> M[0] */ \
> + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
> + offsetof(struct seccomp_filter_data, args[(idx)].hi32)), \
> + BPF_STMT(BPF_ST, 1) /* hi -> M[1] */
> +
> +#define JEQ32(value, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 0, 1), \
> + jt
> +
> +#define JNE32(value, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 1, 0), \
> + jt
> +
> +/* Checks the lo, then swaps to check the hi. A=lo,X=hi */
> +#define JEQ64(lo, hi, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
> + BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 0, 2), \
> + BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> + jt, \
> + BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define JNE64(lo, hi, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 5, 0), \
> + BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 2, 0), \
> + BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> + jt, \
> + BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define JA32(value, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (value), 0, 1), \
> + jt
> +
> +#define JA64(lo, hi, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (hi), 3, 0), \
> + BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> + BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (lo), 0, 2), \
> + BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> + jt, \
> + BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define JGE32(value, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 0, 1), \
> + jt
> +
> +#define JLT32(value, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 1, 0), \
> + jt
> +
> +/* Shortcut checking if hi> arg.hi. */
> +#define JGE64(lo, hi, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
> + BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> + BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (lo), 0, 2), \
> + BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> + jt, \
> + BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define JLT64(lo, hi, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (hi), 0, 4), \
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
> + BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> + BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
> + BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> + jt, \
> + BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define JGT32(value, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
> + jt
> +
> +#define JLE32(value, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
> + jt
Should the true/false offsets be reversed here?
Thanks for all the work on this. We're looking forward to using it with
QEMU.
> +
> +/* Check hi> args.hi first, then do the GE checking */
> +#define JGT64(lo, hi, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
> + BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> + BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 0, 2), \
> + BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> + jt, \
> + BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define JLE64(lo, hi, jt) \
> + BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 6, 0), \
> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 3), \
> + BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> + BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
> + BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> + jt, \
> + BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define LOAD_SYSCALL_NR \
> + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
> + offsetof(struct seccomp_filter_data, syscall_nr))
> +
> +#endif /* __BPF_HELPER_H__ */
--
Regards,
Corey
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 3/3] Documentation: prctl/seccomp_filter
2012-01-30 22:47 ` Corey Bryant
@ 2012-01-30 22:52 ` Will Drewry
0 siblings, 0 replies; 13+ messages in thread
From: Will Drewry @ 2012-01-30 22:52 UTC (permalink / raw)
To: Corey Bryant
Cc: linux-kernel, keescook, john.johansen, serge.hallyn, pmoore,
eparis, djm, torvalds, segoon, rostedt, jmorris, scarybeasts,
avi, penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
corbet, alan, indan, mcgrathr
On Mon, Jan 30, 2012 at 4:47 PM, Corey Bryant <coreyb@linux.vnet.ibm.com> wrote:
>
>
> On 01/28/2012 05:11 PM, Will Drewry wrote:
>>
>> Documents how system call filtering using Berkeley Packet
>> Filter programs works and how it may be used.
>> Includes an example for x86 (32-bit) and a semi-generic
>> example using an example code generator.
>>
>> v6: - tweak the language to note the requirement of
>> PR_SET_NO_NEW_PRIVS being called prior to use. (luto@mit.edu)
>> v5: - update sample to use system call arguments
>> - adds a "fancy" example using a macro-based generator
>> - cleaned up bpf in the sample
>> - update docs to mention arguments
>> - fix prctl value (eparis@redhat.com)
>> - language cleanup (rdunlap@xenotime.net)
>> v4: - update for no_new_privs use
>> - minor tweaks
>> v3: - call out BPF<-> Berkeley Packet Filter (rdunlap@xenotime.net)
>> - document use of tentative always-unprivileged
>> - guard sample compilation for i386 and x86_64
>> v2: - move code to samples (corbet@lwn.net)
>>
>> Signed-off-by: Will Drewry<wad@chromium.org>
>> ---
>> Documentation/prctl/seccomp_filter.txt | 100 +++++++++++++++
>> samples/Makefile | 2 +-
>> samples/seccomp/Makefile | 27 ++++
>> samples/seccomp/bpf-direct.c | 77 +++++++++++
>> samples/seccomp/bpf-fancy.c | 95 ++++++++++++++
>> samples/seccomp/bpf-helper.c | 89 +++++++++++++
>> samples/seccomp/bpf-helper.h | 219
>> ++++++++++++++++++++++++++++++++
>> 7 files changed, 608 insertions(+), 1 deletions(-)
>> create mode 100644 Documentation/prctl/seccomp_filter.txt
>> create mode 100644 samples/seccomp/Makefile
>> create mode 100644 samples/seccomp/bpf-direct.c
>> create mode 100644 samples/seccomp/bpf-fancy.c
>> create mode 100644 samples/seccomp/bpf-helper.c
>> create mode 100644 samples/seccomp/bpf-helper.h
>>
>> diff --git a/Documentation/prctl/seccomp_filter.txt
>> b/Documentation/prctl/seccomp_filter.txt
>> new file mode 100644
>> index 0000000..4ad7649
>> --- /dev/null
>> +++ b/Documentation/prctl/seccomp_filter.txt
>> @@ -0,0 +1,100 @@
>> + Seccomp filtering
>> + =================
>> +
>> +Introduction
>> +------------
>> +
>> +A large number of system calls are exposed to every userland process
>> +with many of them going unused for the entire lifetime of the process.
>> +As system calls change and mature, bugs are found and eradicated. A
>> +certain subset of userland applications benefit by having a reduced set
>> +of available system calls. The resulting set reduces the total kernel
>> +surface exposed to the application. System call filtering is meant for
>> +use with those applications.
>> +
>> +Seccomp filtering provides a means for a process to specify a filter for
>> +incoming system calls. The filter is expressed as a Berkeley Packet
>> +Filter (BPF) program, as with socket filters, except that the data
>> +operated on is related to the system call being made: system call
>> +number, and the system call arguments. This allows for expressive
>> +filtering of system calls using a filter program language with a long
>> +history of being exposed to userland and a straightforward data set.
>> +
>> +Additionally, BPF makes it impossible for users of seccomp to fall prey
>> +to time-of-check-time-of-use (TOCTOU) attacks that are common in system
>> +call interposition frameworks. BPF programs may not dereference
>> +pointers which constrains all filters to solely evaluating the system
>> +call arguments directly.
>> +
>> +What it isn't
>> +-------------
>> +
>> +System call filtering isn't a sandbox. It provides a clearly defined
>> +mechanism for minimizing the exposed kernel surface. Beyond that,
>> +policy for logical behavior and information flow should be managed with
>> +a combination of other system hardening techniques and, potentially, an
>> +LSM of your choosing. Expressive, dynamic filters provide further
>> options down
>> +this path (avoiding pathological sizes or selecting which of the
>> multiplexed
>> +system calls in socketcall() is allowed, for instance) which could be
>> +construed, incorrectly, as a more complete sandboxing solution.
>> +
>> +Usage
>> +-----
>> +
>> +An additional seccomp mode is added, but they are not directly set by
>> +the consuming process. The new mode, '2', is only available if
>> +CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
>> +PR_ATTACH_SECCOMP_FILTER argument.
>> +
>> +Interacting with seccomp filters is done using one prctl(2) call.
>> +
>> +PR_ATTACH_SECCOMP_FILTER:
>> + Allows the specification of a new filter using a BPF program.
>> + The BPF program will be executed over struct seccomp_filter_data
>> + reflecting the system call number, arguments, and other
>> + metadata, To allow a system call, SECCOMP_BPF_ALLOW must be
>> + returned. At present, all other return values result in the
>> + system call being blocked, but it is recommended to return
>> + SECCOMP_BPF_DENY in those cases. This will allow for future
>> + custom return values to be introduced, if ever desired.
>> +
>> + Usage:
>> + prctl(PR_ATTACH_SECCOMP_FILTER, prog);
>> +
>> + The 'prog' argument is a pointer to a struct sock_fprog which will
>> + contain the filter program. If the program is invalid, the call
>> + will return -1 and set errno to EINVAL.
>> +
>> + Note, is_compat_task is also tracked for the @prog. This means
>> + that once set the calling task will have all of its system calls
>> + blocked if it switches its system call ABI.
>> +
>> + If fork/clone and execve are allowed by @prog, any child processes
>> will
>> + be constrained to the same filters and system call ABI as the
>> parent.
>> +
>> + Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
>> + run with CAP_SYS_ADMIN privileges in its namespace. If these are
>> not
>> + true, -EACCES will be returned. This requirement ensures that
>> filter
>> + programs cannot be applied to child processes with greater
>> privileges
>> + than the task that installed them.
>> +
>> + Additionally, if prctl(2) is allowed by the attached filter,
>> + additional filters may be layered on which will increase
>> evaluation
>> + time, but allow for further decreasing the attack surface during
>> + execution of a process.
>> +
>> +The above call returns 0 on success and non-zero on error.
>> +
>> +Example
>> +-------
>> +
>> +The samples/seccomp/ directory contains both a 32-bit specific example
>> +and a more generic example of a higher level macro interface for BPF
>> +program generation.
>> +
>> +Adding architecture support
>> +-----------------------
>> +
>> +Any platform with seccomp support will support seccomp filters as long
>> +as CONFIG_SECCOMP_FILTER is enabled and the architecture has implemented
>> +syscall_get_arguments.
>> diff --git a/samples/Makefile b/samples/Makefile
>> index 6280817..f29b19c 100644
>> --- a/samples/Makefile
>> +++ b/samples/Makefile
>> @@ -1,4 +1,4 @@
>> # Makefile for Linux samples code
>>
>> obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ tracepoints/ trace_events/ \
>> - hw_breakpoint/ kfifo/ kdb/ hidraw/
>> + hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
>> diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
>> new file mode 100644
>> index 0000000..0298c6f
>> --- /dev/null
>> +++ b/samples/seccomp/Makefile
>> @@ -0,0 +1,27 @@
>> +# kbuild trick to avoid linker error. Can be omitted if a module is
>> built.
>> +obj- := dummy.o
>> +
>> +hostprogs-y := bpf-fancy
>> +bpf-fancy-objs := bpf-fancy.o bpf-helper.o
>> +
>> +HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
>> +HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
>> +HOSTCFLAGS_bpf-helper.o += -I$(objtree)/usr/include
>> +HOSTCFLAGS_bpf-helper.o += -idirafter $(objtree)/include
>> +
>> +# bpf-direct.c is x86-only.
>> +ifeq ($(filter-out x86_64 i386,$(KBUILD_BUILDHOST)),)
>> +# List of programs to build
>> +hostprogs-y += bpf-direct
>> +bpf-direct-objs := bpf-direct.o
>> +endif
>> +
>> +# Tell kbuild to always build the programs
>> +always := $(hostprogs-y)
>> +
>> +HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
>> +HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
>> +ifeq ($(KBUILD_BUILDHOST),x86_64)
>> +HOSTCFLAGS_bpf-direct.o += -m32
>> +HOSTLOADLIBES_bpf-direct += -m32
>> +endif
>> diff --git a/samples/seccomp/bpf-direct.c b/samples/seccomp/bpf-direct.c
>> new file mode 100644
>> index 0000000..d799244
>> --- /dev/null
>> +++ b/samples/seccomp/bpf-direct.c
>> @@ -0,0 +1,77 @@
>> +/*
>> + * 32-bit seccomp filter example with BPF macros
>> + *
>> + * Copyright (c) 2012 The Chromium OS
>> Authors<chromium-os-dev@chromium.org>
>> + * Author: Will Drewry<wad@chromium.org>
>> + *
>> + * The code may be used by anyone for any purpose,
>> + * and can serve as a starting point for developing
>> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
>> + */
>> +
>> +#include<linux/filter.h>
>> +#include<linux/ptrace.h>
>> +#include<linux/seccomp_filter.h>
>> +#include<linux/unistd.h>
>> +#include<stdio.h>
>> +#include<stddef.h>
>> +#include<sys/prctl.h>
>> +#include<unistd.h>
>> +
>> +#ifndef PR_ATTACH_SECCOMP_FILTER
>> +# define PR_ATTACH_SECCOMP_FILTER 37
>> +#endif
>> +
>> +#define syscall_arg(_n) (offsetof(struct seccomp_filter_data,
>> args[_n].lo32))
>> +#define nr (offsetof(struct seccomp_filter_data, syscall_nr))
>> +
>> +static int install_filter(void)
>> +{
>> + struct seccomp_filter_block filter[] = {
>> + /* Grab the system call number */
>> + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, nr),
>> + /* Jump table for the allowed syscalls */
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
>> +
>> + /* Check that read is only using stdin. */
>> + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
>> +
>> + /* Check that write is only using stdout/stderr */
>> + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
>> +
>> + BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_ALLOW),
>> + BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_DENY),
>> + };
>> + struct seccomp_fprog prog = {
>> + .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
>> + .filter = filter,
>> + };
>> + if (prctl(PR_ATTACH_SECCOMP_FILTER,&prog)) {
>> + perror("prctl");
>> + return 1;
>> + }
>> + return 0;
>> +}
>> +
>> +#define payload(_c) (_c), sizeof((_c))
>> +int main(int argc, char **argv)
>> +{
>> + char buf[4096];
>> + ssize_t bytes = 0;
>> + if (install_filter())
>> + return 1;
>> + syscall(__NR_write, STDOUT_FILENO,
>> + payload("OHAI! WHAT IS YOUR NAME? "));
>> + bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
>> + syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
>> + syscall(__NR_write, STDOUT_FILENO, buf, bytes);
>> + return 0;
>> +}
>> diff --git a/samples/seccomp/bpf-fancy.c b/samples/seccomp/bpf-fancy.c
>> new file mode 100644
>> index 0000000..1318b1a
>> --- /dev/null
>> +++ b/samples/seccomp/bpf-fancy.c
>> @@ -0,0 +1,95 @@
>> +/*
>> + * Seccomp BPF example using a macro-based generator.
>> + *
>> + * Copyright (c) 2012 The Chromium OS
>> Authors<chromium-os-dev@chromium.org>
>> + * Author: Will Drewry<wad@chromium.org>
>> + *
>> + * The code may be used by anyone for any purpose,
>> + * and can serve as a starting point for developing
>> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
>> + */
>> +
>> +#include<linux/seccomp_filter.h>
>> +#include<linux/unistd.h>
>> +#include<stdio.h>
>> +#include<string.h>
>> +#include<sys/prctl.h>
>> +#include<unistd.h>
>> +
>> +#include "bpf-helper.h"
>> +
>> +#ifndef PR_ATTACH_SECCOMP_FILTER
>> +# define PR_ATTACH_SECCOMP_FILTER 37
>> +#endif
>> +
>> +int main(int argc, char **argv)
>> +{
>> + struct bpf_labels l;
>> + static const char msg1[] = "Please type something: ";
>> + static const char msg2[] = "You typed: ";
>> + char buf[256];
>> + struct seccomp_filter_block filter[] = {
>> + LOAD_SYSCALL_NR,
>> + SYSCALL(__NR_exit, ALLOW),
>> + SYSCALL(__NR_exit_group, ALLOW),
>> + SYSCALL(__NR_write, JUMP(&l, write_fd)),
>> + SYSCALL(__NR_read, JUMP(&l, read)),
>> + DENY, /* Don't passthrough into a label */
>> +
>> + LABEL(&l, read),
>> + ARG(0),
>> + JNE(STDIN_FILENO, DENY),
>> + ARG(1),
>> + JNE((unsigned long)buf, DENY),
>> + ARG(2),
>> + JGE(sizeof(buf), DENY),
>> + ALLOW,
>> +
>> + LABEL(&l, write_fd),
>> + ARG(0),
>> + JEQ(STDOUT_FILENO, JUMP(&l, write_buf)),
>> + JEQ(STDERR_FILENO, JUMP(&l, write_buf)),
>> + DENY,
>> +
>> + LABEL(&l, write_buf),
>> + ARG(1),
>> + JEQ((unsigned long)msg1, JUMP(&l, msg1_len)),
>> + JEQ((unsigned long)msg2, JUMP(&l, msg2_len)),
>> + JEQ((unsigned long)buf, JUMP(&l, buf_len)),
>> + DENY,
>> +
>> + LABEL(&l, msg1_len),
>> + ARG(2),
>> + JLT(sizeof(msg1), ALLOW),
>> + DENY,
>> +
>> + LABEL(&l, msg2_len),
>> + ARG(2),
>> + JLT(sizeof(msg2), ALLOW),
>> + DENY,
>> +
>> + LABEL(&l, buf_len),
>> + ARG(2),
>> + JLT(sizeof(buf), ALLOW),
>> + DENY,
>> + };
>> + struct seccomp_fprog prog = {
>> + .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
>> + .filter = filter,
>> + };
>> + ssize_t bytes;
>> + bpf_resolve_jumps(&l, filter, sizeof(filter)/sizeof(*filter));
>> +
>> + if (prctl(PR_ATTACH_SECCOMP_FILTER,&prog)) {
>> + perror("prctl");
>> + return 1;
>> + }
>> + syscall(__NR_write, STDOUT_FILENO, msg1, strlen(msg1));
>> + bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf)-1);
>> + bytes = (bytes> 0 ? bytes : 0);
>> + syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2));
>> + syscall(__NR_write, STDERR_FILENO, buf, bytes);
>> + /* Now get killed */
>> + syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2)+2);
>> + return 0;
>> +}
>> diff --git a/samples/seccomp/bpf-helper.c b/samples/seccomp/bpf-helper.c
>> new file mode 100644
>> index 0000000..e1b6bc7
>> --- /dev/null
>> +++ b/samples/seccomp/bpf-helper.c
>> @@ -0,0 +1,89 @@
>> +/*
>> + * Seccomp BPF helper functions
>> + *
>> + * Copyright (c) 2012 The Chromium OS
>> Authors<chromium-os-dev@chromium.org>
>> + * Author: Will Drewry<wad@chromium.org>
>> + *
>> + * The code may be used by anyone for any purpose,
>> + * and can serve as a starting point for developing
>> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
>> + */
>> +
>> +#include<stdio.h>
>> +#include<string.h>
>> +
>> +#include "bpf-helper.h"
>> +
>> +int bpf_resolve_jumps(struct bpf_labels *labels,
>> + struct seccomp_filter_block *filter, size_t count)
>> +{
>> + struct seccomp_filter_block *begin = filter;
>> + __u8 insn = count - 1;
>> +
>> + if (count< 1)
>> + return -1;
>> + /*
>> + * Walk it once, backwards, to build the label table and do fixups.
>> + * Since backward jumps are disallowed by BPF, this is easy.
>> + */
>> + filter += insn;
>> + for (; filter>= begin; --insn, --filter) {
>> + if (filter->code != (BPF_JMP+BPF_JA))
>> + continue;
>> + switch ((filter->jt<<8)|filter->jf) {
>> + case (JUMP_JT<<8)|JUMP_JF:
>> + if (labels->labels[filter->k].location ==
>> 0xffffffff) {
>> + fprintf(stderr, "Unresolved label:
>> '%s'\n",
>> + labels->labels[filter->k].label);
>> + return 1;
>> + }
>> + filter->k = labels->labels[filter->k].location -
>> + (insn + 1);
>> + filter->jt = 0;
>> + filter->jf = 0;
>> + continue;
>> + case (LABEL_JT<<8)|LABEL_JF:
>> + if (labels->labels[filter->k].location !=
>> 0xffffffff) {
>> + fprintf(stderr, "Duplicate label use:
>> '%s'\n",
>> + labels->labels[filter->k].label);
>> + return 1;
>> + }
>> + labels->labels[filter->k].location = insn;
>> + filter->k = 0; /* fall through */
>> + filter->jt = 0;
>> + filter->jf = 0;
>> + continue;
>> + }
>> + }
>> + return 0;
>> +}
>> +
>> +/* Simple lookup table for labels. */
>> +__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label)
>> +{
>> + struct __bpf_label *begin = labels->labels, *end;
>> + int id;
>> + if (labels->count == 0) {
>> + begin->label = label;
>> + begin->location = 0xffffffff;
>> + labels->count++;
>> + return 0;
>> + }
>> + end = begin + labels->count;
>> + for (id = 0; begin< end; ++begin, ++id) {
>> + if (!strcmp(label, begin->label))
>> + return id;
>> + }
>> + begin->label = label;
>> + begin->location = 0xffffffff;
>> + labels->count++;
>> + return id;
>> +}
>> +
>> +void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t count)
>> +{
>> + struct seccomp_filter_block *end = filter + count;
>> + for ( ; filter< end; ++filter)
>> + printf("{ code=%u,jt=%u,jf=%u,k=%u },\n",
>> + filter->code, filter->jt, filter->jf, filter->k);
>> +}
>> diff --git a/samples/seccomp/bpf-helper.h b/samples/seccomp/bpf-helper.h
>> new file mode 100644
>> index 0000000..92b94ec
>> --- /dev/null
>> +++ b/samples/seccomp/bpf-helper.h
>> @@ -0,0 +1,219 @@
>> +/*
>> + * Example wrapper around BPF macros.
>> + *
>> + * Copyright (c) 2012 The Chromium OS
>> Authors<chromium-os-dev@chromium.org>
>> + * Author: Will Drewry<wad@chromium.org>
>> + *
>> + * The code may be used by anyone for any purpose,
>> + * and can serve as a starting point for developing
>> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
>> + *
>> + * No guarantees are provided with respect to the correctness
>> + * or functionality of this code.
>> + */
>> +#ifndef __BPF_HELPER_H__
>> +#define __BPF_HELPER_H__
>> +
>> +#include<asm/bitsperlong.h> /* for __BITS_PER_LONG */
>> +#include<linux/filter.h>
>> +#include<linux/seccomp_filter.h> /* for seccomp_filter_data.arg */
>> +#include<linux/types.h>
>> +#include<linux/unistd.h>
>> +#include<stddef.h>
>> +
>> +#define BPF_LABELS_MAX 256
>> +struct bpf_labels {
>> + int count;
>> + struct __bpf_label {
>> + const char *label;
>> + __u32 location;
>> + } labels[BPF_LABELS_MAX];
>> +};
>> +
>> +int bpf_resolve_jumps(struct bpf_labels *labels,
>> + struct seccomp_filter_block *filter, size_t count);
>> +__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label);
>> +void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t
>> count);
>> +
>> +#define JUMP_JT 0xff
>> +#define JUMP_JF 0xff
>> +#define LABEL_JT 0xfe
>> +#define LABEL_JF 0xfe
>> +
>> +#define ALLOW \
>> + BPF_STMT(BPF_RET+BPF_K, 0xFFFFFFFF)
>> +#define DENY \
>> + BPF_STMT(BPF_RET+BPF_K, 0)
>> +#define JUMP(labels, label) \
>> + BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
>> + JUMP_JT, JUMP_JF)
>> +#define LABEL(labels, label) \
>> + BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
>> + LABEL_JT, LABEL_JF)
>> +#define SYSCALL(nr, jt) \
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (nr), 0, 1), \
>> + jt
>> +
>> +/* Lame, but just an example */
>> +#define FIND_LABEL(labels, label) seccomp_bpf_label((labels), #label)
>> +
>> +#define EXPAND(...) __VA_ARGS__
>> +/* Map all width-sensitive operations */
>> +#if __BITS_PER_LONG == 32
>> +
>> +#define JEQ(x, jt) JEQ32(x, EXPAND(jt))
>> +#define JNE(x, jt) JNE32(x, EXPAND(jt))
>> +#define JGT(x, jt) JGT32(x, EXPAND(jt))
>> +#define JLT(x, jt) JLT32(x, EXPAND(jt))
>> +#define JGE(x, jt) JGE32(x, EXPAND(jt))
>> +#define JLE(x, jt) JLE32(x, EXPAND(jt))
>> +#define JA(x, jt) JA32(x, EXPAND(jt))
>> +#define ARG(i) ARG_32(i)
>> +
>> +#elif __BITS_PER_LONG == 64
>> +
>> +#define JEQ(x, jt) \
>> + JEQ64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> + EXPAND(jt))
>> +#define JGT(x, jt) \
>> + JGT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> + EXPAND(jt))
>> +#define JGE(x, jt) \
>> + JGE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> + EXPAND(jt))
>> +#define JNE(x, jt) \
>> + JNE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> + EXPAND(jt))
>> +#define JLT(x, jt) \
>> + JLT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> + EXPAND(jt))
>> +#define JLE(x, jt) \
>> + JLE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> + EXPAND(jt))
>> +
>> +#define JA(x, jt) \
>> + JA64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> + ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> + EXPAND(jt))
>> +#define ARG(i) ARG_64(i)
>> +
>> +#else
>> +#error __BITS_PER_LONG value unusable.
>> +#endif
>> +
>> +/* Loads the arg into A */
>> +#define ARG_32(idx) \
>> + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
>> + offsetof(struct seccomp_filter_data, args[(idx)].lo32))
>> +
>> +/* Loads hi into A and lo in X */
>> +#define ARG_64(idx) \
>> + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
>> + offsetof(struct seccomp_filter_data, args[(idx)].lo32)), \
>> + BPF_STMT(BPF_ST, 0), /* lo -> M[0] */ \
>> + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
>> + offsetof(struct seccomp_filter_data, args[(idx)].hi32)), \
>> + BPF_STMT(BPF_ST, 1) /* hi -> M[1] */
>> +
>> +#define JEQ32(value, jt) \
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 0, 1), \
>> + jt
>> +
>> +#define JNE32(value, jt) \
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 1, 0), \
>> + jt
>> +
>> +/* Checks the lo, then swaps to check the hi. A=lo,X=hi */
>> +#define JEQ64(lo, hi, jt) \
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
>> + BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 0, 2), \
>> + BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
>> + jt, \
>> + BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
>> +
>> +#define JNE64(lo, hi, jt) \
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 5, 0), \
>> + BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 2, 0), \
>> + BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
>> + jt, \
>> + BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
>> +
>> +#define JA32(value, jt) \
>> + BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (value), 0, 1), \
>> + jt
>> +
>> +#define JA64(lo, hi, jt) \
>> + BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (hi), 3, 0), \
>> + BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
>> + BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (lo), 0, 2), \
>> + BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
>> + jt, \
>> + BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
>> +
>> +#define JGE32(value, jt) \
>> + BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 0, 1), \
>> + jt
>> +
>> +#define JLT32(value, jt) \
>> + BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 1, 0), \
>> + jt
>> +
>> +/* Shortcut checking if hi> arg.hi. */
>> +#define JGE64(lo, hi, jt) \
>> + BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
>> + BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
>> + BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (lo), 0, 2), \
>> + BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
>> + jt, \
>> + BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
>> +
>> +#define JLT64(lo, hi, jt) \
>> + BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (hi), 0, 4), \
>> + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
>> + BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
>> + BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
>> + BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
>> + jt, \
>> + BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
>> +
>> +#define JGT32(value, jt) \
>> + BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
>> + jt
>> +
>> +#define JLE32(value, jt) \
>> + BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
>> + jt
>
>
> Should the true/false offsets be reversed here?
Looks that way :)
> Thanks for all the work on this. We're looking forward to using it with
> QEMU.
Definitely - thanks!
will
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 2/3] seccomp_filters: system call filtering using BPF
2012-01-28 22:11 ` [PATCH v6 2/3] seccomp_filters: system call filtering using BPF Will Drewry
@ 2012-01-31 14:13 ` Eduardo Otubo
2012-01-31 15:20 ` Will Drewry
2012-02-02 15:32 ` Serge E. Hallyn
1 sibling, 1 reply; 13+ messages in thread
From: Eduardo Otubo @ 2012-01-31 14:13 UTC (permalink / raw)
To: Will Drewry
Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
olofj, mhalcrow, dlaor, corbet, alan, indan, mcgrathr
On Sat, Jan 28, 2012 at 04:11:54PM -0600, Will Drewry wrote:
> [This patch depends on luto@mit.edu's no_new_privs patch:
> https://lkml.org/lkml/2012/1/12/446
> ]
Will,
I know you clearly pointed to use luto@mit.edu's first no_new_privs
patch, but I couldn't avoid to test it with the latest (and 3rd) version
of the patch [0]. Which defines PR_GET_NO_NEW_PRIVS as 37 as you can see
here [1]. The compilation then would break here:
CC kernel/sys.o
kernel/sys.c: In function ‘sys_prctl’:
kernel/sys.c:1975: error: duplicate case value
kernel/sys.c:1904: error: previously used here
make[1]: *** [kernel/sys.o] Error 1
make: *** [kernel] Error 2
I just changed the value of PR_ATTACH_SECCOMP_FILTER to 38 and
everything went fine. Do you see any problems on changing this value?
Regards,
[0] - https://git.kernel.org/?p=linux/kernel/git/luto/linux.git;a=heads
[1] -
https://git.kernel.org/?p=linux/kernel/git/luto/linux.git;a=blobdiff;f=include/linux/prctl.h;h=a6b5ac9cfe560eeb277646fbe338ae2b14c46caf;hp=7ddc7f1b480fd41318d94c0a39c8e2ff80f9c5f8;hb=7102b0e278af50d27b5d61d1be5faaba1b0a091e;hpb=acb42a3b611d7ad4cb173c3b37674b549df2ffeb
--
Eduardo Otubo
Software Engineer
Linux Technology Center
IBM Systems & Technology Group
Mobile: +55 19 8135 0885
eotubo@linux.vnet.ibm.com
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 2/3] seccomp_filters: system call filtering using BPF
2012-01-31 14:13 ` Eduardo Otubo
@ 2012-01-31 15:20 ` Will Drewry
0 siblings, 0 replies; 13+ messages in thread
From: Will Drewry @ 2012-01-31 15:20 UTC (permalink / raw)
To: Eduardo Otubo
Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
olofj, mhalcrow, dlaor, corbet, alan, indan, mcgrathr
On Tue, Jan 31, 2012 at 7:13 AM, Eduardo Otubo <otubo@linux.vnet.ibm.com> wrote:
> On Sat, Jan 28, 2012 at 04:11:54PM -0600, Will Drewry wrote:
>> [This patch depends on luto@mit.edu's no_new_privs patch:
>> https://lkml.org/lkml/2012/1/12/446
>> ]
>
> Will,
>
> I know you clearly pointed to use luto@mit.edu's first no_new_privs
> patch, but I couldn't avoid to test it with the latest (and 3rd) version
> of the patch [0]. Which defines PR_GET_NO_NEW_PRIVS as 37 as you can see
> here [1]. The compilation then would break here:
>
> CC kernel/sys.o
> kernel/sys.c: In function ‘sys_prctl’:
> kernel/sys.c:1975: error: duplicate case value
> kernel/sys.c:1904: error: previously used here
> make[1]: *** [kernel/sys.o] Error 1
> make: *** [kernel] Error 2
>
> I just changed the value of PR_ATTACH_SECCOMP_FILTER to 38 and
> everything went fine. Do you see any problems on changing this value?
Should be fine -- in the next version, I won't be adding a new PR_
define at all. Feel free to change it to whatever compiles -- the
code only uses the define name for access. Sorry for the collision -
I posted the last rev without the latest from luto.
Cheers!
will
> Regards,
>
> [0] - https://git.kernel.org/?p=linux/kernel/git/luto/linux.git;a=heads
> [1] -
> https://git.kernel.org/?p=linux/kernel/git/luto/linux.git;a=blobdiff;f=include/linux/prctl.h;h=a6b5ac9cfe560eeb277646fbe338ae2b14c46caf;hp=7ddc7f1b480fd41318d94c0a39c8e2ff80f9c5f8;hb=7102b0e278af50d27b5d61d1be5faaba1b0a091e;hpb=acb42a3b611d7ad4cb173c3b37674b549df2ffeb
>
> --
> Eduardo Otubo
> Software Engineer
> Linux Technology Center
> IBM Systems & Technology Group
> Mobile: +55 19 8135 0885
> eotubo@linux.vnet.ibm.com
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 1/3] seccomp: kill the seccomp_t typedef
2012-01-28 22:11 [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Will Drewry
2012-01-28 22:11 ` [PATCH v6 2/3] seccomp_filters: system call filtering using BPF Will Drewry
2012-01-28 22:11 ` [PATCH v6 3/3] Documentation: prctl/seccomp_filter Will Drewry
@ 2012-02-02 15:29 ` Serge E. Hallyn
2012-02-03 23:16 ` Will Drewry
2 siblings, 1 reply; 13+ messages in thread
From: Serge E. Hallyn @ 2012-02-02 15:29 UTC (permalink / raw)
To: Will Drewry
Cc: linux-kernel, keescook, john.johansen, coreyb, pmoore, eparis,
djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
corbet, alan, indan, mcgrathr
Quoting Will Drewry (wad@chromium.org):
> Replaces the seccomp_t typedef with seccomp_struct to match modern
> kernel style.
(sorry, I'm a bit behind on list)
You were going to switch this to 'struct seccomp' right?
> Signed-off-by: Will Drewry <wad@chromium.org>
> ---
> include/linux/sched.h | 2 +-
> include/linux/seccomp.h | 10 ++++++----
> 2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4032ec1..288b5cb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1418,7 +1418,7 @@ struct task_struct {
> uid_t loginuid;
> unsigned int sessionid;
> #endif
> - seccomp_t seccomp;
> + struct seccomp_struct seccomp;
>
> /* Thread group tracking */
> u32 parent_exec_id;
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index cc7a4e9..171ab66 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -7,7 +7,9 @@
> #include <linux/thread_info.h>
> #include <asm/seccomp.h>
>
> -typedef struct { int mode; } seccomp_t;
> +struct seccomp_struct {
> + int mode;
> +};
>
> extern void __secure_computing(int);
> static inline void secure_computing(int this_syscall)
> @@ -19,7 +21,7 @@ static inline void secure_computing(int this_syscall)
> extern long prctl_get_seccomp(void);
> extern long prctl_set_seccomp(unsigned long);
>
> -static inline int seccomp_mode(seccomp_t *s)
> +static inline int seccomp_mode(struct seccomp_struct *s)
> {
> return s->mode;
> }
> @@ -28,7 +30,7 @@ static inline int seccomp_mode(seccomp_t *s)
>
> #include <linux/errno.h>
>
> -typedef struct { } seccomp_t;
> +struct seccomp_struct { };
>
> #define secure_computing(x) do { } while (0)
>
> @@ -42,7 +44,7 @@ static inline long prctl_set_seccomp(unsigned long arg2)
> return -EINVAL;
> }
>
> -static inline int seccomp_mode(seccomp_t *s)
> +static inline int seccomp_mode(struct seccomp_struct *s)
> {
> return 0;
> }
> --
> 1.7.5.4
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 2/3] seccomp_filters: system call filtering using BPF
2012-01-28 22:11 ` [PATCH v6 2/3] seccomp_filters: system call filtering using BPF Will Drewry
2012-01-31 14:13 ` Eduardo Otubo
@ 2012-02-02 15:32 ` Serge E. Hallyn
2012-02-03 23:14 ` Will Drewry
1 sibling, 1 reply; 13+ messages in thread
From: Serge E. Hallyn @ 2012-02-02 15:32 UTC (permalink / raw)
To: Will Drewry
Cc: linux-kernel, keescook, john.johansen, coreyb, pmoore, eparis,
djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
corbet, alan, indan, mcgrathr
Quoting Will Drewry (wad@chromium.org):
> [This patch depends on luto@mit.edu's no_new_privs patch:
> https://lkml.org/lkml/2012/1/12/446
> ]
>
> This patch adds support for seccomp mode 2. This mode enables dynamic
> enforcement of system call filtering policy in the kernel as specified
> by a userland task. The policy is expressed in terms of a Berkeley
> Packet Filter program, as is used for userland-exposed socket filtering.
> Instead of network data, the BPF program is evaluated over struct
> seccomp_filter_data at the time of the system call.
>
> A filter program may be installed by a userland task by calling
> prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
> where fprog is of type struct sock_fprog.
>
> If the first filter program allows subsequent prctl(2) calls, then
> additional filter programs may be attached. All attached programs
> must be evaluated before a system call will be allowed to proceed.
>
> To avoid CONFIG_COMPAT related landmines, once a filter program is
> installed using specific is_compat_task() value, it is not allowed to
> make system calls using the alternate entry point.
>
> Filter programs will be inherited across fork/clone and execve, however
> the installation of filters must be preceded by setting 'no_new_privs'
> to ensure that unprivileged tasks cannot attach filters that affect
> privileged tasks (e.g., setuid binary). Tasks with CAP_SYS_ADMIN
> in their namespace may install inheritable filters without setting
> the no_new_privs bit.
>
> There are a number of benefits to this approach. A few of which are
> as follows:
> - BPF has been exposed to userland for a long time.
> - Userland already knows its ABI: system call numbers and desired
> arguments
> - No time-of-check-time-of-use vulnerable data accesses are possible.
> - system call arguments are loaded on demand only to minimize copying
> required for system call number-only policy decisions.
>
> This patch includes its own BPF evaluator, but relies on the
> net/core/filter.c BPF checking code. It is possible to share
> evaluators, but the performance sensitive nature of the network
> filtering path makes it an iterative optimization which (I think :) can
> be tackled separately via separate patchsets. (And at some point sharing
> BPF JIT code!)
>
> v6: - fix memory leak on attach compat check failure
> - require no_new_privs || CAP_SYS_ADMIN prior to filter
> installation. (luto@mit.edu)
> - s/seccomp_struct_/seccomp_/ for macros/functions
> (amwang@redhat.com)
> - cleaned up Kconfig (amwang@redhat.com)
> - on block, note if the call was compat (so the # means something)
> v5: - uses syscall_get_arguments
> (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
> - uses union-based arg storage with hi/lo struct to
> handle endianness. Compromises between the two alternate
> proposals to minimize extra arg shuffling and account for
> endianness assuming userspace uses offsetof().
> (mcgrathr@chromium.org, indan@nul.nu)
> - update Kconfig description
> - add include/seccomp_filter.h and add its installation
> - (naive) on-demand syscall argument loading
> - drop seccomp_t (eparis@redhat.com)
> v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
> - now uses current->no_new_privs
> (luto@mit.edu,torvalds@linux-foundation.com)
> - assign names to seccomp modes (rdunlap@xenotime.net)
> - fix style issues (rdunlap@xenotime.net)
> - reworded Kconfig entry (rdunlap@xenotime.net)
> v3: - macros to inline (oleg@redhat.com)
> - init_task behavior fixed (oleg@redhat.com)
> - drop creator entry and extra NULL check (oleg@redhat.com)
> - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
> - adds tentative use of "always_unprivileged" as per
> torvalds@linux-foundation.org and luto@mit.edu
> v2: - (patch 2 only)
>
> Signed-off-by: Will Drewry <wad@chromium.org>
Hi Will,
as far as I can tell based on changelog I suspect you could have
kept my Acked-by (from v3?). However, I'll wait until your next
submission (as I see there were a few change requests), and do a
final complete new review of that.
Thanks for continuing to push on this.
-serge
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 2/3] seccomp_filters: system call filtering using BPF
2012-02-02 15:32 ` Serge E. Hallyn
@ 2012-02-03 23:14 ` Will Drewry
0 siblings, 0 replies; 13+ messages in thread
From: Will Drewry @ 2012-02-03 23:14 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: linux-kernel, keescook, john.johansen, coreyb, pmoore, eparis,
djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
corbet, alan, indan, mcgrathr
On Thu, Feb 2, 2012 at 7:32 AM, Serge E. Hallyn
<serge.hallyn@canonical.com> wrote:
> Quoting Will Drewry (wad@chromium.org):
>> [This patch depends on luto@mit.edu's no_new_privs patch:
>> https://lkml.org/lkml/2012/1/12/446
>> ]
>>
>> This patch adds support for seccomp mode 2. This mode enables dynamic
>> enforcement of system call filtering policy in the kernel as specified
>> by a userland task. The policy is expressed in terms of a Berkeley
>> Packet Filter program, as is used for userland-exposed socket filtering.
>> Instead of network data, the BPF program is evaluated over struct
>> seccomp_filter_data at the time of the system call.
>>
>> A filter program may be installed by a userland task by calling
>> prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
>> where fprog is of type struct sock_fprog.
>>
>> If the first filter program allows subsequent prctl(2) calls, then
>> additional filter programs may be attached. All attached programs
>> must be evaluated before a system call will be allowed to proceed.
>>
>> To avoid CONFIG_COMPAT related landmines, once a filter program is
>> installed using specific is_compat_task() value, it is not allowed to
>> make system calls using the alternate entry point.
>>
>> Filter programs will be inherited across fork/clone and execve, however
>> the installation of filters must be preceded by setting 'no_new_privs'
>> to ensure that unprivileged tasks cannot attach filters that affect
>> privileged tasks (e.g., setuid binary). Tasks with CAP_SYS_ADMIN
>> in their namespace may install inheritable filters without setting
>> the no_new_privs bit.
>>
>> There are a number of benefits to this approach. A few of which are
>> as follows:
>> - BPF has been exposed to userland for a long time.
>> - Userland already knows its ABI: system call numbers and desired
>> arguments
>> - No time-of-check-time-of-use vulnerable data accesses are possible.
>> - system call arguments are loaded on demand only to minimize copying
>> required for system call number-only policy decisions.
>>
>> This patch includes its own BPF evaluator, but relies on the
>> net/core/filter.c BPF checking code. It is possible to share
>> evaluators, but the performance sensitive nature of the network
>> filtering path makes it an iterative optimization which (I think :) can
>> be tackled separately via separate patchsets. (And at some point sharing
>> BPF JIT code!)
>>
>> v6: - fix memory leak on attach compat check failure
>> - require no_new_privs || CAP_SYS_ADMIN prior to filter
>> installation. (luto@mit.edu)
>> - s/seccomp_struct_/seccomp_/ for macros/functions
>> (amwang@redhat.com)
>> - cleaned up Kconfig (amwang@redhat.com)
>> - on block, note if the call was compat (so the # means something)
>> v5: - uses syscall_get_arguments
>> (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
>> - uses union-based arg storage with hi/lo struct to
>> handle endianness. Compromises between the two alternate
>> proposals to minimize extra arg shuffling and account for
>> endianness assuming userspace uses offsetof().
>> (mcgrathr@chromium.org, indan@nul.nu)
>> - update Kconfig description
>> - add include/seccomp_filter.h and add its installation
>> - (naive) on-demand syscall argument loading
>> - drop seccomp_t (eparis@redhat.com)
>> v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
>> - now uses current->no_new_privs
>> (luto@mit.edu,torvalds@linux-foundation.com)
>> - assign names to seccomp modes (rdunlap@xenotime.net)
>> - fix style issues (rdunlap@xenotime.net)
>> - reworded Kconfig entry (rdunlap@xenotime.net)
>> v3: - macros to inline (oleg@redhat.com)
>> - init_task behavior fixed (oleg@redhat.com)
>> - drop creator entry and extra NULL check (oleg@redhat.com)
>> - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
>> - adds tentative use of "always_unprivileged" as per
>> torvalds@linux-foundation.org and luto@mit.edu
>> v2: - (patch 2 only)
>>
>> Signed-off-by: Will Drewry <wad@chromium.org>
>
> Hi Will,
>
> as far as I can tell based on changelog I suspect you could have
> kept my Acked-by (from v3?). However, I'll wait until your next
> submission (as I see there were a few change requests), and do a
> final complete new review of that.
Thanks, Serge! I just failed at the proper protocol and didn't mean
to not include your Acked-by. However, I am changing a fair amount
of the internals this time around, so I'll be happy to have another
full review.
> Thanks for continuing to push on this.
Definitely! I've been traveling this week, so it's been a bit slow
going, but I hope to have the next rev up early next week if not
sooner.
Cheers!
will
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 1/3] seccomp: kill the seccomp_t typedef
2012-02-02 15:29 ` [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Serge E. Hallyn
@ 2012-02-03 23:16 ` Will Drewry
2012-02-04 1:05 ` Linus Torvalds
0 siblings, 1 reply; 13+ messages in thread
From: Will Drewry @ 2012-02-03 23:16 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: linux-kernel, keescook, john.johansen, coreyb, pmoore, eparis,
djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
corbet, alan, indan, mcgrathr
On Thu, Feb 2, 2012 at 7:29 AM, Serge E. Hallyn
<serge.hallyn@canonical.com> wrote:
> Quoting Will Drewry (wad@chromium.org):
>> Replaces the seccomp_t typedef with seccomp_struct to match modern
>> kernel style.
>
> (sorry, I'm a bit behind on list)
>
> You were going to switch this to 'struct seccomp' right?
I wasn;'t sure if
task_struct {
...
struct seccomp seccomp;
}
was as ideal. I've noticed that almost all of the duplicate names in
the task struct use redundancy to differentiate the naming, but I'm
happy enough to rename if appropriate.
>> Signed-off-by: Will Drewry <wad@chromium.org>
>> ---
>> include/linux/sched.h | 2 +-
>> include/linux/seccomp.h | 10 ++++++----
>> 2 files changed, 7 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 4032ec1..288b5cb 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1418,7 +1418,7 @@ struct task_struct {
>> uid_t loginuid;
>> unsigned int sessionid;
>> #endif
>> - seccomp_t seccomp;
>> + struct seccomp_struct seccomp;
>>
>> /* Thread group tracking */
>> u32 parent_exec_id;
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index cc7a4e9..171ab66 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -7,7 +7,9 @@
>> #include <linux/thread_info.h>
>> #include <asm/seccomp.h>
>>
>> -typedef struct { int mode; } seccomp_t;
>> +struct seccomp_struct {
>> + int mode;
>> +};
>>
>> extern void __secure_computing(int);
>> static inline void secure_computing(int this_syscall)
>> @@ -19,7 +21,7 @@ static inline void secure_computing(int this_syscall)
>> extern long prctl_get_seccomp(void);
>> extern long prctl_set_seccomp(unsigned long);
>>
>> -static inline int seccomp_mode(seccomp_t *s)
>> +static inline int seccomp_mode(struct seccomp_struct *s)
>> {
>> return s->mode;
>> }
>> @@ -28,7 +30,7 @@ static inline int seccomp_mode(seccomp_t *s)
>>
>> #include <linux/errno.h>
>>
>> -typedef struct { } seccomp_t;
>> +struct seccomp_struct { };
>>
>> #define secure_computing(x) do { } while (0)
>>
>> @@ -42,7 +44,7 @@ static inline long prctl_set_seccomp(unsigned long arg2)
>> return -EINVAL;
>> }
>>
>> -static inline int seccomp_mode(seccomp_t *s)
>> +static inline int seccomp_mode(struct seccomp_struct *s)
>> {
>> return 0;
>> }
>> --
>> 1.7.5.4
>>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 1/3] seccomp: kill the seccomp_t typedef
2012-02-03 23:16 ` Will Drewry
@ 2012-02-04 1:05 ` Linus Torvalds
2012-02-06 16:13 ` Will Drewry
0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2012-02-04 1:05 UTC (permalink / raw)
To: Will Drewry
Cc: Serge E. Hallyn, linux-kernel, keescook, john.johansen, coreyb,
pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
corbet, alan, indan, mcgrathr
On Fri, Feb 3, 2012 at 3:16 PM, Will Drewry <wad@chromium.org> wrote:
>
> task_struct {
> ...
> struct seccomp seccomp;
> }
>
> was as ideal. I've noticed that almost all of the duplicate names in
> the task struct use redundancy to differentiate the naming, but I'm
> happy enough to rename if appropriate.
The redundant "struct xyz_struct" naming is traditional, but we try to
avoid it these days. The reason for it is that I long long ago was a
bit confused about the C namespace rules, so for the longest time I
made struct names unique for no really good reason. The struct/union
namespace is separate from the other namespaces, so trying to make
things unique really has no good reason.
And obviously "struct task_struct" is one of those very old things,
and then the "struct xyz_struct" naming kind of spread from there.
I think "struct seccomp" is fine, and even if "struct x x" looks a bit
odd, it's at least _less_ repetition than "struct x_struct x" which is
just really repetitive.
That said, just to make "grep" easier, please do the whole "struct
xyz" always together, and always with just a single space in between
them, so that
git grep "struct xyz"
does the right thing. And for the same reason, when declaring a
struct, people should always use "struct xyz {", with that exact
spacing. The exact details of spacing obviously has no semantic
meaning, but making it easy to grep for use and for definition is
really convenient.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 1/3] seccomp: kill the seccomp_t typedef
2012-02-04 1:05 ` Linus Torvalds
@ 2012-02-06 16:13 ` Will Drewry
0 siblings, 0 replies; 13+ messages in thread
From: Will Drewry @ 2012-02-06 16:13 UTC (permalink / raw)
To: Linus Torvalds
Cc: Serge E. Hallyn, linux-kernel, keescook, john.johansen, coreyb,
pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
corbet, alan, indan, mcgrathr
On Fri, Feb 3, 2012 at 7:05 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Feb 3, 2012 at 3:16 PM, Will Drewry <wad@chromium.org> wrote:
>>
>> task_struct {
>> ...
>> struct seccomp seccomp;
>> }
>>
>> was as ideal. I've noticed that almost all of the duplicate names in
>> the task struct use redundancy to differentiate the naming, but I'm
>> happy enough to rename if appropriate.
>
> The redundant "struct xyz_struct" naming is traditional, but we try to
> avoid it these days. The reason for it is that I long long ago was a
> bit confused about the C namespace rules, so for the longest time I
> made struct names unique for no really good reason. The struct/union
> namespace is separate from the other namespaces, so trying to make
> things unique really has no good reason.
>
> And obviously "struct task_struct" is one of those very old things,
> and then the "struct xyz_struct" naming kind of spread from there.
>
> I think "struct seccomp" is fine, and even if "struct x x" looks a bit
> odd, it's at least _less_ repetition than "struct x_struct x" which is
> just really repetitive.
>
> That said, just to make "grep" easier, please do the whole "struct
> xyz" always together, and always with just a single space in between
> them, so that
>
> git grep "struct xyz"
>
> does the right thing. And for the same reason, when declaring a
> struct, people should always use "struct xyz {", with that exact
> spacing. The exact details of spacing obviously has no semantic
> meaning, but making it easy to grep for use and for definition is
> really convenient.
Thanks for the background and explanation!
will
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2012-02-06 16:13 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-28 22:11 [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Will Drewry
2012-01-28 22:11 ` [PATCH v6 2/3] seccomp_filters: system call filtering using BPF Will Drewry
2012-01-31 14:13 ` Eduardo Otubo
2012-01-31 15:20 ` Will Drewry
2012-02-02 15:32 ` Serge E. Hallyn
2012-02-03 23:14 ` Will Drewry
2012-01-28 22:11 ` [PATCH v6 3/3] Documentation: prctl/seccomp_filter Will Drewry
2012-01-30 22:47 ` Corey Bryant
2012-01-30 22:52 ` Will Drewry
2012-02-02 15:29 ` [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Serge E. Hallyn
2012-02-03 23:16 ` Will Drewry
2012-02-04 1:05 ` Linus Torvalds
2012-02-06 16:13 ` Will Drewry
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).