linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 1/3] seccomp: kill the seccomp_t typedef
@ 2012-01-27 23:24 Will Drewry
  2012-01-27 23:24 ` [PATCH v5 2/3] seccomp_filters: system call filtering using BPF Will Drewry
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Will Drewry @ 2012-01-27 23:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr

Replaces the seccomp_t typedef with seccomp_struct to match modern
kernel style.

Signed-off-by: Will Drewry <wad@chromium.org>
---
 include/linux/sched.h   |    2 +-
 include/linux/seccomp.h |   10 ++++++----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4032ec1..288b5cb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1418,7 +1418,7 @@ struct task_struct {
 	uid_t loginuid;
 	unsigned int sessionid;
 #endif
-	seccomp_t seccomp;
+	struct seccomp_struct seccomp;
 
 /* Thread group tracking */
    	u32 parent_exec_id;
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..171ab66 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -7,7 +7,9 @@
 #include <linux/thread_info.h>
 #include <asm/seccomp.h>
 
-typedef struct { int mode; } seccomp_t;
+struct seccomp_struct {
+	int mode;
+};
 
 extern void __secure_computing(int);
 static inline void secure_computing(int this_syscall)
@@ -19,7 +21,7 @@ static inline void secure_computing(int this_syscall)
 extern long prctl_get_seccomp(void);
 extern long prctl_set_seccomp(unsigned long);
 
-static inline int seccomp_mode(seccomp_t *s)
+static inline int seccomp_mode(struct seccomp_struct *s)
 {
 	return s->mode;
 }
@@ -28,7 +30,7 @@ static inline int seccomp_mode(seccomp_t *s)
 
 #include <linux/errno.h>
 
-typedef struct { } seccomp_t;
+struct seccomp_struct { };
 
 #define secure_computing(x) do { } while (0)
 
@@ -42,7 +44,7 @@ static inline long prctl_set_seccomp(unsigned long arg2)
 	return -EINVAL;
 }
 
-static inline int seccomp_mode(seccomp_t *s)
+static inline int seccomp_mode(struct seccomp_struct *s)
 {
 	return 0;
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-27 23:24 [PATCH v5 1/3] seccomp: kill the seccomp_t typedef Will Drewry
@ 2012-01-27 23:24 ` Will Drewry
  2012-01-28 11:21   ` Cong Wang
                     ` (2 more replies)
  2012-01-27 23:24 ` [PATCH v5 3/3] Documentation: prctl/seccomp_filter Will Drewry
  2012-01-27 23:41 ` [PATCH v5 1/3] seccomp: kill the seccomp_t typedef Greg KH
  2 siblings, 3 replies; 26+ messages in thread
From: Will Drewry @ 2012-01-27 23:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr

[This patch depends on luto@mit.edu's no_new_privs patch:
 https://lkml.org/lkml/2012/1/12/446
]

This patch adds support for seccomp mode 2.  This mode enables dynamic
enforcement of system call filtering policy in the kernel as specified
by a userland task.  The policy is expressed in terms of a Berkeley
Packet Filter program, as is used for userland-exposed socket filtering.
Instead of network data, the BPF program is evaluated over struct
seccomp_filter_data at the time of the system call.

A filter program may be installed by a userland task by calling
  prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
where fprog is of type struct sock_fprog.

If the first filter program allows subsequent prctl(2) calls, then
additional filter programs may be attached.  All attached programs
must be evaluated before a system call will be allowed to proceed.

To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() value, it is not allowed to
make system calls using the alternate entry point.

Filter programs will be inherited across fork/clone and execve.
However, if the task attaching the filter is unprivileged
(!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task.  This
ensures that unprivileged tasks cannot attach filters that affect
privileged tasks (e.g., setuid binary).

There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time.
- Userland already knows its ABI: system call numbers and desired
  arguments
- No time-of-check-time-of-use vulnerable data accesses are possible.
- system call arguments are loaded on demand only to minimize copying
  required for system call number-only policy decisions.

This patch includes its own BPF evaluator, but relies on the
net/core/filter.c BPF checking code.  It is possible to share
evaluators, but the performance sensitive nature of the network
filtering path makes it an iterative optimization which (I think :) can
be tackled separately via separate patchsets. (And at some point sharing
BPF JIT code!)

 v5: - uses syscall_get_arguments
       (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
     - uses union-based arg storage with hi/lo struct to
       handle endianness.  Compromises between the two alternate
       proposals to minimize extra arg shuffling and account for
       endianness assuming userspace uses offsetof().
       (mcgrathr@chromium.org, indan@nul.nu)
     - update Kconfig description
     - add include/seccomp_filter.h and add its installation
     - (naive) on-demand syscall argument loading
     - drop seccomp_t (eparis@redhat.com)
     - adds proper compat prctl call copying
 v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
     - now uses current->no_new_privs
         (luto@mit.edu,torvalds@linux-foundation.com)
     - assign names to seccomp modes (rdunlap@xenotime.net)
     - fix style issues (rdunlap@xenotime.net)
     - reworded Kconfig entry (rdunlap@xenotime.net)
 v3: - macros to inline (oleg@redhat.com)
     - init_task behavior fixed (oleg@redhat.com)
     - drop creator entry and extra NULL check (oleg@redhat.com)
     - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
     - adds tentative use of "always_unprivileged" as per
       torvalds@linux-foundation.org and luto@mit.edu
 v2: - (patch 2 only)

Signed-off-by: Will Drewry <wad@chromium.org>
---
 include/linux/Kbuild           |    1 +
 include/linux/prctl.h          |    3 +
 include/linux/seccomp.h        |   63 ++++
 include/linux/seccomp_filter.h |   79 +++++
 kernel/Makefile                |    1 +
 kernel/fork.c                  |    4 +
 kernel/seccomp.c               |   10 +-
 kernel/seccomp_filter.c        |  620 ++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c                   |    4 +
 security/Kconfig               |   20 ++
 10 files changed, 804 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/seccomp_filter.h
 create mode 100644 kernel/seccomp_filter.c

diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index c94e717..5659454 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -330,6 +330,7 @@ header-y += scc.h
 header-y += sched.h
 header-y += screen_info.h
 header-y += sdla.h
+header-y += seccomp_filter.h
 header-y += securebits.h
 header-y += selinux_netlink.h
 header-y += sem.h
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index 7ddc7f1..b8c4beb 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -114,4 +114,7 @@
 # define PR_SET_MM_START_BRK		6
 # define PR_SET_MM_BRK			7
 
+/* Set process seccomp filters */
+#define PR_ATTACH_SECCOMP_FILTER	37
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 171ab66..3992bb6 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,10 +5,29 @@
 #ifdef CONFIG_SECCOMP
 
 #include <linux/thread_info.h>
+#include <linux/types.h>
 #include <asm/seccomp.h>
 
+/* Valid values of seccomp_struct.mode */
+#define SECCOMP_MODE_DISABLED	0 /* seccomp is not in use. */
+#define SECCOMP_MODE_STRICT	1 /* uses hard-coded seccomp.c rules. */
+#define SECCOMP_MODE_FILTER	2 /* system call access determined by filter. */
+
+struct seccomp_filter;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * @mode:  indicates one of the valid values above for controlled
+ *         system calls available to a process.
+ * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
+ *          @filter must only be accessed from the context of current as there
+ *          is no guard.
+ */
 struct seccomp_struct {
 	int mode;
+#ifdef CONFIG_SECCOMP_FILTER
+	struct seccomp_filter *filter;
+#endif
 };
 
 extern void __secure_computing(int);
@@ -51,4 +70,48 @@ static inline int seccomp_mode(struct seccomp_struct *s)
 
 #endif /* CONFIG_SECCOMP */
 
+#ifdef CONFIG_SECCOMP_FILTER
+
+
+extern long prctl_attach_seccomp_filter(char __user *);
+
+extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
+extern void put_seccomp_filter(struct seccomp_filter *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+extern void seccomp_struct_fork(struct seccomp_struct *child,
+				const struct seccomp_struct *parent);
+
+static inline void seccomp_struct_init_task(struct seccomp_struct *seccomp)
+{
+	seccomp->mode = SECCOMP_MODE_DISABLED;
+	seccomp->filter = NULL;
+}
+
+/* No locking is needed here because the task_struct will
+ * have no parallel consumers.
+ */
+static inline void seccomp_struct_free_task(struct seccomp_struct *seccomp)
+{
+	put_seccomp_filter(seccomp->filter);
+	seccomp->filter = NULL;
+}
+
+#else  /* CONFIG_SECCOMP_FILTER */
+
+#include <linux/errno.h>
+
+struct seccomp_filter { };
+/* Macros consume the unused dereference by the caller. */
+#define seccomp_struct_init_task(_seccomp) do { } while (0);
+#define seccomp_struct_fork(_tsk, _orig) do { } while (0);
+#define seccomp_struct_free_task(_seccomp) do { } while (0);
+
+static inline long prctl_attach_seccomp_filter(char __user *a2)
+{
+	return -ENOSYS;
+}
+
+#endif  /* CONFIG_SECCOMP_FILTER */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/linux/seccomp_filter.h b/include/linux/seccomp_filter.h
new file mode 100644
index 0000000..3ecd641
--- /dev/null
+++ b/include/linux/seccomp_filter.h
@@ -0,0 +1,79 @@
+/*
+ * Secomp-based system call filtering data structures and definitions.
+ *
+ * Copyright (C) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ *
+ * This copyrighted material is made available to anyone wishing to use,
+ * modify, copy, or redistribute it subject to the terms and conditions
+ * of the GNU General Public License v.2.
+ *
+ */
+
+#ifndef __LINUX_SECCOMP_FILTER_H__
+#define __LINUX_SECCOMP_FILTER_H__
+
+#include <asm/byteorder.h>
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+/*
+ *	Keep the contents of this file similar to linux/filter.h:
+ *	  struct sock_filter and sock_fprog and versions.
+ *	Custom naming exists solely if divergence is ever needed.
+ */
+
+/*
+ * Current version of the filter code architecture.
+ */
+#define SECCOMP_BPF_MAJOR_VERSION 1
+#define SECCOMP_BPF_MINOR_VERSION 1
+
+struct seccomp_filter_block {	/* Filter block */
+	__u16	code;   /* Actual filter code */
+	__u8	jt;	/* Jump true */
+	__u8	jf;	/* Jump false */
+	__u32	k;      /* Generic multiuse field */
+};
+
+struct seccomp_fprog {	/* Required for SO_ATTACH_FILTER. */
+	unsigned short		len;	/* Number of filter blocks */
+	struct seccomp_filter_block __user *filter;
+};
+
+/* Ensure the u32 ordering is consistent with platform byte order. */
+#if defined(__LITTLE_ENDIAN)
+#define SECCOMP_ENDIAN_SWAP(x, y) x, y
+#elif defined(__BIG_ENDIAN)
+#define SECCOMP_ENDIAN_SWAP(x, y) y, x
+#else
+#error edit for your odd arch byteorder.
+#endif
+
+/* System call argument layout for the filter data. */
+union seccomp_filter_arg {
+	struct {
+		__u32 SECCOMP_ENDIAN_SWAP(lo32, hi32);
+	};
+	__u64 u64;
+};
+
+/*
+ *	Expected data the BPF program will execute over.
+ *	Endianness will be arch specific, but the values will be
+ *	swapped, as above, to allow for consistent BPF programs.
+ */
+struct seccomp_filter_data {
+	int syscall_nr;
+	__u32 __reserved;
+	union seccomp_filter_arg args[6];
+};
+
+#undef SECCOMP_ENDIAN_SWAP
+
+/*
+ * Defined valid return values for the BPF program.
+ */
+#define SECCOMP_BPF_ALLOW	0xFFFFFFFF
+#define SECCOMP_BPF_DENY	0
+
+#endif /* __LINUX_SECCOMP_FILTER_H__ */
diff --git a/kernel/Makefile b/kernel/Makefile
index 2d9de86..fd81bac 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -78,6 +78,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TREE_RCU) += rcutree.o
 obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index 051f090..f312edb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
 #include <linux/cgroup.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
+#include <linux/seccomp.h>
 #include <linux/swap.h>
 #include <linux/syscalls.h>
 #include <linux/jiffies.h>
@@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
+	seccomp_struct_free_task(&tsk->seccomp);
 	free_task_struct(tsk);
 }
 EXPORT_SYMBOL(free_task);
@@ -1093,6 +1095,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto fork_out;
 
 	ftrace_graph_init_task(p);
+	seccomp_struct_init_task(&p->seccomp);
 
 	rt_mutex_init_task(p);
 
@@ -1376,6 +1379,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_change_end(current);
 	perf_event_fork(p);
+	seccomp_struct_fork(&p->seccomp, &current->seccomp);
 
 	trace_task_newtask(p, clone_flags);
 
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index e8d76c5..a045dd4 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -37,7 +37,7 @@ void __secure_computing(int this_syscall)
 	int * syscall;
 
 	switch (mode) {
-	case 1:
+	case SECCOMP_MODE_STRICT:
 		syscall = mode1_syscalls;
 #ifdef CONFIG_COMPAT
 		if (is_compat_task())
@@ -48,6 +48,14 @@ void __secure_computing(int this_syscall)
 				return;
 		} while (*++syscall);
 		break;
+#ifdef CONFIG_SECCOMP_FILTER
+	case SECCOMP_MODE_FILTER:
+		if (seccomp_test_filters(this_syscall) == 0)
+			return;
+
+		seccomp_filter_log_failure(this_syscall);
+		break;
+#endif
 	default:
 		BUG();
 	}
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..e57219e
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,620 @@
+/*
+ * linux/kernel/seccomp_filter.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ *
+ * Extends linux/kernel/seccomp.c to allow tasks to install system call
+ * filters using a Berkeley Packet Filter program which is executed over
+ * struct seccomp_filter_data.
+ */
+
+#include <asm/syscall.h>
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/rculist.h>
+#include <linux/filter.h>
+#include <linux/kallsyms.h>
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/pid.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/ratelimit.h>
+#include <linux/reciprocal_div.h>
+#include <linux/regset.h>
+#include <linux/seccomp.h>
+#include <linux/seccomp_filter.h>
+#include <linux/security.h>
+#include <linux/seccomp.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/user.h>
+
+
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object lifetime.
+ *         get/put helpers should be used when accessing an instance
+ *         outside of a lifetime-guarded section.  In general, this
+ *         is only needed for handling filters shared across tasks.
+ * @parent: pointer to the ancestor which this filter will be composed with.
+ * @insns: the BPF program instructions to evaluate
+ * @count: the number of instructions in the program.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+	struct kref usage;
+	struct seccomp_filter *parent;
+	struct {
+		uint32_t compat:1;
+	} flags;
+	unsigned short count;  /* Instruction count */
+	struct sock_filter insns[0];
+};
+
+/*
+ * struct seccomp_filter_metadata - BPF data wrapper
+ * @data: data accessible to the BPF program.
+ * @has_args: indicates that the args have been lazily populated.
+ *
+ * Used by seccomp_load_pointer.
+ */
+struct seccomp_filter_metadata {
+	struct seccomp_filter_data data;
+	bool has_args;
+};
+
+static unsigned int seccomp_run_filter(void *, uint32_t,
+				       const struct sock_filter *);
+
+/**
+ * seccomp_filter_alloc - allocates a new filter object
+ * @padding: size of the insns[0] array in bytes
+ *
+ * The @padding should be a multiple of
+ * sizeof(struct sock_filter).
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
+{
+	struct seccomp_filter *f;
+	unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
+
+	/* Drop oversized requests. */
+	if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
+		return ERR_PTR(-EINVAL);
+
+	/* Padding should always be in sock_filter increments. */
+	if (padding % sizeof(struct sock_filter))
+		return ERR_PTR(-EINVAL);
+
+	f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
+	if (!f)
+		return ERR_PTR(-ENOMEM);
+	kref_init(&f->usage);
+	f->count = bpf_blocks;
+	return f;
+}
+
+/**
+ * seccomp_filter_free - frees the allocated filter.
+ * @filter: NULL or live object to be completely destructed.
+ */
+static void seccomp_filter_free(struct seccomp_filter *filter)
+{
+	if (!filter)
+		return;
+	put_seccomp_filter(filter->parent);
+	kfree(filter);
+}
+
+static void __put_seccomp_filter(struct kref *kref)
+{
+	struct seccomp_filter *orig =
+		container_of(kref, struct seccomp_filter, usage);
+	seccomp_filter_free(orig);
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+	pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
+		current->comm, task_pid_nr(current), syscall,
+		KSTK_EIP(current));
+}
+
+/* put_seccomp_filter - decrements the ref count of @orig and may free. */
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+	if (!orig)
+		return;
+	kref_put(&orig->usage, __put_seccomp_filter);
+}
+
+/* get_seccomp_filter - increments the reference count of @orig. */
+struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+	if (!orig)
+		return NULL;
+	kref_get(&orig->usage);
+	return orig;
+}
+
+#if BITS_PER_LONG == 32
+static inline unsigned long *seccomp_filter_data_arg(
+				struct seccomp_filter_data *data, int index)
+{
+	/* Avoid inconsistent hi contents. */
+	data->args[index].hi32 = 0;
+	return (unsigned long *) &(data->args[index].lo32);
+}
+#elif BITS_PER_LONG == 64
+static inline unsigned long *seccomp_filter_data_arg(
+				struct seccomp_filter_data *data, int index)
+{
+	return (unsigned long *) &(data->args[index].u64);
+}
+#else
+#error Unknown BITS_PER_LONG.
+#endif
+
+/**
+ * seccomp_load_pointer: checks and returns a pointer to the requested offset
+ * @buf: u8 array to index into
+ * @buflen: length of the @buf array
+ * @offset: offset to return data from
+ * @size: size of the data to retrieve at offset
+ * @unused: placeholder which net/core/filter.c uses for for temporary
+ *          storage.  Ideally, the two code paths can be merged.
+ *
+ * Returns a pointer to the BPF evaluator after checking the offset and size
+ * boundaries.
+ */
+static inline void *seccomp_load_pointer(void *data, int offset, size_t size,
+					 void *buffer)
+{
+	struct seccomp_filter_metadata *metadata = data;
+	int arg;
+	if (offset >= sizeof(metadata->data))
+		goto fail;
+	if (offset < 0)
+		goto fail;
+	if (size > sizeof(metadata->data) - offset)
+		goto fail;
+	if (metadata->has_args)
+		goto pass;
+	/* No argument data touched. */
+	if (offset + size - 1 < offsetof(struct seccomp_filter_data, args))
+		goto pass;
+	for (arg = 0; arg < ARRAY_SIZE(metadata->data.args); ++arg)
+		syscall_get_arguments(current, task_pt_regs(current), arg, 1,
+			seccomp_filter_data_arg(&metadata->data, arg));
+	metadata->has_args = true;
+pass:
+	return ((__u8 *)(&metadata->data)) + offset;
+fail:
+	return NULL;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+	int ret = -EACCES;
+	struct seccomp_filter *filter;
+	struct seccomp_filter_metadata metadata;
+
+	filter = current->seccomp.filter; /* uses task ref */
+	if (!filter)
+		goto out;
+
+	metadata.data.syscall_nr = syscall;
+	metadata.has_args = false;
+
+#ifdef CONFIG_COMPAT
+	if (filter->flags.compat != !!(is_compat_task()))
+		goto out;
+#endif
+
+	/* Only allow a system call if it is allowed in all ancestors. */
+	ret = 0;
+	for ( ; filter != NULL; filter = filter->parent) {
+		/* Allowed if return value is SECCOMP_BPF_ALLOW */
+		if (seccomp_run_filter(&metadata, sizeof(metadata.data),
+					filter->insns) != SECCOMP_BPF_ALLOW)
+			ret = -EACCES;
+	}
+out:
+	return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * @fprog: BPF program to install
+ *
+ * Context: User context only. This function may sleep on allocation and
+ *          operates on current. current must be attempting a system call
+ *          when this is called (usually prctl).
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the thread makes.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+	struct seccomp_filter *filter = NULL;
+	/* Note, len is a short so overflow should be impossible. */
+	unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+	long ret = -EPERM;
+
+	/* Allocate a new seccomp_filter */
+	filter = seccomp_filter_alloc(fp_size);
+	if (IS_ERR(filter)) {
+		ret = PTR_ERR(filter);
+		goto out;
+	}
+
+	/* Copy the instructions from fprog. */
+	ret = -EFAULT;
+	if (copy_from_user(filter->insns, fprog->filter, fp_size))
+		goto out;
+
+	/* Check the fprog */
+	ret = sk_chk_filter(filter->insns, filter->count);
+	if (ret)
+		goto out;
+
+	/*
+	 * If a process lacks CAP_SYS_ADMIN in its namespace, force
+	 * this process and all descendents to run with no_new_privs.
+	 * A privileged process will need to set this bit independently,
+	 * if desired.
+	 */
+	if (security_capable_noaudit(current_cred(), current_user_ns(),
+				     CAP_SYS_ADMIN) != 0)
+		current->no_new_privs = 1;
+
+	/*
+	 * If there is an existing filter, make it the parent
+	 * and reuse the existing task-based ref.
+	 */
+	filter->parent = current->seccomp.filter;
+
+#ifdef CONFIG_COMPAT
+	/* Disallow changing system calling conventions after the fact. */
+	filter->flags.compat = !!(is_compat_task());
+
+	if (filter->parent &&
+	    filter->parent->flags.compat != filter->flags.compat)
+		return -EACCES;
+#endif
+
+	/*
+	 * Double claim the new filter so we can release it below simplifying
+	 * the error paths earlier.
+	 */
+	ret = 0;
+	get_seccomp_filter(filter);
+	current->seccomp.filter = filter;
+	/* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
+	if (current->seccomp.mode == SECCOMP_MODE_DISABLED) {
+		current->seccomp.mode = SECCOMP_MODE_FILTER;
+		set_thread_flag(TIF_SECCOMP);
+	}
+
+out:
+	put_seccomp_filter(filter);  /* for get or task, on err */
+	return ret;
+}
+
+#ifdef CONFIG_COMPAT
+/* This should be kept in sync with net/compat.c which changes infrequently. */
+struct compat_sock_fprog {
+	u16 len;
+	compat_uptr_t filter;	/* struct sock_filter */
+};
+
+static long compat_attach_seccomp_filter(char __user *optval)
+{
+	struct compat_sock_fprog __user *fprog32 =
+		(struct compat_sock_fprog __user *)optval;
+	struct sock_fprog __user *kfprog =
+		compat_alloc_user_space(sizeof(struct sock_fprog));
+	compat_uptr_t ptr;
+	u16 len;
+
+	if (!access_ok(VERIFY_READ, fprog32, sizeof(*fprog32)) ||
+	    !access_ok(VERIFY_WRITE, kfprog, sizeof(struct sock_fprog)) ||
+	    __get_user(len, &fprog32->len) ||
+	    __get_user(ptr, &fprog32->filter) ||
+	    __put_user(len, &kfprog->len) ||
+	    __put_user(compat_ptr(ptr), &kfprog->filter))
+		return -EFAULT;
+
+	return seccomp_attach_filter(kfprog);
+}
+#endif
+
+long prctl_attach_seccomp_filter(char __user *user_filter)
+{
+	struct sock_fprog fprog;
+	long ret = -EINVAL;
+	ret = -EFAULT;
+	if (!user_filter)
+		goto out;
+
+#ifdef CONFIG_COMPAT
+	if (is_compat_task())
+		return compat_attach_seccomp_filter(user_filter);
+#endif
+
+	if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+		goto out;
+
+	ret = seccomp_attach_filter(&fprog);
+out:
+	return ret;
+}
+
+/**
+ * seccomp_struct_fork: manages inheritance on fork
+ * @child: forkee's seccomp_struct
+ * @parent: forker's seccomp_struct
+ *
+ * Ensures that @child inherits seccomp mode and state iff
+ * seccomp filtering is in use.
+ */
+void seccomp_struct_fork(struct seccomp_struct *child,
+			 const struct seccomp_struct *parent)
+{
+	child->mode = parent->mode;
+	if (parent->mode != SECCOMP_MODE_FILTER)
+		return;
+	child->filter = get_seccomp_filter(parent->filter);
+}
+
+/**
+ * seccomp_run_filter - evaluate BPF
+ *	@buf: opaque buffer to execute the filter over
+ *	@buflen: length of the buffer
+ *	@fentry: filter to apply
+ *
+ * Decode and apply filter instructions to the buffer.  Return length to
+ * keep, 0 for none. @buf is a seccomp_filter_metadata we are filtering,
+ * @filter is the array of filter instructions.  Because all jumps are
+ * guaranteed to be before last instruction, and last instruction
+ * guaranteed to be a RET, we dont need to check flen.
+ *
+ * See core/net/filter.c as this is nearly an exact copy.
+ * At some point, it would be nice to merge them to take advantage of
+ * optimizations (like JIT).
+ */
+static unsigned int seccomp_run_filter(void *data, uint32_t datalen,
+			  const struct sock_filter *fentry)
+{
+	const void *ptr;
+	u32 A = 0;			/* Accumulator */
+	u32 X = 0;			/* Index Register */
+	u32 mem[BPF_MEMWORDS];		/* Scratch Memory Store */
+	u32 tmp;
+	int k;
+
+	/*
+	 * Process array of filter instructions.
+	 */
+	for (;; fentry++) {
+#if defined(CONFIG_X86_32)
+#define	K (fentry->k)
+#else
+		const u32 K = fentry->k;
+#endif
+
+		switch (fentry->code) {
+		case BPF_S_ALU_ADD_X:
+			A += X;
+			continue;
+		case BPF_S_ALU_ADD_K:
+			A += K;
+			continue;
+		case BPF_S_ALU_SUB_X:
+			A -= X;
+			continue;
+		case BPF_S_ALU_SUB_K:
+			A -= K;
+			continue;
+		case BPF_S_ALU_MUL_X:
+			A *= X;
+			continue;
+		case BPF_S_ALU_MUL_K:
+			A *= K;
+			continue;
+		case BPF_S_ALU_DIV_X:
+			if (X == 0)
+				return 0;
+			A /= X;
+			continue;
+		case BPF_S_ALU_DIV_K:
+			A = reciprocal_divide(A, K);
+			continue;
+		case BPF_S_ALU_AND_X:
+			A &= X;
+			continue;
+		case BPF_S_ALU_AND_K:
+			A &= K;
+			continue;
+		case BPF_S_ALU_OR_X:
+			A |= X;
+			continue;
+		case BPF_S_ALU_OR_K:
+			A |= K;
+			continue;
+		case BPF_S_ALU_LSH_X:
+			A <<= X;
+			continue;
+		case BPF_S_ALU_LSH_K:
+			A <<= K;
+			continue;
+		case BPF_S_ALU_RSH_X:
+			A >>= X;
+			continue;
+		case BPF_S_ALU_RSH_K:
+			A >>= K;
+			continue;
+		case BPF_S_ALU_NEG:
+			A = -A;
+			continue;
+		case BPF_S_JMP_JA:
+			fentry += K;
+			continue;
+		case BPF_S_JMP_JGT_K:
+			fentry += (A > K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JGE_K:
+			fentry += (A >= K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JEQ_K:
+			fentry += (A == K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JSET_K:
+			fentry += (A & K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JGT_X:
+			fentry += (A > X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JGE_X:
+			fentry += (A >= X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JEQ_X:
+			fentry += (A == X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JSET_X:
+			fentry += (A & X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_LD_W_ABS:
+			k = K;
+load_w:
+			ptr = seccomp_load_pointer(data, k, 4, &tmp);
+			if (ptr != NULL) {
+				/*
+				 * Assume load_pointer did any byte swapping.
+				 */
+				A = *(const u32 *)ptr;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_H_ABS:
+			k = K;
+load_h:
+			ptr = seccomp_load_pointer(data, k, 2, &tmp);
+			if (ptr != NULL) {
+				A = *(const u16 *)ptr;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_B_ABS:
+			k = K;
+load_b:
+			ptr = seccomp_load_pointer(data, k, 1, &tmp);
+			if (ptr != NULL) {
+				A = *(const u8 *)ptr;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_W_LEN:
+			A = datalen;
+			continue;
+		case BPF_S_LDX_W_LEN:
+			X = datalen;
+			continue;
+		case BPF_S_LD_W_IND:
+			k = X + K;
+			goto load_w;
+		case BPF_S_LD_H_IND:
+			k = X + K;
+			goto load_h;
+		case BPF_S_LD_B_IND:
+			k = X + K;
+			goto load_b;
+		case BPF_S_LDX_B_MSH:
+			ptr = seccomp_load_pointer(data, K, 1, &tmp);
+			if (ptr != NULL) {
+				X = (*(u8 *)ptr & 0xf) << 2;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_IMM:
+			A = K;
+			continue;
+		case BPF_S_LDX_IMM:
+			X = K;
+			continue;
+		case BPF_S_LD_MEM:
+			A = mem[K];
+			continue;
+		case BPF_S_LDX_MEM:
+			X = mem[K];
+			continue;
+		case BPF_S_MISC_TAX:
+			X = A;
+			continue;
+		case BPF_S_MISC_TXA:
+			A = X;
+			continue;
+		case BPF_S_RET_K:
+			return K;
+		case BPF_S_RET_A:
+			return A;
+		case BPF_S_ST:
+			mem[K] = A;
+			continue;
+		case BPF_S_STX:
+			mem[K] = X;
+			continue;
+		case BPF_S_ANC_PROTOCOL:
+		case BPF_S_ANC_PKTTYPE:
+		case BPF_S_ANC_IFINDEX:
+		case BPF_S_ANC_MARK:
+		case BPF_S_ANC_QUEUE:
+		case BPF_S_ANC_HATYPE:
+		case BPF_S_ANC_RXHASH:
+		case BPF_S_ANC_CPU:
+		case BPF_S_ANC_NLATTR:
+		case BPF_S_ANC_NLATTR_NEST:
+			continue;
+		default:
+			WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
+				       fentry->code, fentry->jt,
+				       fentry->jf, fentry->k);
+			return 0;
+		}
+	}
+
+	return 0;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 4070153..8e43f70 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1901,6 +1901,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		case PR_SET_SECCOMP:
 			error = prctl_set_seccomp(arg2);
 			break;
+		case PR_ATTACH_SECCOMP_FILTER:
+			error = prctl_attach_seccomp_filter((char __user *)
+								arg2);
+			break;
 		case PR_GET_TSC:
 			error = GET_TSC_CTL(arg2);
 			break;
diff --git a/security/Kconfig b/security/Kconfig
index 51bd5a0..e1ffed8 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -84,6 +84,26 @@ config SECURITY_DMESG_RESTRICT
 
 	  If you are unsure how to answer this question, answer N.
 
+config SECCOMP_FILTER
+	bool "Enable seccomp-based system call filtering"
+	select SECCOMP
+	help
+	  This option provide support for limiting the accessibility of
+	  systems calls at a task-level using a dynamically defined policy.
+
+	  System call filtering policy is expressed by the user using
+	  a Berkeley Packet Filter program.  The program is attached using
+	  prctl(2).  For every system call the task makes, its number,
+	  arguments, and other metadata will be evaluated by the attached
+	  filter program.  The result determines if the system call may
+	  may proceed or if the task should be terminated.
+
+	  This behavior is meant to aid security-conscious software in
+	  its ability to minimize the risk of running potentially
+	  risky code.
+
+	  See Documentation/prctl/seccomp_filter.txt for more detail.
+
 config SECURITY
 	bool "Enable different security models"
 	depends on SYSFS
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 3/3] Documentation: prctl/seccomp_filter
  2012-01-27 23:24 [PATCH v5 1/3] seccomp: kill the seccomp_t typedef Will Drewry
  2012-01-27 23:24 ` [PATCH v5 2/3] seccomp_filters: system call filtering using BPF Will Drewry
@ 2012-01-27 23:24 ` Will Drewry
  2012-01-27 23:41 ` [PATCH v5 1/3] seccomp: kill the seccomp_t typedef Greg KH
  2 siblings, 0 replies; 26+ messages in thread
From: Will Drewry @ 2012-01-27 23:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr

Documents how system call filtering using Berkeley Packet
Filter programs works and how it may be used.
Includes an example for x86 (32-bit) and a semi-generic
example using an example code generator.

v5: - update sample to use system call arguments
    - adds a "fancy" example using a macro-based generator
    - cleaned up bpf in the sample
    - update docs to mention arguments
    - fix prctl value (eparis@redhat.com)
    - language cleanup (rdunlap@xenotime.net)
v4: - update for no_new_privs use
    - minor tweaks
v3: - call out BPF <-> Berkeley Packet Filter (rdunlap@xenotime.net)
    - document use of tentative always-unprivileged
    - guard sample compilation for i386 and x86_64
v2: - move code to samples (corbet@lwn.net)

Signed-off-by: Will Drewry <wad@chromium.org>
---
 Documentation/prctl/seccomp_filter.txt |   97 ++++++++++++++
 samples/Makefile                       |    2 +-
 samples/seccomp/Makefile               |   27 ++++
 samples/seccomp/bpf-direct.c           |   77 +++++++++++
 samples/seccomp/bpf-fancy.c            |   95 ++++++++++++++
 samples/seccomp/bpf-helper.c           |   89 +++++++++++++
 samples/seccomp/bpf-helper.h           |  217 ++++++++++++++++++++++++++++++++
 7 files changed, 603 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/prctl/seccomp_filter.txt
 create mode 100644 samples/seccomp/Makefile
 create mode 100644 samples/seccomp/bpf-direct.c
 create mode 100644 samples/seccomp/bpf-fancy.c
 create mode 100644 samples/seccomp/bpf-helper.c
 create mode 100644 samples/seccomp/bpf-helper.h

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..f2c1325
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,97 @@
+		Seccomp filtering
+		=================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated.  A
+certain subset of userland applications benefit by having a reduced set
+of available system calls.  The resulting set reduces the total kernel
+surface exposed to the application.  System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter for
+incoming system calls.  The filter is expressed as a Berkeley Packet
+Filter (BPF) program, as with socket filters, except that the data
+operated on is related to the system call being made: system call
+number, and the system call arguments.  This allows for expressive
+filtering of system calls using a filter program language with a long
+history of being exposed to userland and a straightforward data set.
+
+Additionally, BPF makes it impossible for users of seccomp to fall prey
+to time-of-check-time-of-use (TOCTOU) attacks that are common in system
+call interposition frameworks.  BPF programs may not dereference
+pointers which constrains all filters to solely evaluating the system
+call arguments directly.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox.  It provides a clearly defined
+mechanism for minimizing the exposed kernel surface.  Beyond that,
+policy for logical behavior and information flow should be managed with
+a combination of other system hardening techniques and, potentially, an
+LSM of your choosing.  Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by
+the consuming process.  The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+	Allows the specification of a new filter using a BPF program.
+	The BPF program will be executed over struct seccomp_filter_data
+	reflecting the system call number, arguments, and other
+	metadata, To allow a system call, SECCOMP_BPF_ALLOW must be
+	returned.  At present, all other return values result in the
+	system call being blocked, but it is recommended to return
+	SECCOMP_BPF_DENY in those cases.  This will allow for future
+	custom return values to be introduced, if ever desired.
+
+	Usage:
+		prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+	The 'prog' argument is a pointer to a struct sock_fprog which will
+	contain the filter program.  If the program is invalid, the call
+	will return -1 and set errno to EINVAL.
+
+	Note, is_compat_task is also tracked for the @prog.  This means
+	that once set the calling task will have all of its system calls
+	blocked if it switches its system call ABI.
+
+	If fork/clone and execve are allowed by @prog, any child processes will
+	be constrained to the same filters and system call ABI as the parent.
+
+	When called from an unprivileged process (lacking CAP_SYS_ADMIN), the
+	"no_new_privs" bit is enabled for the process.
+
+	Additionally, if prctl(2) is allowed by the attached filter,
+	additional filters may be layered on which will increase evaluation
+	time, but allow for further decreasing the attack surface during
+	execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Example
+-------
+
+The samples/seccomp/ directory contains both a 32-bit specific example
+and a more generic example of a higher level macro interface for BPF
+program generation.
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters as long
+as CONFIG_SECCOMP_FILTER is enabled and the architecture has implemented
+syscall_get_arguments.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
 # Makefile for Linux samples code
 
 obj-$(CONFIG_SAMPLES)	+= kobject/ kprobes/ tracepoints/ trace_events/ \
-			   hw_breakpoint/ kfifo/ kdb/ hidraw/
+			   hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..0298c6f
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,27 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+hostprogs-y := bpf-fancy
+bpf-fancy-objs := bpf-fancy.o bpf-helper.o
+
+HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
+HOSTCFLAGS_bpf-helper.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-helper.o += -idirafter $(objtree)/include
+
+# bpf-direct.c is x86-only.
+ifeq ($(filter-out x86_64 i386,$(KBUILD_BUILDHOST)),)
+# List of programs to build
+hostprogs-y += bpf-direct
+bpf-direct-objs := bpf-direct.o
+endif
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
+ifeq ($(KBUILD_BUILDHOST),x86_64)
+HOSTCFLAGS_bpf-direct.o += -m32
+HOSTLOADLIBES_bpf-direct += -m32
+endif
diff --git a/samples/seccomp/bpf-direct.c b/samples/seccomp/bpf-direct.c
new file mode 100644
index 0000000..d799244
--- /dev/null
+++ b/samples/seccomp/bpf-direct.c
@@ -0,0 +1,77 @@
+/*
+ * 32-bit seccomp filter example with BPF macros
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <linux/filter.h>
+#include <linux/ptrace.h>
+#include <linux/seccomp_filter.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#ifndef PR_ATTACH_SECCOMP_FILTER
+#	define PR_ATTACH_SECCOMP_FILTER 37
+#endif
+
+#define syscall_arg(_n) (offsetof(struct seccomp_filter_data, args[_n].lo32))
+#define nr (offsetof(struct seccomp_filter_data, syscall_nr))
+
+static int install_filter(void)
+{
+	struct seccomp_filter_block filter[] = {
+		/* Grab the system call number */
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS, nr),
+		/* Jump table for the allowed syscalls */
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+		/* Check that read is only using stdin. */
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+		/* Check that write is only using stdout/stderr */
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_ALLOW),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_DENY),
+	};
+	struct seccomp_fprog prog = {
+		.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+		.filter = filter,
+	};
+	if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
+		perror("prctl");
+		return 1;
+	}
+	return 0;
+}
+
+#define payload(_c) (_c), sizeof((_c))
+int main(int argc, char **argv)
+{
+	char buf[4096];
+	ssize_t bytes = 0;
+	if (install_filter())
+		return 1;
+	syscall(__NR_write, STDOUT_FILENO,
+		payload("OHAI! WHAT IS YOUR NAME? "));
+	bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+	syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+	syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+	return 0;
+}
diff --git a/samples/seccomp/bpf-fancy.c b/samples/seccomp/bpf-fancy.c
new file mode 100644
index 0000000..1318b1a
--- /dev/null
+++ b/samples/seccomp/bpf-fancy.c
@@ -0,0 +1,95 @@
+/*
+ * Seccomp BPF example using a macro-based generator.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <linux/seccomp_filter.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#include "bpf-helper.h"
+
+#ifndef PR_ATTACH_SECCOMP_FILTER
+#	define PR_ATTACH_SECCOMP_FILTER 37
+#endif
+
+int main(int argc, char **argv)
+{
+	struct bpf_labels l;
+	static const char msg1[] = "Please type something: ";
+	static const char msg2[] = "You typed: ";
+	char buf[256];
+	struct seccomp_filter_block filter[] = {
+		LOAD_SYSCALL_NR,
+		SYSCALL(__NR_exit, ALLOW),
+		SYSCALL(__NR_exit_group, ALLOW),
+		SYSCALL(__NR_write, JUMP(&l, write_fd)),
+		SYSCALL(__NR_read, JUMP(&l, read)),
+		DENY,  /* Don't passthrough into a label */
+
+		LABEL(&l, read),
+		ARG(0),
+		JNE(STDIN_FILENO, DENY),
+		ARG(1),
+		JNE((unsigned long)buf, DENY),
+		ARG(2),
+		JGE(sizeof(buf), DENY),
+		ALLOW,
+
+		LABEL(&l, write_fd),
+		ARG(0),
+		JEQ(STDOUT_FILENO, JUMP(&l, write_buf)),
+		JEQ(STDERR_FILENO, JUMP(&l, write_buf)),
+		DENY,
+
+		LABEL(&l, write_buf),
+		ARG(1),
+		JEQ((unsigned long)msg1, JUMP(&l, msg1_len)),
+		JEQ((unsigned long)msg2, JUMP(&l, msg2_len)),
+		JEQ((unsigned long)buf, JUMP(&l, buf_len)),
+		DENY,
+
+		LABEL(&l, msg1_len),
+		ARG(2),
+		JLT(sizeof(msg1), ALLOW),
+		DENY,
+
+		LABEL(&l, msg2_len),
+		ARG(2),
+		JLT(sizeof(msg2), ALLOW),
+		DENY,
+
+		LABEL(&l, buf_len),
+		ARG(2),
+		JLT(sizeof(buf), ALLOW),
+		DENY,
+	};
+	struct seccomp_fprog prog = {
+		.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+		.filter = filter,
+	};
+	ssize_t bytes;
+	bpf_resolve_jumps(&l, filter, sizeof(filter)/sizeof(*filter));
+
+	if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
+		perror("prctl");
+		return 1;
+	}
+	syscall(__NR_write, STDOUT_FILENO, msg1, strlen(msg1));
+	bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf)-1);
+	bytes = (bytes > 0 ? bytes : 0);
+	syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2));
+	syscall(__NR_write, STDERR_FILENO, buf, bytes);
+	/* Now get killed */
+	syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2)+2);
+	return 0;
+}
diff --git a/samples/seccomp/bpf-helper.c b/samples/seccomp/bpf-helper.c
new file mode 100644
index 0000000..e1b6bc7
--- /dev/null
+++ b/samples/seccomp/bpf-helper.c
@@ -0,0 +1,89 @@
+/*
+ * Seccomp BPF helper functions
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include "bpf-helper.h"
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+		      struct seccomp_filter_block *filter, size_t count)
+{
+	struct seccomp_filter_block *begin = filter;
+	__u8 insn = count - 1;
+
+	if (count < 1)
+		return -1;
+	/*
+	* Walk it once, backwards, to build the label table and do fixups.
+	* Since backward jumps are disallowed by BPF, this is easy.
+	*/
+	filter += insn;
+	for (; filter >= begin; --insn, --filter) {
+		if (filter->code != (BPF_JMP+BPF_JA))
+			continue;
+		switch ((filter->jt<<8)|filter->jf) {
+		case (JUMP_JT<<8)|JUMP_JF:
+			if (labels->labels[filter->k].location == 0xffffffff) {
+				fprintf(stderr, "Unresolved label: '%s'\n",
+					labels->labels[filter->k].label);
+				return 1;
+			}
+			filter->k = labels->labels[filter->k].location -
+				    (insn + 1);
+			filter->jt = 0;
+			filter->jf = 0;
+			continue;
+		case (LABEL_JT<<8)|LABEL_JF:
+			if (labels->labels[filter->k].location != 0xffffffff) {
+				fprintf(stderr, "Duplicate label use: '%s'\n",
+					labels->labels[filter->k].label);
+				return 1;
+			}
+			labels->labels[filter->k].location = insn;
+			filter->k = 0; /* fall through */
+			filter->jt = 0;
+			filter->jf = 0;
+			continue;
+		}
+	}
+	return 0;
+}
+
+/* Simple lookup table for labels. */
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label)
+{
+	struct __bpf_label *begin = labels->labels, *end;
+	int id;
+	if (labels->count == 0) {
+		begin->label = label;
+		begin->location = 0xffffffff;
+		labels->count++;
+		return 0;
+	}
+	end = begin + labels->count;
+	for (id = 0; begin < end; ++begin, ++id) {
+		if (!strcmp(label, begin->label))
+			return id;
+	}
+	begin->label = label;
+	begin->location = 0xffffffff;
+	labels->count++;
+	return id;
+}
+
+void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t count)
+{
+	struct seccomp_filter_block *end = filter + count;
+	for ( ; filter < end; ++filter)
+		printf("{ code=%u,jt=%u,jf=%u,k=%u },\n",
+			filter->code, filter->jt, filter->jf, filter->k);
+}
diff --git a/samples/seccomp/bpf-helper.h b/samples/seccomp/bpf-helper.h
new file mode 100644
index 0000000..6b40e3e
--- /dev/null
+++ b/samples/seccomp/bpf-helper.h
@@ -0,0 +1,217 @@
+/*
+ * Example wrapper around BPF macros.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ *
+ * No guarantees are provided with respect to the correctness
+ * or functionality of this code.
+ */
+#ifndef __BPF_HELPER_H__
+#define __BPF_HELPER_H__
+
+#include <asm/bitsperlong.h>	/* for __BITS_PER_LONG */
+#include <linux/filter.h>
+#include <linux/seccomp_filter.h>	/* for seccomp_filter_data.arg */
+#include <linux/types.h>
+#include <linux/unistd.h>
+#include <stddef.h>
+
+#define BPF_LABELS_MAX 256
+struct bpf_labels {
+	int count;
+	struct __bpf_label {
+		const char *label;
+		__u32 location;
+	} labels[BPF_LABELS_MAX];
+};
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+		      struct seccomp_filter_block *filter, size_t count);
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label);
+void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t count);
+
+#define JUMP_JT 0xff
+#define JUMP_JF 0xff
+#define LABEL_JT 0xfe
+#define LABEL_JF 0xfe
+
+#define ALLOW \
+	BPF_STMT(BPF_RET+BPF_K, 0xFFFFFFFF)
+#define DENY \
+	BPF_STMT(BPF_RET+BPF_K, 0)
+#define JUMP(labels, label) \
+	BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL(labels, label), JUMP_JT, JUMP_JF)
+#define LABEL(labels, label) \
+	BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL(labels, label), LABEL_JT, LABEL_JF)
+#define SYSCALL(nr, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1), \
+	jt
+
+/* Lame, but just an example */
+#define FIND_LABEL(labels, label) seccomp_bpf_label(labels, #label)
+
+#define EXPAND(...) __VA_ARGS__
+/* Map all width-sensitive operations */
+#if __BITS_PER_LONG == 32
+
+#define JEQ(x, jt) JEQ32(x, EXPAND(jt))
+#define JNE(x, jt) JNE32(x, EXPAND(jt))
+#define JGT(x, jt) JGT32(x, EXPAND(jt))
+#define JLT(x, jt) JLT32(x, EXPAND(jt))
+#define JGE(x, jt) JGE32(x, EXPAND(jt))
+#define JLE(x, jt) JLE32(x, EXPAND(jt))
+#define JA(x, jt) JA32(x, EXPAND(jt))
+#define ARG(i) ARG_32(i)
+
+#elif __BITS_PER_LONG == 64
+
+#define JEQ(x, jt) \
+	JEQ64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	      EXPAND(jt))
+#define JGT(x, jt) \
+	JGT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	      EXPAND(jt))
+#define JGE(x, jt) \
+	JGE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	      EXPAND(jt))
+#define JNE(x, jt) \
+	JNE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	      EXPAND(jt))
+#define JLT(x, jt) \
+	JLT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	      EXPAND(jt))
+#define JLE(x, jt) \
+	JLE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	      EXPAND(jt))
+
+#define JA(x, jt) \
+	JA64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	       ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	       EXPAND(jt))
+#define ARG(i) ARG_64(i)
+
+#else
+#error __BITS_PER_LONG value unusable.
+#endif
+
+/* Loads the arg into A */
+#define ARG_32(idx) \
+	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+		offsetof(struct seccomp_filter_data, args[(idx)].lo32))
+
+/* Loads hi into A and lo in X */
+#define ARG_64(idx) \
+	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+	  offsetof(struct seccomp_filter_data, args[(idx)].lo32)), \
+	BPF_STMT(BPF_ST, 0), /* lo -> M[0] */ \
+	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+	  offsetof(struct seccomp_filter_data, args[(idx)].hi32)), \
+	BPF_STMT(BPF_ST, 1) /* hi -> M[1] */
+
+#define JEQ32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 0, 1), \
+	jt
+
+#define JNE32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 1, 0), \
+	jt
+
+/* Checks the lo, then swaps to check the hi. A=lo,X=hi */
+#define JEQ64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 0, 2), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JNE64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 5, 0), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 2, 0), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JA32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (value), 0, 1), \
+	jt
+
+#define JA64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (hi), 3, 0), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (lo), 0, 2), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGE32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 0, 1), \
+	jt
+
+#define JLT32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 1, 0), \
+	jt
+
+/* Shortcut checking if hi > arg.hi. */
+#define JGE64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (lo), 0, 2), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLT64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (hi), 0, 4), \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGT32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
+	jt
+
+#define JLE32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
+	jt
+
+/* Check hi > args.hi first, then do the GE checking */
+#define JGT64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 0, 2), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLE64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 6, 0), \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 3), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define LOAD_SYSCALL_NR \
+	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+		 offsetof(struct seccomp_filter_data, syscall_nr))
+
+#endif  /* __BPF_HELPER_H__ */
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/3] seccomp: kill the seccomp_t typedef
  2012-01-27 23:24 [PATCH v5 1/3] seccomp: kill the seccomp_t typedef Will Drewry
  2012-01-27 23:24 ` [PATCH v5 2/3] seccomp_filters: system call filtering using BPF Will Drewry
  2012-01-27 23:24 ` [PATCH v5 3/3] Documentation: prctl/seccomp_filter Will Drewry
@ 2012-01-27 23:41 ` Greg KH
  2012-01-27 23:47   ` Will Drewry
  2 siblings, 1 reply; 26+ messages in thread
From: Greg KH @ 2012-01-27 23:41 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, corbet, alan, indan, mcgrathr

On Fri, Jan 27, 2012 at 05:24:39PM -0600, Will Drewry wrote:
> Replaces the seccomp_t typedef with seccomp_struct to match modern
> kernel style.
> 
> Signed-off-by: Will Drewry <wad@chromium.org>
> ---
>  include/linux/sched.h   |    2 +-
>  include/linux/seccomp.h |   10 ++++++----
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4032ec1..288b5cb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1418,7 +1418,7 @@ struct task_struct {
>  	uid_t loginuid;
>  	unsigned int sessionid;
>  #endif
> -	seccomp_t seccomp;
> +	struct seccomp_struct seccomp;

Isn't 'struct seccomp_struct' a bit redundant?

How about a simple 'struct seccomp' instead?

That is, if this is even needed...

greg k-h

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/3] seccomp: kill the seccomp_t typedef
  2012-01-27 23:41 ` [PATCH v5 1/3] seccomp: kill the seccomp_t typedef Greg KH
@ 2012-01-27 23:47   ` Will Drewry
  2012-01-28  0:38     ` Steven Rostedt
  2012-01-28 11:13     ` Cong Wang
  0 siblings, 2 replies; 26+ messages in thread
From: Will Drewry @ 2012-01-27 23:47 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, corbet, alan, indan, mcgrathr

On Fri, Jan 27, 2012 at 5:41 PM, Greg KH <gregkh@suse.de> wrote:
> On Fri, Jan 27, 2012 at 05:24:39PM -0600, Will Drewry wrote:
>> Replaces the seccomp_t typedef with seccomp_struct to match modern
>> kernel style.
>>
>> Signed-off-by: Will Drewry <wad@chromium.org>
>> ---
>>  include/linux/sched.h   |    2 +-
>>  include/linux/seccomp.h |   10 ++++++----
>>  2 files changed, 7 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 4032ec1..288b5cb 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1418,7 +1418,7 @@ struct task_struct {
>>       uid_t loginuid;
>>       unsigned int sessionid;
>>  #endif
>> -     seccomp_t seccomp;
>> +     struct seccomp_struct seccomp;
>
> Isn't 'struct seccomp_struct' a bit redundant?
>
> How about a simple 'struct seccomp' instead?

Works for me - I can't recall why that seemed to make sense (other
than the user of similar redundant names elsewhere).

> That is, if this is even needed...

I'd like to think it is :)

thanks!
will

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/3] seccomp: kill the seccomp_t typedef
  2012-01-27 23:47   ` Will Drewry
@ 2012-01-28  0:38     ` Steven Rostedt
  2012-01-28 11:13     ` Cong Wang
  1 sibling, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2012-01-28  0:38 UTC (permalink / raw)
  To: Will Drewry
  Cc: Greg KH, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, corbet, alan, indan, mcgrathr

On Fri, 2012-01-27 at 17:47 -0600, Will Drewry wrote:

> > Isn't 'struct seccomp_struct' a bit redundant?
> >
> > How about a simple 'struct seccomp' instead?
> 
> Works for me - I can't recall why that seemed to make sense (other
> than the user of similar redundant names elsewhere).
> 

You mean like... struct task_struct?

;-)

-- Steve



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/3] seccomp: kill the seccomp_t typedef
  2012-01-27 23:47   ` Will Drewry
  2012-01-28  0:38     ` Steven Rostedt
@ 2012-01-28 11:13     ` Cong Wang
  2012-01-28 21:06       ` Will Drewry
  1 sibling, 1 reply; 26+ messages in thread
From: Cong Wang @ 2012-01-28 11:13 UTC (permalink / raw)
  To: Will Drewry
  Cc: Greg KH, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, oleg, ak, eric.dumazet, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, corbet, alan, indan, mcgrathr

On Fri, 2012-01-27 at 17:47 -0600, Will Drewry wrote:
> On Fri, Jan 27, 2012 at 5:41 PM, Greg KH <gregkh@suse.de> wrote:
> > On Fri, Jan 27, 2012 at 05:24:39PM -0600, Will Drewry wrote:
> >> Replaces the seccomp_t typedef with seccomp_struct to match modern
> >> kernel style.
> >>
> >> Signed-off-by: Will Drewry <wad@chromium.org>
> >> ---
> >>  include/linux/sched.h   |    2 +-
> >>  include/linux/seccomp.h |   10 ++++++----
> >>  2 files changed, 7 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index 4032ec1..288b5cb 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -1418,7 +1418,7 @@ struct task_struct {
> >>       uid_t loginuid;
> >>       unsigned int sessionid;
> >>  #endif
> >> -     seccomp_t seccomp;
> >> +     struct seccomp_struct seccomp;
> >
> > Isn't 'struct seccomp_struct' a bit redundant?
> >
> > How about a simple 'struct seccomp' instead?
> 
> Works for me - I can't recall why that seemed to make sense (other
> than the user of similar redundant names elsewhere).

seccomp_struct for a type is okay, but you also have:

+#define seccomp_struct_init_task(_seccomp) do { } while (0);
+#define seccomp_struct_fork(_tsk, _orig) do { } while (0);
+#define seccomp_struct_free_task(_seccomp) do { } while (0);

in patch 2/3, "struct" in these function/macro names is redundant.

Thanks.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-27 23:24 ` [PATCH v5 2/3] seccomp_filters: system call filtering using BPF Will Drewry
@ 2012-01-28 11:21   ` Cong Wang
  2012-01-28 20:13     ` Will Drewry
  2012-01-28 16:28   ` Andrew Lutomirski
  2012-01-29  4:39   ` Indan Zupancic
  2 siblings, 1 reply; 26+ messages in thread
From: Cong Wang @ 2012-01-28 11:21 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, oleg, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, corbet, alan, indan, mcgrathr

On Fri, 2012-01-27 at 17:24 -0600, Will Drewry wrote:
> +config SECCOMP_FILTER
> +       bool "Enable seccomp-based system call filtering"
> +       select SECCOMP

Is 'depends on SECCOMP' better?

> +       help
> +         This option provide support for limiting the accessibility
> of

s/provide/provides/

> +         systems calls at a task-level using a dynamically defined
> policy.


s/systems/system/

> +
> +         System call filtering policy is expressed by the user using
> +         a Berkeley Packet Filter program.  The program is attached
> using

s/the user using//

> +         prctl(2).  For every system call the task makes, its number,
> +         arguments, and other metadata will be evaluated by the
> attached
> +         filter program.  The result determines if the system call
> may
> +         may proceed or if the task should be terminated.

One more "may"... and "be proceeded"

> +
> +         This behavior is meant to aid security-conscious software in
> +         its ability to minimize the risk of running potentially
> +         risky code.
> +
> +         See Documentation/prctl/seccomp_filter.txt for more detail.
> + 


Thanks.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-27 23:24 ` [PATCH v5 2/3] seccomp_filters: system call filtering using BPF Will Drewry
  2012-01-28 11:21   ` Cong Wang
@ 2012-01-28 16:28   ` Andrew Lutomirski
  2012-01-28 19:50     ` Will Drewry
  2012-01-29  4:39   ` Indan Zupancic
  2 siblings, 1 reply; 26+ messages in thread
From: Andrew Lutomirski @ 2012-01-28 16:28 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, indan, mcgrathr

On Fri, Jan 27, 2012 at 3:24 PM, Will Drewry <wad@chromium.org> wrote:
> Filter programs will be inherited across fork/clone and execve.
> However, if the task attaching the filter is unprivileged
> (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task.  This
> ensures that unprivileged tasks cannot attach filters that affect
> privileged tasks (e.g., setuid binary).

This makes me nervous -- I don't think that the behavior of any new
API should be different depending on privilege level -- adding a
privilege should just make things work that would otherwise fail.  You
might end up with bugs where a program is completely safe if run
without CAP_SYS_ADMIN but, if run with CAP_SYS_ADMIN, bad things
happen.  (The behavior of setuid(geteuid()) is an example of this
problem.)

One way to fix it is to make setting a filter program fail unless
capable(CAP_SYS_ADMIN ) || no_new_privs.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-28 16:28   ` Andrew Lutomirski
@ 2012-01-28 19:50     ` Will Drewry
  0 siblings, 0 replies; 26+ messages in thread
From: Will Drewry @ 2012-01-28 19:50 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, indan, mcgrathr

On Sat, Jan 28, 2012 at 8:28 AM, Andrew Lutomirski <luto@mit.edu> wrote:
> On Fri, Jan 27, 2012 at 3:24 PM, Will Drewry <wad@chromium.org> wrote:
>> Filter programs will be inherited across fork/clone and execve.
>> However, if the task attaching the filter is unprivileged
>> (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task.  This
>> ensures that unprivileged tasks cannot attach filters that affect
>> privileged tasks (e.g., setuid binary).
>
> This makes me nervous -- I don't think that the behavior of any new
> API should be different depending on privilege level -- adding a
> privilege should just make things work that would otherwise fail.  You
> might end up with bugs where a program is completely safe if run
> without CAP_SYS_ADMIN but, if run with CAP_SYS_ADMIN, bad things
> happen.  (The behavior of setuid(geteuid()) is an example of this
> problem.)
>
> One way to fix it is to make setting a filter program fail unless
> capable(CAP_SYS_ADMIN ) || no_new_privs.

Good call.  I'll upgrade the interaction with no_new_privs to be
explicit in the next revision.  It is certainly more transparent and
removes the risk of unintended consequences.

thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-28 11:21   ` Cong Wang
@ 2012-01-28 20:13     ` Will Drewry
  0 siblings, 0 replies; 26+ messages in thread
From: Will Drewry @ 2012-01-28 20:13 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, oleg, ak, eric.dumazet, gregkh, dhowells,
	daniel.lezcano, linux-fsdevel, linux-security-module, olofj,
	mhalcrow, dlaor, corbet, alan, indan, mcgrathr

On Sat, Jan 28, 2012 at 3:21 AM, Cong Wang <amwang@redhat.com> wrote:
> On Fri, 2012-01-27 at 17:24 -0600, Will Drewry wrote:
>> +config SECCOMP_FILTER
>> +       bool "Enable seccomp-based system call filtering"
>> +       select SECCOMP
>
> Is 'depends on SECCOMP' better?

Either way is fine for me.  I chose select so that SECCOMP_FILTER
wouldn't be hidden if SECCOMP was off when they hit the security menu.

>
>> +       help
>> +         This option provide support for limiting the accessibility
>> of
>
> s/provide/provides/
>
>> +         systems calls at a task-level using a dynamically defined
>> policy.
>
>
> s/systems/system/
>
>> +
>> +         System call filtering policy is expressed by the user using
>> +         a Berkeley Packet Filter program.  The program is attached
>> using
>
> s/the user using//
>
>> +         prctl(2).  For every system call the task makes, its number,
>> +         arguments, and other metadata will be evaluated by the
>> attached
>> +         filter program.  The result determines if the system call
>> may
>> +         may proceed or if the task should be terminated.
>
> One more "may"... and "be proceeded"
>
>> +
>> +         This behavior is meant to aid security-conscious software in
>> +         its ability to minimize the risk of running potentially
>> +         risky code.
>> +
>> +         See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>
>
> Thanks.

Thanks!  Cleaned up, as recommended, and slightly reworded.  I'll
include the updates in the next salvo.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/3] seccomp: kill the seccomp_t typedef
  2012-01-28 11:13     ` Cong Wang
@ 2012-01-28 21:06       ` Will Drewry
  0 siblings, 0 replies; 26+ messages in thread
From: Will Drewry @ 2012-01-28 21:06 UTC (permalink / raw)
  To: LKML, Greg KH
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov, oleg,
	ak, eric.dumazet, dhowells, daniel.lezcano, linux-fsdevel,
	linux-security-module, olofj, mhalcrow, dlaor, corbet, alan,
	indan, mcgrathr

On Sat, Jan 28, 2012 at 5:13 AM, Cong Wang <amwang@redhat.com> wrote:
> On Fri, 2012-01-27 at 17:47 -0600, Will Drewry wrote:
>> On Fri, Jan 27, 2012 at 5:41 PM, Greg KH <gregkh@suse.de> wrote:
>> > On Fri, Jan 27, 2012 at 05:24:39PM -0600, Will Drewry wrote:
>> >> Replaces the seccomp_t typedef with seccomp_struct to match modern
>> >> kernel style.
>> >>
>> >> Signed-off-by: Will Drewry <wad@chromium.org>
>> >> ---
>> >>  include/linux/sched.h   |    2 +-
>> >>  include/linux/seccomp.h |   10 ++++++----
>> >>  2 files changed, 7 insertions(+), 5 deletions(-)
>> >>
>> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> >> index 4032ec1..288b5cb 100644
>> >> --- a/include/linux/sched.h
>> >> +++ b/include/linux/sched.h
>> >> @@ -1418,7 +1418,7 @@ struct task_struct {
>> >>       uid_t loginuid;
>> >>       unsigned int sessionid;
>> >>  #endif
>> >> -     seccomp_t seccomp;
>> >> +     struct seccomp_struct seccomp;
>> >
>> > Isn't 'struct seccomp_struct' a bit redundant?
>> >
>> > How about a simple 'struct seccomp' instead?
>>
>> Works for me - I can't recall why that seemed to make sense (other
>> than the user of similar redundant names elsewhere).
>
> seccomp_struct for a type is okay, but you also have:

Thanks. Looking again, I see the main benefit of seccomp_struct over
seccomp being that it help avoids error.  Ignoring task_struct :),
signal_struct and sighand_structs are a good example.  Since the
member name is "signal" and "sighand", I think this bit of redundancy
helps disambiuguate, but perhaps I'm just being overly conservative.

I'll leave it as is unless Greg or someone else feels strongly :)

> +#define seccomp_struct_init_task(_seccomp) do { } while (0);
> +#define seccomp_struct_fork(_tsk, _orig) do { } while (0);
> +#define seccomp_struct_free_task(_seccomp) do { } while (0);
>
> in patch 2/3, "struct" in these function/macro names is redundant.

Fixed - thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-27 23:24 ` [PATCH v5 2/3] seccomp_filters: system call filtering using BPF Will Drewry
  2012-01-28 11:21   ` Cong Wang
  2012-01-28 16:28   ` Andrew Lutomirski
@ 2012-01-29  4:39   ` Indan Zupancic
  2012-01-29 20:27     ` Will Drewry
  2 siblings, 1 reply; 26+ messages in thread
From: Indan Zupancic @ 2012-01-29  4:39 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, wad, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, mcgrathr

Hello,

Feedback below. I know you send out v6, but I'm not going to play
eternal catch-up, so here it is.

On Sat, January 28, 2012 00:24, Will Drewry wrote:
> [This patch depends on luto@mit.edu's no_new_privs patch:
>  https://lkml.org/lkml/2012/1/12/446
> ]
>
> This patch adds support for seccomp mode 2.  This mode enables dynamic
> enforcement of system call filtering policy in the kernel as specified
> by a userland task.  The policy is expressed in terms of a Berkeley
> Packet Filter program, as is used for userland-exposed socket filtering.
> Instead of network data, the BPF program is evaluated over struct
> seccomp_filter_data at the time of the system call.
>
> A filter program may be installed by a userland task by calling
>   prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
> where fprog is of type struct sock_fprog.
>
> If the first filter program allows subsequent prctl(2) calls, then
> additional filter programs may be attached.  All attached programs
> must be evaluated before a system call will be allowed to proceed.
>
> To avoid CONFIG_COMPAT related landmines, once a filter program is
> installed using specific is_compat_task() value, it is not allowed to
> make system calls using the alternate entry point.
>
> Filter programs will be inherited across fork/clone and execve.
> However, if the task attaching the filter is unprivileged
> (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task.  This
> ensures that unprivileged tasks cannot attach filters that affect
> privileged tasks (e.g., setuid binary).
>
> There are a number of benefits to this approach. A few of which are
> as follows:
> - BPF has been exposed to userland for a long time.
> - Userland already knows its ABI: system call numbers and desired
>   arguments
> - No time-of-check-time-of-use vulnerable data accesses are possible.
> - system call arguments are loaded on demand only to minimize copying
>   required for system call number-only policy decisions.
>
> This patch includes its own BPF evaluator, but relies on the
> net/core/filter.c BPF checking code.  It is possible to share
> evaluators, but the performance sensitive nature of the network
> filtering path makes it an iterative optimization which (I think :) can
> be tackled separately via separate patchsets. (And at some point sharing
> BPF JIT code!)
>
>  v5: - uses syscall_get_arguments
>        (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
>      - uses union-based arg storage with hi/lo struct to
>        handle endianness.  Compromises between the two alternate
>        proposals to minimize extra arg shuffling and account for
>        endianness assuming userspace uses offsetof().

Why would user space use offsetof()? Do you mean when assembling the BPF
code? So this basically means 32-bit and 64-bit filters are different,
even when they could be the same. Hm.

>        (mcgrathr@chromium.org, indan@nul.nu)
>      - update Kconfig description
>      - add include/seccomp_filter.h and add its installation
>      - (naive) on-demand syscall argument loading
>      - drop seccomp_t (eparis@redhat.com)
>      - adds proper compat prctl call copying
>  v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
>      - now uses current->no_new_privs
>          (luto@mit.edu,torvalds@linux-foundation.com)
>      - assign names to seccomp modes (rdunlap@xenotime.net)
>      - fix style issues (rdunlap@xenotime.net)
>      - reworded Kconfig entry (rdunlap@xenotime.net)
>  v3: - macros to inline (oleg@redhat.com)
>      - init_task behavior fixed (oleg@redhat.com)
>      - drop creator entry and extra NULL check (oleg@redhat.com)
>      - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
>      - adds tentative use of "always_unprivileged" as per
>        torvalds@linux-foundation.org and luto@mit.edu
>  v2: - (patch 2 only)
>
> Signed-off-by: Will Drewry <wad@chromium.org>
> ---
>  include/linux/Kbuild           |    1 +
>  include/linux/prctl.h          |    3 +
>  include/linux/seccomp.h        |   63 ++++
>  include/linux/seccomp_filter.h |   79 +++++

Just stuff it into seccomp.h? If this is added, then this will be seccomp,
and the old mode can be seen as a standard, very restrictive filter.

>  kernel/Makefile                |    1 +
>  kernel/fork.c                  |    4 +
>  kernel/seccomp.c               |   10 +-
>  kernel/seccomp_filter.c        |  620 ++++++++++++++++++++++++++++++++++++++++

Same here. There isn't much to seccomp.c, why not add the filter code there?

If you insist on keeping it separate, then why call it seccomp at all?
Just call it syscall_filter or something.

>  kernel/sys.c                   |    4 +
>  security/Kconfig               |   20 ++
>  10 files changed, 804 insertions(+), 1 deletions(-)

What is the difference in kernel binary size with and without this?
If it's too big I think the BPF filtering code should be shared with
the networking version from the start.

Actually, that is something that is wanted anyway, so why not use the
networking code from the start? That should reduce the code amount a
lot. So I think if this gets merged the consolidated version should be
merged. Only downside would be that networking support is needed for
this feature, but without networking there shouldn't be much need for
this feature anyway.

>  create mode 100644 include/linux/seccomp_filter.h
>  create mode 100644 kernel/seccomp_filter.c
>
> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
> index c94e717..5659454 100644
> --- a/include/linux/Kbuild
> +++ b/include/linux/Kbuild
> @@ -330,6 +330,7 @@ header-y += scc.h
>  header-y += sched.h
>  header-y += screen_info.h
>  header-y += sdla.h
> +header-y += seccomp_filter.h
>  header-y += securebits.h
>  header-y += selinux_netlink.h
>  header-y += sem.h
> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
> index 7ddc7f1..b8c4beb 100644
> --- a/include/linux/prctl.h
> +++ b/include/linux/prctl.h
> @@ -114,4 +114,7 @@
>  # define PR_SET_MM_START_BRK		6
>  # define PR_SET_MM_BRK			7
>
> +/* Set process seccomp filters */
> +#define PR_ATTACH_SECCOMP_FILTER	37
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 171ab66..3992bb6 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -5,10 +5,29 @@
>  #ifdef CONFIG_SECCOMP
>
>  #include <linux/thread_info.h>
> +#include <linux/types.h>
>  #include <asm/seccomp.h>
>
> +/* Valid values of seccomp_struct.mode */
> +#define SECCOMP_MODE_DISABLED	0 /* seccomp is not in use. */
> +#define SECCOMP_MODE_STRICT	1 /* uses hard-coded seccomp.c rules. */
> +#define SECCOMP_MODE_FILTER	2 /* system call access determined by filter. */
> +
> +struct seccomp_filter;
> +/**
> + * struct seccomp_struct - the state of a seccomp'ed process
> + *
> + * @mode:  indicates one of the valid values above for controlled
> + *         system calls available to a process.
> + * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
> + *          @filter must only be accessed from the context of current as there
> + *          is no guard.
> + */
>  struct seccomp_struct {
>  	int mode;
> +#ifdef CONFIG_SECCOMP_FILTER
> +	struct seccomp_filter *filter;
> +#endif

Why a separate config option for this? Just depend on SECCOMP and networking?

>  };
>
>  extern void __secure_computing(int);
> @@ -51,4 +70,48 @@ static inline int seccomp_mode(struct seccomp_struct *s)
>
>  #endif /* CONFIG_SECCOMP */
>
> +#ifdef CONFIG_SECCOMP_FILTER
> +
> +
> +extern long prctl_attach_seccomp_filter(char __user *);
> +
> +extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
> +extern void put_seccomp_filter(struct seccomp_filter *);
> +
> +extern int seccomp_test_filters(int);
> +extern void seccomp_filter_log_failure(int);
> +extern void seccomp_struct_fork(struct seccomp_struct *child,
> +				const struct seccomp_struct *parent);

Lousy function name, what is it doing?

> +
> +static inline void seccomp_struct_init_task(struct seccomp_struct *seccomp)
> +{
> +	seccomp->mode = SECCOMP_MODE_DISABLED;
> +	seccomp->filter = NULL;
> +}

This doesn't actually do anything, just get rid of it and/or adapt seccomp_struct_fork()?

And it seems wrong to do this at fork time considering that the mode is supposed
to keep the mode across forks, so it seems safer to update it to the correct
values immediately.

> +
> +/* No locking is needed here because the task_struct will
> + * have no parallel consumers.
> + */
> +static inline void seccomp_struct_free_task(struct seccomp_struct *seccomp)
> +{
> +	put_seccomp_filter(seccomp->filter);
> +	seccomp->filter = NULL;
> +}

How many functions do you need to free one object?

I'm sure you can get rid of a couple, starting with this one. Just use
put_seccomp_filter() directly, the NULL is useless as task_struct is
going away.

> +
> +#else  /* CONFIG_SECCOMP_FILTER */
> +
> +#include <linux/errno.h>
> +
> +struct seccomp_filter { };
> +/* Macros consume the unused dereference by the caller. */
> +#define seccomp_struct_init_task(_seccomp) do { } while (0);
> +#define seccomp_struct_fork(_tsk, _orig) do { } while (0);
> +#define seccomp_struct_free_task(_seccomp) do { } while (0);
> +
> +static inline long prctl_attach_seccomp_filter(char __user *a2)
> +{
> +	return -ENOSYS;
> +}
> +
> +#endif  /* CONFIG_SECCOMP_FILTER */
>  #endif /* _LINUX_SECCOMP_H */
> diff --git a/include/linux/seccomp_filter.h b/include/linux/seccomp_filter.h
> new file mode 100644
> index 0000000..3ecd641
> --- /dev/null
> +++ b/include/linux/seccomp_filter.h
> @@ -0,0 +1,79 @@
> +/*
> + * Secomp-based system call filtering data structures and definitions.
> + *
> + * Copyright (C) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>

(Not my expertise, but I'm not sure if that makes legal sense.)

> + *
> + * This copyrighted material is made available to anyone wishing to use,
> + * modify, copy, or redistribute it subject to the terms and conditions
> + * of the GNU General Public License v.2.
> + *
> + */
> +
> +#ifndef __LINUX_SECCOMP_FILTER_H__
> +#define __LINUX_SECCOMP_FILTER_H__
> +
> +#include <asm/byteorder.h>
> +#include <linux/compiler.h>
> +#include <linux/types.h>
> +
> +/*
> + *	Keep the contents of this file similar to linux/filter.h:
> + *	  struct sock_filter and sock_fprog and versions.
> + *	Custom naming exists solely if divergence is ever needed.
> + */
> +
> +/*
> + * Current version of the filter code architecture.
> + */
> +#define SECCOMP_BPF_MAJOR_VERSION 1
> +#define SECCOMP_BPF_MINOR_VERSION 1

If it's a fork of the current networking code it should follow their version
numbering too. But these defines aren't used anywhere..?

> +
> +struct seccomp_filter_block {	/* Filter block */
> +	__u16	code;   /* Actual filter code */
> +	__u8	jt;	/* Jump true */
> +	__u8	jf;	/* Jump false */
> +	__u32	k;      /* Generic multiuse field */
> +};
> +
> +struct seccomp_fprog {	/* Required for SO_ATTACH_FILTER. */
> +	unsigned short		len;	/* Number of filter blocks */
> +	struct seccomp_filter_block __user *filter;
> +};
> +
> +/* Ensure the u32 ordering is consistent with platform byte order. */
> +#if defined(__LITTLE_ENDIAN)
> +#define SECCOMP_ENDIAN_SWAP(x, y) x, y
> +#elif defined(__BIG_ENDIAN)
> +#define SECCOMP_ENDIAN_SWAP(x, y) y, x
> +#else
> +#error edit for your odd arch byteorder.
> +#endif
> +
> +/* System call argument layout for the filter data. */
> +union seccomp_filter_arg {
> +	struct {
> +		__u32 SECCOMP_ENDIAN_SWAP(lo32, hi32);
> +	};
> +	__u64 u64;
> +};

Ugh. What was the advantage of doing this again? This looks like optimising
at the wrong level to me.

It doesn't make the ABI simpler, it doesn't make writing filters easier,
it doesn't make your code nicer, and I really doubt it makes it run
anything faster either. It's just obfuscating the code.

If you don't want to hide endianness then don't provide those lo32 and hi32
structure members and just make it a u64. That's at least very clear. But
swapping them around depending on endianness is not good, because it's only
confusing things.

But if you have to do it, then at least do:

#if defined(__LITTLE_ENDIAN)
union seccomp_filter_arg {
	struct {
		__u32 lo32;
		__u32 hi32;
	};
	__u64 u64;
};
#else
union seccomp_filter_arg {
	struct {
		__u32 hi32;
		__u32 lo32;
	};
	__u64 u64;
};
#endif

Instead of hiding what you're doing.

> +
> +/*
> + *	Expected data the BPF program will execute over.
> + *	Endianness will be arch specific, but the values will be
> + *	swapped, as above, to allow for consistent BPF programs.

So now what? You need platform dependent BPF programs depending on
the arch's endianness? Just because you want to stuff one 64-bit
longs directly into two 32-bit ints? That saves how much, two
instructions?

> + */
> +struct seccomp_filter_data {
> +	int syscall_nr;
> +	__u32 __reserved;
> +	union seccomp_filter_arg args[6];
> +};
> +
> +#undef SECCOMP_ENDIAN_SWAP
> +
> +/*
> + * Defined valid return values for the BPF program.
> + */
> +#define SECCOMP_BPF_ALLOW	0xFFFFFFFF
> +#define SECCOMP_BPF_DENY	0

This doesn't make sense. Please be consistent with C and swap it around.

> +
> +#endif /* __LINUX_SECCOMP_FILTER_H__ */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 2d9de86..fd81bac 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -78,6 +78,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>  obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
>  obj-$(CONFIG_SECCOMP) += seccomp.o
> +obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>  obj-$(CONFIG_TREE_RCU) += rcutree.o
>  obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 051f090..f312edb 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -34,6 +34,7 @@
>  #include <linux/cgroup.h>
>  #include <linux/security.h>
>  #include <linux/hugetlb.h>
> +#include <linux/seccomp.h>
>  #include <linux/swap.h>
>  #include <linux/syscalls.h>
>  #include <linux/jiffies.h>
> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
>  	free_thread_info(tsk->stack);
>  	rt_mutex_debug_task_free(tsk);
>  	ftrace_graph_exit_task(tsk);
> +	seccomp_struct_free_task(&tsk->seccomp);
>  	free_task_struct(tsk);
>  }
>  EXPORT_SYMBOL(free_task);
> @@ -1093,6 +1095,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  		goto fork_out;
>
>  	ftrace_graph_init_task(p);
> +	seccomp_struct_init_task(&p->seccomp);

If you just do a get on the filter's krefs here then cleanup is done correctly
by free_task(). So just call seccomp_struct_fork() or get_seccomp_filter() here.

>
>  	rt_mutex_init_task(p);
>
> @@ -1376,6 +1379,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	if (clone_flags & CLONE_THREAD)
>  		threadgroup_change_end(current);
>  	perf_event_fork(p);
> +	seccomp_struct_fork(&p->seccomp, &current->seccomp);

This can be moved above, before the first error goto.

>
>  	trace_task_newtask(p, clone_flags);
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index e8d76c5..a045dd4 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -37,7 +37,7 @@ void __secure_computing(int this_syscall)
>  	int * syscall;
>
>  	switch (mode) {
> -	case 1:
> +	case SECCOMP_MODE_STRICT:
>  		syscall = mode1_syscalls;
>  #ifdef CONFIG_COMPAT
>  		if (is_compat_task())
> @@ -48,6 +48,14 @@ void __secure_computing(int this_syscall)
>  				return;
>  		} while (*++syscall);
>  		break;
> +#ifdef CONFIG_SECCOMP_FILTER
> +	case SECCOMP_MODE_FILTER:
> +		if (seccomp_test_filters(this_syscall) == 0)

Get rid of the '== 0' and have proper return values instead.

> +			return;
> +

No need for an empty line here.

> +		seccomp_filter_log_failure(this_syscall);
> +		break;
> +#endif
>  	default:
>  		BUG();
>  	}
> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
> new file mode 100644
> index 0000000..e57219e
> --- /dev/null
> +++ b/kernel/seccomp_filter.c
> @@ -0,0 +1,620 @@
> +/*
> + * linux/kernel/seccomp_filter.c
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
> + *
> + * Extends linux/kernel/seccomp.c to allow tasks to install system call
> + * filters using a Berkeley Packet Filter program which is executed over
> + * struct seccomp_filter_data.
> + */
> +
> +#include <asm/syscall.h>
> +
> +#include <linux/capability.h>
> +#include <linux/compat.h>
> +#include <linux/err.h>
> +#include <linux/errno.h>
> +#include <linux/rculist.h>
> +#include <linux/filter.h>
> +#include <linux/kallsyms.h>
> +#include <linux/kref.h>
> +#include <linux/module.h>
> +#include <linux/pid.h>
> +#include <linux/prctl.h>
> +#include <linux/ptrace.h>
> +#include <linux/ratelimit.h>
> +#include <linux/reciprocal_div.h>
> +#include <linux/regset.h>
> +#include <linux/seccomp.h>
> +#include <linux/seccomp_filter.h>
> +#include <linux/security.h>
> +#include <linux/seccomp.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/user.h>
> +
> +
> +/**
> + * struct seccomp_filter - container for seccomp BPF programs
> + *
> + * @usage: reference count to manage the object lifetime.
> + *         get/put helpers should be used when accessing an instance
> + *         outside of a lifetime-guarded section.  In general, this
> + *         is only needed for handling filters shared across tasks.
> + * @parent: pointer to the ancestor which this filter will be composed with.

I guess the code will be clearer than the comment. Parent task or filter?
Is it a reverse linked list or something? I'm confused.

> + * @insns: the BPF program instructions to evaluate
> + * @count: the number of instructions in the program.
> + *
> + * seccomp_filter objects should never be modified after being attached
> + * to a task_struct (other than @usage).
> + */
> +struct seccomp_filter {
> +	struct kref usage;
> +	struct seccomp_filter *parent;
> +	struct {
> +		uint32_t compat:1;
> +	} flags;

This flag isn't documented above.

> +	unsigned short count;  /* Instruction count */

Why a short?

> +	struct sock_filter insns[0];

Newer gcc supports insns[], but I don't know what the kernel convention is.

> +};
> +
> +/*
> + * struct seccomp_filter_metadata - BPF data wrapper
> + * @data: data accessible to the BPF program.
> + * @has_args: indicates that the args have been lazily populated.
> + *
> + * Used by seccomp_load_pointer.
> + */
> +struct seccomp_filter_metadata {
> +	struct seccomp_filter_data data;
> +	bool has_args;
> +};

Why use 'bool' instead of a nifty struct { uint32_t : 1 }?

It seems that has_args is something that should be stored elsewhere,
it doesn't seem to warrant its own structure.

> +
> +static unsigned int seccomp_run_filter(void *, uint32_t,
> +				       const struct sock_filter *);

It almosts fits, just put it on one line.

> +
> +/**
> + * seccomp_filter_alloc - allocates a new filter object
> + * @padding: size of the insns[0] array in bytes
> + *
> + * The @padding should be a multiple of
> + * sizeof(struct sock_filter).

It's not padding, why call it such? Why not 'count' or 'bpf_blocks'?
That would reduce the sizeof(struct sock_filter)s to one.

> + *
> + * Returns ERR_PTR on error or an allocated object.
> + */
> +static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
> +{
> +	struct seccomp_filter *f;
> +	unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
> +
> +	/* Drop oversized requests. */
> +	if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
> +		return ERR_PTR(-EINVAL);
> +
> +	/* Padding should always be in sock_filter increments. */
> +	if (padding % sizeof(struct sock_filter))
> +		return ERR_PTR(-EINVAL);

...then this isn't needed.

> +
> +	f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
> +	if (!f)
> +		return ERR_PTR(-ENOMEM);
> +	kref_init(&f->usage);
> +	f->count = bpf_blocks;
> +	return f;
> +}
> +
> +/**
> + * seccomp_filter_free - frees the allocated filter.
> + * @filter: NULL or live object to be completely destructed.
> + */
> +static void seccomp_filter_free(struct seccomp_filter *filter)
> +{
> +	if (!filter)
> +		return;
> +	put_seccomp_filter(filter->parent);

Why do put on the parent here?

This function is only called once below by __put_seccomp_filter, perhaps merge them?

> +	kfree(filter);
> +}
> +
> +static void __put_seccomp_filter(struct kref *kref)
> +{
> +	struct seccomp_filter *orig =
> +		container_of(kref, struct seccomp_filter, usage);
> +	seccomp_filter_free(orig);

I find it unexpected that __put_seccomp_filter() calls put_seccomp_filter(),
why have two versions then? I have the feeling all this can be streamlined
more.

> +}
> +
> +void seccomp_filter_log_failure(int syscall)
> +{
> +	pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
> +		current->comm, task_pid_nr(current), syscall,
> +		KSTK_EIP(current));
> +}
> +
> +/* put_seccomp_filter - decrements the ref count of @orig and may free. */
> +void put_seccomp_filter(struct seccomp_filter *orig)
> +{
> +	if (!orig)
> +		return;
> +	kref_put(&orig->usage, __put_seccomp_filter);
> +}
> +
> +/* get_seccomp_filter - increments the reference count of @orig. */
> +struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
> +{
> +	if (!orig)
> +		return NULL;
> +	kref_get(&orig->usage);
> +	return orig;
> +}

This function should be merged with seccomp_struct_fork(), it's pretty much only
called there.

> +
> +#if BITS_PER_LONG == 32
> +static inline unsigned long *seccomp_filter_data_arg(
> +				struct seccomp_filter_data *data, int index)
> +{
> +	/* Avoid inconsistent hi contents. */
> +	data->args[index].hi32 = 0;
> +	return (unsigned long *) &(data->args[index].lo32);
> +}
> +#elif BITS_PER_LONG == 64
> +static inline unsigned long *seccomp_filter_data_arg(
> +				struct seccomp_filter_data *data, int index)
> +{
> +	return (unsigned long *) &(data->args[index].u64);

No need for any casts, is there? And this looks a bit untidy.


> +}
> +#else
> +#error Unknown BITS_PER_LONG.
> +#endif
> +
> +/**
> + * seccomp_load_pointer: checks and returns a pointer to the requested offset
> + * @buf: u8 array to index into
> + * @buflen: length of the @buf array
> + * @offset: offset to return data from
> + * @size: size of the data to retrieve at offset
> + * @unused: placeholder which net/core/filter.c uses for for temporary
> + *          storage.  Ideally, the two code paths can be merged.

The documentation doesn't matches the actual code?

> + *
> + * Returns a pointer to the BPF evaluator after checking the offset and size
> + * boundaries.
> + */
> +static inline void *seccomp_load_pointer(void *data, int offset, size_t size,
> +					 void *buffer)
> +{
> +	struct seccomp_filter_metadata *metadata = data;
> +	int arg;
> +	if (offset >= sizeof(metadata->data))
> +		goto fail;
> +	if (offset < 0)
> +		goto fail;

Perhaps use unsigned offset?

> +	if (size > sizeof(metadata->data) - offset)
> +		goto fail;
> +	if (metadata->has_args)
> +		goto pass;
> +	/* No argument data touched. */
> +	if (offset + size - 1 < offsetof(struct seccomp_filter_data, args))
> +		goto pass;
> +	for (arg = 0; arg < ARRAY_SIZE(metadata->data.args); ++arg)
> +		syscall_get_arguments(current, task_pt_regs(current), arg, 1,
> +			seccomp_filter_data_arg(&metadata->data, arg));
> +	metadata->has_args = true;

Wasn't the idea that arguments would be loaded on demand?

And as arguments are usually only checked once and not twice, caching the
values isn't really needed. So perhaps get rid of the (meta)data structure
and just have a switch that returns the right data directly depending on
the offset? It seems the over all code would be simpler that way.

> +pass:
> +	return ((__u8 *)(&metadata->data)) + offset;
> +fail:
> +	return NULL;
> +}
> +
> +/**
> + * seccomp_test_filters - tests 'current' against the given syscall
> + * @syscall: number of the system call to test
> + *
> + * Returns 0 on ok and non-zero on error/failure.
> + */
> +int seccomp_test_filters(int syscall)
> +{
> +	int ret = -EACCES;
> +	struct seccomp_filter *filter;
> +	struct seccomp_filter_metadata metadata;
> +
> +	filter = current->seccomp.filter; /* uses task ref */
> +	if (!filter)
> +		goto out;

Actually, shouldn't having no filter with SECCOMP_MODE_FILTER set be impossible?
It seems better to assure that instead of needlessly testing for it every time.

> +
> +	metadata.data.syscall_nr = syscall;
> +	metadata.has_args = false;
> +
> +#ifdef CONFIG_COMPAT
> +	if (filter->flags.compat != !!(is_compat_task()))
> +		goto out;
> +#endif
> +
> +	/* Only allow a system call if it is allowed in all ancestors. */
> +	ret = 0;
> +	for ( ; filter != NULL; filter = filter->parent) {
> +		/* Allowed if return value is SECCOMP_BPF_ALLOW */
> +		if (seccomp_run_filter(&metadata, sizeof(metadata.data),
> +					filter->insns) != SECCOMP_BPF_ALLOW)
> +			ret = -EACCES;

Why keep checking filters if one of them fails? Just return -EACCES?

> +	}
> +out:
> +	return ret;
> +}
> +
> +/**
> + * seccomp_attach_filter: Attaches a seccomp filter to current.
> + * @fprog: BPF program to install
> + *
> + * Context: User context only. This function may sleep on allocation and
> + *          operates on current. current must be attempting a system call
> + *          when this is called (usually prctl).
> + *
> + * This function may be called repeatedly to install additional filters.
> + * Every filter successfully installed will be evaluated (in reverse order)
> + * for each system call the thread makes.
> + *
> + * Returns 0 on success or an errno on failure.
> + */
> +long seccomp_attach_filter(struct sock_fprog *fprog)
> +{
> +	struct seccomp_filter *filter = NULL;
> +	/* Note, len is a short so overflow should be impossible. */
> +	unsigned long fp_size = fprog->len * sizeof(struct sock_filter);

That's a crude way of doing it. The max is BPF_MAXINSNS and that is checked a
bit late by sk_chk_filter(). But perhaps while you're consolidating the code
with the networking filter code, you could change their code to check that
first before trying to allocate the given amount.

> +	long ret = -EPERM;
> +
> +	/* Allocate a new seccomp_filter */
> +	filter = seccomp_filter_alloc(fp_size);

Actually, your seccomp_filter_alloc() checks for BPF_MAXINSNS already, and if
you change that one as I recommended and you pass on fprog->len directly, no
overflow can happen.

> +	if (IS_ERR(filter)) {
> +		ret = PTR_ERR(filter);
> +		goto out;
> +	}
> +
> +	/* Copy the instructions from fprog. */
> +	ret = -EFAULT;
> +	if (copy_from_user(filter->insns, fprog->filter, fp_size))
> +		goto out;
> +
> +	/* Check the fprog */
> +	ret = sk_chk_filter(filter->insns, filter->count);
> +	if (ret)
> +		goto out;
> +
> +	/*
> +	 * If a process lacks CAP_SYS_ADMIN in its namespace, force
> +	 * this process and all descendents to run with no_new_privs.
> +	 * A privileged process will need to set this bit independently,
> +	 * if desired.
> +	 */
> +	if (security_capable_noaudit(current_cred(), current_user_ns(),
> +				     CAP_SYS_ADMIN) != 0)
> +		current->no_new_privs = 1;
> +
> +	/*
> +	 * If there is an existing filter, make it the parent
> +	 * and reuse the existing task-based ref.

You're not reusing the existing task-based ref, are you?

You can't reuse it anyway. Perhaps and old leftover comment?

> +	 */
> +	filter->parent = current->seccomp.filter;

I think this needs a more verbose comment, here or elsewhere explaining how
important it is that new filters are added to the start and old filters are
never removed, or else list corruption can happen. This way assures that
forward pointers are always valid, and that's the reason to use a singly
linked list instead of a standard doubly linked list.

> +#ifdef CONFIG_COMPAT
> +	/* Disallow changing system calling conventions after the fact. */
> +	filter->flags.compat = !!(is_compat_task());
> +
> +	if (filter->parent &&
> +	    filter->parent->flags.compat != filter->flags.compat)
> +		return -EACCES;

This shouldn't be possible and checking it here is too late if it somehow
happened anyway. (And you're leaking memory because current->seccomp.filter
isn't updated yet.)

> +#endif
> +
> +	/*
> +	 * Double claim the new filter so we can release it below simplifying
> +	 * the error paths earlier.
> +	 */

No it doesn't.

> +	ret = 0;
> +	get_seccomp_filter(filter);
> +	current->seccomp.filter = filter;
> +	/* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
> +	if (current->seccomp.mode == SECCOMP_MODE_DISABLED) {
> +		current->seccomp.mode = SECCOMP_MODE_FILTER;
> +		set_thread_flag(TIF_SECCOMP);
> +	}
> +

Just add a return 0; here, much simpler and removes 6 lines.

> +out:
> +	put_seccomp_filter(filter);  /* for get or task, on err */
> +	return ret;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +/* This should be kept in sync with net/compat.c which changes infrequently. */
> +struct compat_sock_fprog {
> +	u16 len;
> +	compat_uptr_t filter;	/* struct sock_filter */
> +};
> +
> +static long compat_attach_seccomp_filter(char __user *optval)
> +{
> +	struct compat_sock_fprog __user *fprog32 =
> +		(struct compat_sock_fprog __user *)optval;
> +	struct sock_fprog __user *kfprog =
> +		compat_alloc_user_space(sizeof(struct sock_fprog));
> +	compat_uptr_t ptr;
> +	u16 len;
> +
> +	if (!access_ok(VERIFY_READ, fprog32, sizeof(*fprog32)) ||
> +	    !access_ok(VERIFY_WRITE, kfprog, sizeof(struct sock_fprog)) ||
> +	    __get_user(len, &fprog32->len) ||
> +	    __get_user(ptr, &fprog32->filter) ||
> +	    __put_user(len, &kfprog->len) ||
> +	    __put_user(compat_ptr(ptr), &kfprog->filter))
> +		return -EFAULT;
> +
> +	return seccomp_attach_filter(kfprog);
> +}
> +#endif
> +
> +long prctl_attach_seccomp_filter(char __user *user_filter)
> +{
> +	struct sock_fprog fprog;
> +	long ret = -EINVAL;
> +	ret = -EFAULT;

Why not set it to -EFAULT directly?

> +	if (!user_filter)
> +		goto out;
> +
> +#ifdef CONFIG_COMPAT
> +	if (is_compat_task())
> +		return compat_attach_seccomp_filter(user_filter);
> +#endif
> +
> +	if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
> +		goto out;
> +
> +	ret = seccomp_attach_filter(&fprog);
> +out:
> +	return ret;
> +}
> +
> +/**
> + * seccomp_struct_fork: manages inheritance on fork
> + * @child: forkee's seccomp_struct
> + * @parent: forker's seccomp_struct
> + *
> + * Ensures that @child inherits seccomp mode and state iff

'if'

> + * seccomp filtering is in use.
> + */
> +void seccomp_struct_fork(struct seccomp_struct *child,
> +			 const struct seccomp_struct *parent)
> +{
> +	child->mode = parent->mode;
> +	if (parent->mode != SECCOMP_MODE_FILTER)
> +		return;
> +	child->filter = get_seccomp_filter(parent->filter);

I think how you manage the life times of the filters is very clever and that it
needs some commenting because it is more subtle than it seems at first glance.
The rules seem to be:

- The list is sorted from new to old.

- Never remove old filters.

- The life-time of a group of filters (all filters set by one process between
  fork calls) is managed by the one at the head of the group.

This forms a tree with the head filters at forking points with the filters
freed from the leave direction.

> +}
> +
> +/**
> + * seccomp_run_filter - evaluate BPF
> + *	@buf: opaque buffer to execute the filter over
> + *	@buflen: length of the buffer

Out of date.

> + *	@fentry: filter to apply
> + *
> + * Decode and apply filter instructions to the buffer.  Return length to
> + * keep, 0 for none.

This doesn't make sense in the context of system calls. Just return 0 for
allowing the system call and -1 to deny it.

> @buf is a seccomp_filter_metadata we are filtering,
> + * @filter is the array of filter instructions.  Because all jumps are
> + * guaranteed to be before last instruction, and last instruction
> + * guaranteed to be a RET, we dont need to check flen.
> + *
> + * See core/net/filter.c as this is nearly an exact copy.
> + * At some point, it would be nice to merge them to take advantage of
> + * optimizations (like JIT).
> + */
> +static unsigned int seccomp_run_filter(void *data, uint32_t datalen,
> +			  const struct sock_filter *fentry)
> +{
> +	const void *ptr;
> +	u32 A = 0;			/* Accumulator */
> +	u32 X = 0;			/* Index Register */
> +	u32 mem[BPF_MEMWORDS];		/* Scratch Memory Store */
> +	u32 tmp;
> +	int k;
> +
> +	/*
> +	 * Process array of filter instructions.
> +	 */
> +	for (;; fentry++) {
> +#if defined(CONFIG_X86_32)
> +#define	K (fentry->k)
> +#else
> +		const u32 K = fentry->k;
> +#endif
> +
> +		switch (fentry->code) {
> +		case BPF_S_ALU_ADD_X:
> +			A += X;
> +			continue;
> +		case BPF_S_ALU_ADD_K:
> +			A += K;
> +			continue;
> +		case BPF_S_ALU_SUB_X:
> +			A -= X;
> +			continue;
> +		case BPF_S_ALU_SUB_K:
> +			A -= K;
> +			continue;
> +		case BPF_S_ALU_MUL_X:
> +			A *= X;
> +			continue;
> +		case BPF_S_ALU_MUL_K:
> +			A *= K;
> +			continue;
> +		case BPF_S_ALU_DIV_X:
> +			if (X == 0)
> +				return 0;
> +			A /= X;
> +			continue;
> +		case BPF_S_ALU_DIV_K:
> +			A = reciprocal_divide(A, K);
> +			continue;
> +		case BPF_S_ALU_AND_X:
> +			A &= X;
> +			continue;
> +		case BPF_S_ALU_AND_K:
> +			A &= K;
> +			continue;
> +		case BPF_S_ALU_OR_X:
> +			A |= X;
> +			continue;
> +		case BPF_S_ALU_OR_K:
> +			A |= K;
> +			continue;
> +		case BPF_S_ALU_LSH_X:
> +			A <<= X;
> +			continue;
> +		case BPF_S_ALU_LSH_K:
> +			A <<= K;
> +			continue;
> +		case BPF_S_ALU_RSH_X:
> +			A >>= X;
> +			continue;
> +		case BPF_S_ALU_RSH_K:
> +			A >>= K;
> +			continue;
> +		case BPF_S_ALU_NEG:
> +			A = -A;
> +			continue;
> +		case BPF_S_JMP_JA:
> +			fentry += K;
> +			continue;
> +		case BPF_S_JMP_JGT_K:
> +			fentry += (A > K) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JGE_K:
> +			fentry += (A >= K) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JEQ_K:
> +			fentry += (A == K) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JSET_K:
> +			fentry += (A & K) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JGT_X:
> +			fentry += (A > X) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JGE_X:
> +			fentry += (A >= X) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JEQ_X:
> +			fentry += (A == X) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_JMP_JSET_X:
> +			fentry += (A & X) ? fentry->jt : fentry->jf;
> +			continue;
> +		case BPF_S_LD_W_ABS:
> +			k = K;
> +load_w:
> +			ptr = seccomp_load_pointer(data, k, 4, &tmp);
> +			if (ptr != NULL) {
> +				/*
> +				 * Assume load_pointer did any byte swapping.
> +				 */
> +				A = *(const u32 *)ptr;
> +				continue;
> +			}
> +			return 0;
> +		case BPF_S_LD_H_ABS:
> +			k = K;
> +load_h:
> +			ptr = seccomp_load_pointer(data, k, 2, &tmp);
> +			if (ptr != NULL) {
> +				A = *(const u16 *)ptr;
> +				continue;
> +			}
> +			return 0;
> +		case BPF_S_LD_B_ABS:
> +			k = K;
> +load_b:
> +			ptr = seccomp_load_pointer(data, k, 1, &tmp);
> +			if (ptr != NULL) {
> +				A = *(const u8 *)ptr;
> +				continue;
> +			}
> +			return 0;
> +		case BPF_S_LD_W_LEN:
> +			A = datalen;
> +			continue;
> +		case BPF_S_LDX_W_LEN:
> +			X = datalen;
> +			continue;
> +		case BPF_S_LD_W_IND:
> +			k = X + K;
> +			goto load_w;
> +		case BPF_S_LD_H_IND:
> +			k = X + K;
> +			goto load_h;
> +		case BPF_S_LD_B_IND:
> +			k = X + K;
> +			goto load_b;
> +		case BPF_S_LDX_B_MSH:
> +			ptr = seccomp_load_pointer(data, K, 1, &tmp);
> +			if (ptr != NULL) {
> +				X = (*(u8 *)ptr & 0xf) << 2;
> +				continue;
> +			}
> +			return 0;
> +		case BPF_S_LD_IMM:
> +			A = K;
> +			continue;
> +		case BPF_S_LDX_IMM:
> +			X = K;
> +			continue;
> +		case BPF_S_LD_MEM:
> +			A = mem[K];
> +			continue;
> +		case BPF_S_LDX_MEM:
> +			X = mem[K];
> +			continue;
> +		case BPF_S_MISC_TAX:
> +			X = A;
> +			continue;
> +		case BPF_S_MISC_TXA:
> +			A = X;
> +			continue;
> +		case BPF_S_RET_K:
> +			return K;
> +		case BPF_S_RET_A:
> +			return A;
> +		case BPF_S_ST:
> +			mem[K] = A;
> +			continue;
> +		case BPF_S_STX:
> +			mem[K] = X;
> +			continue;
> +		case BPF_S_ANC_PROTOCOL:
> +		case BPF_S_ANC_PKTTYPE:
> +		case BPF_S_ANC_IFINDEX:
> +		case BPF_S_ANC_MARK:
> +		case BPF_S_ANC_QUEUE:
> +		case BPF_S_ANC_HATYPE:
> +		case BPF_S_ANC_RXHASH:
> +		case BPF_S_ANC_CPU:
> +		case BPF_S_ANC_NLATTR:
> +		case BPF_S_ANC_NLATTR_NEST:
> +			continue;
> +		default:
> +			WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
> +				       fentry->code, fentry->jt,
> +				       fentry->jf, fentry->k);
> +			return 0;
> +		}
> +	}
> +
> +	return 0;
> +}

All in all it seems fairly easy to consolidate this with the networking code.
The only big change would be that the load_pointer() function pointer needs to
be passed, that way the same code can be used for both versions. The filter
checking code needs be expanded to check for network-only ops and reject the
filter if it's a seccomp one.

> diff --git a/kernel/sys.c b/kernel/sys.c
> index 4070153..8e43f70 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1901,6 +1901,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned
> long, arg3,
>  		case PR_SET_SECCOMP:
>  			error = prctl_set_seccomp(arg2);
>  			break;
> +		case PR_ATTACH_SECCOMP_FILTER:
> +			error = prctl_attach_seccomp_filter((char __user *)
> +								arg2);

Please one line.

Actually, wouldn't it make more sense to use PR_SET_SECCOMP and check arg2 to
see if it's filtering, and then get the filter itself from arg3 or something?

Or just stop calling it seccomp of course. PR_ATTACH_SYSCALL_FILTER?

> +			break;
>  		case PR_GET_TSC:
>  			error = GET_TSC_CTL(arg2);
>  			break;
> diff --git a/security/Kconfig b/security/Kconfig
> index 51bd5a0..e1ffed8 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -84,6 +84,26 @@ config SECURITY_DMESG_RESTRICT
>
>  	  If you are unsure how to answer this question, answer N.
>
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	select SECCOMP
> +	help
> +	  This option provide support for limiting the accessibility of
> +	  systems calls at a task-level using a dynamically defined policy.
> +
> +	  System call filtering policy is expressed by the user using
> +	  a Berkeley Packet Filter program.  The program is attached using
> +	  prctl(2).  For every system call the task makes, its number,
> +	  arguments, and other metadata will be evaluated by the attached
> +	  filter program.  The result determines if the system call may
> +	  may proceed or if the task should be terminated.
> +
> +	  This behavior is meant to aid security-conscious software in
> +	  its ability to minimize the risk of running potentially
> +	  risky code.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config SECURITY
>  	bool "Enable different security models"
>  	depends on SYSFS
> --
> 1.7.5.4
>
>

Greetings,

Indan



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-29  4:39   ` Indan Zupancic
@ 2012-01-29 20:27     ` Will Drewry
  2012-01-30  2:37       ` Indan Zupancic
  0 siblings, 1 reply; 26+ messages in thread
From: Will Drewry @ 2012-01-29 20:27 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, mcgrathr

On Sat, Jan 28, 2012 at 10:39 PM, Indan Zupancic <indan@nul.nu> wrote:
> Hello,
>
> Feedback below. I know you send out v6, but I'm not going to play
> eternal catch-up, so here it is.

The change is minimal - so thanks for the comments on any version!

> On Sat, January 28, 2012 00:24, Will Drewry wrote:
>> [This patch depends on luto@mit.edu's no_new_privs patch:
>>  https://lkml.org/lkml/2012/1/12/446
>> ]
>>
>> This patch adds support for seccomp mode 2.  This mode enables dynamic
>> enforcement of system call filtering policy in the kernel as specified
>> by a userland task.  The policy is expressed in terms of a Berkeley
>> Packet Filter program, as is used for userland-exposed socket filtering.
>> Instead of network data, the BPF program is evaluated over struct
>> seccomp_filter_data at the time of the system call.
>>
>> A filter program may be installed by a userland task by calling
>>   prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
>> where fprog is of type struct sock_fprog.
>>
>> If the first filter program allows subsequent prctl(2) calls, then
>> additional filter programs may be attached.  All attached programs
>> must be evaluated before a system call will be allowed to proceed.
>>
>> To avoid CONFIG_COMPAT related landmines, once a filter program is
>> installed using specific is_compat_task() value, it is not allowed to
>> make system calls using the alternate entry point.
>>
>> Filter programs will be inherited across fork/clone and execve.
>> However, if the task attaching the filter is unprivileged
>> (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task.  This
>> ensures that unprivileged tasks cannot attach filters that affect
>> privileged tasks (e.g., setuid binary).
>>
>> There are a number of benefits to this approach. A few of which are
>> as follows:
>> - BPF has been exposed to userland for a long time.
>> - Userland already knows its ABI: system call numbers and desired
>>   arguments
>> - No time-of-check-time-of-use vulnerable data accesses are possible.
>> - system call arguments are loaded on demand only to minimize copying
>>   required for system call number-only policy decisions.
>>
>> This patch includes its own BPF evaluator, but relies on the
>> net/core/filter.c BPF checking code.  It is possible to share
>> evaluators, but the performance sensitive nature of the network
>> filtering path makes it an iterative optimization which (I think :) can
>> be tackled separately via separate patchsets. (And at some point sharing
>> BPF JIT code!)
>>
>>  v5: - uses syscall_get_arguments
>>        (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
>>      - uses union-based arg storage with hi/lo struct to
>>        handle endianness.  Compromises between the two alternate
>>        proposals to minimize extra arg shuffling and account for
>>        endianness assuming userspace uses offsetof().
>
> Why would user space use offsetof()? Do you mean when assembling the BPF
> code? So this basically means 32-bit and 64-bit filters are different,
> even when they could be the same. Hm.

Not at all.  Take a peek at the union seccomp_filter_data.  What it
does is allows reference by hi32/lo32 be endian-accurate, but they are
union'd with u64.  So if you want portable definitions, you use
offsetof(), but if you know your endiannness and don't care about
cross-compiling, then you can just reference by the offset of the u64
and index into it as your endianness allows.

This avoids a pointless extra copy to split u32 and u64 without
leaving endianness pain entirely for userspace. I think it's a pretty
sane compromise.  32-bit programs will only ever care about lo32.
64-bit will need to check both hi32 and lo32 or manually index into
the u64 member.

>
>>        (mcgrathr@chromium.org, indan@nul.nu)
>>      - update Kconfig description
>>      - add include/seccomp_filter.h and add its installation
>>      - (naive) on-demand syscall argument loading
>>      - drop seccomp_t (eparis@redhat.com)
>>      - adds proper compat prctl call copying
>>  v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
>>      - now uses current->no_new_privs
>>          (luto@mit.edu,torvalds@linux-foundation.com)
>>      - assign names to seccomp modes (rdunlap@xenotime.net)
>>      - fix style issues (rdunlap@xenotime.net)
>>      - reworded Kconfig entry (rdunlap@xenotime.net)
>>  v3: - macros to inline (oleg@redhat.com)
>>      - init_task behavior fixed (oleg@redhat.com)
>>      - drop creator entry and extra NULL check (oleg@redhat.com)
>>      - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
>>      - adds tentative use of "always_unprivileged" as per
>>        torvalds@linux-foundation.org and luto@mit.edu
>>  v2: - (patch 2 only)
>>
>> Signed-off-by: Will Drewry <wad@chromium.org>
>> ---
>>  include/linux/Kbuild           |    1 +
>>  include/linux/prctl.h          |    3 +
>>  include/linux/seccomp.h        |   63 ++++
>>  include/linux/seccomp_filter.h |   79 +++++
>
> Just stuff it into seccomp.h?

Most of it is in there, but by defining a special data set for running
filters over, I needed to export the definition to userspace.
seccomp_filter.h contains only ABI relevant declarations.  The rest of
the in-kernel stuff is in seccomp.h (which is not copied during a
header installation).

> If this is added, then this will be seccomp,
> and the old mode can be seen as a standard, very restrictive filter.

This is totally true.  I originally did not pursue this because I
would have needed 1 hardcoded BPF filter per arch.  With the new
syntax, I believe it will be possible to define a 32-bit filter and a
64-bit filter (and a compat filter) in one central place. I will start
on that for v7.  I hope it is doable now, because it is nice to keep
it all locked to one path.

The only consideration is that it might make sense to do that after
seccomp_filter has been merged (in my dreams :) and given time to
bake.  But it'll be a good exercise to do upfront, so I'll go ahead.
If it seems too hasty, I can always respin it without the merging.

>>  kernel/Makefile                |    1 +
>>  kernel/fork.c                  |    4 +
>>  kernel/seccomp.c               |   10 +-
>>  kernel/seccomp_filter.c        |  620 ++++++++++++++++++++++++++++++++++++++++
>
> Same here. There isn't much to seccomp.c, why not add the filter code there?
>
> If you insist on keeping it separate, then why call it seccomp at all?
> Just call it syscall_filter or something.

Works for me.

>>  kernel/sys.c                   |    4 +
>>  security/Kconfig               |   20 ++
>>  10 files changed, 804 insertions(+), 1 deletions(-)
>
> What is the difference in kernel binary size with and without this?
> If it's too big I think the BPF filtering code should be shared with
> the networking version from the start.

The network filtering code adds one function that is a switch
statement over the possible valid values.  This doesn't duplicate the
checking code.

> Actually, that is something that is wanted anyway, so why not use the
> networking code from the start? That should reduce the code amount a
> lot. So I think if this gets merged the consolidated version should be
> merged. Only downside would be that networking support is needed for
> this feature, but without networking there shouldn't be much need for
> this feature anyway.

The reason that I am not merging upfront is because the load_pointer
and byte swapping behavior is hard-coded in the current versions.  I
believe it will be possible to add a layer of abstraction that doesn't
induce additional overhead, but I didn't want to force that first.  If
that will really be a blocker, then I will go ahead and do that first.
 I was just worried that the potential performance impact discussions
could lead this patch to a dead end before it got a chance.


>>  create mode 100644 include/linux/seccomp_filter.h
>>  create mode 100644 kernel/seccomp_filter.c
>>
>> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
>> index c94e717..5659454 100644
>> --- a/include/linux/Kbuild
>> +++ b/include/linux/Kbuild
>> @@ -330,6 +330,7 @@ header-y += scc.h
>>  header-y += sched.h
>>  header-y += screen_info.h
>>  header-y += sdla.h
>> +header-y += seccomp_filter.h
>>  header-y += securebits.h
>>  header-y += selinux_netlink.h
>>  header-y += sem.h
>> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
>> index 7ddc7f1..b8c4beb 100644
>> --- a/include/linux/prctl.h
>> +++ b/include/linux/prctl.h
>> @@ -114,4 +114,7 @@
>>  # define PR_SET_MM_START_BRK         6
>>  # define PR_SET_MM_BRK                       7
>>
>> +/* Set process seccomp filters */
>> +#define PR_ATTACH_SECCOMP_FILTER     37
>> +
>>  #endif /* _LINUX_PRCTL_H */
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index 171ab66..3992bb6 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -5,10 +5,29 @@
>>  #ifdef CONFIG_SECCOMP
>>
>>  #include <linux/thread_info.h>
>> +#include <linux/types.h>
>>  #include <asm/seccomp.h>
>>
>> +/* Valid values of seccomp_struct.mode */
>> +#define SECCOMP_MODE_DISABLED        0 /* seccomp is not in use. */
>> +#define SECCOMP_MODE_STRICT  1 /* uses hard-coded seccomp.c rules. */
>> +#define SECCOMP_MODE_FILTER  2 /* system call access determined by filter. */
>> +
>> +struct seccomp_filter;
>> +/**
>> + * struct seccomp_struct - the state of a seccomp'ed process
>> + *
>> + * @mode:  indicates one of the valid values above for controlled
>> + *         system calls available to a process.
>> + * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
>> + *          @filter must only be accessed from the context of current as there
>> + *          is no guard.
>> + */
>>  struct seccomp_struct {
>>       int mode;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +     struct seccomp_filter *filter;
>> +#endif
>
> Why a separate config option for this? Just depend on SECCOMP and networking?

Right now it requires asm/syscall.h which CONFIG_SECCOMP doesn't.
Some arches still haven't implemented that file so I thought it would
make sense to keep it separate.

Would it make sense to start with a config option then merge later or
should it be done upfront?

>>  };
>>
>>  extern void __secure_computing(int);
>> @@ -51,4 +70,48 @@ static inline int seccomp_mode(struct seccomp_struct *s)
>>
>>  #endif /* CONFIG_SECCOMP */
>>
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +
>> +
>> +extern long prctl_attach_seccomp_filter(char __user *);
>> +
>> +extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
>> +extern void put_seccomp_filter(struct seccomp_filter *);
>> +
>> +extern int seccomp_test_filters(int);
>> +extern void seccomp_filter_log_failure(int);
>> +extern void seccomp_struct_fork(struct seccomp_struct *child,
>> +                             const struct seccomp_struct *parent);
>
> Lousy function name, what is it doing?

Now it is called seccomp_fork() and it is called by fork when a task
fork()s :)  Naming suggestions welcome!


>> +
>> +static inline void seccomp_struct_init_task(struct seccomp_struct *seccomp)
>> +{
>> +     seccomp->mode = SECCOMP_MODE_DISABLED;
>> +     seccomp->filter = NULL;
>> +}
>
> This doesn't actually do anything, just get rid of it and/or adapt seccomp_struct_fork()?
>
> And it seems wrong to do this at fork time considering that the mode is supposed
> to keep the mode across forks, so it seems safer to update it to the correct
> values immediately.

True - mode and filter should already be zeroed.  I'll remove it. It
persists from when I had a mutex in the older patch series.

>> +
>> +/* No locking is needed here because the task_struct will
>> + * have no parallel consumers.
>> + */
>> +static inline void seccomp_struct_free_task(struct seccomp_struct *seccomp)
>> +{
>> +     put_seccomp_filter(seccomp->filter);
>> +     seccomp->filter = NULL;
>> +}
>
> How many functions do you need to free one object?

At least three :)

> I'm sure you can get rid of a couple, starting with this one. Just use
> put_seccomp_filter() directly, the NULL is useless as task_struct is
> going away.

Cool - I'll do that.

>> +
>> +#else  /* CONFIG_SECCOMP_FILTER */
>> +
>> +#include <linux/errno.h>
>> +
>> +struct seccomp_filter { };
>> +/* Macros consume the unused dereference by the caller. */
>> +#define seccomp_struct_init_task(_seccomp) do { } while (0);
>> +#define seccomp_struct_fork(_tsk, _orig) do { } while (0);
>> +#define seccomp_struct_free_task(_seccomp) do { } while (0);
>> +
>> +static inline long prctl_attach_seccomp_filter(char __user *a2)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +#endif  /* CONFIG_SECCOMP_FILTER */
>>  #endif /* _LINUX_SECCOMP_H */
>> diff --git a/include/linux/seccomp_filter.h b/include/linux/seccomp_filter.h
>> new file mode 100644
>> index 0000000..3ecd641
>> --- /dev/null
>> +++ b/include/linux/seccomp_filter.h
>> @@ -0,0 +1,79 @@
>> +/*
>> + * Secomp-based system call filtering data structures and definitions.
>> + *
>> + * Copyright (C) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
>
> (Not my expertise, but I'm not sure if that makes legal sense.)

Weird, eh :)  That's what my work falls under when I do it on paid
time. If it'd be helpful, I can add an Author: line.  If it's too
weird still, I believe I can use Google Inc. instead. But, IANAL, etc.

>> + *
>> + * This copyrighted material is made available to anyone wishing to use,
>> + * modify, copy, or redistribute it subject to the terms and conditions
>> + * of the GNU General Public License v.2.
>> + *
>> + */
>> +
>> +#ifndef __LINUX_SECCOMP_FILTER_H__
>> +#define __LINUX_SECCOMP_FILTER_H__
>> +
>> +#include <asm/byteorder.h>
>> +#include <linux/compiler.h>
>> +#include <linux/types.h>
>> +
>> +/*
>> + *   Keep the contents of this file similar to linux/filter.h:
>> + *     struct sock_filter and sock_fprog and versions.
>> + *   Custom naming exists solely if divergence is ever needed.
>> + */
>> +
>> +/*
>> + * Current version of the filter code architecture.
>> + */
>> +#define SECCOMP_BPF_MAJOR_VERSION 1
>> +#define SECCOMP_BPF_MINOR_VERSION 1
>
> If it's a fork of the current networking code it should follow their version
> numbering too. But these defines aren't used anywhere..?

Neither are the networking code defines :)  They are exported to
userspace but I don't believe they've ever changed.

>> +
>> +struct seccomp_filter_block {        /* Filter block */
>> +     __u16   code;   /* Actual filter code */
>> +     __u8    jt;     /* Jump true */
>> +     __u8    jf;     /* Jump false */
>> +     __u32   k;      /* Generic multiuse field */
>> +};
>> +
>> +struct seccomp_fprog {       /* Required for SO_ATTACH_FILTER. */
>> +     unsigned short          len;    /* Number of filter blocks */
>> +     struct seccomp_filter_block __user *filter;
>> +};
>> +
>> +/* Ensure the u32 ordering is consistent with platform byte order. */
>> +#if defined(__LITTLE_ENDIAN)
>> +#define SECCOMP_ENDIAN_SWAP(x, y) x, y
>> +#elif defined(__BIG_ENDIAN)
>> +#define SECCOMP_ENDIAN_SWAP(x, y) y, x
>> +#else
>> +#error edit for your odd arch byteorder.
>> +#endif
>> +
>> +/* System call argument layout for the filter data. */
>> +union seccomp_filter_arg {
>> +     struct {
>> +             __u32 SECCOMP_ENDIAN_SWAP(lo32, hi32);
>> +     };
>> +     __u64 u64;
>> +};
>
> Ugh. What was the advantage of doing this again? This looks like optimising
> at the wrong level to me.

I was optimizing for competing opinions it actually working.  If doing
it your way will make people happy, then that's fine with me:

struct {
  __u32 lo32[6];
  __u32 hi32[6];
}

It just means two copies for every system call argument instead of
one, but who knows, maybe the compiler will magic it away.
Regardless, any argument access will always be slightly slower than
just syscall number, so it doesn't hurt if the extra copy doesn't go
away.

> it doesn't make your code nicer, and I really doubt it makes it run
> anything faster either. It's just obfuscating the code.

It makes the filters marginally easier, but eh. You can write a 32-bit
filter that checks lo32 and you know it will index into the u64 at the
correct location regardless of endianness.  However, since it is all
being done in the header anyway, it's really just being done in
userspace.  In the 32-bit kernel code path, I just ensure that hi32 is
zeroed out.

I just want an abi that is functional across platforms that people are
happy enough to merge.  The one I posted is fine with me and so is
your proposal.  So if no one else has strong opinions, I'll just use
yours :)


> If you don't want to hide endianness then don't provide those lo32 and hi32
> structure members and just make it a u64. That's at least very clear. But
> swapping them around depending on endianness is not good, because it's only
> confusing things.
>
> But if you have to do it, then at least do:
>
> #if defined(__LITTLE_ENDIAN)
> union seccomp_filter_arg {
>        struct {
>                __u32 lo32;
>                __u32 hi32;
>        };
>        __u64 u64;
> };
> #else
> union seccomp_filter_arg {
>        struct {
>                __u32 hi32;
>                __u32 lo32;
>        };
>        __u64 u64;
> };
> #endif
>
> Instead of hiding what you're doing.

I copied the aio_abi.h file. I'm happy to do it more explicitly, but
since no one appears to have a strong opinion and your proposal would
allow for seamless addition of a mythical 128-bit argument someday, it
seems like its the way to go.

>> +
>> +/*
>> + *   Expected data the BPF program will execute over.
>> + *   Endianness will be arch specific, but the values will be
>> + *   swapped, as above, to allow for consistent BPF programs.
>
> So now what? You need platform dependent BPF programs depending on
> the arch's endianness? Just because you want to stuff one 64-bit
> longs directly into two 32-bit ints? That saves how much, two
> instructions?

I was also attempting to compromise between aesthetics, but unless I
hear any continued dissent against your proposal, I'll just go with
it.

>> + */
>> +struct seccomp_filter_data {
>> +     int syscall_nr;
>> +     __u32 __reserved;
>> +     union seccomp_filter_arg args[6];
>> +};
>> +
>> +#undef SECCOMP_ENDIAN_SWAP
>> +
>> +/*
>> + * Defined valid return values for the BPF program.
>> + */
>> +#define SECCOMP_BPF_ALLOW    0xFFFFFFFF
>> +#define SECCOMP_BPF_DENY     0
>
> This doesn't make sense. Please be consistent with C and swap it around.

Nope.  The bpf engine returns non-zero for success and 0 for failure.
If I plan to share code with the networking stack, then 0 will always
be the failure mode.  It indicates the amount of the packet to accept.
 If swapping it around becomes mandatory, then sharing code with the
network implementation will be even harder.

>> +
>> +#endif /* __LINUX_SECCOMP_FILTER_H__ */
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index 2d9de86..fd81bac 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -78,6 +78,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
>>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>>  obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
>>  obj-$(CONFIG_SECCOMP) += seccomp.o
>> +obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
>>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>>  obj-$(CONFIG_TREE_RCU) += rcutree.o
>>  obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 051f090..f312edb 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -34,6 +34,7 @@
>>  #include <linux/cgroup.h>
>>  #include <linux/security.h>
>>  #include <linux/hugetlb.h>
>> +#include <linux/seccomp.h>
>>  #include <linux/swap.h>
>>  #include <linux/syscalls.h>
>>  #include <linux/jiffies.h>
>> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
>>       free_thread_info(tsk->stack);
>>       rt_mutex_debug_task_free(tsk);
>>       ftrace_graph_exit_task(tsk);
>> +     seccomp_struct_free_task(&tsk->seccomp);
>>       free_task_struct(tsk);
>>  }
>>  EXPORT_SYMBOL(free_task);
>> @@ -1093,6 +1095,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>>               goto fork_out;
>>
>>       ftrace_graph_init_task(p);
>> +     seccomp_struct_init_task(&p->seccomp);
>
> If you just do a get on the filter's krefs here then cleanup is done correctly
> by free_task(). So just call seccomp_struct_fork() or get_seccomp_filter() here.

Since init_task just NULLs it all out, then later seccomp_fork() does
the get from the parent, I'm not sure I need this function at all any
more.  Am I misreading?

>>
>>       rt_mutex_init_task(p);
>>
>> @@ -1376,6 +1379,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>>       if (clone_flags & CLONE_THREAD)
>>               threadgroup_change_end(current);
>>       perf_event_fork(p);
>> +     seccomp_struct_fork(&p->seccomp, &current->seccomp);
>
> This can be moved above, before the first error goto.

As in where seccomp_struct_init_task is or somewhere else?  (Where
seccomp_init_task is would make sense.)

>>
>>       trace_task_newtask(p, clone_flags);
>>
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index e8d76c5..a045dd4 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -37,7 +37,7 @@ void __secure_computing(int this_syscall)
>>       int * syscall;
>>
>>       switch (mode) {
>> -     case 1:
>> +     case SECCOMP_MODE_STRICT:
>>               syscall = mode1_syscalls;
>>  #ifdef CONFIG_COMPAT
>>               if (is_compat_task())
>> @@ -48,6 +48,14 @@ void __secure_computing(int this_syscall)
>>                               return;
>>               } while (*++syscall);
>>               break;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +     case SECCOMP_MODE_FILTER:
>> +             if (seccomp_test_filters(this_syscall) == 0)
>
> Get rid of the '== 0' and have proper return values instead.

0 is success/allow as normal here.  I can change it to

  if (!seccomp_test_filters(this_syscall))
    return;

if that'd read better?

>> +                     return;
>> +
>
> No need for an empty line here.

I'll drop it.

>> +             seccomp_filter_log_failure(this_syscall);
>> +             break;
>> +#endif
>>       default:
>>               BUG();
>>       }
>> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
>> new file mode 100644
>> index 0000000..e57219e
>> --- /dev/null
>> +++ b/kernel/seccomp_filter.c
>> @@ -0,0 +1,620 @@
>> +/*
>> + * linux/kernel/seccomp_filter.c
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
>> + *
>> + * Copyright (C) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
>> + *
>> + * Extends linux/kernel/seccomp.c to allow tasks to install system call
>> + * filters using a Berkeley Packet Filter program which is executed over
>> + * struct seccomp_filter_data.
>> + */
>> +
>> +#include <asm/syscall.h>
>> +
>> +#include <linux/capability.h>
>> +#include <linux/compat.h>
>> +#include <linux/err.h>
>> +#include <linux/errno.h>
>> +#include <linux/rculist.h>
>> +#include <linux/filter.h>
>> +#include <linux/kallsyms.h>
>> +#include <linux/kref.h>
>> +#include <linux/module.h>
>> +#include <linux/pid.h>
>> +#include <linux/prctl.h>
>> +#include <linux/ptrace.h>
>> +#include <linux/ratelimit.h>
>> +#include <linux/reciprocal_div.h>
>> +#include <linux/regset.h>
>> +#include <linux/seccomp.h>
>> +#include <linux/seccomp_filter.h>
>> +#include <linux/security.h>
>> +#include <linux/seccomp.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/user.h>
>> +
>> +
>> +/**
>> + * struct seccomp_filter - container for seccomp BPF programs
>> + *
>> + * @usage: reference count to manage the object lifetime.
>> + *         get/put helpers should be used when accessing an instance
>> + *         outside of a lifetime-guarded section.  In general, this
>> + *         is only needed for handling filters shared across tasks.
>> + * @parent: pointer to the ancestor which this filter will be composed with.
>
> I guess the code will be clearer than the comment. Parent task or filter?
> Is it a reverse linked list or something? I'm confused.

I'll try to rewrite it.  The parent filter is whatever filter was
installed on the task at the time the current filter was installed.
It may have been inherited or be part of a sequence of filters
installed by the same task.

>> + * @insns: the BPF program instructions to evaluate
>> + * @count: the number of instructions in the program.
>> + *
>> + * seccomp_filter objects should never be modified after being attached
>> + * to a task_struct (other than @usage).
>> + */
>> +struct seccomp_filter {
>> +     struct kref usage;
>> +     struct seccomp_filter *parent;
>> +     struct {
>> +             uint32_t compat:1;
>> +     } flags;
>
> This flag isn't documented above.

Oops dropped it for a rev - I'll re-add it. Thanks!

>> +     unsigned short count;  /* Instruction count */
>
> Why a short?

Matches sock_fprog - I can go larger if that make sense.

>> +     struct sock_filter insns[0];
>
> Newer gcc supports insns[], but I don't know what the kernel convention is.

Yeah - I just lifted what I saw. I can recheck to see if the
conventions are more clearly spelled out for this. (checkpatch is
happy.)

>> +};
>> +
>> +/*
>> + * struct seccomp_filter_metadata - BPF data wrapper
>> + * @data: data accessible to the BPF program.
>> + * @has_args: indicates that the args have been lazily populated.
>> + *
>> + * Used by seccomp_load_pointer.
>> + */
>> +struct seccomp_filter_metadata {
>> +     struct seccomp_filter_data data;
>> +     bool has_args;
>> +};
>
> Why use 'bool' instead of a nifty struct { uint32_t : 1 }?
>
> It seems that has_args is something that should be stored elsewhere,
> it doesn't seem to warrant its own structure.

Well it would naturally live inside the BPF evaluator since the
loading is done in load_pointer.  This struct is just used to marshal
the state around.

>> +
>> +static unsigned int seccomp_run_filter(void *, uint32_t,
>> +                                    const struct sock_filter *);
>
> It almosts fits, just put it on one line.

I'm just doing what checkpatch suggests :)  Would it read better if I
bumped all the args down a line?

>> +
>> +/**
>> + * seccomp_filter_alloc - allocates a new filter object
>> + * @padding: size of the insns[0] array in bytes
>> + *
>> + * The @padding should be a multiple of
>> + * sizeof(struct sock_filter).
>
> It's not padding, why call it such? Why not 'count' or 'bpf_blocks'?
> That would reduce the sizeof(struct sock_filter)s to one.

Either work - I'll change it.

>> + *
>> + * Returns ERR_PTR on error or an allocated object.
>> + */
>> +static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
>> +{
>> +     struct seccomp_filter *f;
>> +     unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
>> +
>> +     /* Drop oversized requests. */
>> +     if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
>> +             return ERR_PTR(-EINVAL);
>> +
>> +     /* Padding should always be in sock_filter increments. */
>> +     if (padding % sizeof(struct sock_filter))
>> +             return ERR_PTR(-EINVAL);
>
> ...then this isn't needed.
>
>> +
>> +     f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
>> +     if (!f)
>> +             return ERR_PTR(-ENOMEM);
>> +     kref_init(&f->usage);
>> +     f->count = bpf_blocks;
>> +     return f;
>> +}
>> +
>> +/**
>> + * seccomp_filter_free - frees the allocated filter.
>> + * @filter: NULL or live object to be completely destructed.
>> + */
>> +static void seccomp_filter_free(struct seccomp_filter *filter)
>> +{
>> +     if (!filter)
>> +             return;
>> +     put_seccomp_filter(filter->parent);
>
> Why do put on the parent here?

filters may be shared across tasks and there may be multiple filters
per-task.  The references keep them alive beyond the life of the
originating task_struct or if they are not hanging off of the
task_struct directly.  In order for them to ever die, the reference to
them needs to be dropped and that's done here.

> This function is only called once below by __put_seccomp_filter, perhaps merge them?

Cool I'll do that.

>> +     kfree(filter);
>> +}
>> +
>> +static void __put_seccomp_filter(struct kref *kref)
>> +{
>> +     struct seccomp_filter *orig =
>> +             container_of(kref, struct seccomp_filter, usage);
>> +     seccomp_filter_free(orig);
>
> I find it unexpected that __put_seccomp_filter() calls put_seccomp_filter(),
> why have two versions then? I have the feeling all this can be streamlined
> more.

So I can either make all callers use kref_put(&filter->usage,
__put_seccomp_filter) or I can have one extra layer of indirection:
put_seccomp_filter.  I can definitely ditch the extra free wrapper
though.  I can also make put_seccomp_filter into a static inline if
that'd be preferable -- or just use kref everywhere.

>> +}
>> +
>> +void seccomp_filter_log_failure(int syscall)
>> +{
>> +     pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
>> +             current->comm, task_pid_nr(current), syscall,
>> +             KSTK_EIP(current));
>> +}
>> +
>> +/* put_seccomp_filter - decrements the ref count of @orig and may free. */
>> +void put_seccomp_filter(struct seccomp_filter *orig)
>> +{
>> +     if (!orig)
>> +             return;
>> +     kref_put(&orig->usage, __put_seccomp_filter);
>> +}
>> +
>> +/* get_seccomp_filter - increments the reference count of @orig. */
>> +struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
>> +{
>> +     if (!orig)
>> +             return NULL;
>> +     kref_get(&orig->usage);
>> +     return orig;
>> +}
>
> This function should be merged with seccomp_struct_fork(), it's pretty much only
> called there.

It's used primarily during filter attach, but if these wrappers are
bothersome, I can just use kref_* directly everywhere.

>> +
>> +#if BITS_PER_LONG == 32
>> +static inline unsigned long *seccomp_filter_data_arg(
>> +                             struct seccomp_filter_data *data, int index)
>> +{
>> +     /* Avoid inconsistent hi contents. */
>> +     data->args[index].hi32 = 0;
>> +     return (unsigned long *) &(data->args[index].lo32);
>> +}
>> +#elif BITS_PER_LONG == 64
>> +static inline unsigned long *seccomp_filter_data_arg(
>> +                             struct seccomp_filter_data *data, int index)
>> +{
>> +     return (unsigned long *) &(data->args[index].u64);
>
> No need for any casts, is there? And this looks a bit untidy.

Yup - gcc isn't a fan even if the size matches.  Won't matter with the
new change though.

>
>> +}
>> +#else
>> +#error Unknown BITS_PER_LONG.
>> +#endif
>> +
>> +/**
>> + * seccomp_load_pointer: checks and returns a pointer to the requested offset
>> + * @buf: u8 array to index into
>> + * @buflen: length of the @buf array
>> + * @offset: offset to return data from
>> + * @size: size of the data to retrieve at offset
>> + * @unused: placeholder which net/core/filter.c uses for for temporary
>> + *          storage.  Ideally, the two code paths can be merged.
>
> The documentation doesn't matches the actual code?

Thanks - sorry.  Last minute changes that I failed to fix the comment for.

>> + *
>> + * Returns a pointer to the BPF evaluator after checking the offset and size
>> + * boundaries.
>> + */
>> +static inline void *seccomp_load_pointer(void *data, int offset, size_t size,
>> +                                      void *buffer)
>> +{
>> +     struct seccomp_filter_metadata *metadata = data;
>> +     int arg;
>> +     if (offset >= sizeof(metadata->data))
>> +             goto fail;
>> +     if (offset < 0)
>> +             goto fail;
>
> Perhaps use unsigned offset?

I'm using what the bpf code specifies already.  Otherwise I would use
an unsigned int. I guess I could cast.

>> +     if (size > sizeof(metadata->data) - offset)
>> +             goto fail;
>> +     if (metadata->has_args)
>> +             goto pass;
>> +     /* No argument data touched. */
>> +     if (offset + size - 1 < offsetof(struct seccomp_filter_data, args))
>> +             goto pass;
>> +     for (arg = 0; arg < ARRAY_SIZE(metadata->data.args); ++arg)
>> +             syscall_get_arguments(current, task_pt_regs(current), arg, 1,
>> +                     seccomp_filter_data_arg(&metadata->data, arg));
>> +     metadata->has_args = true;
>
> Wasn't the idea that arguments would be loaded on demand?

Yes and they are and then cached.

> And as arguments are usually only checked once and not twice, caching the
> values isn't really needed. So perhaps get rid of the (meta)data structure
> and just have a switch that returns the right data directly depending on
> the offset? It seems the over all code would be simpler that way.

Not really. You can make a case statement per range of offsets that
overlap with an argument, but that doesn't account for size.  You
could load a long across two arguments.  We're not enforcing alignment
as the networking code doesn't either.  I could have it fail if the
request is not 32-bit aligned.  I don't prefer that, but it would
simplify the code.  It's certainly easier to change the ABI later to
provide unaligned access then it is to make it aligned later, so I can
change to aligned requests only that return the appropriate 32-bit
slice of the data.

>> +pass:
>> +     return ((__u8 *)(&metadata->data)) + offset;
>> +fail:
>> +     return NULL;
>> +}
>> +
>> +/**
>> + * seccomp_test_filters - tests 'current' against the given syscall
>> + * @syscall: number of the system call to test
>> + *
>> + * Returns 0 on ok and non-zero on error/failure.
>> + */
>> +int seccomp_test_filters(int syscall)
>> +{
>> +     int ret = -EACCES;
>> +     struct seccomp_filter *filter;
>> +     struct seccomp_filter_metadata metadata;
>> +
>> +     filter = current->seccomp.filter; /* uses task ref */
>> +     if (!filter)
>> +             goto out;
>
> Actually, shouldn't having no filter with SECCOMP_MODE_FILTER set be impossible?

Yeah.

> It seems better to assure that instead of needlessly testing for it every time.

I guess so. I personally prefer this style of defensive checking, but
I realize that's at odds with the kernel style. I'll drop it.

>> +
>> +     metadata.data.syscall_nr = syscall;
>> +     metadata.has_args = false;
>> +
>> +#ifdef CONFIG_COMPAT
>> +     if (filter->flags.compat != !!(is_compat_task()))
>> +             goto out;
>> +#endif
>> +
>> +     /* Only allow a system call if it is allowed in all ancestors. */
>> +     ret = 0;
>> +     for ( ; filter != NULL; filter = filter->parent) {
>> +             /* Allowed if return value is SECCOMP_BPF_ALLOW */
>> +             if (seccomp_run_filter(&metadata, sizeof(metadata.data),
>> +                                     filter->insns) != SECCOMP_BPF_ALLOW)
>> +                     ret = -EACCES;
>
> Why keep checking filters if one of them fails? Just return -EACCES?

Why optimize for a failure mode?  Since fails are terminal, I don't
see any reason to rush :) That said, this was more relevant when I had
ptrace integration that depended on knowing which filter the failure
occurred in. I'll change it.

>> +     }
>> +out:
>> +     return ret;
>> +}
>> +
>> +/**
>> + * seccomp_attach_filter: Attaches a seccomp filter to current.
>> + * @fprog: BPF program to install
>> + *
>> + * Context: User context only. This function may sleep on allocation and
>> + *          operates on current. current must be attempting a system call
>> + *          when this is called (usually prctl).
>> + *
>> + * This function may be called repeatedly to install additional filters.
>> + * Every filter successfully installed will be evaluated (in reverse order)
>> + * for each system call the thread makes.
>> + *
>> + * Returns 0 on success or an errno on failure.
>> + */
>> +long seccomp_attach_filter(struct sock_fprog *fprog)
>> +{
>> +     struct seccomp_filter *filter = NULL;
>> +     /* Note, len is a short so overflow should be impossible. */
>> +     unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
>
> That's a crude way of doing it. The max is BPF_MAXINSNS and that is checked a
> bit late by sk_chk_filter(). But perhaps while you're consolidating the code
> with the networking filter code, you could change their code to check that
> first before trying to allocate the given amount.

Sure - I can send those patches independently and see if they are
accepted.  It doesn't change the fact that it can't overflow because
fprog->len is a short and the sizeof struct sock_filter is less than
65535.

>> +     long ret = -EPERM;
>> +
>> +     /* Allocate a new seccomp_filter */
>> +     filter = seccomp_filter_alloc(fp_size);
>
> Actually, your seccomp_filter_alloc() checks for BPF_MAXINSNS already, and if
> you change that one as I recommended and you pass on fprog->len directly, no
> overflow can happen.

Works for me!

>> +     if (IS_ERR(filter)) {
>> +             ret = PTR_ERR(filter);
>> +             goto out;
>> +     }
>> +
>> +     /* Copy the instructions from fprog. */
>> +     ret = -EFAULT;
>> +     if (copy_from_user(filter->insns, fprog->filter, fp_size))
>> +             goto out;
>> +
>> +     /* Check the fprog */
>> +     ret = sk_chk_filter(filter->insns, filter->count);
>> +     if (ret)
>> +             goto out;
>> +
>> +     /*
>> +      * If a process lacks CAP_SYS_ADMIN in its namespace, force
>> +      * this process and all descendents to run with no_new_privs.
>> +      * A privileged process will need to set this bit independently,
>> +      * if desired.
>> +      */
>> +     if (security_capable_noaudit(current_cred(), current_user_ns(),
>> +                                  CAP_SYS_ADMIN) != 0)
>> +             current->no_new_privs = 1;
>> +
>> +     /*
>> +      * If there is an existing filter, make it the parent
>> +      * and reuse the existing task-based ref.
>
> You're not reusing the existing task-based ref, are you?
>
> You can't reuse it anyway. Perhaps and old leftover comment?

I am reusing it.  When current->seccomp.filter is installed -- via
fork or attach -- a get_ is done.  That is for the task.  Here, we are
bumping current->seccomp.filter to filter->parent without calling
put_.  This reuses the task ref for the older filter.  The new filter
has a ref already from during alloc and that is transferred for use by
the task.

>> +      */
>> +     filter->parent = current->seccomp.filter;
>
> I think this needs a more verbose comment, here or elsewhere explaining how
> important it is that new filters are added to the start and old filters are
> never removed, or else list corruption can happen. This way assures that
> forward pointers are always valid, and that's the reason to use a singly
> linked list instead of a standard doubly linked list.

Cool - I can do that.

>> +#ifdef CONFIG_COMPAT
>> +     /* Disallow changing system calling conventions after the fact. */
>> +     filter->flags.compat = !!(is_compat_task());
>> +
>> +     if (filter->parent &&
>> +         filter->parent->flags.compat != filter->flags.compat)
>> +             return -EACCES;
>
> This shouldn't be possible and checking it here is too late if it somehow
> happened anyway.

Yeah you're right. Since mismatched-compat syscalls are blocked, we'd
never hit this point.  Thanks!

> (And you're leaking memory because current->seccomp.filter
> isn't updated yet.)

Fixed in v6 :)

>> +#endif
>> +
>> +     /*
>> +      * Double claim the new filter so we can release it below simplifying
>> +      * the error paths earlier.
>> +      */
>
> No it doesn't.

True enough. It did when I was doing ftrace setup. :/

>> +     ret = 0;
>> +     get_seccomp_filter(filter);
>> +     current->seccomp.filter = filter;
>> +     /* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
>> +     if (current->seccomp.mode == SECCOMP_MODE_DISABLED) {
>> +             current->seccomp.mode = SECCOMP_MODE_FILTER;
>> +             set_thread_flag(TIF_SECCOMP);
>> +     }
>> +
>
> Just add a return 0; here, much simpler and removes 6 lines.
>
>> +out:
>> +     put_seccomp_filter(filter);  /* for get or task, on err */
>> +     return ret;
>> +}
>> +
>> +#ifdef CONFIG_COMPAT
>> +/* This should be kept in sync with net/compat.c which changes infrequently. */
>> +struct compat_sock_fprog {
>> +     u16 len;
>> +     compat_uptr_t filter;   /* struct sock_filter */
>> +};
>> +
>> +static long compat_attach_seccomp_filter(char __user *optval)
>> +{
>> +     struct compat_sock_fprog __user *fprog32 =
>> +             (struct compat_sock_fprog __user *)optval;
>> +     struct sock_fprog __user *kfprog =
>> +             compat_alloc_user_space(sizeof(struct sock_fprog));
>> +     compat_uptr_t ptr;
>> +     u16 len;
>> +
>> +     if (!access_ok(VERIFY_READ, fprog32, sizeof(*fprog32)) ||
>> +         !access_ok(VERIFY_WRITE, kfprog, sizeof(struct sock_fprog)) ||
>> +         __get_user(len, &fprog32->len) ||
>> +         __get_user(ptr, &fprog32->filter) ||
>> +         __put_user(len, &kfprog->len) ||
>> +         __put_user(compat_ptr(ptr), &kfprog->filter))
>> +             return -EFAULT;
>> +
>> +     return seccomp_attach_filter(kfprog);
>> +}
>> +#endif
>> +
>> +long prctl_attach_seccomp_filter(char __user *user_filter)
>> +{
>> +     struct sock_fprog fprog;
>> +     long ret = -EINVAL;
>> +     ret = -EFAULT;
>
> Why not set it to -EFAULT directly?

No reason. Looks like I cut out some code and didn't fix it.

>> +     if (!user_filter)
>> +             goto out;
>> +
>> +#ifdef CONFIG_COMPAT
>> +     if (is_compat_task())
>> +             return compat_attach_seccomp_filter(user_filter);
>> +#endif
>> +
>> +     if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
>> +             goto out;
>> +
>> +     ret = seccomp_attach_filter(&fprog);
>> +out:
>> +     return ret;
>> +}
>> +
>> +/**
>> + * seccomp_struct_fork: manages inheritance on fork
>> + * @child: forkee's seccomp_struct
>> + * @parent: forker's seccomp_struct
>> + *
>> + * Ensures that @child inherits seccomp mode and state iff
>
> 'if'

I guess they are equivalent for this purpose.

>> + * seccomp filtering is in use.
>> + */
>> +void seccomp_struct_fork(struct seccomp_struct *child,
>> +                      const struct seccomp_struct *parent)
>> +{
>> +     child->mode = parent->mode;
>> +     if (parent->mode != SECCOMP_MODE_FILTER)
>> +             return;
>> +     child->filter = get_seccomp_filter(parent->filter);
>
> I think how you manage the life times of the filters is very clever and that it
> needs some commenting because it is more subtle than it seems at first glance.
> The rules seem to be:
>
> - The list is sorted from new to old.
>
> - Never remove old filters.
>
> - The life-time of a group of filters (all filters set by one process between
>  fork calls) is managed by the one at the head of the group.
>
> This forms a tree with the head filters at forking points with the filters
> freed from the leave direction.

Exactly - sorry for it being poorly documented.  I will attempt to
document it more clearly in the next revision (probably using some of
your text above).

>> +}
>> +
>> +/**
>> + * seccomp_run_filter - evaluate BPF
>> + *   @buf: opaque buffer to execute the filter over
>> + *   @buflen: length of the buffer
>
> Out of date.

Doh. Thanks.

>> + *   @fentry: filter to apply
>> + *
>> + * Decode and apply filter instructions to the buffer.  Return length to
>> + * keep, 0 for none.
>
> This doesn't make sense in the context of system calls. Just return 0 for
> allowing the system call and -1 to deny it.

I know but all the shared error paths in the evaluator will return 0.
So on merge with the networking evaluator the return value for errors
needs to be abstracted too.  It'd be nice to not require more changes,
but it might be doable, I don't know.

>> @buf is a seccomp_filter_metadata we are filtering,
>> + * @filter is the array of filter instructions.  Because all jumps are
>> + * guaranteed to be before last instruction, and last instruction
>> + * guaranteed to be a RET, we dont need to check flen.
>> + *
>> + * See core/net/filter.c as this is nearly an exact copy.
>> + * At some point, it would be nice to merge them to take advantage of
>> + * optimizations (like JIT).
>> + */
>> +static unsigned int seccomp_run_filter(void *data, uint32_t datalen,
>> +                       const struct sock_filter *fentry)
>> +{
>> +     const void *ptr;
>> +     u32 A = 0;                      /* Accumulator */
>> +     u32 X = 0;                      /* Index Register */
>> +     u32 mem[BPF_MEMWORDS];          /* Scratch Memory Store */
>> +     u32 tmp;
>> +     int k;
>> +
>> +     /*
>> +      * Process array of filter instructions.
>> +      */
>> +     for (;; fentry++) {
>> +#if defined(CONFIG_X86_32)
>> +#define      K (fentry->k)
>> +#else
>> +             const u32 K = fentry->k;
>> +#endif
>> +
>> +             switch (fentry->code) {
>> +             case BPF_S_ALU_ADD_X:
>> +                     A += X;
>> +                     continue;
>> +             case BPF_S_ALU_ADD_K:
>> +                     A += K;
>> +                     continue;
>> +             case BPF_S_ALU_SUB_X:
>> +                     A -= X;
>> +                     continue;
>> +             case BPF_S_ALU_SUB_K:
>> +                     A -= K;
>> +                     continue;
>> +             case BPF_S_ALU_MUL_X:
>> +                     A *= X;
>> +                     continue;
>> +             case BPF_S_ALU_MUL_K:
>> +                     A *= K;
>> +                     continue;
>> +             case BPF_S_ALU_DIV_X:
>> +                     if (X == 0)
>> +                             return 0;
>> +                     A /= X;
>> +                     continue;
>> +             case BPF_S_ALU_DIV_K:
>> +                     A = reciprocal_divide(A, K);
>> +                     continue;
>> +             case BPF_S_ALU_AND_X:
>> +                     A &= X;
>> +                     continue;
>> +             case BPF_S_ALU_AND_K:
>> +                     A &= K;
>> +                     continue;
>> +             case BPF_S_ALU_OR_X:
>> +                     A |= X;
>> +                     continue;
>> +             case BPF_S_ALU_OR_K:
>> +                     A |= K;
>> +                     continue;
>> +             case BPF_S_ALU_LSH_X:
>> +                     A <<= X;
>> +                     continue;
>> +             case BPF_S_ALU_LSH_K:
>> +                     A <<= K;
>> +                     continue;
>> +             case BPF_S_ALU_RSH_X:
>> +                     A >>= X;
>> +                     continue;
>> +             case BPF_S_ALU_RSH_K:
>> +                     A >>= K;
>> +                     continue;
>> +             case BPF_S_ALU_NEG:
>> +                     A = -A;
>> +                     continue;
>> +             case BPF_S_JMP_JA:
>> +                     fentry += K;
>> +                     continue;
>> +             case BPF_S_JMP_JGT_K:
>> +                     fentry += (A > K) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JGE_K:
>> +                     fentry += (A >= K) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JEQ_K:
>> +                     fentry += (A == K) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JSET_K:
>> +                     fentry += (A & K) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JGT_X:
>> +                     fentry += (A > X) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JGE_X:
>> +                     fentry += (A >= X) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JEQ_X:
>> +                     fentry += (A == X) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_JMP_JSET_X:
>> +                     fentry += (A & X) ? fentry->jt : fentry->jf;
>> +                     continue;
>> +             case BPF_S_LD_W_ABS:
>> +                     k = K;
>> +load_w:
>> +                     ptr = seccomp_load_pointer(data, k, 4, &tmp);
>> +                     if (ptr != NULL) {
>> +                             /*
>> +                              * Assume load_pointer did any byte swapping.
>> +                              */
>> +                             A = *(const u32 *)ptr;
>> +                             continue;
>> +                     }
>> +                     return 0;
>> +             case BPF_S_LD_H_ABS:
>> +                     k = K;
>> +load_h:
>> +                     ptr = seccomp_load_pointer(data, k, 2, &tmp);
>> +                     if (ptr != NULL) {
>> +                             A = *(const u16 *)ptr;
>> +                             continue;
>> +                     }
>> +                     return 0;
>> +             case BPF_S_LD_B_ABS:
>> +                     k = K;
>> +load_b:
>> +                     ptr = seccomp_load_pointer(data, k, 1, &tmp);
>> +                     if (ptr != NULL) {
>> +                             A = *(const u8 *)ptr;
>> +                             continue;
>> +                     }
>> +                     return 0;
>> +             case BPF_S_LD_W_LEN:
>> +                     A = datalen;
>> +                     continue;
>> +             case BPF_S_LDX_W_LEN:
>> +                     X = datalen;
>> +                     continue;
>> +             case BPF_S_LD_W_IND:
>> +                     k = X + K;
>> +                     goto load_w;
>> +             case BPF_S_LD_H_IND:
>> +                     k = X + K;
>> +                     goto load_h;
>> +             case BPF_S_LD_B_IND:
>> +                     k = X + K;
>> +                     goto load_b;
>> +             case BPF_S_LDX_B_MSH:
>> +                     ptr = seccomp_load_pointer(data, K, 1, &tmp);
>> +                     if (ptr != NULL) {
>> +                             X = (*(u8 *)ptr & 0xf) << 2;
>> +                             continue;
>> +                     }
>> +                     return 0;
>> +             case BPF_S_LD_IMM:
>> +                     A = K;
>> +                     continue;
>> +             case BPF_S_LDX_IMM:
>> +                     X = K;
>> +                     continue;
>> +             case BPF_S_LD_MEM:
>> +                     A = mem[K];
>> +                     continue;
>> +             case BPF_S_LDX_MEM:
>> +                     X = mem[K];
>> +                     continue;
>> +             case BPF_S_MISC_TAX:
>> +                     X = A;
>> +                     continue;
>> +             case BPF_S_MISC_TXA:
>> +                     A = X;
>> +                     continue;
>> +             case BPF_S_RET_K:
>> +                     return K;
>> +             case BPF_S_RET_A:
>> +                     return A;
>> +             case BPF_S_ST:
>> +                     mem[K] = A;
>> +                     continue;
>> +             case BPF_S_STX:
>> +                     mem[K] = X;
>> +                     continue;
>> +             case BPF_S_ANC_PROTOCOL:
>> +             case BPF_S_ANC_PKTTYPE:
>> +             case BPF_S_ANC_IFINDEX:
>> +             case BPF_S_ANC_MARK:
>> +             case BPF_S_ANC_QUEUE:
>> +             case BPF_S_ANC_HATYPE:
>> +             case BPF_S_ANC_RXHASH:
>> +             case BPF_S_ANC_CPU:
>> +             case BPF_S_ANC_NLATTR:
>> +             case BPF_S_ANC_NLATTR_NEST:
>> +                     continue;
>> +             default:
>> +                     WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
>> +                                    fentry->code, fentry->jt,
>> +                                    fentry->jf, fentry->k);
>> +                     return 0;
>> +             }
>> +     }
>> +
>> +     return 0;
>> +}
>
> All in all it seems fairly easy to consolidate this with the networking code.
> The only big change would be that the load_pointer() function pointer needs to
> be passed, that way the same code can be used for both versions. The filter
> checking code needs be expanded to check for network-only ops and reject the
> filter if it's a seccomp one.

Also byteswapping.  I think this can be done by adding pointer to the
accumulator to load_pointer, then let that do the byteswapping too.  I
guess I'll go ahead and try my hand at a merged patch and add it to
the end of the next series.  I wasn't planning on also attempting JIT
integration yet though ;)

>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 4070153..8e43f70 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -1901,6 +1901,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned
>> long, arg3,
>>               case PR_SET_SECCOMP:
>>                       error = prctl_set_seccomp(arg2);
>>                       break;
>> +             case PR_ATTACH_SECCOMP_FILTER:
>> +                     error = prctl_attach_seccomp_filter((char __user *)
>> +                                                             arg2);
>
> Please one line.

checkpatch was unhappy :)

> Actually, wouldn't it make more sense to use PR_SET_SECCOMP and check arg2 to
> see if it's filtering, and then get the filter itself from arg3 or something?

This is definitely preferable. Maybe something as simple as:
  prctl(PR_SET_SECCOMP, 2, fprog);
I'll make the changes - thanks!

> Or just stop calling it seccomp of course. PR_ATTACH_SYSCALL_FILTER?

Fine with me too, but it seems to make logical sense to keep it paired
with seccomp.  Especially if mode=1 just becomes a hardcoded bpf.

>> +                     break;
>>               case PR_GET_TSC:
>>                       error = GET_TSC_CTL(arg2);
>>                       break;
>> diff --git a/security/Kconfig b/security/Kconfig
>> index 51bd5a0..e1ffed8 100644
>> --- a/security/Kconfig
>> +++ b/security/Kconfig
>> @@ -84,6 +84,26 @@ config SECURITY_DMESG_RESTRICT
>>
>>         If you are unsure how to answer this question, answer N.
>>
>> +config SECCOMP_FILTER
>> +     bool "Enable seccomp-based system call filtering"
>> +     select SECCOMP
>> +     help
>> +       This option provide support for limiting the accessibility of
>> +       systems calls at a task-level using a dynamically defined policy.
>> +
>> +       System call filtering policy is expressed by the user using
>> +       a Berkeley Packet Filter program.  The program is attached using
>> +       prctl(2).  For every system call the task makes, its number,
>> +       arguments, and other metadata will be evaluated by the attached
>> +       filter program.  The result determines if the system call may
>> +       may proceed or if the task should be terminated.
>> +
>> +       This behavior is meant to aid security-conscious software in
>> +       its ability to minimize the risk of running potentially
>> +       risky code.
>> +
>> +       See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>>  config SECURITY
>>       bool "Enable different security models"
>>       depends on SYSFS
>> --
>> 1.7.5.4
>>
>>
>
> Greetings,

Thanks for the comprehensive review and feedback!
will

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-29 20:27     ` Will Drewry
@ 2012-01-30  2:37       ` Indan Zupancic
  2012-01-30 22:26         ` Will Drewry
  0 siblings, 1 reply; 26+ messages in thread
From: Indan Zupancic @ 2012-01-30  2:37 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, mcgrathr

> On Sat, Jan 28, 2012 at 10:39 PM, Indan Zupancic <indan@nul.nu> wrote:
>> Hello,
>>
>> Feedback below. I know you send out v6, but I'm not going to play
>> eternal catch-up, so here it is.
>
> The change is minimal - so thanks for the comments on any version!

You're welcome!

>> On Sat, January 28, 2012 00:24, Will Drewry wrote:
>>> [This patch depends on luto@mit.edu's no_new_privs patch:
>>>  https://lkml.org/lkml/2012/1/12/446
>>> ]
>>>
>>> This patch adds support for seccomp mode 2.  This mode enables dynamic
>>> enforcement of system call filtering policy in the kernel as specified
>>> by a userland task.  The policy is expressed in terms of a Berkeley
>>> Packet Filter program, as is used for userland-exposed socket filtering.
>>> Instead of network data, the BPF program is evaluated over struct
>>> seccomp_filter_data at the time of the system call.
>>>
>>> A filter program may be installed by a userland task by calling
>>>   prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
>>> where fprog is of type struct sock_fprog.
>>>
>>> If the first filter program allows subsequent prctl(2) calls, then
>>> additional filter programs may be attached.  All attached programs
>>> must be evaluated before a system call will be allowed to proceed.
>>>
>>> To avoid CONFIG_COMPAT related landmines, once a filter program is
>>> installed using specific is_compat_task() value, it is not allowed to
>>> make system calls using the alternate entry point.
>>>
>>> Filter programs will be inherited across fork/clone and execve.
>>> However, if the task attaching the filter is unprivileged
>>> (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task.  This
>>> ensures that unprivileged tasks cannot attach filters that affect
>>> privileged tasks (e.g., setuid binary).
>>>
>>> There are a number of benefits to this approach. A few of which are
>>> as follows:
>>> - BPF has been exposed to userland for a long time.
>>> - Userland already knows its ABI: system call numbers and desired
>>>   arguments
>>> - No time-of-check-time-of-use vulnerable data accesses are possible.
>>> - system call arguments are loaded on demand only to minimize copying
>>>   required for system call number-only policy decisions.
>>>
>>> This patch includes its own BPF evaluator, but relies on the
>>> net/core/filter.c BPF checking code.  It is possible to share
>>> evaluators, but the performance sensitive nature of the network
>>> filtering path makes it an iterative optimization which (I think :) can
>>> be tackled separately via separate patchsets. (And at some point sharing
>>> BPF JIT code!)
>>>
>>>  v5: - uses syscall_get_arguments
>>>        (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
>>>      - uses union-based arg storage with hi/lo struct to
>>>        handle endianness.  Compromises between the two alternate
>>>        proposals to minimize extra arg shuffling and account for
>>>        endianness assuming userspace uses offsetof().
>>
>> Why would user space use offsetof()? Do you mean when assembling the BPF
>> code? So this basically means 32-bit and 64-bit filters are different,
>> even when they could be the same. Hm.
>
> Not at all.  Take a peek at the union seccomp_filter_data.  What it
> does is allows reference by hi32/lo32 be endian-accurate, but they are
> union'd with u64.  So if you want portable definitions, you use
> offsetof(), but if you know your endiannness and don't care about
> cross-compiling, then you can just reference by the offset of the u64
> and index into it as your endianness allows.
>
> This avoids a pointless extra copy to split u32 and u64 without
> leaving endianness pain entirely for userspace. I think it's a pretty
> sane compromise.  32-bit programs will only ever care about lo32.
> 64-bit will need to check both hi32 and lo32 or manually index into
> the u64 member.

The problem with this is that it makes BPF programs non-portable.
This means you can't easily distribute them in binary form. Worse,
they will often work fine, except they use the wrong data.

The structure layout is part of the ABI, by doing it this way you
have two ABI versions instead of just one. I don't think it's worth
the future trouble.

Actually, now I think about it, the current networking BPF programs
aren't binary portable either, because of the endianness of 32-bit
integers. It's still cleaner to have one structure layout instead
of two though.

>
>>
>>>        (mcgrathr@chromium.org, indan@nul.nu)
>>>      - update Kconfig description
>>>      - add include/seccomp_filter.h and add its installation
>>>      - (naive) on-demand syscall argument loading
>>>      - drop seccomp_t (eparis@redhat.com)
>>>      - adds proper compat prctl call copying
>>>  v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
>>>      - now uses current->no_new_privs
>>>          (luto@mit.edu,torvalds@linux-foundation.com)
>>>      - assign names to seccomp modes (rdunlap@xenotime.net)
>>>      - fix style issues (rdunlap@xenotime.net)
>>>      - reworded Kconfig entry (rdunlap@xenotime.net)
>>>  v3: - macros to inline (oleg@redhat.com)
>>>      - init_task behavior fixed (oleg@redhat.com)
>>>      - drop creator entry and extra NULL check (oleg@redhat.com)
>>>      - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
>>>      - adds tentative use of "always_unprivileged" as per
>>>        torvalds@linux-foundation.org and luto@mit.edu
>>>  v2: - (patch 2 only)
>>>
>>> Signed-off-by: Will Drewry <wad@chromium.org>
>>> ---
>>>  include/linux/Kbuild           |    1 +
>>>  include/linux/prctl.h          |    3 +
>>>  include/linux/seccomp.h        |   63 ++++
>>>  include/linux/seccomp_filter.h |   79 +++++
>>
>> Just stuff it into seccomp.h?
>
> Most of it is in there, but by defining a special data set for running
> filters over, I needed to export the definition to userspace.
> seccomp_filter.h contains only ABI relevant declarations.  The rest of
> the in-kernel stuff is in seccomp.h (which is not copied during a
> header installation).

Makes sense.

>> If this is added, then this will be seccomp,
>> and the old mode can be seen as a standard, very restrictive filter.
>
> This is totally true.  I originally did not pursue this because I
> would have needed 1 hardcoded BPF filter per arch.  With the new
> syntax, I believe it will be possible to define a 32-bit filter and a
> 64-bit filter (and a compat filter) in one central place. I will start
> on that for v7.  I hope it is doable now, because it is nice to keep
> it all locked to one path.
>
> The only consideration is that it might make sense to do that after
> seccomp_filter has been merged (in my dreams :) and given time to
> bake.  But it'll be a good exercise to do upfront, so I'll go ahead.
> If it seems too hasty, I can always respin it without the merging.

I only meant conceptually, I don't think it's a good idea to actually
do that with a hard-coded filter!

The point of the original seccomp was that it is so simple that it
is absolutely secure. BPF filtering still has room for errors.

>>>  kernel/Makefile                |    1 +
>>>  kernel/fork.c                  |    4 +
>>>  kernel/seccomp.c               |   10 +-
>>>  kernel/seccomp_filter.c        |  620 ++++++++++++++++++++++++++++++++++++++++
>>
>> Same here. There isn't much to seccomp.c, why not add the filter code there?
>>
>> If you insist on keeping it separate, then why call it seccomp at all?
>> Just call it syscall_filter or something.
>
> Works for me.
>
>>>  kernel/sys.c                   |    4 +
>>>  security/Kconfig               |   20 ++
>>>  10 files changed, 804 insertions(+), 1 deletions(-)
>>
>> What is the difference in kernel binary size with and without this?
>> If it's too big I think the BPF filtering code should be shared with
>> the networking version from the start.
>
> The network filtering code adds one function that is a switch
> statement over the possible valid values.  This doesn't duplicate the
> checking code.

Yes, but that doesn't answer my question.

The advantage of using the network code would be to reduce the only impact
this has for people not using it: Kernel binary size. So if you want it to
be small enough that it's not worth a separate config option, then it has
to be tiny.

>> Actually, that is something that is wanted anyway, so why not use the
>> networking code from the start? That should reduce the code amount a
>> lot. So I think if this gets merged the consolidated version should be
>> merged. Only downside would be that networking support is needed for
>> this feature, but without networking there shouldn't be much need for
>> this feature anyway.
>
> The reason that I am not merging upfront is because the load_pointer
> and byte swapping behavior is hard-coded in the current versions.  I

Just move the byte swapping into the networking load_pointer(), and
add an extra function pointer argument which tells which load_pointer()
function to call.

> believe it will be possible to add a layer of abstraction that doesn't
> induce additional overhead, but I didn't want to force that first.  If
> that will really be a blocker, then I will go ahead and do that first.
> I was just worried that the potential performance impact discussions
> could lead this patch to a dead end before it got a chance.

To know how much it is needed depends on what the difference in kernel
size is with and without CONFIG_SECCOMP_FILTER. If it's 4kB then you
may have a good point arguing that it may not be worth consolidating.
But considering that you already depend on the networking checking code,
it just seems cleaner to share everything and to make it explicit. This
avoids problems in the future when people change the networking checking
code without being aware that it is being used elsewhere too.

And performance differences can and should be measured. You don't want
to accidentally slow down all system calls, so it's good to double
check that you don't.

>
>>>  create mode 100644 include/linux/seccomp_filter.h
>>>  create mode 100644 kernel/seccomp_filter.c
>>>
>>> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
>>> index c94e717..5659454 100644
>>> --- a/include/linux/Kbuild
>>> +++ b/include/linux/Kbuild
>>> @@ -330,6 +330,7 @@ header-y += scc.h
>>>  header-y += sched.h
>>>  header-y += screen_info.h
>>>  header-y += sdla.h
>>> +header-y += seccomp_filter.h
>>>  header-y += securebits.h
>>>  header-y += selinux_netlink.h
>>>  header-y += sem.h
>>> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
>>> index 7ddc7f1..b8c4beb 100644
>>> --- a/include/linux/prctl.h
>>> +++ b/include/linux/prctl.h
>>> @@ -114,4 +114,7 @@
>>>  # define PR_SET_MM_START_BRK         6
>>>  # define PR_SET_MM_BRK                       7
>>>
>>> +/* Set process seccomp filters */
>>> +#define PR_ATTACH_SECCOMP_FILTER     37
>>> +
>>>  #endif /* _LINUX_PRCTL_H */
>>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>>> index 171ab66..3992bb6 100644
>>> --- a/include/linux/seccomp.h
>>> +++ b/include/linux/seccomp.h
>>> @@ -5,10 +5,29 @@
>>>  #ifdef CONFIG_SECCOMP
>>>
>>>  #include <linux/thread_info.h>
>>> +#include <linux/types.h>
>>>  #include <asm/seccomp.h>
>>>
>>> +/* Valid values of seccomp_struct.mode */
>>> +#define SECCOMP_MODE_DISABLED        0 /* seccomp is not in use. */
>>> +#define SECCOMP_MODE_STRICT  1 /* uses hard-coded seccomp.c rules. */
>>> +#define SECCOMP_MODE_FILTER  2 /* system call access determined by filter. */
>>> +
>>> +struct seccomp_filter;
>>> +/**
>>> + * struct seccomp_struct - the state of a seccomp'ed process
>>> + *
>>> + * @mode:  indicates one of the valid values above for controlled
>>> + *         system calls available to a process.
>>> + * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
>>> + *          @filter must only be accessed from the context of current as there
>>> + *          is no guard.
>>> + */
>>>  struct seccomp_struct {
>>>       int mode;
>>> +#ifdef CONFIG_SECCOMP_FILTER
>>> +     struct seccomp_filter *filter;
>>> +#endif
>>
>> Why a separate config option for this? Just depend on SECCOMP and networking?
>
> Right now it requires asm/syscall.h which CONFIG_SECCOMP doesn't.
> Some arches still haven't implemented that file so I thought it would
> make sense to keep it separate.

I think you should go on a quest to make sure (almost) all archs have that file,
before this patch can be merged. At least the archs that have ptrace support.

> Would it make sense to start with a config option then merge later or
> should it be done upfront?

Yes, I think that would make sense. But I also think no one will enable it
if it's a separate option because it's just very obscure (especially because
no one uses it yet). So if it's good enough it should be enabled by default,
at least for the archs that are supported. If it's coupled to SECCOMP, you
can bundle it with that (and networking).

>>>  };
>>>
>>>  extern void __secure_computing(int);
>>> @@ -51,4 +70,48 @@ static inline int seccomp_mode(struct seccomp_struct *s)
>>>
>>>  #endif /* CONFIG_SECCOMP */
>>>
>>> +#ifdef CONFIG_SECCOMP_FILTER
>>> +
>>> +
>>> +extern long prctl_attach_seccomp_filter(char __user *);
>>> +
>>> +extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
>>> +extern void put_seccomp_filter(struct seccomp_filter *);
>>> +
>>> +extern int seccomp_test_filters(int);
>>> +extern void seccomp_filter_log_failure(int);
>>> +extern void seccomp_struct_fork(struct seccomp_struct *child,
>>> +                             const struct seccomp_struct *parent);
>>
>> Lousy function name, what is it doing?
>
> Now it is called seccomp_fork() and it is called by fork when a task
> fork()s :)  Naming suggestions welcome!

That's better. My mind automatically associated the 'struct' with the fork,
which didn't make sense.

>
>>> +
>>> +static inline void seccomp_struct_init_task(struct seccomp_struct *seccomp)
>>> +{
>>> +     seccomp->mode = SECCOMP_MODE_DISABLED;
>>> +     seccomp->filter = NULL;
>>> +}
>>
>> This doesn't actually do anything, just get rid of it and/or adapt
seccomp_struct_fork()?
>>
>> And it seems wrong to do this at fork time considering that the mode is supposed
>> to keep the mode across forks, so it seems safer to update it to the correct
>> values immediately.
>
> True - mode and filter should already be zeroed.  I'll remove it. It
> persists from when I had a mutex in the older patch series.
>
>>> +
>>> +/* No locking is needed here because the task_struct will
>>> + * have no parallel consumers.
>>> + */
>>> +static inline void seccomp_struct_free_task(struct seccomp_struct *seccomp)
>>> +{
>>> +     put_seccomp_filter(seccomp->filter);
>>> +     seccomp->filter = NULL;
>>> +}
>>
>> How many functions do you need to free one object?
>
> At least three :)

I think you can get away with two. ;-)

>> I'm sure you can get rid of a couple, starting with this one. Just use
>> put_seccomp_filter() directly, the NULL is useless as task_struct is
>> going away.
>
> Cool - I'll do that.
>
>>> +
>>> +#else  /* CONFIG_SECCOMP_FILTER */
>>> +
>>> +#include <linux/errno.h>
>>> +
>>> +struct seccomp_filter { };
>>> +/* Macros consume the unused dereference by the caller. */
>>> +#define seccomp_struct_init_task(_seccomp) do { } while (0);
>>> +#define seccomp_struct_fork(_tsk, _orig) do { } while (0);
>>> +#define seccomp_struct_free_task(_seccomp) do { } while (0);
>>> +
>>> +static inline long prctl_attach_seccomp_filter(char __user *a2)
>>> +{
>>> +     return -ENOSYS;
>>> +}
>>> +
>>> +#endif  /* CONFIG_SECCOMP_FILTER */
>>>  #endif /* _LINUX_SECCOMP_H */
>>> diff --git a/include/linux/seccomp_filter.h b/include/linux/seccomp_filter.h
>>> new file mode 100644
>>> index 0000000..3ecd641
>>> --- /dev/null
>>> +++ b/include/linux/seccomp_filter.h
>>> @@ -0,0 +1,79 @@
>>> +/*
>>> + * Secomp-based system call filtering data structures and definitions.
>>> + *
>>> + * Copyright (C) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
>>
>> (Not my expertise, but I'm not sure if that makes legal sense.)
>
> Weird, eh :)  That's what my work falls under when I do it on paid
> time. If it'd be helpful, I can add an Author: line.  If it's too
> weird still, I believe I can use Google Inc. instead. But, IANAL, etc.

The thing is, a name identifies someone or something. "Chromium OS Authors"
doesn't. A company name works because legally it is much the same as a
person. It seems doubtful to me that you can assign copyright to an abstract
group of people. But if Google's lawyers think it's okay then go for it.

>>> + *
>>> + * This copyrighted material is made available to anyone wishing to use,
>>> + * modify, copy, or redistribute it subject to the terms and conditions
>>> + * of the GNU General Public License v.2.
>>> + *
>>> + */
>>> +
>>> +#ifndef __LINUX_SECCOMP_FILTER_H__
>>> +#define __LINUX_SECCOMP_FILTER_H__
>>> +
>>> +#include <asm/byteorder.h>
>>> +#include <linux/compiler.h>
>>> +#include <linux/types.h>
>>> +
>>> +/*
>>> + *   Keep the contents of this file similar to linux/filter.h:
>>> + *     struct sock_filter and sock_fprog and versions.
>>> + *   Custom naming exists solely if divergence is ever needed.
>>> + */
>>> +
>>> +/*
>>> + * Current version of the filter code architecture.
>>> + */
>>> +#define SECCOMP_BPF_MAJOR_VERSION 1
>>> +#define SECCOMP_BPF_MINOR_VERSION 1
>>
>> If it's a fork of the current networking code it should follow their version
>> numbering too. But these defines aren't used anywhere..?
>
> Neither are the networking code defines :)  They are exported to
> userspace but I don't believe they've ever changed.

Probably because they are useless as they can only be checked at compile time,
while the kernel version they are run on can be different. But if you want to
reuse the existing tools to create BPF filters, being compatible with the
networking code would be handy.

>>> +
>>> +struct seccomp_filter_block {        /* Filter block */
>>> +     __u16   code;   /* Actual filter code */
>>> +     __u8    jt;     /* Jump true */
>>> +     __u8    jf;     /* Jump false */
>>> +     __u32   k;      /* Generic multiuse field */
>>> +};
>>> +
>>> +struct seccomp_fprog {       /* Required for SO_ATTACH_FILTER. */
>>> +     unsigned short          len;    /* Number of filter blocks */
>>> +     struct seccomp_filter_block __user *filter;
>>> +};

Oh wait, I missed this before: Just use the existing networking filter.h
instead of redefining anything, and only add seccomp specific stuff to
your file, the seccomp_filter_data.

>>> +
>>> +/* Ensure the u32 ordering is consistent with platform byte order. */
>>> +#if defined(__LITTLE_ENDIAN)
>>> +#define SECCOMP_ENDIAN_SWAP(x, y) x, y
>>> +#elif defined(__BIG_ENDIAN)
>>> +#define SECCOMP_ENDIAN_SWAP(x, y) y, x
>>> +#else
>>> +#error edit for your odd arch byteorder.
>>> +#endif
>>> +
>>> +/* System call argument layout for the filter data. */
>>> +union seccomp_filter_arg {
>>> +     struct {
>>> +             __u32 SECCOMP_ENDIAN_SWAP(lo32, hi32);
>>> +     };
>>> +     __u64 u64;
>>> +};
>>
>> Ugh. What was the advantage of doing this again? This looks like optimising
>> at the wrong level to me.
>
> I was optimizing for competing opinions it actually working.  If doing
> it your way will make people happy, then that's fine with me:
>
> struct {
>   __u32 lo32[6];
>   __u32 hi32[6];
> }

No need to stuff it in a separate structure then, just something like:

struct seccomp_filter_data {
	int syscall_nr;
	unsigned int arg_lo[6];
	unsigned int arg_hi[6];
	int __reserved[3];
};

I'm not sure what the advantage of using __u32 would be, I think we can count on
unsigned int always being 32 bits?

(Is my email client mangling tabs or is yours?)

> It just means two copies for every system call argument instead of
> one, but who knows, maybe the compiler will magic it away.
> Regardless, any argument access will always be slightly slower than
> just syscall number, so it doesn't hurt if the extra copy doesn't go
> away.

True. If you add 64 bit support to BPF then your way is much better of
course. But as long as it is 32-bit I don't think it makes much sense
to make it more complicated than necessary.

>> it doesn't make your code nicer, and I really doubt it makes it run
>> anything faster either. It's just obfuscating the code.
>
> It makes the filters marginally easier, but eh. You can write a 32-bit
> filter that checks lo32 and you know it will index into the u64 at the
> correct location regardless of endianness.  However, since it is all
> being done in the header anyway, it's really just being done in
> userspace.  In the 32-bit kernel code path, I just ensure that hi32 is
> zeroed out.

Your way it's a lot harder to see what the actual data layout is, including
offsets. My way the offset to the low bits is arg_nr * sizeof(int), if
counting args at 1 instead of 0, which is usually done with args (at least
by gcc).

> I just want an abi that is functional across platforms that people are
> happy enough to merge.  The one I posted is fine with me and so is
> your proposal.  So if no one else has strong opinions, I'll just use
> yours :)

I'm just giving my opinion of course, I would like to know what other
people prefer as well.

>
>> If you don't want to hide endianness then don't provide those lo32 and hi32
>> structure members and just make it a u64. That's at least very clear. But
>> swapping them around depending on endianness is not good, because it's only
>> confusing things.
>>
>> But if you have to do it, then at least do:
>>
>> #if defined(__LITTLE_ENDIAN)
>> union seccomp_filter_arg {
>>        struct {
>>                __u32 lo32;
>>                __u32 hi32;
>>        };
>>        __u64 u64;
>> };
>> #else
>> union seccomp_filter_arg {
>>        struct {
>>                __u32 hi32;
>>                __u32 lo32;
>>        };
>>        __u64 u64;
>> };
>> #endif
>>
>> Instead of hiding what you're doing.
>
> I copied the aio_abi.h file.

Ah, okay. For them it was just one field of a big structure though, and
hopefully it was because of historical reasons or something...

> I'm happy to do it more explicitly, but
> since no one appears to have a strong opinion and your proposal would
> allow for seamless addition of a mythical 128-bit argument someday, it
> seems like its the way to go.

If the difference is just one copy for the upper 32 bits then I think
my way is better. But if you can somehow make it much faster when the
upper and lower halves are adjacent then go for it.

>>> +
>>> +/*
>>> + *   Expected data the BPF program will execute over.
>>> + *   Endianness will be arch specific, but the values will be
>>> + *   swapped, as above, to allow for consistent BPF programs.
>>
>> So now what? You need platform dependent BPF programs depending on
>> the arch's endianness? Just because you want to stuff one 64-bit
>> longs directly into two 32-bit ints? That saves how much, two
>> instructions?
>
> I was also attempting to compromise between aesthetics, but unless I
> hear any continued dissent against your proposal, I'll just go with
> it.
>
>>> + */
>>> +struct seccomp_filter_data {
>>> +     int syscall_nr;
>>> +     __u32 __reserved;
>>> +     union seccomp_filter_arg args[6];
>>> +};
>>> +
>>> +#undef SECCOMP_ENDIAN_SWAP
>>> +
>>> +/*
>>> + * Defined valid return values for the BPF program.
>>> + */
>>> +#define SECCOMP_BPF_ALLOW    0xFFFFFFFF
>>> +#define SECCOMP_BPF_DENY     0
>>
>> This doesn't make sense. Please be consistent with C and swap it around.
>
> Nope.  The bpf engine returns non-zero for success and 0 for failure.
> If I plan to share code with the networking stack, then 0 will always
> be the failure mode.  It indicates the amount of the packet to accept.
> If swapping it around becomes mandatory, then sharing code with the
> network implementation will be even harder.

Gah! You're right.

>>> +
>>> +#endif /* __LINUX_SECCOMP_FILTER_H__ */
>>> diff --git a/kernel/Makefile b/kernel/Makefile
>>> index 2d9de86..fd81bac 100644
>>> --- a/kernel/Makefile
>>> +++ b/kernel/Makefile
>>> @@ -78,6 +78,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
>>>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>>>  obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
>>>  obj-$(CONFIG_SECCOMP) += seccomp.o
>>> +obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
>>>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>>>  obj-$(CONFIG_TREE_RCU) += rcutree.o
>>>  obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 051f090..f312edb 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -34,6 +34,7 @@
>>>  #include <linux/cgroup.h>
>>>  #include <linux/security.h>
>>>  #include <linux/hugetlb.h>
>>> +#include <linux/seccomp.h>
>>>  #include <linux/swap.h>
>>>  #include <linux/syscalls.h>
>>>  #include <linux/jiffies.h>
>>> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
>>>       free_thread_info(tsk->stack);
>>>       rt_mutex_debug_task_free(tsk);
>>>       ftrace_graph_exit_task(tsk);
>>> +     seccomp_struct_free_task(&tsk->seccomp);
>>>       free_task_struct(tsk);
>>>  }
>>>  EXPORT_SYMBOL(free_task);
>>> @@ -1093,6 +1095,7 @@ static struct task_struct *copy_process(unsigned long
clone_flags,
>>>               goto fork_out;
>>>
>>>       ftrace_graph_init_task(p);
>>> +     seccomp_struct_init_task(&p->seccomp);
>>
>> If you just do a get on the filter's krefs here then cleanup is done correctly
>> by free_task(). So just call seccomp_struct_fork() or get_seccomp_filter() here.
>
> Since init_task just NULLs it all out, then later seccomp_fork() does
> the get from the parent, I'm not sure I need this function at all any
> more.  Am I misreading?

That's what I tried to say, yes.

>>>
>>>       rt_mutex_init_task(p);
>>>
>>> @@ -1376,6 +1379,7 @@ static struct task_struct *copy_process(unsigned long
clone_flags,
>>>       if (clone_flags & CLONE_THREAD)
>>>               threadgroup_change_end(current);
>>>       perf_event_fork(p);
>>> +     seccomp_struct_fork(&p->seccomp, &current->seccomp);
>>
>> This can be moved above, before the first error goto.
>
> As in where seccomp_struct_init_task is or somewhere else?  (Where
> seccomp_init_task is would make sense.)

Around there yes. It needs to  be always executed, before free_task() may be called
because of an error path.

>>>
>>>       trace_task_newtask(p, clone_flags);
>>>
>>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>>> index e8d76c5..a045dd4 100644
>>> --- a/kernel/seccomp.c
>>> +++ b/kernel/seccomp.c
>>> @@ -37,7 +37,7 @@ void __secure_computing(int this_syscall)
>>>       int * syscall;
>>>
>>>       switch (mode) {
>>> -     case 1:
>>> +     case SECCOMP_MODE_STRICT:
>>>               syscall = mode1_syscalls;
>>>  #ifdef CONFIG_COMPAT
>>>               if (is_compat_task())
>>> @@ -48,6 +48,14 @@ void __secure_computing(int this_syscall)
>>>                               return;
>>>               } while (*++syscall);
>>>               break;
>>> +#ifdef CONFIG_SECCOMP_FILTER
>>> +     case SECCOMP_MODE_FILTER:
>>> +             if (seccomp_test_filters(this_syscall) == 0)
>>
>> Get rid of the '== 0' and have proper return values instead.
>
> 0 is success/allow as normal here.  I can change it to

Oh, that's unexpected.

>  if (!seccomp_test_filters(this_syscall))
>    return;
>
> if that'd read better?

Considering you're going to add more return values later, I would let
seccomp_test_filters() return the filter code and stuff it in a temporary
variable and work on that instead, and use SECCOMP_BPF_DENY/ALLOW to make
clear what's happening. It's a bit more code, but it's clearer how we got
to our decision.

And perhaps rename seccomp_test_filters() to seccomp_run_filters(), for
consistency with the networking code.

But do whatever you prefer, it doesn't matter much.

>>> +                     return;
>>> +
>>
>> No need for an empty line here.
>
> I'll drop it.
>
>>> +             seccomp_filter_log_failure(this_syscall);
>>> +             break;
>>> +#endif
>>>       default:
>>>               BUG();
>>>       }
>>> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
>>> new file mode 100644
>>> index 0000000..e57219e
>>> --- /dev/null
>>> +++ b/kernel/seccomp_filter.c
>>> @@ -0,0 +1,620 @@
>>> +/*
>>> + * linux/kernel/seccomp_filter.c
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License as published by
>>> + * the Free Software Foundation; either version 2 of the License, or
>>> + * (at your option) any later version.
>>> + *
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>> + * GNU General Public License for more details.
>>> + *
>>> + * You should have received a copy of the GNU General Public License
>>> + * along with this program; if not, write to the Free Software
>>> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
>>> + *
>>> + * Copyright (C) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
>>> + *
>>> + * Extends linux/kernel/seccomp.c to allow tasks to install system call
>>> + * filters using a Berkeley Packet Filter program which is executed over
>>> + * struct seccomp_filter_data.
>>> + */
>>> +
>>> +#include <asm/syscall.h>
>>> +
>>> +#include <linux/capability.h>
>>> +#include <linux/compat.h>
>>> +#include <linux/err.h>
>>> +#include <linux/errno.h>
>>> +#include <linux/rculist.h>
>>> +#include <linux/filter.h>
>>> +#include <linux/kallsyms.h>
>>> +#include <linux/kref.h>
>>> +#include <linux/module.h>
>>> +#include <linux/pid.h>
>>> +#include <linux/prctl.h>
>>> +#include <linux/ptrace.h>
>>> +#include <linux/ratelimit.h>
>>> +#include <linux/reciprocal_div.h>
>>> +#include <linux/regset.h>
>>> +#include <linux/seccomp.h>
>>> +#include <linux/seccomp_filter.h>
>>> +#include <linux/security.h>
>>> +#include <linux/seccomp.h>
>>> +#include <linux/sched.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/uaccess.h>
>>> +#include <linux/user.h>
>>> +
>>> +
>>> +/**
>>> + * struct seccomp_filter - container for seccomp BPF programs
>>> + *
>>> + * @usage: reference count to manage the object lifetime.
>>> + *         get/put helpers should be used when accessing an instance
>>> + *         outside of a lifetime-guarded section.  In general, this
>>> + *         is only needed for handling filters shared across tasks.
>>> + * @parent: pointer to the ancestor which this filter will be composed with.
>>
>> I guess the code will be clearer than the comment. Parent task or filter?
>> Is it a reverse linked list or something? I'm confused.
>
> I'll try to rewrite it.  The parent filter is whatever filter was
> installed on the task at the time the current filter was installed.
> It may have been inherited or be part of a sequence of filters
> installed by the same task.

I think I would just call it 'next' or 'prev', but that's me. 'parent' is slightly
confusing in task context. The filter may come from the parent process, but it could
as well be installed by the same process.

>>> + * @insns: the BPF program instructions to evaluate
>>> + * @count: the number of instructions in the program.
>>> + *
>>> + * seccomp_filter objects should never be modified after being attached
>>> + * to a task_struct (other than @usage).
>>> + */
>>> +struct seccomp_filter {
>>> +     struct kref usage;
>>> +     struct seccomp_filter *parent;
>>> +     struct {
>>> +             uint32_t compat:1;
>>> +     } flags;
>>
>> This flag isn't documented above.
>
> Oops dropped it for a rev - I'll re-add it. Thanks!

And in case you missed my comment below, change it to a bool?

>>> +     unsigned short count;  /* Instruction count */
>>
>> Why a short?
>
> Matches sock_fprog - I can go larger if that make sense.

Nah, I missed that.

>>> +     struct sock_filter insns[0];
>>
>> Newer gcc supports insns[], but I don't know what the kernel convention is.
>
> Yeah - I just lifted what I saw. I can recheck to see if the
> conventions are more clearly spelled out for this. (checkpatch is
> happy.)
>
>>> +};
>>> +
>>> +/*
>>> + * struct seccomp_filter_metadata - BPF data wrapper
>>> + * @data: data accessible to the BPF program.
>>> + * @has_args: indicates that the args have been lazily populated.
>>> + *
>>> + * Used by seccomp_load_pointer.
>>> + */
>>> +struct seccomp_filter_metadata {
>>> +     struct seccomp_filter_data data;
>>> +     bool has_args;
>>> +};
>>
>> Why use 'bool' instead of a nifty struct { uint32_t : 1 }?
>>
>> It seems that has_args is something that should be stored elsewhere,
>> it doesn't seem to warrant its own structure.
>
> Well it would naturally live inside the BPF evaluator since the
> loading is done in load_pointer.  This struct is just used to marshal
> the state around.

I think load_pointer() is better off being stateless.

>>> +
>>> +static unsigned int seccomp_run_filter(void *, uint32_t,
>>> +                                    const struct sock_filter *);
>>
>> It almosts fits, just put it on one line.
>
> I'm just doing what checkpatch suggests :)  Would it read better if I
> bumped all the args down a line?

No, just ignore checkpatch and put it on one line.

If you can't live with the warning, change the return value to int or
drop the seccomp_ bit, as it's a local static function anyway. Or just
move the function above its first use so you don't need this.

>>> +
>>> +/**
>>> + * seccomp_filter_alloc - allocates a new filter object
>>> + * @padding: size of the insns[0] array in bytes
>>> + *
>>> + * The @padding should be a multiple of
>>> + * sizeof(struct sock_filter).
>>
>> It's not padding, why call it such? Why not 'count' or 'bpf_blocks'?
>> That would reduce the sizeof(struct sock_filter)s to one.
>
> Either work - I'll change it.
>
>>> + *
>>> + * Returns ERR_PTR on error or an allocated object.
>>> + */
>>> +static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
>>> +{
>>> +     struct seccomp_filter *f;
>>> +     unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
>>> +
>>> +     /* Drop oversized requests. */
>>> +     if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
>>> +             return ERR_PTR(-EINVAL);
>>> +
>>> +     /* Padding should always be in sock_filter increments. */
>>> +     if (padding % sizeof(struct sock_filter))
>>> +             return ERR_PTR(-EINVAL);
>>
>> ...then this isn't needed.
>>
>>> +
>>> +     f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
>>> +     if (!f)
>>> +             return ERR_PTR(-ENOMEM);
>>> +     kref_init(&f->usage);
>>> +     f->count = bpf_blocks;
>>> +     return f;
>>> +}
>>> +
>>> +/**
>>> + * seccomp_filter_free - frees the allocated filter.
>>> + * @filter: NULL or live object to be completely destructed.
>>> + */
>>> +static void seccomp_filter_free(struct seccomp_filter *filter)
>>> +{
>>> +     if (!filter)
>>> +             return;
>>> +     put_seccomp_filter(filter->parent);
>>
>> Why do put on the parent here?
>
> filters may be shared across tasks and there may be multiple filters
> per-task.  The references keep them alive beyond the life of the
> originating task_struct or if they are not hanging off of the
> task_struct directly.  In order for them to ever die, the reference to
> them needs to be dropped and that's done here.

Yeah, I figured that out later on. It's quite nifty, but I find the recursion
within kref_put() slightly worrisome. Perhaps the code would be cleaner if this
was avoided and done differently, but I can't think of a good alternative. I'll
wait for the new version to see if I can find a way.

>> This function is only called once below by __put_seccomp_filter, perhaps merge them?
>
> Cool I'll do that.
>
>>> +     kfree(filter);
>>> +}
>>> +
>>> +static void __put_seccomp_filter(struct kref *kref)
>>> +{
>>> +     struct seccomp_filter *orig =
>>> +             container_of(kref, struct seccomp_filter, usage);
>>> +     seccomp_filter_free(orig);
>>
>> I find it unexpected that __put_seccomp_filter() calls put_seccomp_filter(),
>> why have two versions then? I have the feeling all this can be streamlined
>> more.
>
> So I can either make all callers use kref_put(&filter->usage,
> __put_seccomp_filter) or I can have one extra layer of indirection:
> put_seccomp_filter.  I can definitely ditch the extra free wrapper
> though.  I can also make put_seccomp_filter into a static inline if
> that'd be preferable -- or just use kref everywhere.

Just ditch seccomp_filter_free() and make put_seccomp_filter() a static
inline. You want one extra if check at most when this isn't being used.

>>> +}
>>> +
>>> +void seccomp_filter_log_failure(int syscall)
>>> +{
>>> +     pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
>>> +             current->comm, task_pid_nr(current), syscall,
>>> +             KSTK_EIP(current));
>>> +}
>>> +
>>> +/* put_seccomp_filter - decrements the ref count of @orig and may free. */
>>> +void put_seccomp_filter(struct seccomp_filter *orig)
>>> +{
>>> +     if (!orig)
>>> +             return;
>>> +     kref_put(&orig->usage, __put_seccomp_filter);
>>> +}
>>> +
>>> +/* get_seccomp_filter - increments the reference count of @orig. */
>>> +struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
>>> +{
>>> +     if (!orig)
>>> +             return NULL;
>>> +     kref_get(&orig->usage);
>>> +     return orig;
>>> +}
>>
>> This function should be merged with seccomp_struct_fork(), it's pretty much only
>> called there.
>
> It's used primarily during filter attach, but if these wrappers are
> bothersome, I can just use kref_* directly everywhere.

That may be going too far to the other side of things. Too many wrappers are
bothersome, but if they avoid code duplication and make things clearer, they
are good. It's a balance.

>
>>> +
>>> +#if BITS_PER_LONG == 32
>>> +static inline unsigned long *seccomp_filter_data_arg(
>>> +                             struct seccomp_filter_data *data, int index)
>>> +{
>>> +     /* Avoid inconsistent hi contents. */
>>> +     data->args[index].hi32 = 0;
>>> +     return (unsigned long *) &(data->args[index].lo32);
>>> +}
>>> +#elif BITS_PER_LONG == 64
>>> +static inline unsigned long *seccomp_filter_data_arg(
>>> +                             struct seccomp_filter_data *data, int index)
>>> +{
>>> +     return (unsigned long *) &(data->args[index].u64);
>>
>> No need for any casts, is there? And this looks a bit untidy.
>
> Yup - gcc isn't a fan even if the size matches.  Won't matter with the
> new change though.

Strange.

>>
>>> +}
>>> +#else
>>> +#error Unknown BITS_PER_LONG.
>>> +#endif
>>> +
>>> +/**
>>> + * seccomp_load_pointer: checks and returns a pointer to the requested offset
>>> + * @buf: u8 array to index into
>>> + * @buflen: length of the @buf array
>>> + * @offset: offset to return data from
>>> + * @size: size of the data to retrieve at offset
>>> + * @unused: placeholder which net/core/filter.c uses for for temporary
>>> + *          storage.  Ideally, the two code paths can be merged.
>>
>> The documentation doesn't matches the actual code?
>
> Thanks - sorry.  Last minute changes that I failed to fix the comment for.
>
>>> + *
>>> + * Returns a pointer to the BPF evaluator after checking the offset and size
>>> + * boundaries.
>>> + */
>>> +static inline void *seccomp_load_pointer(void *data, int offset, size_t size,
>>> +                                      void *buffer)
>>> +{
>>> +     struct seccomp_filter_metadata *metadata = data;
>>> +     int arg;
>>> +     if (offset >= sizeof(metadata->data))
>>> +             goto fail;
>>> +     if (offset < 0)
>>> +             goto fail;
>>
>> Perhaps use unsigned offset?
>
> I'm using what the bpf code specifies already.  Otherwise I would use
> an unsigned int. I guess I could cast.

The networking code uses negative indices to get to ancillary data.

I would make it unsigned from the start or make just the 'offset' argument unsigned.

>>> +     if (size > sizeof(metadata->data) - offset)
>>> +             goto fail;
>>> +     if (metadata->has_args)
>>> +             goto pass;
>>> +     /* No argument data touched. */
>>> +     if (offset + size - 1 < offsetof(struct seccomp_filter_data, args))
>>> +             goto pass;
>>> +     for (arg = 0; arg < ARRAY_SIZE(metadata->data.args); ++arg)
>>> +             syscall_get_arguments(current, task_pt_regs(current), arg, 1,
>>> +                     seccomp_filter_data_arg(&metadata->data, arg));
>>> +     metadata->has_args = true;
>>
>> Wasn't the idea that arguments would be loaded on demand?
>
> Yes and they are and then cached.

I mean individually loaded on demand, not all or nothing. Usually only one
or two args are checked, and never rechecked, so caching doesn't make sense.

>> And as arguments are usually only checked once and not twice, caching the
>> values isn't really needed. So perhaps get rid of the (meta)data structure
>> and just have a switch that returns the right data directly depending on
>> the offset? It seems the over all code would be simpler that way.
>
> Not really. You can make a case statement per range of offsets that
> overlap with an argument, but that doesn't account for size.  You
> could load a long across two arguments.  We're not enforcing alignment
> as the networking code doesn't either.  I could have it fail if the
> request is not 32-bit aligned.  I don't prefer that, but it would
> simplify the code.  It's certainly easier to change the ABI later to
> provide unaligned access then it is to make it aligned later, so I can
> change to aligned requests only that return the appropriate 32-bit
> slice of the data.

I would go for forcing alignment, unaligned accesses on arguments don't make
much sense. Only unaligned access that makes sense is reading the two middle
bytes of an int. So I would force all accesses to fall within 32 bits. This
avoids the unaligned loading problem and would make load_pointer() stateless.

>>> +pass:
>>> +     return ((__u8 *)(&metadata->data)) + offset;
>>> +fail:
>>> +     return NULL;
>>> +}
>>> +
>>> +/**
>>> + * seccomp_test_filters - tests 'current' against the given syscall
>>> + * @syscall: number of the system call to test
>>> + *
>>> + * Returns 0 on ok and non-zero on error/failure.
>>> + */
>>> +int seccomp_test_filters(int syscall)
>>> +{
>>> +     int ret = -EACCES;
>>> +     struct seccomp_filter *filter;
>>> +     struct seccomp_filter_metadata metadata;
>>> +
>>> +     filter = current->seccomp.filter; /* uses task ref */
>>> +     if (!filter)
>>> +             goto out;
>>
>> Actually, shouldn't having no filter with SECCOMP_MODE_FILTER set be impossible?
>
> Yeah.
>
>> It seems better to assure that instead of needlessly testing for it every time.
>
> I guess so. I personally prefer this style of defensive checking, but
> I realize that's at odds with the kernel style. I'll drop it.

I'm all for defensive checking, but in this case it's way too late to save you,
because if you messed up, 'filter' is more likely pointing to stale data than to
NULL.

>>> +
>>> +     metadata.data.syscall_nr = syscall;
>>> +     metadata.has_args = false;
>>> +
>>> +#ifdef CONFIG_COMPAT
>>> +     if (filter->flags.compat != !!(is_compat_task()))
>>> +             goto out;
>>> +#endif
>>> +
>>> +     /* Only allow a system call if it is allowed in all ancestors. */
>>> +     ret = 0;
>>> +     for ( ; filter != NULL; filter = filter->parent) {
>>> +             /* Allowed if return value is SECCOMP_BPF_ALLOW */
>>> +             if (seccomp_run_filter(&metadata, sizeof(metadata.data),
>>> +                                     filter->insns) != SECCOMP_BPF_ALLOW)
>>> +                     ret = -EACCES;
>>
>> Why keep checking filters if one of them fails? Just return -EACCES?
>
> Why optimize for a failure mode?  Since fails are terminal, I don't
> see any reason to rush :) That said, this was more relevant when I had
> ptrace integration that depended on knowing which filter the failure
> occurred in. I'll change it.

I'd guess that would be easier if any failure would break the loop.
But I don't think there's a good way to tell ptrace which filter
it is because there is no way to identify them. And filters may be
installed before the ptrace process, so numbering them doesn't always
work either. And ptrace has all the same info as the filters, so it
could figure it out for itself.

>>> +     }
>>> +out:
>>> +     return ret;
>>> +}
>>> +
>>> +/**
>>> + * seccomp_attach_filter: Attaches a seccomp filter to current.
>>> + * @fprog: BPF program to install
>>> + *
>>> + * Context: User context only. This function may sleep on allocation and
>>> + *          operates on current. current must be attempting a system call
>>> + *          when this is called (usually prctl).
>>> + *
>>> + * This function may be called repeatedly to install additional filters.
>>> + * Every filter successfully installed will be evaluated (in reverse order)
>>> + * for each system call the thread makes.
>>> + *
>>> + * Returns 0 on success or an errno on failure.
>>> + */
>>> +long seccomp_attach_filter(struct sock_fprog *fprog)
>>> +{
>>> +     struct seccomp_filter *filter = NULL;
>>> +     /* Note, len is a short so overflow should be impossible. */
>>> +     unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
>>
>> That's a crude way of doing it. The max is BPF_MAXINSNS and that is checked a
>> bit late by sk_chk_filter(). But perhaps while you're consolidating the code
>> with the networking filter code, you could change their code to check that
>> first before trying to allocate the given amount.
>
> Sure - I can send those patches independently and see if they are
> accepted.  It doesn't change the fact that it can't overflow because
> fprog->len is a short and the sizeof struct sock_filter is less than
> 65535.

Yeah, never mind, I missed that networking uses a short too.

>>> +     long ret = -EPERM;
>>> +
>>> +     /* Allocate a new seccomp_filter */
>>> +     filter = seccomp_filter_alloc(fp_size);
>>
>> Actually, your seccomp_filter_alloc() checks for BPF_MAXINSNS already, and if
>> you change that one as I recommended and you pass on fprog->len directly, no
>> overflow can happen.
>
> Works for me!
>
>>> +     if (IS_ERR(filter)) {
>>> +             ret = PTR_ERR(filter);
>>> +             goto out;
>>> +     }
>>> +
>>> +     /* Copy the instructions from fprog. */
>>> +     ret = -EFAULT;
>>> +     if (copy_from_user(filter->insns, fprog->filter, fp_size))
>>> +             goto out;
>>> +
>>> +     /* Check the fprog */
>>> +     ret = sk_chk_filter(filter->insns, filter->count);
>>> +     if (ret)
>>> +             goto out;
>>> +
>>> +     /*
>>> +      * If a process lacks CAP_SYS_ADMIN in its namespace, force
>>> +      * this process and all descendents to run with no_new_privs.
>>> +      * A privileged process will need to set this bit independently,
>>> +      * if desired.
>>> +      */
>>> +     if (security_capable_noaudit(current_cred(), current_user_ns(),
>>> +                                  CAP_SYS_ADMIN) != 0)
>>> +             current->no_new_privs = 1;
>>> +
>>> +     /*
>>> +      * If there is an existing filter, make it the parent
>>> +      * and reuse the existing task-based ref.
>>
>> You're not reusing the existing task-based ref, are you?
>>
>> You can't reuse it anyway. Perhaps and old leftover comment?
>
> I am reusing it.  When current->seccomp.filter is installed -- via
> fork or attach -- a get_ is done.  That is for the task.  Here, we are
> bumping current->seccomp.filter to filter->parent without calling
> put_.  This reuses the task ref for the older filter.  The new filter
> has a ref already from during alloc and that is transferred for use by
> the task.

I would expect "reusing" to mean that you start with the old reference count,
or use an existing kref instead of a new one. Neither happens. What you're
doing seems more like just using it.

>
>>> +      */
>>> +     filter->parent = current->seccomp.filter;
>>
>> I think this needs a more verbose comment, here or elsewhere explaining how
>> important it is that new filters are added to the start and old filters are
>> never removed, or else list corruption can happen. This way assures that
>> forward pointers are always valid, and that's the reason to use a singly
>> linked list instead of a standard doubly linked list.
>
> Cool - I can do that.
>
>>> +#ifdef CONFIG_COMPAT
>>> +     /* Disallow changing system calling conventions after the fact. */
>>> +     filter->flags.compat = !!(is_compat_task());
>>> +
>>> +     if (filter->parent &&
>>> +         filter->parent->flags.compat != filter->flags.compat)
>>> +             return -EACCES;
>>
>> This shouldn't be possible and checking it here is too late if it somehow
>> happened anyway.
>
> Yeah you're right. Since mismatched-compat syscalls are blocked, we'd
> never hit this point.  Thanks!
>
>> (And you're leaking memory because current->seccomp.filter
>> isn't updated yet.)
>
> Fixed in v6 :)
>
>>> +#endif
>>> +
>>> +     /*
>>> +      * Double claim the new filter so we can release it below simplifying
>>> +      * the error paths earlier.
>>> +      */
>>
>> No it doesn't.
>
> True enough. It did when I was doing ftrace setup. :/
>
>>> +     ret = 0;
>>> +     get_seccomp_filter(filter);
>>> +     current->seccomp.filter = filter;
>>> +     /* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
>>> +     if (current->seccomp.mode == SECCOMP_MODE_DISABLED) {
>>> +             current->seccomp.mode = SECCOMP_MODE_FILTER;
>>> +             set_thread_flag(TIF_SECCOMP);
>>> +     }
>>> +
>>
>> Just add a return 0; here, much simpler and removes 6 lines.
>>
>>> +out:
>>> +     put_seccomp_filter(filter);  /* for get or task, on err */
>>> +     return ret;
>>> +}
>>> +
>>> +#ifdef CONFIG_COMPAT
>>> +/* This should be kept in sync with net/compat.c which changes infrequently. */
>>> +struct compat_sock_fprog {
>>> +     u16 len;
>>> +     compat_uptr_t filter;   /* struct sock_filter */
>>> +};
>>> +
>>> +static long compat_attach_seccomp_filter(char __user *optval)
>>> +{
>>> +     struct compat_sock_fprog __user *fprog32 =
>>> +             (struct compat_sock_fprog __user *)optval;
>>> +     struct sock_fprog __user *kfprog =
>>> +             compat_alloc_user_space(sizeof(struct sock_fprog));
>>> +     compat_uptr_t ptr;
>>> +     u16 len;
>>> +
>>> +     if (!access_ok(VERIFY_READ, fprog32, sizeof(*fprog32)) ||
>>> +         !access_ok(VERIFY_WRITE, kfprog, sizeof(struct sock_fprog)) ||
>>> +         __get_user(len, &fprog32->len) ||
>>> +         __get_user(ptr, &fprog32->filter) ||
>>> +         __put_user(len, &kfprog->len) ||
>>> +         __put_user(compat_ptr(ptr), &kfprog->filter))
>>> +             return -EFAULT;
>>> +
>>> +     return seccomp_attach_filter(kfprog);
>>> +}
>>> +#endif
>>> +
>>> +long prctl_attach_seccomp_filter(char __user *user_filter)
>>> +{
>>> +     struct sock_fprog fprog;
>>> +     long ret = -EINVAL;
>>> +     ret = -EFAULT;
>>
>> Why not set it to -EFAULT directly?
>
> No reason. Looks like I cut out some code and didn't fix it.
>
>>> +     if (!user_filter)
>>> +             goto out;
>>> +
>>> +#ifdef CONFIG_COMPAT
>>> +     if (is_compat_task())
>>> +             return compat_attach_seccomp_filter(user_filter);
>>> +#endif
>>> +
>>> +     if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
>>> +             goto out;
>>> +
>>> +     ret = seccomp_attach_filter(&fprog);
>>> +out:
>>> +     return ret;
>>> +}
>>> +
>>> +/**
>>> + * seccomp_struct_fork: manages inheritance on fork
>>> + * @child: forkee's seccomp_struct
>>> + * @parent: forker's seccomp_struct
>>> + *
>>> + * Ensures that @child inherits seccomp mode and state iff
>>
>> 'if'
>
> I guess they are equivalent for this purpose.

Ah, I mistook it for a typo. Never mind.

>>> + * seccomp filtering is in use.
>>> + */
>>> +void seccomp_struct_fork(struct seccomp_struct *child,
>>> +                      const struct seccomp_struct *parent)
>>> +{
>>> +     child->mode = parent->mode;
>>> +     if (parent->mode != SECCOMP_MODE_FILTER)
>>> +             return;
>>> +     child->filter = get_seccomp_filter(parent->filter);
>>
>> I think how you manage the life times of the filters is very clever and that it
>> needs some commenting because it is more subtle than it seems at first glance.
>> The rules seem to be:
>>
>> - The list is sorted from new to old.
>>
>> - Never remove old filters.
>>
>> - The life-time of a group of filters (all filters set by one process between
>>  fork calls) is managed by the one at the head of the group.
>>
>> This forms a tree with the head filters at forking points with the filters
>> freed from the leave direction.
>
> Exactly - sorry for it being poorly documented.  I will attempt to
> document it more clearly in the next revision (probably using some of
> your text above).

No problem, that's what reviews are for!

>>> +}
>>> +
>>> +/**
>>> + * seccomp_run_filter - evaluate BPF
>>> + *   @buf: opaque buffer to execute the filter over
>>> + *   @buflen: length of the buffer
>>
>> Out of date.
>
> Doh. Thanks.
>
>>> + *   @fentry: filter to apply
>>> + *
>>> + * Decode and apply filter instructions to the buffer.  Return length to
>>> + * keep, 0 for none.
>>
>> This doesn't make sense in the context of system calls. Just return 0 for
>> allowing the system call and -1 to deny it.
>
> I know but all the shared error paths in the evaluator will return 0.
> So on merge with the networking evaluator the return value for errors
> needs to be abstracted too.  It'd be nice to not require more changes,
> but it might be doable, I don't know.

Yeah, I didn't expect it to be so deeply ingrained.

>>> @buf is a seccomp_filter_metadata we are filtering,
>>> + * @filter is the array of filter instructions.  Because all jumps are
>>> + * guaranteed to be before last instruction, and last instruction
>>> + * guaranteed to be a RET, we dont need to check flen.
>>> + *
>>> + * See core/net/filter.c as this is nearly an exact copy.
>>> + * At some point, it would be nice to merge them to take advantage of
>>> + * optimizations (like JIT).
>>> + */
>>> +static unsigned int seccomp_run_filter(void *data, uint32_t datalen,
>>> +                       const struct sock_filter *fentry)
>>> +{
>>> +     const void *ptr;
>>> +     u32 A = 0;                      /* Accumulator */
>>> +     u32 X = 0;                      /* Index Register */
>>> +     u32 mem[BPF_MEMWORDS];          /* Scratch Memory Store */
>>> +     u32 tmp;
>>> +     int k;
>>> +
>>> +     /*
>>> +      * Process array of filter instructions.
>>> +      */
>>> +     for (;; fentry++) {
>>> +#if defined(CONFIG_X86_32)
>>> +#define      K (fentry->k)
>>> +#else
>>> +             const u32 K = fentry->k;
>>> +#endif
>>> +
>>> +             switch (fentry->code) {
>>> +             case BPF_S_ALU_ADD_X:
>>> +                     A += X;
>>> +                     continue;
>>> +             case BPF_S_ALU_ADD_K:
>>> +                     A += K;
>>> +                     continue;
>>> +             case BPF_S_ALU_SUB_X:
>>> +                     A -= X;
>>> +                     continue;
>>> +             case BPF_S_ALU_SUB_K:
>>> +                     A -= K;
>>> +                     continue;
>>> +             case BPF_S_ALU_MUL_X:
>>> +                     A *= X;
>>> +                     continue;
>>> +             case BPF_S_ALU_MUL_K:
>>> +                     A *= K;
>>> +                     continue;
>>> +             case BPF_S_ALU_DIV_X:
>>> +                     if (X == 0)
>>> +                             return 0;
>>> +                     A /= X;
>>> +                     continue;
>>> +             case BPF_S_ALU_DIV_K:
>>> +                     A = reciprocal_divide(A, K);
>>> +                     continue;
>>> +             case BPF_S_ALU_AND_X:
>>> +                     A &= X;
>>> +                     continue;
>>> +             case BPF_S_ALU_AND_K:
>>> +                     A &= K;
>>> +                     continue;
>>> +             case BPF_S_ALU_OR_X:
>>> +                     A |= X;
>>> +                     continue;
>>> +             case BPF_S_ALU_OR_K:
>>> +                     A |= K;
>>> +                     continue;
>>> +             case BPF_S_ALU_LSH_X:
>>> +                     A <<= X;
>>> +                     continue;
>>> +             case BPF_S_ALU_LSH_K:
>>> +                     A <<= K;
>>> +                     continue;
>>> +             case BPF_S_ALU_RSH_X:
>>> +                     A >>= X;
>>> +                     continue;
>>> +             case BPF_S_ALU_RSH_K:
>>> +                     A >>= K;
>>> +                     continue;
>>> +             case BPF_S_ALU_NEG:
>>> +                     A = -A;
>>> +                     continue;
>>> +             case BPF_S_JMP_JA:
>>> +                     fentry += K;
>>> +                     continue;
>>> +             case BPF_S_JMP_JGT_K:
>>> +                     fentry += (A > K) ? fentry->jt : fentry->jf;
>>> +                     continue;
>>> +             case BPF_S_JMP_JGE_K:
>>> +                     fentry += (A >= K) ? fentry->jt : fentry->jf;
>>> +                     continue;
>>> +             case BPF_S_JMP_JEQ_K:
>>> +                     fentry += (A == K) ? fentry->jt : fentry->jf;
>>> +                     continue;
>>> +             case BPF_S_JMP_JSET_K:
>>> +                     fentry += (A & K) ? fentry->jt : fentry->jf;
>>> +                     continue;
>>> +             case BPF_S_JMP_JGT_X:
>>> +                     fentry += (A > X) ? fentry->jt : fentry->jf;
>>> +                     continue;
>>> +             case BPF_S_JMP_JGE_X:
>>> +                     fentry += (A >= X) ? fentry->jt : fentry->jf;
>>> +                     continue;
>>> +             case BPF_S_JMP_JEQ_X:
>>> +                     fentry += (A == X) ? fentry->jt : fentry->jf;
>>> +                     continue;
>>> +             case BPF_S_JMP_JSET_X:
>>> +                     fentry += (A & X) ? fentry->jt : fentry->jf;
>>> +                     continue;
>>> +             case BPF_S_LD_W_ABS:
>>> +                     k = K;
>>> +load_w:
>>> +                     ptr = seccomp_load_pointer(data, k, 4, &tmp);
>>> +                     if (ptr != NULL) {
>>> +                             /*
>>> +                              * Assume load_pointer did any byte swapping.
>>> +                              */
>>> +                             A = *(const u32 *)ptr;
>>> +                             continue;
>>> +                     }
>>> +                     return 0;
>>> +             case BPF_S_LD_H_ABS:
>>> +                     k = K;
>>> +load_h:
>>> +                     ptr = seccomp_load_pointer(data, k, 2, &tmp);
>>> +                     if (ptr != NULL) {
>>> +                             A = *(const u16 *)ptr;
>>> +                             continue;
>>> +                     }
>>> +                     return 0;
>>> +             case BPF_S_LD_B_ABS:
>>> +                     k = K;
>>> +load_b:
>>> +                     ptr = seccomp_load_pointer(data, k, 1, &tmp);
>>> +                     if (ptr != NULL) {
>>> +                             A = *(const u8 *)ptr;
>>> +                             continue;
>>> +                     }
>>> +                     return 0;
>>> +             case BPF_S_LD_W_LEN:
>>> +                     A = datalen;
>>> +                     continue;
>>> +             case BPF_S_LDX_W_LEN:
>>> +                     X = datalen;
>>> +                     continue;
>>> +             case BPF_S_LD_W_IND:
>>> +                     k = X + K;
>>> +                     goto load_w;
>>> +             case BPF_S_LD_H_IND:
>>> +                     k = X + K;
>>> +                     goto load_h;
>>> +             case BPF_S_LD_B_IND:
>>> +                     k = X + K;
>>> +                     goto load_b;
>>> +             case BPF_S_LDX_B_MSH:
>>> +                     ptr = seccomp_load_pointer(data, K, 1, &tmp);
>>> +                     if (ptr != NULL) {
>>> +                             X = (*(u8 *)ptr & 0xf) << 2;
>>> +                             continue;
>>> +                     }
>>> +                     return 0;
>>> +             case BPF_S_LD_IMM:
>>> +                     A = K;
>>> +                     continue;
>>> +             case BPF_S_LDX_IMM:
>>> +                     X = K;
>>> +                     continue;
>>> +             case BPF_S_LD_MEM:
>>> +                     A = mem[K];
>>> +                     continue;
>>> +             case BPF_S_LDX_MEM:
>>> +                     X = mem[K];
>>> +                     continue;
>>> +             case BPF_S_MISC_TAX:
>>> +                     X = A;
>>> +                     continue;
>>> +             case BPF_S_MISC_TXA:
>>> +                     A = X;
>>> +                     continue;
>>> +             case BPF_S_RET_K:
>>> +                     return K;
>>> +             case BPF_S_RET_A:
>>> +                     return A;
>>> +             case BPF_S_ST:
>>> +                     mem[K] = A;
>>> +                     continue;
>>> +             case BPF_S_STX:
>>> +                     mem[K] = X;
>>> +                     continue;
>>> +             case BPF_S_ANC_PROTOCOL:
>>> +             case BPF_S_ANC_PKTTYPE:
>>> +             case BPF_S_ANC_IFINDEX:
>>> +             case BPF_S_ANC_MARK:
>>> +             case BPF_S_ANC_QUEUE:
>>> +             case BPF_S_ANC_HATYPE:
>>> +             case BPF_S_ANC_RXHASH:
>>> +             case BPF_S_ANC_CPU:
>>> +             case BPF_S_ANC_NLATTR:
>>> +             case BPF_S_ANC_NLATTR_NEST:
>>> +                     continue;
>>> +             default:
>>> +                     WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
>>> +                                    fentry->code, fentry->jt,
>>> +                                    fentry->jf, fentry->k);
>>> +                     return 0;
>>> +             }
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>
>> All in all it seems fairly easy to consolidate this with the networking code.
>> The only big change would be that the load_pointer() function pointer needs to
>> be passed, that way the same code can be used for both versions. The filter
>> checking code needs be expanded to check for network-only ops and reject the
>> filter if it's a seccomp one.
>
> Also byteswapping.  I think this can be done by adding pointer to the
> accumulator to load_pointer, then let that do the byteswapping too.  I

I would change load_pointer() to return the (byteswapped) value instead.
It gets all the info it needs to know what to do via 'size'. Oh bugger,
that messes up the error checking. Oh well.

> guess I'll go ahead and try my hand at a merged patch and add it to
> the end of the next series.  I wasn't planning on also attempting JIT
> integration yet though ;)

It's a nice follow-up project for someone after this is merged.

>>> diff --git a/kernel/sys.c b/kernel/sys.c
>>> index 4070153..8e43f70 100644
>>> --- a/kernel/sys.c
>>> +++ b/kernel/sys.c
>>> @@ -1901,6 +1901,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2,
unsigned
>>> long, arg3,
>>>               case PR_SET_SECCOMP:
>>>                       error = prctl_set_seccomp(arg2);
>>>                       break;
>>> +             case PR_ATTACH_SECCOMP_FILTER:
>>> +                     error = prctl_attach_seccomp_filter((char __user *)
>>> +                                                             arg2);
>>
>> Please one line.
>
> checkpatch was unhappy :)

Now I'm unhappy.

You're not the first one doing that though, so it seems the function needs a

char __user* arg2_p = (char __user *)arg2;

at the top and let everyone happily use that, as it's always arg2 that needs casting.

>
>> Actually, wouldn't it make more sense to use PR_SET_SECCOMP and check arg2 to
>> see if it's filtering, and then get the filter itself from arg3 or something?
>
> This is definitely preferable. Maybe something as simple as:
>   prctl(PR_SET_SECCOMP, 2, fprog);
> I'll make the changes - thanks!
>
>> Or just stop calling it seccomp of course. PR_ATTACH_SYSCALL_FILTER?
>
> Fine with me too, but it seems to make logical sense to keep it paired
> with seccomp.  Especially if mode=1 just becomes a hardcoded bpf.

Either way is fine with me. (But don't use bpf for mode=1.)

>>> +                     break;
>>>               case PR_GET_TSC:
>>>                       error = GET_TSC_CTL(arg2);
>>>                       break;
>>> diff --git a/security/Kconfig b/security/Kconfig
>>> index 51bd5a0..e1ffed8 100644
>>> --- a/security/Kconfig
>>> +++ b/security/Kconfig
>>> @@ -84,6 +84,26 @@ config SECURITY_DMESG_RESTRICT
>>>
>>>         If you are unsure how to answer this question, answer N.
>>>
>>> +config SECCOMP_FILTER
>>> +     bool "Enable seccomp-based system call filtering"
>>> +     select SECCOMP
>>> +     help
>>> +       This option provide support for limiting the accessibility of
>>> +       systems calls at a task-level using a dynamically defined policy.
>>> +
>>> +       System call filtering policy is expressed by the user using
>>> +       a Berkeley Packet Filter program.  The program is attached using
>>> +       prctl(2).  For every system call the task makes, its number,
>>> +       arguments, and other metadata will be evaluated by the attached
>>> +       filter program.  The result determines if the system call may
>>> +       may proceed or if the task should be terminated.
>>> +
>>> +       This behavior is meant to aid security-conscious software in
>>> +       its ability to minimize the risk of running potentially
>>> +       risky code.
>>> +
>>> +       See Documentation/prctl/seccomp_filter.txt for more detail.
>>> +
>>>  config SECURITY
>>>       bool "Enable different security models"
>>>       depends on SYSFS
>>> --
>>> 1.7.5.4
>>>
>>>
>>
>> Greetings,
>
> Thanks for the comprehensive review and feedback!
> will

Cheers,

Indan



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-30  2:37       ` Indan Zupancic
@ 2012-01-30 22:26         ` Will Drewry
  2012-01-30 22:29           ` Andrew Lutomirski
  2012-01-31  1:42           ` Indan Zupancic
  0 siblings, 2 replies; 26+ messages in thread
From: Will Drewry @ 2012-01-30 22:26 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, mcgrathr

On Sun, Jan 29, 2012 at 8:37 PM, Indan Zupancic <indan@nul.nu> wrote:
>> On Sat, Jan 28, 2012 at 10:39 PM, Indan Zupancic <indan@nul.nu> wrote:
...
> Actually, now I think about it, the current networking BPF programs
> aren't binary portable either, because of the endianness of 32-bit
> integers. It's still cleaner to have one structure layout instead
> of two though.

I agree. I'm changing it all out now.
...
>>> If this is added, then this will be seccomp,
>>> and the old mode can be seen as a standard, very restrictive filter.
>>
>> This is totally true.  I originally did not pursue this because I
>> would have needed 1 hardcoded BPF filter per arch.  With the new
>> syntax, I believe it will be possible to define a 32-bit filter and a
>> 64-bit filter (and a compat filter) in one central place.
>
> I only meant conceptually, I don't think it's a good idea to actually
> do that with a hard-coded filter!

No fun :)


>> The network filtering code adds one function that is a switch
>> statement over the possible valid values.  This doesn't duplicate the
>> checking code.
>
> Yes, but that doesn't answer my question.

Sorry - it added approx 861 bytes.

> The advantage of using the network code would be to reduce the only impact
> this has for people not using it: Kernel binary size. So if you want it to
> be small enough that it's not worth a separate config option, then it has
> to be tiny.

I'm still in the process of reworking based on the discussions we've
had, but I've already dropped 3kB of bloat.  Relying on
net/core/filter.c dropped off the 861 and only increased its size 292
bytes.  Merging with seccomp.c and cleaning up the functions is
helping frop the rest.  The current changes to seccomp.o add 2148
bytes to a x86_32 build.  Hopefully when I'm done cleaning up, it'll
be smaller.

Do you think something along the lines of 2 kB is sane for a config-less change?

...
>> The reason that I am not merging upfront is because the load_pointer
>> and byte swapping behavior is hard-coded in the current versions.  I
>
> Just move the byte swapping into the networking load_pointer(), and
> add an extra function pointer argument which tells which load_pointer()
> function to call.

Doing exactly that.  I've been tinkering with the best way to minimize
the impact to the existing BPF evaluator.  Right now, I'm adding a
very small number of new instructions to the sk_buff specific code
path, but I haven't yet benchmarked - just disasssembled.

>> believe it will be possible to add a layer of abstraction that doesn't
>> induce additional overhead, but I didn't want to force that first.  If
>> that will really be a blocker, then I will go ahead and do that first.
>> I was just worried that the potential performance impact discussions
>> could lead this patch to a dead end before it got a chance.
>
> To know how much it is needed depends on what the difference in kernel
> size is with and without CONFIG_SECCOMP_FILTER. If it's 4kB then you
> may have a good point arguing that it may not be worth consolidating.
> But considering that you already depend on the networking checking code,
> it just seems cleaner to share everything and to make it explicit. This
> avoids problems in the future when people change the networking checking
> code without being aware that it is being used elsewhere too.

I agree. I will post the next series with a proposed integration. If
there is a lot of resistance, then the difference will be going from a
2kB changes to a 3kB change.

> And performance differences can and should be measured. You don't want
> to accidentally slow down all system calls, so it's good to double
> check that you don't.

Agreed.

>> Right now it requires asm/syscall.h which CONFIG_SECCOMP doesn't.
>> Some arches still haven't implemented that file so I thought it would
>> make sense to keep it separate.
>
> I think you should go on a quest to make sure (almost) all archs have that file,
> before this patch can be merged. At least the archs that have ptrace support.

I'm an idiot.  CONFIG_HAVE_ARCH_TRACEHOOK covers asm/syscall.h

So I have two choices:
1. allow seccomp with filtering on these platforms by fail if an
argument is accessed
2. return ENOSYS when a filter is attempted to be installed on
platforms with no tracehook support.

I think #2 is the nicest user contract, but #1 allows working filters
even on less hospitable arches even if they can't user arguments
(yet).  I'm coding it up as #2, but it is easy to switch to #1.


>> Would it make sense to start with a config option then merge later or
>> should it be done upfront?
>
> Yes, I think that would make sense. But I also think no one will enable it
> if it's a separate option because it's just very obscure (especially because
> no one uses it yet). So if it's good enough it should be enabled by default,
> at least for the archs that are supported. If it's coupled to SECCOMP, you
> can bundle it with that (and networking).

I'm attempting to couple it now. If there are concerns, but not about
code size, then the config option could just control whether it
returns -ENOSYS or not.


>>>> +extern void seccomp_struct_fork(struct seccomp_struct *child,
>>>> +                             const struct seccomp_struct *parent);
>>>
>>> Lousy function name, what is it doing?
>>
>> Now it is called seccomp_fork() and it is called by fork when a task
>> fork()s :)  Naming suggestions welcome!
>
> That's better. My mind automatically associated the 'struct' with the fork,
> which didn't make sense.

After cleaning things up, I was going to change it to copy_seccomp()
to match other copy_process-called functions.  I can keep it
seccomp_fork() or whatever though :)

>> Neither are the networking code defines :)  They are exported to
>> userspace but I don't believe they've ever changed.
>
> Probably because they are useless as they can only be checked at compile time,
> while the kernel version they are run on can be different. But if you want to
> reuse the existing tools to create BPF filters, being compatible with the
> networking code would be handy.

I'm just going to try to go all out with a network merge right now. We'll see.

> Oh wait, I missed this before: Just use the existing networking filter.h
> instead of redefining anything, and only add seccomp specific stuff to
> your file, the seccomp_filter_data.

Doing that and I'll rename seccomp_filter_data to just seccomp_data.

> struct seccomp_filter_data {
>        int syscall_nr;
>        unsigned int arg_lo[6];
>        unsigned int arg_hi[6];
>        int __reserved[3];
> };
>
> I'm not sure what the advantage of using __u32 would be, I think we can count on
> unsigned int always being 32 bits?

Sure - bpf uses u32 as its accumulator value, so I was sticking with
the matching type. sock_filters also use u32 for k.

Since they should be the same, it doesn't matter, but I liked keeping
them the same.  I can change them if it is necessary.

> (Is my email client mangling tabs or is yours?)

Probably mine - sorry :/

>> It just means two copies for every system call argument instead of
>> one, but who knows, maybe the compiler will magic it away.
>> Regardless, any argument access will always be slightly slower than
>> just syscall number, so it doesn't hurt if the extra copy doesn't go
>> away.
>
> True. If you add 64 bit support to BPF then your way is much better of
> course. But as long as it is 32-bit I don't think it makes much sense
> to make it more complicated than necessary.

Yeah and BPF going 64-bit would be a complete change of user space
ABI, so if it went that route, it could also include an alternate
struct definition.

> Your way it's a lot harder to see what the actual data layout is, including
> offsets. My way the offset to the low bits is arg_nr * sizeof(int), if
> counting args at 1 instead of 0, which is usually done with args (at least
> by gcc).

Yeah -- it is cleaner across the board I think.

>> I just want an abi that is functional across platforms that people are
>> happy enough to merge.  The one I posted is fine with me and so is
>> your proposal.  So if no one else has strong opinions, I'll just use
>> yours :)
>
> I'm just giving my opinion of course, I would like to know what other
> people prefer as well.

It's appreciated.  Hopefully once the next changes settle down, we'll
see some more opinions.

...
>>>> +     seccomp_struct_fork(&p->seccomp, &current->seccomp);
>>>
>>> This can be moved above, before the first error goto.
>>
>> As in where seccomp_struct_init_task is or somewhere else?  (Where
>> seccomp_init_task is would make sense.)
>
> Around there yes. It needs to  be always executed, before free_task() may be called
> because of an error path.

Makes sense - thanks!

...
>>  if (!seccomp_test_filters(this_syscall))
>>    return;
>>
>> if that'd read better?
>
> Considering you're going to add more return values later, I would let
> seccomp_test_filters() return the filter code and stuff it in a temporary
> variable and work on that instead, and use SECCOMP_BPF_DENY/ALLOW to make
> clear what's happening. It's a bit more code, but it's clearer how we got
> to our decision.
>
> And perhaps rename seccomp_test_filters() to seccomp_run_filters(), for
> consistency with the networking code.
>
> But do whatever you prefer, it doesn't matter much.

Sounds good. I'll bubble it up some of the way, then if new error
codes come, those can be wired up easily.

>>> I guess the code will be clearer than the comment. Parent task or filter?
>>> Is it a reverse linked list or something? I'm confused.
>>
>> I'll try to rewrite it.  The parent filter is whatever filter was
>> installed on the task at the time the current filter was installed.
>> It may have been inherited or be part of a sequence of filters
>> installed by the same task.
>
> I think I would just call it 'next' or 'prev', but that's me. 'parent' is slightly
> confusing in task context. The filter may come from the parent process, but it could
> as well be installed by the same process.

Agreed.

...
>>>> +     struct {
>>>> +             uint32_t compat:1;
>>>> +     } flags;
>>>
>>> This flag isn't documented above.
>>
>> Oops dropped it for a rev - I'll re-add it. Thanks!
>
> And in case you missed my comment below, change it to a bool?

Thanks - I'll do that.

...
>>>> +struct seccomp_filter_metadata {
>>>> +     struct seccomp_filter_data data;
>>>> +     bool has_args;
>>>> +};
>>>
>>> Why use 'bool' instead of a nifty struct { uint32_t : 1 }?
>>>
>>> It seems that has_args is something that should be stored elsewhere,
>>> it doesn't seem to warrant its own structure.
>>
>> Well it would naturally live inside the BPF evaluator since the
>> loading is done in load_pointer.  This struct is just used to marshal
>> the state around.
>
> I think load_pointer() is better off being stateless.

Agreed and it is now -- exactly as you described -- simple switch statement.

>>>> +
>>>> +static unsigned int seccomp_run_filter(void *, uint32_t,
>>>> +                                    const struct sock_filter *);
>>>
>>> It almosts fits, just put it on one line.
>>
>> I'm just doing what checkpatch suggests :)  Would it read better if I
>> bumped all the args down a line?
>
> No, just ignore checkpatch and put it on one line.

Hehe

> If you can't live with the warning, change the return value to int or
> drop the seccomp_ bit, as it's a local static function anyway. Or just
> move the function above its first use so you don't need this.

It doesn't matter if I used the networking one, but I'll reorganize
the files to avoid needless prototypes.

...
>>>> +static void seccomp_filter_free(struct seccomp_filter *filter)
>>>> +{
>>>> +     if (!filter)
>>>> +             return;
>>>> +     put_seccomp_filter(filter->parent);
>>>
>>> Why do put on the parent here?
>>
>> filters may be shared across tasks and there may be multiple filters
>> per-task.  The references keep them alive beyond the life of the
>> originating task_struct or if they are not hanging off of the
>> task_struct directly.  In order for them to ever die, the reference to
>> them needs to be dropped and that's done here.
>
> Yeah, I figured that out later on. It's quite nifty, but I find the recursion
> within kref_put() slightly worrisome. Perhaps the code would be cleaner if this
> was avoided and done differently, but I can't think of a good alternative. I'll
> wait for the new version to see if I can find a way.

Thanks - sure.  Since kref_put is just an atomic_dec_and_test followed
by a call to the function pointer, I wasn't too concerned.  Without
changing how the relationships are handled, I'm not sure how to
approach it differently (and still avoid races). IIRC, this isn't much
different to how namespaces work, they just use their own atomic
counters.

>> I'm using what the bpf code specifies already.  Otherwise I would use
>> an unsigned int. I guess I could cast.
>
> The networking code uses negative indices to get to ancillary data.
>
> I would make it unsigned from the start or make just the 'offset' argument unsigned.

Makes sense.  I just cast it locally unsigned and let the compiler
handle it.  (And for my current build it even saved an extra 20 bytes
:)

>>> Wasn't the idea that arguments would be loaded on demand?
>>
>> Yes and they are and then cached.
>
> I mean individually loaded on demand, not all or nothing. Usually only one
> or two args are checked, and never rechecked, so caching doesn't make sense.

I've done it that way now. I still need to test both the functionality
and the performance.

>>> And as arguments are usually only checked once and not twice, caching the
>>> values isn't really needed. So perhaps get rid of the (meta)data structure
>>> and just have a switch that returns the right data directly depending on
>>> the offset? It seems the over all code would be simpler that way.
>>
>> Not really. You can make a case statement per range of offsets that
>> overlap with an argument, but that doesn't account for size.  You
>> could load a long across two arguments.  We're not enforcing alignment
>> as the networking code doesn't either.  I could have it fail if the
>> request is not 32-bit aligned.  I don't prefer that, but it would
>> simplify the code.  It's certainly easier to change the ABI later to
>> provide unaligned access then it is to make it aligned later, so I can
>> change to aligned requests only that return the appropriate 32-bit
>> slice of the data.
>
> I would go for forcing alignment, unaligned accesses on arguments don't make
> much sense. Only unaligned access that makes sense is reading the two middle
> bytes of an int. So I would force all accesses to fall within 32 bits. This
> avoids the unaligned loading problem and would make load_pointer() stateless.

Works for me and how I've changed it now.  It's always easy to relax
this restriction later if it is really needed.

...
>>>> +     /* Only allow a system call if it is allowed in all ancestors. */
>>>> +     ret = 0;
>>>> +     for ( ; filter != NULL; filter = filter->parent) {
>>>> +             /* Allowed if return value is SECCOMP_BPF_ALLOW */
>>>> +             if (seccomp_run_filter(&metadata, sizeof(metadata.data),
>>>> +                                     filter->insns) != SECCOMP_BPF_ALLOW)
>>>> +                     ret = -EACCES;
>>>
>>> Why keep checking filters if one of them fails? Just return -EACCES?
>>
>> Why optimize for a failure mode?  Since fails are terminal, I don't
>> see any reason to rush :) That said, this was more relevant when I had
>> ptrace integration that depended on knowing which filter the failure
>> occurred in. I'll change it.
>
> I'd guess that would be easier if any failure would break the loop.
> But I don't think there's a good way to tell ptrace which filter
> it is because there is no way to identify them. And filters may be
> installed before the ptrace process, so numbering them doesn't always
> work either. And ptrace has all the same info as the filters, so it
> could figure it out for itself.

I'll clarify a bit.  My original ptrace integration worked such that a
tracer may only intercept syscall failures if it attached prior to the
failing filter being installed.  I did it this way to avoid using
ptrace to escape system call filtering.  However, since I don't have
that as part of the patch series, it doesn't make sense to keep it. (I
tracked a tracer pid_struct in the filters.)  If it needs to come back
with later patchsets, then it can be tackled then!

...
>>>> +     /*
>>>> +      * If there is an existing filter, make it the parent
>>>> +      * and reuse the existing task-based ref.
>>>
>>> You're not reusing the existing task-based ref, are you?
>>>
>>> You can't reuse it anyway. Perhaps and old leftover comment?
>>
>> I am reusing it.  When current->seccomp.filter is installed -- via
>> fork or attach -- a get_ is done.  That is for the task.  Here, we are
>> bumping current->seccomp.filter to filter->parent without calling
>> put_.  This reuses the task ref for the older filter.  The new filter
>> has a ref already from during alloc and that is transferred for use by
>> the task.
>
> I would expect "reusing" to mean that you start with the old reference count,
> or use an existing kref instead of a new one. Neither happens. What you're
> doing seems more like just using it.

Works for me either way.

...
>>> All in all it seems fairly easy to consolidate this with the networking code.
>>> The only big change would be that the load_pointer() function pointer needs to
>>> be passed, that way the same code can be used for both versions. The filter
>>> checking code needs be expanded to check for network-only ops and reject the
>>> filter if it's a seccomp one.
>>
>> Also byteswapping.  I think this can be done by adding pointer to the
>> accumulator to load_pointer, then let that do the byteswapping too.  I
>
> I would change load_pointer() to return the (byteswapped) value instead.
> It gets all the info it needs to know what to do via 'size'. Oh bugger,
> that messes up the error checking. Oh well.

Yeah - you end up needing to return something to indicate success or
failure along with populating the accumulator.  I've added a generic
call into the evaluator with a fallback to the network path - to keep
it on as fast a path as possible.


Thanks!
will

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-30 22:26         ` Will Drewry
@ 2012-01-30 22:29           ` Andrew Lutomirski
  2012-01-30 22:42             ` Will Drewry
  2012-01-30 22:59             ` Indan Zupancic
  2012-01-31  1:42           ` Indan Zupancic
  1 sibling, 2 replies; 26+ messages in thread
From: Andrew Lutomirski @ 2012-01-30 22:29 UTC (permalink / raw)
  To: Will Drewry
  Cc: Indan Zupancic, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, mcgrathr

[not quoting anything because this is a more general question]

How portable across architectures is this supposed to be?  At least
syscall numbering varies widely, and calling conventions can be
different (x86_64 has four of them, for example).  For all I know,
argument order might be different in some cases.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-30 22:29           ` Andrew Lutomirski
@ 2012-01-30 22:42             ` Will Drewry
  2012-01-30 22:59             ` Indan Zupancic
  1 sibling, 0 replies; 26+ messages in thread
From: Will Drewry @ 2012-01-30 22:42 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Indan Zupancic, linux-kernel, keescook, john.johansen,
	serge.hallyn, coreyb, pmoore, eparis, djm, torvalds, segoon,
	rostedt, jmorris, scarybeasts, avi, penberg, viro, mingo, akpm,
	khilman, borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, mcgrathr

On Mon, Jan 30, 2012 at 4:29 PM, Andrew Lutomirski <luto@mit.edu> wrote:
> [not quoting anything because this is a more general question]
>
> How portable across architectures is this supposed to be?  At least
> syscall numbering varies widely, and calling conventions can be
> different (x86_64 has four of them, for example).  For all I know,
> argument order might be different in some cases.

>From my perspective, the raw filter ABI shouldn't strive to be
arch-independent.  However, it'd be really nice to minimize the user
pain, but I don't know that there is any guarantee that any given
syscall will exist on all arches nor that it will treat its arguments
the same way.

Since glibc already does mapping of syscall args and numbers, we know
that userspace can fill in any arch-specific gaps, but it is nice if
the filters "just work" for the majority of cases if __NR_* defines
are used.

cheers!
will

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-30 22:29           ` Andrew Lutomirski
  2012-01-30 22:42             ` Will Drewry
@ 2012-01-30 22:59             ` Indan Zupancic
  1 sibling, 0 replies; 26+ messages in thread
From: Indan Zupancic @ 2012-01-30 22:59 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, mcgrathr

On Mon, January 30, 2012 23:29, Andrew Lutomirski wrote:
> [not quoting anything because this is a more general question]
>
> How portable across architectures is this supposed to be?  At least
> syscall numbering varies widely, and calling conventions can be
> different (x86_64 has four of them, for example).  For all I know,
> argument order might be different in some cases.

Apart from the syscall numbering, it is surprisingly consistent.
For e.g. x86 about 190 out of 300/340 system calls are the same,
so about 2/3. The 1/3 that are different are mostly 32-bit only
variants of systemcalls, or obsolete ones. So for the syscall
arguments actually checked, the number is much higher.

Different argument ordering for the same syscall between archs
is very rare.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-30 22:26         ` Will Drewry
  2012-01-30 22:29           ` Andrew Lutomirski
@ 2012-01-31  1:42           ` Indan Zupancic
  2012-01-31 11:04             ` Will Drewry
  1 sibling, 1 reply; 26+ messages in thread
From: Indan Zupancic @ 2012-01-31  1:42 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, mcgrathr

On Mon, January 30, 2012 23:26, Will Drewry wrote:
> Do you think something along the lines of 2 kB is sane for a config-less change?

Yes, especially if there is some way to get rid of it anyway,
like disabling SECCOMP or some option under CONFIG_EMBEDDED.
But it seems you need at least a hidden config option which
depends on the stuff you need.

>
> Doing exactly that.  I've been tinkering with the best way to minimize
> the impact to the existing BPF evaluator.  Right now, I'm adding a
> very small number of new instructions to the sk_buff specific code
> path, but I haven't yet benchmarked - just disasssembled.

I would do all the checking in sk_chk_filter(), so you know that when
you do run the filter, you can't hit the sk_buff paths. This doesn't
cause any slow down for the networking path.

> I agree. I will post the next series with a proposed integration. If
> there is a lot of resistance, then the difference will be going from a
> 2kB changes to a 3kB change.

Let's see how it goes.

>> I think you should go on a quest to make sure (almost) all archs have that file,
>> before this patch can be merged. At least the archs that have ptrace support.
>
> I'm an idiot.  CONFIG_HAVE_ARCH_TRACEHOOK covers asm/syscall.h
>
> So I have two choices:
> 1. allow seccomp with filtering on these platforms by fail if an
> argument is accessed
> 2. return ENOSYS when a filter is attempted to be installed on
> platforms with no tracehook support.

I vote for:

3. Add tracehook support to all archs.

Maybe not all archs, but at least some more. That way, every time someone
adds something tracehook specific, more archs support it.

syscall.h has no TRACEHOOK defines or anything though.

Only syscall_rollback() looks tricky. I have no clue what the difference
between syscall_get_error() and syscall_get_return_value() is. But you
only need to add syscall_get_nr() and syscall_[gs]et_arguments(), which
should be possible for all archs.

How many archs don't support tracehook?

>
> I think #2 is the nicest user contract, but #1 allows working filters
> even on less hospitable arches even if they can't user arguments
> (yet).  I'm coding it up as #2, but it is easy to switch to #1.

If you don't support the arch, don't compile any code at all, and
let prctl(2) return EINVAL. You don't want to return ENOSYS.

> I'm attempting to couple it now. If there are concerns, but not about
> code size, then the config option could just control whether it
> returns -ENOSYS or not.

If there are other concerns then it shouldn't be merged.

> After cleaning things up, I was going to change it to copy_seccomp()
> to match other copy_process-called functions.  I can keep it
> seccomp_fork() or whatever though :)

Nah.

> I'm just going to try to go all out with a network merge right now. We'll see.

Good.

>> Oh wait, I missed this before: Just use the existing networking filter.h
>> instead of redefining anything, and only add seccomp specific stuff to
>> your file, the seccomp_filter_data.
>
> Doing that and I'll rename seccomp_filter_data to just seccomp_data.

Good.

>> struct seccomp_filter_data {
>>        int syscall_nr;
>>        unsigned int arg_lo[6];
>>        unsigned int arg_hi[6];
>>        int __reserved[3];
>> };
>>
>> I'm not sure what the advantage of using __u32 would be, I think we can count on
>> unsigned int always being 32 bits?
>
> Sure - bpf uses u32 as its accumulator value, so I was sticking with
> the matching type. sock_filters also use u32 for k.
>
> Since they should be the same, it doesn't matter, but I liked keeping
> them the same.  I can change them if it is necessary.

Nah.

>> (Is my email client mangling tabs or is yours?)
>
> Probably mine - sorry :/

Mine didn't use to do it, so I was quite confused by the whitespace changes.

> Yeah and BPF going 64-bit would be a complete change of user space
> ABI, so if it went that route, it could also include an alternate
> struct definition.

True.

>> I'm just giving my opinion of course, I would like to know what other
>> people prefer as well.
>
> It's appreciated.  Hopefully once the next changes settle down, we'll
> see some more opinions.

Looking forward to it!

>> Yeah, I figured that out later on. It's quite nifty, but I find the recursion
>> within kref_put() slightly worrisome. Perhaps the code would be cleaner if this
>> was avoided and done differently, but I can't think of a good alternative. I'll
>> wait for the new version to see if I can find a way.
>
> Thanks - sure.  Since kref_put is just an atomic_dec_and_test followed
> by a call to the function pointer, I wasn't too concerned.  Without
> changing how the relationships are handled, I'm not sure how to
> approach it differently (and still avoid races). IIRC, this isn't much
> different to how namespaces work, they just use their own atomic
> counters.

Well, the thing is, this recursion is controlled by user space depending
on how many filters they have installed. What is preventing them to force
you out of stack?

So perhaps add at least some arbitrary filter limit to avoid this?

> I've done it that way now. I still need to test both the functionality
> and the performance.

Considering the alternative is ptrace, I wouldn't worry too much about
performance. But I think the networking people would really dislike a
slowdown, so focus on that first.

>> I would go for forcing alignment, unaligned accesses on arguments don't make
>> much sense. Only unaligned access that makes sense is reading the two middle
>> bytes of an int. So I would force all accesses to fall within 32 bits. This
>> avoids the unaligned loading problem and would make load_pointer() stateless.
>
> Works for me and how I've changed it now.  It's always easy to relax
> this restriction later if it is really needed.

Indeed.

> I'll clarify a bit.  My original ptrace integration worked such that a
> tracer may only intercept syscall failures if it attached prior to the
> failing filter being installed.  I did it this way to avoid using
> ptrace to escape system call filtering.  However, since I don't have
> that as part of the patch series, it doesn't make sense to keep it. (I
> tracked a tracer pid_struct in the filters.)  If it needs to come back
> with later patchsets, then it can be tackled then!

The problem of that is that filters can be shared between processes with
different ptracers. And you have all the hassle of keeping it up to date.

I think seccomp should always come first and just trust ptrace. This
because it can deny all ptrace() calls for filtered tasks, so the only
untrusted tasks doing ptrace() are outside of seccomp's filtering control.
And those could do the same system calls themselves.

The case where there is one task being filtered and allowed to do ptrace,
but not some other syscall, ptracing another filtered task which isn't
allowed to do ptrace, but allowed to do that other syscall, is quite far
fetched I think. If you really want to handle this, then you could run
the ptracer's filters against the tracee's post-ptrace syscall state.
This is best done in the ptracer's context, just before continuing the
system call. (You really want Oleg's SIKILL immediate patch then.)

What about:

1) Seccomp filters can deny a syscall by killing the task.

2) Seccomp can deny a syscall and let it return an error value.

   I know you're not fond of this one. It's just a performance
   optimisation as sometimes a lot of denied but harmless syscalls
   are called by glibc all the time, like getxattr(). Hardcoding
   the kill policy seems wrong when it can be avoided. If this is
   too hard then skip it, but if it's easy to add then please do.
   I'll take a look at this with your next patch version.

3) Seccomp can allow a syscall to proceed normally.

4) Seccomp can set a hint to skip ptrace syscall events for this syscall.
   A filter sets this by returning a specific value.

5) Ptrace always gets a syscall event when it asked for it.

6) Ptrace can set an option to honour seccomp's hint and to not get all
   syscall events.

This way all seccomp needs to do is to set some flags which ptrace can check.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-31  1:42           ` Indan Zupancic
@ 2012-01-31 11:04             ` Will Drewry
  2012-01-31 15:18               ` Will Drewry
  2012-02-01  1:36               ` Indan Zupancic
  0 siblings, 2 replies; 26+ messages in thread
From: Will Drewry @ 2012-01-31 11:04 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, mcgrathr

On Mon, Jan 30, 2012 at 7:42 PM, Indan Zupancic <indan@nul.nu> wrote:
> On Mon, January 30, 2012 23:26, Will Drewry wrote:
>> Do you think something along the lines of 2 kB is sane for a config-less change?
>
> Yes, especially if there is some way to get rid of it anyway,
> like disabling SECCOMP or some option under CONFIG_EMBEDDED.
> But it seems you need at least a hidden config option which
> depends on the stuff you need.

Disabling SECCOMP would definitely do it.

>>
>> Doing exactly that.  I've been tinkering with the best way to minimize
>> the impact to the existing BPF evaluator.  Right now, I'm adding a
>> very small number of new instructions to the sk_buff specific code
>> path, but I haven't yet benchmarked - just disasssembled.
>
> I would do all the checking in sk_chk_filter(), so you know that when
> you do run the filter, you can't hit the sk_buff paths. This doesn't
> cause any slow down for the networking path.

Ah sorry - I was referring to the intrusion of a load_pointer function
pointer.  I want to leave the current networking path as untouched as
possible.  For checking, I agree -- a quick change to sk_chk_filter or
even just a small helper function that scans for any codes in the
ancillary range will do the trick.


>> I agree. I will post the next series with a proposed integration. If
>> there is a lot of resistance, then the difference will be going from a
>> 2kB changes to a 3kB change.
>
> Let's see how it goes.
>
>>> I think you should go on a quest to make sure (almost) all archs have that file,
>>> before this patch can be merged. At least the archs that have ptrace support.
>>
>> I'm an idiot.  CONFIG_HAVE_ARCH_TRACEHOOK covers asm/syscall.h
>>
>> So I have two choices:
>> 1. allow seccomp with filtering on these platforms by fail if an
>> argument is accessed
>> 2. return ENOSYS when a filter is attempted to be installed on
>> platforms with no tracehook support.
>
> I vote for:
>
> 3. Add tracehook support to all archs.

I don't see these #3 as mutually exclusive :)  tracehook requires:
- task_pt_regs()          in asm/processor.h or asm/ptrace.h
- arch_has_single_step()  if there is hardware single-step support
- arch_has_block_step()   if there is hardware block-step support
- asm/syscall.h           supplying asm-generic/syscall.h interface
- linux/regset.h          user_regset interfaces
- CORE_DUMP_USE_REGSET    #define'd in linux/elf.h
-TIF_SYSCALL_TRACE       calls tracehook_report_syscall_{entry,exit}
- TIF_NOTIFY_RESUME       calls tracehook_notify_resume()
- signal delivery         calls tracehook_signal_handler()

> Maybe not all archs, but at least some more. That way, every time someone
> adds something tracehook specific, more archs support it.

Well the other arch I want this on specifically for my purposes is
arm, but someone recently posted a partial asm/syscall.h for arm, but
I didn't see that one go anywhere just yet.  (I know syscall_get_nr
can be tricky on arm because it doesn't (didn't) have a way of
tracking in-syscall state.)

ref: https://lkml.org/lkml/2011/12/1/131

> syscall.h has no TRACEHOOK defines or anything though.

Nope - it is just part of what is expected.

> Only syscall_rollback() looks tricky. I have no clue what the difference
> between syscall_get_error() and syscall_get_return_value() is. But you
> only need to add syscall_get_nr() and syscall_[gs]et_arguments(), which
> should be possible for all archs.

It seems even syscall_get_nr can have some wrinkles :)

ref: http://lkml.indiana.edu/hypermail/linux/kernel/0906.3/00096.html

> How many archs don't support tracehook?

14 out of 26.  However, 5 of those still have asm/syscall.h

>> I think #2 is the nicest user contract, but #1 allows working filters
>> even on less hospitable arches even if they can't user arguments
>> (yet).  I'm coding it up as #2, but it is easy to switch to #1.
>
> If you don't support the arch, don't compile any code at all, and
> let prctl(2) return EINVAL. You don't want to return ENOSYS.

I was thinking of the inline prctl handler, but EINVAL makes sense.

>>> Yeah, I figured that out later on. It's quite nifty, but I find the recursion
>>> within kref_put() slightly worrisome. Perhaps the code would be cleaner if this
>>> was avoided and done differently, but I can't think of a good alternative. I'll
>>> wait for the new version to see if I can find a way.
>>
>> Thanks - sure.  Since kref_put is just an atomic_dec_and_test followed
>> by a call to the function pointer, I wasn't too concerned.  Without
>> changing how the relationships are handled, I'm not sure how to
>> approach it differently (and still avoid races). IIRC, this isn't much
>> different to how namespaces work, they just use their own atomic
>> counters.
>
> Well, the thing is, this recursion is controlled by user space depending
> on how many filters they have installed. What is preventing them to force
> you out of stack?

Hrm true.  The easiest option is to just convert it to iterative by
not using kref_t, but I'll look more closely.

> So perhaps add at least some arbitrary filter limit to avoid this?

Definitely possible -- probably as a sysctl.  I'm not quite sure what
number makes sense yet, but I'll look at breaking the recursion first.
 Thanks!

>> I'll clarify a bit.  My original ptrace integration worked such that a
>> tracer may only intercept syscall failures if it attached prior to the
>> failing filter being installed.  I did it this way to avoid using
>> ptrace to escape system call filtering.  However, since I don't have
>> that as part of the patch series, it doesn't make sense to keep it. (I
>> tracked a tracer pid_struct in the filters.)  If it needs to come back
>> with later patchsets, then it can be tackled then!
>
> The problem of that is that filters can be shared between processes with
> different ptracers. And you have all the hassle of keeping it up to date.
>
> I think seccomp should always come first and just trust ptrace. This
> because it can deny all ptrace() calls for filtered tasks, so the only
> untrusted tasks doing ptrace() are outside of seccomp's filtering control.
> And those could do the same system calls themselves.
>
> The case where there is one task being filtered and allowed to do ptrace,
> but not some other syscall, ptracing another filtered task which isn't
> allowed to do ptrace, but allowed to do that other syscall, is quite far
> fetched I think. If you really want to handle this, then you could run
> the ptracer's filters against the tracee's post-ptrace syscall state.
> This is best done in the ptracer's context, just before continuing the
> system call. (You really want Oleg's SIKILL immediate patch then.)
>
> What about:
>
> 1) Seccomp filters can deny a syscall by killing the task.
>
> 2) Seccomp can deny a syscall and let it return an error value.
>
>   I know you're not fond of this one. It's just a performance
>   optimisation as sometimes a lot of denied but harmless syscalls
>   are called by glibc all the time, like getxattr(). Hardcoding
>   the kill policy seems wrong when it can be avoided. If this is
>   too hard then skip it, but if it's easy to add then please do.
>   I'll take a look at this with your next patch version.

It's easy on x86 harder on other arches.  I would suggest saving
changing the __secure_computing signature until after the core
functionality lands, but that's just me.

> 3) Seccomp can allow a syscall to proceed normally.
>
> 4) Seccomp can set a hint to skip ptrace syscall events for this syscall.
>   A filter sets this by returning a specific value.
>
> 5) Ptrace always gets a syscall event when it asked for it.
>
> 6) Ptrace can set an option to honour seccomp's hint and to not get all
>   syscall events.
>
> This way all seccomp needs to do is to set some flags which ptrace can check.

I like the use of flags/options to trigger ptrace handling.  If I were
to stack rank these for pursuit after the core functionality lands,
it'd be to add #6 (and its deps) then #2.  With #6, #2 can be
simulated (by having a supervisor that changes the syscall number to
-1), but that is much less ideal than just returning SECCOMP_ERROR
instead of SECCOMP_ALLOW/DENY and letting an error code get bubbled
up.

thanks!
will

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-31 11:04             ` Will Drewry
@ 2012-01-31 15:18               ` Will Drewry
  2012-02-01  1:36               ` Indan Zupancic
  1 sibling, 0 replies; 26+ messages in thread
From: Will Drewry @ 2012-01-31 15:18 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, mcgrathr

On Tue, Jan 31, 2012 at 4:04 AM, Will Drewry <wad@chromium.org> wrote:
> On Mon, Jan 30, 2012 at 7:42 PM, Indan Zupancic <indan@nul.nu> wrote:
>> On Mon, January 30, 2012 23:26, Will Drewry wrote:
>>> Do you think something along the lines of 2 kB is sane for a config-less change?
>>
>> Yes, especially if there is some way to get rid of it anyway,
>> like disabling SECCOMP or some option under CONFIG_EMBEDDED.
>> But it seems you need at least a hidden config option which
>> depends on the stuff you need.
>
> Disabling SECCOMP would definitely do it.
>
>>>
>>> Doing exactly that.  I've been tinkering with the best way to minimize
>>> the impact to the existing BPF evaluator.  Right now, I'm adding a
>>> very small number of new instructions to the sk_buff specific code
>>> path, but I haven't yet benchmarked - just disasssembled.
>>
>> I would do all the checking in sk_chk_filter(), so you know that when
>> you do run the filter, you can't hit the sk_buff paths. This doesn't
>> cause any slow down for the networking path.
>
> Ah sorry - I was referring to the intrusion of a load_pointer function
> pointer.  I want to leave the current networking path as untouched as
> possible.  For checking, I agree -- a quick change to sk_chk_filter or
> even just a small helper function that scans for any codes in the
> ancillary range will do the trick.

I see now. I can do all the fixup in sk_chk_filter. Clever!  Sorry for
not catching on faster :)

>
>>> I agree. I will post the next series with a proposed integration. If
>>> there is a lot of resistance, then the difference will be going from a
>>> 2kB changes to a 3kB change.
>>
>> Let's see how it goes.
>>
>>>> I think you should go on a quest to make sure (almost) all archs have that file,
>>>> before this patch can be merged. At least the archs that have ptrace support.
>>>
>>> I'm an idiot.  CONFIG_HAVE_ARCH_TRACEHOOK covers asm/syscall.h
>>>
>>> So I have two choices:
>>> 1. allow seccomp with filtering on these platforms by fail if an
>>> argument is accessed
>>> 2. return ENOSYS when a filter is attempted to be installed on
>>> platforms with no tracehook support.
>>
>> I vote for:
>>
>> 3. Add tracehook support to all archs.
>
> I don't see these #3 as mutually exclusive :)  tracehook requires:
> - task_pt_regs()          in asm/processor.h or asm/ptrace.h
> - arch_has_single_step()  if there is hardware single-step support
> - arch_has_block_step()   if there is hardware block-step support
> - asm/syscall.h           supplying asm-generic/syscall.h interface
> - linux/regset.h          user_regset interfaces
> - CORE_DUMP_USE_REGSET    #define'd in linux/elf.h
> -TIF_SYSCALL_TRACE       calls tracehook_report_syscall_{entry,exit}
> - TIF_NOTIFY_RESUME       calls tracehook_notify_resume()
> - signal delivery         calls tracehook_signal_handler()
>
>> Maybe not all archs, but at least some more. That way, every time someone
>> adds something tracehook specific, more archs support it.
>
> Well the other arch I want this on specifically for my purposes is
> arm, but someone recently posted a partial asm/syscall.h for arm, but
> I didn't see that one go anywhere just yet.  (I know syscall_get_nr
> can be tricky on arm because it doesn't (didn't) have a way of
> tracking in-syscall state.)
>
> ref: https://lkml.org/lkml/2011/12/1/131
>
>> syscall.h has no TRACEHOOK defines or anything though.
>
> Nope - it is just part of what is expected.
>
>> Only syscall_rollback() looks tricky. I have no clue what the difference
>> between syscall_get_error() and syscall_get_return_value() is. But you
>> only need to add syscall_get_nr() and syscall_[gs]et_arguments(), which
>> should be possible for all archs.
>
> It seems even syscall_get_nr can have some wrinkles :)
>
> ref: http://lkml.indiana.edu/hypermail/linux/kernel/0906.3/00096.html
>
>> How many archs don't support tracehook?
>
> 14 out of 26.  However, 5 of those still have asm/syscall.h
>
>>> I think #2 is the nicest user contract, but #1 allows working filters
>>> even on less hospitable arches even if they can't user arguments
>>> (yet).  I'm coding it up as #2, but it is easy to switch to #1.
>>
>> If you don't support the arch, don't compile any code at all, and
>> let prctl(2) return EINVAL. You don't want to return ENOSYS.
>
> I was thinking of the inline prctl handler, but EINVAL makes sense.
>
>>>> Yeah, I figured that out later on. It's quite nifty, but I find the recursion
>>>> within kref_put() slightly worrisome. Perhaps the code would be cleaner if this
>>>> was avoided and done differently, but I can't think of a good alternative. I'll
>>>> wait for the new version to see if I can find a way.
>>>
>>> Thanks - sure.  Since kref_put is just an atomic_dec_and_test followed
>>> by a call to the function pointer, I wasn't too concerned.  Without
>>> changing how the relationships are handled, I'm not sure how to
>>> approach it differently (and still avoid races). IIRC, this isn't much
>>> different to how namespaces work, they just use their own atomic
>>> counters.
>>
>> Well, the thing is, this recursion is controlled by user space depending
>> on how many filters they have installed. What is preventing them to force
>> you out of stack?
>
> Hrm true.  The easiest option is to just convert it to iterative by
> not using kref_t, but I'll look more closely.
>
>> So perhaps add at least some arbitrary filter limit to avoid this?
>
> Definitely possible -- probably as a sysctl.  I'm not quite sure what
> number makes sense yet, but I'll look at breaking the recursion first.
>  Thanks!
>
>>> I'll clarify a bit.  My original ptrace integration worked such that a
>>> tracer may only intercept syscall failures if it attached prior to the
>>> failing filter being installed.  I did it this way to avoid using
>>> ptrace to escape system call filtering.  However, since I don't have
>>> that as part of the patch series, it doesn't make sense to keep it. (I
>>> tracked a tracer pid_struct in the filters.)  If it needs to come back
>>> with later patchsets, then it can be tackled then!
>>
>> The problem of that is that filters can be shared between processes with
>> different ptracers. And you have all the hassle of keeping it up to date.
>>
>> I think seccomp should always come first and just trust ptrace. This
>> because it can deny all ptrace() calls for filtered tasks, so the only
>> untrusted tasks doing ptrace() are outside of seccomp's filtering control.
>> And those could do the same system calls themselves.
>>
>> The case where there is one task being filtered and allowed to do ptrace,
>> but not some other syscall, ptracing another filtered task which isn't
>> allowed to do ptrace, but allowed to do that other syscall, is quite far
>> fetched I think. If you really want to handle this, then you could run
>> the ptracer's filters against the tracee's post-ptrace syscall state.
>> This is best done in the ptracer's context, just before continuing the
>> system call. (You really want Oleg's SIKILL immediate patch then.)
>>
>> What about:
>>
>> 1) Seccomp filters can deny a syscall by killing the task.
>>
>> 2) Seccomp can deny a syscall and let it return an error value.
>>
>>   I know you're not fond of this one. It's just a performance
>>   optimisation as sometimes a lot of denied but harmless syscalls
>>   are called by glibc all the time, like getxattr(). Hardcoding
>>   the kill policy seems wrong when it can be avoided. If this is
>>   too hard then skip it, but if it's easy to add then please do.
>>   I'll take a look at this with your next patch version.
>
> It's easy on x86 harder on other arches.  I would suggest saving
> changing the __secure_computing signature until after the core
> functionality lands, but that's just me.
>
>> 3) Seccomp can allow a syscall to proceed normally.
>>
>> 4) Seccomp can set a hint to skip ptrace syscall events for this syscall.
>>   A filter sets this by returning a specific value.
>>
>> 5) Ptrace always gets a syscall event when it asked for it.
>>
>> 6) Ptrace can set an option to honour seccomp's hint and to not get all
>>   syscall events.
>>
>> This way all seccomp needs to do is to set some flags which ptrace can check.
>
> I like the use of flags/options to trigger ptrace handling.  If I were
> to stack rank these for pursuit after the core functionality lands,
> it'd be to add #6 (and its deps) then #2.  With #6, #2 can be
> simulated (by having a supervisor that changes the syscall number to
> -1), but that is much less ideal than just returning SECCOMP_ERROR
> instead of SECCOMP_ALLOW/DENY and letting an error code get bubbled
> up.
>
> thanks!
> will

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-01-31 11:04             ` Will Drewry
  2012-01-31 15:18               ` Will Drewry
@ 2012-02-01  1:36               ` Indan Zupancic
  2012-02-01  9:02                 ` Will Drewry
  2012-02-01 17:34                 ` Roland McGrath
  1 sibling, 2 replies; 26+ messages in thread
From: Indan Zupancic @ 2012-02-01  1:36 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, mcgrathr

On Tue, January 31, 2012 12:04, Will Drewry wrote:
> On Mon, Jan 30, 2012 at 7:42 PM, Indan Zupancic <indan@nul.nu> wrote:
>>> I vote for:
>>>
>>> 3. Add tracehook support to all archs.
>
> I don't see these #3 as mutually exclusive :)

They are if you really add tracehook support to all archs. ;-)

> tracehook requires:
> - task_pt_regs()          in asm/processor.h or asm/ptrace.h
> - arch_has_single_step()  if there is hardware single-step support
> - arch_has_block_step()   if there is hardware block-step support
> - asm/syscall.h           supplying asm-generic/syscall.h interface
> - linux/regset.h          user_regset interfaces
> - CORE_DUMP_USE_REGSET    #define'd in linux/elf.h
> -TIF_SYSCALL_TRACE       calls tracehook_report_syscall_{entry,exit}
> - TIF_NOTIFY_RESUME       calls tracehook_notify_resume()
> - signal delivery         calls tracehook_signal_handler()

Okay, that's a bit fuzzier than I expected. I suppose the archs implement
some of that in another way currently?

>>> Maybe not all archs, but at least some more. That way, every time someone
>>> adds something tracehook specific, more archs support it.
>
> Well the other arch I want this on specifically for my purposes is
> arm, but someone recently posted a partial asm/syscall.h for arm, but
> I didn't see that one go anywhere just yet.  (I know syscall_get_nr
> can be tricky on arm because it doesn't (didn't) have a way of
> tracking in-syscall state.)
>
> ref: https://lkml.org/lkml/2011/12/1/131

That totally ignores OABI, it should depend on CONFIG_AEABI and
on !CONFIG_OABI_COMPAT.

>>> syscall.h has no TRACEHOOK defines or anything though.
>
> Nope - it is just part of what is expected.
>
>>> Only syscall_rollback() looks tricky. I have no clue what the difference
>>> between syscall_get_error() and syscall_get_return_value() is. But you
>>> only need to add syscall_get_nr() and syscall_[gs]et_arguments(), which
>>> should be possible for all archs.
>
> It seems even syscall_get_nr can have some wrinkles :)
>
> ref: http://lkml.indiana.edu/hypermail/linux/kernel/0906.3/00096.html

That's 2009! I wonder why no progress happened since then.

At least for SECCOMP you get the syscall nr directly as a parameter,
so you at least don't actually need syscall_get_nr(). Same for ptrace.
That doesn't help for /proc/$PID/syscall or coredumps though.

One solution I can think of for the OABI compat problem is to let the
OABI entry path store the syscall nr in thread_info->syscall and set
an OABI task flag somewhere. Then syscall_get_nr() can check that to
decide between r7 or the stored value. This would help /proc and doesn't
cause any overhead for non-compat syscalls.

Another ugly way of doing it would be to store the trap instruction for
OABI calls, and when user space does a PTRACE_PEEK_TEXT on that it returns
the saved instruction instead of rereading the actual memory. This avoids
races and lets current user space work without modifications. Problem is
that user space can't easily check if it's running on a fixed kernel.

All in all I propose to only support this stuff for kernels not having OABI
support, as that keeps things nice and simple, which was probably the point
of moving to a new ARM ABI anyway. Tough, but that's what you get when you
change ABIs.

>>> How many archs don't support tracehook?
>
> 14 out of 26.  However, 5 of those still have asm/syscall.h

That's a lot of tricky work.

>>> Well, the thing is, this recursion is controlled by user space depending
>>> on how many filters they have installed. What is preventing them to force
>>> you out of stack?
>
> Hrm true.  The easiest option is to just convert it to iterative by
> not using kref_t, but I'll look more closely.

Probably.

>>> So perhaps add at least some arbitrary filter limit to avoid this?
>
> Definitely possible -- probably as a sysctl.  I'm not quite sure what
> number makes sense yet, but I'll look at breaking the recursion first.
>  Thanks!

Please no sysctl. The point of arbitrary limits is that they are
arbitrarily high, but low enough to avoid any problems. A limit
of 1K filters should be plenty for user space, but still put a
limit on the total (memory) overhead of filters.

>>>> I'll clarify a bit.  My original ptrace integration worked such that a
>>>> tracer may only intercept syscall failures if it attached prior to the
>>>> failing filter being installed.  I did it this way to avoid using
>>>> ptrace to escape system call filtering.  However, since I don't have
>>>> that as part of the patch series, it doesn't make sense to keep it. (I
>>>> tracked a tracer pid_struct in the filters.)  If it needs to come back
>>>> with later patchsets, then it can be tackled then!
>>>
>>> The problem of that is that filters can be shared between processes with
>>> different ptracers. And you have all the hassle of keeping it up to date.
>>>
>>> I think seccomp should always come first and just trust ptrace. This
>>> because it can deny all ptrace() calls for filtered tasks, so the only
>>> untrusted tasks doing ptrace() are outside of seccomp's filtering control.
>>> And those could do the same system calls themselves.
>>>
>>> The case where there is one task being filtered and allowed to do ptrace,
>>> but not some other syscall, ptracing another filtered task which isn't
>>> allowed to do ptrace, but allowed to do that other syscall, is quite far
>>> fetched I think. If you really want to handle this, then you could run
>>> the ptracer's filters against the tracee's post-ptrace syscall state.
>>> This is best done in the ptracer's context, just before continuing the
>>> system call. (You really want Oleg's SIKILL immediate patch then.)
>>>
>>> What about:
>>>
>>> 1) Seccomp filters can deny a syscall by killing the task.
>>>
>>> 2) Seccomp can deny a syscall and let it return an error value.
>>>
>>>   I know you're not fond of this one. It's just a performance
>>>   optimisation as sometimes a lot of denied but harmless syscalls
>>>   are called by glibc all the time, like getxattr(). Hardcoding
>>>   the kill policy seems wrong when it can be avoided. If this is
>>>   too hard then skip it, but if it's easy to add then please do.
>>>   I'll take a look at this with your next patch version.
>
> It's easy on x86 harder on other arches.  I would suggest saving
> changing the __secure_computing signature until after the core
> functionality lands, but that's just me.

The only problem of that is that you will have two versions of seccomp
filtering in the wild: One that does support this, and one that doesn't.

It seems cleaner to consolidate the seccomp entry path with the ptrace path,
and do the seccomp check first in there. This saves a few instructions on
some archs for the seccomp check too and would simplify the entry code of
most archs.

Ptrace is expected to change the syscall nr already, so that can be used
to set it to -1 or something. A different return value than -ENOSYS can
be set in the syscall exit path, if required. All this can be shared with
PTRACE_SYSEMU.

I don't know.

>>> 3) Seccomp can allow a syscall to proceed normally.
>>>
>>> 4) Seccomp can set a hint to skip ptrace syscall events for this syscall.
>>>   A filter sets this by returning a specific value.
>>>
>>> 5) Ptrace always gets a syscall event when it asked for it.
>>>
>>> 6) Ptrace can set an option to honour seccomp's hint and to not get all
>>>   syscall events.
>>>
>>> This way all seccomp needs to do is to set some flags which ptrace can check.
>
> I like the use of flags/options to trigger ptrace handling.  If I were
> to stack rank these for pursuit after the core functionality lands,
> it'd be to add #6 (and its deps) then #2.  With #6, #2 can be
> simulated (by having a supervisor that changes the syscall number to
> -1), but that is much less ideal than just returning SECCOMP_ERROR
> instead of SECCOMP_ALLOW/DENY and letting an error code get bubbled
> up.

But it would require two filter versions and hence make it more difficult
to generally support BPF syscall filtering.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-02-01  1:36               ` Indan Zupancic
@ 2012-02-01  9:02                 ` Will Drewry
  2012-02-01 10:56                   ` Indan Zupancic
  2012-02-01 17:34                 ` Roland McGrath
  1 sibling, 1 reply; 26+ messages in thread
From: Will Drewry @ 2012-02-01  9:02 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, mcgrathr

On Tue, Jan 31, 2012 at 5:36 PM, Indan Zupancic <indan@nul.nu> wrote:
> On Tue, January 31, 2012 12:04, Will Drewry wrote:
>> On Mon, Jan 30, 2012 at 7:42 PM, Indan Zupancic <indan@nul.nu> wrote:
>>>> I vote for:
>>>>
>>>> 3. Add tracehook support to all archs.
>>
>> I don't see these #3 as mutually exclusive :)
>
> They are if you really add tracehook support to all archs. ;-)
>
>> tracehook requires:
>> - task_pt_regs()          in asm/processor.h or asm/ptrace.h
>> - arch_has_single_step()  if there is hardware single-step support
>> - arch_has_block_step()   if there is hardware block-step support
>> - asm/syscall.h           supplying asm-generic/syscall.h interface
>> - linux/regset.h          user_regset interfaces
>> - CORE_DUMP_USE_REGSET    #define'd in linux/elf.h
>> -TIF_SYSCALL_TRACE       calls tracehook_report_syscall_{entry,exit}
>> - TIF_NOTIFY_RESUME       calls tracehook_notify_resume()
>> - signal delivery         calls tracehook_signal_handler()
>
> Okay, that's a bit fuzzier than I expected. I suppose the archs implement
> some of that in another way currently?
>
>>>> Maybe not all archs, but at least some more. That way, every time someone
>>>> adds something tracehook specific, more archs support it.
>>
>> Well the other arch I want this on specifically for my purposes is
>> arm, but someone recently posted a partial asm/syscall.h for arm, but
>> I didn't see that one go anywhere just yet.  (I know syscall_get_nr
>> can be tricky on arm because it doesn't (didn't) have a way of
>> tracking in-syscall state.)
>>
>> ref: https://lkml.org/lkml/2011/12/1/131
>
> That totally ignores OABI, it should depend on CONFIG_AEABI and
> on !CONFIG_OABI_COMPAT.
>
>>>> syscall.h has no TRACEHOOK defines or anything though.
>>
>> Nope - it is just part of what is expected.
>>
>>>> Only syscall_rollback() looks tricky. I have no clue what the difference
>>>> between syscall_get_error() and syscall_get_return_value() is. But you
>>>> only need to add syscall_get_nr() and syscall_[gs]et_arguments(), which
>>>> should be possible for all archs.
>>
>> It seems even syscall_get_nr can have some wrinkles :)
>>
>> ref: http://lkml.indiana.edu/hypermail/linux/kernel/0906.3/00096.html
>
> That's 2009! I wonder why no progress happened since then.
>
> At least for SECCOMP you get the syscall nr directly as a parameter,
> so you at least don't actually need syscall_get_nr(). Same for ptrace.
> That doesn't help for /proc/$PID/syscall or coredumps though.
>
> One solution I can think of for the OABI compat problem is to let the
> OABI entry path store the syscall nr in thread_info->syscall and set
> an OABI task flag somewhere. Then syscall_get_nr() can check that to
> decide between r7 or the stored value. This would help /proc and doesn't
> cause any overhead for non-compat syscalls.
>
> Another ugly way of doing it would be to store the trap instruction for
> OABI calls, and when user space does a PTRACE_PEEK_TEXT on that it returns
> the saved instruction instead of rereading the actual memory. This avoids
> races and lets current user space work without modifications. Problem is
> that user space can't easily check if it's running on a fixed kernel.
>
> All in all I propose to only support this stuff for kernels not having OABI
> support, as that keeps things nice and simple, which was probably the point
> of moving to a new ARM ABI anyway. Tough, but that's what you get when you
> change ABIs.
>
>>>> How many archs don't support tracehook?
>>
>> 14 out of 26.  However, 5 of those still have asm/syscall.h
>
> That's a lot of tricky work.
>
>>>> Well, the thing is, this recursion is controlled by user space depending
>>>> on how many filters they have installed. What is preventing them to force
>>>> you out of stack?
>>
>> Hrm true.  The easiest option is to just convert it to iterative by
>> not using kref_t, but I'll look more closely.
>
> Probably.
>
>>>> So perhaps add at least some arbitrary filter limit to avoid this?
>>
>> Definitely possible -- probably as a sysctl.  I'm not quite sure what
>> number makes sense yet, but I'll look at breaking the recursion first.
>>  Thanks!
>
> Please no sysctl. The point of arbitrary limits is that they are
> arbitrarily high, but low enough to avoid any problems. A limit
> of 1K filters should be plenty for user space, but still put a
> limit on the total (memory) overhead of filters.

I'd love to avoid a limit at all, but picking something in a sane
range makes perfect sense.

>>>>> I'll clarify a bit.  My original ptrace integration worked such that a
>>>>> tracer may only intercept syscall failures if it attached prior to the
>>>>> failing filter being installed.  I did it this way to avoid using
>>>>> ptrace to escape system call filtering.  However, since I don't have
>>>>> that as part of the patch series, it doesn't make sense to keep it. (I
>>>>> tracked a tracer pid_struct in the filters.)  If it needs to come back
>>>>> with later patchsets, then it can be tackled then!
>>>>
>>>> The problem of that is that filters can be shared between processes with
>>>> different ptracers. And you have all the hassle of keeping it up to date.
>>>>
>>>> I think seccomp should always come first and just trust ptrace. This
>>>> because it can deny all ptrace() calls for filtered tasks, so the only
>>>> untrusted tasks doing ptrace() are outside of seccomp's filtering control.
>>>> And those could do the same system calls themselves.
>>>>
>>>> The case where there is one task being filtered and allowed to do ptrace,
>>>> but not some other syscall, ptracing another filtered task which isn't
>>>> allowed to do ptrace, but allowed to do that other syscall, is quite far
>>>> fetched I think. If you really want to handle this, then you could run
>>>> the ptracer's filters against the tracee's post-ptrace syscall state.
>>>> This is best done in the ptracer's context, just before continuing the
>>>> system call. (You really want Oleg's SIKILL immediate patch then.)
>>>>
>>>> What about:
>>>>
>>>> 1) Seccomp filters can deny a syscall by killing the task.
>>>>
>>>> 2) Seccomp can deny a syscall and let it return an error value.
>>>>
>>>>   I know you're not fond of this one. It's just a performance
>>>>   optimisation as sometimes a lot of denied but harmless syscalls
>>>>   are called by glibc all the time, like getxattr(). Hardcoding
>>>>   the kill policy seems wrong when it can be avoided. If this is
>>>>   too hard then skip it, but if it's easy to add then please do.
>>>>   I'll take a look at this with your next patch version.
>>
>> It's easy on x86 harder on other arches.  I would suggest saving
>> changing the __secure_computing signature until after the core
>> functionality lands, but that's just me.
>
> The only problem of that is that you will have two versions of seccomp
> filtering in the wild: One that does support this, and one that doesn't.

Hrm.  It'd be the addition of a return value which is essentially an
expansion of the possible filters.  It wouldn't make the existing
filters any less valid -- it's just that they wouldn't return errors.

However, I realize there's a desire to have all the pieces in place
upfront. I'll see how much work this really pans out to being. On x86,
it's easy.  Some other arches, a little less so, but probably not too
bad.

> It seems cleaner to consolidate the seccomp entry path with the ptrace path,
> and do the seccomp check first in there. This saves a few instructions on
> some archs for the seccomp check too and would simplify the entry code of
> most archs.
> Ptrace is expected to change the syscall nr already, so that can be used
> to set it to -1 or something. A different return value than -ENOSYS can
> be set in the syscall exit path, if required. All this can be shared with
> PTRACE_SYSEMU.

That's roughly how it is on x86 now except seccomp is after all the
slow-path copy stuff.  It'd be cool to bump it up in front of that
work then pass through its return value.  I'll poke around at this and
look at the use of __secure_computing on the other arches to make sure
I understand the impact.  Maybe it is easier than I first thought it'd
be.

>>>> 3) Seccomp can allow a syscall to proceed normally.
>>>>
>>>> 4) Seccomp can set a hint to skip ptrace syscall events for this syscall.
>>>>   A filter sets this by returning a specific value.
>>>>
>>>> 5) Ptrace always gets a syscall event when it asked for it.
>>>>
>>>> 6) Ptrace can set an option to honour seccomp's hint and to not get all
>>>>   syscall events.
>>>>
>>>> This way all seccomp needs to do is to set some flags which ptrace can check.
>>
>> I like the use of flags/options to trigger ptrace handling.  If I were
>> to stack rank these for pursuit after the core functionality lands,
>> it'd be to add #6 (and its deps) then #2.  With #6, #2 can be
>> simulated (by having a supervisor that changes the syscall number to
>> -1), but that is much less ideal than just returning SECCOMP_ERROR
>> instead of SECCOMP_ALLOW/DENY and letting an error code get bubbled
>> up.
>
> But it would require two filter versions and hence make it more difficult
> to generally support BPF syscall filtering.

True.  I'm not quite sure that it makes sense to have the BPF program
decide what the tracer sees or doesn't see.  I can see it as a nice
optimization for a sandbox implementation, but I could see something
similar being done purely by letting a tracer catch disallowed
syscalls (possibly via it registering an option indicating it wants
seccomp_events). It wouldn't be quite as flexible, but it would avoid
the simple filter becoming a more complex piece of logic and if a
returned error were allowed (that didn't trigger ptrace), you'd not
see a ridiculous amount of overhead from pointless syscalls.  I'm not
sure though.

I'll poke at the ptrace bits too and see if one approach seems to fit
better than another.

thanks!
will

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-02-01  9:02                 ` Will Drewry
@ 2012-02-01 10:56                   ` Indan Zupancic
  0 siblings, 0 replies; 26+ messages in thread
From: Indan Zupancic @ 2012-02-01 10:56 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, mcgrathr

On Wed, February 1, 2012 10:02, Will Drewry wrote:
> Hrm.  It'd be the addition of a return value which is essentially an
> expansion of the possible filters.  It wouldn't make the existing
> filters any less valid -- it's just that they wouldn't return errors.

True, but there will be a gap of who knows how many years where people
try to run new filters on older (stable) kernels without return support.
It's just an extra cost on top of supporting BPF at all, and creating
filters isn't easy enough that creating another version if the return
one doesn't work is exactly fun.

> However, I realize there's a desire to have all the pieces in place
> upfront. I'll see how much work this really pans out to being. On x86,
> it's easy.  Some other arches, a little less so, but probably not too
> bad.

Please send the updated basic and consolidated version first, I'd like to
see how it ended up looking like.

> That's roughly how it is on x86 now except seccomp is after all the
> slow-path copy stuff.  It'd be cool to bump it up in front of that
> work then pass through its return value.  I'll poke around at this and
> look at the use of __secure_computing on the other arches to make sure
> I understand the impact.  Maybe it is easier than I first thought it'd
> be.

Normal seccomp can be in the fast path, but I thought you need to go
through the slow path to have the registers saved at all sometimes.
I would prefer not slowing down the fast path with seccomp stuff.

> True.  I'm not quite sure that it makes sense to have the BPF program
> decide what the tracer sees or doesn't see.

Only if ptrace agrees, with setting a special option. (And if multiple
filters are installed, they all have to set the ignore flag before ptrace
doesn't see it.)

> I can see it as a nice
> optimization for a sandbox implementation, but I could see something
> similar being done purely by letting a tracer catch disallowed
> syscalls (possibly via it registering an option indicating it wants
> seccomp_events).

If filters can't return errors, tracers never get to see disallowed events.
So all they can do is allow it and let the pracer block it. If you add a
ptrace option to overrule seccomp decisions, then ptrace and seccomp only
get more tangled without much gain.

> It wouldn't be quite as flexible, but it would avoid
> the simple filter becoming a more complex piece of logic and if a
> returned error were allowed (that didn't trigger ptrace), you'd not
> see a ridiculous amount of overhead from pointless syscalls.  I'm not
> sure though.

It's not just that, it's also that not all filters want to kill the task
when it tries to do make a forbidden system call. BPF is more flexible
than the very limited and controlled original seccomp.

> I'll poke at the ptrace bits too and see if one approach seems to fit
> better than another.

Let's do it step by step. That the final patch may have all features
doesn't mean you have to write it all at once. So first get the next
version out, then focus on supporting more archs and see how hard it
would be to have an error return.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF
  2012-02-01  1:36               ` Indan Zupancic
  2012-02-01  9:02                 ` Will Drewry
@ 2012-02-01 17:34                 ` Roland McGrath
  1 sibling, 0 replies; 26+ messages in thread
From: Roland McGrath @ 2012-02-01 17:34 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Will Drewry, linux-kernel, keescook, john.johansen, serge.hallyn,
	coreyb, pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan

On Tue, Jan 31, 2012 at 5:36 PM, Indan Zupancic <indan@nul.nu> wrote:
>> ref: http://lkml.indiana.edu/hypermail/linux/kernel/0906.3/00096.html
>
> That's 2009! I wonder why no progress happened since then.

At that time I did most of all the CONFIG_HAVE_ARCH_TRACEHOOK work for ARM.
The ARM kernel maintainers just weren't interested in paying attention, and
I gave up on hounding them.  Since then, only a subset of what I did before
has actually been done and merged.  (e.g. there is now some user_regset
support, but less than what I implemented and posted originally--so for
some ARM variants there is still register information only accessible via
old ptrace calls but not in core dumps and no PTRACE_GETREGSET support.
The really trivial stuff like using tracehook_report_* still hasn't gone
in, though I posted that code back then too.)


Thanks,
Roland

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2012-02-01 17:34 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-27 23:24 [PATCH v5 1/3] seccomp: kill the seccomp_t typedef Will Drewry
2012-01-27 23:24 ` [PATCH v5 2/3] seccomp_filters: system call filtering using BPF Will Drewry
2012-01-28 11:21   ` Cong Wang
2012-01-28 20:13     ` Will Drewry
2012-01-28 16:28   ` Andrew Lutomirski
2012-01-28 19:50     ` Will Drewry
2012-01-29  4:39   ` Indan Zupancic
2012-01-29 20:27     ` Will Drewry
2012-01-30  2:37       ` Indan Zupancic
2012-01-30 22:26         ` Will Drewry
2012-01-30 22:29           ` Andrew Lutomirski
2012-01-30 22:42             ` Will Drewry
2012-01-30 22:59             ` Indan Zupancic
2012-01-31  1:42           ` Indan Zupancic
2012-01-31 11:04             ` Will Drewry
2012-01-31 15:18               ` Will Drewry
2012-02-01  1:36               ` Indan Zupancic
2012-02-01  9:02                 ` Will Drewry
2012-02-01 10:56                   ` Indan Zupancic
2012-02-01 17:34                 ` Roland McGrath
2012-01-27 23:24 ` [PATCH v5 3/3] Documentation: prctl/seccomp_filter Will Drewry
2012-01-27 23:41 ` [PATCH v5 1/3] seccomp: kill the seccomp_t typedef Greg KH
2012-01-27 23:47   ` Will Drewry
2012-01-28  0:38     ` Steven Rostedt
2012-01-28 11:13     ` Cong Wang
2012-01-28 21:06       ` Will Drewry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).