linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 1/3] seccomp: kill the seccomp_t typedef
@ 2012-01-28 22:11 Will Drewry
  2012-01-28 22:11 ` [PATCH v6 2/3] seccomp_filters: system call filtering using BPF Will Drewry
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Will Drewry @ 2012-01-28 22:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr

Replaces the seccomp_t typedef with seccomp_struct to match modern
kernel style.

Signed-off-by: Will Drewry <wad@chromium.org>
---
 include/linux/sched.h   |    2 +-
 include/linux/seccomp.h |   10 ++++++----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4032ec1..288b5cb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1418,7 +1418,7 @@ struct task_struct {
 	uid_t loginuid;
 	unsigned int sessionid;
 #endif
-	seccomp_t seccomp;
+	struct seccomp_struct seccomp;
 
 /* Thread group tracking */
    	u32 parent_exec_id;
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..171ab66 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -7,7 +7,9 @@
 #include <linux/thread_info.h>
 #include <asm/seccomp.h>
 
-typedef struct { int mode; } seccomp_t;
+struct seccomp_struct {
+	int mode;
+};
 
 extern void __secure_computing(int);
 static inline void secure_computing(int this_syscall)
@@ -19,7 +21,7 @@ static inline void secure_computing(int this_syscall)
 extern long prctl_get_seccomp(void);
 extern long prctl_set_seccomp(unsigned long);
 
-static inline int seccomp_mode(seccomp_t *s)
+static inline int seccomp_mode(struct seccomp_struct *s)
 {
 	return s->mode;
 }
@@ -28,7 +30,7 @@ static inline int seccomp_mode(seccomp_t *s)
 
 #include <linux/errno.h>
 
-typedef struct { } seccomp_t;
+struct seccomp_struct { };
 
 #define secure_computing(x) do { } while (0)
 
@@ -42,7 +44,7 @@ static inline long prctl_set_seccomp(unsigned long arg2)
 	return -EINVAL;
 }
 
-static inline int seccomp_mode(seccomp_t *s)
+static inline int seccomp_mode(struct seccomp_struct *s)
 {
 	return 0;
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 2/3] seccomp_filters: system call filtering using BPF
  2012-01-28 22:11 [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Will Drewry
@ 2012-01-28 22:11 ` Will Drewry
  2012-01-31 14:13   ` Eduardo Otubo
  2012-02-02 15:32   ` Serge E. Hallyn
  2012-01-28 22:11 ` [PATCH v6 3/3] Documentation: prctl/seccomp_filter Will Drewry
  2012-02-02 15:29 ` [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Serge E. Hallyn
  2 siblings, 2 replies; 13+ messages in thread
From: Will Drewry @ 2012-01-28 22:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr

[This patch depends on luto@mit.edu's no_new_privs patch:
 https://lkml.org/lkml/2012/1/12/446
]

This patch adds support for seccomp mode 2.  This mode enables dynamic
enforcement of system call filtering policy in the kernel as specified
by a userland task.  The policy is expressed in terms of a Berkeley
Packet Filter program, as is used for userland-exposed socket filtering.
Instead of network data, the BPF program is evaluated over struct
seccomp_filter_data at the time of the system call.

A filter program may be installed by a userland task by calling
  prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
where fprog is of type struct sock_fprog.

If the first filter program allows subsequent prctl(2) calls, then
additional filter programs may be attached.  All attached programs
must be evaluated before a system call will be allowed to proceed.

To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() value, it is not allowed to
make system calls using the alternate entry point.

Filter programs will be inherited across fork/clone and execve, however
the installation of filters must be preceded by setting 'no_new_privs'
to ensure that unprivileged tasks cannot attach filters that affect
privileged tasks (e.g., setuid binary).  Tasks with CAP_SYS_ADMIN
in their namespace may install inheritable filters without setting
the no_new_privs bit.

There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time.
- Userland already knows its ABI: system call numbers and desired
  arguments
- No time-of-check-time-of-use vulnerable data accesses are possible.
- system call arguments are loaded on demand only to minimize copying
  required for system call number-only policy decisions.

This patch includes its own BPF evaluator, but relies on the
net/core/filter.c BPF checking code.  It is possible to share
evaluators, but the performance sensitive nature of the network
filtering path makes it an iterative optimization which (I think :) can
be tackled separately via separate patchsets. (And at some point sharing
BPF JIT code!)

 v6: - fix memory leak on attach compat check failure
     - require no_new_privs || CAP_SYS_ADMIN prior to filter
       installation. (luto@mit.edu)
     - s/seccomp_struct_/seccomp_/ for macros/functions
       (amwang@redhat.com)
     - cleaned up Kconfig (amwang@redhat.com)
     - on block, note if the call was compat (so the # means something)
 v5: - uses syscall_get_arguments
       (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
     - uses union-based arg storage with hi/lo struct to
       handle endianness.  Compromises between the two alternate
       proposals to minimize extra arg shuffling and account for
       endianness assuming userspace uses offsetof().
       (mcgrathr@chromium.org, indan@nul.nu)
     - update Kconfig description
     - add include/seccomp_filter.h and add its installation
     - (naive) on-demand syscall argument loading
     - drop seccomp_t (eparis@redhat.com)
 v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
     - now uses current->no_new_privs
         (luto@mit.edu,torvalds@linux-foundation.com)
     - assign names to seccomp modes (rdunlap@xenotime.net)
     - fix style issues (rdunlap@xenotime.net)
     - reworded Kconfig entry (rdunlap@xenotime.net)
 v3: - macros to inline (oleg@redhat.com)
     - init_task behavior fixed (oleg@redhat.com)
     - drop creator entry and extra NULL check (oleg@redhat.com)
     - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
     - adds tentative use of "always_unprivileged" as per
       torvalds@linux-foundation.org and luto@mit.edu
 v2: - (patch 2 only)

Signed-off-by: Will Drewry <wad@chromium.org>
---
 include/linux/Kbuild           |    1 +
 include/linux/prctl.h          |    3 +
 include/linux/seccomp.h        |   63 ++++
 include/linux/seccomp_filter.h |   79 +++++
 kernel/Makefile                |    1 +
 kernel/fork.c                  |    4 +
 kernel/seccomp.c               |   10 +-
 kernel/seccomp_filter.c        |  627 ++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c                   |    4 +
 security/Kconfig               |   20 ++
 10 files changed, 811 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/seccomp_filter.h
 create mode 100644 kernel/seccomp_filter.c

diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index c94e717..5659454 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -330,6 +330,7 @@ header-y += scc.h
 header-y += sched.h
 header-y += screen_info.h
 header-y += sdla.h
+header-y += seccomp_filter.h
 header-y += securebits.h
 header-y += selinux_netlink.h
 header-y += sem.h
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index 7ddc7f1..b8c4beb 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -114,4 +114,7 @@
 # define PR_SET_MM_START_BRK		6
 # define PR_SET_MM_BRK			7
 
+/* Set process seccomp filters */
+#define PR_ATTACH_SECCOMP_FILTER	37
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 171ab66..d3b896b 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,10 +5,29 @@
 #ifdef CONFIG_SECCOMP
 
 #include <linux/thread_info.h>
+#include <linux/types.h>
 #include <asm/seccomp.h>
 
+/* Valid values of seccomp_struct.mode */
+#define SECCOMP_MODE_DISABLED	0 /* seccomp is not in use. */
+#define SECCOMP_MODE_STRICT	1 /* uses hard-coded seccomp.c rules. */
+#define SECCOMP_MODE_FILTER	2 /* system call access determined by filter. */
+
+struct seccomp_filter;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * @mode:  indicates one of the valid values above for controlled
+ *         system calls available to a process.
+ * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
+ *          @filter must only be accessed from the context of current as there
+ *          is no guard.
+ */
 struct seccomp_struct {
 	int mode;
+#ifdef CONFIG_SECCOMP_FILTER
+	struct seccomp_filter *filter;
+#endif
 };
 
 extern void __secure_computing(int);
@@ -51,4 +70,48 @@ static inline int seccomp_mode(struct seccomp_struct *s)
 
 #endif /* CONFIG_SECCOMP */
 
+#ifdef CONFIG_SECCOMP_FILTER
+
+
+extern long prctl_attach_seccomp_filter(char __user *);
+
+extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
+extern void put_seccomp_filter(struct seccomp_filter *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+extern void seccomp_fork(struct seccomp_struct *child,
+			 const struct seccomp_struct *parent);
+
+static inline void seccomp_init_task(struct seccomp_struct *seccomp)
+{
+	seccomp->mode = SECCOMP_MODE_DISABLED;
+	seccomp->filter = NULL;
+}
+
+/* No locking is needed here because the task_struct will
+ * have no parallel consumers.
+ */
+static inline void seccomp_free_task(struct seccomp_struct *seccomp)
+{
+	put_seccomp_filter(seccomp->filter);
+	seccomp->filter = NULL;
+}
+
+#else  /* CONFIG_SECCOMP_FILTER */
+
+#include <linux/errno.h>
+
+struct seccomp_filter { };
+/* Macros consume the unused dereference by the caller. */
+#define seccomp_init_task(_seccomp) do { } while (0);
+#define seccomp_fork(_tsk, _orig) do { } while (0);
+#define seccomp_free_task(_seccomp) do { } while (0);
+
+static inline long prctl_attach_seccomp_filter(char __user *a2)
+{
+	return -ENOSYS;
+}
+
+#endif  /* CONFIG_SECCOMP_FILTER */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/linux/seccomp_filter.h b/include/linux/seccomp_filter.h
new file mode 100644
index 0000000..3ecd641
--- /dev/null
+++ b/include/linux/seccomp_filter.h
@@ -0,0 +1,79 @@
+/*
+ * Secomp-based system call filtering data structures and definitions.
+ *
+ * Copyright (C) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ *
+ * This copyrighted material is made available to anyone wishing to use,
+ * modify, copy, or redistribute it subject to the terms and conditions
+ * of the GNU General Public License v.2.
+ *
+ */
+
+#ifndef __LINUX_SECCOMP_FILTER_H__
+#define __LINUX_SECCOMP_FILTER_H__
+
+#include <asm/byteorder.h>
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+/*
+ *	Keep the contents of this file similar to linux/filter.h:
+ *	  struct sock_filter and sock_fprog and versions.
+ *	Custom naming exists solely if divergence is ever needed.
+ */
+
+/*
+ * Current version of the filter code architecture.
+ */
+#define SECCOMP_BPF_MAJOR_VERSION 1
+#define SECCOMP_BPF_MINOR_VERSION 1
+
+struct seccomp_filter_block {	/* Filter block */
+	__u16	code;   /* Actual filter code */
+	__u8	jt;	/* Jump true */
+	__u8	jf;	/* Jump false */
+	__u32	k;      /* Generic multiuse field */
+};
+
+struct seccomp_fprog {	/* Required for SO_ATTACH_FILTER. */
+	unsigned short		len;	/* Number of filter blocks */
+	struct seccomp_filter_block __user *filter;
+};
+
+/* Ensure the u32 ordering is consistent with platform byte order. */
+#if defined(__LITTLE_ENDIAN)
+#define SECCOMP_ENDIAN_SWAP(x, y) x, y
+#elif defined(__BIG_ENDIAN)
+#define SECCOMP_ENDIAN_SWAP(x, y) y, x
+#else
+#error edit for your odd arch byteorder.
+#endif
+
+/* System call argument layout for the filter data. */
+union seccomp_filter_arg {
+	struct {
+		__u32 SECCOMP_ENDIAN_SWAP(lo32, hi32);
+	};
+	__u64 u64;
+};
+
+/*
+ *	Expected data the BPF program will execute over.
+ *	Endianness will be arch specific, but the values will be
+ *	swapped, as above, to allow for consistent BPF programs.
+ */
+struct seccomp_filter_data {
+	int syscall_nr;
+	__u32 __reserved;
+	union seccomp_filter_arg args[6];
+};
+
+#undef SECCOMP_ENDIAN_SWAP
+
+/*
+ * Defined valid return values for the BPF program.
+ */
+#define SECCOMP_BPF_ALLOW	0xFFFFFFFF
+#define SECCOMP_BPF_DENY	0
+
+#endif /* __LINUX_SECCOMP_FILTER_H__ */
diff --git a/kernel/Makefile b/kernel/Makefile
index 2d9de86..fd81bac 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -78,6 +78,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TREE_RCU) += rcutree.o
 obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index 051f090..0007933 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
 #include <linux/cgroup.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
+#include <linux/seccomp.h>
 #include <linux/swap.h>
 #include <linux/syscalls.h>
 #include <linux/jiffies.h>
@@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
+	seccomp_free_task(&tsk->seccomp);
 	free_task_struct(tsk);
 }
 EXPORT_SYMBOL(free_task);
@@ -1093,6 +1095,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto fork_out;
 
 	ftrace_graph_init_task(p);
+	seccomp_init_task(&p->seccomp);
 
 	rt_mutex_init_task(p);
 
@@ -1376,6 +1379,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_change_end(current);
 	perf_event_fork(p);
+	seccomp_fork(&p->seccomp, &current->seccomp);
 
 	trace_task_newtask(p, clone_flags);
 
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index e8d76c5..a045dd4 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -37,7 +37,7 @@ void __secure_computing(int this_syscall)
 	int * syscall;
 
 	switch (mode) {
-	case 1:
+	case SECCOMP_MODE_STRICT:
 		syscall = mode1_syscalls;
 #ifdef CONFIG_COMPAT
 		if (is_compat_task())
@@ -48,6 +48,14 @@ void __secure_computing(int this_syscall)
 				return;
 		} while (*++syscall);
 		break;
+#ifdef CONFIG_SECCOMP_FILTER
+	case SECCOMP_MODE_FILTER:
+		if (seccomp_test_filters(this_syscall) == 0)
+			return;
+
+		seccomp_filter_log_failure(this_syscall);
+		break;
+#endif
 	default:
 		BUG();
 	}
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..0e2e56c
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,627 @@
+/*
+ * linux/kernel/seccomp_filter.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ *
+ * Extends linux/kernel/seccomp.c to allow tasks to install system call
+ * filters using a Berkeley Packet Filter program which is executed over
+ * struct seccomp_filter_data.
+ */
+
+#include <asm/syscall.h>
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/rculist.h>
+#include <linux/filter.h>
+#include <linux/kallsyms.h>
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/pid.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/ratelimit.h>
+#include <linux/reciprocal_div.h>
+#include <linux/regset.h>
+#include <linux/seccomp.h>
+#include <linux/seccomp_filter.h>
+#include <linux/security.h>
+#include <linux/seccomp.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/user.h>
+
+
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object lifetime.
+ *         get/put helpers should be used when accessing an instance
+ *         outside of a lifetime-guarded section.  In general, this
+ *         is only needed for handling filters shared across tasks.
+ * @parent: pointer to the ancestor which this filter will be composed with.
+ * @insns: the BPF program instructions to evaluate
+ * @count: the number of instructions in the program.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+	struct kref usage;
+	struct seccomp_filter *parent;
+	struct {
+		uint32_t compat:1;
+	} flags;
+	unsigned short count;  /* Instruction count */
+	struct sock_filter insns[0];
+};
+
+/*
+ * struct seccomp_filter_metadata - BPF data wrapper
+ * @data: data accessible to the BPF program.
+ * @has_args: indicates that the args have been lazily populated.
+ *
+ * Used by seccomp_load_pointer.
+ */
+struct seccomp_filter_metadata {
+	struct seccomp_filter_data data;
+	bool has_args;
+};
+
+static unsigned int seccomp_run_filter(void *, uint32_t,
+				       const struct sock_filter *);
+
+/**
+ * seccomp_filter_alloc - allocates a new filter object
+ * @padding: size of the insns[0] array in bytes
+ *
+ * The @padding should be a multiple of
+ * sizeof(struct sock_filter).
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
+{
+	struct seccomp_filter *f;
+	unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
+
+	/* Drop oversized requests. */
+	if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
+		return ERR_PTR(-EINVAL);
+
+	/* Padding should always be in sock_filter increments. */
+	if (padding % sizeof(struct sock_filter))
+		return ERR_PTR(-EINVAL);
+
+	f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
+	if (!f)
+		return ERR_PTR(-ENOMEM);
+	kref_init(&f->usage);
+	f->count = bpf_blocks;
+	return f;
+}
+
+/**
+ * seccomp_filter_free - frees the allocated filter.
+ * @filter: NULL or live object to be completely destructed.
+ */
+static void seccomp_filter_free(struct seccomp_filter *filter)
+{
+	if (!filter)
+		return;
+	put_seccomp_filter(filter->parent);
+	kfree(filter);
+}
+
+static void __put_seccomp_filter(struct kref *kref)
+{
+	struct seccomp_filter *orig =
+		container_of(kref, struct seccomp_filter, usage);
+	seccomp_filter_free(orig);
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+	int compat = 0;
+#ifdef CONFIG_COMPAT
+	compat = is_compat_task();
+#endif
+	pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
+		current->comm, task_pid_nr(current),
+		(compat ? "compat " : ""),
+		syscall, KSTK_EIP(current));
+}
+
+/* put_seccomp_filter - decrements the ref count of @orig and may free. */
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+	if (!orig)
+		return;
+	kref_put(&orig->usage, __put_seccomp_filter);
+}
+
+/* get_seccomp_filter - increments the reference count of @orig. */
+struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+	if (!orig)
+		return NULL;
+	kref_get(&orig->usage);
+	return orig;
+}
+
+#if BITS_PER_LONG == 32
+static inline unsigned long *seccomp_filter_data_arg(
+				struct seccomp_filter_data *data, int index)
+{
+	/* Avoid inconsistent hi contents. */
+	data->args[index].hi32 = 0;
+	return (unsigned long *) &(data->args[index].lo32);
+}
+#elif BITS_PER_LONG == 64
+static inline unsigned long *seccomp_filter_data_arg(
+				struct seccomp_filter_data *data, int index)
+{
+	return (unsigned long *) &(data->args[index].u64);
+}
+#else
+#error Unknown BITS_PER_LONG.
+#endif
+
+/**
+ * seccomp_load_pointer: checks and returns a pointer to the requested offset
+ * @buf: u8 array to index into
+ * @buflen: length of the @buf array
+ * @offset: offset to return data from
+ * @size: size of the data to retrieve at offset
+ * @unused: placeholder which net/core/filter.c uses for for temporary
+ *          storage.  Ideally, the two code paths can be merged.
+ *
+ * Returns a pointer to the BPF evaluator after checking the offset and size
+ * boundaries.
+ */
+static inline void *seccomp_load_pointer(void *data, int offset, size_t size,
+					 void *buffer)
+{
+	struct seccomp_filter_metadata *metadata = data;
+	int arg;
+	if (offset >= sizeof(metadata->data))
+		goto fail;
+	if (offset < 0)
+		goto fail;
+	if (size > sizeof(metadata->data) - offset)
+		goto fail;
+	if (metadata->has_args)
+		goto pass;
+	/* No argument data touched. */
+	if (offset + size - 1 < offsetof(struct seccomp_filter_data, args))
+		goto pass;
+	for (arg = 0; arg < ARRAY_SIZE(metadata->data.args); ++arg)
+		syscall_get_arguments(current, task_pt_regs(current), arg, 1,
+			seccomp_filter_data_arg(&metadata->data, arg));
+	metadata->has_args = true;
+pass:
+	return ((__u8 *)(&metadata->data)) + offset;
+fail:
+	return NULL;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+	int ret = -EACCES;
+	struct seccomp_filter *filter;
+	struct seccomp_filter_metadata metadata;
+
+	filter = current->seccomp.filter; /* uses task ref */
+	if (!filter)
+		goto out;
+
+	metadata.data.syscall_nr = syscall;
+	metadata.has_args = false;
+
+#ifdef CONFIG_COMPAT
+	if (filter->flags.compat != !!(is_compat_task()))
+		goto out;
+#endif
+
+	/* Only allow a system call if it is allowed in all ancestors. */
+	ret = 0;
+	for ( ; filter != NULL; filter = filter->parent) {
+		/* Allowed if return value is SECCOMP_BPF_ALLOW */
+		if (seccomp_run_filter(&metadata, sizeof(metadata.data),
+					filter->insns) != SECCOMP_BPF_ALLOW)
+			ret = -EACCES;
+	}
+out:
+	return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * @fprog: BPF program to install
+ *
+ * Context: User context only. This function may sleep on allocation and
+ *          operates on current. current must be attempting a system call
+ *          when this is called (usually prctl).
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the thread makes.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+	struct seccomp_filter *filter = NULL;
+	/* Note, len is a short so overflow should be impossible. */
+	unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+	long ret = -EPERM;
+
+	/* Allocate a new seccomp_filter */
+	filter = seccomp_filter_alloc(fp_size);
+	if (IS_ERR(filter)) {
+		ret = PTR_ERR(filter);
+		goto out;
+	}
+
+	/* Copy the instructions from fprog. */
+	ret = -EFAULT;
+	if (copy_from_user(filter->insns, fprog->filter, fp_size))
+		goto out;
+
+	/* Check the fprog */
+	ret = sk_chk_filter(filter->insns, filter->count);
+	if (ret)
+		goto out;
+
+	/*
+	 * Installing a seccomp filter requires that the task
+	 * have CAP_SYS_ADMIN in its namespace or be running with
+	 * no_new_privs.  This avoids scenarios where unprivileged
+	 * tasks can affect the behavior of privileged children.
+	 */
+	ret = -EACCES;
+	if (!current->no_new_privs &&
+	    security_capable_noaudit(current_cred(), current_user_ns(),
+				     CAP_SYS_ADMIN) != 0)
+		goto out;
+
+	/*
+	 * If there is an existing filter, make it the parent
+	 * and reuse the existing task-based ref.
+	 */
+	filter->parent = current->seccomp.filter;
+
+#ifdef CONFIG_COMPAT
+	/* Disallow changing system calling conventions after the fact. */
+	filter->flags.compat = !!(is_compat_task());
+
+	if (filter->parent &&
+	    filter->parent->flags.compat != filter->flags.compat)
+		goto out;
+#endif
+
+	/*
+	 * Double claim the new filter so we can release it below simplifying
+	 * the error paths earlier.
+	 */
+	ret = 0;
+	get_seccomp_filter(filter);
+	current->seccomp.filter = filter;
+	/* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
+	if (current->seccomp.mode == SECCOMP_MODE_DISABLED) {
+		current->seccomp.mode = SECCOMP_MODE_FILTER;
+		set_thread_flag(TIF_SECCOMP);
+	}
+
+out:
+	put_seccomp_filter(filter);  /* for get or task, on err */
+	return ret;
+}
+
+#ifdef CONFIG_COMPAT
+/* This should be kept in sync with net/compat.c which changes infrequently. */
+struct compat_sock_fprog {
+	u16 len;
+	compat_uptr_t filter;	/* struct sock_filter */
+};
+
+static long compat_attach_seccomp_filter(char __user *optval)
+{
+	struct compat_sock_fprog __user *fprog32 =
+		(struct compat_sock_fprog __user *)optval;
+	struct sock_fprog __user *kfprog =
+		compat_alloc_user_space(sizeof(struct sock_fprog));
+	compat_uptr_t ptr;
+	u16 len;
+
+	if (!access_ok(VERIFY_READ, fprog32, sizeof(*fprog32)) ||
+	    !access_ok(VERIFY_WRITE, kfprog, sizeof(struct sock_fprog)) ||
+	    __get_user(len, &fprog32->len) ||
+	    __get_user(ptr, &fprog32->filter) ||
+	    __put_user(len, &kfprog->len) ||
+	    __put_user(compat_ptr(ptr), &kfprog->filter))
+		return -EFAULT;
+
+	return seccomp_attach_filter(kfprog);
+}
+#endif
+
+long prctl_attach_seccomp_filter(char __user *user_filter)
+{
+	struct sock_fprog fprog;
+	long ret = -EINVAL;
+	ret = -EFAULT;
+	if (!user_filter)
+		goto out;
+
+#ifdef CONFIG_COMPAT
+	if (is_compat_task())
+		return compat_attach_seccomp_filter(user_filter);
+#endif
+
+	if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+		goto out;
+
+	ret = seccomp_attach_filter(&fprog);
+out:
+	return ret;
+}
+
+/**
+ * seccomp_fork: manages inheritance on fork
+ * @child: forkee's seccomp_struct
+ * @parent: forker's seccomp_struct
+ *
+ * Ensures that @child inherits seccomp mode and state iff
+ * seccomp filtering is in use.
+ */
+void seccomp_fork(struct seccomp_struct *child,
+			 const struct seccomp_struct *parent)
+{
+	child->mode = parent->mode;
+	if (parent->mode != SECCOMP_MODE_FILTER)
+		return;
+	child->filter = get_seccomp_filter(parent->filter);
+}
+
+/**
+ * seccomp_run_filter - evaluate BPF
+ *	@buf: opaque buffer to execute the filter over
+ *	@buflen: length of the buffer
+ *	@fentry: filter to apply
+ *
+ * Decode and apply filter instructions to the buffer.  Return length to
+ * keep, 0 for none. @buf is a seccomp_filter_metadata we are filtering,
+ * @filter is the array of filter instructions.  Because all jumps are
+ * guaranteed to be before last instruction, and last instruction
+ * guaranteed to be a RET, we dont need to check flen.
+ *
+ * See core/net/filter.c as this is nearly an exact copy.
+ * At some point, it would be nice to merge them to take advantage of
+ * optimizations (like JIT).
+ */
+static unsigned int seccomp_run_filter(void *data, uint32_t datalen,
+			  const struct sock_filter *fentry)
+{
+	const void *ptr;
+	u32 A = 0;			/* Accumulator */
+	u32 X = 0;			/* Index Register */
+	u32 mem[BPF_MEMWORDS];		/* Scratch Memory Store */
+	u32 tmp;
+	int k;
+
+	/*
+	 * Process array of filter instructions.
+	 */
+	for (;; fentry++) {
+#if defined(CONFIG_X86_32)
+#define	K (fentry->k)
+#else
+		const u32 K = fentry->k;
+#endif
+
+		switch (fentry->code) {
+		case BPF_S_ALU_ADD_X:
+			A += X;
+			continue;
+		case BPF_S_ALU_ADD_K:
+			A += K;
+			continue;
+		case BPF_S_ALU_SUB_X:
+			A -= X;
+			continue;
+		case BPF_S_ALU_SUB_K:
+			A -= K;
+			continue;
+		case BPF_S_ALU_MUL_X:
+			A *= X;
+			continue;
+		case BPF_S_ALU_MUL_K:
+			A *= K;
+			continue;
+		case BPF_S_ALU_DIV_X:
+			if (X == 0)
+				return 0;
+			A /= X;
+			continue;
+		case BPF_S_ALU_DIV_K:
+			A = reciprocal_divide(A, K);
+			continue;
+		case BPF_S_ALU_AND_X:
+			A &= X;
+			continue;
+		case BPF_S_ALU_AND_K:
+			A &= K;
+			continue;
+		case BPF_S_ALU_OR_X:
+			A |= X;
+			continue;
+		case BPF_S_ALU_OR_K:
+			A |= K;
+			continue;
+		case BPF_S_ALU_LSH_X:
+			A <<= X;
+			continue;
+		case BPF_S_ALU_LSH_K:
+			A <<= K;
+			continue;
+		case BPF_S_ALU_RSH_X:
+			A >>= X;
+			continue;
+		case BPF_S_ALU_RSH_K:
+			A >>= K;
+			continue;
+		case BPF_S_ALU_NEG:
+			A = -A;
+			continue;
+		case BPF_S_JMP_JA:
+			fentry += K;
+			continue;
+		case BPF_S_JMP_JGT_K:
+			fentry += (A > K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JGE_K:
+			fentry += (A >= K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JEQ_K:
+			fentry += (A == K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JSET_K:
+			fentry += (A & K) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JGT_X:
+			fentry += (A > X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JGE_X:
+			fentry += (A >= X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JEQ_X:
+			fentry += (A == X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_JMP_JSET_X:
+			fentry += (A & X) ? fentry->jt : fentry->jf;
+			continue;
+		case BPF_S_LD_W_ABS:
+			k = K;
+load_w:
+			ptr = seccomp_load_pointer(data, k, 4, &tmp);
+			if (ptr != NULL) {
+				/*
+				 * Assume load_pointer did any byte swapping.
+				 */
+				A = *(const u32 *)ptr;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_H_ABS:
+			k = K;
+load_h:
+			ptr = seccomp_load_pointer(data, k, 2, &tmp);
+			if (ptr != NULL) {
+				A = *(const u16 *)ptr;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_B_ABS:
+			k = K;
+load_b:
+			ptr = seccomp_load_pointer(data, k, 1, &tmp);
+			if (ptr != NULL) {
+				A = *(const u8 *)ptr;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_W_LEN:
+			A = datalen;
+			continue;
+		case BPF_S_LDX_W_LEN:
+			X = datalen;
+			continue;
+		case BPF_S_LD_W_IND:
+			k = X + K;
+			goto load_w;
+		case BPF_S_LD_H_IND:
+			k = X + K;
+			goto load_h;
+		case BPF_S_LD_B_IND:
+			k = X + K;
+			goto load_b;
+		case BPF_S_LDX_B_MSH:
+			ptr = seccomp_load_pointer(data, K, 1, &tmp);
+			if (ptr != NULL) {
+				X = (*(u8 *)ptr & 0xf) << 2;
+				continue;
+			}
+			return 0;
+		case BPF_S_LD_IMM:
+			A = K;
+			continue;
+		case BPF_S_LDX_IMM:
+			X = K;
+			continue;
+		case BPF_S_LD_MEM:
+			A = mem[K];
+			continue;
+		case BPF_S_LDX_MEM:
+			X = mem[K];
+			continue;
+		case BPF_S_MISC_TAX:
+			X = A;
+			continue;
+		case BPF_S_MISC_TXA:
+			A = X;
+			continue;
+		case BPF_S_RET_K:
+			return K;
+		case BPF_S_RET_A:
+			return A;
+		case BPF_S_ST:
+			mem[K] = A;
+			continue;
+		case BPF_S_STX:
+			mem[K] = X;
+			continue;
+		case BPF_S_ANC_PROTOCOL:
+		case BPF_S_ANC_PKTTYPE:
+		case BPF_S_ANC_IFINDEX:
+		case BPF_S_ANC_MARK:
+		case BPF_S_ANC_QUEUE:
+		case BPF_S_ANC_HATYPE:
+		case BPF_S_ANC_RXHASH:
+		case BPF_S_ANC_CPU:
+		case BPF_S_ANC_NLATTR:
+		case BPF_S_ANC_NLATTR_NEST:
+			continue;
+		default:
+			WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
+				       fentry->code, fentry->jt,
+				       fentry->jf, fentry->k);
+			return 0;
+		}
+	}
+
+	return 0;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 4070153..8e43f70 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1901,6 +1901,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		case PR_SET_SECCOMP:
 			error = prctl_set_seccomp(arg2);
 			break;
+		case PR_ATTACH_SECCOMP_FILTER:
+			error = prctl_attach_seccomp_filter((char __user *)
+								arg2);
+			break;
 		case PR_GET_TSC:
 			error = GET_TSC_CTL(arg2);
 			break;
diff --git a/security/Kconfig b/security/Kconfig
index 51bd5a0..3c55d36 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -84,6 +84,26 @@ config SECURITY_DMESG_RESTRICT
 
 	  If you are unsure how to answer this question, answer N.
 
+config SECCOMP_FILTER
+	bool "Enable seccomp-based system call filtering"
+	select SECCOMP
+	help
+	  This option provides support for limiting the accessibility of
+	  system calls at a task-level using a dynamically defined policy.
+
+	  System call filtering policy is expressed as a Berkeley Packet
+	  Filter program.  The program is attached using prctl(2) and
+	  cannot be detached. Once attached, the filter program will
+	  evaluate each system call, and its arguments, the task
+	  makes.  Its output determines if the system call may proceed.
+	  If the system call is disallowed, the task will be terminated
+	  immediately.
+
+	  Dynamically limiting system call access aids software in the
+	  creation of secure computation environments.
+
+	  See Documentation/prctl/seccomp_filter.txt for more detail.
+
 config SECURITY
 	bool "Enable different security models"
 	depends on SYSFS
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 3/3] Documentation: prctl/seccomp_filter
  2012-01-28 22:11 [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Will Drewry
  2012-01-28 22:11 ` [PATCH v6 2/3] seccomp_filters: system call filtering using BPF Will Drewry
@ 2012-01-28 22:11 ` Will Drewry
  2012-01-30 22:47   ` Corey Bryant
  2012-02-02 15:29 ` [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Serge E. Hallyn
  2 siblings, 1 reply; 13+ messages in thread
From: Will Drewry @ 2012-01-28 22:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: keescook, john.johansen, serge.hallyn, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, wad, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr

Documents how system call filtering using Berkeley Packet
Filter programs works and how it may be used.
Includes an example for x86 (32-bit) and a semi-generic
example using an example code generator.

v6: - tweak the language to note the requirement of
      PR_SET_NO_NEW_PRIVS being called prior to use. (luto@mit.edu)
v5: - update sample to use system call arguments
    - adds a "fancy" example using a macro-based generator
    - cleaned up bpf in the sample
    - update docs to mention arguments
    - fix prctl value (eparis@redhat.com)
    - language cleanup (rdunlap@xenotime.net)
v4: - update for no_new_privs use
    - minor tweaks
v3: - call out BPF <-> Berkeley Packet Filter (rdunlap@xenotime.net)
    - document use of tentative always-unprivileged
    - guard sample compilation for i386 and x86_64
v2: - move code to samples (corbet@lwn.net)

Signed-off-by: Will Drewry <wad@chromium.org>
---
 Documentation/prctl/seccomp_filter.txt |  100 +++++++++++++++
 samples/Makefile                       |    2 +-
 samples/seccomp/Makefile               |   27 ++++
 samples/seccomp/bpf-direct.c           |   77 +++++++++++
 samples/seccomp/bpf-fancy.c            |   95 ++++++++++++++
 samples/seccomp/bpf-helper.c           |   89 +++++++++++++
 samples/seccomp/bpf-helper.h           |  219 ++++++++++++++++++++++++++++++++
 7 files changed, 608 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/prctl/seccomp_filter.txt
 create mode 100644 samples/seccomp/Makefile
 create mode 100644 samples/seccomp/bpf-direct.c
 create mode 100644 samples/seccomp/bpf-fancy.c
 create mode 100644 samples/seccomp/bpf-helper.c
 create mode 100644 samples/seccomp/bpf-helper.h

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..4ad7649
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,100 @@
+		Seccomp filtering
+		=================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated.  A
+certain subset of userland applications benefit by having a reduced set
+of available system calls.  The resulting set reduces the total kernel
+surface exposed to the application.  System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter for
+incoming system calls.  The filter is expressed as a Berkeley Packet
+Filter (BPF) program, as with socket filters, except that the data
+operated on is related to the system call being made: system call
+number, and the system call arguments.  This allows for expressive
+filtering of system calls using a filter program language with a long
+history of being exposed to userland and a straightforward data set.
+
+Additionally, BPF makes it impossible for users of seccomp to fall prey
+to time-of-check-time-of-use (TOCTOU) attacks that are common in system
+call interposition frameworks.  BPF programs may not dereference
+pointers which constrains all filters to solely evaluating the system
+call arguments directly.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox.  It provides a clearly defined
+mechanism for minimizing the exposed kernel surface.  Beyond that,
+policy for logical behavior and information flow should be managed with
+a combination of other system hardening techniques and, potentially, an
+LSM of your choosing.  Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by
+the consuming process.  The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+	Allows the specification of a new filter using a BPF program.
+	The BPF program will be executed over struct seccomp_filter_data
+	reflecting the system call number, arguments, and other
+	metadata, To allow a system call, SECCOMP_BPF_ALLOW must be
+	returned.  At present, all other return values result in the
+	system call being blocked, but it is recommended to return
+	SECCOMP_BPF_DENY in those cases.  This will allow for future
+	custom return values to be introduced, if ever desired.
+
+	Usage:
+		prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+	The 'prog' argument is a pointer to a struct sock_fprog which will
+	contain the filter program.  If the program is invalid, the call
+	will return -1 and set errno to EINVAL.
+
+	Note, is_compat_task is also tracked for the @prog.  This means
+	that once set the calling task will have all of its system calls
+	blocked if it switches its system call ABI.
+
+	If fork/clone and execve are allowed by @prog, any child processes will
+	be constrained to the same filters and system call ABI as the parent.
+
+	Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
+	run with CAP_SYS_ADMIN privileges in its namespace.  If these are not
+	true, -EACCES will be returned.  This requirement ensures that filter
+	programs cannot be applied to child processes with greater privileges
+	than the task that installed them.
+
+	Additionally, if prctl(2) is allowed by the attached filter,
+	additional filters may be layered on which will increase evaluation
+	time, but allow for further decreasing the attack surface during
+	execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Example
+-------
+
+The samples/seccomp/ directory contains both a 32-bit specific example
+and a more generic example of a higher level macro interface for BPF
+program generation.
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters as long
+as CONFIG_SECCOMP_FILTER is enabled and the architecture has implemented
+syscall_get_arguments.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
 # Makefile for Linux samples code
 
 obj-$(CONFIG_SAMPLES)	+= kobject/ kprobes/ tracepoints/ trace_events/ \
-			   hw_breakpoint/ kfifo/ kdb/ hidraw/
+			   hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..0298c6f
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,27 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+hostprogs-y := bpf-fancy
+bpf-fancy-objs := bpf-fancy.o bpf-helper.o
+
+HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
+HOSTCFLAGS_bpf-helper.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-helper.o += -idirafter $(objtree)/include
+
+# bpf-direct.c is x86-only.
+ifeq ($(filter-out x86_64 i386,$(KBUILD_BUILDHOST)),)
+# List of programs to build
+hostprogs-y += bpf-direct
+bpf-direct-objs := bpf-direct.o
+endif
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
+ifeq ($(KBUILD_BUILDHOST),x86_64)
+HOSTCFLAGS_bpf-direct.o += -m32
+HOSTLOADLIBES_bpf-direct += -m32
+endif
diff --git a/samples/seccomp/bpf-direct.c b/samples/seccomp/bpf-direct.c
new file mode 100644
index 0000000..d799244
--- /dev/null
+++ b/samples/seccomp/bpf-direct.c
@@ -0,0 +1,77 @@
+/*
+ * 32-bit seccomp filter example with BPF macros
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <linux/filter.h>
+#include <linux/ptrace.h>
+#include <linux/seccomp_filter.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#ifndef PR_ATTACH_SECCOMP_FILTER
+#	define PR_ATTACH_SECCOMP_FILTER 37
+#endif
+
+#define syscall_arg(_n) (offsetof(struct seccomp_filter_data, args[_n].lo32))
+#define nr (offsetof(struct seccomp_filter_data, syscall_nr))
+
+static int install_filter(void)
+{
+	struct seccomp_filter_block filter[] = {
+		/* Grab the system call number */
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS, nr),
+		/* Jump table for the allowed syscalls */
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+		/* Check that read is only using stdin. */
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+		/* Check that write is only using stdout/stderr */
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_ALLOW),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_DENY),
+	};
+	struct seccomp_fprog prog = {
+		.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+		.filter = filter,
+	};
+	if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
+		perror("prctl");
+		return 1;
+	}
+	return 0;
+}
+
+#define payload(_c) (_c), sizeof((_c))
+int main(int argc, char **argv)
+{
+	char buf[4096];
+	ssize_t bytes = 0;
+	if (install_filter())
+		return 1;
+	syscall(__NR_write, STDOUT_FILENO,
+		payload("OHAI! WHAT IS YOUR NAME? "));
+	bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+	syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+	syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+	return 0;
+}
diff --git a/samples/seccomp/bpf-fancy.c b/samples/seccomp/bpf-fancy.c
new file mode 100644
index 0000000..1318b1a
--- /dev/null
+++ b/samples/seccomp/bpf-fancy.c
@@ -0,0 +1,95 @@
+/*
+ * Seccomp BPF example using a macro-based generator.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <linux/seccomp_filter.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#include "bpf-helper.h"
+
+#ifndef PR_ATTACH_SECCOMP_FILTER
+#	define PR_ATTACH_SECCOMP_FILTER 37
+#endif
+
+int main(int argc, char **argv)
+{
+	struct bpf_labels l;
+	static const char msg1[] = "Please type something: ";
+	static const char msg2[] = "You typed: ";
+	char buf[256];
+	struct seccomp_filter_block filter[] = {
+		LOAD_SYSCALL_NR,
+		SYSCALL(__NR_exit, ALLOW),
+		SYSCALL(__NR_exit_group, ALLOW),
+		SYSCALL(__NR_write, JUMP(&l, write_fd)),
+		SYSCALL(__NR_read, JUMP(&l, read)),
+		DENY,  /* Don't passthrough into a label */
+
+		LABEL(&l, read),
+		ARG(0),
+		JNE(STDIN_FILENO, DENY),
+		ARG(1),
+		JNE((unsigned long)buf, DENY),
+		ARG(2),
+		JGE(sizeof(buf), DENY),
+		ALLOW,
+
+		LABEL(&l, write_fd),
+		ARG(0),
+		JEQ(STDOUT_FILENO, JUMP(&l, write_buf)),
+		JEQ(STDERR_FILENO, JUMP(&l, write_buf)),
+		DENY,
+
+		LABEL(&l, write_buf),
+		ARG(1),
+		JEQ((unsigned long)msg1, JUMP(&l, msg1_len)),
+		JEQ((unsigned long)msg2, JUMP(&l, msg2_len)),
+		JEQ((unsigned long)buf, JUMP(&l, buf_len)),
+		DENY,
+
+		LABEL(&l, msg1_len),
+		ARG(2),
+		JLT(sizeof(msg1), ALLOW),
+		DENY,
+
+		LABEL(&l, msg2_len),
+		ARG(2),
+		JLT(sizeof(msg2), ALLOW),
+		DENY,
+
+		LABEL(&l, buf_len),
+		ARG(2),
+		JLT(sizeof(buf), ALLOW),
+		DENY,
+	};
+	struct seccomp_fprog prog = {
+		.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+		.filter = filter,
+	};
+	ssize_t bytes;
+	bpf_resolve_jumps(&l, filter, sizeof(filter)/sizeof(*filter));
+
+	if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
+		perror("prctl");
+		return 1;
+	}
+	syscall(__NR_write, STDOUT_FILENO, msg1, strlen(msg1));
+	bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf)-1);
+	bytes = (bytes > 0 ? bytes : 0);
+	syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2));
+	syscall(__NR_write, STDERR_FILENO, buf, bytes);
+	/* Now get killed */
+	syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2)+2);
+	return 0;
+}
diff --git a/samples/seccomp/bpf-helper.c b/samples/seccomp/bpf-helper.c
new file mode 100644
index 0000000..e1b6bc7
--- /dev/null
+++ b/samples/seccomp/bpf-helper.c
@@ -0,0 +1,89 @@
+/*
+ * Seccomp BPF helper functions
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include "bpf-helper.h"
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+		      struct seccomp_filter_block *filter, size_t count)
+{
+	struct seccomp_filter_block *begin = filter;
+	__u8 insn = count - 1;
+
+	if (count < 1)
+		return -1;
+	/*
+	* Walk it once, backwards, to build the label table and do fixups.
+	* Since backward jumps are disallowed by BPF, this is easy.
+	*/
+	filter += insn;
+	for (; filter >= begin; --insn, --filter) {
+		if (filter->code != (BPF_JMP+BPF_JA))
+			continue;
+		switch ((filter->jt<<8)|filter->jf) {
+		case (JUMP_JT<<8)|JUMP_JF:
+			if (labels->labels[filter->k].location == 0xffffffff) {
+				fprintf(stderr, "Unresolved label: '%s'\n",
+					labels->labels[filter->k].label);
+				return 1;
+			}
+			filter->k = labels->labels[filter->k].location -
+				    (insn + 1);
+			filter->jt = 0;
+			filter->jf = 0;
+			continue;
+		case (LABEL_JT<<8)|LABEL_JF:
+			if (labels->labels[filter->k].location != 0xffffffff) {
+				fprintf(stderr, "Duplicate label use: '%s'\n",
+					labels->labels[filter->k].label);
+				return 1;
+			}
+			labels->labels[filter->k].location = insn;
+			filter->k = 0; /* fall through */
+			filter->jt = 0;
+			filter->jf = 0;
+			continue;
+		}
+	}
+	return 0;
+}
+
+/* Simple lookup table for labels. */
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label)
+{
+	struct __bpf_label *begin = labels->labels, *end;
+	int id;
+	if (labels->count == 0) {
+		begin->label = label;
+		begin->location = 0xffffffff;
+		labels->count++;
+		return 0;
+	}
+	end = begin + labels->count;
+	for (id = 0; begin < end; ++begin, ++id) {
+		if (!strcmp(label, begin->label))
+			return id;
+	}
+	begin->label = label;
+	begin->location = 0xffffffff;
+	labels->count++;
+	return id;
+}
+
+void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t count)
+{
+	struct seccomp_filter_block *end = filter + count;
+	for ( ; filter < end; ++filter)
+		printf("{ code=%u,jt=%u,jf=%u,k=%u },\n",
+			filter->code, filter->jt, filter->jf, filter->k);
+}
diff --git a/samples/seccomp/bpf-helper.h b/samples/seccomp/bpf-helper.h
new file mode 100644
index 0000000..92b94ec
--- /dev/null
+++ b/samples/seccomp/bpf-helper.h
@@ -0,0 +1,219 @@
+/*
+ * Example wrapper around BPF macros.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ * Author: Will Drewry <wad@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ *
+ * No guarantees are provided with respect to the correctness
+ * or functionality of this code.
+ */
+#ifndef __BPF_HELPER_H__
+#define __BPF_HELPER_H__
+
+#include <asm/bitsperlong.h>	/* for __BITS_PER_LONG */
+#include <linux/filter.h>
+#include <linux/seccomp_filter.h>	/* for seccomp_filter_data.arg */
+#include <linux/types.h>
+#include <linux/unistd.h>
+#include <stddef.h>
+
+#define BPF_LABELS_MAX 256
+struct bpf_labels {
+	int count;
+	struct __bpf_label {
+		const char *label;
+		__u32 location;
+	} labels[BPF_LABELS_MAX];
+};
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+		      struct seccomp_filter_block *filter, size_t count);
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label);
+void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t count);
+
+#define JUMP_JT 0xff
+#define JUMP_JF 0xff
+#define LABEL_JT 0xfe
+#define LABEL_JF 0xfe
+
+#define ALLOW \
+	BPF_STMT(BPF_RET+BPF_K, 0xFFFFFFFF)
+#define DENY \
+	BPF_STMT(BPF_RET+BPF_K, 0)
+#define JUMP(labels, label) \
+	BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
+		 JUMP_JT, JUMP_JF)
+#define LABEL(labels, label) \
+	BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
+		 LABEL_JT, LABEL_JF)
+#define SYSCALL(nr, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (nr), 0, 1), \
+	jt
+
+/* Lame, but just an example */
+#define FIND_LABEL(labels, label) seccomp_bpf_label((labels), #label)
+
+#define EXPAND(...) __VA_ARGS__
+/* Map all width-sensitive operations */
+#if __BITS_PER_LONG == 32
+
+#define JEQ(x, jt) JEQ32(x, EXPAND(jt))
+#define JNE(x, jt) JNE32(x, EXPAND(jt))
+#define JGT(x, jt) JGT32(x, EXPAND(jt))
+#define JLT(x, jt) JLT32(x, EXPAND(jt))
+#define JGE(x, jt) JGE32(x, EXPAND(jt))
+#define JLE(x, jt) JLE32(x, EXPAND(jt))
+#define JA(x, jt) JA32(x, EXPAND(jt))
+#define ARG(i) ARG_32(i)
+
+#elif __BITS_PER_LONG == 64
+
+#define JEQ(x, jt) \
+	JEQ64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	      EXPAND(jt))
+#define JGT(x, jt) \
+	JGT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	      EXPAND(jt))
+#define JGE(x, jt) \
+	JGE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	      EXPAND(jt))
+#define JNE(x, jt) \
+	JNE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	      EXPAND(jt))
+#define JLT(x, jt) \
+	JLT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	      EXPAND(jt))
+#define JLE(x, jt) \
+	JLE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	      EXPAND(jt))
+
+#define JA(x, jt) \
+	JA64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
+	       ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
+	       EXPAND(jt))
+#define ARG(i) ARG_64(i)
+
+#else
+#error __BITS_PER_LONG value unusable.
+#endif
+
+/* Loads the arg into A */
+#define ARG_32(idx) \
+	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+		offsetof(struct seccomp_filter_data, args[(idx)].lo32))
+
+/* Loads hi into A and lo in X */
+#define ARG_64(idx) \
+	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+	  offsetof(struct seccomp_filter_data, args[(idx)].lo32)), \
+	BPF_STMT(BPF_ST, 0), /* lo -> M[0] */ \
+	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+	  offsetof(struct seccomp_filter_data, args[(idx)].hi32)), \
+	BPF_STMT(BPF_ST, 1) /* hi -> M[1] */
+
+#define JEQ32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 0, 1), \
+	jt
+
+#define JNE32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 1, 0), \
+	jt
+
+/* Checks the lo, then swaps to check the hi. A=lo,X=hi */
+#define JEQ64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 0, 2), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JNE64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 5, 0), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 2, 0), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JA32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (value), 0, 1), \
+	jt
+
+#define JA64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (hi), 3, 0), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (lo), 0, 2), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGE32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 0, 1), \
+	jt
+
+#define JLT32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 1, 0), \
+	jt
+
+/* Shortcut checking if hi > arg.hi. */
+#define JGE64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (lo), 0, 2), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLT64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (hi), 0, 4), \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGT32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
+	jt
+
+#define JLE32(value, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
+	jt
+
+/* Check hi > args.hi first, then do the GE checking */
+#define JGT64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 0, 2), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLE64(lo, hi, jt) \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 6, 0), \
+	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 3), \
+	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+	jt, \
+	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define LOAD_SYSCALL_NR \
+	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+		 offsetof(struct seccomp_filter_data, syscall_nr))
+
+#endif  /* __BPF_HELPER_H__ */
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 3/3] Documentation: prctl/seccomp_filter
  2012-01-28 22:11 ` [PATCH v6 3/3] Documentation: prctl/seccomp_filter Will Drewry
@ 2012-01-30 22:47   ` Corey Bryant
  2012-01-30 22:52     ` Will Drewry
  0 siblings, 1 reply; 13+ messages in thread
From: Corey Bryant @ 2012-01-30 22:47 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, pmoore,
	eparis, djm, torvalds, segoon, rostedt, jmorris, scarybeasts,
	avi, penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr



On 01/28/2012 05:11 PM, Will Drewry wrote:
> Documents how system call filtering using Berkeley Packet
> Filter programs works and how it may be used.
> Includes an example for x86 (32-bit) and a semi-generic
> example using an example code generator.
>
> v6: - tweak the language to note the requirement of
>        PR_SET_NO_NEW_PRIVS being called prior to use. (luto@mit.edu)
> v5: - update sample to use system call arguments
>      - adds a "fancy" example using a macro-based generator
>      - cleaned up bpf in the sample
>      - update docs to mention arguments
>      - fix prctl value (eparis@redhat.com)
>      - language cleanup (rdunlap@xenotime.net)
> v4: - update for no_new_privs use
>      - minor tweaks
> v3: - call out BPF<->  Berkeley Packet Filter (rdunlap@xenotime.net)
>      - document use of tentative always-unprivileged
>      - guard sample compilation for i386 and x86_64
> v2: - move code to samples (corbet@lwn.net)
>
> Signed-off-by: Will Drewry<wad@chromium.org>
> ---
>   Documentation/prctl/seccomp_filter.txt |  100 +++++++++++++++
>   samples/Makefile                       |    2 +-
>   samples/seccomp/Makefile               |   27 ++++
>   samples/seccomp/bpf-direct.c           |   77 +++++++++++
>   samples/seccomp/bpf-fancy.c            |   95 ++++++++++++++
>   samples/seccomp/bpf-helper.c           |   89 +++++++++++++
>   samples/seccomp/bpf-helper.h           |  219 ++++++++++++++++++++++++++++++++
>   7 files changed, 608 insertions(+), 1 deletions(-)
>   create mode 100644 Documentation/prctl/seccomp_filter.txt
>   create mode 100644 samples/seccomp/Makefile
>   create mode 100644 samples/seccomp/bpf-direct.c
>   create mode 100644 samples/seccomp/bpf-fancy.c
>   create mode 100644 samples/seccomp/bpf-helper.c
>   create mode 100644 samples/seccomp/bpf-helper.h
>
> diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
> new file mode 100644
> index 0000000..4ad7649
> --- /dev/null
> +++ b/Documentation/prctl/seccomp_filter.txt
> @@ -0,0 +1,100 @@
> +		Seccomp filtering
> +		=================
> +
> +Introduction
> +------------
> +
> +A large number of system calls are exposed to every userland process
> +with many of them going unused for the entire lifetime of the process.
> +As system calls change and mature, bugs are found and eradicated.  A
> +certain subset of userland applications benefit by having a reduced set
> +of available system calls.  The resulting set reduces the total kernel
> +surface exposed to the application.  System call filtering is meant for
> +use with those applications.
> +
> +Seccomp filtering provides a means for a process to specify a filter for
> +incoming system calls.  The filter is expressed as a Berkeley Packet
> +Filter (BPF) program, as with socket filters, except that the data
> +operated on is related to the system call being made: system call
> +number, and the system call arguments.  This allows for expressive
> +filtering of system calls using a filter program language with a long
> +history of being exposed to userland and a straightforward data set.
> +
> +Additionally, BPF makes it impossible for users of seccomp to fall prey
> +to time-of-check-time-of-use (TOCTOU) attacks that are common in system
> +call interposition frameworks.  BPF programs may not dereference
> +pointers which constrains all filters to solely evaluating the system
> +call arguments directly.
> +
> +What it isn't
> +-------------
> +
> +System call filtering isn't a sandbox.  It provides a clearly defined
> +mechanism for minimizing the exposed kernel surface.  Beyond that,
> +policy for logical behavior and information flow should be managed with
> +a combination of other system hardening techniques and, potentially, an
> +LSM of your choosing.  Expressive, dynamic filters provide further options down
> +this path (avoiding pathological sizes or selecting which of the multiplexed
> +system calls in socketcall() is allowed, for instance) which could be
> +construed, incorrectly, as a more complete sandboxing solution.
> +
> +Usage
> +-----
> +
> +An additional seccomp mode is added, but they are not directly set by
> +the consuming process.  The new mode, '2', is only available if
> +CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
> +PR_ATTACH_SECCOMP_FILTER argument.
> +
> +Interacting with seccomp filters is done using one prctl(2) call.
> +
> +PR_ATTACH_SECCOMP_FILTER:
> +	Allows the specification of a new filter using a BPF program.
> +	The BPF program will be executed over struct seccomp_filter_data
> +	reflecting the system call number, arguments, and other
> +	metadata, To allow a system call, SECCOMP_BPF_ALLOW must be
> +	returned.  At present, all other return values result in the
> +	system call being blocked, but it is recommended to return
> +	SECCOMP_BPF_DENY in those cases.  This will allow for future
> +	custom return values to be introduced, if ever desired.
> +
> +	Usage:
> +		prctl(PR_ATTACH_SECCOMP_FILTER, prog);
> +
> +	The 'prog' argument is a pointer to a struct sock_fprog which will
> +	contain the filter program.  If the program is invalid, the call
> +	will return -1 and set errno to EINVAL.
> +
> +	Note, is_compat_task is also tracked for the @prog.  This means
> +	that once set the calling task will have all of its system calls
> +	blocked if it switches its system call ABI.
> +
> +	If fork/clone and execve are allowed by @prog, any child processes will
> +	be constrained to the same filters and system call ABI as the parent.
> +
> +	Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
> +	run with CAP_SYS_ADMIN privileges in its namespace.  If these are not
> +	true, -EACCES will be returned.  This requirement ensures that filter
> +	programs cannot be applied to child processes with greater privileges
> +	than the task that installed them.
> +
> +	Additionally, if prctl(2) is allowed by the attached filter,
> +	additional filters may be layered on which will increase evaluation
> +	time, but allow for further decreasing the attack surface during
> +	execution of a process.
> +
> +The above call returns 0 on success and non-zero on error.
> +
> +Example
> +-------
> +
> +The samples/seccomp/ directory contains both a 32-bit specific example
> +and a more generic example of a higher level macro interface for BPF
> +program generation.
> +
> +Adding architecture support
> +-----------------------
> +
> +Any platform with seccomp support will support seccomp filters as long
> +as CONFIG_SECCOMP_FILTER is enabled and the architecture has implemented
> +syscall_get_arguments.
> diff --git a/samples/Makefile b/samples/Makefile
> index 6280817..f29b19c 100644
> --- a/samples/Makefile
> +++ b/samples/Makefile
> @@ -1,4 +1,4 @@
>   # Makefile for Linux samples code
>
>   obj-$(CONFIG_SAMPLES)	+= kobject/ kprobes/ tracepoints/ trace_events/ \
> -			   hw_breakpoint/ kfifo/ kdb/ hidraw/
> +			   hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
> diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
> new file mode 100644
> index 0000000..0298c6f
> --- /dev/null
> +++ b/samples/seccomp/Makefile
> @@ -0,0 +1,27 @@
> +# kbuild trick to avoid linker error. Can be omitted if a module is built.
> +obj- := dummy.o
> +
> +hostprogs-y := bpf-fancy
> +bpf-fancy-objs := bpf-fancy.o bpf-helper.o
> +
> +HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
> +HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
> +HOSTCFLAGS_bpf-helper.o += -I$(objtree)/usr/include
> +HOSTCFLAGS_bpf-helper.o += -idirafter $(objtree)/include
> +
> +# bpf-direct.c is x86-only.
> +ifeq ($(filter-out x86_64 i386,$(KBUILD_BUILDHOST)),)
> +# List of programs to build
> +hostprogs-y += bpf-direct
> +bpf-direct-objs := bpf-direct.o
> +endif
> +
> +# Tell kbuild to always build the programs
> +always := $(hostprogs-y)
> +
> +HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
> +HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
> +ifeq ($(KBUILD_BUILDHOST),x86_64)
> +HOSTCFLAGS_bpf-direct.o += -m32
> +HOSTLOADLIBES_bpf-direct += -m32
> +endif
> diff --git a/samples/seccomp/bpf-direct.c b/samples/seccomp/bpf-direct.c
> new file mode 100644
> index 0000000..d799244
> --- /dev/null
> +++ b/samples/seccomp/bpf-direct.c
> @@ -0,0 +1,77 @@
> +/*
> + * 32-bit seccomp filter example with BPF macros
> + *
> + * Copyright (c) 2012 The Chromium OS Authors<chromium-os-dev@chromium.org>
> + * Author: Will Drewry<wad@chromium.org>
> + *
> + * The code may be used by anyone for any purpose,
> + * and can serve as a starting point for developing
> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
> + */
> +
> +#include<linux/filter.h>
> +#include<linux/ptrace.h>
> +#include<linux/seccomp_filter.h>
> +#include<linux/unistd.h>
> +#include<stdio.h>
> +#include<stddef.h>
> +#include<sys/prctl.h>
> +#include<unistd.h>
> +
> +#ifndef PR_ATTACH_SECCOMP_FILTER
> +#	define PR_ATTACH_SECCOMP_FILTER 37
> +#endif
> +
> +#define syscall_arg(_n) (offsetof(struct seccomp_filter_data, args[_n].lo32))
> +#define nr (offsetof(struct seccomp_filter_data, syscall_nr))
> +
> +static int install_filter(void)
> +{
> +	struct seccomp_filter_block filter[] = {
> +		/* Grab the system call number */
> +		BPF_STMT(BPF_LD+BPF_W+BPF_ABS, nr),
> +		/* Jump table for the allowed syscalls */
> +		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
> +		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
> +		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
> +		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
> +		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
> +		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
> +
> +		/* Check that read is only using stdin. */
> +		BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
> +		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
> +
> +		/* Check that write is only using stdout/stderr */
> +		BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
> +		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
> +		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
> +
> +		BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_ALLOW),
> +		BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_DENY),
> +	};
> +	struct seccomp_fprog prog = {
> +		.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
> +		.filter = filter,
> +	};
> +	if (prctl(PR_ATTACH_SECCOMP_FILTER,&prog)) {
> +		perror("prctl");
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +#define payload(_c) (_c), sizeof((_c))
> +int main(int argc, char **argv)
> +{
> +	char buf[4096];
> +	ssize_t bytes = 0;
> +	if (install_filter())
> +		return 1;
> +	syscall(__NR_write, STDOUT_FILENO,
> +		payload("OHAI! WHAT IS YOUR NAME? "));
> +	bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
> +	syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
> +	syscall(__NR_write, STDOUT_FILENO, buf, bytes);
> +	return 0;
> +}
> diff --git a/samples/seccomp/bpf-fancy.c b/samples/seccomp/bpf-fancy.c
> new file mode 100644
> index 0000000..1318b1a
> --- /dev/null
> +++ b/samples/seccomp/bpf-fancy.c
> @@ -0,0 +1,95 @@
> +/*
> + * Seccomp BPF example using a macro-based generator.
> + *
> + * Copyright (c) 2012 The Chromium OS Authors<chromium-os-dev@chromium.org>
> + * Author: Will Drewry<wad@chromium.org>
> + *
> + * The code may be used by anyone for any purpose,
> + * and can serve as a starting point for developing
> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
> + */
> +
> +#include<linux/seccomp_filter.h>
> +#include<linux/unistd.h>
> +#include<stdio.h>
> +#include<string.h>
> +#include<sys/prctl.h>
> +#include<unistd.h>
> +
> +#include "bpf-helper.h"
> +
> +#ifndef PR_ATTACH_SECCOMP_FILTER
> +#	define PR_ATTACH_SECCOMP_FILTER 37
> +#endif
> +
> +int main(int argc, char **argv)
> +{
> +	struct bpf_labels l;
> +	static const char msg1[] = "Please type something: ";
> +	static const char msg2[] = "You typed: ";
> +	char buf[256];
> +	struct seccomp_filter_block filter[] = {
> +		LOAD_SYSCALL_NR,
> +		SYSCALL(__NR_exit, ALLOW),
> +		SYSCALL(__NR_exit_group, ALLOW),
> +		SYSCALL(__NR_write, JUMP(&l, write_fd)),
> +		SYSCALL(__NR_read, JUMP(&l, read)),
> +		DENY,  /* Don't passthrough into a label */
> +
> +		LABEL(&l, read),
> +		ARG(0),
> +		JNE(STDIN_FILENO, DENY),
> +		ARG(1),
> +		JNE((unsigned long)buf, DENY),
> +		ARG(2),
> +		JGE(sizeof(buf), DENY),
> +		ALLOW,
> +
> +		LABEL(&l, write_fd),
> +		ARG(0),
> +		JEQ(STDOUT_FILENO, JUMP(&l, write_buf)),
> +		JEQ(STDERR_FILENO, JUMP(&l, write_buf)),
> +		DENY,
> +
> +		LABEL(&l, write_buf),
> +		ARG(1),
> +		JEQ((unsigned long)msg1, JUMP(&l, msg1_len)),
> +		JEQ((unsigned long)msg2, JUMP(&l, msg2_len)),
> +		JEQ((unsigned long)buf, JUMP(&l, buf_len)),
> +		DENY,
> +
> +		LABEL(&l, msg1_len),
> +		ARG(2),
> +		JLT(sizeof(msg1), ALLOW),
> +		DENY,
> +
> +		LABEL(&l, msg2_len),
> +		ARG(2),
> +		JLT(sizeof(msg2), ALLOW),
> +		DENY,
> +
> +		LABEL(&l, buf_len),
> +		ARG(2),
> +		JLT(sizeof(buf), ALLOW),
> +		DENY,
> +	};
> +	struct seccomp_fprog prog = {
> +		.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
> +		.filter = filter,
> +	};
> +	ssize_t bytes;
> +	bpf_resolve_jumps(&l, filter, sizeof(filter)/sizeof(*filter));
> +
> +	if (prctl(PR_ATTACH_SECCOMP_FILTER,&prog)) {
> +		perror("prctl");
> +		return 1;
> +	}
> +	syscall(__NR_write, STDOUT_FILENO, msg1, strlen(msg1));
> +	bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf)-1);
> +	bytes = (bytes>  0 ? bytes : 0);
> +	syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2));
> +	syscall(__NR_write, STDERR_FILENO, buf, bytes);
> +	/* Now get killed */
> +	syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2)+2);
> +	return 0;
> +}
> diff --git a/samples/seccomp/bpf-helper.c b/samples/seccomp/bpf-helper.c
> new file mode 100644
> index 0000000..e1b6bc7
> --- /dev/null
> +++ b/samples/seccomp/bpf-helper.c
> @@ -0,0 +1,89 @@
> +/*
> + * Seccomp BPF helper functions
> + *
> + * Copyright (c) 2012 The Chromium OS Authors<chromium-os-dev@chromium.org>
> + * Author: Will Drewry<wad@chromium.org>
> + *
> + * The code may be used by anyone for any purpose,
> + * and can serve as a starting point for developing
> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
> + */
> +
> +#include<stdio.h>
> +#include<string.h>
> +
> +#include "bpf-helper.h"
> +
> +int bpf_resolve_jumps(struct bpf_labels *labels,
> +		      struct seccomp_filter_block *filter, size_t count)
> +{
> +	struct seccomp_filter_block *begin = filter;
> +	__u8 insn = count - 1;
> +
> +	if (count<  1)
> +		return -1;
> +	/*
> +	* Walk it once, backwards, to build the label table and do fixups.
> +	* Since backward jumps are disallowed by BPF, this is easy.
> +	*/
> +	filter += insn;
> +	for (; filter>= begin; --insn, --filter) {
> +		if (filter->code != (BPF_JMP+BPF_JA))
> +			continue;
> +		switch ((filter->jt<<8)|filter->jf) {
> +		case (JUMP_JT<<8)|JUMP_JF:
> +			if (labels->labels[filter->k].location == 0xffffffff) {
> +				fprintf(stderr, "Unresolved label: '%s'\n",
> +					labels->labels[filter->k].label);
> +				return 1;
> +			}
> +			filter->k = labels->labels[filter->k].location -
> +				    (insn + 1);
> +			filter->jt = 0;
> +			filter->jf = 0;
> +			continue;
> +		case (LABEL_JT<<8)|LABEL_JF:
> +			if (labels->labels[filter->k].location != 0xffffffff) {
> +				fprintf(stderr, "Duplicate label use: '%s'\n",
> +					labels->labels[filter->k].label);
> +				return 1;
> +			}
> +			labels->labels[filter->k].location = insn;
> +			filter->k = 0; /* fall through */
> +			filter->jt = 0;
> +			filter->jf = 0;
> +			continue;
> +		}
> +	}
> +	return 0;
> +}
> +
> +/* Simple lookup table for labels. */
> +__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label)
> +{
> +	struct __bpf_label *begin = labels->labels, *end;
> +	int id;
> +	if (labels->count == 0) {
> +		begin->label = label;
> +		begin->location = 0xffffffff;
> +		labels->count++;
> +		return 0;
> +	}
> +	end = begin + labels->count;
> +	for (id = 0; begin<  end; ++begin, ++id) {
> +		if (!strcmp(label, begin->label))
> +			return id;
> +	}
> +	begin->label = label;
> +	begin->location = 0xffffffff;
> +	labels->count++;
> +	return id;
> +}
> +
> +void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t count)
> +{
> +	struct seccomp_filter_block *end = filter + count;
> +	for ( ; filter<  end; ++filter)
> +		printf("{ code=%u,jt=%u,jf=%u,k=%u },\n",
> +			filter->code, filter->jt, filter->jf, filter->k);
> +}
> diff --git a/samples/seccomp/bpf-helper.h b/samples/seccomp/bpf-helper.h
> new file mode 100644
> index 0000000..92b94ec
> --- /dev/null
> +++ b/samples/seccomp/bpf-helper.h
> @@ -0,0 +1,219 @@
> +/*
> + * Example wrapper around BPF macros.
> + *
> + * Copyright (c) 2012 The Chromium OS Authors<chromium-os-dev@chromium.org>
> + * Author: Will Drewry<wad@chromium.org>
> + *
> + * The code may be used by anyone for any purpose,
> + * and can serve as a starting point for developing
> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
> + *
> + * No guarantees are provided with respect to the correctness
> + * or functionality of this code.
> + */
> +#ifndef __BPF_HELPER_H__
> +#define __BPF_HELPER_H__
> +
> +#include<asm/bitsperlong.h>	/* for __BITS_PER_LONG */
> +#include<linux/filter.h>
> +#include<linux/seccomp_filter.h>	/* for seccomp_filter_data.arg */
> +#include<linux/types.h>
> +#include<linux/unistd.h>
> +#include<stddef.h>
> +
> +#define BPF_LABELS_MAX 256
> +struct bpf_labels {
> +	int count;
> +	struct __bpf_label {
> +		const char *label;
> +		__u32 location;
> +	} labels[BPF_LABELS_MAX];
> +};
> +
> +int bpf_resolve_jumps(struct bpf_labels *labels,
> +		      struct seccomp_filter_block *filter, size_t count);
> +__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label);
> +void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t count);
> +
> +#define JUMP_JT 0xff
> +#define JUMP_JF 0xff
> +#define LABEL_JT 0xfe
> +#define LABEL_JF 0xfe
> +
> +#define ALLOW \
> +	BPF_STMT(BPF_RET+BPF_K, 0xFFFFFFFF)
> +#define DENY \
> +	BPF_STMT(BPF_RET+BPF_K, 0)
> +#define JUMP(labels, label) \
> +	BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
> +		 JUMP_JT, JUMP_JF)
> +#define LABEL(labels, label) \
> +	BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
> +		 LABEL_JT, LABEL_JF)
> +#define SYSCALL(nr, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (nr), 0, 1), \
> +	jt
> +
> +/* Lame, but just an example */
> +#define FIND_LABEL(labels, label) seccomp_bpf_label((labels), #label)
> +
> +#define EXPAND(...) __VA_ARGS__
> +/* Map all width-sensitive operations */
> +#if __BITS_PER_LONG == 32
> +
> +#define JEQ(x, jt) JEQ32(x, EXPAND(jt))
> +#define JNE(x, jt) JNE32(x, EXPAND(jt))
> +#define JGT(x, jt) JGT32(x, EXPAND(jt))
> +#define JLT(x, jt) JLT32(x, EXPAND(jt))
> +#define JGE(x, jt) JGE32(x, EXPAND(jt))
> +#define JLE(x, jt) JLE32(x, EXPAND(jt))
> +#define JA(x, jt) JA32(x, EXPAND(jt))
> +#define ARG(i) ARG_32(i)
> +
> +#elif __BITS_PER_LONG == 64
> +
> +#define JEQ(x, jt) \
> +	JEQ64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> +	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> +	      EXPAND(jt))
> +#define JGT(x, jt) \
> +	JGT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> +	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> +	      EXPAND(jt))
> +#define JGE(x, jt) \
> +	JGE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> +	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> +	      EXPAND(jt))
> +#define JNE(x, jt) \
> +	JNE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> +	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> +	      EXPAND(jt))
> +#define JLT(x, jt) \
> +	JLT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> +	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> +	      EXPAND(jt))
> +#define JLE(x, jt) \
> +	JLE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> +	      ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> +	      EXPAND(jt))
> +
> +#define JA(x, jt) \
> +	JA64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
> +	       ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
> +	       EXPAND(jt))
> +#define ARG(i) ARG_64(i)
> +
> +#else
> +#error __BITS_PER_LONG value unusable.
> +#endif
> +
> +/* Loads the arg into A */
> +#define ARG_32(idx) \
> +	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
> +		offsetof(struct seccomp_filter_data, args[(idx)].lo32))
> +
> +/* Loads hi into A and lo in X */
> +#define ARG_64(idx) \
> +	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
> +	  offsetof(struct seccomp_filter_data, args[(idx)].lo32)), \
> +	BPF_STMT(BPF_ST, 0), /* lo ->  M[0] */ \
> +	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
> +	  offsetof(struct seccomp_filter_data, args[(idx)].hi32)), \
> +	BPF_STMT(BPF_ST, 1) /* hi ->  M[1] */
> +
> +#define JEQ32(value, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 0, 1), \
> +	jt
> +
> +#define JNE32(value, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 1, 0), \
> +	jt
> +
> +/* Checks the lo, then swaps to check the hi. A=lo,X=hi */
> +#define JEQ64(lo, hi, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> +	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 0, 2), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> +	jt, \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define JNE64(lo, hi, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 5, 0), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> +	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 2, 0), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> +	jt, \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define JA32(value, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (value), 0, 1), \
> +	jt
> +
> +#define JA64(lo, hi, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (hi), 3, 0), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> +	BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (lo), 0, 2), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> +	jt, \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define JGE32(value, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 0, 1), \
> +	jt
> +
> +#define JLT32(value, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 1, 0), \
> +	jt
> +
> +/* Shortcut checking if hi>  arg.hi. */
> +#define JGE64(lo, hi, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
> +	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> +	BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (lo), 0, 2), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> +	jt, \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define JLT64(lo, hi, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (hi), 0, 4), \
> +	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> +	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> +	jt, \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define JGT32(value, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
> +	jt
> +
> +#define JLE32(value, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
> +	jt

Should the true/false offsets be reversed here?

Thanks for all the work on this.  We're looking forward to using it with 
QEMU.

> +
> +/* Check hi>  args.hi first, then do the GE checking */
> +#define JGT64(lo, hi, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
> +	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> +	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 0, 2), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> +	jt, \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define JLE64(lo, hi, jt) \
> +	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 6, 0), \
> +	BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 3), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
> +	BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
> +	jt, \
> +	BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
> +
> +#define LOAD_SYSCALL_NR \
> +	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
> +		 offsetof(struct seccomp_filter_data, syscall_nr))
> +
> +#endif  /* __BPF_HELPER_H__ */


-- 
Regards,
Corey


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 3/3] Documentation: prctl/seccomp_filter
  2012-01-30 22:47   ` Corey Bryant
@ 2012-01-30 22:52     ` Will Drewry
  0 siblings, 0 replies; 13+ messages in thread
From: Will Drewry @ 2012-01-30 22:52 UTC (permalink / raw)
  To: Corey Bryant
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, pmoore,
	eparis, djm, torvalds, segoon, rostedt, jmorris, scarybeasts,
	avi, penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr

On Mon, Jan 30, 2012 at 4:47 PM, Corey Bryant <coreyb@linux.vnet.ibm.com> wrote:
>
>
> On 01/28/2012 05:11 PM, Will Drewry wrote:
>>
>> Documents how system call filtering using Berkeley Packet
>> Filter programs works and how it may be used.
>> Includes an example for x86 (32-bit) and a semi-generic
>> example using an example code generator.
>>
>> v6: - tweak the language to note the requirement of
>>       PR_SET_NO_NEW_PRIVS being called prior to use. (luto@mit.edu)
>> v5: - update sample to use system call arguments
>>     - adds a "fancy" example using a macro-based generator
>>     - cleaned up bpf in the sample
>>     - update docs to mention arguments
>>     - fix prctl value (eparis@redhat.com)
>>     - language cleanup (rdunlap@xenotime.net)
>> v4: - update for no_new_privs use
>>     - minor tweaks
>> v3: - call out BPF<->  Berkeley Packet Filter (rdunlap@xenotime.net)
>>     - document use of tentative always-unprivileged
>>     - guard sample compilation for i386 and x86_64
>> v2: - move code to samples (corbet@lwn.net)
>>
>> Signed-off-by: Will Drewry<wad@chromium.org>
>> ---
>>  Documentation/prctl/seccomp_filter.txt |  100 +++++++++++++++
>>  samples/Makefile                       |    2 +-
>>  samples/seccomp/Makefile               |   27 ++++
>>  samples/seccomp/bpf-direct.c           |   77 +++++++++++
>>  samples/seccomp/bpf-fancy.c            |   95 ++++++++++++++
>>  samples/seccomp/bpf-helper.c           |   89 +++++++++++++
>>  samples/seccomp/bpf-helper.h           |  219
>> ++++++++++++++++++++++++++++++++
>>  7 files changed, 608 insertions(+), 1 deletions(-)
>>  create mode 100644 Documentation/prctl/seccomp_filter.txt
>>  create mode 100644 samples/seccomp/Makefile
>>  create mode 100644 samples/seccomp/bpf-direct.c
>>  create mode 100644 samples/seccomp/bpf-fancy.c
>>  create mode 100644 samples/seccomp/bpf-helper.c
>>  create mode 100644 samples/seccomp/bpf-helper.h
>>
>> diff --git a/Documentation/prctl/seccomp_filter.txt
>> b/Documentation/prctl/seccomp_filter.txt
>> new file mode 100644
>> index 0000000..4ad7649
>> --- /dev/null
>> +++ b/Documentation/prctl/seccomp_filter.txt
>> @@ -0,0 +1,100 @@
>> +               Seccomp filtering
>> +               =================
>> +
>> +Introduction
>> +------------
>> +
>> +A large number of system calls are exposed to every userland process
>> +with many of them going unused for the entire lifetime of the process.
>> +As system calls change and mature, bugs are found and eradicated.  A
>> +certain subset of userland applications benefit by having a reduced set
>> +of available system calls.  The resulting set reduces the total kernel
>> +surface exposed to the application.  System call filtering is meant for
>> +use with those applications.
>> +
>> +Seccomp filtering provides a means for a process to specify a filter for
>> +incoming system calls.  The filter is expressed as a Berkeley Packet
>> +Filter (BPF) program, as with socket filters, except that the data
>> +operated on is related to the system call being made: system call
>> +number, and the system call arguments.  This allows for expressive
>> +filtering of system calls using a filter program language with a long
>> +history of being exposed to userland and a straightforward data set.
>> +
>> +Additionally, BPF makes it impossible for users of seccomp to fall prey
>> +to time-of-check-time-of-use (TOCTOU) attacks that are common in system
>> +call interposition frameworks.  BPF programs may not dereference
>> +pointers which constrains all filters to solely evaluating the system
>> +call arguments directly.
>> +
>> +What it isn't
>> +-------------
>> +
>> +System call filtering isn't a sandbox.  It provides a clearly defined
>> +mechanism for minimizing the exposed kernel surface.  Beyond that,
>> +policy for logical behavior and information flow should be managed with
>> +a combination of other system hardening techniques and, potentially, an
>> +LSM of your choosing.  Expressive, dynamic filters provide further
>> options down
>> +this path (avoiding pathological sizes or selecting which of the
>> multiplexed
>> +system calls in socketcall() is allowed, for instance) which could be
>> +construed, incorrectly, as a more complete sandboxing solution.
>> +
>> +Usage
>> +-----
>> +
>> +An additional seccomp mode is added, but they are not directly set by
>> +the consuming process.  The new mode, '2', is only available if
>> +CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
>> +PR_ATTACH_SECCOMP_FILTER argument.
>> +
>> +Interacting with seccomp filters is done using one prctl(2) call.
>> +
>> +PR_ATTACH_SECCOMP_FILTER:
>> +       Allows the specification of a new filter using a BPF program.
>> +       The BPF program will be executed over struct seccomp_filter_data
>> +       reflecting the system call number, arguments, and other
>> +       metadata, To allow a system call, SECCOMP_BPF_ALLOW must be
>> +       returned.  At present, all other return values result in the
>> +       system call being blocked, but it is recommended to return
>> +       SECCOMP_BPF_DENY in those cases.  This will allow for future
>> +       custom return values to be introduced, if ever desired.
>> +
>> +       Usage:
>> +               prctl(PR_ATTACH_SECCOMP_FILTER, prog);
>> +
>> +       The 'prog' argument is a pointer to a struct sock_fprog which will
>> +       contain the filter program.  If the program is invalid, the call
>> +       will return -1 and set errno to EINVAL.
>> +
>> +       Note, is_compat_task is also tracked for the @prog.  This means
>> +       that once set the calling task will have all of its system calls
>> +       blocked if it switches its system call ABI.
>> +
>> +       If fork/clone and execve are allowed by @prog, any child processes
>> will
>> +       be constrained to the same filters and system call ABI as the
>> parent.
>> +
>> +       Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
>> +       run with CAP_SYS_ADMIN privileges in its namespace.  If these are
>> not
>> +       true, -EACCES will be returned.  This requirement ensures that
>> filter
>> +       programs cannot be applied to child processes with greater
>> privileges
>> +       than the task that installed them.
>> +
>> +       Additionally, if prctl(2) is allowed by the attached filter,
>> +       additional filters may be layered on which will increase
>> evaluation
>> +       time, but allow for further decreasing the attack surface during
>> +       execution of a process.
>> +
>> +The above call returns 0 on success and non-zero on error.
>> +
>> +Example
>> +-------
>> +
>> +The samples/seccomp/ directory contains both a 32-bit specific example
>> +and a more generic example of a higher level macro interface for BPF
>> +program generation.
>> +
>> +Adding architecture support
>> +-----------------------
>> +
>> +Any platform with seccomp support will support seccomp filters as long
>> +as CONFIG_SECCOMP_FILTER is enabled and the architecture has implemented
>> +syscall_get_arguments.
>> diff --git a/samples/Makefile b/samples/Makefile
>> index 6280817..f29b19c 100644
>> --- a/samples/Makefile
>> +++ b/samples/Makefile
>> @@ -1,4 +1,4 @@
>>  # Makefile for Linux samples code
>>
>>  obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ tracepoints/ trace_events/ \
>> -                          hw_breakpoint/ kfifo/ kdb/ hidraw/
>> +                          hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
>> diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
>> new file mode 100644
>> index 0000000..0298c6f
>> --- /dev/null
>> +++ b/samples/seccomp/Makefile
>> @@ -0,0 +1,27 @@
>> +# kbuild trick to avoid linker error. Can be omitted if a module is
>> built.
>> +obj- := dummy.o
>> +
>> +hostprogs-y := bpf-fancy
>> +bpf-fancy-objs := bpf-fancy.o bpf-helper.o
>> +
>> +HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
>> +HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
>> +HOSTCFLAGS_bpf-helper.o += -I$(objtree)/usr/include
>> +HOSTCFLAGS_bpf-helper.o += -idirafter $(objtree)/include
>> +
>> +# bpf-direct.c is x86-only.
>> +ifeq ($(filter-out x86_64 i386,$(KBUILD_BUILDHOST)),)
>> +# List of programs to build
>> +hostprogs-y += bpf-direct
>> +bpf-direct-objs := bpf-direct.o
>> +endif
>> +
>> +# Tell kbuild to always build the programs
>> +always := $(hostprogs-y)
>> +
>> +HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
>> +HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
>> +ifeq ($(KBUILD_BUILDHOST),x86_64)
>> +HOSTCFLAGS_bpf-direct.o += -m32
>> +HOSTLOADLIBES_bpf-direct += -m32
>> +endif
>> diff --git a/samples/seccomp/bpf-direct.c b/samples/seccomp/bpf-direct.c
>> new file mode 100644
>> index 0000000..d799244
>> --- /dev/null
>> +++ b/samples/seccomp/bpf-direct.c
>> @@ -0,0 +1,77 @@
>> +/*
>> + * 32-bit seccomp filter example with BPF macros
>> + *
>> + * Copyright (c) 2012 The Chromium OS
>> Authors<chromium-os-dev@chromium.org>
>> + * Author: Will Drewry<wad@chromium.org>
>> + *
>> + * The code may be used by anyone for any purpose,
>> + * and can serve as a starting point for developing
>> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
>> + */
>> +
>> +#include<linux/filter.h>
>> +#include<linux/ptrace.h>
>> +#include<linux/seccomp_filter.h>
>> +#include<linux/unistd.h>
>> +#include<stdio.h>
>> +#include<stddef.h>
>> +#include<sys/prctl.h>
>> +#include<unistd.h>
>> +
>> +#ifndef PR_ATTACH_SECCOMP_FILTER
>> +#      define PR_ATTACH_SECCOMP_FILTER 37
>> +#endif
>> +
>> +#define syscall_arg(_n) (offsetof(struct seccomp_filter_data,
>> args[_n].lo32))
>> +#define nr (offsetof(struct seccomp_filter_data, syscall_nr))
>> +
>> +static int install_filter(void)
>> +{
>> +       struct seccomp_filter_block filter[] = {
>> +               /* Grab the system call number */
>> +               BPF_STMT(BPF_LD+BPF_W+BPF_ABS, nr),
>> +               /* Jump table for the allowed syscalls */
>> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
>> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
>> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
>> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
>> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
>> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
>> +
>> +               /* Check that read is only using stdin. */
>> +               BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
>> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
>> +
>> +               /* Check that write is only using stdout/stderr */
>> +               BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
>> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
>> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
>> +
>> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_ALLOW),
>> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_BPF_DENY),
>> +       };
>> +       struct seccomp_fprog prog = {
>> +               .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
>> +               .filter = filter,
>> +       };
>> +       if (prctl(PR_ATTACH_SECCOMP_FILTER,&prog)) {
>> +               perror("prctl");
>> +               return 1;
>> +       }
>> +       return 0;
>> +}
>> +
>> +#define payload(_c) (_c), sizeof((_c))
>> +int main(int argc, char **argv)
>> +{
>> +       char buf[4096];
>> +       ssize_t bytes = 0;
>> +       if (install_filter())
>> +               return 1;
>> +       syscall(__NR_write, STDOUT_FILENO,
>> +               payload("OHAI! WHAT IS YOUR NAME? "));
>> +       bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
>> +       syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
>> +       syscall(__NR_write, STDOUT_FILENO, buf, bytes);
>> +       return 0;
>> +}
>> diff --git a/samples/seccomp/bpf-fancy.c b/samples/seccomp/bpf-fancy.c
>> new file mode 100644
>> index 0000000..1318b1a
>> --- /dev/null
>> +++ b/samples/seccomp/bpf-fancy.c
>> @@ -0,0 +1,95 @@
>> +/*
>> + * Seccomp BPF example using a macro-based generator.
>> + *
>> + * Copyright (c) 2012 The Chromium OS
>> Authors<chromium-os-dev@chromium.org>
>> + * Author: Will Drewry<wad@chromium.org>
>> + *
>> + * The code may be used by anyone for any purpose,
>> + * and can serve as a starting point for developing
>> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
>> + */
>> +
>> +#include<linux/seccomp_filter.h>
>> +#include<linux/unistd.h>
>> +#include<stdio.h>
>> +#include<string.h>
>> +#include<sys/prctl.h>
>> +#include<unistd.h>
>> +
>> +#include "bpf-helper.h"
>> +
>> +#ifndef PR_ATTACH_SECCOMP_FILTER
>> +#      define PR_ATTACH_SECCOMP_FILTER 37
>> +#endif
>> +
>> +int main(int argc, char **argv)
>> +{
>> +       struct bpf_labels l;
>> +       static const char msg1[] = "Please type something: ";
>> +       static const char msg2[] = "You typed: ";
>> +       char buf[256];
>> +       struct seccomp_filter_block filter[] = {
>> +               LOAD_SYSCALL_NR,
>> +               SYSCALL(__NR_exit, ALLOW),
>> +               SYSCALL(__NR_exit_group, ALLOW),
>> +               SYSCALL(__NR_write, JUMP(&l, write_fd)),
>> +               SYSCALL(__NR_read, JUMP(&l, read)),
>> +               DENY,  /* Don't passthrough into a label */
>> +
>> +               LABEL(&l, read),
>> +               ARG(0),
>> +               JNE(STDIN_FILENO, DENY),
>> +               ARG(1),
>> +               JNE((unsigned long)buf, DENY),
>> +               ARG(2),
>> +               JGE(sizeof(buf), DENY),
>> +               ALLOW,
>> +
>> +               LABEL(&l, write_fd),
>> +               ARG(0),
>> +               JEQ(STDOUT_FILENO, JUMP(&l, write_buf)),
>> +               JEQ(STDERR_FILENO, JUMP(&l, write_buf)),
>> +               DENY,
>> +
>> +               LABEL(&l, write_buf),
>> +               ARG(1),
>> +               JEQ((unsigned long)msg1, JUMP(&l, msg1_len)),
>> +               JEQ((unsigned long)msg2, JUMP(&l, msg2_len)),
>> +               JEQ((unsigned long)buf, JUMP(&l, buf_len)),
>> +               DENY,
>> +
>> +               LABEL(&l, msg1_len),
>> +               ARG(2),
>> +               JLT(sizeof(msg1), ALLOW),
>> +               DENY,
>> +
>> +               LABEL(&l, msg2_len),
>> +               ARG(2),
>> +               JLT(sizeof(msg2), ALLOW),
>> +               DENY,
>> +
>> +               LABEL(&l, buf_len),
>> +               ARG(2),
>> +               JLT(sizeof(buf), ALLOW),
>> +               DENY,
>> +       };
>> +       struct seccomp_fprog prog = {
>> +               .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
>> +               .filter = filter,
>> +       };
>> +       ssize_t bytes;
>> +       bpf_resolve_jumps(&l, filter, sizeof(filter)/sizeof(*filter));
>> +
>> +       if (prctl(PR_ATTACH_SECCOMP_FILTER,&prog)) {
>> +               perror("prctl");
>> +               return 1;
>> +       }
>> +       syscall(__NR_write, STDOUT_FILENO, msg1, strlen(msg1));
>> +       bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf)-1);
>> +       bytes = (bytes>  0 ? bytes : 0);
>> +       syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2));
>> +       syscall(__NR_write, STDERR_FILENO, buf, bytes);
>> +       /* Now get killed */
>> +       syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2)+2);
>> +       return 0;
>> +}
>> diff --git a/samples/seccomp/bpf-helper.c b/samples/seccomp/bpf-helper.c
>> new file mode 100644
>> index 0000000..e1b6bc7
>> --- /dev/null
>> +++ b/samples/seccomp/bpf-helper.c
>> @@ -0,0 +1,89 @@
>> +/*
>> + * Seccomp BPF helper functions
>> + *
>> + * Copyright (c) 2012 The Chromium OS
>> Authors<chromium-os-dev@chromium.org>
>> + * Author: Will Drewry<wad@chromium.org>
>> + *
>> + * The code may be used by anyone for any purpose,
>> + * and can serve as a starting point for developing
>> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
>> + */
>> +
>> +#include<stdio.h>
>> +#include<string.h>
>> +
>> +#include "bpf-helper.h"
>> +
>> +int bpf_resolve_jumps(struct bpf_labels *labels,
>> +                     struct seccomp_filter_block *filter, size_t count)
>> +{
>> +       struct seccomp_filter_block *begin = filter;
>> +       __u8 insn = count - 1;
>> +
>> +       if (count<  1)
>> +               return -1;
>> +       /*
>> +       * Walk it once, backwards, to build the label table and do fixups.
>> +       * Since backward jumps are disallowed by BPF, this is easy.
>> +       */
>> +       filter += insn;
>> +       for (; filter>= begin; --insn, --filter) {
>> +               if (filter->code != (BPF_JMP+BPF_JA))
>> +                       continue;
>> +               switch ((filter->jt<<8)|filter->jf) {
>> +               case (JUMP_JT<<8)|JUMP_JF:
>> +                       if (labels->labels[filter->k].location ==
>> 0xffffffff) {
>> +                               fprintf(stderr, "Unresolved label:
>> '%s'\n",
>> +                                       labels->labels[filter->k].label);
>> +                               return 1;
>> +                       }
>> +                       filter->k = labels->labels[filter->k].location -
>> +                                   (insn + 1);
>> +                       filter->jt = 0;
>> +                       filter->jf = 0;
>> +                       continue;
>> +               case (LABEL_JT<<8)|LABEL_JF:
>> +                       if (labels->labels[filter->k].location !=
>> 0xffffffff) {
>> +                               fprintf(stderr, "Duplicate label use:
>> '%s'\n",
>> +                                       labels->labels[filter->k].label);
>> +                               return 1;
>> +                       }
>> +                       labels->labels[filter->k].location = insn;
>> +                       filter->k = 0; /* fall through */
>> +                       filter->jt = 0;
>> +                       filter->jf = 0;
>> +                       continue;
>> +               }
>> +       }
>> +       return 0;
>> +}
>> +
>> +/* Simple lookup table for labels. */
>> +__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label)
>> +{
>> +       struct __bpf_label *begin = labels->labels, *end;
>> +       int id;
>> +       if (labels->count == 0) {
>> +               begin->label = label;
>> +               begin->location = 0xffffffff;
>> +               labels->count++;
>> +               return 0;
>> +       }
>> +       end = begin + labels->count;
>> +       for (id = 0; begin<  end; ++begin, ++id) {
>> +               if (!strcmp(label, begin->label))
>> +                       return id;
>> +       }
>> +       begin->label = label;
>> +       begin->location = 0xffffffff;
>> +       labels->count++;
>> +       return id;
>> +}
>> +
>> +void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t count)
>> +{
>> +       struct seccomp_filter_block *end = filter + count;
>> +       for ( ; filter<  end; ++filter)
>> +               printf("{ code=%u,jt=%u,jf=%u,k=%u },\n",
>> +                       filter->code, filter->jt, filter->jf, filter->k);
>> +}
>> diff --git a/samples/seccomp/bpf-helper.h b/samples/seccomp/bpf-helper.h
>> new file mode 100644
>> index 0000000..92b94ec
>> --- /dev/null
>> +++ b/samples/seccomp/bpf-helper.h
>> @@ -0,0 +1,219 @@
>> +/*
>> + * Example wrapper around BPF macros.
>> + *
>> + * Copyright (c) 2012 The Chromium OS
>> Authors<chromium-os-dev@chromium.org>
>> + * Author: Will Drewry<wad@chromium.org>
>> + *
>> + * The code may be used by anyone for any purpose,
>> + * and can serve as a starting point for developing
>> + * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
>> + *
>> + * No guarantees are provided with respect to the correctness
>> + * or functionality of this code.
>> + */
>> +#ifndef __BPF_HELPER_H__
>> +#define __BPF_HELPER_H__
>> +
>> +#include<asm/bitsperlong.h>    /* for __BITS_PER_LONG */
>> +#include<linux/filter.h>
>> +#include<linux/seccomp_filter.h>       /* for seccomp_filter_data.arg */
>> +#include<linux/types.h>
>> +#include<linux/unistd.h>
>> +#include<stddef.h>
>> +
>> +#define BPF_LABELS_MAX 256
>> +struct bpf_labels {
>> +       int count;
>> +       struct __bpf_label {
>> +               const char *label;
>> +               __u32 location;
>> +       } labels[BPF_LABELS_MAX];
>> +};
>> +
>> +int bpf_resolve_jumps(struct bpf_labels *labels,
>> +                     struct seccomp_filter_block *filter, size_t count);
>> +__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label);
>> +void seccomp_bpf_print(struct seccomp_filter_block *filter, size_t
>> count);
>> +
>> +#define JUMP_JT 0xff
>> +#define JUMP_JF 0xff
>> +#define LABEL_JT 0xfe
>> +#define LABEL_JF 0xfe
>> +
>> +#define ALLOW \
>> +       BPF_STMT(BPF_RET+BPF_K, 0xFFFFFFFF)
>> +#define DENY \
>> +       BPF_STMT(BPF_RET+BPF_K, 0)
>> +#define JUMP(labels, label) \
>> +       BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
>> +                JUMP_JT, JUMP_JF)
>> +#define LABEL(labels, label) \
>> +       BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
>> +                LABEL_JT, LABEL_JF)
>> +#define SYSCALL(nr, jt) \
>> +       BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (nr), 0, 1), \
>> +       jt
>> +
>> +/* Lame, but just an example */
>> +#define FIND_LABEL(labels, label) seccomp_bpf_label((labels), #label)
>> +
>> +#define EXPAND(...) __VA_ARGS__
>> +/* Map all width-sensitive operations */
>> +#if __BITS_PER_LONG == 32
>> +
>> +#define JEQ(x, jt) JEQ32(x, EXPAND(jt))
>> +#define JNE(x, jt) JNE32(x, EXPAND(jt))
>> +#define JGT(x, jt) JGT32(x, EXPAND(jt))
>> +#define JLT(x, jt) JLT32(x, EXPAND(jt))
>> +#define JGE(x, jt) JGE32(x, EXPAND(jt))
>> +#define JLE(x, jt) JLE32(x, EXPAND(jt))
>> +#define JA(x, jt) JA32(x, EXPAND(jt))
>> +#define ARG(i) ARG_32(i)
>> +
>> +#elif __BITS_PER_LONG == 64
>> +
>> +#define JEQ(x, jt) \
>> +       JEQ64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> +             ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> +             EXPAND(jt))
>> +#define JGT(x, jt) \
>> +       JGT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> +             ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> +             EXPAND(jt))
>> +#define JGE(x, jt) \
>> +       JGE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> +             ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> +             EXPAND(jt))
>> +#define JNE(x, jt) \
>> +       JNE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> +             ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> +             EXPAND(jt))
>> +#define JLT(x, jt) \
>> +       JLT64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> +             ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> +             EXPAND(jt))
>> +#define JLE(x, jt) \
>> +       JLE64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> +             ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> +             EXPAND(jt))
>> +
>> +#define JA(x, jt) \
>> +       JA64(((union seccomp_filter_arg){.u64 = (x)}).lo32, \
>> +              ((union seccomp_filter_arg){.u64 = (x)}).hi32, \
>> +              EXPAND(jt))
>> +#define ARG(i) ARG_64(i)
>> +
>> +#else
>> +#error __BITS_PER_LONG value unusable.
>> +#endif
>> +
>> +/* Loads the arg into A */
>> +#define ARG_32(idx) \
>> +       BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
>> +               offsetof(struct seccomp_filter_data, args[(idx)].lo32))
>> +
>> +/* Loads hi into A and lo in X */
>> +#define ARG_64(idx) \
>> +       BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
>> +         offsetof(struct seccomp_filter_data, args[(idx)].lo32)), \
>> +       BPF_STMT(BPF_ST, 0), /* lo ->  M[0] */ \
>> +       BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
>> +         offsetof(struct seccomp_filter_data, args[(idx)].hi32)), \
>> +       BPF_STMT(BPF_ST, 1) /* hi ->  M[1] */
>> +
>> +#define JEQ32(value, jt) \
>> +       BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 0, 1), \
>> +       jt
>> +
>> +#define JNE32(value, jt) \
>> +       BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 1, 0), \
>> +       jt
>> +
>> +/* Checks the lo, then swaps to check the hi. A=lo,X=hi */
>> +#define JEQ64(lo, hi, jt) \
>> +       BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
>> +       BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 0, 2), \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
>> +       jt, \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
>> +
>> +#define JNE64(lo, hi, jt) \
>> +       BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 5, 0), \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
>> +       BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 2, 0), \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
>> +       jt, \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
>> +
>> +#define JA32(value, jt) \
>> +       BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (value), 0, 1), \
>> +       jt
>> +
>> +#define JA64(lo, hi, jt) \
>> +       BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (hi), 3, 0), \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
>> +       BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (lo), 0, 2), \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
>> +       jt, \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
>> +
>> +#define JGE32(value, jt) \
>> +       BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 0, 1), \
>> +       jt
>> +
>> +#define JLT32(value, jt) \
>> +       BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 1, 0), \
>> +       jt
>> +
>> +/* Shortcut checking if hi>  arg.hi. */
>> +#define JGE64(lo, hi, jt) \
>> +       BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
>> +       BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
>> +       BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (lo), 0, 2), \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
>> +       jt, \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
>> +
>> +#define JLT64(lo, hi, jt) \
>> +       BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (hi), 0, 4), \
>> +       BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
>> +       BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
>> +       jt, \
>> +       BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
>> +
>> +#define JGT32(value, jt) \
>> +       BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
>> +       jt
>> +
>> +#define JLE32(value, jt) \
>> +       BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
>> +       jt
>
>
> Should the true/false offsets be reversed here?

Looks that way :)

> Thanks for all the work on this.  We're looking forward to using it with
> QEMU.

Definitely - thanks!
will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 2/3] seccomp_filters: system call filtering using BPF
  2012-01-28 22:11 ` [PATCH v6 2/3] seccomp_filters: system call filtering using BPF Will Drewry
@ 2012-01-31 14:13   ` Eduardo Otubo
  2012-01-31 15:20     ` Will Drewry
  2012-02-02 15:32   ` Serge E. Hallyn
  1 sibling, 1 reply; 13+ messages in thread
From: Eduardo Otubo @ 2012-01-31 14:13 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, indan, mcgrathr

On Sat, Jan 28, 2012 at 04:11:54PM -0600, Will Drewry wrote:
> [This patch depends on luto@mit.edu's no_new_privs patch:
>  https://lkml.org/lkml/2012/1/12/446
> ]

Will,

I know you clearly pointed to use luto@mit.edu's first no_new_privs
patch, but I couldn't avoid to test it with the latest (and 3rd) version
of the patch [0]. Which defines PR_GET_NO_NEW_PRIVS as 37 as you can see
here [1]. The compilation then would break here:

     CC      kernel/sys.o
   kernel/sys.c: In function ‘sys_prctl’:
   kernel/sys.c:1975: error: duplicate case value
   kernel/sys.c:1904: error: previously used here
   make[1]: *** [kernel/sys.o] Error 1
   make: *** [kernel] Error 2

I just changed the value of PR_ATTACH_SECCOMP_FILTER to 38 and
everything went fine. Do you see any problems on changing this value?

Regards,

[0] - https://git.kernel.org/?p=linux/kernel/git/luto/linux.git;a=heads
[1] -
https://git.kernel.org/?p=linux/kernel/git/luto/linux.git;a=blobdiff;f=include/linux/prctl.h;h=a6b5ac9cfe560eeb277646fbe338ae2b14c46caf;hp=7ddc7f1b480fd41318d94c0a39c8e2ff80f9c5f8;hb=7102b0e278af50d27b5d61d1be5faaba1b0a091e;hpb=acb42a3b611d7ad4cb173c3b37674b549df2ffeb

-- 
Eduardo Otubo
Software Engineer
Linux Technology Center
IBM Systems & Technology Group
Mobile: +55 19 8135 0885 
eotubo@linux.vnet.ibm.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 2/3] seccomp_filters: system call filtering using BPF
  2012-01-31 14:13   ` Eduardo Otubo
@ 2012-01-31 15:20     ` Will Drewry
  0 siblings, 0 replies; 13+ messages in thread
From: Will Drewry @ 2012-01-31 15:20 UTC (permalink / raw)
  To: Eduardo Otubo
  Cc: linux-kernel, keescook, john.johansen, serge.hallyn, coreyb,
	pmoore, eparis, djm, torvalds, segoon, rostedt, jmorris,
	scarybeasts, avi, penberg, viro, luto, mingo, akpm, khilman,
	borislav.petkov, amwang, oleg, ak, eric.dumazet, gregkh,
	dhowells, daniel.lezcano, linux-fsdevel, linux-security-module,
	olofj, mhalcrow, dlaor, corbet, alan, indan, mcgrathr

On Tue, Jan 31, 2012 at 7:13 AM, Eduardo Otubo <otubo@linux.vnet.ibm.com> wrote:
> On Sat, Jan 28, 2012 at 04:11:54PM -0600, Will Drewry wrote:
>> [This patch depends on luto@mit.edu's no_new_privs patch:
>>  https://lkml.org/lkml/2012/1/12/446
>> ]
>
> Will,
>
> I know you clearly pointed to use luto@mit.edu's first no_new_privs
> patch, but I couldn't avoid to test it with the latest (and 3rd) version
> of the patch [0]. Which defines PR_GET_NO_NEW_PRIVS as 37 as you can see
> here [1]. The compilation then would break here:
>
>     CC      kernel/sys.o
>   kernel/sys.c: In function ‘sys_prctl’:
>   kernel/sys.c:1975: error: duplicate case value
>   kernel/sys.c:1904: error: previously used here
>   make[1]: *** [kernel/sys.o] Error 1
>   make: *** [kernel] Error 2
>
> I just changed the value of PR_ATTACH_SECCOMP_FILTER to 38 and
> everything went fine. Do you see any problems on changing this value?

Should be fine -- in the next version, I won't be adding a new PR_
define at all.  Feel free to change it to whatever compiles -- the
code only uses the define name for access.  Sorry for the collision -
I posted the last rev without the latest from luto.

Cheers!
will

> Regards,
>
> [0] - https://git.kernel.org/?p=linux/kernel/git/luto/linux.git;a=heads
> [1] -
> https://git.kernel.org/?p=linux/kernel/git/luto/linux.git;a=blobdiff;f=include/linux/prctl.h;h=a6b5ac9cfe560eeb277646fbe338ae2b14c46caf;hp=7ddc7f1b480fd41318d94c0a39c8e2ff80f9c5f8;hb=7102b0e278af50d27b5d61d1be5faaba1b0a091e;hpb=acb42a3b611d7ad4cb173c3b37674b549df2ffeb
>
> --
> Eduardo Otubo
> Software Engineer
> Linux Technology Center
> IBM Systems & Technology Group
> Mobile: +55 19 8135 0885
> eotubo@linux.vnet.ibm.com
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/3] seccomp: kill the seccomp_t typedef
  2012-01-28 22:11 [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Will Drewry
  2012-01-28 22:11 ` [PATCH v6 2/3] seccomp_filters: system call filtering using BPF Will Drewry
  2012-01-28 22:11 ` [PATCH v6 3/3] Documentation: prctl/seccomp_filter Will Drewry
@ 2012-02-02 15:29 ` Serge E. Hallyn
  2012-02-03 23:16   ` Will Drewry
  2 siblings, 1 reply; 13+ messages in thread
From: Serge E. Hallyn @ 2012-02-02 15:29 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr

Quoting Will Drewry (wad@chromium.org):
> Replaces the seccomp_t typedef with seccomp_struct to match modern
> kernel style.

(sorry, I'm a bit behind on list)

You were going to switch this to 'struct seccomp' right?

> Signed-off-by: Will Drewry <wad@chromium.org>
> ---
>  include/linux/sched.h   |    2 +-
>  include/linux/seccomp.h |   10 ++++++----
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4032ec1..288b5cb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1418,7 +1418,7 @@ struct task_struct {
>  	uid_t loginuid;
>  	unsigned int sessionid;
>  #endif
> -	seccomp_t seccomp;
> +	struct seccomp_struct seccomp;
>  
>  /* Thread group tracking */
>     	u32 parent_exec_id;
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index cc7a4e9..171ab66 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -7,7 +7,9 @@
>  #include <linux/thread_info.h>
>  #include <asm/seccomp.h>
>  
> -typedef struct { int mode; } seccomp_t;
> +struct seccomp_struct {
> +	int mode;
> +};
>  
>  extern void __secure_computing(int);
>  static inline void secure_computing(int this_syscall)
> @@ -19,7 +21,7 @@ static inline void secure_computing(int this_syscall)
>  extern long prctl_get_seccomp(void);
>  extern long prctl_set_seccomp(unsigned long);
>  
> -static inline int seccomp_mode(seccomp_t *s)
> +static inline int seccomp_mode(struct seccomp_struct *s)
>  {
>  	return s->mode;
>  }
> @@ -28,7 +30,7 @@ static inline int seccomp_mode(seccomp_t *s)
>  
>  #include <linux/errno.h>
>  
> -typedef struct { } seccomp_t;
> +struct seccomp_struct { };
>  
>  #define secure_computing(x) do { } while (0)
>  
> @@ -42,7 +44,7 @@ static inline long prctl_set_seccomp(unsigned long arg2)
>  	return -EINVAL;
>  }
>  
> -static inline int seccomp_mode(seccomp_t *s)
> +static inline int seccomp_mode(struct seccomp_struct *s)
>  {
>  	return 0;
>  }
> -- 
> 1.7.5.4
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 2/3] seccomp_filters: system call filtering using BPF
  2012-01-28 22:11 ` [PATCH v6 2/3] seccomp_filters: system call filtering using BPF Will Drewry
  2012-01-31 14:13   ` Eduardo Otubo
@ 2012-02-02 15:32   ` Serge E. Hallyn
  2012-02-03 23:14     ` Will Drewry
  1 sibling, 1 reply; 13+ messages in thread
From: Serge E. Hallyn @ 2012-02-02 15:32 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-kernel, keescook, john.johansen, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr

Quoting Will Drewry (wad@chromium.org):
> [This patch depends on luto@mit.edu's no_new_privs patch:
>  https://lkml.org/lkml/2012/1/12/446
> ]
> 
> This patch adds support for seccomp mode 2.  This mode enables dynamic
> enforcement of system call filtering policy in the kernel as specified
> by a userland task.  The policy is expressed in terms of a Berkeley
> Packet Filter program, as is used for userland-exposed socket filtering.
> Instead of network data, the BPF program is evaluated over struct
> seccomp_filter_data at the time of the system call.
> 
> A filter program may be installed by a userland task by calling
>   prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
> where fprog is of type struct sock_fprog.
> 
> If the first filter program allows subsequent prctl(2) calls, then
> additional filter programs may be attached.  All attached programs
> must be evaluated before a system call will be allowed to proceed.
> 
> To avoid CONFIG_COMPAT related landmines, once a filter program is
> installed using specific is_compat_task() value, it is not allowed to
> make system calls using the alternate entry point.
> 
> Filter programs will be inherited across fork/clone and execve, however
> the installation of filters must be preceded by setting 'no_new_privs'
> to ensure that unprivileged tasks cannot attach filters that affect
> privileged tasks (e.g., setuid binary).  Tasks with CAP_SYS_ADMIN
> in their namespace may install inheritable filters without setting
> the no_new_privs bit.
> 
> There are a number of benefits to this approach. A few of which are
> as follows:
> - BPF has been exposed to userland for a long time.
> - Userland already knows its ABI: system call numbers and desired
>   arguments
> - No time-of-check-time-of-use vulnerable data accesses are possible.
> - system call arguments are loaded on demand only to minimize copying
>   required for system call number-only policy decisions.
> 
> This patch includes its own BPF evaluator, but relies on the
> net/core/filter.c BPF checking code.  It is possible to share
> evaluators, but the performance sensitive nature of the network
> filtering path makes it an iterative optimization which (I think :) can
> be tackled separately via separate patchsets. (And at some point sharing
> BPF JIT code!)
> 
>  v6: - fix memory leak on attach compat check failure
>      - require no_new_privs || CAP_SYS_ADMIN prior to filter
>        installation. (luto@mit.edu)
>      - s/seccomp_struct_/seccomp_/ for macros/functions
>        (amwang@redhat.com)
>      - cleaned up Kconfig (amwang@redhat.com)
>      - on block, note if the call was compat (so the # means something)
>  v5: - uses syscall_get_arguments
>        (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
>      - uses union-based arg storage with hi/lo struct to
>        handle endianness.  Compromises between the two alternate
>        proposals to minimize extra arg shuffling and account for
>        endianness assuming userspace uses offsetof().
>        (mcgrathr@chromium.org, indan@nul.nu)
>      - update Kconfig description
>      - add include/seccomp_filter.h and add its installation
>      - (naive) on-demand syscall argument loading
>      - drop seccomp_t (eparis@redhat.com)
>  v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
>      - now uses current->no_new_privs
>          (luto@mit.edu,torvalds@linux-foundation.com)
>      - assign names to seccomp modes (rdunlap@xenotime.net)
>      - fix style issues (rdunlap@xenotime.net)
>      - reworded Kconfig entry (rdunlap@xenotime.net)
>  v3: - macros to inline (oleg@redhat.com)
>      - init_task behavior fixed (oleg@redhat.com)
>      - drop creator entry and extra NULL check (oleg@redhat.com)
>      - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
>      - adds tentative use of "always_unprivileged" as per
>        torvalds@linux-foundation.org and luto@mit.edu
>  v2: - (patch 2 only)
> 
> Signed-off-by: Will Drewry <wad@chromium.org>

Hi Will,

as far as I can tell based on changelog I suspect you could have
kept my Acked-by (from v3?).  However, I'll wait until your next
submission (as I see there were a few change requests), and do a
final complete new review of that.

Thanks for continuing to push on this.

-serge

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 2/3] seccomp_filters: system call filtering using BPF
  2012-02-02 15:32   ` Serge E. Hallyn
@ 2012-02-03 23:14     ` Will Drewry
  0 siblings, 0 replies; 13+ messages in thread
From: Will Drewry @ 2012-02-03 23:14 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: linux-kernel, keescook, john.johansen, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr

On Thu, Feb 2, 2012 at 7:32 AM, Serge E. Hallyn
<serge.hallyn@canonical.com> wrote:
> Quoting Will Drewry (wad@chromium.org):
>> [This patch depends on luto@mit.edu's no_new_privs patch:
>>  https://lkml.org/lkml/2012/1/12/446
>> ]
>>
>> This patch adds support for seccomp mode 2.  This mode enables dynamic
>> enforcement of system call filtering policy in the kernel as specified
>> by a userland task.  The policy is expressed in terms of a Berkeley
>> Packet Filter program, as is used for userland-exposed socket filtering.
>> Instead of network data, the BPF program is evaluated over struct
>> seccomp_filter_data at the time of the system call.
>>
>> A filter program may be installed by a userland task by calling
>>   prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
>> where fprog is of type struct sock_fprog.
>>
>> If the first filter program allows subsequent prctl(2) calls, then
>> additional filter programs may be attached.  All attached programs
>> must be evaluated before a system call will be allowed to proceed.
>>
>> To avoid CONFIG_COMPAT related landmines, once a filter program is
>> installed using specific is_compat_task() value, it is not allowed to
>> make system calls using the alternate entry point.
>>
>> Filter programs will be inherited across fork/clone and execve, however
>> the installation of filters must be preceded by setting 'no_new_privs'
>> to ensure that unprivileged tasks cannot attach filters that affect
>> privileged tasks (e.g., setuid binary).  Tasks with CAP_SYS_ADMIN
>> in their namespace may install inheritable filters without setting
>> the no_new_privs bit.
>>
>> There are a number of benefits to this approach. A few of which are
>> as follows:
>> - BPF has been exposed to userland for a long time.
>> - Userland already knows its ABI: system call numbers and desired
>>   arguments
>> - No time-of-check-time-of-use vulnerable data accesses are possible.
>> - system call arguments are loaded on demand only to minimize copying
>>   required for system call number-only policy decisions.
>>
>> This patch includes its own BPF evaluator, but relies on the
>> net/core/filter.c BPF checking code.  It is possible to share
>> evaluators, but the performance sensitive nature of the network
>> filtering path makes it an iterative optimization which (I think :) can
>> be tackled separately via separate patchsets. (And at some point sharing
>> BPF JIT code!)
>>
>>  v6: - fix memory leak on attach compat check failure
>>      - require no_new_privs || CAP_SYS_ADMIN prior to filter
>>        installation. (luto@mit.edu)
>>      - s/seccomp_struct_/seccomp_/ for macros/functions
>>        (amwang@redhat.com)
>>      - cleaned up Kconfig (amwang@redhat.com)
>>      - on block, note if the call was compat (so the # means something)
>>  v5: - uses syscall_get_arguments
>>        (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
>>      - uses union-based arg storage with hi/lo struct to
>>        handle endianness.  Compromises between the two alternate
>>        proposals to minimize extra arg shuffling and account for
>>        endianness assuming userspace uses offsetof().
>>        (mcgrathr@chromium.org, indan@nul.nu)
>>      - update Kconfig description
>>      - add include/seccomp_filter.h and add its installation
>>      - (naive) on-demand syscall argument loading
>>      - drop seccomp_t (eparis@redhat.com)
>>  v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
>>      - now uses current->no_new_privs
>>          (luto@mit.edu,torvalds@linux-foundation.com)
>>      - assign names to seccomp modes (rdunlap@xenotime.net)
>>      - fix style issues (rdunlap@xenotime.net)
>>      - reworded Kconfig entry (rdunlap@xenotime.net)
>>  v3: - macros to inline (oleg@redhat.com)
>>      - init_task behavior fixed (oleg@redhat.com)
>>      - drop creator entry and extra NULL check (oleg@redhat.com)
>>      - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
>>      - adds tentative use of "always_unprivileged" as per
>>        torvalds@linux-foundation.org and luto@mit.edu
>>  v2: - (patch 2 only)
>>
>> Signed-off-by: Will Drewry <wad@chromium.org>
>
> Hi Will,
>
> as far as I can tell based on changelog I suspect you could have
> kept my Acked-by (from v3?).  However, I'll wait until your next
> submission (as I see there were a few change requests), and do a
> final complete new review of that.

Thanks, Serge!  I just failed at the proper protocol and didn't mean
to not include your Acked-by.   However, I am changing a fair amount
of the internals this time around, so I'll be happy to have another
full review.

> Thanks for continuing to push on this.

Definitely! I've been traveling this week, so it's been a bit slow
going, but I hope to have the next rev up early next week if not
sooner.

Cheers!
will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/3] seccomp: kill the seccomp_t typedef
  2012-02-02 15:29 ` [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Serge E. Hallyn
@ 2012-02-03 23:16   ` Will Drewry
  2012-02-04  1:05     ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Will Drewry @ 2012-02-03 23:16 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: linux-kernel, keescook, john.johansen, coreyb, pmoore, eparis,
	djm, torvalds, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr

On Thu, Feb 2, 2012 at 7:29 AM, Serge E. Hallyn
<serge.hallyn@canonical.com> wrote:
> Quoting Will Drewry (wad@chromium.org):
>> Replaces the seccomp_t typedef with seccomp_struct to match modern
>> kernel style.
>
> (sorry, I'm a bit behind on list)
>
> You were going to switch this to 'struct seccomp' right?

I wasn;'t sure if

task_struct {
 ...
 struct seccomp seccomp;
}

was as ideal.  I've noticed that almost all of the duplicate names in
the task struct use redundancy to differentiate the naming, but I'm
happy enough to rename if appropriate.


>> Signed-off-by: Will Drewry <wad@chromium.org>
>> ---
>>  include/linux/sched.h   |    2 +-
>>  include/linux/seccomp.h |   10 ++++++----
>>  2 files changed, 7 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 4032ec1..288b5cb 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1418,7 +1418,7 @@ struct task_struct {
>>       uid_t loginuid;
>>       unsigned int sessionid;
>>  #endif
>> -     seccomp_t seccomp;
>> +     struct seccomp_struct seccomp;
>>
>>  /* Thread group tracking */
>>       u32 parent_exec_id;
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index cc7a4e9..171ab66 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -7,7 +7,9 @@
>>  #include <linux/thread_info.h>
>>  #include <asm/seccomp.h>
>>
>> -typedef struct { int mode; } seccomp_t;
>> +struct seccomp_struct {
>> +     int mode;
>> +};
>>
>>  extern void __secure_computing(int);
>>  static inline void secure_computing(int this_syscall)
>> @@ -19,7 +21,7 @@ static inline void secure_computing(int this_syscall)
>>  extern long prctl_get_seccomp(void);
>>  extern long prctl_set_seccomp(unsigned long);
>>
>> -static inline int seccomp_mode(seccomp_t *s)
>> +static inline int seccomp_mode(struct seccomp_struct *s)
>>  {
>>       return s->mode;
>>  }
>> @@ -28,7 +30,7 @@ static inline int seccomp_mode(seccomp_t *s)
>>
>>  #include <linux/errno.h>
>>
>> -typedef struct { } seccomp_t;
>> +struct seccomp_struct { };
>>
>>  #define secure_computing(x) do { } while (0)
>>
>> @@ -42,7 +44,7 @@ static inline long prctl_set_seccomp(unsigned long arg2)
>>       return -EINVAL;
>>  }
>>
>> -static inline int seccomp_mode(seccomp_t *s)
>> +static inline int seccomp_mode(struct seccomp_struct *s)
>>  {
>>       return 0;
>>  }
>> --
>> 1.7.5.4
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/3] seccomp: kill the seccomp_t typedef
  2012-02-03 23:16   ` Will Drewry
@ 2012-02-04  1:05     ` Linus Torvalds
  2012-02-06 16:13       ` Will Drewry
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2012-02-04  1:05 UTC (permalink / raw)
  To: Will Drewry
  Cc: Serge E. Hallyn, linux-kernel, keescook, john.johansen, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr

On Fri, Feb 3, 2012 at 3:16 PM, Will Drewry <wad@chromium.org> wrote:
>
> task_struct {
>  ...
>  struct seccomp seccomp;
> }
>
> was as ideal.  I've noticed that almost all of the duplicate names in
> the task struct use redundancy to differentiate the naming, but I'm
> happy enough to rename if appropriate.

The redundant "struct xyz_struct" naming is traditional, but we try to
avoid it these days. The reason for it is that I long long ago was a
bit confused about the C namespace rules, so for the longest time I
made struct names unique for no really good reason. The struct/union
namespace is separate from the other namespaces, so trying to make
things unique really has no good reason.

And obviously "struct task_struct" is one of those very old things,
and then the "struct xyz_struct" naming kind of spread from there.

I think "struct seccomp" is fine, and even if "struct x x" looks a bit
odd, it's at least _less_ repetition than "struct x_struct x" which is
just really repetitive.

That said, just to make "grep" easier, please do the whole "struct
xyz" always together, and always with just a single space in between
them, so that

   git grep "struct xyz"

does the right thing. And for the same reason, when declaring a
struct, people should always use "struct xyz {", with that exact
spacing. The exact details of spacing obviously has no semantic
meaning, but making it easy to grep for use and for definition is
really convenient.

                   Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/3] seccomp: kill the seccomp_t typedef
  2012-02-04  1:05     ` Linus Torvalds
@ 2012-02-06 16:13       ` Will Drewry
  0 siblings, 0 replies; 13+ messages in thread
From: Will Drewry @ 2012-02-06 16:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Serge E. Hallyn, linux-kernel, keescook, john.johansen, coreyb,
	pmoore, eparis, djm, segoon, rostedt, jmorris, scarybeasts, avi,
	penberg, viro, luto, mingo, akpm, khilman, borislav.petkov,
	amwang, oleg, ak, eric.dumazet, gregkh, dhowells, daniel.lezcano,
	linux-fsdevel, linux-security-module, olofj, mhalcrow, dlaor,
	corbet, alan, indan, mcgrathr

On Fri, Feb 3, 2012 at 7:05 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Feb 3, 2012 at 3:16 PM, Will Drewry <wad@chromium.org> wrote:
>>
>> task_struct {
>>  ...
>>  struct seccomp seccomp;
>> }
>>
>> was as ideal.  I've noticed that almost all of the duplicate names in
>> the task struct use redundancy to differentiate the naming, but I'm
>> happy enough to rename if appropriate.
>
> The redundant "struct xyz_struct" naming is traditional, but we try to
> avoid it these days. The reason for it is that I long long ago was a
> bit confused about the C namespace rules, so for the longest time I
> made struct names unique for no really good reason. The struct/union
> namespace is separate from the other namespaces, so trying to make
> things unique really has no good reason.
>
> And obviously "struct task_struct" is one of those very old things,
> and then the "struct xyz_struct" naming kind of spread from there.
>
> I think "struct seccomp" is fine, and even if "struct x x" looks a bit
> odd, it's at least _less_ repetition than "struct x_struct x" which is
> just really repetitive.
>
> That said, just to make "grep" easier, please do the whole "struct
> xyz" always together, and always with just a single space in between
> them, so that
>
>   git grep "struct xyz"
>
> does the right thing. And for the same reason, when declaring a
> struct, people should always use "struct xyz {", with that exact
> spacing. The exact details of spacing obviously has no semantic
> meaning, but making it easy to grep for use and for definition is
> really convenient.

Thanks for the background and explanation!
will

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-02-06 16:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-28 22:11 [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Will Drewry
2012-01-28 22:11 ` [PATCH v6 2/3] seccomp_filters: system call filtering using BPF Will Drewry
2012-01-31 14:13   ` Eduardo Otubo
2012-01-31 15:20     ` Will Drewry
2012-02-02 15:32   ` Serge E. Hallyn
2012-02-03 23:14     ` Will Drewry
2012-01-28 22:11 ` [PATCH v6 3/3] Documentation: prctl/seccomp_filter Will Drewry
2012-01-30 22:47   ` Corey Bryant
2012-01-30 22:52     ` Will Drewry
2012-02-02 15:29 ` [PATCH v6 1/3] seccomp: kill the seccomp_t typedef Serge E. Hallyn
2012-02-03 23:16   ` Will Drewry
2012-02-04  1:05     ` Linus Torvalds
2012-02-06 16:13       ` Will Drewry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).